Fix bug in detaching disks when node is powered-off but still in inventory in vsphere CP #63413

w-leads · 2018-05-03T23:08:55Z

What this PR does / why we need it:
vSphere cloud provider fails to detach disks from powered-off node when the node is still in vCenter inventory. Due to this behavior, the disk needs to be detached manually before powering it on.
This issue is caused because node information is removed from nodemanager when kubernetes detects powered-off node though the information is required when VCP attempts to detach disks afterward.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

NONE

w-leads · 2018-05-04T22:09:56Z

/sig vmware

dims · 2018-05-07T12:51:18Z

/ok-to-test

w-leads · 2018-05-07T18:42:48Z

/retest

w-leads · 2018-05-07T19:18:07Z

/assign @abrarshivani

abrarshivani · 2018-05-10T23:26:20Z

pkg/cloudprovider/providers/vsphere/vsphere.go

-				glog.Infof("Node %q does not exist, disk %s is already detached from node.", convertToString(nodeName), volPath)
+		if err == vclib.ErrNoVMFound {
+			// If node doesn't exist, discover the node to check if node is still in vcenter inventory
+			node, err := vs.nodeManager.GetRemovedNode(nodeName)


I think this will not work if controller-manager restarts since this information will vanish. Can you please confirm this?

abrarshivani · 2018-05-10T23:32:02Z

pkg/cloudprovider/providers/vsphere/nodemanager.go

@@ -45,10 +45,12 @@ type NodeManager struct {
 	nodeInfoMap map[string]*NodeInfo
 	// Maps node name to node structure
 	registeredNodes map[string]*v1.Node
+	removedNodes    map[string]*v1.Node


Node information will still stay even if all the volumes that are attached by the cluster are detached. I would suggest to remove this information when not needed.

w-leads · 2018-05-11T00:24:54Z

Hi, @abrarshivani
Thank you so much for reviewing my code!

To solve these 2 issues you mentioned, I'm thinking to take a different approach instead like

Add/Remove attached disk information in VSphere.AttachDisk, DetachDisk, DisksAreAttached
The information includes node name and volume path, and is stored in nodemanager
Try to detach disks using the disk info if any, when VSphere.NodeDeleted is called

Would you tell me how do you feel about this approach?

w-leads · 2018-05-11T19:36:56Z

@abrarshivani Would you please take another look at code?
I reverted previous commit, and took the different approach by keeping attached disk information instead of removed node info.
To solve controller-manager restart case, the attached disks information are updated in VSphere.DisksAreAttached which is called almost every minute.
Thank you,

abrarshivani · 2018-05-11T21:47:32Z

pkg/cloudprovider/providers/vsphere/vsphere.go

+	disks, err := vs.nodeManager.GetAttachedDisks(convertToK8sType(node.ObjectMeta.Name))
+	if err == nil {
+		for _, disk := range disks {
+			if err := vs.DetachDisk(disk, convertToK8sType(node.ObjectMeta.Name)); err != nil {


Here DetachDisk won't be retried if it fails (example: In case of network issue to vCenter) and still the volume will be attached to power off node. When kubernetes will retry to detach disk from the deleted node it will fail since node will be unregistered here.

abrarshivani · 2018-05-11T21:53:13Z

@w-leads Thanks for fixing this. I think the right approach would be to report NodeName as Virtual Machine UUID. This way vSphere Cloud Provider will not have any need to maintain the mapping of nodeName to v1.Node or keep the state of attached and detached disk.

w-leads · 2018-05-11T23:35:22Z

@abrarshivani Thanks a lot for reviewing again!

So what's your saying is following, is my understanding correct?

Register mapping of nodeName to VM UUID(or maybe NodeInfo struct?), as done in 1st commit but instead of nodeName
When k8s tries to detach disks from powered-off node and gets some errors, detach disks using that info instead
Compare nodeVolumes and the info in VSphere.DisksAreAttached, and remove the info if there's a node which is not in nodeVolumes. Or maybe other better way to do this...?

In this case, I have no idea to get the info again in case that controller-manager is restarted after the node is powered-off. This is because only VSphere.NodeAdded, NodeDeleted receive v1.Node which contains UUID. Please correct me if I'm wrong

abrarshivani · 2018-05-14T04:12:35Z

@w-leads Currently, vSphere Cloud Provider uses *v1.Node to get VM UUID and discover the nodes with the uuid. kubelet (if provided cloud-provider as vsphere) gets nodeName from vSphere Cloud Provider. If vsphere.CurrentNodeName is updated to return Virtual Machine UUID instead of hostname than nodeName will be reported as VM UUID. In this case, there is no need to store *v1.Node. Also, in case of DetachDisk if node is powered off and unregistered from nodeManager, node can still be discovered with UUID.

w-leads · 2018-05-14T18:37:32Z

@abrarshivani Thanks for the info. That makes sense. Code will be more simple with vm UUID.
I suppose that the change would affect many functions and this pull request would get much bigger. So I'm wondering if we could fix this bug by keeping attached disk information for now? And once VCP is modified to use vm UUID for nodeName instead of hostname, then we can update the change in this pull request in appropriate way too.

w-leads · 2018-05-14T23:09:11Z

@abrarshivani I made a commit and now detaching disks from removed node will be retried when it fails. If this attached-disks-info method is okay for now, would you take another look please?

abrarshivani · 2018-05-17T00:55:36Z

@w-leads Sounds good to me. Will take another look. Can you please create an issue to track Virtual Machine UUID as Nodename approach?

abrarshivani · 2018-05-17T00:56:33Z

pkg/cloudprovider/providers/vsphere/nodemanager.go

+	if len(nm.attachedDisks[convertToString(nodeName)]) == 0 {
+		delete(nm.attachedDisks, convertToString(nodeName))
+	}
+	nm.attachedDisksLock.Unlock()


I would suggest to change this to
defer nm.attachedDisksLock.Unlock()

abrarshivani · 2018-05-17T01:10:45Z

pkg/cloudprovider/providers/vsphere/nodemanager.go

+func (nm *NodeManager) GetAttachedDisks(nodeName k8stypes.NodeName) []string {
+	nm.attachedDisksLock.Lock()
+	attachedDisks := nm.attachedDisks[convertToString(nodeName)]
+	nm.attachedDisksLock.Unlock()


I think attachedDisks needs to be protected with locks. Since this is the same object which will be accessed by RegisterAttachedDisk and UnRegisterAttachedDisk

abrarshivani · 2018-05-17T01:31:00Z

pkg/cloudprovider/providers/vsphere/vsphere_util.go

-		if err != nil {
-			return nil, err
-		}
+func (vs *VSphere) convertVolPathsToDevicePaths(ctx context.Context, nodeName k8stypes.NodeName, volPaths []string) ([]string, error) {


Can we go with the earlier way where we accept nodeVolumes here? ErrVMNotFound error can be handled here.

I suppose this change is necessary.
The function before this change received volumes on all nodes, and if any of nodes was not found, the function and VSphere.DisksAreAttached returned error. However attached-disks information needs to be re-registered in DisksAreAttached in case of controller manager restart. That's why I changed this func to handle volumes on a single node so that DisksAreAttached could just ignore removed nodes instead of returning error before reregistration.
Also, DisksAreAttached has a comment which says "// 5b. If VMs are removed from vSphere inventory they are ignored." So I suppose DisksAreAttached shouldn't give error at this point.

VCP fails to detach disk from powered-off node which is still in vcenter inventory. That's caused because VCP tries to detach disks after nodemanager has unregistered the node information.

This reverts commit 680ab089ce491f27c3602202761d2dd2625bec90.

This reverts commit 0738258143024a28917c4626f6474ffbee552a55.

This reverts commit eea9efdd61a25c3e74dbc54e11aae711ca499c85.

VCP fails to detach disk from powered-off node which is still in vcenter inventory. That's caused because VCP tries to detach disks after nodemanager has unregistered the node information.

k8s-ci-robot · 2018-07-23T21:43:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: w-leads
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: abrarshivani

If they are not already assigned, you can assign the PR to them by writing /assign @abrarshivani in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

pkg/cloudprovider/providers/vsphere/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2018-09-03T23:32:07Z

@w-leads: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SandeepPissay · 2018-09-27T21:26:56Z

pkg/cloudprovider/providers/vsphere/vsphere.go

@@ -233,6 +236,19 @@ func init() {

 // Initialize passes a Kubernetes clientBuilder interface to the cloud provider
 func (vs *VSphere) Initialize(clientBuilder controller.ControllerClientBuilder) {


Is this method getting invoked in Kubelet? If no, then we should fail the initialization if vs.informerFactory is not set.

Can you confirm if Initialize is invoked after SetInformers?

You are also introducing a vsphere-cloud-provider service account, wouldn't this break Kubernetes upgrade? How do you plan to address upgrade?

I think we will have to install corresponding rbac policies as defaults.

This does not get called from kubelet. Also Initialize is called after SetInformers, so I think we should be safe. Having said that, I think the code should work if it can't find configmap as well and revert to old behaviour when configmap isn't found.

abrarshivani · 2018-10-01T22:03:41Z

pkg/cloudprovider/providers/vsphere/nodemanager.go

+		return nil, err
+	}
+	if configMap.Data == nil {
+		nm.vmUUIDConfigMapCache = make(map[string]string)


Giving configmap read and write permissions to vSphere Cloud Provider after launching k8s cluster will fail the volume operations. This happens due to node events are occurred before giving permissions and nodeName and vmUUID mapping is not cached at that time. Can you fix this?

Installing the RBAC permissions via bootstrap policies should fix that right?

gnufied · 2018-10-10T20:04:05Z

@w-leads does this PR really needs to be against master because IIRC master isn't affected by this bug.

fejta-bot · 2019-01-08T21:03:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-02-07T21:50:13Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-03-09T22:05:10Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-03-09T22:05:17Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 3, 2018

k8s-ci-robot requested review from davidopp and a user May 3, 2018 23:09

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 3, 2018

k8s-ci-robot added the area/provider/vmware Issues or PRs related to vmware provider label May 4, 2018

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 7, 2018

k8s-ci-robot assigned abrarshivani May 7, 2018

w-leads mentioned this pull request May 9, 2018

vSphere cloud provider did not detach volume when shutdown node #63577

Closed

abrarshivani reviewed May 10, 2018

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 11, 2018

abrarshivani reviewed May 11, 2018

View reviewed changes

w-leads force-pushed the bugfix/vsphere-detach-from-removed-node branch 3 times, most recently from 5ec4340 to 9c3b9cc Compare May 14, 2018 23:01

abrarshivani reviewed May 17, 2018

View reviewed changes

w-leads added 9 commits July 23, 2018 14:32

Fix bug in detaching disks in vsphere cp

3336f1c

VCP fails to detach disk from powered-off node which is still in vcenter inventory. That's caused because VCP tries to detach disks after nodemanager has unregistered the node information.

Retry to detach disks from deleted nodes in VCP

75712c9

Use defer to unlock attached disks in vcp

ef5a96a

Revert "Use defer to unlock attached disks in vcp"

7efc128

This reverts commit 680ab089ce491f27c3602202761d2dd2625bec90.

Revert "Retry to detach disks from deleted nodes in VCP"

5863b08

This reverts commit 0738258143024a28917c4626f6474ffbee552a55.

Revert "Fix bug in detaching disks in vsphere cp"

fcc4af2

This reverts commit eea9efdd61a25c3e74dbc54e11aae711ca499c85.

Fix bug in detaching disks in vsphere cp

e6f3003

VCP fails to detach disk from powered-off node which is still in vcenter inventory. That's caused because VCP tries to detach disks after nodemanager has unregistered the node information.

Add logs for vm uuid configmap in vsphere cp

f9da0ff

Assume disk is detached if node is not in vsphere inventory

7e323cf

w-leads force-pushed the bugfix/vsphere-detach-from-removed-node branch from 673f44d to 7e323cf Compare July 23, 2018 21:43

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 23, 2018

This was referenced Aug 28, 2018

vsphere cloudprovider does not attempt to detach volumes from shutdown nodes #67900

Closed

Fix VMWare VM freezing bug by reverting #51066 #67825

Merged

divyenpatel mentioned this pull request Aug 31, 2018

Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 #68056

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 3, 2018

SandeepPissay reviewed Sep 27, 2018

View reviewed changes

abrarshivani reviewed Oct 1, 2018

View reviewed changes

gnufied mentioned this pull request Oct 17, 2018

Fix vsphere detach bug #69691

Closed

gnufied mentioned this pull request Dec 7, 2018

vSphere cloud provider unable to detachVolume if node is deleted but not vSphere vm #71829

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 7, 2019

k8s-ci-robot closed this Mar 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in detaching disks when node is powered-off but still in inventory in vsphere CP #63413

Fix bug in detaching disks when node is powered-off but still in inventory in vsphere CP #63413

w-leads commented May 3, 2018

w-leads commented May 4, 2018

dims commented May 7, 2018

w-leads commented May 7, 2018

w-leads commented May 7, 2018

abrarshivani May 10, 2018

abrarshivani May 10, 2018

w-leads commented May 11, 2018 •

edited

Loading

w-leads commented May 11, 2018

abrarshivani May 11, 2018 •

edited

Loading

abrarshivani commented May 11, 2018 •

edited

Loading

w-leads commented May 11, 2018

abrarshivani commented May 14, 2018

w-leads commented May 14, 2018

w-leads commented May 14, 2018

abrarshivani commented May 17, 2018

abrarshivani May 17, 2018

abrarshivani May 17, 2018

abrarshivani May 17, 2018

w-leads May 17, 2018

k8s-ci-robot commented Jul 23, 2018

k8s-ci-robot commented Sep 3, 2018

SandeepPissay Sep 27, 2018

gnufied Oct 10, 2018

gnufied Oct 10, 2018

abrarshivani Oct 1, 2018

gnufied Oct 10, 2018

gnufied commented Oct 10, 2018

fejta-bot commented Jan 8, 2019

fejta-bot commented Feb 7, 2019

fejta-bot commented Mar 9, 2019

k8s-ci-robot commented Mar 9, 2019

		@@ -233,6 +236,19 @@ func init() {

		// Initialize passes a Kubernetes clientBuilder interface to the cloud provider
		func (vs *VSphere) Initialize(clientBuilder controller.ControllerClientBuilder) {

Fix bug in detaching disks when node is powered-off but still in inventory in vsphere CP #63413

Fix bug in detaching disks when node is powered-off but still in inventory in vsphere CP #63413

Conversation

w-leads commented May 3, 2018

w-leads commented May 4, 2018

dims commented May 7, 2018

w-leads commented May 7, 2018

w-leads commented May 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

w-leads commented May 11, 2018 • edited Loading

w-leads commented May 11, 2018

abrarshivani May 11, 2018 • edited Loading

Choose a reason for hiding this comment

abrarshivani commented May 11, 2018 • edited Loading

w-leads commented May 11, 2018

abrarshivani commented May 14, 2018

w-leads commented May 14, 2018

w-leads commented May 14, 2018

abrarshivani commented May 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 23, 2018

k8s-ci-robot commented Sep 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Oct 10, 2018

fejta-bot commented Jan 8, 2019

fejta-bot commented Feb 7, 2019

fejta-bot commented Mar 9, 2019

k8s-ci-robot commented Mar 9, 2019

w-leads commented May 11, 2018 •

edited

Loading

abrarshivani May 11, 2018 •

edited

Loading

abrarshivani commented May 11, 2018 •

edited

Loading