-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 #68056
Automated cherry pick of #67825: Fix VMWare VM freezing bug by reverting #51066 #68056
Conversation
/assign @mbohlool |
/ok-to-test |
/assign @divyenpatel someone more familiar with vsphere cloudprovider should approve these for cherry-picks. |
is it possible to add a test for this? |
@mbohlool if you mean a repro, it has been tested here: #67825 (comment) @divyenpatel @gnufied I believe consensus has to be reached for #65392 before merging this PR to 1.9, right? |
Created Deployment.
Pod
Powered Off Node
Pod remains in the ContainerCreating state for about 6 minutes. Observed following errors.
After 6 minutes, Pod state changed to running and volumes successfully attached to the new node
But Volumes never detached from the previous node To resolve above behavior we will either need to keep powered off node registered with the API server, or fix cloud provider with #63413 so that volumes gets detached from the powered off node which is also removed from API server. cc: @SandeepPissay |
when
kubernetes-node4 becomes
To resolve this we will need to force detach volumes from the pod with the Unknown state to allow attaching them to new node with the change made in #67847 |
@divyenpatel so in a nutshell this PR only prevents attach to another node for first 6 minutes. I think what might be happening is, I frankly do not see a point in backporting this PR to older releases like 1.9, until we fix real detach issue. Volume is still being attached to multiple nodes, so why bother? |
/hold |
@divyenpatel thanks for your repro work. @gnufied the PR fixes vmware-archive#500 / vmware-archive#502, and backporting it fixes it in previous versions as well. It was a custom behaviour for Vsphere on the attachdetach reconciler that forced multi-attach. It's a showstopper for us and we're running 1.9.10, thus the upstream backport. Many shops running older versions would happily bump the minor version instead of doing major upgrades. Overall, it seems that there are many open PRs that solve the above problems with broken & shutdown nodes and volumes, but there seems to be a stalemate with that holds the PRs because of the discussion for the shutdown taint in #65392. When the deadlock is solved and solutions are merged (and backported), this backport should be good to go. |
@nikopen how does this PR prevents multiattach? didn't @divyenpatel just proved we are still multiattaching? am i missing something. So if this PR does not fixes multiattach, I would like to know what does it actually fixes and why should we backport it? |
I understand validity of this PR on master branch btw. I know why we need there. On Master branch, this PR indeed prevents multiattach but on 1.9 branch, afaict this PR does not have much effect beyond delaying attach to another node for 6 minutes. |
As described on DM, this is the VM freeze: vmware-archive#500 (comment), step 4. The observation is that k8s is aggressively attempting to attach a disk to another node without first triggering the detach from the originating VM, causing the new VM to completely freeze as long as the volume is 'unavailable'/stuck elsewhere, starting from 1 minute and up to 3 mins or more. It's the main purpose of the original PR, as it fixes the freeze. Backporting to all versions affected makes sense to me. |
I agree with @nikopen. We should allow this change to be cherry-picked to other branches. At least when kubelet is not responding or not able to communicate with API server, instead of continuous fail attempts to attach volumes on the new node, Without this change, we will see lots of failure messages on the vSphere task console for failed attempt to attach hot disk on the new node. We can also avoid freezing node VM, which can disturb existing workload running on that node. |
Thanks @divyenpatel ! |
okay, I think this issue still does not handle real problem of actually detaching volumes from shutdown/deleted pods in 1.9 but we can merge this for now. /hold cancel |
/kind bug |
/milestone v1.9 |
/approve |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: childsb, nikopen The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history for this PR. Silence the bot with an |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. |
Cherry pick of #67825 on release-1.9.
#67825: Fix VMWare VM freezing bug by reverting #51066