-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix logic error in graceful deletion #37721
Fix logic error in graceful deletion #37721
Conversation
I need to get a test case for this tomorrow, but this came up in a scenario where we create 100s of namespaces, each namespace with a 100 pods, and we en-masse delete all namespaces. With about 10k pods, we would end up with ~70 or so pods that could never be deleted. Inspection of the state showed that the pod had a gracePeriodSeconds=0 and a deletionTimestamp=some_value . If a pod was in the state, forceful deletion would never occur. As for why the resource was in this state, is reason for another investigation yet to happen, but it appears that when a kubelet did a force-deletion, the deletion was given an OK response, but the object was not truly removed and instead had its local state updated as above. Once this happened, it was never able to be removed without direct access to etcd. I think this is a release-blocker for 1.5, and I will need to cherry-pick to 1.4.x I think as well. /cc @smarterclayton @ingvagabund @sjenning @caesarxuchao @deads2k @saad-ali @eparis |
Logically this PR makes sense, should be a small change to pkg/api/rest/resttest to verify it. In 1.7 once we move to etcd3 I will open a follow up item to set grace period to 0 and then delete the pod in a single transaction, which should avoid the potential race error here (we want consumers to see 0 and observe the deletion as discrete steps). |
Marking do-not-merge pending unit tests |
Jenkins GCE e2e failed for commit 1ec5411. Full PR test history. The magic incantation to run this job again is |
Jenkins GCE etcd3 e2e failed for commit 1ec5411. Full PR test history. The magic incantation to run this job again is |
1ec5411
to
1473084
Compare
@smarterclayton -- added test case. The reason this happened is if the update succeeded, and the delete failed (due to etcd being temporarily unreachable, or having an internal error), we would find ourselves in a non-recoverable state. |
Jenkins GCI GCE e2e failed for commit 1473084. Full PR test history. The magic incantation to run this job again is |
@k8s-bot gci gce e2e test this |
Automatic merge from submit-queue |
#37834-#37723-#37668-#37721-#37381-#37944-#37997-#37939-#37990-upstream-release-1.5 Automatic merge from submit-queue Automated cherry pick of #35272 #37834 #37723 #37668 #37721 #37381 #37944 #37997 #37939 #37990 upstream release 1.5 Batch cherry pick PRs #35272 #37834 #37723 #37668 #37721 #37381 #37944 #37997 #37939 #37990 from master to release-1.5 branch. PRs #37997 had merge conflicts that needed to be resolved (due to large PRs that merged to master but not 1.5, see this for details) CC PR Authors: @yarntime @ixdy @mtaufen @ymqytw @derekwaynecarr @jszczepkowski @Kargakis @foxish @jingxu97
Commit found in the "release-1.5" branch appears to be this PR. Removing the "cherrypick-candidate" label. If this is an error find help to get your PR picked. |
If a resource has the following criteria:
the resource could never be deleted as we always returned pending graceful.