Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "PVC Volume not detached if pod deleted via namespace deletion" issue #29077

Merged

Conversation

saad-ali
Copy link
Member

@saad-ali saad-ali commented Jul 17, 2016

Fixes #29051: "PVC Volume not detached if pod deleted via namespace deletion"

This PR:

  • Fixes a bug in desired_state_of_the_world_populator.go to check the value of exists returned by the podInformer so that it can delete pods even if the delete event is missed (or fails).
  • Reduces the desired state of the world populators sleep period from 5 min to 1 min (reducing the amount of time a volume would remain attached if a volume delete event is missed or fails).

@saad-ali saad-ali added release-note Denotes a PR that will be considered when it comes time to generate release notes. cherrypick-candidate labels Jul 17, 2016
@saad-ali saad-ali added this to the v1.3 milestone Jul 17, 2016
@saad-ali saad-ali self-assigned this Jul 17, 2016
@k8s-github-robot k8s-github-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 17, 2016
@sijnc
Copy link

sijnc commented Jul 18, 2016

I believe this:

"Modify attach/detach controller to cache PVC/PV for each volume so even if these objects are deleted, pod deletion events can be processed without error, resulting in even faster detaches.
Add a test to verify this scenario (PVC volumes are detached on namespace deletion) and prevent future regressions."

...has been causing us a lot of pain using PDs on GKE. We have a script that launches our deployments, pv,pvc, etc. It creates pv/pvc first and then the deployment, however the script that tears down the deployment deletes the pv/pvc first before deleting the deployment. We never delete the namespace and we deploy into a non default namespace.

We randomly see PDs remaining attached for long periods after deleting, usually finally detaches in 5-15 minutes. We have a GKE engineer looking at the cluster so if you want any details about the findings let me know.

We have another issue (may not be related) where deployment times start at 5-10 seconds on a fresh cluster and then after a few days or a week the deployment times drop to 1-2 minutes and never get better. If we delete the cluster and re-create we get back to 5-10 seconds, but since it's GKE that's our only option to fix. Not sure if this is related but it seems like the PD issues don't happen until the cluster starts doing slow deployments so it might be worth looking into.

If this is the issue we've been seeing on GKE it's been happening over the last couple months and across almost every new version released during that time.

@sijnc
Copy link

sijnc commented Jul 18, 2016

@saad-ali: If you're interested the Google GKE support case is 09807892.

@@ -84,7 +84,7 @@ func (dswp *desiredStateOfWorldPopulator) findAndRemoveDeletedPods() {
}
// retrieve the pod object from pod informer with the namespace key
informerPodObj, exists, err := dswp.podInformer.GetStore().GetByKey(dswPodKey)
if err != nil || informerPodObj == nil {
if err != nil || (exists && informerPodObj == nil) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saad-ali DO we want to make this check more descriptive? Is one worth logging more than another?

Trying to understand:

if err != nil {
glog.Errorf("podInformer GetBy key failed with an error....")
continue
}

if (exists && informerPodObj == nil) {
glog.Info("podInformer GetByKey found pod that is already deleted %q", dwsPodKey)
continue
}

saad-ali added 2 commits July 20, 2016 01:03
Fix bug in desired_state_of_the_world_populator.go to check exists so
that it can delete pods even if the delete event is missed (or fails)
@saad-ali saad-ali force-pushed the fixIssue29051NamespaceDeletion branch from 0fd5e1f to afd8a58 Compare July 20, 2016 08:03
@saad-ali saad-ali changed the title [WIP] Fix "PVC Volume not detached if pod deleted via namespace deletion" issue Fix "PVC Volume not detached if pod deleted via namespace deletion" issue Jul 20, 2016
@saad-ali
Copy link
Member Author

@matchstick PTAL

Will do the following in a follow-up (#29262):

  • Modify attach/detach controller to cache PVC/PV for each volume so even if these objects are deleted, pod deletion events can be processed without error, resulting in even faster detaches.
  • Add a test to verify this scenario (PVC volumes are detached on namespace deletion) and prevent future regressions.

@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 20, 2016
@saad-ali
Copy link
Member Author

Opened #29311 to track improvements to this code.

@matchstick matchstick added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2016
@matchstick
Copy link
Contributor

@saad-ali Giving a LGTM for 1.3 but we really need something better for 1.4 as this code made us bump heads several times.

@k8s-bot
Copy link

k8s-bot commented Jul 20, 2016

GCE e2e build/test passed for commit afd8a58.

@k8s-github-robot
Copy link

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-bot
Copy link

k8s-bot commented Jul 21, 2016

GCE e2e build/test passed for commit afd8a58.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 99e24da into kubernetes:master Jul 21, 2016
@fabioy fabioy added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jul 24, 2016
fabioy added a commit that referenced this pull request Jul 26, 2016
…77-upstream-release-1.3

Automated cherry pick of #29077 upstream release 1.3
@saad-ali saad-ali deleted the fixIssue29051NamespaceDeletion branch August 15, 2016 22:21
shyamjvs pushed a commit to shyamjvs/kubernetes that referenced this pull request Dec 1, 2016
…ck-of-#29077-upstream-release-1.3

Automated cherry pick of kubernetes#29077 upstream release 1.3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants