-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix garbage collector when leader-elect=false #57340
Conversation
**What this PR does / why we need it**: In a 1.8.x master with --leader-elect=false, the garbage collector controller does not work. When deleting a deployment with v1meta.DeletePropagationForeground, the deployment had its deletionTimestamp set and a foreground Deletion finalizer was added, but the deployment, rs and pod were not deleted. This is an issue with how the garbage collector graph_builder behaves when the stopCh=nil. This PR creates a dummy stop channel for the garbage collector controller (and other controllers started by the controller-manager) so that they can work more like they do when when the controller-manager is configured with --leader-elect=true. **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes kubernetes#57044 **Special notes for your reviewer**: **Release note**: <!-- Write your release note: 1. Enter your extended release note in the below block. If the PR requires additional action from users switching to the new release, include the string "action required". 2. If no release note is required, just write "NONE". --> ```release-note Garbage collection doesn't work when the controller-manager uses --leader-elect=false ```
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Verified CLA, hope that kicked something. |
/ok-to-test |
@jmcmeek Can you update the release note to something like
|
@cmluciano I have updated the release note as you requested. |
/test pull-kubernetes-unit |
/assign @deads2k @caesarxuchao |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, jmcmeek Associated issue: #57044 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest |
/retest Review the full test history for this PR. |
@jmcmeek: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
@deads2k Can this be cherry picked into 1.8 and 1.9? I don't seem to be able to add labels (or don't know how to). |
run(nil) | ||
stopCh := make(chan struct{}) | ||
defer close(stopCh) | ||
run(stopCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused how this changes any behavior. The defer close(stopCh)
should never run, since run shouldn't return until stopCh is closed. Receiving from a nil channel blocks forever, which would behave the same as receiving from an unclosed stopCh. Was something in run() explicitly checking if the passed channel was nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With stopCh = nil, the graph_builder StartMonitors function exits without starting the monitors and no garbage collection happens. See https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/controller/garbagecollector/graph_builder.go#L273
I don't know if creating a dummy stopCh is the best solution, but it let things follow the same path that is taken with leader-elect=true, which seems to be better tested.
This is my only foray into go, but it does seem unlikely that the deferred close serves any purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt See my response above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, then it seems better to fix the garbage collector specific assumptions in https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/controller/garbagecollector/graph_builder.go#L85 by setting up a non-nil dummy stopCh in GraphBuilder#Run if a nil stopCh is passed in. otherwise we still have the possibility of other callers making the same mistake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt If there was some other caller of GraphBuilder#Run, is it safe to assume that we should create a dummy stop channel if one is not provided? I think it would be safer to require / expect that a stop channel be provided and maybe GraphBuilder#Run should fail in some fashion if one is not provided. Other comments in graph_builder suggest that a stop channel is expected.
Maybe GarbageCollector#Run (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/garbagecollector/garbagecollector.go#L129) or startGarbageCollectorController (https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/core.go#L344) would be better places.
Hopefully other controllers that use the stop channel do not need it to be set.
And not seeing a better communication channel... I am getting in well past my comfort level with go and internals of Kubernetes. If this is not the right way or place to fix this, I would prefer to leave this to others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a stop channel should really only be used for one thing... stopping the controller when a receive from that channel completes. receiving from a nil channel blocks forever, so from the top-level, passing nil should be valid.
it looks like the garbage collector and resource quota controllers are using the given channel for two purposes... as a stop channel, and as a signal that Run() has been called. since they're the ones using it in non-standard ways, I'd prefer a fix localized in those controllers, like #58306
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
…#58306-upstream-release-1.8 Automatic merge from submit-queue. Automated cherry pick of #57340: Fix garbage collector when leader-elect=false Cherry pick of #57340 #58306 on release-1.8. #57340: Fix garbage collector when leader-elect=false #58306: Track run status explicitly rather than non-nil check on ```release-note Fix garbage collection when the controller-manager uses --leader-elect=false ```
…#58306-upstream-release-1.9 Automatic merge from submit-queue. Automated cherry pick of #57340: Fix garbage collector when leader-elect=false Cherry pick of #57340 #58306 on release-1.9. #57340: Fix garbage collector when leader-elect=false #58306: Track run status explicitly rather than non-nil check on ```release-note Fix garbage collection and resource quota when the controller-manager uses --leader-elect=false ```
What this PR does / why we need it:
In a 1.8.x master with --leader-elect=false, the garbage collector controller
does not work.
When deleting a deployment with v1meta.DeletePropagationForeground, the deployment
had its deletionTimestamp set and a foreground Deletion finalizer was added,
but the deployment, rs and pod were not deleted.
This is an issue with how the garbage collector graph_builder behaves when the
stopCh=nil. This PR creates a dummy stop channel for the garbage collector controller (and other
controllers started by the controller-manager) so that they can work more like they do when
when the controller-manager is configured with --leader-elect=true.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #57044
Special notes for your reviewer:
Release note: