Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix garbage collector when leader-elect=false #57340

Merged
merged 1 commit into from
Jan 2, 2018

Conversation

jmcmeek
Copy link
Contributor

@jmcmeek jmcmeek commented Dec 18, 2017

What this PR does / why we need it:

In a 1.8.x master with --leader-elect=false, the garbage collector controller
does not work.

When deleting a deployment with v1meta.DeletePropagationForeground, the deployment
had its deletionTimestamp set and a foreground Deletion finalizer was added,
but the deployment, rs and pod were not deleted.

This is an issue with how the garbage collector graph_builder behaves when the
stopCh=nil. This PR creates a dummy stop channel for the garbage collector controller (and other
controllers started by the controller-manager) so that they can work more like they do when
when the controller-manager is configured with --leader-elect=true.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #57044

Special notes for your reviewer:

Release note:

Fix garbage collection and resource quota when the controller-manager uses --leader-elect=false

**What this PR does / why we need it**:

In a 1.8.x master with --leader-elect=false, the garbage collector controller
does not work.

When deleting a deployment with v1meta.DeletePropagationForeground, the deployment
had its deletionTimestamp set and a foreground Deletion finalizer was added,
but the deployment, rs and pod were not deleted.

This is an issue with how the garbage collector graph_builder behaves when the
stopCh=nil.  This PR creates a dummy stop channel for the garbage collector controller (and other
controllers started by the controller-manager) so that they can work more like they do when
when the controller-manager is configured with --leader-elect=true.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes kubernetes#57044

**Special notes for your reviewer**:

**Release note**:
<!--  Write your release note:
1. Enter your extended release note in the below block. If the PR requires additional action from users switching to the new release, include the string "action required".
2. If no release note is required, just write "NONE".
-->
```release-note
Garbage collection doesn't work when the controller-manager uses --leader-elect=false

```
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Dec 18, 2017
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://github.com/kubernetes/kubernetes/wiki/CLA-FAQ to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 18, 2017
@jmcmeek
Copy link
Contributor Author

jmcmeek commented Dec 18, 2017

Verified CLA, hope that kicked something.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 18, 2017
@dims
Copy link
Member

dims commented Dec 18, 2017

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 18, 2017
@cmluciano
Copy link

@jmcmeek Can you update the release note to something like

Fix garbage collection when the controller-manager uses --leader-elect=false

@jmcmeek
Copy link
Contributor Author

jmcmeek commented Dec 19, 2017

@cmluciano I have updated the release note as you requested.

@jmcmeek
Copy link
Contributor Author

jmcmeek commented Dec 19, 2017

/test pull-kubernetes-unit

@cmluciano
Copy link

/assign @deads2k @caesarxuchao

@deads2k
Copy link
Contributor

deads2k commented Jan 2, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 2, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jmcmeek

Associated issue: #57044

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 2, 2018
@jmcmeek
Copy link
Contributor Author

jmcmeek commented Jan 2, 2018

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 2, 2018

@jmcmeek: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-device-plugin-gpu 880a68a link /test pull-kubernetes-e2e-gce-device-plugin-gpu

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit f5d5d18 into kubernetes:master Jan 2, 2018
@jmcmeek jmcmeek deleted the jmcmeek_57044 branch January 5, 2018 20:36
@jmcmeek
Copy link
Contributor Author

jmcmeek commented Jan 5, 2018

@deads2k Can this be cherry picked into 1.8 and 1.9?

I don't seem to be able to add labels (or don't know how to).

@cmluciano
Copy link

@deads2k @lavalamp @liggitt Any input on if we can cherry-pick this ^

run(nil)
stopCh := make(chan struct{})
defer close(stopCh)
run(stopCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused how this changes any behavior. The defer close(stopCh) should never run, since run shouldn't return until stopCh is closed. Receiving from a nil channel blocks forever, which would behave the same as receiving from an unclosed stopCh. Was something in run() explicitly checking if the passed channel was nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With stopCh = nil, the graph_builder StartMonitors function exits without starting the monitors and no garbage collection happens. See https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/controller/garbagecollector/graph_builder.go#L273

I don't know if creating a dummy stopCh is the best solution, but it let things follow the same path that is taken with leader-elect=true, which seems to be better tested.

This is my only foray into go, but it does seem unlikely that the deferred close serves any purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt See my response above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, then it seems better to fix the garbage collector specific assumptions in https://github.com/kubernetes/kubernetes/blob/release-1.8/pkg/controller/garbagecollector/graph_builder.go#L85 by setting up a non-nil dummy stopCh in GraphBuilder#Run if a nil stopCh is passed in. otherwise we still have the possibility of other callers making the same mistake

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt If there was some other caller of GraphBuilder#Run, is it safe to assume that we should create a dummy stop channel if one is not provided? I think it would be safer to require / expect that a stop channel be provided and maybe GraphBuilder#Run should fail in some fashion if one is not provided. Other comments in graph_builder suggest that a stop channel is expected.

Maybe GarbageCollector#Run (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/garbagecollector/garbagecollector.go#L129) or startGarbageCollectorController (https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/core.go#L344) would be better places.

Hopefully other controllers that use the stop channel do not need it to be set.

And not seeing a better communication channel... I am getting in well past my comfort level with go and internals of Kubernetes. If this is not the right way or place to fix this, I would prefer to leave this to others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a stop channel should really only be used for one thing... stopping the controller when a receive from that channel completes. receiving from a nil channel blocks forever, so from the top-level, passing nil should be valid.

it looks like the garbage collector and resource quota controllers are using the given channel for two purposes... as a stop channel, and as a signal that Run() has been called. since they're the ones using it in non-standard ways, I'd prefer a fix localized in those controllers, like #58306

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

k8s-github-robot pushed a commit that referenced this pull request Jan 18, 2018
…#58306-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #57340: Fix garbage collector when leader-elect=false

Cherry pick of #57340 #58306 on release-1.8.

#57340: Fix garbage collector when leader-elect=false
#58306: Track run status explicitly rather than non-nil check on

```release-note
Fix garbage collection when the controller-manager uses --leader-elect=false
```
k8s-github-robot pushed a commit that referenced this pull request Jan 27, 2018
…#58306-upstream-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #57340: Fix garbage collector when leader-elect=false

Cherry pick of #57340 #58306 on release-1.9.

#57340: Fix garbage collector when leader-elect=false
#58306: Track run status explicitly rather than non-nil check on

```release-note
Fix garbage collection and resource quota when the controller-manager uses --leader-elect=false
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube 1.8 deployment garbage collection doesn't work with --leader-elect=false