List all pods on node added for attach_detach controller #44675

jsravn · 2017-04-19T15:31:38Z

To ensure all pods with volumes are managed immediately upon a node
becoming managed. This most commonly happens at startup of the
controller manager.

This may produce duplicate pod processing, but this should be minimal
since nodes are generally added at slow rate. If it's a problem we'll
need to move to a worker queue approach and keep track of all nodes
so pods can be requeued for processing.

Main drawback is at startup. The attach detach controller may end up
processing all pods twice in the worse case. And without a node index
it will list all the pods for every node in the cluster.

To ensure all pods with volumes are managed immediately upon a node becoming managed. This most commonly happens at startup of the controller manager. This may produce duplicate pod processing, but this should be minimal since nodes are generally added at slow rate. If it's a problem we'll need to move to a worker queue approach and keep track of all nodes so pods can be requeued for processing.

k8s-reviewable · 2017-04-19T15:31:44Z

This change is

k8s-github-robot · 2017-04-19T15:31:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsravn
We suggest the following additional approver: @jsafrane

Needs approval from an approver in each of these OWNERS Files:

pkg/controller/volume/attachdetach/OWNERS

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-ci-robot · 2017-04-19T15:31:45Z

Hi @jsravn. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

eparis · 2017-04-19T16:23:57Z

@k8s-bot ok to test

gnufied · 2017-04-19T17:08:11Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+}
+
+func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) {
+	pods, err := adc.podLister.List(labels.Everything())


Is there a way we can use a selector to pick only pods which belong to this node? Not a blocker, but this defenitely looks promising.

We could index the pod cache based on a NodeNameIndex and fetch pods for a specific node but I haven't looked close wrt the impact of using multiple indexes in cache. #38868

This is a problem at startup especially. For every node in the cluster it will do a list of the pods. I notice we do lots of list alls though (like in daemonset controller), so I'm not sure if it's a big deal or not.

podLister works with shared informer cache, it won't ask API server. So it's IMO OK for now and it should be fixed in a subsequent PR.

I'm happy to work on that, I agree though it should be a separate PR, esp since I don't think it can be easily cherry picked into older releases.

This is adding a lot of work to the controller: if there are 5000 nodes with 30 pods each, when the controller restarts, each node you will fetch and process 150,000 pods, meaning it will end up processing 450,000 pods (excluding the ones it will get from watch).

Given that we have #42033, which periodically lists all pods, do we really needs this as well to fix #43300?

Yeah we'd need a node index to do this efficiently, or use a work queue w/ pod processing retries. Considering #42033 also processes every single pod every 3 mins is it a big deal in comparison? I wasn't sure.

The main purpose for me is to get a fix we can use in 1.6/1.5. I didn't think #42033 was a candidate for backporting, or was even going to stick around in its current form. I thought it was more of a backup, experimental solution. Is it okay to have to wait for the reconciler to fix things when new nodes get added?

If we just want #42033, that's okay with me, as long as we can get a fix in 1.6/1.5 as well, since this is a major problem in those releases.

cblecker · 2017-04-19T19:11:31Z

I0419 09:31:36.636] Verifying hack/make-rules/../../hack/verify-bazel.sh
I0419 09:31:41.586] 33a34
I0419 09:31:41.587] >         "//vendor/k8s.io/apimachinery/pkg/labels:go_default_library",
W0419 09:31:41.687] DRY-RUN: wrote "pkg/controller/volume/attachdetach/BUILD"
W0419 09:31:42.228] 
W0419 09:31:42.228] 1 BUILD files not up-to-date.
I0419 09:31:42.329] 
I0419 09:31:42.329] Run ./hack/update-bazel.sh
I0419 09:31:42.329] FAILED   hack/make-rules/../../hack/verify-bazel.sh	6s

So it doesn't block the node handler.

jsafrane · 2017-04-20T11:26:43Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

@@ -293,6 +294,24 @@ func (adc *attachDetachController) nodeAdd(obj interface{}) {
 	// the attached volumes field. This function ensures that we sync with
 	// the actual status.
 	adc.actualStateOfWorld.SetNodeStatusUpdateNeeded(nodeName)
+
+	if _, exists := node.Annotations[volumehelper.ControllerManagedAttachAnnotation]; exists {


Is ADC locked at this time? I am not sure, but I think the reconciler runs in a separate goroutine and it could run at this point, where we have node and its attached volumes in states of the world, while the pods are not there yet. reconcile() will try to detach these volumes.

I think you're right. That race already exists though, correct? At least by immediately adding the pod information here, we have a shot at avoiding it. If we do hit the race, worse case is it will detach then re-attach when the pods are processed.

Actually I think waitForCacheSync in #run will help here. It will prevent the reconciler from running until all the initial nodes/pods have been processed. Still a small race since the handlers operate asynchronously so may still be running even when waitForCacheSync returns.

k8s-github-robot · 2017-04-24T21:05:57Z

@jsravn PR needs rebase

jsravn · 2017-04-25T15:04:20Z

I pinged @saad-ali to have a look

saad-ali

My big question is about if we should do this at all

saad-ali · 2017-04-26T23:31:48Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

@@ -250,10 +255,6 @@ func (adc *attachDetachController) podAdd(obj interface{}) {
 	if pod == nil || !ok {
 		return
 	}
-	if pod.Spec.NodeName == "" {


Why remove this?

It's redundant with ProcessPodVolumes, which does the same check (with a log message).

saad-ali · 2017-04-26T23:34:44Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+	}
+}
+
+func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) {


rename to fetchAndProcessPodsForNode(...) -- I want the name to emphasis that it is not passively processing but actively fetching.

Also add a comment summarizing what this method does above it.

saad-ali · 2017-04-26T23:44:53Z

pkg/controller/volume/attachdetach/attach_detach_controller.go

+}
+
+func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) {
+	pods, err := adc.podLister.List(labels.Everything())


This is adding a lot of work to the controller: if there are 5000 nodes with 30 pods each, when the controller restarts, each node you will fetch and process 150,000 pods, meaning it will end up processing 450,000 pods (excluding the ones it will get from watch).

Given that we have #42033, which periodically lists all pods, do we really needs this as well to fix #43300?

k8s-ci-robot · 2017-04-27T10:54:21Z

@jsravn: The following test(s) failed:

Test name	Commit	Details	Rerun command
Jenkins GCE etcd3 e2e	`3388e73`	link	`@k8s-bot gce etcd3 e2e test this`
Jenkins kops AWS e2e	`3388e73`	link	`@k8s-bot kops aws e2e test this`
Jenkins Kubemark GCE e2e	`3388e73`	link	`@k8s-bot kubemark e2e test this`
Jenkins unit/integration	`3388e73`	link	`@k8s-bot unit test this`
Jenkins verification	`3388e73`	link	`@k8s-bot verify test this`
Jenkins GCE Node e2e	`3388e73`	link	`@k8s-bot node e2e test this`
Jenkins Bazel Build	`3388e73`	link	`@k8s-bot bazel test this`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

saad-ali · 2017-04-28T17:16:03Z

Yeah we'd need a node index to do this efficiently, or use a work queue w/ pod processing retries. Considering #42033 also processes every single pod every 3 mins is it a big deal in comparison? I wasn't sure.

The main purpose for me is to get a fix we can use in 1.6/1.5. I didn't think #42033 was a candidate for backporting, or was even going to stick around in its current form. I thought it was more of a backup, experimental solution. Is it okay to have to wait for the reconciler to fix things when new nodes get added?

If we just want #42033, that's okay with me, as long as we can get a fix in 1.6/1.5 as well, since this is a major problem in those releases.

The reason I prefer PR #42033 over this PR is:

fix TODO: find and add active pods for dswp #42033 seems more robust--it will continuously ensure state is not skewed over the lifetime of the controller process instead of just at startup,
fix TODO: find and add active pods for dswp #42033 seems like a smaller (although still non-trivial) performance hit: it processes potentially 5000 nodes * 30 pods =150,000 pods at 3 minute increments vs this solution processes potentially 5000 nodes * 30 pods = 150,000 pods * 5000 nodes = 750,000,000 pods all up front--that's a three order of magnitude difference processed all up front.

I understand that PR #42033 may be more difficult to port back to older branches, but because of the reasons above and that PR #42033 appears to address the problem raised in issue #43300 (making attach/detach controller tolerant to HA failover), I hesitate to endorse this PR.

jsravn · 2017-04-28T17:55:38Z

@saad-ali

#42033 seems more robust--it will continuously ensure state is not skewed over the lifetime of the controller process instead of just at startup,

This also takes care of processing pods when node state changes. Like adding nodes to the cluster.

#42033 seems like a smaller (although still non-trivial) performance hit: it processes potentially 5000 nodes * 30 pods =150,000 pods at 3 minute increments vs this solution processes potentially 5000 nodes * 30 pods = 150,000 pods * 5000 nodes = 750,000,000 pods all up front--that's a three order of magnitude difference processed all up front.

Agreed, so I think we'd need the node indexing. @jsafrane suggested we do that in a separate PR.

So do you think we should just try backporting #42033? Is it something @NickrenREN could pick up?

saad-ali · 2017-04-28T18:31:12Z

This also takes care of processing pods when node state changes. Like adding nodes to the cluster.

I think #42033 would handle that case too: add goes from unmanaged to attach/detach controller managed, controller gets an add, adds pod to desired state. Populator comes around, find that pods that were previously being ignored now have a node in desired state, adds them.

Agreed, so I think we'd need the node indexing. @jsafrane suggested we do that in a separate PR.

Could you expand on what you mean by node indexing?

So do you think we should just try backporting #42033? Is it something @NickrenREN could pick up?

Yes, let's backport it but let's ensure that #42033 hasn't caused any major regressions. @NickrenREN is this something you can help with?

jingxu97 · 2017-04-28T22:27:01Z

For a cluster with M nodes and N pods per node, when controller restarted, controller will process M * ( M*N) pods during startup. Will it cause high overhead?

jsravn · 2017-04-29T11:37:36Z

@saad-ali

Could you expand on what you mean by node indexing?

I mean adding a method to PodLister so we can list by nodename, like we can do for namespace. It adds an Indexer for node on pods. There was another use case for this as well, I think in the daemonset controller.

NickrenREN · 2017-04-30T03:47:13Z

@saad-ali I'd like to help if needed

jsravn · 2017-05-04T16:11:37Z

We can close this then if we're deciding to backport #42033.

gnufied · 2017-05-10T21:58:45Z

I have opened a cherry-pick of #42033 - #45620

@NickrenREN can you please review?

jsravn · 2017-05-16T09:08:37Z

Closing in favour of #45620

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 19, 2017

k8s-github-robot assigned gnufied and jsafrane Apr 19, 2017

k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 19, 2017

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 19, 2017

k8s-github-robot added the release-note-label-needed label Apr 19, 2017

jsravn mentioned this pull request Apr 19, 2017

EBS fails to detach after controller manager is restarted #43300

Closed

Simplify type assertion

0467206

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 19, 2017

gnufied reviewed Apr 19, 2017

View reviewed changes

James Ravn added 3 commits April 20, 2017 11:00

Fix formatting

aa0f766

Run processPodsForNode in goroutine

bddfcca

So it doesn't block the node handler.

Update bazel build

8f94c42

jsafrane reviewed Apr 20, 2017

View reviewed changes

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2017

saad-ali self-assigned this Apr 25, 2017

saad-ali reviewed Apr 26, 2017

View reviewed changes

Rename to fetchAndProcessPodsForNode

3388e73

gnufied mentioned this pull request May 13, 2017

Automated cherry pick of #42033 #45620

Merged

jsravn closed this May 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List all pods on node added for attach_detach controller #44675

List all pods on node added for attach_detach controller #44675

jsravn commented Apr 19, 2017 •

edited

Loading

k8s-reviewable commented Apr 19, 2017

k8s-github-robot commented Apr 19, 2017

k8s-ci-robot commented Apr 19, 2017

eparis commented Apr 19, 2017

gnufied Apr 19, 2017

0xmichalis Apr 19, 2017

jsravn Apr 20, 2017

jsafrane Apr 20, 2017

jsravn Apr 25, 2017

saad-ali Apr 26, 2017

jsravn Apr 27, 2017

cblecker commented Apr 19, 2017

jsafrane Apr 20, 2017 •

edited

Loading

jsravn Apr 25, 2017

jsravn Apr 25, 2017 •

edited

Loading

k8s-github-robot commented Apr 24, 2017

jsravn commented Apr 25, 2017

saad-ali left a comment

saad-ali Apr 26, 2017

jsravn Apr 27, 2017

saad-ali Apr 26, 2017

saad-ali Apr 26, 2017

k8s-ci-robot commented Apr 27, 2017

saad-ali commented Apr 28, 2017

jsravn commented Apr 28, 2017

saad-ali commented Apr 28, 2017

jingxu97 commented Apr 28, 2017

jsravn commented Apr 29, 2017

NickrenREN commented Apr 30, 2017

jsravn commented May 4, 2017

gnufied commented May 10, 2017

jsravn commented May 16, 2017

List all pods on node added for attach_detach controller #44675

List all pods on node added for attach_detach controller #44675

Conversation

jsravn commented Apr 19, 2017 • edited Loading

k8s-reviewable commented Apr 19, 2017

k8s-github-robot commented Apr 19, 2017

k8s-ci-robot commented Apr 19, 2017

eparis commented Apr 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblecker commented Apr 19, 2017

jsafrane Apr 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsravn Apr 25, 2017 • edited Loading

Choose a reason for hiding this comment

k8s-github-robot commented Apr 24, 2017

jsravn commented Apr 25, 2017

saad-ali left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Apr 27, 2017

saad-ali commented Apr 28, 2017

jsravn commented Apr 28, 2017

saad-ali commented Apr 28, 2017

jingxu97 commented Apr 28, 2017

jsravn commented Apr 29, 2017

NickrenREN commented Apr 30, 2017

jsravn commented May 4, 2017

gnufied commented May 10, 2017

jsravn commented May 16, 2017

jsravn commented Apr 19, 2017 •

edited

Loading

jsafrane Apr 20, 2017 •

edited

Loading

jsravn Apr 25, 2017 •

edited

Loading