Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List all pods on node added for attach_detach controller #44675

Conversation

jsravn
Copy link
Contributor

@jsravn jsravn commented Apr 19, 2017

To ensure all pods with volumes are managed immediately upon a node
becoming managed. This most commonly happens at startup of the
controller manager.

This may produce duplicate pod processing, but this should be minimal
since nodes are generally added at slow rate. If it's a problem we'll
need to move to a worker queue approach and keep track of all nodes
so pods can be requeued for processing.

Main drawback is at startup. The attach detach controller may end up
processing all pods twice in the worse case. And without a node index
it will list all the pods for every node in the cluster.

To ensure all pods with volumes are managed immediately upon a node
becoming managed. This most commonly happens at startup of the
controller manager.

This may produce duplicate pod processing, but this should be minimal
since nodes are generally added at slow rate. If it's a problem we'll
need to move to a worker queue approach and keep track of all nodes
so pods can be requeued for processing.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 19, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsravn
We suggest the following additional approver: @jsafrane

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @jsravn. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with @k8s-bot ok to test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot k8s-github-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 19, 2017
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 19, 2017
@eparis
Copy link
Contributor

eparis commented Apr 19, 2017

@k8s-bot ok to test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 19, 2017
}

func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) {
pods, err := adc.podLister.List(labels.Everything())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can use a selector to pick only pods which belong to this node? Not a blocker, but this defenitely looks promising.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could index the pod cache based on a NodeNameIndex and fetch pods for a specific node but I haven't looked close wrt the impact of using multiple indexes in cache. #38868

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem at startup especially. For every node in the cluster it will do a list of the pods. I notice we do lots of list alls though (like in daemonset controller), so I'm not sure if it's a big deal or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podLister works with shared informer cache, it won't ask API server. So it's IMO OK for now and it should be fixed in a subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to work on that, I agree though it should be a separate PR, esp since I don't think it can be easily cherry picked into older releases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding a lot of work to the controller: if there are 5000 nodes with 30 pods each, when the controller restarts, each node you will fetch and process 150,000 pods, meaning it will end up processing 450,000 pods (excluding the ones it will get from watch).

Given that we have #42033, which periodically lists all pods, do we really needs this as well to fix #43300?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we'd need a node index to do this efficiently, or use a work queue w/ pod processing retries. Considering #42033 also processes every single pod every 3 mins is it a big deal in comparison? I wasn't sure.

The main purpose for me is to get a fix we can use in 1.6/1.5. I didn't think #42033 was a candidate for backporting, or was even going to stick around in its current form. I thought it was more of a backup, experimental solution. Is it okay to have to wait for the reconciler to fix things when new nodes get added?

If we just want #42033, that's okay with me, as long as we can get a fix in 1.6/1.5 as well, since this is a major problem in those releases.

@cblecker
Copy link
Member

I0419 09:31:36.636] Verifying hack/make-rules/../../hack/verify-bazel.sh
I0419 09:31:41.586] 33a34
I0419 09:31:41.587] >         "//vendor/k8s.io/apimachinery/pkg/labels:go_default_library",
W0419 09:31:41.687] DRY-RUN: wrote "pkg/controller/volume/attachdetach/BUILD"
W0419 09:31:42.228] 
W0419 09:31:42.228] 1 BUILD files not up-to-date.
I0419 09:31:42.329] 
I0419 09:31:42.329] Run ./hack/update-bazel.sh
I0419 09:31:42.329] FAILED   hack/make-rules/../../hack/verify-bazel.sh	6s

James Ravn added 3 commits April 20, 2017 11:00
@@ -293,6 +294,24 @@ func (adc *attachDetachController) nodeAdd(obj interface{}) {
// the attached volumes field. This function ensures that we sync with
// the actual status.
adc.actualStateOfWorld.SetNodeStatusUpdateNeeded(nodeName)

if _, exists := node.Annotations[volumehelper.ControllerManagedAttachAnnotation]; exists {
Copy link
Member

@jsafrane jsafrane Apr 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ADC locked at this time? I am not sure, but I think the reconciler runs in a separate goroutine and it could run at this point, where we have node and its attached volumes in states of the world, while the pods are not there yet. reconcile() will try to detach these volumes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right. That race already exists though, correct? At least by immediately adding the pod information here, we have a shot at avoiding it. If we do hit the race, worse case is it will detach then re-attach when the pods are processed.

Copy link
Contributor Author

@jsravn jsravn Apr 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think waitForCacheSync in #run will help here. It will prevent the reconciler from running until all the initial nodes/pods have been processed. Still a small race since the handlers operate asynchronously so may still be running even when waitForCacheSync returns.

@k8s-github-robot
Copy link

@jsravn PR needs rebase

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2017
@jsravn
Copy link
Contributor Author

jsravn commented Apr 25, 2017

I pinged @saad-ali to have a look

@saad-ali saad-ali self-assigned this Apr 25, 2017
Copy link
Member

@saad-ali saad-ali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big question is about if we should do this at all

@@ -250,10 +255,6 @@ func (adc *attachDetachController) podAdd(obj interface{}) {
if pod == nil || !ok {
return
}
if pod.Spec.NodeName == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's redundant with ProcessPodVolumes, which does the same check (with a log message).

}
}

func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to fetchAndProcessPodsForNode(...) -- I want the name to emphasis that it is not passively processing but actively fetching.

Also add a comment summarizing what this method does above it.

}

func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) {
pods, err := adc.podLister.List(labels.Everything())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding a lot of work to the controller: if there are 5000 nodes with 30 pods each, when the controller restarts, each node you will fetch and process 150,000 pods, meaning it will end up processing 450,000 pods (excluding the ones it will get from watch).

Given that we have #42033, which periodically lists all pods, do we really needs this as well to fix #43300?

@k8s-ci-robot
Copy link
Contributor

@jsravn: The following test(s) failed:

Test name Commit Details Rerun command
Jenkins GCE etcd3 e2e 3388e73 link @k8s-bot gce etcd3 e2e test this
Jenkins kops AWS e2e 3388e73 link @k8s-bot kops aws e2e test this
Jenkins Kubemark GCE e2e 3388e73 link @k8s-bot kubemark e2e test this
Jenkins unit/integration 3388e73 link @k8s-bot unit test this
Jenkins verification 3388e73 link @k8s-bot verify test this
Jenkins GCE Node e2e 3388e73 link @k8s-bot node e2e test this
Jenkins Bazel Build 3388e73 link @k8s-bot bazel test this

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@saad-ali
Copy link
Member

Yeah we'd need a node index to do this efficiently, or use a work queue w/ pod processing retries. Considering #42033 also processes every single pod every 3 mins is it a big deal in comparison? I wasn't sure.

The main purpose for me is to get a fix we can use in 1.6/1.5. I didn't think #42033 was a candidate for backporting, or was even going to stick around in its current form. I thought it was more of a backup, experimental solution. Is it okay to have to wait for the reconciler to fix things when new nodes get added?

If we just want #42033, that's okay with me, as long as we can get a fix in 1.6/1.5 as well, since this is a major problem in those releases.

The reason I prefer PR #42033 over this PR is:

  1. fix TODO: find and add active pods for dswp #42033 seems more robust--it will continuously ensure state is not skewed over the lifetime of the controller process instead of just at startup,
  2. fix TODO: find and add active pods for dswp #42033 seems like a smaller (although still non-trivial) performance hit: it processes potentially 5000 nodes * 30 pods =150,000 pods at 3 minute increments vs this solution processes potentially 5000 nodes * 30 pods = 150,000 pods * 5000 nodes = 750,000,000 pods all up front--that's a three order of magnitude difference processed all up front.

I understand that PR #42033 may be more difficult to port back to older branches, but because of the reasons above and that PR #42033 appears to address the problem raised in issue #43300 (making attach/detach controller tolerant to HA failover), I hesitate to endorse this PR.

@jsravn
Copy link
Contributor Author

jsravn commented Apr 28, 2017

@saad-ali

#42033 seems more robust--it will continuously ensure state is not skewed over the lifetime of the controller process instead of just at startup,

This also takes care of processing pods when node state changes. Like adding nodes to the cluster.

#42033 seems like a smaller (although still non-trivial) performance hit: it processes potentially 5000 nodes * 30 pods =150,000 pods at 3 minute increments vs this solution processes potentially 5000 nodes * 30 pods = 150,000 pods * 5000 nodes = 750,000,000 pods all up front--that's a three order of magnitude difference processed all up front.

Agreed, so I think we'd need the node indexing. @jsafrane suggested we do that in a separate PR.

So do you think we should just try backporting #42033? Is it something @NickrenREN could pick up?

@saad-ali
Copy link
Member

This also takes care of processing pods when node state changes. Like adding nodes to the cluster.

I think #42033 would handle that case too: add goes from unmanaged to attach/detach controller managed, controller gets an add, adds pod to desired state. Populator comes around, find that pods that were previously being ignored now have a node in desired state, adds them.

Agreed, so I think we'd need the node indexing. @jsafrane suggested we do that in a separate PR.

Could you expand on what you mean by node indexing?

So do you think we should just try backporting #42033? Is it something @NickrenREN could pick up?

Yes, let's backport it but let's ensure that #42033 hasn't caused any major regressions. @NickrenREN is this something you can help with?

@jingxu97
Copy link
Contributor

For a cluster with M nodes and N pods per node, when controller restarted, controller will process M * ( M*N) pods during startup. Will it cause high overhead?

@jsravn
Copy link
Contributor Author

jsravn commented Apr 29, 2017

@saad-ali

Could you expand on what you mean by node indexing?

I mean adding a method to PodLister so we can list by nodename, like we can do for namespace. It adds an Indexer for node on pods. There was another use case for this as well, I think in the daemonset controller.

@NickrenREN
Copy link
Contributor

@saad-ali I'd like to help if needed

@jsravn
Copy link
Contributor Author

jsravn commented May 4, 2017

We can close this then if we're deciding to backport #42033.

@gnufied
Copy link
Member

gnufied commented May 10, 2017

I have opened a cherry-pick of #42033 - #45620

@NickrenREN can you please review?

@jsravn
Copy link
Contributor Author

jsravn commented May 16, 2017

Closing in favour of #45620

@jsravn jsravn closed this May 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.