-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List all pods on node added for attach_detach controller #44675
List all pods on node added for attach_detach controller #44675
Conversation
To ensure all pods with volumes are managed immediately upon a node becoming managed. This most commonly happens at startup of the controller manager. This may produce duplicate pod processing, but this should be minimal since nodes are generally added at slow rate. If it's a problem we'll need to move to a worker queue approach and keep track of all nodes so pods can be requeued for processing.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jsravn
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Hi @jsravn. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@k8s-bot ok to test |
} | ||
|
||
func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) { | ||
pods, err := adc.podLister.List(labels.Everything()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can use a selector to pick only pods which belong to this node? Not a blocker, but this defenitely looks promising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could index the pod cache based on a NodeNameIndex and fetch pods for a specific node but I haven't looked close wrt the impact of using multiple indexes in cache. #38868
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a problem at startup especially. For every node in the cluster it will do a list of the pods. I notice we do lots of list alls though (like in daemonset controller), so I'm not sure if it's a big deal or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
podLister works with shared informer cache, it won't ask API server. So it's IMO OK for now and it should be fixed in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to work on that, I agree though it should be a separate PR, esp since I don't think it can be easily cherry picked into older releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding a lot of work to the controller: if there are 5000 nodes with 30 pods each, when the controller restarts, each node you will fetch and process 150,000 pods, meaning it will end up processing 450,000 pods (excluding the ones it will get from watch).
Given that we have #42033, which periodically lists all pods, do we really needs this as well to fix #43300?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we'd need a node index to do this efficiently, or use a work queue w/ pod processing retries. Considering #42033 also processes every single pod every 3 mins is it a big deal in comparison? I wasn't sure.
The main purpose for me is to get a fix we can use in 1.6/1.5. I didn't think #42033 was a candidate for backporting, or was even going to stick around in its current form. I thought it was more of a backup, experimental solution. Is it okay to have to wait for the reconciler to fix things when new nodes get added?
If we just want #42033, that's okay with me, as long as we can get a fix in 1.6/1.5 as well, since this is a major problem in those releases.
|
So it doesn't block the node handler.
@@ -293,6 +294,24 @@ func (adc *attachDetachController) nodeAdd(obj interface{}) { | |||
// the attached volumes field. This function ensures that we sync with | |||
// the actual status. | |||
adc.actualStateOfWorld.SetNodeStatusUpdateNeeded(nodeName) | |||
|
|||
if _, exists := node.Annotations[volumehelper.ControllerManagedAttachAnnotation]; exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is ADC locked at this time? I am not sure, but I think the reconciler runs in a separate goroutine and it could run at this point, where we have node and its attached volumes in states of the world, while the pods are not there yet. reconcile() will try to detach these volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right. That race already exists though, correct? At least by immediately adding the pod information here, we have a shot at avoiding it. If we do hit the race, worse case is it will detach then re-attach when the pods are processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think waitForCacheSync in #run will help here. It will prevent the reconciler from running until all the initial nodes/pods have been processed. Still a small race since the handlers operate asynchronously so may still be running even when waitForCacheSync returns.
@jsravn PR needs rebase |
I pinged @saad-ali to have a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My big question is about if we should do this at all
@@ -250,10 +255,6 @@ func (adc *attachDetachController) podAdd(obj interface{}) { | |||
if pod == nil || !ok { | |||
return | |||
} | |||
if pod.Spec.NodeName == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's redundant with ProcessPodVolumes, which does the same check (with a log message).
} | ||
} | ||
|
||
func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to fetchAndProcessPodsForNode(...)
-- I want the name to emphasis that it is not passively processing but actively fetching.
Also add a comment summarizing what this method does above it.
} | ||
|
||
func (adc *attachDetachController) processPodsForNode(nodeName types.NodeName) { | ||
pods, err := adc.podLister.List(labels.Everything()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding a lot of work to the controller: if there are 5000 nodes with 30 pods each, when the controller restarts, each node you will fetch and process 150,000 pods, meaning it will end up processing 450,000 pods (excluding the ones it will get from watch).
Given that we have #42033, which periodically lists all pods, do we really needs this as well to fix #43300?
@jsravn: The following test(s) failed:
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
The reason I prefer PR #42033 over this PR is:
I understand that PR #42033 may be more difficult to port back to older branches, but because of the reasons above and that PR #42033 appears to address the problem raised in issue #43300 (making attach/detach controller tolerant to HA failover), I hesitate to endorse this PR. |
This also takes care of processing pods when node state changes. Like adding nodes to the cluster.
Agreed, so I think we'd need the node indexing. @jsafrane suggested we do that in a separate PR. So do you think we should just try backporting #42033? Is it something @NickrenREN could pick up? |
I think #42033 would handle that case too: add goes from unmanaged to attach/detach controller managed, controller gets an add, adds pod to desired state. Populator comes around, find that pods that were previously being ignored now have a node in desired state, adds them.
Could you expand on what you mean by node indexing?
Yes, let's backport it but let's ensure that #42033 hasn't caused any major regressions. @NickrenREN is this something you can help with? |
For a cluster with M nodes and N pods per node, when controller restarted, controller will process M * ( M*N) pods during startup. Will it cause high overhead? |
I mean adding a method to PodLister so we can list by nodename, like we can do for namespace. It adds an Indexer for node on pods. There was another use case for this as well, I think in the daemonset controller. |
@saad-ali I'd like to help if needed |
We can close this then if we're deciding to backport #42033. |
I have opened a cherry-pick of #42033 - #45620 @NickrenREN can you please review? |
Closing in favour of #45620 |
To ensure all pods with volumes are managed immediately upon a node
becoming managed. This most commonly happens at startup of the
controller manager.
This may produce duplicate pod processing, but this should be minimal
since nodes are generally added at slow rate. If it's a problem we'll
need to move to a worker queue approach and keep track of all nodes
so pods can be requeued for processing.
Main drawback is at startup. The attach detach controller may end up
processing all pods twice in the worse case. And without a node index
it will list all the pods for every node in the cluster.