-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EBS fails to detach after controller manager is restarted #43300
Comments
This is what I see in controller manager logs (here vol-02f69898d700456b0 fails to detach, i've filtered on all logs with that id, sorted in reverse order): In this case controller-manager-10.50.241.133 got stopped at 15:58, and controller-manager-10.50.246.14 started just after. Then when the pod got deleted at 19:00, the new controller manager ignores the pod deletion completely and doesn't do any detach action. It tries to attach which then hangs indefinitely. Kubelet on the node successfully unmounted the volume, but it remains attached to the old node. |
To clarify we've seen this a few times but I've been unable to reproduce it manually. When it happens though it's always shortly after a controller manager restart (which is pretty rare for us, happens maybe once every couple weeks). |
So digging more, my use case actually has the controller restart a few times over a 15 minute period (each controller is being updated, then restarted in a rolling deploy). In this case, the first restart looks fine - the informer updates the controller managers view of the world correctly. But for the subsequent restarts over the 15 minute period, the desired state of the world is not updated at all. As a result, the last controller manager to gain leadership has no knowledge of the already mounted pods. I'm not sure what is happening - why would subsequent restarts not be the same as the first? (Might just be coincidence though that this is what happened for me). |
Isn't there a race condition between nodeAdd and podAdd at startup? It looks like podAdd is skipped if the node hasn't been added yet. |
I think the right approach here would be to requeue the pod for processing once nodes have finished syncing, same as other controllers do. |
@jsravn just discussed with @saad-ali, it seems possible to have this race when controller manager is restarted. I will add some log to test and confirm it. Could you also help add a log in this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L245 and test with your setup? Thanks! |
@jingxu97 Okay I've added that. I also turned on the logging in processPodVolumes, because that's where I suspect it is skipping (the pods didn't get deleted afaict during controller manager restart). |
@jsravn thanks! Are you able to make this change on your local setup first and test it? Thanks! |
Yes I'll roll this out shortly to our local clusters |
@jingxu97 we managed to trigger this condition again. It looks like the pods got processed before the nodes - so the attach detach controller skipped them, and the future attaches failed. Due to the 12 hr resync period this is a problem of course. In this case it happened to an influxdb pod which needs an EBS: |
I suppose it's also possible the node informer never triggered, but I'd need to add additional logging to confirm. |
@jingxu97 do you have any ideas on the best way to fix this? It looks like similar code to wait for sync has changed a bit in master. I could try adding a simple loop to wait for |
Added that fix at jsravn@5d3c1da - going to deploy this version of kube-controller-manager and verify if it fixes the restart issue. |
I can verify the node informer wasn't synced in time - I get the message "Waiting for node store to sync before processing nodes" on startup. So this is the bug I believe, and will cause volume information to be out of sync on kube-controller-manager startup/restarts. |
How about if we add all such pods to a data structure called Similarly when node is deleted we need to clear out any pendingitems in |
I think the other controllers handle this pretty well - you can take a look at how endpoint controller does it in 1.4/1.5, it has its own work queue where it will stick services to process while it waits for the pod informer to finish updating. Although I noticed this changed quite a lot in master - haven't gone through those changes yet, though. |
@jsravn so to clarify, jsravn@5d3c1da fixed this? Because now it correctly waits for the node store to sync before attempting processing? It was discovered by someone else too that the attach detach controller never bothers to wait for its caches to sync here https://github.com/kubernetes/kubernetes/pull/40385/files#diff-72e66cba2ced392dc0f9cd10c424d6d0R225 . So if we uncomment that and fix the mentioned test, that should fix this issue right? |
WaitForCacheSync seems like the standard way to wait, if I understand correctly @ncdc just didn't want to touch things he didn't need to in what was mostly a mechanical PR so he left that as a comment for discussion. So I'm saying that if @jsravn's fix worked, we can uncomment the WaitForCacheSync in my second link as the "proper" fix. |
@wongma7 it's not as simple as uncommenting. You'll have to fix the integration test I referenced as well. |
Also afaict just putting the |
@wongma7 that's correct. @gnufied yes this is the dirty fix. But it won't block anything else, at least in 1.4 - each listener in the shared informer gets its own goroutine for handling events. So it will just block the pod handler in the attach detach controller. WaitForCacheSync won't solve it, like @gnufied says you will miss events - you need to queue the events to process, like how endpoint controller works, or just block the event handler goroutine like I did. |
@jingxu97 sharedInformers.start() is called after the Run() (see |
Me either
The informers will resync their items every so often, as defined by either the global shared informer factory resync period, or the per-event handler resync period. But it's better to use
You can and should call
See https://github.com/kubernetes/kubernetes/blob/8e26fa25da6d3b1deb333fe2484f794795d1c6b9/staging/src/k8s.io/kube-aggregator/pkg/controllers/autoregister/autoregister_controller.go for an example of a typical controller using a work queue. If all we do is uncomment the call to |
You answered that it yourself in the last part of your reply. Just putting The resync interval in attach/detach controller is 12 hours and hence not super useful for any real use. That is why we need both work queues and |
Ok, difference in terminology. The events are not missed per se. They're received and processed, possibly incorrectly without a work queue. |
@ncdc, can I ask what might happen if controller start working process without WaitForCacheSync returns true? |
@ncdc Also, I still not sure about why it doesn't set pvcsSynced and pvsSynced to alwaysReady? |
@jingxu97 |
/assign @gnufied Going to take a stab at it. |
@gnufied From my understanding, the pod event could still be processed before a node event even WaitForCacheSync returns. WaitForCacheSync just makes sure when controller starts processing the events, the informer cache is synced. |
@jingxu97 that's correct. If you have 2 fully synced caches (pods, nodes) and events arrive, there's no guarantee as to the order in which they arrive. You have to be able to handle situations where either the pod or the node arrives first. |
We can just add the pod back into the worker queue if its node hasn't been processed yet, right? |
This PR #42033 takes a different way to solve the problem. There is a desiredStatePopulator in attach_detach controller which currently periodically verifies with the API server that the pods in the desired state of th world still exist, if not, it removes them. |
I'm not sure what the decided approach is here. I was thinking more about the worker queue approach and think it might be overkill. Especially since it will have to do a sleep/loop anyways. I think a simple loop like I added in the pod handler should suffice, possibly with the restriction that it gives up after a period of time if the node informer doesn't update in time. My changes have fixed all issues we have in our clusters with kube-controller-manager restarts. Should I clean it up and open a PR? I don't know about #42033, it seems like yet another bandaid. There isn't any discussion or links as to why it's being done in the PR - I'm guessing this is to handle the edge case of a missing pod update during restarts? Having it list the entire world every 1m feels like a bad approach, that should be handled by the informers via notify and a proper delta protocol, with #39732 to recover missed pods on startup. |
I don't think #42033 will fix this issue entirely either...hard to say whether it's intended to fix this issue or just a coincidence :). Disregarding whether it's a good/bad approach, there is still a minute-ish window for a detach to never occur if you delete the pod during a polling window which is IMO totally reasonable to expect
So I'm thinking now that one of these is needed to fix it fully:
(in addition to #39732 for fixing the restart case which leads to the same symptoms as this issue) edit: to be clear I agree #42033 should be unnecessary, we should be able to trust the informers except at startup when their caches aren't even ready yet. After the first polls, it's basically not doing anything useful... |
There are two race condition issues around controller restarts need to be solved here #39732 is an effort to fix issue 1. The idea is that because node status might still have the attached volume in the "VolumesAttached" field. This approach has a couple of minor issue.
About Issue 2, as @wongma7 mentioned this thread discussed a few possible approach. But I am thinking waitforcacheSync is required no matter which approach we use. Because otherwise when a pod is added to the desired state, the cached information from the informer might be out of date. Comparing the different approach
For approach 1, I do see there are other places to periodic check all the pods (or we might only check active pods?) and we need some stress tests to prove it. @jsravn could you give more details about your approach? Do you also have a PR for it? Thanks! |
@jingxu97 it is this change jsravn@5d3c1da. I can do a PR for it on Monday. |
I am quite late in this discussion, I just have a stupid idea. If we have issues with podAdd event arriving before nodeAdd event, can we add some code to nodeAdd that would list all pods on the node and call podAdd on them? podAdd might be called twice for the same pod and even in parallel (node and pod informer run in separate goroutines and they may process nodeAdd and podAdd at the same time), still it's much better than not calling podAdd at all. Work queues would help here to get rid of unwanted parallelism. One thing that can also happen is that the controller is running (no crash recovery or whatnot) and a node is added to the cluster and daemon set controller adds a pod to the node. Our attach/detach controller can get these events in any order and it should process them reliably. I am afraid that #42033 is just a workaround and not a solution. |
@jsafrane I'm a little disappointed to see #42033 got merged into master. Maybe it will help mitigate some of these bugs, though (with the drawbacks already mentioned). Anyways, I realised my approach of simply waiting for node informer to sync in processPods still has a race, you really do need the work queues. Not sure if @gnufied got a chance to do this, otherwise I'll probably whip up a PR shortly for it. |
@jsravn If you don't beat me to it, I will open a PR today that fixes this issue properly. I am sorry, I have been caught in bunch of other things. And yeah - #42033 is not a proper fix. The current situation is bad. For example, the other day I was running some tests and every single pod was getting skipped on controller restart - https://gist.github.com/gnufied/93135976373857e27fe14619f8e4ca7d However @wongma7's PR - #44566 should make it somewhat better. |
Spending more time on it, I don't really get the point of the desired_state_of_world_populator.go. I feel like this populator should be removed and instead just reduce the informer resync period if we want to refresh occasionally. I've attempted to do @jsafrane's solution - it is probably the simplest approach. The queue approach would need heavier modification, e.g. would need to store state of all nodes, not just managed ones, so we know whether a pod should be skipped or re-queued. |
I had a look into the attach/dettach controller today and I was wondering why this is missing |
regarding desired_state_of_world_populator.go, I did some digging a while ago and it was because a flake was encountered where pod delete events were getting missed. There's no notion of "resyncing" delete events so it was added. Let me find the exact issue. #42033 copied the pattern of "double checking" deletes to double checking adds/updates. I did bring up changing the resync period as an alternative in the PR, that would IMO be less error-prone and seen as less of a workaround/band-aid simply because there's less code to write & we can rely on the existing podAdd/podUpdate mechanism, but the PR being already written I didn't really object any further than that. Important to note that at the time the attach/detach controller was first written, one could not set custom resync periods when using shared informers, that was added recently. See also https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L83 |
I just opened #44675 which does the relist of pods on a managed node being added. That handles the case of controller restart as well as nodes being dynamically added. I think this will be easy to cherry pick for 1.5. I started working on the queue side but it's a bit more involved. Happy to do that instead or let @gnufied if the relist isn't seen as a good solution. @wongma7 makes sense. I was checking our cluster logs for the last 3 weeks and I haven't ever seen the populator do anything. With a shorter resync period it is probably unnecessary now. |
@saad-ali is there any reason we can't reduce resync period now? Most other controllers have switched to 30s default resync periods... |
I haven't dug into it recently. With the populator changes this should be less critical now. |
@gnufied Sure, although I haven't verified it yet. But can reopen if I see it again. |
Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
EBS
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT
Kubernetes version (use
kubectl version
):1.4.8
Environment:
uname -a
): 4.8.0What happened:
We have an HA controller manager setup. After a controller manager was restarted, a new controller manager became the leader. We observed after this happens that some EBS fail to detach when a pod is redeployed. It seems that controller-manager forgets about some attached EBS after it is restarted.
What you expected to happen:
After a new controller manager starts, if an existing pod with an EBS gets redeployed, the EBS should be detached first.
How to reproduce it (as minimally and precisely as possible):
Unfortunately I've been unable to reproduce this consistently - it doesn't always happen, but when it does it's always after a controller manager got restarted.
Anything else we need to know:
The text was updated successfully, but these errors were encountered: