EBS fails to detach after controller manager is restarted #43300

jsravn · 2017-03-17T14:57:11Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
EBS

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Kubernetes version (use kubectl version):
1.4.8

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu 16.04.2
Kernel (e.g. uname -a): 4.8.0
Install tools: custom ansible
Others:

What happened:
We have an HA controller manager setup. After a controller manager was restarted, a new controller manager became the leader. We observed after this happens that some EBS fail to detach when a pod is redeployed. It seems that controller-manager forgets about some attached EBS after it is restarted.

What you expected to happen:
After a new controller manager starts, if an existing pod with an EBS gets redeployed, the EBS should be detached first.

How to reproduce it (as minimally and precisely as possible):
Unfortunately I've been unable to reproduce this consistently - it doesn't always happen, but when it does it's always after a controller manager got restarted.

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

jsravn · 2017-03-17T14:59:03Z

@jingxu97 do you know if there are any existing bugs that match the symptoms I'm seeing?

The closest I can find is #34242, but that is only if a pod is deleted during restart. In my case it happens when a pod is deleted after controller manager is restarted.

jsravn · 2017-03-17T15:08:03Z

This is what I see in controller manager logs (here vol-02f69898d700456b0 fails to detach, i've filtered on all logs with that id, sorted in reverse order):

In this case controller-manager-10.50.241.133 got stopped at 15:58, and controller-manager-10.50.246.14 started just after. Then when the pod got deleted at 19:00, the new controller manager ignores the pod deletion completely and doesn't do any detach action. It tries to attach which then hangs indefinitely. Kubelet on the node successfully unmounted the volume, but it remains attached to the old node.

jsravn · 2017-03-17T15:14:55Z

To clarify we've seen this a few times but I've been unable to reproduce it manually. When it happens though it's always shortly after a controller manager restart (which is pretty rare for us, happens maybe once every couple weeks).

jsravn · 2017-03-17T17:34:20Z

So digging more, my use case actually has the controller restart a few times over a 15 minute period (each controller is being updated, then restarted in a rolling deploy). In this case, the first restart looks fine - the informer updates the controller managers view of the world correctly. But for the subsequent restarts over the 15 minute period, the desired state of the world is not updated at all. As a result, the last controller manager to gain leadership has no knowledge of the already mounted pods. I'm not sure what is happening - why would subsequent restarts not be the same as the first? (Might just be coincidence though that this is what happened for me).

jsravn · 2017-03-17T17:43:58Z

Isn't there a race condition between nodeAdd and podAdd at startup? It looks like podAdd is skipped if the node hasn't been added yet.

jsravn · 2017-03-17T17:58:12Z

I think the right approach here would be to requeue the pod for processing once nodes have finished syncing, same as other controllers do.

jingxu97 · 2017-03-17T18:07:29Z

@jsravn just discussed with @saad-ali, it seems possible to have this race when controller manager is restarted. I will add some log to test and confirm it. Could you also help add a log in this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L245 and test with your setup? Thanks!

jsravn · 2017-03-20T09:39:05Z

@jingxu97 Okay I've added that. I also turned on the logging in processPodVolumes, because that's where I suspect it is skipping (the pods didn't get deleted afaict during controller manager restart).

kubernetes#43300

jingxu97 · 2017-03-20T18:08:01Z

@jsravn thanks! Are you able to make this change on your local setup first and test it? Thanks!

jsravn · 2017-03-20T18:25:49Z

Yes I'll roll this out shortly to our local clusters

jsravn · 2017-03-23T15:42:59Z

@jingxu97 we managed to trigger this condition again. It looks like the pods got processed before the nodes - so the attach detach controller skipped them, and the future attaches failed. Due to the 12 hr resync period this is a problem of course.

In this case it happened to an influxdb pod which needs an EBS:

jsravn · 2017-03-23T15:45:51Z

I suppose it's also possible the node informer never triggered, but I'd need to add additional logging to confirm.

jsravn · 2017-03-30T12:35:17Z

@jingxu97 do you have any ideas on the best way to fix this? It looks like similar code to wait for sync has changed a bit in master. I could try adding a simple loop to wait for hasSynced in 1.4 to confirm this is our problem.

jsravn · 2017-03-30T16:37:48Z

Added that fix at jsravn@5d3c1da - going to deploy this version of kube-controller-manager and verify if it fixes the restart issue.

jsravn · 2017-03-30T16:47:01Z

I can verify the node informer wasn't synced in time - I get the message "Waiting for node store to sync before processing nodes" on startup. So this is the bug I believe, and will cause volume information to be out of sync on kube-controller-manager startup/restarts.

gnufied · 2017-03-30T19:41:45Z

How about if we add all such pods to a data structure called PodsOnHold and when nodeAdd event is ran for such a node - we trigger an event inside PodsOnHold to process all pending pods on that node.

Similarly when node is deleted we need to clear out any pendingitems in PodsOnHold.

jsravn · 2017-03-31T10:57:41Z

I think the other controllers handle this pretty well - you can take a look at how endpoint controller does it in 1.4/1.5, it has its own work queue where it will stick services to process while it waits for the pod informer to finish updating. Although I noticed this changed quite a lot in master - haven't gone through those changes yet, though.

wongma7 · 2017-04-03T20:53:18Z

@jsravn so to clarify, jsravn@5d3c1da fixed this? Because now it correctly waits for the node store to sync before attempting processing?

It was discovered by someone else too that the attach detach controller never bothers to wait for its caches to sync here https://github.com/kubernetes/kubernetes/pull/40385/files#diff-72e66cba2ced392dc0f9cd10c424d6d0R225 . So if we uncomment that and fix the mentioned test, that should fix this issue right?

gnufied · 2017-04-03T23:30:53Z

@wongma7 We can't put sleep in the code there, because events are processed in main goroutine that is started by controller-manager. Putting sleep there will potentially block other things.

I am leaning towards adding a internal queue as endpoints controller. cc @saad-ali @jingxu97

wongma7 · 2017-04-03T23:34:25Z

WaitForCacheSync seems like the standard way to wait, if I understand correctly @ncdc just didn't want to touch things he didn't need to in what was mostly a mechanical PR so he left that as a comment for discussion. So I'm saying that if @jsravn's fix worked, we can uncomment the WaitForCacheSync in my second link as the "proper" fix.

ncdc · 2017-04-04T00:47:28Z

@wongma7 it's not as simple as uncommenting. You'll have to fix the integration test I referenced as well.

gnufied · 2017-04-04T00:57:39Z

Also afaict just putting the WaitForCacheSync in Run will not automatically fix this bug. The events are being processed on separate goroutine from goroutine that runs attach detach controller loop. Uncommenting that line in Run will simply stop controller loop. Events will be still be missed.

jingxu97 · 2017-04-04T06:00:11Z

@gnufied I am not sure the comment about "Events will be still missed". Shouldn't the informer will replay the events during the period controller is stopped (not yet started)?
@ncdc why it doesn't set pvcsSynced and pvsSynced to alwaysReady?

jsravn · 2017-04-04T11:37:44Z

@wongma7 that's correct. @gnufied yes this is the dirty fix. But it won't block anything else, at least in 1.4 - each listener in the shared informer gets its own goroutine for handling events. So it will just block the pod handler in the attach detach controller.

WaitForCacheSync won't solve it, like @gnufied says you will miss events - you need to queue the events to process, like how endpoint controller works, or just block the event handler goroutine like I did.

jsravn · 2017-04-04T11:40:23Z

@jingxu97 sharedInformers.start() is called after the Run() (see controllermanager.go#StartControllers) so I don't think you can just WaitForCacheSync in Run. You could do WaitForCacheSync in processPods though (like what I did).

ncdc · 2017-04-04T15:48:42Z

@jingxu97 @gnufied @jsravn

I am not sure the comment about "Events will be still missed"

Me either

Shouldn't the informer will replay the events during the period controller is stopped (not yet started)?

The informers will resync their items every so often, as defined by either the global shared informer factory resync period, or the per-event handler resync period. But it's better to use WaitForCacheSync in the controller's Run function.

sharedInformers.start() is called after the Run() (see controllermanager.go#StartControllers) so I don't think you can just WaitForCacheSync in Run. You could do WaitForCacheSync in processPods though (like what I did).

You can and should call WaitForCacheSync in Run. This is the common pattern in our controllers. The typical flow is:

In your controller's constructor, call someInformer.Informer().AddEventHandler()
The event handlers should add to a work queue
In your controller's Run, call WaitForCacheSync
In your controller's Run, once the caches are synced, start your worker goroutines
The workers pop items off of the work queue and process them
After executing all the controllers' constructors and Run functions, we call Start on the shared informer factory

See https://github.com/kubernetes/kubernetes/blob/8e26fa25da6d3b1deb333fe2484f794795d1c6b9/staging/src/k8s.io/kube-aggregator/pkg/controllers/autoregister/autoregister_controller.go for an example of a typical controller using a work queue.

If all we do is uncomment the call to WaitForCacheSync and somehow fix the integration test I referenced, you'll still have a situation where the event handler functions (e.g. podAdd/Update/Delete, nodeAdd/Update/Delete) may be called before WaitForCacheSync unblocks. I'm guessing that might cause issues, which is why you'd want to convert to using work queues.

gnufied · 2017-04-04T16:11:25Z

Me either

You answered that it yourself in the last part of your reply. Just putting WaitForCacheSync line in Run function will result in a situation where event handler function will be called before WaitForCacheSync unblocks.

The resync interval in attach/detach controller is 12 hours and hence not super useful for any real use.

That is why we need both work queues and WaitForCacheSync to properly fix this problem.

ncdc · 2017-04-04T16:15:26Z

Ok, difference in terminology. The events are not missed per se. They're received and processed, possibly incorrectly without a work queue.

jingxu97 · 2017-04-04T18:12:23Z

@ncdc, can I ask what might happen if controller start working process without WaitForCacheSync returns true?

jingxu97 · 2017-04-04T18:13:20Z

@ncdc Also, I still not sure about why it doesn't set pvcsSynced and pvsSynced to alwaysReady?

gnufied · 2017-04-04T19:25:33Z

@jingxu97 WaitForCacheSync is a must have - if we are going to implement worker queues for processing events, because without waiting for WaitForCacheSync to return - if controller starts processing events, then we are back to square one problem of race between node and pod informer cache. So without waiting for WaitForCacheSync to return - a pod event from queue could be processed before a node add event processed and hence that would make whole queue implementation pointless.

gnufied · 2017-04-04T19:28:21Z

/assign @gnufied

Going to take a stab at it.

jingxu97 · 2017-04-04T20:27:38Z

@gnufied From my understanding, the pod event could still be processed before a node event even WaitForCacheSync returns. WaitForCacheSync just makes sure when controller starts processing the events, the informer cache is synced.

ncdc · 2017-04-04T20:31:17Z

@jingxu97 that's correct. If you have 2 fully synced caches (pods, nodes) and events arrive, there's no guarantee as to the order in which they arrive. You have to be able to handle situations where either the pod or the node arrives first.

jsravn · 2017-04-07T15:57:48Z

We can just add the pod back into the worker queue if its node hasn't been processed yet, right?

jingxu97 · 2017-04-10T20:15:57Z

This PR #42033 takes a different way to solve the problem. There is a desiredStatePopulator in attach_detach controller which currently periodically verifies with the API server that the pods in the desired state of th world still exist, if not, it removes them.
PR #42033 adds the check to all the active pods from API server and ensures that each one exists in the desired state of the world cache if it has volumes. In case pod event and node event are out of order and podAdd is ignored, the periodic check will add the pod to the desired state eventually. I think it should solve the issue.

jsravn · 2017-04-13T10:50:37Z

I'm not sure what the decided approach is here. I was thinking more about the worker queue approach and think it might be overkill. Especially since it will have to do a sleep/loop anyways. I think a simple loop like I added in the pod handler should suffice, possibly with the restriction that it gives up after a period of time if the node informer doesn't update in time. My changes have fixed all issues we have in our clusters with kube-controller-manager restarts. Should I clean it up and open a PR?

I don't know about #42033, it seems like yet another bandaid. There isn't any discussion or links as to why it's being done in the PR - I'm guessing this is to handle the edge case of a missing pod update during restarts? Having it list the entire world every 1m feels like a bad approach, that should be handled by the informers via notify and a proper delta protocol, with #39732 to recover missed pods on startup.

wongma7 · 2017-04-13T17:02:43Z

I don't think #42033 will fix this issue entirely either...hard to say whether it's intended to fix this issue or just a coincidence :). Disregarding whether it's a good/bad approach, there is still a minute-ish window for a detach to never occur if you delete the pod during a polling window which is IMO totally reasonable to expect

Controller starts
As reported in this issue, a pod from podAdd isn't processed because its node isn't in the informer cache yet and nodeAdd for the node hasn't fired yet
DWSP runs, and the new 'findAndAddActivePods' runs, the node still hasn't been added to cache so pod isn't processed
User deletes the pod in the 1 minute~ after the restart.
Pod's volume is never detached

So I'm thinking now that one of these is needed to fix it fully:

fix TODO: find and add active pods for dswp #42033 + waitforcachesync before running dwsp
@jsravn's pod handler fix
worker queue

(in addition to #39732 for fixing the restart case which leads to the same symptoms as this issue)

edit: to be clear I agree #42033 should be unnecessary, we should be able to trust the informers except at startup when their caches aren't even ready yet. After the first polls, it's basically not doing anything useful...

jingxu97 · 2017-04-14T18:48:47Z

There are two race condition issues around controller restarts need to be solved here
Issue 1. a pod is deleted during the controller crash. After controller starts, this pod and its volume information will be lost from the desired state.
Issue 2. podAdd event happens before nodeAdd event, so pod is failed to add to the desired state

#39732 is an effort to fix issue 1. The idea is that because node status might still have the attached volume in the "VolumesAttached" field. This approach has a couple of minor issue.

One is that the information in "VolumesAttached" field is not enough to full construct the data structure in desired state and some information has to be faked. But the approach still works because those faked information should not actually be used anywhere.
The other issue is that because controller delete volume from "VolumesAttached" list before actually performing detach operation, there is still a chance that we lose the information about the volume while it is still attached. The events might happen is 1. pod is deleted 2 controller checks the condition and remove volume from "VolumesAttached", but it has not yet called "detach" operation 3. controller crash 4. controller restarts, the volume information will not be found anywhere (neither from pod nor VolumesAttached in node status)

About Issue 2, as @wongma7 mentioned this thread discussed a few possible approach. But I am thinking waitforcacheSync is required no matter which approach we use. Because otherwise when a pod is added to the desired state, the cached information from the informer might be out of date. Comparing the different approach

fix TODO: find and add active pods for dswp #42033: overhead from periodic checking all the pods
worker queue: requires a big change in controller
@jsravn's pod handler (could you explain more details about this approach?)

For approach 1, I do see there are other places to periodic check all the pods (or we might only check active pods?) and we need some stress tests to prove it. @jsravn could you give more details about your approach? Do you also have a PR for it? Thanks!

jsravn · 2017-04-14T18:58:27Z

@jingxu97 it is this change jsravn@5d3c1da. I can do a PR for it on Monday.

jsafrane · 2017-04-19T07:48:14Z

I am quite late in this discussion, I just have a stupid idea. If we have issues with podAdd event arriving before nodeAdd event, can we add some code to nodeAdd that would list all pods on the node and call podAdd on them? podAdd might be called twice for the same pod and even in parallel (node and pod informer run in separate goroutines and they may process nodeAdd and podAdd at the same time), still it's much better than not calling podAdd at all. Work queues would help here to get rid of unwanted parallelism.

One thing that can also happen is that the controller is running (no crash recovery or whatnot) and a node is added to the cluster and daemon set controller adds a pod to the node. Our attach/detach controller can get these events in any order and it should process them reliably.

I am afraid that #42033 is just a workaround and not a solution.

jsravn · 2017-04-19T09:14:47Z

@jsafrane I'm a little disappointed to see #42033 got merged into master. Maybe it will help mitigate some of these bugs, though (with the drawbacks already mentioned).

Anyways, I realised my approach of simply waiting for node informer to sync in processPods still has a race, you really do need the work queues. Not sure if @gnufied got a chance to do this, otherwise I'll probably whip up a PR shortly for it.

gnufied · 2017-04-19T13:48:10Z

@jsravn If you don't beat me to it, I will open a PR today that fixes this issue properly. I am sorry, I have been caught in bunch of other things.

And yeah - #42033 is not a proper fix. The current situation is bad. For example, the other day I was running some tests and every single pod was getting skipped on controller restart - https://gist.github.com/gnufied/93135976373857e27fe14619f8e4ca7d

However @wongma7's PR - #44566 should make it somewhat better.

jsravn · 2017-04-19T15:07:57Z

Spending more time on it, I don't really get the point of the desired_state_of_world_populator.go. I feel like this populator should be removed and instead just reduce the informer resync period if we want to refresh occasionally.

I've attempted to do @jsafrane's solution - it is probably the simplest approach. The queue approach would need heavier modification, e.g. would need to store state of all nodes, not just managed ones, so we know whether a pod should be skipped or re-queued.

0xmichalis · 2017-04-19T15:23:22Z

you really do need the work queues

I had a look into the attach/dettach controller today and I was wondering why this is missing

wongma7 · 2017-04-19T15:31:09Z

regarding desired_state_of_world_populator.go, I did some digging a while ago and it was because a flake was encountered where pod delete events were getting missed. There's no notion of "resyncing" delete events so it was added. Let me find the exact issue.

#42033 copied the pattern of "double checking" deletes to double checking adds/updates. I did bring up changing the resync period as an alternative in the PR, that would IMO be less error-prone and seen as less of a workaround/band-aid simply because there's less code to write & we can rely on the existing podAdd/podUpdate mechanism, but the PR being already written I didn't really object any further than that.

Important to note that at the time the attach/detach controller was first written, one could not set custom resync periods when using shared informers, that was added recently. See also https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L83

jsravn · 2017-04-19T15:37:12Z

I just opened #44675 which does the relist of pods on a managed node being added. That handles the case of controller restart as well as nodes being dynamically added. I think this will be easy to cherry pick for 1.5. I started working on the queue side but it's a bit more involved. Happy to do that instead or let @gnufied if the relist isn't seen as a good solution.

@wongma7 makes sense. I was checking our cluster logs for the last 3 weeks and I haven't ever seen the populator do anything. With a shorter resync period it is probably unnecessary now.

gnufied · 2017-04-20T02:09:18Z

@saad-ali is there any reason we can't reduce resync period now? Most other controllers have switched to 30s default resync periods...

0xmichalis · 2017-04-20T09:15:30Z

@saad-ali is there any reason we can't reduce resync period now? Most other controllers have switched to 30s default resync periods...

@gnufied most? Can you point me to those?

saad-ali · 2017-04-20T21:54:48Z

@saad-ali is there any reason we can't reduce resync period now? Most other controllers have switched to 30s default resync periods...

I haven't dug into it recently. With the populator changes this should be less critical now.

gnufied · 2017-06-14T19:36:40Z

I think this should be fixed by #44566 and #42033.

@jsravn can we close this?

jsravn · 2017-06-15T15:40:57Z

@gnufied Sure, although I haven't verified it yet. But can reopen if I see it again.

jsravn added a commit to jsravn/kubernetes that referenced this issue Mar 20, 2017

Turn up logging for EBS problem

2ad63c7

kubernetes#43300

jsravn mentioned this issue Mar 31, 2017

Fix issue #34242: Attach/detach should recover from a crash #39732

Merged

k8s-ci-robot assigned gnufied Apr 4, 2017

wongma7 mentioned this issue Apr 13, 2017

fix TODO: find and add active pods for dswp #42033

Merged

saad-ali mentioned this issue Apr 26, 2017

List all pods on node added for attach_detach controller #44675

Closed

wongma7 mentioned this issue May 13, 2017

Automated cherry pick of #44566 #45618

Merged

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017

jsravn closed this as completed Jun 15, 2017

EBS fails to detach after controller manager is restarted #43300

EBS fails to detach after controller manager is restarted #43300

Comments

jsravn commented Mar 17, 2017

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.): EBS

jsravn commented Mar 17, 2017

jsravn commented Mar 17, 2017 • edited Loading

jsravn commented Mar 17, 2017

jsravn commented Mar 17, 2017 • edited Loading

jsravn commented Mar 17, 2017

jsravn commented Mar 17, 2017

jingxu97 commented Mar 17, 2017

jsravn commented Mar 20, 2017

jingxu97 commented Mar 20, 2017

jsravn commented Mar 20, 2017

jsravn commented Mar 23, 2017 • edited Loading

jsravn commented Mar 23, 2017

jsravn commented Mar 30, 2017

jsravn commented Mar 30, 2017 • edited Loading

jsravn commented Mar 30, 2017

gnufied commented Mar 30, 2017

jsravn commented Mar 31, 2017

wongma7 commented Apr 3, 2017

gnufied commented Apr 3, 2017

wongma7 commented Apr 3, 2017

ncdc commented Apr 4, 2017

gnufied commented Apr 4, 2017

jingxu97 commented Apr 4, 2017

jsravn commented Apr 4, 2017 • edited Loading

jsravn commented Apr 4, 2017

ncdc commented Apr 4, 2017

gnufied commented Apr 4, 2017 • edited Loading

ncdc commented Apr 4, 2017

jingxu97 commented Apr 4, 2017

jingxu97 commented Apr 4, 2017

gnufied commented Apr 4, 2017

gnufied commented Apr 4, 2017

jingxu97 commented Apr 4, 2017

ncdc commented Apr 4, 2017

jsravn commented Apr 7, 2017

jingxu97 commented Apr 10, 2017

jsravn commented Apr 13, 2017 • edited Loading

wongma7 commented Apr 13, 2017 • edited Loading

jingxu97 commented Apr 14, 2017 • edited Loading

jsravn commented Apr 14, 2017

jsafrane commented Apr 19, 2017

jsravn commented Apr 19, 2017 • edited Loading

gnufied commented Apr 19, 2017

jsravn commented Apr 19, 2017 • edited Loading

0xmichalis commented Apr 19, 2017 • edited Loading

wongma7 commented Apr 19, 2017

jsravn commented Apr 19, 2017 • edited Loading

gnufied commented Apr 20, 2017

0xmichalis commented Apr 20, 2017

saad-ali commented Apr 20, 2017

gnufied commented Jun 14, 2017

jsravn commented Jun 15, 2017

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
EBS

jsravn commented Mar 17, 2017 •

edited

Loading

jsravn commented Mar 17, 2017 •

edited

Loading

jsravn commented Mar 23, 2017 •

edited

Loading

jsravn commented Mar 30, 2017 •

edited

Loading

jsravn commented Apr 4, 2017 •

edited

Loading

gnufied commented Apr 4, 2017 •

edited

Loading

jsravn commented Apr 13, 2017 •

edited

Loading

wongma7 commented Apr 13, 2017 •

edited

Loading

jingxu97 commented Apr 14, 2017 •

edited

Loading

jsravn commented Apr 19, 2017 •

edited

Loading

jsravn commented Apr 19, 2017 •

edited

Loading

0xmichalis commented Apr 19, 2017 •

edited

Loading

jsravn commented Apr 19, 2017 •

edited

Loading