Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EBS fails to detach after controller manager is restarted #43300

Closed
jsravn opened this issue Mar 17, 2017 · 52 comments
Closed

EBS fails to detach after controller manager is restarted #43300

jsravn opened this issue Mar 17, 2017 · 52 comments
Assignees
Labels
needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.

Comments

@jsravn
Copy link
Contributor

jsravn commented Mar 17, 2017

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see http://kubernetes.io/docs/troubleshooting/.):
No

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
EBS

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

Kubernetes version (use kubectl version):
1.4.8

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.2
  • Kernel (e.g. uname -a): 4.8.0
  • Install tools: custom ansible
  • Others:

What happened:
We have an HA controller manager setup. After a controller manager was restarted, a new controller manager became the leader. We observed after this happens that some EBS fail to detach when a pod is redeployed. It seems that controller-manager forgets about some attached EBS after it is restarted.

What you expected to happen:
After a new controller manager starts, if an existing pod with an EBS gets redeployed, the EBS should be detached first.

How to reproduce it (as minimally and precisely as possible):
Unfortunately I've been unable to reproduce this consistently - it doesn't always happen, but when it does it's always after a controller manager got restarted.

Anything else we need to know:

@jsravn
Copy link
Contributor Author

jsravn commented Mar 17, 2017

@jingxu97 do you know if there are any existing bugs that match the symptoms I'm seeing?

The closest I can find is #34242, but that is only if a pod is deleted during restart. In my case it happens when a pod is deleted after controller manager is restarted.

@jsravn
Copy link
Contributor Author

jsravn commented Mar 17, 2017

This is what I see in controller manager logs (here vol-02f69898d700456b0 fails to detach, i've filtered on all logs with that id, sorted in reverse order):

screenshot from 2017-03-17 15-05-22

In this case controller-manager-10.50.241.133 got stopped at 15:58, and controller-manager-10.50.246.14 started just after. Then when the pod got deleted at 19:00, the new controller manager ignores the pod deletion completely and doesn't do any detach action. It tries to attach which then hangs indefinitely. Kubelet on the node successfully unmounted the volume, but it remains attached to the old node.

@jsravn
Copy link
Contributor Author

jsravn commented Mar 17, 2017

To clarify we've seen this a few times but I've been unable to reproduce it manually. When it happens though it's always shortly after a controller manager restart (which is pretty rare for us, happens maybe once every couple weeks).

@jsravn
Copy link
Contributor Author

jsravn commented Mar 17, 2017

So digging more, my use case actually has the controller restart a few times over a 15 minute period (each controller is being updated, then restarted in a rolling deploy). In this case, the first restart looks fine - the informer updates the controller managers view of the world correctly. But for the subsequent restarts over the 15 minute period, the desired state of the world is not updated at all. As a result, the last controller manager to gain leadership has no knowledge of the already mounted pods. I'm not sure what is happening - why would subsequent restarts not be the same as the first? (Might just be coincidence though that this is what happened for me).

@jsravn
Copy link
Contributor Author

jsravn commented Mar 17, 2017

Isn't there a race condition between nodeAdd and podAdd at startup? It looks like podAdd is skipped if the node hasn't been added yet.

@jsravn
Copy link
Contributor Author

jsravn commented Mar 17, 2017

I think the right approach here would be to requeue the pod for processing once nodes have finished syncing, same as other controllers do.

@jingxu97
Copy link
Contributor

@jsravn just discussed with @saad-ali, it seems possible to have this race when controller manager is restarted. I will add some log to test and confirm it. Could you also help add a log in this line https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L245 and test with your setup? Thanks!

@jsravn
Copy link
Contributor Author

jsravn commented Mar 20, 2017

@jingxu97 Okay I've added that. I also turned on the logging in processPodVolumes, because that's where I suspect it is skipping (the pods didn't get deleted afaict during controller manager restart).

jsravn added a commit to jsravn/kubernetes that referenced this issue Mar 20, 2017
@jingxu97
Copy link
Contributor

@jsravn thanks! Are you able to make this change on your local setup first and test it? Thanks!

@jsravn
Copy link
Contributor Author

jsravn commented Mar 20, 2017

Yes I'll roll this out shortly to our local clusters

@jsravn
Copy link
Contributor Author

jsravn commented Mar 23, 2017

@jingxu97 we managed to trigger this condition again. It looks like the pods got processed before the nodes - so the attach detach controller skipped them, and the future attaches failed. Due to the 12 hr resync period this is a problem of course.

In this case it happened to an influxdb pod which needs an EBS:

screenshot from 2017-03-23 15-38-30

@jsravn
Copy link
Contributor Author

jsravn commented Mar 23, 2017

I suppose it's also possible the node informer never triggered, but I'd need to add additional logging to confirm.

@jsravn
Copy link
Contributor Author

jsravn commented Mar 30, 2017

@jingxu97 do you have any ideas on the best way to fix this? It looks like similar code to wait for sync has changed a bit in master. I could try adding a simple loop to wait for hasSynced in 1.4 to confirm this is our problem.

@jsravn
Copy link
Contributor Author

jsravn commented Mar 30, 2017

Added that fix at jsravn@5d3c1da - going to deploy this version of kube-controller-manager and verify if it fixes the restart issue.

@jsravn
Copy link
Contributor Author

jsravn commented Mar 30, 2017

I can verify the node informer wasn't synced in time - I get the message "Waiting for node store to sync before processing nodes" on startup. So this is the bug I believe, and will cause volume information to be out of sync on kube-controller-manager startup/restarts.

@gnufied
Copy link
Member

gnufied commented Mar 30, 2017

How about if we add all such pods to a data structure called PodsOnHold and when nodeAdd event is ran for such a node - we trigger an event inside PodsOnHold to process all pending pods on that node.

Similarly when node is deleted we need to clear out any pendingitems in PodsOnHold.

@jsravn
Copy link
Contributor Author

jsravn commented Mar 31, 2017

I think the other controllers handle this pretty well - you can take a look at how endpoint controller does it in 1.4/1.5, it has its own work queue where it will stick services to process while it waits for the pod informer to finish updating. Although I noticed this changed quite a lot in master - haven't gone through those changes yet, though.

@wongma7
Copy link
Contributor

wongma7 commented Apr 3, 2017

@jsravn so to clarify, jsravn@5d3c1da fixed this? Because now it correctly waits for the node store to sync before attempting processing?

It was discovered by someone else too that the attach detach controller never bothers to wait for its caches to sync here https://github.com/kubernetes/kubernetes/pull/40385/files#diff-72e66cba2ced392dc0f9cd10c424d6d0R225 . So if we uncomment that and fix the mentioned test, that should fix this issue right?

@gnufied
Copy link
Member

gnufied commented Apr 3, 2017

@wongma7 We can't put sleep in the code there, because events are processed in main goroutine that is started by controller-manager. Putting sleep there will potentially block other things.

I am leaning towards adding a internal queue as endpoints controller. cc @saad-ali @jingxu97

@wongma7
Copy link
Contributor

wongma7 commented Apr 3, 2017

WaitForCacheSync seems like the standard way to wait, if I understand correctly @ncdc just didn't want to touch things he didn't need to in what was mostly a mechanical PR so he left that as a comment for discussion. So I'm saying that if @jsravn's fix worked, we can uncomment the WaitForCacheSync in my second link as the "proper" fix.

@ncdc
Copy link
Member

ncdc commented Apr 4, 2017

@wongma7 it's not as simple as uncommenting. You'll have to fix the integration test I referenced as well.

@gnufied
Copy link
Member

gnufied commented Apr 4, 2017

Also afaict just putting the WaitForCacheSync in Run will not automatically fix this bug. The events are being processed on separate goroutine from goroutine that runs attach detach controller loop. Uncommenting that line in Run will simply stop controller loop. Events will be still be missed.

@jingxu97
Copy link
Contributor

jingxu97 commented Apr 4, 2017

@gnufied I am not sure the comment about "Events will be still missed". Shouldn't the informer will replay the events during the period controller is stopped (not yet started)?
@ncdc why it doesn't set pvcsSynced and pvsSynced to alwaysReady?

@jsravn
Copy link
Contributor Author

jsravn commented Apr 4, 2017

@wongma7 that's correct. @gnufied yes this is the dirty fix. But it won't block anything else, at least in 1.4 - each listener in the shared informer gets its own goroutine for handling events. So it will just block the pod handler in the attach detach controller.

WaitForCacheSync won't solve it, like @gnufied says you will miss events - you need to queue the events to process, like how endpoint controller works, or just block the event handler goroutine like I did.

@jsravn
Copy link
Contributor Author

jsravn commented Apr 4, 2017

@jingxu97 sharedInformers.start() is called after the Run() (see controllermanager.go#StartControllers) so I don't think you can just WaitForCacheSync in Run. You could do WaitForCacheSync in processPods though (like what I did).

@ncdc
Copy link
Member

ncdc commented Apr 4, 2017

@jingxu97 @gnufied @jsravn

I am not sure the comment about "Events will be still missed"

Me either

Shouldn't the informer will replay the events during the period controller is stopped (not yet started)?

The informers will resync their items every so often, as defined by either the global shared informer factory resync period, or the per-event handler resync period. But it's better to use WaitForCacheSync in the controller's Run function.

sharedInformers.start() is called after the Run() (see controllermanager.go#StartControllers) so I don't think you can just WaitForCacheSync in Run. You could do WaitForCacheSync in processPods though (like what I did).

You can and should call WaitForCacheSync in Run. This is the common pattern in our controllers. The typical flow is:

  1. In your controller's constructor, call someInformer.Informer().AddEventHandler()
  2. The event handlers should add to a work queue
  3. In your controller's Run, call WaitForCacheSync
  4. In your controller's Run, once the caches are synced, start your worker goroutines
  5. The workers pop items off of the work queue and process them
  6. After executing all the controllers' constructors and Run functions, we call Start on the shared informer factory

See https://github.com/kubernetes/kubernetes/blob/8e26fa25da6d3b1deb333fe2484f794795d1c6b9/staging/src/k8s.io/kube-aggregator/pkg/controllers/autoregister/autoregister_controller.go for an example of a typical controller using a work queue.

If all we do is uncomment the call to WaitForCacheSync and somehow fix the integration test I referenced, you'll still have a situation where the event handler functions (e.g. podAdd/Update/Delete, nodeAdd/Update/Delete) may be called before WaitForCacheSync unblocks. I'm guessing that might cause issues, which is why you'd want to convert to using work queues.

@gnufied
Copy link
Member

gnufied commented Apr 4, 2017

Me either

You answered that it yourself in the last part of your reply. Just putting WaitForCacheSync line in Run function will result in a situation where event handler function will be called before WaitForCacheSync unblocks.

The resync interval in attach/detach controller is 12 hours and hence not super useful for any real use.

That is why we need both work queues and WaitForCacheSync to properly fix this problem.

@ncdc
Copy link
Member

ncdc commented Apr 4, 2017

Ok, difference in terminology. The events are not missed per se. They're received and processed, possibly incorrectly without a work queue.

@jingxu97
Copy link
Contributor

jingxu97 commented Apr 4, 2017

@ncdc, can I ask what might happen if controller start working process without WaitForCacheSync returns true?

@jingxu97
Copy link
Contributor

jingxu97 commented Apr 4, 2017

@ncdc Also, I still not sure about why it doesn't set pvcsSynced and pvsSynced to alwaysReady?

@gnufied
Copy link
Member

gnufied commented Apr 4, 2017

@jingxu97 WaitForCacheSync is a must have - if we are going to implement worker queues for processing events, because without waiting for WaitForCacheSync to return - if controller starts processing events, then we are back to square one problem of race between node and pod informer cache. So without waiting for WaitForCacheSync to return - a pod event from queue could be processed before a node add event processed and hence that would make whole queue implementation pointless.

@gnufied
Copy link
Member

gnufied commented Apr 4, 2017

/assign @gnufied

Going to take a stab at it.

@jingxu97
Copy link
Contributor

jingxu97 commented Apr 4, 2017

@gnufied From my understanding, the pod event could still be processed before a node event even WaitForCacheSync returns. WaitForCacheSync just makes sure when controller starts processing the events, the informer cache is synced.

@ncdc
Copy link
Member

ncdc commented Apr 4, 2017

@jingxu97 that's correct. If you have 2 fully synced caches (pods, nodes) and events arrive, there's no guarantee as to the order in which they arrive. You have to be able to handle situations where either the pod or the node arrives first.

@jsravn
Copy link
Contributor Author

jsravn commented Apr 7, 2017

We can just add the pod back into the worker queue if its node hasn't been processed yet, right?

@jingxu97
Copy link
Contributor

This PR #42033 takes a different way to solve the problem. There is a desiredStatePopulator in attach_detach controller which currently periodically verifies with the API server that the pods in the desired state of th world still exist, if not, it removes them.
PR #42033 adds the check to all the active pods from API server and ensures that each one exists in the desired state of the world cache if it has volumes. In case pod event and node event are out of order and podAdd is ignored, the periodic check will add the pod to the desired state eventually. I think it should solve the issue.

@jsravn
Copy link
Contributor Author

jsravn commented Apr 13, 2017

I'm not sure what the decided approach is here. I was thinking more about the worker queue approach and think it might be overkill. Especially since it will have to do a sleep/loop anyways. I think a simple loop like I added in the pod handler should suffice, possibly with the restriction that it gives up after a period of time if the node informer doesn't update in time. My changes have fixed all issues we have in our clusters with kube-controller-manager restarts. Should I clean it up and open a PR?

I don't know about #42033, it seems like yet another bandaid. There isn't any discussion or links as to why it's being done in the PR - I'm guessing this is to handle the edge case of a missing pod update during restarts? Having it list the entire world every 1m feels like a bad approach, that should be handled by the informers via notify and a proper delta protocol, with #39732 to recover missed pods on startup.

@wongma7
Copy link
Contributor

wongma7 commented Apr 13, 2017

I don't think #42033 will fix this issue entirely either...hard to say whether it's intended to fix this issue or just a coincidence :). Disregarding whether it's a good/bad approach, there is still a minute-ish window for a detach to never occur if you delete the pod during a polling window which is IMO totally reasonable to expect

  1. Controller starts
  2. As reported in this issue, a pod from podAdd isn't processed because its node isn't in the informer cache yet and nodeAdd for the node hasn't fired yet
  3. DWSP runs, and the new 'findAndAddActivePods' runs, the node still hasn't been added to cache so pod isn't processed
  4. User deletes the pod in the 1 minute~ after the restart.
  5. Pod's volume is never detached

So I'm thinking now that one of these is needed to fix it fully:

(in addition to #39732 for fixing the restart case which leads to the same symptoms as this issue)

edit: to be clear I agree #42033 should be unnecessary, we should be able to trust the informers except at startup when their caches aren't even ready yet. After the first polls, it's basically not doing anything useful...

@jingxu97
Copy link
Contributor

jingxu97 commented Apr 14, 2017

There are two race condition issues around controller restarts need to be solved here
Issue 1. a pod is deleted during the controller crash. After controller starts, this pod and its volume information will be lost from the desired state.
Issue 2. podAdd event happens before nodeAdd event, so pod is failed to add to the desired state

#39732 is an effort to fix issue 1. The idea is that because node status might still have the attached volume in the "VolumesAttached" field. This approach has a couple of minor issue.

  • One is that the information in "VolumesAttached" field is not enough to full construct the data structure in desired state and some information has to be faked. But the approach still works because those faked information should not actually be used anywhere.
  • The other issue is that because controller delete volume from "VolumesAttached" list before actually performing detach operation, there is still a chance that we lose the information about the volume while it is still attached. The events might happen is 1. pod is deleted 2 controller checks the condition and remove volume from "VolumesAttached", but it has not yet called "detach" operation 3. controller crash 4. controller restarts, the volume information will not be found anywhere (neither from pod nor VolumesAttached in node status)

About Issue 2, as @wongma7 mentioned this thread discussed a few possible approach. But I am thinking waitforcacheSync is required no matter which approach we use. Because otherwise when a pod is added to the desired state, the cached information from the informer might be out of date. Comparing the different approach

  1. fix TODO: find and add active pods for dswp #42033: overhead from periodic checking all the pods
  2. worker queue: requires a big change in controller
  3. @jsravn's pod handler (could you explain more details about this approach?)

For approach 1, I do see there are other places to periodic check all the pods (or we might only check active pods?) and we need some stress tests to prove it. @jsravn could you give more details about your approach? Do you also have a PR for it? Thanks!

@jsravn
Copy link
Contributor Author

jsravn commented Apr 14, 2017

@jingxu97 it is this change jsravn@5d3c1da. I can do a PR for it on Monday.

@jsafrane
Copy link
Member

I am quite late in this discussion, I just have a stupid idea. If we have issues with podAdd event arriving before nodeAdd event, can we add some code to nodeAdd that would list all pods on the node and call podAdd on them? podAdd might be called twice for the same pod and even in parallel (node and pod informer run in separate goroutines and they may process nodeAdd and podAdd at the same time), still it's much better than not calling podAdd at all. Work queues would help here to get rid of unwanted parallelism.

One thing that can also happen is that the controller is running (no crash recovery or whatnot) and a node is added to the cluster and daemon set controller adds a pod to the node. Our attach/detach controller can get these events in any order and it should process them reliably.

I am afraid that #42033 is just a workaround and not a solution.

@jsravn
Copy link
Contributor Author

jsravn commented Apr 19, 2017

@jsafrane I'm a little disappointed to see #42033 got merged into master. Maybe it will help mitigate some of these bugs, though (with the drawbacks already mentioned).

Anyways, I realised my approach of simply waiting for node informer to sync in processPods still has a race, you really do need the work queues. Not sure if @gnufied got a chance to do this, otherwise I'll probably whip up a PR shortly for it.

@gnufied
Copy link
Member

gnufied commented Apr 19, 2017

@jsravn If you don't beat me to it, I will open a PR today that fixes this issue properly. I am sorry, I have been caught in bunch of other things.

And yeah - #42033 is not a proper fix. The current situation is bad. For example, the other day I was running some tests and every single pod was getting skipped on controller restart - https://gist.github.com/gnufied/93135976373857e27fe14619f8e4ca7d

However @wongma7's PR - #44566 should make it somewhat better.

@jsravn
Copy link
Contributor Author

jsravn commented Apr 19, 2017

Spending more time on it, I don't really get the point of the desired_state_of_world_populator.go. I feel like this populator should be removed and instead just reduce the informer resync period if we want to refresh occasionally.

I've attempted to do @jsafrane's solution - it is probably the simplest approach. The queue approach would need heavier modification, e.g. would need to store state of all nodes, not just managed ones, so we know whether a pod should be skipped or re-queued.

@0xmichalis
Copy link
Contributor

0xmichalis commented Apr 19, 2017

you really do need the work queues

I had a look into the attach/dettach controller today and I was wondering why this is missing

@wongma7
Copy link
Contributor

wongma7 commented Apr 19, 2017

regarding desired_state_of_world_populator.go, I did some digging a while ago and it was because a flake was encountered where pod delete events were getting missed. There's no notion of "resyncing" delete events so it was added. Let me find the exact issue.

#42033 copied the pattern of "double checking" deletes to double checking adds/updates. I did bring up changing the resync period as an alternative in the PR, that would IMO be less error-prone and seen as less of a workaround/band-aid simply because there's less code to write & we can rely on the existing podAdd/podUpdate mechanism, but the PR being already written I didn't really object any further than that.

Important to note that at the time the attach/detach controller was first written, one could not set custom resync periods when using shared informers, that was added recently. See also https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/volume/attachdetach/attach_detach_controller.go#L83

@jsravn
Copy link
Contributor Author

jsravn commented Apr 19, 2017

I just opened #44675 which does the relist of pods on a managed node being added. That handles the case of controller restart as well as nodes being dynamically added. I think this will be easy to cherry pick for 1.5. I started working on the queue side but it's a bit more involved. Happy to do that instead or let @gnufied if the relist isn't seen as a good solution.

@wongma7 makes sense. I was checking our cluster logs for the last 3 weeks and I haven't ever seen the populator do anything. With a shorter resync period it is probably unnecessary now.

@gnufied
Copy link
Member

gnufied commented Apr 20, 2017

@saad-ali is there any reason we can't reduce resync period now? Most other controllers have switched to 30s default resync periods...

@0xmichalis
Copy link
Contributor

@saad-ali is there any reason we can't reduce resync period now? Most other controllers have switched to 30s default resync periods...

@gnufied most? Can you point me to those?

@saad-ali
Copy link
Member

@saad-ali is there any reason we can't reduce resync period now? Most other controllers have switched to 30s default resync periods...

I haven't dug into it recently. With the populator changes this should be less critical now.

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@gnufied
Copy link
Member

gnufied commented Jun 14, 2017

I think this should be fixed by #44566 and #42033.

@jsravn can we close this?

@jsravn
Copy link
Contributor Author

jsravn commented Jun 15, 2017

@gnufied Sure, although I haven't verified it yet. But can reopen if I see it again.

@jsravn jsravn closed this as completed Jun 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

9 participants