Make the scheduler more resilient to zombie pods #6023

bprashanth · 2015-03-26T22:36:27Z

This documents some issues and a solution for scheduler resilience.

TL;DR

We need a ttl that forces the system to re-evaluate assumed pods periodically
Every delete watch event needs to get piped from scheduled pods -> assumed pods
If a partition causes dropped watch events, both stores should recover:
- the ttl should expire the pod from the assumed store
- a periodic re-list should refresh pods in the scheduled pods store.
The scheduled pods store should not try to correct the assumed pods store with its relist.
If scheduled pods deletes a pod from assumed pods before the scheduler writes the same pod into the assumed store, we will have a zombine for ttl seconds.

Background
The scheduler currently has 3 stores: assumed, scheduled and queued. The last one is a fifo that contains new pods till the scheduler has time for them, certain scenarios could lead to zombie pods in the other 2.

We currently have an assumed pod store to avoid race conditions where we assign a pod via binding, and make a wrong assignment on the next pod because we still haven't heard confirmation from etcd (via the scheduled pods reflector). Previously, bound-pods would prevent this situation from happening. To solve this, we write the pod to a local store that we consult before making decisions.

Doing so is racy, because if the pod we just wrote is deleted before another pod is scheduled, it will remain in the assumed pod store as a zombie consuming resources forever (if another pod is scheduled before this one is deleted, it will exist in the scheduled pods store => purged from assumed pods => no zombie).

So specifically:
2 pods: Pod1:8080, Pod2:8080
Scheduler: Assign Pod1:8080, write to assume store, send to apiserver
Reflector: Insert into scheduled pod store
Kubectl: delete
Reflector: Delete from scheduled pod store <- This needs to happen before the next pod for the zombie to exist
Scheduler: Try to assign Pod2:8080, finds in assumed pod store, fails to schedule

Another situation that could cause a zombie is network flake. If a delete watch event is dropped, the scheduled pod store will contain the deleted pod blocking its resources.

Solutions

One way to solve the first problem, is to have the delete remove the pod from 2 stores, both scheduled and assumed. This is racy, because the delete could happen before a write to the assumed store.
We could modify that solution in a way that allowed the delete iff the pod exists, and only allow the assumer to write to the store (so the deleter would hang till the pod exists). Though this will work, it isn't resilient to network partitions. Once the network recovers, the system can never correct itself because a delete event was dropped.
So we can periodically re-list and correct assumed pods. This is racy because we could do a list just before a bind and clear out the pod we just scheduled.

Changing lanes, we can solve this problem with a ttl on the assumed pod store. That way any weird stale pods will get cleared out when it expires. Even this is racy, because if it takes > ttl for the watch event to hit the scheduled pods store (so a proper partition, kube-proxy bug or etcd going down for 30s) we will end up losing the pod in the assumed pods store and making the wrong decision for newer pods. Also a ttl of 30s during which a users pod remains weirdly un-assigned makes for a bad experience.

So, a combination is proposed in the TL;DR.

@lavalamp

lavalamp · 2015-04-27T19:01:29Z

This is fixed now.

bprashanth added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. team/master labels Mar 26, 2015

bprashanth mentioned this issue Mar 26, 2015

Forget assumptions made about new pods #5791

Closed

lavalamp mentioned this issue Mar 26, 2015

Scheduling takes more than 3 minutes on a new cluster #5956

Closed

lavalamp added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 26, 2015

lavalamp mentioned this issue Mar 27, 2015

Remove pods from the assumed pod list when they are deleted #6110

Merged

bgrant0607 added this to the v1.0 milestone Mar 28, 2015

bprashanth mentioned this issue Mar 30, 2015

Modeler uses a ttl store for assumed pods #6179

Merged

davidopp assigned bprashanth Apr 7, 2015

lavalamp closed this as completed Apr 27, 2015

bprashanth mentioned this issue May 20, 2016

RFC: Deprecate cmd/integration/integration.go #24440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the scheduler more resilient to zombie pods #6023

Make the scheduler more resilient to zombie pods #6023

bprashanth commented Mar 26, 2015

lavalamp commented Apr 27, 2015

Make the scheduler more resilient to zombie pods #6023

Make the scheduler more resilient to zombie pods #6023

Comments

bprashanth commented Mar 26, 2015

lavalamp commented Apr 27, 2015