Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the scheduler more resilient to zombie pods #6023

Closed
bprashanth opened this issue Mar 26, 2015 · 1 comment
Closed

Make the scheduler more resilient to zombie pods #6023

bprashanth opened this issue Mar 26, 2015 · 1 comment
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Milestone

Comments

@bprashanth
Copy link
Contributor

This documents some issues and a solution for scheduler resilience.

TL;DR

  • We need a ttl that forces the system to re-evaluate assumed pods periodically
  • Every delete watch event needs to get piped from scheduled pods -> assumed pods
  • If a partition causes dropped watch events, both stores should recover:
    • the ttl should expire the pod from the assumed store
    • a periodic re-list should refresh pods in the scheduled pods store.
  • The scheduled pods store should not try to correct the assumed pods store with its relist.
  • If scheduled pods deletes a pod from assumed pods before the scheduler writes the same pod into the assumed store, we will have a zombine for ttl seconds.

Background
The scheduler currently has 3 stores: assumed, scheduled and queued. The last one is a fifo that contains new pods till the scheduler has time for them, certain scenarios could lead to zombie pods in the other 2.

We currently have an assumed pod store to avoid race conditions where we assign a pod via binding, and make a wrong assignment on the next pod because we still haven't heard confirmation from etcd (via the scheduled pods reflector). Previously, bound-pods would prevent this situation from happening. To solve this, we write the pod to a local store that we consult before making decisions.

Doing so is racy, because if the pod we just wrote is deleted before another pod is scheduled, it will remain in the assumed pod store as a zombie consuming resources forever (if another pod is scheduled before this one is deleted, it will exist in the scheduled pods store => purged from assumed pods => no zombie).

So specifically:
2 pods: Pod1:8080, Pod2:8080
Scheduler: Assign Pod1:8080, write to assume store, send to apiserver
Reflector: Insert into scheduled pod store
Kubectl: delete
Reflector: Delete from scheduled pod store <- This needs to happen before the next pod for the zombie to exist
Scheduler: Try to assign Pod2:8080, finds in assumed pod store, fails to schedule

Another situation that could cause a zombie is network flake. If a delete watch event is dropped, the scheduled pod store will contain the deleted pod blocking its resources.

Solutions

  1. One way to solve the first problem, is to have the delete remove the pod from 2 stores, both scheduled and assumed. This is racy, because the delete could happen before a write to the assumed store.
  2. We could modify that solution in a way that allowed the delete iff the pod exists, and only allow the assumer to write to the store (so the deleter would hang till the pod exists). Though this will work, it isn't resilient to network partitions. Once the network recovers, the system can never correct itself because a delete event was dropped.
  3. So we can periodically re-list and correct assumed pods. This is racy because we could do a list just before a bind and clear out the pod we just scheduled.

Changing lanes, we can solve this problem with a ttl on the assumed pod store. That way any weird stale pods will get cleared out when it expires. Even this is racy, because if it takes > ttl for the watch event to hit the scheduled pods store (so a proper partition, kube-proxy bug or etcd going down for 30s) we will end up losing the pod in the assumed pods store and making the wrong decision for newer pods. Also a ttl of 30s during which a users pod remains weirdly un-assigned makes for a bad experience.

So, a combination is proposed in the TL;DR.

@lavalamp

@bprashanth bprashanth added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. team/master labels Mar 26, 2015
@lavalamp lavalamp added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Mar 26, 2015
@bgrant0607 bgrant0607 added this to the v1.0 milestone Mar 28, 2015
@lavalamp
Copy link
Member

This is fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

3 participants