Make the scheduler more resilient to zombie pods #6023
Labels
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
sig/scheduling
Categorizes an issue or PR as relevant to SIG Scheduling.
Milestone
This documents some issues and a solution for scheduler resilience.
TL;DR
Background
The scheduler currently has 3 stores: assumed, scheduled and queued. The last one is a fifo that contains new pods till the scheduler has time for them, certain scenarios could lead to zombie pods in the other 2.
We currently have an assumed pod store to avoid race conditions where we assign a pod via binding, and make a wrong assignment on the next pod because we still haven't heard confirmation from etcd (via the scheduled pods reflector). Previously, bound-pods would prevent this situation from happening. To solve this, we write the pod to a local store that we consult before making decisions.
Doing so is racy, because if the pod we just wrote is deleted before another pod is scheduled, it will remain in the assumed pod store as a zombie consuming resources forever (if another pod is scheduled before this one is deleted, it will exist in the scheduled pods store => purged from assumed pods => no zombie).
So specifically:
2 pods: Pod1:8080, Pod2:8080
Scheduler: Assign Pod1:8080, write to assume store, send to apiserver
Reflector: Insert into scheduled pod store
Kubectl: delete
Reflector: Delete from scheduled pod store <- This needs to happen before the next pod for the zombie to exist
Scheduler: Try to assign Pod2:8080, finds in assumed pod store, fails to schedule
Another situation that could cause a zombie is network flake. If a delete watch event is dropped, the scheduled pod store will contain the deleted pod blocking its resources.
Solutions
Changing lanes, we can solve this problem with a ttl on the assumed pod store. That way any weird stale pods will get cleared out when it expires. Even this is racy, because if it takes > ttl for the watch event to hit the scheduled pods store (so a proper partition, kube-proxy bug or etcd going down for 30s) we will end up losing the pod in the assumed pods store and making the wrong decision for newer pods. Also a ttl of 30s during which a users pod remains weirdly un-assigned makes for a bad experience.
So, a combination is proposed in the TL;DR.
@lavalamp
The text was updated successfully, but these errors were encountered: