Prevent evicted pods from being scheduled back onto the same node #18853
Description
Add something like
AvoidPreviousNode bool
to api.DeleteOptions
Semantics is: if the pod is managed by a replication controller, scheduler will try to put the replacement pod on a node different from the one where the pod you are deleting was previously running. (If the pod is not managed by a replication controller, then the option has no effect.)
Proposed implementation is
- add
AvoidPreviousNode bool
field toapi.DeleteOptions
- before deleting the pod, API server (registry, I think) adds an annotation to the pod (maybe key
scheduler.alpha.kubernetes.io/avoid_previous_node
and no value) - modify the controllers so that when they see this annotation added to a pod, they cache the name of the pod P and the name of the node N it is running on; when they see the deletion and go to create the replacement pod for P, they add a SoftNodeAffinity for "node name is not N" (I just realized this feature is dependent on implementation of Node affinity and NodeSelector design doc #18261 since we can't express this today, and also dependent on exposing node name as a node label which is part of Auto-populate node labels with node information from cloud provider #9044)
This should work even if the controller fails between the update and the deletion, but not if the controller fails between the deletion and creating the replacement pod. This should be OK as the feature is "best effort" (we intentionally use SoftNodeAffinity instead of HardNodeAffinity).
This is a pre-requisite for rescheduler, so it can create a hole for a pending pod to schedule into, without having the pod that was previously in that hole schedule back into the space first.
However there are other uses, for example a human operator might want to move a pod off a node for some reason (maybe it is receiving or producing interference from/to another pod on that node).
This was originally discussed in this comment
#12140 (comment)
cc/ @mikedanese @lavalamp @bgrant0607 @mml
Assigned to @HaiyangDING who said he would implement it (but note that it is dependent on implementation of #18261, as well as exposing node name as a node label).