-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define a common Node autoscaling safe-to-evict/do-not-disrupt annotation #124800
base: master
Are you sure you want to change the base?
Conversation
Currently, there are 2 Node autoscalers sponsored by sig-autoscaling, each supporting a different Pod annotation with the same semantics: * Cluster Autoscaler: cluster-autoscaler.kubernetes.io/safe-to-evict=true/false * Karpenter: karpenter.sh/do-not-disrupt=true The semantics for cluster-autoscaler.kubernetes.io/safe-to-evict=false, and karpenter.sh/do-not-disrupt=true are identical. Both of these annotations will be replaced by node-autoscaling.kubernetes.io/safe-to-evict=false. cluster-autoscaler.kubernetes.io/safe-to-evict=true doesn't have an equivalent in Karpenter right now, as Karpenter doesn't have any pod-level conditions blocking consolidation. This means that the equivalent new annotation node-autoscaling.kubernetes.io/safe-to-evict=true should be trivially supported by Karpenter initially (but will require caution if Karpenter ever adds any pod-level conditions blocking consolidation). Going with the Cluster Autoscaler wording for the common annotation, as otherwise we'd have a double negation (do-not-disrupt=false) in the safe-to-evict=true case which doesn't seem ideal. This is a part of a broader alignment between Cluster Autoscaler and Karpenter. More details about the alignment can be found in https://docs.google.com/document/d/1rHhltfLV5V1kcnKr_mKRKDC4ZFPYGP4Tde2Zy-LE72w
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: towca The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I feel like we are a caught between a rock and a hard place with this semantic. I can see why this semantic works well for CAS, but now this semantic causes awkwardness for Karpenter users since there's currently no scenario where I need to do a bit more thinking on the trade-offs here between CAS and Karpenter supporting something common. |
So the only way for a pod to block consolidation of its node in Karpenter is for the user to explicitly opt that exact pod/workload into the blocking somehow? Or how does it work? Do you have/anticipate any such blocking config options that would span multiple workloads? If so,
|
@jonathan-innis Have you maybe had a chance to give this more thought? |
/sig node |
I think the core of the problem here is that the defaults for CAS and Karpenter are different when it comes to blocking evictions. Karpenter takes the opinion that you aren't "blocking eviction" of a pod unless you have a pod that tolerates the Karpenter drain/disruption taint. Even still -- with this in place -- we will still drain the node with this pod on there, we just won't attempt to drain that pod because we know that that pod will immediately reschedule as soon as it has been drained. Users in Karpenter explicitly opt-in to block the eviction of their pods by setting the do-not-disrupt annotation on these pods (much like you select against workloads with a PDB to block the eviction of those pods). From what I remember, the decision to be safe and ask users to explicitly opt-out of the eviction of specific pods was taken to match the drain logic of the @towca What are your thoughts? I recognize that this is a bit of a change in thinking from the way that CAS thinks about this problem. What are the behaviors that you have seen from users? Have you seen users like this defaulting behavior with things like local storage or controllerless pods or has it become more of a hinderance to users? FWIW, I think that we should work to align on the behavior here before we align on the taint. Ideally, we have one opinion about how this drain operation works in Kubernetes, this applies across all SIGs and is seen as "the way to drain nodes gracefully" and the |
I think we might be looking at this from slightly different perspectives, or confusing "blocking eviction" with "blocking consolidation". I fully agree that we need to align on one drain behavior and use it across Kubernetes components. However, I don't think the safe-to-evict/do-not-disrupt annotation should be a part of this alignment. I think the annotation works one layer above the drain mechanism. At least for CAS, the annotation is supposed to mean "do not consolidate the Node on which this pod is scheduled (since this would disrupt the pod and it doesn't want that)". The drain behavior doesn't come into play at all, because consolidation (and in turn the drain) should not be attempted by CAS. It's not eviction of one particular pod that is blocked, it's consolidation of its whole node. Btw, this abstraction layer difference is also precisely why I see this annotation in the The distinction seems useful even inside Karpenter (if I understand its behaviors correctly):
Same here, I think kubectl differentiates between the pods at the drain level here. If you do After writing this, |
@gjtempleton @MaciekPytel WDYT? |
// - If the value is "true": This pod is safe to evict, and Node autoscalers should not block consolidation of a | ||
// Node because of it, when they normally would. For example, if a Node autoscaler normally blocks consolidating | ||
// Nodes on which kube-system Pods are running, this can be used to opt some pods out of the blocking. | ||
NodeAutoscalingSafeToEvictKey = nodeAutoscalingPrefix + "/safe-to-evict" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this a spec field?
We usually don't changelog annotations in k/k, because they're not formally part of the API stabilization process. So, I think this doesn't need a changelog. However, if kubectl learns a special way to report that Pods are safe to evict, then we'd changelog that.
We don't remove annotations from the list of registered annotations, we just tell people they've stopped having a useful meaning. There's no lifecycle that I know of tied to Kubernetes releases. You might like to request a mention in kubernetes/website#46948, as an aside. That's a good place to communicate the upcoming deprecation. |
In my experience that is something that does indeed come up (at least on GKE) - there is a major difference between disruptions caused by consolidation vs other disruptions:
In CAS you can achieve both behaviors:
We see more usage of |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
@jonathan-innis @njtran Ping, could we move this review along? |
So to be clear -- the way that we get around this problem today with I think we still have some disparity between the meaning of these annotations -- so I question whether using something like As for the What's the reason that y'all don't choose to block eviction using the annotation as well? Is it just a gap in having a mechanism to force terminate if something was blocking the node deletion for too long? |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@towca: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind api-change
What this PR does / why we need it:
Currently, there are 2 Node autoscalers sponsored by sig-autoscaling, each supporting a different Pod annotation with the same semantics:
cluster-autoscaler.kubernetes.io/safe-to-evict=true/false
karpenter.sh/do-not-disrupt=true
The semantics for
cluster-autoscaler.kubernetes.io/safe-to-evict=false
, andkarpenter.sh/do-not-disrupt=true
are identical. Both of these annotations will be replaced bynode-autoscaling.kubernetes.io/safe-to-evict=false
.cluster-autoscaler.kubernetes.io/safe-to-evict=true
doesn't have an equivalent in Karpenter right now, as Karpenter doesn't have any pod-level conditions blocking consolidation. This means that the equivalent new annotationnode-autoscaling.kubernetes.io/safe-to-evict=true
should be trivially supported by Karpenter initially (but will require caution if Karpenter ever adds any pod-level conditions blocking consolidation).Going with the Cluster Autoscaler wording for the common annotation, as otherwise we'd have a double negation (
do-not-disrupt=false
) in thesafe-to-evict=true
case which doesn't seem ideal.This is a part of a broader alignment between Cluster Autoscaler and Karpenter. More details about the alignment can be found in https://docs.google.com/document/d/1rHhltfLV5V1kcnKr_mKRKDC4ZFPYGP4Tde2Zy-LE72w
Which issue(s) this PR fixes:
Part of kubernetes/autoscaler#6648
Special notes for your reviewer:
The implementation in Cluster Autoscaler and Karpenter will follow this PR. If this is a problem, I could do the implementation first with the annotation hardcoded, then submit this PR, then clean up the implementation to use the annotation from the API.
@jonathan-innis this PR goes with the Cluster Autoscaler
safe-to-evict
wording for now, instead of the Karpenterdo-not-disrupt
one.do-not-disrupt
would have to be negated to expresssafe-to-evict=true
, which would result in a double negation. Would switching tosafe-to-evict
be a problem for Karpenter?Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
/assign @jonathan-innis
/assign @MaciekPytel
/assign @gjtempleton
/hold
Want LGTMs from the Node autoscaling stakeholders above before unholding.