Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a common Node autoscaling safe-to-evict/do-not-disrupt annotation #124800

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

towca
Copy link
Contributor

@towca towca commented May 10, 2024

What type of PR is this?

/kind api-change

What this PR does / why we need it:

Currently, there are 2 Node autoscalers sponsored by sig-autoscaling, each supporting a different Pod annotation with the same semantics:

  • Cluster Autoscaler: cluster-autoscaler.kubernetes.io/safe-to-evict=true/false
  • Karpenter: karpenter.sh/do-not-disrupt=true

The semantics for cluster-autoscaler.kubernetes.io/safe-to-evict=false, and karpenter.sh/do-not-disrupt=true are identical. Both of these annotations will be replaced by node-autoscaling.kubernetes.io/safe-to-evict=false.

cluster-autoscaler.kubernetes.io/safe-to-evict=true doesn't have an equivalent in Karpenter right now, as Karpenter doesn't have any pod-level conditions blocking consolidation. This means that the equivalent new annotation
node-autoscaling.kubernetes.io/safe-to-evict=true should be trivially supported by Karpenter initially (but will require caution if Karpenter ever adds any pod-level conditions blocking consolidation).

Going with the Cluster Autoscaler wording for the common annotation, as otherwise we'd have a double negation (do-not-disrupt=false) in the safe-to-evict=true case which doesn't seem ideal.

This is a part of a broader alignment between Cluster Autoscaler and Karpenter. More details about the alignment can be found in https://docs.google.com/document/d/1rHhltfLV5V1kcnKr_mKRKDC4ZFPYGP4Tde2Zy-LE72w

Which issue(s) this PR fixes:

Part of kubernetes/autoscaler#6648

Special notes for your reviewer:

The implementation in Cluster Autoscaler and Karpenter will follow this PR. If this is a problem, I could do the implementation first with the annotation hardcoded, then submit this PR, then clean up the implementation to use the annotation from the API.

@jonathan-innis this PR goes with the Cluster Autoscaler safe-to-evict wording for now, instead of the Karpenter do-not-disrupt one. do-not-disrupt would have to be negated to express safe-to-evict=true, which would result in a double negation. Would switching to safe-to-evict be a problem for Karpenter?

Does this PR introduce a user-facing change?

A new Pod annotation node-autoscaling.kubernetes.io/safe-to-evict is introduced. The annotation can be used to control Node autoscaler drain behavior. Value "true" means that a Pod is safe to evict, and Node autoscalers should not block consolidation of a Node because of it, when they normally would. Value "false" means that a Pod is not safe to evict, and Node autoscalers shouldn't consolidate a Node where such a pod is present. The annotation is supported by
Cluster Autoscaler and Karpenter. The annotation is equivalent to autoscaler-specific cluster-autoscaler.kubernetes.io/safe-to-evict and karpenter.sh/do-not-disrupt annotations. The autoscaler-specific annotations are deprecated, and will be removed in a future release.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [Cluster Autoscaler/Karpenter alignment doc]: https://docs.google.com/document/d/1rHhltfLV5V1kcnKr_mKRKDC4ZFPYGP4Tde2Zy-LE72w/edit?usp=sharing

/assign @jonathan-innis
/assign @MaciekPytel
/assign @gjtempleton
/hold
Want LGTMs from the Node autoscaling stakeholders above before unholding.

Currently, there are 2 Node autoscalers sponsored by sig-autoscaling,
each supporting a different Pod annotation with the same semantics:

* Cluster Autoscaler:
  cluster-autoscaler.kubernetes.io/safe-to-evict=true/false
* Karpenter: karpenter.sh/do-not-disrupt=true

The semantics for cluster-autoscaler.kubernetes.io/safe-to-evict=false,
and karpenter.sh/do-not-disrupt=true are identical. Both of these
annotations will be replaced by
node-autoscaling.kubernetes.io/safe-to-evict=false.

cluster-autoscaler.kubernetes.io/safe-to-evict=true doesn't have an
equivalent in Karpenter right now, as Karpenter doesn't have any
pod-level conditions blocking consolidation. This means that the
equivalent new annotation
node-autoscaling.kubernetes.io/safe-to-evict=true should be trivially
supported by Karpenter initially (but will require caution if Karpenter
ever adds any pod-level conditions blocking consolidation).

Going with the Cluster Autoscaler wording for the common annotation, as
otherwise we'd have a double negation (do-not-disrupt=false) in the
safe-to-evict=true case which doesn't seem ideal.

This is a part of a broader alignment between Cluster Autoscaler
and Karpenter. More details about the alignment can be found in
https://docs.google.com/document/d/1rHhltfLV5V1kcnKr_mKRKDC4ZFPYGP4Tde2Zy-LE72w
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 10, 2024
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label May 10, 2024
@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 10, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 10, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 10, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: towca
Once this PR has been reviewed and has the lgtm label, please assign msau42 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from cici37 and mwielgus May 10, 2024 15:33
@jonathan-innis
Copy link

jonathan-innis commented May 10, 2024

node-autoscaling.kubernetes.io/safe-to-evict=true

I feel like we are a caught between a rock and a hard place with this semantic. I can see why this semantic works well for CAS, but now this semantic causes awkwardness for Karpenter users since there's currently no scenario where node-autoscaling.kubernetes.io/safe-to-evict=true would apply, since all pods are safe to evict by default (and we layer blocking elements on top of it). Effectively, all Karpenter users would always be doing node-autoscaling.kubernetes.io/safe-to-evict=false but I think the semantic is a little awkward, since having a boolean semantic like this kind of implies that you support the "truthy" value.

I need to do a bit more thinking on the trade-offs here between CAS and Karpenter supporting something common.

@towca
Copy link
Contributor Author

towca commented May 13, 2024

@jonathan-innis

since all pods are safe to evict by default (and we layer blocking elements on top of it)

So the only way for a pod to block consolidation of its node in Karpenter is for the user to explicitly opt that exact pod/workload into the blocking somehow? Or how does it work?

Do you have/anticipate any such blocking config options that would span multiple workloads? If so, safe-to-evict: true is still useful for "exceptions", something like:

  • I want to configure X workloads together because they mostly have the same requirements/have to run together/etc.
  • The behavior I want for most of the workloads is to block consolidation on some conditions - e.g. if a pod uses local storage.
  • For some of the workloads, I know that the local storage they use is safe to lose, so I can annotate them with safe-to-evict: true.

@towca
Copy link
Contributor Author

towca commented May 20, 2024

@jonathan-innis Have you maybe had a chance to give this more thought?

@dims
Copy link
Member

dims commented May 22, 2024

/sig node
/sig autoscaling

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 22, 2024
@jonathan-innis
Copy link

The behavior I want for most of the workloads is to block consolidation on some conditions - e.g. if a pod uses local storage

I think the core of the problem here is that the defaults for CAS and Karpenter are different when it comes to blocking evictions. Karpenter takes the opinion that you aren't "blocking eviction" of a pod unless you have a pod that tolerates the Karpenter drain/disruption taint. Even still -- with this in place -- we will still drain the node with this pod on there, we just won't attempt to drain that pod because we know that that pod will immediately reschedule as soon as it has been drained.

Users in Karpenter explicitly opt-in to block the eviction of their pods by setting the do-not-disrupt annotation on these pods (much like you select against workloads with a PDB to block the eviction of those pods).

From what I remember, the decision to be safe and ask users to explicitly opt-out of the eviction of specific pods was taken to match the drain logic of the kubectl command, but I do think that (in an automated, autoscaling environment) it may create more confusion to have a set of pods that will be evicted by default and a different set of pods that won't be.

@towca What are your thoughts? I recognize that this is a bit of a change in thinking from the way that CAS thinks about this problem. What are the behaviors that you have seen from users? Have you seen users like this defaulting behavior with things like local storage or controllerless pods or has it become more of a hinderance to users?

FWIW, I think that we should work to align on the behavior here before we align on the taint. Ideally, we have one opinion about how this drain operation works in Kubernetes, this applies across all SIGs and is seen as "the way to drain nodes gracefully" and the do-not-disrupt or safe-to-evict annotations are just part of that story. Thoughts?

@towca
Copy link
Contributor Author

towca commented Jun 13, 2024

FWIW, I think that we should work to align on the behavior here before we align on the taint. Ideally, we have one opinion about how this drain operation works in Kubernetes, this applies across all SIGs and is seen as "the way to drain nodes gracefully" and the do-not-disrupt or safe-to-evict annotations are just part of that story. Thoughts?

I think we might be looking at this from slightly different perspectives, or confusing "blocking eviction" with "blocking consolidation".

I fully agree that we need to align on one drain behavior and use it across Kubernetes components. However, I don't think the safe-to-evict/do-not-disrupt annotation should be a part of this alignment. I think the annotation works one layer above the drain mechanism.

At least for CAS, the annotation is supposed to mean "do not consolidate the Node on which this pod is scheduled (since this would disrupt the pod and it doesn't want that)". The drain behavior doesn't come into play at all, because consolidation (and in turn the drain) should not be attempted by CAS. It's not eviction of one particular pod that is blocked, it's consolidation of its whole node.

Btw, this abstraction layer difference is also precisely why I see this annotation in the node-autoscaling prefix instead of the broader node-lifecycle one. My intention is for node-autoscaling.kubernetes.io/safe-to-evict=false to mean "pod not safe to evict by autoscalers, so autoscalers can't consolidate its node". In contrast, the way I'd understand node-lifecycle.kubernetes.io/safe-to-evict=false would be "pod not safe to evict by anyone using the standard draining mechanism, so its node can't be drained at all". Which is a much stronger guarantee, and some components can't guarantee not disrupting a pod while keeping other guarantees they promise.

The distinction seems useful even inside Karpenter (if I understand its behaviors correctly):

  • Karpenter provides both node-autoscaling features (consolidating underutilized Nodes), and other node-lifecycle features (e.g. consolidating expired Nodes, or consolidating Nodes manually selected by the user).
  • Some users would like to express that not disrupting a pod is more important to them than optimizing the cluster cost, but disruptions caused by other things (e.g. keeping the nodes fresh) are tolerable.
  • Other users would like to express that a pod is critical and not disrupting it is more important than everything else, including keeping the nodes fresh.
  • In my experience, the first scenario is way more common. Users want to avoid frequent, preventable disruptions but have to tolerate unpreventable disruptions anyway - so they're usually also ok tolerating rare "preventable" disruptions like node upgrade. The biggest source of preventable disruptions in autoscaled clusters is node autoscaling, so saying "I don't want autoscaling to disrupt my pods but the other rarer things that keep the cluster healthy like node upgrade are okay to disrupt" is frequently exactly what users want.
  • The second scenario seems way less common to me. Fully blocking draining a node for everything in the cluster has serious implications, and would only be practical for very specific workflows (something else would probably have to guarantee that the annotation is always removed at some point).
  • The node-autoscaling version of the annotation fits the first scenario way more than node-lifecycle IMO, and allows us to still add the node-lifecycle version in the future if the "fully blocking" semantics are truly needed.

From what I remember, the decision to be safe and ask users to explicitly opt-out of the eviction of specific pods was taken to match the drain logic of the kubectl command, but I do think that (in an automated, autoscaling environment) it may create more confusion to have a set of pods that will be evicted by default and a different set of pods that won't be.

Same here, I think kubectl differentiates between the pods at the drain level here. If you do kubectl drain, kubectl will by default evict some pods but not others (mostly because as you said some pods tolerate the cordoning so they'd come back anyway). And the annotation operates a level above - by analogy kubectl drain should refuse to evict any pods if any of the pods on the node has a kubectl.kubernetes.io/safe-to-evict=false annotation.

After writing this, node-autoscaling.kubernetes.io/safe-to-disrupt=false/true seems like an even better name - it's not tied to eviction.

@towca
Copy link
Contributor Author

towca commented Jun 13, 2024

@gjtempleton @MaciekPytel WDYT?

// - If the value is "true": This pod is safe to evict, and Node autoscalers should not block consolidation of a
// Node because of it, when they normally would. For example, if a Node autoscaler normally blocks consolidating
// Nodes on which kube-system Pods are running, this can be used to opt some pods out of the blocking.
NodeAutoscalingSafeToEvictKey = nodeAutoscalingPrefix + "/safe-to-evict"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a spec field?

@sftim
Copy link
Contributor

sftim commented Jun 25, 2024

A new Pod annotation node-autoscaling.kubernetes.io/safe-to-evict is introduced. The annotation can be used to control Node autoscaler drain behavior. Value "true" means that a Pod is safe to evict, and Node autoscalers should not block consolidation of a Node because of it, when they normally would. Value "false" means that a Pod is not safe to evict, and Node autoscalers shouldn't consolidate a Node where such a pod is present. The annotation is supported by Cluster Autoscaler and Karpenter. The annotation is equivalent to autoscaler-specific cluster-autoscaler.kubernetes.io/safe-to-evict and karpenter.sh/do-not-disrupt annotations. The autoscaler-specific annotations are deprecated, and will be removed in a future release.

We usually don't changelog annotations in k/k, because they're not formally part of the API stabilization process.
The way to register a new annotation is to update https://kubernetes.io/docs/reference/labels-annotations-taints/

So, I think this doesn't need a changelog. However, if kubectl learns a special way to report that Pods are safe to evict, then we'd changelog that.

The autoscaler-specific annotations are deprecated, and will be removed in a future release.

We don't remove annotations from the list of registered annotations, we just tell people they've stopped having a useful meaning. There's no lifecycle that I know of tied to Kubernetes releases.

You might like to request a mention in kubernetes/website#46948, as an aside. That's a good place to communicate the upcoming deprecation.

@MaciekPytel
Copy link
Contributor

Btw, this abstraction layer difference is also precisely why I see this annotation in the node-autoscaling prefix instead of the broader node-lifecycle one. My intention is for node-autoscaling.kubernetes.io/safe-to-evict=false to mean "pod not safe to evict by autoscalers, so autoscalers can't consolidate its node". In contrast, the way I'd understand node-lifecycle.kubernetes.io/safe-to-evict=false would be "pod not safe to evict by anyone using the standard draining mechanism, so its node can't be drained at all". Which is a much stronger guarantee, and some components can't guarantee not disrupting a pod while keeping other guarantees they promise.

In my experience that is something that does indeed come up (at least on GKE) - there is a major difference between disruptions caused by consolidation vs other disruptions:

  • Consolidation can potentially trigger massively greater volume of disruption compared to something like node upgrades. I liked the description of one of our users who described it as a cost - if it happens to a workload once every few weeks it's negligible. If it happens multiple times per-day it starts adding up.
  • Maintenance operations can be scheduled at convenient time.

In CAS you can achieve both behaviors:

  • PDB with maxUnavailable=0 will prevent CAS consolidation and will also be respected by a lot other automation (obviously depending on what automation you have running - using GKE as an example, node upgrades will respect PDB at least for a certain duration).
  • safe-to-evict: false annotation is only respected by autoscaler and ignored by any other automation related to node lifecycle.

We see more usage of safe-to-evict: false compared to PDBs completely preventing restarts, which supports the theory that it's useful to have controls scoped to just autoscaling. And from that perspective autoscaling alignement seems useful, even if later on we also do a broader annotation that's respected by all drain implementations (though - how is that different from using a restrictive PDB?)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 7, 2024
@towca
Copy link
Contributor Author

towca commented Oct 29, 2024

@jonathan-innis @njtran Ping, could we move this review along?

@jonathan-innis
Copy link

jonathan-innis commented Oct 30, 2024

Fully blocking draining a node for everything in the cluster has serious implications, and would only be practical for very specific workflows

So to be clear -- the way that we get around this problem today with karpenter.sh/do-not-disrupt annotation fully blocking the drain of pods for a node is that we have a separate concept known as the terminationGracePeriod of the node. When we choose to disrupt a node, we start draining all of the pods on it -- if there is a pod that has this annotation, then we wait to evict it until it either no longer has the annotation OR until we hit the terminationGracePeriod of the node.

I think we still have some disparity between the meaning of these annotations -- so I question whether using something like node-autoscaling.kubernetes.io/do-not-disrupt is something that CAS would want to consider when it comes to controller the eviction, not just decisions, for the pods.

As for the node-autoscaling.kuberentes.io/safe-to-evict:true annotation -- we don't really have the same concept in Karpenter today either because we don't have an annotation that only blocks our voluntary disruption and doesn't block our drain. Obviously, since it blocks our drain, we don't choose to voluntary disrupt so karpenter.sh/do-not-disrupt is effectively a superset of the node-autoscaling.kuberentes.io/safe-to-evict:false annotation.

What's the reason that y'all don't choose to block eviction using the annotation as well? Is it just a gap in having a mechanism to force terminate if something was blocking the node deletion for too long?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@towca
Copy link
Contributor Author

towca commented Dec 30, 2024

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Dec 30, 2024
@k8s-ci-robot
Copy link
Contributor

@towca: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Status: not-only-sig-node
Development

Successfully merging this pull request may close these issues.

9 participants