Proposal: Forgiveness #1574

bgrant0607 · 2014-10-03T21:14:42Z

Forked from #598 -- see that issue for more background. We don't need to implement this now, but recording the idea for posterity.

If we need more control over pod durability (e.g., to configure the node controller behavior -- #1366), rather than a single pin/persistence bit, I suggest we implement forgiveness: a list of (event type, optional duration, optional rate) of disruption events (e.g., host unreachability) the pod will tolerate. We could support an any event type and infinite duration for pods that want to be pinned regardless of what happens.

This approach would generalize nicely for cases where, for example, applications wanted to endure reboots but give up in the case of extended outages or in the case that the disk goes bad. We're also going to want to use a similar spec for availability requirements / failure tolerances of sets of pods.

smarterclayton · 2014-10-09T00:14:22Z

Is the decision to unbind a pod from a host (delete it) a decision a node controller makes (delete all pods on node, then delete node)? Or is it reactive by the system from the node being deleted (node deleted, therefore some process needs to clean up pods)? The former seems like annotations and a custom node controller would solve (delete all pods immediately that have forgiveness true). The latter seems like it would have to be supported by the system itself.

This is probably a separate issue but I don't know where it is - should a node have a concept of being unavailable for scheduling?

bgrant0607 · 2014-10-09T02:22:38Z

@smarterclayton Yes, a node should have a concept of being unavailable for scheduling, and the system (a node controller) should also be responsible for removing pods from non-functional nodes -- see #1366. This separates concerns of pod controlling vs. node controlling (for the most part). Both control systems will get more and more sophisticated over time.

bgrant0607 · 2014-12-23T16:12:38Z

Sketch of forgiveness from #3058:

// goes in Pod
    ForgivenessPolicy *ForgivenessPolicy `json:"forgivenessPolicy,omitempty"`

type ForgivenessPolicy struct {
    Thresholds []ForgivenessThreshold `json:"thresholds"`
}
type ForgivenEvent string
const (
    ForgivenDead ForgivenEvent = "Dead"
    ForgivenNotReady ForgivenEvent = "NotReady"
    ForgivenDeadNode ForgivenEvent = "DeadNode"
    ForgivenUnreachableNode ForgivenEvent = "UnreachableNode"
)
type ForgivenessThreshold struct {
    Event ForgivenEvent `json:"event"`
    ContiguousSeconds int `json:"contiguousSeconds,omitempty"`
    Rate int `json:"rate,omitempty"`
    RatePeriodSeconds int `json:"ratePeriodSeconds,omitempty"`
}

smarterclayton · 2015-05-31T21:32:04Z

OpenShift implemented a "manage-node" command which models an evacuate action. Currently it's up to the admin to decide whether the pods will be moved, but a forgiveness value on the pod could be the input to that command (just like the automatic evacuation implemented in node controller). A node with a pod with a long forgiveness policy is also less transient, so there is potentially value in making scheduling decisions based on forgiveness (where increasing the forgiveness of a given node has a cost, but scheduling a pod with a lower forgiveness is free). That's probably more a model for more traditional operations teams, and is certainly not a high priority.

smarterclayton · 2015-07-26T21:58:39Z

Do we want to simplify forgiveness to try to get it for 1.1? As a general policy value it's useful to have in the system.

davidopp · 2015-07-26T22:03:26Z

I don't know if it would be high enough priority to commit to for 1.1, but I could imagine the first step would just be a bool (Pod is treated normally, i.e. deleted by NodeController after standard timeout, vs. Pod stays bound to the node forever unless manually deleted or node is deleted)..

smarterclayton · 2015-07-26T23:46:11Z

How about a duration?

We want to dramatically reduce the fail detecting time for certain
workloads on the cluster, which implies some way of distinguishing pods on
an unhealthy node that should be aggressively considered down (such that
the replication controller gets the earliest chance to proceed).

Maybe there's a more tactical way to manage that, but that does seem
something that would let admins start to manage SLA more effectively that
fits into the long term pattern.

davidopp · 2015-07-27T00:08:19Z

That seems reasonable to me, I was just suggesting the simplest thing we could do. What you suggested would also make it a subset of what @bgrant0607 suggested in his comment above, so
#1574 (comment)
you'd just have a ContiguousSeconds and basically nothing else. (As opposed to the bool which doesn't really fit into what he suggested).

Not sure about prioritization for 1.1 though.

smarterclayton · 2015-07-27T01:10:33Z

If we added the value, but did not leverage it in node controller (or made
it the simplest possible expression, such as changing the evacuate behavior
to be the min forgiveness), would we be in a bad spot?

As far as prioritization it's something we'd be interested in implementing,
so it's really a question of API design and scope of change.

erictune · 2016-10-29T14:07:49Z

Is there any doc that has a couple of examples of concrete real world applications and what forgiveness values we recommend for Taints and Tolerations?

smarterclayton · 2016-10-29T16:49:32Z

We have a few apps running that are "must never be evicted" where the user
has chosen to turn off node eviction today. I have fewer examples of
30 except a few production dbs that would prefer to only be
moved if absolutely necessary because restore >30 min

On Oct 29, 2016, at 10:08 AM, Eric Tune notifications@github.com wrote:

Is there any doc that has a couple of examples of concrete real world
applications and what forgiveness values we recommend for Taints and
Tolerations?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1574 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pxJruBotk6j9s0vX4xKZBhw6DyGgks5q41NBgaJpZM4CqrbH
.

davidopp · 2016-10-29T18:52:06Z

Yeah I think there are three categories

want to be evicted sooner than the default 5 minutes -- we have an example of this from a GKE customer running a session-based application where every container handles a set of users, and if the node becomes unavailable that set of users can't be served until a replacement container is created on a new node (they requested 1 minute timeout)
want to be evicted after more than 5 minutes but less than infinite -- I think Clayton answered this one; unreplicated apps that take a long time to rebuild their state if restarted on a different node
never want to be evicted -- anything that runs as a DaemonSet today falls into this category (we can get rid of the hacky code that handles DaemonSet specially with respect to evictions, and just give them infinite tolerations); sounds like Clayton has other examples (I would think anything that is imported from a legacy VM world where the app can't handle "moving" once started without operator intervention and can't be rewritten to be a PersistentSet would fall into this category)

smarterclayton · 2016-10-30T16:50:47Z

Anything taking to legacy network attached storage (iscsi, fibrechannel,
nfs over SAN) will probably want to be using stateful set + infinite
forgiveness for quite a while until we implement volume locking and have a
fencer in place. That's most mission critical data stores I'm aware of -
Oracle, SAP, etc. Some of those will be "nodes as pets" where there is
extrinsic HA applied to the node and eviction would be impractical.

In terms of general "not being surprised by the platform" there are a class
of existing workloads that should have very high forgiveness until they
have time to verify they can be moved.

On Oct 29, 2016, at 2:52 PM, David Oppenheimer notifications@github.com
wrote:

Yeah I think there are three categories

want to be evicted sooner than the default 5 minutes -- we have an
example of this from a GKE customer running a session-based application
where every container handles a set of users, and if the node becomes
unavailable that set of users can't be served until a replacement container
is created on a new node (they requested 1 minute timeout)
want to be evicted after more than 5 minutes but less than infinite --
I think Clayton answered this one; unreplicated apps that take a long time
to rebuild their state if restarted on a different node
never want to be evicted -- anything that runs as a DaemonSet today
falls into this category (we can get rid of the hacky code that handles
DaemonSet specially with respect to evictions, and just give them infinite
tolerations); sounds like Clayton has other examples (I would think
anything that is imported from a legacy VM world where the app can't handle
"moving" once started and can't be rewritten to be a PersistentSet would
fall into this category)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1574 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p2jWRQ5cv8IHLjj7oSg6tUcLbPnaks5q45XkgaJpZM4CqrbH
.

Automatic merge from submit-queue (batch tested with PRs 39469, 40557) Forgiveness api changes **What this PR does / why we need it**: Splited from #34825 , contains api changes that are needed to implement forgiveness: 1. update toleration api types to support forgiveness, added a new field forgivenessSeconds to indicate the duration of time it tolerates a taint. 2. update taint api types, added a new field to indicate the time the taint is added. **Which issue this PR fixes** : Related issue: #1574 Related PR: #34825 **Special notes for your reviewer**: **Release note**: ```release-note forgiveness alpha version api definition ```

…hanges Automatic merge from submit-queue (batch tested with PRs 40696, 39914, 40374) Forgiveness library changes **What this PR does / why we need it**: Splited from #34825, contains library changes that are needed to implement forgiveness: 1. ~~make taints-tolerations matching respect timestamps, so that one toleration can just tolerate a taint for only a period of time.~~ As TaintManager is caching taints and observing taint changes, time-based checking is now outside the library (in TaintManager). see #40355. 2. make tolerations respect wildcard key. 3. add/refresh some related functions to wrap taints-tolerations operation. **Which issue this PR fixes**: Related issue: #1574 Related PR: #34825, #39469 ~~Please note that the first 2 commits in this PR come from #39469 .~~ **Special notes for your reviewer**: ~~Since currently we have `pkg/api/helpers.go` and `pkg/api/v1/helpers.go`, there are some duplicated periods of code laying in these two files.~~ ~~Ideally we should move taints-tolerations related functions into a separate package (pkg/util/taints), and make it a unified set of implementations. But I'd just suggest to do it in a follow-up PR after Forgiveness ones done, in case of feature Forgiveness getting blocked to long.~~ **Release note**: ```release-note make tolerations respect wildcard key ```

…ission-controller Automatic merge from submit-queue (batch tested with PRs 41043, 39058, 41021, 41603, 41414) add defaultTolerationSeconds admission controller **What this PR does / why we need it**: Splited from #34825, add a new admission-controller that 1. adds toleration (with tolerationSeconds = 300) for taint `notReady:NoExecute` to every pod that does not already have a toleration for that taint, and 2. adds toleration (with tolerationSeconds = 300) for taint `unreachable:NoExecute` to every pod that does not already have a toleration for that taint. **Which issue this PR fixes**: Related issue: #1574 Related PR: #34825 **Special notes for your reviewer**: **Release note**: ```release-note add defaultTolerationSeconds admission controller ```

…ionseconds-admission-controller Automatic merge from submit-queue (batch tested with PRs 40932, 41896, 41815, 41309, 41628) enable DefaultTolerationSeconds admission controller by default **What this PR does / why we need it**: Continuation of PR #41414, enable DefaultTolerationSeconds admission controller by default. **Which issue this PR fixes**: fixes: #41860 related Issue: #1574, #25320 related PRs: #34825, #41133, #41414 **Special notes for your reviewer**: **Release note**: ```release-note enable DefaultTolerationSeconds admission controller by default ```

…-noexecute Automatic merge from submit-queue (batch tested with PRs 41116, 41804, 42104, 42111, 42120) make kubectl taint command respect effect NoExecute **What this PR does / why we need it**: Part of feature forgiveness implementation, make kubectl taint command respect effect NoExecute. **Which issue this PR fixes**: Related Issue: #1574 Related PR: #39469 **Special notes for your reviewer**: **Release note**: ```release-note make kubectl taint command respect effect NoExecute ```

…onds Automatic merge from submit-queue fix kubectl describe pod, show tolerationSeconds **What this PR does / why we need it**: tolerationSeconds is now not shown in kubectl describe resutl, this PR is to fix it. With this fix, pod toleration with tolerationSeconds would like below: ```yaml Name: bar Namespace: foo Node: / Labels: <none> Status: IP: Controllers: <none> Containers: <none> No volumes. QoS Class: Node-Selectors: <none> Tolerations: key1=value1 key2=value2:NoSchedule key3=value3:NoExecute for 300s ``` **Which issue this PR fixes** : Related issue: #1574 Related PR: #39469 **Special notes for your reviewer**: **Release note**: ```release-note make kubectl describe pod show tolerationSeconds ```

UPSTREAM: <carry>: move test rules from origin

bgrant0607 added kind/design Categorizes issue or PR as related to design. area/api Indicates an issue on api area. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Oct 3, 2014

bgrant0607 mentioned this issue Oct 17, 2014

Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

Closed

bgrant0607 mentioned this issue Dec 1, 2014

Add node status to API object. #2315

Merged

bgrant0607 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Dec 4, 2014

bgrant0607 mentioned this issue Dec 19, 2014

Convert ReplicationController to a plugin (was ReplicationController redesign) #3058

Closed

bgrant0607 mentioned this issue Feb 14, 2015

Scheduler input should be taken when reducing replicas #4301

Open

ddysher mentioned this issue Feb 21, 2015

Allow admin user to explicitly unschedule the node #4585

Merged

bgrant0607 added the team/master label Feb 28, 2015

This was referenced Mar 20, 2015

Add timeout to kubelet client #5635

Merged

kubectl describe nodes <id> should have more information #4075

Closed

bgrant0607 mentioned this issue Apr 3, 2015

Add PodSpec.NodeFailurePolicy = {Reschedule, Delete, Ignore} #6393

Closed

bgrant0607 mentioned this issue May 11, 2015

Make pod.Spec.Host unambiguous #8007

Closed

bgrant0607 mentioned this issue May 30, 2015

There should be a local storage persistent volume #7562

Closed

bgrant0607 mentioned this issue Jun 25, 2015

Initial Daemon controller proposal #10210

Closed

bgrant0607 mentioned this issue Aug 12, 2015

Node Controller deletes the node before it deletes the pods on that node #12545

Closed

mikedanese added team/control-plane and removed team/master labels Aug 20, 2015

This was referenced Sep 16, 2015

PetSet (was nominal services) #260

Closed

Design doc for DaemonSet #14326

Closed

kevin-wangzefeng mentioned this issue Oct 14, 2016

WIP: implement forgiveness phase 1 #34825

Closed

8 tasks

jellonek mentioned this issue Nov 22, 2016

Make status checks take consecutive failures in time into account to get rid of temporary failures Mirantis/k8s-AppController#106

Closed

kevin-wangzefeng mentioned this issue Jan 5, 2017

Forgiveness api changes #39469

Merged

kevin-wangzefeng mentioned this issue Jan 14, 2017

Forgiveness library changes #39914

Merged

madhusudancs mentioned this issue Feb 6, 2017

[Federation] Eliminate code duplication across controllers #40989

Closed

kevin-wangzefeng mentioned this issue Feb 14, 2017

add defaultTolerationSeconds admission controller #41414

Merged

kevin-wangzefeng mentioned this issue Feb 22, 2017

enable DefaultTolerationSeconds admission controller by default #41815

Merged

bgrant0607 mentioned this issue Feb 23, 2017

[Feature] Support restarting a pod on the same node when it goes down #37250

Closed

kevin-wangzefeng mentioned this issue Feb 26, 2017

make kubectl taint command respect effect NoExecute #42120

Merged

kevin-wangzefeng mentioned this issue Feb 27, 2017

fix kubectl describe pod, show tolerationSeconds #42162

Merged

adohe-zz mentioned this issue Apr 6, 2017

Refactoring reorganize taints function in kubectl to expose operations #43171

Merged

davidopp closed this as completed Jul 13, 2017

bgrant0607 mentioned this issue Aug 2, 2019

Eviction should be able to request eviction asynchronously #66811

Open

soltysh pushed a commit to soltysh/kubernetes that referenced this issue May 15, 2023

Merge pull request kubernetes#1574 from soltysh/move_rules

20a4409

UPSTREAM: <carry>: move test rules from origin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Forgiveness #1574

Proposal: Forgiveness #1574

bgrant0607 commented Oct 3, 2014

smarterclayton commented Oct 9, 2014

bgrant0607 commented Oct 9, 2014

bgrant0607 commented Dec 23, 2014

smarterclayton commented May 31, 2015

smarterclayton commented Jul 26, 2015

davidopp commented Jul 26, 2015

smarterclayton commented Jul 26, 2015

davidopp commented Jul 27, 2015

smarterclayton commented Jul 27, 2015

erictune commented Oct 29, 2016

smarterclayton commented Oct 29, 2016

davidopp commented Oct 29, 2016 •

edited

Loading

smarterclayton commented Oct 30, 2016

Proposal: Forgiveness #1574

Proposal: Forgiveness #1574

Comments

bgrant0607 commented Oct 3, 2014

smarterclayton commented Oct 9, 2014

bgrant0607 commented Oct 9, 2014

bgrant0607 commented Dec 23, 2014

smarterclayton commented May 31, 2015

smarterclayton commented Jul 26, 2015

davidopp commented Jul 26, 2015

smarterclayton commented Jul 26, 2015

davidopp commented Jul 27, 2015

smarterclayton commented Jul 27, 2015

erictune commented Oct 29, 2016

smarterclayton commented Oct 29, 2016

davidopp commented Oct 29, 2016 • edited Loading

smarterclayton commented Oct 30, 2016

davidopp commented Oct 29, 2016 •

edited

Loading