Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Forgiveness #1574

Closed
bgrant0607 opened this issue Oct 3, 2014 · 22 comments
Closed

Proposal: Forgiveness #1574

bgrant0607 opened this issue Oct 3, 2014 · 22 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@bgrant0607
Copy link
Member

Forked from #598 -- see that issue for more background. We don't need to implement this now, but recording the idea for posterity.

If we need more control over pod durability (e.g., to configure the node controller behavior -- #1366), rather than a single pin/persistence bit, I suggest we implement forgiveness: a list of (event type, optional duration, optional rate) of disruption events (e.g., host unreachability) the pod will tolerate. We could support an any event type and infinite duration for pods that want to be pinned regardless of what happens.

This approach would generalize nicely for cases where, for example, applications wanted to endure reboots but give up in the case of extended outages or in the case that the disk goes bad. We're also going to want to use a similar spec for availability requirements / failure tolerances of sets of pods.

@bgrant0607 bgrant0607 added kind/design Categorizes issue or PR as related to design. area/api Indicates an issue on api area. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Oct 3, 2014
@smarterclayton
Copy link
Contributor

Is the decision to unbind a pod from a host (delete it) a decision a node controller makes (delete all pods on node, then delete node)? Or is it reactive by the system from the node being deleted (node deleted, therefore some process needs to clean up pods)? The former seems like annotations and a custom node controller would solve (delete all pods immediately that have forgiveness true). The latter seems like it would have to be supported by the system itself.

This is probably a separate issue but I don't know where it is - should a node have a concept of being unavailable for scheduling?

@bgrant0607
Copy link
Member Author

@smarterclayton Yes, a node should have a concept of being unavailable for scheduling, and the system (a node controller) should also be responsible for removing pods from non-functional nodes -- see #1366. This separates concerns of pod controlling vs. node controlling (for the most part). Both control systems will get more and more sophisticated over time.

@bgrant0607
Copy link
Member Author

Sketch of forgiveness from #3058:

// goes in Pod
    ForgivenessPolicy *ForgivenessPolicy `json:"forgivenessPolicy,omitempty"`

type ForgivenessPolicy struct {
    Thresholds []ForgivenessThreshold `json:"thresholds"`
}
type ForgivenEvent string
const (
    ForgivenDead ForgivenEvent = "Dead"
    ForgivenNotReady ForgivenEvent = "NotReady"
    ForgivenDeadNode ForgivenEvent = "DeadNode"
    ForgivenUnreachableNode ForgivenEvent = "UnreachableNode"
)
type ForgivenessThreshold struct {
    Event ForgivenEvent `json:"event"`
    ContiguousSeconds int `json:"contiguousSeconds,omitempty"`
    Rate int `json:"rate,omitempty"`
    RatePeriodSeconds int `json:"ratePeriodSeconds,omitempty"`
}

@smarterclayton
Copy link
Contributor

OpenShift implemented a "manage-node" command which models an evacuate action. Currently it's up to the admin to decide whether the pods will be moved, but a forgiveness value on the pod could be the input to that command (just like the automatic evacuation implemented in node controller). A node with a pod with a long forgiveness policy is also less transient, so there is potentially value in making scheduling decisions based on forgiveness (where increasing the forgiveness of a given node has a cost, but scheduling a pod with a lower forgiveness is free). That's probably more a model for more traditional operations teams, and is certainly not a high priority.

@smarterclayton
Copy link
Contributor

Do we want to simplify forgiveness to try to get it for 1.1? As a general policy value it's useful to have in the system.

@davidopp
Copy link
Member

I don't know if it would be high enough priority to commit to for 1.1, but I could imagine the first step would just be a bool (Pod is treated normally, i.e. deleted by NodeController after standard timeout, vs. Pod stays bound to the node forever unless manually deleted or node is deleted)..

@smarterclayton
Copy link
Contributor

How about a duration?

We want to dramatically reduce the fail detecting time for certain
workloads on the cluster, which implies some way of distinguishing pods on
an unhealthy node that should be aggressively considered down (such that
the replication controller gets the earliest chance to proceed).

Maybe there's a more tactical way to manage that, but that does seem
something that would let admins start to manage SLA more effectively that
fits into the long term pattern.

@davidopp
Copy link
Member

That seems reasonable to me, I was just suggesting the simplest thing we could do. What you suggested would also make it a subset of what @bgrant0607 suggested in his comment above, so
#1574 (comment)
you'd just have a ContiguousSeconds and basically nothing else. (As opposed to the bool which doesn't really fit into what he suggested).

Not sure about prioritization for 1.1 though.

@smarterclayton
Copy link
Contributor

If we added the value, but did not leverage it in node controller (or made
it the simplest possible expression, such as changing the evacuate behavior
to be the min forgiveness), would we be in a bad spot?

As far as prioritization it's something we'd be interested in implementing,
so it's really a question of API design and scope of change.

@erictune
Copy link
Member

Is there any doc that has a couple of examples of concrete real world applications and what forgiveness values we recommend for Taints and Tolerations?

@smarterclayton
Copy link
Contributor

We have a few apps running that are "must never be evicted" where the user
has chosen to turn off node eviction today. I have fewer examples of
30 except a few production dbs that would prefer to only be
moved if absolutely necessary because restore >30 min

On Oct 29, 2016, at 10:08 AM, Eric Tune notifications@github.com wrote:

Is there any doc that has a couple of examples of concrete real world
applications and what forgiveness values we recommend for Taints and
Tolerations?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1574 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pxJruBotk6j9s0vX4xKZBhw6DyGgks5q41NBgaJpZM4CqrbH
.

@davidopp
Copy link
Member

davidopp commented Oct 29, 2016

Yeah I think there are three categories

  • want to be evicted sooner than the default 5 minutes -- we have an example of this from a GKE customer running a session-based application where every container handles a set of users, and if the node becomes unavailable that set of users can't be served until a replacement container is created on a new node (they requested 1 minute timeout)
  • want to be evicted after more than 5 minutes but less than infinite -- I think Clayton answered this one; unreplicated apps that take a long time to rebuild their state if restarted on a different node
  • never want to be evicted -- anything that runs as a DaemonSet today falls into this category (we can get rid of the hacky code that handles DaemonSet specially with respect to evictions, and just give them infinite tolerations); sounds like Clayton has other examples (I would think anything that is imported from a legacy VM world where the app can't handle "moving" once started without operator intervention and can't be rewritten to be a PersistentSet would fall into this category)

@smarterclayton
Copy link
Contributor

Anything taking to legacy network attached storage (iscsi, fibrechannel,
nfs over SAN) will probably want to be using stateful set + infinite
forgiveness for quite a while until we implement volume locking and have a
fencer in place. That's most mission critical data stores I'm aware of -
Oracle, SAP, etc. Some of those will be "nodes as pets" where there is
extrinsic HA applied to the node and eviction would be impractical.

In terms of general "not being surprised by the platform" there are a class
of existing workloads that should have very high forgiveness until they
have time to verify they can be moved.

On Oct 29, 2016, at 2:52 PM, David Oppenheimer notifications@github.com
wrote:

Yeah I think there are three categories

  • want to be evicted sooner than the default 5 minutes -- we have an
    example of this from a GKE customer running a session-based application
    where every container handles a set of users, and if the node becomes
    unavailable that set of users can't be served until a replacement container
    is created on a new node (they requested 1 minute timeout)
  • want to be evicted after more than 5 minutes but less than infinite --
    I think Clayton answered this one; unreplicated apps that take a long time
    to rebuild their state if restarted on a different node
  • never want to be evicted -- anything that runs as a DaemonSet today
    falls into this category (we can get rid of the hacky code that handles
    DaemonSet specially with respect to evictions, and just give them infinite
    tolerations); sounds like Clayton has other examples (I would think
    anything that is imported from a legacy VM world where the app can't handle
    "moving" once started and can't be rewritten to be a PersistentSet would
    fall into this category)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#1574 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p2jWRQ5cv8IHLjj7oSg6tUcLbPnaks5q45XkgaJpZM4CqrbH
.

k8s-github-robot pushed a commit that referenced this issue Jan 27, 2017
Automatic merge from submit-queue (batch tested with PRs 39469, 40557)

Forgiveness api changes

**What this PR does / why we need it**:
Splited from #34825 , contains api changes that are needed to implement forgiveness:
1. update toleration api types to support forgiveness, added a new field forgivenessSeconds to indicate the duration of time it tolerates a taint.
2. update taint api types, added a new field to indicate the time the taint is added.

**Which issue this PR fixes** : 
Related issue: #1574
Related PR: #34825 

**Special notes for your reviewer**:

**Release note**:

```release-note
forgiveness alpha version api definition
```
k8s-github-robot pushed a commit that referenced this issue Feb 3, 2017
…hanges

Automatic merge from submit-queue (batch tested with PRs 40696, 39914, 40374)

Forgiveness library changes

**What this PR does / why we need it**:
Splited from #34825, contains library changes that are needed to implement forgiveness:

1. ~~make taints-tolerations matching respect timestamps, so that one toleration can just tolerate a taint for only a period of time.~~ As TaintManager is caching taints and observing taint changes, time-based checking is now outside the library (in TaintManager). see #40355.
2. make tolerations respect wildcard key.
3. add/refresh some related functions to wrap taints-tolerations operation.

**Which issue this PR fixes**: 
Related issue: #1574
Related PR: #34825, #39469 
~~Please note that the first 2 commits in this PR come from #39469 .~~

**Special notes for your reviewer**:

~~Since currently we have `pkg/api/helpers.go` and `pkg/api/v1/helpers.go`, there are some duplicated periods of code laying in these two files.~~

~~Ideally we should move taints-tolerations related functions into a separate package (pkg/util/taints), and make it a unified set of implementations. But I'd just suggest to do it in a follow-up PR after Forgiveness ones done, in case of feature Forgiveness getting blocked to long.~~

**Release note**:

```release-note
make tolerations respect wildcard key
```
k8s-github-robot pushed a commit that referenced this issue Feb 19, 2017
…ission-controller

Automatic merge from submit-queue (batch tested with PRs 41043, 39058, 41021, 41603, 41414)

add defaultTolerationSeconds admission controller

**What this PR does / why we need it**:
Splited from #34825, add a new admission-controller that
1. adds toleration (with tolerationSeconds = 300) for taint `notReady:NoExecute` to every pod that does not already have a toleration for that taint, and
2. adds toleration (with tolerationSeconds = 300) for taint `unreachable:NoExecute` to every pod that does not already have a toleration for that taint.

**Which issue this PR fixes**: 
Related issue: #1574
Related PR: #34825

**Special notes for your reviewer**:

**Release note**:

```release-note
add defaultTolerationSeconds admission controller
```
k8s-github-robot pushed a commit that referenced this issue Feb 26, 2017
…ionseconds-admission-controller

Automatic merge from submit-queue (batch tested with PRs 40932, 41896, 41815, 41309, 41628)

enable DefaultTolerationSeconds admission controller by default

**What this PR does / why we need it**:
Continuation of PR #41414, enable DefaultTolerationSeconds admission controller by default.


**Which issue this PR fixes**: 
fixes: #41860
related Issue: #1574, #25320
related PRs: #34825, #41133, #41414 

**Special notes for your reviewer**:

**Release note**:

```release-note
enable DefaultTolerationSeconds admission controller by default
```
k8s-github-robot pushed a commit that referenced this issue Feb 27, 2017
…-noexecute

Automatic merge from submit-queue (batch tested with PRs 41116, 41804, 42104, 42111, 42120)

make kubectl taint command respect effect NoExecute

**What this PR does / why we need it**:
Part of feature forgiveness implementation, make kubectl taint command respect effect NoExecute.

**Which issue this PR fixes**: 
Related Issue: #1574
Related PR: #39469

**Special notes for your reviewer**:

**Release note**:

```release-note
make kubectl taint command respect effect NoExecute
```
k8s-github-robot pushed a commit that referenced this issue Mar 1, 2017
…onds

Automatic merge from submit-queue

fix kubectl describe pod, show tolerationSeconds

**What this PR does / why we need it**:
tolerationSeconds is now not shown in kubectl describe resutl, this PR is to fix it.

With this fix, pod toleration with tolerationSeconds would like below:
```yaml
Name:           bar
Namespace:      foo
Node:           /
Labels:         <none>
Status:
IP:
Controllers:    <none>
Containers:     <none>
No volumes.
QoS Class:
Node-Selectors: <none>
Tolerations:    key1=value1
				key2=value2:NoSchedule
				key3=value3:NoExecute for 300s
```


**Which issue this PR fixes** : 
Related issue: #1574
Related PR: #39469

**Special notes for your reviewer**:

**Release note**:

```release-note
make kubectl describe pod show tolerationSeconds
```
@bgrant0607 bgrant0607 added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triaged and removed priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. team/control-plane (deprecated - do not use) area/api Indicates an issue on api area. kind/design Categorizes issue or PR as related to design. labels Mar 7, 2017
soltysh pushed a commit to soltysh/kubernetes that referenced this issue May 15, 2023
UPSTREAM: <carry>: move test rules from origin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

9 participants