Add proposal for node maintenance. #17393

mml · 2015-11-17T22:10:36Z

AKA draining

mml · 2015-11-17T22:11:10Z

This still needs some editing, but putting it out there sooner than later.

mml · 2015-11-17T23:39:12Z

@davidopp pointed out that I shouldn't be using events for anything important, so that will be revised shortly.

davidopp · 2015-11-18T00:04:25Z

@bgrant0607 this was auto-assigned to you but I'm happy to do the first review if you're busy

ncdc · 2015-11-18T03:38:07Z

cc @kubernetes/rh-cluster-infra @smarterclayton @pravisankar @timothysc

pmorie · 2015-11-19T06:17:59Z

docs/proposals/maintenance.md

+
+## Service Quality
+
+This may be thought of as the cell admin offering a guarantee or SLO, or as the


s/cell/node, or am i missing something?

On Wed, Nov 18, 2015 at 10:18 PM, Paul Morie notifications@github.com
wrote:

In docs/proposals/maintenance.md
#17393 (comment)
:

+Nodes need periodic maintenance, most of which require partial or total
+unavailability to pods. The cluster admin would like for this to happen (1)
+automatically, and (2) respecting any service quality constraints.
+
+Examples of maintenance:
+
+- unmount/remount filesystems to change mkfs params
+- remove from service but leave up for debugging
+- reboot for kernel or image upgrade
+- remove from service for hardware repair
+- remove from service permanently
+
+## Service Quality
+
+This may be thought of as the cell admin offering a guarantee or SLO, or as the

s/cell/node, or am i missing something?

Fixed to 'cluster admin'. Borg terminology was seeping in.

—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/17393/files#r45305996.

@pmorie xref to google nomenclature: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf

timothysc · 2015-11-19T21:13:10Z

I see no notion of forgiveness/claim-life here, could we at least reference such that once available we can address.

timothysc · 2015-11-19T21:18:07Z

docs/proposals/maintenance.md

+### Reliable Event Log
+
+If we wish our SLO to cover durations longer than an hour or so, the loss of
+historical state on restart becomes a real problem.  Adding a reliable event log


s/event log/history service

s/event log/history service

Done.

mml · 2015-11-19T21:27:03Z

On Thu, Nov 19, 2015 at 1:13 PM, Timothy St. Clair <notifications@github.com

wrote:

I see no notion of forgiveness/claim-life here, could we at least
reference such that once available we can address.

I understand what "forgiveness " means, but not "claim-life". Can you more
specifically say what you're driving at?

timothysc · 2015-11-19T22:17:29Z

@mml there are no points on the proposal that outline how pods which may been matched based on predicate will maintain their place after maintenance. Namely we are going to want data persistence across maintenance windows in the future.

/cc @kubernetes/rh-storage

mml · 2015-11-19T22:31:21Z

On Thu, Nov 19, 2015 at 2:17 PM, Timothy St. Clair <notifications@github.com

wrote:

@mml https://github.com/mml there are no points on the proposal that
outline how pods which may been matched based on predicate will maintain
their place after maintenance. Namely we are going to want data persistance
across maintenance windows in the future.

I think I understand. An example would be storage servers that need to go
back onto the machine that has the specific bytes they serve up.

How will such pods managed in general? Intentional removal for maintenance
is just one reason the pod might die.

Also, is there an issue covering this use case?

/cc @kubernetes/rh-storage

https://github.com/orgs/kubernetes/teams/rh-storage

—
Reply to this email directly or view it on GitHub
#17393 (comment)
.

timothysc · 2015-11-19T22:39:29Z

#1574

mml · 2015-11-19T23:25:51Z

On Thu, Nov 19, 2015 at 2:39 PM, Timothy St. Clair <notifications@github.com

wrote:

#1574 #1574

OK. It sounds like forgiveness, if implemented, would cover one case:
brief removal from service for a reboot. You would simply configure
storage pods to forgive these events.

But, quoting @bgrant0607

This approach would generalize nicely for cases where, for example,
applications wanted to endure reboots but give up in the case of extended
outages or in the case that the disk goes bad.

If you're taking a node down forever, you may have to get unreplicated
bytes off. If we need to distinguish this type of event and handle it
differently (this may not be a use case anyone has yet), we should file an
issue describing it.

ghost · 2015-11-20T14:52:40Z

docs/proposals/maintenance.md

+parameters:
+
+- time between evictions (of pods in the same replica set)
+- minimum shard strength (minimum fraction of pods healthy)


Hmm.. what does "fraction of pods healthy" mean ?

On Fri, Nov 20, 2015 at 6:52 AM, Sami Wagiaalla notifications@github.com
wrote:

Hmm.. what does "fraction of pods healthy" mean ?

What fraction of pods in the ReplicaSet are scheduled and have Ready=True,
assuming they have readiness probes defined. Would "ready" be a better
term than "healthy" here?

you should probably add "of pods in the same replica set" as you did in the previous line

The use of "shard" is a little confusing to me. We don't have a mapping of shards to ReplicationControllers, much less multi-dimensionally (e.g. 10 shards, 3 copies of each). Calling it "shard strength" makes me think of the second dimension (e.g. it's ok to have only 1 copy of my shard as long as you don't do that to more than 1 shard at a time) but we have (AFAIK) no such concept.

The use of "shard" is a little confusing to me.

Agreed. How about "replica strength"? Considered "pod strength", but that's weaker.

ghost · 2015-11-20T15:30:03Z

An example would be storage servers that need to go back onto the machine that has the specific bytes they serve up. How will such pods managed in general? Intentional removal for maintenance is just one reason the pod might die.

I think that depends on the strength of the nodeSelector used to schedule the pod in the first place. If the node selector can only be fulfilled by the node which has been decommissioned then the pod will wait until that node is back. If not the pod will be scheduled to a new node and local storage will be lost. I think that is fair enough because the nodeSelector reflects the user's desire. So an application which uses node storage as computation cache can be safely rescheduled to a new node but will be a bit slow at first, and an application which uses node storage for important state will not be rescheduled until the node is back up. User's can use nodeSelectors to express that. Network storage will of course follow the pod to the new node.

erictune · 2015-11-20T15:47:43Z

docs/proposals/maintenance.md

+- time between evictions (of pods in the same replica set)
+- minimum shard strength (minimum fraction of pods healthy)
+- eviction notice
+


Is there an implicit parameter here, which is "maximum machines drained at once", which is equal to the number of concurrent drains. Maybe I want to drain N machines at once in a large cluster? Or maybe it is better expressed as "min machines still in cluster at any time"? Gets complicated if we add dedicated machines though, so maybe too complex? should be explicit (allow draining two at once

Why would the user ever care about that, as long as the system is satisfying the shard strength and eviction rate requirements of all the pods on the machines you are simultaneously draining? Is it to leave slack capacity for workload bursts? Seems like that should be a cluster autoscaling parameter rather than something here.

On Fri, Nov 20, 2015 at 7:47 AM, Eric Tune notifications@github.com wrote:

Is there an implicit parameter here, which is "maximum machines drained at
once", which is equal to the number of concurrent drains. Maybe I want to
drain N machines at once in a large cluster? Or maybe it is better
expressed as "min machines still in cluster at any time"? Gets complicated
if we add dedicated machines though, so maybe too complex? should be
explicit (allow draining two at once

We sort of implicitly manage it by guaranteeing that only a limited amount
of work is pending, but we don't offer any guarantees on slack capacity.
To give a concrete example of why this matters: you could imagine someone
downing a replication controller for an (unmanaged) upgrade, then the drain
controller sees the green light and is able to shut down a bunch of nodes,
then the user ups a new rc and now there is way less available capacity and
things go pending.

You could manage this with max-nodes-in-flight, as you suggest, or with a
"percentage of capacity" parameter. You'd also get defense in depth from
rate limiting pod deletion rates. If the controller goes slow, it can only
do so much damage during the window.

Either way, I will add a mention of this issue.

I expect this to be requested frequently by admins who have strict policies on cluster management.

maximum machines drained at once is an issue for bare metal. Suppose an HPA scales up an RC due to increased load. There is no such thing as cluster autoscaling on bare metal.

pmorie · 2015-12-02T18:55:08Z

Moving discussion to PR scope from: #17393 (comment), since github does the incredibly annoying 'fold-outdated-diff' thing that makes this hard to find in the UI.

Let's get quorum on the final names for the verb to put yellow tape and slippery-when-wet signs around a node while under maintainence. Quoting @mml:

I would like a term that cleanly has two verbs and a noun describing the state. Do you have any suggestions along those lines? In the current case: lame, unlame, and lamed.
Some suggestions from discussions with @bgrant0607 and @erictune

cordon, uncordon, cordoned

decommission, recommission, decommissioned

close, open, closed (as in a faucet or business)

shut, reopen, shut (ditto)

My personal preference is for (1), cordon/uncordon/cordoned.

@mml @davidopp @erictune @bgrant0607 @derekwaynecarr thoughts?

mml · 2015-12-02T23:24:53Z

Let's get quorum on the final names for the verb

I like cordon/uncordon, and it's had the least (0?) objections. If I don't hear anything by tomorrow morning, I'll rewrite this and #16698 with those names.

I don't think anyone objects to drain, but if there are objections (speak now or forever etc), evacuate is my second choice.

pmorie · 2015-12-03T17:22:58Z

@mml I have no objections to drain

smarterclayton · 2015-12-03T20:49:31Z

I'm fine with drain, cordon, and uncordon

On Thu, Dec 3, 2015 at 3:40 PM, Kubernetes Bot notifications@github.com
wrote:

GCE e2e build/test failed for commit c39e4c2
c39e4c2
.

Build Log
https://storage.cloud.google.com/kubernetes-jenkins/pr-logs/c39e4c2352da3e0aac262daee504cc09b83f6a8e/kubernetes-pull-build-test-e2e-gce/18979/build-log.txt

Test Artifacts
https://console.developers.google.com/storage/browser/kubernetes-jenkins/pr-logs/c39e4c2352da3e0aac262daee504cc09b83f6a8e/kubernetes-pull-build-test-e2e-gce/18979/artifacts/

Internal Jenkins Results
http://goto.google.com/prkubekins/job/kubernetes-pull-build-test-e2e-gce//18979

—
Reply to this email directly or view it on GitHub
#17393 (comment)
.

AKA draining

davidopp · 2015-12-18T08:46:00Z

@mml is this ready to re-review?

mml · 2015-12-19T00:52:09Z

No, I still have some comments to address, and it really needs a new revision to give more than lip service to some of these issues.

We do not have a strong story for everything that has been raised (e.g. how would complicated workflows work), and I'm not sure we really need that unless we plan to implement this soon, so I wonder where I should stop. The unaddressed issues become fractal without a clear stopping point: The more I write the more new things I uncover that aren't addressed.

bgrant0607 · 2016-01-22T17:36:12Z

I'd like to wrap this up. Reassigning to myself to take a look, since I'm also reviewing #18263.

bgrant0607 · 2016-01-22T17:54:04Z

Ah, this is mostly about SLOs. Ok, transferring back to @davidopp for now.

davidopp · 2016-01-22T18:00:01Z

Yeah, there is no design or doc yet about using taints/tolerations in machine lifecycle events. I intentionally left that out of #18263 because I felt the dedicated node use case was sufficient justification for it and the amount of time to get a design doc approved seems to scale with the number of words in it. :-) @mml can write a separate doc about that once he has had time to think about it (and we get the taints/tolerations design doc merged).

k8s-bot · 2016-02-17T00:17:04Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@bgrant0607

Automatic merge from submit-queue [WIP/RFC] Rescheduling in Kubernetes design proposal Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega). This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota, `/evict` subresource, and rescheduler.) Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things. ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212 @HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107

k8s-bot · 2016-07-12T16:19:01Z

GCE e2e build/test passed for commit 0e9aa1d.

k8s-github-robot · 2016-07-19T22:15:07Z

This PR hasn't been active in 179 days. Feel free to reopen.

You can add 'keep-open' label to prevent this from happening again.

googlebot added the cla: yes label Nov 17, 2015

k8s-github-robot assigned bgrant0607 Nov 17, 2015

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 17, 2015

mml force-pushed the drain-doc branch 2 times, most recently from 23c8275 to 0489c8e Compare November 18, 2015 21:07

bgrant0607 assigned davidopp and unassigned bgrant0607 Nov 19, 2015

pmorie reviewed Nov 19, 2015
View reviewed changes

timothysc reviewed Nov 19, 2015
View reviewed changes

ghost reviewed Nov 20, 2015
View reviewed changes

erictune reviewed Nov 20, 2015
View reviewed changes

mml force-pushed the drain-doc branch from 7bdf740 to c39e4c2 Compare December 3, 2015 19:51

mml mentioned this pull request Dec 3, 2015

Node upgrades: graceful node evacuation #6080

Closed

Add proposal for node maintenance.

0e9aa1d

AKA draining

mml force-pushed the drain-doc branch from c39e4c2 to 0e9aa1d Compare December 4, 2015 00:59

davidopp mentioned this pull request Dec 7, 2015

Create a rescheduler #12140

Closed

mml mentioned this pull request Dec 16, 2015

Allow users to wait for conditions from kubectl and using the API #1899

Closed

bgrant0607 assigned bgrant0607 and unassigned davidopp Jan 22, 2016

bgrant0607 assigned davidopp and unassigned bgrant0607 Jan 22, 2016

This was referenced Feb 22, 2016

Node upgrades #6079

Closed

Draining Node for Planned Maintenance/Cluster Scaling Down Proposal #21679

Closed

davidopp mentioned this pull request Feb 29, 2016

[WIP/RFC] Rescheduling in Kubernetes design proposal #22217

Merged

k8s-github-robot added the release-note-label-needed label Mar 31, 2016

ikehz mentioned this pull request Apr 22, 2016

Node maintanence API #24644

Closed

AdlyTempleton mentioned this pull request Jul 8, 2016

WIP/RFC: A background rescheduling proposal #28689

Closed

k8s-github-robot closed this Jul 19, 2016


		## Service Quality

		This may be thought of as the cell admin offering a guarantee or SLO, or as the

Add proposal for node maintenance. #17393

Add proposal for node maintenance. #17393

Conversation

mml commented Nov 17, 2015 • edited by k8s-github-robot Loading

mml commented Nov 17, 2015

mml commented Nov 17, 2015

davidopp commented Nov 18, 2015

ncdc commented Nov 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timothysc commented Nov 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mml commented Nov 19, 2015

timothysc commented Nov 19, 2015

mml commented Nov 19, 2015

timothysc commented Nov 19, 2015

mml commented Nov 19, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Nov 20, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmorie commented Dec 2, 2015

mml commented Dec 2, 2015

pmorie commented Dec 3, 2015

smarterclayton commented Dec 3, 2015

davidopp commented Dec 18, 2015

mml commented Dec 19, 2015

bgrant0607 commented Jan 22, 2016

bgrant0607 commented Jan 22, 2016

davidopp commented Jan 22, 2016

k8s-bot commented Feb 17, 2016

k8s-bot commented Jul 12, 2016

k8s-github-robot commented Jul 19, 2016

mml commented Nov 17, 2015 •

edited by k8s-github-robot

Loading