Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for node maintenance. #17393

Closed
wants to merge 1 commit into from
Closed

Conversation

mml
Copy link
Contributor

@mml mml commented Nov 17, 2015

AKA draining

@mml
Copy link
Contributor Author

mml commented Nov 17, 2015

This still needs some editing, but putting it out there sooner than later.

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 17, 2015
@mml
Copy link
Contributor Author

mml commented Nov 17, 2015

@davidopp pointed out that I shouldn't be using events for anything important, so that will be revised shortly.

@davidopp
Copy link
Member

@bgrant0607 this was auto-assigned to you but I'm happy to do the first review if you're busy

@ncdc
Copy link
Member

ncdc commented Nov 18, 2015

cc @kubernetes/rh-cluster-infra @smarterclayton @pravisankar @timothysc

@mml mml force-pushed the drain-doc branch 2 times, most recently from 23c8275 to 0489c8e Compare November 18, 2015 21:07
@bgrant0607 bgrant0607 assigned davidopp and unassigned bgrant0607 Nov 19, 2015

## Service Quality

This may be thought of as the cell admin offering a guarantee or SLO, or as the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/cell/node, or am i missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Wed, Nov 18, 2015 at 10:18 PM, Paul Morie notifications@github.com
wrote:

In docs/proposals/maintenance.md
#17393 (comment)
:

+Nodes need periodic maintenance, most of which require partial or total
+unavailability to pods. The cluster admin would like for this to happen (1)
+automatically, and (2) respecting any service quality constraints.
+
+Examples of maintenance:
+
+- unmount/remount filesystems to change mkfs params
+- remove from service but leave up for debugging
+- reboot for kernel or image upgrade
+- remove from service for hardware repair
+- remove from service permanently
+
+## Service Quality
+
+This may be thought of as the cell admin offering a guarantee or SLO, or as the

s/cell/node, or am i missing something?

Fixed to 'cluster admin'. Borg terminology was seeping in.


Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/17393/files#r45305996.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timothysc
Copy link
Member

I see no notion of forgiveness/claim-life here, could we at least reference such that once available we can address.

### Reliable Event Log

If we wish our SLO to cover durations longer than an hour or so, the loss of
historical state on restart becomes a real problem. Adding a reliable event log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/event log/history service

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/event log/history service

Done.

@mml
Copy link
Contributor Author

mml commented Nov 19, 2015

On Thu, Nov 19, 2015 at 1:13 PM, Timothy St. Clair <notifications@github.com

wrote:

I see no notion of forgiveness/claim-life here, could we at least
reference such that once available we can address.

I understand what "forgiveness " means, but not "claim-life". Can you more
specifically say what you're driving at?

@timothysc
Copy link
Member

@mml there are no points on the proposal that outline how pods which may been matched based on predicate will maintain their place after maintenance. Namely we are going to want data persistence across maintenance windows in the future.

/cc @kubernetes/rh-storage

@mml
Copy link
Contributor Author

mml commented Nov 19, 2015

On Thu, Nov 19, 2015 at 2:17 PM, Timothy St. Clair <notifications@github.com

wrote:

@mml https://github.com/mml there are no points on the proposal that
outline how pods which may been matched based on predicate will maintain
their place after maintenance. Namely we are going to want data persistance
across maintenance windows in the future.

I think I understand. An example would be storage servers that need to go
back onto the machine that has the specific bytes they serve up.

How will such pods managed in general? Intentional removal for maintenance
is just one reason the pod might die.

Also, is there an issue covering this use case?

/cc @kubernetes/rh-storage

https://github.com/orgs/kubernetes/teams/rh-storage


Reply to this email directly or view it on GitHub
#17393 (comment)
.

@timothysc
Copy link
Member

#1574

@mml
Copy link
Contributor Author

mml commented Nov 19, 2015

On Thu, Nov 19, 2015 at 2:39 PM, Timothy St. Clair <notifications@github.com

wrote:

#1574 #1574

OK. It sounds like forgiveness, if implemented, would cover one case:
brief removal from service for a reboot. You would simply configure
storage pods to forgive these events.

But, quoting @bgrant0607

This approach would generalize nicely for cases where, for example,
applications wanted to endure reboots but give up in the case of extended
outages or in the case that the disk goes bad.

If you're taking a node down forever, you may have to get unreplicated
bytes off. If we need to distinguish this type of event and handle it
differently (this may not be a use case anyone has yet), we should file an
issue describing it.

parameters:

- time between evictions (of pods in the same replica set)
- minimum shard strength (minimum fraction of pods healthy)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. what does "fraction of pods healthy" mean ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Fri, Nov 20, 2015 at 6:52 AM, Sami Wagiaalla notifications@github.com
wrote:

Hmm.. what does "fraction of pods healthy" mean ?

What fraction of pods in the ReplicaSet are scheduled and have Ready=True,
assuming they have readiness probes defined. Would "ready" be a better
term than "healthy" here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should probably add "of pods in the same replica set" as you did in the previous line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of "shard" is a little confusing to me. We don't have a mapping of shards to ReplicationControllers, much less multi-dimensionally (e.g. 10 shards, 3 copies of each). Calling it "shard strength" makes me think of the second dimension (e.g. it's ok to have only 1 copy of my shard as long as you don't do that to more than 1 shard at a time) but we have (AFAIK) no such concept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of "shard" is a little confusing to me.

Agreed. How about "replica strength"? Considered "pod strength", but that's weaker.

@ghost
Copy link

ghost commented Nov 20, 2015

An example would be storage servers that need to go back onto the machine that has the specific bytes they serve up. How will such pods managed in general? Intentional removal for maintenance is just one reason the pod might die.

I think that depends on the strength of the nodeSelector used to schedule the pod in the first place. If the node selector can only be fulfilled by the node which has been decommissioned then the pod will wait until that node is back. If not the pod will be scheduled to a new node and local storage will be lost. I think that is fair enough because the nodeSelector reflects the user's desire. So an application which uses node storage as computation cache can be safely rescheduled to a new node but will be a bit slow at first, and an application which uses node storage for important state will not be rescheduled until the node is back up. User's can use nodeSelectors to express that. Network storage will of course follow the pod to the new node.

- time between evictions (of pods in the same replica set)
- minimum shard strength (minimum fraction of pods healthy)
- eviction notice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an implicit parameter here, which is "maximum machines drained at once", which is equal to the number of concurrent drains. Maybe I want to drain N machines at once in a large cluster? Or maybe it is better expressed as "min machines still in cluster at any time"? Gets complicated if we add dedicated machines though, so maybe too complex? should be explicit (allow draining two at once

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the user ever care about that, as long as the system is satisfying the shard strength and eviction rate requirements of all the pods on the machines you are simultaneously draining? Is it to leave slack capacity for workload bursts? Seems like that should be a cluster autoscaling parameter rather than something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Fri, Nov 20, 2015 at 7:47 AM, Eric Tune notifications@github.com wrote:

Is there an implicit parameter here, which is "maximum machines drained at
once", which is equal to the number of concurrent drains. Maybe I want to
drain N machines at once in a large cluster? Or maybe it is better
expressed as "min machines still in cluster at any time"? Gets complicated
if we add dedicated machines though, so maybe too complex? should be
explicit (allow draining two at once

We sort of implicitly manage it by guaranteeing that only a limited amount
of work is pending, but we don't offer any guarantees on slack capacity.
To give a concrete example of why this matters: you could imagine someone
downing a replication controller for an (unmanaged) upgrade, then the drain
controller sees the green light and is able to shut down a bunch of nodes,
then the user ups a new rc and now there is way less available capacity and
things go pending.

You could manage this with max-nodes-in-flight, as you suggest, or with a
"percentage of capacity" parameter. You'd also get defense in depth from
rate limiting pod deletion rates. If the controller goes slow, it can only
do so much damage during the window.

Either way, I will add a mention of this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this to be requested frequently by admins who have strict policies on cluster management.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximum machines drained at once is an issue for bare metal. Suppose an HPA scales up an RC due to increased load. There is no such thing as cluster autoscaling on bare metal.

@pmorie
Copy link
Member

pmorie commented Dec 2, 2015

Moving discussion to PR scope from: #17393 (comment), since github does the incredibly annoying 'fold-outdated-diff' thing that makes this hard to find in the UI.

Let's get quorum on the final names for the verb to put yellow tape and slippery-when-wet signs around a node while under maintainence. Quoting @mml:

I would like a term that cleanly has two verbs and a noun describing the state. Do you have any suggestions along those lines? In the current case: lame, unlame, and lamed.
Some suggestions from discussions with @bgrant0607 and @erictune

  • cordon, uncordon, cordoned
  • decommission, recommission, decommissioned
  • close, open, closed (as in a faucet or business)
  • shut, reopen, shut (ditto)

My personal preference is for (1), cordon/uncordon/cordoned.

@mml @davidopp @erictune @bgrant0607 @derekwaynecarr thoughts?

@mml
Copy link
Contributor Author

mml commented Dec 2, 2015

Let's get quorum on the final names for the verb

I like cordon/uncordon, and it's had the least (0?) objections. If I don't hear anything by tomorrow morning, I'll rewrite this and #16698 with those names.

I don't think anyone objects to drain, but if there are objections (speak now or forever etc), evacuate is my second choice.

@pmorie
Copy link
Member

pmorie commented Dec 3, 2015

@mml I have no objections to drain

@smarterclayton
Copy link
Contributor

@davidopp
Copy link
Member

@mml is this ready to re-review?

@mml
Copy link
Contributor Author

mml commented Dec 19, 2015

No, I still have some comments to address, and it really needs a new revision to give more than lip service to some of these issues.

We do not have a strong story for everything that has been raised (e.g. how would complicated workflows work), and I'm not sure we really need that unless we plan to implement this soon, so I wonder where I should stop. The unaddressed issues become fractal without a clear stopping point: The more I write the more new things I uncover that aren't addressed.

@bgrant0607
Copy link
Member

I'd like to wrap this up. Reassigning to myself to take a look, since I'm also reviewing #18263.

@bgrant0607 bgrant0607 assigned bgrant0607 and unassigned davidopp Jan 22, 2016
@bgrant0607
Copy link
Member

Ah, this is mostly about SLOs. Ok, transferring back to @davidopp for now.

@bgrant0607 bgrant0607 assigned davidopp and unassigned bgrant0607 Jan 22, 2016
@davidopp
Copy link
Member

Yeah, there is no design or doc yet about using taints/tolerations in machine lifecycle events. I intentionally left that out of #18263 because I felt the dedicated node use case was sufficient justification for it and the amount of time to get a design doc approved seems to scale with the number of words in it. :-) @mml can write a separate doc about that once he has had time to think about it (and we get the taints/tolerations design doc merged).

@k8s-bot
Copy link

k8s-bot commented Feb 17, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@ikehz ikehz mentioned this pull request Apr 22, 2016
k8s-github-robot pushed a commit that referenced this pull request Jul 10, 2016
Automatic merge from submit-queue

[WIP/RFC] Rescheduling in Kubernetes design proposal

Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega).

This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota, `/evict` subresource, and rescheduler.)

Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things. 

ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212

@HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107
@k8s-bot
Copy link

k8s-bot commented Jul 12, 2016

GCE e2e build/test passed for commit 0e9aa1d.

@k8s-github-robot
Copy link

This PR hasn't been active in 179 days. Feel free to reopen.

You can add 'keep-open' label to prevent this from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.