-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposal for node maintenance. #17393
Conversation
This still needs some editing, but putting it out there sooner than later. |
@davidopp pointed out that I shouldn't be using events for anything important, so that will be revised shortly. |
@bgrant0607 this was auto-assigned to you but I'm happy to do the first review if you're busy |
cc @kubernetes/rh-cluster-infra @smarterclayton @pravisankar @timothysc |
23c8275
to
0489c8e
Compare
|
||
## Service Quality | ||
|
||
This may be thought of as the cell admin offering a guarantee or SLO, or as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/cell/node, or am i missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Wed, Nov 18, 2015 at 10:18 PM, Paul Morie notifications@github.com
wrote:
In docs/proposals/maintenance.md
#17393 (comment)
:
+Nodes need periodic maintenance, most of which require partial or total
+unavailability to pods. The cluster admin would like for this to happen (1)
+automatically, and (2) respecting any service quality constraints.
+
+Examples of maintenance:
+
+- unmount/remount filesystems to change mkfs params
+- remove from service but leave up for debugging
+- reboot for kernel or image upgrade
+- remove from service for hardware repair
+- remove from service permanently
+
+## Service Quality
+
+This may be thought of as the cell admin offering a guarantee or SLO, or as thes/cell/node, or am i missing something?
Fixed to 'cluster admin'. Borg terminology was seeping in.
—
Reply to this email directly or view it on GitHub
https://github.com/kubernetes/kubernetes/pull/17393/files#r45305996.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmorie xref to google nomenclature: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf
I see no notion of forgiveness/claim-life here, could we at least reference such that once available we can address. |
### Reliable Event Log | ||
|
||
If we wish our SLO to cover durations longer than an hour or so, the loss of | ||
historical state on restart becomes a real problem. Adding a reliable event log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/event log/history service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/event log/history service
Done.
On Thu, Nov 19, 2015 at 1:13 PM, Timothy St. Clair <notifications@github.com
I understand what "forgiveness " means, but not "claim-life". Can you more |
@mml there are no points on the proposal that outline how pods which may been matched based on predicate will maintain their place after maintenance. Namely we are going to want data persistence across maintenance windows in the future. /cc @kubernetes/rh-storage |
On Thu, Nov 19, 2015 at 2:17 PM, Timothy St. Clair <notifications@github.com
I think I understand. An example would be storage servers that need to go How will such pods managed in general? Intentional removal for maintenance Also, is there an issue covering this use case? /cc @kubernetes/rh-storage
|
On Thu, Nov 19, 2015 at 2:39 PM, Timothy St. Clair <notifications@github.com
OK. It sounds like forgiveness, if implemented, would cover one case: But, quoting @bgrant0607
If you're taking a node down forever, you may have to get unreplicated |
parameters: | ||
|
||
- time between evictions (of pods in the same replica set) | ||
- minimum shard strength (minimum fraction of pods healthy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. what does "fraction of pods healthy" mean ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Fri, Nov 20, 2015 at 6:52 AM, Sami Wagiaalla notifications@github.com
wrote:
Hmm.. what does "fraction of pods healthy" mean ?
What fraction of pods in the ReplicaSet are scheduled and have Ready=True,
assuming they have readiness probes defined. Would "ready" be a better
term than "healthy" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should probably add "of pods in the same replica set" as you did in the previous line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of "shard" is a little confusing to me. We don't have a mapping of shards to ReplicationControllers, much less multi-dimensionally (e.g. 10 shards, 3 copies of each). Calling it "shard strength" makes me think of the second dimension (e.g. it's ok to have only 1 copy of my shard as long as you don't do that to more than 1 shard at a time) but we have (AFAIK) no such concept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of "shard" is a little confusing to me.
Agreed. How about "replica strength"? Considered "pod strength", but that's weaker.
I think that depends on the strength of the nodeSelector used to schedule the pod in the first place. If the node selector can only be fulfilled by the node which has been decommissioned then the pod will wait until that node is back. If not the pod will be scheduled to a new node and local storage will be lost. I think that is fair enough because the nodeSelector reflects the user's desire. So an application which uses node storage as computation cache can be safely rescheduled to a new node but will be a bit slow at first, and an application which uses node storage for important state will not be rescheduled until the node is back up. User's can use nodeSelectors to express that. Network storage will of course follow the pod to the new node. |
- time between evictions (of pods in the same replica set) | ||
- minimum shard strength (minimum fraction of pods healthy) | ||
- eviction notice | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an implicit parameter here, which is "maximum machines drained at once", which is equal to the number of concurrent drains. Maybe I want to drain N machines at once in a large cluster? Or maybe it is better expressed as "min machines still in cluster at any time"? Gets complicated if we add dedicated machines though, so maybe too complex? should be explicit (allow draining two at once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the user ever care about that, as long as the system is satisfying the shard strength and eviction rate requirements of all the pods on the machines you are simultaneously draining? Is it to leave slack capacity for workload bursts? Seems like that should be a cluster autoscaling parameter rather than something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Fri, Nov 20, 2015 at 7:47 AM, Eric Tune notifications@github.com wrote:
Is there an implicit parameter here, which is "maximum machines drained at
once", which is equal to the number of concurrent drains. Maybe I want to
drain N machines at once in a large cluster? Or maybe it is better
expressed as "min machines still in cluster at any time"? Gets complicated
if we add dedicated machines though, so maybe too complex? should be
explicit (allow draining two at onceWe sort of implicitly manage it by guaranteeing that only a limited amount
of work is pending, but we don't offer any guarantees on slack capacity.
To give a concrete example of why this matters: you could imagine someone
downing a replication controller for an (unmanaged) upgrade, then the drain
controller sees the green light and is able to shut down a bunch of nodes,
then the user ups a new rc and now there is way less available capacity and
things go pending.
You could manage this with max-nodes-in-flight, as you suggest, or with a
"percentage of capacity" parameter. You'd also get defense in depth from
rate limiting pod deletion rates. If the controller goes slow, it can only
do so much damage during the window.
Either way, I will add a mention of this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this to be requested frequently by admins who have strict policies on cluster management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maximum machines drained at once is an issue for bare metal. Suppose an HPA scales up an RC due to increased load. There is no such thing as cluster autoscaling on bare metal.
Moving discussion to PR scope from: #17393 (comment), since github does the incredibly annoying 'fold-outdated-diff' thing that makes this hard to find in the UI. Let's get quorum on the final names for the verb to put yellow tape and slippery-when-wet signs around a node while under maintainence. Quoting @mml:
My personal preference is for (1), cordon/uncordon/cordoned. @mml @davidopp @erictune @bgrant0607 @derekwaynecarr thoughts? |
I like cordon/uncordon, and it's had the least (0?) objections. If I don't hear anything by tomorrow morning, I'll rewrite this and #16698 with those names. I don't think anyone objects to drain, but if there are objections (speak now or forever etc), evacuate is my second choice. |
@mml I have no objections to drain |
I'm fine with drain, cordon, and uncordon On Thu, Dec 3, 2015 at 3:40 PM, Kubernetes Bot notifications@github.com
|
AKA draining
@mml is this ready to re-review? |
No, I still have some comments to address, and it really needs a new revision to give more than lip service to some of these issues. We do not have a strong story for everything that has been raised (e.g. how would complicated workflows work), and I'm not sure we really need that unless we plan to implement this soon, so I wonder where I should stop. The unaddressed issues become fractal without a clear stopping point: The more I write the more new things I uncover that aren't addressed. |
I'd like to wrap this up. Reassigning to myself to take a look, since I'm also reviewing #18263. |
Ah, this is mostly about SLOs. Ok, transferring back to @davidopp for now. |
Yeah, there is no design or doc yet about using taints/tolerations in machine lifecycle events. I intentionally left that out of #18263 because I felt the dedicated node use case was sufficient justification for it and the amount of time to get a design doc approved seems to scale with the number of words in it. :-) @mml can write a separate doc about that once he has had time to think about it (and we get the taints/tolerations design doc merged). |
Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist") If this message is too spammy, please complain to ixdy. |
Automatic merge from submit-queue [WIP/RFC] Rescheduling in Kubernetes design proposal Proposal by @bgrant0607 and @davidopp (and inspired by years of discussion and experience from folks who worked on Borg and Omega). This doc is a proposal for a set of inter-related concepts related to "rescheduling" -- that is, "moving" an already-running pod to a new node in order to improve where it is running. (Specific concepts discussed are priority, preemption, disruption budget, quota, `/evict` subresource, and rescheduler.) Feedback on the proposal is very welcome. For now, please stick to comments about the design, not spelling, punctuation, grammar, broken links, etc., so we can keep the doc uncluttered enough to make it easy for folks to comment on the more important things. ref/ #22054 #18724 #19080 #12611 #20699 #17393 #12140 #22212 @HaiyangDING @mqliang @derekwaynecarr @kubernetes/sig-scheduling @kubernetes/huawei @timothysc @mml @dchen1107
GCE e2e build/test passed for commit 0e9aa1d. |
This PR hasn't been active in 179 days. Feel free to reopen. You can add 'keep-open' label to prevent this from happening again. |
AKA draining