-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose flags for new NodeEviction logic in NodeController #30138
Conversation
@@ -173,5 +174,10 @@ func (s *CMServer) AddFlags(fs *pflag.FlagSet) { | |||
fs.DurationVar(&s.ControllerStartInterval.Duration, "controller-start-interval", s.ControllerStartInterval.Duration, "Interval between starting controller managers.") | |||
fs.BoolVar(&s.EnableGarbageCollector, "enable-garbage-collector", s.EnableGarbageCollector, "Enables the generic garbage collector. MUST be synced with the corresponding flag of the kube-apiserver. WARNING: the generic garbage collector is an alpha feature.") | |||
fs.Int32Var(&s.ConcurrentGCSyncs, "concurrent-gc-syncs", s.ConcurrentGCSyncs, "The number of garbage collector workers that are allowed to sync concurrently.") | |||
fs.Float32Var(&s.NodeEvictionRate, "node-eviction-rate", 0.1, "Number of nodes per second on which pods are deleted in case of node failure when a zone is healthy.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/healthy./healthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters./
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@davidopp PTAL |
ping @davidopp |
Super-busy right now, will review within the next couple of days. |
d8892bb
to
f94b2fa
Compare
@@ -173,5 +174,10 @@ func (s *CMServer) AddFlags(fs *pflag.FlagSet) { | |||
fs.DurationVar(&s.ControllerStartInterval.Duration, "controller-start-interval", s.ControllerStartInterval.Duration, "Interval between starting controller managers.") | |||
fs.BoolVar(&s.EnableGarbageCollector, "enable-garbage-collector", s.EnableGarbageCollector, "Enables the generic garbage collector. MUST be synced with the corresponding flag of the kube-apiserver. WARNING: the generic garbage collector is an alpha feature.") | |||
fs.Int32Var(&s.ConcurrentGCSyncs, "concurrent-gc-syncs", s.ConcurrentGCSyncs, "The number of garbage collector workers that are allowed to sync concurrently.") | |||
fs.Float32Var(&s.NodeEvictionRate, "node-eviction-rate", 0.1, "Number of nodes per second on which pods are deleted in case of node failure when a zone is healthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters.") | |||
fs.Float32Var(&s.SecondaryNodeEvictionRate, "secondary-node-eviction-rate", 0.01, "Number of nodes per second on which pods are deleted in case of node failure when a zone is unhealthy (see --unhealthy-zone-threshold for definition of healthy/unhealthy). Zone refers to entire cluster in non-multizone clusters. This value is implicitly overridden to 0 if the cluster size is smaller than --large-cluster-size-threshold.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Daniel that 0.01 is probably too large, especially if --large-cluster-size-threshold is 20.
I think maybe we should change the default for --unhealthy-zone-threshold to 0.55 (this is a rough approximation for "can't reschedule anywhere" if the pods are evenly thread), and change --secondary-node-eviction-rate to 0.002.
We should also mention this new default behavior very prominently in the release notes so we don't get a ton of questions from people about why their pods aren't being evicted.
Actually, it would be great to add some kind of NodeCondition (generated by node controller) that indicates when evictions from that node are being rate-limited due to this new feature. Borg has something like this and it's really helpful for understandability. I would say it's not critical if you can't finish it by Friday (certainly I would not suggest to put it in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.002 means that we'll evict ~7 Nodes and hour - I think it's too slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 20 Nodes with 1 Node per 100s it will take over 30 minutes to evict whole cluster. If you think it's too little I'd rather increase the large-cluster-threshold to 40 (it'll give an hour for full eviction) than to decrease rate drastically. This is actually a reason why I wanted to have 'time-to-drain' as a second flag here... @davidpp PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know... I mean, you probably never really want to evict the whole cluster, so 30 minutes is probably too short... I'm trying to think in terms of an ops team with a pager, what is a reasonable SLO for them to respond and stop the bleeding... But yeah, if you want to change large-cluster-threshold to 40 (or probably 50 would be better) instead of changing the rate, I think it's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it 50.
I made a few small comments. Please add LGTM after you've made the changes. Do you think the tests are adequate or should there be more? Please file an issue and assign to yourself for the node condition I suggested in the comments. It would be great to get that in 1.4 but not the end of the world if you can't. |
BTW what is the PodStatus/ContainerStatus for pods/containers on nodes that are NotReady but not being evicted? (I didn't review your other PRs for this and am too lazy or busy, depending on one's perspective, at the moment to check.) |
I didn't change the logic there, so Pods are marked as NotReady, but containers are left as is (it was this way since I remember)7. I can send another PR that changes this behavior, but I don't want to do this in this PR. |
That behavior sounds fine, actually. |
I'm trying to check why integration test failed... |
LGTM |
Self applying LGTM label per #30138 (comment) |
GCE e2e build/test passed for commit 4cf698e. |
Automatic merge from submit-queue |
Fix #28832
Last PR from the NodeController NodeEviction logic series.
cc @davidopp @lavalamp @mml
This change is