-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update node-heartbeat KEP #846
Conversation
LGTM. @wojtek-t were you looking for anyone in particular to review this? |
@bgrant0607 - it was requested by @spiffxp in #589 (comment) |
keps/sig-node/0009-node-heartbeat.md
Outdated
@@ -243,9 +255,19 @@ The changes in components logic (Kubelet, NodeController) should be done behind | |||
a feature gate. We suggest making that enabled by default once the feature is | |||
implemented. | |||
|
|||
Beta: | |||
- Proved scalability/performance gain. We expect significant decrease in total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that this has been concerned (presumably via kubemark?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry: confirmed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both kubemark but also real clusters - also clarified in the text.
etcd size (at least 2x) and no drop in any other scalability SLIs. | ||
|
||
GA: | ||
- Enabled by default for a release with no complaints. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that imply 1.16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we're going to Beta in 1.14, I think we can do 1.15 still (assuming everything will be fine)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To advance to GA in 1.15, the work to make this GA would have to be done shortly after shipping 1.14, by which time few people would have used it, and probably none in production scenarios.
If we have high confidence, it can GA in 1.15, but it's not as if it will be "enabled by default for a release" by the time that decision needs to be made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better, and I won't necessarily block on these, but from the perspective of someone on the release team trying to consume these as a checklist: "is this good to go for release?" It's still pretty hard to answer
Having read more in depth I think I get the upgrade/downgrade stuff, but it would be nice if this was broken out into its own section, so I could quickly, as a cluster operator, read the part relevant to me as I try to upgrade my cluster. AFAICT the gist is: if I want to keep the old behavior, I must explicitly opt-out. By default, my controller manager will opt-in, and then my nodes will opt-in, if I'm following the appropriate upgrade procedure. If I have nodes that explicitly opt-out, they will still work with a controller-manager that has opted in. I will see reduced frequency in NodeStatus updates.
If I have node-problem-detector installed already, is there anything I need to do?
keps/sig-node/0009-node-heartbeat.md
Outdated
There is a set of dedicated end-to-end tests added for that feature excercising: | ||
- whether Lease object is being created and update by Kubelet | ||
- whether Kubelet is reducing frequency of node status updates appropriately | ||
- whether Lease object is deleted on node deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the perspective of the CI signal role on the release team:
- Where can I see these tests running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added job-names where those tests are running.
|
||
Additionally, if the feature gate is switched on, all existing test suites are | ||
implicitly testing behavior of this feature, as this is then the signal for | ||
healthiness of nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any test suites / testgrid dashboards that have this feature gate switched on that I can look at as CI signal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, based on #589 (comment) the feature is switched on by default, so the place for CI signal to look is:
- existing release-blocking jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
There's kubemark:
https://testgrid.k8s.io/sig-scalability-kubemark#kubemark-5000
I don't see instructions for how to interpret the results in the kubemark guide though:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scalability/kubemark-guide.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I was curious:
Example from latest run: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/1511/build-log.txt
Search for "Top latency metric".
W0221 01:26:34.024] I0221 01:26:34.024444 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:nodes Subresource: Verb:LIST Scope:cluster Latency:{Perc50:240.369ms Perc90:295.291ms Perc99:870.811ms} Count:323}; threshold: 10s
W0221 01:26:34.025] I0221 01:26:34.024572 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:namespace Latency:{Perc50:44.882ms Perc90:51.798ms Perc99:318.453ms} Count:10128}; threshold: 5s
W0221 01:26:34.025] I0221 01:26:34.024584 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:cluster Latency:{Perc50:156µs Perc90:179.876ms Perc99:179.876ms} Count:3}; threshold: 10s
W0221 01:26:34.025] I0221 01:26:34.024592 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:leases Subresource: Verb:GET Scope:namespace Latency:{Perc50:1.862ms Perc90:8.634ms Perc99:146.831ms} Count:3273182}; threshold: 1s
W0221 01:26:34.026] I0221 01:26:34.024599 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:POST Scope:namespace Latency:{Perc50:1.492ms Perc90:3.788ms Perc99:124.283ms} Count:155016}; threshold: 1s
...
W0221 11:06:04.643] I0221 11:06:04.643028 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:nodes Subresource: Verb:LIST Scope:cluster Latency:{Perc50:243.395ms Perc90:345.626ms Perc99:1.178885s} Count:1701}; threshold: 10s
W0221 11:06:04.643] I0221 11:06:04.643171 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:services Subresource: Verb:LIST Scope:cluster Latency:{Perc50:101.436ms Perc90:129.024ms Perc99:405.32ms} Count:399}; threshold: 10s
W0221 11:06:04.644] I0221 11:06:04.643182 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:namespace Latency:{Perc50:27.837ms Perc90:46.604ms Perc99:229.905ms} Count:46612}; threshold: 5s
W0221 11:06:04.644] I0221 11:06:04.643190 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:4.518ms Perc90:18.402ms Perc99:136.869ms} Count:334634}; threshold: 1s
W0221 11:06:04.644] I0221 11:06:04.643230 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:services Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:26.928ms Perc90:38.175ms Perc99:135.006ms} Count:8250}; threshold: 1s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (for observing no regression on SLI metrics + Pod startup time:
W0222 09:18:49.232] I0222 09:18:49.196971 13101 pod_startup_latency.go:307] PodStartupLatency: labelSelector(group = latency): perc50: 2.588842476s, perc90: 3.365416811s, perc99: 4.421729689s; threshold: 5s
)
But the test will fail if we exceed SLO.
and for the gain max size of etcd db size, e.g.:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/315/artifacts/EtcdMetrics_density_2019-02-22T10:10:08Z.txt
(at the bottom).
I also extended the text.
keps/sig-node/0009-node-heartbeat.md
Outdated
@@ -243,9 +255,19 @@ The changes in components logic (Kubelet, NodeController) should be done behind | |||
a feature gate. We suggest making that enabled by default once the feature is | |||
implemented. | |||
|
|||
Beta: | |||
- Proved scalability/performance gain. We expect significant decrease in total | |||
etcd size (at least 2x) and no drop in any other scalability SLIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could a member of the release team verify this? Are there results you can point to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a TODO above about higher QPS, should that be resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - it's verified. Removed the TODO.
Regarding release team verification - I don't expect this to be verified by release team.
It's SIG-scalability responsibility (joint with SIG node), which we already did.
/lgtm /assign @dchen1107 @derekwaynecarr |
a3fbbe4
to
3419e4b
Compare
3419e4b
to
49e74c4
Compare
@spiffxp - PTAL |
/lgtm |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, spiffxp, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
No description provided.