Update node-heartbeat KEP #846

wojtek-t · 2019-02-20T08:29:17Z

No description provided.

bgrant0607 · 2019-02-21T05:52:17Z

LGTM. @wojtek-t were you looking for anyone in particular to review this?

wojtek-t · 2019-02-21T13:31:30Z

@bgrant0607 - it was requested by @spiffxp in #589 (comment)

bgrant0607 · 2019-02-21T22:50:35Z

keps/sig-node/0009-node-heartbeat.md

@@ -243,9 +255,19 @@ The changes in components logic (Kubelet, NodeController) should be done behind
 a feature gate. We suggest making that enabled by default once the feature is
 implemented.

+Beta:
+- Proved scalability/performance gain. We expect significant decrease in total


Please note that this has been concerned (presumably via kubemark?).

sorry: confirmed

Both kubemark but also real clusters - also clarified in the text.

bgrant0607 · 2019-02-21T22:50:59Z

keps/sig-node/0009-node-heartbeat.md

+etcd size (at least 2x) and no drop in any other scalability SLIs.
+
+GA:
+- Enabled by default for a release with no complaints.


Does that imply 1.16?

Given we're going to Beta in 1.14, I think we can do 1.15 still (assuming everything will be fine)?

To advance to GA in 1.15, the work to make this GA would have to be done shortly after shipping 1.14, by which time few people would have used it, and probably none in production scenarios.

If we have high confidence, it can GA in 1.15, but it's not as if it will be "enabled by default for a release" by the time that decision needs to be made.

spiffxp

This is better, and I won't necessarily block on these, but from the perspective of someone on the release team trying to consume these as a checklist: "is this good to go for release?" It's still pretty hard to answer

Having read more in depth I think I get the upgrade/downgrade stuff, but it would be nice if this was broken out into its own section, so I could quickly, as a cluster operator, read the part relevant to me as I try to upgrade my cluster. AFAICT the gist is: if I want to keep the old behavior, I must explicitly opt-out. By default, my controller manager will opt-in, and then my nodes will opt-in, if I'm following the appropriate upgrade procedure. If I have nodes that explicitly opt-out, they will still work with a controller-manager that has opted in. I will see reduced frequency in NodeStatus updates.

If I have node-problem-detector installed already, is there anything I need to do?

spiffxp · 2019-02-21T22:46:22Z

keps/sig-node/0009-node-heartbeat.md

+There is a set of dedicated end-to-end tests added for that feature excercising:
+- whether Lease object is being created and update by Kubelet
+- whether Kubelet is reducing frequency of node status updates appropriately
+- whether Lease object is deleted on node deletion


From the perspective of the CI signal role on the release team:

Where can I see these tests running?

Added job-names where those tests are running.

spiffxp · 2019-02-21T22:46:54Z

keps/sig-node/0009-node-heartbeat.md

+
+Additionally, if the feature gate is switched on, all existing test suites are
+implicitly testing behavior of this feature, as this is then the signal for
+healthiness of nodes.


Are there any test suites / testgrid dashboards that have this feature gate switched on that I can look at as CI signal?

OK, based on #589 (comment) the feature is switched on by default, so the place for CI signal to look is:

existing release-blocking jobs

Good question.

There's kubemark:
https://testgrid.k8s.io/sig-scalability-kubemark#kubemark-5000

I don't see instructions for how to interpret the results in the kubemark guide though:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scalability/kubemark-guide.md

Since I was curious:
Example from latest run: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/1511/build-log.txt
Search for "Top latency metric".

W0221 01:26:34.024] I0221 01:26:34.024444 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:nodes Subresource: Verb:LIST Scope:cluster Latency:{Perc50:240.369ms Perc90:295.291ms Perc99:870.811ms} Count:323}; threshold: 10s W0221 01:26:34.025] I0221 01:26:34.024572 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:namespace Latency:{Perc50:44.882ms Perc90:51.798ms Perc99:318.453ms} Count:10128}; threshold: 5s W0221 01:26:34.025] I0221 01:26:34.024584 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:cluster Latency:{Perc50:156Âµs Perc90:179.876ms Perc99:179.876ms} Count:3}; threshold: 10s W0221 01:26:34.025] I0221 01:26:34.024592 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:leases Subresource: Verb:GET Scope:namespace Latency:{Perc50:1.862ms Perc90:8.634ms Perc99:146.831ms} Count:3273182}; threshold: 1s W0221 01:26:34.026] I0221 01:26:34.024599 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:POST Scope:namespace Latency:{Perc50:1.492ms Perc90:3.788ms Perc99:124.283ms} Count:155016}; threshold: 1s ... W0221 11:06:04.643] I0221 11:06:04.643028 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:nodes Subresource: Verb:LIST Scope:cluster Latency:{Perc50:243.395ms Perc90:345.626ms Perc99:1.178885s} Count:1701}; threshold: 10s W0221 11:06:04.643] I0221 11:06:04.643171 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:services Subresource: Verb:LIST Scope:cluster Latency:{Perc50:101.436ms Perc90:129.024ms Perc99:405.32ms} Count:399}; threshold: 10s W0221 11:06:04.644] I0221 11:06:04.643182 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:namespace Latency:{Perc50:27.837ms Perc90:46.604ms Perc99:229.905ms} Count:46612}; threshold: 5s W0221 11:06:04.644] I0221 11:06:04.643190 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:4.518ms Perc90:18.402ms Perc99:136.869ms} Count:334634}; threshold: 1s W0221 11:06:04.644] I0221 11:06:04.643230 22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:services Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:26.928ms Perc90:38.175ms Perc99:135.006ms} Count:8250}; threshold: 1s

This (for observing no regression on SLI metrics + Pod startup time:

W0222 09:18:49.232] I0222 09:18:49.196971 13101 pod_startup_latency.go:307] PodStartupLatency: labelSelector(group = latency): perc50: 2.588842476s, perc90: 3.365416811s, perc99: 4.421729689s; threshold: 5s

)

But the test will fail if we exceed SLO.

and for the gain max size of etcd db size, e.g.:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/315/artifacts/EtcdMetrics_density_2019-02-22T10:10:08Z.txt
(at the bottom).

I also extended the text.

spiffxp · 2019-02-21T22:52:55Z

keps/sig-node/0009-node-heartbeat.md

@@ -243,9 +255,19 @@ The changes in components logic (Kubelet, NodeController) should be done behind
 a feature gate. We suggest making that enabled by default once the feature is
 implemented.

+Beta:
+- Proved scalability/performance gain. We expect significant decrease in total
+etcd size (at least 2x) and no drop in any other scalability SLIs.


How could a member of the release team verify this? Are there results you can point to?

There is a TODO above about higher QPS, should that be resolved?

Yes - it's verified. Removed the TODO.

Regarding release team verification - I don't expect this to be verified by release team.
It's SIG-scalability responsibility (joint with SIG node), which we already did.

spiffxp · 2019-02-21T23:04:31Z

/lgtm
/hold
if you want to address the above questions/comments

/assign @dchen1107 @derekwaynecarr
Needs an /approve from someone in sig-node-leads

wojtek-t · 2019-02-25T14:37:47Z

@spiffxp - PTAL

spiffxp · 2019-02-25T17:02:18Z

/lgtm
Needs approval from one of @dchen1107 or @derekwaynecarr

dchen1107 · 2019-02-25T20:20:20Z

/lgtm
/approve

k8s-ci-robot · 2019-02-25T20:20:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, spiffxp, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/sig-node/OWNERS~~ [dchen1107]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

spiffxp · 2019-02-25T21:46:29Z

/hold cancel

wojtek-t assigned spiffxp, derekwaynecarr and dchen1107 Feb 20, 2019

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 20, 2019

k8s-ci-robot requested review from dchen1107 and derekwaynecarr February 20, 2019 08:29

wojtek-t mentioned this pull request Feb 20, 2019

Move frequent Kubelet heartbeats to Lease API #589

Closed

6 tasks

bgrant0607 reviewed Feb 21, 2019

View reviewed changes

spiffxp reviewed Feb 21, 2019

View reviewed changes

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 21, 2019

wojtek-t force-pushed the update_hartbeat_kep branch from a3fbbe4 to 3419e4b Compare February 25, 2019 14:31

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 25, 2019

Update node-heartbeat KEP

49e74c4

wojtek-t force-pushed the update_hartbeat_kep branch from 3419e4b to 49e74c4 Compare February 25, 2019 14:36

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2019

k8s-ci-robot merged commit 021046f into kubernetes:master Feb 25, 2019

wojtek-t deleted the update_hartbeat_kep branch October 11, 2019 09:30

justaugustus removed the sig/pm label Apr 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update node-heartbeat KEP #846

Update node-heartbeat KEP #846

wojtek-t commented Feb 20, 2019

bgrant0607 commented Feb 21, 2019

wojtek-t commented Feb 21, 2019

bgrant0607 Feb 21, 2019

bgrant0607 Feb 21, 2019

wojtek-t Feb 25, 2019

bgrant0607 Feb 21, 2019

wojtek-t Feb 25, 2019

bgrant0607 Feb 26, 2019

spiffxp left a comment

spiffxp Feb 21, 2019

wojtek-t Feb 25, 2019

spiffxp Feb 21, 2019

spiffxp Feb 21, 2019

bgrant0607 Feb 21, 2019

bgrant0607 Feb 21, 2019

wojtek-t Feb 25, 2019 •

edited

Loading

spiffxp Feb 21, 2019

spiffxp Feb 21, 2019

wojtek-t Feb 25, 2019

spiffxp commented Feb 21, 2019

wojtek-t commented Feb 25, 2019

spiffxp commented Feb 25, 2019

dchen1107 commented Feb 25, 2019

k8s-ci-robot commented Feb 25, 2019

spiffxp commented Feb 25, 2019

Update node-heartbeat KEP #846

Update node-heartbeat KEP #846

Conversation

wojtek-t commented Feb 20, 2019

bgrant0607 commented Feb 21, 2019

wojtek-t commented Feb 21, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffxp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t Feb 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spiffxp commented Feb 21, 2019

wojtek-t commented Feb 25, 2019

spiffxp commented Feb 25, 2019

dchen1107 commented Feb 25, 2019

k8s-ci-robot commented Feb 25, 2019

spiffxp commented Feb 25, 2019

wojtek-t Feb 25, 2019 •

edited

Loading