Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update node-heartbeat KEP #846

Merged
merged 1 commit into from
Feb 25, 2019

Conversation

wojtek-t
Copy link
Member

No description provided.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 20, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/pm size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 20, 2019
@bgrant0607
Copy link
Member

LGTM. @wojtek-t were you looking for anyone in particular to review this?

@wojtek-t
Copy link
Member Author

@bgrant0607 - it was requested by @spiffxp in #589 (comment)

@@ -243,9 +255,19 @@ The changes in components logic (Kubelet, NodeController) should be done behind
a feature gate. We suggest making that enabled by default once the feature is
implemented.

Beta:
- Proved scalability/performance gain. We expect significant decrease in total
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this has been concerned (presumably via kubemark?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry: confirmed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both kubemark but also real clusters - also clarified in the text.

etcd size (at least 2x) and no drop in any other scalability SLIs.

GA:
- Enabled by default for a release with no complaints.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that imply 1.16?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we're going to Beta in 1.14, I think we can do 1.15 still (assuming everything will be fine)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To advance to GA in 1.15, the work to make this GA would have to be done shortly after shipping 1.14, by which time few people would have used it, and probably none in production scenarios.

If we have high confidence, it can GA in 1.15, but it's not as if it will be "enabled by default for a release" by the time that decision needs to be made.

Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better, and I won't necessarily block on these, but from the perspective of someone on the release team trying to consume these as a checklist: "is this good to go for release?" It's still pretty hard to answer

Having read more in depth I think I get the upgrade/downgrade stuff, but it would be nice if this was broken out into its own section, so I could quickly, as a cluster operator, read the part relevant to me as I try to upgrade my cluster. AFAICT the gist is: if I want to keep the old behavior, I must explicitly opt-out. By default, my controller manager will opt-in, and then my nodes will opt-in, if I'm following the appropriate upgrade procedure. If I have nodes that explicitly opt-out, they will still work with a controller-manager that has opted in. I will see reduced frequency in NodeStatus updates.

If I have node-problem-detector installed already, is there anything I need to do?

There is a set of dedicated end-to-end tests added for that feature excercising:
- whether Lease object is being created and update by Kubelet
- whether Kubelet is reducing frequency of node status updates appropriately
- whether Lease object is deleted on node deletion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of the CI signal role on the release team:

  • Where can I see these tests running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added job-names where those tests are running.


Additionally, if the feature gate is switched on, all existing test suites are
implicitly testing behavior of this feature, as this is then the signal for
healthiness of nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any test suites / testgrid dashboards that have this feature gate switched on that I can look at as CI signal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, based on #589 (comment) the feature is switched on by default, so the place for CI signal to look is:

  • existing release-blocking jobs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question.

There's kubemark:
https://testgrid.k8s.io/sig-scalability-kubemark#kubemark-5000

I don't see instructions for how to interpret the results in the kubemark guide though:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-scalability/kubemark-guide.md

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I was curious:
Example from latest run: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/1511/build-log.txt
Search for "Top latency metric".

W0221 01:26:34.024] I0221 01:26:34.024444   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:nodes Subresource: Verb:LIST Scope:cluster Latency:{Perc50:240.369ms Perc90:295.291ms Perc99:870.811ms} Count:323}; threshold: 10s
W0221 01:26:34.025] I0221 01:26:34.024572   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:namespace Latency:{Perc50:44.882ms Perc90:51.798ms Perc99:318.453ms} Count:10128}; threshold: 5s
W0221 01:26:34.025] I0221 01:26:34.024584   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:cluster Latency:{Perc50:156µs Perc90:179.876ms Perc99:179.876ms} Count:3}; threshold: 10s
W0221 01:26:34.025] I0221 01:26:34.024592   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:leases Subresource: Verb:GET Scope:namespace Latency:{Perc50:1.862ms Perc90:8.634ms Perc99:146.831ms} Count:3273182}; threshold: 1s
W0221 01:26:34.026] I0221 01:26:34.024599   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:POST Scope:namespace Latency:{Perc50:1.492ms Perc90:3.788ms Perc99:124.283ms} Count:155016}; threshold: 1s
...
W0221 11:06:04.643] I0221 11:06:04.643028   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:nodes Subresource: Verb:LIST Scope:cluster Latency:{Perc50:243.395ms Perc90:345.626ms Perc99:1.178885s} Count:1701}; threshold: 10s
W0221 11:06:04.643] I0221 11:06:04.643171   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:services Subresource: Verb:LIST Scope:cluster Latency:{Perc50:101.436ms Perc90:129.024ms Perc99:405.32ms} Count:399}; threshold: 10s
W0221 11:06:04.644] I0221 11:06:04.643182   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:LIST Scope:namespace Latency:{Perc50:27.837ms Perc90:46.604ms Perc99:229.905ms} Count:46612}; threshold: 5s
W0221 11:06:04.644] I0221 11:06:04.643190   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:pods Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:4.518ms Perc90:18.402ms Perc99:136.869ms} Count:334634}; threshold: 1s
W0221 11:06:04.644] I0221 11:06:04.643230   22221 api_responsiveness.go:135] APIResponsiveness: Top latency metric: {Resource:services Subresource: Verb:DELETE Scope:namespace Latency:{Perc50:26.928ms Perc90:38.175ms Perc99:135.006ms} Count:8250}; threshold: 1s

Copy link
Member Author

@wojtek-t wojtek-t Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (for observing no regression on SLI metrics + Pod startup time:

W0222 09:18:49.232] I0222 09:18:49.196971   13101 pod_startup_latency.go:307] PodStartupLatency: labelSelector(group = latency): perc50: 2.588842476s, perc90: 3.365416811s, perc99: 4.421729689s; threshold: 5s

)

But the test will fail if we exceed SLO.

and for the gain max size of etcd db size, e.g.:
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/315/artifacts/EtcdMetrics_density_2019-02-22T10:10:08Z.txt
(at the bottom).

I also extended the text.

@@ -243,9 +255,19 @@ The changes in components logic (Kubelet, NodeController) should be done behind
a feature gate. We suggest making that enabled by default once the feature is
implemented.

Beta:
- Proved scalability/performance gain. We expect significant decrease in total
etcd size (at least 2x) and no drop in any other scalability SLIs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could a member of the release team verify this? Are there results you can point to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a TODO above about higher QPS, should that be resolved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it's verified. Removed the TODO.

Regarding release team verification - I don't expect this to be verified by release team.
It's SIG-scalability responsibility (joint with SIG node), which we already did.

@spiffxp
Copy link
Member

spiffxp commented Feb 21, 2019

/lgtm
/hold
if you want to address the above questions/comments

/assign @dchen1107 @derekwaynecarr
Needs an /approve from someone in sig-node-leads

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 21, 2019
@wojtek-t wojtek-t force-pushed the update_hartbeat_kep branch from a3fbbe4 to 3419e4b Compare February 25, 2019 14:31
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 25, 2019
@wojtek-t wojtek-t force-pushed the update_hartbeat_kep branch from 3419e4b to 49e74c4 Compare February 25, 2019 14:36
@wojtek-t
Copy link
Member Author

@spiffxp - PTAL

@spiffxp
Copy link
Member

spiffxp commented Feb 25, 2019

/lgtm
Needs approval from one of @dchen1107 or @derekwaynecarr

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2019
@dchen1107
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, spiffxp, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 25, 2019
@spiffxp
Copy link
Member

spiffxp commented Feb 25, 2019

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2019
@k8s-ci-robot k8s-ci-robot merged commit 021046f into kubernetes:master Feb 25, 2019
@wojtek-t wojtek-t deleted the update_hartbeat_kep branch October 11, 2019 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants