Emit event and retry when fail to start healthz server on kube-proxy #49267

MrHohn · 2017-07-20T06:42:19Z

What this PR does / why we need it: Enhance kube-proxy's logic when fail to start healthz server.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): From #49263.

Special notes for your reviewer:
/assign @thockin @nicksardo @bowei

Release note:

kube-proxy will emit "FailedToStartNodeHealthcheck" event when fails to start healthz server.

dixudx · 2017-07-20T07:33:27Z

/test pull-kubernetes-kubemark-e2e-gce

thockin · 2017-07-20T16:45:19Z

/lgtm
/approve

MrHohn · 2017-07-20T17:30:08Z

Some other components occupy port 10256, hence kube-proxy keeps restarting. kubeproxy logs from kubemark CI:

Flag --resource-container has been deprecated, This feature will be removed in a later release.
I0720 07:56:04.089459       8 feature_gate.go:144] feature gates: map[ExperimentalCriticalPodAnnotation:true AdvancedAuditing:true]
W0720 07:56:04.089543       8 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
I0720 07:56:04.101823       8 iptables.go:564] couldn't get iptables-restore version; assuming it doesn't support --wait
I0720 07:56:04.101929       8 iptables.go:192] Could not connect to D-Bus system bus: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
I0720 07:56:04.104894       8 server.go:478] Using iptables Proxier.
W0720 07:56:04.106620       8 server.go:790] Failed to retrieve node info: Get https://35.192.72.118/api/v1/nodes/e2e-40722-minion-group-29qh: dial tcp 35.192.72.118:443: getsockopt: connection refused
W0720 07:56:04.106706       8 proxier.go:483] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
I0720 07:56:04.106736       8 proxier.go:518] minSyncPeriod: 10s, syncPeriod: 1m0s, burstSyncs: 2
I0720 07:56:04.106891       8 server.go:513] Tearing down userspace rules.
I0720 07:56:04.118816       8 server.go:621] setting OOM scores is unsupported in this build
E0720 07:56:04.119026       8 healthcheck.go:303] Failed to start healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
F0720 07:56:04.119063       8 helpers.go:119] error: listen tcp 0.0.0.0:10256: bind: address already in use
Flag --resource-container has been deprecated, This feature will be removed in a later release.
I0720 07:56:05.190499       8 feature_gate.go:144] feature gates: map[ExperimentalCriticalPodAnnotation:true AdvancedAuditing:true]
W0720 07:56:05.190559       8 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
I0720 07:56:05.193049       8 iptables.go:564] couldn't get iptables-restore version; assuming it doesn't support --wait
I0720 07:56:05.193120       8 iptables.go:192] Could not connect to D-Bus system bus: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
I0720 07:56:05.195908       8 server.go:478] Using iptables Proxier.
W0720 07:56:05.197214       8 server.go:790] Failed to retrieve node info: Get https://35.192.72.118/api/v1/nodes/e2e-40722-minion-group-29qh: dial tcp 35.192.72.118:443: getsockopt: connection refused
W0720 07:56:05.197286       8 proxier.go:483] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
I0720 07:56:05.197311       8 proxier.go:518] minSyncPeriod: 10s, syncPeriod: 1m0s, burstSyncs: 2
I0720 07:56:05.197450       8 server.go:513] Tearing down userspace rules.
I0720 07:56:05.207668       8 server.go:621] setting OOM scores is unsupported in this build
E0720 07:56:05.207868       8 healthcheck.go:303] Failed to start healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

Will need to fix that before this could go in.

yujuhong · 2017-07-20T17:56:40Z

Will need to fix that before this could go in.

This cannot go in until the NPD's port is changed in the COS config...

MrHohn · 2017-07-20T18:53:10Z

This cannot go in until the NPD's port is changed in the COS config...

Yep, Ajit sent a fix for that: #49316.

And I just dug in the logs and confirmed the Some other components I mentioned above is node-problem-detector. cc @bowei

yujuhong · 2017-07-20T18:56:33Z

Didn't look closely but you may also need to change the hollow node config used by kubemark https://github.com/kubernetes/kubernetes/blob/v1.7.1/test/kubemark/resources/hollow-node_template.yaml#L103

bowei

An alternative is to rate limited spin trying to bind the healthz port and emit events. Seems bad to have the kube-proxy die for this error?

bowei · 2017-07-20T23:43:25Z

pkg/proxy/healthcheck/healthcheck.go

 	}
-	go func() {
+	go wait.Until(func() {
 		glog.V(3).Infof("Starting goroutine for healthz on %s", hs.addr)
 		if err := server.Serve(listener); err != nil {
 			glog.Errorf("Healhz closed: %v", err)


Healthz is mispelled

MrHohn · 2017-07-21T00:11:54Z

cc @dnardo

dixudx · 2017-07-21T02:15:55Z

pkg/proxy/healthcheck/healthcheck.go

 	}
-	go func() {
+	go wait.Until(func() {
 		glog.V(3).Infof("Starting goroutine for healthz on %s", hs.addr)
 		if err := server.Serve(listener); err != nil {
 			glog.Errorf("Healhz closed: %v", err)
 			return
 		}
 		glog.Errorf("Unexpected healhz closed.")


Also misspelled here.

s/healhz/healthz

fejta · 2017-07-27T09:01:35Z

/retest
Automatic retester of approved PRs with test failures, please send feedback to fejta

fejta-bot · 2017-07-27T23:47:05Z

/retest
Automated flake /retester experiment. Please send feedback to fejta

MrHohn · 2017-07-28T02:44:23Z

An alternative is to rate limited spin trying to bind the healthz port and emit events. Seems bad to have the kube-proxy die for this error?

Per discussion offline, we believe this is the better solution. Codes are revised to do so.

MrHohn · 2017-07-28T17:32:15Z

Sample event:

38s        38s         1         e2e-test-zihongz-minion-group-v02k   Node                  Warning   FailedToStartNodeHealthcheck   kube-proxy, e2e-test-zihongz-minion-group-v02k   Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

thockin · 2017-08-01T16:46:36Z

/lgtm

k8s-github-robot · 2017-08-01T16:47:19Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MrHohn, thockin

Associated issue: 49263

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~cmd/OWNERS~~ [thockin]
~~pkg/proxy/OWNERS~~ [thockin]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

MrHohn · 2017-08-01T18:39:27Z

/retest

yujuhong · 2017-08-01T18:47:19Z

An alternative is to rate limited spin trying to bind the healthz port and emit events. Seems bad to have the kube-proxy die for this error?
Per discussion offline, we believe this is the better solution. Codes are revised to do so.

Would you care to explain why? :-)

kube-proxy would appear to be running and ready, which seems very misleading. Even though events could indicate problems, it's probably not the best mechanism to express the status a system component...

MrHohn · 2017-08-01T19:02:32Z

Would you care to explain why? :-)

The main idea is to keep kube-proxy running when 99% functionality is working.

kube-proxy would appear to be running and ready, which seems very misleading. Even though events could indicate problems, it's probably not the best mechanism to express the status a system component...

Agreed that kube-proxy shouldn't appear to be ready in this case. I will soon send a separate PR to add a readiness probe to kube-proxy.

MrHohn · 2017-08-02T02:50:01Z

/retest

k8s-github-robot · 2017-08-02T12:15:59Z

Automatic merge from submit-queue (batch tested with PRs 49992, 48861, 49267, 49356, 49886)

MrHohn · 2017-08-03T20:36:55Z

/release-note

k8s-ci-robot assigned bowei, nicksardo and thockin Jul 20, 2017

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 20, 2017

k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Jul 20, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2017

bowei reviewed Jul 20, 2017

View reviewed changes

dixudx reviewed Jul 21, 2017

View reviewed changes

dcbw added the area/kube-proxy label Jul 27, 2017

MrHohn force-pushed the kube-proxy-abort-when-healthz-fails branch from 359cf90 to 934d8d3 Compare July 28, 2017 01:25

MrHohn force-pushed the kube-proxy-abort-when-healthz-fails branch from 934d8d3 to 677b2c5 Compare July 28, 2017 01:30

MrHohn changed the title ~~Abort kube-proxy when fail to start healthz server.~~ Emit event and retry when fail to start healthz server on kube-proxy Jul 28, 2017

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2017

MrHohn force-pushed the kube-proxy-abort-when-healthz-fails branch from 677b2c5 to dbb5a3e Compare July 28, 2017 02:42

Emit event and retry when fail to start healthz server on kube-proxy.

db379de

MrHohn force-pushed the kube-proxy-abort-when-healthz-fails branch from dbb5a3e to db379de Compare July 28, 2017 03:55

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 1, 2017

k8s-github-robot merged commit acc00af into kubernetes:master Aug 2, 2017

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 3, 2017

MrHohn deleted the kube-proxy-abort-when-healthz-fails branch August 4, 2017 00:03

MrHohn mentioned this pull request Aug 4, 2017

Add livenessProbe to kube-proxy templates #50118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit event and retry when fail to start healthz server on kube-proxy #49267

Emit event and retry when fail to start healthz server on kube-proxy #49267

MrHohn commented Jul 20, 2017 •

edited

Loading

dixudx commented Jul 20, 2017

thockin commented Jul 20, 2017

MrHohn commented Jul 20, 2017 •

edited

Loading

yujuhong commented Jul 20, 2017

MrHohn commented Jul 20, 2017

yujuhong commented Jul 20, 2017

bowei left a comment

bowei Jul 20, 2017

MrHohn commented Jul 21, 2017

dixudx Jul 21, 2017

fejta commented Jul 27, 2017

fejta-bot commented Jul 27, 2017

MrHohn commented Jul 28, 2017

MrHohn commented Jul 28, 2017

thockin commented Aug 1, 2017

k8s-github-robot commented Aug 1, 2017

MrHohn commented Aug 1, 2017

yujuhong commented Aug 1, 2017

MrHohn commented Aug 1, 2017 •

edited

Loading

MrHohn commented Aug 2, 2017

k8s-github-robot commented Aug 2, 2017

MrHohn commented Aug 3, 2017

Emit event and retry when fail to start healthz server on kube-proxy #49267

Emit event and retry when fail to start healthz server on kube-proxy #49267

Conversation

MrHohn commented Jul 20, 2017 • edited Loading

dixudx commented Jul 20, 2017

thockin commented Jul 20, 2017

MrHohn commented Jul 20, 2017 • edited Loading

yujuhong commented Jul 20, 2017

MrHohn commented Jul 20, 2017

yujuhong commented Jul 20, 2017

bowei left a comment

Choose a reason for hiding this comment

bowei Jul 20, 2017

Choose a reason for hiding this comment

MrHohn commented Jul 21, 2017

dixudx Jul 21, 2017

Choose a reason for hiding this comment

fejta commented Jul 27, 2017

fejta-bot commented Jul 27, 2017

MrHohn commented Jul 28, 2017

MrHohn commented Jul 28, 2017

thockin commented Aug 1, 2017

k8s-github-robot commented Aug 1, 2017

MrHohn commented Aug 1, 2017

yujuhong commented Aug 1, 2017

MrHohn commented Aug 1, 2017 • edited Loading

MrHohn commented Aug 2, 2017

k8s-github-robot commented Aug 2, 2017

MrHohn commented Aug 3, 2017

MrHohn commented Jul 20, 2017 •

edited

Loading

MrHohn commented Jul 20, 2017 •

edited

Loading

MrHohn commented Aug 1, 2017 •

edited

Loading