Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit event and retry when fail to start healthz server on kube-proxy #49267

Conversation

MrHohn
Copy link
Member

@MrHohn MrHohn commented Jul 20, 2017

What this PR does / why we need it: Enhance kube-proxy's logic when fail to start healthz server.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): From #49263.

Special notes for your reviewer:
/assign @thockin @nicksardo @bowei

Release note:

kube-proxy will emit "FailedToStartNodeHealthcheck" event when fails to start healthz server.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 20, 2017
@k8s-github-robot k8s-github-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. labels Jul 20, 2017
@dixudx
Copy link
Member

dixudx commented Jul 20, 2017

/test pull-kubernetes-kubemark-e2e-gce

@thockin
Copy link
Member

thockin commented Jul 20, 2017

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2017
@MrHohn
Copy link
Member Author

MrHohn commented Jul 20, 2017

Some other components occupy port 10256, hence kube-proxy keeps restarting. kubeproxy logs from kubemark CI:

Flag --resource-container has been deprecated, This feature will be removed in a later release.
I0720 07:56:04.089459       8 feature_gate.go:144] feature gates: map[ExperimentalCriticalPodAnnotation:true AdvancedAuditing:true]
W0720 07:56:04.089543       8 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
I0720 07:56:04.101823       8 iptables.go:564] couldn't get iptables-restore version; assuming it doesn't support --wait
I0720 07:56:04.101929       8 iptables.go:192] Could not connect to D-Bus system bus: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
I0720 07:56:04.104894       8 server.go:478] Using iptables Proxier.
W0720 07:56:04.106620       8 server.go:790] Failed to retrieve node info: Get https://35.192.72.118/api/v1/nodes/e2e-40722-minion-group-29qh: dial tcp 35.192.72.118:443: getsockopt: connection refused
W0720 07:56:04.106706       8 proxier.go:483] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
I0720 07:56:04.106736       8 proxier.go:518] minSyncPeriod: 10s, syncPeriod: 1m0s, burstSyncs: 2
I0720 07:56:04.106891       8 server.go:513] Tearing down userspace rules.
I0720 07:56:04.118816       8 server.go:621] setting OOM scores is unsupported in this build
E0720 07:56:04.119026       8 healthcheck.go:303] Failed to start healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use
F0720 07:56:04.119063       8 helpers.go:119] error: listen tcp 0.0.0.0:10256: bind: address already in use
Flag --resource-container has been deprecated, This feature will be removed in a later release.
I0720 07:56:05.190499       8 feature_gate.go:144] feature gates: map[ExperimentalCriticalPodAnnotation:true AdvancedAuditing:true]
W0720 07:56:05.190559       8 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
I0720 07:56:05.193049       8 iptables.go:564] couldn't get iptables-restore version; assuming it doesn't support --wait
I0720 07:56:05.193120       8 iptables.go:192] Could not connect to D-Bus system bus: dial unix /var/run/dbus/system_bus_socket: connect: no such file or directory
I0720 07:56:05.195908       8 server.go:478] Using iptables Proxier.
W0720 07:56:05.197214       8 server.go:790] Failed to retrieve node info: Get https://35.192.72.118/api/v1/nodes/e2e-40722-minion-group-29qh: dial tcp 35.192.72.118:443: getsockopt: connection refused
W0720 07:56:05.197286       8 proxier.go:483] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
I0720 07:56:05.197311       8 proxier.go:518] minSyncPeriod: 10s, syncPeriod: 1m0s, burstSyncs: 2
I0720 07:56:05.197450       8 server.go:513] Tearing down userspace rules.
I0720 07:56:05.207668       8 server.go:621] setting OOM scores is unsupported in this build
E0720 07:56:05.207868       8 healthcheck.go:303] Failed to start healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

Will need to fix that before this could go in.

@yujuhong
Copy link
Contributor

Will need to fix that before this could go in.

This cannot go in until the NPD's port is changed in the COS config...

@MrHohn
Copy link
Member Author

MrHohn commented Jul 20, 2017

This cannot go in until the NPD's port is changed in the COS config...

Yep, Ajit sent a fix for that: #49316.

And I just dug in the logs and confirmed the Some other components I mentioned above is node-problem-detector. cc @bowei

@yujuhong
Copy link
Contributor

Didn't look closely but you may also need to change the hollow node config used by kubemark https://github.com/kubernetes/kubernetes/blob/v1.7.1/test/kubemark/resources/hollow-node_template.yaml#L103

Copy link
Member

@bowei bowei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is to rate limited spin trying to bind the healthz port and emit events. Seems bad to have the kube-proxy die for this error?

}
go func() {
go wait.Until(func() {
glog.V(3).Infof("Starting goroutine for healthz on %s", hs.addr)
if err := server.Serve(listener); err != nil {
glog.Errorf("Healhz closed: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Healthz is mispelled

@MrHohn
Copy link
Member Author

MrHohn commented Jul 21, 2017

cc @dnardo

}
go func() {
go wait.Until(func() {
glog.V(3).Infof("Starting goroutine for healthz on %s", hs.addr)
if err := server.Serve(listener); err != nil {
glog.Errorf("Healhz closed: %v", err)
return
}
glog.Errorf("Unexpected healhz closed.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also misspelled here.

s/healhz/healthz

@fejta
Copy link
Contributor

fejta commented Jul 27, 2017

/retest
Automatic retester of approved PRs with test failures, please send feedback to fejta

@fejta-bot
Copy link

/retest
Automated flake /retester experiment. Please send feedback to fejta

@MrHohn MrHohn force-pushed the kube-proxy-abort-when-healthz-fails branch from 359cf90 to 934d8d3 Compare July 28, 2017 01:25
@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 28, 2017
@MrHohn MrHohn force-pushed the kube-proxy-abort-when-healthz-fails branch from 934d8d3 to 677b2c5 Compare July 28, 2017 01:30
@MrHohn MrHohn changed the title Abort kube-proxy when fail to start healthz server. Emit event and retry when fail to start healthz server on kube-proxy Jul 28, 2017
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2017
@MrHohn MrHohn force-pushed the kube-proxy-abort-when-healthz-fails branch from 677b2c5 to dbb5a3e Compare July 28, 2017 02:42
@MrHohn
Copy link
Member Author

MrHohn commented Jul 28, 2017

An alternative is to rate limited spin trying to bind the healthz port and emit events. Seems bad to have the kube-proxy die for this error?

Per discussion offline, we believe this is the better solution. Codes are revised to do so.

@MrHohn MrHohn force-pushed the kube-proxy-abort-when-healthz-fails branch from dbb5a3e to db379de Compare July 28, 2017 03:55
@MrHohn
Copy link
Member Author

MrHohn commented Jul 28, 2017

Sample event:

38s        38s         1         e2e-test-zihongz-minion-group-v02k   Node                  Warning   FailedToStartNodeHealthcheck   kube-proxy, e2e-test-zihongz-minion-group-v02k   Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

@thockin
Copy link
Member

thockin commented Aug 1, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 1, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MrHohn, thockin

Associated issue: 49263

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@MrHohn
Copy link
Member Author

MrHohn commented Aug 1, 2017

/retest

@yujuhong
Copy link
Contributor

yujuhong commented Aug 1, 2017

An alternative is to rate limited spin trying to bind the healthz port and emit events. Seems bad to have the kube-proxy die for this error?
Per discussion offline, we believe this is the better solution. Codes are revised to do so.

Would you care to explain why? :-)

kube-proxy would appear to be running and ready, which seems very misleading. Even though events could indicate problems, it's probably not the best mechanism to express the status a system component...

@MrHohn
Copy link
Member Author

MrHohn commented Aug 1, 2017

Would you care to explain why? :-)

The main idea is to keep kube-proxy running when 99% functionality is working.

kube-proxy would appear to be running and ready, which seems very misleading. Even though events could indicate problems, it's probably not the best mechanism to express the status a system component...

Agreed that kube-proxy shouldn't appear to be ready in this case. I will soon send a separate PR to add a readiness probe to kube-proxy.

@MrHohn
Copy link
Member Author

MrHohn commented Aug 2, 2017

/retest

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 49992, 48861, 49267, 49356, 49886)

@k8s-github-robot k8s-github-robot merged commit acc00af into kubernetes:master Aug 2, 2017
@MrHohn
Copy link
Member Author

MrHohn commented Aug 3, 2017

/release-note

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Aug 3, 2017
@MrHohn MrHohn deleted the kube-proxy-abort-when-healthz-fails branch August 4, 2017 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.