Lot of restarts on Kube-proxy pod #61901

disha94 · 2018-03-29T14:41:26Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
I am recently facing issues with Kubernetes version 1.8.9 as I am getting a lot of restarts on Kube-proxy pod for few nodes. Due to which all service pods scheduled on that node are crashing again and again.
What you expected to happen:
Stable setup
How to reproduce it (as minimally and precisely as possible):
Happening randomly on few of the nodes so not able to reproduce
Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.8.9
Cloud provider or hardware configuration: Installed with Kops 1.8.1 on AWS
OS (e.g. from /etc/os-release): Debian GNU/Linux 9 (stretch)
Kernel (e.g. uname -a): 4.9.0-6-amd64
Install tools: Kops
Docker Run Time: docker://1.13.1

@kubernetes/sig-node
Could someone please help me with this as this is causing a lot of problems in our setup.

The text was updated successfully, but these errors were encountered:

disha94 · 2018-04-02T11:41:56Z

/sig cluster-ops
/sig node

disha94 · 2018-04-06T11:11:21Z

When I check the node events and kube-proxy pod logs, I am getting the following errors:

This is happening not just for the metrics server / healthz, it's happening for all the node ports that are exposed. When I did "sudo netstat -tulpn | grep LISTEN" inside one of the node, I am getting this:

Tried restarting / deleting the node but no use.

disha94 · 2018-04-10T05:48:46Z

The issue is still open if anyone can take a look at this??

MrHohn · 2018-04-26T21:25:05Z

@kubernetes/sig-node-bugs It seems like two instances of kube-proxy were running and causd one of them failed to bind any ports?

ValeryNo · 2018-04-26T22:07:01Z

I had the same issue after. Happened couple of days after 1.7.x -> 1.9.x cluster upgrade. Initially, everything was fine, but then I was deploying more services one node became NotReady, and the rest shortly followed. We're using Kops on AWS. Problem was solved by a complete cluster restart (workers scaled to 0, each master restarted, then workers scaled back to initial #

arminmor · 2018-04-27T13:22:02Z

I am facing the same issue here.

I have a cluster of 10 nodes, 1 master, 1 etcd.

Services on 6 nodes (out of 10) can not "reach"/"be reached from" other containers. However, the other 4 nodes work perfectly and if I put pods (using nodeSelector) on them they can "reach"/"be reached from" other containers placed on the other nodes (4 nodes).

I checked the describe node, kube-proxy and calico logs for the 6 nodes:

describe node:

Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Warning FailedToStartNodeHealthcheck 3m (x73391 over 50d) kube-proxy, sael0688 Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

kube-proxy:

1 server.go:483] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
1 proxier.go:1379] can't open "nodePort for ingress-nginx/ingress-nginx:http" (:30868/tcp), skipping this nodePort: listen tcp :30868: bind: address already in use
1 proxier.go:1379] can't open "nodePort for ingress-nginx/ingress-nginx:https" (:30344/tcp), skipping this nodePort: listen tcp :30344: bind: address already in use
1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

calico-node:
[ERROR][101] health.go 193: Health endpoint failed, trying to restart it... error=listen tcp :9099: bind: address already in use

I also checked the nodes which I have issues on them and I realized that two different processes (different pids) with kube-proxy name are active, but only one of them is listening on port 10249 and 10256:

[clgm.k8@sael0684 ~]$ sudo ss -lntp | grep kube-proxy
LISTEN     0      128    127.0.0.1:10249                    *:*                   users:(("kube-proxy",pid=459,fd=9))
LISTEN     0      128         :::30344                   :::*                   users:(("kube-proxy",pid=459,fd=10))
LISTEN     0      128         :::10256                   :::*                   users:(("kube-proxy",pid=459,fd=7))
LISTEN     0      128         :::30868                   :::*                   users:(("kube-proxy",pid=459,fd=8))
LISTEN     0      128         :::32221                   :::*                   users:(("kube-proxy",pid=18161,fd=5))
LISTEN     0      128         :::31294                   :::*                   users:(("kube-proxy",pid=18161,fd=7))

however on healthy nodes one pid does the job:

[clgm.k8@sael0689 ~]$ sudo ss -lntp | grep kube-proxy
LISTEN     0      128    127.0.0.1:10249                    *:*                   users:(("kube-proxy",pid=26603,fd=8))
LISTEN     0      128         :::30344                   :::*                   users:(("kube-proxy",pid=26603,fd=11))
LISTEN     0      128         :::10256                   :::*                   users:(("kube-proxy",pid=26603,fd=7))
LISTEN     0      128         :::30868                   :::*                   users:(("kube-proxy",pid=26603,fd=10))
LISTEN     0      128         :::32221                   :::*                   users:(("kube-proxy",pid=26603,fd=5))
LISTEN     0      128         :::31294                   :::*                   users:(("kube-proxy",pid=26603,fd=9))

Does anybody know what causes this issue and how I should resolve this?

arminmor · 2018-04-27T14:39:16Z

FYI, deleting the kube-proxy and calico pods on the node with the issue resolved the problem form me.
It takes a few minutes to have updates on the cluster. During this period I was monitoring the calico-node logs and I realized that as soon as the following logs appear, service access is updated:

bird: Graceful restart done
bird: Mesh_172_21_87_86: State changed to feed
bird: Mesh_172_21_87_88: State changed to feed
bird: Mesh_172_21_87_93: State changed to feed
bird: Mesh_172_21_87_95: State changed to feed
bird: Mesh_172_21_87_96: State changed to feed
bird: Mesh_172_21_87_106: State changed to feed
bird: Mesh_172_21_87_86: State changed to up
bird: Mesh_172_21_87_88: State changed to up
bird: Mesh_172_21_87_93: State changed to up
bird: Mesh_172_21_87_95: State changed to up
bird: Mesh_172_21_87_96: State changed to up
bird: Mesh_172_21_87_106: State changed to up

rodcloutier · 2018-05-08T14:55:11Z

Did you check if there were more than one kube-proxy process? I had the issue on 1 node in GKE where 2 kube-proxy processes where live. Fixed the issue by killing the processes on the node and deleting the corresponding pod.

garethlewin · 2018-05-15T20:50:14Z

We also have multiple instances of kube-proxy on one of our nodes, but we have no idea how that occured, is this a known issue?

milosgajdos · 2018-05-23T14:43:17Z

We have noticed similar problem with exactly the same symptoms as described in this issue.

As @arminmor mentioned in his description we too have noticed exactly the same calico-node errors in the logs when multiple instances of kube-proxy are found on the worker nodes.

This makes me think if there is some kind of correlation in cause->effect fashion i.e. calico-node experiences issues which somehow causes kube-proxy to "crash", which for whatever reason doesn't seem to clean up after itself and leaves "stale" instance running whilst spinning out a new kube-proxy instance.

This comment in a different issue mentions running out of disk space as a potential cause of similar problems. We too have noticed some "correlation" between resource pressure on the node and the kube-proxy issue, however we can't confirm it with absolute certainty since we can't reproduce it. Either way, both calico daemon set and kube-proxy pods have critical-pod annotations on them so I wouldn't expect them to be killed off by scheduler due to lack of node resources.

aabed · 2018-06-01T20:55:29Z

the problem was solved with those versions

Kube 1.10.3
Calico-node v2.6.7
Calico-cni v1.11.2

milosgajdos · 2018-06-02T21:46:11Z

Do you know the root cause @aabed ? Which one of those components were at fault and how?

aabed · 2018-06-03T01:55:54Z

I really don't know the root cause
but we started cluster with kops, it was k8s 1.9.3
the problem existed

after upgrading to 1.10.3 it disappeared

in both cases Calico versions were the same

so I'd say it was solved by upgrading kubernetes itself

fejta-bot · 2018-09-01T02:35:48Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-10-01T02:57:58Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2018-10-31T03:46:08Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2018-10-31T03:46:15Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 29, 2018

k8s-ci-robot added sig/cluster-ops sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 2, 2018

disha94 mentioned this issue Apr 9, 2018

Port collision between kube-proxy and node-problem-detector (standalone mode) #49263

Closed

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 26, 2018

cknowles mentioned this issue Jul 7, 2018

Constant calico node health endpoint failure logs kubernetes-retired/kube-aws#1381

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 1, 2018

k8s-ci-robot closed this as completed Oct 31, 2018

adabuleanu mentioned this issue May 4, 2021

kube-scheduler failed with "listen tcp 127.0.0.1:10259: bind: address already in use" #101727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lot of restarts on Kube-proxy pod #61901

Lot of restarts on Kube-proxy pod #61901

disha94 commented Mar 29, 2018 •

edited

Loading

disha94 commented Apr 2, 2018

disha94 commented Apr 6, 2018 •

edited

Loading

disha94 commented Apr 10, 2018

MrHohn commented Apr 26, 2018

ValeryNo commented Apr 26, 2018

arminmor commented Apr 27, 2018 •

edited

Loading

arminmor commented Apr 27, 2018

rodcloutier commented May 8, 2018 •

edited

Loading

garethlewin commented May 15, 2018

milosgajdos commented May 23, 2018

aabed commented Jun 1, 2018

milosgajdos commented Jun 2, 2018

aabed commented Jun 3, 2018

fejta-bot commented Sep 1, 2018

fejta-bot commented Oct 1, 2018

fejta-bot commented Oct 31, 2018

k8s-ci-robot commented Oct 31, 2018

Lot of restarts on Kube-proxy pod #61901

Lot of restarts on Kube-proxy pod #61901

Comments

disha94 commented Mar 29, 2018 • edited Loading

disha94 commented Apr 2, 2018

disha94 commented Apr 6, 2018 • edited Loading

disha94 commented Apr 10, 2018

MrHohn commented Apr 26, 2018

ValeryNo commented Apr 26, 2018

arminmor commented Apr 27, 2018 • edited Loading

arminmor commented Apr 27, 2018

rodcloutier commented May 8, 2018 • edited Loading

garethlewin commented May 15, 2018

milosgajdos commented May 23, 2018

aabed commented Jun 1, 2018

milosgajdos commented Jun 2, 2018

aabed commented Jun 3, 2018

fejta-bot commented Sep 1, 2018

fejta-bot commented Oct 1, 2018

fejta-bot commented Oct 31, 2018

k8s-ci-robot commented Oct 31, 2018

disha94 commented Mar 29, 2018 •

edited

Loading

disha94 commented Apr 6, 2018 •

edited

Loading

arminmor commented Apr 27, 2018 •

edited

Loading

rodcloutier commented May 8, 2018 •

edited

Loading