Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lot of restarts on Kube-proxy pod #61901

Closed
disha94 opened this issue Mar 29, 2018 · 17 comments
Closed

Lot of restarts on Kube-proxy pod #61901

disha94 opened this issue Mar 29, 2018 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@disha94
Copy link

disha94 commented Mar 29, 2018

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
I am recently facing issues with Kubernetes version 1.8.9 as I am getting a lot of restarts on Kube-proxy pod for few nodes. Due to which all service pods scheduled on that node are crashing again and again.
What you expected to happen:
Stable setup
How to reproduce it (as minimally and precisely as possible):
Happening randomly on few of the nodes so not able to reproduce
Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.8.9
  • Cloud provider or hardware configuration: Installed with Kops 1.8.1 on AWS
  • OS (e.g. from /etc/os-release): Debian GNU/Linux 9 (stretch)
  • Kernel (e.g. uname -a): 4.9.0-6-amd64
  • Install tools: Kops
  • Docker Run Time: docker://1.13.1

@kubernetes/sig-node
Could someone please help me with this as this is causing a lot of problems in our setup.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 29, 2018
@disha94
Copy link
Author

disha94 commented Apr 2, 2018

/sig cluster-ops
/sig node

@k8s-ci-robot k8s-ci-robot added sig/cluster-ops sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 2, 2018
@disha94
Copy link
Author

disha94 commented Apr 6, 2018

When I check the node events and kube-proxy pod logs, I am getting the following errors:
screen shot 2018-04-06 at 3 59 43 pm

This is happening not just for the metrics server / healthz, it's happening for all the node ports that are exposed. When I did "sudo netstat -tulpn | grep LISTEN" inside one of the node, I am getting this:
screen shot 2018-04-06 at 4 34 22 pm

Tried restarting / deleting the node but no use.

@disha94
Copy link
Author

disha94 commented Apr 10, 2018

The issue is still open if anyone can take a look at this??

@MrHohn
Copy link
Member

MrHohn commented Apr 26, 2018

@kubernetes/sig-node-bugs It seems like two instances of kube-proxy were running and causd one of them failed to bind any ports?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 26, 2018
@ValeryNo
Copy link

I had the same issue after. Happened couple of days after 1.7.x -> 1.9.x cluster upgrade. Initially, everything was fine, but then I was deploying more services one node became NotReady, and the rest shortly followed. We're using Kops on AWS. Problem was solved by a complete cluster restart (workers scaled to 0, each master restarted, then workers scaled back to initial #

@arminmor
Copy link

arminmor commented Apr 27, 2018

I am facing the same issue here.

I have a cluster of 10 nodes, 1 master, 1 etcd.

Services on 6 nodes (out of 10) can not "reach"/"be reached from" other containers. However, the other 4 nodes work perfectly and if I put pods (using nodeSelector) on them they can "reach"/"be reached from" other containers placed on the other nodes (4 nodes).

I checked the describe node, kube-proxy and calico logs for the 6 nodes:

describe node:

Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Warning FailedToStartNodeHealthcheck 3m (x73391 over 50d) kube-proxy, sael0688 Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

kube-proxy:

1 server.go:483] starting metrics server failed: listen tcp 127.0.0.1:10249: bind: address already in use
1 proxier.go:1379] can't open "nodePort for ingress-nginx/ingress-nginx:http" (:30868/tcp), skipping this nodePort: listen tcp :30868: bind: address already in use
1 proxier.go:1379] can't open "nodePort for ingress-nginx/ingress-nginx:https" (:30344/tcp), skipping this nodePort: listen tcp :30344: bind: address already in use
1 healthcheck.go:317] Failed to start node healthz on 0.0.0.0:10256: listen tcp 0.0.0.0:10256: bind: address already in use

calico-node:
[ERROR][101] health.go 193: Health endpoint failed, trying to restart it... error=listen tcp :9099: bind: address already in use

I also checked the nodes which I have issues on them and I realized that two different processes (different pids) with kube-proxy name are active, but only one of them is listening on port 10249 and 10256:

[clgm.k8@sael0684 ~]$ sudo ss -lntp | grep kube-proxy
LISTEN     0      128    127.0.0.1:10249                    *:*                   users:(("kube-proxy",pid=459,fd=9))
LISTEN     0      128         :::30344                   :::*                   users:(("kube-proxy",pid=459,fd=10))
LISTEN     0      128         :::10256                   :::*                   users:(("kube-proxy",pid=459,fd=7))
LISTEN     0      128         :::30868                   :::*                   users:(("kube-proxy",pid=459,fd=8))
LISTEN     0      128         :::32221                   :::*                   users:(("kube-proxy",pid=18161,fd=5))
LISTEN     0      128         :::31294                   :::*                   users:(("kube-proxy",pid=18161,fd=7))

however on healthy nodes one pid does the job:

[clgm.k8@sael0689 ~]$ sudo ss -lntp | grep kube-proxy
LISTEN     0      128    127.0.0.1:10249                    *:*                   users:(("kube-proxy",pid=26603,fd=8))
LISTEN     0      128         :::30344                   :::*                   users:(("kube-proxy",pid=26603,fd=11))
LISTEN     0      128         :::10256                   :::*                   users:(("kube-proxy",pid=26603,fd=7))
LISTEN     0      128         :::30868                   :::*                   users:(("kube-proxy",pid=26603,fd=10))
LISTEN     0      128         :::32221                   :::*                   users:(("kube-proxy",pid=26603,fd=5))
LISTEN     0      128         :::31294                   :::*                   users:(("kube-proxy",pid=26603,fd=9))

Does anybody know what causes this issue and how I should resolve this?

@arminmor
Copy link

FYI, deleting the kube-proxy and calico pods on the node with the issue resolved the problem form me.
It takes a few minutes to have updates on the cluster. During this period I was monitoring the calico-node logs and I realized that as soon as the following logs appear, service access is updated:

bird: Graceful restart done
bird: Mesh_172_21_87_86: State changed to feed
bird: Mesh_172_21_87_88: State changed to feed
bird: Mesh_172_21_87_93: State changed to feed
bird: Mesh_172_21_87_95: State changed to feed
bird: Mesh_172_21_87_96: State changed to feed
bird: Mesh_172_21_87_106: State changed to feed
bird: Mesh_172_21_87_86: State changed to up
bird: Mesh_172_21_87_88: State changed to up
bird: Mesh_172_21_87_93: State changed to up
bird: Mesh_172_21_87_95: State changed to up
bird: Mesh_172_21_87_96: State changed to up
bird: Mesh_172_21_87_106: State changed to up

@rodcloutier
Copy link
Contributor

rodcloutier commented May 8, 2018

Did you check if there were more than one kube-proxy process? I had the issue on 1 node in GKE where 2 kube-proxy processes where live. Fixed the issue by killing the processes on the node and deleting the corresponding pod.

@garethlewin
Copy link

We also have multiple instances of kube-proxy on one of our nodes, but we have no idea how that occured, is this a known issue?

@milosgajdos
Copy link

We have noticed similar problem with exactly the same symptoms as described in this issue.

As @arminmor mentioned in his description we too have noticed exactly the same calico-node errors in the logs when multiple instances of kube-proxy are found on the worker nodes.

This makes me think if there is some kind of correlation in cause->effect fashion i.e. calico-node experiences issues which somehow causes kube-proxy to "crash", which for whatever reason doesn't seem to clean up after itself and leaves "stale" instance running whilst spinning out a new kube-proxy instance.

This comment in a different issue mentions running out of disk space as a potential cause of similar problems. We too have noticed some "correlation" between resource pressure on the node and the kube-proxy issue, however we can't confirm it with absolute certainty since we can't reproduce it. Either way, both calico daemon set and kube-proxy pods have critical-pod annotations on them so I wouldn't expect them to be killed off by scheduler due to lack of node resources.

@aabed
Copy link

aabed commented Jun 1, 2018

the problem was solved with those versions

Kube 1.10.3
Calico-node v2.6.7
Calico-cni v1.11.2

@milosgajdos
Copy link

Do you know the root cause @aabed ? Which one of those components were at fault and how?

@aabed
Copy link

aabed commented Jun 3, 2018

I really don't know the root cause
but we started cluster with kops, it was k8s 1.9.3
the problem existed

after upgrading to 1.10.3 it disappeared

in both cases Calico versions were the same

so I'd say it was solved by upgrading kubernetes itself

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 1, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 1, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

10 participants