Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s-1.10: One of the kube-proxy pod failed to get up after restart #63064

Closed
saurabh-chordiya opened this issue Apr 24, 2018 · 10 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@saurabh-chordiya
Copy link

saurabh-chordiya commented Apr 24, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
Deployed k8s-1.10 using kubeadm (60 nodes setup) and everything was up & running.
Restarted one of the kube-proxy pod and then it failed to come up
Pod was up & running but after restart if went into error state

$ kubectl get pod --namespace=kube-system -o wide |grep kube-proxy-pj4xw
kube-proxy-pj4xw 0/1 CrashLoopBackOff 8 17m 1.0.0.76 minion-30-5-0-5

Below is the error:-
$ kubectl logs --namespace=kube-system kube-proxy-pj4xw
I0424 06:24:47.961665 1 feature_gate.go:226] feature gates: &{{} map[]}
error: unable to read certificate-authority /var/run/secrets/kubernetes.io/serviceaccount/ca.crt for default due to open /var/run/secrets/kubernetes.io/serviceaccount/ca.crt: no such file or directory

What you expected to happen:
Pod should up and running after restart

How to reproduce it (as minimally and precisely as possible):
delete one of kube-proxy pod

Anything else we need to know?:
Not seen this issue with k8s-1.9.1

Disabled feature MountPropagation through featureGates as with MountPropagation enable it was not working as well so gave a try with disable

Also added "MountFlags=shared" to /etc/systemd/system/multi-user.target.wants/docker.service as without this all test-pod deployment were failing

After runing command "mount --make-rshared /" and restart docker service it started to work but next time again after delete pod, it failed to come up

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:55:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    Deployed using kubeadm (60 nodes setup)

  • OS (e.g. from /etc/os-release):

NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
  • Kernel (e.g. uname -a):
Linux minion-30-5-0-9 4.6.0-040600-generic #201606100558 SMP Fri Jun 10 10:01:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:

##########################

docker version

Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.6.2
 Git commit:   092cba3
 Built:        Thu Nov  2 20:40:23 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.6.2
 Git commit:   092cba3
 Built:        Thu Nov  2 20:40:23 2017
 OS/Arch:      linux/amd64
 Experimental: false

##########################

kubectl describe pod --namespace=kube-system kube-proxy-pj4xw

Name:           kube-proxy-pj4xw
Namespace:      kube-system
Node:           minion-30-5-0-5/1.0.0.76
Start Time:     Tue, 24 Apr 2018 06:23:59 +0000
Labels:         controller-revision-hash=1193416634
                k8s-app=kube-proxy
                pod-template-generation=1
Annotations:    <none>
Status:         Running
IP:             1.0.0.76
Controlled By:  DaemonSet/kube-proxy
Containers:
  kube-proxy:
    Container ID:  docker://a7cce480ec211e0fdf53a5216e1d4f0601feb25df0b35370e3194203ec2f5165
    Image:         k8s.gcr.io/kube-proxy-amd64:v1.10.0
    Image ID:      docker-pullable://k8s.gcr.io/kube-proxy-amd64@sha256:fc944b06c14cb442916045a630d5e374dfb9c453dfc56d3cb59ac21ea4268875
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/local/bin/kube-proxy
      --config=/var/lib/kube-proxy/config.conf
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 24 Apr 2018 06:34:44 +0000
      Finished:     Tue, 24 Apr 2018 06:34:45 +0000
    Ready:          False
    Restart Count:  7
    Environment:    <none>
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/kube-proxy from kube-proxy (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-proxy-token-b526l (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  kube-proxy:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-proxy
    Optional:  false
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  kube-proxy-token-b526l:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kube-proxy-token-b526l
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node-role.kubernetes.io/master:NoSchedule
                 node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/unreachable:NoExecute
Events:
  Type     Reason                 Age                From                      Message
  ----     ------                 ----               ----                      -------
  Normal   SuccessfulMountVolume  14m                kubelet, minion-30-5-0-5  MountVolume.SetUp succeeded for volume "lib-modules"
  Normal   SuccessfulMountVolume  14m                kubelet, minion-30-5-0-5  MountVolume.SetUp succeeded for volume "xtables-lock"
  Normal   SuccessfulMountVolume  14m                kubelet, minion-30-5-0-5  MountVolume.SetUp succeeded for volume "kube-proxy"
  Normal   SuccessfulMountVolume  14m                kubelet, minion-30-5-0-5  MountVolume.SetUp succeeded for volume "kube-proxy-token-b526l"
  Normal   Started                13m (x4 over 14m)  kubelet, minion-30-5-0-5  Started container
  Normal   Pulled                 13m (x5 over 14m)  kubelet, minion-30-5-0-5  Container image "k8s.gcr.io/kube-proxy-amd64:v1.10.0" already present on machine
  Normal   Created                13m (x5 over 14m)  kubelet, minion-30-5-0-5  Created container
  Warning  BackOff                4m (x47 over 14m)  kubelet, minion-30-5-0-5  Back-off restarting failed container
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Apr 24, 2018
@saurabh-chordiya
Copy link
Author

saurabh-chordiya commented Apr 24, 2018

/sig storage
/sig node

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 24, 2018
@saurabh-chordiya saurabh-chordiya changed the title k8s-1.10: One of the kube-proxy node failed to get up after restart k8s-1.10: One of the kube-proxy pod failed to get up after restart Apr 24, 2018
@saurabh-chordiya
Copy link
Author

saurabh-chordiya commented Apr 26, 2018

@kubernetes/sig-storage-bugs
@jsafrane

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Apr 26, 2018
@jsafrane
Copy link
Member

I am not sure it's related to mount propagation at all. It would produce different messages.

error: unable to read certificate-authority /var/run/secrets/kubernetes.io/serviceaccount/ca.crt for default due to open /var/run/secrets/kubernetes.io/serviceaccount/ca.crt: no such file or directory

This seems to be some issue with Secret volumes. Please check /var/lib/kubelet/pods/<your pod uid>/volumes/kubernetes.io~secret/<token name>/ if it contains ca.crt and looks "healthy".

@saurabh-chordiya
Copy link
Author

saurabh-chordiya commented Apr 30, 2018

Yes this doesn't seem to be related to mount propagation, but something new in 1.10 compare to 1.9 that is breaking it.

Checked below path and it has all the required data

root@minion-30-1-0-5:~# ls /var/lib/kubelet/pods/9e4ef60c-4a12-11e8-982e-222201000023/volumes/kubernetes.io~secret/default-token-xrbpk/
ca.crt  namespace  token

Mount propagation is "private,slave"

root@minion-30-1-0-5:~# findmnt -o TARGET,PROPAGATION /var/lib/kubelet/pods/9e4ef60c-4a12-11e8-982e-222201000023/volumes/kubernetes.io~secret/default-token-xrbpk/
TARGET                                                                                                      PROPAGATION
/var/lib/kubelet/pods/9e4ef60c-4a12-11e8-982e-222201000023/volumes/kubernetes.io~secret/default-token-xrbpk private,slave

After changing mount propagation to shared and docker restart it started working

mount --make-rshared /var/lib/kubelet/pods/9e4ef60c-4a12-11e8-982e-222201000023/volumes/kubernetes.io~secret/default-token-xrbpk/
root@minion-30-1-0-15:~# kubectl get pod -o wide --namespace=kube-system |grep 1.0.0.66
kube-proxy-jwdn4                           1/1       Running            12         37m       1.0.0.66   minion-30-1-0-5

There is something really wrong in 1.10 as same scenario is working fine with k8s 1.9.1

@jsafrane
Copy link
Member

jsafrane commented May 4, 2018

You should retry with #62633 (upcoming 1.10.3?) where we change the default back to private and use slave/shared only when explicitly requested.

@saurabh-chordiya
Copy link
Author

saurabh-chordiya commented May 4, 2018

Sure will retry with 1.10.3 (once available) and update.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 1, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

4 participants