Skip to content

kubelet status updates wedging after failed CRI call #53207

Closed
google/cadvisor
#1830
@jsravn

Description

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
Occasionally our nodes stop reporting their status to apiserver, and get marked as NotReady. However, they remain connected to the apiserver and receive the update to delete all their pods - which they do.

  OutOfDisk 		Unknown 	Thu, 28 Sep 2017 07:55:44 +0000 	Thu, 28 Sep 2017 07:57:04 +0000 	NodeStatusUnknown 	Kubelet stopped posting node status.
  MemoryPressure 	Unknown 	Thu, 28 Sep 2017 07:55:44 +0000 	Thu, 28 Sep 2017 07:57:04 +0000 	NodeStatusUnknown 	Kubelet stopped posting node status.
  DiskPressure 		Unknown 	Thu, 28 Sep 2017 07:55:44 +0000 	Thu, 28 Sep 2017 07:57:04 +0000 	NodeStatusUnknown 	Kubelet stopped posting node status.
  Ready 		Unknown 	Thu, 28 Sep 2017 07:55:44 +0000 	Thu, 28 Sep 2017 07:57:04 +0000 	NodeStatusUnknown 	Kubelet stopped posting node status.
  KernelDeadlock 	False 		Thu, 28 Sep 2017 07:58:38 +0000 	Wed, 27 Sep 2017 08:44:28 +0000 	KernelHasNoDeadlock 	kernel has no deadlock

In this example, the node last reported at 7:55:44.

This always seems to happen after I see these sort of log messages:

Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064121    1748 remote_runtime.go:163] ListPodSandbox with filter "nil" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064194    1748 kuberuntime_sandbox.go:185] ListPodSandbox failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064209    1748 generic.go:198] GenericPLEG: Unable to retrieve pods: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081115    1748 remote_runtime.go:163] ListPodSandbox with filter "&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081169    1748 kuberuntime_sandbox.go:185] ListPodSandbox failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081187    1748 kubelet_pods.go:842] Error listing containers: &grpc.rpcError{code:0x4, desc:"context deadline exceeded"}
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081210    1748 kubelet.go:1886] Failed cleaning pods: rpc error: code = 4 desc = context deadline exceeded

Which looks like the dockershim is timing out, maybe blocked on something.

Oddly, I never see these errors again in the kubelet logs - which indicates to me the PLEG loop continues to run successfully.

However after this time, all heartbeats to the apiserver stop. This even happened on an apiserver node, where kubelet connects to the localhost. So it doesn't seem like a network issue to me.

Next time this happens I'll try to capture a goroutine dump.

What you expected to happen:

Even if there is a temporary blip in the PLEG loop, kubelet should continue to send heartbeats, and then eventually go ready again.

How to reproduce it (as minimally and precisely as possible):
I've been unable to reproduce it yet - it happens quite infrequently.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.6.9 / 1.7.10
  • Cloud provider or hardware configuration**:
  • OS (e.g. from /etc/os-release): Ubuntu 16.04.3
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.milestone/needs-attentionpriority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions