kubelet status updates wedging after failed CRI call

**Is this a BUG REPORT or FEATURE REQUEST?**:
BUG REPORT

> Uncomment only one, leave it on its own line: 
>
/kind bug
> /kind feature


**What happened**:
Occasionally our nodes stop reporting their status to apiserver, and get marked as NotReady. However, they remain connected to the apiserver and receive the update to delete all their pods - which they do.

```
  OutOfDisk 		Unknown 	Thu, 28 Sep 2017 07:55:44 +0000 	Thu, 28 Sep 2017 07:57:04 +0000 	NodeStatusUnknown 	Kubelet stopped posting node status.
  MemoryPressure 	Unknown 	Thu, 28 Sep 2017 07:55:44 +0000 	Thu, 28 Sep 2017 07:57:04 +0000 	NodeStatusUnknown 	Kubelet stopped posting node status.
  DiskPressure 		Unknown 	Thu, 28 Sep 2017 07:55:44 +0000 	Thu, 28 Sep 2017 07:57:04 +0000 	NodeStatusUnknown 	Kubelet stopped posting node status.
  Ready 		Unknown 	Thu, 28 Sep 2017 07:55:44 +0000 	Thu, 28 Sep 2017 07:57:04 +0000 	NodeStatusUnknown 	Kubelet stopped posting node status.
  KernelDeadlock 	False 		Thu, 28 Sep 2017 07:58:38 +0000 	Wed, 27 Sep 2017 08:44:28 +0000 	KernelHasNoDeadlock 	kernel has no deadlock
```

In this example, the node last reported at 7:55:44.

This always seems to happen after I see these sort of log messages:
```
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064121    1748 remote_runtime.go:163] ListPodSandbox with filter "nil" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064194    1748 kuberuntime_sandbox.go:185] ListPodSandbox failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064209    1748 generic.go:198] GenericPLEG: Unable to retrieve pods: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081115    1748 remote_runtime.go:163] ListPodSandbox with filter "&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081169    1748 kuberuntime_sandbox.go:185] ListPodSandbox failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081187    1748 kubelet_pods.go:842] Error listing containers: &grpc.rpcError{code:0x4, desc:"context deadline exceeded"}
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081210    1748 kubelet.go:1886] Failed cleaning pods: rpc error: code = 4 desc = context deadline exceeded
```

Which looks like the dockershim is timing out, maybe blocked on something.

Oddly, I never see these errors again in the kubelet logs - which indicates to me the PLEG loop continues to run successfully.

However after this time, all heartbeats to the apiserver stop. This even happened on an apiserver node, where kubelet connects to the localhost. So it doesn't seem like a network issue to me.

Next time this happens I'll try to capture a goroutine dump.

**What you expected to happen**:

Even if there is a temporary blip in the PLEG loop, kubelet should continue to send heartbeats, and then eventually go ready again.

**How to reproduce it (as minimally and precisely as possible)**:
I've been unable to reproduce it yet - it happens quite infrequently.

**Anything else we need to know?**:

**Environment**:
- Kubernetes version (use `kubectl version`): 1.6.9 / 1.7.10
- Cloud provider or hardware configuration**:
- OS (e.g. from /etc/os-release): Ubuntu 16.04.3
- Kernel (e.g. `uname -a`):
- Install tools:
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubelet status updates wedging after failed CRI call #53207

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development