kubelet status updates wedging after failed CRI call #53207
Description
Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
Uncomment only one, leave it on its own line:
/kind bug
/kind feature
What happened:
Occasionally our nodes stop reporting their status to apiserver, and get marked as NotReady. However, they remain connected to the apiserver and receive the update to delete all their pods - which they do.
OutOfDisk Unknown Thu, 28 Sep 2017 07:55:44 +0000 Thu, 28 Sep 2017 07:57:04 +0000 NodeStatusUnknown Kubelet stopped posting node status.
MemoryPressure Unknown Thu, 28 Sep 2017 07:55:44 +0000 Thu, 28 Sep 2017 07:57:04 +0000 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Thu, 28 Sep 2017 07:55:44 +0000 Thu, 28 Sep 2017 07:57:04 +0000 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Thu, 28 Sep 2017 07:55:44 +0000 Thu, 28 Sep 2017 07:57:04 +0000 NodeStatusUnknown Kubelet stopped posting node status.
KernelDeadlock False Thu, 28 Sep 2017 07:58:38 +0000 Wed, 27 Sep 2017 08:44:28 +0000 KernelHasNoDeadlock kernel has no deadlock
In this example, the node last reported at 7:55:44.
This always seems to happen after I see these sort of log messages:
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064121 1748 remote_runtime.go:163] ListPodSandbox with filter "nil" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064194 1748 kuberuntime_sandbox.go:185] ListPodSandbox failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.064209 1748 generic.go:198] GenericPLEG: Unable to retrieve pods: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081115 1748 remote_runtime.go:163] ListPodSandbox with filter "&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}" from runtime service failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081169 1748 kuberuntime_sandbox.go:185] ListPodSandbox failed: rpc error: code = 4 desc = context deadline exceeded
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081187 1748 kubelet_pods.go:842] Error listing containers: &grpc.rpcError{code:0x4, desc:"context deadline exceeded"}
Sep 28 07:57:54 ip-10-50-97-190 start-kubelet.sh[1748]: E0928 07:57:54.081210 1748 kubelet.go:1886] Failed cleaning pods: rpc error: code = 4 desc = context deadline exceeded
Which looks like the dockershim is timing out, maybe blocked on something.
Oddly, I never see these errors again in the kubelet logs - which indicates to me the PLEG loop continues to run successfully.
However after this time, all heartbeats to the apiserver stop. This even happened on an apiserver node, where kubelet connects to the localhost. So it doesn't seem like a network issue to me.
Next time this happens I'll try to capture a goroutine dump.
What you expected to happen:
Even if there is a temporary blip in the PLEG loop, kubelet should continue to send heartbeats, and then eventually go ready again.
How to reproduce it (as minimally and precisely as possible):
I've been unable to reproduce it yet - it happens quite infrequently.
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): 1.6.9 / 1.7.10 - Cloud provider or hardware configuration**:
- OS (e.g. from /etc/os-release): Ubuntu 16.04.3
- Kernel (e.g.
uname -a
): - Install tools:
- Others: