116 nodes gradually marked unhealthy over the course of 15 minutes

**Kubernetes version**:
```
Client Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2", GitCommit:"477efc3cbe6a7effca06bd1452fa356e2201e1ee", GitTreeState:"clean", BuildDate:"2017-04-19T20:33:11Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"6", GitVersion:"v1.6.2+coreos.0", GitCommit:"79fee581ce4a35b7791fdd92e0fc97e02ef1d5c0", GitTreeState:"clean", BuildDate:"2017-04-19T23:13:34Z", GoVersion:"go1.7.5", Compiler:"gc", Platform:"linux/amd64"}
```
**Environment**:
- **Cloud provider or hardware configuration**: aws
- **OS**:
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1353.7.0
VERSION_ID=1353.7.0
BUILD_ID=2017-04-26-2154

- **Kernel**: Linux 4.9.24-coreos #1 SMP Wed Apr 26 21:44:23 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz GenuineIntel GNU/Linux

- **Install tools**: kube-aws 0.9.6

------------

We're investigating an incident we had last week where 116 nodes in a 119 node cluster moved to NodeNotReady. Nodes moved into NodeNotReady one by one, over a period of 15 minutes. Then, all nodes became healthy at around the same time.

Here's what we know:
 1) We're running Kubernetes v1.6.2, spun up with kube-aws v0.9.6 on CoreOS's stable release branch.

 2) While the nodes were unhealthy, their pods remained running -- but stopped receiving incoming traffic. The service fronting the pods returned an empty result set (no matching pods). The application running inside the pod did not show signs of network trouble, appeared able to service already-established connections.

 3) Kubernetes initialized a pod removal for nodes that had been unhealthy for 5 minutes (the pod eviction time).

 4) The pod removal was processed only when the nodes became healthy again (presumably because the controller could not communicate with the kubelet).

 5) The closest we've gotten to root cause is that it's "something related to iptables". The nodes' dmesg contains:

```
ip-10-0-7-5.ec2.internal.dmesg:[Fri May 26 00:05:34 2017] Netfilter messages via NETLINK v0.30.
ip-10-0-7-5.ec2.internal.dmesg-[Fri May 26 00:05:34 2017] ctnetlink v0.93: registering with nfnetlink.
```

And then a subset of the nodes failed ensuring correct rules were running:

```
May 26 00:06:37 ip-10-0-7-5.ec2.internal kubelet-wrapper[1911]: E0526 00:06:37.336142    1911 kubelet_network.go:412] Failed to ensure marking rule for KUBE-MARK-MASQ: error checking rule: exit status 4: iptables: Resource temporarily unavailable.
May 26 00:20:38 ip-10-0-7-5.ec2.internal kubelet-wrapper[1911]: E0526 00:20:38.629536    1911 kubelet_network.go:378] Failed to ensure marking rule for KUBE-MARK-DROP: error checking rule: exit status 4: iptables: Resource temporarily unavailable.
```

  6) All nodes printed the Netfilter message between 2017-05-26T00:05:34Z - 2017-05-26T00:10:02Z, and all nodes were marked unhealthy between 2017-05-26T00:05:41Z - 2017-05-26T00:20:06Z. There is not a clear correlation between which nodes printed the netfilter message first, and which were detected unhealthy first.


We don't know where to go from here, and would appreciate pointers. We've collected the journal and dmesg on all nodes, the kubectl event stream, as well as the api-server, controller and scheduler logs. Don't believe they contain any sensitive data, but would prefer to share over a private channel, if possible.

Are there any other logs we should be gathering before tearing down the cluster?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

116 nodes gradually marked unhealthy over the course of 15 minutes #46641

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development