hyperkube kubelet: KillPodSandboxError ... networkPlugin cni failed to teardown pod ... network: neither iptables nor ip6tables usable #92272
Description
What happened:
- I attempted to bump the 1.16, 1.17, and 1.18 versions in my team's kubernetes image repo to 1.16.11, 1.17.7, and 1.18.4, respectively.
- The images built successfully but none of the e2e tests we run, including the kubernetes conformance e2e tests, ever actually ran correctly without a clear indication as to why -- ginkgo was just mysteriously succeeding without any tests passed, skipped, or failed. this prompted me to take a closer look a 1.18.4 cluster.
- I expected corednes pods to be in "Running" status but instead found:
kube-system coredns-6b6854dcbf-tqxgn 0/1 Completed 0 5h24m
kube-system coredns-6b6854dcbf-wztzn 0/1 Completed 0 5h24m
- ...and the following event on one of those pods:
Normal SandboxChanged 98s (x627 over 136m) kubelet, test-1-18-4-0b8b9360-3fhas Pod sandbox changed, it will be killed and re-created.
- This even prompted me to log on to a worker node and notice repeated messages like the following in the kubelet logs:
Jun 18 21:41:53 test-1-18-4-06729635-3f4er docker[1211]: E0618 21:41:53.316852 1291 pod_workers.go:191] Error syncing pod f364dc8e-6e7d-4869-976a-678a3ce913ac ("coredns-6b6854dcbf-2t9cv_kube-system(f364dc8e-6e7d-4869-976a-678a3ce913ac)"), skipping: failed to "KillPodSandbox" for "f364dc8e-6e7d-4869-976a-678a3ce913ac" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"coredns-6b6854dcbf-2t9cv_kube-system\" network: neither iptables nor ip6tables usable"
This led me to grep through the kubernetes repo at v1.18.4 for the message network: neither iptables nor ip6tables usable
but I didn't find anything. So I did the same thing in the containernetworking plugins repo and found the following references:
- https://github.com/containernetworking/plugins/blob/ad10b6fa91aacd720f1f9ab94341a97a82a24965/plugins/meta/portmap/portmap.go#L120
- https://github.com/containernetworking/plugins/blob/ad10b6fa91aacd720f1f9ab94341a97a82a24965/plugins/meta/portmap/portmap.go#L384
It's kind of hard to tell which of the possible maybeGetIptables
function failures occurred because it doesn't lift the error into the calling context but this led me to believe that there is something different about the iptables installed in the hyperkube container that we are are using to run our kubelets. And I did find a potentially significant difference at least between 1.18.3 and 1.18.4 (i haven't compared the patch level versions on 1.16 and 1.17 yet):
zsh/4 10048 % docker run --rm -it --entrypoint "" gcr.io/google-containers/hyperkube:v1.18.3 iptables --version
iptables v1.6.0
zsh/4 10047 % docker run --rm -it --entrypoint "" gcr.io/google-containers/hyperkube:v1.18.4 iptables --version
iptables v1.8.2 (nf_tables)
I'm not familiar enough with iptables to say with any kind of certainty that this is really the root cause of the problem, but if not then it's a pretty good red herring.
What you expected to happen:
I was hoping that 1.16.11
, 1.17.7
, and 1.18.4
would "just work" similar to the previous patch level versions.
How to reproduce it (as minimally and precisely as possible):
I'm so far from really understanding what's going on at pretty much all times that it's actually kind of depressing but without just dumping a bunch of configuration files that I don't think would be very helpful, here is what I think may be happening:
We have --network-plugin=cni
passed to kubelet in our systemd unit file, and some part of kubelet is somehow delegating networking stuff to the CNI binaries that appear to be copied into the hyperkube image:
zsh/4 10049 % docker run --rm -it --entrypoint "" gcr.io/google-containers/hyperkube:v1.18.4 ls /opt/cni/bin
bandwidth bridge dhcp firewall flannel host-device host-local ipvlan loopback macvlan portmap ptp sbr static tuning vlan
So I'm guessing that portmap
here corresponds to the code I linked to above in the cni plugins repo.
Anything else we need to know?:
I considered the possibility of not filing this bug and instead chiming in on #92242 or #92250 but I thought it would be better to post a new issue rather than risk the information I've gathed being lost in the noise of comments.
Environment:
- Kubernetes version (use
kubectl version
):
1.16.11, 1.17.7, and 1.18.4 - Cloud provider or hardware configuration:
Digitalocean - OS (e.g:
cat /etc/os-release
):
zsh/4 10051 % docker run --rm -it --entrypoint "" gcr.io/google-containers/hyperkube:v1.18.3 cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
zsh/4 10054 [2] % docker run --rm -it --entrypoint "" gcr.io/google-containers/hyperkube:v1.18.4 cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
- Kernel (e.g.
uname -a
):
Linux test-1-18-4-06729635-3f4er 4.19.0-0.bpo.6-amd64 #1 SMP Debian 4.19.67-2+deb10u2~bpo9+1 (2019-11-12) x86_64 GNU/Linux
-
Install tools:
hyperkube -
Network plugin and version (if this is a network-related bug):
happens on both v0.7.4 and v0.8.6 of https://github.com/containernetworking/plugins
(we volume mount a directory containing the cni plugin binaries from our worker node host into the kubelet hyperkube containers so it appears that we can end up using a different CNI version than upstream -- I don't know why we do this, maybe in the past hyperkube didn't include the binaries or maybe we needed a specific version of CNI temporarily and just never went back to using the bundled binaries; regardless, because I've tried both versions I assume this isn't the root cause of the problem I'm seeing)