-
Notifications
You must be signed in to change notification settings - Fork 294
apiserver evicted under imagefs pressure and controller node never recovers #1300
Comments
@mumoshu FYI this appears to be the cause of my dev cluster sporadically becoming unhealthy and never recovering that we chatted about. |
Also seems relevant - kubernetes/kubernetes#57020 which is in k8s 1.10. Plus this comment - kubernetes/kubernetes#47526 (comment). |
We're being hit by this issue too. We've suffered it in v1.9.2 and now on v1.10.2
Container 5cb7d5e102ea is etcd... why is the eviction manager trying to kill etcd?? |
After looking a bit more about the issue, it seems that eviction_manager is trying to kill a running container (etcd) that is not being managed by the control plane. In our case what triggered the issue was hitting the low threshold for imagefs utilisation. The thing is, eviction_manager should never try to remove containers that were never managed by K8s. It seems that it is assuming that every container belongs (or has belonged to) K8s. |
@gscalise Hi! I'm confused a bit. kube-aws doesn't place etcd docker containers on controller nodes. It places etcd rkt containers on dedicated etcd nodes. Are you sure the container is etcd? |
@c-knowles Thanks for the info. Just read all the relevant issues and it seems like you finally found the source of the problem! Can you try K8S v1.10.x to see if that fixes the problem? I'm also unsure how we could intentionally reproduce this in K8S 1.9.x. Can we just write a bunch of random files to make the disk almost full, so that eviction-manager removes the image of the running container? |
@mumoshu yeah write anything that it won’t evict, for example write a tonne of data to If we want more resilience for this specific case of host logs, GKE and kops appear to have logrotate enabled by default on most of /var/log. Maybe we want to enable that by default as well, although it might interfere with audit logs. |
Ah, sounds like a good idea. I had blindly decided that what we have today - the docker container log rotation enable by default - was enough to make controller nodes resilient to log spams. Out of curiosity, what logs actually filled up your disk so much? |
It was logs for fluentd itself since a typical config means it doesn’t monitor its own logs (see upstream addon config for example). In my case some unrelated problem meant fluentd was logging more than normal. |
@c-knowles Thanks for the response. Interesting. I was thinking that the fluentd docker container managed by a daemonset would write its own log to stdout. dockerd should rotates the log every 50MB as configured by kube-aws. But it seems not always the case given your explanation. |
@mumoshu I agree in general, just it was not always the upstream config - ref base upstream config plus daemon set change from quite a while ago. It's less of an issue given the upstream changes, so this just shows it could happen but then any other directory written to from a pod could also bloat. How about for this task we focus on node health? Regardless of upstream updates the apiserver could stop running under other conditions, how do we ensure this node is marked unhealthy in AWS or k8s? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Using kube-aws 5780b59 and k8s 1.9.7 I have experienced that imagefs pressure on a controller node can cause the apiserver to be evicted. This by itself might not be a problem, but the node never recovers and it is not marked as unhealthy so is never replaced. Some logs from kubelet below.
The eviction might need to be reported upstream. I found kubernetes/kubernetes#44951 but that was fixed, perhaps it's a similar error.
The text was updated successfully, but these errors were encountered: