-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Does the pod need to talk to the master to restart? #23556
Comments
/cc @brendandburns @kubernetes/goog-node Can you clarify a bit: what was restarting, a machine, a pod, or both? "one of the machines in our cluster stopped trying to restart" vs. "the pod did not try to restart again after that" |
To clarify the original question Our Api job running in the container did not restart after going down even though the ping handler was not working which the kubelet use to check for aliveness. This happened while there was a network outage. So to give more context and where we are in our investigation: Just a few hours ago we discovered that this is an issue with the container not being able to connect to the DNS server. When we changed the resolv.conf contents in the container to match the one on the VM outside the container, we were able to connect to the outside world. |
@r-tock kubelet doesn't need to rely on other components to continue managing the containers. However, if kubelet got restarted during the network outage, the new kubelet instance wouldn't know any pod/specs until it talks to the api server. This is due to the lack of persistent checkpoint on the node (the issue to add checkpointing: #489). Could you check the kubelet log to see if it got restarted during that time? |
The kubelet healthchecking a pod should not rely on network connectivity between the kubelet and the apiserver. As @yujuhong said, once the kubelet has the pod definition, it will run the pod and healthcheck it even if the apiserver goes down and the kubelet should certainly restart a failed container without needing to contact the apiserver (unless the kubelet restarts).
Are you saying that a misconfigured resolv.conf made you lose all network connectivity (or was it just name resolution)? Could you get to the outside world by IP address? |
Yes we can get to the outside world by IP. However since we use external
|
Ok, I just wanted to verify that networking wasn't entirely broken (just name resolution). |
Can you clarify what the "bad" resolv.conf says? @ArtfulCoder P0 to triage |
To clarify: the way kube-dns works is taht we configure your container DNS |
search default.svc.cluster.local svc.cluster.local cluster.local c.rapid-depot-817.internal. google.internal. nameserver 10.219.240.10
|
@r-tock
|
|
|
I think I may have pasted the previous resolv.conf entry from our preprod cluster. For our prod cluster it is the right thing. Note that you can commandeer our preprod cluster to do any experiment. Prod setup Below: (DNS works here)
|
This is the equivalent set of info from our preprod cluster. The lack of DNS issue is reproducible there PreProd setup Below: DNS Does not work here.
|
To clarify, where do you have DNS issues ? |
We are seeing DNS issues inside the docker container for connecting with VMs on Google Cloud and public domains. We are using kubernetes in both preprod and prod clusters. |
what do you get when you run on the offending cluster ? |
|
|
|
KubeDNS is not running at all on your prepod setup. If there is an error on the kubedns start, can you get the logs for the pod, by running kubectl get pod |
|
Here is the kubelet.log
|
can you run: |
|
Can you manually scale your cluster to have 1 extra node? |
yeah, it's |
Could you paste the command for that? |
I looked into this very issue yesterday for our gke cluster, and here is what I found yesterday (copy & paste from gke buganiza):
I also tried to remove one of possibilities of restarting docker daemon through: #23593. But it looks like it doesn't remove all possibilities based on our tests. Another possibility is introduced by docker load. In 1.2 release, someone introduce docker load to load kube-proxy image on every worker node which might hang docker daemon, thus there is a script to detect such issue and start docker daemon. So my plan is to detect the issue and remedy the system by killing all running daemons. EDIT: @ArtfulCoder has pr #23662 to remedy the system here. |
Saw a restart today morning 30 mins ago. This looks like a memory issue.
|
@r-tock The issue you reported at #23556 (comment) is a totally different issue. Looks like the node is totally memory over committed here which triggers the sys oom by kernel. Are your workloads ask too small resource requests, but use too much over their initial requests with the memory? |
This has not happened before. We have not specified a resource requirement and max heap size on the JVM is set at a max heap of 2G out of 3.6G on that machine. It is very unlikely JVM is going cause this system to go overcommitted as we have been running with this config for a while. In this case the kubelet restarted which makes me wonder that if one of the new addons is causing this OOM. |
@dchen1107 Yes, Agreed the issue is different from the earlier ones but all points to the recent upgrade as the general root cause. I understand the usefulness of the cluster-addons. However, the pushing a release without a heads-up to production clients have left us in this state where we are unable to run our service reliably. Yes, we can peel the onion figure out the aftermath of issues one-by-one. I am all here for helping out diagnose these but wouldn't it be more prudent to rollback the release and allow us to get back into a stable state and then roll forward with right documentation about the breaking changes. And please do clarify if this is a wrong assessment, from where I see this, this is a new release gone bad. |
Unfortunately, our partners don't care about Kubernetes or containers, they are trying to run a restaurant and making sure their guests have a great experience. We have to explain to them in plain terms and explain to them why their guests are having a bad experience booking on our system. Some of these restaurants may not be able to afford a loss of conversion due to micro-outages. It's is a big thing for them. For many, this eats directly into their bottom-line So quite simply:
In either case, what would be the ETA? |
@r-tock We are cutting a 1.2.1 release this afternoon. We will begin rolling that out broadly over the next several days. |
@cjcullen Thanks, I am talking with @roberthbailey offline to see what we can do to get back into a stable state immediately. |
@r-tock All prs required to fix your first issues are merged, and we are cutting 1.2.1 release shortly today. On your second issue, I looked at your comment at #23556 (comment). The only newly introduced cluster-addons pod is Kube-proxy. Kube-proxy is not new, and the binary used to run as a daemon like what kubelet and docker are. In 1.2 release, we build kube-proxy as docker image and run it as static pod. Looks like that tiny overhead is introduced by kube-proxy base image, and related code is at: https://github.com/kubernetes/kubernetes/blob/master/build/common.sh#L106 |
Thank you, we will help test 1.2.1 once its ready for testing. |
If this is an OOM for the reason stated in #23556 (comment), simply increasing the size of the node should fix it. The increase in memory used by a debian base image vs busybox should be close to nothing, according to Perhaps running kube-proxy in a container (vs raw on the node) has increased the overall footprint by some amount that puts you over the edge, it's hard to say for sure without knowing your exact free amount, but it will not grow unbounded.
Setting resource limits/request will give you guaranteed QoS (https://github.com/kubernetes/kubernetes/blob/release-1.2/docs/proposals/resource-qos.md#qos-classes) which should give it a better OOM score (@dchen1107 correct me if I'm wrong) Also from your kern.log, you have about 80M free:
This includes physical memory + swap. You have no swap (total swap = 0Kb).
And now the interesting bit, your java and kube-proxy are at the same oom score, but you have a few java processes using more memory:
And the last bit shows the oom'd process memory stats:
I'm no oom analysis expret, just trying to be helpful. I can't explain why this would suddenly start happening, perhaps it's because of some of the other conditions on the node, but it looks like java had the most intense memory profile from the posted kernel log. |
Good insight @bprashanth appreciate it. Looks like JVMs are taking a total of 4G total VM size. Which is suspicious. But the resident size is about ~780MB, I am trying to see how that matches up. |
Setting memory limits is what is necessary to be in the Guaranteed class.
The more memory a container uses that what it had originally requested, the more likely it is to be evicted in case of system memory pressure.
The kernel does not use RSS to identify a victim in case of system memory pressure. It uses working set which is usually greater than RSS. |
@roberthbailey Looking at the RAM usage, I have to say (thanks to @bprashanth's tip) that we have not been doing a good job accounting. We have 3 JVMs running. One for ApiServer, which is at 2G, one for JMX export which is running at default JVM size which I believe is 1G and log-stash exporter which is running at max-heap of1G. Overall we have oversubscribed the node except both log-stash and JMX is running with the old version of Kubernetes. The why of how this remains stable against the previous version of Kubernetes eludes me. Before switch to 1.2.1 today, I will try to move to a node with larger RAM Side note: |
We didn't add the
will create a 3 node cluster with 4 cpu / 5 GB memory nodes. |
Thanks, that is a surprising API choice but works nevertheless. New cluster is up with 5G ram and ready for upgrade to 1.2.1 |
@r-tock -- can we close this issue or is there further debugging that you'd like to do? |
Feel free to close this. I would still like to understand the investigation if you continue to figure out what went wrong with docker-manager. Maybe a separate bug tracking it? |
@dchen1107 do we have a bug tracking the docker issues that this exposed? |
We had a micro-outage the other day when one of the machines in our cluster stopped trying to restart. On startup, it hit a transient error that prevented the server from starting up, but the pod did not try to restart again after that. The logs showed nothing between when it got stuck starting up and when I initiated a manual restart 20 minutes later. It seems we hit this issue during a network partition, though we are still investigating.
Does the pod need to talk to the master in order for a restart to be triggered?
Attached is our configuration file, including the restart policy apiserverProdConfig.txt
Thanks!
We're using Kubernetes 1.2.0
The text was updated successfully, but these errors were encountered: