-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API status for pods is flaky #4188
Comments
I blame the pod cache. |
@brendandburns has recently changed pod cache? FWIW, I also observed this behavior: 19 pods Running in one call and all Unknown in the next. With 0.10.0. |
Yes, I think @brendandburns's PR on pod cache #4169 should fix the problem reported here. Also it should fix the issue reported by @satnam6502 about adds-on services. I have a pending PR #4194 to have kubelet deciding on PodStatus too. |
Could this have caused #3950? |
@yujuhong I don't see how. I'll comment on that issue. |
@yujuhong, I don't think this could be the cause of that one. |
I don't think @brendandburns's PR on pod cache #4169 will fix this. That one clears the node status. This issue is with pod going from Running to Unknown even when nodes keep running just fine. |
Totally consistent observation based on my personal experience. I just S On Thu, Feb 5, 2015 at 2:11 PM, Tim Hockin notifications@github.com wrote:
|
Agree with @alex-mohr, #4169 clears node status only. One possibility is that we hit transient network problem in between the calls that fails health check, so controller set node condition to not reachable, hence unknown pods. I'll make this more robust anyway. Other than that, it's hard to think of a reason that node status results in the flakiness. |
In the absence of node information, the pod status defaults to Unknown. Maybe there's another source of this specific problem, but the pod cache is a source of continuous problems, partly because it's not just a cache, but is implementing or duplicating a significant chunk of system logic in a half-baked manner. My goal is to eliminate it -- see #2726. Short of that, we need to at least implement #1366, eliminate nodes from the pod cache, and implement a real node cache. |
I can look into this. Repro info from Tim, who said he was running on a one-node cluster but has seen the problem with multi-node clusters also: ./cluster/kubectl.sh run-container hostnames --image=kubernetes/serve_hostname --replicas=1 --port=9376 --labels=app=hostnames but I had to hack the node a bit to make it get past 100 |
I can verify that the situation is rather bad. Lots of fluctuation in the numbers of Pending, Running, and Unknown. serve_hostname containers on the node are exiting; I'm not sure why, or if the pod itself is terminating or just the container. Lots of the following in the log (not sure what this means yet): Sat, 07 Feb 2015 00:06:22 +0000 hostnames-ecd2r Pod failedScheduling {scheduler } Binding rejected: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0] |
The pause container is also exiting, so I'm guessing that means the whole pod is exiting, not just serve_hostname... |
Ugh, there is an etcd size limit you might be hitting too, I forgot. If
|
Just a few more observations: This is reproducible (and a little less chaotic) with one node and 100 pods (but not with 50 pods, so it's definitely load-dependent). You can get to 100 Running pods (0 Pending or Unknown), and then a minute later a dozen or so will become Unknown, and then later go back to Running, etc. As mentioned earlier, the containers continue running on the node all this time (sudo docker ps|grep serve_hostname | wc -l), so it is definitely some kind of reporting problem. As an aside, it looks like there is some kind of race condition with IP address allocation, as during startup (but not at steady state) a number of pods exit with the following in their log (sudo docker logs xxx): |
Almost certainly the restart is just kubelet retrying after some time.
|
I found that if I comment out the line with the arrow next to it (see below) in pod_cache.cc, the Pods stop oscillating between Running and Unknown. Commenting out this line changes the logic from "if get an error from Kubelet, change pod status to Unknown" to "if get an error from Kubelet, keep pod status unchanged."
I saw in the master logs intermittent errors of this form, which was causing the pods to flip back to Unknown: E0209 04:19:17.155149 4996 pod_cache.go:270] Error getting info for pod default/hostnames-v5drx: kubelet "kubernetes-minion-1.c.striped-torus-760.internal" server responded with HTTP error code 500 for pod default/hostnames-v5drx This seems to correspond to lines like this in the kubelet logs: I0209 06:38:11.104796 4704 server.go:351] GET /stats/default/hostnames-gfw5j/2bb979c5-b026-11e4-a070-42010af0f256/hostnames?num_stats=33: (218.857218ms) 500 However, pods still flip between Running and Pending, even though no containers are exiting on the minion, so there's another problem here too. |
Thanks for investigating this! Pods really shouldn't flip between --brendan On Sun, Feb 8, 2015 at 11:53 PM, David Oppenheimer <notifications@github.com
|
I also investigated the same issue last Friday. I observed exact the same as what @davidopp described above, except the related kubelet log should be:
Meanwhile, I observed master is trying to query pod / and failed all the time:
Talked to @davidopp offline, he is going to look into why running pods go back to Pending; meanwhile, I am going to figure out Unknown cases here. |
That would be great. It is very hard for me to make sense of any of the scalability experiments that I am doing while the pod status is in hysteresis. |
Am I wrong, or are these instructions also within bounds of an e2e test? Just because our existing e2es only do one or two things doesn't mean we can't do more within O(minutes), assuming you're dealing with the same image. |
The "flips back to pending" problem appears to be because updatePodStatus() writes to the cache but not back to the Pod's status. computePodStatus() initializes newStatus based on pod.Status, not based on the cached value of status. This second behavior seems wrong in light of the first behavior. However, I believe this only causes a problem in combination with my fix described earlier, which as a result doesn't actually do what I described (it doesn't "keep the status as-is" but rather copies it out of the pod's status; to "keep the status as-is" it should be copying it out of the cache). I can see if this fixes things. (We also should fix the root cause stuff on the node side, which Dawn is working on.) |
Assign the issue to me ... I didn't see the flips between Pending and Running anymore. Don't know why. I consistently see flips between Unknown and Running on those heavily loaded nodes. The root issue is once a while docker inspect API call returns us EAGAIN. Please note that on my test nodes, I already increase NOFILE limit for docker daemon process.
The proposed solution is that kubelet deals with all errors internally except EAGAIN one. In this case, computePodStatus logic in pod_cache does nothing when p.containerInfo.GetPodStatus returns an error. Working on a PR to fix this. |
Dawn, my suggestion is that computePodStatus() should return the current cached value whenever it gets an error from Kubelet. I can implement this if you want. |
@davidopp , why shouldn't updatePodStatus simply write through to the pod's status? I'd think either we always read the cached value first, or we always write through to the pod and read from there. Only reading the cached value upon an error seems counter-intuitive to me... |
@davidopp why do you computePodStatus returns a cached value here? The only place calls computePodStatus is updatePodStatus, which triggered by two operations: 1) periodically update all containers 2) the client asks for PodStatus, and there is no cached one, instead calling computePodStatus to retrieve one. I think a cleaner way is
I have one working PR to avoid the flips between Running and Unknown, but with increasing number of Pending from node side caused by docker failure. Still investigating it... |
@dchen1107 I didn't understand "if there is an error from kubelet, it should be transient; and if there is a cached one, simply return cached one upon request; if no cached one, it should be generated by computePodStatus in next round." If you're talking about the caching in the podCache on the master, then what you wrote is exactly what I was suggesting. (If you're talking about some kind of caching on the Kubelet, then that's different, because I am not aware of cache for status on the Kubelet.) @yujuhong I think always writing back to the pod's status might be fine but I don't know why the podCache was designed the way it was, so I didn't want to change that without asking Daniel first (I think he's the one who created it). I wasn't suggesting to only read the cached value upon an error; I meant always read the cached value under all circumstances. |
@dchen1107 and @davidopp, here is my understanding. Correct me if I am wrong. UpdateAllContainers periodically updates all containers, but it gets the initial pods from etcd through this line
If you repeat the same steps again, you will start with the stale pod status in the etcd because |
I have tested #4376 since last night. The flaps between Running, Pending and Unknown are gone for about 10 hour running 300 pods on 4 nodes except two scenario:
I0212 20:18:30.119078 31254 pod_cache.go:175] node doesn't exist: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0] , setting pod hostnames-yuy8d status to unknown This happens a couple of times last 10 hours. We need to figure out why.
|
@dchen1107 up to you whether to reopen this issue or file new ones for new issues. |
@dchen1107 I've also seen the "All the given peers are not reachable" message (see my second entry in this issue). I think that was what @thockin was referring to when he said "there is an etcd size limit you might be hitting too" but I may be mis-interpreting. |
That error is the etcd client's catch-all for "something went wrong" - it's On Thu, Feb 12, 2015 at 9:32 PM, David Oppenheimer <notifications@github.com
|
https://github.com/coreos/go-etcd/issues/185 On Fri, Feb 13, 2015 at 9:32 AM, Tim Hockin thockin@google.com wrote:
|
@davidopp and @thockin Yes, that error message is cache-all error message, which need to investigation. In my particular case, it shouldn't be etcd size limits at all since we have that cluster built and upgraded more than 10 hours before, and all 300 pods were scheduled back then. There is no new events generated either from kubelet or other components since all bound pods are indeed happily running on the nodes. No new writes to etcd except NodeStatus, which is very less frequent and not much data. The problem of etcd here occurs when master tries to read from etcd with an error. |
I have a cluster with ~130 pods. Running back-to-back 'get pods' shows me wildly different understandings of their state, even though I can tell from docker that they have not crashed or anything.
thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running: ./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l app=hostnames
127 Running
1 STATUS
6 Unknown
thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running: ./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l app=hostnames
82 Running
1 STATUS
51 Unknown
Nothing changed in between these calls.
The text was updated successfully, but these errors were encountered: