API status for pods is flaky #4188

thockin · 2015-02-05T22:11:14Z

I have a cluster with ~130 pods. Running back-to-back 'get pods' shows me wildly different understandings of their state, even though I can tell from docker that they have not crashed or anything.

thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running: ./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l app=hostnames
127 Running
1 STATUS
6 Unknown

thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running: ./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l app=hostnames
82 Running
1 STATUS
51 Unknown

Nothing changed in between these calls.

bgrant0607 · 2015-02-05T22:14:03Z

cc @davidopp @dchen1107

I blame the pod cache.

alex-mohr · 2015-02-05T23:51:24Z

@brendandburns has recently changed pod cache?

FWIW, I also observed this behavior: 19 pods Running in one call and all Unknown in the next. With 0.10.0.

dchen1107 · 2015-02-06T00:35:53Z

Yes, I think @brendandburns's PR on pod cache #4169 should fix the problem reported here. Also it should fix the issue reported by @satnam6502 about adds-on services. I have a pending PR #4194 to have kubelet deciding on PodStatus too.

yujuhong · 2015-02-06T00:41:44Z

Could this have caused #3950?

bgrant0607 · 2015-02-06T00:54:16Z

@yujuhong I don't see how. I'll comment on that issue.

dchen1107 · 2015-02-06T01:00:16Z

@yujuhong, I don't think this could be the cause of that one.

alex-mohr · 2015-02-06T01:06:33Z

I don't think @brendandburns's PR on pod cache #4169 will fix this. That one clears the node status. This issue is with pod going from Running to Unknown even when nodes keep running just fine.

satnam6502 · 2015-02-06T07:55:52Z

Totally consistent observation based on my personal experience. I just
assumed we were shooting for some kind of Hesinberg Uncertainty Principle.

S

On Thu, Feb 5, 2015 at 2:11 PM, Tim Hockin notifications@github.com wrote:

I have a cluster with ~130 pods. Running back-to-back 'get pods' shows me
wildly different understandings of their state, even though I can tell from
docker that they have not crashed or anything.

thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l
app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running:
./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl
get pods -l app=hostnames
127 Running
1 STATUS
6 Unknown

thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l
app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running:
./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl
get pods -l app=hostnames
82 Running
1 STATUS
51 Unknown

Nothing changed in between these calls.

—
Reply to this email directly or view it on GitHub
#4188.

ddysher · 2015-02-06T21:58:59Z

Agree with @alex-mohr, #4169 clears node status only. One possibility is that we hit transient network problem in between the calls that fails health check, so controller set node condition to not reachable, hence unknown pods. I'll make this more robust anyway. Other than that, it's hard to think of a reason that node status results in the flakiness.

bgrant0607 · 2015-02-06T22:06:03Z

In the absence of node information, the pod status defaults to Unknown.

Maybe there's another source of this specific problem, but the pod cache is a source of continuous problems, partly because it's not just a cache, but is implementing or duplicating a significant chunk of system logic in a half-baked manner. My goal is to eliminate it -- see #2726. Short of that, we need to at least implement #1366, eliminate nodes from the pod cache, and implement a real node cache.

davidopp · 2015-02-06T22:30:37Z

I can look into this. Repro info from Tim, who said he was running on a one-node cluster but has seen the problem with multi-node clusters also:

./cluster/kubectl.sh run-container hostnames --image=kubernetes/serve_hostname --replicas=1 --port=9376 --labels=app=hostnames
./cluster/kubectl.sh expose hostnames --port=80 --container-port=9376 --selector="app=hostnames"
./cluster/kubectl.sh resize --replicas=150 rc hostnames

but I had to hack the node a bit to make it get past 100
on the node:
sysctl -w fs.inotify.max_user_instances=8192
echo "DOCKER_NOFILE=1000000" >> /etc/default/docker; /etc/init.d/docker restart (fails sometimes, dunno why)

davidopp · 2015-02-07T01:04:58Z

I can verify that the situation is rather bad. Lots of fluctuation in the numbers of Pending, Running, and Unknown. serve_hostname containers on the node are exiting; I'm not sure why, or if the pod itself is terminating or just the container. Lots of the following in the log (not sure what this means yet):

Sat, 07 Feb 2015 00:06:22 +0000 hostnames-ecd2r Pod failedScheduling {scheduler } Binding rejected: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

davidopp · 2015-02-07T01:12:52Z

The pause container is also exiting, so I'm guessing that means the whole pod is exiting, not just serve_hostname...

thockin · 2015-02-07T01:57:52Z

Ugh, there is an etcd size limit you might be hitting too, I forgot. If
you are seeing the fluctuations, maybe you got far enough, though.
On Feb 6, 2015 5:13 PM, "David Oppenheimer" notifications@github.com
wrote:

The pause container is also exiting, so I'm guessing that means the whole
pod is exiting, not just serve_hostname...

Reply to this email directly or view it on GitHub
#4188 (comment)
.

davidopp · 2015-02-08T22:57:12Z

Just a few more observations:

This is reproducible (and a little less chaotic) with one node and 100 pods (but not with 50 pods, so it's definitely load-dependent). You can get to 100 Running pods (0 Pending or Unknown), and then a minute later a dozen or so will become Unknown, and then later go back to Running, etc. As mentioned earlier, the containers continue running on the node all this time (sudo docker ps|grep serve_hostname | wc -l), so it is definitely some kind of reporting problem.

As an aside, it looks like there is some kind of race condition with IP address allocation, as during startup (but not at steady state) a number of pods exit with the following in their log (sudo docker logs xxx):
listen tcp :9376: bind: address already in use
They do restart OK (I'm not sure if this is being restarted locally on the node or making a round-trip back to the scheduler).

thockin · 2015-02-09T00:02:33Z

Almost certainly the restart is just kubelet retrying after some time.
On Feb 8, 2015 2:57 PM, "David Oppenheimer" notifications@github.com
wrote:

Just a few more observations:

This is reproducible (and a little less chaotic) with one node and 100
pods (but not with 50 pods, so it's definitely load-dependent). You can get
to 100 Running pods (0 Pending or Unknown), and then a minute later a dozen
or so will become Unknown, and then later go back to Running, etc. As
mentioned earlier, the containers continue running on the node all this
time (sudo docker ps|grep serve_hostname | wc -l), so it is definitely some
kind of reporting problem.

As an aside, it looks like there is some kind of race condition with IP
address allocation, as during startup (but not at steady state) a number of
pods exit with the following in their log (sudo docker logs xxx):
listen tcp :9376: bind: address already in use
They do restart OK (I'm not sure if this is being restarted locally on the
node or making a round-trip back to the scheduler).

Reply to this email directly or view it on GitHub
#4188 (comment)
.

davidopp · 2015-02-09T07:52:46Z

I found that if I comment out the line with the arrow next to it (see below) in pod_cache.cc, the Pods stop oscillating between Running and Unknown. Commenting out this line changes the logic from "if get an error from Kubelet, change pod status to Unknown" to "if get an error from Kubelet, keep pod status unchanged."

result, err := p.containerInfo.GetPodStatus(pod.Status.Host, pod.Namespace, pod.Name)
newStatus.HostIP = nodeStatus.HostIP

if err != nil {
    newStatus.Phase = api.PodUnknown  <<<------------------------
} else {

I saw in the master logs intermittent errors of this form, which was causing the pods to flip back to Unknown:

E0209 04:19:17.155149 4996 pod_cache.go:270] Error getting info for pod default/hostnames-v5drx: kubelet "kubernetes-minion-1.c.striped-torus-760.internal" server responded with HTTP error code 500 for pod default/hostnames-v5drx

This seems to correspond to lines like this in the kubelet logs:

I0209 06:38:11.104796 4704 server.go:351] GET /stats/default/hostnames-gfw5j/2bb979c5-b026-11e4-a070-42010af0f256/hostnames?num_stats=33: (218.857218ms) 500
goroutine 730 [running]:
[ ... stack trace, but they're not always the same; I'll dig into this some more, the default stack trace depth wasn't going back past the various functions that just serve up the error ... ]

However, pods still flip between Running and Pending, even though no containers are exiting on the minion, so there's another problem here too.

brendandburns · 2015-02-09T17:33:43Z

Thanks for investigating this! Pods really shouldn't flip between
Running and Pending, as I believe that the only way that they go Pending
is if Pod.Status.Host is empty, and this really shouldn't happen post
scheduling, unless we are somehow dropping information. Maybe we need to
make sure that we don't wipe out status on the error case?

--brendan

On Sun, Feb 8, 2015 at 11:53 PM, David Oppenheimer <notifications@github.com

wrote:

I found that if I comment out the line with the arrow next to it (see
below) in pod_cache.cc, the Pods stop oscillating between Running and
Unknown. Commenting out this line changes the logic from "if get an error
from Kubelet, change pod status to Unknown" to "if get an error from
Kubelet, keep pod status unchanged."

result, err := p.containerInfo.GetPodStatus(pod.Status.Host, pod.Namespace, pod.Name)
newStatus.HostIP = nodeStatus.HostIP

if err != nil {
newStatus.Phase = api.PodUnknown <<<------------------------
} else {

I saw in the master logs intermittent errors of this form, which was
causing the pods to flip back to Unknown:

E0209 04:19:17.155149 4996 pod_cache.go:270] Error getting info for pod
default/hostnames-v5drx: kubelet
"kubernetes-minion-1.c.striped-torus-760.internal" server responded with
HTTP error code 500 for pod default/hostnames-v5drx

This seems to correspond to lines like this in the kubelet logs:

I0209 06:38:11.104796 4704 server.go:351] GET
/stats/default/hostnames-gfw5j/2bb979c5-b026-11e4-a070-42010af0f256/hostnames?num_stats=33:
(218.857218ms) 500
goroutine 730 [running]:
[ ... stack trace, but they're not always the same; I'll dig into this
some more, the default stack trace depth wasn't going back past the various
functions that just serve up the error ... ]

However, pods still flip between Running and Pending, even though no
containers are exiting on the minion, so there's another problem here too.

—
Reply to this email directly or view it on GitHub
#4188 (comment)
.

dchen1107 · 2015-02-09T23:12:07Z

I also investigated the same issue last Friday. I observed exact the same as what @davidopp described above, except the related kubelet log should be:

E0209 13:03:04.032711    5035 kubelet.go:1038] Unable to get pod with name "hostnames-zt4ct.default.api" and uid "9e39ef09-ae57-11e4-bdee-42010af0483b" info, health checks may be invalid
I0209 13:30:23.205974    5035 server.go:351] GET /api/v1beta1/podInfo?podID=hostnames-zt4ct&podNamespace=default: (346.387µs) 500

Meanwhile, I observed master is trying to query pod / and failed all the time:

E0209 23:00:13.286561    4961 pod_cache.go:255] Error getting info for pod /: kubelet "kubernetes-minion-1.c.golden-system-455.internal" server responded with HTTP error code 400 for pod /

Talked to @davidopp offline, he is going to look into why running pods go back to Pending; meanwhile, I am going to figure out Unknown cases here.

satnam6502 · 2015-02-09T23:14:30Z

That would be great. It is very hard for me to make sense of any of the scalability experiments that I am doing while the pod status is in hysteresis.

zmerlynn · 2015-02-11T17:48:42Z

Am I wrong, or are these instructions also within bounds of an e2e test? Just because our existing e2es only do one or two things doesn't mean we can't do more within O(minutes), assuming you're dealing with the same image.

davidopp · 2015-02-11T18:04:22Z

The "flips back to pending" problem appears to be because updatePodStatus() writes to the cache but not back to the Pod's status. computePodStatus() initializes newStatus based on pod.Status, not based on the cached value of status. This second behavior seems wrong in light of the first behavior. However, I believe this only causes a problem in combination with my fix described earlier, which as a result doesn't actually do what I described (it doesn't "keep the status as-is" but rather copies it out of the pod's status; to "keep the status as-is" it should be copying it out of the cache). I can see if this fixes things.

(We also should fix the root cause stuff on the node side, which Dawn is working on.)

dchen1107 · 2015-02-11T19:36:10Z

Assign the issue to me ...

I didn't see the flips between Pending and Running anymore. Don't know why. I consistently see flips between Unknown and Running on those heavily loaded nodes. The root issue is once a while docker inspect API call returns us EAGAIN. Please note that on my test nodes, I already increase NOFILE limit for docker daemon process.

I0211 18:44:31.732281   27704 kubelet.go:1559] Query docker container info failed with error: dial unix /var/run/docker.sock: resource temporarily unavailable
I0211 18:44:31.732359   27704 server.go:351] GET /api/v1beta1/podInfo?podID=hostnames-bxfcy&podNamespace=default: (3.832648315s) 500

The proposed solution is that kubelet deals with all errors internally except EAGAIN one. In this case, computePodStatus logic in pod_cache does nothing when p.containerInfo.GetPodStatus returns an error. Working on a PR to fix this.

davidopp · 2015-02-11T22:38:02Z

Dawn, my suggestion is that computePodStatus() should return the current cached value whenever it gets an error from Kubelet. I can implement this if you want.

yujuhong · 2015-02-11T23:12:28Z

@davidopp , why shouldn't updatePodStatus simply write through to the pod's status? I'd think either we always read the cached value first, or we always write through to the pod and read from there. Only reading the cached value upon an error seems counter-intuitive to me...

dchen1107 · 2015-02-12T00:14:15Z

@davidopp why do you computePodStatus returns a cached value here? The only place calls computePodStatus is updatePodStatus, which triggered by two operations: 1) periodically update all containers 2) the client asks for PodStatus, and there is no cached one, instead calling computePodStatus to retrieve one. I think a cleaner way is

Kubelet handles all the errors except this transient one. Of course, we need to make sure docker is auto-restart and there is no too many open files error on node side too. But they are separate issues and should be handled by node side through script or configure etc.
Now back to master side, if node status is ok, kubelet should return something; if there is an error from kubelet, it should be transient; and if there is a cached one, simply return cached one upon request; if no cached one, it should be generated by computePodStatus in next round.

I have one working PR to avoid the flips between Running and Unknown, but with increasing number of Pending from node side caused by docker failure. Still investigating it...

davidopp · 2015-02-12T00:45:42Z

@dchen1107 I didn't understand "if there is an error from kubelet, it should be transient; and if there is a cached one, simply return cached one upon request; if no cached one, it should be generated by computePodStatus in next round." If you're talking about the caching in the podCache on the master, then what you wrote is exactly what I was suggesting. (If you're talking about some kind of caching on the Kubelet, then that's different, because I am not aware of cache for status on the Kubelet.)

@yujuhong I think always writing back to the pod's status might be fine but I don't know why the podCache was designed the way it was, so I didn't want to change that without asking Daniel first (I think he's the one who created it). I wasn't suggesting to only read the cached value upon an error; I meant always read the cached value under all circumstances.

yujuhong · 2015-02-12T00:49:29Z

@dchen1107 and @davidopp, here is my understanding. Correct me if I am wrong.

UpdateAllContainers periodically updates all containers, but it gets the initial pods from etcd through this line pods, err := p.pods.ListPods(ctx, labels.Everything()).

For each pod from pods, UpdateAllContainers calls updatePodStatus
updatePodStatus passes pod to computePodStatus.
computePodStatus computes and returns a new status. Note that the status is computed based on pod, i.e., the information in etcd.
updatePodStatus updates the pod cache with the new status. Note that the new status is not written to etcd.

If you repeat the same steps again, you will start with the stale pod status in the etcd because updatePodStatus does not check the cached value.

dchen1107 · 2015-02-12T21:24:30Z

I have tested #4376 since last night. The flaps between Running, Pending and Unknown are gone for about 10 hour running 300 pods on 4 nodes except two scenario:

etcd issue:
cat kube-apiserver.log | grep "status to unknown"

I0212 20:18:30.119078 31254 pod_cache.go:175] node doesn't exist: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0] , setting pod hostnames-yuy8d status to unknown
...

This happens a couple of times last 10 hours. We need to figure out why.

NodeStatus becomes UNREADY. It happens once last 10 hours, and pods status on that node become Unknown which is expected. I didn't figure out why NodeStatus was UNREADY then, but I heard there is a network earlier this morning.

erictune · 2015-02-13T00:25:49Z

@dchen1107 up to you whether to reopen this issue or file new ones for new issues.

davidopp · 2015-02-13T05:32:09Z

@dchen1107 I've also seen the "All the given peers are not reachable" message (see my second entry in this issue). I think that was what @thockin was referring to when he said "there is an etcd size limit you might be hitting too" but I may be mis-interpreting.

thockin · 2015-02-13T17:32:41Z

That error is the etcd client's catch-all for "something went wrong" - it's
actually really bad that it obscures errors.

On Thu, Feb 12, 2015 at 9:32 PM, David Oppenheimer <notifications@github.com

wrote:

@dchen1107 https://github.com/dchen1107 I've also seen the "All the
given peers are not reachable" message (see my second entry in this issue).
I think that was what @thockin https://github.com/thockin was
referring to when he said "there is an etcd size limit you might be hitting
too" but I may be mis-interpreting.

Reply to this email directly or view it on GitHub
#4188 (comment)
.

thockin · 2015-02-13T17:36:17Z

https://github.com/coreos/go-etcd/issues/185

On Fri, Feb 13, 2015 at 9:32 AM, Tim Hockin thockin@google.com wrote:

That error is the etcd client's catch-all for "something went wrong" -
it's actually really bad that it obscures errors.

On Thu, Feb 12, 2015 at 9:32 PM, David Oppenheimer <
notifications@github.com> wrote:

@dchen1107 https://github.com/dchen1107 I've also seen the "All the
given peers are not reachable" message (see my second entry in this issue).
I think that was what @thockin https://github.com/thockin was
referring to when he said "there is an etcd size limit you might be hitting
too" but I may be mis-interpreting.

Reply to this email directly or view it on GitHub
#4188 (comment)
.

dchen1107 · 2015-02-13T21:38:46Z

@davidopp and @thockin Yes, that error message is cache-all error message, which need to investigation. In my particular case, it shouldn't be etcd size limits at all since we have that cluster built and upgraded more than 10 hours before, and all 300 pods were scheduled back then. There is no new events generated either from kubelet or other components since all bound pods are indeed happily running on the nodes. No new writes to etcd except NodeStatus, which is very less frequent and not much data. The problem of etcd here occurs when master tries to read from etcd with an error.

bgrant0607 added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/master sig/node Categorizes an issue or PR as relevant to SIG Node. labels Feb 5, 2015

brendandburns added kind/bug Categorizes issue or PR as related to a bug. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 6, 2015

goltermann modified the milestone: P1 Issue Fix-it Feb 6, 2015

brendandburns mentioned this issue Feb 6, 2015

Add some more verbose logging around pod status. #4222

Merged

davidopp self-assigned this Feb 6, 2015

dchen1107 assigned dchen1107 and unassigned davidopp Feb 11, 2015

zmerlynn added this to the P1 Issue Fix-it milestone Feb 11, 2015

dchen1107 mentioned this issue Feb 11, 2015

Check if kubelet does know such Pod before querying status. #4323

Merged

yujuhong mentioned this issue Feb 12, 2015

backtrace from kubelet server #4127

Closed

dchen1107 mentioned this issue Feb 12, 2015

Fix podstatus issue caused by docker's resource temporarily unavailable issue #4376

Merged

erictune closed this as completed in #4376 Feb 12, 2015

dchen1107 mentioned this issue Feb 13, 2015

Consider moving events out of etcd #4432

Closed

davidopp modified the milestones: v1.0, P1 Issue Fix-it Feb 18, 2015

davidopp mentioned this issue Feb 18, 2015

Kubectl stop rc <name> is flaky #4528

Closed

This was referenced Feb 27, 2015

Revist design of Annotations currently stored in docker container name #3453

Closed

Pods hang in pending state indefinitely #4891

Closed

vishh mentioned this issue Mar 24, 2015

Heapster scalability testing #5880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API status for pods is flaky #4188

API status for pods is flaky #4188

thockin commented Feb 5, 2015

bgrant0607 commented Feb 5, 2015

alex-mohr commented Feb 5, 2015

dchen1107 commented Feb 6, 2015

yujuhong commented Feb 6, 2015

bgrant0607 commented Feb 6, 2015

dchen1107 commented Feb 6, 2015

alex-mohr commented Feb 6, 2015

satnam6502 commented Feb 6, 2015

ddysher commented Feb 6, 2015

bgrant0607 commented Feb 6, 2015

davidopp commented Feb 6, 2015

davidopp commented Feb 7, 2015

davidopp commented Feb 7, 2015

thockin commented Feb 7, 2015

davidopp commented Feb 8, 2015

thockin commented Feb 9, 2015

davidopp commented Feb 9, 2015

brendandburns commented Feb 9, 2015

dchen1107 commented Feb 9, 2015

satnam6502 commented Feb 9, 2015

zmerlynn commented Feb 11, 2015

davidopp commented Feb 11, 2015

dchen1107 commented Feb 11, 2015

davidopp commented Feb 11, 2015

yujuhong commented Feb 11, 2015

dchen1107 commented Feb 12, 2015

davidopp commented Feb 12, 2015

yujuhong commented Feb 12, 2015

dchen1107 commented Feb 12, 2015

erictune commented Feb 13, 2015

davidopp commented Feb 13, 2015

thockin commented Feb 13, 2015

thockin commented Feb 13, 2015

dchen1107 commented Feb 13, 2015

API status for pods is flaky #4188

API status for pods is flaky #4188

Comments

thockin commented Feb 5, 2015

bgrant0607 commented Feb 5, 2015

alex-mohr commented Feb 5, 2015

dchen1107 commented Feb 6, 2015

yujuhong commented Feb 6, 2015

bgrant0607 commented Feb 6, 2015

dchen1107 commented Feb 6, 2015

alex-mohr commented Feb 6, 2015

satnam6502 commented Feb 6, 2015

ddysher commented Feb 6, 2015

bgrant0607 commented Feb 6, 2015

davidopp commented Feb 6, 2015

davidopp commented Feb 7, 2015

davidopp commented Feb 7, 2015

thockin commented Feb 7, 2015

davidopp commented Feb 8, 2015

thockin commented Feb 9, 2015

davidopp commented Feb 9, 2015

brendandburns commented Feb 9, 2015

dchen1107 commented Feb 9, 2015

satnam6502 commented Feb 9, 2015

zmerlynn commented Feb 11, 2015

davidopp commented Feb 11, 2015

dchen1107 commented Feb 11, 2015

davidopp commented Feb 11, 2015

yujuhong commented Feb 11, 2015

dchen1107 commented Feb 12, 2015

davidopp commented Feb 12, 2015

yujuhong commented Feb 12, 2015

dchen1107 commented Feb 12, 2015

erictune commented Feb 13, 2015

davidopp commented Feb 13, 2015

thockin commented Feb 13, 2015

thockin commented Feb 13, 2015

dchen1107 commented Feb 13, 2015