Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API status for pods is flaky #4188

Closed
thockin opened this issue Feb 5, 2015 · 40 comments · Fixed by #4376
Closed

API status for pods is flaky #4188

thockin opened this issue Feb 5, 2015 · 40 comments · Fixed by #4376
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@thockin
Copy link
Member

thockin commented Feb 5, 2015

I have a cluster with ~130 pods. Running back-to-back 'get pods' shows me wildly different understandings of their state, even though I can tell from docker that they have not crashed or anything.

thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running: ./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l app=hostnames
127 Running
1 STATUS
6 Unknown

thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running: ./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl get pods -l app=hostnames
82 Running
1 STATUS
51 Unknown

Nothing changed in between these calls.

@bgrant0607
Copy link
Member

cc @davidopp @dchen1107

I blame the pod cache.

@bgrant0607 bgrant0607 added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. team/master sig/node Categorizes an issue or PR as relevant to SIG Node. labels Feb 5, 2015
@alex-mohr
Copy link
Contributor

@brendandburns has recently changed pod cache?

FWIW, I also observed this behavior: 19 pods Running in one call and all Unknown in the next. With 0.10.0.

@dchen1107
Copy link
Member

Yes, I think @brendandburns's PR on pod cache #4169 should fix the problem reported here. Also it should fix the issue reported by @satnam6502 about adds-on services. I have a pending PR #4194 to have kubelet deciding on PodStatus too.

@yujuhong
Copy link
Contributor

yujuhong commented Feb 6, 2015

Could this have caused #3950?

@bgrant0607
Copy link
Member

@yujuhong I don't see how. I'll comment on that issue.

@dchen1107
Copy link
Member

@yujuhong, I don't think this could be the cause of that one.

@alex-mohr
Copy link
Contributor

I don't think @brendandburns's PR on pod cache #4169 will fix this. That one clears the node status. This issue is with pod going from Running to Unknown even when nodes keep running just fine.

@satnam6502
Copy link
Contributor

Totally consistent observation based on my personal experience. I just
assumed we were shooting for some kind of Hesinberg Uncertainty Principle.

S

On Thu, Feb 5, 2015 at 2:11 PM, Tim Hockin notifications@github.com wrote:

I have a cluster with ~130 pods. Running back-to-back 'get pods' shows me
wildly different understandings of their state, even though I can tell from
docker that they have not crashed or anything.

thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l
app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running:
./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl
get pods -l app=hostnames
127 Running
1 STATUS
6 Unknown

thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods -l
app=hostnames | awk '{print $(NF)}' | sort | uniq -c
Running:
./cluster/../cluster/gce/../../_output/dockerized/bin/linux/amd64/kubectl
get pods -l app=hostnames
82 Running
1 STATUS
51 Unknown

Nothing changed in between these calls.


Reply to this email directly or view it on GitHub
#4188.

@brendandburns brendandburns added kind/bug Categorizes issue or PR as related to a bug. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 6, 2015
@goltermann goltermann modified the milestone: P1 Issue Fix-it Feb 6, 2015
@ddysher
Copy link
Contributor

ddysher commented Feb 6, 2015

Agree with @alex-mohr, #4169 clears node status only. One possibility is that we hit transient network problem in between the calls that fails health check, so controller set node condition to not reachable, hence unknown pods. I'll make this more robust anyway. Other than that, it's hard to think of a reason that node status results in the flakiness.

@bgrant0607
Copy link
Member

In the absence of node information, the pod status defaults to Unknown.

Maybe there's another source of this specific problem, but the pod cache is a source of continuous problems, partly because it's not just a cache, but is implementing or duplicating a significant chunk of system logic in a half-baked manner. My goal is to eliminate it -- see #2726. Short of that, we need to at least implement #1366, eliminate nodes from the pod cache, and implement a real node cache.

@davidopp davidopp self-assigned this Feb 6, 2015
@davidopp
Copy link
Member

davidopp commented Feb 6, 2015

I can look into this. Repro info from Tim, who said he was running on a one-node cluster but has seen the problem with multi-node clusters also:

./cluster/kubectl.sh run-container hostnames --image=kubernetes/serve_hostname --replicas=1 --port=9376 --labels=app=hostnames
./cluster/kubectl.sh expose hostnames --port=80 --container-port=9376 --selector="app=hostnames"
./cluster/kubectl.sh resize --replicas=150 rc hostnames

but I had to hack the node a bit to make it get past 100
on the node:
sysctl -w fs.inotify.max_user_instances=8192
echo "DOCKER_NOFILE=1000000" >> /etc/default/docker; /etc/init.d/docker restart (fails sometimes, dunno why)

@davidopp
Copy link
Member

davidopp commented Feb 7, 2015

I can verify that the situation is rather bad. Lots of fluctuation in the numbers of Pending, Running, and Unknown. serve_hostname containers on the node are exiting; I'm not sure why, or if the pod itself is terminating or just the container. Lots of the following in the log (not sure what this means yet):

Sat, 07 Feb 2015 00:06:22 +0000 hostnames-ecd2r Pod failedScheduling {scheduler } Binding rejected: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

@davidopp
Copy link
Member

davidopp commented Feb 7, 2015

The pause container is also exiting, so I'm guessing that means the whole pod is exiting, not just serve_hostname...

@thockin
Copy link
Member Author

thockin commented Feb 7, 2015

Ugh, there is an etcd size limit you might be hitting too, I forgot. If
you are seeing the fluctuations, maybe you got far enough, though.
On Feb 6, 2015 5:13 PM, "David Oppenheimer" notifications@github.com
wrote:

The pause container is also exiting, so I'm guessing that means the whole
pod is exiting, not just serve_hostname...

Reply to this email directly or view it on GitHub
#4188 (comment)
.

@davidopp
Copy link
Member

davidopp commented Feb 8, 2015

Just a few more observations:

This is reproducible (and a little less chaotic) with one node and 100 pods (but not with 50 pods, so it's definitely load-dependent). You can get to 100 Running pods (0 Pending or Unknown), and then a minute later a dozen or so will become Unknown, and then later go back to Running, etc. As mentioned earlier, the containers continue running on the node all this time (sudo docker ps|grep serve_hostname | wc -l), so it is definitely some kind of reporting problem.

As an aside, it looks like there is some kind of race condition with IP address allocation, as during startup (but not at steady state) a number of pods exit with the following in their log (sudo docker logs xxx):
listen tcp :9376: bind: address already in use
They do restart OK (I'm not sure if this is being restarted locally on the node or making a round-trip back to the scheduler).

@thockin
Copy link
Member Author

thockin commented Feb 9, 2015

Almost certainly the restart is just kubelet retrying after some time.
On Feb 8, 2015 2:57 PM, "David Oppenheimer" notifications@github.com
wrote:

Just a few more observations:

This is reproducible (and a little less chaotic) with one node and 100
pods (but not with 50 pods, so it's definitely load-dependent). You can get
to 100 Running pods (0 Pending or Unknown), and then a minute later a dozen
or so will become Unknown, and then later go back to Running, etc. As
mentioned earlier, the containers continue running on the node all this
time (sudo docker ps|grep serve_hostname | wc -l), so it is definitely some
kind of reporting problem.

As an aside, it looks like there is some kind of race condition with IP
address allocation, as during startup (but not at steady state) a number of
pods exit with the following in their log (sudo docker logs xxx):
listen tcp :9376: bind: address already in use
They do restart OK (I'm not sure if this is being restarted locally on the
node or making a round-trip back to the scheduler).

Reply to this email directly or view it on GitHub
#4188 (comment)
.

@davidopp
Copy link
Member

davidopp commented Feb 9, 2015

I found that if I comment out the line with the arrow next to it (see below) in pod_cache.cc, the Pods stop oscillating between Running and Unknown. Commenting out this line changes the logic from "if get an error from Kubelet, change pod status to Unknown" to "if get an error from Kubelet, keep pod status unchanged."

result, err := p.containerInfo.GetPodStatus(pod.Status.Host, pod.Namespace, pod.Name)
newStatus.HostIP = nodeStatus.HostIP

if err != nil {
    newStatus.Phase = api.PodUnknown  <<<------------------------
} else {

I saw in the master logs intermittent errors of this form, which was causing the pods to flip back to Unknown:

E0209 04:19:17.155149 4996 pod_cache.go:270] Error getting info for pod default/hostnames-v5drx: kubelet "kubernetes-minion-1.c.striped-torus-760.internal" server responded with HTTP error code 500 for pod default/hostnames-v5drx

This seems to correspond to lines like this in the kubelet logs:

I0209 06:38:11.104796 4704 server.go:351] GET /stats/default/hostnames-gfw5j/2bb979c5-b026-11e4-a070-42010af0f256/hostnames?num_stats=33: (218.857218ms) 500
goroutine 730 [running]:
[ ... stack trace, but they're not always the same; I'll dig into this some more, the default stack trace depth wasn't going back past the various functions that just serve up the error ... ]

However, pods still flip between Running and Pending, even though no containers are exiting on the minion, so there's another problem here too.

@brendandburns
Copy link
Contributor

Thanks for investigating this! Pods really shouldn't flip between
Running and Pending, as I believe that the only way that they go Pending
is if Pod.Status.Host is empty, and this really shouldn't happen post
scheduling, unless we are somehow dropping information. Maybe we need to
make sure that we don't wipe out status on the error case?

--brendan

On Sun, Feb 8, 2015 at 11:53 PM, David Oppenheimer <notifications@github.com

wrote:

I found that if I comment out the line with the arrow next to it (see
below) in pod_cache.cc, the Pods stop oscillating between Running and
Unknown. Commenting out this line changes the logic from "if get an error
from Kubelet, change pod status to Unknown" to "if get an error from
Kubelet, keep pod status unchanged."

result, err := p.containerInfo.GetPodStatus(pod.Status.Host, pod.Namespace, pod.Name)
newStatus.HostIP = nodeStatus.HostIP

if err != nil {
newStatus.Phase = api.PodUnknown <<<------------------------
} else {

I saw in the master logs intermittent errors of this form, which was
causing the pods to flip back to Unknown:

E0209 04:19:17.155149 4996 pod_cache.go:270] Error getting info for pod
default/hostnames-v5drx: kubelet
"kubernetes-minion-1.c.striped-torus-760.internal" server responded with
HTTP error code 500 for pod default/hostnames-v5drx

This seems to correspond to lines like this in the kubelet logs:

I0209 06:38:11.104796 4704 server.go:351] GET
/stats/default/hostnames-gfw5j/2bb979c5-b026-11e4-a070-42010af0f256/hostnames?num_stats=33:
(218.857218ms) 500
goroutine 730 [running]:
[ ... stack trace, but they're not always the same; I'll dig into this
some more, the default stack trace depth wasn't going back past the various
functions that just serve up the error ... ]

However, pods still flip between Running and Pending, even though no
containers are exiting on the minion, so there's another problem here too.


Reply to this email directly or view it on GitHub
#4188 (comment)
.

@dchen1107
Copy link
Member

I also investigated the same issue last Friday. I observed exact the same as what @davidopp described above, except the related kubelet log should be:

E0209 13:03:04.032711    5035 kubelet.go:1038] Unable to get pod with name "hostnames-zt4ct.default.api" and uid "9e39ef09-ae57-11e4-bdee-42010af0483b" info, health checks may be invalid
I0209 13:30:23.205974    5035 server.go:351] GET /api/v1beta1/podInfo?podID=hostnames-zt4ct&podNamespace=default: (346.387µs) 500

Meanwhile, I observed master is trying to query pod / and failed all the time:

E0209 23:00:13.286561    4961 pod_cache.go:255] Error getting info for pod /: kubelet "kubernetes-minion-1.c.golden-system-455.internal" server responded with HTTP error code 400 for pod /

Talked to @davidopp offline, he is going to look into why running pods go back to Pending; meanwhile, I am going to figure out Unknown cases here.

@satnam6502
Copy link
Contributor

That would be great. It is very hard for me to make sense of any of the scalability experiments that I am doing while the pod status is in hysteresis.

@dchen1107 dchen1107 assigned dchen1107 and unassigned davidopp Feb 11, 2015
@zmerlynn
Copy link
Member

Am I wrong, or are these instructions also within bounds of an e2e test? Just because our existing e2es only do one or two things doesn't mean we can't do more within O(minutes), assuming you're dealing with the same image.

@davidopp
Copy link
Member

The "flips back to pending" problem appears to be because updatePodStatus() writes to the cache but not back to the Pod's status. computePodStatus() initializes newStatus based on pod.Status, not based on the cached value of status. This second behavior seems wrong in light of the first behavior. However, I believe this only causes a problem in combination with my fix described earlier, which as a result doesn't actually do what I described (it doesn't "keep the status as-is" but rather copies it out of the pod's status; to "keep the status as-is" it should be copying it out of the cache). I can see if this fixes things.

(We also should fix the root cause stuff on the node side, which Dawn is working on.)

@dchen1107
Copy link
Member

Assign the issue to me ...

I didn't see the flips between Pending and Running anymore. Don't know why. I consistently see flips between Unknown and Running on those heavily loaded nodes. The root issue is once a while docker inspect API call returns us EAGAIN. Please note that on my test nodes, I already increase NOFILE limit for docker daemon process.

I0211 18:44:31.732281   27704 kubelet.go:1559] Query docker container info failed with error: dial unix /var/run/docker.sock: resource temporarily unavailable
I0211 18:44:31.732359   27704 server.go:351] GET /api/v1beta1/podInfo?podID=hostnames-bxfcy&podNamespace=default: (3.832648315s) 500

The proposed solution is that kubelet deals with all errors internally except EAGAIN one. In this case, computePodStatus logic in pod_cache does nothing when p.containerInfo.GetPodStatus returns an error. Working on a PR to fix this.

@davidopp
Copy link
Member

Dawn, my suggestion is that computePodStatus() should return the current cached value whenever it gets an error from Kubelet. I can implement this if you want.

@yujuhong
Copy link
Contributor

@davidopp , why shouldn't updatePodStatus simply write through to the pod's status? I'd think either we always read the cached value first, or we always write through to the pod and read from there. Only reading the cached value upon an error seems counter-intuitive to me...

@dchen1107
Copy link
Member

@davidopp why do you computePodStatus returns a cached value here? The only place calls computePodStatus is updatePodStatus, which triggered by two operations: 1) periodically update all containers 2) the client asks for PodStatus, and there is no cached one, instead calling computePodStatus to retrieve one. I think a cleaner way is

  1. Kubelet handles all the errors except this transient one. Of course, we need to make sure docker is auto-restart and there is no too many open files error on node side too. But they are separate issues and should be handled by node side through script or configure etc.

  2. Now back to master side, if node status is ok, kubelet should return something; if there is an error from kubelet, it should be transient; and if there is a cached one, simply return cached one upon request; if no cached one, it should be generated by computePodStatus in next round.

I have one working PR to avoid the flips between Running and Unknown, but with increasing number of Pending from node side caused by docker failure. Still investigating it...

@davidopp
Copy link
Member

@dchen1107 I didn't understand "if there is an error from kubelet, it should be transient; and if there is a cached one, simply return cached one upon request; if no cached one, it should be generated by computePodStatus in next round." If you're talking about the caching in the podCache on the master, then what you wrote is exactly what I was suggesting. (If you're talking about some kind of caching on the Kubelet, then that's different, because I am not aware of cache for status on the Kubelet.)

@yujuhong I think always writing back to the pod's status might be fine but I don't know why the podCache was designed the way it was, so I didn't want to change that without asking Daniel first (I think he's the one who created it). I wasn't suggesting to only read the cached value upon an error; I meant always read the cached value under all circumstances.

@yujuhong
Copy link
Contributor

@dchen1107 and @davidopp, here is my understanding. Correct me if I am wrong.

UpdateAllContainers periodically updates all containers, but it gets the initial pods from etcd through this line pods, err := p.pods.ListPods(ctx, labels.Everything()).

  1. For each pod from pods, UpdateAllContainers calls updatePodStatus
  2. updatePodStatus passes pod to computePodStatus.
  3. computePodStatus computes and returns a new status. Note that the status is computed based on pod, i.e., the information in etcd.
  4. updatePodStatus updates the pod cache with the new status. Note that the new status is not written to etcd.

If you repeat the same steps again, you will start with the stale pod status in the etcd because updatePodStatus does not check the cached value.

@dchen1107
Copy link
Member

I have tested #4376 since last night. The flaps between Running, Pending and Unknown are gone for about 10 hour running 300 pods on 4 nodes except two scenario:

  1. etcd issue:
    cat kube-apiserver.log | grep "status to unknown"

I0212 20:18:30.119078 31254 pod_cache.go:175] node doesn't exist: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0] , setting pod hostnames-yuy8d status to unknown
...

This happens a couple of times last 10 hours. We need to figure out why.

  1. NodeStatus becomes UNREADY. It happens once last 10 hours, and pods status on that node become Unknown which is expected. I didn't figure out why NodeStatus was UNREADY then, but I heard there is a network earlier this morning.

@erictune
Copy link
Member

@dchen1107 up to you whether to reopen this issue or file new ones for new issues.

@davidopp
Copy link
Member

@dchen1107 I've also seen the "All the given peers are not reachable" message (see my second entry in this issue). I think that was what @thockin was referring to when he said "there is an etcd size limit you might be hitting too" but I may be mis-interpreting.

@thockin
Copy link
Member Author

thockin commented Feb 13, 2015

That error is the etcd client's catch-all for "something went wrong" - it's
actually really bad that it obscures errors.

On Thu, Feb 12, 2015 at 9:32 PM, David Oppenheimer <notifications@github.com

wrote:

@dchen1107 https://github.com/dchen1107 I've also seen the "All the
given peers are not reachable" message (see my second entry in this issue).
I think that was what @thockin https://github.com/thockin was
referring to when he said "there is an etcd size limit you might be hitting
too" but I may be mis-interpreting.

Reply to this email directly or view it on GitHub
#4188 (comment)
.

@thockin
Copy link
Member Author

thockin commented Feb 13, 2015

https://github.com/coreos/go-etcd/issues/185

On Fri, Feb 13, 2015 at 9:32 AM, Tim Hockin thockin@google.com wrote:

That error is the etcd client's catch-all for "something went wrong" -
it's actually really bad that it obscures errors.

On Thu, Feb 12, 2015 at 9:32 PM, David Oppenheimer <
notifications@github.com> wrote:

@dchen1107 https://github.com/dchen1107 I've also seen the "All the
given peers are not reachable" message (see my second entry in this issue).
I think that was what @thockin https://github.com/thockin was
referring to when he said "there is an etcd size limit you might be hitting
too" but I may be mis-interpreting.

Reply to this email directly or view it on GitHub
#4188 (comment)
.

@dchen1107
Copy link
Member

@davidopp and @thockin Yes, that error message is cache-all error message, which need to investigation. In my particular case, it shouldn't be etcd size limits at all since we have that cluster built and upgraded more than 10 hours before, and all 300 pods were scheduled back then. There is no new events generated either from kubelet or other components since all bound pods are indeed happily running on the nodes. No new writes to etcd except NodeStatus, which is very less frequent and not much data. The problem of etcd here occurs when master tries to read from etcd with an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.