Millions of conflicts on Nodes in kubemark-scale tests #46851

wojtek-t · 2017-06-02T11:54:26Z

Kubemark-scale is currently constantly failing. There are different reasons of it:

too high metrics
some nodes becoming unready
some pods not starting
some requests getting "429" error

However, I looked into logs, and it seems there is one common underlying root cause of all of them. And the problem is:
millions of conflicts on Node objects

Those conflicts result in:

some nodes not being able to heartbeat -> becoming unready
more requests in the system in general -> higher latencies
more requests in the system -> 429 errors
system more overloaded -> pod status updated not being delivered on time

We should understand why those conflicts became so often and fix the problem.
This seems like a regression to me - so putting into 1.7 milestone.

@kubernetes/sig-scalability-bugs

wojtek-t · 2017-06-02T12:06:56Z

It seems to me that the main problem is the "Cacher" for nodes objects lagging far behind the reality.

The big unknown to me is why other Cachers seem to not lag that much. In particular, I'm looking into currently running tests.

In particular, looking into HWM marks for different resources:

Pods:

wojtekt@kubemark-2000-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep Pod | tail -n 1
I0602 11:03:42.395256       7 cacher.go:566] cacher (*api.Pod): 73 objects queued in incoming channel.

Services

wojtekt@kubemark-2000-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep Service | tail -n 1
I0602 11:21:34.177855       7 cacher.go:566] cacher (*api.Service): 11 objects queued in incoming channel.

Endpoints

wojtekt@kubemark-2000-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep Endpoint | tail -n 1
I0602 11:27:05.117909       7 cacher.go:566] cacher (*api.Endpoints): 17 objects queued in incoming channel.

Nodes:

wojtekt@kubemark-2000-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep Node | tail -n 1
I0602 05:53:49.921624       7 cacher.go:566] cacher (*api.Node): 100 objects queued in incoming channel.

Node that we created the cluster at ~05:51 . Also note, that 100 HWM is the maximal possible value.
To we accumulated max possible amount of log within 2 minutes for nodes. And it didn't happen at all of other resources in the next 6 hours or so.

That said, our problem doesn't seem to be cpu starvation (because this would probably affect all resources).

wojtek-t · 2017-06-02T12:08:52Z

Also note, that this HWM=100 for nodes, happened significantly before we started any test on the cluster.
The first test started at: 05:56:28

wojtek-t · 2017-06-02T12:25:16Z

In other words - we seem to have problems even in the empty cluster without any user pods.

wojtek-t · 2017-06-02T12:55:25Z

My hypothesis is that we have some slow watchers that are not able to consume nodes fast enough. But looking into logs, I'm not seeing any watchers that were forced to break, so it seems none of these are terribly slow.
However, I looked into logs and for any given node update, there are exactly 6 watchers interested in that event:

one from scheduler component
one from controller manager component
one from corresponding hollow-kubelet component
3 from client named heapster (there is heapster, eventer and pod nanny I guess)

With the trigger function in place, we are always considering only those 6, so this doesn't seem like much work to do. Additionally:

the one from hollow-kubelet is interested only in O(1) event per 10s (O(1) because there is also NPD sending updates every 30s or sth) - it has buffer of 10 elements, so it is prepared for O(1m) backlog - that seems way more than enough
the others, have buffer of 1000:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L334
which means they are prepared for roughly <2s of backlog - but that still should be enough, given that we are not 100% saturated...

BTW note that there is some interesting backpressure happening here - more conflicts, means less real updates, so less events to deliver, so more time to deliver them :) But obviously, it's not something we should count on.

wojtek-t · 2017-06-02T13:40:39Z

A very good news is that it is extremely easy to reproduce - you just start 5000-node kubemark and you're done (you don't even need to run any tests).

wojtek-t · 2017-06-02T14:12:44Z

Another evidence that this is a regression is that in empty cluster apiserver is using ~24cores.
When I was bumping Go to 1.8.3, it was the previous time I was running it manually and it was at least 2x smaller then...
So the regression is somewhere in the meantime...

wojtek-t · 2017-06-02T14:23:40Z

BTW - we seem to be accumulating lag pretty fast. I added the following diff to the code:

 func (c *Cacher) dispatchEvents() {
+       num := 0
+       enable := (c.objectType.String() == "*api.Node")
        for {
                select {
                case event, ok := <-c.incoming:
                        if !ok {
                                return
                        }
+                       num++
+                       if enable && num%100 == 0 {
+                               glog.Errorf("AAA: dispatching event nr %d", num)
+                       }
                        c.dispatchEvent(&event)

and started kubemark-5000 with it.

Then I ssh-ed to kubemark master and run those:

wojtekt@kubernetes-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep "v1\/nodes" | grep -v "GET"  | grep -v "trace" | grep -v "Get" | grep -v "Starting watch" | grep -v "List" | grep -v " 429 " | grep -v " 409 " | grep -v " 404 " > out.txt
wojtekt@kubernetes-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep AAA > xxx.txt

[the first are logs of successful POST/PUT/PATCH operation on nodes, the second, when we are dispatching event].

And what I'm seeing is

out.txt (line: time)

1:           13:35:51.934856
10000:   13:36:54.648217
100000: 13:41:09.628872
200000: 13:48:19.224460
400000: 14:02:06.778121

xxx.txt (processed x events):

1:           13:35:52.500375
10000:   13:36:54.943128
100000: 13:42:07.613365
200000: 13:50:53.479439
400000: 14:07:41.702557

So after ~30minutes, we are lagging by 5 minutes...

What is interesting is that:

we didn't force any watch to be closed - I also patched Add logging to debug conflicts in kubemark-scale test #46855
we are not cpu starved - there are constantly few idle cores on master machine

wojtek-t · 2017-06-02T14:38:39Z

Well - I've just checked this run:
http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/427/artifacts/kubemark-2000-kubemark-master/
[from ~2 weeks ago]

There were 4449 patch conflicts on nodes there (comparing to millions happening today).

This is 3-4 orders of magnitude more.

Looking into test logs should make it possible to find roughly when it happened.

wojtekt@wojtekt-work:~/Downloads$ cat kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
4449

liggitt · 2017-06-04T19:36:22Z

a couple changes to strategic patching happened recently: #44597 and #45980

I'm not sure those increased the likelihood of patch conflicts or not

wojtek-t · 2017-06-05T13:54:06Z

So unfortunately due to these problems, we are also missing logs from most of kubemark-scale runs.
That said, I checked a bit more and:

it seems it was still good in 442 run:

wojtekt@wojtekt-work:~/Downloads$ gsutil cp gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/442/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log .
Copying gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/442/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log...
Downloading to temp gzip filename ./kube-apiserver.log_.gztmp                   
Uncompressing temporarily gzipped file to ./kube-apiserver.log...               
/ [1 files][  1.8 GiB/  1.8 GiB]      0.0 B/s                                   
Operation completed over 1 objects/1.8 GiB.                                      
wojtekt@wojtekt-work:~/Downloads$ cat kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
8688

It was worse in runs 446 and 447, but still not that bad as it is now:

wojtekt@wojtekt-work:~/Downloads$ gsutil cp gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/446/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log .
Copying gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/446/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log...
Downloading to temp gzip filename ./kube-apiserver.log_.gztmp                   
Uncompressing temporarily gzipped file to ./kube-apiserver.log...               
| [1 files][  1.8 GiB/  1.8 GiB]      0.0 B/s                                   
Operation completed over 1 objects/1.8 GiB.                                      
wojtekt@wojtekt-work:~/Downloads$ cat kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
46305
wojtekt@wojtekt-work:~/Downloads$ gsutil cp gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/447/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log .
Copying gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/447/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log...
Downloading to temp gzip filename ./kube-apiserver.log_.gztmp                   
Uncompressing temporarily gzipped file to ./kube-apiserver.log...               
- [1 files][  1.7 GiB/  1.7 GiB]      0.0 B/s                                   
Operation completed over 1 objects/1.7 GiB.                                      
wojtekt@wojtekt-work:~/Downloads$ cat kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
22171

So there might have been two separate regressions. And the PRs that @liggitt mentioned above might be probable. We should verify this.
@mengqiy @pwittrock @apelisse - FYI

wojtek-t · 2017-06-05T13:59:50Z

I just checked that when I was reproducing it on Friday, I was reproducing it from commit:

7043372d05510b13a0928af92deda6110740dd5e

That means, that the second PR #45980 was even not yet merged then.

wojtek-t · 2017-06-05T14:16:10Z

BTW - I think it's not clear from anything below, but a huge number of conflicts, may also be consequence, not the root cause. E.g. slow watch event processing may be root cause (though I don't really have hypothesis why it could have regressed...)

wojtek-t · 2017-06-05T14:24:04Z

OK - so I reverted the second PR that @liggitt mentioned (on top of the same commit as on Friday), and I still have 100.000 conflicts after 5 minutes of running cluster (not even running tests). So it seems, those PRs are unrelated to the regression...

wojtek-t · 2017-06-05T15:30:41Z

I did:

git rebase 1444d252e17ab7be24c1af07f4ef3bdb5dac8b48

and run 5000-kubemark from there, and still millions of conflicts (after 30minutes)

wojtekt@kubernetes-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
1549982

wojtek-t · 2017-06-05T15:33:12Z

hmm - but maybe I did it wrong...

wojtek-t · 2017-06-05T15:39:25Z

@dchen1107 - FYI (as a release blocker)

smarterclayton · 2017-06-05T20:24:35Z

@liggitt this sounds like your slow informer lag thing

wojtek-t · 2017-06-06T21:24:48Z

So I have a victim. Ironicly this is my PR: #46588 (specifically the second commit of it).
Since it isn't critical by any means, I'm simply going to revert the second commit.

wojtek-t · 2017-06-06T21:29:23Z

#47082 send out as a fix to it.

Automatic merge from submit-queue Add logging to debug conflicts in kubemark-scale test Ref #46851

wojtek-t added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jun 2, 2017

wojtek-t added this to the v1.7 milestone Jun 2, 2017

wojtek-t assigned gmarek and shyamjvs Jun 2, 2017

wojtek-t mentioned this issue Jun 2, 2017

Add logging to debug conflicts in kubemark-scale test #46855

Merged

shyamjvs added the release-blocker label Jun 2, 2017

shyamjvs mentioned this issue Jun 5, 2017

Losing kubemark master logs if failing in e2e tests #46980

Closed

wojtek-t mentioned this issue Jun 6, 2017

Select initializers from the dynamic configuration #46672

Merged

wojtek-t assigned wojtek-t and unassigned gmarek and shyamjvs Jun 6, 2017

wojtek-t mentioned this issue Jun 6, 2017

Revert "Optimize selector for single-matching items" #47082

Merged

k8s-github-robot closed this as completed in #47082 Jun 7, 2017

shyamjvs mentioned this issue Jun 13, 2017

Kubemark-5k failing load test due to high latency of LIST api requests #47419

Closed

k8s-github-robot pushed a commit that referenced this issue Jun 19, 2017

Merge pull request #46855 from wojtek-t/log_broken_watches

2d023ab

Automatic merge from submit-queue Add logging to debug conflicts in kubemark-scale test Ref #46851

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Millions of conflicts on Nodes in kubemark-scale tests #46851

Millions of conflicts on Nodes in kubemark-scale tests #46851

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

liggitt commented Jun 4, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

smarterclayton commented Jun 5, 2017

wojtek-t commented Jun 6, 2017

wojtek-t commented Jun 6, 2017

Millions of conflicts on Nodes in kubemark-scale tests #46851

Millions of conflicts on Nodes in kubemark-scale tests #46851

Comments

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

wojtek-t commented Jun 2, 2017

liggitt commented Jun 4, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

wojtek-t commented Jun 5, 2017

smarterclayton commented Jun 5, 2017

wojtek-t commented Jun 6, 2017

wojtek-t commented Jun 6, 2017