Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Millions of conflicts on Nodes in kubemark-scale tests #46851

Closed
wojtek-t opened this issue Jun 2, 2017 · 19 comments
Closed

Millions of conflicts on Nodes in kubemark-scale tests #46851

wojtek-t opened this issue Jun 2, 2017 · 19 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. release-blocker sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Milestone

Comments

@wojtek-t
Copy link
Member

wojtek-t commented Jun 2, 2017

Kubemark-scale is currently constantly failing. There are different reasons of it:

  • too high metrics
  • some nodes becoming unready
  • some pods not starting
  • some requests getting "429" error

However, I looked into logs, and it seems there is one common underlying root cause of all of them. And the problem is:
millions of conflicts on Node objects

Those conflicts result in:

  • some nodes not being able to heartbeat -> becoming unready
  • more requests in the system in general -> higher latencies
  • more requests in the system -> 429 errors
  • system more overloaded -> pod status updated not being delivered on time

We should understand why those conflicts became so often and fix the problem.
This seems like a regression to me - so putting into 1.7 milestone.

@kubernetes/sig-scalability-bugs

@wojtek-t wojtek-t added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Jun 2, 2017
@wojtek-t wojtek-t added this to the v1.7 milestone Jun 2, 2017
@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2017

It seems to me that the main problem is the "Cacher" for nodes objects lagging far behind the reality.

The big unknown to me is why other Cachers seem to not lag that much. In particular, I'm looking into currently running tests.

In particular, looking into HWM marks for different resources:

  1. Pods:
wojtekt@kubemark-2000-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep Pod | tail -n 1
I0602 11:03:42.395256       7 cacher.go:566] cacher (*api.Pod): 73 objects queued in incoming channel.
  1. Services
wojtekt@kubemark-2000-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep Service | tail -n 1
I0602 11:21:34.177855       7 cacher.go:566] cacher (*api.Service): 11 objects queued in incoming channel.
  1. Endpoints
wojtekt@kubemark-2000-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep Endpoint | tail -n 1
I0602 11:27:05.117909       7 cacher.go:566] cacher (*api.Endpoints): 17 objects queued in incoming channel.
  1. Nodes:
wojtekt@kubemark-2000-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep Node | tail -n 1
I0602 05:53:49.921624       7 cacher.go:566] cacher (*api.Node): 100 objects queued in incoming channel.

Node that we created the cluster at ~05:51 . Also note, that 100 HWM is the maximal possible value.
To we accumulated max possible amount of log within 2 minutes for nodes. And it didn't happen at all of other resources in the next 6 hours or so.

That said, our problem doesn't seem to be cpu starvation (because this would probably affect all resources).

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2017

Also note, that this HWM=100 for nodes, happened significantly before we started any test on the cluster.
The first test started at: 05:56:28

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2017

In other words - we seem to have problems even in the empty cluster without any user pods.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2017

My hypothesis is that we have some slow watchers that are not able to consume nodes fast enough. But looking into logs, I'm not seeing any watchers that were forced to break, so it seems none of these are terribly slow.
However, I looked into logs and for any given node update, there are exactly 6 watchers interested in that event:

  • one from scheduler component
  • one from controller manager component
  • one from corresponding hollow-kubelet component
  • 3 from client named heapster (there is heapster, eventer and pod nanny I guess)

With the trigger function in place, we are always considering only those 6, so this doesn't seem like much work to do. Additionally:

BTW note that there is some interesting backpressure happening here - more conflicts, means less real updates, so less events to deliver, so more time to deliver them :) But obviously, it's not something we should count on.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2017

A very good news is that it is extremely easy to reproduce - you just start 5000-node kubemark and you're done (you don't even need to run any tests).

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2017

Another evidence that this is a regression is that in empty cluster apiserver is using ~24cores.
When I was bumping Go to 1.8.3, it was the previous time I was running it manually and it was at least 2x smaller then...
So the regression is somewhere in the meantime...

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2017

BTW - we seem to be accumulating lag pretty fast. I added the following diff to the code:

 func (c *Cacher) dispatchEvents() {
+       num := 0
+       enable := (c.objectType.String() == "*api.Node")
        for {
                select {
                case event, ok := <-c.incoming:
                        if !ok {
                                return
                        }
+                       num++
+                       if enable && num%100 == 0 {
+                               glog.Errorf("AAA: dispatching event nr %d", num)
+                       }
                        c.dispatchEvent(&event)

and started kubemark-5000 with it.

Then I ssh-ed to kubemark master and run those:

wojtekt@kubernetes-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep "v1\/nodes" | grep -v "GET"  | grep -v "trace" | grep -v "Get" | grep -v "Starting watch" | grep -v "List" | grep -v " 429 " | grep -v " 409 " | grep -v " 404 " > out.txt
wojtekt@kubernetes-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep AAA > xxx.txt

[the first are logs of successful POST/PUT/PATCH operation on nodes, the second, when we are dispatching event].

And what I'm seeing is

  • out.txt (line: time)
1:           13:35:51.934856
10000:   13:36:54.648217
100000: 13:41:09.628872
200000: 13:48:19.224460
400000: 14:02:06.778121
  • xxx.txt (processed x events):
1:           13:35:52.500375
10000:   13:36:54.943128
100000: 13:42:07.613365
200000: 13:50:53.479439
400000: 14:07:41.702557

So after ~30minutes, we are lagging by 5 minutes...

What is interesting is that:

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 2, 2017

Well - I've just checked this run:
http://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/427/artifacts/kubemark-2000-kubemark-master/
[from ~2 weeks ago]

There were 4449 patch conflicts on nodes there (comparing to millions happening today).

This is 3-4 orders of magnitude more.

Looking into test logs should make it possible to find roughly when it happened.

wojtekt@wojtekt-work:~/Downloads$ cat kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
4449

@liggitt
Copy link
Member

liggitt commented Jun 4, 2017

a couple changes to strategic patching happened recently: #44597 and #45980

I'm not sure those increased the likelihood of patch conflicts or not

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 5, 2017

So unfortunately due to these problems, we are also missing logs from most of kubemark-scale runs.
That said, I checked a bit more and:

  • it seems it was still good in 442 run:
wojtekt@wojtekt-work:~/Downloads$ gsutil cp gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/442/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log .
Copying gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/442/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log...
Downloading to temp gzip filename ./kube-apiserver.log_.gztmp                   
Uncompressing temporarily gzipped file to ./kube-apiserver.log...               
/ [1 files][  1.8 GiB/  1.8 GiB]      0.0 B/s                                   
Operation completed over 1 objects/1.8 GiB.                                      
wojtekt@wojtekt-work:~/Downloads$ cat kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
8688

It was worse in runs 446 and 447, but still not that bad as it is now:

wojtekt@wojtekt-work:~/Downloads$ gsutil cp gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/446/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log .
Copying gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/446/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log...
Downloading to temp gzip filename ./kube-apiserver.log_.gztmp                   
Uncompressing temporarily gzipped file to ./kube-apiserver.log...               
| [1 files][  1.8 GiB/  1.8 GiB]      0.0 B/s                                   
Operation completed over 1 objects/1.8 GiB.                                      
wojtekt@wojtekt-work:~/Downloads$ cat kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
46305
wojtekt@wojtekt-work:~/Downloads$ gsutil cp gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/447/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log .
Copying gs://kubernetes-jenkins/logs/ci-kubernetes-kubemark-gce-scale/447/artifacts/kubemark-2000-kubemark-master/kube-apiserver.log...
Downloading to temp gzip filename ./kube-apiserver.log_.gztmp                   
Uncompressing temporarily gzipped file to ./kube-apiserver.log...               
- [1 files][  1.7 GiB/  1.7 GiB]      0.0 B/s                                   
Operation completed over 1 objects/1.7 GiB.                                      
wojtekt@wojtekt-work:~/Downloads$ cat kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
22171

So there might have been two separate regressions. And the PRs that @liggitt mentioned above might be probable. We should verify this.
@mengqiy @pwittrock @apelisse - FYI

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 5, 2017

I just checked that when I was reproducing it on Friday, I was reproducing it from commit:

7043372d05510b13a0928af92deda6110740dd5e

That means, that the second PR #45980 was even not yet merged then.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 5, 2017

BTW - I think it's not clear from anything below, but a huge number of conflicts, may also be consequence, not the root cause. E.g. slow watch event processing may be root cause (though I don't really have hypothesis why it could have regressed...)

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 5, 2017

OK - so I reverted the second PR that @liggitt mentioned (on top of the same commit as on Friday), and I still have 100.000 conflicts after 5 minutes of running cluster (not even running tests). So it seems, those PRs are unrelated to the regression...

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 5, 2017

I did:

git rebase 1444d252e17ab7be24c1af07f4ef3bdb5dac8b48

and run 5000-kubemark from there, and still millions of conflicts (after 30minutes)

wojtekt@kubernetes-kubemark-master ~ $ cat /var/log/kube-apiserver.log | grep "v1\/nodes" | grep PATCH | grep " 409 " | wc -l
1549982

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 5, 2017

hmm - but maybe I did it wrong...

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 5, 2017

@dchen1107 - FYI (as a release blocker)

@smarterclayton
Copy link
Contributor

@liggitt this sounds like your slow informer lag thing

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 6, 2017

So I have a victim. Ironicly this is my PR: #46588 (specifically the second commit of it).
Since it isn't critical by any means, I'm simply going to revert the second commit.

@wojtek-t
Copy link
Member Author

wojtek-t commented Jun 6, 2017

#47082 send out as a fix to it.

k8s-github-robot pushed a commit that referenced this issue Jun 19, 2017
Automatic merge from submit-queue

Add logging to debug conflicts in kubemark-scale test

Ref #46851
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. release-blocker sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

5 participants