-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Millions of conflicts on Nodes in kubemark-scale tests #46851
Comments
It seems to me that the main problem is the "Cacher" for nodes objects lagging far behind the reality. The big unknown to me is why other Cachers seem to not lag that much. In particular, I'm looking into currently running tests. In particular, looking into HWM marks for different resources:
Node that we created the cluster at ~ That said, our problem doesn't seem to be cpu starvation (because this would probably affect all resources). |
Also note, that this HWM=100 for nodes, happened significantly before we started any test on the cluster. |
In other words - we seem to have problems even in the empty cluster without any user pods. |
My hypothesis is that we have some slow watchers that are not able to consume nodes fast enough. But looking into logs, I'm not seeing any watchers that were forced to break, so it seems none of these are terribly slow.
With the trigger function in place, we are always considering only those 6, so this doesn't seem like much work to do. Additionally:
BTW note that there is some interesting backpressure happening here - more conflicts, means less real updates, so less events to deliver, so more time to deliver them :) But obviously, it's not something we should count on. |
A very good news is that it is extremely easy to reproduce - you just start 5000-node kubemark and you're done (you don't even need to run any tests). |
Another evidence that this is a regression is that in empty cluster apiserver is using ~24cores. |
BTW - we seem to be accumulating lag pretty fast. I added the following diff to the code:
and started kubemark-5000 with it. Then I ssh-ed to kubemark master and run those:
[the first are logs of successful POST/PUT/PATCH operation on nodes, the second, when we are dispatching event]. And what I'm seeing is
So after ~30minutes, we are lagging by 5 minutes... What is interesting is that:
|
Well - I've just checked this run: There were This is 3-4 orders of magnitude more. Looking into test logs should make it possible to find roughly when it happened.
|
So unfortunately due to these problems, we are also missing logs from most of kubemark-scale runs.
It was worse in runs 446 and 447, but still not that bad as it is now:
So there might have been two separate regressions. And the PRs that @liggitt mentioned above might be probable. We should verify this. |
I just checked that when I was reproducing it on Friday, I was reproducing it from commit:
That means, that the second PR #45980 was even not yet merged then. |
BTW - I think it's not clear from anything below, but a huge number of conflicts, may also be consequence, not the root cause. E.g. slow watch event processing may be root cause (though I don't really have hypothesis why it could have regressed...) |
OK - so I reverted the second PR that @liggitt mentioned (on top of the same commit as on Friday), and I still have 100.000 conflicts after 5 minutes of running cluster (not even running tests). So it seems, those PRs are unrelated to the regression... |
I did:
and run 5000-kubemark from there, and still millions of conflicts (after 30minutes)
|
hmm - but maybe I did it wrong... |
@dchen1107 - FYI (as a release blocker) |
@liggitt this sounds like your slow informer lag thing |
So I have a victim. Ironicly this is my PR: #46588 (specifically the second commit of it). |
#47082 send out as a fix to it. |
Automatic merge from submit-queue Add logging to debug conflicts in kubemark-scale test Ref #46851
Kubemark-scale is currently constantly failing. There are different reasons of it:
However, I looked into logs, and it seems there is one common underlying root cause of all of them. And the problem is:
millions of conflicts on Node objects
Those conflicts result in:
We should understand why those conflicts became so often and fix the problem.
This seems like a regression to me - so putting into 1.7 milestone.
@kubernetes/sig-scalability-bugs
The text was updated successfully, but these errors were encountered: