-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make one-shot stats faster #46448
Make one-shot stats faster #46448
Conversation
62759c2
to
e750c0f
Compare
201fd92
to
ca57e45
Compare
Mirantis/cri-dockerd#107 should have made stats collection very fast. |
To avoid such delays, Mirantis/cri-dockerd#38 set the concurrency for getting container stats to the number of containers, if the number of containers is far larger than number of CPUs, It's not ideal. |
5311fb8
to
b54bac6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cache invalidation is famously one of the hard problems in computer science. This PR needs some more work on that front.
It seems like the collector is really being used for 2 things:
For 1, I think we can use a |
Not sure how to move forward. The main purpose of this PR is to help |
The stats API has three modes of operation, summarized by the following table:
For the streaming mode it seems reasonable to have a single collector which broadcasts results to all interested clients to amortize the cost of collection as the clients expect not to have control over the sampling frequency or phase. Similarly, clients requesting "two-shot" stats expect there to be some latency so the additional latency and jitter of using the collector to amortize the sampling costs is not a deal-breaker. Then there are the clients that want more control over when a sample is taken. Prometheus exporters for instance are expected to capture a sample only upon request so that the scraper has full control over the sampling interval and phase, timestamps the sample using its own clock to sidestep issues with clock skew, has visibility into the end-to-end latency, and can compensate to reduce jitter. For one-shot stats requests to be suitable for that use case, the stats sample needs to be taken synchronously with the request and not coalesced with other requests. Therefore I think what @xinfengliu has done in this PR to handle one-shot stats requests synchronously, without any single-flighting or queueing, is the right design choice. On the topic of caching
And wouldn't you know, tl;dr modify |
Thank you very much for the deep analysis. I will modify the PR per your suggestions. As for collecting stats inside |
Revert away! |
b3cea44
to
a70c98f
Compare
@corhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no logic bugs in the code. There is still some room for further improvement, however.
3557915
to
ff10cbc
Compare
This commit moves one-shot stats processing out of the publishing channels, i.e. collect stats directly. Also changes the method of getSystemCPUUsage() on Linux to return number of online CPUs also. Signed-off-by: Xinfeng Liu <XinfengLiu@icloud.com>
ff10cbc
to
95aea39
Compare
Thanks for your suggestions. I have made the modifications. As for
BTW, the currently failed CI job seems not related to my PR and I don't have permission to re-run only failed CI jobs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I'm not super thrilled with where errors are handled and logged, but it is consistent with existing behaviour and the code is already way cleaner than before.
Signed-off-by: Xinfeng Liu <XinfengLiu@icloud.com>
48aece9
to
3d70af4
Compare
@thaJeztah |
@xinfengliu let me defer that question to @cpuguy83 and @corhere who had a much closer look into these changes. FWIW; we'll be working on 25.0 releases soon as well; so we'll have to look which branches we want to backport to (if any); was there a specific reason you needed this in the 23.0 branch? |
@xinfengliu open a backport PR targeting the 23.0 branch which cherry-picks these commits. ( |
The backport PR for 23.0 branch #46617 has been created and CI has passed. @thaJeztah |
@xinfengliu can you also open a backport for the 24.0 branch, otherwise 24.0 would "regress" compared to 23.0 |
@thaJeztah Update: |
- What I did
Kuberenets CRI interface
ListContainerStats
requires to collect every pod container's stats. Currentlydaemon/stats/collector.go
runs in a serial way and sleep1s
at the end offor
loop. This causesListContainerStats
to suffer long delays if there are many pod containers on the node.This PR is to address this issue.
- How I did it
.Sleep
is not needed in a normal loop since a condition variable is used. Only sleep when an error happens at system level, then retry in next looptime.Sleep(s.interval)
is needed forone-short=0
orstream=1
, otherwise the caller would receive successive stats without any interval.bufReader
out ofCollector
, since it cannot be shared by multiple goroutines.one-shot
requests.- How to verify it
Run the PR build with
cri-dockerd
, observe the time taken forListContainerStats
. Without this code change, it could take over 3 seconds when about 40 pod containers on the node.With this fix, it takes less than 1 second.ListContainerStats
time to 2 seconds on the 4-CPU node.one-shot
requests reduces theListContainerStats
time to less than 200ms. (The requests sent bycri-dockerd
areone-shot
requests)- Description for the changelog
Improves the performance of the stats collector
- A picture of a cute animal (not mandatory but encouraged)