-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cAdvisor /stats/summary endpoint in kubelet returns incorrect cpu usage numbers #27194
Comments
/cc @xiang90 |
/cc @vishh |
@xiang90 @vishh is on vacation. I am guessing @timstclair is taking over while he is out? |
Ok, we received several similar reports over different channel. Looks like there is a regression in the node monitoring pipeline. But @vishh and @timstclair are out this week. The rest of node team will take a look. |
A little clarification here:
@Random-Liu could you please see if we can reproduce the issue on GCE first. Then we can look deeply. |
I turned up a cluster and checked cAdvisor reports, there is no issue on cAdvisor side. This is a good news. If there is an issue, should be at Kubelet side when generating the summary report. |
|
This is definitely a cadvisor issue. I got the following data with ba5be34:
The script I use:
|
I think I know the root cause based on the data collected by @Random-Liu above, but need to verify. UsageNanoCores was introduced to record the total CPU usage (sum of all cores) averaged over the sample window, but I couldn't find the code summarizing the usages on cores together. In @Random-Liu's test, I think the node has 2 cores, and the busyloop container is running on two cores. I believe cadvisor does report the usages on two cores, but summary only report the first one here. In summary, it is a kubelet summary code bug, not cAdvisor issue. |
@Random-Liu Since I don't have the test environment ready for this yet, could you please help me quickly validate my theory. Please update your busyloop container's cpuset.cpus to 0. You can simply modify /sys/fs/cgroup/cpuset//cpuset.cpus from 0-1 to 0. Then run your stats collection script. |
@dchen1107 - I sort of doubt that's the issue, since the summary just copies the field from the cAdvisor API: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/server/stats/summary.go#L289 Without looking too deep (vacation and all), how do the cumulative numbers look ( |
@timstclair you should on vacation :-) Yes, I just saw the code, we simply use Total from cAdvisor API. Also @Random-Liu just mentioned to me that the initial report reporting node usage, not pod usage above. |
@timstclair @dchen1107 If we manually calculate with the cumulative number |
@Random-Liu found the root cause here https://github.com/google/cadvisor/blob/master/info/v2/conversion.go#L209:
When the valueDelta is too big, times 1e9 makes it overflow. |
Can we use this https://golang.org/pkg/math/big/ in above conversion code? |
Team Thanks for yours attention! @dchen1107 do you mean the cpu_usage provided by heapster? If so, when I use any "source" provided by summary in heapster (=kubernetes.summary_api:'' or kubernetes.summary_api:https://kubernetes.default) the cpu/usage doesn't get populated! curl 10.125.7.224:31580/api/v1/model/namespaces/rubis-ns/pods/rubis-vhqc6/metrics/cpu/usage |
@mwielgus Can you please help with debugging this? This seams important. |
Can you please include logs from heapster? |
@fgrzadkowski! Here are the logs from heapster: #kubectl logs heapster-p46du --namespace=kube-system |
I am not sure if this is related but I just fixed a bug in cAdvisor where I need to sweep cAdvisor to see if that pattern exists elsewhere, but On Saturday, July 30, 2016, mcabranches notifications@github.com wrote:
|
I am running on v1.3.5 and this still seems to be an issue
Node version:
|
Looks like we have overflow in some other places in our stack. |
@timstclair and I together looked at @ichekrygin's node stats more closer, and status of Kubelet / cAdvisor looks sane. I believe the overflow is at heapster. I will close this one, and open another one for heapster. |
@dchen1107 - ha, i was greping code for similar math that could be wrong and had same conclusion. can you link to heapster issue when opened? |
@dchen1107 - for now I am commenting on kubernetes-retired/heapster#1168 (which is closed). I hope to get it re-opened - if not, I will create a new one |
@ichekrygin Thanks for pointing me the proper heapster issue. I just opened #30939 and marked it for 1.4 milestone. Thanks! |
I'm facing similar issue, but with kubelet Here is the PromQL I use:
And here is the actual
|
@shamil That sounds like a separate issue, could you open a new issue? |
@timstclair isn't it related to |
It is processed differently, and goes through a different pipeline. The original source of the numbers is the same, but I think these issues are not related. |
OK, submitted #32414 |
Automatic merge from submit-queue Rewrite summary e2e test to check metric sanity Take two, forked from #28195 Adds a test library that extends the ginkgo matchers to check nested data structures. Then uses the new matcher library to thoroughly check the validity of every field in the summary metrics API. This approach is more flexible than the previous approach since it allows for different tests per-field, and is easier to add case-by-case exceptions. It also places the lower & upper bounds side-by-side, making the test much easier to read & reason about. Most fields are expected to be within some bounds. This is not intended to be a performance test, so metric bounds are very loose. Rather, I'm looking to check that the values are sane to catch bugs like #27194 Fixes #23411, #31989 /cc @kubernetes/sig-node
Environment
Kubernetes version: 1.2.3
Docker version: 1.10.3
3 node(c4.xLarge) cluster on AWS running CoreOS 1010.4.0.
Issue
After facing an issue with incorrect metrics being reported by heapster kubernetes-retired/heapster#1177 I tried querying the cadvisor /stats/summary endpoint directly to see if that would give me consistent values for node cpu usage.
I have one pod with cpu request=1000m and limit=1000m. In that pod I run a busy loop to consume a 100% of the cpu. This is what top shows on the node.
I query the /stats/summary endpoint every 5 seconds, however it seems that the latest timestamps are only updated every 15 seconds or so. Checking the
summary.Node.CPU.UsageNanoCores
value from the summary returned gives me the following output(formatted):As you can see I'm not getting a steady report of near 100% cpu usage values for UsageNanoCores. Any idea why this might be the case or how I can debug this issue. Also is there any way I can change the resolution of the summary stats to get more fine grained reporting.
The text was updated successfully, but these errors were encountered: