-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing kubelet /stats/summary API and ResourceMetricsAPI tests on cgroup v2 #99230
Comments
/sig node |
/assign @fromanirh |
@ehashman: GitHub didn't allow me to assign the following users: fromanirh. Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
still working my way towards a reproducer. |
I've opened google/cadvisor#2837 to fix one issue I've seen. I think we will also need to modify libcontainer to return memory stats from /cc @kolyshkin |
another fix for cadvisor: google/cadvisor#2839 |
for completeness, the patch for runc/libcontainer: opencontainers/runc#2873 |
this should be fixed with #102483 |
@harche can we close now? |
Test is still failing. See testgrid: https://testgrid.k8s.io/sig-node-cri-o#pr-crio-cgrpv2-gce-e2e |
hm.. it is failing on:
not sure how to deal with this test. Should we handle it differently for cgroup v2? Or just enable it for cgroup v1? @kolyshkin @odinuge what do you think? |
I don't know how changes in conformance test should be done. The issue in this case is that for cgroup v1, Updating the value in the test is the "simplest" fix, but it doesn't solve the underlying difference in metrics when switching to cgroup v2 (eg. if this cause someones cluster to go wilde du to the "extreme" increase in this value). Who is responsible for these metrics/stats, and who should decide what to do? sig-node? I don't really know what to do.. Maybe @ehashman has some thoughts? |
I am not sure how that metric was used on cgroup v1. We could hardcode it to 0 for cgroup v2, but I don't think it is a good idea as it finally seems to make sense :-) Is it just something that should be documented? |
No idea myself. A quick test on my rpi using cgroup v1 I get this:
Where the first non-null value is Not sure who "own" this.. 🤔 |
on cgroup v2 the reported metric is recursive for the entire and it includes all the sub cgroups. Adjust the test accordingly. Closes: kubernetes#99230 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Thanks for trying that. Opened a PR for comments: #102726 If that is not the right approach, the only alternative I see is to hardcode the value to 0. |
Also, the structs were introduced here: #18544 /cc @tallclair |
I think changing the test is the right thing to do. It is a breaking change to change the metric, but it wasn't particularly useful before. Also, we won't be able to hide the change to the metric for pod cgroup metrics, so we might as well change it for system containers as well. |
Ahh, sounds good. Thanks! |
@harche, I'm trying to port the PR to 1.20 version which we're using with cgroups v2. Could you please give more information about the failing tests? such as detailed failure messages. Thanks very much! Actually, I got the same error when running the test against upstream master with this PR and 1.20: upstream
1.20
|
on cgroup v2 the reported metric is recursive for the entire and it includes all the sub cgroups. Adjust the test accordingly. Closes: kubernetes#99230 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Which jobs are failing:
pull-kubernetes-node-crio-cgrpv2-e2e
Which test(s) are failing:
E2eNode Suite: [k8s.io] Summary API [NodeConformance] when querying /stats/summary should report resource usage through the stats api
E2eNode Suite: [k8s.io] ResourceMetricsAPI [NodeFeature:ResourceMetrics] when querying /resource/metrics should report resource usage through the resource metrics api
Since when has it been failing:
From what I have observed these tests have always failed on cgroup v2 host.
Testgrid link:
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/99110/pull-kubernetes-node-crio-cgrpv2-e2e/1361561789094957056/
Reason for failure:
cgroup v2 has a different set of files compared to cgroup v1 to report the metrics. A lot of places in cadvisor, where the metrics end up getting read from, still assume cgroup v1 paths and files.
Anything else we need to know:
Some of the following PRs end up addressing the issue partially, but not completely.
google/cadvisor#2801
google/cadvisor#2800
The text was updated successfully, but these errors were encountered: