Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] TestRegistrationHandler/manage-resource-slices of kubelet/cm/dra plugin is Flaking #129066

Open
Rajalakshmi-Girish opened this issue Dec 3, 2024 · 11 comments · May be fixed by #129114
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@Rajalakshmi-Girish
Copy link
Contributor

Rajalakshmi-Girish commented Dec 3, 2024

Which jobs are flaking?

k8s Unit test Job.

Which tests are flaking?

TestRegistrationHandler/manage-resource-slices in k8s.io/kubernetes/pkg/kubelet/cm/dra: plugin
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/dra/plugin/registration_test.go#L115

Reason for failure (if possible)

--- FAIL: TestRegistrationHandler (3.38s)
    --- FAIL: TestRegistrationHandler/manage-resource-slices (3.38s)
        registration_test.go:149: 
            	Error Trace:	/root/kubernetes/pkg/kubelet/cm/dra/plugin/registration_test.go:149
            	            				/root/kubernetes/staging/src/k8s.io/client-go/testing/fixture.go:882
            	            				/root/kubernetes/staging/src/k8s.io/client-go/testing/fake.go:145
            	            				/root/kubernetes/staging/src/k8s.io/client-go/gentype/fake.go:234
            	            				/root/kubernetes/pkg/kubelet/cm/dra/plugin/registration.go:116
            	            				/root/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/wait.go:154
            	            				/root/kubernetes/staging/src/k8s.io/apimachinery/pkg/util/wait/backoff.go:485
            	            				/root/kubernetes/pkg/kubelet/cm/dra/plugin/registration.go:102
            	            				/usr/local/go/src/runtime/asm_amd64.s:1700
            	Error:      	Not equal: 
            	            	expected: "spec.nodeName=worker"
            	            	actual  : "spec.driver=pluginB,spec.nodeName=worker"
            	            	
            	            	Diff:
            	            	--- Expected
            	            	+++ Actual
            	            	@@ -1 +1 @@
            	            	-spec.nodeName=worker
            	            	+spec.driver=pluginB,spec.nodeName=worker
            	Test:       	TestRegistrationHandler/manage-resource-slices
            	Messages:   	field selector in DeleteCollection
FAIL

Anything else we need to know?

TestRegistrationHandler failure

[root@raji-x86-workspace1 kubernetes]# stress ./plugin.test -test.run TestRegistrationHandler
5s: 16 runs so far, 0 failures, 16 active
10s: 32 runs so far, 0 failures, 16 active
15s: 48 runs so far, 0 failures, 16 active
20s: 64 runs so far, 0 failures, 16 active
25s: 80 runs so far, 0 failures, 16 active
30s: 96 runs so far, 0 failures, 16 active
35s: 113 runs so far, 0 failures, 16 active
40s: 131 runs so far, 0 failures, 16 active
45s: 160 runs so far, 0 failures, 16 active
50s: 176 runs so far, 0 failures, 16 active
55s: 193 runs so far, 0 failures, 16 active
1m0s: 210 runs so far, 0 failures, 16 active
1m5s: 226 runs so far, 0 failures, 16 active
1m10s: 242 runs so far, 0 failures, 16 active
1m15s: 258 runs so far, 0 failures, 16 active
1m20s: 275 runs so far, 0 failures, 16 active
1m25s: 303 runs so far, 0 failures, 16 active

/tmp/go-stress-20241202T035110-241152069
I1202 03:52:35.497329 1309458 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginA" endpoint="endpointA"
I1202 03:52:35.497841 1309458 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginA" endpoint="endpointA"
I1202 03:52:35.497928 1309458 registration.go:184] "DRA plugin already registered, the old plugin was replaced and will be forgotten by the kubelet till the next kubelet restart" logger="DRA registration handler" pluginName="pluginA" oldEndpoint="endpointA"
I1202 03:52:35.497993 1309458 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginB" endpoint="endpointB"
I1202 03:52:35.498078 1309458 registration.go:226] "Deregister DRA plugin" logger="DRA registration handler" pluginName="pluginB" endpoint="endpointB"
I1202 03:52:35.498146 1309458 registration.go:237] "Deregister DRA plugin not necessary, was already removed" logger="DRA registration handler"
I1202 03:52:35.498436 1309458 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginA" endpoint="endpointA"
I1202 03:52:35.498518 1309458 registration.go:184] "DRA plugin already registered, the old plugin was replaced and will be forgotten by the kubelet till the next kubelet restart" logger="DRA registration handler" pluginName="pluginA" oldEndpoint="endpointA"
I1202 03:52:35.498578 1309458 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginB" endpoint="endpointB"
I1202 03:52:35.498659 1309458 registration.go:226] "Deregister DRA plugin" logger="DRA registration handler" pluginName="pluginB" endpoint="endpointB"
I1202 03:52:35.498715 1309458 registration.go:237] "Deregister DRA plugin not necessary, was already removed" logger="DRA registration handler"
I1202 03:52:35.498967 1309458 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginA" endpoint="endpointA"
I1202 03:52:35.499036 13
…
1m30s: 322 runs so far, 1 failures (0.31%), 16 active

Relevant SIG(s)

/sig testing

@Rajalakshmi-Girish Rajalakshmi-Girish added the kind/flake Categorizes issue or PR as related to a flaky test. label Dec 3, 2024
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 3, 2024
@Rajalakshmi-Girish
Copy link
Contributor Author

@Rajalakshmi-Girish
Copy link
Contributor Author

@pohly any clue on this please?

@pohly
Copy link
Contributor

pohly commented Dec 4, 2024

Has this also been seen in some CI run?

@pohly
Copy link
Contributor

pohly commented Dec 4, 2024

/cc @bart0sh

@pohly
Copy link
Contributor

pohly commented Dec 4, 2024

/wg device-management
/sig node
/remove-sig testing

@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Dec 4, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Dec 4, 2024

/assign

@bart0sh
Copy link
Contributor

bart0sh commented Dec 4, 2024

@Rajalakshmi-Girish Is this reproducible often enough in your environment? Which commit did you use?
I was not able to repro it in mine using recent master:

commit 2b472fe4690c83a2b343995f88050b2a3e9ff0fa (HEAD -> master, upstream/master)
Author: Kubernetes Release Robot <k8s-release-robot@users.noreply.github.com>
Date:   Tue Dec 3 19:07:38 2024 +0000

    CHANGELOG: Update directory for v1.32.0-rc.1 release

Here is the output so far:

ed@devel ~/go/src/k8s.io/kubernetes/pkg/kubelet (master) $ ~/go/bin/stress ./plugin.test -test.run TestRegistrationHandler
5s: 48 runs so far, 0 failures, 24 active
10s: 96 runs so far, 0 failures, 24 active
15s: 144 runs so far, 0 failures, 24 active
20s: 216 runs so far, 0 failures, 24 active
25s: 264 runs so far, 0 failures, 24 active
30s: 312 runs so far, 0 failures, 24 active
35s: 360 runs so far, 0 failures, 24 active
40s: 432 runs so far, 0 failures, 24 active
45s: 480 runs so far, 0 failures, 24 active
50s: 528 runs so far, 0 failures, 24 active
55s: 577 runs so far, 0 failures, 24 active
1m0s: 648 runs so far, 0 failures, 24 active
1m5s: 696 runs so far, 0 failures, 24 active
1m10s: 745 runs so far, 0 failures, 24 active
1m15s: 793 runs so far, 0 failures, 24 active
1m20s: 864 runs so far, 0 failures, 24 active
1m25s: 912 runs so far, 0 failures, 24 active
1m30s: 961 runs so far, 0 failures, 24 active
1m35s: 1022 runs so far, 0 failures, 24 active
1m40s: 1080 runs so far, 0 failures, 24 active
1m45s: 1129 runs so far, 0 failures, 24 active
1m50s: 1177 runs so far, 0 failures, 24 active
1m55s: 1248 runs so far, 0 failures, 24 active
2m0s: 1297 runs so far, 0 failures, 24 active
2m5s: 1345 runs so far, 0 failures, 24 active
2m10s: 1393 runs so far, 0 failures, 24 active
2m15s: 1465 runs so far, 0 failures, 24 active
2m20s: 1513 runs so far, 0 failures, 24 active
2m25s: 1561 runs so far, 0 failures, 24 active
2m30s: 1609 runs so far, 0 failures, 24 active
2m35s: 1681 runs so far, 0 failures, 24 active
2m40s: 1729 runs so far, 0 failures, 24 active
2m45s: 1777 runs so far, 0 failures, 24 active
2m50s: 1825 runs so far, 0 failures, 24 active
2m55s: 1897 runs so far, 0 failures, 24 active
3m0s: 1945 runs so far, 0 failures, 24 active
3m5s: 1993 runs so far, 0 failures, 24 active
3m10s: 2041 runs so far, 0 failures, 24 active
3m15s: 2113 runs so far, 0 failures, 24 active
3m20s: 2161 runs so far, 0 failures, 24 active
3m25s: 2209 runs so far, 0 failures, 24 active
3m30s: 2263 runs so far, 0 failures, 24 active
3m35s: 2329 runs so far, 0 failures, 24 active
3m40s: 2377 runs so far, 0 failures, 24 active
3m45s: 2425 runs so far, 0 failures, 24 active
3m50s: 2497 runs so far, 0 failures, 24 active
3m55s: 2545 runs so far, 0 failures, 24 active
4m0s: 2593 runs so far, 0 failures, 24 active
4m5s: 2641 runs so far, 0 failures, 24 active
4m10s: 2713 runs so far, 0 failures, 24 active
4m15s: 2761 runs so far, 0 failures, 24 active
4m20s: 2809 runs so far, 0 failures, 24 active
4m25s: 2857 runs so far, 0 failures, 24 active
4m30s: 2929 runs so far, 0 failures, 24 active
4m35s: 2977 runs so far, 0 failures, 24 active
4m40s: 3025 runs so far, 0 failures, 24 active
4m45s: 3073 runs so far, 0 failures, 24 active
4m50s: 3145 runs so far, 0 failures, 24 active
4m55s: 3193 runs so far, 0 failures, 24 active
5m0s: 3241 runs so far, 0 failures, 24 active
5m5s: 3290 runs so far, 0 failures, 24 active
5m10s: 3361 runs so far, 0 failures, 24 active
5m15s: 3409 runs so far, 0 failures, 24 active
5m20s: 3457 runs so far, 0 failures, 24 active
5m25s: 3529 runs so far, 0 failures, 24 active
5m30s: 3577 runs so far, 0 failures, 24 active
5m35s: 3625 runs so far, 0 failures, 24 active
5m40s: 3674 runs so far, 0 failures, 24 active
5m45s: 3745 runs so far, 0 failures, 24 active
5m50s: 3793 runs so far, 0 failures, 24 active
5m55s: 3842 runs so far, 0 failures, 24 active
6m0s: 3890 runs so far, 0 failures, 24 active
6m5s: 3961 runs so far, 0 failures, 24 active

I'll keep it running.

@bart0sh
Copy link
Contributor

bart0sh commented Dec 4, 2024

@Rajalakshmi-Girish

k8s Unit test Job.

Can you point me out to the test grid for this job?

@Rajalakshmi-Girish
Copy link
Contributor Author

@bart0sh yes we are seeing this flake quite often in our job.
TestGrid link : https://testgrid.k8s.io/ibm-k8s-unit-tests-ppc64le#periodic-k8s-unit-tests-ppc64le

I am able to reproduce this flake even on x86 environment.
Hope you used the -race flag while compiling and generating plugin.test

@bart0sh
Copy link
Contributor

bart0sh commented Dec 5, 2024

@Rajalakshmi-Girish Yes, I used -race flag.

Eventually I could reproduce the issue on my x86 setup, but it was not easy. it broke after 24m of stress running:

...
24m10s: 8369 runs so far, 0 failures, 24 active
24m15s: 8393 runs so far, 0 failures, 24 active

/tmp/go-stress-20241205T131045-1140091016
I1205 13:34:59.712641   81159 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginA" endpoint="endpointA"
I1205 13:34:59.713251   81159 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginA" endpoint="endpointA"
I1205 13:34:59.713374   81159 registration.go:184] "DRA plugin already registered, the old plugin was replaced and will be forgotten by the kubelet till the next kubelet restart" logger="DRA registration handler" pluginName="pluginA" oldEndpoint="endpointA"
I1205 13:34:59.713465   81159 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginB" endpoint="endpointB"
I1205 13:34:59.713574   81159 registration.go:226] "Deregister DRA plugin" logger="DRA registration handler" pluginName="pluginB" endpoint="endpointB"
I1205 13:34:59.713701   81159 registration.go:237] "Deregister DRA plugin not necessary, was already removed" logger="DRA registration handler"
I1205 13:34:59.713991   81159 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginA" endpoint="endpointA"
I1205 13:34:59.714090   81159 registration.go:184] "DRA plugin already registered, the old plugin was replaced and will be forgotten by the kubelet till the next kubelet restart" logger="DRA registration handler" pluginName="pluginA" oldEndpoint="endpointA"
I1205 13:34:59.714174   81159 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginB" endpoint="endpointB"
I1205 13:34:59.714260   81159 registration.go:226] "Deregister DRA plugin" logger="DRA registration handler" pluginName="pluginB" endpoint="endpointB"
I1205 13:34:59.714310   81159 registration.go:237] "Deregister DRA plugin not necessary, was already removed" logger="DRA registration handler"
I1205 13:34:59.714595   81159 registration.go:154] "Register new DRA plugin" logger="DRA registration handler" pluginName="pluginA" endpoint="endpointA"
I1205 13:34:59.714665   
…
24m20s: 8418 runs so far, 1 failures (0.01%), 24 active
24m25s: 8443 runs so far, 1 failures (0.01%), 24 active
...

@bart0sh bart0sh moved this from Triage to Issues - In progress in SIG Node CI/Test Board Dec 5, 2024
@bart0sh bart0sh moved this from 🆕 New to 🏗 In progress in SIG Node: Dynamic Resource Allocation Dec 5, 2024
@bart0sh
Copy link
Contributor

bart0sh commented Dec 5, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 5, 2024
@bart0sh bart0sh linked a pull request Dec 9, 2024 that will close this issue
@bart0sh bart0sh moved this from 🏗 In progress to 👀 In review in SIG Node: Dynamic Resource Allocation Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: Issues - In progress
Status: 👀 In review
Development

Successfully merging a pull request may close this issue.

4 participants