Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "Should set the condition rejected when the provision fails" test. #3170

Conversation

mbobrovskyi
Copy link
Contributor

@mbobrovskyi mbobrovskyi commented Oct 1, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Fix "Provisioning when A workload is using a provision admission check Should set the condition rejected when the provision fails" flaky test.

Which issue(s) this PR fixes:

Fixes #3079

Special notes for your reviewer:

After investigating the logs in https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/3078/pull-kueue-test-integration-main/1836042561878233088, I found that the failure occurred because the LocalQueue reconciler ran after the Workload reserved quota. As a result, we now have QuotaReserved=False with the message "Workload is inadmissible because of missing LocalQueue."

STEP: Setting the quota reservation to the workload @ 09/17/24 14:06:29.885
  2024-09-17T14:06:29.886934773Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:653	Workload update event	{"workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "queue": "queue", "status": "pending"}
  2024-09-17T14:06:29.887149146Z	LEVEL(-2)	workload-reconciler	core/workload_controller.go:679	Queue for updated workload didn't exist; ignoring for now	{"workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "queue": "queue", "status": "pending"}
  2024-09-17T14:06:29.887557531Z	LEVEL(-2)	core/workload_controller.go:144	Reconciling Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "namespace": "provisioning-8pcjl", "name": "wl", "reconcileID": "17e56ed7-bcba-499a-be20-d3c715dc3d81", "workload": {"name":"wl","namespace":"provisioning-8pcjl"}}
  2024-09-17T14:06:29.887680353Z	LEVEL(-3)	core/workload_controller.go:283	Workload is inadmissible because of missing LocalQueue	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "namespace": "provisioning-8pcjl", "name": "wl", "reconcileID": "17e56ed7-bcba-499a-be20-d3c715dc3d81", "workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "localQueue": {"name":"queue","namespace":"provisioning-8pcjl"}}

...

  2024-09-17T14:06:29.957435239Z	LEVEL(-3)	core/workload_controller.go:283	Workload is inadmissible because of missing LocalQueue	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "namespace": "provisioning-8pcjl", "name": "wl", "reconcileID": "1826962b-9aa6-41f0-987c-5f93b3aaadc1", "workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "localQueue": {"name":"queue","namespace":"provisioning-8pcjl"}}
  2024-09-17T14:06:30.004706085Z	LEVEL(-2)	localqueue-reconciler	core/localqueue_controller.go:132	LocalQueue create event	{"localQueue": {"name":"queue","namespace":"provisioning-8pcjl"}}
  2024-09-17T14:06:30.005215351Z	LEVEL(-2)	core/workload_controller.go:144	Reconciling Workload	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "namespace": "provisioning-8pcjl", "name": "wl", "reconcileID": "88ce6e83-ef9a-4064-9dd8-741b776683d1", "workload": {"name":"wl","namespace":"provisioning-8pcjl"}}
  2024-09-17T14:06:30.005309193Z	LEVEL(-2)	core/localqueue_controller.go:102	Reconciling LocalQueue	{"controller": "localqueue", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "LocalQueue", "LocalQueue": {"name":"queue","namespace":"provisioning-8pcjl"}, "namespace": "provisioning-8pcjl", "name": "queue", "reconcileID": "e57cae68-def8-43a0-9ebf-d495a0a73b85", "localQueue": {"name":"queue","namespace":"provisioning-8pcjl"}}
  2024-09-17T14:06:30.005400874Z	LEVEL(-3)	core/workload_controller.go:301	Workload is inadmissible because ClusterQueue is inactive	{"controller": "workload", "controllerGroup": "kueue.x-k8s.io", "controllerKind": "Workload", "Workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "namespace": "provisioning-8pcjl", "name": "wl", "reconcileID": "88ce6e83-ef9a-4064-9dd8-741b776683d1", "workload": {"name":"wl","namespace":"provisioning-8pcjl"}, "clusterQueue": {"name":"cluster-queue"}}

case !lqExists:
log.V(3).Info("Workload is inadmissible because of missing LocalQueue", "localQueue", klog.KRef(wl.Namespace, wl.Spec.QueueName))
if workload.UnsetQuotaReservationWithCondition(&wl, kueue.WorkloadInadmissible, fmt.Sprintf("LocalQueue %s doesn't exist", wl.Spec.QueueName)) {
err := workload.ApplyAdmissionStatus(ctx, r.client, &wl, true)
return ctrl.Result{}, client.IgnoreNotFound(err)
}

As seen in the logs above, even after the LocalQueue becomes active, we still have Active=False condition on the ClusterQueue. The workload status is now "Workload is inadmissible because ClusterQueue is inactive," due to a pending admission check.

The solution is to wait for the ClusterQueue and LocalQueue to become active and move pending admission to another When block.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 1, 2024
Copy link

netlify bot commented Oct 1, 2024

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit fce7ba8
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66fbd9f562271e00081175d5
😎 Deploy Preview https://deploy-preview-3170--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 1, 2024
@mbobrovskyi
Copy link
Contributor Author

/cc @mimowo

@k8s-ci-robot k8s-ci-robot requested a review from mimowo October 1, 2024 12:35
@mimowo
Copy link
Contributor

mimowo commented Oct 1, 2024

The code changes lgtm, but I'm not sure how they fix the issue. Let me ask some questions to better understand the problem.

  1. How the moved test "Should reject an admission check if a workload is not Admitted, the ProvisioningRequest's condition is set to BookingExpired, and there is no retries left" affected that the other test "Should set the condition rejected when the provision fails" flaked?
  2. Is it related to the recent change to reuse kueue manager across tests (if so, how)? Or could the flake also exist before?

@mbobrovskyi
Copy link
Contributor Author

mbobrovskyi commented Oct 1, 2024

How the moved test "Should reject an admission check if a workload is not Admitted, the ProvisioningRequest's condition is set to BookingExpired, and there is no retries left" affected that the other test "Should set the condition rejected when the provision fails" flaked?

Most of the tests fail after waiting for LocalQueue and ClusterQueue. This happens because LocalQueue and ClusterQueue are always inadmissible due to a pending admission check.

I would say that it's very strange that we haven't had issues with this before. :).

To fix this, we can move this test to a separate When block, as it is the only test using pending admission.

@mbobrovskyi
Copy link
Contributor Author

Is it related to the recent change to reuse kueue manager across tests (if so, how)? Or could the flake also exist before?

Yes, this happens before it.

@mbobrovskyi
Copy link
Contributor Author

The code changes lgtm, but I'm not sure how they fix the issue.

I tried to move creation of LocalQueue after quota reservation to mimic the current scenario and I got the same error that we have on prow.

@mimowo
Copy link
Contributor

mimowo commented Oct 1, 2024

Most of the tests fail after waiting for LocalQueue and ClusterQueue. This happens because LocalQueue and ClusterQueue are always inadmissible due to a pending admission check.

But the admission check is eventually moved to Ready? And so, LocalQueue and ClusterQueue are moved to active.
Could it be a race condition, that too many things had to happen within 5s? Adding the steps (as proposed in the diff) to wait for LocalQueue, and ClusterQueue could help then.

To fix this, we can move this test to a separate When block, as it is the only test using pending admission.

I see, makes sense to have only one test for Pending admission check. This will make the other tests run faster also.
The only thing is that in the extracted test we may also need to wait for LocalQueue and ClusterQueue to be active explicitly. Otherwise, the extracted test could still fail?

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
I synced with @mbobrovskyi and I agree that the PR has a good potential to eliminate the flake (though it wasn't confirmed by actual experiment, just static log analysis).

The change also makes it easier to reason about the scenarios by splitting the positive setup (CQ active), and the negative setup (CQ remains inactive due to pending admission check).

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 1, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4da3c6d68c37b154c8ed8a78b442d6259cb1ff42

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mbobrovskyi, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2024
@k8s-ci-robot k8s-ci-robot merged commit 2f14cb3 into kubernetes-sigs:main Oct 1, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Oct 1, 2024
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
3 participants