-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testing: Fix pod delete timeout failures after InPlacePodVerticalScaling Graduate to Beta commit #128880
base: master
Are you sure you want to change the base?
Conversation
8617463
to
55ad72d
Compare
@esotsal: GitHub didn't allow me to request PR reviews from the following users: hshiina. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/triage accepted |
/test pull-kubernetes-e2e-capz-windows-master |
/test |
@kannon92: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-kubernetes-node-kubelet-serial-containerd /hold Let’s confirm that this fixes the ci issues before merge. |
/test pull-kubernetes-node-kubelet-serial-containerd |
test/e2e/common/node/pod_resize.go
Outdated
@@ -1006,7 +1008,7 @@ func doPodResizeTests(f *framework.Framework) { | |||
} | |||
|
|||
ginkgo.By("deleting pod") | |||
framework.ExpectNoError(podClient.Delete(ctx, newPod.Name, metav1.DeleteOptions{})) | |||
podClient.DeleteSync(ctx, newPod.Name, metav1.DeleteOptions{}, timeouts.PodDelete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the Delete call erroring now?
why should we ignore the error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree using DeleteSync
is better and very likely to fix or at least significantly improve flakiness/failures we observed lately.
But I also agree we should not ignore the error or at very least clearly documenting why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We dont ignore the error , DeleteSync handles it internally or ?
kubernetes/test/e2e/framework/pod/pod_client.go
Lines 217 to 225 in c9092f6
// DeleteSync deletes the pod and wait for the pod to disappear for `timeout`. If the pod doesn't | |
// disappear before the timeout, it will fail the test. | |
func (c *PodClient) DeleteSync(ctx context.Context, name string, options metav1.DeleteOptions, timeout time.Duration) { | |
err := c.Delete(ctx, name, options) | |
if err != nil && !apierrors.IsNotFound(err) { | |
framework.Failf("Failed to delete pod %q: %v", name, err) | |
} | |
framework.ExpectNoError(WaitForPodNotFoundInNamespace(ctx, c.f.ClientSet, name, c.namespace, timeout), "wait for pod %q to disappear", name) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, pod client.DeleteSync checks for the error internally
kubernetes/test/e2e/framework/pod/pod_client.go
Lines 219 to 225 in c9092f6
func (c *PodClient) DeleteSync(ctx context.Context, name string, options metav1.DeleteOptions, timeout time.Duration) { | |
err := c.Delete(ctx, name, options) | |
if err != nil && !apierrors.IsNotFound(err) { | |
framework.Failf("Failed to delete pod %q: %v", name, err) | |
} | |
framework.ExpectNoError(WaitForPodNotFoundInNamespace(ctx, c.f.ClientSet, name, c.namespace, timeout), "wait for pod %q to disappear", name) | |
} |
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it thanks for the extra explanation @esotsal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace the command in a pod for resize tests with InfiniteSleepCommand
or insert trap
for handling SIGTERM? I think this blocks pod deletion for 30 seconds:
kubernetes/test/e2e/framework/pod/resize.go
Line 161 in c9092f6
cmd := "grep Cpus_allowed_list /proc/self/status | cut -f2 && sleep 1d" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the pods terminates faster that would help a lot, and we can and should apply the same fix in quite other places, but still we will need to have guaranteed cleanup between tests in the lane marked serial
(or to rethink from scratch a bunch of resource management tests)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tallclair i think stability should be more important to test duration. Yes the flakes are due to undeleted pods, please check here
Does this mean that we do not have to revert the promotion #128875?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to fix #128917 (although it's an easy fix), and I don't think the windows test failures are due to undeleted pods (though I could be wrong).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace the command in a pod for resize tests with
InfiniteSleepCommand
or inserttrap
for handling SIGTERM? I think this blocks pod deletion for 30 seconds:kubernetes/test/e2e/framework/pod/resize.go
Line 161 in c9092f6
cmd := "grep Cpus_allowed_list /proc/self/status | cut -f2 && sleep 1d"
Missed this one , sorry @hshiina . acknowledged, will give it a try as well.
@@ -549,6 +550,9 @@ func containerSucceeded(c *v1.Container, podStatus *kubecontainer.PodStatus) boo | |||
} | |||
|
|||
func IsInPlacePodVerticalScalingAllowed(pod *v1.Pod) bool { | |||
if runtime.GOOS == "windows" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks darn important!!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And ideally, this shouldn't be a runtime check. Isn't there something to mark a test as "does not work on Windows"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may move this function IsInPlacePodVerticalScalingAllowed
to pkg/kubelet/kuberuntime/kuberuntime_manager_linux.go
and (pkg/kubelet/kuberuntime/kuberuntime_manager_windows.go
or kuberuntime_manager_others.go
).
Add "[LinuxOnly]"
to the e2e test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1006,7 +1006,7 @@ func doPodResizeTests(f *framework.Framework) { | |||
} | |||
|
|||
ginkgo.By("deleting pod") | |||
framework.ExpectNoError(podClient.Delete(ctx, newPod.Name, metav1.DeleteOptions{})) | |||
podClient.DeleteSync(ctx, newPod.Name, metav1.DeleteOptions{}, f.Timeouts.PodDelete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are ok to skip checking the error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We dont ignore the error , DeleteSync handles it internally.
kubernetes/test/e2e/framework/pod/pod_client.go
Lines 217 to 225 in c9092f6
// DeleteSync deletes the pod and wait for the pod to disappear for `timeout`. If the pod doesn't | |
// disappear before the timeout, it will fail the test. | |
func (c *PodClient) DeleteSync(ctx context.Context, name string, options metav1.DeleteOptions, timeout time.Duration) { | |
err := c.Delete(ctx, name, options) | |
if err != nil && !apierrors.IsNotFound(err) { | |
framework.Failf("Failed to delete pod %q: %v", name, err) | |
} | |
framework.ExpectNoError(WaitForPodNotFoundInNamespace(ctx, c.f.ClientSet, name, c.namespace, timeout), "wait for pod %q to disappear", name) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. I hadn't noticed the switch from Delete to DeleteSync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice thanks @esotsal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, DeleteSync
is not aligned with our E2E framework guidelines (should return an error), but it is what it is.
@@ -115,7 +115,7 @@ var _ = SIGDescribe("Container Lifecycle Hook", func() { | |||
}, postStartWaitTimeout, podCheckInterval).Should(gomega.BeNil()) | |||
} | |||
ginkgo.By("delete the pod with lifecycle hook") | |||
podClient.DeleteSync(ctx, podWithHook.Name, *metav1.NewDeleteOptions(15), e2epod.DefaultPodDeletionTimeout) | |||
podClient.DeleteSync(ctx, podWithHook.Name, *metav1.NewDeleteOptions(15), f.Timeouts.PodDelete) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please call out what is the difference between the two timeouts and highlight it in the PR text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am guessing the new one is longer than the old one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,
DefaultPodDeletionTimeout = 3 * time.Minute |
kubernetes/test/e2e/framework/timeouts.go
Line 26 in 35d098a
PodDelete: 5 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add this info to the PR text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/e2e_node/criproxy_test.go
Outdated
ginkgo.AfterEach(func() { | ||
err := resetCRIProxyInjector(e2eCriProxy) | ||
framework.ExpectNoError(err) | ||
ginkgo.DeferCleanup(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pattern is way better. thanks!
f1b71b9
to
4cf1a3d
Compare
test pull-kubernetes-node-kubelet-serial-containerd |
It looks that currently no job runs tests with enabling Especially, |
Thanks for letting us know, for this commit , I have not reverted the beta to debug the issue, InPlacePodVerticalScaling should be enabled by default. Being beta , might explain why pull-kubernetes-node-kubelet-serial-containerd-alpha-features does not run the tests? |
I think I will rebase this commit, findings are sufficient. Thank you all for your comments and feedback. |
Will add it as an extra step when the beta is merged back not to forget |
726dc3e
to
08a700f
Compare
/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2 |
08a700f
to
bb94a06
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, esotsal The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/test pull-kubernetes-verify-lint |
@esotsal: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/hold cancel |
/retest |
/test pull-kubernetes-e2e-gce-cos-alpha-features |
What type of PR is this?
/kind bug
/kind failing-test
What this PR does / why we need it:
Please check analysis here and conversation in this PR below.
In brief this commit modifies testing code, and ensure same delete timeout duration is used, more specificaly
replaced DeletePod timeouts used before with framework.timeouts.PodDelete, previous timeouts where less than 5 minutes, please see full commit for details.
kubernetes/test/e2e/framework/timeouts.go
Line 26 in 35d098a
Replace of Delete with DeleteSync pod_resize test, with a 5 minute timeout, to ensure Pods are Deleted and will not impact some tests which were not resilient with the existence of pods.
Which issues this PR fixes
Relates to #128837
Fixes test failures due to pods not deleted with the introduction of InPlacePodVerticalScaling Beta
Special notes for your reviewer:
Please see files changes for more details.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: