E2E Node tests for image pull backoff and crashloopbackoff behavior #128559

lauralorenz · 2024-11-05T02:01:51Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Add e2e tests buit on top of the CRI proxy framework to test the backoff behavior of image pulls and container restarts. Includes a case where container restarts are configured using the alpha feature from KEP-4306.

Which issue(s) this PR fixes:

Related to kubernetes/enhancements#4603

Special notes for your reviewer:

Test freeze exception: https://groups.google.com/g/kubernetes-sig-node/c/zYclDRIyD0w
How to run:

make test-e2e-node REMOTE=false PARALLELISM=1 FOCUS="Container 
Restart|Pull Image" SKIP="\[Flaky\]|\[Slow\]"  TEST_ARGS='--kubelet-flags="--fail-swap-on=false" --cri-proxy-enabled=true'

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

/hold

lauralorenz · 2024-11-05T02:37:17Z

/test pull-kubernetes-node-e2e-containerd

lauralorenz · 2024-11-05T06:29:42Z

Hiya @SergeyKanzhelev (cc @tallclair) I'm going to keep working on this tomorrow, but heads up if you have any opinions on how this is shaping up (as I intend to use something similar to e2e test container restarts too), the latest commit has some informative TODOs of where I'm currently at. Most relevantly I am looking for (/ possibly need to make?) a util that can snag kubelet logs for a defined time period because that's where the data I need to parse really is, as I don't seem to be able to do it with the events API; if you know of something that already does that in the node e2e suite please point me in the direction as I haven't found anything so far. Thanks!

lauralorenz · 2024-11-05T23:17:56Z

Ideas:

set up a goroutine to watch the events and create my own event log on the side
can get the kubelet logs directly from the kubelet endpoint to parse them after the test runs
still thinking maybe there is an example of kubelet log parsing in the other e2e tests?
e2e slow downs can require timeouts more like 1-2 minutes minimum to observe behavior (e.g. pod running (? i think) timeout is 2 minutes)

SergeyKanzhelev · 2024-11-05T23:39:52Z

e2e slow downs can require timeouts more like 1-2 minutes minimum to observe behavior (e.g. pod running (? i think) timeout is 2 minutes)

Is the question how to check that the image pull backoff did not inherit the container crash loop backoff?

Easiest you can do - check how fast you receive the next image pull in the cri proxy. If container crash loop backoff configured to 5 seconds, make sure you are not getting image pull backoffs less then so many times in one minute.

aojea · 2024-11-06T19:25:43Z

this should merge with #128374 , adding e2e separated of the feature is not a good practice

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

test/e2e_node/criproxy_test.go

thockin

/approve

I looked at the tests and they make sense to me, but I'm not able to dig into the underlying framework as much as someone who knows it very well.

k8s-ci-robot · 2024-11-13T01:19:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lauralorenz, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e_node/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thockin · 2024-11-13T01:21:49Z

/lgtm

k8s-ci-robot · 2024-11-13T01:21:57Z

LGTM label has been added.

Git tree hash: 7cf3281708e43a7febbc701a3195cec8c9df0420

Focused too much on the container restart one in commit that fixed that Signed-off-by: Laura Lorenz <lauralorenz@google.com>

lauralorenz · 2024-11-13T01:45:32Z

@tallclair I had changed the sleeps to shorter times in 285d433, but focused on running the container restart tests (which are green on prow here) and didn't update the expectation of the shorter sleep on the image pull test.

Now in
9ab0d81 I'm expecting >=3 for a timeout of 30s (first 0s, second ~0s, third ~10s, fourth won't happen until on or just after 30s), whereas before I expected noninclusive >3 as I expected when I had 1 minute.

tallclair · 2024-11-13T01:48:03Z

/test pull-kubernetes-node-e2e-cri-proxy-serial
/lgtm
/hold

Please make sure the pull-kubernetes-node-e2e-cri-proxy-serial run passes before removing the hold

k8s-ci-robot · 2024-11-13T01:48:10Z

LGTM label has been added.

Git tree hash: 0db6e203d06421fab66b369522ccf14644c3d914

tallclair · 2024-11-13T01:50:00Z

/triage accepted
/priority important-soon

lauralorenz · 2024-11-13T02:11:53Z

Can confirm
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/128559/pull-kubernetes-node-e2e-cri-proxy-serial/1856514342405541888#1:build-log.txt%3A524

/unhold

fsmunoz · 2024-11-13T13:35:26Z

/milestone v1.32

lauralorenz · 2024-11-13T16:40:06Z

/retest-required

k8s-ci-robot requested review from dchen1107 and tallclair November 5, 2024 02:02

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 5, 2024

lauralorenz mentioned this pull request Nov 5, 2024

KEP-4603: Refactor various hardcoded backoffs into separate constants #128369

Merged

This was referenced Nov 6, 2024

[WIP] ReduceCrashLoopBackOff delay #128614

Draft

Tune CrashLoopBackOff kubernetes/enhancements#4603

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 8, 2024

lauralorenz added 3 commits November 11, 2024 17:55

Adding imagepull backoff test

f913b7a

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

Organize into its own context

6337a28

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

The idea of how this test should work

6ef05db

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

thockin reviewed Nov 13, 2024

View reviewed changes

test/e2e_node/criproxy_test.go Show resolved Hide resolved

thockin reviewed Nov 13, 2024

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 13, 2024

k8s-ci-robot assigned thockin Nov 13, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2024

Now that sleep is shorter, only expect to reach 3 within 30s

9ab0d81

Focused too much on the container restart one in commit that fixed that Signed-off-by: Laura Lorenz <lauralorenz@google.com>

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2024

k8s-ci-robot requested a review from thockin November 13, 2024 01:41

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 13, 2024

k8s-ci-robot assigned tallclair Nov 13, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 13, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 13, 2024

k8s-ci-robot added this to the v1.32 milestone Nov 13, 2024

k8s-ci-robot merged commit 5ee686b into kubernetes:master Nov 13, 2024
16 checks passed

lauralorenz mentioned this pull request Nov 19, 2024

KEP-4602: Crashloopbackoff alpha docs PR kubernetes/website#48499

Merged

pacoxu mentioned this pull request Nov 19, 2024

skip if cri proxy is disabled/undefined #128851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E Node tests for image pull backoff and crashloopbackoff behavior #128559

E2E Node tests for image pull backoff and crashloopbackoff behavior #128559

lauralorenz commented Nov 5, 2024 •

edited

Loading

lauralorenz commented Nov 5, 2024

lauralorenz commented Nov 5, 2024

lauralorenz commented Nov 5, 2024

SergeyKanzhelev commented Nov 5, 2024

aojea commented Nov 6, 2024

thockin left a comment

k8s-ci-robot commented Nov 13, 2024

thockin commented Nov 13, 2024

k8s-ci-robot commented Nov 13, 2024

lauralorenz commented Nov 13, 2024 •

edited

Loading

tallclair commented Nov 13, 2024

k8s-ci-robot commented Nov 13, 2024

tallclair commented Nov 13, 2024

lauralorenz commented Nov 13, 2024

fsmunoz commented Nov 13, 2024

lauralorenz commented Nov 13, 2024

E2E Node tests for image pull backoff and crashloopbackoff behavior #128559

E2E Node tests for image pull backoff and crashloopbackoff behavior #128559

Conversation

lauralorenz commented Nov 5, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

lauralorenz commented Nov 5, 2024

lauralorenz commented Nov 5, 2024

lauralorenz commented Nov 5, 2024

SergeyKanzhelev commented Nov 5, 2024

aojea commented Nov 6, 2024

thockin left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 13, 2024

thockin commented Nov 13, 2024

k8s-ci-robot commented Nov 13, 2024

lauralorenz commented Nov 13, 2024 • edited Loading

tallclair commented Nov 13, 2024

k8s-ci-robot commented Nov 13, 2024

tallclair commented Nov 13, 2024

lauralorenz commented Nov 13, 2024

fsmunoz commented Nov 13, 2024

lauralorenz commented Nov 13, 2024

lauralorenz commented Nov 5, 2024 •

edited

Loading

lauralorenz commented Nov 13, 2024 •

edited

Loading