Failing SIG-Node presubmit jobs #127831

bart0sh · 2024-10-03T13:54:55Z

Which jobs are failing?

Which tests are failing?

E2eNode Suite: [It] [sig-node] PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive] [NodeFeature:Eviction] when we run containers with PodAndContainerStatsFromCRI=false that should cause PIDPressure should eventually evict all of the correct pods
E2eNode Suite: [It] [sig-node] PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive] [NodeFeature:Eviction] when we run containers with PodAndContainerStatsFromCRI=true that should cause PIDPressure should eventually evict all of the correct pods
E2eNode Suite: [It] [sig-node] [NodeFeature:SidecarContainers] Containers Lifecycle when A pod with restartable init containers is terminating when Restartable init containers are terminated during initialization should not hang in termination if terminated during initialization
E2eNode Suite: [It] [sig-node] Device Plugin [NodeFeature:DevicePlugin] [Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) [Flaky]

Since when has it been failing?

I believe that most of the jobs were not triggered for a long time, so it's hard to say for how long they're failing.

Testgrid links

https://testgrid.k8s.io/sig-node-presubmits
https://testgrid.k8s.io/sig-node-ec2

Reason for failure (if possible)

I've triggered all SIG-Node pull* jobs for my test PR (the codebase is the same as the latest master branch). Here is how the list of jobs was generated:

test-infra (master) $  git grep 'name: pull-' config/jobs/kubernetes/sig-node/*-presubmit.yaml | cut -f3 -d: | while read job ; do echo "/test $job"; done | sort -u

/test pull-crio-cgroupv1-node-e2e-eviction
/test pull-crio-cgroupv1-node-e2e-eviction-kubetest2
/test pull-crio-cgroupv1-node-e2e-features
/test pull-crio-cgroupv1-node-e2e-features-kubetest2
/test pull-crio-cgroupv1-node-e2e-hugepages
/test pull-crio-cgroupv1-node-e2e-hugepages-kubetest2
/test pull-crio-cgroupv1-node-e2e-resource-managers
/test pull-crio-cgroupv1-node-e2e-resource-managers-kubetest2
/test pull-crio-cgroupv2-imagefs-separatedisktest
/test pull-crio-cgroupv2-imagefs-separatedisktest-kubetest2
/test pull-crio-cgroupv2-node-e2e-eviction
/test pull-crio-cgroupv2-node-e2e-eviction-kubetest2
/test pull-crio-cgroupv2-node-e2e-hugepages
/test pull-crio-cgroupv2-node-e2e-hugepages-kubetest2
/test pull-crio-cgroupv2-node-e2e-resource-managers
/test pull-crio-cgroupv2-node-e2e-resource-managers-kubetest2
/test pull-crio-cgroupv2-splitfs-separate-disk
/test pull-crio-cgroupv2-splitfs-separate-disk-kubetest2
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-crio-node-memoryqos-cgrpv2
/test pull-kubernetes-crio-node-memoryqos-cgrpv2-kubetest2
/test pull-kubernetes-e2e-containerd-gce
/test pull-kubernetes-e2e-gce-kubelet-credential-provider
/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
/test pull-kubernetes-e2e-relaxed-environment-variable-validation
/test pull-kubernetes-kind-dra
/test pull-kubernetes-kind-dra-all
/test pull-kubernetes-node-arm64-e2e-containerd-ec2
/test pull-kubernetes-node-arm64-e2e-containerd-ec2-canary
/test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2
/test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2-canary
/test pull-kubernetes-node-arm64-ubuntu-serial-gce
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-e2e
/test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e
/test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-imagevolume-e2e
/test pull-kubernetes-node-crio-cgrpv2-imagevolume-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e
/test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial
/test pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial-kubetest2
/test pull-kubernetes-node-crio-e2e
/test pull-kubernetes-node-crio-e2e-kubetest2
/test pull-kubernetes-node-e2e-alpha-ec2
/test pull-kubernetes-node-e2e-containerd
/test pull-kubernetes-node-e2e-containerd-1-7-dra
/test pull-kubernetes-node-e2e-containerd-alpha-features
/test pull-kubernetes-node-e2e-containerd-ec2
/test pull-kubernetes-node-e2e-containerd-ec2-canary
/test pull-kubernetes-node-e2e-containerd-ec2-eks-canary
/test pull-kubernetes-node-e2e-containerd-features
/test pull-kubernetes-node-e2e-containerd-features-kubetest2
/test pull-kubernetes-node-e2e-containerd-kubetest2
/test pull-kubernetes-node-e2e-containerd-serial-ec2
/test pull-kubernetes-node-e2e-containerd-serial-ec2-canary
/test pull-kubernetes-node-e2e-containerd-serial-ec2-eks
/test pull-kubernetes-node-e2e-containerd-serial-ec2-eks-canary
/test pull-kubernetes-node-e2e-containerd-standalone-mode
/test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha
/test pull-kubernetes-node-e2e-cri-proxy-serial
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2 # experimental alternative to pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra-kubetest2 # experimental alternative to pull-kubernetes-node-e2e-crio-cgrpv2-dra
/test pull-kubernetes-node-e2e-resource-health-status
/test pull-kubernetes-node-kubelet-containerd-flaky
/test pull-kubernetes-node-kubelet-credential-provider
/test pull-kubernetes-node-kubelet-podresize
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-node-kubelet-serial-containerd-alpha-features
/test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
/test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
/test pull-kubernetes-node-kubelet-serial-cpu-manager
/test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1-kubetest2
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2-kubetest2
/test pull-kubernetes-node-kubelet-serial-hugepages
/test pull-kubernetes-node-kubelet-serial-memory-manager
/test pull-kubernetes-node-kubelet-serial-podresources
/test pull-kubernetes-node-kubelet-serial-topology-manager
/test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
/test pull-kubernetes-node-swap-conformance-fedora-serial
/test pull-kubernetes-node-swap-conformance-ubuntu-serial
/test pull-kubernetes-node-swap-fedora
/test pull-kubernetes-node-swap-fedora-serial
/test pull-kubernetes-node-swap-ubuntu-serial

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

The text was updated successfully, but these errors were encountered:

krzyzacy · 2024-10-08T20:18:57Z

I believe that most of the jobs were not triggered for a long time, so it's hard to say for how long they're failing.

we'd at least ensure all the blocking presubmit are healthy, and we can probably clean up the older ones if nobody cares about them

pacoxu · 2024-10-11T08:16:47Z

I believe that most of the jobs were not triggered for a long time, so it's hard to say for how long they're failing.

Generally, most presubmit jobs has another periodic job.

for instance, pull-crio-cgroupv1-node-e2e-eviction is for presubmit, and ci-crio-cgroupv2-node-e2e-eviction is the relevant periodic job.
- I opened [Flaking Test] [sig-node] PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive] [NodeFeature:Eviction] should cause PIDPressure should eventually evict all of the correct pods #127996 for the flake.

Hence, most of the failure may be tracked in the periodic job flake or failing test.

pull-kubernetes-integration

pull-kubernetes-unit

Those will be tracked by release signal team as the failure/flake can also be found in https://testgrid.k8s.io/sig-release-master-blocking. I suppose, we don't need to track these two here.

pacoxu · 2024-10-11T08:23:47Z

pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

This is tracked in #127312. for ci-crio-cgroupv1-evented-pleg.

https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-evented-pleg

pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

Will use ginkgo flags as: -timeout=24h -nodes=8  -focus="\[NodeConformance\]|\[NodeFeature:.+\]|\[NodeFeature\]"  -skip="\[Flaky\]|\[Slow\]|\[Serial\]"  --no-color -vF1010 14:37:46.322342   10028 gce_runner.go:112] While preparing GCE images: Could not read image config file provided: open --image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv1-evented-pleg.yaml: no such file or directory

pacoxu · 2024-10-12T01:00:52Z

pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

This is tracked in #127312. for ci-crio-cgroupv1-evented-pleg.

After kubernetes/test-infra#33633, pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2 failed for #127312 as well.

pacoxu · 2024-10-14T03:28:36Z

pull-kubernetes-node-arm64-e2e-containerd-ec2-canary

pull-kubernetes-node-kubelet-containerd-flaky

flaky(known flaking) and canary tests are with lower priority.

pull-kubernetes-node-e2e-cri-proxy-serial

This should be fixed by #127495.

pull-crio-cgroupv2-node-e2e-eviction

See #127996.

pull-kubernetes-node-arm64-e2e-containerd-ec2 Synchronized pull and ci arm64-e2e-containerd-ec2 job configs test-infra#33640

@bart0sh your fix kubernetes/test-infra#33640 was merged today. It looks green now.

pull-kubernetes-node-arm64-ubuntu-serial-gce

https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial looks green.

ci-kubernetes-node-arm64-ubuntu-serial runs [Slow] e2e test, and does not run SWAP related e2e.
pull-kubernetes-node-arm64-ubuntu-serial-gce also runs [NodeSwap], and flakes a lot in other test(not swap related.)
- Memory Manager [Disruptive] [Serial] [Feature:MemoryManager] with static policy when multiple guaranteed pods started should succeed to start all pods
- Memory Manager [Disruptive] [Serial] [Feature:MemoryManager] with static policy when guaranteed pod has only app containers should succeed to start the pod
- E2eNode Suite: [It] [sig-node] ImageGarbageCollect [Serial] [NodeFeature:GarbageCollect] when ImageMaximumGCAge is set should not GC unused images prematurely

Add to my todo list. (will update here once do a diff): kubernetes/test-infra#33641

Updated: the test still flakes randomly after skipping swap e2e: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/120459/pull-kubernetes-node-arm64-ubuntu-serial-gce/1845719819626745856.

pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial

Added recently. See #127484.

I plan to open an issue to track it. #128042 was opened

pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha disabled EventedPLEG standalone-mode-all-alpha test-infra#33642

Related to #127312.

I prefer to think they are related to the Evented PLEG issue.

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/128040/pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha/1845667801025482752 proved it in [Do-Not-Merge] EventedPLEG & InPlacePodVerticalScaling testing false #128040. We may disable the EventedPLEG until the bug is addressed.

https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-containerd-standalone-mode-all-alpha
failed for the same reason: I opened kubernetes/test-infra#33642 if we can disabled the EventedPLEG for the CI.

pull-kubernetes-node-e2e-containerd-alpha-features

https://testgrid.k8s.io/sig-node-presubmits#pr-node-kubelet-containerd-alpha-features

E2eNode Suite.[It] [sig-node] ResourceMetricsAPI [NodeFeature:ResourceMetrics] when querying /resource/metrics should report resource usage through the resource metrics api: for this failing test, I opened [Failing Test] [NodeFeature:ResourceMetrics] when querying /resource/metrics should report resource usage through the resource metrics api #128047.

There are some flakes and the above one is always failing.

bart0sh · 2024-10-14T08:17:11Z

@pacoxu Thank you for the detailed overview and the links to the issues! I'll try to continue investigating (and hopefully fixing) pr jobs and running them in the test PR.

bart0sh · 2024-10-14T08:36:19Z

@pacoxu I excluded -kubetest2 jobs from the scope of this issue as they seem to be a work in progress.

pacoxu · 2024-10-14T09:58:52Z

@pacoxu I excluded -kubetest2 jobs from the scope of this issue as they seem to be a work in progress.

Do you mean the migration process kubernetes/test-infra#32567?

bart0sh · 2024-10-14T14:12:47Z

Yes. I decided to concentrate on more stable jobs. I'm hoping that at the end of this road the -kubetest2 suffix will be removed and my test PR will trigger them automatically. They're still too buggy to pay attention to them right now in my opinion.

kannon92 · 2024-10-16T16:47:51Z

/triage accepted
/priority important-longterm

TY for looking into this!

kannon92 · 2024-10-16T16:48:09Z

@bart0sh should you be assigned to this?

bart0sh · 2024-10-17T09:31:44Z

@kannon92 thanks for the reminder!
/assign

pacoxu · 2024-10-22T06:56:30Z

#128251

this also can be seen in presubmit CI:

/test pull-kubernetes-node-e2e-containerd-serial-ec2
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial

pacoxu · 2024-10-23T07:54:14Z

Just see https://testgrid.k8s.io/presubmits-kubernetes-blocking this is a blocking board for presubmits CIs.

We may take them as the top priority and update the board if needed.

bart0sh · 2024-12-18T12:13:25Z

@pacoxu @kannon92 @haircommander I'm going to close this issue as almost all pr jobs failures have been fixed. There are only 4 failing jobs (see my test pr) currently:

pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial times out because of the new DRA issue. This needs its own issue.
pull-kubernetes-node-kubelet-containerd-flaky should be fixed by this pr pending approval.
2 evented pleg jobs, which are permanently failing and hopefully have their own issues: pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e and pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

Any thoughts/objections/suggestions?

kannon92 · 2024-12-18T13:38:33Z

Thank you for the work! Happy to close

bart0sh · 2024-12-18T14:34:48Z

Thank you! Closing.

pacoxu · 2024-12-19T01:44:53Z

Great job. Thanks @bart0sh.

bart0sh · 2024-12-19T10:34:40Z

Thank you it's been a great run!

The next step would be to develop a setup which makes sure that no pr job is forgotten. Otherwise we'll end up with the same situation after some time. I'd propose to make sure every pr job is mapped to the ci job with the same name, e.g. pull-crio-cgroupv2-splitfs-separate-disk -> ci-crio-cgroupv2-splitfs-separate-disk and develop automatic checks that ensure this mapping. Generating job configs from easy to maintain configuration file[s] would also help to keep mappings in a healthy state.

Any other ideas?

bart0sh added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 3, 2024

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 3, 2024

github-project-automation bot moved this to Triage in SIG Node CI/Test Board Oct 3, 2024

github-project-automation bot added this to SIG Node CI/Test Board Oct 3, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 3, 2024

This was referenced Oct 3, 2024

fix(kubelet): fix deleting pod with shorter grace period #127580

Open

WIP: Test presubmit CI jobs #120459

Open

krzyzacy mentioned this issue Oct 8, 2024

replicate crio presubmits using kubetest2 kubernetes/test-infra#33550

Merged

toVersus mentioned this issue Oct 12, 2024

[Sidecar Containers] Check for restarts without being affected by container startup order #128021

Merged

bart0sh changed the title ~~Failing SIG-Node prestubmit jobs~~ Failing SIG-Node presubmit jobs Oct 13, 2024

kannon92 moved this from Triage to Issues - In progress in SIG Node CI/Test Board Oct 16, 2024

k8s-ci-robot assigned bart0sh Oct 17, 2024

bart0sh mentioned this issue Oct 20, 2024

Add timeout for node-arm64-ubuntu-serial-gce job kubernetes/test-infra#33690

Merged

bart0sh mentioned this issue Oct 29, 2024

density test: adjust CPU and memory limits #128422

Merged

soltysh mentioned this issue Nov 6, 2024

apiserver: avoid TODO in public docs #128380

Merged

bart0sh closed this as completed Dec 18, 2024

github-project-automation bot moved this from Issues - In progress to Done in SIG Node CI/Test Board Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failing SIG-Node presubmit jobs #127831

Failing SIG-Node presubmit jobs #127831

bart0sh commented Oct 3, 2024 •

edited

Loading

krzyzacy commented Oct 8, 2024

pacoxu commented Oct 11, 2024 •

edited

Loading

pacoxu commented Oct 11, 2024

pacoxu commented Oct 12, 2024

pacoxu commented Oct 14, 2024 •

edited

Loading

bart0sh commented Oct 14, 2024

bart0sh commented Oct 14, 2024 •

edited

Loading

pacoxu commented Oct 14, 2024

bart0sh commented Oct 14, 2024

kannon92 commented Oct 16, 2024

kannon92 commented Oct 16, 2024

bart0sh commented Oct 17, 2024

pacoxu commented Oct 22, 2024

pacoxu commented Oct 23, 2024

bart0sh commented Dec 18, 2024 •

edited

Loading

kannon92 commented Dec 18, 2024

bart0sh commented Dec 18, 2024

pacoxu commented Dec 19, 2024

bart0sh commented Dec 19, 2024 •

edited

Loading

Failing SIG-Node presubmit jobs #127831

Failing SIG-Node presubmit jobs #127831

Comments

bart0sh commented Oct 3, 2024 • edited Loading

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid links

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

krzyzacy commented Oct 8, 2024

pacoxu commented Oct 11, 2024 • edited Loading

pacoxu commented Oct 11, 2024

pacoxu commented Oct 12, 2024

pacoxu commented Oct 14, 2024 • edited Loading

bart0sh commented Oct 14, 2024

bart0sh commented Oct 14, 2024 • edited Loading

pacoxu commented Oct 14, 2024

bart0sh commented Oct 14, 2024

kannon92 commented Oct 16, 2024

kannon92 commented Oct 16, 2024

bart0sh commented Oct 17, 2024

pacoxu commented Oct 22, 2024

pacoxu commented Oct 23, 2024

bart0sh commented Dec 18, 2024 • edited Loading

kannon92 commented Dec 18, 2024

bart0sh commented Dec 18, 2024

pacoxu commented Dec 19, 2024

bart0sh commented Dec 19, 2024 • edited Loading

bart0sh commented Oct 3, 2024 •

edited

Loading

pacoxu commented Oct 11, 2024 •

edited

Loading

pacoxu commented Oct 14, 2024 •

edited

Loading

bart0sh commented Oct 14, 2024 •

edited

Loading

bart0sh commented Dec 18, 2024 •

edited

Loading

bart0sh commented Dec 19, 2024 •

edited

Loading