Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing SIG-Node presubmit jobs #127831

Closed
8 tasks
bart0sh opened this issue Oct 3, 2024 · 19 comments
Closed
8 tasks

Failing SIG-Node presubmit jobs #127831

bart0sh opened this issue Oct 3, 2024 · 19 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@bart0sh
Copy link
Contributor

bart0sh commented Oct 3, 2024

Which jobs are failing?

Which tests are failing?

  • E2eNode Suite: [It] [sig-node] PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive] [NodeFeature:Eviction] when we run containers with PodAndContainerStatsFromCRI=false that should cause PIDPressure should eventually evict all of the correct pods
  • E2eNode Suite: [It] [sig-node] PriorityPidEvictionOrdering [Slow] [Serial] [Disruptive] [NodeFeature:Eviction] when we run containers with PodAndContainerStatsFromCRI=true that should cause PIDPressure should eventually evict all of the correct pods
  • E2eNode Suite: [It] [sig-node] [NodeFeature:SidecarContainers] Containers Lifecycle when A pod with restartable init containers is terminating when Restartable init containers are terminated during initialization should not hang in termination if terminated during initialization
  • E2eNode Suite: [It] [sig-node] Device Plugin [NodeFeature:DevicePlugin] [Serial] DevicePlugin [Serial] [Disruptive] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) [Flaky]

Since when has it been failing?

I believe that most of the jobs were not triggered for a long time, so it's hard to say for how long they're failing.

Testgrid links

https://testgrid.k8s.io/sig-node-presubmits
https://testgrid.k8s.io/sig-node-ec2

Reason for failure (if possible)

I've triggered all SIG-Node pull* jobs for my test PR (the codebase is the same as the latest master branch). Here is how the list of jobs was generated:

test-infra (master) $  git grep 'name: pull-' config/jobs/kubernetes/sig-node/*-presubmit.yaml | cut -f3 -d: | while read job ; do echo "/test $job"; done | sort -u

/test pull-crio-cgroupv1-node-e2e-eviction
/test pull-crio-cgroupv1-node-e2e-eviction-kubetest2
/test pull-crio-cgroupv1-node-e2e-features
/test pull-crio-cgroupv1-node-e2e-features-kubetest2
/test pull-crio-cgroupv1-node-e2e-hugepages
/test pull-crio-cgroupv1-node-e2e-hugepages-kubetest2
/test pull-crio-cgroupv1-node-e2e-resource-managers
/test pull-crio-cgroupv1-node-e2e-resource-managers-kubetest2
/test pull-crio-cgroupv2-imagefs-separatedisktest
/test pull-crio-cgroupv2-imagefs-separatedisktest-kubetest2
/test pull-crio-cgroupv2-node-e2e-eviction
/test pull-crio-cgroupv2-node-e2e-eviction-kubetest2
/test pull-crio-cgroupv2-node-e2e-hugepages
/test pull-crio-cgroupv2-node-e2e-hugepages-kubetest2
/test pull-crio-cgroupv2-node-e2e-resource-managers
/test pull-crio-cgroupv2-node-e2e-resource-managers-kubetest2
/test pull-crio-cgroupv2-splitfs-separate-disk
/test pull-crio-cgroupv2-splitfs-separate-disk-kubetest2
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-eviction
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial
/test pull-kubernetes-crio-node-memoryqos-cgrpv2
/test pull-kubernetes-crio-node-memoryqos-cgrpv2-kubetest2
/test pull-kubernetes-e2e-containerd-gce
/test pull-kubernetes-e2e-gce-kubelet-credential-provider
/test pull-kubernetes-e2e-inplace-pod-resize-containerd-main-v2
/test pull-kubernetes-e2e-relaxed-environment-variable-validation
/test pull-kubernetes-kind-dra
/test pull-kubernetes-kind-dra-all
/test pull-kubernetes-node-arm64-e2e-containerd-ec2
/test pull-kubernetes-node-arm64-e2e-containerd-ec2-canary
/test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2
/test pull-kubernetes-node-arm64-e2e-containerd-serial-ec2-canary
/test pull-kubernetes-node-arm64-ubuntu-serial-gce
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e
/test pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-e2e
/test pull-kubernetes-node-crio-cgrpv2-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e
/test pull-kubernetes-node-crio-cgrpv2-imagefs-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-imagevolume-e2e
/test pull-kubernetes-node-crio-cgrpv2-imagevolume-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e
/test pull-kubernetes-node-crio-cgrpv2-splitfs-e2e-kubetest2
/test pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial
/test pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial-kubetest2
/test pull-kubernetes-node-crio-e2e
/test pull-kubernetes-node-crio-e2e-kubetest2
/test pull-kubernetes-node-e2e-alpha-ec2
/test pull-kubernetes-node-e2e-containerd
/test pull-kubernetes-node-e2e-containerd-1-7-dra
/test pull-kubernetes-node-e2e-containerd-alpha-features
/test pull-kubernetes-node-e2e-containerd-ec2
/test pull-kubernetes-node-e2e-containerd-ec2-canary
/test pull-kubernetes-node-e2e-containerd-ec2-eks-canary
/test pull-kubernetes-node-e2e-containerd-features
/test pull-kubernetes-node-e2e-containerd-features-kubetest2
/test pull-kubernetes-node-e2e-containerd-kubetest2
/test pull-kubernetes-node-e2e-containerd-serial-ec2
/test pull-kubernetes-node-e2e-containerd-serial-ec2-canary
/test pull-kubernetes-node-e2e-containerd-serial-ec2-eks
/test pull-kubernetes-node-e2e-containerd-serial-ec2-eks-canary
/test pull-kubernetes-node-e2e-containerd-standalone-mode
/test pull-kubernetes-node-e2e-containerd-standalone-mode-all-alpha
/test pull-kubernetes-node-e2e-cri-proxy-serial
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2 # experimental alternative to pull-kubernetes-node-e2e-crio-cgrpv1-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra
/test pull-kubernetes-node-e2e-crio-cgrpv2-dra-kubetest2 # experimental alternative to pull-kubernetes-node-e2e-crio-cgrpv2-dra
/test pull-kubernetes-node-e2e-resource-health-status
/test pull-kubernetes-node-kubelet-containerd-flaky
/test pull-kubernetes-node-kubelet-credential-provider
/test pull-kubernetes-node-kubelet-podresize
/test pull-kubernetes-node-kubelet-serial-containerd
/test pull-kubernetes-node-kubelet-serial-containerd-alpha-features
/test pull-kubernetes-node-kubelet-serial-containerd-kubetest2
/test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
/test pull-kubernetes-node-kubelet-serial-cpu-manager
/test pull-kubernetes-node-kubelet-serial-cpu-manager-kubetest2
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv1-kubetest2
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2
/test pull-kubernetes-node-kubelet-serial-crio-cgroupv2-kubetest2
/test pull-kubernetes-node-kubelet-serial-hugepages
/test pull-kubernetes-node-kubelet-serial-memory-manager
/test pull-kubernetes-node-kubelet-serial-podresources
/test pull-kubernetes-node-kubelet-serial-topology-manager
/test pull-kubernetes-node-kubelet-serial-topology-manager-kubetest2
/test pull-kubernetes-node-swap-conformance-fedora-serial
/test pull-kubernetes-node-swap-conformance-ubuntu-serial
/test pull-kubernetes-node-swap-fedora
/test pull-kubernetes-node-swap-fedora-serial
/test pull-kubernetes-node-swap-ubuntu-serial

Anything else we need to know?

No response

Relevant SIG(s)

/sig node

@bart0sh bart0sh added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 3, 2024
@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 3, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 3, 2024
@krzyzacy
Copy link
Member

krzyzacy commented Oct 8, 2024

I believe that most of the jobs were not triggered for a long time, so it's hard to say for how long they're failing.

we'd at least ensure all the blocking presubmit are healthy, and we can probably clean up the older ones if nobody cares about them

@pacoxu
Copy link
Member

pacoxu commented Oct 11, 2024

I believe that most of the jobs were not triggered for a long time, so it's hard to say for how long they're failing.

Generally, most presubmit jobs has another periodic job.

Hence, most of the failure may be tracked in the periodic job flake or failing test.

  • pull-kubernetes-integration
  • pull-kubernetes-unit

Those will be tracked by release signal team as the failure/flake can also be found in https://testgrid.k8s.io/sig-release-master-blocking. I suppose, we don't need to track these two here.

@pacoxu
Copy link
Member

pacoxu commented Oct 11, 2024

pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

This is tracked in #127312. for ci-crio-cgroupv1-evented-pleg.

pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2

Will use ginkgo flags as: -timeout=24h -nodes=8  -focus="\[NodeConformance\]|\[NodeFeature:.+\]|\[NodeFeature\]"  -skip="\[Flaky\]|\[Slow\]|\[Serial\]"  --no-color -vF1010 14:37:46.322342   10028 gce_runner.go:112] While preparing GCE images: Could not read image config file provided: open --image-config-file=/home/prow/go/src/k8s.io/test-infra/jobs/e2e_node/crio/latest/image-config-cgroupv1-evented-pleg.yaml: no such file or directory

@pacoxu
Copy link
Member

pacoxu commented Oct 12, 2024

pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e

This is tracked in #127312. for ci-crio-cgroupv1-evented-pleg.

After kubernetes/test-infra#33633, pull-kubernetes-node-crio-cgrpv1-evented-pleg-e2e-kubetest2 failed for #127312 as well.

@bart0sh bart0sh changed the title Failing SIG-Node prestubmit jobs Failing SIG-Node presubmit jobs Oct 13, 2024
@pacoxu
Copy link
Member

pacoxu commented Oct 14, 2024

  • pull-kubernetes-node-arm64-e2e-containerd-ec2-canary
  • pull-kubernetes-node-kubelet-containerd-flaky

flaky(known flaking) and canary tests are with lower priority.

  • pull-kubernetes-node-e2e-cri-proxy-serial

This should be fixed by #127495.

  • pull-crio-cgroupv2-node-e2e-eviction

See #127996.

@bart0sh your fix kubernetes/test-infra#33640 was merged today. It looks green now.

  • pull-kubernetes-node-arm64-ubuntu-serial-gce

https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial looks green.

  • ci-kubernetes-node-arm64-ubuntu-serial runs [Slow] e2e test, and does not run SWAP related e2e.
  • pull-kubernetes-node-arm64-ubuntu-serial-gce also runs [NodeSwap], and flakes a lot in other test(not swap related.)
    • Memory Manager [Disruptive] [Serial] [Feature:MemoryManager] with static policy when multiple guaranteed pods started should succeed to start all pods
    • Memory Manager [Disruptive] [Serial] [Feature:MemoryManager] with static policy when guaranteed pod has only app containers should succeed to start the pod
    • E2eNode Suite: [It] [sig-node] ImageGarbageCollect [Serial] [NodeFeature:GarbageCollect] when ImageMaximumGCAge is set should not GC unused images prematurely

Add to my todo list. (will update here once do a diff): kubernetes/test-infra#33641

  • pull-kubernetes-node-crio-cgrpv2-userns-e2e-serial

Added recently. See #127484.

I plan to open an issue to track it. #128042 was opened

Related to #127312.

I prefer to think they are related to the Evented PLEG issue.

https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-containerd-standalone-mode-all-alpha
failed for the same reason: I opened kubernetes/test-infra#33642 if we can disabled the EventedPLEG for the CI.

  • pull-kubernetes-node-e2e-containerd-alpha-features

https://testgrid.k8s.io/sig-node-presubmits#pr-node-kubelet-containerd-alpha-features

There are some flakes and the above one is always failing.

@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 14, 2024

@pacoxu Thank you for the detailed overview and the links to the issues! I'll try to continue investigating (and hopefully fixing) pr jobs and running them in the test PR.

@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 14, 2024

@pacoxu I excluded -kubetest2 jobs from the scope of this issue as they seem to be a work in progress.

@pacoxu
Copy link
Member

pacoxu commented Oct 14, 2024

@pacoxu I excluded -kubetest2 jobs from the scope of this issue as they seem to be a work in progress.

Do you mean the migration process kubernetes/test-infra#32567?

@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 14, 2024

Yes. I decided to concentrate on more stable jobs. I'm hoping that at the end of this road the -kubetest2 suffix will be removed and my test PR will trigger them automatically. They're still too buggy to pay attention to them right now in my opinion.

@kannon92
Copy link
Contributor

/triage accepted
/priority important-longterm

TY for looking into this!

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 16, 2024
@kannon92
Copy link
Contributor

@bart0sh should you be assigned to this?

@kannon92 kannon92 moved this from Triage to Issues - In progress in SIG Node CI/Test Board Oct 16, 2024
@bart0sh
Copy link
Contributor Author

bart0sh commented Oct 17, 2024

@kannon92 thanks for the reminder!
/assign

@pacoxu
Copy link
Member

pacoxu commented Oct 22, 2024

#128251

  • this also can be seen in presubmit CI:
/test pull-kubernetes-node-e2e-containerd-serial-ec2
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-serial

@pacoxu
Copy link
Member

pacoxu commented Oct 23, 2024

Just see https://testgrid.k8s.io/presubmits-kubernetes-blocking this is a blocking board for presubmits CIs.

  • We may take them as the top priority and update the board if needed.

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 18, 2024

@pacoxu @kannon92 @haircommander I'm going to close this issue as almost all pr jobs failures have been fixed. There are only 4 failing jobs (see my test pr) currently:

Any thoughts/objections/suggestions?

@kannon92
Copy link
Contributor

Thank you for the work! Happy to close

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 18, 2024

Thank you! Closing.

@bart0sh bart0sh closed this as completed Dec 18, 2024
@github-project-automation github-project-automation bot moved this from Issues - In progress to Done in SIG Node CI/Test Board Dec 18, 2024
@pacoxu
Copy link
Member

pacoxu commented Dec 19, 2024

Great job. Thanks @bart0sh.

@bart0sh
Copy link
Contributor Author

bart0sh commented Dec 19, 2024

Thank you it's been a great run!

The next step would be to develop a setup which makes sure that no pr job is forgotten. Otherwise we'll end up with the same situation after some time. I'd propose to make sure every pr job is mapped to the ci job with the same name, e.g. pull-crio-cgroupv2-splitfs-separate-disk -> ci-crio-cgroupv2-splitfs-separate-disk and develop automatic checks that ensure this mapping. Generating job configs from easy to maintain configuration file[s] would also help to keep mappings in a healthy state.

Any other ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

5 participants