flaky pull-kubernetes-e2e-kind tests due to timeout #101275

gautierdelorme · 2021-04-20T11:58:51Z

Which jobs are flaking:

pull-kubernetes-e2e-kind

Which test(s) are flaking:

Different tests seem to randomly fail. Some examples:

[sig-network] KubeProxy should set TCP CLOSE_WAIT timeout [Privileged] : timed out waiting for the condition
[sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]: Failed to connect to exposed host ports
[sig-node] RuntimeClass should run a Pod requesting a RuntimeClass with scheduling without taints: timed out waiting for the condition
[sig-storage] CSI mock volume CSI Volume expansion should expand volume by restarting pod if attach=on, nodeExpansion=on: timed out waiting for the condition
...

Reason for failure

Timeouts

Anything else we need to know:

/kind flake
/sig network
/sig node
/sig storage
/cc @dims

The text was updated successfully, but these errors were encountered:

aojea · 2021-04-20T13:16:05Z

/assign
is not failing always https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind , but it seems affects only to #100490

aojea · 2021-04-20T20:41:28Z

@BenTheElder this is not looking good https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind&sort-by-flakiness=20 , lots of timeouts, are you aware of any issue that can be causing it?
You and @spiffxp have some graphs with trends, is there anything unusual?

BenTheElder · 2021-04-20T22:22:29Z

It's looking worse but timeouts plague presubmit in general. We could reduce the test load as the ipv6 job looks healthier.

I'm not aware of anything specific but you should have access to the same data as me and aaron.

kubernetes/k8s.io#1187 (comment) is what I'd prefer to pursue for now, though if flakiness concerns become excessive I'd suggest bringing the group of tests we run inline with the ipv6 job.

None of the presubmits run everything because it would be excessive. We want to min-max coverage vs reliability & speed ...

SergeyKanzhelev · 2021-04-21T17:10:27Z

Is there anything we can do from the test authoring perspective? Or this is purely infra issue?

BenTheElder · 2021-04-29T20:29:50Z

I don't think it's a test authoring issue mostly, there's tentatively two things going on here:

something changed causing problems for tests across the board (increased resource usage possibly? investingating)
a specific flaky test is increasing the flake rate Deflake startupProbe e2e test #99998 (comment)

purely infra issue?

It's not purely infra though, pull-kubernetes-e2e-kind-ipv6 is perfectly acceptable other than the startup probe issue. pull-kubernetes-e2e-kind is doing horribly right now.

Current best guess is changes to the CSI hostpath driver landing in k8s around the time failures shot up #100637 (comment)

slack thread: https://kubernetes.slack.com/archives/C09QZ4DQB/p1619723516188000

possible fix kubernetes-csi/csi-driver-host-path#277

we're considering skipping more tests in the non-ipv6 job to match ipv6 and bring failure rate back to normal levels, unless it seems likely that it was a specific regression.

bridgetkromhout · 2021-04-29T21:15:56Z

/triage accepted

bridgetkromhout · 2021-04-29T21:16:28Z

/assign @aojea

BenTheElder · 2021-04-30T00:48:20Z

/assign
Tentatively looking healthier https://prow.k8s.io/?job=pull-kubernetes-e2e-kind
After kubernetes/test-infra#22025

Still not ideal

aojea · 2021-04-30T07:54:02Z

Success rate over time: 3h: 75%, 12h: 56%, 48h: 44%

BenTheElder · 2021-04-30T16:39:56Z

quick sampling shows remaining flakes are pretty much all:

[sig-node] Probing container should be ready immediately after startupProbe succeeds

so we should look at that test or skip it, and we should look into fixing CSI hostpath tests and re-enabling them. cc @msau42 @pohly

BenTheElder · 2021-04-30T16:41:34Z

the pass rate is still pretty bad with the startupProbe test.

Success rate over time: 3h: 48%, 12h: 52%, 48h: 44%

A normal e2e presubmit should be in something like 70+ for 12h during active development periods

msau42 · 2021-04-30T17:10:12Z

To re-enable the csi hostpath tests, we'll need to compare container resource usage before/after the changes. Do we have metrics collection enabled on kind clusters?

Another possibility, is that for csi hostpath every single container is deployed as a separate Pod. I don't think that would impact cpu/memory consumption on the node, but it would reduce parallelism of the tests. One of the changes that we did do was add 2 more sidecars = 2 more pods to the hostpath driver. Multiplied by 30-40 test cases = 60-80 more pods needing to be scheduled.

BenTheElder · 2021-04-30T18:50:05Z

maybe then we're hitting pod scheduling limits? kind jobs run with 2 schedulable nodes by default. (and one control plane).

When I hear "sidecar" I usually thing additional container, not pod though.

BenTheElder · 2021-04-30T18:50:53Z

we do not have metrics collection enabled, kubeadm does not support this OOTB and we'd been waiting for a blessed addon manager from SCL. @aojea has an experiment related to this.

msau42 · 2021-04-30T18:58:32Z

When I hear "sidecar" I usually thing additional container, not pod though.

Correct in a real production driver, we put all the sidecars in a single pod. However we use hostpath driver to validate that we have the correct RBACs set per container (because some sidecars are optional) so that's why in our CI we run each container as an individual pod. Perhaps we need to support both methods: kubernetes-csi/csi-driver-host-path#192

aojea · 2021-04-30T18:59:06Z

I have a way to install prometheus and dump the database once the test fails kubernetes-sigs/kind#2190, it monitors the

What kind metrics are you looking for @msau42 ?
because in KIND there is no cpu/mem/.. per node metrics, since the containers use the host /proc...

msau42 · 2021-04-30T19:03:18Z

Our two leading theories to why the kind job become more flaky after making changes to the hostpath driver:

We added 2 more pods per test case = 60-80 more pods. Potentially reduces the test parallelism and can contribute to test timeouts.
New csi sidecars are consuming more resources, slowing down the system. It would be good to look at any cpu/mem metrics to see if this could be the case.

pohly · 2021-04-30T19:52:50Z

* We added 2 more pods per test case = 60-80 more pods. Potentially reduces the test parallelism and can contribute to test timeouts.

Which additional pods are these?

https://github.com/kubernetes/kubernetes/pull/100637/files initially added csi-external-health-monitor-agent and csi-external-health-monitor-controller, but those were additional containers in the driver pod, not separate pods.

Later I added code to disable them. Even later, I downgraded to the driver YAML file that doesn't contain them. All of that was done before merging the PR, so if the job is flaky now, it's not because of these two.

Now that I think about it, https://github.com/kubernetes/kubernetes/pull/101360/files which then removed the code that disables the containers didn't change anything because we are still on the old driver release without those containers.

pohly · 2021-05-01T08:11:17Z

I have a way to install prometheus and dump the database once the test fails kubernetes-sigs/kind#2190, it monitors the

What kind metrics are you looking for @msau42 ?
because in KIND there is no cpu/mem/.. per node metrics, since the containers use the host /proc...

A summary of CPU and RAM consumption of the entire Prow test job (i.e. including all processes running inside KinD) amortized over time for the entire duration of the job would be a good start. Is that possible?

Then we can directly see if changes that we make really let the tests run more efficiently. Right now we are speculating based on very indirect observations like timeouts in random tests.

We'll need a pull job which ideally runs just the storage tests that were disabled in kubernetes/test-infra#22025

The first step then will be to test whether the sidecar update in #100637 really made the situation worse - so far evidence for that is not conclusive.

pohly · 2021-05-01T08:19:03Z

Regarding running more efficiently, here are some ideas ranging from "makes sense" to "what is Patrick smoking":

run sidecars inside the same pod as the driver (Consolidate all the hostpath driver specs into one pod kubernetes-csi/csi-driver-host-path#192) -> less pods
link code from the different sidecars into one controller sidecar and one node sidecar -> less containers, less overhead (Go runtime only linked once, leader election and informers shared, storage capacity tracking code in provisioner can react to changes in resizer)
link code from the sidecars directly into the CSI driver binary -> even less containers

The last one probably doesn't make sense because it only works for drivers written in Go. Listed for the sake of completeness...

BenTheElder · 2021-05-03T16:52:14Z

For the record: after @aojea reverted the most recent startup probe test change we are now at a normal ~80% pass rate which consists of a few number of startup probe flakes and normal failures due to bad PRs (code doesn't build, or debug code tanks performance and either wall all the e2e jobs fail not just kind).

Linking the sidecars together makes sense to me. We probably still need to get actual before & after measurements. Prow will be a little noisy though.

BenTheElder · 2021-05-11T00:09:22Z

this still needs follow up, but I want to close this issue because this job should not be running into this, except low volume ongoing flakes with the startup probe test

right now the job is still

Success rate over time: 3h: 88%, 12h: 81%, 48h: 76%

so the CSI tests still need follow up, and startupprobe still needs follow up, but that's not the original issue that the job is flaking and blocking contributors, that's long been resolved.

pohly · 2021-05-11T07:46:01Z

Do we have an issue open about re-enabling the hostpath tests in these kind jobs (i.e. about reverting kubernetes/test-infra#22025)?

msau42 · 2021-05-11T16:51:22Z

Opened #101913

gautierdelorme added the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 20, 2021

gautierdelorme mentioned this issue Apr 20, 2021

Update kube-openapi and gnostic dependencies #100490

Merged

k8s-ci-robot assigned aojea Apr 20, 2021

gautierdelorme mentioned this issue Apr 21, 2021

eliminate dependency on go-openapi/spec #101234

Merged

k8s-ci-robot mentioned this issue Apr 23, 2021

cleanup: simplify returning boolean expression in controller #101418

Closed

dobsonj mentioned this issue Apr 27, 2021

Deprecate removal of CSI nodepublish path by kubelet (#101332) #101441

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2021

This was referenced Apr 29, 2021

temporarily drop csi-hostpath driver from kind presubmit kubernetes/test-infra#22025

Merged

[WIP] Revert "storage e2e: automate hostpath YAML updates, update sidecars … #101630

Closed

k8s-ci-robot assigned BenTheElder Apr 30, 2021

BenTheElder mentioned this issue Apr 30, 2021

[meta]: v0.11.0 priority tracking kubernetes-sigs/kind#2024

Closed

6 tasks

BenTheElder closed this as completed May 11, 2021

msau42 mentioned this issue May 11, 2021

Re-enable csi hostpath tests in kind job #101913

Closed

aojea mentioned this issue May 15, 2021

drop csi-hostpath driver kubernetes/test-infra#22193

Closed

aojea mentioned this issue Jun 24, 2021

e2e framework: monitoring and metrics report #103145

Closed

warmchang added this to SIG Node CI/Test Board Aug 19, 2023

warmchang moved this to Done in SIG Node CI/Test Board Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky pull-kubernetes-e2e-kind tests due to timeout #101275

flaky pull-kubernetes-e2e-kind tests due to timeout #101275

gautierdelorme commented Apr 20, 2021 •

edited

Loading

aojea commented Apr 20, 2021

aojea commented Apr 20, 2021

BenTheElder commented Apr 20, 2021

SergeyKanzhelev commented Apr 21, 2021

BenTheElder commented Apr 29, 2021

bridgetkromhout commented Apr 29, 2021

bridgetkromhout commented Apr 29, 2021

BenTheElder commented Apr 30, 2021

aojea commented Apr 30, 2021

BenTheElder commented Apr 30, 2021

BenTheElder commented Apr 30, 2021

msau42 commented Apr 30, 2021

BenTheElder commented Apr 30, 2021

BenTheElder commented Apr 30, 2021

msau42 commented Apr 30, 2021

aojea commented Apr 30, 2021

msau42 commented Apr 30, 2021

pohly commented Apr 30, 2021

pohly commented May 1, 2021

pohly commented May 1, 2021

BenTheElder commented May 3, 2021

BenTheElder commented May 11, 2021

pohly commented May 11, 2021

msau42 commented May 11, 2021

flaky pull-kubernetes-e2e-kind tests due to timeout #101275

flaky pull-kubernetes-e2e-kind tests due to timeout #101275

Comments

gautierdelorme commented Apr 20, 2021 • edited Loading

Which jobs are flaking:

Which test(s) are flaking:

Reason for failure

Anything else we need to know:

aojea commented Apr 20, 2021

aojea commented Apr 20, 2021

BenTheElder commented Apr 20, 2021

SergeyKanzhelev commented Apr 21, 2021

BenTheElder commented Apr 29, 2021

bridgetkromhout commented Apr 29, 2021

bridgetkromhout commented Apr 29, 2021

BenTheElder commented Apr 30, 2021

aojea commented Apr 30, 2021

BenTheElder commented Apr 30, 2021

BenTheElder commented Apr 30, 2021

msau42 commented Apr 30, 2021

BenTheElder commented Apr 30, 2021

BenTheElder commented Apr 30, 2021

msau42 commented Apr 30, 2021

aojea commented Apr 30, 2021

msau42 commented Apr 30, 2021

pohly commented Apr 30, 2021

pohly commented May 1, 2021

pohly commented May 1, 2021

BenTheElder commented May 3, 2021

BenTheElder commented May 11, 2021

pohly commented May 11, 2021

msau42 commented May 11, 2021

gautierdelorme commented Apr 20, 2021 •

edited

Loading