Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky pull-kubernetes-e2e-kind tests due to timeout #101275

Closed
gautierdelorme opened this issue Apr 20, 2021 · 24 comments
Closed

flaky pull-kubernetes-e2e-kind tests due to timeout #101275

gautierdelorme opened this issue Apr 20, 2021 · 24 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@gautierdelorme
Copy link
Contributor

gautierdelorme commented Apr 20, 2021

Which jobs are flaking:

pull-kubernetes-e2e-kind

Which test(s) are flaking:

Different tests seem to randomly fail. Some examples:

  • [sig-network] KubeProxy should set TCP CLOSE_WAIT timeout [Privileged] : timed out waiting for the condition
  • [sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]: Failed to connect to exposed host ports
  • [sig-node] RuntimeClass should run a Pod requesting a RuntimeClass with scheduling without taints: timed out waiting for the condition
  • [sig-storage] CSI mock volume CSI Volume expansion should expand volume by restarting pod if attach=on, nodeExpansion=on: timed out waiting for the condition
  • ...

Reason for failure

Timeouts

Anything else we need to know:

/kind flake
/sig network
/sig node
/sig storage
/cc @dims

@gautierdelorme gautierdelorme added the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 20, 2021
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 20, 2021
@aojea
Copy link
Member

aojea commented Apr 20, 2021

/assign
is not failing always https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind , but it seems affects only to #100490

@aojea
Copy link
Member

aojea commented Apr 20, 2021

@BenTheElder this is not looking good https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind&sort-by-flakiness=20 , lots of timeouts, are you aware of any issue that can be causing it?
You and @spiffxp have some graphs with trends, is there anything unusual?

@BenTheElder
Copy link
Member

It's looking worse but timeouts plague presubmit in general. We could reduce the test load as the ipv6 job looks healthier.

I'm not aware of anything specific but you should have access to the same data as me and aaron.

kubernetes/k8s.io#1187 (comment) is what I'd prefer to pursue for now, though if flakiness concerns become excessive I'd suggest bringing the group of tests we run inline with the ipv6 job.

None of the presubmits run everything because it would be excessive. We want to min-max coverage vs reliability & speed ...

@SergeyKanzhelev
Copy link
Member

Is there anything we can do from the test authoring perspective? Or this is purely infra issue?

@BenTheElder
Copy link
Member

I don't think it's a test authoring issue mostly, there's tentatively two things going on here:

purely infra issue?

It's not purely infra though, pull-kubernetes-e2e-kind-ipv6 is perfectly acceptable other than the startup probe issue. pull-kubernetes-e2e-kind is doing horribly right now.

Current best guess is changes to the CSI hostpath driver landing in k8s around the time failures shot up #100637 (comment)

slack thread: https://kubernetes.slack.com/archives/C09QZ4DQB/p1619723516188000

possible fix kubernetes-csi/csi-driver-host-path#277

we're considering skipping more tests in the non-ipv6 job to match ipv6 and bring failure rate back to normal levels, unless it seems likely that it was a specific regression.

@bridgetkromhout
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 29, 2021
@bridgetkromhout
Copy link
Member

/assign @aojea

@BenTheElder
Copy link
Member

/assign
Tentatively looking healthier https://prow.k8s.io/?job=pull-kubernetes-e2e-kind
After kubernetes/test-infra#22025

Still not ideal

@aojea
Copy link
Member

aojea commented Apr 30, 2021

Success rate over time: 3h: 75%, 12h: 56%, 48h: 44%

@BenTheElder
Copy link
Member

quick sampling shows remaining flakes are pretty much all:

[sig-node] Probing container should be ready immediately after startupProbe succeeds

so we should look at that test or skip it, and we should look into fixing CSI hostpath tests and re-enabling them. cc @msau42 @pohly

@BenTheElder
Copy link
Member

the pass rate is still pretty bad with the startupProbe test.

Success rate over time: 3h: 48%, 12h: 52%, 48h: 44%

A normal e2e presubmit should be in something like 70+ for 12h during active development periods

@msau42
Copy link
Member

msau42 commented Apr 30, 2021

To re-enable the csi hostpath tests, we'll need to compare container resource usage before/after the changes. Do we have metrics collection enabled on kind clusters?

Another possibility, is that for csi hostpath every single container is deployed as a separate Pod. I don't think that would impact cpu/memory consumption on the node, but it would reduce parallelism of the tests. One of the changes that we did do was add 2 more sidecars = 2 more pods to the hostpath driver. Multiplied by 30-40 test cases = 60-80 more pods needing to be scheduled.

@BenTheElder
Copy link
Member

maybe then we're hitting pod scheduling limits? kind jobs run with 2 schedulable nodes by default. (and one control plane).

When I hear "sidecar" I usually thing additional container, not pod though.

@BenTheElder
Copy link
Member

we do not have metrics collection enabled, kubeadm does not support this OOTB and we'd been waiting for a blessed addon manager from SCL. @aojea has an experiment related to this.

@msau42
Copy link
Member

msau42 commented Apr 30, 2021

When I hear "sidecar" I usually thing additional container, not pod though.

Correct in a real production driver, we put all the sidecars in a single pod. However we use hostpath driver to validate that we have the correct RBACs set per container (because some sidecars are optional) so that's why in our CI we run each container as an individual pod. Perhaps we need to support both methods: kubernetes-csi/csi-driver-host-path#192

@aojea
Copy link
Member

aojea commented Apr 30, 2021

I have a way to install prometheus and dump the database once the test fails kubernetes-sigs/kind#2190, it monitors the

What kind metrics are you looking for @msau42 ?
because in KIND there is no cpu/mem/.. per node metrics, since the containers use the host /proc...

@msau42
Copy link
Member

msau42 commented Apr 30, 2021

Our two leading theories to why the kind job become more flaky after making changes to the hostpath driver:

  • We added 2 more pods per test case = 60-80 more pods. Potentially reduces the test parallelism and can contribute to test timeouts.
  • New csi sidecars are consuming more resources, slowing down the system. It would be good to look at any cpu/mem metrics to see if this could be the case.

@pohly
Copy link
Contributor

pohly commented Apr 30, 2021

* We added 2 more pods per test case = 60-80 more pods. Potentially reduces the test parallelism and can contribute to test timeouts.

Which additional pods are these?

https://github.com/kubernetes/kubernetes/pull/100637/files initially added csi-external-health-monitor-agent and csi-external-health-monitor-controller, but those were additional containers in the driver pod, not separate pods.

Later I added code to disable them. Even later, I downgraded to the driver YAML file that doesn't contain them. All of that was done before merging the PR, so if the job is flaky now, it's not because of these two.

Now that I think about it, https://github.com/kubernetes/kubernetes/pull/101360/files which then removed the code that disables the containers didn't change anything because we are still on the old driver release without those containers.

@pohly
Copy link
Contributor

pohly commented May 1, 2021

I have a way to install prometheus and dump the database once the test fails kubernetes-sigs/kind#2190, it monitors the

What kind metrics are you looking for @msau42 ?
because in KIND there is no cpu/mem/.. per node metrics, since the containers use the host /proc...

A summary of CPU and RAM consumption of the entire Prow test job (i.e. including all processes running inside KinD) amortized over time for the entire duration of the job would be a good start. Is that possible?

Then we can directly see if changes that we make really let the tests run more efficiently. Right now we are speculating based on very indirect observations like timeouts in random tests.

We'll need a pull job which ideally runs just the storage tests that were disabled in kubernetes/test-infra#22025

The first step then will be to test whether the sidecar update in #100637 really made the situation worse - so far evidence for that is not conclusive.

@pohly
Copy link
Contributor

pohly commented May 1, 2021

Regarding running more efficiently, here are some ideas ranging from "makes sense" to "what is Patrick smoking":

  • run sidecars inside the same pod as the driver (Consolidate all the hostpath driver specs into one pod kubernetes-csi/csi-driver-host-path#192) -> less pods
  • link code from the different sidecars into one controller sidecar and one node sidecar -> less containers, less overhead (Go runtime only linked once, leader election and informers shared, storage capacity tracking code in provisioner can react to changes in resizer)
  • link code from the sidecars directly into the CSI driver binary -> even less containers

The last one probably doesn't make sense because it only works for drivers written in Go. Listed for the sake of completeness...

@BenTheElder
Copy link
Member

For the record: after @aojea reverted the most recent startup probe test change we are now at a normal ~80% pass rate which consists of a few number of startup probe flakes and normal failures due to bad PRs (code doesn't build, or debug code tanks performance and either wall all the e2e jobs fail not just kind).

Linking the sidecars together makes sense to me. We probably still need to get actual before & after measurements. Prow will be a little noisy though.

@BenTheElder
Copy link
Member

this still needs follow up, but I want to close this issue because this job should not be running into this, except low volume ongoing flakes with the startup probe test

right now the job is still

Success rate over time: 3h: 88%, 12h: 81%, 48h: 76%

so the CSI tests still need follow up, and startupprobe still needs follow up, but that's not the original issue that the job is flaking and blocking contributors, that's long been resolved.

@pohly
Copy link
Contributor

pohly commented May 11, 2021

Do we have an issue open about re-enabling the hostpath tests in these kind jobs (i.e. about reverting kubernetes/test-infra#22025)?

@msau42
Copy link
Member

msau42 commented May 11, 2021

Opened #101913

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

No branches or pull requests

8 participants