-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky pull-kubernetes-e2e-kind tests due to timeout #101275
Comments
/assign |
@BenTheElder this is not looking good https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind&sort-by-flakiness=20 , lots of timeouts, are you aware of any issue that can be causing it? |
It's looking worse but timeouts plague presubmit in general. We could reduce the test load as the ipv6 job looks healthier. I'm not aware of anything specific but you should have access to the same data as me and aaron. kubernetes/k8s.io#1187 (comment) is what I'd prefer to pursue for now, though if flakiness concerns become excessive I'd suggest bringing the group of tests we run inline with the ipv6 job. None of the presubmits run everything because it would be excessive. We want to min-max coverage vs reliability & speed ... |
Is there anything we can do from the test authoring perspective? Or this is purely infra issue? |
I don't think it's a test authoring issue mostly, there's tentatively two things going on here:
It's not purely infra though, Current best guess is changes to the CSI hostpath driver landing in k8s around the time failures shot up #100637 (comment) slack thread: https://kubernetes.slack.com/archives/C09QZ4DQB/p1619723516188000 possible fix kubernetes-csi/csi-driver-host-path#277 we're considering skipping more tests in the non-ipv6 job to match ipv6 and bring failure rate back to normal levels, unless it seems likely that it was a specific regression. |
/triage accepted |
/assign @aojea |
/assign Still not ideal |
|
the pass rate is still pretty bad with the startupProbe test.
A normal e2e presubmit should be in something like 70+ for 12h during active development periods |
To re-enable the csi hostpath tests, we'll need to compare container resource usage before/after the changes. Do we have metrics collection enabled on kind clusters? Another possibility, is that for csi hostpath every single container is deployed as a separate Pod. I don't think that would impact cpu/memory consumption on the node, but it would reduce parallelism of the tests. One of the changes that we did do was add 2 more sidecars = 2 more pods to the hostpath driver. Multiplied by 30-40 test cases = 60-80 more pods needing to be scheduled. |
maybe then we're hitting pod scheduling limits? kind jobs run with 2 schedulable nodes by default. (and one control plane). When I hear "sidecar" I usually thing additional container, not pod though. |
we do not have metrics collection enabled, kubeadm does not support this OOTB and we'd been waiting for a blessed addon manager from SCL. @aojea has an experiment related to this. |
Correct in a real production driver, we put all the sidecars in a single pod. However we use hostpath driver to validate that we have the correct RBACs set per container (because some sidecars are optional) so that's why in our CI we run each container as an individual pod. Perhaps we need to support both methods: kubernetes-csi/csi-driver-host-path#192 |
I have a way to install prometheus and dump the database once the test fails kubernetes-sigs/kind#2190, it monitors the What kind metrics are you looking for @msau42 ? |
Our two leading theories to why the kind job become more flaky after making changes to the hostpath driver:
|
Which additional pods are these? https://github.com/kubernetes/kubernetes/pull/100637/files initially added csi-external-health-monitor-agent and csi-external-health-monitor-controller, but those were additional containers in the driver pod, not separate pods. Later I added code to disable them. Even later, I downgraded to the driver YAML file that doesn't contain them. All of that was done before merging the PR, so if the job is flaky now, it's not because of these two. Now that I think about it, https://github.com/kubernetes/kubernetes/pull/101360/files which then removed the code that disables the containers didn't change anything because we are still on the old driver release without those containers. |
A summary of CPU and RAM consumption of the entire Prow test job (i.e. including all processes running inside KinD) amortized over time for the entire duration of the job would be a good start. Is that possible? Then we can directly see if changes that we make really let the tests run more efficiently. Right now we are speculating based on very indirect observations like timeouts in random tests. We'll need a pull job which ideally runs just the storage tests that were disabled in kubernetes/test-infra#22025 The first step then will be to test whether the sidecar update in #100637 really made the situation worse - so far evidence for that is not conclusive. |
Regarding running more efficiently, here are some ideas ranging from "makes sense" to "what is Patrick smoking":
The last one probably doesn't make sense because it only works for drivers written in Go. Listed for the sake of completeness... |
For the record: after @aojea reverted the most recent startup probe test change we are now at a normal ~80% pass rate which consists of a few number of startup probe flakes and normal failures due to bad PRs (code doesn't build, or debug code tanks performance and either wall all the e2e jobs fail not just kind). Linking the sidecars together makes sense to me. We probably still need to get actual before & after measurements. Prow will be a little noisy though. |
this still needs follow up, but I want to close this issue because this job should not be running into this, except low volume ongoing flakes with the startup probe test right now the job is still
so the CSI tests still need follow up, and startupprobe still needs follow up, but that's not the original issue that the job is flaking and blocking contributors, that's long been resolved. |
Do we have an issue open about re-enabling the hostpath tests in these kind jobs (i.e. about reverting kubernetes/test-infra#22025)? |
Opened #101913 |
Which jobs are flaking:
pull-kubernetes-e2e-kind
Which test(s) are flaking:
Different tests seem to randomly fail. Some examples:
[sig-network] KubeProxy should set TCP CLOSE_WAIT timeout [Privileged]
:timed out waiting for the condition
[sig-network] HostPort validates that there is no conflict between pods with same hostPort but different hostIP and protocol [LinuxOnly] [Conformance]
:Failed to connect to exposed host ports
[sig-node] RuntimeClass should run a Pod requesting a RuntimeClass with scheduling without taints
:timed out waiting for the condition
[sig-storage] CSI mock volume CSI Volume expansion should expand volume by restarting pod if attach=on, nodeExpansion=on
:timed out waiting for the condition
Reason for failure
Timeouts
Anything else we need to know:
/kind flake
/sig network
/sig node
/sig storage
/cc @dims
The text was updated successfully, but these errors were encountered: