Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] capz-windows-master ci-kubernetes-e2e-capz-master-windows.Overall #127408

Open
drewhagen opened this issue Sep 17, 2024 · 10 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Milestone

Comments

@drewhagen
Copy link
Member

drewhagen commented Sep 17, 2024

Which jobs are flaking

  • master-informing:
    • capz-windows-master

Which tests are flaking?

ci-kubernetes-e2e-capz-master-windows.Overall

Since when has it been flaking?

  • Often once or twice daily since 09-03 05:10 CDT Prow link

Failed runs:

Testgrid link

Testgrid link

Reason for failure (if possible)

Sun, 15 Sep 2024 19:12:24 +0000: cluster creation complete
Sun, 15 Sep 2024 19:12:25 +0000: bastion info: capi@null:22
Sun, 15 Sep 2024 19:12:25 +0000: wait for cluster to stabilize
Sun, 15 Sep 2024 19:17:25 +0000: cleaning up
./capz/run-capz-e2e.sh: line 103: capz::ci-build-azure-ccm::cleanup: command not found
E0915 19:17:55.212078    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
E0915 19:18:25.214460    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
E0915 19:18:55.216389    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
E0915 19:19:25.218324    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
E0915 19:19:55.220740    2635 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: Get \"https://capz-conf-l4mu8v-a076bee8.westus2.cloudapp.azure.com:6443/api?timeout=32s\": dial tcp 20.120.140.205:6443: i/o timeout"
Unable to connect to the server: dial tcp 20.120.140.205:6443: i/o timeout
+ EXIT_VALUE=1
+ set +o xtrace
Cleaning up after docker in docker.

Anything else we need to know?

Relevant SIG(s)

/sig windows
/kind flake

cc: @kubernetes/release-team-release-signal

@drewhagen drewhagen converted this from a draft issue Sep 17, 2024
@k8s-ci-robot k8s-ci-robot added sig/windows Categorizes an issue or PR as relevant to SIG Windows. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 17, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@drewhagen
Copy link
Member Author

Thanks y'all! I notice that #126096 is in active code review.

Also, @kubernetes/sig-windows-bugs The first release cut (1.32.0-alpha.1) is due in less than a week from today on Oct 1st 2024. Given that this flake is on master informing and is being addressed, can we consider this a Non-Blocker for this next release cut? Please advise - thank you!

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 25, 2024
@marosset
Copy link
Contributor

marosset commented Oct 1, 2024

This isn't a blocker.

These errors are failures in bringing up a test cluster and happen before we run any of the e2e tests.
I think we need to figure out how to get more logs for these failures - either Azure ARM logs or possibly some logs from the capz-controllers

/cc @jsturtevant @ritikaguptams

@knabben
Copy link
Member

knabben commented Oct 11, 2024

Updated Windows 2022 in the job, but does not seems to have clear effects.

In parallel another infra issue is happening on CAPZ, related to the region availability and only gets fixed in the next retry.

	--------------------------------------------------------------------------------
	RESPONSE 409: 409 Conflict
	ERROR CODE: SkuNotAvailable
	--------------------------------------------------------------------------------
	{
	  "error": {
	    "code": "SkuNotAvailable",
	    "message": "The requested VM size for resource 'Following SKUs have failed for Capacity Restrictions: Standard_D2s_v3' is currently not available in location 'westus2'. Please try another size or deploy to a different location or different zone. See https://aka.ms/azureskunotavailable for details.",
	    "target": "vmSize"
	  }
	}
	--------------------------------------------------------------------------------

@drewhagen
Copy link
Member Author

/milestone v1.32

@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Oct 18, 2024
@drewhagen
Copy link
Member Author

Hello @knabben @marosset. Thanks for taking action on this!

A friendly reminder of what's ahead:

  • Code freeze is starting 02:00 UTC Friday November 8th 2024 (about 3 weeks from now), and while there is still time, we want to ensure that each PR has a chance to be merged on time.

Given this timeline and capacity, will a fix for this continue to be aimed for the 1.32 release?
Thanks! 😄 🚀

@drewhagen
Copy link
Member Author

👋 @marosset @knabben
Thanks for updating Windows 2022 in that job. Is this still an issue, and do we plan to resolve for v1.32?

To that end, I want to extend a friendly reminder that the code freeze is starting 02:00 UTC Friday November 8th 2024 (a little less than 1 week from now). Please make sure any new PRs have both lgtm and approved labels before the code freeze. Thanks! 👍

@drewhagen
Copy link
Member Author

👋 Hello @marosset @knabben!
Appreciate all of your efforts with this! Is the plan still to resolve this issue for v1.32 ?
If so, a gentle reminder that the code freeze has started 02:00 UTC Friday November 8th 2024 . Please make sure any PRs have both lgtm and approved labels ASAP, and file an Exception.
Thanks!

@drewhagen drewhagen moved this from Tracked to At Risk in [sig-release] Bug Triage Nov 18, 2024
@drewhagen
Copy link
Member Author

Since this isn't high priority and code freeze has passed, we're going to remove the 1.32 milestone.

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.32 milestone Dec 4, 2024
@drewhagen
Copy link
Member Author

/milestone v1.33

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects
Status: At Risk
Status: No status
Development

No branches or pull requests

4 participants