Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes-test-go: Build timed out #24285

Closed
lavalamp opened this issue Apr 14, 2016 · 24 comments · Fixed by #24480
Closed

kubernetes-test-go: Build timed out #24285

lavalamp opened this issue Apr 14, 2016 · 24 comments · Fixed by #24480
Assignees
Labels
area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@lavalamp
Copy link
Member

@lavalamp lavalamp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Apr 14, 2016
@lavalamp
Copy link
Member Author

@spxtr Can you take a look? I think the gke build has the same problem.

@spxtr
Copy link
Contributor

spxtr commented Apr 14, 2016

Looks like they've gotten significantly slower since the last time this came up. I'll try and figure out why.

@lavalamp
Copy link
Member Author

This happened again and is continuing to block the merge queue. @spxtr can you make a bandaid that extends the timeout?

@spxtr
Copy link
Contributor

spxtr commented Apr 14, 2016

Done, but I need to stress that we need to know why the duration is climbing. I'm afraid in a few months I'll get another issue that says "kubernetes-test-go is timing out".

@lavalamp
Copy link
Member Author

Thanks. Yes, we can leave this issue open until we figure it out.

@lavalamp
Copy link
Member Author

@lavalamp
Copy link
Member Author

Actually, it was taking 14-17 mins, so 30 minute timeout should hopefully be enough. hm.

@spxtr
Copy link
Contributor

spxtr commented Apr 14, 2016

The build is the result of a PR going in that does affect the build. It had to download a new build image, and it also needs to build more tarballs. We might need to legitimately bump that one's timeout.

@lavalamp
Copy link
Member Author

It looks like it's about to time out again.

@spxtr
Copy link
Contributor

spxtr commented Apr 14, 2016

Barely passed. The build got slower by a factor of two after #23931 went in. This was somewhat expected. I'll bump the timeout.

@lavalamp
Copy link
Member Author

We should exclude those platforms in our testing, maybe? We don't need ppc or arm binaries. Should we roll back #23931?

@spxtr
Copy link
Contributor

spxtr commented Apr 14, 2016

I don't think we should roll it back. We want kubernetes-build to do a full release build, which now includes building for those architectures.

However, if the PR Jenkins e2e job starts timing out, then we should probably change it to do a quick-release instead of the full release.

cc @luxas

@spxtr spxtr added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Apr 14, 2016
@spxtr
Copy link
Contributor

spxtr commented Apr 14, 2016

Dropped the priority now that the bleeding has stopped. I'll try to figure out why the test-go job got slower.

@spxtr spxtr removed the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 14, 2016
@david-mcmahon
Copy link
Contributor

@luxas this is a fairly significant increase in build and release times with #23931. Is there anything we can do to mitigate the increase? I see many packages downloaded during the build. Are we maybe downloading more than we need to? Can we cache anything somehow/somewhere? Can we parallelize package updating and/or building?

@luxas
Copy link
Member

luxas commented Apr 15, 2016

The problem is not all the things that get downloaded into kube-cross, that's just one time.
It's the building time that's a small problem. Since go1.5+ building may get ~2x slower, and now we've both upgraded to go1.6 and added more server platforms, so the increase in build time is expected. Sorry for not notifying you in beforehand though.

Steps we could take to decrease the time:

  • Move cmd/linkcheck to test targets. I don't know why that one is considered a server target.
  • Build test targets only for linux/amd64 (now linux/amd64, windows/amd64, darwin/amd64, linux/arm)
  • Remove addon images: don't ship kube-registry-proxy and pause images in tars. #23605
  • Consider dropping support for */386 for kubectl
  • Remove cmd/kubemark for arm, arm64 and ppc64le. I don't think it's required from official builds.

@luxas
Copy link
Member

luxas commented Apr 15, 2016

@spxtr We should do a full build on every CI run for all arches, so we may detect regressions in the code

@spxtr
Copy link
Contributor

spxtr commented Apr 15, 2016

Move cmd/linkcheck to test targets. I don't know why that one is considered a server target.

Whoops, I meant to make a PR to do that a while ago. cc @caesarxuchao

@caesarxuchao
Copy link
Member

@spxtr thanks for letting me know. Do you want me to send the PR or will you?

@spxtr
Copy link
Contributor

spxtr commented Apr 15, 2016

Build test targets only for linux/amd64 (now linux/amd64, windows/amd64, darwin/amd64, linux/arm)

I think we definitely want to build test targets for at least darwin, since plenty of people develop on a mac. It might be worth dropping tests for arm and windows.

Remove cmd/kubemark for arm, arm64 and ppc64le. I don't think it's required from official builds.

sgtm

k8s-github-robot pushed a commit that referenced this issue Apr 16, 2016
Automatic merge from submit-queue

Move cmd/linkcheck to test targets.

#24285 (comment)
@spxtr spxtr added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 16, 2016
@spxtr spxtr added the kind/flake Categorizes issue or PR as related to a flaky test. label Apr 16, 2016
@spxtr
Copy link
Contributor

spxtr commented Apr 18, 2016

All of our kubernetes-build and kubernetes-test-go jobs are running on a single n1-highmem-32 instance, which is hitting 100% CPU usage fairly often. This is most likely why our test-go times are so inconsistent lately.

@luxas
Copy link
Member

luxas commented Apr 18, 2016

Is there anything I may help with? (Not having access to your servers)

@spxtr
Copy link
Contributor

spxtr commented Apr 18, 2016

@luxas I think I can handle the test-go problems. Feel free to work on your other suggestions, I think they're good. Thanks, though :)

@fejta fejta removed the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 19, 2016
@fejta
Copy link
Contributor

fejta commented Apr 19, 2016

The timeout is too aggressive:

http://kubekins.dls.corp.google.com/job/kubernetes-test-go/buildTimeTrend shows a passing run in 75 minutes (11026) and the average passing run time is in the mid 50m. We need at least a 100m timeout (2x average 50m runtime) rather than 80.

Or to put it another way: in order to run this job reliably I want the timeout to be twice the average runtime, rather than a couple additional minutes.

@spxtr
Copy link
Contributor

spxtr commented Apr 19, 2016

@fejta I'm wary of a 100 minute timeout. Last time this came up the job was taking ~35 minutes (#23127) so an 80 minute timeout was fine. Some time last week it started taking 60-80 minutes and occasionally timing out. I'd like to know why. In the meantime, to fix the submit queue lets bump the timeout.

I'm also starting to think we should run verify-*.sh, unit/integration tests, and test-cmd.sh in separate Jenkins jobs. kubernetes-test-go blocks the queue for over an hour now.

k8s-github-robot pushed a commit that referenced this issue Apr 19, 2016
Automatic merge from submit-queue

Bump kubernetes-test-go timeout.

It looks like the run times got more inconsistent because of load on the VM. Adding another Jenkins slave improved things so we're not constantly timing out, but it still gets a little close to timing out at times.

Average runtime is ~45 mins so I went with a 100 min timeout.

Fixes #24285
openshift-publish-robot pushed a commit to openshift/kubernetes that referenced this issue Jan 2, 2020
Remove patch for sa public key configuration

Origin-commit: 7631cfb6243de670c962eeb677f47bb3338c5924
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants