Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish process for caring for non-merge-queue test suites #18116

Closed
4 of 8 tasks
ikehz opened this issue Dec 3, 2015 · 6 comments
Closed
4 of 8 tasks

Establish process for caring for non-merge-queue test suites #18116

ikehz opened this issue Dec 3, 2015 · 6 comments
Assignees
Labels
area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@ikehz
Copy link
Contributor

ikehz commented Dec 3, 2015

We should have three different kinds of test suites: 'merge-queue', 'critical builds', and 'other'. Build cop is responsible for the first two.

Action items:

  • What is the definition of a critical build?
    • Write tools to enforce this state of affairs
  • Establish how to find the owner of a job and method (@spxtr)
  • Find and define an “owner” for each test method
  • Find and define the “owner” for each job (@spxtr)
  • Take action on soak tests: they should be green for a week
  • Identify candidates to put into critical (@jlowdermilk)
  • What is the point of other when no one fixes them?

Assigning to @spxtr to manage/delegate the above.

@ikehz ikehz added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. area/test-infra labels Dec 3, 2015
@ikehz ikehz added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Dec 3, 2015
@pmorie
Copy link
Member

pmorie commented Dec 3, 2015

@kubernetes/rh-cluster-infra

@eparis
Copy link
Contributor

eparis commented Dec 3, 2015

I feel like there must be a whole lot of 'why' and backstory which I'm missing. Can you tell me like I'm 5?

@ikehz
Copy link
Contributor Author

ikehz commented Dec 3, 2015

Oh, sorry. This is kind of internal Google stuff.

We have test suites that block the merge-bot if they start failing. Those are definitely watched by our build cop. We also have suites that are marked as "critical", but it's not clear what that means. We also have a bunch of suites that aren't marked as "critical". A lot of them are failing consistently.

We just need to establish how to decide when a build is allowed to be "critical" or "merge-blocking", and also how to establish responsibility for the "other" builds so that we don't just have dozens of test suites lying around that aren't providing us any information.

@spxtr
Copy link
Contributor

spxtr commented Mar 3, 2016

I'm not sure that this is useful as a tracking issue, although the underlying issue still remains. For instance, kubernetes-e2e-gce-flannel has been broken for a long time, but nobody cares. Technically it has an owner, but that doesn't mean the owner needs to fix it anytime soon.

Soak tests are a long-term goal on their own that deserve their own issue.

I don't think we're going to get owners per test method. It might be worth making a policy of adding a comment over each test with the team to contact if it breaks? It's usually pretty obvious from the git history, though.

@spxtr spxtr closed this as completed Mar 16, 2016
@bprashanth
Copy link
Contributor

I'm not sure that this is useful as a tracking issue, although the underlying issue still remains. For instance, kubernetes-e2e-gce-flannel has been broken for a long time, but nobody cares.

Actually it times out. The suite is still running. The important part with flannel is to prove networking works for the scalability suite (which we've proven by running > 50% of tests and the enormous cluster). I've been meaning to look at the timeouts but didn't have time (that's why it's feature, right?).

@bprashanth
Copy link
Contributor

Actually it WAS timing out at one point. Now it's

7fb527513480d8229e363ca62d8548fbeea92cbe)
02:04:33 +++ kubernetes-salt.tar.gz uploaded (sha1 = 890376d5ba609abb84e0416510f4f69cfffa5aee)
02:04:34 Starting master and configuring firewalls
02:04:35 ERROR: (gcloud.compute.firewall-rules.create) Some requests did not succeed:
02:04:35  - The resource 'projects/kubernetes-flannel/global/firewalls/e2e-flannel-master-https' already exists
02:04:35 ERROR: (gcloud.compute.disks.create) Some requests did not succeed:
02:04:35  - The resource 'projects/kubernetes-flannel/zones/us-central1-f/disks/e2e-flannel-master-pd' already exists
02:04:35 2016/03/16 02:04:35 e2e.go:200: Error running up: exit status 1
02:04:35 2016/03/16 02:04:35 e2e.go:196: Step 'up' finished in 14.084397715s
02:04:35 

Which is #21564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test-infra priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

5 participants