Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master upgrade test harness #8081

Closed
roberthbailey opened this issue May 11, 2015 · 29 comments · Fixed by #9517
Closed

Master upgrade test harness #8081

roberthbailey opened this issue May 11, 2015 · 29 comments · Fixed by #9517
Assignees
Labels
area/upgrade priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@roberthbailey
Copy link
Contributor

Forked from #3135.

For testing master upgrades, we will need a test harness that:

  1. Launches a cluster at some known version (e.g. latest release)
  2. Runs some amount of work on the cluster
    • Running e2es would be ok, but e2e tests tend to try and clean up after themselves and therefore a test that is better suited would be one that keeps running during the upgrade
  3. Upgrade the master to a new version (e.g. latest release running in jenkins)
  4. Verifies that the cluster is still in a sane state (from the previous tests)
  5. Runs e2e tests against the upgraded cluster.

/cc @zmerlynn @satnam6502

@roberthbailey roberthbailey added area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/hosting area/test-infra labels May 11, 2015
@roberthbailey roberthbailey added this to the v1.0 milestone May 11, 2015
@satnam6502
Copy link
Contributor

I volunteer a workload based on Elasticsearch.

@roberthbailey
Copy link
Contributor Author

I forked #8082 to discuss the workload part. This one is for the harness (Jenkins?) itself. Feel free to assign #8082 to yourself.

@davidopp
Copy link
Member

@ixdy are you available to work on this?

@ghost
Copy link

ghost commented May 15, 2015

@ixdy is pretty busy right now getting e2e's to run on every PR pre-merge, so that we don't keep wasting so much time tracking down regressions. I'd like him to stay focussed on that until it's done, hopefully mid next week.

@davidopp
Copy link
Member

Yeah sorry I didn't update the issue, I checked with him a couple of days ago and he said the same, so I was leaving unassigned for now. I am hoping @piosz can work on this but haven't discussed with him yet.

@ghost
Copy link

ghost commented May 15, 2015

@saad-ali Is also interested in working on it. I just discussed it with him.

@ghost
Copy link

ghost commented May 15, 2015

Provisionally assigning to @saad-ali

@ghost ghost assigned saad-ali May 15, 2015
@davidopp
Copy link
Member

@fgrzadkowski FYI

@zmerlynn
Copy link
Member

cc @alex-mohr

@zmerlynn
Copy link
Member

@saad-ali: What's the status here?

@saad-ali
Copy link
Member

The kubernetes-upgrade-gce test is set up in Jenkins. It currently picks up gs://kubernetes-release/release/stable.txt, deploys it to a cluster, runs E2E to verify stability, then runs the upgrade GCE script to upgrade to latest_ci, and another E2E to verify cluster afterwards. The test is not passing yet, but it works, so we can close this bug.

@zmerlynn
Copy link
Member

Where is it failing? If you "upgrade" instead to just stable.txt again, does it pass?

@saad-ali
Copy link
Member

For v17.1 to v17.1 it fails at: "line 71: set-master-htpasswd: command not found"
This should be fixed in v18.0

@davidopp
Copy link
Member

davidopp commented Jun 2, 2015

@saad-ali do you think you could add what is described in this comment
#8082 (comment)
namely adding "invoke kube-push and run e2e again" ?

@saad-ali
Copy link
Member

saad-ali commented Jun 2, 2015

Yep, I'll take a crack at it today.

@saad-ali
Copy link
Member

saad-ali commented Jun 3, 2015

Done. Added kubernetes-upgrade-gce-step5-kube-push and kubernetes-upgrade-gce-step6-e2e to Jenkins GCE upgrade test.

@piosz
Copy link
Member

piosz commented Jun 3, 2015

Thanks a lot @saad-ali !

@davidopp
Copy link
Member

davidopp commented Jun 3, 2015

+1, thanks!

@davidopp
Copy link
Member

davidopp commented Jun 3, 2015

@roberthbailey I think we can close this. If you agree, please close it.

@alex-mohr
Copy link
Contributor

Thanks for the work here!

My understanding from @saad-ali's comment at #8081 (comment) that of Robby's initial comment, the tests now do 1., 3., and 5., but not 2. or 4.?

If the e2es clean up after themselves and we're testing upgrading an empty cluster, are we vulnerable to upgrades causing existing pods/services/etc to become broken in some way? Or restarted all pods?

Or maybe turning this around: if we had these tests for borgmaster rollouts, would you be content? :)

@satnam6502
Copy link
Contributor

Certainly happy to jump on this on Monday when I return assuming @thockin is OK with it. I think I asked this question on another thread, but don't we also want to check that the upgrade actually occurred? Otherwise the "NOP" operation would satisfy the test? This seems to suggest that we want something running that spans the lifetime of the two clusters and applies a probe to check for the dy/dx.

@alex-mohr
Copy link
Contributor

The issue with e2es is that they start from an empty cluster state and return it to an empty cluster state. I think making sure they pass before and after is necessary, but it'd be even better to test more things.

What are the invariants we want to ensure through an upgrade process? Something like:

  • components are actually running the new versions, as @satnam6502 described.
  • pods/services/whatevers created before the upgrade still exist and are running afterwards?
  • others?

If so, that would mean step 2 creates a bunch of pods/services/whatevers, ensures they're in a sane state (accessible via network, running, ...?), and step 4 then verifies they're still accessible, running, etc. Step 4 probably also verifies components are the new version. And maybe also that e.g. pods didn't get restarted or evicted or deleted?

Sorry for the super-high-level vagueness. I think I'm proposing that someone think about what's needed to ensure a good upgrade experience, maybe document it if the existing issues don't capture that, and then start implementing paths to achieve that good experience along with the tests that will verify it holds true across future versions?

Re: v1 explicitly, I'm fine with a cut whereever, though I think we'd get a lot of mileage out of doing tests to verify behavior first and then iteratively improve capabilities over time?

@davidopp
Copy link
Member

davidopp commented Jun 4, 2015

SGTM. Thanks @satnam6502 for taking over this issue. For reference, @piosz 's plan was to take his load test and just add a final step where you create a service and then use it to test connectivity to all of the pods. The load test is having problems, though (and it's probably better not to conflate load testing and correctness testing right now), so it would probably be better if you were to just do something simple from scratch (it could still have the same basic idea, I guess -- e.g. create a replication controller and a service, verify you can connect to the pods through the service).

@davidopp davidopp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 8, 2015
@mikedanese mikedanese assigned mikedanese and unassigned satnam6502 Jun 8, 2015
@davidopp
Copy link
Member

davidopp commented Jun 8, 2015

@mikedanese offered to work on this and @satnam6502 was OK with that (and it seems an especially good idea since @mikedanese just did the etcd-killing e2e test), so reassigning to Mike. Thanks, Mike!

@davidopp davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 10, 2015
@mbforbes
Copy link
Contributor

Hmm, I don't think this should have been closed. I'm adding my comment from #9517 below.

@mbforbes mbforbes reopened this Jun 16, 2015
@mbforbes
Copy link
Contributor

The following could come as follow-up PRs / work, but are necessary to close #8081:

  1. [@mbforbes Verify upgrades change node/master software version #10273] Validate that the upgrade actually happened. We want to ensure that the new software is really running, otherwise--as @satnam6502 pointed out in Master upgrade test harness #8081--a no-op would satisfy the "upgrade".
  2. [@mbforbes Node upgrade test #9987] Have the webserver util create a replication controller rather than just a pod. At a minimum, we want to validate the core k8s resources, and I think pod/rc/svc are the minimum. We'd have to make sure the services test still works, and we'd probably want to pull this code out of services and into util.
  3. [@mbforbes Node upgrade test #9987] Do a bit more validation. I think we want to check, e.g., more than just the name of the service. Think of it this way: if you were running a cluster with a bunch of resources, what are all of the properties of your resources that you'd expect to hold after an upgrade? I think we need a bit deeper introspection.
  4. [@mikedanese] Make the script (hack/upgrade-e2e-test.sh) only run stable e2es when it runs e2es before and after the upgrade. We also don't want to run the 'restart' e2e test as that does another upgrade. This will hopefully be solved soon by having the necessary ginkgo test arg strings checked into github (cc @ixdy), so we can just curl that and use it in the script.
  5. [@mikedanese] Set up a Jenkins job to basically run what hack/upgrade-e2e-test.sh does. We'll have to see how this is best done given the current Jenkins script. The answer might be to have multiple "step" jobs for Jenkins that trigger each other.
  6. [@mbforbes] Tweak above code and config until Jenkins really runs green.
  7. [@mbforbes GKE upgrade tests #10133] Add GKE support for the above

The above, in my mind, will close out #8081. The following would be extra credit (after the above are fixed):

  1. Validate more kubernetes resources (secretes, volumes, persistent volumes, ...?) can be created before an upgrade, (possibly) work during an upgrade, and (definitely) are correct after an upgrade.

I'll also extend the above for nodes.

@davidopp
Copy link
Member

@mbforbes Do you have an ETA on the remaining work for this issue?

@mbforbes
Copy link
Contributor

I updated the checklist; I can get a PR for the last item above out today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants