Master upgrade test harness #8081

roberthbailey · 2015-05-11T22:50:19Z

Forked from #3135.

For testing master upgrades, we will need a test harness that:

Launches a cluster at some known version (e.g. latest release)
Runs some amount of work on the cluster
- Running e2es would be ok, but e2e tests tend to try and clean up after themselves and therefore a test that is better suited would be one that keeps running during the upgrade
Upgrade the master to a new version (e.g. latest release running in jenkins)
- For GCE, we should use cluster/gce/upgrade.sh for this part (and fix it as necessary)
Verifies that the cluster is still in a sane state (from the previous tests)
Runs e2e tests against the upgraded cluster.

satnam6502 · 2015-05-11T22:57:16Z

I volunteer a workload based on Elasticsearch.

roberthbailey · 2015-05-11T23:04:28Z

I forked #8082 to discuss the workload part. This one is for the harness (Jenkins?) itself. Feel free to assign #8082 to yourself.

davidopp · 2015-05-12T21:11:36Z

@ixdy are you available to work on this?

ghost · 2015-05-15T23:11:26Z

@ixdy is pretty busy right now getting e2e's to run on every PR pre-merge, so that we don't keep wasting so much time tracking down regressions. I'd like him to stay focussed on that until it's done, hopefully mid next week.

davidopp · 2015-05-15T23:13:28Z

Yeah sorry I didn't update the issue, I checked with him a couple of days ago and he said the same, so I was leaving unassigned for now. I am hoping @piosz can work on this but haven't discussed with him yet.

ghost · 2015-05-15T23:27:20Z

@saad-ali Is also interested in working on it. I just discussed it with him.

ghost · 2015-05-15T23:35:18Z

Provisionally assigning to @saad-ali

davidopp · 2015-05-22T05:53:26Z

@fgrzadkowski FYI

zmerlynn · 2015-05-28T21:52:07Z

cc @alex-mohr

zmerlynn · 2015-05-28T21:52:33Z

@saad-ali: What's the status here?

saad-ali · 2015-05-28T22:02:19Z

The kubernetes-upgrade-gce test is set up in Jenkins. It currently picks up gs://kubernetes-release/release/stable.txt, deploys it to a cluster, runs E2E to verify stability, then runs the upgrade GCE script to upgrade to latest_ci, and another E2E to verify cluster afterwards. The test is not passing yet, but it works, so we can close this bug.

zmerlynn · 2015-05-28T22:28:10Z

Where is it failing? If you "upgrade" instead to just stable.txt again, does it pass?

saad-ali · 2015-05-28T22:35:04Z

For v17.1 to v17.1 it fails at: "line 71: set-master-htpasswd: command not found"
This should be fixed in v18.0

davidopp · 2015-06-02T18:15:18Z

@saad-ali do you think you could add what is described in this comment
#8082 (comment)
namely adding "invoke kube-push and run e2e again" ?

saad-ali · 2015-06-02T19:58:31Z

Yep, I'll take a crack at it today.

saad-ali · 2015-06-03T01:13:27Z

Done. Added kubernetes-upgrade-gce-step5-kube-push and kubernetes-upgrade-gce-step6-e2e to Jenkins GCE upgrade test.

piosz · 2015-06-03T05:40:49Z

Thanks a lot @saad-ali !

davidopp · 2015-06-03T07:04:03Z

+1, thanks!

davidopp · 2015-06-03T07:17:59Z

@roberthbailey I think we can close this. If you agree, please close it.

alex-mohr · 2015-06-03T21:11:19Z

Thanks for the work here!

My understanding from @saad-ali's comment at #8081 (comment) that of Robby's initial comment, the tests now do 1., 3., and 5., but not 2. or 4.?

If the e2es clean up after themselves and we're testing upgrading an empty cluster, are we vulnerable to upgrades causing existing pods/services/etc to become broken in some way? Or restarted all pods?

Or maybe turning this around: if we had these tests for borgmaster rollouts, would you be content? :)

satnam6502 · 2015-06-04T09:10:45Z

Certainly happy to jump on this on Monday when I return assuming @thockin is OK with it. I think I asked this question on another thread, but don't we also want to check that the upgrade actually occurred? Otherwise the "NOP" operation would satisfy the test? This seems to suggest that we want something running that spans the lifetime of the two clusters and applies a probe to check for the dy/dx.

alex-mohr · 2015-06-04T16:43:42Z

The issue with e2es is that they start from an empty cluster state and return it to an empty cluster state. I think making sure they pass before and after is necessary, but it'd be even better to test more things.

What are the invariants we want to ensure through an upgrade process? Something like:

components are actually running the new versions, as @satnam6502 described.
pods/services/whatevers created before the upgrade still exist and are running afterwards?
others?

If so, that would mean step 2 creates a bunch of pods/services/whatevers, ensures they're in a sane state (accessible via network, running, ...?), and step 4 then verifies they're still accessible, running, etc. Step 4 probably also verifies components are the new version. And maybe also that e.g. pods didn't get restarted or evicted or deleted?

Sorry for the super-high-level vagueness. I think I'm proposing that someone think about what's needed to ensure a good upgrade experience, maybe document it if the existing issues don't capture that, and then start implementing paths to achieve that good experience along with the tests that will verify it holds true across future versions?

Re: v1 explicitly, I'm fine with a cut whereever, though I think we'd get a lot of mileage out of doing tests to verify behavior first and then iteratively improve capabilities over time?

davidopp · 2015-06-04T20:20:01Z

SGTM. Thanks @satnam6502 for taking over this issue. For reference, @piosz 's plan was to take his load test and just add a final step where you create a service and then use it to test connectivity to all of the pods. The load test is having problems, though (and it's probably better not to conflate load testing and correctness testing right now), so it would probably be better if you were to just do something simple from scratch (it could still have the same basic idea, I guess -- e.g. create a replication controller and a service, verify you can connect to the pods through the service).

davidopp · 2015-06-08T20:16:41Z

@mikedanese offered to work on this and @satnam6502 was OK with that (and it seems an especially good idea since @mikedanese just did the etcd-killing e2e test), so reassigning to Mike. Thanks, Mike!

mbforbes · 2015-06-16T17:08:09Z

Hmm, I don't think this should have been closed. I'm adding my comment from #9517 below.

mbforbes · 2015-06-16T17:09:34Z

The following could come as follow-up PRs / work, but are necessary to close #8081:

[@mbforbes Verify upgrades change node/master software version #10273] Validate that the upgrade actually happened. We want to ensure that the new software is really running, otherwise--as @satnam6502 pointed out in Master upgrade test harness #8081--a no-op would satisfy the "upgrade".
[@mbforbes Node upgrade test #9987] Have the webserver util create a replication controller rather than just a pod. At a minimum, we want to validate the core k8s resources, and I think pod/rc/svc are the minimum. We'd have to make sure the services test still works, and we'd probably want to pull this code out of services and into util.
[@mbforbes Node upgrade test #9987] Do a bit more validation. I think we want to check, e.g., more than just the name of the service. Think of it this way: if you were running a cluster with a bunch of resources, what are all of the properties of your resources that you'd expect to hold after an upgrade? I think we need a bit deeper introspection.
[@mikedanese] Make the script (hack/upgrade-e2e-test.sh) only run stable e2es when it runs e2es before and after the upgrade. We also don't want to run the 'restart' e2e test as that does another upgrade. This will hopefully be solved soon by having the necessary ginkgo test arg strings checked into github (cc @ixdy), so we can just curl that and use it in the script.
[@mikedanese] Set up a Jenkins job to basically run what hack/upgrade-e2e-test.sh does. We'll have to see how this is best done given the current Jenkins script. The answer might be to have multiple "step" jobs for Jenkins that trigger each other.
[@mbforbes] Tweak above code and config until Jenkins really runs green.
[@mbforbes GKE upgrade tests #10133] Add GKE support for the above

The above, in my mind, will close out #8081. The following would be extra credit (after the above are fixed):

Validate more kubernetes resources (secretes, volumes, persistent volumes, ...?) can be created before an upgrade, (possibly) work during an upgrade, and (definitely) are correct after an upgrade.

I'll also extend the above for nodes.

davidopp · 2015-06-23T16:53:32Z

@mbforbes Do you have an ETA on the remaining work for this issue?

mbforbes · 2015-06-23T17:31:21Z

I updated the checklist; I can get a PR for the last item above out today.

roberthbailey added area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/hosting area/test-infra labels May 11, 2015

roberthbailey added this to the v1.0 milestone May 11, 2015

roberthbailey mentioned this issue May 11, 2015

Master upgrade cluster test #8082

Closed

davidopp added the team/master label May 12, 2015

ghost assigned saad-ali May 15, 2015

saad-ali mentioned this issue May 26, 2015

Modify GCE upgrade script to not require exact version number #8717

Merged

alex-mohr mentioned this issue Jun 3, 2015

Master upgrades #6075

Closed

goltermann assigned satnam6502 and unassigned saad-ali Jun 4, 2015

ghost removed area/test area/test-infra labels Jun 4, 2015

davidopp added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 8, 2015

mikedanese assigned mikedanese and unassigned satnam6502 Jun 8, 2015

davidopp mentioned this issue Jun 9, 2015

test: ensure node upgrades between k8s versions work #7914

Closed

mikedanese mentioned this issue Jun 9, 2015

e2e test cluster stability during master upgrade #9517

Merged

davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. labels Jun 10, 2015

ArtfulCoder closed this as completed in #9517 Jun 12, 2015

mbforbes reopened this Jun 16, 2015

mbforbes added the area/upgrade label Jun 17, 2015

mikedanese mentioned this issue Jun 17, 2015

Use replication controllers in e2e cluster upgrade tests #9978

Closed

mbforbes mentioned this issue Jun 17, 2015

Node upgrade test #9987

Merged

mikedanese mentioned this issue Jun 17, 2015

remove e2e run before cluster upgrade #9988

Merged

mbforbes mentioned this issue Jun 19, 2015

GKE upgrade tests #10133

Merged

mbforbes mentioned this issue Jun 24, 2015

Verify upgrades change node/master software version #10273

Merged

mbforbes closed this as completed Jun 24, 2015

mbforbes mentioned this issue Jul 15, 2015

Tests to verify that existing pods/services continue to function across cluster resize and/or upgrade #11340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master upgrade test harness #8081

Master upgrade test harness #8081

roberthbailey commented May 11, 2015

satnam6502 commented May 11, 2015

roberthbailey commented May 11, 2015

davidopp commented May 12, 2015

ghost commented May 15, 2015

davidopp commented May 15, 2015

ghost commented May 15, 2015

ghost commented May 15, 2015

davidopp commented May 22, 2015

zmerlynn commented May 28, 2015

zmerlynn commented May 28, 2015

saad-ali commented May 28, 2015

zmerlynn commented May 28, 2015

saad-ali commented May 28, 2015

davidopp commented Jun 2, 2015

saad-ali commented Jun 2, 2015

saad-ali commented Jun 3, 2015

piosz commented Jun 3, 2015

davidopp commented Jun 3, 2015

davidopp commented Jun 3, 2015

alex-mohr commented Jun 3, 2015

satnam6502 commented Jun 4, 2015

alex-mohr commented Jun 4, 2015

davidopp commented Jun 4, 2015

davidopp commented Jun 8, 2015

mbforbes commented Jun 16, 2015

mbforbes commented Jun 16, 2015

davidopp commented Jun 23, 2015

mbforbes commented Jun 23, 2015

Master upgrade test harness #8081

Master upgrade test harness #8081

Comments

roberthbailey commented May 11, 2015

satnam6502 commented May 11, 2015

roberthbailey commented May 11, 2015

davidopp commented May 12, 2015

ghost commented May 15, 2015

davidopp commented May 15, 2015

ghost commented May 15, 2015

ghost commented May 15, 2015

davidopp commented May 22, 2015

zmerlynn commented May 28, 2015

zmerlynn commented May 28, 2015

saad-ali commented May 28, 2015

zmerlynn commented May 28, 2015

saad-ali commented May 28, 2015

davidopp commented Jun 2, 2015

saad-ali commented Jun 2, 2015

saad-ali commented Jun 3, 2015

piosz commented Jun 3, 2015

davidopp commented Jun 3, 2015

davidopp commented Jun 3, 2015

alex-mohr commented Jun 3, 2015

satnam6502 commented Jun 4, 2015

alex-mohr commented Jun 4, 2015

davidopp commented Jun 4, 2015

davidopp commented Jun 8, 2015

mbforbes commented Jun 16, 2015

mbforbes commented Jun 16, 2015

davidopp commented Jun 23, 2015

mbforbes commented Jun 23, 2015