Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: ensure node upgrades between k8s versions work #7914

Closed
mbforbes opened this issue May 7, 2015 · 13 comments
Closed

test: ensure node upgrades between k8s versions work #7914

mbforbes opened this issue May 7, 2015 · 13 comments
Assignees
Labels
area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Milestone

Comments

@mbforbes
Copy link
Contributor

mbforbes commented May 7, 2015

This is blocked by using only MIG templates #7912 and a mechanism to do updates #6088.

Overview:

  • this needs a separate Jenkins job to run continuously on a different cluster (shouldn't be part of e2es yet because of version skewed testing)
  • as a minimum bar, this should test only from the latest released version to head
  • as a minimum bar, this should run e2es after the upgrade happens
@mbforbes mbforbes added area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. area/test-infra area/upgrade labels May 7, 2015
@mbforbes mbforbes added this to the v1.0-candidate milestone May 7, 2015
@mbforbes mbforbes mentioned this issue May 7, 2015
10 tasks
@davidopp
Copy link
Member

davidopp commented May 8, 2015

/subscribe

@davidopp
Copy link
Member

Just to be sure, this issue is about upgrading API version, whereas #8082 is about upgrading binary version?

@mbforbes
Copy link
Contributor Author

I was thinking this was about upgrading the binary version of the full cluster. That includes:

  1. upgrade the master
  2. then upgrade the nodes ← the focus of this issue

My impression was that #8082 was about the "master" specifically, whereas this is about the "nodes" because we want to make sure normal workloads / e2es pass. But, because the master is upgraded first before node upgrades happens, this necessarily also includes a master in its process, so becomes a "whole cluster upgrade."

Sorry if the wording is unclear (or if I'm confused; @roberthbailey can you confirm this interpretation is correct?). Please feel free to rename stuff.

@roberthbailey
Copy link
Contributor

I concur with @mbforbes's assessment: We want to be able to test node upgrades. To do this, we either need to first upgrade the master, or somehow start with a cluster that has skew (there isn't currently a way to do this). So I'd consider #8082 to either be blocking this issue, or that they can be combined into a single test.

@goltermann goltermann modified the milestones: v1.0, v1.0-candidate May 12, 2015
@dchen1107
Copy link
Member

/sub

@goltermann goltermann changed the title test: ensure upgrades between k8s versions work test: ensure node upgrades between k8s versions work May 12, 2015
@mbforbes
Copy link
Contributor Author

I'm happy to work on this, but we had explicitly split this off in case I'm holed up doing other upgrade work because it's pretty separable. (In other words, if you're reading this, it's unblocked, I haven't written another comment that I'm actively working on this, and you want to work on it, please feel free to self-assign.)

@mbforbes
Copy link
Contributor Author

mbforbes commented Jun 9, 2015

OK, here is the plan:

Create a specific test that will be skipped by default that goes through the following process (where "validate" means "ensure the resources exist and function correctly"):

  • create a bunch of resources (rc-backed pods, services, external load balancer); validate
  • upgrade the master; validate
  • upgrade the nodes; validate
  • tear down the resources

Run this test, alone, on Jenkins:

  • as a CI job, upgrading latest stable clusters to latest CI clusters
  • as release validation jobs for all versions X to all newer versions Y. For example, if we have versions 1.0, 1.1, and 1.2:
    • one job runs 1.0 to 1.1
    • one job runs 1.1 to 1.2
    • one job runs 1.0 to 1.2

We'll have to add more release validation jobs as we do more releases; this is a cost we can eat for a while.

I'll own the PR to write this test, the GCE project creation and Jenkins setup/config to ensure this test is running.

Follow-up extra credit involves testing more things (secrets, volumes, persistent volumes, ...+?).

A non-goal for this issue is to upgrade the objects themselves (their serialized format on etcd). This is important but I think out of the scope of this issue. (Someone is likely addressing this elsewhere.)

@roberthbailey @alex-mohr let me know if this doesn't sound good.

@davidopp
Copy link
Member

davidopp commented Jun 9, 2015

Sorry, wrong issue. I meant #8081.

@mbforbes
Copy link
Contributor Author

mbforbes commented Jun 9, 2015

Thanks for the pointer—seems super related, not just vaguely!

@mbforbes
Copy link
Contributor Author

As an update, I'm going to do this for the GKE provider first, as upgrade-master() for GCE is currently broken (will file an issue if there isn't one).

@alex-mohr alex-mohr added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 10, 2015
@mbforbes
Copy link
Contributor Author

Tiny update as I see this is now a P0: I've been working on the e2e test code today and will have it out for review today or tomorrow.

Regarding Jenkins, we've been rethinking how to do upgrade tests on Jenkins that involve just two or three Jenkins jobs that can run in parallel rather than six sequential ones; more details in this comment. The Jenkins tests that would close this issue would run in one of the slots outlined there.

Regarding the test extensions I mentioned in #8081, my first PR for this will close out number 2 for sure, possibly more.

@mbforbes
Copy link
Contributor Author

Now that #9987 is merged, the remaining work for this issue is to

  • support this for GKE (GKE upgrade tests #10133)
  • get all of the Jenkins config sorted out so that we get the [deploy→upgrade master→upgrade nodes→e2e test] workflows green for GCE and GKE.

@mbforbes
Copy link
Contributor Author

Both GCE and GKE upgrade (including node upgrade) builds are green. The last code thing is the final unfinished item from #8081 (comment):

We don't have any released (stable) versions that include these tests, but once we do, we'll need to continuously create new Jenkins jobs that verify supported release paths are upgrade-able. This should probably be in the GCE or GKE release instructions; I'll own getting those instructions written, but it's separate from this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/upgrade priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

No branches or pull requests

6 participants