-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master upgrade test harness #8081
Comments
I volunteer a workload based on Elasticsearch. |
@ixdy are you available to work on this? |
@ixdy is pretty busy right now getting e2e's to run on every PR pre-merge, so that we don't keep wasting so much time tracking down regressions. I'd like him to stay focussed on that until it's done, hopefully mid next week. |
Yeah sorry I didn't update the issue, I checked with him a couple of days ago and he said the same, so I was leaving unassigned for now. I am hoping @piosz can work on this but haven't discussed with him yet. |
@saad-ali Is also interested in working on it. I just discussed it with him. |
Provisionally assigning to @saad-ali |
@fgrzadkowski FYI |
cc @alex-mohr |
@saad-ali: What's the status here? |
The |
Where is it failing? If you "upgrade" instead to just |
For v17.1 to v17.1 it fails at: "line 71: set-master-htpasswd: command not found" |
@saad-ali do you think you could add what is described in this comment |
Yep, I'll take a crack at it today. |
Done. Added |
Thanks a lot @saad-ali ! |
+1, thanks! |
@roberthbailey I think we can close this. If you agree, please close it. |
Thanks for the work here! My understanding from @saad-ali's comment at #8081 (comment) that of Robby's initial comment, the tests now do 1., 3., and 5., but not 2. or 4.? If the e2es clean up after themselves and we're testing upgrading an empty cluster, are we vulnerable to upgrades causing existing pods/services/etc to become broken in some way? Or restarted all pods? Or maybe turning this around: if we had these tests for borgmaster rollouts, would you be content? :) |
Certainly happy to jump on this on Monday when I return assuming @thockin is OK with it. I think I asked this question on another thread, but don't we also want to check that the upgrade actually occurred? Otherwise the "NOP" operation would satisfy the test? This seems to suggest that we want something running that spans the lifetime of the two clusters and applies a probe to check for the dy/dx. |
The issue with e2es is that they start from an empty cluster state and return it to an empty cluster state. I think making sure they pass before and after is necessary, but it'd be even better to test more things. What are the invariants we want to ensure through an upgrade process? Something like:
If so, that would mean step 2 creates a bunch of pods/services/whatevers, ensures they're in a sane state (accessible via network, running, ...?), and step 4 then verifies they're still accessible, running, etc. Step 4 probably also verifies components are the new version. And maybe also that e.g. pods didn't get restarted or evicted or deleted? Sorry for the super-high-level vagueness. I think I'm proposing that someone think about what's needed to ensure a good upgrade experience, maybe document it if the existing issues don't capture that, and then start implementing paths to achieve that good experience along with the tests that will verify it holds true across future versions? Re: v1 explicitly, I'm fine with a cut whereever, though I think we'd get a lot of mileage out of doing tests to verify behavior first and then iteratively improve capabilities over time? |
SGTM. Thanks @satnam6502 for taking over this issue. For reference, @piosz 's plan was to take his load test and just add a final step where you create a service and then use it to test connectivity to all of the pods. The load test is having problems, though (and it's probably better not to conflate load testing and correctness testing right now), so it would probably be better if you were to just do something simple from scratch (it could still have the same basic idea, I guess -- e.g. create a replication controller and a service, verify you can connect to the pods through the service). |
@mikedanese offered to work on this and @satnam6502 was OK with that (and it seems an especially good idea since @mikedanese just did the etcd-killing e2e test), so reassigning to Mike. Thanks, Mike! |
Hmm, I don't think this should have been closed. I'm adding my comment from #9517 below. |
The following could come as follow-up PRs / work, but are necessary to close #8081:
The above, in my mind, will close out #8081. The following would be extra credit (after the above are fixed):
I'll also extend the above for nodes. |
@mbforbes Do you have an ETA on the remaining work for this issue? |
I updated the checklist; I can get a PR for the last item above out today. |
Forked from #3135.
For testing master upgrades, we will need a test harness that:
/cc @zmerlynn @satnam6502
The text was updated successfully, but these errors were encountered: