-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add etcd 3.x minor version rollback support to migrate-if-needed.sh #59298
Add etcd 3.x minor version rollback support to migrate-if-needed.sh #59298
Conversation
b2718cc
to
5343343
Compare
e93291e
to
83a4eaa
Compare
@@ -227,7 +262,7 @@ for step in ${SUPPORTED_VERSIONS}; do | |||
echo "Starting etcd ${step} in v3 mode failed" | |||
exit 1 | |||
fi | |||
${ETCDCTL_CMD} rm --recursive "${ETCD_DATA_PREFIX}" | |||
${ETCDCTL_CMD} --endpoints "http://127.0.0.1:${ETCD_PORT}" rm --recursive "${ETCD_DATA_PREFIX}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t FYI. We found this when adding tests. It looks like this wasn't getting run correctly since the --endpoints
flag was not being set to the etcd ports etcd_start
starts etcd on. Maybe we should adjust the logic to check for the data in all 3.x migrations and delete it if it is found to make sure any clusters that have already upgraded to 3.1 get cleaned up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. But let's do this in a separate PR.
7f0cfbd
to
49047a4
Compare
} | ||
|
||
# Rollback from "3.0.x" version in 'etcd3' mode to "2.2.1" version in 'etcd2' mode, if needed. | ||
rollback_to_etcd2() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Body of this function is unchanged. The code was moved from inline script (below) to this function.
@@ -0,0 +1,68 @@ | |||
#!/bin/sh | |||
|
|||
# Copyright 2016 The Kubernetes Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved these functions to their own source file so they can be leveraged by the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better with tests and, yes, functions. Thanks!
cluster/images/etcd/Makefile
Outdated
cd $(TEMP_DIR)/rollback-etcd2 | ||
|
||
@echo "Starting $(ETCD2_ROLLBACK_NEW_TAG) etcd and writing some sample data." | ||
docker run -ti -v $(TEMP_DIR)/rollback-etcd2:/var/etcd \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we spell out the docker flags? e.g. --tty
, --interactive
? Maybe I just haven't run docker in too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
cluster/images/etcd/Makefile
Outdated
test: test-rollback test-rollback-etcd2 test-migrate | ||
|
||
all: build test | ||
.PHONY: build push test-rollback test-rollback-etcd2 test-migrate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test is phony too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
} | ||
|
||
# Rollback from "3.0.x" version in 'etcd3' mode to "2.2.1" version in 'etcd2' mode, if needed. | ||
rollback_to_etcd2() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functions help. For clarity, can we move their definitions up to the top? It makes it easier to see what the flow control of the main script is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
@@ -163,6 +114,90 @@ if [ "${CURRENT_VERSION}" = "2.2.1" -a "${CURRENT_VERSION}" != "${TARGET_VERSION | |||
echo "Backup done in ${BACKUP_DIR}" | |||
fi | |||
|
|||
# Rollback to previous minor version of etcd 3.x, if needed. | |||
# This approach has only been tested with 3.2 and earlier. If newer versions of etcd support this | |||
# downgrade approach, please update the "${CURRENT_MINOR_VERSION} -le ..." check here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this "-le" check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I've removed this comment. We now have the tests, which ensures the rollback flow won't get broken without being noticed.
49047a4
to
c672f1b
Compare
cluster/images/etcd/Makefile
Outdated
-e "TARGET_VERSION=$(ETCD2_ROLLBACK_NEW_TAG)" \ | ||
-e "DATA_DIRECTORY=/var/etcd/data" \ | ||
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \ | ||
'/usr/local/bin/migrate-if-needed.sh && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this one is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed here to write version.txt
.
cluster/images/etcd/Makefile
Outdated
ETCDCTL_API=3 /usr/local/bin/etcdctl-$(ETCD2_ROLLBACK_NEW_TAG) --endpoints http://127.0.0.1:$${ETCD_PORT} put /registry/k1 value1 && \ | ||
stop_etcd' | ||
|
||
docker run --tty --interactive -v $(TEMP_DIR)/rollback-etcd2:/var/etcd \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this one with the previous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Done.
cluster/images/etcd/Makefile
Outdated
source /usr/local/bin/start-stop-etcd.sh && \ | ||
START_STORAGE=etcd3 START_VERSION=$(REGISTRY_TAG) start_etcd && \ | ||
ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints http://127.0.0.1:$${ETCD_PORT} put /registry/k1 value1 && \ | ||
stop_etcd' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we don't need to write version.txt file here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
migrate-if-needed.sh
writes it.
docker run --tty --interactive -v $(TEMP_DIR)/rollback-test:/var/etcd \ | ||
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \ | ||
'[ $$(cat /var/etcd/data/version.txt) = $(ROLLBACK_REGISTRY_TAG)/etcd3 ] && \ | ||
grep -q value1 /var/etcd/keyspace.txt' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two rollback tests seem pretty similar - would it be possible somehow share the code between those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked into it briefly and the main reason I kept them separate is because we will be deleting the etcd support later this year, and by keeping the tests separate we make the deletion trivial.
cluster/images/etcd/Makefile
Outdated
-e "TARGET_VERSION=$${tag}" \ | ||
-e "DATA_DIRECTORY=/var/etcd/data" \ | ||
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \ | ||
"/usr/local/bin/migrate-if-needed.sh && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unneeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to write an initial version.txt
cluster/images/etcd/Makefile
Outdated
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \ | ||
"/usr/local/bin/migrate-if-needed.sh && \ | ||
source /usr/local/bin/start-stop-etcd.sh && \ | ||
START_STORAGE=etcd2 START_VERSION=$${tag} start_etcd && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is always etcd v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This is incorrect. Fixing
ETCDCTL_CMD="/usr/local/bin/etcdctl-${TARGET_VERSION}" | ||
NAME="etcd-$(hostname)" | ||
ETCDCTL_API=3 ${ETCDCTL_CMD} snapshot restore "${SNAPSHOT_FILE}" \ | ||
--data-dir "${DATA_DIRECTORY}" --name "${NAME}" --initial-cluster "${NAME}=http://localhost:2380" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This initial cluster will work only in non-HA mode, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. After this restore, the etcd cluster is then started using the --initial-cluster {{ etcd_cluster }}
setting provided in the etcd.manifest
. Looks like we also need to pass that setting in to migrate-if-needed.sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
9b336f7
to
5fb7ecf
Compare
Test failure is a parse error in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve no-issue
I will let @mml to also take a look.
@@ -227,7 +262,7 @@ for step in ${SUPPORTED_VERSIONS}; do | |||
echo "Starting etcd ${step} in v3 mode failed" | |||
exit 1 | |||
fi | |||
${ETCDCTL_CMD} rm --recursive "${ETCD_DATA_PREFIX}" | |||
${ETCDCTL_CMD} --endpoints "http://127.0.0.1:${ETCD_PORT}" rm --recursive "${ETCD_DATA_PREFIX}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. But let's do this in a separate PR.
@wojtek-t For splitting |
I didn't mean splitting this change to a separate PR. I meant splitting this thing:
|
@wojtek-t Ah, yes. Totally agree. Separate PR for that. |
/lgtm Might want a followup issue for making sure these tests get run by CI. Thanks again. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jpbetz, mml, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these OWNERS Files:
Approvers can indicate their approval by writing |
5fb7ecf
to
746e247
Compare
Clean rebase, reapplying lgtm |
/test all [submit-queue is verifying that this PR is safe to merge] |
@jpbetz: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Automatic merge from submit-queue (batch tested with PRs 59298, 59773, 59772). If you want to cherry-pick this change to another branch, please follow the instructions here. |
…-origin-release-1.9 Automatic merge from submit-queue. Automated cherry pick of #59298: Add etcd 3.x minor version rollback support to Cherry pick of #59298 on release-1.9. #59298: Add etcd 3.x minor version rollback support to ```release-note Add automatic etcd 3.2->3.1 and 3.1->3.0 minor version rollback support to gcr.io/google_container/etcd images. For HA clusters, all members must be stopped before performing a rollback. ```
Provide automatic etcd 3.x minor version downgrade when using the gcr.io/google_containers/etcd docker images to operate etcd.
Uses
etcdctl snapshot save
andetcdctl snapshot restore
to safely downgrade etcd from 3.2->3.1 or 3.1->3.0. This is safe because the data storage file formats used by etcd have not changed between these versions.Intended as a stop-gap until we can introduce more comprehensive downgrade support in etcd. The main limitation of this approach is that it is not able to perform zero downtime downgrades for HA clusters. For HA clusters, all members must be stopped and downgraded before the cluster may be restarted at the downgraded version.
Example usage:
Note that while this will rollback to an earlier etcd version, the newer etcd gcr.io image version must continue to be used throughout the downgrade. Only TARGET_VERSION is downgraded.
Test coverage was lacking for
migrate-if-needed.sh
so this adds some container level testing to theMakefile
for migrating and rolling back. This surfaced a couple bugs that are fixed by this PR as well.cc @mml @lavalamp @wenjiaswe