Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etcd 3.x minor version rollback support to migrate-if-needed.sh #59298

Merged

Conversation

jpbetz
Copy link
Contributor

@jpbetz jpbetz commented Feb 3, 2018

Provide automatic etcd 3.x minor version downgrade when using the gcr.io/google_containers/etcd docker images to operate etcd.

Uses etcdctl snapshot save and etcdctl snapshot restore to safely downgrade etcd from 3.2->3.1 or 3.1->3.0. This is safe because the data storage file formats used by etcd have not changed between these versions.

Intended as a stop-gap until we can introduce more comprehensive downgrade support in etcd. The main limitation of this approach is that it is not able to perform zero downtime downgrades for HA clusters. For HA clusters, all members must be stopped and downgraded before the cluster may be restarted at the downgraded version.

Example usage:

  • Initially the etcd.manifest is set to gcr.io/google_containers/etcd:3.0.17, TARGET_VERSION=3.0.17
  • A upgrade to 3.1.11 is initiated.
  • etcd.manifest is updated to gcr.io/google_containers/etcd:3.1.11, TARGET_VERSION=3.1.11
  • etcd restarts and establishes 3.1 as it's "cluster version"
  • For whatever reason, a downgrade is initiated
  • etcd.manifest is updated gcr.io/google_containers/etcd:3.1.11, TARGET_VERSION=3.0.17
  • migrate-if-needed.sh detects that the current version (3.1.11) is newer than the target version, so it:
    • creates a snapshot using etcd & etcdctl 3.1.11
    • backs up the data dir
    • restores the snapshot using etcdctl 3.0.17 to create a replacement data dir
    • starts etcd 3.0.17

Note that while this will rollback to an earlier etcd version, the newer etcd gcr.io image version must continue to be used throughout the downgrade. Only TARGET_VERSION is downgraded.

Test coverage was lacking for migrate-if-needed.sh so this adds some container level testing to the Makefile for migrating and rolling back. This surfaced a couple bugs that are fixed by this PR as well.

cc @mml @lavalamp @wenjiaswe

Add automatic etcd 3.2->3.1 and 3.1->3.0 minor version rollback support to gcr.io/google_container/etcd images. For HA clusters, all members must be stopped before performing a rollback.

@jpbetz jpbetz added this to the v1.10 milestone Feb 3, 2018
@jpbetz jpbetz self-assigned this Feb 3, 2018
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 3, 2018
@k8s-ci-robot k8s-ci-robot requested review from eparis and jbeda February 3, 2018 01:50
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 3, 2018
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2018
@jpbetz jpbetz force-pushed the etcd3-minor-version-rollback branch from b2718cc to 5343343 Compare February 3, 2018 07:20
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 3, 2018
@jpbetz jpbetz force-pushed the etcd3-minor-version-rollback branch 2 times, most recently from e93291e to 83a4eaa Compare February 5, 2018 20:53
@@ -227,7 +262,7 @@ for step in ${SUPPORTED_VERSIONS}; do
echo "Starting etcd ${step} in v3 mode failed"
exit 1
fi
${ETCDCTL_CMD} rm --recursive "${ETCD_DATA_PREFIX}"
${ETCDCTL_CMD} --endpoints "http://127.0.0.1:${ETCD_PORT}" rm --recursive "${ETCD_DATA_PREFIX}"
Copy link
Contributor Author

@jpbetz jpbetz Feb 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t FYI. We found this when adding tests. It looks like this wasn't getting run correctly since the --endpoints flag was not being set to the etcd ports etcd_start starts etcd on. Maybe we should adjust the logic to check for the data in all 3.x migrations and delete it if it is found to make sure any clusters that have already upgraded to 3.1 get cleaned up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. But let's do this in a separate PR.

@wojtek-t wojtek-t self-assigned this Feb 5, 2018
@jpbetz jpbetz force-pushed the etcd3-minor-version-rollback branch 2 times, most recently from 7f0cfbd to 49047a4 Compare February 6, 2018 17:36
}

# Rollback from "3.0.x" version in 'etcd3' mode to "2.2.1" version in 'etcd2' mode, if needed.
rollback_to_etcd2() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Body of this function is unchanged. The code was moved from inline script (below) to this function.

@@ -0,0 +1,68 @@
#!/bin/sh

# Copyright 2016 The Kubernetes Authors.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these functions to their own source file so they can be leveraged by the tests.

Copy link
Contributor

@mml mml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better with tests and, yes, functions. Thanks!

cd $(TEMP_DIR)/rollback-etcd2

@echo "Starting $(ETCD2_ROLLBACK_NEW_TAG) etcd and writing some sample data."
docker run -ti -v $(TEMP_DIR)/rollback-etcd2:/var/etcd \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we spell out the docker flags? e.g. --tty, --interactive? Maybe I just haven't run docker in too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

test: test-rollback test-rollback-etcd2 test-migrate

all: build test
.PHONY: build push test-rollback test-rollback-etcd2 test-migrate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test is phony too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}

# Rollback from "3.0.x" version in 'etcd3' mode to "2.2.1" version in 'etcd2' mode, if needed.
rollback_to_etcd2() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions help. For clarity, can we move their definitions up to the top? It makes it easier to see what the flow control of the main script is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

@@ -163,6 +114,90 @@ if [ "${CURRENT_VERSION}" = "2.2.1" -a "${CURRENT_VERSION}" != "${TARGET_VERSION
echo "Backup done in ${BACKUP_DIR}"
fi

# Rollback to previous minor version of etcd 3.x, if needed.
# This approach has only been tested with 3.2 and earlier. If newer versions of etcd support this
# downgrade approach, please update the "${CURRENT_MINOR_VERSION} -le ..." check here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this "-le" check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I've removed this comment. We now have the tests, which ensures the rollback flow won't get broken without being noticed.

@jpbetz jpbetz force-pushed the etcd3-minor-version-rollback branch from 49047a4 to c672f1b Compare February 7, 2018 00:50
-e "TARGET_VERSION=$(ETCD2_ROLLBACK_NEW_TAG)" \
-e "DATA_DIRECTORY=/var/etcd/data" \
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \
'/usr/local/bin/migrate-if-needed.sh && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this one is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed here to write version.txt.

ETCDCTL_API=3 /usr/local/bin/etcdctl-$(ETCD2_ROLLBACK_NEW_TAG) --endpoints http://127.0.0.1:$${ETCD_PORT} put /registry/k1 value1 && \
stop_etcd'

docker run --tty --interactive -v $(TEMP_DIR)/rollback-etcd2:/var/etcd \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this one with the previous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Done.

source /usr/local/bin/start-stop-etcd.sh && \
START_STORAGE=etcd3 START_VERSION=$(REGISTRY_TAG) start_etcd && \
ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints http://127.0.0.1:$${ETCD_PORT} put /registry/k1 value1 && \
stop_etcd'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't need to write version.txt file here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

migrate-if-needed.sh writes it.

docker run --tty --interactive -v $(TEMP_DIR)/rollback-test:/var/etcd \
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \
'[ $$(cat /var/etcd/data/version.txt) = $(ROLLBACK_REGISTRY_TAG)/etcd3 ] && \
grep -q value1 /var/etcd/keyspace.txt'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two rollback tests seem pretty similar - would it be possible somehow share the code between those?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into it briefly and the main reason I kept them separate is because we will be deleting the etcd support later this year, and by keeping the tests separate we make the deletion trivial.

-e "TARGET_VERSION=$${tag}" \
-e "DATA_DIRECTORY=/var/etcd/data" \
gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \
"/usr/local/bin/migrate-if-needed.sh && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unneeded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to write an initial version.txt

gcr.io/google_containers/etcd-$(ARCH):$(REGISTRY_TAG) /bin/sh -c \
"/usr/local/bin/migrate-if-needed.sh && \
source /usr/local/bin/start-stop-etcd.sh && \
START_STORAGE=etcd2 START_VERSION=$${tag} start_etcd && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is always etcd v2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This is incorrect. Fixing

ETCDCTL_CMD="/usr/local/bin/etcdctl-${TARGET_VERSION}"
NAME="etcd-$(hostname)"
ETCDCTL_API=3 ${ETCDCTL_CMD} snapshot restore "${SNAPSHOT_FILE}" \
--data-dir "${DATA_DIRECTORY}" --name "${NAME}" --initial-cluster "${NAME}=http://localhost:2380"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This initial cluster will work only in non-HA mode, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. After this restore, the etcd cluster is then started using the --initial-cluster {{ etcd_cluster }} setting provided in the etcd.manifest. Looks like we also need to pass that setting in to migrate-if-needed.sh.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@jpbetz jpbetz force-pushed the etcd3-minor-version-rollback branch 4 times, most recently from 9b336f7 to 5fb7ecf Compare February 7, 2018 23:24
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 7, 2018

Test failure is a parse error in etcd-empty-dir-cleanup.yaml which was not touched by this PR. Rebasing.

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 8, 2018

@mml @wojtek-t All feedback has been applied. PTAL.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve no-issue

I will let @mml to also take a look.

@@ -227,7 +262,7 @@ for step in ${SUPPORTED_VERSIONS}; do
echo "Starting etcd ${step} in v3 mode failed"
exit 1
fi
${ETCDCTL_CMD} rm --recursive "${ETCD_DATA_PREFIX}"
${ETCDCTL_CMD} --endpoints "http://127.0.0.1:${ETCD_PORT}" rm --recursive "${ETCD_DATA_PREFIX}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. But let's do this in a separate PR.

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 9, 2018
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 9, 2018

@wojtek-t For splitting --endpoints "http://127.0.0.1:${ETCD_PORT}" out into a separate PR- WIthout the change the newly added tests fail and splitting them out into a separate PR (or sequencing that PR to go first) is quite a bit of extra work. Okay if instead we just open an issue explaining the bug that this fixes and note that this PR fixes that issue?

@wojtek-t
Copy link
Member

wojtek-t commented Feb 9, 2018

@wojtek-t For splitting --endpoints "http://127.0.0.1:${ETCD_PORT}" out into a separate PR- WIthout the change the newly added tests fail and splitting them out into a separate PR (or sequencing that PR to go first) is quite a bit of extra work. Okay if instead we just open an issue explaining the bug that this fixes and note that this PR fixes that issue?

I didn't mean splitting this change to a separate PR. I meant splitting this thing:

Maybe we should adjust the logic to check for the data in all 3.x migrations and delete it if it is found to make sure any clusters that have already upgraded to 3.1 get cleaned up?

@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 9, 2018

@wojtek-t Ah, yes. Totally agree. Separate PR for that.

@mml
Copy link
Contributor

mml commented Feb 12, 2018

/lgtm

Might want a followup issue for making sure these tests get run by CI.

Thanks again.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpbetz, mml, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these OWNERS Files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jpbetz jpbetz force-pushed the etcd3-minor-version-rollback branch from 5fb7ecf to 746e247 Compare February 12, 2018 23:02
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 12, 2018
@jpbetz jpbetz added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2018
@jpbetz
Copy link
Contributor Author

jpbetz commented Feb 13, 2018

Clean rebase, reapplying lgtm

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot
Copy link
Contributor

@jpbetz: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kubeadm-gce-canary 5fb7ecf104498693751be0e729ad6c34d626e817 link /test pull-kubernetes-e2e-kubeadm-gce-canary
pull-kubernetes-e2e-kops-aws 746e247 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 59298, 59773, 59772). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit c1216df into kubernetes:master Feb 13, 2018
k8s-github-robot pushed a commit that referenced this pull request Mar 1, 2018
…-origin-release-1.9

Automatic merge from submit-queue.

Automated cherry pick of #59298: Add etcd 3.x minor version rollback support to

Cherry pick of #59298 on release-1.9.

#59298: Add etcd 3.x minor version rollback support to

```release-note
Add automatic etcd 3.2->3.1 and 3.1->3.0 minor version rollback support to gcr.io/google_container/etcd images. For HA clusters, all members must be stopped before performing a rollback.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants