Service update failure thresholds and rollback #26421

aaronlehmann · 2016-09-08T17:24:15Z

This adds support for two enhancements to swarm service rolling updates:

Failure thresholds: In Docker 1.12, a service update could be set up
to either pause or continue after a single failure occurs. This adds
an --update-max-failure-ratio flag that controls how many tasks need to
fail to update for the update as a whole to be considered a failure. A
counterpart flag, --update-monitor, controls how long to monitor each
task for a failure after starting it during the update.
Rollback flag: service update --rollback reverts the service to its
previous version. If a service update encounters task failures, or
fails to function properly for some other reason, the user can roll back
the update.

SwarmKit also has the ability to roll back updates automatically after
hitting the failure thresholds, but we've decided not to expose this in
the Docker API/CLI for now, favoring a workflow where the decision to
roll back is always made by an admin. Depending on user feedback, we may
add a "rollback" option to --update-failure-action in the future.

SwarmKit PR: moby/swarmkit#1380

dperny · 2016-09-13T17:47:34Z

cli/command/service/opts.go

-	flagWorkdir              = "workdir"
-	flagRegistryAuth         = "with-registry-auth"
-	flagLogDriver            = "log-driver"
-	flagLogOpt               = "log-opt"


gofmt working as intended

dperny · 2016-09-13T17:53:20Z

FWIW, LGTM. I am not a maintainer though.

aluzzardi · 2016-09-17T00:28:32Z

cli/command/service/opts.go

+	flagStopGracePeriod       = "stop-grace-period"
+	flagUpdateDelay           = "update-delay"
+	flagUpdateFailureAction   = "update-failure-action"
+	flagUpdateFailureFraction = "update-failure-fraction"


Not a big fan of --update-failure-fraction overall as I mentioned IRL but unfortunately I don't have a better idea.

@dnephin @vieux @aanand @ehazlett maybe?

I agree, I usually prefer the term "ratio" over "fraction". Some ideas:

--update-min-success-ratio

--update-max-failure-ratio

I think --update-max-failure-ratio could work. Ratio is indeed a better description.

@aluzzardi WDYT? Also, if we adopt this terminology, should we update SwarmKit to rename the protobuf field accordingly?

aluzzardi

Overall LGTM - nits on flag names.

dnephin

How does rollback behave with respect to service version? I'm thinking of something like this:

The service starts at version 1
User A performs an update (service is at version 2)
User B performs an update immediately after (service is at version 3)
User A sees things going bad, so attempts to rollback.
Service ends up on version 2 (I assume?) instead of version 1

If the --rollback flag accepted a service version it would prevent that scenario? The user would be informed of the error and they could force an update to the specific version they want manually.

dnephin · 2016-09-17T15:05:04Z

cli/command/service/update.go

+	} else if rollback {
+		updateOpts.RegistryAuthFrom = types.RegistryAuthFromPreviousSpec
+	} else {
+		updateOpts.RegistryAuthFrom = types.RegistryAuthFromSpec


I think a switch/case would be appropriate here

dnephin · 2016-09-17T15:14:26Z

docs/swarm/services.md

+`docker service update`'s `--rollback` flag. This will revert the service
+to the configuration that was in place before the most recent
+`docker service update` command. Other options can be combined with
+`--rollback`; for example, `--update-delay 0s` to execute the rollback without


When other options are combined with rollback are they applied to the service spec, or are they only used for the "update", so only "update" options would work?

My understanding (which may be out of date by now) was that if you ran an update which changed the update config of a service, that change wouldn't be reflected in the current update, it would only apply to the next one, because the service uses the current configuration to perform the update, not the new configuration.

Has that changed? Or is rollback a special case that works differently?

When other options are combined with rollback are they applied to the service spec, or are they only used for the "update", so only "update" options would work?

The current implementation allows all options supported by service update. I think update-* options are the most useful options to specify in combination with a rollback. But I am allowing all options for simplicity. If there's a good UX argument for restricting the set of options that can be combined with rollback, we could consider doing that.

My understanding (which may be out of date by now) was that if you ran an update which changed the update config of a service, that change wouldn't be reflected in the current update, it would only apply to the next one, because the service uses the current configuration to perform the update, not the new configuration.

I don't believe that's the case. When I've tested these options, they seem to take effect immediately. If there are execptions to this, I don't think it's the intended behavior.

aaronlehmann · 2016-09-19T17:35:01Z

How does rollback behave with respect to service version? I'm thinking of something like this:

The service starts at version 1
User A performs an update (service is at version 2)
User B performs an update immediately after (service is at version 3)
User A sees things going bad, so attempts to rollback.
Service ends up on version 2 (I assume?) instead of version 1

Yes, that scenario is accurate. However I don't see this as any different from the way docker service update works today. If User A runs docker service update --env-add A=B and immediately afterwards, User B runs docker service update --env-add C=D, the service ends up with both environment variables. We don't have the concept of a "user" in the daemon (no multitenancy), or a way of associating a service version with a particular user, so I'm not sure how this could work differently. I think for now it's a reasonable assumption that if the user asks for a rollback, they want to go to the immediately preceding version.

If the --rollback flag accepted a service version it would prevent that scenario? The user would be informed of the error and they could force an update to the specific version they want manually.

In theory, yes, but I see a few hurdles:

We'd need to store multiple historic versions of the service spec. This raises a few resource management issues (how many do you keep?)
The UI around this would be pretty involved. Internal version numbers aren't meaningful to the user, so I think we'd need a way to diff service specs and present the differences in a way that's easy to interpret. Then we'd need to design a reasonable CLI workflow for this. (service diff subcommand? Something that shows the incremental evolution of the service from version to version?)
Once we get into the business of storing generalized history, it probably doesn't make sense to do this as a one-off for services. Ideally we would extend SwarmKit's store to keep historical versions of all objects. This could definitely come in handy for other types such as tasks (looking at the different version of a particular task object could serve as an execution log). But this is a big redesign/refactor and @aluzzardi and I feel it's outside the scope of this rollback project.

aaronlehmann · 2016-09-22T01:38:45Z

Rebased and updated in response to feeback. --update-failure-fraction has been renamed to --update-max-failure-ratio.

@dnephin @aluzzardi PTAL

dnephin · 2016-09-22T13:55:58Z

design LGTM

thaJeztah · 2016-09-22T18:20:23Z

design LGTM

aaronlehmann · 2016-09-27T09:21:30Z

Rebased.

Review ping?

dnephin · 2016-09-27T14:30:35Z

LGTM

thaJeztah

docs changes LGTM

but ping @mstanleyjones if we need to have an example use somewhere in the documentation

aaronlehmann · 2016-10-15T01:05:10Z

User docs here: docker/docs#219

thaJeztah · 2016-10-15T01:07:34Z

Oh, thanks!

vdemeester · 2016-10-18T13:41:42Z

arf, @aaronlehmann needs a rebase 👼

vdemeester

One comment/question, but docs LGTM 🐸

vdemeester · 2016-10-18T13:43:21Z

docs/reference/api/docker_remote_api_v1.25.md

-        - **Delay** – Delay between restart attempts.
-        - **Attempts** – Maximum attempts to restart a given container before giving up (default value
+        - **Delay** – Delay between restart attempts, in nanoseconds.
+        - **MaxAttempts** – Maximum attempts to restart a given container before giving up (default value


Was it wrongely documented, or has this changed ?

It was wrongly documented.

mdlinville · 2016-10-18T16:42:36Z

cli/command/formatter/service.go

@@ -41,10 +41,14 @@ Placement:
 {{- if .HasUpdateConfig }}
 UpdateConfig:
 Parallelism:	{{ .UpdateParallelism }}
-{{- if .HasUpdateDelay -}}


Just curious, what does the - do here? What does removing it do?

The - tells the template engine to remove whitespace that follows. This was incorrect because it ended up merging Delay onto the same line as Parallelism. Notice that the other conditionals don't have a - at the end.

I could have opened another PR for this, but found it easiest to bundle it with the other changes I was making to the template.

mdlinville · 2016-10-18T16:44:28Z

cli/command/service/opts.go

+			Delay:           opts.update.delay,
+			Monitor:         opts.update.monitor,
+			FailureAction:   opts.update.onFailure,
+			MaxFailureRatio: opts.update.maxFailureRatio,


This one is missing onFailure?

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

This adds support for two enhancements to swarm service rolling updates: - Failure thresholds: In Docker 1.12, a service update could be set up to either pause or continue after a single failure occurs. This adds an --update-max-failure-ratio flag that controls how many tasks need to fail to update for the update as a whole to be considered a failure. A counterpart flag, --update-monitor, controls how long to monitor each task for a failure after starting it during the update. - Rollback flag: service update --rollback reverts the service to its previous version. If a service update encounters task failures, or fails to function properly for some other reason, the user can roll back the update. SwarmKit also has the ability to roll back updates automatically after hitting the failure thresholds, but we've decided not to expose this in the Docker API/CLI for now, favoring a workflow where the decision to roll back is always made by an admin. Depending on user feedback, we may add a "rollback" option to --update-failure-action in the future. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann · 2016-10-18T17:09:57Z

Rebased.

thaJeztah

LGTM, thanks!

ralphtheninja · 2017-01-12T00:42:09Z

How would a rollback look in terms of using the docker api?

Pull request moby#27745 added support for the client to talk to older versions of the daemon. Various flags were added to docker 1.13 that are not compatible with older daemons. This PR adds annotations to those flags, so that they are automatically hidden if the daemon is older than docker 1.13 (API 1.25). Not all new flags affect the API (some are client-side only). The following PR's added new flags to docker 1.13 that affect the API; - moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime` - moby#27800 / moby#25317 added `--group` / `--group-add` / `--group-rm` - moby#27702 added `--network` to `docker build` - moby#25962 added `--attachable` to `docker network create` - moby#27998 added `--compose-file` to `docker stack deploy` - moby#22566 added `--stop-timeout` to `docker run` and `docker create` - moby#26061 added `--init` to `docker run` and `docker create` - moby#26941 added `--init-path` to `docker run` and `docker create` - moby#27958 added `--cpus` on `docker run` / `docker create` - moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create` - moby#27596 added `--force` to `docker service update` - moby#27857 added `--hostname` to `docker service create` - moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update` - moby#28076 added `--tty` on `docker service create` / `docker service update` - moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update` - moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update` Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

darklow · 2017-03-06T21:49:12Z

@aaronlehmann Having rollback as option for --update-failure-action would be great. Actually I thought it is already how it is working now, so I kept running
docker service update --rollback=true my_service and wondering what I see no rollback parameter in docker service inspect.

Using --update-failure-action=rollback would help in following scenario, which I had recently. For example I have a service that is running nginx and I made a mistake in config file, so when I called docker service update ... new container failed to start, old was stopped and no one was listening at port 80 and immediately amazon ELB health check failed and took down whole cluster. While I realised what happened 1-2 minutes passed while I updated fix and amazon ELB re-added instances and resume services to work.

Having --update-failure-action=rollback would eliminate such case and because nginx failed it would roll back and there would be no 2 minutes downtime at our services.

thaJeztah · 2017-03-06T22:24:19Z

@darklow see #31108, does that cover your use case?

darklow · 2017-03-06T22:27:48Z

@thaJeztah very nice, that's a good one, wasn't aware of it yet, thanks! It would totally cover my case.

darklow · 2017-03-06T22:35:19Z

@thaJeztah one more question maybe you could help with. Rollback will definitely help. But I wonder why in my case/example old task was killed even if new one started unhealthy, nginx config had error, so main command failed to start and it still killed existing task and keep restarting new one and failing all the time.

Any existing settings would help solving this? It would make sense not to kill old service tasks until new ones are ready, I wonder why it didn't happened. Thank you.

thaJeztah · 2017-03-06T22:40:32Z

@darklow I think this may cover that #30261. Also the discussion on #30321 may be relevant

Pull request moby/moby#27745 added support for the client to talk to older versions of the daemon. Various flags were added to docker 1.13 that are not compatible with older daemons. This PR adds annotations to those flags, so that they are automatically hidden if the daemon is older than docker 1.13 (API 1.25). Not all new flags affect the API (some are client-side only). The following PR's added new flags to docker 1.13 that affect the API; - moby/moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime` - moby/moby#27800 / moby/moby#25317 added `--group` / `--group-add` / `--group-rm` - moby/moby#27702 added `--network` to `docker build` - moby/moby#25962 added `--attachable` to `docker network create` - moby/moby#27998 added `--compose-file` to `docker stack deploy` - moby/moby#22566 added `--stop-timeout` to `docker run` and `docker create` - moby/moby#26061 added `--init` to `docker run` and `docker create` - moby/moby#26941 added `--init-path` to `docker run` and `docker create` - moby/moby#27958 added `--cpus` on `docker run` / `docker create` - moby/moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create` - moby/moby#27596 added `--force` to `docker service update` - moby/moby#27857 added `--hostname` to `docker service create` - moby/moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update` - moby/moby#28076 added `--tty` on `docker service create` / `docker service update` - moby/moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update` - moby/moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update` Signed-off-by: Sebastiaan van Stijn <github@gone.nl> Upstream-commit: 5d2722f83db9e301c6dcbe1c562c2051a52905db Component: cli

aaronlehmann added the status/needs-vendoring label Sep 8, 2016

vdemeester added the status/1-design-review label Sep 8, 2016

aaronlehmann force-pushed the update-thresholds-rollbacks branch 2 times, most recently from 23d0359 to 0e0d238 Compare September 9, 2016 20:20

dperny reviewed Sep 13, 2016
View reviewed changes

aaronlehmann force-pushed the update-thresholds-rollbacks branch from 0e0d238 to e04ccb3 Compare September 13, 2016 18:16

aluzzardi reviewed Sep 17, 2016

View reviewed changes

aluzzardi approved these changes Sep 17, 2016

View reviewed changes

dnephin reviewed Sep 17, 2016

View reviewed changes

aaronlehmann mentioned this pull request Sep 20, 2016

Rename AllowedFailureFraction to MaxFailureRatio moby/swarmkit#1556

Merged

aaronlehmann force-pushed the update-thresholds-rollbacks branch from e04ccb3 to 0e8b69c Compare September 22, 2016 01:37

aaronlehmann removed the status/needs-vendoring label Sep 22, 2016

thaJeztah added status/2-code-review and removed status/1-design-review labels Sep 22, 2016

aaronlehmann force-pushed the update-thresholds-rollbacks branch from 0e8b69c to 29e19b6 Compare September 27, 2016 09:21

This was referenced Sep 27, 2016

docker service create with wrong mount opts exhausts docker.log and disk #26485

Closed

service create is using CLI defaults rather than leaving the fields empty #24959

Closed

aaronlehmann force-pushed the update-thresholds-rollbacks branch 2 times, most recently from c2b9fd2 to 0f5df94 Compare October 13, 2016 00:09

thaJeztah approved these changes Oct 15, 2016

View reviewed changes

aaronlehmann mentioned this pull request Oct 15, 2016

swarm: Document rollback, failure threshold, and monitor flags docker/docs#219

Merged

vdemeester approved these changes Oct 18, 2016

View reviewed changes

mdlinville reviewed Oct 18, 2016

View reviewed changes

aaronlehmann added 2 commits October 18, 2016 10:09

API changes for service rollback and failure threshold

67bebd6

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>

aaronlehmann force-pushed the update-thresholds-rollbacks branch from 8b24fdc to 6d4b527 Compare October 18, 2016 17:09

thaJeztah approved these changes Oct 18, 2016

View reviewed changes

thaJeztah merged commit 3b0660d into moby:master Oct 18, 2016

aaronlehmann deleted the update-thresholds-rollbacks branch October 18, 2016 22:35

This was referenced Nov 15, 2016

service update needs option to delete old image #28442

Open

[rfc] make "--update-max-failure-ratio" useful for new services #28583

Open

thaJeztah mentioned this pull request Jan 16, 2017

Add version annotation to various flags added in 1.13 #30186

Merged

thaJeztah mentioned this pull request Jan 23, 2017

docker swarm ,Can I update a specific container? #30270

Closed

thaJeztah mentioned this pull request Feb 2, 2017

no such option as --update-max-failure-ratio in docker service create #30655

Closed

This was referenced May 30, 2017

Can't rollback service automatically after update #33427

Closed

[RFC] Add "docker service rollback" subcommand docker/cli#142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service update failure thresholds and rollback #26421

Service update failure thresholds and rollback #26421

aaronlehmann commented Sep 8, 2016 •

edited

Loading

dperny Sep 13, 2016

dperny commented Sep 13, 2016

aluzzardi Sep 17, 2016

dnephin Sep 17, 2016

aaronlehmann Sep 19, 2016

aluzzardi left a comment

dnephin left a comment

dnephin Sep 17, 2016

dnephin Sep 17, 2016 •

edited

Loading

aaronlehmann Sep 19, 2016

aaronlehmann commented Sep 19, 2016 •

edited

Loading

aaronlehmann commented Sep 22, 2016

dnephin commented Sep 22, 2016

thaJeztah commented Sep 22, 2016

aaronlehmann commented Sep 27, 2016

dnephin commented Sep 27, 2016

thaJeztah left a comment

aaronlehmann commented Oct 15, 2016

thaJeztah commented Oct 15, 2016

vdemeester commented Oct 18, 2016

vdemeester left a comment

vdemeester Oct 18, 2016

aaronlehmann Oct 18, 2016

mdlinville Oct 18, 2016

aaronlehmann Oct 18, 2016

mdlinville Oct 18, 2016

aaronlehmann commented Oct 18, 2016

thaJeztah left a comment

ralphtheninja commented Jan 12, 2017

darklow commented Mar 6, 2017

thaJeztah commented Mar 6, 2017

darklow commented Mar 6, 2017 •

edited

Loading

darklow commented Mar 6, 2017 •

edited

Loading

thaJeztah commented Mar 6, 2017

Service update failure thresholds and rollback #26421

Service update failure thresholds and rollback #26421

Conversation

aaronlehmann commented Sep 8, 2016 • edited Loading

Choose a reason for hiding this comment

dperny commented Sep 13, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aluzzardi left a comment

Choose a reason for hiding this comment

dnephin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnephin Sep 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronlehmann commented Sep 19, 2016 • edited Loading

aaronlehmann commented Sep 22, 2016

dnephin commented Sep 22, 2016

thaJeztah commented Sep 22, 2016

aaronlehmann commented Sep 27, 2016

dnephin commented Sep 27, 2016

thaJeztah left a comment

Choose a reason for hiding this comment

aaronlehmann commented Oct 15, 2016

thaJeztah commented Oct 15, 2016

vdemeester commented Oct 18, 2016

vdemeester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronlehmann commented Oct 18, 2016

thaJeztah left a comment

Choose a reason for hiding this comment

ralphtheninja commented Jan 12, 2017

darklow commented Mar 6, 2017

thaJeztah commented Mar 6, 2017

darklow commented Mar 6, 2017 • edited Loading

darklow commented Mar 6, 2017 • edited Loading

thaJeztah commented Mar 6, 2017

aaronlehmann commented Sep 8, 2016 •

edited

Loading

dnephin Sep 17, 2016 •

edited

Loading

aaronlehmann commented Sep 19, 2016 •

edited

Loading

darklow commented Mar 6, 2017 •

edited

Loading

darklow commented Mar 6, 2017 •

edited

Loading