Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service update failure thresholds and rollback #26421

Merged
merged 2 commits into from
Oct 18, 2016

Conversation

aaronlehmann
Copy link
Contributor

@aaronlehmann aaronlehmann commented Sep 8, 2016

This adds support for two enhancements to swarm service rolling updates:

  • Failure thresholds: In Docker 1.12, a service update could be set up
    to either pause or continue after a single failure occurs. This adds
    an --update-max-failure-ratio flag that controls how many tasks need to
    fail to update for the update as a whole to be considered a failure. A
    counterpart flag, --update-monitor, controls how long to monitor each
    task for a failure after starting it during the update.
  • Rollback flag: service update --rollback reverts the service to its
    previous version. If a service update encounters task failures, or
    fails to function properly for some other reason, the user can roll back
    the update.

SwarmKit also has the ability to roll back updates automatically after
hitting the failure thresholds, but we've decided not to expose this in
the Docker API/CLI for now, favoring a workflow where the decision to
roll back is always made by an admin. Depending on user feedback, we may
add a "rollback" option to --update-failure-action in the future.

SwarmKit PR: moby/swarmkit#1380

@aaronlehmann aaronlehmann force-pushed the update-thresholds-rollbacks branch 2 times, most recently from 23d0359 to 0e0d238 Compare September 9, 2016 20:20
flagWorkdir = "workdir"
flagRegistryAuth = "with-registry-auth"
flagLogDriver = "log-driver"
flagLogOpt = "log-opt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gofmt working as intended

@dperny
Copy link
Contributor

dperny commented Sep 13, 2016

FWIW, LGTM. I am not a maintainer though.

@aaronlehmann aaronlehmann force-pushed the update-thresholds-rollbacks branch from 0e0d238 to e04ccb3 Compare September 13, 2016 18:16
flagStopGracePeriod = "stop-grace-period"
flagUpdateDelay = "update-delay"
flagUpdateFailureAction = "update-failure-action"
flagUpdateFailureFraction = "update-failure-fraction"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of --update-failure-fraction overall as I mentioned IRL but unfortunately I don't have a better idea.

@dnephin @vieux @aanand @ehazlett maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I usually prefer the term "ratio" over "fraction". Some ideas:

  • --update-min-success-ratio
  • --update-max-failure-ratio

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think --update-max-failure-ratio could work. Ratio is indeed a better description.

@aluzzardi WDYT? Also, if we adopt this terminology, should we update SwarmKit to rename the protobuf field accordingly?

Copy link
Member

@aluzzardi aluzzardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM - nits on flag names.

Copy link
Member

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does rollback behave with respect to service version? I'm thinking of something like this:

The service starts at version 1
User A performs an update (service is at version 2)
User B performs an update immediately after (service is at version 3)
User A sees things going bad, so attempts to rollback.
Service ends up on version 2 (I assume?) instead of version 1

If the --rollback flag accepted a service version it would prevent that scenario? The user would be informed of the error and they could force an update to the specific version they want manually.

} else if rollback {
updateOpts.RegistryAuthFrom = types.RegistryAuthFromPreviousSpec
} else {
updateOpts.RegistryAuthFrom = types.RegistryAuthFromSpec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a switch/case would be appropriate here

`docker service update`'s `--rollback` flag. This will revert the service
to the configuration that was in place before the most recent
`docker service update` command. Other options can be combined with
`--rollback`; for example, `--update-delay 0s` to execute the rollback without
Copy link
Member

@dnephin dnephin Sep 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When other options are combined with rollback are they applied to the service spec, or are they only used for the "update", so only "update" options would work?

My understanding (which may be out of date by now) was that if you ran an update which changed the update config of a service, that change wouldn't be reflected in the current update, it would only apply to the next one, because the service uses the current configuration to perform the update, not the new configuration.

Has that changed? Or is rollback a special case that works differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When other options are combined with rollback are they applied to the service spec, or are they only used for the "update", so only "update" options would work?

The current implementation allows all options supported by service update. I think update-* options are the most useful options to specify in combination with a rollback. But I am allowing all options for simplicity. If there's a good UX argument for restricting the set of options that can be combined with rollback, we could consider doing that.

My understanding (which may be out of date by now) was that if you ran an update which changed the update config of a service, that change wouldn't be reflected in the current update, it would only apply to the next one, because the service uses the current configuration to perform the update, not the new configuration.

I don't believe that's the case. When I've tested these options, they seem to take effect immediately. If there are execptions to this, I don't think it's the intended behavior.

@aaronlehmann
Copy link
Contributor Author

aaronlehmann commented Sep 19, 2016

How does rollback behave with respect to service version? I'm thinking of something like this:

The service starts at version 1
User A performs an update (service is at version 2)
User B performs an update immediately after (service is at version 3)
User A sees things going bad, so attempts to rollback.
Service ends up on version 2 (I assume?) instead of version 1

Yes, that scenario is accurate. However I don't see this as any different from the way docker service update works today. If User A runs docker service update --env-add A=B and immediately afterwards, User B runs docker service update --env-add C=D, the service ends up with both environment variables. We don't have the concept of a "user" in the daemon (no multitenancy), or a way of associating a service version with a particular user, so I'm not sure how this could work differently. I think for now it's a reasonable assumption that if the user asks for a rollback, they want to go to the immediately preceding version.

If the --rollback flag accepted a service version it would prevent that scenario? The user would be informed of the error and they could force an update to the specific version they want manually.

In theory, yes, but I see a few hurdles:

  • We'd need to store multiple historic versions of the service spec. This raises a few resource management issues (how many do you keep?)
  • The UI around this would be pretty involved. Internal version numbers aren't meaningful to the user, so I think we'd need a way to diff service specs and present the differences in a way that's easy to interpret. Then we'd need to design a reasonable CLI workflow for this. (service diff subcommand? Something that shows the incremental evolution of the service from version to version?)
  • Once we get into the business of storing generalized history, it probably doesn't make sense to do this as a one-off for services. Ideally we would extend SwarmKit's store to keep historical versions of all objects. This could definitely come in handy for other types such as tasks (looking at the different version of a particular task object could serve as an execution log). But this is a big redesign/refactor and @aluzzardi and I feel it's outside the scope of this rollback project.

@aaronlehmann
Copy link
Contributor Author

Rebased and updated in response to feeback. --update-failure-fraction has been renamed to --update-max-failure-ratio.

@dnephin @aluzzardi PTAL

@dnephin
Copy link
Member

dnephin commented Sep 22, 2016

design LGTM

1 similar comment
@thaJeztah
Copy link
Member

design LGTM

@aaronlehmann
Copy link
Contributor Author

Rebased.

Review ping?

@dnephin
Copy link
Member

dnephin commented Sep 27, 2016

LGTM

@aaronlehmann aaronlehmann force-pushed the update-thresholds-rollbacks branch 2 times, most recently from c2b9fd2 to 0f5df94 Compare October 13, 2016 00:09
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs changes LGTM

but ping @mstanleyjones if we need to have an example use somewhere in the documentation

@aaronlehmann
Copy link
Contributor Author

User docs here: docker/docs#219

@thaJeztah
Copy link
Member

Oh, thanks!

@vdemeester
Copy link
Member

arf, @aaronlehmann needs a rebase 👼

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment/question, but docs LGTM 🐸

- **Delay** – Delay between restart attempts.
- **Attempts** – Maximum attempts to restart a given container before giving up (default value
- **Delay** – Delay between restart attempts, in nanoseconds.
- **MaxAttempts** – Maximum attempts to restart a given container before giving up (default value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it wrongely documented, or has this changed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was wrongly documented.

@@ -41,10 +41,14 @@ Placement:
{{- if .HasUpdateConfig }}
UpdateConfig:
Parallelism: {{ .UpdateParallelism }}
{{- if .HasUpdateDelay -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what does the - do here? What does removing it do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The - tells the template engine to remove whitespace that follows. This was incorrect because it ended up merging Delay onto the same line as Parallelism. Notice that the other conditionals don't have a - at the end.

I could have opened another PR for this, but found it easiest to bundle it with the other changes I was making to the template.

Delay: opts.update.delay,
Monitor: opts.update.monitor,
FailureAction: opts.update.onFailure,
MaxFailureRatio: opts.update.maxFailureRatio,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is missing onFailure?

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
This adds support for two enhancements to swarm service rolling updates:

- Failure thresholds: In Docker 1.12, a service update could be set up
  to either pause or continue after a single failure occurs. This adds
  an --update-max-failure-ratio flag that controls how many tasks need to
  fail to update for the update as a whole to be considered a failure. A
  counterpart flag, --update-monitor, controls how long to monitor each
  task for a failure after starting it during the update.

- Rollback flag: service update --rollback reverts the service to its
  previous version. If a service update encounters task failures, or
  fails to function properly for some other reason, the user can roll back
  the update.

SwarmKit also has the ability to roll back updates automatically after
hitting the failure thresholds, but we've decided not to expose this in
the Docker API/CLI for now, favoring a workflow where the decision to
roll back is always made by an admin. Depending on user feedback, we may
add a "rollback" option to --update-failure-action in the future.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann aaronlehmann force-pushed the update-thresholds-rollbacks branch from 8b24fdc to 6d4b527 Compare October 18, 2016 17:09
@aaronlehmann
Copy link
Contributor Author

Rebased.

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ralphtheninja
Copy link
Contributor

How would a rollback look in terms of using the docker api?

thaJeztah added a commit to thaJeztah/docker that referenced this pull request Jan 16, 2017
Pull request moby#27745 added support for the
client to talk to older versions of the daemon. Various flags were added to
docker 1.13 that are not compatible with older daemons.

This PR adds annotations to those flags, so that they are automatically hidden
if the daemon is older than docker 1.13 (API 1.25).

Not all new flags affect the API (some are client-side only). The following
PR's added new flags to docker 1.13 that affect the API;

- moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime`
- moby#27800 / moby#25317 added `--group` / `--group-add` / `--group-rm`
- moby#27702 added `--network` to `docker build`
- moby#25962 added `--attachable` to `docker network create`
- moby#27998 added `--compose-file` to `docker stack deploy`
- moby#22566 added `--stop-timeout` to `docker run` and `docker create`
- moby#26061 added `--init` to `docker run` and `docker create`
- moby#26941 added `--init-path` to `docker run` and `docker create`
- moby#27958 added `--cpus` on `docker run` / `docker create`
- moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create`
- moby#27596 added `--force` to `docker service update`
- moby#27857 added `--hostname` to `docker service create`
- moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update`
- moby#28076 added `--tty` on `docker service create` / `docker service update`
- moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update`
- moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update`

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@darklow
Copy link

darklow commented Mar 6, 2017

@aaronlehmann Having rollback as option for --update-failure-action would be great. Actually I thought it is already how it is working now, so I kept running
docker service update --rollback=true my_service and wondering what I see no rollback parameter in docker service inspect.

Using --update-failure-action=rollback would help in following scenario, which I had recently. For example I have a service that is running nginx and I made a mistake in config file, so when I called docker service update ... new container failed to start, old was stopped and no one was listening at port 80 and immediately amazon ELB health check failed and took down whole cluster. While I realised what happened 1-2 minutes passed while I updated fix and amazon ELB re-added instances and resume services to work.

Having --update-failure-action=rollback would eliminate such case and because nginx failed it would roll back and there would be no 2 minutes downtime at our services.

@thaJeztah
Copy link
Member

@darklow see #31108, does that cover your use case?

@darklow
Copy link

darklow commented Mar 6, 2017

@thaJeztah very nice, that's a good one, wasn't aware of it yet, thanks! It would totally cover my case.

@darklow
Copy link

darklow commented Mar 6, 2017

@thaJeztah one more question maybe you could help with. Rollback will definitely help. But I wonder why in my case/example old task was killed even if new one started unhealthy, nginx config had error, so main command failed to start and it still killed existing task and keep restarting new one and failing all the time.

Any existing settings would help solving this? It would make sense not to kill old service tasks until new ones are ready, I wonder why it didn't happened. Thank you.

@thaJeztah
Copy link
Member

@darklow I think this may cover that #30261. Also the discussion on #30321 may be relevant

andrewhsu pushed a commit to docker-archive/docker-ce that referenced this pull request May 19, 2017
Pull request moby/moby#27745 added support for the
client to talk to older versions of the daemon. Various flags were added to
docker 1.13 that are not compatible with older daemons.

This PR adds annotations to those flags, so that they are automatically hidden
if the daemon is older than docker 1.13 (API 1.25).

Not all new flags affect the API (some are client-side only). The following
PR's added new flags to docker 1.13 that affect the API;

- moby/moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime`
- moby/moby#27800 / moby/moby#25317 added `--group` / `--group-add` / `--group-rm`
- moby/moby#27702 added `--network` to `docker build`
- moby/moby#25962 added `--attachable` to `docker network create`
- moby/moby#27998 added `--compose-file` to `docker stack deploy`
- moby/moby#22566 added `--stop-timeout` to `docker run` and `docker create`
- moby/moby#26061 added `--init` to `docker run` and `docker create`
- moby/moby#26941 added `--init-path` to `docker run` and `docker create`
- moby/moby#27958 added `--cpus` on `docker run` / `docker create`
- moby/moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create`
- moby/moby#27596 added `--force` to `docker service update`
- moby/moby#27857 added `--hostname` to `docker service create`
- moby/moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update`
- moby/moby#28076 added `--tty` on `docker service create` / `docker service update`
- moby/moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update`
- moby/moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update`

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Upstream-commit: 5d2722f83db9e301c6dcbe1c562c2051a52905db
Component: cli
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants