-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service update failure thresholds and rollback #26421
Service update failure thresholds and rollback #26421
Conversation
23d0359
to
0e0d238
Compare
flagWorkdir = "workdir" | ||
flagRegistryAuth = "with-registry-auth" | ||
flagLogDriver = "log-driver" | ||
flagLogOpt = "log-opt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gofmt
working as intended
FWIW, LGTM. I am not a maintainer though. |
0e0d238
to
e04ccb3
Compare
flagStopGracePeriod = "stop-grace-period" | ||
flagUpdateDelay = "update-delay" | ||
flagUpdateFailureAction = "update-failure-action" | ||
flagUpdateFailureFraction = "update-failure-fraction" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I usually prefer the term "ratio" over "fraction". Some ideas:
--update-min-success-ratio
--update-max-failure-ratio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think --update-max-failure-ratio
could work. Ratio is indeed a better description.
@aluzzardi WDYT? Also, if we adopt this terminology, should we update SwarmKit to rename the protobuf field accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM - nits on flag names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does rollback behave with respect to service version? I'm thinking of something like this:
The service starts at version 1
User A performs an update (service is at version 2)
User B performs an update immediately after (service is at version 3)
User A sees things going bad, so attempts to rollback.
Service ends up on version 2 (I assume?) instead of version 1
If the --rollback
flag accepted a service version it would prevent that scenario? The user would be informed of the error and they could force an update to the specific version they want manually.
} else if rollback { | ||
updateOpts.RegistryAuthFrom = types.RegistryAuthFromPreviousSpec | ||
} else { | ||
updateOpts.RegistryAuthFrom = types.RegistryAuthFromSpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a switch/case would be appropriate here
`docker service update`'s `--rollback` flag. This will revert the service | ||
to the configuration that was in place before the most recent | ||
`docker service update` command. Other options can be combined with | ||
`--rollback`; for example, `--update-delay 0s` to execute the rollback without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When other options are combined with rollback are they applied to the service spec, or are they only used for the "update", so only "update" options would work?
My understanding (which may be out of date by now) was that if you ran an update which changed the update config of a service, that change wouldn't be reflected in the current update, it would only apply to the next one, because the service uses the current configuration to perform the update, not the new configuration.
Has that changed? Or is rollback a special case that works differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When other options are combined with rollback are they applied to the service spec, or are they only used for the "update", so only "update" options would work?
The current implementation allows all options supported by service update
. I think update-*
options are the most useful options to specify in combination with a rollback. But I am allowing all options for simplicity. If there's a good UX argument for restricting the set of options that can be combined with rollback, we could consider doing that.
My understanding (which may be out of date by now) was that if you ran an update which changed the update config of a service, that change wouldn't be reflected in the current update, it would only apply to the next one, because the service uses the current configuration to perform the update, not the new configuration.
I don't believe that's the case. When I've tested these options, they seem to take effect immediately. If there are execptions to this, I don't think it's the intended behavior.
Yes, that scenario is accurate. However I don't see this as any different from the way
In theory, yes, but I see a few hurdles:
|
e04ccb3
to
0e8b69c
Compare
Rebased and updated in response to feeback. @dnephin @aluzzardi PTAL |
design LGTM |
1 similar comment
design LGTM |
0e8b69c
to
29e19b6
Compare
Rebased. Review ping? |
LGTM |
c2b9fd2
to
0f5df94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs changes LGTM
but ping @mstanleyjones if we need to have an example use somewhere in the documentation
User docs here: docker/docs#219 |
Oh, thanks! |
arf, @aaronlehmann needs a rebase 👼 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment/question, but docs LGTM 🐸
- **Delay** – Delay between restart attempts. | ||
- **Attempts** – Maximum attempts to restart a given container before giving up (default value | ||
- **Delay** – Delay between restart attempts, in nanoseconds. | ||
- **MaxAttempts** – Maximum attempts to restart a given container before giving up (default value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it wrongely documented, or has this changed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was wrongly documented.
@@ -41,10 +41,14 @@ Placement: | |||
{{- if .HasUpdateConfig }} | |||
UpdateConfig: | |||
Parallelism: {{ .UpdateParallelism }} | |||
{{- if .HasUpdateDelay -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what does the - do here? What does removing it do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The -
tells the template engine to remove whitespace that follows. This was incorrect because it ended up merging Delay onto the same line as Parallelism. Notice that the other conditionals don't have a -
at the end.
I could have opened another PR for this, but found it easiest to bundle it with the other changes I was making to the template.
Delay: opts.update.delay, | ||
Monitor: opts.update.monitor, | ||
FailureAction: opts.update.onFailure, | ||
MaxFailureRatio: opts.update.maxFailureRatio, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is missing onFailure?
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
This adds support for two enhancements to swarm service rolling updates: - Failure thresholds: In Docker 1.12, a service update could be set up to either pause or continue after a single failure occurs. This adds an --update-max-failure-ratio flag that controls how many tasks need to fail to update for the update as a whole to be considered a failure. A counterpart flag, --update-monitor, controls how long to monitor each task for a failure after starting it during the update. - Rollback flag: service update --rollback reverts the service to its previous version. If a service update encounters task failures, or fails to function properly for some other reason, the user can roll back the update. SwarmKit also has the ability to roll back updates automatically after hitting the failure thresholds, but we've decided not to expose this in the Docker API/CLI for now, favoring a workflow where the decision to roll back is always made by an admin. Depending on user feedback, we may add a "rollback" option to --update-failure-action in the future. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
8b24fdc
to
6d4b527
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
How would a rollback look in terms of using the docker api? |
Pull request moby#27745 added support for the client to talk to older versions of the daemon. Various flags were added to docker 1.13 that are not compatible with older daemons. This PR adds annotations to those flags, so that they are automatically hidden if the daemon is older than docker 1.13 (API 1.25). Not all new flags affect the API (some are client-side only). The following PR's added new flags to docker 1.13 that affect the API; - moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime` - moby#27800 / moby#25317 added `--group` / `--group-add` / `--group-rm` - moby#27702 added `--network` to `docker build` - moby#25962 added `--attachable` to `docker network create` - moby#27998 added `--compose-file` to `docker stack deploy` - moby#22566 added `--stop-timeout` to `docker run` and `docker create` - moby#26061 added `--init` to `docker run` and `docker create` - moby#26941 added `--init-path` to `docker run` and `docker create` - moby#27958 added `--cpus` on `docker run` / `docker create` - moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create` - moby#27596 added `--force` to `docker service update` - moby#27857 added `--hostname` to `docker service create` - moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update` - moby#28076 added `--tty` on `docker service create` / `docker service update` - moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update` - moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update` Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@aaronlehmann Having Using Having |
@thaJeztah very nice, that's a good one, wasn't aware of it yet, thanks! It would totally cover my case. |
@thaJeztah one more question maybe you could help with. Rollback will definitely help. But I wonder why in my case/example old task was killed even if new one started unhealthy, Any existing settings would help solving this? It would make sense not to kill old service tasks until new ones are ready, I wonder why it didn't happened. Thank you. |
Pull request moby/moby#27745 added support for the client to talk to older versions of the daemon. Various flags were added to docker 1.13 that are not compatible with older daemons. This PR adds annotations to those flags, so that they are automatically hidden if the daemon is older than docker 1.13 (API 1.25). Not all new flags affect the API (some are client-side only). The following PR's added new flags to docker 1.13 that affect the API; - moby/moby#23430 added `--cpu-rt-period`and `--cpu-rt-runtime` - moby/moby#27800 / moby/moby#25317 added `--group` / `--group-add` / `--group-rm` - moby/moby#27702 added `--network` to `docker build` - moby/moby#25962 added `--attachable` to `docker network create` - moby/moby#27998 added `--compose-file` to `docker stack deploy` - moby/moby#22566 added `--stop-timeout` to `docker run` and `docker create` - moby/moby#26061 added `--init` to `docker run` and `docker create` - moby/moby#26941 added `--init-path` to `docker run` and `docker create` - moby/moby#27958 added `--cpus` on `docker run` / `docker create` - moby/moby#27567 added `--dns`, `--dns-opt`, and `--dns-search` to `docker service create` - moby/moby#27596 added `--force` to `docker service update` - moby/moby#27857 added `--hostname` to `docker service create` - moby/moby#28031 added `--hosts`, `--host-add` / `--host-rm` to `docker service create` and `docker service update` - moby/moby#28076 added `--tty` on `docker service create` / `docker service update` - moby/moby#26421 added `--update-max-failure-ratio`, `--update-monitor` and `--rollback` on `docker service update` - moby/moby#27369 added `--health-cmd`, `--health-interval`, `--health-retries`, `--health-timeout` and `--no-healthcheck` options to `docker service create` and `docker service update` Signed-off-by: Sebastiaan van Stijn <github@gone.nl> Upstream-commit: 5d2722f83db9e301c6dcbe1c562c2051a52905db Component: cli
This adds support for two enhancements to swarm service rolling updates:
to either pause or continue after a single failure occurs. This adds
an
--update-max-failure-ratio
flag that controls how many tasks need tofail to update for the update as a whole to be considered a failure. A
counterpart flag,
--update-monitor
, controls how long to monitor eachtask for a failure after starting it during the update.
--rollback
reverts the service to itsprevious version. If a service update encounters task failures, or
fails to function properly for some other reason, the user can roll back
the update.
SwarmKit also has the ability to roll back updates automatically after
hitting the failure thresholds, but we've decided not to expose this in
the Docker API/CLI for now, favoring a workflow where the decision to
roll back is always made by an admin. Depending on user feedback, we may
add a "rollback" option to
--update-failure-action
in the future.SwarmKit PR: moby/swarmkit#1380