Automatic rollback for failed deployments #23211

therc · 2016-03-18T20:29:36Z

I don't see any existing issues filed for this.

Allow the user to instruct Kubernetes, based on data from liveness/readiness probes (or failures to start new pods at all), to automatically undo a rollout, not just pause it. This could even have a configurable maxFailures knob to specify how many pods in "failed after upgrade" state to allow before rolling back. Similar for pending pods.

bgrant0607 · 2016-03-18T23:10:00Z

See also #14519

0xmichalis · 2016-05-14T21:00:34Z

I will start working on this once #19343 is on master.

krmayankk · 2016-08-01T06:21:56Z

@Kargakis @bgrant0607 is there a proposal for how we plan to implement this. We are interested in this feature and if possible would like to help out with the implementation as well.

0xmichalis · 2016-08-03T09:33:41Z

@krmayankk we first need to add the concept of a failed deployment (see #19343) before proceeding with this. The way it is going to work is:

User sets spec.progressDeadlineSeconds for a Deployment - if there is no progress for that amount of time then the Deployment controller will stop trying to deploy the latest template and the Deployment will be considered as failed.

I guess automatic rollback will be configurable (eg. spec.autoRollback=true|false). If it's enabled, then the latest successfully deployed template should be scaled up instead of the current failed Deployment template. If it's not set, then the controller should scale the new RS for the Deployment down to zero.

krmayankk · 2016-08-18T18:43:24Z

@Kargakis when a deployment rolls back , the deployment spec will still be the old desired state right but somehow that will be marked as failed and rolled back and hence the controller won't try to reconcile it again with the desired state until something on the deployment spec changes again , does that look right ?

0xmichalis · 2016-08-19T08:02:13Z

Exactly, we shouldn't change the spec at any point - only users should do
that. Conditions on the Deployment status and annotations on Replica Sets
denoting that this template failed or succeeded being deployed will help in
determining to what old template should we automatically rollback.

On Thu, Aug 18, 2016 at 8:43 PM, krmayankk notifications@github.com wrote:

@Kargakis https://github.com/kargakis when a deployment rolls back ,
the deployment spec will still be the old desired state right but somehow
that will be marked as failed and rolled back and hence the controller
won't try to reconcile it again with the desired state until something on
the deployment spec changes again , does that look right ?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#23211 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFf8Lx5kQqnk5qKZlhd3d7aDLGhJj3ks5qhKfogaJpZM4H0Mur
.

krmayankk · 2016-09-04T03:46:55Z

@Kargakis is there a sig-* group for this kind of discussions, i can join/attend ?

0xmichalis · 2016-09-04T10:07:02Z

@krmayankk @kubernetes/sig-apps it is

https://github.com/kubernetes/community/tree/master/sig-apps

0xmichalis · 2016-09-26T17:13:26Z

Exactly, we shouldn't change the spec at any point - only users should do that. Conditions on the Deployment status and annotations on Replica Sets denoting that this template failed or succeeded being deployed will help in determining to what old template should we automatically rollback.

We had a discussion with @smarterclayton and we may want to use the rollback spec for this which means we will change the deployment template back to its working state (so the deployment template is always the latest running template and that actually is much easier to do and fits into the current state of the code like a charm). We could include an object reference inside Conditions and have a Condition that lets users know that they have been automatically rolled back because we couldn't deploy the replica set in the object reference.

bgrant0607 · 2016-10-12T04:47:47Z

@Kargakis Yes, I imagined we would change the pod template back, as with manual rollback.

I could imagine an object reference in an event, but don't know where it would fit in a condition.

ghodss · 2016-10-12T06:15:55Z

@bgrant0607 Automatically changing the pod template back in the case of a failed deployment breaks the declarative config model, doesn't it? Currently we continuously run kubectl apply across an entire git repo to sync kubernetes objects into the clusters, but this would just result in a fight between the deployment controller changing the spec and the applier trying to change it back.

@Kargakis' point above that "we shouldn't change the spec at any point - only users should do that" is fundamental to our workflow - can that be a core requirement for this issue?

0xmichalis · 2016-10-12T08:42:11Z

@Kargakis' point above that "we shouldn't change the spec at any point - only users should do that" is fundamental to our workflow - can that be a core requirement for this issue?

@ghodss there are pros and cons to not changing the template.
Pros:

More predictable behavior. Users won't be surprised to see their spec has changed. But this really depends on how we document what autoRollback is doing in case of a failure.

Cons:

We break the assumption that the latest replica set is the active replica set, something that has bitten us in OpenShift.
It will be much harder to maintain.

@bgrant0607 I am not sure events are discoverable enough for this kind of info. I would prefer to have it somewhere in the deployment status (either in Conditions or as a separate object reference) so clients can use it directly and let users know what's what.

bgrant0607 · 2016-10-12T15:42:04Z

@ghodss Yes, it would deliberately break the declarative model. I'd assume that a declarative user would want to detect lack of progress and then rollback via version-control rollback and re-apply.

Anyway, I don't think we could even get to the reviews for this until next year.

krmayankk · 2016-10-20T05:14:33Z

We break the assumption that the latest replica set is the active replica set, something that has bitten us in OpenShift.

@Kargakis what does this mean ? A replica set is always identified by the pod template hash. Do you mean latest by age ? If we plan to automagically change the deployment spec to reflect the rollback, how do you suggest scenarios like @ghodss suggest should be handled, where kubectl apply is called and there will be a race between what is applied and the rolledback. Would kubectl apply be inteligent enough to not apply since its rolled back ? Also how do we proceed after rollback, meaning push a new updated deployment ?

krmayankk · 2016-10-20T05:29:09Z

@Kargakis i saw your comment on the perma failed deployments and now i understand the first part of this. Basically you mean if the current deployment spec doesnt change, the active replica set no longer has a template which matches the currently deployment deployment template. I am curious why is that assumption important for openshift ?

0xmichalis · 2016-10-20T09:56:09Z

@krmayankk I didn't mention anywhere that the assumption is important for OpenShift.

I don't think kubectl apply should have any logic specific to deployments or any other resource for that matter.

krmayankk · 2016-10-21T04:28:51Z

Sorry I meant you said 'has bitten us in the paste at openshift' and I was just curious how or what scenario @Kargakis

0xmichalis · 2016-10-21T10:45:57Z

If you are not going to keep the latest replica set as the running replica set, you are going to scale up the last successfully deployed replica set which may be far behind in your history (and you lose all deployment guarantees when you do that). Tooling that expects the latest replica set to be the running replica set (what's happening right now in Kube Deployments), is suddenly broken. We had to maintain ugliness in our deployment controller that nobody wanted to touch for 3 releases in order to support oc scale that was making the aforementioned assumption in 1.1. The worst part is that that patch was affecting our system in subtle ways. I don't want to see this happen for Deployments and at the same time we need to think about @ghodss's case.

kevin-wangzefeng · 2017-04-25T16:59:28Z

Any updates or plan on this? We are interested in this feature and would like to start writing a proposal if no one else is working on it.

0xmichalis · 2017-04-25T17:06:50Z

@kevin-wangzefeng nobody is working on it currently, feel free to open a proposal.

kevin-wangzefeng · 2017-04-25T17:55:29Z

Thanks. Assigning to myself, as I'm going to work on this.

0xmichalis · 2017-04-26T00:58:16Z

@kevin-wangzefeng my thoughts and not something set in stone while I have been thinking about this for the past months:

this can be layered on top of spec.progressDeadlineSeconds. I think a timeout-based approach is simple and works because you can only know that you failed when you decide that you have waited more than you should - most if not all of the possible failures afaik are transient but I may not be thinking about all.
there is an open question of actually rolling back the Deployment podtemplatespec or not (see previous discussion on this thread). Today, I think we should not actually rollback the Deployment because then the autoRollback option stops being declarative and the API starts messing with a promising workflow IMO. Our Ops team in OpenShift is also using a similar approach to https://github.com/box/kube-applier.
I think this is a chance to introduce a spec.failurePolicy field, analogous to spec.strategy (which is really spec.updateStrategy). In the new struct we can move progressDeadlineSeconds and also add different kinds of strategies such as retry which is what we do today, rollback which scales down the bad new replicas and scales back up the old replicas. Maybe extend rollback later to do canaries every specific interval.

0xmichalis · 2017-04-26T11:27:56Z

this can be layered on top of spec.progressDeadlineSeconds. I think a timeout-based approach is simple and works because you can only know that you failed when you decide that you have waited more than you should - most if not all of the possible failures afaik are transient but I may not be thinking about all.

We can make progressDeadlineSeconds optional for the failurePolicy and allow infant mortality (#18568) but I think users would have to opt in.

0xmichalis · 2017-07-11T21:32:08Z

We discussed this briefly in the sig-apps meeting yesterday and I am echoing bits from the discussion: the bar for new features is very high at this point (1.8) for various reasons: 1) we want to bring StatefulSets and DaemonSets on par feature-wise as much as possible (API consistency) 2) we need more feedback on the existing features; lots of good work has gone in 1.7 and we need to verify it. I think we are going to conclude on this in a follow-up meeting.

@smarterclayton @kow3ns @janetkuo @erictune @kubernetes/sig-apps-feature-requests

Also, there is general consensus that workload controllers (Deployments, StatefulSets, DaemonSets, Jobs, etc.) are building blocks for higher-level orchestrators/worklflow engines. That being said, my thought is that today you should be able to set ProgressDeadlineSeconds on the Deployment spec, and automatically rollback to the previous revision once you observe that the Progressing condition in the Deployment status has Reason=ProgressDeadlineExceeded. You can use something like a controller in a pod (#5164), AppController, a cronjob, whatever. This does not mean that I am against supporting this directly in the API in the future.

bgrant0607 · 2017-08-04T00:23:53Z

At this point, I think automatic rollback needs to be a feature of higher-level deployment pipelines.

Example: I want to deploy to staging. If it succeeds, I want to deploy to prod. If it fails, I want to rollback. Implementing auto rollback in K8s wouldn't fit with that.

Making lack-of-progress reporting more usable in such higher-level pipelines seems more valuable than auto-rollback in Deployment.

tnozicka · 2017-08-04T06:12:18Z

@bgrant0607 what if it succeeds in staging and fails in prod? There will be a Deployment broken until manual action is taken which can take long time. Or you can check it and rollback it manually in every deployment script/pipeline. And that assumes you have one. There are other complementary concepts like triggers. Also it will result in duplicating this logic in every pipeline you have. It's not a trivial action to check conditions and timeouts.

Implementing auto rollback in K8s wouldn't fit with that.

We can surely make auto rollback configurable to be enabled/disabled depending on what's the default. That should allow both use cases.

0xmichalis · 2017-09-17T10:47:28Z

Or you can check it and rollback it manually in every deployment script/pipeline

Or you can write a k8s controller that does this for you.

At this point, I think it's more valuable to make sure that all workload controllers report lack of progress.
cc: @kow3ns @janetkuo

bgrant0607 · 2017-09-19T00:11:48Z

I agree with @Kargakis on reporting lack of progress consistently. This is needed in order to facilitate building higher-level deployment pipelines. This issue, among others, is discussed more in #34363.

See also my comments on auto-pause:
#11505 (comment)

They doubly apply to auto-rollback.

fejta-bot · 2018-01-06T01:22:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-02-09T18:20:11Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-11T19:06:32Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2018-06-20T22:43:30Z

@contra: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

prudhvigodithi · 2018-10-18T04:05:41Z

Hey so automated rollbacks through kubernetes is done currently with only "undo" command with kubectl.

Is there is way kubernetes identifies the failed deployment and rollbacks automatically to previous running deployment by itself rather than passing the command kubectl rollout undo deploy/
Also with the command "kubectl patch deployment/nginx-deployment -p '{"spec":{"progressDeadlineSeconds":100}}" kubernetes does not by default rolls back to previous success deployment after the new deployment fails and completes the progressDeadlineSeconds to 100s
can we make make kubernetes know to rollback by itself depending upon progressDeadlineSeconds ?
I have tried " kubectl patch deployment/nginx-deployment -p '{"spec":{"spec.autoRollback":true}}'" which dint seem to work

Please provide me with solution if there is a way kubernetes can automatically rollback to previous successful deployment by itself.
Thank you

TomaszKlosinski · 2020-10-13T11:37:29Z

So is there any plan to implement spec. autoRollback?

/reopen

rasert · 2024-03-06T20:00:56Z

Please implement this.

k8s-ci-robot · 2024-03-06T20:00:59Z

@rasert: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

Please implement this.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

freehan added the team/ux label Mar 18, 2016

bgrant0607 added priority/backlog Higher priority than priority/awaiting-more-evidence. area/app-lifecycle labels Mar 18, 2016

This was referenced May 2, 2016

RFC: Allow perm-failed deployments #14519

Closed

Add perma-failed deployments API #19343

Merged

0xmichalis self-assigned this May 14, 2016

0xmichalis added the kind/enhancement label Jun 25, 2016

bgrant0607 mentioned this issue Aug 10, 2016

Documentation on how to recover from a bad deployemnt #29784

Closed

0xmichalis mentioned this issue Oct 18, 2016

[WiP] DaemonSet updates #31693

Closed

0xmichalis mentioned this issue Nov 28, 2016

Common patterns in workload controllers #37563

Closed

0xmichalis added the area/workload-api/deployment label Dec 5, 2016

bgrant0607 removed the area/workload-api/deployment label Dec 8, 2016

bgrant0607 mentioned this issue Mar 21, 2017

Workload API v1 requirements umbrella issue #42752

Closed

0xmichalis added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 3, 2017

kevin-wangzefeng self-assigned this Apr 25, 2017

bgrant0607 mentioned this issue Aug 4, 2017

Automatic rollback for failed deployments kubernetes/enhancements#273

Closed

dhilipkumars mentioned this issue Nov 15, 2017

[Proposal] Helm Upgrade (Tiller) to perform Automatic rollback? helm/helm#3149

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2018

k8s-ci-robot closed this as completed Mar 11, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 6, 2024

Automatic rollback for failed deployments #23211

Automatic rollback for failed deployments #23211

Comments

therc commented Mar 18, 2016

bgrant0607 commented Mar 18, 2016

0xmichalis commented May 14, 2016

krmayankk commented Aug 1, 2016

0xmichalis commented Aug 3, 2016

krmayankk commented Aug 18, 2016

0xmichalis commented Aug 19, 2016

krmayankk commented Sep 4, 2016

0xmichalis commented Sep 4, 2016

0xmichalis commented Sep 26, 2016

bgrant0607 commented Oct 12, 2016

ghodss commented Oct 12, 2016

0xmichalis commented Oct 12, 2016

bgrant0607 commented Oct 12, 2016

krmayankk commented Oct 20, 2016 • edited Loading

krmayankk commented Oct 20, 2016

0xmichalis commented Oct 20, 2016 • edited Loading

krmayankk commented Oct 21, 2016

0xmichalis commented Oct 21, 2016

kevin-wangzefeng commented Apr 25, 2017

0xmichalis commented Apr 25, 2017

kevin-wangzefeng commented Apr 25, 2017

0xmichalis commented Apr 26, 2017 • edited Loading

0xmichalis commented Apr 26, 2017

0xmichalis commented Jul 11, 2017

bgrant0607 commented Aug 4, 2017

tnozicka commented Aug 4, 2017

0xmichalis commented Sep 17, 2017

bgrant0607 commented Sep 19, 2017

fejta-bot commented Jan 6, 2018

fejta-bot commented Feb 9, 2018

fejta-bot commented Mar 11, 2018

k8s-ci-robot commented Jun 20, 2018

prudhvigodithi commented Oct 18, 2018

TomaszKlosinski commented Oct 13, 2020 • edited Loading

rasert commented Mar 6, 2024 • edited Loading

k8s-ci-robot commented Mar 6, 2024

krmayankk commented Oct 20, 2016 •

edited

Loading

0xmichalis commented Oct 20, 2016 •

edited

Loading

0xmichalis commented Apr 26, 2017 •

edited

Loading

TomaszKlosinski commented Oct 13, 2020 •

edited

Loading

rasert commented Mar 6, 2024 •

edited

Loading