Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic rollback for failed deployments #23211

Closed
therc opened this issue Mar 18, 2016 · 36 comments
Closed

Automatic rollback for failed deployments #23211

therc opened this issue Mar 18, 2016 · 36 comments
Assignees
Labels
area/app-lifecycle area/workload-api/deployment priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@therc
Copy link
Member

therc commented Mar 18, 2016

I don't see any existing issues filed for this.

Allow the user to instruct Kubernetes, based on data from liveness/readiness probes (or failures to start new pods at all), to automatically undo a rollout, not just pause it. This could even have a configurable maxFailures knob to specify how many pods in "failed after upgrade" state to allow before rolling back. Similar for pending pods.

@bgrant0607
Copy link
Member

See also #14519

@bgrant0607 bgrant0607 added priority/backlog Higher priority than priority/awaiting-more-evidence. area/app-lifecycle labels Mar 18, 2016
@0xmichalis 0xmichalis self-assigned this May 14, 2016
@0xmichalis
Copy link
Contributor

I will start working on this once #19343 is on master.

@krmayankk
Copy link

@Kargakis @bgrant0607 is there a proposal for how we plan to implement this. We are interested in this feature and if possible would like to help out with the implementation as well.

@0xmichalis
Copy link
Contributor

@krmayankk we first need to add the concept of a failed deployment (see #19343) before proceeding with this. The way it is going to work is:

User sets spec.progressDeadlineSeconds for a Deployment - if there is no progress for that amount of time then the Deployment controller will stop trying to deploy the latest template and the Deployment will be considered as failed.

I guess automatic rollback will be configurable (eg. spec.autoRollback=true|false). If it's enabled, then the latest successfully deployed template should be scaled up instead of the current failed Deployment template. If it's not set, then the controller should scale the new RS for the Deployment down to zero.

@krmayankk
Copy link

@Kargakis when a deployment rolls back , the deployment spec will still be the old desired state right but somehow that will be marked as failed and rolled back and hence the controller won't try to reconcile it again with the desired state until something on the deployment spec changes again , does that look right ?

@0xmichalis
Copy link
Contributor

Exactly, we shouldn't change the spec at any point - only users should do
that. Conditions on the Deployment status and annotations on Replica Sets
denoting that this template failed or succeeded being deployed will help in
determining to what old template should we automatically rollback.

On Thu, Aug 18, 2016 at 8:43 PM, krmayankk notifications@github.com wrote:

@Kargakis https://github.com/kargakis when a deployment rolls back ,
the deployment spec will still be the old desired state right but somehow
that will be marked as failed and rolled back and hence the controller
won't try to reconcile it again with the desired state until something on
the deployment spec changes again , does that look right ?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#23211 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADuFf8Lx5kQqnk5qKZlhd3d7aDLGhJj3ks5qhKfogaJpZM4H0Mur
.

@krmayankk
Copy link

@Kargakis is there a sig-* group for this kind of discussions, i can join/attend ?

@0xmichalis
Copy link
Contributor

@krmayankk @kubernetes/sig-apps it is

https://github.com/kubernetes/community/tree/master/sig-apps

@0xmichalis
Copy link
Contributor

Exactly, we shouldn't change the spec at any point - only users should do that. Conditions on the Deployment status and annotations on Replica Sets denoting that this template failed or succeeded being deployed will help in determining to what old template should we automatically rollback.

We had a discussion with @smarterclayton and we may want to use the rollback spec for this which means we will change the deployment template back to its working state (so the deployment template is always the latest running template and that actually is much easier to do and fits into the current state of the code like a charm). We could include an object reference inside Conditions and have a Condition that lets users know that they have been automatically rolled back because we couldn't deploy the replica set in the object reference.

@bgrant0607
Copy link
Member

@Kargakis Yes, I imagined we would change the pod template back, as with manual rollback.

I could imagine an object reference in an event, but don't know where it would fit in a condition.

@ghodss
Copy link
Contributor

ghodss commented Oct 12, 2016

@bgrant0607 Automatically changing the pod template back in the case of a failed deployment breaks the declarative config model, doesn't it? Currently we continuously run kubectl apply across an entire git repo to sync kubernetes objects into the clusters, but this would just result in a fight between the deployment controller changing the spec and the applier trying to change it back.

@Kargakis' point above that "we shouldn't change the spec at any point - only users should do that" is fundamental to our workflow - can that be a core requirement for this issue?

@0xmichalis
Copy link
Contributor

@Kargakis' point above that "we shouldn't change the spec at any point - only users should do that" is fundamental to our workflow - can that be a core requirement for this issue?

@ghodss there are pros and cons to not changing the template.
Pros:

  • More predictable behavior. Users won't be surprised to see their spec has changed. But this really depends on how we document what autoRollback is doing in case of a failure.

Cons:

  • We break the assumption that the latest replica set is the active replica set, something that has bitten us in OpenShift.
  • It will be much harder to maintain.

@bgrant0607 I am not sure events are discoverable enough for this kind of info. I would prefer to have it somewhere in the deployment status (either in Conditions or as a separate object reference) so clients can use it directly and let users know what's what.

@bgrant0607
Copy link
Member

@ghodss Yes, it would deliberately break the declarative model. I'd assume that a declarative user would want to detect lack of progress and then rollback via version-control rollback and re-apply.

Anyway, I don't think we could even get to the reviews for this until next year.

@krmayankk
Copy link

krmayankk commented Oct 20, 2016

We break the assumption that the latest replica set is the active replica set, something that has bitten us in OpenShift.

@Kargakis what does this mean ? A replica set is always identified by the pod template hash. Do you mean latest by age ? If we plan to automagically change the deployment spec to reflect the rollback, how do you suggest scenarios like @ghodss suggest should be handled, where kubectl apply is called and there will be a race between what is applied and the rolledback. Would kubectl apply be inteligent enough to not apply since its rolled back ? Also how do we proceed after rollback, meaning push a new updated deployment ?

@krmayankk
Copy link

@Kargakis i saw your comment on the perma failed deployments and now i understand the first part of this. Basically you mean if the current deployment spec doesnt change, the active replica set no longer has a template which matches the currently deployment deployment template. I am curious why is that assumption important for openshift ?

@0xmichalis
Copy link
Contributor

0xmichalis commented Oct 20, 2016

@krmayankk I didn't mention anywhere that the assumption is important for OpenShift.

I don't think kubectl apply should have any logic specific to deployments or any other resource for that matter.

@krmayankk
Copy link

Sorry I meant you said 'has bitten us in the paste at openshift' and I was just curious how or what scenario @Kargakis

@0xmichalis
Copy link
Contributor

If you are not going to keep the latest replica set as the running replica set, you are going to scale up the last successfully deployed replica set which may be far behind in your history (and you lose all deployment guarantees when you do that). Tooling that expects the latest replica set to be the running replica set (what's happening right now in Kube Deployments), is suddenly broken. We had to maintain ugliness in our deployment controller that nobody wanted to touch for 3 releases in order to support oc scale that was making the aforementioned assumption in 1.1. The worst part is that that patch was affecting our system in subtle ways. I don't want to see this happen for Deployments and at the same time we need to think about @ghodss's case.

@0xmichalis 0xmichalis added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 3, 2017
@kevin-wangzefeng
Copy link
Member

Any updates or plan on this? We are interested in this feature and would like to start writing a proposal if no one else is working on it.

@0xmichalis
Copy link
Contributor

@kevin-wangzefeng nobody is working on it currently, feel free to open a proposal.

@kevin-wangzefeng
Copy link
Member

Thanks. Assigning to myself, as I'm going to work on this.

@kevin-wangzefeng kevin-wangzefeng self-assigned this Apr 25, 2017
@0xmichalis
Copy link
Contributor

0xmichalis commented Apr 26, 2017

@kevin-wangzefeng my thoughts and not something set in stone while I have been thinking about this for the past months:

  • this can be layered on top of spec.progressDeadlineSeconds. I think a timeout-based approach is simple and works because you can only know that you failed when you decide that you have waited more than you should - most if not all of the possible failures afaik are transient but I may not be thinking about all.
  • there is an open question of actually rolling back the Deployment podtemplatespec or not (see previous discussion on this thread). Today, I think we should not actually rollback the Deployment because then the autoRollback option stops being declarative and the API starts messing with a promising workflow IMO. Our Ops team in OpenShift is also using a similar approach to https://github.com/box/kube-applier.
  • I think this is a chance to introduce a spec.failurePolicy field, analogous to spec.strategy (which is really spec.updateStrategy). In the new struct we can move progressDeadlineSeconds and also add different kinds of strategies such as retry which is what we do today, rollback which scales down the bad new replicas and scales back up the old replicas. Maybe extend rollback later to do canaries every specific interval.

@0xmichalis
Copy link
Contributor

this can be layered on top of spec.progressDeadlineSeconds. I think a timeout-based approach is simple and works because you can only know that you failed when you decide that you have waited more than you should - most if not all of the possible failures afaik are transient but I may not be thinking about all.

We can make progressDeadlineSeconds optional for the failurePolicy and allow infant mortality (#18568) but I think users would have to opt in.

@0xmichalis
Copy link
Contributor

We discussed this briefly in the sig-apps meeting yesterday and I am echoing bits from the discussion: the bar for new features is very high at this point (1.8) for various reasons: 1) we want to bring StatefulSets and DaemonSets on par feature-wise as much as possible (API consistency) 2) we need more feedback on the existing features; lots of good work has gone in 1.7 and we need to verify it. I think we are going to conclude on this in a follow-up meeting.

@smarterclayton @kow3ns @janetkuo @erictune @kubernetes/sig-apps-feature-requests

Also, there is general consensus that workload controllers (Deployments, StatefulSets, DaemonSets, Jobs, etc.) are building blocks for higher-level orchestrators/worklflow engines. That being said, my thought is that today you should be able to set ProgressDeadlineSeconds on the Deployment spec, and automatically rollback to the previous revision once you observe that the Progressing condition in the Deployment status has Reason=ProgressDeadlineExceeded. You can use something like a controller in a pod (#5164), AppController, a cronjob, whatever. This does not mean that I am against supporting this directly in the API in the future.

@bgrant0607
Copy link
Member

At this point, I think automatic rollback needs to be a feature of higher-level deployment pipelines.

Example: I want to deploy to staging. If it succeeds, I want to deploy to prod. If it fails, I want to rollback. Implementing auto rollback in K8s wouldn't fit with that.

Making lack-of-progress reporting more usable in such higher-level pipelines seems more valuable than auto-rollback in Deployment.

@tnozicka
Copy link
Contributor

tnozicka commented Aug 4, 2017

@bgrant0607 what if it succeeds in staging and fails in prod? There will be a Deployment broken until manual action is taken which can take long time. Or you can check it and rollback it manually in every deployment script/pipeline. And that assumes you have one. There are other complementary concepts like triggers. Also it will result in duplicating this logic in every pipeline you have. It's not a trivial action to check conditions and timeouts.

Implementing auto rollback in K8s wouldn't fit with that.

We can surely make auto rollback configurable to be enabled/disabled depending on what's the default. That should allow both use cases.

@0xmichalis
Copy link
Contributor

Or you can check it and rollback it manually in every deployment script/pipeline

Or you can write a k8s controller that does this for you.

At this point, I think it's more valuable to make sure that all workload controllers report lack of progress.
cc: @kow3ns @janetkuo

@bgrant0607
Copy link
Member

I agree with @Kargakis on reporting lack of progress consistently. This is needed in order to facilitate building higher-level deployment pipelines. This issue, among others, is discussed more in #34363.

See also my comments on auto-pause:
#11505 (comment)

They doubly apply to auto-rollback.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 9, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@contra: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@prudhvigodithi
Copy link

Hey so automated rollbacks through kubernetes is done currently with only "undo" command with kubectl.

  1. Is there is way kubernetes identifies the failed deployment and rollbacks automatically to previous running deployment by itself rather than passing the command kubectl rollout undo deploy/

  2. Also with the command "kubectl patch deployment/nginx-deployment -p '{"spec":{"progressDeadlineSeconds":100}}" kubernetes does not by default rolls back to previous success deployment after the new deployment fails and completes the progressDeadlineSeconds to 100s

  3. can we make make kubernetes know to rollback by itself depending upon progressDeadlineSeconds ?

  4. I have tried " kubectl patch deployment/nginx-deployment -p '{"spec":{"spec.autoRollback":true}}'" which dint seem to work

Please provide me with solution if there is a way kubernetes can automatically rollback to previous successful deployment by itself.
Thank you

@TomaszKlosinski
Copy link

TomaszKlosinski commented Oct 13, 2020

So is there any plan to implement spec. autoRollback?

/reopen

@rasert
Copy link

rasert commented Mar 6, 2024

Please implement this.

@k8s-ci-robot
Copy link
Contributor

@rasert: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen
/remove-lifecycle rotten

Please implement this.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle area/workload-api/deployment priority/backlog Higher priority than priority/awaiting-more-evidence. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests