-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic rollback for failed deployments #23211
Comments
See also #14519 |
I will start working on this once #19343 is on master. |
@Kargakis @bgrant0607 is there a proposal for how we plan to implement this. We are interested in this feature and if possible would like to help out with the implementation as well. |
@krmayankk we first need to add the concept of a failed deployment (see #19343) before proceeding with this. The way it is going to work is: User sets spec.progressDeadlineSeconds for a Deployment - if there is no progress for that amount of time then the Deployment controller will stop trying to deploy the latest template and the Deployment will be considered as failed. I guess automatic rollback will be configurable (eg. spec.autoRollback=true|false). If it's enabled, then the latest successfully deployed template should be scaled up instead of the current failed Deployment template. If it's not set, then the controller should scale the new RS for the Deployment down to zero. |
@Kargakis when a deployment rolls back , the deployment spec will still be the old desired state right but somehow that will be marked as failed and rolled back and hence the controller won't try to reconcile it again with the desired state until something on the deployment spec changes again , does that look right ? |
Exactly, we shouldn't change the spec at any point - only users should do On Thu, Aug 18, 2016 at 8:43 PM, krmayankk notifications@github.com wrote:
|
@Kargakis is there a sig-* group for this kind of discussions, i can join/attend ? |
@krmayankk @kubernetes/sig-apps it is https://github.com/kubernetes/community/tree/master/sig-apps |
We had a discussion with @smarterclayton and we may want to use the rollback spec for this which means we will change the deployment template back to its working state (so the deployment template is always the latest running template and that actually is much easier to do and fits into the current state of the code like a charm). We could include an object reference inside Conditions and have a Condition that lets users know that they have been automatically rolled back because we couldn't deploy the replica set in the object reference. |
@Kargakis Yes, I imagined we would change the pod template back, as with manual rollback. I could imagine an object reference in an event, but don't know where it would fit in a condition. |
@bgrant0607 Automatically changing the pod template back in the case of a failed deployment breaks the declarative config model, doesn't it? Currently we continuously run kubectl apply across an entire git repo to sync kubernetes objects into the clusters, but this would just result in a fight between the deployment controller changing the spec and the applier trying to change it back. @Kargakis' point above that "we shouldn't change the spec at any point - only users should do that" is fundamental to our workflow - can that be a core requirement for this issue? |
@ghodss there are pros and cons to not changing the template.
Cons:
@bgrant0607 I am not sure events are discoverable enough for this kind of info. I would prefer to have it somewhere in the deployment status (either in Conditions or as a separate object reference) so clients can use it directly and let users know what's what. |
@ghodss Yes, it would deliberately break the declarative model. I'd assume that a declarative user would want to detect lack of progress and then rollback via version-control rollback and re-apply. Anyway, I don't think we could even get to the reviews for this until next year. |
@Kargakis what does this mean ? A replica set is always identified by the pod template hash. Do you mean latest by age ? If we plan to automagically change the deployment spec to reflect the rollback, how do you suggest scenarios like @ghodss suggest should be handled, where kubectl apply is called and there will be a race between what is applied and the rolledback. Would kubectl apply be inteligent enough to not apply since its rolled back ? Also how do we proceed after rollback, meaning push a new updated deployment ? |
@Kargakis i saw your comment on the perma failed deployments and now i understand the first part of this. Basically you mean if the current deployment spec doesnt change, the active replica set no longer has a template which matches the currently deployment deployment template. I am curious why is that assumption important for openshift ? |
@krmayankk I didn't mention anywhere that the assumption is important for OpenShift. I don't think kubectl apply should have any logic specific to deployments or any other resource for that matter. |
Sorry I meant you said 'has bitten us in the paste at openshift' and I was just curious how or what scenario @Kargakis |
If you are not going to keep the latest replica set as the running replica set, you are going to scale up the last successfully deployed replica set which may be far behind in your history (and you lose all deployment guarantees when you do that). Tooling that expects the latest replica set to be the running replica set (what's happening right now in Kube Deployments), is suddenly broken. We had to maintain ugliness in our deployment controller that nobody wanted to touch for 3 releases in order to support |
Any updates or plan on this? We are interested in this feature and would like to start writing a proposal if no one else is working on it. |
@kevin-wangzefeng nobody is working on it currently, feel free to open a proposal. |
Thanks. Assigning to myself, as I'm going to work on this. |
@kevin-wangzefeng my thoughts and not something set in stone while I have been thinking about this for the past months:
|
We can make progressDeadlineSeconds optional for the failurePolicy and allow infant mortality (#18568) but I think users would have to opt in. |
We discussed this briefly in the sig-apps meeting yesterday and I am echoing bits from the discussion: the bar for new features is very high at this point (1.8) for various reasons: 1) we want to bring StatefulSets and DaemonSets on par feature-wise as much as possible (API consistency) 2) we need more feedback on the existing features; lots of good work has gone in 1.7 and we need to verify it. I think we are going to conclude on this in a follow-up meeting. @smarterclayton @kow3ns @janetkuo @erictune @kubernetes/sig-apps-feature-requests Also, there is general consensus that workload controllers (Deployments, StatefulSets, DaemonSets, Jobs, etc.) are building blocks for higher-level orchestrators/worklflow engines. That being said, my thought is that today you should be able to set ProgressDeadlineSeconds on the Deployment spec, and automatically rollback to the previous revision once you observe that the Progressing condition in the Deployment status has Reason=ProgressDeadlineExceeded. You can use something like a controller in a pod (#5164), AppController, a cronjob, whatever. This does not mean that I am against supporting this directly in the API in the future. |
At this point, I think automatic rollback needs to be a feature of higher-level deployment pipelines. Example: I want to deploy to staging. If it succeeds, I want to deploy to prod. If it fails, I want to rollback. Implementing auto rollback in K8s wouldn't fit with that. Making lack-of-progress reporting more usable in such higher-level pipelines seems more valuable than auto-rollback in Deployment. |
@bgrant0607 what if it succeeds in staging and fails in prod? There will be a Deployment broken until manual action is taken which can take long time. Or you can check it and rollback it manually in every deployment script/pipeline. And that assumes you have one. There are other complementary concepts like triggers. Also it will result in duplicating this logic in every pipeline you have. It's not a trivial action to check conditions and timeouts.
We can surely make auto rollback configurable to be enabled/disabled depending on what's the default. That should allow both use cases. |
I agree with @Kargakis on reporting lack of progress consistently. This is needed in order to facilitate building higher-level deployment pipelines. This issue, among others, is discussed more in #34363. See also my comments on auto-pause: They doubly apply to auto-rollback. |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@contra: you can't re-open an issue/PR unless you authored it or you are assigned to it. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hey so automated rollbacks through kubernetes is done currently with only "undo" command with kubectl.
Please provide me with solution if there is a way kubernetes can automatically rollback to previous successful deployment by itself. |
So is there any plan to implement /reopen |
Please implement this. |
@rasert: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I don't see any existing issues filed for this.
Allow the user to instruct Kubernetes, based on data from liveness/readiness probes (or failures to start new pods at all), to automatically undo a rollout, not just pause it. This could even have a configurable maxFailures knob to specify how many pods in "failed after upgrade" state to allow before rolling back. Similar for pending pods.
The text was updated successfully, but these errors were encountered: