Promotable canary deployments (aka auto-pause) #11505

bgrant0607 · 2015-07-18T06:11:36Z

Writing the new user guide made me think about this.

It's currently easy to run multiple release tracks (e.g., daily release and weekly release), by just running separate replication controllers indefinitely.

However, if someone wants to run a couple canary instances for a while and then roll out the same image to replace the full replica set once it has been sufficiently validated, we don't have direct support for that.

It is relatively easy to just kill kubectl rolling-update in the middle and resume or rollback later, but only if the rate of the rollout is sufficiently slow and one is watching it closely.

The simplest solution I can think of is to automate killing kubectl rolling-update: a --canaries=N flag, which would cause it to break out of the update loop after ramping up the new replication controller to N. The rolling update should be resumable just as if it were killed manually, to promote the canary to the current version.

cc @kelseyhightower

nikhiljindal · 2015-08-17T18:31:32Z

/sub

cc @ironcladlou

0xmichalis · 2016-02-03T11:00:00Z

Deployments also should support canary checks.

bgrant0607 · 2016-02-24T18:25:35Z

This will be hard to implement predictable behavior for in the presence of scaling, failures, etc. See iterations on #20273. Since this would require external orchestration anyway, I'm leaning back to just suggesting use of multiple Deployments, one per release track.

bgrant0607 · 2016-04-28T23:23:52Z

Actually, I think we could implement this similarly to rollback.

0xmichalis · 2016-04-29T08:02:31Z

@bgrant0607 meaning users would need to explicitly ask for a canary check before rolling out?

bgrant0607 · 2016-05-02T22:43:10Z

By "similar to rollback", I meant that the autopause specification would be cleared when the Deployment was paused, similar to how the rollback spec is cleared when the pod template is rolled back. Autopause implies higher-level (and potentially manual/imperative) orchestration, anyway.

I'm also thinking that paused should be a list (or map) of reasons why the deployment is paused, similar to finalizers, initializers, taints (vs. unschedulable), etc.

therc · 2016-10-11T17:28:34Z

I might be biased, but I think it would be desirable to have this baked in, in both Deployments and, eventually, Daemonsets. In Borg land (where I started, oversaw or rescued hundreds or thousands of updates), the kubectl equivalent can update X canaries and watch them for N minutes, all from the client side. If no more than M failures have occurred in the new instances, the rest of the tasks will get updated, otherwise the canaries get rolled back. I don't think it's a surprise that each time someone reinvented a server-side automation framework on top of either the CLI tool or directly the Borg API (I can think of at least three of such efforts, all from different parts of the organization...), canarying was kept or even expanded, e.g. with a full deployment in an alpha cluster being the canary before new binaries are deployed to other clusters.
Multiple deployments are nice, but they still require either a human or some other kind of automation to watch for the failures and initiate rollbacks. I suspect that people will reinvent that wheel multiple times and do a less-than-ideal job at it.

0xmichalis · 2017-04-05T10:19:20Z

By "similar to rollback", I meant that the autopause specification would be cleared when the Deployment was paused, similar to how the rollback spec is cleared when the pod template is rolled back. Autopause implies higher-level (and potentially manual/imperative) orchestration, anyway.

@bgrant0607 right, I think this makes sense now that the StatefulSet proposal is touching on it: kubernetes/community#503

I'm also thinking that paused should be a list (or map) of reasons why the deployment is paused, similar to finalizers, initializers, taints (vs. unschedulable), etc.

Agreed on this one too

0xmichalis · 2017-09-10T22:46:35Z

/assign tnozicka

bgrant0607 · 2017-09-13T01:38:40Z

I am less keen to support autopause than I once was.

A simple alternative is to just create a separate Deployment:
https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#canary-deployments

which can be updated, checked for success or lack of progress, and rolled back independently.

As we found when trying to implement proportional scaling, it's hard to distinguish scaling from rollout progress, especially when surge instances are in play.

And it would be hard to determine whether the autopause threshold had been hit. That property would have to somehow be recorded in the object, and then cleared when the spec changed again.

Also, the semantics become less clear in the presence of edge cases such as a series of updates where the threshold had not yet been hit by any single ReplicaSet. Is the partition ordinal update semantics in StatefulSet as described by kubernetes/community#503 well defined in the presence of such multiple subsequent updates?

I don't ever want to support an array of pod templates. A similar effect could be driven by a combination of autoscaling them independently and driving load via traffic splitting using something like Istio. However, I think the more common case would be a non-autoscaled minimally sized canary Deployment and a scalable primary production Deployment.

Many users seem to also want deployment pipelines that span namespaces or even clusters, such as when deploying to a staging instance.

StatefulSet is somewhat different in that each instance has a particular identity, and there are no surge instances. It should still be possible to split them into multiple StatefulSets for more control, but probably less convenient.

Where multiple DaemonSet versions are needed simultaneously, we have found them to often be dependent upon node configurations and coordinated with node lifecycle. This is most easily achieved with multiple DaemonSets, each with a different nodeSelector.

tonglil · 2017-12-08T18:30:18Z

I have an idea of using probes to allow apps to self-declare their (canary) promotion/deployment continuation:

type: Deployment
spec.strategy.rollingUpdate:
    advancementProbes: # or deploymentProbes:
      - percentage: 10 # does 10% of the deployment and wait for a probe status
        exec:
          command:
            - /bin/true
      - percentage: 50
        http: /canary?amount=50 # apps can report self satisfying an per/self SLO, or fetch SLO from the aggregated statistics like prometheus
      - full
        http: /canary?promote=true

Apps can give signal of how they are doing, and show their own shapes using these "health" endpoints.

Apps will pass probes after x minutes of metrics or do whatever check they want, but all pods need to pass a probe to advance to the next probe / complete the deployment

When a probe succeeds, the app continues to the next percentage until the "full" probe succeeds.

fejta-bot · 2018-03-08T18:56:14Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-04-07T19:43:51Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

janetkuo · 2018-04-10T01:05:28Z

/remove-lifecycle rotten

fejta-bot · 2018-07-09T01:23:36Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-08-08T02:11:29Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

tnozicka · 2018-08-15T11:48:39Z

/remove-lifecycle rotten
/lifecycle frozen
/unassign

discordianfish · 2018-08-17T08:45:37Z

Maybe the scope of this could be limited to just provide a 'auto-pause' and more advanced deployment strategies can be addressed later? I don't see this making any progress otherwise..

@bgrant0607 To explain my use case: Currently we auto-deploy master to production. Our repos contain the deployment manifests with thin templating to adjust the image. A second deployment would mean that rolling out a change requires to changes: First one to introduce new code and update the canary manifest, another change to update the production deployment. Now if something breaks in in the canary deployment, we already updated the code in master and would have to revert it. But we want to keep the master history clean.

So instead I'm thinking to introduce a pre-prod branch which would be deployed to our production namespace the same way as master is, but all changes should not be rolled out (->deployment paused). Now I can roll it out manually. If I'm not happy, I can just rollback the changes. If I'm happy, I can merge into master which would do a no-op deployment. Seems like a better option to me but not very safe with opportunistic running rollout pause.

irvifa · 2018-11-12T13:56:17Z

will this included on the roadmap? currently i'm thinking about using istio, however sometimes istio seems to be overly sensitive

smarterclayton · 2022-01-20T20:08:57Z

kueue proposal suggested auto-paused / access control on unpausing Jobs so that a higher level controller could take charge. Similarities to this issue which I noted https://docs.google.com/document/d/1jFdQPlGnvjCSOrtAFxzGxEMi9z-OS0VVD1uTfSGHXts/edit?resourcekey=0-BgDvCZcpwFVaCEZj2tlfyw

Initializers tried to target some of this (on creation of a job you could assign some properties without implementing admission control), and one of the initializer use cases would be allowing workload APIs to be components of larger orchestrations and potentially take some control away from end users (autostart behavior of deployments / jobs, for instance). Deployment hooks also had some use cases where a higher level controller would pause / unpause the workload while performing other actions, and the controller might want to prevent interference.

sftim · 2024-11-06T11:52:32Z

How do we feel about formally deciding that this should be solved out of tree?

(maybe the answer is: no; either way I'd like to know)

bgrant0607 added help-wanted priority/backlog Higher priority than priority/awaiting-more-evidence. area/app-lifecycle area/kubectl sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jul 18, 2015

bgrant0607 added team/ux and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 4, 2015

bgrant0607 changed the title ~~Promotable canary deployments in kubectl rolling-update~~ Promotable canary deployments in kubectl rolling-update (aka autoPause) Feb 6, 2016

bgrant0607 changed the title ~~Promotable canary deployments in kubectl rolling-update (aka autoPause)~~ Promotable canary deployments in kubectl rolling-update (aka auto-pause) Feb 6, 2016

bgrant0607 changed the title ~~Promotable canary deployments in kubectl rolling-update (aka auto-pause)~~ Promotable canary deployments (aka auto-pause) Apr 28, 2016

bgrant0607 mentioned this issue May 2, 2016

Add perma-failed deployments API #19343

Merged

bgrant0607 removed the help-wanted label Aug 30, 2016

0xmichalis self-assigned this Oct 13, 2016

0xmichalis added component/deployment and removed area/kubectl labels Oct 13, 2016

0xmichalis mentioned this issue Oct 18, 2016

[WiP] DaemonSet updates #31693

Closed

bgrant0607 added area/workload-api/deployment and removed component/deployment labels Nov 15, 2016

0xmichalis removed their assignment Jan 11, 2017

0xmichalis removed the team/ux (deprecated - do not use) label Jan 11, 2017

bgrant0607 mentioned this issue Mar 21, 2017

Workload API v1 requirements umbrella issue #42752

Closed

0xmichalis added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 3, 2017

k8s-ci-robot assigned tnozicka Sep 10, 2017

bgrant0607 mentioned this issue Sep 19, 2017

Automatic rollback for failed deployments #23211

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2018

janetkuo mentioned this issue Mar 28, 2018

deployment support more type deployments #57641

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 7, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 10, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 8, 2018

k8s-ci-robot unassigned tnozicka Aug 15, 2018

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 15, 2018

bgrant0607 mentioned this issue Aug 23, 2018

Proposal for Lifecycle Hooks kubernetes/community#1171

Closed

helayoty added this to SIG Apps Sep 29, 2023

github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promotable canary deployments (aka auto-pause) #11505

Promotable canary deployments (aka auto-pause) #11505

bgrant0607 commented Jul 18, 2015

nikhiljindal commented Aug 17, 2015

0xmichalis commented Feb 3, 2016 •

edited

Loading

bgrant0607 commented Feb 24, 2016

bgrant0607 commented Apr 28, 2016

0xmichalis commented Apr 29, 2016

bgrant0607 commented May 2, 2016

therc commented Oct 11, 2016

0xmichalis commented Apr 5, 2017

0xmichalis commented Sep 10, 2017

bgrant0607 commented Sep 13, 2017

tonglil commented Dec 8, 2017

fejta-bot commented Mar 8, 2018

fejta-bot commented Apr 7, 2018

janetkuo commented Apr 10, 2018

fejta-bot commented Jul 9, 2018

fejta-bot commented Aug 8, 2018

tnozicka commented Aug 15, 2018

discordianfish commented Aug 17, 2018

irvifa commented Nov 12, 2018

smarterclayton commented Jan 20, 2022

sftim commented Nov 6, 2024

Promotable canary deployments (aka auto-pause) #11505

Promotable canary deployments (aka auto-pause) #11505

Comments

bgrant0607 commented Jul 18, 2015

nikhiljindal commented Aug 17, 2015

0xmichalis commented Feb 3, 2016 • edited Loading

bgrant0607 commented Feb 24, 2016

bgrant0607 commented Apr 28, 2016

0xmichalis commented Apr 29, 2016

bgrant0607 commented May 2, 2016

therc commented Oct 11, 2016

0xmichalis commented Apr 5, 2017

0xmichalis commented Sep 10, 2017

bgrant0607 commented Sep 13, 2017

tonglil commented Dec 8, 2017

fejta-bot commented Mar 8, 2018

fejta-bot commented Apr 7, 2018

janetkuo commented Apr 10, 2018

fejta-bot commented Jul 9, 2018

fejta-bot commented Aug 8, 2018

tnozicka commented Aug 15, 2018

discordianfish commented Aug 17, 2018

irvifa commented Nov 12, 2018

smarterclayton commented Jan 20, 2022

sftim commented Nov 6, 2024

0xmichalis commented Feb 3, 2016 •

edited

Loading