Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promotable canary deployments (aka auto-pause) #11505

Open
bgrant0607 opened this issue Jul 18, 2015 · 29 comments
Open

Promotable canary deployments (aka auto-pause) #11505

bgrant0607 opened this issue Jul 18, 2015 · 29 comments
Labels
area/app-lifecycle area/workload-api/deployment lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@bgrant0607
Copy link
Member

Writing the new user guide made me think about this.

It's currently easy to run multiple release tracks (e.g., daily release and weekly release), by just running separate replication controllers indefinitely.

However, if someone wants to run a couple canary instances for a while and then roll out the same image to replace the full replica set once it has been sufficiently validated, we don't have direct support for that.

It is relatively easy to just kill kubectl rolling-update in the middle and resume or rollback later, but only if the rate of the rollout is sufficiently slow and one is watching it closely.

The simplest solution I can think of is to automate killing kubectl rolling-update: a --canaries=N flag, which would cause it to break out of the update loop after ramping up the new replication controller to N. The rolling update should be resumable just as if it were killed manually, to promote the canary to the current version.

cc @kelseyhightower

@bgrant0607 bgrant0607 added help-wanted priority/backlog Higher priority than priority/awaiting-more-evidence. area/app-lifecycle area/kubectl sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jul 18, 2015
@bgrant0607 bgrant0607 added team/ux and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 4, 2015
@nikhiljindal
Copy link
Contributor

/sub

cc @ironcladlou

@0xmichalis
Copy link
Contributor

0xmichalis commented Feb 3, 2016

Deployments also should support canary checks.

@bgrant0607 bgrant0607 changed the title Promotable canary deployments in kubectl rolling-update Promotable canary deployments in kubectl rolling-update (aka autoPause) Feb 6, 2016
@bgrant0607 bgrant0607 changed the title Promotable canary deployments in kubectl rolling-update (aka autoPause) Promotable canary deployments in kubectl rolling-update (aka auto-pause) Feb 6, 2016
@bgrant0607
Copy link
Member Author

This will be hard to implement predictable behavior for in the presence of scaling, failures, etc. See iterations on #20273. Since this would require external orchestration anyway, I'm leaning back to just suggesting use of multiple Deployments, one per release track.

@bgrant0607 bgrant0607 changed the title Promotable canary deployments in kubectl rolling-update (aka auto-pause) Promotable canary deployments (aka auto-pause) Apr 28, 2016
@bgrant0607
Copy link
Member Author

Actually, I think we could implement this similarly to rollback.

@0xmichalis
Copy link
Contributor

@bgrant0607 meaning users would need to explicitly ask for a canary check before rolling out?

@bgrant0607
Copy link
Member Author

By "similar to rollback", I meant that the autopause specification would be cleared when the Deployment was paused, similar to how the rollback spec is cleared when the pod template is rolled back. Autopause implies higher-level (and potentially manual/imperative) orchestration, anyway.

I'm also thinking that paused should be a list (or map) of reasons why the deployment is paused, similar to finalizers, initializers, taints (vs. unschedulable), etc.

@therc
Copy link
Member

therc commented Oct 11, 2016

I might be biased, but I think it would be desirable to have this baked in, in both Deployments and, eventually, Daemonsets. In Borg land (where I started, oversaw or rescued hundreds or thousands of updates), the kubectl equivalent can update X canaries and watch them for N minutes, all from the client side. If no more than M failures have occurred in the new instances, the rest of the tasks will get updated, otherwise the canaries get rolled back. I don't think it's a surprise that each time someone reinvented a server-side automation framework on top of either the CLI tool or directly the Borg API (I can think of at least three of such efforts, all from different parts of the organization...), canarying was kept or even expanded, e.g. with a full deployment in an alpha cluster being the canary before new binaries are deployed to other clusters.
Multiple deployments are nice, but they still require either a human or some other kind of automation to watch for the failures and initiate rollbacks. I suspect that people will reinvent that wheel multiple times and do a less-than-ideal job at it.

@0xmichalis
Copy link
Contributor

By "similar to rollback", I meant that the autopause specification would be cleared when the Deployment was paused, similar to how the rollback spec is cleared when the pod template is rolled back. Autopause implies higher-level (and potentially manual/imperative) orchestration, anyway.

@bgrant0607 right, I think this makes sense now that the StatefulSet proposal is touching on it: kubernetes/community#503

I'm also thinking that paused should be a list (or map) of reasons why the deployment is paused, similar to finalizers, initializers, taints (vs. unschedulable), etc.

Agreed on this one too

@0xmichalis
Copy link
Contributor

/assign tnozicka

@bgrant0607
Copy link
Member Author

I am less keen to support autopause than I once was.

A simple alternative is to just create a separate Deployment:
https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#canary-deployments

which can be updated, checked for success or lack of progress, and rolled back independently.

As we found when trying to implement proportional scaling, it's hard to distinguish scaling from rollout progress, especially when surge instances are in play.

And it would be hard to determine whether the autopause threshold had been hit. That property would have to somehow be recorded in the object, and then cleared when the spec changed again.

Also, the semantics become less clear in the presence of edge cases such as a series of updates where the threshold had not yet been hit by any single ReplicaSet. Is the partition ordinal update semantics in StatefulSet as described by kubernetes/community#503 well defined in the presence of such multiple subsequent updates?

I don't ever want to support an array of pod templates. A similar effect could be driven by a combination of autoscaling them independently and driving load via traffic splitting using something like Istio. However, I think the more common case would be a non-autoscaled minimally sized canary Deployment and a scalable primary production Deployment.

Many users seem to also want deployment pipelines that span namespaces or even clusters, such as when deploying to a staging instance.

StatefulSet is somewhat different in that each instance has a particular identity, and there are no surge instances. It should still be possible to split them into multiple StatefulSets for more control, but probably less convenient.

Where multiple DaemonSet versions are needed simultaneously, we have found them to often be dependent upon node configurations and coordinated with node lifecycle. This is most easily achieved with multiple DaemonSets, each with a different nodeSelector.

@tonglil
Copy link
Contributor

tonglil commented Dec 8, 2017

I have an idea of using probes to allow apps to self-declare their (canary) promotion/deployment continuation:

type: Deployment
spec.strategy.rollingUpdate:
    advancementProbes: # or deploymentProbes:
      - percentage: 10 # does 10% of the deployment and wait for a probe status
        exec:
          command:
            - /bin/true
      - percentage: 50
        http: /canary?amount=50 # apps can report self satisfying an per/self SLO, or fetch SLO from the aggregated statistics like prometheus
      - full
        http: /canary?promote=true

Apps can give signal of how they are doing, and show their own shapes using these "health" endpoints.

Apps will pass probes after x minutes of metrics or do whatever check they want, but all pods need to pass a probe to advance to the next probe / complete the deployment

When a probe succeeds, the app continues to the next percentage until the "full" probe succeeds.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 7, 2018
@janetkuo
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 10, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 8, 2018
@tnozicka
Copy link
Contributor

/remove-lifecycle rotten
/lifecycle frozen
/unassign

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Aug 15, 2018
@discordianfish
Copy link
Contributor

Maybe the scope of this could be limited to just provide a 'auto-pause' and more advanced deployment strategies can be addressed later? I don't see this making any progress otherwise..

@bgrant0607 To explain my use case: Currently we auto-deploy master to production. Our repos contain the deployment manifests with thin templating to adjust the image. A second deployment would mean that rolling out a change requires to changes: First one to introduce new code and update the canary manifest, another change to update the production deployment. Now if something breaks in in the canary deployment, we already updated the code in master and would have to revert it. But we want to keep the master history clean.

So instead I'm thinking to introduce a pre-prod branch which would be deployed to our production namespace the same way as master is, but all changes should not be rolled out (->deployment paused). Now I can roll it out manually. If I'm not happy, I can just rollback the changes. If I'm happy, I can merge into master which would do a no-op deployment. Seems like a better option to me but not very safe with opportunistic running rollout pause.

@irvifa
Copy link
Member

irvifa commented Nov 12, 2018

will this included on the roadmap? currently i'm thinking about using istio, however sometimes istio seems to be overly sensitive

@smarterclayton
Copy link
Contributor

kueue proposal suggested auto-paused / access control on unpausing Jobs so that a higher level controller could take charge. Similarities to this issue which I noted https://docs.google.com/document/d/1jFdQPlGnvjCSOrtAFxzGxEMi9z-OS0VVD1uTfSGHXts/edit?resourcekey=0-BgDvCZcpwFVaCEZj2tlfyw

Initializers tried to target some of this (on creation of a job you could assign some properties without implementing admission control), and one of the initializer use cases would be allowing workload APIs to be components of larger orchestrations and potentially take some control away from end users (autostart behavior of deployments / jobs, for instance). Deployment hooks also had some use cases where a higher level controller would pause / unpause the workload while performing other actions, and the controller might want to prevent interference.

@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2023
@sftim
Copy link
Contributor

sftim commented Nov 6, 2024

How do we feel about formally deciding that this should be solved out of tree?

(maybe the answer is: no; either way I'd like to know)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle area/workload-api/deployment lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Needs Triage
Development

No branches or pull requests