-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promotable canary deployments (aka auto-pause) #11505
Comments
/sub cc @ironcladlou |
Deployments also should support canary checks. |
This will be hard to implement predictable behavior for in the presence of scaling, failures, etc. See iterations on #20273. Since this would require external orchestration anyway, I'm leaning back to just suggesting use of multiple Deployments, one per release track. |
Actually, I think we could implement this similarly to rollback. |
@bgrant0607 meaning users would need to explicitly ask for a canary check before rolling out? |
By "similar to rollback", I meant that the autopause specification would be cleared when the Deployment was paused, similar to how the rollback spec is cleared when the pod template is rolled back. Autopause implies higher-level (and potentially manual/imperative) orchestration, anyway. I'm also thinking that |
I might be biased, but I think it would be desirable to have this baked in, in both Deployments and, eventually, Daemonsets. In Borg land (where I started, oversaw or rescued hundreds or thousands of updates), the kubectl equivalent can update X canaries and watch them for N minutes, all from the client side. If no more than M failures have occurred in the new instances, the rest of the tasks will get updated, otherwise the canaries get rolled back. I don't think it's a surprise that each time someone reinvented a server-side automation framework on top of either the CLI tool or directly the Borg API (I can think of at least three of such efforts, all from different parts of the organization...), canarying was kept or even expanded, e.g. with a full deployment in an alpha cluster being the canary before new binaries are deployed to other clusters. |
@bgrant0607 right, I think this makes sense now that the StatefulSet proposal is touching on it: kubernetes/community#503
Agreed on this one too |
/assign tnozicka |
I am less keen to support autopause than I once was. A simple alternative is to just create a separate Deployment: which can be updated, checked for success or lack of progress, and rolled back independently. As we found when trying to implement proportional scaling, it's hard to distinguish scaling from rollout progress, especially when surge instances are in play. And it would be hard to determine whether the autopause threshold had been hit. That property would have to somehow be recorded in the object, and then cleared when the spec changed again. Also, the semantics become less clear in the presence of edge cases such as a series of updates where the threshold had not yet been hit by any single ReplicaSet. Is the partition ordinal update semantics in StatefulSet as described by kubernetes/community#503 well defined in the presence of such multiple subsequent updates? I don't ever want to support an array of pod templates. A similar effect could be driven by a combination of autoscaling them independently and driving load via traffic splitting using something like Istio. However, I think the more common case would be a non-autoscaled minimally sized canary Deployment and a scalable primary production Deployment. Many users seem to also want deployment pipelines that span namespaces or even clusters, such as when deploying to a staging instance. StatefulSet is somewhat different in that each instance has a particular identity, and there are no surge instances. It should still be possible to split them into multiple StatefulSets for more control, but probably less convenient. Where multiple DaemonSet versions are needed simultaneously, we have found them to often be dependent upon node configurations and coordinated with node lifecycle. This is most easily achieved with multiple DaemonSets, each with a different nodeSelector. |
I have an idea of using probes to allow apps to self-declare their (canary) promotion/deployment continuation:
Apps can give signal of how they are doing, and show their own shapes using these "health" endpoints. Apps will pass probes after x minutes of metrics or do whatever check they want, but all pods need to pass a probe to advance to the next probe / complete the deployment When a probe succeeds, the app continues to the next percentage until the "full" probe succeeds. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Maybe the scope of this could be limited to just provide a 'auto-pause' and more advanced deployment strategies can be addressed later? I don't see this making any progress otherwise.. @bgrant0607 To explain my use case: Currently we auto-deploy master to production. Our repos contain the deployment manifests with thin templating to adjust the image. A second deployment would mean that rolling out a change requires to changes: First one to introduce new code and update the canary manifest, another change to update the production deployment. Now if something breaks in in the canary deployment, we already updated the code in master and would have to revert it. But we want to keep the master history clean. So instead I'm thinking to introduce a pre-prod branch which would be deployed to our production namespace the same way as master is, but all changes should not be rolled out (->deployment paused). Now I can roll it out manually. If I'm not happy, I can just rollback the changes. If I'm happy, I can merge into master which would do a no-op deployment. Seems like a better option to me but not very safe with opportunistic running |
will this included on the roadmap? currently i'm thinking about using istio, however sometimes istio seems to be |
kueue proposal suggested auto-paused / access control on unpausing Jobs so that a higher level controller could take charge. Similarities to this issue which I noted https://docs.google.com/document/d/1jFdQPlGnvjCSOrtAFxzGxEMi9z-OS0VVD1uTfSGHXts/edit?resourcekey=0-BgDvCZcpwFVaCEZj2tlfyw Initializers tried to target some of this (on creation of a job you could assign some properties without implementing admission control), and one of the initializer use cases would be allowing workload APIs to be components of larger orchestrations and potentially take some control away from end users (autostart behavior of deployments / jobs, for instance). Deployment hooks also had some use cases where a higher level controller would pause / unpause the workload while performing other actions, and the controller might want to prevent interference. |
How do we feel about formally deciding that this should be solved out of tree? (maybe the answer is: no; either way I'd like to know) |
Writing the new user guide made me think about this.
It's currently easy to run multiple release tracks (e.g., daily release and weekly release), by just running separate replication controllers indefinitely.
However, if someone wants to run a couple canary instances for a while and then roll out the same image to replace the full replica set once it has been sufficiently validated, we don't have direct support for that.
It is relatively easy to just kill kubectl rolling-update in the middle and resume or rollback later, but only if the rate of the rollout is sufficiently slow and one is watching it closely.
The simplest solution I can think of is to automate killing kubectl rolling-update: a --canaries=N flag, which would cause it to break out of the update loop after ramping up the new replication controller to N. The rolling update should be resumable just as if it were killed manually, to promote the canary to the current version.
cc @kelseyhightower
The text was updated successfully, but these errors were encountered: