-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proportionally scale paused and rolling deployments #20273
Proportionally scale paused and rolling deployments #20273
Conversation
// modified since. Such is life. | ||
return nil | ||
} | ||
_, err = dc.scaleRCAndRecordEvent(newRC, deployment.Spec.Replicas, deployment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to scale the new RC to the full replica count of the Deployment. That would radically change the pace of the rolling update and violate maxSurge.
The least impactful behavior with respect to update progress would be to proportionally scale all RCs with non-zero replicas, rounding to whole replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the un-paused scaling behavior is what we want, either. If the new scale is greater than the scale of the newest RC, then it scales down just the newest RC. However, that doesn't take into account that older RCs may still have replicas. Scaling proportionally makes sense to me while un-paused, also.
It also doesn't take into account which pods are ready or not ready, but that's another issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the new scale is greater than the scale of the newest RC, then it scales down just the newest RC.
You mean lesser, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, lesser. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fat fingers
We don't want to scale the new RC to the full replica count of the Deployment. That would radically change the pace of the rolling update and violate maxSurge.
I was thinking about this today and I believe maxSurge does make sense only under a progressing deployment, right? If a deployment is paused but not underway a rollout (ie. its latest rs owns all of the replicas), then we should scale to the full count. If a deployment is paused in the middle of a rollout, respect maxSurge/maxUnavailable and scale down older, scale up new. Also this is specific for rolling deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why might a user pause a rollout? They might want to treat new instances as a canary to observe how they behave under real load. Or they might want to run tests in prod. Or they may have observed a problem that they want to investigate.
In any of these cases, I'd imagine they'd want to freeze the rollout in whatever state it was in.
And, whether paused or not, I'd imagine scaling (eg. by HPA) would change the current state proportionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which makes me think that we should implement canary checks. Also regarding test deployments see openshift/origin#6930.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the un-paused scaling behavior is what we want, either. If the new scale is lesser than the scale of the newest RC, then it scales down just the newest RC. However, that doesn't take into account that older RCs may still have replicas.
Then we are susceptible to violating maxUnavailable, right? It seems to me that we want exactly the behavior of unpaused minus the generation of a new rc in case of podtemplate changes.
@bgrant0607 updated to use the normal path when there is already a new rc for a paused deployment. The intent of a paused deployment is to make changes in the pod template w/o having the controller automatically reconciling those new changes. I now believe it makes sense to provide everything else we already have: proportional scaling in both directions to ensure we won't violate maxSurge/maxUnavailable, cleanup policy if defined. |
@Kargakis I'll take a look, but I would not expect a rollout to advance (e.g., scale all old RCs down to 0 while the Deployment replicas count were unchanged) if the Deployment were paused. |
@bgrant0607 we can check that d.status.replicas (current pods from all rcs) equals d.spec.replicas. If not, we can interpret this as a change in the scale (most probably it will be that - otherwise it's just the difference created by maxSurge/maxUnavailable) and allow the deployment to process. |
@Kargakis We should take surge into account. spec.replicas compared with (status.replicas - surge amount). Assuming that a rollout is underway. If no rollout is underway, then there would be no surge amount. |
When paused, I don't see much harm in creating new RCs with 0 replicas, so that they appear in the revision history. OTOH, the revision history is about the only reason to. Note also that an RC might already exist that matches the updated pod template, anyway. If there's only one RC with non-zero replicas, scaling is easy. Otherwise, it's not so easy. Scenario axes when a scaling (up or down) event occurs:
Unless we record the number of surge pods in status or in annotations on the RCs (#14062 (comment)), it's not obvious to me that we could figure out the previous desired number of replicas, which I'd like to do in order to compute the scaling factor. If we detect a scaling event, I'd like to scale all the RCs with non-zero replicas proportionally, independent of the normal reconciliation that enables rollout progress. |
One advantage of annotations on the RCs is that they can be written atomically with RC scale changes. |
The proposal to scale proportionally was to mitigate risk. Otherwise, scaling up would increase the size of the newRC and scaling down would decrease the sizes of the old ones, both of which would have the effect of hastening the rollout progress, which could produce a higher proportion of unavailable replicas in the event of a problem with the rolled out template. An alternative would be to scale up the RC with the largest number of available replicas and scale down the one(s) with the most unavailable replicas. However, in order to be obviously better than proportional scaling, we'd have to be able to do it without actually figuring out whether we were scaling up or down. The lack of multi-resource transactions certainly makes this harder. |
We should be able to tell whether there was a scaling event. If there is one RC with non-zero replicas, just compare its replicas count with the Deployment's. If there are multiple RCs with non-zero replicas, compare the total pod count with Deployment's plus maxSurge and newRC's replica count with Deployment's replica count (since newRC's shouldn't exceed Deployment's). Concrete example, similar to the one I posted to #20368:
Deployment is updated, newRC is created with 3 replicas, oldRC is scaled down to 8, and newRC is scaled up to 5. 3 newRC replicas become available, so oldRC is scaled down to 5 and newRC is scaled up to 8. Deployment is updated again, a newnewRC is created with 0 replicas (allPodsCount is already 13). Let's say 5 of the 8 replicas of (the old) newRC are available, so 3 will be scaled down and then newnewRC will be scaled up to 3: 5, 5, and 3. What if we scale the Deployment up to 12 at this point? Well, there are multiple RCs with non-zero replicas, so we could have up to 12 + maxSurge replicas, 15 in this case. But there are only 13 pods. Where should the 2 new pods go? The current policy would always add them to newRC:
And shrinking the Deployment replicas count shrinks the newRC if it exceeds the new replicas count, and otherwise just shrinks the oldRCs. Could we do proportional distribution? Sort the RCs in descending order by number of available pods: 5, 5, 0. Compute the distribution weighed by proportion of available pods, round off, distribute leftovers evenly across the RCs in order, and cap at the number of pods we want to add. 5/10*2 = 1, so +1, +1, +0. Shrinking down should work the same way. However, one catch is that we currently create the newRC with 0 replicas, so a pod template update would always look like a scaling event. We could fix that by setting the replicas count at creation time, though, to maxTotalPods - currentPodCount. |
Note that if the controller failed in the middle of effecting the proportional changes it could skew the proportions. I think that's acceptable, especially if we use available pod counts, since newly created pods wouldn't become available immediately when scaling up, and RCs that had already been scaled down would be scaled proportionally less after recovery. That would still be more graceful than always scaling the newest RC. |
@bgrant0607 thanks for the thorough analysis. I have pushed changes that implement proportional scaling for paused deployments. Please have a look. Note that tomorrow I'll be out for DevConf here at Brno and I am not sure I can come back to this first thing on Monday. If needed, @janetkuo can pick it up. |
@Kargakis Thanks. I'll take a look. Scaling proportionally while paused would definitely be an improvement over not scaling while paused. However, I would like unpaused deployments to be scaled proportionally, also. |
if err != nil { | ||
return fmt.Errorf("invalid value for MaxSurge: %v", err) | ||
} | ||
if isPercent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen this code other places. We need a GetIntValueFromIntOrPercent() helper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update this once #21044 merges
My pleasure |
allErrs = append(allErrs, apivalidation.ValidateNonnegativeField(int64(intOrPercent.IntValue()), fldPath)...) | ||
default: | ||
allErrs = append(allErrs, field.Invalid(fldPath, intOrPercent, "must be an integer or percentage (e.g '5%')")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to escape %
with another one. So this needs to look like this: 5%%
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct. will update
Enable paused and rolling deployments to be proportionally scaled. Also have cleanup policy work for paused deployments.
GCE e2e build/test passed for commit f3d2e3f. |
@k8s-bot test this [submit-queue is verifying that this PR is safe to merge] |
GCE e2e build/test passed for commit f3d2e3f. |
Automatic merge from submit-queue |
Proportionally scale deployments was introduced in v1.4 (kubernetes#20273), which caused this 1.3 test to fail on 1.4 clusters. This is cherrypicked into 1.3 merely for fixing test failues on kubernetes-e2e-gke-1.3-1.4-upgrade-cluster
Proportionally scale deployments was introduced in v1.4 (kubernetes#20273), which caused this 1.2 test to fail on 1.4 clusters. This is cherrypicked into 1.2 merely for fixing test failues on kubernetes-e2e-gke-1.2-1.4-upgrade-cluster
Automatic merge from submit-queue Fix 1.3 paused deployment test failure against >1.4 cluster <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md 2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md 3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes --> **What this PR does / why we need it**: Fixing upgrade test kubernetes-e2e-gke-1.3-1.4-upgrade-cluster Proportionally scale deployments was introduced in v1.4 (#20273), which caused this 1.3 test to fail on 1.4 clusters. This is cherrypicked into 1.3 merely for fixing test failues on kubernetes-e2e-gke-1.3-1.4-upgrade-cluster **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes #32710 **Special notes for your reviewer**: **Release note**: <!-- Steps to write your release note: 1. Use the release-note-* labels to set the release note state (if you have access) 2. Enter your extended release note in the below block; leaving it blank means using the PR title as the release note. If no release note is required, just write `NONE`. --> ```release-note ```
@Kargakis @pwittrock this PR has a release-note-action-required label, but I can't actually tell what the action is, CHANGELOG.md only lists the PR title can you clarify what the action required is? |
I don't think there is any action required before upgrading, but this is definitely a notable change to existing behavior. May be there should be a section for that.
|
Proportionally scale deployments was introduced in v1.4 (kubernetes#20273), which caused this 1.3 test to fail on 1.4 clusters. This is cherrypicked into 1.3 merely for fixing test failues on kubernetes-e2e-gke-1.3-1.4-upgrade-cluster
Proportionally scale deployments was introduced in v1.4 (kubernetes#20273), which caused this 1.3 test to fail on 1.4 clusters. This is cherrypicked into 1.3 merely for fixing test failues on kubernetes-e2e-gke-1.3-1.4-upgrade-cluster
…-deployment Automatic merge from submit-queue Fix 1.3 paused deployment test failure against >1.4 cluster <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, read our contributor guidelines https://github.com/kubernetes/kubernetes/blob/master/CONTRIBUTING.md and developer guide https://github.com/kubernetes/kubernetes/blob/master/docs/devel/development.md 2. If you want *faster* PR reviews, read how: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/faster_reviews.md 3. Follow the instructions for writing a release note: https://github.com/kubernetes/kubernetes/blob/master/docs/devel/pull-requests.md#release-notes --> **What this PR does / why we need it**: Fixing upgrade test kubernetes-e2e-gke-1.3-1.4-upgrade-cluster Proportionally scale deployments was introduced in v1.4 (kubernetes#20273), which caused this 1.3 test to fail on 1.4 clusters. This is cherrypicked into 1.3 merely for fixing test failues on kubernetes-e2e-gke-1.3-1.4-upgrade-cluster **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes#32710 **Special notes for your reviewer**: **Release note**: <!-- Steps to write your release note: 1. Use the release-note-* labels to set the release note state (if you have access) 2. Enter your extended release note in the below block; leaving it blank means using the PR title as the release note. If no release note is required, just write `NONE`. --> ```release-note ```
Proportionally scale deployments was introduced in v1.4 (kubernetes#20273), which caused this 1.2 test to fail on 1.4 clusters. This is cherrypicked into 1.2 merely for fixing test failues on kubernetes-e2e-gke-1.2-1.4-upgrade-cluster
Enable paused and rolling deployments to be proportionally scaled.
Also have cleanup policy work for paused deployments.
Fixes #20853
Fixes #20966
Fixes #20754
@bgrant0607 @janetkuo @ironcladlou @nikhiljindal
This change is