Ability to enable/disable replication controller #37086

MarkRx · 2016-11-18T13:25:17Z

At present there isn't any way to temporarily disable a replication controller. As a result attempting to manage pod counts outside of a replication controller can be infeasible. Adding the ability to enable/disable a replication controller would pause / unpause the auto scale feature to the replication count. When paused pods can be manually created / deleted / moved around without interference from the replication controller.

Examples:

Manually creating pods and adding them to a replication controller by having the pod use labels that will get picked up by the replication controller
Taking existing pods and adding them to a replication controller by changing the pod labels
Moving pods between replication controllers
Any case in which a pod might be shared by two replication controllers
Having pods available in a "standby / hot" mode that can be added to a RC by changing labels without the RC accidentally scaling them down

In all of the above at present there is a contention between the replication controller trying to maintain a set number of pods and the user performing manual work. If a pod is added to a replication controller then the replication controller will see that there are too many pods and scale one down. If a pod is removed from a replication controller then the replication controller will see that there are not enough pods and create one. In both cases the replication count could be modified to reflect the true desired amount of pods but since removing a pod and changing the count cannot be done atomically there is a catch 22.

By having the ability to temporarily disable replication controllers pods could be reshuffled while they are disabled and the controllers could be updated to reflect the new states before being re-enabled.

0xmichalis · 2016-11-18T13:33:36Z

@kubernetes/sig-apps @kubernetes/deployment pausing a Deployment foo does not stop the deployment controller from managing replicas for foo. Not sure if we can have pausing for a ReplicationController foo-1 that would make the RC manager not manage replicas.

smarterclayton · 2016-11-18T18:28:18Z

Manually creating pods and adding them to a replication controller by having the pod use labels that will get picked up by the replication controller

I'm not sure is a use case that we intended to support, although I understand the desire to do it in some cases. Can you describe some of the detail about the underlying use case that matters for you?

MarkRx · 2016-11-18T19:09:29Z

This relates to my thread here: openshift/origin#11954.

The setup is as follows using a blue/green deployment strategy with an additional abstraction of stage/active.

Before (dc = deployment config / deployment):

stg rte-> stg svc-> null
act rte-> act svc -> dc X

Stage Y

stg rte-> stg svc-> dc Y (give the rc template a label of status: stage)
act rte-> act svc -> dc X

Promote Y (delete dc X)

stg rte-> stg svc-> null
act rte-> act svc -> dc Y (give the rc template a label of status: active)

At the "promote Y" stage the pods created by dc Y will still have the label of status:stage. We were investigating how we could update those labels on both the RC and the pre-existing pods without creating new pods or fighting with the RC autoscaler. This doesn't seem possible at the moment.

One of the ideas we had around this was we could perform promotion on a pod level instead of carrying a dc through the stage -> active promotion.
Before:

stg rte-> stg svc-> stg dc
act rte-> act svc -> act dc

Stage Y

stg rte-> stg svc-> stg dc (update dc and rc template)
act rte-> act svc -> act dc

Promote stg -> act in the following steps:

disable stg dc (in particular the replication controller) and update it
disable act dc (in particular the replication controller) and update it
get a list L1 of pods from stg dc, L2 of pods from act dc
For each pod in L1, re-parent it to be owned by "act dc"
For each pod in L2, re-parent it to be owned by "stg dc"
(optional) scale down stg dc to 0
enable stg dc
enable act dc

A "PodSwitch" strategy could be written as a custom deployment strategy (openshift concept but I think it has trickled back into kubernetes with the deployment object). It would have an input of a source dc / rc:

disable the source and target dc / rc
swap pods by changing their labels
(optional) scale down the source dc to 0 thereby deleting the old active pods
re-enable the source and target dc / rc

Another use case: Having pods in "standby mode". This can be helpful for quick scaling when pod startup is slow.

Example:

RC controls 3 pods A, B, and C. Pods D and E are in standby mode - they exist and are ready to perform work but just are not being selected by a service.
Load increases; Add Pods D and E to the mix and have them be controlled by the RC by changing their labels to match what the RC and service is expecting

Problems happen with step 2. The RC replica count needs to change at the exact same time as the RC gains more pods. If the replica count changes too soon it may try to spin up spurious pods. If the replica count changes too late it may try to spin down pods that were just added to the RC.

smarterclayton · 2016-11-18T22:56:33Z

Independent of the merits of pause on the RC, I did want to mention the
comment from the other thread that in general, doing a blue green swap is
about minimizing the number of changes in either direction. Ideally a
single atomic operation can do the flip. Re-labelling of pods is not
atomic like that, so I would generally suggest for this specific case that
we try to find a way to make service / ingress / route + deployments behave
the way we want where only a single change is necessary to swap.

On Fri, Nov 18, 2016 at 11:09 AM, Mark notifications@github.com wrote:

This relates to my thread here: openshift/origin#11954
openshift/origin#11954.

The setup is as follows using a blue/green deployment strategy with an
additional abstraction of stage/active.

Before (dc = deployment config / deployment):

stg rte-> stg svc-> null
act rte-> act svc -> dc X

Stage Y

stg rte-> stg svc-> dc Y (give the rc template a label of status: stage)
act rte-> act svc -> dc X

Promote Y (delete dc X)

stg rte-> stg svc-> null
act rte-> act svc -> dc Y (give the rc template a label of status: active)

At the "promote Y" stage the pods created by dc Y will still have the
label of status:stage. We were investigating how we could update those
labels on both the RC and the pre-existing pods without creating new pods
or fighting with the RC autoscaler. This doesn't seem possible at the
moment.

One of the ideas we had around this was we could perform promotion on a
pod level instead of carrying a dc through the stage -> active promotion.
Before:

stg rte-> stg svc-> stg dc
act rte-> act svc -> act dc

Stage Y

stg rte-> stg svc-> stg dc (update dc and rc template)
act rte-> act svc -> act dc

Promote stg -> act in the following steps:

disable stg dc (in particular the replication controller) and update
it

disable act dc (in particular the replication controller) and update
it

get a list L1 of pods from stg dc, L2 of pods from act dc

For each pod in L1, re-parent it to be owned by "act dc"

For each pod in L2, re-parent it to be owned by "stg dc"

(optional) scale down stg dc to 0

enable stg dc

enable act dc

A "PodSwitch" strategy could be written as a custom deployment strategy
(openshift concept but I think it has trickled back into kubernetes with
the deployment object). It would have an input of a source dc / rc:

disable the source and target dc / rc

swap pods by changing their labels

(optional) scale down the source dc to 0 thereby deleting the old
active pods

re-enable the source and target dc / rc

Another use case: Having pods in "standby mode". This can be helpful for
quick scaling when pod startup is slow.

Example:

RC controls 3 pods A, B, and C. Pods D and E are in standby mode - they
exist and are ready to perform work but just are not being selected by a
service.

Load increases; Add Pods D and E to the mix and have them be controlled
by the RC by changing their labels to match what the RC is expecting

Problems happen with step 2. The RC replica count needs to change at the
exact same time as the RC gains more pods. If the replica count changes too
soon it may try to spin up spurious pods. If the replica count changes too
late it may try to spin down pods that were just added to the RC.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#37086 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p023kMNmSGPTMoOtbpuDvfjQtQhAks5q_ff4gaJpZM4K2ehq
.

bgrant0607 · 2016-11-22T17:26:31Z

One could always delete the RCs and then re-create them.

I think the most compelling case is scaling down and choosing a victim, which some other systems (e.g., Marathon) support. For that, a simple grace period before choosing a victim would suffice. We've discussed that previously in a number of issues.

MarkRx · 2016-11-23T13:31:40Z

Deleting the RCs could leave you in a bad state should you come across a failure. I had considered it but it also means that if something died along the way I could be left without an RC which could make a mess.

0xmichalis · 2016-11-23T21:25:12Z

Any case in which a pod might be shared by two replication controllers

I am interested in what cases you have for this kind of setup. We have a lot of issues with overlapping controllers and I am not sure we want to allow ways of getting this any more complex than it already is. Managing labels in general is hard for average users. Wouldn't #36897 be helpful for you?

MarkRx · 2016-11-28T13:38:13Z

@Kargakis yes #36897 would be helpful.

The case I was thinking of was a rolling strategy that reused old pods when the content of the pods did not change. Hence, instead of new pods and old pods being spun up / down one-by-one, they would be transferred over one-by-one and then any additional scaling that is needed would be performed. At present my desire for this is to try to work around not being able to atomically change RC selectors and pod labels at the same time (openshift/origin#11954, #36897)

0xmichalis · 2016-11-28T13:42:37Z

@MarkRx #9043 is related

fejta-bot · 2017-12-23T11:01:39Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

MarkRx · 2017-12-26T14:36:09Z

/remove-lifecycle stale

fejta-bot · 2018-03-26T14:55:51Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-04-25T15:12:53Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-05-25T15:59:05Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-11T12:51:21Z

@MarkRx: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2019-09-09T12:55:38Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-10-09T13:41:11Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-01-08T14:14:58Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-02-07T14:56:00Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-03-08T15:39:27Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-03-08T15:39:34Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bgrant0607 added area/workload-api/replicaset sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Nov 22, 2016

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 23, 2017

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2017

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 26, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 25, 2018

k8s-ci-robot closed this as completed May 25, 2018

k8s-ci-robot reopened this Jun 11, 2019

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 11, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 9, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 8, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 7, 2020

k8s-ci-robot closed this as completed Mar 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to enable/disable replication controller #37086

Ability to enable/disable replication controller #37086

MarkRx commented Nov 18, 2016 •

edited

Loading

0xmichalis commented Nov 18, 2016

smarterclayton commented Nov 18, 2016

MarkRx commented Nov 18, 2016 •

edited

Loading

smarterclayton commented Nov 18, 2016

bgrant0607 commented Nov 22, 2016

MarkRx commented Nov 23, 2016

0xmichalis commented Nov 23, 2016 •

edited

Loading

MarkRx commented Nov 28, 2016

0xmichalis commented Nov 28, 2016

fejta-bot commented Dec 23, 2017

MarkRx commented Dec 26, 2017

fejta-bot commented Mar 26, 2018

fejta-bot commented Apr 25, 2018

fejta-bot commented May 25, 2018

k8s-ci-robot commented Jun 11, 2019

fejta-bot commented Sep 9, 2019

fejta-bot commented Oct 9, 2019

fejta-bot commented Jan 8, 2020

fejta-bot commented Feb 7, 2020

fejta-bot commented Mar 8, 2020

k8s-ci-robot commented Mar 8, 2020

Ability to enable/disable replication controller #37086

Ability to enable/disable replication controller #37086

Comments

MarkRx commented Nov 18, 2016 • edited Loading

0xmichalis commented Nov 18, 2016

smarterclayton commented Nov 18, 2016

MarkRx commented Nov 18, 2016 • edited Loading

smarterclayton commented Nov 18, 2016

bgrant0607 commented Nov 22, 2016

MarkRx commented Nov 23, 2016

0xmichalis commented Nov 23, 2016 • edited Loading

MarkRx commented Nov 28, 2016

0xmichalis commented Nov 28, 2016

fejta-bot commented Dec 23, 2017

MarkRx commented Dec 26, 2017

fejta-bot commented Mar 26, 2018

fejta-bot commented Apr 25, 2018

fejta-bot commented May 25, 2018

k8s-ci-robot commented Jun 11, 2019

fejta-bot commented Sep 9, 2019

fejta-bot commented Oct 9, 2019

fejta-bot commented Jan 8, 2020

fejta-bot commented Feb 7, 2020

fejta-bot commented Mar 8, 2020

k8s-ci-robot commented Mar 8, 2020

MarkRx commented Nov 18, 2016 •

edited

Loading

MarkRx commented Nov 18, 2016 •

edited

Loading

0xmichalis commented Nov 23, 2016 •

edited

Loading