Rolling-update by node #18450

titilambert · 2015-12-09T18:27:15Z

Hello,
I would like to make rolling update by node:

Select one node
Stop all pods of the RC on this node.
Start new pods on this node
Select an other node
Stop all pods of the RC on this node.
...

Questions:

Is this using is OK with Kubernetes philosophy ?
Is there any way to do it right now ?
If not, which files are concerned by this ?
I can patch it to get something like: kubectl rolling-update frontend-v1 -f frontend-v2.json --by-node

Thanks

nikhiljindal · 2015-12-09T21:45:22Z

No there is no way to do it right now.

Am curious why you want to do rolling update by node?
Is the current rolling update mechanism not enough?

titilambert · 2015-12-09T22:51:55Z

I can't get instances with different version on the same node because my processes use shared memory.

bgrant0607 · 2016-01-23T05:36:26Z

In general, we should do rolling update by the failure domains pods are spread by.

To clarify: these pods are communicating via shared memory? How? Why not put all the containers in the same pod? I don't see how this would work without hard affinity #18265.

titilambert · 2016-02-13T17:23:41Z

@bgrant0607

pods are communicating via shared memory? : No, pods are just reading the same memory, this is just to save memory usage (avoid duplication)
How? : using /dev/shm
Why not put all the containers in the same pod? : Because this is remove flexibility about scaling...

titilambert · 2016-02-13T17:24:54Z

@nikhiljindal I just made an implementation of rolling update by node using labels on nodes.
Do you think this could be integrated to kubectl ?

bgrant0607 · 2016-02-13T22:51:27Z

@titilambert Have you seen kubectl drain? Would that do what you need?

cc @mml @janetkuo @Kargakis @mqliang

titilambert · 2016-03-04T18:58:04Z

@bgrant0607 Hello!
I made a first draft of this using node selectors. (#22442)
kubectl drain does not really fit, because, it stop scheduling on the current node. (https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/kubectl/kubectl_drain.md)
In this first draft, only pods of the targeted RC is impacted of the rolling update and node can still received other pods from other RCs.

davidopp · 2016-03-06T00:31:44Z

Sorry, I just saw this issue. Would #9043 solve your problem?

titilambert · 2016-03-06T18:22:21Z

Hello !
Not really, the objectif of the rolling update by node is to be sure that you can never get 2 different version of the same RC running on the same node. I don't thing this issue cover that case.

bgrant0607 · 2016-03-07T05:44:47Z

@titilambert If you use a hostPort in your pods, only one can schedule per node.

We also have some anti-affinity features coming that may help:
https://github.com/kubernetes/kubernetes/blob/master/docs/design/podaffinity.md

titilambert · 2016-03-07T15:23:48Z

@bgrant0607 Thanks, for your reply !
It sounds interesting, but I can not see how I can be sure that the old RC will not deploy new pod on the current node. One of the main requirement of this rolling update by node, is to delete all pods from the old RC (one the current node) then when there is no more pods on this node, start the creation of pods of the new RC.
Maybe I missed something... could you give me more details of your thoughts ?

BTW, I'm pretty sure that anti-affinity feature will help this PR to be better (maybe getting this feature without using node Selector ?)

bgrant0607 · 2016-04-29T17:33:31Z

@titilambert I still don't understand the reason why you want to stop all pods on a given node at the same time. However, this sounds like a fairly niche use case.

Maybe there is something we could do to make this easier to implement outside of Kubernetes?

bgrant0607 · 2016-04-29T17:36:50Z

Additionally, as I mentioned in the PR, we're trying to reduce the amount of logic in kubectl (#12143).

djsly · 2016-04-29T18:03:47Z

Hi Brian, let me try to explain in more details the use case here.

We have this single thread service that requires a lot of RAM to be loaded
(loads a model into it)

Since the process is single threaded, we run multiple instances of the
service on the same machine, and we share the RAM across the different
services using /dev/shm.

Now in k8s, we have managed to migrate the service into a single docker
container and we can scale the POD accordantly. The main problem is that
during rolling update, we cannot have service 1 and service 2 running and
sharing /dev/shm with state X, start an update, have service 1 stop,
restart and try to update /dev/shm since service 2 is already using it.
(here we are assuming that service 1 will fill in /dev/shm with new data
incompatible with service2's version)

So the only way right now to fix this (at the infra level) is to stop all
the PODs running on the node, this ensures that the mounted /dev/shm is
released by the last POD being destroyed. Once the host isn't running any
POD of that service (v1) we can move towards the upgrade of the service (v2). At the
service (v2) boots up, the first POD on the host will reload the /dev/shm and the
subsequent POD will simply use the shared /dev/shm.

going one node at a time allows no down time of the service.

We do understand that this is not aligned with the micro services best
approach but unfortunately, limitation to the service prevent us from going
into a better paradigm that fit well in k8s.

Hopefully this niche case, is now clearer for you.

Regards!

Sylvain

On Fri, Apr 29, 2016 at 1:37 PM, Brian Grant notifications@github.com
wrote:

Additionally, as I mentioned in the PR, we're trying to reduce the amount
of logic in kubectl (#12143
#12143).

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#18450 (comment)

davidopp · 2016-04-29T19:18:27Z

In 1.3 we're adding "pod affinity" which lets you say "when you're trying to schedule a pod from [service, RC, RS, Job, whatever] X, only do so on a node that is already running at least one pod from [service, RC, RS, Job, whatever] Y.

There is a variant of this (that we're not implementing in 1.3, but might later) that says "in addition, if the pod from Y stops running, then kill X"

If you really only have two services, then this variant (that we're not implementing in 1.3) sounds like it would solve your problem. In particular:

Create Y
Create X, give it pod affinity so that it can only run on nodes with Y
Later, update the pod templates in the RC's for X and Y
Run a script that walks through the nodes in the cluster, killing Y. This will in turn also kill X and both will reload with their updated pod template. Of course there is a bit of a race here, where you need some way to make sure X dies before Y restarts.

I'm not saying this is the best way to address your problem, and of course it's hard to compare one nonexistent solution to other nonexistent solutions, but I thought I'd mention it, as this at least fits in with something we're building.

djsly · 2016-04-29T23:29:24Z

Hi Daniel. Thanks for the feedback.
I might have not explained it well the first time.
In my case, service1 and service2 are in fact the same RC. It's just a different replica. I would like to use the concept of instance number appended to the service name... We can run up to 24 replicas of the same POD on one node.
When we do a rolling upgrade, that's when we need that any new replica of the new RC be started on a particular node only when the previous RC's replicas are all stopped on that particular node. To prevent the corruption of the mounted /dev/shm partition.

Regards

davidopp · 2016-05-02T08:49:23Z

Hi @djsly. Thanks for the clarification. Now I understand -- you want to "roll" one node at a time rather than one replica at a time, and you want to ensure that no updated replica starts on the node until all of the old replicas on the node have been killed.

There's no automated way to do what you're asking. But here's an approach that might be good enough.

Let's say rc1 is the ReplicationController that's managing the current replicas, rc1's PodSpec has a label selector "version=1", and all the nodes in the cluster start out labeled "version=1"

First, you create rc2, a ReplicationController that will manage the new version; it is identical to rc1 except it uses the image name you're upgrading to and it has label selector "version=2" instead of "version=1" (and its name is rc2 instead of rc1, of course). Then

for (each node N)
   set N's NodeSpec.Unschedulable = true
   delete all the pods on N; wait for them to actually be gone
   change N's label from version=1 to version=2
   set N's NodeSpec.Unschedulable = false

Once you're done upgrading the nodes, you can delete rc1.

I realize this isn't perfect, but I think it's the closest you can get without writing your own controller.

djsly · 2016-05-03T21:30:57Z

Hi David, Thanks for the proposal!

This is indeed exactly what we coded in this PR #22442 with the only exception that we kept

NodeSpec.Unschedulable = true

to allow other RC to deploy replicas that are independent of the /dev/shm, hence making the node's resources still available for other type of services.

What we would like to do in the end is provide upstream with the changes to support such scenario such that we could stop relying on our own fork of the project and eventually get back at using the official releases.

We understand that this should be coded server side which makes lot of senses and would like to get guidance on what you guys would prefer to ensure that we can work on getting a future PR accepted.

Thanks!

Sylvain

djsly · 2016-10-28T17:29:38Z

@davidopp, if we are interested to resume this work in migrating the previous PR to the deployment object, where would be the best place to start in terms of proposal�� ? is #sig-apps the right venue for initial design discussion ?

davidopp · 2016-10-29T00:02:58Z

Yes, sig-apps is probably the right place.

davidopp · 2016-11-13T03:30:46Z

Hi, sorry we did not get a chance to talk in-person at KubeCon.

Is it possible to do this using your own client? We now have a go client:
https://github.com/kubernetes/client-go

If you want to see this built into Deployment, you should write a proposal and discuss it with sig-apps.

djsly · 2017-01-11T20:55:48Z

@titilambert I guess we can close this now that we have coded the logic on the client side for now. Eventually we will be looking at using either Operators from coreOS or directly third_party_resources.

0xmichalis · 2017-01-11T21:09:29Z

@djsly mind sharing your implementation if it's open-sourced?

nikhiljindal added kind/enhancement area/app-lifecycle sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. area/kubectl labels Dec 9, 2015

bgrant0607 added team/ux priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 23, 2016

titilambert mentioned this issue Mar 3, 2016

Add Rolling update by node #22442

Closed

davidopp added the team/control-plane label Mar 6, 2016

davidopp mentioned this issue Apr 29, 2016

Fix the "first pod problem" for inter-controller/service pod affinity like we did for same-controller/service pod affinity #24995

Closed

0xmichalis removed team/control-plane (deprecated - do not use) labels Jan 11, 2017

0xmichalis closed this as completed Jan 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling-update by node #18450

Rolling-update by node #18450

titilambert commented Dec 9, 2015

nikhiljindal commented Dec 9, 2015

titilambert commented Dec 9, 2015

bgrant0607 commented Jan 23, 2016

titilambert commented Feb 13, 2016

titilambert commented Feb 13, 2016

bgrant0607 commented Feb 13, 2016

titilambert commented Mar 4, 2016

davidopp commented Mar 6, 2016

titilambert commented Mar 6, 2016

bgrant0607 commented Mar 7, 2016

titilambert commented Mar 7, 2016

bgrant0607 commented Apr 29, 2016

bgrant0607 commented Apr 29, 2016

djsly commented Apr 29, 2016 •

edited

Loading

davidopp commented Apr 29, 2016

djsly commented Apr 29, 2016 •

edited

Loading

davidopp commented May 2, 2016

djsly commented May 3, 2016

djsly commented Oct 28, 2016 •

edited

Loading

davidopp commented Oct 29, 2016

davidopp commented Nov 13, 2016

djsly commented Jan 11, 2017

0xmichalis commented Jan 11, 2017

Rolling-update by node #18450

Rolling-update by node #18450

Comments

titilambert commented Dec 9, 2015

nikhiljindal commented Dec 9, 2015

titilambert commented Dec 9, 2015

bgrant0607 commented Jan 23, 2016

titilambert commented Feb 13, 2016

titilambert commented Feb 13, 2016

bgrant0607 commented Feb 13, 2016

titilambert commented Mar 4, 2016

davidopp commented Mar 6, 2016

titilambert commented Mar 6, 2016

bgrant0607 commented Mar 7, 2016

titilambert commented Mar 7, 2016

bgrant0607 commented Apr 29, 2016

bgrant0607 commented Apr 29, 2016

djsly commented Apr 29, 2016 • edited Loading

davidopp commented Apr 29, 2016

djsly commented Apr 29, 2016 • edited Loading

davidopp commented May 2, 2016

djsly commented May 3, 2016

djsly commented Oct 28, 2016 • edited Loading

davidopp commented Oct 29, 2016

davidopp commented Nov 13, 2016

djsly commented Jan 11, 2017

0xmichalis commented Jan 11, 2017

djsly commented Apr 29, 2016 •

edited

Loading

djsly commented Apr 29, 2016 •

edited

Loading

djsly commented Oct 28, 2016 •

edited

Loading