Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling-update by node #18450

Closed
titilambert opened this issue Dec 9, 2015 · 23 comments
Closed

Rolling-update by node #18450

titilambert opened this issue Dec 9, 2015 · 23 comments
Labels
area/app-lifecycle area/kubectl priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@titilambert
Copy link

Hello,
I would like to make rolling update by node:

  1. Select one node
  2. Stop all pods of the RC on this node.
  3. Start new pods on this node
  4. Select an other node
  5. Stop all pods of the RC on this node.
  6. ...

Questions:

  • Is this using is OK with Kubernetes philosophy ?
  • Is there any way to do it right now ?
  • If not, which files are concerned by this ?
    I can patch it to get something like: kubectl rolling-update frontend-v1 -f frontend-v2.json --by-node

Thanks

@nikhiljindal
Copy link
Contributor

No there is no way to do it right now.

Am curious why you want to do rolling update by node?
Is the current rolling update mechanism not enough?

@nikhiljindal nikhiljindal added kind/enhancement area/app-lifecycle sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. area/kubectl labels Dec 9, 2015
@titilambert
Copy link
Author

I can't get instances with different version on the same node because my processes use shared memory.

@bgrant0607 bgrant0607 added team/ux priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Jan 23, 2016
@bgrant0607
Copy link
Member

In general, we should do rolling update by the failure domains pods are spread by.

To clarify: these pods are communicating via shared memory? How? Why not put all the containers in the same pod? I don't see how this would work without hard affinity #18265.

@titilambert
Copy link
Author

@bgrant0607

  • pods are communicating via shared memory? : No, pods are just reading the same memory, this is just to save memory usage (avoid duplication)
  • How? : using /dev/shm
  • Why not put all the containers in the same pod? : Because this is remove flexibility about scaling...

@titilambert
Copy link
Author

@nikhiljindal I just made an implementation of rolling update by node using labels on nodes.
Do you think this could be integrated to kubectl ?

@bgrant0607
Copy link
Member

@titilambert Have you seen kubectl drain? Would that do what you need?

cc @mml @janetkuo @Kargakis @mqliang

@titilambert
Copy link
Author

@bgrant0607 Hello!
I made a first draft of this using node selectors. (#22442)
kubectl drain does not really fit, because, it stop scheduling on the current node. (https://github.com/kubernetes/kubernetes/blob/master/docs/user-guide/kubectl/kubectl_drain.md)
In this first draft, only pods of the targeted RC is impacted of the rolling update and node can still received other pods from other RCs.

@davidopp
Copy link
Member

davidopp commented Mar 6, 2016

Sorry, I just saw this issue. Would #9043 solve your problem?

@titilambert
Copy link
Author

Hello !
Not really, the objectif of the rolling update by node is to be sure that you can never get 2 different version of the same RC running on the same node. I don't thing this issue cover that case.

@bgrant0607
Copy link
Member

@titilambert If you use a hostPort in your pods, only one can schedule per node.

We also have some anti-affinity features coming that may help:
https://github.com/kubernetes/kubernetes/blob/master/docs/design/podaffinity.md

@titilambert
Copy link
Author

@bgrant0607 Thanks, for your reply !
It sounds interesting, but I can not see how I can be sure that the old RC will not deploy new pod on the current node. One of the main requirement of this rolling update by node, is to delete all pods from the old RC (one the current node) then when there is no more pods on this node, start the creation of pods of the new RC.
Maybe I missed something... could you give me more details of your thoughts ?

BTW, I'm pretty sure that anti-affinity feature will help this PR to be better (maybe getting this feature without using node Selector ?)

@bgrant0607
Copy link
Member

@titilambert I still don't understand the reason why you want to stop all pods on a given node at the same time. However, this sounds like a fairly niche use case.

Maybe there is something we could do to make this easier to implement outside of Kubernetes?

@bgrant0607
Copy link
Member

Additionally, as I mentioned in the PR, we're trying to reduce the amount of logic in kubectl (#12143).

@djsly
Copy link
Contributor

djsly commented Apr 29, 2016

Hi Brian, let me try to explain in more details the use case here.

We have this single thread service that requires a lot of RAM to be loaded
(loads a model into it)

Since the process is single threaded, we run multiple instances of the
service on the same machine, and we share the RAM across the different
services using /dev/shm.

Now in k8s, we have managed to migrate the service into a single docker
container and we can scale the POD accordantly. The main problem is that
during rolling update, we cannot have service 1 and service 2 running and
sharing /dev/shm with state X, start an update, have service 1 stop,
restart and try to update /dev/shm since service 2 is already using it.
(here we are assuming that service 1 will fill in /dev/shm with new data
incompatible with service2's version)

So the only way right now to fix this (at the infra level) is to stop all
the PODs running on the node, this ensures that the mounted /dev/shm is
released by the last POD being destroyed. Once the host isn't running any
POD of that service (v1) we can move towards the upgrade of the service (v2). At the
service (v2) boots up, the first POD on the host will reload the /dev/shm and the
subsequent POD will simply use the shared /dev/shm.

going one node at a time allows no down time of the service.

We do understand that this is not aligned with the micro services best
approach but unfortunately, limitation to the service prevent us from going
into a better paradigm that fit well in k8s.

Hopefully this niche case, is now clearer for you.

Regards!

Sylvain

On Fri, Apr 29, 2016 at 1:37 PM, Brian Grant notifications@github.com
wrote:

Additionally, as I mentioned in the PR, we're trying to reduce the amount
of logic in kubectl (#12143
#12143).


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#18450 (comment)

@davidopp
Copy link
Member

In 1.3 we're adding "pod affinity" which lets you say "when you're trying to schedule a pod from [service, RC, RS, Job, whatever] X, only do so on a node that is already running at least one pod from [service, RC, RS, Job, whatever] Y.

There is a variant of this (that we're not implementing in 1.3, but might later) that says "in addition, if the pod from Y stops running, then kill X"

If you really only have two services, then this variant (that we're not implementing in 1.3) sounds like it would solve your problem. In particular:

  1. Create Y
  2. Create X, give it pod affinity so that it can only run on nodes with Y
  3. Later, update the pod templates in the RC's for X and Y
  4. Run a script that walks through the nodes in the cluster, killing Y. This will in turn also kill X and both will reload with their updated pod template. Of course there is a bit of a race here, where you need some way to make sure X dies before Y restarts.

I'm not saying this is the best way to address your problem, and of course it's hard to compare one nonexistent solution to other nonexistent solutions, but I thought I'd mention it, as this at least fits in with something we're building.

@djsly
Copy link
Contributor

djsly commented Apr 29, 2016

Hi Daniel. Thanks for the feedback. 
I might have not explained it well the first time. 
In my case, service1 and service2 are in fact the same RC. It's just a different replica. I would like to use the concept of instance number appended to the service name...   We can run up to 24 replicas of the same POD on one node. 
When we do a rolling upgrade, that's when we need that any new replica of the new RC be started on a particular node only when the previous RC's replicas are all stopped on that particular node. To prevent the corruption of the mounted /dev/shm partition. 

Regards 

@davidopp
Copy link
Member

davidopp commented May 2, 2016

Hi @djsly. Thanks for the clarification. Now I understand -- you want to "roll" one node at a time rather than one replica at a time, and you want to ensure that no updated replica starts on the node until all of the old replicas on the node have been killed.

There's no automated way to do what you're asking. But here's an approach that might be good enough.

Let's say rc1 is the ReplicationController that's managing the current replicas, rc1's PodSpec has a label selector "version=1", and all the nodes in the cluster start out labeled "version=1"

First, you create rc2, a ReplicationController that will manage the new version; it is identical to rc1 except it uses the image name you're upgrading to and it has label selector "version=2" instead of "version=1" (and its name is rc2 instead of rc1, of course). Then

for (each node N)
   set N's NodeSpec.Unschedulable = true
   delete all the pods on N; wait for them to actually be gone
   change N's label from version=1 to version=2
   set N's NodeSpec.Unschedulable = false

Once you're done upgrading the nodes, you can delete rc1.

I realize this isn't perfect, but I think it's the closest you can get without writing your own controller.

@djsly
Copy link
Contributor

djsly commented May 3, 2016

Hi David, Thanks for the proposal!

This is indeed exactly what we coded in this PR #22442 with the only exception that we kept

NodeSpec.Unschedulable = true

to allow other RC to deploy replicas that are independent of the /dev/shm, hence making the node's resources still available for other type of services.

What we would like to do in the end is provide upstream with the changes to support such scenario such that we could stop relying on our own fork of the project and eventually get back at using the official releases.

We understand that this should be coded server side which makes lot of senses and would like to get guidance on what you guys would prefer to ensure that we can work on getting a future PR accepted.

Thanks!

Sylvain

@djsly
Copy link
Contributor

djsly commented Oct 28, 2016

@davidopp, if we are interested to resume this work in migrating the previous PR to the deployment object, where would be the best place to start in terms of proposal�� ? is #sig-apps the right venue for initial design discussion ?

@davidopp
Copy link
Member

Yes, sig-apps is probably the right place.

@davidopp
Copy link
Member

Hi, sorry we did not get a chance to talk in-person at KubeCon.

Is it possible to do this using your own client? We now have a go client:
https://github.com/kubernetes/client-go

If you want to see this built into Deployment, you should write a proposal and discuss it with sig-apps.

@djsly
Copy link
Contributor

djsly commented Jan 11, 2017

@titilambert I guess we can close this now that we have coded the logic on the client side for now. Eventually we will be looking at using either Operators from coreOS or directly third_party_resources.

@0xmichalis
Copy link
Contributor

@djsly mind sharing your implementation if it's open-sourced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/app-lifecycle area/kubectl priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

6 participants