-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pet set upgrades #28706
Comments
One ask is that the container / pets be notified of the update. |
We don't really have any kind of orchestrated node upgrade in Kubernetes right now (neither "kubectl drain" nor cluster/gce/upgrade.sh have any notion of when it's safe to move on to the next node), so I think documenting is the only thing we can do. With PodDisruptionBudget it will be possible to configure a PetSet so that no more than one instance of the set can be down at a time, so higher-level upgrade orchestrators won't need to reason directly about the rule you mentioned. |
@kubernetes/huawei |
For OpenDJ ldap servers, I have a requirement to update the topology when a pet is scaled down. The OpenDJ server requires that the scaled down node Some kind of scale down hook would be useful. |
@wstrange same exists for Elastic and Cassandra. Upgrades, restarts, scale down. |
/subscribe |
Let's straw man what it would take to have PetSets be declarative config. Without some deployment mechanism for that, we can't For pet sets to be declarative, we either need a higher level resource, or we need a way to manage that on the petset. We would need to rollback to a previous history. Unlike deployments, running two pet sets at the same time is not an option (the petset needs to provide a strong lock on the cluster config, which two pet sets can't do). How would we record history of petset? We can't put it in spec (apply has to overwrite it). |
Rolling-upgrade is maybe ok for beta, but I really want to talk about the final design first and then walk back to rolling-upgrade. Agree that we can probably solve rolling upgrade (have an external process manage the update). |
If there is no volunteers, i will be glad to help here and land at least part of it in 1.5. |
I think the problem with mimizing downtime during upgrades is different for pet vs replics sets. As indicated the former cannot tolerate old and new alive at the same time, so we'll have to figure out ways to address cluster uptime, an alien concept to your typical rs/deployment. Moreover the petset controller aready enforces I'm going to assume petset upgrades happen on existing pets, one at a time, and that our task is to orchestrate this with minimum downtime. We should converge to a solution that allows us to record history and roll back to a known good version, but IMO the goal should NOT be 100% uptime on all cluster topologies. Here's is one way to proceed:
Note that as proposed, the deployment orchestrator is different from the existing deployment controller in 2 key ways:
This might result in some duplication of logic beteween deployment-petset-controller and petset-controller. There might be nicer ways for the deployment-petset-controller to drive the petset-controller. |
I have some ideas about Pet Set upgrade.
@bprashanth Any comments? BTW, I am working at Pet Set upgrade project and hope can do some help. |
A few observations from recent discussion. I do not think a pet set should allow pets of more than 2 versions at
rollout must complete for a new version to be created. That implies we need at most three versions of the pod template - Since we don't keep arbitrary history for deployments (that is what Occasionally an admin must interrupt an update/upgrade. That could be I forgot what my other thing was but it'll come to me. I agree with |
I think in place upgrade is reasonable - we need it for pods in general and there should be a plan for it. |
PetSet upgrade has to be about the whole pod template, not just images - we can always optimize the special case (only image change) but image is not the only thing that can change. |
Yes, I agree. However, we need to be more cautious to update some template fileds such as resource requests as in-place update may fail forever(node doesn't have enough resources and scheduler are not aware of the pod because |
We also have to manage the risk that two controllers race to create pods of divergent specs (possible today). If the petset controller loses its lease, another controller could observe different historical versions and create arbitrarily older specs. Even with a work log on the petset approach, a node controller could delete the pod (which removes the lock) and two versions of the pet set controller could try to create two versions of the pod. However, the pet pod would still have the generation of the petset embedded into it (we need to add this via downward API) and could use that as the config version record to control joining the cluster. So basically, we can skip excessive cleverness in managing divergent specs by assuming that pods can be created out of order (by arbitrarily delayed defunct petset controllers) and requiring the pod to manage that as part of joining the cluster. So each pod is embedded with a numerically increasing integer representing the config version that it is based on (the generation). Straw man 1 - pet set controller actively describes its state transitions for clients on the petset
A client uses
Pros
Cons
Variant 1Describe deltas with patches (from spec or target spec) Pros
Cons
Straw man 2 - use a sub resource / child resource to represent the history in etcd, rather than as part of PetSetCreate a new endpoint Cons:
Straw man 3 - Create a new resource type PetSetVersion that records the state a controller is targetingController creates a new PetSetVersion every time it observes one, and then always selects the newest (as recorded by generation) one as its creation target Pros:
Cons:
|
In theory, today's petset already does upgrade, if you forget about rollback. Update the spec and its sync routine should kick in and recreate pods with a strategy of maxSurge=0, maxUnavailable=1 (I'm sure it doesn't work exactly as I'm thinking but i don't think it should be hard to make it). We're just augmenting it with the ability to remember exactly one older petset and rollback. This feels like your (1). Then I started looking for a way to involve a wrapper controller that allows us to do this safely (i.e snapshot each pet before upgrade, and restore from snapshot before rollback). A wrapper controller would benefit from 2: client creates Eitherway it sounds like we'd still need a cluster config int (essentially generation) piped down into the pets. Some modern databases like etcd maintain an internal cluster version, for those that don't, educating users on how to use it might be tricky (especially because the race is between master components). |
When you say wrapper controller, I hear babysitter. Is that true? If so I On Sat, Sep 10, 2016 at 7:28 PM, Prashanth B notifications@github.com
|
The more I think about this more convinced I am that more than two versions On Sat, Sep 10, 2016 at 7:49 PM, Clayton Coleman ccoleman@redhat.com
|
I was looking for a lighter weight deployment. We will soon support volume snapshotting, so that feels a little like a strategy. Not all the concepts of RS deployment map cleanly to petset, but the separation might afford us some flexibility. Even in the mixed version scenario, the deployment becomes the gatekeeper of petset spec. You can either wait for the petset to finish the current upgrade, or rollback, but if during a petset upgrade the user keeps applying new deployments it just updates its target, not the petset.Spec. The deployment.targetSpec is propogated to the petset.spec when the petset updates generation. Maybe it is easier to invert this dependence and have petset reach out to a babysitter though. |
Hrm - I don't think we can make user facing objects behave like that (have
the user update something other than spec that is acting like spec). I
think consistency and past practice alone dictate that whatever the user
sets as spec is described as the goal, and the controller may just take
longer to converge than it does for deployments (which can already be quite
long). I don't think intermediate updates before the next deployment have
to be reflected, just like in deployment it's possible for the controller
to miss intermediates.
But even just discussing it, I don't see a strong need to have a separate
object (3 above) as long as the action of update and the controller's
reaction to it feel natural and progressive. There may be an argument to
be made that anything we can't hide from the pet set author completely we
should not make too rare - otherwise they're unlikely to consider it and
capable of being surprised later. Since a defunct petset controller could
be arbitrarily delayed it's still possible to see >2 versions in the
cluster - you could imagine in a particularly fraught series of failovers
that you have multiple all racing to create a version that no longer exists
and end up with 3 distinct versions. I guess if all of the controllers are
converging by updating the oldest generation first, you can ensure that you
eventually reach a stable set.
|
If sounds like the focus is currently on detecting, from within a pet, when previous pets may be at different versions, and either hanging or exiting. The value add of rollback and pause feel marginal, because we have version control for a single revision back and "pause" already kind of works as a hack through a debug hook I put in (http://kubernetes.io/docs/user-guide/petset/#troubleshooting), right? Maybe we should focus on this one problem first? as an afterthought: storing targetSpec in status also feels odd, I thought our policy around status was that it could be blown away at any time and recomputed by examining the state of the cluster? |
Should we prevent scale up/down operations when upgrade in progress? I find it's hard to do it as Kubernetes has no mechanism that can easily achieve that. If we don't prevent against it, it means that we can allow more than replicas+1 or less than replicas-1 instances in a cluster. |
Target spec would be reconstructable by observing the state of the cluster. Pod with highest generation becomes target spec (or alternatively second oldest generation).
I think pause is mandatory - all of the arguments with pause around deployments apply here, because pause is the higher level primitive saying "don't apply this newest spec yet". |
So ordered state transition has to be done by a strongly consistent agent. I think we need that to be petset controller (so you don't have to implement a babysitter in all cases), and then the petset controller needs to delegate to a babysitter in a controlled fashion. We could conceivably also make it easy to fork the petset controller and customize it, but in practice that only works so well. Also, there should be the ability to set an annotation / field on petset that says "let someone else manage my semantics" as per #31571 |
I think the invocation of the babysitter is going to be important. There are going to be some very common patterns like snapshot upgrade that we can write simple babysitters for so everyone gets them for free. I also don't think it is specific to upgrades. On scale down, I want to delegate to a babysitter to pick the non master.
Ingress does this through annotations, we could start with that. |
Currently a pet can block forever. If we support rollback strategy for upgrade, should we give up the in progress upgrade and roll back if timeout(we may set a timeout value before hand) or just keep blocking? |
Annotations is useful to contain something like upgrade strategies in transitional period, but the |
what if we decoupled identity sync from upgrade?
We can probably write a standard babysitter that handles the following cases:
We'd need some strong fencing guarantees to guarantee there are never 2 babysitters. Babysitter plugins would be great, then one can leverage the same framework to move a |
Tried to brainstorm some of this today, some notes:
I stil prefer the babysitter as "one or more bash scripts in the same image as the server" for ease of reasoning by end admins (write bash script like "pause petset, get list of pets with same version, delete first pet, unpause petset, wait, etc"). Prashanth still prefers "controller loop like server process", but we agreed that we could make the scripts common to both approaches like the "oc observe" demonstrated. We need to sketch out an actual upgrade. Probably more things, I forgot. We'll try to schedule a follow up with people - in the meantime we've got some grist to chew on. |
I was looking for a good ansible / pacemaker example for DBs but haven't found one that I consider "something I would really trust". Will keep digging (although I really recommend people read the pacemaker docs - they're excellent at explaining the expected mechanics) |
Ah, found it for postgres: http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster |
When we going to get a design doc for this? Going to have PetSets in prod soon, and not being able to upgrade makes me squirm a bit. |
fyi you can already upgrade pets like daemons, change the image and delete pets. If you don't care about the listed inadequacies, you can head to the races. |
good work around, but I need to bump the container version ;( |
FWIW, DaemonSets will most probably use PodTemplates for retaining history: #31693 |
Part of the hold on daemon sets is so we can talk about upgrade for On Oct 28, 2016, at 9:21 AM, Michail Kargakis notifications@github.com FWIW, DaemonSets will most probably use PodTemplates for retaining history: — |
@bprashanth There are no sig labels on this issue. Please add a sig label by: |
/close |
One should be able to upgrade a petset through the standard kubectl rolling-update syntax. While it's desirable to have a controlled upgrade strategy (eg: deployments), it may not be necessary for beta.
The tricky part is making sure that
pet-v1-0
andpet-v2-0
come up with the same data. For beta I propose we just makekubectl rolling-update
on petset update the image and kill pets one by one (as opposed to creating a new petset like a classic rolling-update). If any pet doesn't become ready within a timeout, the rolling-update reverts.Node upgrade: mark node as unschedulable, delete pets one by one, wait till they show up on new node as ready, upgrade node. Can we get away with just documenting this (upgrading kubelet on a node shouldn't lead to container restarts anyway)?
@smarterclayton @kubernetes/sig-apps Is this enough?
pre-delete
andpre-update
. We can useterminationGrace
simply to leave the cluster,pre-delete
to do something drastic like backup all data, andpre-update
to do something like give-up leadership (but retain data). When a member is removed/rejoined the rest of the cluster should receiveon-change
. Feature request: A way to signal pods #24957 should help with the events.This might work as follows:
on-upgrade
to master so it gives up mastership, and doesn't compete in re-electionA leader election service might help here (#28658).
on-change
sent to all other members needs to exclude the upgrading member by re-writing the config, so writes aren't blocked on its ack. The upgraded member will have to play catchup when it rejoins.kubectl drain
to move pets off a node slowly. If people are running 1 pet per node (eg: pod anti-affinity), we have less of a problem.The text was updated successfully, but these errors were encountered: