Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pet set upgrades #28706

Closed
bprashanth opened this issue Jul 8, 2016 · 40 comments
Closed

Pet set upgrades #28706

bprashanth opened this issue Jul 8, 2016 · 40 comments
Assignees
Labels
area/stateful-apps priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@bprashanth
Copy link
Contributor

One should be able to upgrade a petset through the standard kubectl rolling-update syntax. While it's desirable to have a controlled upgrade strategy (eg: deployments), it may not be necessary for beta.

The tricky part is making sure that pet-v1-0 and pet-v2-0 come up with the same data. For beta I propose we just make kubectl rolling-update on petset update the image and kill pets one by one (as opposed to creating a new petset like a classic rolling-update). If any pet doesn't become ready within a timeout, the rolling-update reverts.

Node upgrade: mark node as unschedulable, delete pets one by one, wait till they show up on new node as ready, upgrade node. Can we get away with just documenting this (upgrading kubelet on a node shouldn't lead to container restarts anyway)?

@smarterclayton @kubernetes/sig-apps Is this enough?

  • To improve the upgrade UX we could use 2 new events: pre-delete and pre-update. We can use terminationGrace simply to leave the cluster, pre-delete to do something drastic like backup all data, and pre-update to do something like give-up leadership (but retain data). When a member is removed/rejoined the rest of the cluster should receive on-change. Feature request: A way to signal pods #24957 should help with the events.
  • To improve master/slave upgrade UX, we need to avoid "chasing" the master, i.e by picking the master to upgrade, while the upgrade is happening a new master is elected, the next pod we pick is this new master, and so on.

This might work as follows:

  • upgrade all slaves
  • send on-upgrade to master so it gives up mastership, and doesn't compete in re-election
  • wait for new master
  • upgrade old master

A leader election service might help here (#28658).

  • To improve active-active upgrade UX, the on-change sent to all other members needs to exclude the upgrading member by re-writing the config, so writes aren't blocked on its ack. The upgraded member will have to play catchup when it rejoins.
  • Some databases like Cassandra have special roles, like seed provider. These need to be dealt with in a similar way as leadership transfer (promote new seed, update peers, upgrade existing seed).
  • To improve node upgrade UX, we could teach kubectl drain to move pets off a node slowly. If people are running 1 pet per node (eg: pod anti-affinity), we have less of a problem.
@chrislovecnm
Copy link
Contributor

One ask is that the container / pets be notified of the update.

@davidopp
Copy link
Member

davidopp commented Jul 9, 2016

Node upgrade: mark node as unschedulable, delete pets one by one, wait till they show up on new node as ready, upgrade node. Can we get away with just documenting this (upgrading kubelet on a node shouldn't lead to container restarts anyway)?

We don't really have any kind of orchestrated node upgrade in Kubernetes right now (neither "kubectl drain" nor cluster/gce/upgrade.sh have any notion of when it's safe to move on to the next node), so I think documenting is the only thing we can do. With PodDisruptionBudget it will be possible to configure a PetSet so that no more than one instance of the set can be down at a time, so higher-level upgrade orchestrators won't need to reason directly about the rule you mentioned.

@magicwang-cn
Copy link
Contributor

@kubernetes/huawei

@wstrange
Copy link
Contributor

For OpenDJ ldap servers, I have a requirement to update the topology when a pet is scaled down.

The OpenDJ server requires that the scaled down node
be taken out of the replication topology so other servers do attempt to replicate to it.

Some kind of scale down hook would be useful.

@chrislovecnm
Copy link
Contributor

@wstrange same exists for Elastic and Cassandra. Upgrades, restarts, scale down.

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Jul 15, 2016

@wstrange here is the issue that I wrote about scale up, scale down, upgrades and restarts: #25275

@pwittrock pwittrock added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 18, 2016
@m1093782566
Copy link
Contributor

/subscribe

@smarterclayton
Copy link
Contributor

Let's straw man what it would take to have PetSets be declarative config. Without some deployment mechanism for that, we can't kubectl apply pet sets more than once, which i think is problematic.

For pet sets to be declarative, we either need a higher level resource, or we need a way to manage that on the petset. We would need to rollback to a previous history. Unlike deployments, running two pet sets at the same time is not an option (the petset needs to provide a strong lock on the cluster config, which two pet sets can't do). How would we record history of petset? We can't put it in spec (apply has to overwrite it).

@smarterclayton
Copy link
Contributor

Rolling-upgrade is maybe ok for beta, but I really want to talk about the final design first and then walk back to rolling-upgrade. Agree that we can probably solve rolling upgrade (have an external process manage the update).

@dshulyak
Copy link
Contributor

dshulyak commented Sep 6, 2016

If there is no volunteers, i will be glad to help here and land at least part of it in 1.5.

@bprashanth
Copy link
Contributor Author

I think the problem with mimizing downtime during upgrades is different for pet vs replics sets. As indicated the former cannot tolerate old and new alive at the same time, so we'll have to figure out ways to address cluster uptime, an alien concept to your typical rs/deployment. Moreover the petset controller aready enforces MaxUnavailable=1, and if we don't create 2 petsets, we have no MaxSurge concept.

I'm going to assume petset upgrades happen on existing pets, one at a time, and that our task is to orchestrate this with minimum downtime. We should converge to a solution that allows us to record history and roll back to a known good version, but IMO the goal should NOT be 100% uptime on all cluster topologies.

Here's is one way to proceed:

  1. Implement rolling-update as described, this should be relatively easy because you can already do it by hand.

    • The order of pets to delete needs to be considered.
      • We might get away with random order for a first cut
      • There is benefit in deleting them in reverse order because 0 is usually the master, and having a predictable delete order allows an admin to "pause" upgrade, transfer leadership, "resume" upgrade. You can probably reuse the library the petset controller uses to scale down and get ordered deletion for free.
    • Docuement that rolling-update has an implicit disruption budget of 1 pet. Write some tests around this invariant. Cluster upgrade is the same provided you use pod anti-affinity/hostPort hack/node labels to schedule 1 pet per node.
  2. Write proposals. I see at least 2 that could fall out of this:
    a. Minimized downtime (or establish that we don't care about this)

    b. Upgrades orchestrated through a controller

    • The dumbest petset deployment will do exactly what rolling update does + pause/resume.
    • A smarter deployment might send event, delete pet, send next event, ....
    • Store history as annotations on the petset (maybe restrict this to just "image" initially?)
    • Add "undo/rollback" support

Note that as proposed, the deployment orchestrator is different from the existing deployment controller in 2 key ways:

  1. It actually deletes pods
  2. It doesn't create a new petset

This might result in some duplication of logic beteween deployment-petset-controller and petset-controller. There might be nicer ways for the deployment-petset-controller to drive the petset-controller.

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 9, 2016

I have some ideas about Pet Set upgrade.

  1. Should we need to support in-place upgrade? For those Pet Set which only use local volumes. see In-place rolling updates #9043

    IMO, the difference between in-place upgrade and normal upgrade is whether delete old pods or not. in-place upgrade only need to change Pod.Container's image filed and no need to re-schedule Pod/Pet.

  2. For adding "undo/rollback" support, I propose one simple way to store the current container images as Pet Set Annotations, such as Annotation[pod.alpha.kubernetes.io/backups]=image1;image2...imageN. When upgrade failed, use the backup images to rollback.

  3. To support richer upgrade strategy(rollback, Qos, recreate Pod, in-place, etc.), do we need to define a new filed (such as UpgradeStrategy) in PetSetSpec like DeploymentStrategy in Deployment? Or just make use of Annotations on the Pet Set?

  4. A detailed issue in code level. Do we need to add a new petLifeCycleEvent in Pet Set Controller, for example upgradePet, to handle Pet Set upgrade procedure in syncPetSet()? Or, the current syncPet and deletePet events are enough?

@bprashanth Any comments?

BTW, I am working at Pet Set upgrade project and hope can do some help.

@smarterclayton
Copy link
Contributor

A few observations from recent discussion.

I do not think a pet set should allow pets of more than 2 versions at
any time. Having more may be actively harmful. A rollback is an
atomic change, driven to completion by the controller, just like a
rollout. While some admins may be reason about a rollout between N
and N+1, I doubt many can reason about N, N+1, and N+2. So an
additional guarantee might be:

A PetSet only allows 2 versions of live pets at any time - a

rollout must complete for a new version to be created.

That implies we need at most three versions of the pod template -
current, transitioning to, and next. We don't want to prevent
updates, but for instance if you specify N+3 while N and N+1 are
running, then N+2 would be ignored and N+3 is your target after N+1

Since we don't keep arbitrary history for deployments (that is what
Git and config driven flows are for) this would allow us to describe
spec (target), previous, and active transitioning. Previous and
active should be immutable, and changes must be atomic to an observer.
It must be required for a controller to see an ordered list of changes
to the PetSet configuration, so we have to store those on the same
object. This also preserves apply being able to update objects
(previous and active are part of status).

Occasionally an admin must interrupt an update/upgrade. That could be
pause, reset, or abort. They must be able to recover a broken set
somehow. Those also need to be transactional, but also be able to be
predictable to declarative config.

I forgot what my other thing was but it'll come to me. I agree with
the points you both made - strategies may eventually be relevant, and
100% uptime is less important than safety (we chose CP over AP).

@smarterclayton
Copy link
Contributor

I think in place upgrade is reasonable - we need it for pods in general and there should be a plan for it.

@smarterclayton
Copy link
Contributor

PetSet upgrade has to be about the whole pod template, not just images - we can always optimize the special case (only image change) but image is not the only thing that can change.

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 10, 2016

PetSet upgrade has to be about the whole pod template, not just images

Yes, I agree. However, we need to be more cautious to update some template fileds such as resource requests as in-place update may fail forever(node doesn't have enough resources and scheduler are not aware of the pod because Pod.Spec.Node is not nil). Is there a good way to solve it?

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 10, 2016

We also have to manage the risk that two controllers race to create pods of divergent specs (possible today). If the petset controller loses its lease, another controller could observe different historical versions and create arbitrarily older specs. Even with a work log on the petset approach, a node controller could delete the pod (which removes the lock) and two versions of the pet set controller could try to create two versions of the pod. However, the pet pod would still have the generation of the petset embedded into it (we need to add this via downward API) and could use that as the config version record to control joining the cluster.

So basically, we can skip excessive cleverness in managing divergent specs by assuming that pods can be created out of order (by arbitrarily delayed defunct petset controllers) and requiring the pod to manage that as part of joining the cluster. So each pod is embedded with a numerically increasing integer representing the config version that it is based on (the generation).

Straw man 1 - pet set controller actively describes its state transitions for clients on the petset

type PetSetStatus struct {
  // TargetSpec is set if the cluster is trying to transition to a pet set state that is divergent from spec
  // MUST include a generation
  TargetSpec *PetSetSpec
  // PreviousTargetSpec records a previous target set from the cluster - this would be the rollback target
  // MUST include a generation
  PreviousTargetSpec *PetSetSpec
}

A client uses petset.spec as the desired spec, and is able to view a transactional record of the mutations of the spec that the controller is trying to perform via targetSpec and previousTargetSpec. Example flow:

  1. User creates petset, spec is set
  2. Controller observes new petset, before it creates the first pod it must record a version of that spec in targetSpec via updating petset status and set the generation onto targetSpec.
  3. Controller then begins creating pods using the generation
  4. User modifies petset, spec is set
  5. Controller detects that petset spec != petset targetSpec, promotes spec -> targetSpec and targetSpec -> previousTargetSpec
  6. Controller updates with whatever targetSpec is
  7. User elects to begin a rollback by setting spec == previousTargetSpec

Pros

  • Keeps all client reads on the object
  • Keeps history tidy

Cons

  • Object can get quite large with status - most info is unlikely to change
  • Limited history

Variant 1

Describe deltas with patches (from spec or target spec)

Pros

  • Concisely represent deltas

Cons

  • More cumbersome for naive clients to work with (can be mitigated via a subresource that expands it)
  • Have to formalize patch type into stored API - patches are not perfect (may force us to make patches correct for petset though which we have to do anyway).

Straw man 2 - use a sub resource / child resource to represent the history in etcd, rather than as part of PetSet

Create a new endpoint petsets/NAME/targetspec that can be retrieved / set by the controller to record whatever it is targeting. When target spec is set, previousTargetSpec is automatically updated.

Cons:

  • Hard to get everything at once as a client - client may have to execute 3 requests to figure out what is going on

Straw man 3 - Create a new resource type PetSetVersion that records the state a controller is targeting

Controller creates a new PetSetVersion every time it observes one, and then always selects the newest (as recorded by generation) one as its creation target

Pros:

  • More consistent with Deployments
  • Allows longer history

Cons:

  • Much more going on
  • Longer history has to be cleaned up

@bprashanth
Copy link
Contributor Author

In theory, today's petset already does upgrade, if you forget about rollback. Update the spec and its sync routine should kick in and recreate pods with a strategy of maxSurge=0, maxUnavailable=1 (I'm sure it doesn't work exactly as I'm thinking but i don't think it should be hard to make it). We're just augmenting it with the ability to remember exactly one older petset and rollback.

This feels like your (1).

Then I started looking for a way to involve a wrapper controller that allows us to do this safely (i.e snapshot each pet before upgrade, and restore from snapshot before rollback). A wrapper controller would benefit from 2: client creates petset/NAME/targetSpec, wrapper copies petset.Spec into petset.Status.PreviousSpec, takes any snapshotting/eventing actions, updates petset.Spec to initiate upgrade.

Eitherway it sounds like we'd still need a cluster config int (essentially generation) piped down into the pets. Some modern databases like etcd maintain an internal cluster version, for those that don't, educating users on how to use it might be tricky (especially because the race is between master components).

@smarterclayton
Copy link
Contributor

When you say wrapper controller, I hear babysitter. Is that true? If so I
was envisioning the petset controller delegating to a babysitter when it
detects spec changes, and waits for some condition to be updated before
continuing. Requiring a client to set something other than spec breaks
apply, edit, replace, etc.

On Sat, Sep 10, 2016 at 7:28 PM, Prashanth B notifications@github.com
wrote:

In theory, today's petset already does upgrade, if you forget about
rollback. Update the spec and its sync routine should kick in and recreate
pods with a strategy of maxSurge=0, maxUnavailable=1 (I'm sure it doesn't
work exactly as I'm thinking but i don't think it should be hard to make
it). We're just augmenting it with the ability to remember exactly one
older petset and rollback.

This feels like your (1).

Then I started looking for a way to involve a wrapper controller that
allows us to do this safely (i.e snapshot each pet before upgrade, and
restore from snapshot before rollback). A wrapper controller would benefit
from 2: client creates petset/NAME/targetSpec, wrapper copies petset.Spec
into petset.Status.PreviousSpec, takes any snapshotting/eventing actions,
updates petset.Spec to initiate upgrade.

Eitherway it sounds like we'd still need a cluster config int (essentially
generation) piped down into the pets. Some modern databases like etcd
maintain an internal cluster version, for those that don't, educating users
on how to use it might be tricky (especially because the race is between
master components).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28706 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4iONernHUthfbFBOZ0RNUmSNOahks5qoz01gaJpZM4JIZHf
.

@smarterclayton
Copy link
Contributor

The more I think about this more convinced I am that more than two versions
of pets in the cluster at any one time is probably surprising to users.
Not sure we can guarantee it, but requiring the controller to observe that
all running pets + the new pet results in no more than 2 versions before
creating a pod seems safer and less surprising.

On Sat, Sep 10, 2016 at 7:49 PM, Clayton Coleman ccoleman@redhat.com
wrote:

When you say wrapper controller, I hear babysitter. Is that true? If so
I was envisioning the petset controller delegating to a babysitter when it
detects spec changes, and waits for some condition to be updated before
continuing. Requiring a client to set something other than spec breaks
apply, edit, replace, etc.

On Sat, Sep 10, 2016 at 7:28 PM, Prashanth B notifications@github.com
wrote:

In theory, today's petset already does upgrade, if you forget about
rollback. Update the spec and its sync routine should kick in and recreate
pods with a strategy of maxSurge=0, maxUnavailable=1 (I'm sure it doesn't
work exactly as I'm thinking but i don't think it should be hard to make
it). We're just augmenting it with the ability to remember exactly one
older petset and rollback.

This feels like your (1).

Then I started looking for a way to involve a wrapper controller that
allows us to do this safely (i.e snapshot each pet before upgrade, and
restore from snapshot before rollback). A wrapper controller would benefit
from 2: client creates petset/NAME/targetSpec, wrapper copies
petset.Spec into petset.Status.PreviousSpec, takes any
snapshotting/eventing actions, updates petset.Spec to initiate upgrade.

Eitherway it sounds like we'd still need a cluster config int
(essentially generation) piped down into the pets. Some modern databases
like etcd maintain an internal cluster version, for those that don't,
educating users on how to use it might be tricky (especially because the
race is between master components).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28706 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4iONernHUthfbFBOZ0RNUmSNOahks5qoz01gaJpZM4JIZHf
.

@bprashanth
Copy link
Contributor Author

I was looking for a lighter weight deployment. We will soon support volume snapshotting, so that feels a little like a strategy. Not all the concepts of RS deployment map cleanly to petset, but the separation might afford us some flexibility. Even in the mixed version scenario, the deployment becomes the gatekeeper of petset spec. You can either wait for the petset to finish the current upgrade, or rollback, but if during a petset upgrade the user keeps applying new deployments it just updates its target, not the petset.Spec. The deployment.targetSpec is propogated to the petset.spec when the petset updates generation.

Maybe it is easier to invert this dependence and have petset reach out to a babysitter though.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 11, 2016 via email

@bprashanth
Copy link
Contributor Author

If sounds like the focus is currently on detecting, from within a pet, when previous pets may be at different versions, and either hanging or exiting. The value add of rollback and pause feel marginal, because we have version control for a single revision back and "pause" already kind of works as a hack through a debug hook I put in (http://kubernetes.io/docs/user-guide/petset/#troubleshooting), right? Maybe we should focus on this one problem first?

as an afterthought: storing targetSpec in status also feels odd, I thought our policy around status was that it could be blown away at any time and recomputed by examining the state of the cluster?

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 12, 2016

Should we prevent scale up/down operations when upgrade in progress? I find it's hard to do it as Kubernetes has no mechanism that can easily achieve that. If we don't prevent against it, it means that we can allow more than replicas+1 or less than replicas-1 instances in a cluster.

@smarterclayton
Copy link
Contributor

as an afterthought: storing targetSpec in status also feels odd, I thought our policy around status was that it could be blown away at any time and recomputed by examining the state of the cluster?

Target spec would be reconstructable by observing the state of the cluster. Pod with highest generation becomes target spec (or alternatively second oldest generation).

pause

I think pause is mandatory - all of the arguments with pause around deployments apply here, because pause is the higher level primitive saying "don't apply this newest spec yet".

@smarterclayton
Copy link
Contributor

So ordered state transition has to be done by a strongly consistent agent. I think we need that to be petset controller (so you don't have to implement a babysitter in all cases), and then the petset controller needs to delegate to a babysitter in a controlled fashion. We could conceivably also make it easy to fork the petset controller and customize it, but in practice that only works so well. Also, there should be the ability to set an annotation / field on petset that says "let someone else manage my semantics" as per #31571

@bprashanth
Copy link
Contributor Author

I think the invocation of the babysitter is going to be important. There are going to be some very common patterns like snapshot upgrade that we can write simple babysitters for so everyone gets them for free. I also don't think it is specific to upgrades. On scale down, I want to delegate to a babysitter to pick the non master.

Also, there should be the ability to set an annotation / field on petset that says "let someone else manage my semantics" as per #31571

Ingress does this through annotations, we could start with that.

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 13, 2016

Currently a pet can block forever. If we support rollback strategy for upgrade, should we give up the in progress upgrade and roll back if timeout(we may set a timeout value before hand) or just keep blocking?

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 13, 2016

Annotations is useful to contain something like upgrade strategies in transitional period, but the map[string]string type annotations also has its limitations - it's unable to describe complicated things such as upgrade revisions.

@bprashanth
Copy link
Contributor Author

what if we decoupled identity sync from upgrade?

  • the petset will sync identity, always, that's its job. Eg: if you modify the pod annotations so it gets a different hostname, of disconnect the PVC from pet-0, the petset controller fixes this.
  • Image is not part of identity, nor are secrets, extra hostpath volumes etc.
  • If you don't care about how image upgrade works, use kubectl rolling-update, it will patch the petset, then kill pods and wait for petset controller to recreate them
  • If you do care about image upgrade, run a babysitter in an RC (HA pod) that observes the petset for changes to spec (we currently disallow all changes except replicas and image), and takes the required actions.

We can probably write a standard babysitter that handles the following cases:

  1. Rollback: babysitter tracks LKG-spec through a subresource/status on petset.
  2. Pause: babysitter observes something written to petset. This has a chilling effect on both petset and babysitter, meaning identities can go out of sync.
  3. Upgrade: Unlike deployment, babysitter will record LKG-spec, finish an upgrade, wait for cluster to become healthy, then process next upgrade. If you wrote the wrong image version in a way that the upgrade is not going to complete, rollback.
  4. Upgrade + replica change: if the spec differes from pervious spec by both replicas and image, babysitter waits for petset to create new replicas (this happens with new image), then deletes old replicas so petset creates them with new image. This is overall risky so do them one at a time.
  5. In place upgrade: There are things you can do to ensure inplace upgrade, eg if only image changes, babysitter modifies each pod with new image and kubelet restarts container in place. If you add a new hostPort, in place might not work.

We'd need some strong fencing guarantees to guarantee there are never 2 babysitters.

Babysitter plugins would be great, then one can leverage the same framework to move a role:master label around, or snapshot each pet before upgrade.

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 14, 2016

Tried to brainstorm some of this today, some notes:

  • If we can guarantee storage locking is correct (if all storage formats either use an innate lock, or use the PV labels as locks and aren't broken by other controllers) then neither pet sets nor RCs need to do anything special to get pods to be "safe" in the presence of splits.
    • I.e., either proposed local PV or an NFS PV, if "attached" to a node at schedule time by the volume attach/detach controller prior to the kubelet actually allowing the pod to start, and only unbound by detach controller after clean shutdown ack'd by kubelet (not by pod deletion) would be sufficient to make arbitrary PV "safe" in the presence of down nodes
    • A fencer running on top of this system could observe a "stuck" volume and a down node and fence them. It could also preemptively fence by performing selective traffic operations, but that's a more advanced case
    • That removes the need for the PetSet to fence storage
  • There is a spectrum of "pets":
    • Single node, fast restart - "available"
      • A deployment controller with a PV that is correctly locked (as described above) and has scale 1 can satisfy this as can a petset. It's actually better to use a deployment controller here than a pet set because the D can spin up new pods optimistically.
      • A future variant can also include the controller trying to keep nodes "warm" by prepulling images or perform other actions so that a failover action would prefer those other nodes.
    • Two nodes, hot standby / failover
      • A PetSet is required because we must be able to ensure identity is locked
      • Failover is "at most one IP is answering traffic", plus possible fencing of network, plus some execution action in the failover pod to take over (mysql "step up" etc)
    • Quorum based sets incapable of doing dynamic member reconfiguration safely
      • Old zookeeper
    • Quorum based sets capable of doing dynamic member reconfiguration safely if they can be given a deterministic order
  • The petset controller, if it registers as a finalizer on a pod (@caesarxuchao also relevant from the GC discussion) would be able to control cleanup on the pod to ensure that all state has been terminated and observe the final cleanup of the pod. That allows us to have deterministic control over the state of the pet set and guarantee that that pod cannot be split brained, which means that we can guarantee at-most one instance of that pod is running in the cluster.
    • This could be leveraged as a way of making one pod responsible for reconfiguration duties a la babysitter
  • Babysitters could be
    • A hook run inside the non-splitbrainable pod as an exec job (could be accidentally killed if server restarts)
    • An init container in the first pod (would have to be reset on config changes which causes downtime)
    • A side car container watching the PetSet resource and reacting to config changes via list/watch
    • A separate pod run like a job that gets injected with a specific state and does what it needs
    • Whichever path we go down, we have to take resources into account. If you need to start a JVM client to trigger a reconfig (zookeeper) then you need enough memory to run that.
  • Babysitters can use the PetSet as their coordination record - using things like Paused and potentially other flags like "PauseAfterEachPet" or "PauseOnSpecChange" flags to coordinate with the petset controller in a consistent way.

I stil prefer the babysitter as "one or more bash scripts in the same image as the server" for ease of reasoning by end admins (write bash script like "pause petset, get list of pets with same version, delete first pet, unpause petset, wait, etc"). Prashanth still prefers "controller loop like server process", but we agreed that we could make the scripts common to both approaches like the "oc observe" demonstrated. We need to sketch out an actual upgrade.

Probably more things, I forgot. We'll try to schedule a follow up with people - in the meantime we've got some grist to chew on.

@smarterclayton
Copy link
Contributor

I was looking for a good ansible / pacemaker example for DBs but haven't found one that I consider "something I would really trust". Will keep digging (although I really recommend people read the pacemaker docs - they're excellent at explaining the expected mechanics)

@smarterclayton
Copy link
Contributor

Ah, found it for postgres: http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster

@chrislovecnm
Copy link
Contributor

When we going to get a design doc for this? Going to have PetSets in prod soon, and not being able to upgrade makes me squirm a bit.

@bprashanth
Copy link
Contributor Author

fyi you can already upgrade pets like daemons, change the image and delete pets. If you don't care about the listed inadequacies, you can head to the races.

@chrislovecnm
Copy link
Contributor

good work around, but I need to bump the container version ;(

@0xmichalis
Copy link
Contributor

FWIW, DaemonSets will most probably use PodTemplates for retaining history: #31693

@smarterclayton
Copy link
Contributor

Part of the hold on daemon sets is so we can talk about upgrade for
stateful at the same time. I don't think pod templates are terrible, but
it's not a foregone conclusion.

On Oct 28, 2016, at 9:21 AM, Michail Kargakis notifications@github.com
wrote:

FWIW, DaemonSets will most probably use PodTemplates for retaining history:
#31693 #31693


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#28706 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p52y9hYG5tBvqv30p9noBqFL1GCAks5q4fb0gaJpZM4JIZHf
.

@k8s-github-robot
Copy link

@bprashanth There are no sig labels on this issue. Please add a sig label by:
(1) mentioning a sig: @kubernetes/sig-<team-name>-misc
(2) specifying the label manually: /sig <label>

Note: method (1) will trigger a notification to the team. You can find the team list here.

@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 31, 2017
@enisoc enisoc added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jun 1, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 1, 2017
@kow3ns
Copy link
Member

kow3ns commented Jul 11, 2017

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stateful-apps priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests