Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for implementing nominal services AKA StatefulSets AKA The-Proposal-Formerly-Known-As-PetSets #18016

Merged
merged 1 commit into from
Oct 27, 2016

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Dec 1, 2015

This is the draft proposal for #260.


This change is Reviewable

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 1, 2015
Like a replication controller, a PetSet may be targeted by an autoscaler. The PetSet makes no assumptions
about upgrading or altering the pods in the set (similar to a DaemonSet) - instead, the user can trigger
graceful deletion and the PetSet will replace the terminated member with the newer template once it exits.
Future proposals may offer update capabilities. A PetSet requires RunAlways pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean RestartPolicyAlways (or "a restart policy of Always")

@davidopp
Copy link
Member

davidopp commented Dec 1, 2015

I like this design a lot. I think this is going to make a lot of users very happy, and will make it practical for normal users to deploy applications that today require bizarre contortions (like creating one controller per pod).

@eswarbala
Copy link

Nice proposal and looking forward to this! I am one such user looking forward to move away from one RC per instance to achieve this.

@ant31
Copy link
Member

ant31 commented Dec 2, 2015

Active-active
*Need active-active master example - galera? *

I propose RabbitMQ as active-active exemple. It also has some specific requirement being master-master.
i.e: When the entire cluster is brought down, the last node to go down must be the first node to be brought online.

@ant31
Copy link
Member

ant31 commented Dec 2, 2015

Instances can migrate from machine to machine as necessary and are not tied to an instance

This requires to use a NetworkStorage, I would prefer to have the choice to allow or not my Pets to be re-scheduled on an another node. If this is possible, using local storages on the cluster would be easy.
A simple example is when someone simply wants to mount a volume like `hostPath: /mnt/mypet-$IDENTITY'.

@ncdc
Copy link
Member

ncdc commented Dec 2, 2015

cc @kubernetes/rh-cluster-infra @kubernetes/rh-scalability

@smarterclayton
Copy link
Contributor Author

Thanks, will add rabbit as an example.

On Dec 2, 2015, at 7:30 AM, Antoine Legrand notifications@github.com
wrote:

Active-active
*Need active-active master example - galera? *

I propose RabbitMQ as active-active exemple. It also has some specific
requirement being master-master.
i.e: When the entire cluster is brought down, the last node to go down must
be the first node to be brought online.


Reply to this email directly or view it on GitHub
#18016 (comment)
.

@smarterclayton
Copy link
Contributor Author

This is a broader topic, and I agree it is important. I will add a section
to describe the impacts and summarize the current state. Unfortunately
locality is probably out of scope for rev 1 of this proposal because it
requires touch points elsewhere in the stack.

It is likely that locality is associated with volumes, not the sets - so as
long as we target the templates to allow locality to be specified it's not
a serious issue here.

If you care about locality today the DaemonSet is the appropriate tool. I
will make sure to clarify that in the design assumptions section.

On Dec 2, 2015, at 8:08 AM, Antoine Legrand notifications@github.com
wrote:

Instances can migrate from machine to machine as necessary and are not tied
to an instance

This requires to use a NetworkStorage, I would prefer to have the choice to
allow or not my Pets to be re-scheduled on an another node. If this is
possible, using local storages on the cluster would be easy.
A simple example is when someone simply wants to mount a volume like
`hostPath: /mnt/mypet-$IDENTITY'.


Reply to this email directly or view it on GitHub
#18016 (comment)
.

@timothysc
Copy link
Member

So the key sticking point that I'm missing is gravity, forgiveness, and recovery. Once a pet has found it's home, it's not going to want to leave unless there is a maintenance / migration plan.

Otherwise many clustered systems will attempt to recover for a failure condition when in fact it was a planned outage.

@smarterclayton
Copy link
Contributor Author

The first two are separate issues and while they can and should be
mentioned here, my base position is that they are orthogonal (unless we can
demonstrate otherwise).

I'm actually going to walk back on saying update is out of scope a bit - a
pragmatic requirement that reduces the need for gravity is the potential
for in place update.

Forgiveness is described in another issue but we can implement this without
forgiveness being implemented. There is not a lot of disagreement on
forgiveness that I've heard, just that how we record the policy for
forgiveness is less clear.

For recovery is this post pod death recovery? The new pod is allowed to
init whatever it wants upon startup. What examples of recovery above the
pods were you thinking of?

On Dec 2, 2015, at 1:10 PM, Timothy St. Clair notifications@github.com
wrote:

So the key sticking point that I'm missing is gravity, forgiveness, and
recovery. Once a pet has found it's home, it's not going to want to leave
unless there is a maintenance / migration plan.

Otherwise many clustered systems will attempt to recover during a failure
condition when in fact it was a planned outage.


Reply to this email directly or view it on GitHub
#18016 (comment)
.

replicas exist as quickly as possible (by creating new pods as soon as old ones begin graceful deletion, for
instance). In addition, pods by design have no stable network identity other than their assigned pod IP,
which can change over the lifetime of a pod resource. ReplicaSets are best leveraged for shared-nothing,
zero-coordination software.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other applicable adjectives: stateless, embarrassingly parallel, fungible

@chrislovecnm
Copy link
Contributor

chrislovecnm commented Sep 5, 2016

@solsson https://github.com/kubernetes/charts is probably a great place for your kafka example.

@m1093782566
Copy link
Contributor

/subscribe

@hex108
Copy link
Contributor

hex108 commented Sep 19, 2016

@smarterclayton Thanks for the proposal!

We are using PetSet, we find these addtional features are useful for our use case:

  1. Stop and restart user specified pet, e.g. stop pet-3, then pet-3 will never be running unless user restarts it.

  2. When rolling update, we'd like to update some pets' image version and keep testing for some time, then we could test them for a long and enough time. It means that in a long time, different pets in same PetSet use different image verion.

Would them be supported in the further? Thanks!

@k8s-bot
Copy link

k8s-bot commented Sep 19, 2016

GCE e2e build/test passed for commit b9d998f.

@m1093782566
Copy link
Contributor

m1093782566 commented Sep 20, 2016

  1. Stop and restart user specified pet, e.g. stop pet-3, then pet-3 will never be running unless user restarts it.

We can integrate with deployment to achieve this(perhaps in the future).

  1. When rolling update, we'd like to update some pets' image version and keep testing for some time, then we could test them for a long and enough time.

Did you try initialized annotation?

@hex108
Copy link
Contributor

hex108 commented Sep 20, 2016

We can integrate with deployment to achieve this(perhaps in the future).

Very glad to know it. PetSet and deployment are used for different use cases. I think PetSet needs it too. Is there any plan to support it in PetSet?

Did you try initialized annotation?

If I understand it correctly, it is used for initialization, it could not be used for rolling update after app has been running for some time. Could you explain it more? Thanks!

@smarterclayton
Copy link
Contributor Author

Will update this after the rename and the pod safety proposal #34160 is reviewed.

@thockin thockin assigned smarterclayton and unassigned thockin Oct 20, 2016
@smarterclayton
Copy link
Contributor Author

Proposal has been updated to reflect the changes to naming as decided in #27430 and has included the beta and GA criteria as described.

@bprashanth I think this is ready for merge and subsequent changes can be reflected as updates.

@smarterclayton smarterclayton changed the title Proposal for implementing nominal services AKA PetSets Proposal for implementing nominal services AKA StatefulSets AKA The-Proposal-Formerly-Known-As-PetSets Oct 26, 2016
Copy link
Contributor

@bprashanth bprashanth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, comments were nits I'm fine with fixing them or letting it in as is

* Add examples
* Discuss failure modes for various types of clusters
* Provide an active-active example
* Templating proposals need to be argued through to reduce options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't all these already done?

rebalances.
* Active-active
* Galera - has multiple active masters which must remain in sync
* ???
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master slave? (non-quorum, unilateral master)



## Design Assumptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in retrospect these assumptions feel a little pessimisitc. We've dicsucces:
External access direct to cluster members is out of scope
No built-in update
Limited scaling

on issues for longer than we should if we were just designing somthing that ignores any of that ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bprashanth Out of curiosity - does the "limited scaling" also includes limited down-scaling? Is it possible to say I can't go lower than 3-4 replicas?

## Proposed Design

Add a new resource to Kubernetes to represent a set of pods that are individually distinct but each
individual can safely be replaced-- the name **StatefulSet** (working name) is chosen to convey that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer (working name)?

individual can safely be replaced-- the name **StatefulSet** (working name) is chosen to convey that the
individual members of the set are themselves "members" and thus each one is preserved. A relevant analogy
is that a StatefulSet is composed of members, but the members are like goldfish. If you have a blue, red, and
yellow goldfish, and the red goldfish dies, you replace it with another red goldfish and no one would
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goldfish analogy doesn't work half as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shame, it was an awesome analogy.


Requested features:

* IPs per member for clustered software like Cassandra that cache resolved DNS addresses that can be used outside the cluster (scope growth)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth noting that service per pod might also not solve this case because there is delay in updating endpoints

@paralin
Copy link
Contributor

paralin commented Oct 26, 2016

StatefulSets now? Oh boy. I guess this is just the nature of alpha / work in progress features. Will get started refactoring now.

... also, isn't the idea behind petsets / statefulsets keeping things healthy with persistent identities rather than culling off unhealthy pets? I certainly would NOT kill off one of my pets if it got sick. And I don't think Cassandra would appreciate random nodes dying, either.

@bprashanth bprashanth added this to the v1.5 milestone Oct 26, 2016
@smarterclayton
Copy link
Contributor Author

Applied prashanth's changes - labelling. Thanks for round 1 of feedback :)

@smarterclayton smarterclayton added lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Oct 27, 2016
@bprashanth
Copy link
Contributor

LGTM (we're just a month away from 1 year to merge though)

@smarterclayton
Copy link
Contributor Author

The amount of comments definitely breaks github, for sure.

On Thu, Oct 27, 2016 at 3:11 PM, Prashanth B notifications@github.com
wrote:

LGTM (we're just a month away from 1 year to merge though)


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#18016 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_p4mkh_vp4vQ0jXQHxTvAYjPwyma2ks5q4Pd3gaJpZM4GsQyZ
.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

1 similar comment
@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 4773b71 into kubernetes:master Oct 27, 2016

Requested features:

* IPs per member for clustered software like Cassandra that cache resolved DNS addresses that can be used outside the cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean it would be possible to access a specific cluster member from the outside of the cloud? Or by outside of the cluster you mean "within the same namespace"?

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
Automatic merge from submit-queue

Proposal for implementing nominal services AKA StatefulSets AKA The-Proposal-Formerly-Known-As-PetSets

This is the draft proposal for kubernetes#260.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stateful-apps kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.