-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert ReplicationController to a plugin (was ReplicationController redesign) #3058
Comments
I want to push back here a little. This is a pretty big change in our model. I can see the reasons to do this, but I think we are making our system look very complicated. It is perhaps unstated, but my understanding of a core principle for kubernetes is Small, easy to understand composable objects that can be combined for higher order behavior. The number of options here bust the "small and easy to understand" part of this. One other thing -- we always imagined that the ReplicationController was one of a class/set/type of object in our system. We want to encourage users to think about other "active" controller objects that can take on management policy for them. What if we have a set of objects but have them share implementation in the back end? There is a bunch of overlap here, but I think we can steer users toward a set of concepts that are easier to understand. And encourage them to think about and build other controllers that can play in this world. |
As for the feature set implied here, I'd expect that "PerNode" would include a node selector as to which nodes to run this pod on. |
I think we should separate the questions of whether these policies should I like the idea of putting these policies into a single object; decomposing However, I think we should start simple with the policies. For example, the I wasn't sure what the two different fields of CancellationPolicy meant but In this way, basically everything from ForgivenessPolicy down in Brian's On Fri, Dec 19, 2014 at 9:52 AM, Joe Beda notifications@github.com wrote:
|
@jbeda Much of this functionality we don't need right now. I wrote this up to force a conscious decision on our direction. At present, ReplicationController is the only simple object. Service and Pod are complex and getting more complex over time. I think questions are:
The per-node case doesn't need a selector. Pods already have one. There's no point in creating pods that won't pass the selector fit predicate. That should be easy for the per-node controller to figure out. @davidopp CancellationGracePeriodSeconds sounds like what we call "preemption notice" internally. That should be a property of the pod, not the controller. Forgiveness is on the controller, because it's responsible for performing the replacement. I could see a case for putting it on the pod, as well, though, even though Kubelet would never act upon it. The node controller could act upon the node-related events. We could put some of these in separate objects. For example, a separate entity could independently enforce deadlines. However, functionality that would modify the core behavior of the replication controller could not be. Considerations for inclusion include similarity to other responsibilities, whether inclusion would obstruct composability, and the complexity of more moving parts. Deadline enforcement, for instance, wouldn't preclude an external agent from shutting down the controller, so it doesn't obstruct composability, and it's simple enough and similar enough to other termination decisions, that it should be straightforward to include, assuming we don't split control of terminating workloads into another object. I'm fine with starting with persistent rather than full-blown forgiveness, though no users have requested it yet. I'd still put it into ForgivenessPolicy. The cancellation policies assumed a delta between creation and start -- deferred scheduling. I wouldn't imagine we'd include this anytime soon. In the immediate future, I'd imagine only including stop and replicationPolicy (in addition to the selector and template and ref) -- almost no additional complexity added to the current object. What I explicitly want to keep out of the controller:
(1) is self-explanatory. (2), (3), and (4) need to be able to span multiple controllers. Users will want an unbounded variety of policies for (2), (4), and (5). (4) also potentially requires a greater degree of consistency than we'll eventually want to promise. |
Some alternatives to manage complexity without separate objects:
|
I forgot one of my most important principles: neither pods nor replication controllers should be considered permanent, durable entities (i.e., pets). Pods should not have identities that survive rescheduling to new nodes, for following reasons:
Replication controllers should not convey identities to pods upon creation, either:
Work/role assignment should really be dynamic, using master election, fine-grain locks, shard assignment, pubsub, task queues, load balancing, etc. Nominal services (#260) are a convenience to address relatively static cases. |
I also want to minimize mutation of pods vs. the template. That should be relegated to higher-level objects, client-side configuration generation, setting of defaults when creating the pod template, and other controllers, such as auto-sizers (e.g., setting pod/container cpu and memory requirements/limits). |
I think I must have misunderstood CancellationPolicy. What does it mean? Why is forgiveness in the overseer spec rather than the pod template? Is there a general rule to know when some property should be a property of the overseer, and when it should be part of the pod definition? |
Don't worry about CancellationPolicy. It's just an example of a policy we might add. If Kubelet, the node controller, or the scheduler will consume it, then it needs to go in the pod. If only the overseer needs it, then it doesn't belong in the pod. |
Also, in general, properties about sets of pods don't belong in the pod/podtemplate. |
To play the devil's advocate for a moment here, why should any configuration information go in the overseer (outside of the pod template)? It seems there are many benefits to putting configuration information in pods rather than in the overseer. For example,
I agree it wold be nice if the overseer didn't have to read any pod information to do its work, but this seems a somewhat weak reason if it's the only reason. Moreover IIUC the overseer needs to list pods anyway to know which pods it is managing. So it seems to me that the only information you really need to put in the overseer is exactly the information it has today -- number of replicas, label query, and pod template. I guess the one counter-argument would be if you want to hand off control of a pod from one overseer to another and with the handoff automatically change some behavior. Making that behavior be a configuration property of the controller would avoid having to update the pod. But are there any examples of this use case? And wouldn't you need to update the pod's labels to hand off the pod anyway? |
@davidopp Pods can be started without any controller. It doesn't make sense to specify behavior in the pod that can't be implemented without assistance of another entity. |
I want to strong object to the term "overseer". This is a pretty generic term that could apply to any number of objects in kubernetes. We might as well call it the "manager". |
I just want to echo jbeda's comments here, "Small, easy to understand composable objects that can be combined for higher order behavior". For me this is really the advantage of kubernetes over many heavy weight orchestration systems. We should not overload kubernetes itself but rather compose an ecosystem of tools around it. Kubernetes as a low level cluster management building block should do a few things really well and then allow others to build systems around it. It's very early days and the key to a sustainable long live project that gains mass adoption is to provide an easy to use, powerful v1.0 that doesn't try to be everything for everyone. |
Can we try to sketch out the $whatever_we're_renaming_replication_controller_to use cases to try to figure this out? My recent experience was with a system where having too many components, and designing the mechanisms for them to interact, was a major source of pain, so I'm going to be biased towards the maximally monolithic approach in pretty much any situation. But I think we can approach this objectively by thinking about use cases. For example, having a separate controller for run-everywhere pods vs. non-run-everywhere pods seems like overkill to me, but probably the only downside is code duplication. But once we start talking about having a separate replication controller for each of the properties in Brian's sample code (which is what you guys are suggesting, or am I misunderstanding?), and having multiple controllers simultaneously manage multiple pods, I think things get confusing quickly. But sketching out more scenarios of how we see controllers being used might help. |
I think people are mostly reacting to example policies, so I've removed them. I also removed forgiveness, which should go in Pod and the suspend/stop bits, which are not specific to replication controller. What remains still stands out as by far the simplest object in Kubernetes. I agree that we want composable building blocks, and I thought fairly carefully about what functionality I proposed including in replication controller vs. what functionality was important to keep separate. The rationale was documented above. As for what we call it, let's bikeshed on that in #3024. |
You've now reduced it to the same structure as volumes. I've always been on the fence about this aspect, finding value in both approaches. My main concern is that we had everyone in the room and the general consensus was to use different objects, and now this proposal is the opposite of that. I'm afraid that if we try to be everything to every case, we will serve none of them well. What does Spec.Selector represent? What about the proposed "job controller" - is that another policy or are we going to overload logic based on template's restart policy? What about the proposals around durable/replicated data (I'm still catching up on email)? Will policy be arbitrarily extensible (like volume plugins should be)? How will that actually work for a cluster admin to install extensions? |
It feels to me like we understand two categories of use cases of replication controllers well: using them for heterogeneous deployments (rolling upgrade, multiple release tracks, etc.), and using them for different node selections (e.g. run-on-every-node vs. normal services). But there's another axis of customization that people keep alluding to but I don't yet understand, namely having multiple replication controllers (or different kinds of controllers, maybe one replication controller plus other types of controllers) manage different behaviors of the same set of pods. How does this work? What are some examples? I'm not saying I think this is a bad idea, but I think that being more concrete about this will help us make decisions about how to decompose the configuration of these behaviors. |
@davidopp I think different "controllers" for different semantic purposes is not a bad thing, even if it seems like two or four or either things could be rolled into one with a few config points. As my point of view is mainly as a software engineer doing functional programming I imagine that what I want is a large set of simple primitives with absolutely minimal config (and where possible, consistent config fields across these primitives). There are lots of other analogies: lego bricks, etc. I honestly feel that anything other than trivial config files are incredibly confusing and frustrating, hiding what should be programming logic. I would be very happy to have a |
Thoughts on the separate-object approach:
I think the best motivating example for the separate-object approach is the job controller (#1624). That's likely to need a number of features specific to bounded-duration/deferred-execution batch/workflow types of jobs: queuing and execution deadlines, success/failure aggregation, gang scheduling and/or admission control, max-in-flight limits (e.g., run 50 pods, no more than 10 at a time), inter-job dependencies (A before B), pre-start auto-scaling/auto-sizing, intelligent response to out-of-resource events, ... Potentially nominal services could be considered "controllers" also (controlling services). So could auto-scalers (controlling replication controllers). The flavor of controller could maybe be inferred by the Kind of object controlled, or perhaps that needs to be part of the reflection API, also. I think the separate-object approach could work, but we need to design the reflection API that would enable meta-programming for multiple kinds of controllers with similar features. Other thoughts on the per-node controller:
|
To also capture what Brian and I discussed a bit yesterday: The first question to ask is: Do we really think people will want to make controllers outside of kubernetes code? If the answer is no, then a monolithic compound object seems net simpler. If the answer is yes, then I think separate objects are simpler. I'm going to assume the answer is yes, because I want it to be :) Here's how I mentally run through the differences in how the mondels should work. Separate API objects:
Monolithic API object:
In both cases we need an extension point for plugins. In both cases we need API to be more dynamic. In both cases we need to do a bunch of library abstraction so 3rd party plugins can re-use logic (if they want). If we want a discovery system ("dear apiserver, please tell me all Kinds that act as pod controllers"), we need it in both cases. If we want to have generic code that can operate on things without knowing what they are, the problem is isomorphic for both models. But the separate objects case gives us generic REST plugins, which I think we want in both models (net less work, net less concepts). It also retains the existing ReplicationController API. So I fall on the side of separate objects, though only slightly. Now we can argue about services - are services the same? Much of the same argument applies. |
Ah OK thanks, some parts of the thread were about separating different bits of functionality for the same set of pods into separate controllers, so I didn't realize the most recent comments were just about different controllers for different pod types. So I withdraw my objection, which was applicable to this situation. :-) |
For the record, we should do this with generic REST plugins, in or out of
|
To finalize the proposals for the controllers planned in the immediate future:
|
This is decided enough for 1.0, so demoting implementation to P3. |
Currently the implementation is on hold for 1.0 correct? Is there any work in progress about this feature at the moment? In the spirit of Kubernetes being modular, multiple small |
The decision was to create new controllers for new use cases (e.g., per-node/daemon, RestartOnFailure pods) rather than make ReplicationController more complicated. To facilitate that, we need to:
There is work underway on (3), #5012, and (4), #5270. AFAIK, there's no direct work on (1) or (2), but we are converting existing objects, including ReplicationController, to use the generic registry implementation and have created a pattern for posting status back from controllers as part of #2726. If the amount of code required for a new controller were sufficiently reduced by other ongoing efforts, we needn't block implementation of new controllers on the generic plugin mechanism, but the plugin mechanism would be of independent utility. |
I'm not sure we should just say every conceivable variation merits a new replication controller. Per-node/daemon might be different enough from a regular replication controller to deserve being separate, but things that can be expressed trivially with one bool (like restart-on-failure vs. don't) should just be configuration parameters for the RC we already have IMO. |
Obsolete |
Continued from #1518 and #3024, and also relates to #1624.
Since there will necessarily be a higher barrier to entry for new objects, we should make replication controller composable (e.g., with auto-scalers and deployment managers), pluggable, hookable (#2804), to delegate decisions like which pod to kill to decrease the replica count, when pods should be killed and replaced (esp. moving to new nodes), when the whole collection should terminate (for RestartPolicyOnFailure), etc.
For jobs that terminate (RestartPolicyOnFailure), I'd make the count auto-decrement on success. I'm tempted to say we should support terminating jobs of the per-node variety, but then we'd have to keep track on which nodes the pods had executed, which seems ugly.
It should support graceful termination (#1535), but I've removed that from this sketch, since it's not specific to replication controller.
Stab at a definition, also using my latest proposed name change:
Forgiveness example was moved to #1574.
Additional example policies have been removed.
/cc @smarterclayton @thockin @brendandburns @erictune @lavalamp
The text was updated successfully, but these errors were encountered: