-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Daemon controller proposal #10210
Conversation
@bgrant0607 @davidopp @AnanyaKumar @vmarmol @vishh @erictune Saving the daemon controller discussion as a proposal here. |
GCE e2e build/test failed for commit 4395a996c248282f2d7ebc7cf42e5ae1f1df4dca. |
GCE e2e build/test passed for commit 36a97c071860a90caa4f30ef497f447788309074. |
Thanks a lot @rjnagal ! |
@rjnagal can you pick a reviewer, please? |
@bgrant0607 for the interim |
Ref #1518 |
Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The daemon controller can be used to run a data collection service (for example fluentd) and send the data to a service like ElasticSearch for analysis. | ||
|
||
### Cluster-Level Applications | ||
Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A daemon controller is a convenient way to implement such a datastore. At Google, D is an example of such an application. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we've published D, but the GFS chunkserver was described in the GFS paper:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf
|
||
1. Include the daemon in the machine image | ||
2. Use config files to launch daemons | ||
3. Use static pod manifests to launch daemon pods when the node initializes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node. | ||
- The daemon manager will create a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname) | ||
#### Kubelet | ||
- Does not need to be modified, but health checking for the daemon pods and revive the pods if they are killed (we will set the pod restartPolicy to Always). We reject Daemon Controller objects with pod templates that don’t have restartPolicy set to Always. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation of restartPolicy belongs under Apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I realized when I implemented the Apiserver part of the daemon controller :) Thanks!
Thanks. My only leftover question is on whether showing aggregated status makes sense for higher level objects like Replication Controller and Daemon Controller. Yes, pod status can be looked at for all the pods controlled by one of these controllers. If the Controller's status encodes a union of all the Pods statuses, it would be helpful imo to the end user. |
@ravigadde Aggregated status: yes. I agree it would be helpful. |
I know I'm talking way out of turn here... And I do NOT want to derail real good forward progress work. This is a real proposal for something I really want. But I really like the start of scheduler thoughts that @timothysc had. It was some relatively simple (to conceptualize) ideas making changes to the normal replication controller. His proposal can likely solve this too. Then again, his proposal is just starting to work through concepts and this one is getting to the coding point, so I'd rather take a simple thing like this now and get a more flexible solution over there to solve things where the 'daemon controller' isn't simple... His proposal had at least 2/3 key concepts required. (Tim, what's that PR?)
In Tim's thoughts your proposal would look like an rc with replicas = NUM_NODES(EXISTS("kubernetes.io/hostname")) To run a pod adding iSCSI offload to all of the 3com nics (but not the trusty old 3c509b) you'd have something: replics = NUM_NODES(nic_man == 3com && nic != 3c509b) I think these are the exact problems you are trying to solve here? SPREAD_BY("kubernetes.io/hostname") seems to be the same thing in his thoughts. But lets not derail here. I want this functionality badly. Let's just not try to overengineer here, something simple and to the point. Which we can re-use when we have more flexible scheduling. |
One reason to prefer a separate object (even though we can represent a wide It's not that we couldn't represent the spectrum of choice between the On Jun 27, 2015, at 11:12 AM, Eric Paris notifications@github.com wrote: I know I'm talking way out of turn here... And I do NOT want to derail real But I really like the start of scheduler thoughts that @timothysc Then again, his proposal is just starting to work through concepts and this His proposal had at least 2/3 key concepts required. (Tim, what's that PR?) Better label selection. !, ||, Exists, (). Right now we only have && and SPREAD_BY(label) and GROUP_BY(label_selector) scheduling concepts. So
In Tim's thoughts your proposal would look like an rc with replicas = NUM_NODES(EXISTS("kubernetes.io/hostname")) To run a pod adding iSCSI offload to all of the 3com nics (but not the replics = NUM_NODES(nic_man == 3com && nic != 3c509b) I think these are the exact problems you are trying to solve here? But lets not derail here. I want this functionality badly. Let's just not — |
(My responses below are the same whether we implement this as a separate object or as an RC. But I think the arguments for putting the DC functionality inside the node controller are compelling, and justify implementing this as a separate object.) @eparis With respect to your point (1), it seems that if we expand the label selector language over time, the feature as described in the proposal will eventually work the way you want. And I think we surely will want to make the label selector language more sophisticated in time (I'm sure I've seen an issue for this...) Your point (2) is a different beast than (1), along two axes: it is a "soft constraint" rather than a hard one (try to do this rather than you must do this) and it describes a relationship among multiple pods rather than something that can be evaluated in isolation. The first part of that is easy, the second part is not as easy. It seems like these SPREAD_BY and GROUP_BY rules have to be defined at a cluster level, not a controller level. Otherwise what happens if you have one controller that says to spread pods by label X and another controller that says to group pods by label X? I agree on point (3). At some point in the past, I thought this proposal allowed you to specify a % of machines (that match the selector) to deploy onto. Maybe that's not required for the first version but I think it's a useful feature. |
@davidopp (2) I was not thinking of something 'cluster' level. Definitely controller level. I could say 'create on 15 replicas of my DB and spread by the "row=" label' very very similar to this daemon controller. Although as I think more a group_by concept (although not the one I was intending to outline) does seem that it could be more interesting on a cluster level instead of controller level, as you say. I was thinking GROUP_BY(label=rack) so all of the pods would land on a single rack for low network latency. But I can totally see how I many want to group my db pods and my web server pods against each other. But that isn't what I was originally thinking.... |
There are three cases Integrating with node controller solves (1). For (2) and (3) we could say this is a rare event and you have to drain the machine, remove it, and re-add it, which makes it a case of (1). I'm not saying we shouldn't have preemptions, I'm just saying I don't think they're strictly required for daemon controller even if we want guaranteed scheduling. My view on priorities (since it came up and I can never resist offering my opinion on things) is that it should be a layered concept. The lowest level should be based on numerical values but the knob we offer users could be higher-level (like different qualitative categories or whatever). The "everything is part of the API" concept makes the distinction between internal and external a bit muddled, though. |
- API object for Daemon Controller will be created in api/v1/types.go and api/v1/register.go | ||
#### DaemonManager | ||
- Creates new daemon controllers when requested. Launches the corresponding daemon pod on all nodes with labels matching the new daemon controller’s selector. | ||
- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I feel like you want another point here. How does the DCManager notice daemon death and what pod states does it ignore? What does api.PodSuccess even mean for a daemon? Are we just going to rely on the kubelet restarting the daemon forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @bprashanth point. You wrote that DC won't monitor health of the Daemon somewhere above, but it'd be good to mention that NC won't be evicting Daemons in the case of Nodes death. We'll also need to add some garbage collection for such forever hanging Daemons.
I really wanted to read through this proposal, but literally didn't have time today:/ After first read I had mostly two questions:
If it's addressed somewhere, please shout at me. I'll make rereading this proposal my P0 for tomorrow. |
@gmarek Multiple schedulers or multiple controllers? What issue are you concerned about? Machine drains: Again, what specific issue(s) are you concerned about? We don't support machine drains yet. |
He asked me about this last night so I think I know what he is asking. (1) Multiple schedulers: Say a daemon pod is accidentally deleted from a machine (not crashes and restarts locally -- accidentally deleted in etcd). Now the node controller/scheduler will try to put the daemon pod back at the same time as the regular scheduler may try to schedule a regular pod on the machine. Do we have a mechanism to avoid race conditions between two schedulers (e.g. two schedulers each see the machine as free, then send binding for pod that takes up the whole machine -- will it be accepted?)? We have resourceVersion but that is on the binding -- do we need something like in Omega where we "touch" the machine when we write a binding so that subsequent bindings that don't take this binding into account will be rejected? (I believe this would require a multi-object etcd transaction.) (2) Machine drains: Without preemption, the only way to increase the resources on a daemon pod or add a new daemon pod (in a guaranteed-to-work way) is the drain the machine. I think he is just observing that we don't have machine drain yet. But you can just remove the machine which will evict everything (and then re-add it), so we do have machine drain, just not a graceful one. |
Yes, thanks David.
|
Sorry for the lag, I'm still catching up from vacation. @eparis I don't believe the two ideas are mutually exclusive, and it's entirely possible they could be built atop. However, I do believe this is proposal jumps the shark (so to speak) before preemption and priority schemas are created. +1 to @davidopp's layering prio. |
GCE e2e build/test failed for commit e495dad. |
@bgrant0607 - what are the plans for this PR? Will we merge it or drop it? |
@gmarek Let's close this, I'll send a PR to add a slightly updated doc to docs/design soon. |
Thanks, closing. |
@AnanyaKumar where is this at now? |
@timothysc it's been moved to #13368! |
Transcribing Daemon controller proposal from @AnanyaKumar