Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Daemon controller proposal #10210

Closed
wants to merge 1 commit into from
Closed

Conversation

rjnagal
Copy link
Contributor

@rjnagal rjnagal commented Jun 23, 2015

Transcribing Daemon controller proposal from @AnanyaKumar

@rjnagal
Copy link
Contributor Author

rjnagal commented Jun 23, 2015

@bgrant0607 @davidopp @AnanyaKumar @vmarmol @vishh @erictune

Saving the daemon controller discussion as a proposal here.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2015

GCE e2e build/test failed for commit 4395a996c248282f2d7ebc7cf42e5ae1f1df4dca.

@k8s-bot
Copy link

k8s-bot commented Jun 23, 2015

GCE e2e build/test passed for commit 36a97c071860a90caa4f30ef497f447788309074.

@AnanyaKumar
Copy link
Contributor

Thanks a lot @rjnagal !

@j3ffml j3ffml added the kind/design Categorizes issue or PR as related to design. label Jun 23, 2015
@mbforbes
Copy link
Contributor

@rjnagal can you pick a reviewer, please?

@mbforbes
Copy link
Contributor

@bgrant0607 for the interim

@bgrant0607
Copy link
Member

Ref #1518

Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The daemon controller can be used to run a data collection service (for example fluentd) and send the data to a service like ElasticSearch for analysis.

### Cluster-Level Applications
Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled ‘datastore’, might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A daemon controller is a convenient way to implement such a datastore. At Google, D is an example of such an application.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we've published D, but the GFS chunkserver was described in the GFS paper:
http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf


1. Include the daemon in the machine image
2. Use config files to launch daemons
3. Use static pod manifests to launch daemon pods when the node initializes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some of the text from #10093 would be useful in describing this alternative.
cc @jsafrane

- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node.
- The daemon manager will create a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname)
#### Kubelet
- Does not need to be modified, but health checking for the daemon pods and revive the pods if they are killed (we will set the pod restartPolicy to Always). We reject Daemon Controller objects with pod templates that don’t have restartPolicy set to Always.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation of restartPolicy belongs under Apiserver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I realized when I implemented the Apiserver part of the daemon controller :) Thanks!

@bgrant0607
Copy link
Member

cc @bprashanth @gmarek

@ravigadde
Copy link
Contributor

@bgrant0607

Thanks. My only leftover question is on whether showing aggregated status makes sense for higher level objects like Replication Controller and Daemon Controller. Yes, pod status can be looked at for all the pods controlled by one of these controllers. If the Controller's status encodes a union of all the Pods statuses, it would be helpful imo to the end user.

@bgrant0607
Copy link
Member

@ravigadde Aggregated status: yes. I agree it would be helpful.

@eparis
Copy link
Contributor

eparis commented Jun 27, 2015

I know I'm talking way out of turn here... And I do NOT want to derail real good forward progress work. This is a real proposal for something I really want.

But I really like the start of scheduler thoughts that @timothysc had. It was some relatively simple (to conceptualize) ideas making changes to the normal replication controller. His proposal can likely solve this too.

Then again, his proposal is just starting to work through concepts and this one is getting to the coding point, so I'd rather take a simple thing like this now and get a more flexible solution over there to solve things where the 'daemon controller' isn't simple...

His proposal had at least 2/3 key concepts required. (Tim, what's that PR?)

  1. Better label selection. !, ||, Exists, (). Right now we only have && and ==. Might be nice here too. The ability to say "all nodes with 3com nics but not nodes with the 3c509b nic" Yes I can create another set of labels on the nodes. So maybe this isn't a P0 to solve his problem.
  2. SPREAD_BY(label) and GROUP_BY(label_selector) scheduling concepts. So after the selector has decided the set of nodes in question the _BY hints would cause the scheduler to either co-locate pods where the label value was the same, or make sure no 2 pods went onto the same node where the label was the same. (The exact syntax of these is quite vague in my mind)
  3. A new rc 'replicas' option. Some form of $NUM_NODES(label_selector) so the system dynamically picks the number of replicas, much like you propose here.

In Tim's thoughts your proposal would look like an rc with

replicas = NUM_NODES(EXISTS("kubernetes.io/hostname"))
SPREAD_BY("kubernetes.io/hostname")

To run a pod adding iSCSI offload to all of the 3com nics (but not the trusty old 3c509b) you'd have something:

replics = NUM_NODES(nic_man == 3com && nic != 3c509b)
SPREAD_BY("kubernetes.io/hostname")

I think these are the exact problems you are trying to solve here? SPREAD_BY("kubernetes.io/hostname") seems to be the same thing in his thoughts.

But lets not derail here. I want this functionality badly. Let's just not try to overengineer here, something simple and to the point. Which we can re-use when we have more flexible scheduling.

@smarterclayton
Copy link
Contributor

One reason to prefer a separate object (even though we can represent a wide
variety of scenarios with a more powerful policy) is that it makes it
easier for users and tools to reason about the use cases for choosing one
or the other. The use cases for daemon controllers almost all overlap with
some characteristic of the underlying host - file system, device, network,
or location. Those use cases require elevated privilege, and also in some
cases undermine the assumption that topology and placement are details that
users do not need. On the contrary, replication controllers represent a
stronger need for a particular number of an item (for performance, or
reliability). A daemon controller is satisfied by covering a set - a
replication controller that is oversubscribed to cluster resources may
indicate the cluster should scale up.

It's not that we couldn't represent the spectrum of choice between the
two positions (I need to guarantee that all nodes that look like this can
run this process, to I need a certain number of these processes that may
need these nodes) with a single object - but that we perhaps shouldn't
because those distinctions are important for users and tools to understand
and choose between.

On Jun 27, 2015, at 11:12 AM, Eric Paris notifications@github.com wrote:

I know I'm talking way out of turn here... And I do NOT want to derail real
good forward progress work. This is a real proposal for something I really
want.

But I really like the start of scheduler thoughts that @timothysc
https://github.com/timothysc had. It was some relatively simple (to
conceptualize) ideas making changes to the normal replication controller.
His proposal can likely solve this too.

Then again, his proposal is just starting to work through concepts and this
one is getting to the coding point, so I'd rather take a simple thing like
this now and get a more flexible solution over there to solve things where
the 'daemon controller' isn't simple...

His proposal had at least 2/3 key concepts required. (Tim, what's that PR?)

Better label selection. !, ||, Exists, (). Right now we only have && and
==. Might be nice here too. The ability to say "all nodes with 3com nics
but not nodes with the 3c509b nic" Yes I can create another set of labels
on the nodes. So maybe this isn't a P0 to solve his problem.
2.

SPREAD_BY(label) and GROUP_BY(label_selector) scheduling concepts. So
after the selector has decided the set of nodes in question the _BY hints
would cause the scheduler to either co-locate pods where the label value
was the same, or make sure no 2 pods went onto the same node where the
label was the same. (The exact syntax of these is quite vague in my mind)

  1. A new rc 'replicas' option. Some form of $NUM_NODES(label_selector) so
    the system dynamically picks the number of replicas, much like you propose
    here.

In Tim's thoughts your proposal would look like an rc with

replicas = NUM_NODES(EXISTS("kubernetes.io/hostname"))
SPREAD_BY("kubernetes.io/hostname")

To run a pod adding iSCSI offload to all of the 3com nics (but not the
trusty old 3c509b) you'd have something:

replics = NUM_NODES(nic_man == 3com && nic != 3c509b)
SPREAD_BY("kubernetes.io/hostname")

I think these are the exact problems you are trying to solve here?
SPREAD_BY("kubernetes.io/hostname") seems to be the same thing in his
thoughts.

But lets not derail here. I want this functionality badly. Let's just not
try to overengineer here, something simple and to the point. Which we can
re-use when we have more flexible scheduling.


Reply to this email directly or view it on GitHub
#10210 (comment)
.

@davidopp
Copy link
Member

(My responses below are the same whether we implement this as a separate object or as an RC. But I think the arguments for putting the DC functionality inside the node controller are compelling, and justify implementing this as a separate object.)

@eparis With respect to your point (1), it seems that if we expand the label selector language over time, the feature as described in the proposal will eventually work the way you want. And I think we surely will want to make the label selector language more sophisticated in time (I'm sure I've seen an issue for this...)

Your point (2) is a different beast than (1), along two axes: it is a "soft constraint" rather than a hard one (try to do this rather than you must do this) and it describes a relationship among multiple pods rather than something that can be evaluated in isolation. The first part of that is easy, the second part is not as easy. It seems like these SPREAD_BY and GROUP_BY rules have to be defined at a cluster level, not a controller level. Otherwise what happens if you have one controller that says to spread pods by label X and another controller that says to group pods by label X?

I agree on point (3). At some point in the past, I thought this proposal allowed you to specify a % of machines (that match the selector) to deploy onto. Maybe that's not required for the first version but I think it's a useful feature.

@eparis
Copy link
Contributor

eparis commented Jun 28, 2015

@davidopp (2) I was not thinking of something 'cluster' level. Definitely controller level. I could say 'create on 15 replicas of my DB and spread by the "row=" label' very very similar to this daemon controller.

Although as I think more a group_by concept (although not the one I was intending to outline) does seem that it could be more interesting on a cluster level instead of controller level, as you say. I was thinking GROUP_BY(label=rack) so all of the pods would land on a single rack for low network latency. But I can totally see how I many want to group my db pods and my web server pods against each other. But that isn't what I was originally thinking....

@davidopp
Copy link
Member

I didn't say I was opposed to preemption, only to numerical priorities. Daemons don't make much
sense, IMO, unless the placement can be guaranteed, so they should be able to preempt anything
else. We might want to perform such preemptions gracefully (e.g., in a rolling fashion), however.

There are three cases
(1) new nodes
(2) add a new daemon pod, or increase resource limit of existing daemon pod
(3) daemon pod was deleted (perhaps accidentally) and you want it back

Integrating with node controller solves (1). For (2) and (3) we could say this is a rare event and you have to drain the machine, remove it, and re-add it, which makes it a case of (1).

I'm not saying we shouldn't have preemptions, I'm just saying I don't think they're strictly required for daemon controller even if we want guaranteed scheduling.

My view on priorities (since it came up and I can never resist offering my opinion on things) is that it should be a layered concept. The lowest level should be based on numerical values but the knob we offer users could be higher-level (like different qualitative categories or whatever). The "everything is part of the API" concept makes the distinction between internal and external a bit muddled, though.

- API object for Daemon Controller will be created in api/v1/types.go and api/v1/register.go
#### DaemonManager
- Creates new daemon controllers when requested. Launches the corresponding daemon pod on all nodes with labels matching the new daemon controller’s selector.
- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each daemon controller. If the label of the node matches the selector of the daemon controller, then the daemon manager will create the corresponding daemon pod in the new node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I feel like you want another point here. How does the DCManager notice daemon death and what pod states does it ignore? What does api.PodSuccess even mean for a daemon? Are we just going to rely on the kubelet restarting the daemon forever?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @bprashanth point. You wrote that DC won't monitor health of the Daemon somewhere above, but it'd be good to mention that NC won't be evicting Daemons in the case of Nodes death. We'll also need to add some garbage collection for such forever hanging Daemons.

@gmarek
Copy link
Contributor

gmarek commented Jun 29, 2015

I really wanted to read through this proposal, but literally didn't have time today:/ After first read I had mostly two questions:

  1. how we handle multiple schedulers? (IUUC design moved from writing separate controller to extending NodeController, so it's probably no longer an issue)
  2. how we handle machine drains?

If it's addressed somewhere, please shout at me. I'll make rereading this proposal my P0 for tomorrow.

@bgrant0607
Copy link
Member

@gmarek Multiple schedulers or multiple controllers? What issue are you concerned about?

Machine drains: Again, what specific issue(s) are you concerned about? We don't support machine drains yet.

@davidopp
Copy link
Member

He asked me about this last night so I think I know what he is asking.

(1) Multiple schedulers: Say a daemon pod is accidentally deleted from a machine (not crashes and restarts locally -- accidentally deleted in etcd). Now the node controller/scheduler will try to put the daemon pod back at the same time as the regular scheduler may try to schedule a regular pod on the machine. Do we have a mechanism to avoid race conditions between two schedulers (e.g. two schedulers each see the machine as free, then send binding for pod that takes up the whole machine -- will it be accepted?)? We have resourceVersion but that is on the binding -- do we need something like in Omega where we "touch" the machine when we write a binding so that subsequent bindings that don't take this binding into account will be rejected? (I believe this would require a multi-object etcd transaction.)

(2) Machine drains: Without preemption, the only way to increase the resources on a daemon pod or add a new daemon pod (in a guaranteed-to-work way) is the drain the machine. I think he is just observing that we don't have machine drain yet. But you can just remove the machine which will evict everything (and then re-add it), so we do have machine drain, just not a graceful one.

@gmarek
Copy link
Contributor

gmarek commented Jun 29, 2015

Yes, thanks David.

  1. is exactly as David wrote. I meant DaemonController of course.

  2. I just recall some discussion on horizontal node autoscaling, where someone assumed that nodes that are possible to remove are completely empty. I have no idea what's the status of it, but I'm sure something like that was discussed. The question is how/if we want to support eviction of daemons from nodes, as I don't recall seeing it in the doc (but I might have missed it). Of course if we don't support graceful drains, then it's OK to ignore this.

@bgrant0607 bgrant0607 added this to the v1.0 milestone Jun 30, 2015
@bgrant0607
Copy link
Member

@rjnagal Please just ensure the proposal accurately reflects what we're doing, and then we can merge the proposal. It can cite this PR and the relevant issues (#1518, #3058) for people to read the full discussion.

I'll respond to the comments.

@bgrant0607 bgrant0607 modified the milestones: v1.0-post, v1.0 Jul 6, 2015
@timothysc
Copy link
Member

Sorry for the lag, I'm still catching up from vacation.

@eparis I don't believe the two ideas are mutually exclusive, and it's entirely possible they could be built atop.

However, I do believe this is proposal jumps the shark (so to speak) before preemption and priority schemas are created.

+1 to @davidopp's layering prio.

@bgrant0607 bgrant0607 removed this from the v1.0-post milestone Jul 24, 2015
@bgrant0607 bgrant0607 added the area/api Indicates an issue on api area. label Jul 30, 2015
@k8s-bot
Copy link

k8s-bot commented Jul 31, 2015

GCE e2e build/test failed for commit e495dad.

@gmarek
Copy link
Contributor

gmarek commented Aug 24, 2015

@bgrant0607 - what are the plans for this PR? Will we merge it or drop it?

@AnanyaKumar
Copy link
Contributor

@gmarek Let's close this, I'll send a PR to add a slightly updated doc to docs/design soon.

@gmarek
Copy link
Contributor

gmarek commented Aug 24, 2015

Thanks, closing.

@timothysc
Copy link
Member

@AnanyaKumar where is this at now?

@AnanyaKumar
Copy link
Contributor

@timothysc it's been moved to #13368!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. kind/design Categorizes issue or PR as related to design.
Projects
None yet
Development

Successfully merging this pull request may close these issues.