Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

jbeda · 2014-09-30T23:11:49Z

There are cases where we want to run a pod on every node. This'll be useful for monitoring things (cAdvisor, DataDog) or replicated storage agents (HDFS node).

Right now you can approximate this by (a) using a hostPort and (b) setting replication count > nodes. It would be better if we had an explicit way of doing this.

thockin · 2014-09-30T23:17:52Z

Do we really want this to be a scheduling feature? Config files can achieve
this. We avoided adding this feature internally because it add complexity
that we did not feel was required.

It's a cute idea, but it's sort of a layering violation.

On Tue, Sep 30, 2014 at 4:12 PM, Joe Beda notifications@github.com wrote:

There are cases where we want to run a pod on every node. This'll be
useful for monitoring things (cAdvisor, DataDog) or replicated storage
agents (HDFS node).

Right now you can approximate this by (a) using a hostPort and (b) setting
replication count > nodes. It would be better if we had an explicit way of
doing this.

Reply to this email directly or view it on GitHub
#1518.

bketelsen · 2014-09-30T23:32:35Z

I'm +1 on this. Some things like statsd collectors need to run one per node, and are dependencies of my pods. Keeping all the runtime dependencies together means I have fewer orchestration tools to worry about. And Kubernetes makes sure it stays running.

thockin · 2014-09-30T23:35:19Z

As opposed to dropping a pod config on each machine and letting kubelet run
it? Is there a reason the simpler solution is not adequate?

On Tue, Sep 30, 2014 at 4:32 PM, Brian Ketelsen notifications@github.com
wrote:

I'm +1 on this. Some things like statsd collectors need to run one per
node, and are dependencies of my pods. Keeping all the runtime dependencies
together means I have fewer orchestration tools to worry about. And
Kubernetes makes sure it stays running.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

bketelsen · 2014-09-30T23:38:22Z

forgive my ignorance, but how does one "drop a pod config on each machine" My only interaction with k8s has been through kubecfg so far.

thockin · 2014-09-30T23:41:04Z

The assumption is that if you want to run something on each machine, you're
essentially a cluster admin, and can arrange for a config file to appear on
each node, rather than scheduling.

/etc/kubernetes/manifests holds files which are the "manifest" section of a
pod, and kubelet will run those as if it had been scheduled to do so.

On Tue, Sep 30, 2014 at 4:38 PM, Brian Ketelsen notifications@github.com
wrote:

forgive my ignorance, but how does one "drop a pod config on each machine"
My only interaction with k8s has been through kubecfg so far.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

bketelsen · 2014-09-30T23:48:17Z

interesting. I'll poke around with that concept too. sounds like it would solve this use case well enough.

bgrant0607 · 2014-10-01T00:01:46Z

This request sounds more for a custom auto-scaler than anything else, though some other features would be useful, such as per-attribute limits (discussed in #367 (comment)).

Most such agents that have been discussed do want host ports, though I understand we want to get rid of host ports. If we did eliminate host ports, we'd need an alternative discovery mechanism; I don't think we want to use existing k8s services. Even for the file-based approach, we probably need that. In #386, I proposed that we represent such services in /etc/hosts within containers. We could give them local magic IPs.

thockin · 2014-10-01T00:07:50Z

Or we could add host networking

On Tue, Sep 30, 2014 at 5:02 PM, bgrant0607 notifications@github.com
wrote:

This request sounds more for a custom auto-scaler than anything else,
though some other features would be useful, such as per-attribute limits
(discussed in #367 (comment)
#367 (comment)).

Most such agents that have been discussed do want host ports, though I
understand we want to get rid of host ports. If we did eliminate host
ports, we'd need an alternative discovery mechanism; I don't think we want
to use existing k8s services. Even for the file-based approach, we probably
need that. In #386
#386, I
proposed that we represent such services in /etc/hosts within containers.
We could give them local magic IPs.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

brendandburns · 2014-10-01T00:20:07Z

I agree with Tim. We should use manifest files on the host nodes to
achieve this.

Brendan
On Sep 30, 2014 5:08 PM, "Tim Hockin" notifications@github.com wrote:

Or we could add host networking

On Tue, Sep 30, 2014 at 5:02 PM, bgrant0607 notifications@github.com
wrote:

This request sounds more for a custom auto-scaler than anything else,
though some other features would be useful, such as per-attribute limits
(discussed in #367 (comment)
<
#367 (comment)
).

Most such agents that have been discussed do want host ports, though I
understand we want to get rid of host ports. If we did eliminate host
ports, we'd need an alternative discovery mechanism; I don't think we
want
to use existing k8s services. Even for the file-based approach, we
probably
need that. In #386
#386, I
proposed that we represent such services in /etc/hosts within containers.
We could give them local magic IPs.

Reply to this email directly or view it on GitHub
<
#1518 (comment)

.

—
Reply to this email directly or view it on GitHub
#1518 (comment)
.

jbeda · 2014-10-01T00:30:31Z

The big thing with manifest files is that resultant pods aren't named/tracked by kubernetes. As we do GUI visualizations as to what is running these won't show up.

Finally -- installing/distributing manifest files requires an out of band management system. It would be great if, once k8s was bootstrapped, there was one system for managing all work.

If, instead, we had the kubernetes master/API track local pods in a read-only way, that might help...

thockin · 2014-10-01T00:31:17Z

Always with the caveat that we're all ears if someone has a use case that
CAN NOT be satisfied this way :)

On Tue, Sep 30, 2014 at 5:20 PM, Brendan Burns notifications@github.com
wrote:

I agree with Tim. We should use manifest files on the host nodes to
achieve this.

Brendan
On Sep 30, 2014 5:08 PM, "Tim Hockin" notifications@github.com wrote:

Or we could add host networking

On Tue, Sep 30, 2014 at 5:02 PM, bgrant0607 notifications@github.com
wrote:

This request sounds more for a custom auto-scaler than anything else,
though some other features would be useful, such as per-attribute
limits
(discussed in #367 (comment)
<

#367 (comment)

).

Most such agents that have been discussed do want host ports, though I
understand we want to get rid of host ports. If we did eliminate host
ports, we'd need an alternative discovery mechanism; I don't think we
want
to use existing k8s services. Even for the file-based approach, we
probably
need that. In #386
#386, I
proposed that we represent such services in /etc/hosts within
containers.
We could give them local magic IPs.

Reply to this email directly or view it on GitHub
<

#1518 (comment)

.

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/1518#issuecomment-57401979>

.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

erictune · 2014-10-01T14:36:47Z

I predict that some significant fraction of K8s cluster owners will want to run a control loop that automatically adds nodes to a K8s cluster based on a demand signal, such as pending pods. The approach of "setting replication count > nodes" won't work well with that.

erictune · 2014-10-01T14:46:07Z

I agree with jbeda that we would want these locally-configured pods to have useful names and to show up in visualizations. And I agree with his suggestion that we should track them as "Read-only" pods. We have experience internally that suggests this works.

I think the "readonly pod" solution is hard to avoid. People are going to think about their (physical or virtual) machines in terms of their raw capacity. But, then some amount of resources will be taken out of that total by the kernel memory, root-namespace files. And then there will be some daemons that people won't want to start using kubernetes, such as sshd (for emergency debugging of kubelet problems) and kubelet (for starting pods in the first place). Those need memory and cpu too. Once we solve exporting that information and visualizing it, we are most of the way to generally handling locally-configured pods.

thockin · 2014-10-01T17:26:10Z

set replication count to "inf" could work, but it still feels like a hack
to me.

On Wed, Oct 1, 2014 at 7:37 AM, erictune notifications@github.com wrote:

I predict that some significant fraction of K8s cluster owners will want
to run a control loop that automatically adds nodes to a K8s cluster based
on a demand signal, such as pending pods. The approach of "setting
replication count > nodes" won't work well with that.

Reply to this email directly or view it on GitHub
#1518 (comment)
.

thockin · 2014-10-01T17:27:16Z

I think the master should become aware of pods that are running
but that it did not create, and manage in a read-only mode as you suggest.
This is more or less how internal stuff works.

On Tue, Sep 30, 2014 at 5:30 PM, Joe Beda notifications@github.com wrote:

The big thing with manifest files is that resultant pods aren't
named/tracked by kubernetes. As we do GUI visualizations as to what is
running these won't show up.

Finally -- installing/distributing manifest files requires an out of band
management system. It would be great if, once k8s was bootstrapped, there
was one system for managing all work.

If, instead, we had the kubernetes master/API track local pods in a
read-only way, that might help...

Reply to this email directly or view it on GitHub
#1518 (comment)
.

bgrant0607 · 2014-10-01T18:55:05Z

Replication controller doesn't auto-scale on its own.

dchen1107 · 2014-10-01T18:57:11Z

I filed #490 a while back to track all pods including kubelet and other daemons which created by manifest files. The pods created through manifest files are having a separate and reserved namespace as appendix to pod name. I don't see any potential issues with read-only mode.

bgrant0607 · 2014-10-01T18:59:41Z

Filed #1523 for the more specific issue of representing such pods in the apiserver/etcd

sdake · 2014-10-17T00:11:08Z

My thoughts on this issue. It is critical for our application to be able to run on every minion without necessarily having to modify the host filesystem to do so.

The specific use case is a project to run OpenStack on top of k8s (http://github.com/stackforge/kolla). This upstream project wants a defined non-hacky way to run a libvirt container and nova-compute container in 1 pod on every minion to provide virtual machine services via OpenStack. Without such a feature, it is impossible to make OpenStack actually run on top of k8s without installing 400+ packages in the host OS. Essentially it would erase any gains containerizing our two containers (1 pod) would provide.

I think the hostport hack would be acceptable, by setting replicationcount to 2^32 or something similar. But as of yet I haven't got this to work. We want to manage all OpenStack services through the kube-apiserver process, rather then having to manually modify the host filesystem system as this creates more complex deployment models. In some cases, manually modifying the host filesystem is extremely difficult for us, especially in the case of something like Atomic, a RHEL7 based operating system without a package manager.

I am hopeful we can come to agreement that this feature is helpful and doesn't add much in the way of complexity or scope creep.

Regarding the mention that adding this feature would result in more complexity, I believe the existing solutions are either hacky (hostport) or more complex (kubelet config file). In the case of the kublet config file, there is no single management interface, instead requiring adding two methods of interfacing with the system.

Regards
-steve

thockin · 2014-10-17T02:35:48Z

I disagree that kubelet config files are more complex - but reasonable
people can disagree.

As for the host port hack, it is a hack, and if you set 2^32 replicas it
will actually try to create and schedule 4 billion pods, failing on all but
a tiny fraction of them, and then it will retry that periodically. A
really bad idea. :)

Certainly you have SOMETHING managing your host machine filesystems? We
have found, internally, that it is way easier to manage one-per-machine
jobs as config files.

That sad, I am not STRONGLY against this idea. Looking forward to some
discussion.
On Oct 16, 2014 5:11 PM, "Steven Dake" notifications@github.com wrote:

My thoughts on this issue. It is critical for our application to be able
to run on every minion without necessarily having to modify the host
filesystem to do so.

The specific use case is a project to run OpenStack on top of k8s (
http://github.com/stackforge/kolla https://github.com/stackforge/kolla).
This upstream project wants a defined non-hacky way to run a libvirt
container and nova-compute container in 1 pod on every minion to provide
virtual machine services via OpenStack. Without such a feature, it is
impossible to make OpenStack actually run on top of k8s without installing
400+ packages in the host OS. Essentially it would erase any gains
containerizing our two containers (1 pod) would provide.

I think the hostport hack would be acceptable, by setting replicationcount
to 2^32 or something similar. But as of yet I haven't got this to work. We
want to manage all OpenStack services through the kube-apiserver process,
rather then having to manually modify the host filesystem system as this
creates more complex deployment models. In some cases, manually modifying
the host filesystem is extremely difficult for us, especially in the case
of something like Atomic, a RHEL7 based operating system without a package
manager.

I am hopeful we can come to agreement that this feature is helpful and
doesn't add much in the way of complexity or scope creep.

Regarding the mention that adding this feature would result in more
complexity, I believe the existing solutions are either hacky (hostport) or
more complex (kubelet config file). In the case of the kublet config
file, there is no single management interface, instead requiring adding two
methods of interfacing with the system.

Regards
-steve

Reply to this email directly or view it on GitHub
#1518 (comment)
.

erictune · 2014-10-17T17:59:58Z

Custom autoscaler seems like the way to go.

have object just like a replication controller, except that instead of a replicas count field, you have a node-selector (would select all nodes in your case).
controller manager watches for new minions and creates pods constrained to them.
use hostname constraints
garbage collect pods after 1 day of not seeing a node or something.

lavalamp · 2014-10-17T18:01:13Z

We need to add constraints to scheduler to make this work.

bgrant0607 · 2014-10-17T23:32:33Z

@erictune's proposal is along the lines of what I was thinking. Definitely NOT a replication controller with an infinite count.

Rather than a hostname constraint, we could expose a node parameter on POST to /pods. Since pods don't reschedule, this constraint doesn't need to be part of the pod spec. This would also enable use of pod templates without getting into field overrides. Internally, we could add it to scheduling constraints, if we chose, but one could potentially also just bypass the scheduler in this case, so long as the apiserver verified feasibility, which is necessary if we want to support multiple schedulers, anyway.

The main thing to worry about is races: daemons getting evicted due to missing nodes, then not being the first thing to schedule back on the nodes when they become healthy again. To prevent this, we'd almost certainly need to add forgiveness (#1574).

As for whether we should support this feature:

One one hand, it's another feature, so it adds complexity.
On the other hand, we likely want to reduce our reliance on external configuration management systems, especially for application deployment, of which this is one example.

The core functionality seems pretty isolated from everything else, and what it needs are things we'd like to add for other reasons. Maybe it could be implemented as a plugin, which would also make it easier to rip out if we decided it was a bad idea. I'm in favor, but can't see it as high priority at the moment since there is a workaround.

bgrant0607 · 2015-06-25T03:51:47Z

cc @AnanyaKumar

bgrant0607 · 2015-06-25T04:22:44Z

@ravigadde Please comment regarding whether #10210 satisfies your use case.

davidopp · 2015-06-25T04:40:25Z

I like the idea of having the node controller schedule daemon pods. Putting in one place all the logic that has to run before a node is considered ready to accept regular pods makes a lot of sense.

But I'm not sure why you're mixing Deployment into this. It seems that just as we built ReplicationController as one component and plan to layer Deployment on top to allow declaratively describing state of the whole cluster and orchestrating rolling updates, we can build DaemonController as one component (that is mostly identical to ReplicationController except it uses % of nodes where ReplicationController uses # of replicas) and then layer Deployment on top to do those other things. The only tricky part is ensuring the % maps consistently, but IIRC you and @lavalamp had some ideas on how we could do that. Anyway, we can discuss that on the other issue.

davidopp · 2015-06-25T04:42:57Z

(To clarify my previous comment, when I refer to DaemonController I'm talking about logic, not necessarily a separate component; as you said, we could put the logic in the node controller).

ravigadde · 2015-06-25T23:45:09Z

@bgrant0607 Thanks for your comments. I will add my comments to #10210

+1 for adding this logic to node controller

mikedanese · 2015-09-23T03:12:21Z

I think this is done and we can close. Thanks @AnanyaKumar! Still todo (and discuss) is:

handle daemonset podtemplate update
fold daemonset controller into nodecontroller

zhangxiaoyu-zidif · 2017-09-06T07:45:26Z

I have a question:
when i create the ds, the memory did not meet the request memory, and then when momory is enough, does daemonset start its pod?
If the pod was not created when memory is insuffient, when memory is suffient, could the pod of ds start?

…e-disable UPSTREAM: <carry>: OCPNODE-1548,OCPNODE-1584: disable load balancing on created cgroups when managed is enabled

jbeda added enhancement sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Sep 30, 2014

bgrant0607 added the area/api Indicates an issue on api area. label Sep 30, 2014

bgrant0607 added the bootstrapping label Oct 1, 2014

bgrant0607 mentioned this issue Oct 1, 2014

pods created locally on nodes should be represented #1523

Closed

erictune mentioned this issue Nov 13, 2014

Consider a feature to deploy a pod on each node #2351

Closed

ddysher mentioned this issue Nov 20, 2014

WIP - a Per-Node controller #2491

Closed

bgrant0607 added area/nodecontroller and removed kind/gsoc labels Jun 25, 2015

AnanyaKumar mentioned this issue Aug 7, 2015

Daemon Registry and Client #12383

Closed

bgrant0607 mentioned this issue Aug 17, 2015

Per instance pods #12786

Closed

davidopp added team/control-plane and removed team/master labels Aug 22, 2015

bgrant0607 changed the title ~~Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)~~ Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) Aug 27, 2015

bgrant0607 mentioned this issue Aug 27, 2015

Proposal to rework Kubernetes deployment CLI #5472

Closed

4 tasks

a-robinson mentioned this issue Sep 10, 2015

Update Fluentd Configurations to extract labels and other metadata #8001

Closed

bgrant0607 mentioned this issue Sep 16, 2015

Generalize label selectors #341

Closed

This was referenced Sep 23, 2015

Add daemonset support to kubectl #13183

Merged

Add daemon controller #13182

Merged

Add Daemon API #11801

Merged

Design doc for DaemonSet #14326

Closed

Job and DaemonSet documentation. #14079

Merged

mikedanese closed this as completed Sep 23, 2015

mikedanese assigned AnanyaKumar and unassigned rjnagal Sep 23, 2015

davidopp mentioned this issue Sep 24, 2015

Daemon design, take 3 #14529

Merged

agalitsyn mentioned this issue May 4, 2016

Research basic k8s resource for deployment agalitsyn/yagoda#7

Closed

zhangxiaoyu-zidif unassigned AnanyaKumar Sep 6, 2017

kow3ns mentioned this issue Oct 13, 2017

Add "DaemonSet create-first rolling update" design proposal. kubernetes/community#977

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

jbeda commented Sep 30, 2014

thockin commented Sep 30, 2014

bketelsen commented Sep 30, 2014

thockin commented Sep 30, 2014

bketelsen commented Sep 30, 2014

thockin commented Sep 30, 2014

bketelsen commented Sep 30, 2014

bgrant0607 commented Oct 1, 2014

thockin commented Oct 1, 2014

brendandburns commented Oct 1, 2014

jbeda commented Oct 1, 2014

thockin commented Oct 1, 2014

erictune commented Oct 1, 2014

erictune commented Oct 1, 2014

thockin commented Oct 1, 2014

thockin commented Oct 1, 2014

bgrant0607 commented Oct 1, 2014

dchen1107 commented Oct 1, 2014

bgrant0607 commented Oct 1, 2014

sdake commented Oct 17, 2014

thockin commented Oct 17, 2014

erictune commented Oct 17, 2014

lavalamp commented Oct 17, 2014

bgrant0607 commented Oct 17, 2014

bgrant0607 commented Jun 25, 2015

bgrant0607 commented Jun 25, 2015

davidopp commented Jun 25, 2015

davidopp commented Jun 25, 2015

ravigadde commented Jun 25, 2015

mikedanese commented Sep 23, 2015

zhangxiaoyu-zidif commented Sep 6, 2017

Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518

Comments

jbeda commented Sep 30, 2014

thockin commented Sep 30, 2014

bketelsen commented Sep 30, 2014

thockin commented Sep 30, 2014

bketelsen commented Sep 30, 2014

thockin commented Sep 30, 2014

bketelsen commented Sep 30, 2014

bgrant0607 commented Oct 1, 2014

thockin commented Oct 1, 2014

brendandburns commented Oct 1, 2014

jbeda commented Oct 1, 2014

thockin commented Oct 1, 2014

erictune commented Oct 1, 2014

erictune commented Oct 1, 2014

thockin commented Oct 1, 2014

thockin commented Oct 1, 2014

bgrant0607 commented Oct 1, 2014

dchen1107 commented Oct 1, 2014

bgrant0607 commented Oct 1, 2014

sdake commented Oct 17, 2014

thockin commented Oct 17, 2014

erictune commented Oct 17, 2014

lavalamp commented Oct 17, 2014

bgrant0607 commented Oct 17, 2014

bgrant0607 commented Jun 25, 2015

bgrant0607 commented Jun 25, 2015

davidopp commented Jun 25, 2015

davidopp commented Jun 25, 2015

ravigadde commented Jun 25, 2015

mikedanese commented Sep 23, 2015

zhangxiaoyu-zidif commented Sep 6, 2017