-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Daemon (was Feature: run-on-every-node scheduling/replication (aka per-node controller or daemon controller)) #1518
Comments
Do we really want this to be a scheduling feature? Config files can achieve It's a cute idea, but it's sort of a layering violation. On Tue, Sep 30, 2014 at 4:12 PM, Joe Beda notifications@github.com wrote:
|
I'm +1 on this. Some things like statsd collectors need to run one per node, and are dependencies of my pods. Keeping all the runtime dependencies together means I have fewer orchestration tools to worry about. And Kubernetes makes sure it stays running. |
As opposed to dropping a pod config on each machine and letting kubelet run On Tue, Sep 30, 2014 at 4:32 PM, Brian Ketelsen notifications@github.com
|
forgive my ignorance, but how does one "drop a pod config on each machine" My only interaction with k8s has been through kubecfg so far. |
The assumption is that if you want to run something on each machine, you're /etc/kubernetes/manifests holds files which are the "manifest" section of a On Tue, Sep 30, 2014 at 4:38 PM, Brian Ketelsen notifications@github.com
|
interesting. I'll poke around with that concept too. sounds like it would solve this use case well enough. |
This request sounds more for a custom auto-scaler than anything else, though some other features would be useful, such as per-attribute limits (discussed in #367 (comment)). Most such agents that have been discussed do want host ports, though I understand we want to get rid of host ports. If we did eliminate host ports, we'd need an alternative discovery mechanism; I don't think we want to use existing k8s services. Even for the file-based approach, we probably need that. In #386, I proposed that we represent such services in /etc/hosts within containers. We could give them local magic IPs. |
Or we could add host networking On Tue, Sep 30, 2014 at 5:02 PM, bgrant0607 notifications@github.com
|
I agree with Tim. We should use manifest files on the host nodes to Brendan
|
The big thing with manifest files is that resultant pods aren't named/tracked by kubernetes. As we do GUI visualizations as to what is running these won't show up. Finally -- installing/distributing manifest files requires an out of band management system. It would be great if, once k8s was bootstrapped, there was one system for managing all work. If, instead, we had the kubernetes master/API track local pods in a read-only way, that might help... |
Always with the caveat that we're all ears if someone has a use case that On Tue, Sep 30, 2014 at 5:20 PM, Brendan Burns notifications@github.com
|
I predict that some significant fraction of K8s cluster owners will want to run a control loop that automatically adds nodes to a K8s cluster based on a demand signal, such as pending pods. The approach of "setting replication count > nodes" won't work well with that. |
I agree with jbeda that we would want these locally-configured pods to have useful names and to show up in visualizations. And I agree with his suggestion that we should track them as "Read-only" pods. We have experience internally that suggests this works. I think the "readonly pod" solution is hard to avoid. People are going to think about their (physical or virtual) machines in terms of their raw capacity. But, then some amount of resources will be taken out of that total by the kernel memory, root-namespace files. And then there will be some daemons that people won't want to start using kubernetes, such as sshd (for emergency debugging of kubelet problems) and kubelet (for starting pods in the first place). Those need memory and cpu too. Once we solve exporting that information and visualizing it, we are most of the way to generally handling locally-configured pods. |
set replication count to "inf" could work, but it still feels like a hack On Wed, Oct 1, 2014 at 7:37 AM, erictune notifications@github.com wrote:
|
I think the master should become aware of pods that are running On Tue, Sep 30, 2014 at 5:30 PM, Joe Beda notifications@github.com wrote:
|
Replication controller doesn't auto-scale on its own. |
I filed #490 a while back to track all pods including kubelet and other daemons which created by manifest files. The pods created through manifest files are having a separate and reserved namespace as appendix to pod name. I don't see any potential issues with read-only mode. |
Filed #1523 for the more specific issue of representing such pods in the apiserver/etcd |
My thoughts on this issue. It is critical for our application to be able to run on every minion without necessarily having to modify the host filesystem to do so. The specific use case is a project to run OpenStack on top of k8s (http://github.com/stackforge/kolla). This upstream project wants a defined non-hacky way to run a libvirt container and nova-compute container in 1 pod on every minion to provide virtual machine services via OpenStack. Without such a feature, it is impossible to make OpenStack actually run on top of k8s without installing 400+ packages in the host OS. Essentially it would erase any gains containerizing our two containers (1 pod) would provide. I think the hostport hack would be acceptable, by setting replicationcount to 2^32 or something similar. But as of yet I haven't got this to work. We want to manage all OpenStack services through the kube-apiserver process, rather then having to manually modify the host filesystem system as this creates more complex deployment models. In some cases, manually modifying the host filesystem is extremely difficult for us, especially in the case of something like Atomic, a RHEL7 based operating system without a package manager. I am hopeful we can come to agreement that this feature is helpful and doesn't add much in the way of complexity or scope creep. Regarding the mention that adding this feature would result in more complexity, I believe the existing solutions are either hacky (hostport) or more complex (kubelet config file). In the case of the kublet config file, there is no single management interface, instead requiring adding two methods of interfacing with the system. Regards |
I disagree that kubelet config files are more complex - but reasonable As for the host port hack, it is a hack, and if you set 2^32 replicas it Certainly you have SOMETHING managing your host machine filesystems? We That sad, I am not STRONGLY against this idea. Looking forward to some
|
Custom autoscaler seems like the way to go.
|
We need to add constraints to scheduler to make this work. |
@erictune's proposal is along the lines of what I was thinking. Definitely NOT a replication controller with an infinite count. Rather than a hostname constraint, we could expose a node parameter on POST to /pods. Since pods don't reschedule, this constraint doesn't need to be part of the pod spec. This would also enable use of pod templates without getting into field overrides. Internally, we could add it to scheduling constraints, if we chose, but one could potentially also just bypass the scheduler in this case, so long as the apiserver verified feasibility, which is necessary if we want to support multiple schedulers, anyway. The main thing to worry about is races: daemons getting evicted due to missing nodes, then not being the first thing to schedule back on the nodes when they become healthy again. To prevent this, we'd almost certainly need to add forgiveness (#1574). As for whether we should support this feature:
The core functionality seems pretty isolated from everything else, and what it needs are things we'd like to add for other reasons. Maybe it could be implemented as a plugin, which would also make it easier to rip out if we decided it was a bad idea. I'm in favor, but can't see it as high priority at the moment since there is a workaround. |
cc @AnanyaKumar |
@ravigadde Please comment regarding whether #10210 satisfies your use case. |
I like the idea of having the node controller schedule daemon pods. Putting in one place all the logic that has to run before a node is considered ready to accept regular pods makes a lot of sense. But I'm not sure why you're mixing Deployment into this. It seems that just as we built ReplicationController as one component and plan to layer Deployment on top to allow declaratively describing state of the whole cluster and orchestrating rolling updates, we can build DaemonController as one component (that is mostly identical to ReplicationController except it uses % of nodes where ReplicationController uses # of replicas) and then layer Deployment on top to do those other things. The only tricky part is ensuring the % maps consistently, but IIRC you and @lavalamp had some ideas on how we could do that. Anyway, we can discuss that on the other issue. |
(To clarify my previous comment, when I refer to DaemonController I'm talking about logic, not necessarily a separate component; as you said, we could put the logic in the node controller). |
@bgrant0607 Thanks for your comments. I will add my comments to #10210 +1 for adding this logic to node controller |
I think this is done and we can close. Thanks @AnanyaKumar! Still todo (and discuss) is:
|
I have a question: |
…e-disable UPSTREAM: <carry>: OCPNODE-1548,OCPNODE-1584: disable load balancing on created cgroups when managed is enabled
There are cases where we want to run a pod on every node. This'll be useful for monitoring things (cAdvisor, DataDog) or replicated storage agents (HDFS node).
Right now you can approximate this by (a) using a hostPort and (b) setting replication count > nodes. It would be better if we had an explicit way of doing this.
The text was updated successfully, but these errors were encountered: