-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service scale-to-zero (fka "Socket activation") #484
Comments
I like this idea, and the idea of making more sophisticated Service objects in general. Basically, you would bind a Service to a ReplicationController and if size of that replication controller was zero, you'd spin up a new container, and then make the connection. What do others think? Do you want to send us a PR? |
I'm a Go neophyte plus getting a startup going so I'm probably not the right person to tackle this atm, unfortunately. This would go well with in an auto-scaling story. Scale down to zero resources when there's no activity and up to however many containers are needed. So basically the # of replicas set in a ReplicationController would be a max not the set number. |
I like the general idea, but I don't know when we'll get to it - we've been I'd probably argue that this is a different type of controller or at least On Wed, Jul 16, 2014 at 10:46 AM, Kyle Mathews notifications@github.com
|
Yeah, in terms of the replication controller, I think the # there is always --brendan On Wed, Jul 16, 2014 at 11:38 AM, Tim Hockin notifications@github.com
|
It would be awesome to make a feedback loop between a service's QPS and the number in the replicas field. We have the proxy, so we could in theory automatically count QPS... |
When we went through this debate with Docker + systemd + geard + openshift, it did seem like it's best to model it as a characteristic of whichever proxy you were using (service, external load balancer, etc). The challenge with systemd activation on a minion is that the port/interface has to be defined before the pod is started, which today is the pod being scheduled to the host. However, the pod has to be "idle" (stopped) which complicates resource scheduling - the scheduler now can't make resource decisions without double checking to see if things have been unidled, and potentially you would wake too many things. The complexity for OpenShift didn't seem worth it, so we said that in OpenShift next on Kubernetes we would just do the idling / unidling at the edge traffic proxy, which fits in with what brendan was suggesting. We also generally prefer the idea of removing unused config from the minions to simplify administration. The downside being is you have a lot higher latency before you can release traffic from the proxy (service -> apiserver -> scheduler -> kubelet poll -> docker container start). However, if the scheduling problems could be solved (i.e. if that feedback loop from minion to host on unidle could be made sound) it could definitely benefit some use cases. |
Speaking as a systemd maintainer, we're always +1 on new integrations with socket activation. I'll keep watching this thread to see if there's any way I can provide implementation guidance. I know the CoreOS folks have extensive experience with Go and socket activation. |
Agree on autoscaling be distinct from replication controller responsibility. |
David: Does idle mean you terminate the daemon process? How do you decide what is the right number of total daemons per machine, before you need to add more physical or virtual machines? What do you do when, due to a stroke of bad luck, a bunch of daemons all wake up at the same time, and you OOM? Or do you have so many daemons per machine that this is very unlikely? In that case, I guess you know that wakeups are relatively uncorrelated? |
I can answer for OpenShift today (not for David, who has a much higher density than us and a slightly different scenario) - idle means terminate daemon, capacity is preplanned and works best at higher densities (lower densities it works less well for), if you get too many wakes you either throttle, OOM, or fail. In practice in our case they are uncorrelated. |
I can see how this is great for the case where you have lots of infrequently used daemons, lots of tenants, and pre-planned resources. I can also see that things get complicated fast if you try to mix this model with a more "VM hosting" type of model, with guaranteed resources. If k8s is going to support both models, then it would be good to carefully define different names for how much memory a daemon can ever use (for kubelet resource limits #168), versus how much is uses at a moment in time, versus how much is assumed to use "on average" (for use in scheduling #274 and quota #442). And then display the right numbers in the right context (#317). |
A minimalistic approach discussed today: The service could generate an event when it received traffic and had no pods to send it to. These events could simply be another demand signal to an auto-scaler. |
There'd need to be some way of holding onto the request until a pod has
|
Agreed. It sounds like a nice, generic approach: treating un-hibernation like auto-scaling up from zero. |
Yeah, I started prototyping that in Kube proxy and it was fairly straightforward.
|
There was a proposal: #3247. |
Fix detection of AUFS layers.
Another proposal: #29471 |
Keep alive! |
Why is this stale bot back? When was that decided? |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
We have the stale bot enabled because an issue being "open" generally means we are either in some state of triage, or the issue has some level of prioritization. Currently we don't have any prioritization for this issue (we the maintainers have limited capacity, and the project often relies on the community asking for features to also be the ones to help deliver) and it would appear we need people from the community to step forward and champion it if we are going to make this happen. We would ask that if you intend to bump an issue out of stale/rotten that you do so when you are ready to be assigned the issue and become one of it's champions to move it forward. @ShadowJonathan (or others here) please let us know: is this an issue that you want to take on? We will be happy to support your efforts: please put it on the agenda for the next SIG Network community meeting so we can discuss it with you! |
I don't know if I have the capacity to work on the issue, but it is a feature that would be helpful to some ideas that came across in the past (that'd be worked on with kubernetes' scale-to-zero), and/or I know that the wider community would have another useful tool in their toolbox that could make kubernetes a useful replacement over already-existing linux workload managers, such as SystemD. |
If nobody is planning to pick this up, we should put back the stale marker. |
The issue is not the implementation itself. What the community would need is a plan in what steps and how to implement it. I researched this issue extensively and got to the conclusion, .) Extension of the Pod definition for Init-Container .) Extension of the Pod definition for Socket Resources. .) Extension of the Pod-Container definition for Startup (container waiting phase) .) Extension of the Pod-Container-Healthcheck |
This looks like sidecar containers which is currently beta. |
@sathieu Yes, this is exactly the feature, with it socket-activation can be implanted. A sidecar container can create a tcp socket and get into ready state on the first connection. The only thing that would be missing is to disable/postpone the health check. |
While we appreciate the details and the interest, we still don't seem to have anyone who's stepping forward to be assigned the issue for the moment, so it seems the previous lifecycle was still accurate: /lifecycle rotten Please do let us know if you are interested in taking this one on, and we can provide support. |
@Zetanova Why is current conversation around this feature talking about "suspending" a pod?? As @thockin said in this issue a year ago, as long as a pod object exists in etcd, you've incurred the cost required to schedule.. at which point "suspending" it does nothing for you. The value here is specifically in having zero pod objects existing underneath a service and only having a pod being created and scheduled once a web requests comes in. I apologize if I missed the answer to this question further up in the thread, this is a long one. |
The idea of a "suspended" pod which has a minimized footprint (perhaps literally 0, perhaps just "less than nominal") is not off the table. For example, you could have N running pods and M suspended pods - when an up-scale is needed, pick some subset of the suspended pods (1 <= K <= M) to try to "wake up" (in-place resize + state change). Not all wakeup calls will succeed, but that's OK. What I just described is NOT particularly related to this issue, though. It's more of a "make horizontal scaling faster" thing. True scale-to-zero means scaling in real time to traffic, which means SOMEONE has to hold the traffic, almost certainly BEFORE a destination node has been selected. |
Suspending pods means suspending applications which is an entirely different can of worms than scaling. Apps have tasks and event loops that time out, connection pools to refresh etc. To make suspending useful, these resources may have to be refreshed/ reopened
What about the ingress controllers. It would seem, they might be in a good position for that. |
Sure. Obviously, if we did something like this, it would be opt-in, and pod owners would know what's going on.
If the traffic you want goes through one, sure. Pod-to-service traffic doesn't, for example. |
@wreed4 don't know who started with I explained it here multiple times and differently. The After the process/container starts it handles the first connection in the created socket Of course the POD and IP and resource limits and requests are used. With the new sidecar init container feature, it should be possible to use socket activation for the main-container: The simple benefit is that a idling service does not consume any memory and cpu cycles. @sathieu "When" I have some time and can update my k8s, I will test it out. Possible savings:
|
@Zetanova socket-activation works but it depends on the fact that the agent which receives the traffic is able to hand-off to the final process. When it's all local to a machine, that works pretty well. What this proposal needs is to flesh out all of the OTHER cases where such hand-off is not obvious. If a service uses If a Pod is talking to a Service in the same cluster, same problem but for There are lots of cases to consider. It may be viable to spec some of them away - e.g. you can't use scale-to-zero for As for why the conversation shifted to "suspended" pods, you yourself described it in #484 (comment) though you did not use the word "suspend". In order to assign a pod to a node, you have to make resource requests. Those requested resources are unavailable for other scheduling. If you specify too little, your app may not start or run. If you specify the real amount, then there's no point in playing games - just run the pod. In #484 (comment) I suggested using in-place pod resize plus a "shotgun" approach to increase the probability of success (but still no guaranteee, and still needs traffic parking). If we are OK with scale-from-zero just failing sometimes, then what you describe COULD work. Pick a ready-to-run pod, wake it up, try to do an in-place resize but send the traffic to it anyway. If it fails, the user gets an error, probably. That sounds like an experience nobody would choose to use, IMO. As always, if I am missing your point, please clarify. |
@thockin socket-activation is It solves the issue of idle-cpu load and real-memory allocation. Many processes using 1-5m idle cpu and >50mb memory to idle around. Sidecar container like sshd or rsync for backup or admission controller can profit from socket-activation k8s defined We can make tech demo with the above mentioned sidecar-init-container, this would already be a start. |
That's an entirely different problem than "scale-to-zero", and I don't think it needs Kubernetes changes to support it? If you don't mind pre-scheduling your pods (and committing the
Obviously there's a little more than that, but I am no longer confident in my understanding of what you think Kubernetes should be providing. |
@thockin In k8s it should be now possible with ´sidecar-init-container´ to archive near the same. But there are still view improvements to make regarding health-check, resource.requests and so on. One extension for the health-check definition, what I can think of could be:
|
I think I see what you are getting at, but I'll be honest - I am doubtful that we want to do this, when in-pod workarounds exist. Without solving the scheduling/requests part of it, the return-on-complexity doesn't seem justifiable. As always, prove me wrong - show a demo, write a KEP, convince me I am not seeing the whole picture. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I want to start this issue to drum up support for adding socket activation support to Kubernetes. My CTO at the last place I worked at, @davidstrauss, invented this technique.
Normally for a service to be available, its daemon has to be running and listening on a socket. With socket activation, another service, generally SystemD, listens for all incoming traffic and on an incoming request, passes the request to its service, waking it if necessary.
By idling inactive containers, there can be significant savings on memory. At Pantheon which hosts Drupal and Wordpress sites, at any time some ~80-90% of containers are idled waiting for traffic. This means they can proportionally increase the density of containers resulting in significant operations and server cost savings.
This is of course not a technique for everyone. You need to have services which won't get any use for significant periods of time and you can accept a few second delay from the initial waking time. But for many people, especially those running multi-tenant architectures, this would be a very valuable feature if baked in and easy to use and help Kubernetes be even more elastic.
Some more information on socket activation:
The text was updated successfully, but these errors were encountered: