Service scale-to-zero (fka "Socket activation") #484

KyleAMathews · 2014-07-16T05:26:24Z

I want to start this issue to drum up support for adding socket activation support to Kubernetes. My CTO at the last place I worked at, @davidstrauss, invented this technique.

Normally for a service to be available, its daemon has to be running and listening on a socket. With socket activation, another service, generally SystemD, listens for all incoming traffic and on an incoming request, passes the request to its service, waking it if necessary.

By idling inactive containers, there can be significant savings on memory. At Pantheon which hosts Drupal and Wordpress sites, at any time some ~80-90% of containers are idled waiting for traffic. This means they can proportionally increase the density of containers resulting in significant operations and server cost savings.

This is of course not a technique for everyone. You need to have services which won't get any use for significant periods of time and you can accept a few second delay from the initial waking time. But for many people, especially those running multi-tenant architectures, this would be a very valuable feature if baked in and easy to use and help Kubernetes be even more elastic.

Some more information on socket activation:

brendandburns · 2014-07-16T17:21:58Z

I like this idea, and the idea of making more sophisticated Service objects in general.

Basically, you would bind a Service to a ReplicationController and if size of that replication controller was zero, you'd spin up a new container, and then make the connection.

What do others think?

Do you want to send us a PR?

KyleAMathews · 2014-07-16T17:46:47Z

I'm a Go neophyte plus getting a startup going so I'm probably not the right person to tackle this atm, unfortunately.

This would go well with in an auto-scaling story. Scale down to zero resources when there's no activity and up to however many containers are needed. So basically the # of replicas set in a ReplicationController would be a max not the set number.

thockin · 2014-07-16T18:38:05Z

I like the general idea, but I don't know when we'll get to it - we've been
sort of inundated in the last month :)

I'd probably argue that this is a different type of controller or at least
a different parameter to replication controller

On Wed, Jul 16, 2014 at 10:46 AM, Kyle Mathews notifications@github.com
wrote:

I'm a Go neophyte plus getting a startup going so I'm probably not the
right person to tackle this atm, unfortunately.

This would go well with in an auto-scaling story. Scale down to zero
resources when there's no activity and up to however many containers are
needed. So basically the # of replicas set in a ReplicationController would
be a max not the set number.

Reply to this email directly or view it on GitHub
#484 (comment)
.

brendandburns · 2014-07-16T19:02:07Z

Yeah, in terms of the replication controller, I think the # there is always
the truth. If we want to introduce auto-scaling, we should introduce an
additional controller that is in charge of auto-scaling.

--brendan

On Wed, Jul 16, 2014 at 11:38 AM, Tim Hockin notifications@github.com
wrote:

I like the general idea, but I don't know when we'll get to it - we've been
sort of inundated in the last month :)

I'd probably argue that this is a different type of controller or at least
a different parameter to replication controller

On Wed, Jul 16, 2014 at 10:46 AM, Kyle Mathews notifications@github.com
wrote:

I'm a Go neophyte plus getting a startup going so I'm probably not the
right person to tackle this atm, unfortunately.

This would go well with in an auto-scaling story. Scale down to zero
resources when there's no activity and up to however many containers are
needed. So basically the # of replicas set in a ReplicationController
would
be a max not the set number.

Reply to this email directly or view it on GitHub
<
#484 (comment)

.

—
Reply to this email directly or view it on GitHub
#484 (comment)
.

lavalamp · 2014-07-16T19:05:38Z

It would be awesome to make a feedback loop between a service's QPS and the number in the replicas field. We have the proxy, so we could in theory automatically count QPS...

smarterclayton · 2014-07-16T19:08:42Z

When we went through this debate with Docker + systemd + geard + openshift, it did seem like it's best to model it as a characteristic of whichever proxy you were using (service, external load balancer, etc).

The challenge with systemd activation on a minion is that the port/interface has to be defined before the pod is started, which today is the pod being scheduled to the host. However, the pod has to be "idle" (stopped) which complicates resource scheduling - the scheduler now can't make resource decisions without double checking to see if things have been unidled, and potentially you would wake too many things. The complexity for OpenShift didn't seem worth it, so we said that in OpenShift next on Kubernetes we would just do the idling / unidling at the edge traffic proxy, which fits in with what brendan was suggesting. We also generally prefer the idea of removing unused config from the minions to simplify administration.

The downside being is you have a lot higher latency before you can release traffic from the proxy (service -> apiserver -> scheduler -> kubelet poll -> docker container start).

However, if the scheduling problems could be solved (i.e. if that feedback loop from minion to host on unidle could be made sound) it could definitely benefit some use cases.

davidstrauss · 2014-07-16T19:08:47Z

Speaking as a systemd maintainer, we're always +1 on new integrations with socket activation. I'll keep watching this thread to see if there's any way I can provide implementation guidance. I know the CoreOS folks have extensive experience with Go and socket activation.

smarterclayton · 2014-07-16T19:10:20Z

Agree on autoscaling be distinct from replication controller responsibility.

erictune · 2014-07-16T19:53:53Z

David:

Does idle mean you terminate the daemon process?

How do you decide what is the right number of total daemons per machine, before you need to add more physical or virtual machines?

What do you do when, due to a stroke of bad luck, a bunch of daemons all wake up at the same time, and you OOM? Or do you have so many daemons per machine that this is very unlikely? In that case, I guess you know that wakeups are relatively uncorrelated?

smarterclayton · 2014-07-16T20:01:13Z

I can answer for OpenShift today (not for David, who has a much higher density than us and a slightly different scenario) - idle means terminate daemon, capacity is preplanned and works best at higher densities (lower densities it works less well for), if you get too many wakes you either throttle, OOM, or fail. In practice in our case they are uncorrelated.

erictune · 2014-07-16T20:48:41Z

I can see how this is great for the case where you have lots of infrequently used daemons, lots of tenants, and pre-planned resources.

I can also see that things get complicated fast if you try to mix this model with a more "VM hosting" type of model, with guaranteed resources.

If k8s is going to support both models, then it would be good to carefully define different names for how much memory a daemon can ever use (for kubelet resource limits #168), versus how much is uses at a moment in time, versus how much is assumed to use "on average" (for use in scheduling #274 and quota #442). And then display the right numbers in the right context (#317).

bgrant0607 · 2014-12-04T05:33:44Z

A minimalistic approach discussed today: The service could generate an event when it received traffic and had no pods to send it to. These events could simply be another demand signal to an auto-scaler.

KyleAMathews · 2014-12-04T06:33:28Z

There'd need to be some way of holding onto the request until a pod has
been instantiated but otherwise that sounds great.
On Wed, Dec 3, 2014 at 10:34 PM bgrant0607 notifications@github.com wrote:

A minimalistic approach discussed today: The service could generate an
event when it received traffic and had no pods to send it to. These events
could simply be another demand signal to an auto-scaler.

—
Reply to this email directly or view it on GitHub
#484 (comment)
.

davidstrauss · 2014-12-04T07:37:13Z

Agreed. It sounds like a nice, generic approach: treating un-hibernation like auto-scaling up from zero.

smarterclayton · 2014-12-04T22:11:53Z

Yeah, I started prototyping that in Kube proxy and it was fairly straightforward.

On Dec 3, 2014, at 10:33 PM, Kyle Mathews notifications@github.com wrote:

There'd need to be some way of holding onto the request until a pod has
been instantiated but otherwise that sounds great.
On Wed, Dec 3, 2014 at 10:34 PM bgrant0607 notifications@github.com wrote:

A minimalistic approach discussed today: The service could generate an
event when it received traffic and had no pods to send it to. These events
could simply be another demand signal to an auto-scaler.

—
Reply to this email directly or view it on GitHub
#484 (comment)
.

—
Reply to this email directly or view it on GitHub.

bgrant0607 · 2015-02-28T02:36:18Z

There was a proposal: #3247.

Fix detection of AUFS layers.

bgrant0607 · 2016-08-11T18:45:12Z

Another proposal: #29471

bhack · 2024-01-30T01:05:43Z

Keep alive!

ShadowJonathan · 2024-01-30T09:06:32Z

Why is this stale bot back? When was that decided?

k8s-triage-robot · 2024-02-29T09:24:17Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ShadowJonathan · 2024-02-29T09:49:19Z

/remove-lifecycle rotten

shaneutt · 2024-03-21T07:16:26Z

Why is this stale bot back? When was that decided?

We have the stale bot enabled because an issue being "open" generally means we are either in some state of triage, or the issue has some level of prioritization.

Currently we don't have any prioritization for this issue (we the maintainers have limited capacity, and the project often relies on the community asking for features to also be the ones to help deliver) and it would appear we need people from the community to step forward and champion it if we are going to make this happen.

We would ask that if you intend to bump an issue out of stale/rotten that you do so when you are ready to be assigned the issue and become one of it's champions to move it forward.

@ShadowJonathan (or others here) please let us know: is this an issue that you want to take on? We will be happy to support your efforts: please put it on the agenda for the next SIG Network community meeting so we can discuss it with you!

ShadowJonathan · 2024-03-21T09:01:06Z

I don't know if I have the capacity to work on the issue, but it is a feature that would be helpful to some ideas that came across in the past (that'd be worked on with kubernetes' scale-to-zero), and/or I know that the wider community would have another useful tool in their toolbox that could make kubernetes a useful replacement over already-existing linux workload managers, such as SystemD.

sftim · 2024-03-21T09:05:31Z

If nobody is planning to pick this up, we should put back the stale marker.

Zetanova · 2024-03-21T10:23:44Z

The issue is not the implementation itself.
It requires an extension in the k8s core pod definition and lifecycle.

What the community would need is a plan in what steps and how to implement it.

I researched this issue extensively and got to the conclusion,
First It can be split in 4 separate features:

.) Extension of the Pod definition for Init-Container
A init-container need to be possible to be declared as long-running.
This would require the init-container to signal the pod that the initialization has been completed,
but without using the return code and exiting the process.
With this extension it is already possible to make a simple socket-activation init-container

.) Extension of the Pod definition for Socket Resources.
It should be possible to define all types of sockets/pipes inside the Pod definition. (like AF_UNIX, AF_INET ..)
The k8s-pod can then create them at initializing time (pod-pending) like it is done for volumes/configmap
This would mirror the same behavior like systemd does. see
This feature by itself would be already a great addition. container-restarts and container-ipc could already profit from it.
Example CSI: kubelet<>sidecars<>csi-driver communiction

.) Extension of the Pod-Container definition for Startup (container waiting phase)
It should be possible to define a container to be scheduled but not to be started until a run signal/event.
The run signal/event can contain dynamic generated ENV to be used for the startup of the container.
This would resolve an other big issue/feature with the STS ordinal index

.) Extension of the Pod-Container-Healthcheck
A container should be able to be marked as static live and/or ready.
This would deactivate the health check until the label is removed.
This would enable a service to transfer traffic to the pod for activation

sathieu · 2024-03-21T10:53:35Z

Extension of the Pod definition for Init-Container

This looks like sidecar containers which is currently beta.

Zetanova · 2024-03-21T11:14:18Z

@sathieu Yes, this is exactly the feature, with it socket-activation can be implanted.

A sidecar container can create a tcp socket and get into ready state on the first connection.
This would postpone the main-container initializing (hopping hat k8s is pre-downloading the image)

The only thing that would be missing is to disable/postpone the health check.
Or else the service need to have the publishNotReadyAddresses: true field set.

shaneutt · 2024-03-22T08:31:05Z

While we appreciate the details and the interest, we still don't seem to have anyone who's stepping forward to be assigned the issue for the moment, so it seems the previous lifecycle was still accurate:

/lifecycle rotten

Please do let us know if you are interested in taking this one on, and we can provide support.

wreed4 · 2024-03-28T14:55:08Z

@Zetanova Why is current conversation around this feature talking about "suspending" a pod?? As @thockin said in this issue a year ago, as long as a pod object exists in etcd, you've incurred the cost required to schedule.. at which point "suspending" it does nothing for you. The value here is specifically in having zero pod objects existing underneath a service and only having a pod being created and scheduled once a web requests comes in. I apologize if I missed the answer to this question further up in the thread, this is a long one.

thockin · 2024-03-28T16:23:35Z

The idea of a "suspended" pod which has a minimized footprint (perhaps literally 0, perhaps just "less than nominal") is not off the table. For example, you could have N running pods and M suspended pods - when an up-scale is needed, pick some subset of the suspended pods (1 <= K <= M) to try to "wake up" (in-place resize + state change). Not all wakeup calls will succeed, but that's OK. What I just described is NOT particularly related to this issue, though. It's more of a "make horizontal scaling faster" thing.

True scale-to-zero means scaling in real time to traffic, which means SOMEONE has to hold the traffic, almost certainly BEFORE a destination node has been selected.

christf · 2024-03-28T23:58:09Z

The idea of a "suspended" pod which has a minimized footprint (perhaps literally 0, perhaps just "less than nominal") is not off the table. For example, you could have N running pods and M suspended pods - when an up-scale is needed, pick some subset of the suspended pods (1 <= K <= M) to try to "wake up" (in-place resize + state change). Not all wakeup calls will succeed, but that's OK. What I just described is NOT particularly related to this issue, though. It's more of a "make horizontal scaling faster" thing.

Suspending pods means suspending applications which is an entirely different can of worms than scaling. Apps have tasks and event loops that time out, connection pools to refresh etc. To make suspending useful, these resources may have to be refreshed/ reopened

True scale-to-zero means scaling in real time to traffic, which means SOMEONE has to hold the traffic, almost certainly BEFORE a destination node has been selected.

What about the ingress controllers. It would seem, they might be in a good position for that.

thockin · 2024-03-29T00:13:45Z

Suspending pods means suspending applications which is an entirely different can of worms

Sure. Obviously, if we did something like this, it would be opt-in, and pod owners would know what's going on.

What about the ingress controllers. It would seem, they might be in a good position for that.

If the traffic you want goes through one, sure. Pod-to-service traffic doesn't, for example.

Zetanova · 2024-03-29T19:50:41Z

@wreed4 don't know who started with suspended. I think there is a miss understanding with the socket-activation function.

I explained it here multiple times and differently.

The socket-activation is a well known feature/pattern of systemd. docker itself does support it too.
It is basically the postponing of the start of a process until a signal of from a new tcp-connection
The process itself takes the created socket uds with a well known ENV var.

After the process/container starts it handles the first connection in the created socket
and it uses the same socket for all following connections until it decides to exit (scale down to zero)

Of course the POD and IP and resource limits and requests are used.

With the new sidecar init container feature, it should be possible to use socket activation for the main-container:
https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/

The simple benefit is that a idling service does not consume any memory and cpu cycles.
Multiple services can also start in a self defining sequence.
systemd solved with it boot sequence and resource-burst problems.

@sathieu
socket-activation for sidecars like backup, sshd are still not possible.

"When" I have some time and can update my k8s, I will test it out.

Possible savings:

dotnet service would save at least 60mb memory and 1-4m idle-cpu.
PHP container like phpmyadmin 45mb mem, 1m cpu
kubelet-csr-approver 2x ~20mb mem, 1-2m cpu
powerdnsadmi 250mb mem, 2m cpu
cert-manager-webhook 49mb mem, 4m cpu
cert-manager maybe 117mb mem, 155m cpu
and so on

thockin · 2024-03-29T20:11:21Z

@Zetanova socket-activation works but it depends on the fact that the agent which receives the traffic is able to hand-off to the final process. When it's all local to a machine, that works pretty well.

What this proposal needs is to flesh out all of the OTHER cases where such hand-off is not obvious.

If a service uses externalTrafficPolicy: Local then either: a) the upstream LB can't choose a node to deliver to until the new Pod is scheduled; or b) we sometimes violate the externalTrafficPolicy.

If a Pod is talking to a Service in the same cluster, same problem but for internalTrafficPolicy.

There are lots of cases to consider. It may be viable to spec some of them away - e.g. you can't use scale-to-zero for iTP/eTP: Local). As stated elsewhere, this would need a KEP.

As for why the conversation shifted to "suspended" pods, you yourself described it in #484 (comment) though you did not use the word "suspend".

In order to assign a pod to a node, you have to make resource requests. Those requested resources are unavailable for other scheduling. If you specify too little, your app may not start or run. If you specify the real amount, then there's no point in playing games - just run the pod. In #484 (comment) I suggested using in-place pod resize plus a "shotgun" approach to increase the probability of success (but still no guaranteee, and still needs traffic parking).

If we are OK with scale-from-zero just failing sometimes, then what you describe COULD work. Pick a ready-to-run pod, wake it up, try to do an in-place resize but send the traffic to it anyway. If it fails, the user gets an error, probably. That sounds like an experience nobody would choose to use, IMO.

As always, if I am missing your point, please clarify.

Zetanova · 2024-03-29T21:24:20Z

@thockin socket-activation is pod internal.

It solves the issue of idle-cpu load and real-memory allocation.
Not every process runs smooth like nginx does.

Many processes using 1-5m idle cpu and >50mb memory to idle around.
Bad once >50m cpu and over >124MB

Sidecar container like sshd or rsync for backup or admission controller can profit from socket-activation

k8s defined resources.requests and under/over commitment is related but a separated issue.

We can make tech demo with the above mentioned sidecar-init-container, this would already be a start.
A KDE for resources.requests, static health-check status, Sockets as a resource and Startup signal can be made severalty.

thockin · 2024-03-29T21:46:07Z

It solves the issue of idle-cpu load and real-memory allocation.

That's an entirely different problem than "scale-to-zero", and I don't think it needs Kubernetes changes to support it?

If you don't mind pre-scheduling your pods (and committing the requests) then what's left seems like it can be worked around within the pod or even within a container:

while true {
    accept()
    fork()
    exec(realCommand)
}

Obviously there's a little more than that, but I am no longer confident in my understanding of what you think Kubernetes should be providing.

Zetanova · 2024-03-29T22:12:42Z

@thockin
systemd and docker both supporting currently the socket-activation pattern

In k8s it should be now possible with ´sidecar-init-container´ to archive near the same.

But there are still view improvements to make regarding health-check, resource.requests and so on.

One extension for the health-check definition, what I can think of could be:

Define a container with statically readiness and liveness healthy status and remove the label when the container starts.
When the defined container exit code is X then don't restart it and set it back to statically readiness and liveness healthy status.

thockin · 2024-03-29T22:38:12Z

I think I see what you are getting at, but I'll be honest - I am doubtful that we want to do this, when in-pod workarounds exist. Without solving the scheduling/requests part of it, the return-on-complexity doesn't seem justifiable.

As always, prove me wrong - show a demo, write a KEP, convince me I am not seeing the whole picture.

k8s-triage-robot · 2024-04-28T22:50:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-04-28T22:50:52Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton added the enhancement label Jul 16, 2014

bgrant0607 added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Sep 30, 2014

bgrant0607 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Dec 4, 2014

bgrant0607 mentioned this issue Dec 15, 2014

WIP: auto-scaler proposal #2863

Merged

bgrant0607 added the team/cluster label Feb 28, 2015

therc mentioned this issue Feb 28, 2016

Temporary pod #21925

Closed

vishh added a commit to vishh/kubernetes that referenced this issue Apr 6, 2016

Merge pull request kubernetes#484 from vmarmol/aufs

e699ec7

Fix detection of AUFS layers.

bgrant0607 mentioned this issue Aug 11, 2016

Proposal: Introduce Idling and Unidling #29471

Closed

bgrant0607 added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed team/cluster (deprecated - do not use) labels Feb 10, 2017

bgrant0607 added the triaged label Mar 9, 2017

bgrant0607 mentioned this issue May 17, 2017

Kubeless is creating one pod per function - is this normal for serverless? vmware-archive/kubeless#148

Closed

sebgoa mentioned this issue May 21, 2017

Investigate inactivation of Pod/functions not used vmware-archive/kubeless#157

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

ShadowJonathan mentioned this issue Jan 30, 2024

Auto-closing issues is harmful and causes friction #103151

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 29, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 29, 2024

k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 22, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2024

Service scale-to-zero (fka "Socket activation") #484

Service scale-to-zero (fka "Socket activation") #484

Comments

KyleAMathews commented Jul 16, 2014

brendandburns commented Jul 16, 2014

KyleAMathews commented Jul 16, 2014

thockin commented Jul 16, 2014

brendandburns commented Jul 16, 2014

lavalamp commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

davidstrauss commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

erictune commented Jul 16, 2014

smarterclayton commented Jul 16, 2014

erictune commented Jul 16, 2014

bgrant0607 commented Dec 4, 2014

KyleAMathews commented Dec 4, 2014

davidstrauss commented Dec 4, 2014

smarterclayton commented Dec 4, 2014

bgrant0607 commented Feb 28, 2015

bgrant0607 commented Aug 11, 2016

bhack commented Jan 30, 2024

ShadowJonathan commented Jan 30, 2024

k8s-triage-robot commented Feb 29, 2024

ShadowJonathan commented Feb 29, 2024

shaneutt commented Mar 21, 2024 • edited Loading

ShadowJonathan commented Mar 21, 2024

sftim commented Mar 21, 2024

Zetanova commented Mar 21, 2024

sathieu commented Mar 21, 2024

Zetanova commented Mar 21, 2024 • edited Loading

shaneutt commented Mar 22, 2024

wreed4 commented Mar 28, 2024

thockin commented Mar 28, 2024

christf commented Mar 28, 2024

thockin commented Mar 29, 2024

Zetanova commented Mar 29, 2024

thockin commented Mar 29, 2024 • edited Loading

Zetanova commented Mar 29, 2024

thockin commented Mar 29, 2024

Zetanova commented Mar 29, 2024

thockin commented Mar 29, 2024

k8s-triage-robot commented Apr 28, 2024

k8s-ci-robot commented Apr 28, 2024

shaneutt commented Mar 21, 2024 •

edited

Loading

Zetanova commented Mar 21, 2024 •

edited

Loading

thockin commented Mar 29, 2024 •

edited

Loading