We have applications that require many processes to be spawned on the same machine.
In order to port those applications to Kubernetes/OpenShift, we plan to have many containers (with only one process inside) in a POD.
The first remark is that it is not because a process is started that it is ready to work and process traffic.
For example, we can have a process that starts its life by fetching some configuration elements from somewhere and doing some expensive configuration stuff before being really ready to work.
Further in this document, we’ll make a distinction between starting
and ready
states.
starting
means that the container is running, the process is live, but it is not ready to do its real “work”. For example, it may still be loading libraries, waiting for a database connection, building some internal caches, whatever…ready
means that the container is running, is fully functional and is ready to process incoming traffic.
By extension, we can define POD states as:
- if a POD has at least one container in the
starting
state, then the POD isstarting
; - if all the containers of a POD are
ready
, then, the POD isready
.
In today’s Kubernetes terminology, we don’t distinguish those two states and both are named running
.
The initialization of the containers might be CPU expensive. We have examples of applications that:
- load a huge number of dynamic libraries and the symbol relocation takes a significant amount of time;
- retrieve some configuration from a file or a remote DB and denormalize that configuration, this can be expansive as well;
- create some local caches that need to be fed before the application can work;
- etc.
A starting container which consumes a lot of resources can have two kind of consequences:
- It slows down the
ready
containers which are already running on the same machine and cause a response time degradation of the services provided by PODs that are running on the same machine than the starting one. - It slows down the
starting
containers themselves which are longer to becomeready
.
This issue can be solved by adjusting the priorities of the different PODs so that one POD cannot starve other ones.
The impact is that it slows down the starting
time of the processes themselves because of resource starvation.
The main issue we have is that our processes have internal health checks that check that the starting
phase does not last longer than a pre-defined time-out.
In case of resource starvation, the times-out expire and the programs think they fell in an infinite loop or something.
We use to solve that issue that limiting the number of processes that can start simultaneously in order to avoid resource starvation and to have more predictable start-up times.
Let’s consider the situation where, on a given machine, we have:
- a POD in the
ready
state which is not consuming a lot of CPU but which is handling requests and their response time are critical. - a POD in the
starting
state which is very CPU greedy.
We want to prevent the starting
POD from starving the ready
one in order to not degrade the response time of the processes of the ready
POD. We must guarantee that even if the starting
POD has much more containers than the ready
one.
Today, docker containers are spawned in their own cgroup, but those cgroup are all child of /system.slice
. As a consequence, all the containers have the same weight. If the starting
POD has ten times more containers than the ready
one, it will be allocated ten times more CPU.
Having an additional layer with a slice per POD would allow to have a fair resource allocation per POD instead of having a fair resource allocation per container.
We propose to enhance docker to be able to specify the parent slice of containers’ cgroup (Pull request #9436, Pull request #9551) and to enhance kubernetes to create one slice per POD.
Before:
systemd-cgls
├─1 /sbin/init
├─system.slice
│ ├─docker-123456….scope
│ │ └─100 /foo/bar/baz
│ ├─docker-123457….scope
│ │ └─101 /foo/bar/baz
│ ├─docker-123458….scope
│ │ └─103 /foo/bar/baz
│ ├─docker-123459….scope
│ │ └─104 /foo/bar/baz
After:
systemd-cgls
├─1 /sbin/init
├─system.slice
│ ├─kubernetes.slice
│ │ ├─k8s_pod_X.slice
│ │ │ ├─docker-123456….scope
│ │ │ │ └─100 /foo/bar/baz
│ │ │ └─docker-123457….scope
│ │ │ └─101 /foo/bar/baz
│ │ └─k8s_pod_Y.slice
│ │ ├─docker-123458….scope
│ │ │ └─103 /foo/bar/baz
│ │ └─docker-123459….scope
│ │ └─104 /foo/bar/baz
In order to not overload the machine, we propose to not start a container as soon as its image has been pulled, but to control that start with a policy.
When a POD is assigned to a minion, kubelet creates the following FSM for each container:
When an image is pulled, the container is not started immediately. Instead, it enters a pending
state.
The pending
to starting
transition is triggered when the throttling policy authorizes it. The throttling policy is described below.
The starting
to ready
transition is triggered when the container is considered as ready.
The starting
to ready
transition is new. This transition can be triggered by two ways:
- either the containers send a notification to tell they are ready;
- or the containers are polled by docker/kube to check their readiness.
This solution requires the containers to notify their readiness.
systemd has a similar requirement and let the systemd services notify their readiness via different means:
- simple: the service is immediately ready;
- forking: the service is ready as soon as the parent process exits;
- dbus: the service is ready as soon as it acquires a name on D-Bus
- notify: the service is ready as soon at it has explicitly notified systemd about it by posting a message on a dedicated UNIX socket via the
sd_notify
function.
For containers, the notify
service type sounds to be the most suitable.
- Notification is sent as soon as possible.
- Either it introduces some “systemd” dependency, or it requires to implement another notification mechanism inspired by the
sd_notify
feature. In all cases, it is intrusive since it requires to implement something in the containerized processes. - This may be perceived as an advantage because existing programs may already implement this
sd_notify
call. In practice, when possible, it is preferable to decouple the public resource allocation (socket binding for example) from the program start-up. Concretely, the sockets are bound by systemd itself, the program isType=simle
(considered as ready immediately) and the program receives the file descriptor of the socket.
This solution consists in having docker/kube regularly check the readiness
of containers.
Such a mechanism already exists in Kubernetes as LivenessProbe. There are already different flavours of LivenessProbes:
- HTTP probe: try to do an HTTP GET on a given URL
- TCP probe: try to connect on a given port
- Exec: try to execute an arbitrary command inside the container
The last one sounds generic enough to implement anything.
- LivenessProbe is a mechanism that already exists;
- It is not intrusive since it doesn’t require to implement an
sd-notify
-like call in the programs.
- It is based on “polling”
Reusing the LivenessProbe mechanism.
This is what triggers the pending
to starting
transition.
Several strategies are possible
This policy allows a container to go from pending
to starting
as soon as the number of containers in starting
state on the minion (whatever the POD it belongs to) drops below a pre-defined threshold.
- Simple
-
It’s non-trivial to set the threshold.
If the processes are CPU bound and are consuming 100% of a CPU when they are in
starting
state, then, the optimal threshold would be the number of cores of the machine.If the processes are mostly waiting for external resources, then, the above recommendation is suboptimal.
If the processes are multi-threaded and are consuming 100% of 3 CPUs, then, the optimal threshold is the third of the number of cores of the machine.
-
If processes are stuck in the
starting
state, they will prevent other containers from being started. It is thus mandatory to implement a time-out mechanism that passes thestarting
containers infailed
state if they stay instarting
for too long.
This policy allows only a maximum number of containers to become starting
per a given amount of time. Ex: 2 containers per minions can be started per second.
- Simple
- Doesn’t require a time-out mechanism
- Does not guarantee that:
- The machine is never overloaded
- The resources of the machines are optimized (we don’t uselessly throttle a container whereas the CPU is idle, there is no I/O, etc.
This policy allows a container to become starting
only if:
- the CPU consumption drops below a given threshold during a given amount of time
- the I/O usage drops below a given threshold during a given amount of time
- the load average of the machine drops below a given threshold
- Really takes into account the available resources of the machine to optimize the startup time without overloading it
- Implementation more complex
- If the machine is loaded by things other than
starting
containers (likeready
containers or even processes running on the machine that are not docker containers), it will prevent containers from starting.
The “resource monitoring” policy is the one that better uses the resources but it relies on resource consumption averaged for a given amount of time.
It should be combined with a maximum number of starting
containers and a maximum start-up rate in order to not start too many containers before the “average CPU for last 10s” or “average I/O for last 10s” or “load average for 1min” increases
If we limit the maximum number of starting
containers, we must have a time-out mechanism that prevent containers from staying in starting
for too long.
If the resources consumption doesn’t drop when containers leave the starting
state — either because the ready
containers also consume resources, or because some resources are consumed by processes outside Kubernetes/OpenShift — it will prevent pending
containers from starting for ever. In order to avoid that, we need to have a minimum start-up rate that guarantees that we will eventually start all the containers.
- The only one that works in all cases?
- Complex
The throttling mechanism described in this section is about avoiding resource starvation. The resources are global to the machine. As a consequence, the settings cannot be at the the POD level. They need to be at the minion level
We could have a configuration file attached to minions.
m1_config.json
:
{
"kind": "MinionConfig",
"apiVersion": "v1beta1",
"throttling": {
"maxStartingContainers": "3 NbCores",
"maxLoadAvg": "1.5 NbCores",
"maxCPU": "80%",
"minRate": "0.1",
"maxRate": "10"
}
}
kubecfg -c m1_config.json update minions/192.168.10.1
- maxStartingContainers: Can be a absolute value or a factor multiplied by the number of cores of the machine
- maxLoadAvg: Can be a absolute value or a factor multiplied by the number of cores of the machine
- minRate: Whatever the other settings, we’ll start at least one container every 10s
- maxRate: Whatever the other settings, we’ll start at most 10 containers per second
“A POD is a collocated group of containers […] which are tightly coupled — in a pre-container world, they would have executed on the same physical or virtual host.” (extract from the PODs definition)
As they are tightly coupled, there might be dependencies between containers. In a “pre-container world”, they would have been spawned on a host by an init system able to handle dependencies.
We propose to enhance Kubernetes to use a dependency graph inside PODs to decide which containers can be started and which containers must wait for others.
Let’s consider, as an example, a POD with 5 containers linked together by the following dependencies:
Such a dependency graph could be described in the POD json like this:
{
"kind": "Pod",
"apiVersion": "v1beta1",
"id": "app",
"desiredState": {
"manifest": {
"version": "v1beta1",
"id": "app",
"containers": [{
"name": "app_a",
"image": "me/app_a",
"livenessProbe": {
"exec": {
"command": "/check_A_health"
}
}
},{
"name": "app_b",
"image": "me/app_b",
"livenessProbe": {
"exec": {
"command": "/check_B_health"
}
},
"dependsOn": [
"mag"
]
},{
"name": "app_c",
"image": "me/app_c",
"livenessProbe": {
"exec": {
"command": "/check_C_health"
}
},
"dependsOn": [
"app_b"
]
},{
"name": "app_d",
"image": "me/app_d",
"livenessProbe": {
"exec": {
"command": "/check_D_health"
}
},
"dependsOn": [
"app_b"
]
},{
"name": "app_e",
"image": "me/app_e",
"livenessProbe": {
"exec": {
"command": "/check_E_health"
}
},
"dependsOn": [
"app_c",
"app_d"
]
}]
}
},
"labels": {
"name": "app"
}
}
The POD start-up sequence is amended so that, when a POD is assigned to a minion, kubelet creates the following FSM for each container:
The containers are initially in the downloading
state. Once the image is pulled, they move to the new blocked
state.
When a container X reaches the blocked
state, the state of all its dependencies are checked. If all of them are ready
, the container X moves to starting
immediately. This is for example always the case for the containers which have no dependency.
Then, when a container X passes the starting
to ready
transition, for each container Yi in the blocked
state that depends on X, we check the state of all the dependencies of Yi. If all of them are ready
, then the Yi container becomes starting
.
In the example above, when the fe
container becomes ready
, the state of cs
is checked. It it’s ready
, then the state of example
moves from blocking
to starting
.
We must ensure that the dependency graph is a DAG.
This could, for example, be checked when parsing the json if we enforce a rule saying that in the dependsOn
list of a container, we can only have containers which are declared above in the file.
Let’s imagine a POD with 3 containers A, B and C. Those 3 containers can be started in any order in the sense that they won’t fail, crash or prematurely exit if the others are not there.
However, B and C needs to communicate with A in order to become ready
.
For example:
- A is a database. It notifies its readiness as soon as it is ready to process requests.
- B is an application that needs to connect to the database to configure itself. It notifies its readiness as soon as it is configured.
- C is similar to B
The throttling settings limit the number of containers authorized to be in starting
at the same time to 2.
We have no dependency expressed in the POD json.
If we are lucky, things can happen in this order:
- B starts. It cannot configure itself because A is not there. It is waiting for A.
- A starts.
- A is ready.
- As A becomes ready, there is one “starting” slot available to make C start.
- B connects to A. B configure itself and eventually becomes ready
- C connects to A, configure itself and becomes ready
Note that even if this works, before A becomes ready
, B “consumes” a starting
slot although it is not consuming resources. It is sub-optimal as, if we knew that B depends of A, we could have started a container of an other POD which doesn’t depend on A.
If we are unlucky, things can happen in this order:
- B and C are started. They are waiting for being able to connect to A
- A is not started because we already have two processes in the starting mode.
- We’re dead-locked.
This trivial example shows that
- limiting the number of processes starting at a time and
- having process readiness conditioned by other process readiness
is not possible if we cannot enforce a start-up order. That’s why if the readiness of some containers depends on the readiness of others, this dependency needs to be handled as described in the previous section.
Here is the complete FSM of containers:
Some states have associated actions that are triggered when the state is entered:
downloading
: the images is fetched with adocker pull
;blocked
: the container can be created withdocker create
but not started;starting
: the container is started withdocker start
.
The transition are defined as followed:
- When a POD is assigned to a minion, all its containers are put in the
downloading
state. - When the image is pulled, the containers becomes
blocked
. - Immediately, the dependencies of that container are checked. If they are all in the
ready
state, then the container becomespending
immediately, otherwise, it stays inblocked
state. - When there are containers in the
pending
state, the throttling mechanism regularly checks if conditions are met to take apending
container and make it progress tostarting
. - LivenessProbes regularly check if
ready
containers areready
. The first time a LivenessProbe reports a success, the container becomesready
. - If a container remains
starting
for a time longer than a pre-defined threshold without its LivenessProbe reporting a success, the container becomesfailed
. - Each time a container leaves the
starting
state, we check all the dependencies of theblocked
containers that depends on it. If they are allready
, then, thatblocked
container becomespending
.