diff --git a/docs/proposals/container-init.md b/docs/proposals/container-init.md new file mode 100644 index 0000000000000..15aac1012e80a --- /dev/null +++ b/docs/proposals/container-init.md @@ -0,0 +1,473 @@ + + + + +WARNING +WARNING +WARNING +WARNING +WARNING + +

PLEASE NOTE: This document applies to the HEAD of the source tree

+ +If you are using a released version of Kubernetes, you should +refer to the docs that go with that version. + +Documentation for other releases can be found at +[releases.k8s.io](http://releases.k8s.io). + +-- + + + + + +# Pod initialization + +@smarterclayton + +March 2016 + +## Proposal and Motivation + +Within a pod there is a need to initialize local data or adapt to the current +cluster environment that is not easily achieved in the current container model. +Containers start in parallel after volumes are mounted, leaving no opportunity +for coordination between containers without specialization of the image. If +two containers need to share common initialization data, both images must +be altered to cooperate using filesystem or network semantics, which introduces +coupling between images. Likewise, if an image requires configuration in order +to start and that configuration is environment dependent, the image must be +altered to add the necessary templating or retrieval. + +This proposal introduces the concept of an **init container**, one or more +containers started in sequence before the pod's normal containers are started. +These init containers may share volumes, perform network operations, and perform +computation prior to the start of the remaining containers. They may also, by +virtue of their sequencing, block or delay the startup of application containers +until some precondition is met. In this document we refer to the existing pod +containers as **app containers**. + +This proposal also provides a high level design of **volume containers**, which +initialize a particular volume, as a feature that specializes some of the tasks +defined for init containers. The init container design anticipates the existence +of volume containers and highlights where they will take future work + +## Design Points + +* Init containers should be able to: + * Perform initialization of shared volumes + * Download binaries that will be used in app containers as execution targets + * Inject configuration or extension capability to generic images at startup + * Perform complex templating of information available in the local environment + * Initialize a database by starting a temporary execution process and applying + schema info. + * Delay the startup of application containers until preconditions are met + * Register the pod with other components of the system +* Reduce coupling: + * Between application images, eliminating the need to customize those images for + Kubernetes generally or specific roles + * Inside of images, by specializing which containers perform which tasks + (install git into init container, use filesystem contents + in web container) + * Between initialization steps, by supporting multiple sequential init containers +* Init containers allow simple start preconditions to be implemented that are + decoupled from application code + * The order init containers start should be predictable and allow users to easily + reason about the startup of a container + * Complex ordering and failure will not be supported - all complex workflows can + if necessary be implemented inside of a single init container, and this proposal + aims to enable that ordering without adding undue complexity to the system. + Pods in general are not intended to support DAG workflows. +* Both run-once and run-forever pods should be able to use init containers +* As much as possible, an init container should behave like an app container + to reduce complexity for end users, for clients, and for divergent use cases. + An init container is a container with the minimum alterations to accomplish + its goal. +* Volume containers should be able to: + * Perform initialization of a single volume + * Start in parallel + * Perform computation to initialize a volume, and delay start until that + volume is initialized successfully. + * Using a volume container that does not populate a volume to delay pod start + (in the absence of init containers) would be an abuse of the goal of volume + containers. +* Container pre-start hooks are not sufficient for all initialization cases: + * They cannot easily coordinate complex conditions across containers + * They can only function with code in the image or code in a shared volume, + which would have to be statically linked (not a common pattern in wide use) + * They cannot be implemented with the current Docker implementation - see + [#140](https://github.com/kubernetes/kubernetes/issues/140) + + + +## Alternatives + +* Any mechanism that runs user code on a node before regular pod containers + should itself be a container and modeled as such - we explicitly reject + creating new mechanisms for running user processes. +* The container pre-start hook (not yet implemented) requires execution within + the container's image and so cannot adapt existing images. It also cannot + block startup of containers +* Running a "pre-pod" would defeat the purpose of the pod being an atomic + unit of scheduling. + + +## Design + +Each pod may have 0..N init containers defined along with the existing +1..M app containers. + +On startup of the pod, after the network and volumes are initialized, the +init containers are started in order. Each container must exit successfully +before the next is invoked. If a container fails to start (due to the runtime) +or exits with failure, it is retried according to the pod RestartPolicy. +RestartPolicyNever pods will immediately fail and exit. RestartPolicyAlways +pods will retry the failing init container with increasing backoff until it +succeeds. To align with the design of application containers, init containers +will only support "infinite retries" (RestartPolicyAlways) or "no retries" +(RestartPolicyNever). + +A pod cannot be ready until all init containers have succeeded. The ports +on an init container are not aggregated under a service. A pod that is +being initialized is in the `Pending` phase but should have a distinct +condition. Each app container and all future init containers should have +the reason `PodInitializing`. The pod should have a condition `Initializing` +set to `false` until all init containers have succeeded, and `true` thereafter. +If the pod is restarted, the `Initializing` condition should be set to `false. + +If the pod is "restarted" all containers stopped and started due to +a node restart, change to the pod definition, or admin interaction, all +init containers must execute again. Restartable conditions are defined as: + +* An init container image is changed +* The pod infrastructure container is restarted (shared namespaces are lost) +* The Kubelet detects that all containers in a pod are terminated AND + no record of init container completion is available on disk (due to GC) + +Changes to the init container spec are limited to the container image field. +Altering the container image field is equivalent to restarting the pod. + +Because init containers can be restarted, retried, or reexecuted, container +authors should make their init behavior idempotent by handling volumes that +are already populated or the possibility that this instance of the pod has +already contacted a remote system. + +Each init container has all of the fields of an app container. The following +fields are prohibited from being used on init containers by validation: + +* `readinessProbe` - init containers must exit for pod startup to continue, + are not included in rotation, and so cannot define readiness distinct from + completion. + +Init container authors may use `activeDeadlineSeconds` on the pod and +`livenessProbe` on the container to prevent init containers from failing +forever. The active deadline includes init containers. + +Because init containers are semantically different in lifecycle from app +containers (they are run serially, rather than in parallel), for backwards +compatibility and design clarity they will be identified as distinct fields +in the API: + + pod: + spec: + containers: ... + initContainers: + - name: init-container1 + image: ... + ... + - name: init-container2 + ... + status: + containerStatuses: ... + initContainerStatuses: + - name: init-container1 + ... + - name: init-container2 + ... + +This separation also serves to make the order of container initialization +clear - init containers are executed in the order that they appear, then all +app containers are started at once. + +The name of each app and init container in a pod must be unique - it is a +validation error for any container to share a name. + +While pod containers are in alpha state, they will be serialized as an annotation +on the pod with the name `pod.alpha.kubernetes.io/init-containers` and the status +of the containers will be stored as `pod.alpha.kubernetes.io/init-container-statuses`. +Mutation of these annotations is prohibited on existing pods. + + +### Resources + +Given the ordering and execution for init containers, the following rules +for resource usage apply: + +* The highest of any particular resource request or limit defined on all init + containers is the **effective init request/limit** +* The pod's **effective request/limit** for a resource is the higher of: + * sum of all app containers request/limit for a resource + * effective init request/limit for a resource +* Scheduling is done based on effective requests/limits, which means + init containers can reserve resources for initialization that are not used + during the life of the pod. +* The lowest QoS tier of init containers per resource is the **effective init QoS tier**, + and the highest QoS tier of both init containers and regular containers is the + **effective pod QoS tier**. + +So the following pod: + + pod: + spec: + initContainers: + - limits: + cpu: 100m + memory: 1GiB + - limits: + cpu: 50m + memory: 2GiB + containers: + - limits: + cpu: 10m + memory: 1100MiB + - limits: + cpu: 10m + memory: 1100MiB + +has an effective pod limit of `cpu: 100m`, `memory: 2200MiB` (highest init +container cpu is larger than sum of all app containers, sum of container +memory is larger than the max of all init containers). The scheduler, node, +and quota must respect the effective pod request/limit. + +In the absence of a defined request or limit on a container, the effective +request/limit will be applied. For example, the following pod: + + pod: + spec: + initContainers: + - limits: + cpu: 100m + memory: 1GiB + containers: + - request: + cpu: 10m + memory: 1100MiB + +will have an effective request of `10m / 1100MiB`, and an effective limit +of `100m / 1GiB`, i.e.: + + pod: + spec: + initContainers: + - request: + cpu: 10m + memory: 1GiB + - limits: + cpu: 100m + memory: 1100MiB + containers: + - request: + cpu: 10m + memory: 1GiB + - limits: + cpu: 100m + memory: 1100MiB + +and thus have the QoS tier **Burstable** (because request is not equal to +limit). + +Quota and limits will be applied based on the effective pod request and +limit. + +Pod level cGroups will be based on the effective pod request and limit, the +same as the scheduler. + + +### Kubelet and container runtime details + +Container runtimes should treat the set of init and app containers as one +large pool. An individual init container execution should be identical to +an app container, including all standard container environment setup +(network, namespaces, hostnames, DNS, etc). + +All app container operations are permitted on init containers. The +logs for an init container should be available for the duration of the pod +lifetime or until the pod is restarted. + +During initialization, app container status should be shown with the reason +PodInitializing if any init containers are present. Each init container +should show appropriate container status, and all init containers that are +waiting for earlier init containers to finish should have the `reason` +PendingInitialization. + +The container runtime should aggressively prune failed init containers. +The container runtime should record whether all init containers have +succeeded internally, and only invoke new init containers if a pod +restart is needed (for Docker, if all containers terminate or if the pod +infra container terminates). Init containers should follow backoff rules +as necessary. The Kubelet *must* preserve at least the most recent instance +of an init container to serve logs and data for end users and to track +failure states. The Kubelet *should* prefer to garbage collect completed +init containers over app containers, as long as the Kubelet is able to +track that initialization has been completed. In the future, container +state checkpointing in the Kubelet may remove or reduce the need to +preserve old init containers. + +For the initial implementation, the Kubelet will use the last termination +container state of the highest indexed init container to determine whether +the pod has completed initialization. During a pod restart, initialization +will be restarted from the beginning (all initializers will be rerun). + + +### API Behavior + +All APIs that access containers by name should operate on both init and +app containers. Because names are unique the addition of the init container +should be transparent to use cases. + +A client with no knowledge of init containers should see appropriate +container status `reason` and `message` fields while the pod is in the +`Pending` phase, and so be able to communicate that to end users. + + +### Example init containers + +* Wait for a service to be created + + pod: + spec: + initContainers: + - name: wait + image: centos:centos7 + command: ["/bin/sh", "-c", "for i in {1..100}; do sleep 1; if dig myservice; then exit 0; fi; exit 1"] + containers: + - name: run + image: application-image + command: ["/my_application_that_depends_on_myservice"] + +* Register this pod with a remote server + + pod: + spec: + initContainers: + - name: register + image: centos:centos7 + command: ["/bin/sh", "-c", "curl -X POST http://$MANAGEMENT_SERVICE_HOST:$MANAGEMENT_SERVICE_PORT/register -d 'instance=$(POD_NAME)&ip=$(POD_IP)'"] + env: + - name: POD_NAME + valueFrom: + field: metadata.name + - name: POD_IP + valueFrom: + field: status.podIP + containers: + - name: run + image: application-image + command: ["/my_application_that_depends_on_myservice"] + +* Wait for an arbitrary period of time + + pod: + spec: + initContainers: + - name: wait + image: centos:centos7 + command: ["/bin/sh", "-c", "sleep 60"] + containers: + - name: run + image: application-image + command: ["/static_binary_without_sleep"] + +* Clone a git repository into a volume (can be implemented by volume containers in the future): + + pod: + spec: + initContainers: + - name: download + image: image-with-git + command: ["git", "clone", "https://github.com/myrepo/myrepo.git", "/var/lib/data"] + volumeMounts: + - mountPath: /var/lib/data + volumeName: git + containers: + - name: run + image: centos:centos7 + command: ["/var/lib/data/binary"] + volumeMounts: + - mountPath: /var/lib/data + volumeName: git + volumes: + - emptyDir: {} + name: git + +* Execute a template transformation based on environment (can be implemented by volume containers in the future): + + pod: + spec: + initContainers: + - name: copy + image: application-image + command: ["/bin/cp", "mytemplate.j2", "/var/lib/data/"] + volumeMounts: + - mountPath: /var/lib/data + volumeName: data + - name: transform + image: image-with-jinja + command: ["/bin/sh", "-c", "jinja /var/lib/data/mytemplate.j2 > /var/lib/data/mytemplate.conf"] + volumeMounts: + - mountPath: /var/lib/data + volumeName: data + containers: + - name: run + image: application-image + command: ["/myapplication", "-conf", "/var/lib/data/mytemplate.conf"] + volumeMounts: + - mountPath: /var/lib/data + volumeName: data + volumes: + - emptyDir: {} + name: data + +* Perform a container build + + pod: + spec: + initContainers: + - name: copy + image: base-image + workingDir: /home/user/source-tree + command: ["make"] + containers: + - name: commit + image: image-with-docker + command: + - /bin/sh + - -c + - docker commit $(complex_bash_to_get_container_id_of_copy) \ + docker push $(commit_id) myrepo:latest + volumesMounts: + - mountPath: /var/run/docker.sock + volumeName: dockersocket + +## Backwards compatibilty implications + +Since this is a net new feature in the API and Kubelet, new API servers during upgrade may not +be able to rely on Kubelets implementing init containers. The management of feature skew between +master and Kubelet is tracked in issue [#4855](https://github.com/kubernetes/kubernetes/issues/4855). + + +## Future work + +* Unify pod QoS class with init containers +* Implement container / image volumes to make composition of runtime from images efficient + + + +[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/container-init.md?pixel)]() +