- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Implementation History
- Alternatives
- kubernetes/enhancements issue in release milestone, which links to KEP (this should be a link to the KEP location in kubernetes/enhancements, not the initial KEP PR)
- KEP approvers have set the KEP status to
implementable
- Design details are appropriately documented
- Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- Graduation criteria is in place
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
We will add support for configuring pod-shared process namespaces by adding a
new boolean field ShareProcessNamespace
to the pod spec. The default to false
means that each container will have a separate process namespace. When set to
true, all containers in the pod will share a single process namespace.
The Container Runtime Interface (CRI) will be updated to support three namespace modes: Container, Pod & Node. The Runtime Manager will translate the pod spec into one of these modes as follows:
Pod shareProcessNamespace |
Pod hostPID |
CRI PID Mode |
---|---|---|
false | false | Container |
false | true | Node |
true | false | Pod |
true | true | Error |
If a runtime does not implement a particular PID mode, it must return an error. For reference, Docker will support all three modes when using version >= 1.13.1.
The shared PID functionality will be hidden behind a new feature gate in both
the API server and the kubelet, and the existing --docker-disable-shared-pid
flag will be removed from the kubelet, subject to
deprecation policy.
Pods share namespaces where possible, but support for sharing the PID namespace had not been defined due to lack of support in Docker. This created an implicit API on which certain container images now rely. This document proposes adding support for sharing a process namespace between containers in a pod while maintaining backwards compatibility with the existing implicit API.
- Backwards compatibility with container images expecting
pid == 1
semantics - Per-pod configuration of PID namespace sharing
- Ability to change default sharing behavior in
v2.Pod
- Creating a general purpose container init solution
- Multiple shared PID namespaces per pod
- Per-container configuration of PID namespace sharing
Sharing a PID namespace between containers in a pod is discussed in #1615 and enables:
- signaling between containers, which is useful for side cars (e.g. for signaling a daemon process after rotating logs).
- easier troubleshooting of pods.
- addressing Docker's zombie problem by reaping orphaned zombies in the infra container.
v1.PodSpec
gains a new field named ShareProcessNamespace
:
// PodSpec is a description of a pod.
type PodSpec struct {
...
// Use the host's pid namespace.
// Note that HostPID and ShareProcessNamespace cannot both be set.
// Optional: Default to false.
// +k8s:conversion-gen=false
// +optional
HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"`
// Share a single process namespace between all of the containers in a pod.
// Note that HostPID and ShareProcessNamespace cannot both be set.
// Optional: Default to false.
// +k8s:conversion-gen=false
// +optional
ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,XX,opt,name=shareProcessNamespace"`
...
The field name deviates from that of HostPID in an attempt to
better signal the consequences
of setting the option. Setting both ShareProcessNamespace
and HostPID
will
cause a validation error.
Namespace options in the CRI are currently specified for both PodSandbox
and
Container
creation requests via booleans in NamespaceOption
:
message NamespaceOption {
// If set, use the host's network namespace.
bool host_network = 1;
// If set, use the host's PID namespace.
bool host_pid = 2;
// If set, use the host's IPC namespace.
bool host_ipc = 3;
}
We will change NamespaceOption
to use a NamespaceMode
enumeration for the
existing namespace options:
enum NamespaceMode {
POD = 0;
CONTAINER = 1;
NODE = 2;
}
// NamespaceOption provides options for Linux namespaces.
message NamespaceOption {
// Network namespace for this container/sandbox.
// Runtimes must support: POD, NODE
NamespaceMode network = 1;
// PID namespace for this container/sandbox.
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
// Runtimes must support: POD, CONTAINER, NODE
NamespaceMode pid = 2;
// IPC namespace for this container/sandbox.
// Runtimes must support: POD, NODE
NamespaceMode ipc = 3;
}
Note that this breaks backwards compatibility in the CRI, which is still in alpha.
The protocol default for a namespace is POD
because that's the default for
network and IPC, and we will consider making it the default for PID in v2.Pod
.
The kubelet will explicitly set pid
to CONTAINER
for v1.Pod
by default so
that the default behavior of v1.Pod
does not change.
This CRI design allows different namespace configuration for each of the
containers in the pod and the sandbox, but currently we have no plans to support
this in the Kubernetes API. The kubelet will translate namespace booleans from
v1.PodSpec into a single NamespaceMode
to be used for the sandbox and all
regular and init containers in a pod.
Though we don't intend to support this in general pod configuration, there is a use case for mixed process namespaces within a single pod. Ephemeral Containers allows inserting an ephemeral Debug Container in an existing, running pod. In order for this to be useful we want to share, within the pod, a process namespace between the new container performing the debugging and its existing target container.
This is done with the additional NamespaceMode
TARGET
and field target_id
:
enum NamespaceMode {
POD = 0;
CONTAINER = 1;
NODE = 2;
TARGET = 3;
}
// NamespaceOption provides options for Linux namespaces.
message NamespaceOption {
// Network namespace for this container/sandbox.
// Runtimes must support: POD, NODE
NamespaceMode network = 1;
// PID namespace for this container/sandbox.
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
// Runtimes must support: POD, CONTAINER, NODE, TARGET
NamespaceMode pid = 2;
// IPC namespace for this container/sandbox.
// Runtimes must support: POD, NODE
NamespaceMode ipc = 3;
// Target Container ID for NamespaceMode of TARGET. This container must be in the
// same pod as the target container.
string target_id = 4;
}
When NamespaceOption.pid
is set to TARGET
, a runtime must create the new
container in the namespace used by the container ID in target_id
. If the
target container has NamespaceOption.pid
set to POD
, then the new container
should also use the pod namespace. If the target container has an isolated
process namespace, then the new container will join only that container's
namespace. Examples are provided for dockershim below.
There is no mechanism in the Kubernetes API for an end-user to set TARGET
. It
exists for the kubelet to run automation or debugging from a container image in
the namespace of an existing pod and container. Additionally, we choose to
explicitly not support sharing namespaces between different pods. The kubelet
must not generate such a reference, and the runtime should not accept it. That
is, for pod{Container A
, InitContainer B
, Sandbox S}
and any other
unrelated Container C
:
valid target_id s for TARGET |
invalid target_id is for TARGET |
---|---|
containerID(A) | sandboxID(S) |
containerID(B) | containerID(C) |
Note that targeting init containers is allowed and has no special handling. The result of targeting a container which is not running is left to the descretion of the CRI implementer. For PID namespace targeting with Docker this is an error, but it may be allowed by other runtimes.
The Docker runtime implements the pod sandbox as a container running the pause
container image. When configured for POD
namespace sharing, the PID namespace
of the sandbox will become the single PID namespace for the pod. This means a
namespace of POD
and CONTAINER
are equivalent for the sandbox. The mapping
of the sandbox's PID mode to docker's HostConfig.PidMode
is (v1.Pod
settings provided as reference):
ShareProcessNamespace | HostPID | Sandbox PID Mode | HostConfig.PidMode |
---|---|---|---|
false | false | CONTAINER | unset |
true | false | POD | unset |
false | true | NODE | "host" |
- | - | TARGET | Error |
For containers, HostConfig.PidMode
will be set as follows:
ShareProcessNamespace | HostPID | Container PID Mode | HostConfig.PidMode |
---|---|---|---|
false | false | CONTAINER | unset |
true | false | POD | "container:[sandbox-container-id]" |
false | true | NODE | "host" |
false | false | TARGET | "container:[target-container-id]" |
true | false | TARGET | "container:[sandbox-container-id]" |
false | true | TARGET | "host" |
If the Docker runtime version does not support sharing pid namespaces, a
CreateContainerRequest
with namespace_options.pid
set to POD
will return
an error.
Docker's zombie problem mentioned above also theoretically applies to containers targeting another container's isolated PID namespace. Interestingly, this does not occur in practice. At least in Docker version 19.03, the zombie of the main process of the targeting container is cleaned up.
Either way, this does not introduce a new problem: users of isolated namespaces
without an init process will already be enjoying zombies created by other
debugging methods such as kubectl exec
. Users are encouraged to enable
pod-level process namespace sharing.
SIG Node did not anticipate the strong objections to migrating from isolated to shared process namespaces for Docker. The previous (now abandoned) migration plan introduced a kubelet flag to toggle the shared namespace behavior, but objections did not materialize until the flag had moved from experimental to GA.
The --docker-disable-shared-pid
(default: true) kubelet flag disables the use
of shared process namespaces for the Docker runtime. We will immediately mark it
as deprecated, but according to the
deprecation policy
we must support it for 6 months.
We must provide a transition path for users setting this kubelet flag to false.
Setting this flag asserts a desire to override the default Kubernetes behavior
for all pods. Until the flag is removed, the kubelet will honor this assertion
by ignoring the value of ShareProcessNamespace
and logging a warning to the
event log.
Sharing a process namespace fits well with Kubernetes' pod abstraction, but it's a significant departure from the traditional behavior of Docker. This may break container images and development patterns that have come to rely on process isolation. Notably:
- The main container process no longer has PID 1. It cannot be signalled
using
kill 1
, and attempting to do so will instead signal the infrastructure container and potentially restart the pod. Containers shipping an init system like systemd may require additional flags. - Processes are visible to other containers in the pod. This includes all
information visible in
/proc
, such as passwords as arguments or environment variables, and process signalling. This can be somewhat mitigated by running processes as separate, non-root users. - Container filesystems are visible to other containers in the pod through
the
/proc/$pid/root
magic symlink. This makes debugging easier, but it also means that secrets are protected only by standard filesystem permissions.
Enhancement #127 proposes node-level user namespace
remapping. The CRI changes proposed in the current proposal
are compatible with the Container Runtime Interface Changes described above. The implementers of
the user namespace should decide whether to support a TARGET
mode. A
Non-goal of the current proposal is "support pod/container
level user namespace isolation", so it's likely unnecessary.
This feature launched with test coverage in node-e2e.
At the time this KEP was written, the feature was already in beta.
- 3 articles describing using
shareProcessNamespace
in a task. shareProcessNamespace
referenced in documentation for a Kubernetes cloud provider (as proxy indicator for feature enabled on cloud provider).- Feature spends at least 2 release in beta.
- No open bug reports for latest release.
- 2016-12-21: Original (pre-KEP) proposal created
- 1.10: Feature released in Alpha.
- 1.12: Feature released in Beta.
- 2019-09-20: Ported Original proposal to KEP.
- 1.17: Feature generally available.
Rather than using a NamespaceMode
, NamespaceOption.pid
could be a string
that explicitly targets a container or sandbox ID:
// NamespaceOption provides options for Linux namespaces.
message NamespaceOption {
...
// ID of Sandbox or Container to use for PID namespace, or "host"
string pid = 2;
...
}
This removes the need for a separate TARGET
mode, but a mode enumeration
better captures the intent of the option.
Other Kubernetes runtimes already share a single PID namespace between containers in a pod. We could easily change the Docker runtime to always share a PID namespace when supported by the installed Docker version, but this would cause problems for container images that assume they will always be PID 1.
Rather than adding support to the API for configuring namespaces we could allow changing the default behavior with pod annotations with the intention of removing support for isolated PID namespaces in v2.Pod. Many members of the community want to use the isolated namespaces as security boundary between containers in a pod, however.