- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
For Linux containers, the Kubelet instructs container runtimes to mask and set as read-only certain paths in /proc
.
This is to prevent data from being exposed into a container that should not be.
However, there are certain use-cases where it is necessary to turn this off.
This KEP proposes adding a field to the Pod security context to allow bypassing the usual restrictions.
In 1.12, this was introduced as the ProcMountType feature gate, and has it has languished in alpha ever since. This KEP is a successor to (and heavily based on) kubernetes/community#1934, updated for the modern era.
Some end users would like to run unprivileged containers nested inside a Kubernetes container using user namespaces. The outer container is started by the CRI implementation.
Kubernetes defaults to masking the /proc
mount of a container, setting some paths as read only. To run a nested container within an unprivileged Pod, a user would need a way to
override that default masking behavior.
Please see the following filed issues for more information:
- Allow users to opt out of the CRI masking
/proc
for Linux containers.
Add a new string
named procMount
to the securityContext
definition for choosing from a set of proc mount isolation mode options.
The default for procMount
is Default
, which instructs the container runtime to mask the aforementioned paths.
This will look like the following in the spec:
type ProcMountType string
const (
// DefaultProcMount uses the container runtime default ProcType. Most
// container runtimes mask certain paths in /proc to avoid accidental security
// exposure of special devices or information.
DefaultProcMount ProcMountType = "Default"
// UnmaskedProcMount bypasses the default masking behavior of the container
// runtime and ensures the newly created /proc the container stays intact with
// no modifications.
UnmaskedProcMount ProcMountType = "Unmasked"
)
procMount *ProcMountType
where nil is default, and is interpreted as "Default" ProcMountType.
When the kubelet is presented with a pod that has a ProcMountType as Unmasked, it will edit the default list of masked paths it passes down to the CRI to be empty which it does with the CRI request.
This requires changes to the CRI runtime integrations so that kubelet will add the specific unmasked
option.
This was done after alpha:
- CRI-O has support in v1.25.0 after https://github.com/cri-o/cri-o/pull/6025/commits/4102586132214263c5d0ae93ec257432653ab82b
- containerd has support in 1.6. See https://github.com/containerd/containerd/pull/5070/commits/07f1df4541d6a81c205d194f4f6ea3a6a95c3e29
The main use case for unmasking paths in /proc
are for a user nesting unprivileged containers within a container. However, having an Unmasked ProcMountType
is a privileged operation, and thus is part of the privileged Pod Security Admission (PSA). Since a user must have be in
the privileged policy, they are also trusted to choose the correct user ID and run a workload that won't interfere with the host.
A container running as root user on the host and an unmasked /proc
could be able to write to the host /proc
, and thus this privileged designation is appropriate.
As a cluster admin, I would like a way to nest containers within containers. To do so, kernel the top level containers need an unmasked /proc.
As a kubernetes user, I may want to build containers from within a kubernetes container. See this article for more information.
- A user turning this on without user namespaces enabled
- Admission should deny a pod that tries to use
ProcMountType: Unmasked
withHostUsers: true
- Admission should deny a pod that tries to use
- More trust in user namespacing/the kernel instead of container runtime
- This is probably the correct direction to head in.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
pkg/securitycontext
:10-05-2023
-70.04
- N/A (Kubelet barely defines integration tests today, focusing on e2e_node tests instead)
-
test/e2e_node
-
additional tests should be added to e2e_node suite to test the adherence of the ProcMount field
- Test default behavior actually masks /proc paths.
- Test Unmasked behavior is not masking /proc paths.
- Test PSA integration (if possible to test in e2e)
- Test that Windows pod cannot be scehduled with the value of ProcMount specifies
- Feature implemented behind a feature flag
- Add e2e tests for the feature (must be done before beta)
- Including ones for enabling/disabling the feature
- Explicitly require hostUsers option to be
false
if this option is enabled.- Otherwise, this option effectively becomes another "privileged" field
- Allowing time for feedback
Turn off the feature gate to turn off the feature.
The feature gate is only processed by the API server--Kubelet has no awareness of it. API server will scrub the ProcMount field from the request if it doesn't support the feature gate. Since all supported Kubelet versions support ProcMountType field, there's no version skew worry. API server can have the feature gate toggled without worrying about doing the same for Kubelets.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: ProcMountType
- Components depending on the feature gate: kube-apiserver (kube-apiserver filters
procMount
field if it's not enabled).
No, only gives a user access to the Unmasked ProcMountType
Yes. This can be done by removing the feature gate from all kube-apiservers. To fully roll back, the nodes will need to be drained or rebooted,
as the Kubelet will not remove the procMount
of an already running container.
Nothing special. The pod's procMount
field depends on where in the enablement process the kube-apiserver was when it was created.
The container has to be restarted to be up to date with the kube-apiserver.
Yes. I have manually tested feature enablement and disablement on kube-apiserver, and verified that pods are not recreated without a drain. There will be an e2e test to verify this as well.
It cannot. Either the kube-apiserver has the feature gate on or not. If it has it on, then workloads with the feature enabled will get an Unmasked ProcMountType if they request it. If it's off, then the kube-apiserver will force it to default, and the container's creation will move forward without an Unmasked ProcMountType.
Already running workloads aren't stopped and restarted on a feature revert, so an admin would need to reboot or drain to impact running workloads.
The behavior of this feature has been consistent for more than 10 minor releases, so these tests are less relevant now. Put differently: there is no upgrade->downgrade->upgrade path between supported versions of kubernetes that support this feature.
Manual testing has been done between versions that do support it, toggling the feature on and off. In these cases, the feature works as described.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
kubectl get pods --all-namespaces -o jsonpath="{range .items[*]}{.metadata.name}{' '}{.spec.containers[*].securityContext.procMount}{'\n'}{end}" | grep -i unmasked
Will print all pods that has an Unmasked ProcMountType, along with the pod name.
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details: Container created with the Unmasked ProcMountType have paths here as writable, not read only.
- "/proc/asound",
- "/proc/acpi",
- "/proc/kcore",
- "/proc/keys",
- "/proc/latency_stats",
- "/proc/timer_list",
- "/proc/timer_stats",
- "/proc/sched_debug",
- "/proc/scsi",
- "/sys/firmware",
- Another option is to run
kubectl exec $podname -- mount | grep /proc
.- If there's just one mount, and it looks like
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
this is an unmasked/proc
- If there's just one mount, and it looks like
- Details: Container created with the Unmasked ProcMountType have paths here as writable, not read only.
No noticeable change in pod start times when this feature is enabled.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
kubelet_pod_start_sli_duration_seconds
- [Optional] Aggregation method:
- Components exposing the metric: kubelet
- Metric name:
I don't think any would be useful.
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
I don't think any would be useful.
- A CRI implementation that supports this feature
- All supported versions currently do.
No
ProcMountType in the pod spec
No
There is one additional field in the pod API: procMount
. It has an enum value of two values: Default
and Unmasked
.
The Kubelet is also passing the MaskedPaths to the CRI, which involves a single slice of strings.
When the value Default
is chosen, the slice is defined here.
If Unmasked
, the slice is empty.
Both of these are size changes on the order of bytes and can be considered negligible.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
Potentially a malicious user given access and running with a root container in the host context could mess with the host processes. PSA has already been configured to mitigate this by required a user be in a privileged namespace to get access to the field.
No effect
Malicious user gaining access to the host /proc
with a rootful container
- admission should be updated to deny unmasked ProcMountType without user namespaces (hostUsers: true)
The field can be unset in a pod spec (or feature gate turned off) to see if SLOs met after the feature is disabled for pods.
2018-05-07: k/community update opened 2018-05-27: k/kubernetes PR merged with support. 2023-10-02: KEP opened and retargeted at Alpha 2024-02-26: Update Unmasked ProcMountType to fail validation without a pod level user namespace. 2024-05-31: Added e2e tests 2024-05-31: KEP updated to Beta
--oci-worker-no-process-sandbox
like in BuildKit- Not broadly supported with other container runtimes/builders.
- Update the kernel to allow mounting a new procfs with masks.
- Proposed, but denied in the kernel
- Adopt a similar approach to LXD where
/proc
and/sys
are mounted to different locations within the container, instead of masked. - Give all pods with
hostUsers: false
(pod level user namespace) access to these mounts by default- Even though it potentially is safe, it opens an argument that user namespaced pods are less secure than non user namespaced pods. The weakining of these boundries should be opt-in.
- Ditch this option
- Most use cases don't really need this. However, if a pod wants to be able to, for instance, set its own sysctls, it would need this option.