- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Make read-only volumes recursively read-only.
e.g., if /mnt
is mounted as read-only, its submounts such as /mnt/usbstorage
should be read-only too.
The current readOnly
volumes are not recursively read-only, and may result in compromise of data;
e.g., even if /mnt
is mounted as read-only, its submounts such as /mnt/usbstorage
are not read-only.
This issue can be fixed by utilizing OCI Runtime's "rro" bind mount option (https://github.com/opencontainers/runtime-spec/blob/v1.2.0/config.md#linux-mount-options) to make read-only bind mounts recursively read-only.
The "rro" bind mount options is implemented by calling mount_setattr(2)
with MOUNT_ATTR_RDONLY
and AT_RECURSIVE
.
Requires kernel >= 5.12, with one of the following OCI runtimes:
- runc >= 1.1
- crun >= 1.4
Support recursive read-only mounts for kernel >= 5.12.
Support recursive read-only mounts for old runc and old kernel releases.
A user wants to mount /mnt
, includings its submounts such as /mnt/usbstorage
, as read-only.
Constraints: needs runc >= 1.1 && kernel >= 5.12.
-
Increased API surface but still not secure-by-default, for sake of compatibility.
- Mitigation: None
-
False sense of security when not implemented
- Mitigation:
VolumeMountStatus
indicating actual RRO setting
- Mitigation:
Add RecursiveReadOnly: (Disabled|IfPossible|Enabled)
to the VolumeMount
struct.
A pod manifest will look like this:
spec:
volumes:
- name: foo
hostPath:
path: /mnt
type: Directory
containers:
- volumeMounts:
- mountPath: /mnt
name: foo
mountPropagation: None
readOnly: true
# NEW
recursiveReadOnly: IfPossible
See the comment lines in the diff below for the constraints of the VolumeMount
options:
diff --git a/pkg/apis/core/types.go b/pkg/apis/core/types.go
index e40b8bfa104..09c88222c2d 100644
--- a/pkg/apis/core/types.go
+++ b/pkg/apis/core/types.go
@@ -1914,6 +1914,31 @@ type VolumeMount struct {
// Optional: Defaults to false (read-write).
// +optional
ReadOnly bool
+ // RecursiveReadOnly specifies recursive-readonly mode.
+ //
+ // 1. If ReadOnly is false, RecursiveReadOnly must be unspecified.
+ // 2. If ReadOnly is true:
+ // 2.1. If RecursiveReadOnly is unspecified:
+ // 2.1.1. if it belongs to a Pod being created, it is initialized to Disabled.
+ // 2.1.2 if it belongs to a PodSpec under Deployment, Job, etc., it remains unspecified
+ // (and will be set to Disabled eventually, when the Pod is created).
+ // 2.2. If RecursiveReadOnly is set to Disabled, the mount is not made recursively read-only.
+ // 2.3. If RecursiveReadOnly is set to IfPossible, the mount is made recursively read-only,
+ // if it is supported by the runtime.
+ // If it is not supported by the runtime, the mount is not made recursively read-only.
+ // MountPropagation must be None or unspecified (which defaults to None).
+ // 2.4. If RecursiveReadOnly is set to Enabled, the mount is made recursively read-only.
+ // If it is not supported by the runtime, the Pod will be terminated by kubelet,
+ // and an error will be generated to indicate the reason.
+ // MountPropagation must be None or unspecified (which defaults to None).
+ // 2.5. If RecursiveReadOnly is set to unknown value, it will result in an error.
+ //
+ // When this property is recognized by kubelet and kube-apiserver,
+ // VolumeMountStatus.RecursiveReadOnly will be set to either Disabled or Enabled.
+ //
+ // +featureGate=RecursiveReadOnlyMounts
+ // +optional
+ RecursiveReadOnly *RecursiveReadOnlyMode
// Required. If the path is not an absolute path (e.g. some/path) it
// will be prepended with the appropriate root prefix for the operating
// system. On Linux this is '/', on Windows this is 'C:\'.
@@ -1926,6 +1951,8 @@ type VolumeMount struct {
// to container and the other way around.
// When not set, MountPropagationNone is used.
// This field is beta in 1.10.
+ // When RecursiveReadOnly is set to IfPossible or to Enabled, MountPropagation must be None or unspecified
+ // (which defaults to None).
// +optional
MountPropagation *MountPropagationMode
// Expanded path within the volume from which the container's volume should be mounted.
@@ -1961,6 +1988,18 @@ const (
MountPropagationBidirectional MountPropagationMode = "Bidirectional"
)
+// RecursiveReadOnlyMode describes recursive-readonly mode.
+type RecursiveReadOnlyMode string
+
+const (
+ // RecursiveReadOnlyDisabled disables recursive-readonly mode.
+ RecursiveReadOnlyDisabled RecursiveReadOnlyMode = "Disabled"
+ // RecursiveReadOnlyIfPossible enables recursive-readonly mode if possible.
+ RecursiveReadOnlyIfPossible RecursiveReadOnlyMode = "IfPossible"
+ // RecursiveReadOnlyEnabled enables recursive-readonly mode, or raise an error.
+ RecursiveReadOnlyEnabled RecursiveReadOnlyMode = "Enabled"
+)
+
// VolumeDevice describes a mapping of a raw block device within a container.
type VolumeDevice struct {
// name must match the name of a persistentVolumeClaim in the pod
@@ -2591,6 +2630,10 @@ type ContainerStatus struct {
// +featureGate=InPlacePodVerticalScaling
// +optional
Resources *ResourceRequirements
+ // Status of volume mounts.
+ // +listType=atomic
+ // +optional
+ VolumeMounts []VolumeMountStatus
}
// PodPhase is a label for the condition of a pod at the current time.
@@ -2664,6 +2707,21 @@ const (
PodResizeStatusInfeasible PodResizeStatus = "Infeasible"
)
+// VolumeMountStatus shows status of volume mounts.
+type VolumeMountStatus struct {
+ // Name corresponds to the name of the original VolumeMount.
+ Name string
+ // ReadOnly corresponds to the original VolumeMount.
+ // +optional
+ ReadOnly bool
+ // RecursiveReadOnly must be set to Disabled, Enabled, or unspecified (for non-readonly mounts).
+ // An IfPossible value in the original VolumeMount must be translated to Disabled or Enabled,
+ // depending on the mount result.
+ // +featureGate=RecursiveReadOnlyMounts
+ // +optional
+ RecursiveReadOnly *RecursiveReadOnlyMode
+}
+
// RestartPolicy describes how the container should be restarted.
// Only one of the following restart policies may be specified.
// If none of the following policies is specified, the default one
@@ -4591,6 +4649,24 @@ type NodeDaemonEndpoints struct {
KubeletEndpoint DaemonEndpoint
}
+// RuntimeClassFeatures is a set of runtime features.
+type RuntimeClassFeatures struct {
+ // RecursiveReadOnlyMounts is set to true if the runtime class supports RecursiveReadOnlyMounts.
+ // +optional
+ RecursiveReadOnlyMounts *bool
+}
+
+// RuntimeClass is a set of runtime class information.
+type RuntimeClass struct {
+ // Runtime class name.
+ // Empty for the default runtime class.
+ // +optional
+ Name string
+ // Supported features.
+ // +optional
+ Features *RuntimeClassFeatures
+}
+
// NodeSystemInfo is a set of ids/uuids to uniquely identify the node.
type NodeSystemInfo struct {
// MachineID reported by the node. For unique machine identification
@@ -4701,6 +4777,9 @@ type NodeStatus struct {
// Status of the config assigned to the node via the dynamic Kubelet config feature.
// +optional
Config *NodeConfigStatus
+ // The available runtime classes.
+ // +optional
+ RuntimeClasses []RuntimeClass
}
// UniqueVolumeName defines the name of attached volume
Add bool recursive_read_only
to the Mount
message.
CRI implementations will also expose the availability of the feature via the RuntimeHandlerFeatures
message.
As kubelet can inspect the availability of the feature via the RuntimeHandlerFeatures
message,
there is no concept of "IfPossible" in the CRI API;
kubelet translates an "IfPossible" value in the Core API into true or false in the CRI API
The RuntimeHandlerFeatures
message is also propagated to the NodeSystemInfo
struct of the Core API.
Diff:
diff --git a/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
index e16688d8386..194d591c27f 100644
--- a/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
+++ b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
@@ -235,6 +235,15 @@ message Mount {
repeated IDMapping uidMappings = 6;
// GidMappings specifies the runtime GID mappings for the mount.
repeated IDMapping gidMappings = 7;
+ // If set to true, the mount is made recursive read-only.
+ // In this CRI API, recursive_read_only is a plain true/false boolean, although its equivalent
+ // in the Kubernetes core API is a quaternary that can be nil, "Enabled", "IfPossible", or "Disabled".
+ // kubelet translates that quaternary value in the core API into a boolean in this CRI API.
+ // Remarks:
+ // - nil is just treated as false
+ // - when set to true, readonly must be explicitly set to true, and propagation must be PRIVATE (0).
+ // - (readonly == false && recursive_read_only == false) does not make the mount read-only.
+ bool recursive_read_only = 8;
}
// IDMapping describes host to container ID mappings for a pod sandbox.
@@ -1524,6 +1533,22 @@ message StatusRequest {
bool verbose = 1;
}
+message RuntimeHandlerFeatures {
+ // recursive_read_only_mounts is set to true if the runtime handler supports
+ // recursive read-only mounts.
+ // For runc-compatible runtimes, availability of this feature can be detected by checking whether
+ // the Linux kernel version is >= 5.12, and, `runc features | jq .mountOptions` contains "rro".
+ bool recursive_read_only_mounts = 1;
+}
+
+message RuntimeHandler {
+ // Name must be unique in StatusResponse.
+ // An empty string denotes the default handler.
+ string name = 1;
+ // Supported features.
+ RuntimeHandlerFeatures features = 2;
+}
+
message StatusResponse {
// Status of the Runtime.
RuntimeStatus status = 1;
@@ -1532,6 +1557,8 @@ message StatusResponse {
// debug, e.g. plugins used by the container runtime.
// It should only be returned non-empty when Verbose is true.
map<string, string> info = 2;
+ // Runtime handlers.
+ repeated RuntimeHandler runtime_handlers = 3;
}
message ImageFsInfoRequest {}
diff --git a/staging/src/k8s.io/cri-api/pkg/errors/errors.go b/staging/src/k8s.io/cri-api/pkg/errors/errors.go
index a4538669122..c8e4a18dec5 100644
--- a/staging/src/k8s.io/cri-api/pkg/errors/errors.go
+++ b/staging/src/k8s.io/cri-api/pkg/errors/errors.go
@@ -29,6 +29,9 @@ var (
// ErrSignatureValidationFailed - Unable to validate the image signature on the PullImage RPC call.
ErrSignatureValidationFailed = errors.New("SignatureValidationFailed")
+
+ // ErrRROUnsupported - Unable to enforce recursive readonly mounts
+ ErrRROUnsupported = errors.New("RROUnsupported")
)
// IsNotFound returns a boolean indicating whether the error
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
The existing tests will continue to pass. New tests have to be added to cover the proposed feature.
- kubelet unit tests: takes a CRI status and populate the
RecursiveReadOnly
field in theVolumeMountStatus
struct. Implemented in https://github.com/kubernetes/kubernetes/blob/v1.30.0/pkg/kubelet/kubelet_pods_test.go#L6080-L6201. The unit test set covers 16 conditions as of Kubernetes v1.30.0. There is no branch coverage data (go test -cover
), as the feature is not implemented as a dedicated Go package. - CRI test: similar to e2e tests below but without using Kubernetes Core API. Implemented in https://github.com/kubernetes-sigs/cri-tools/blob/v1.30.0/pkg/validate/container_linux.go#L311-L413.
See e2e tests below.
- run a pod in each RecursiveReadOnly mode and verify that the status comes back correctly
- run RecursiveReadOnly="Enabled" on a runtime that does not support it and ensure the error
- run RecursiveReadOnly="Enabled", and verify that the mount is actually recursively read-only
- run RecursiveReadOnly="Disabled", and verify that the mount is actually not recursively read-only
Tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.30.0/test/e2e_node/mount_rro_linux_test.go, and will be executed on the CI when the CI is upgraded to use containerd v2.0. So, there is no link to the testgrid yet.
- Feature implemented behind a feature flag
- Unit tests and CRI tests will pass
- e2e tests pass with containerd, CRI-O, and cri-dockerd
- containerd/containerd#9787
- cri-o/cri-o#7962
- Mirantis/cri-dockerd#370
- Two beta releases of Kubernetes at least
- containerd, CRI-O, and cri-dockerd supports the feature with their GA releases
Upgrade: No action is needed. Existing readonly mounts will remain non-recursively readonly.
Downgrade:
-
On downgrading kube-apiserver, the
[]volumeMounts.recursiveReadOnly
property will be lost and will not be propagated to kubelet. If the mode was set to non-Disabled
, this will result in producing writable mounts. It is the user's responsibility to use the correct version of kube-apiserver when they need non-Disabled
mode. -
On downgrading kubelet, the
[]volumeMounts.recursiveReadOnly
properties will be lost, and the[]containerStatuses.[]volumeMount.recursiveReadOnly
status will not be updated. It is the user's responsibility to use the correct version of kubelet when they need to check[]containerStatuses.[]volumeMount.recursiveReadOnly
. -
On downgrading the CRI or OCI runtime, if the
RecursiveReadOnly
mode is set toEnabled
, kubelet will raise an error.IfPossible
will be just treated asDisabled
.
-
It is the user's responsibility to use the correct version of kube-apiserver when they need non-
Disabled
mode. Otherwise the mode will not be propagated to kubelet. -
It is the user's responsibility to use the correct version of kube-apiserver and kubelet when they need to check
[]containerStatuses.[]volumeMount.recursiveReadOnly
. Otherwise the property may have an inconsistent value. -
CRI and OCI runtimes have to be updated before kubelet, otherwise kubelet will not be aware whether they supports the feature or not, and it will assume that they do not support the feature.
-
If only partial nodes supports the feature,
Disabled
andIfPossible
will continue to work on all the nodes, butEnabled
will fail on a node that does not support the feature. kube-scheduler does not care about this, and, it is the user's responsibility to setnodeSelector
,nodeAffinity
, etc. to avoid scheduling a pod withEnabled
to a node that does not support the feature.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
RecursiveReadOnlyMounts
- Components depending on the feature gate: kube-apiserver,kubelet
- Feature gate name:
No
Yes, by unsetting RecursiveReadOnly=Enabled
.
Components can be downgraded too, but it should be noted that VolumeMountStatus
may still see an inconsistent state when kubelet was downgraded.
The pod manifest has to be recreated to get a consistent state in this case.
Works. Just same as a fresh roll-out, as long as the user has recreated the pod manifests. (See "Can the feature be disabled once ..." section above)
Unit tests will run with and without the feature gate.
A rollout may fail when at least one of the following components are too old:
Component | readOnlyRecursive value that will cause an error |
---|---|
kube-apiserver | any value |
kubelet | any value |
CRI runtime | Enabled |
OCI runtime | Enabled |
kernel | Enabled |
For example, an error will be returned like this if kube-apiserver is too old:
$ kubectl apply -f rro.yaml
Error from server (BadRequest): error when creating "rro.yaml": Pod in version "v1" cannot be handled as a Pod:
strict decoding error: unknown field "spec.containers[0].volumeMounts[0].recursiveReadOnly"
No impact on already running workloads.
Look for an event saying indicating RRO is not supported by the runtime.
$ kubectl get events -o json -w
...
{
...
"kind": "Event",
"message": "Error: RRONotSupported",
...
}
...
During the beta phase, the following test will be manually performed:
- Enable the
RecursiveReadOnly
feature gate for kube-apiserver and kubelet. - Create a pod with
recursiveReadOnly
specified. - Disable the
RecursiveReadOnly
feature gate for kube-apiserver, and confirm that the pod gets rejected. - Enable the
RecursiveReadOnly
feature gate again, and confirm that the pod gets scheduled again. - Do the same for kubelet too.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
Yes, the feature is used if the following jq
command prints non-zero number:
kubectl get pods -A -o json | jq '[.items[].spec.containers[].volumeMounts[]? | select(.recursiveReadOnly)] | length'
- API .status
- Condition name:
volumeMountStatus.recursiveReadOnly
- Condition name:
-
recursiveReadOnly=Enabled
: 100% of pods that were scheduled into a node must run with recursive read-only mounts, or, 100% of them must fail to run. -
recursiveReadOnly=IfPossible
: 100% of pods that were scheduled into a node must run with or without recursive read-only mounts -
recursiveReadOnly=Disabled
, or unset: 100% of pods that were scheduled into a node must run without recursive read-only mounts
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name: Event
- [Optional] Aggregation method:
kubectl get events -o json -w
- Components exposing the metric: kubelet -> kube-apiserver
If recursiveReadOnly
is set to Enabled
but it is not supported, kubelet will raise an event like this:
$ kubectl get events -o json -w
...
{
...
"kind": "Event",
"message": "Error: RRONotSupported",
...
}
...
If the OCI runtime claims that it supports recursive read only mounts but it actually fails to mount them, the pod will enter CrashLoopBackoff. The error from the OCI runtime can be inspected by running:
kubectl get pod -o json foo | jq .status.containerStatuses[0].lastState.terminated.message
Are there any missing metrics that would be useful to have to improve observability of this feature?
Potentially, kube-scheduler could be implemented to avoid scheduling a pod with recursiveReadOnly: Enabled
to a pod running an old kernel.
In this way, the Event metric described above would not happen, and users would instead see Pending
pods
as an error metric.
However, this is not planned to be implemented in kube-scheduler, as it seems overengineering.
Users may use nodeSelector
, nodeAffinity
, etc. to workaround this.
Specific version of CRI, OCI, and Linux kernel
A pod with recursiveReadOnly: Enabled
may be rejected by kubelet with the probablility of
To evaluate this risk, users may run
kubectl get nodes -o json | jq '[.items[].status.runtimeClasses[].Features]'
to see how many nodes support RecursiveReadOnlyMounts: true
.
No
No
No
A dozen of bytes
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
No
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No
A pod cannot be created, just as in other pods.
None
- Make sure that the node is running Linux kernel v5.12 or later.
- Make sure that
runc features | jq .mountOptions
contains "rro". Otherwise update runc. - Make sure that
crictl info
(with the latest crictl) reports thatRecursiveReadOnlyMounts
is supported. Otherwise update the CRI runtime, and make sure that no relevant error is printed in the CRI runtime's log. - Make sure that
kubectl get nodes -o json | jq '[.items[].status.runtimeClasses[].Features]'
(with the latest kubectl and control planes) reports thatRecursiveReadOnlyMounts
is supported. Otherwise update the CRI runtime, and make sure that no relevant error is printed in kubelet's log.
- v1.30: alpha
- v1.31: beta
See "Alternatives" below.
Plan B is to keep the Kubernetes Core API and the CRI API completely unmodified, and just let the CRI runtime treat "readonly" as "recursive readonly".
This would be much easier to implement and adopt, however, small portion of users may find this to be a breaking change.
Actually, containerd has once adopted the Plan B (containerd/containerd#9713) in its main branch (not in any GA release), but it is being reverted in favor of this KEP now (containerd/containerd#9747).
runc >= 1.1 && kernel >= 5.12