Skip to content

Latest commit

 

History

History

3857-rro-mounts

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

KEP-3857: Recursive read-only (RRO) mounts

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Make read-only volumes recursively read-only. e.g., if /mnt is mounted as read-only, its submounts such as /mnt/usbstorage should be read-only too.

Motivation

The current readOnly volumes are not recursively read-only, and may result in compromise of data; e.g., even if /mnt is mounted as read-only, its submounts such as /mnt/usbstorage are not read-only.

This issue can be fixed by utilizing OCI Runtime's "rro" bind mount option (https://github.com/opencontainers/runtime-spec/blob/v1.2.0/config.md#linux-mount-options) to make read-only bind mounts recursively read-only.

The "rro" bind mount options is implemented by calling mount_setattr(2) with MOUNT_ATTR_RDONLY and AT_RECURSIVE.

Requires kernel >= 5.12, with one of the following OCI runtimes:

  • runc >= 1.1
  • crun >= 1.4

Goals

Support recursive read-only mounts for kernel >= 5.12.

Non-Goals

Support recursive read-only mounts for old runc and old kernel releases.

Proposal

User Stories (Optional)

Story 1

A user wants to mount /mnt, includings its submounts such as /mnt/usbstorage, as read-only.

Notes/Constraints/Caveats (Optional)

Constraints: needs runc >= 1.1 && kernel >= 5.12.

Risks and Mitigations

  • Increased API surface but still not secure-by-default, for sake of compatibility.

    • Mitigation: None
  • False sense of security when not implemented

    • Mitigation: VolumeMountStatus indicating actual RRO setting

Design Details

Core API

Add RecursiveReadOnly: (Disabled|IfPossible|Enabled) to the VolumeMount struct.

A pod manifest will look like this:

spec:
  volumes:
    - name: foo
      hostPath:
        path: /mnt
        type: Directory
  containers:
  - volumeMounts:
    - mountPath: /mnt
      name: foo
      mountPropagation: None
      readOnly: true
      # NEW
      recursiveReadOnly: IfPossible

See the comment lines in the diff below for the constraints of the VolumeMount options:

diff --git a/pkg/apis/core/types.go b/pkg/apis/core/types.go
index e40b8bfa104..09c88222c2d 100644
--- a/pkg/apis/core/types.go
+++ b/pkg/apis/core/types.go
@@ -1914,6 +1914,31 @@ type VolumeMount struct {
 	// Optional: Defaults to false (read-write).
 	// +optional
 	ReadOnly bool
+	// RecursiveReadOnly specifies recursive-readonly mode.
+	//
+	// 1. If ReadOnly is false, RecursiveReadOnly must be unspecified.
+	// 2. If ReadOnly is true:
+	//   2.1. If RecursiveReadOnly is unspecified:
+	//        2.1.1. if it belongs to a Pod being created, it is initialized to Disabled.
+	//        2.1.2  if it belongs to a PodSpec under Deployment, Job, etc., it remains unspecified
+	//               (and will be set to Disabled eventually, when the Pod is created).
+	//   2.2. If RecursiveReadOnly is set to Disabled, the mount is not made recursively read-only.
+	//   2.3. If RecursiveReadOnly is set to IfPossible, the mount is made recursively read-only,
+	//        if it is supported by the runtime.
+	//        If it is not supported by the runtime, the mount is not made recursively read-only.
+	//        MountPropagation must be None or unspecified (which defaults to None).
+	//   2.4. If RecursiveReadOnly is set to Enabled, the mount is made recursively read-only.
+	//        If it is not supported by the runtime, the Pod will be terminated by kubelet,
+	//        and an error will be generated to indicate the reason.
+	//        MountPropagation must be None or unspecified (which defaults to None).
+	//   2.5. If RecursiveReadOnly is set to unknown value, it will result in an error.
+	//
+	// When this property is recognized by kubelet and kube-apiserver,
+	// VolumeMountStatus.RecursiveReadOnly will be set to either Disabled or Enabled.
+	//
+	// +featureGate=RecursiveReadOnlyMounts
+	// +optional
+	RecursiveReadOnly *RecursiveReadOnlyMode
 	// Required. If the path is not an absolute path (e.g. some/path) it
 	// will be prepended with the appropriate root prefix for the operating
 	// system.  On Linux this is '/', on Windows this is 'C:\'.
@@ -1926,6 +1951,8 @@ type VolumeMount struct {
 	// to container and the other way around.
 	// When not set, MountPropagationNone is used.
 	// This field is beta in 1.10.
+	// When RecursiveReadOnly is set to IfPossible or to Enabled, MountPropagation must be None or unspecified
+	// (which defaults to None).
 	// +optional
 	MountPropagation *MountPropagationMode
 	// Expanded path within the volume from which the container's volume should be mounted.
@@ -1961,6 +1988,18 @@ const (
 	MountPropagationBidirectional MountPropagationMode = "Bidirectional"
 )
 
+// RecursiveReadOnlyMode describes recursive-readonly mode.
+type RecursiveReadOnlyMode string
+
+const (
+	// RecursiveReadOnlyDisabled disables recursive-readonly mode.
+	RecursiveReadOnlyDisabled RecursiveReadOnlyMode = "Disabled"
+	// RecursiveReadOnlyIfPossible enables recursive-readonly mode if possible.
+	RecursiveReadOnlyIfPossible RecursiveReadOnlyMode = "IfPossible"
+	// RecursiveReadOnlyEnabled enables recursive-readonly mode, or raise an error.
+	RecursiveReadOnlyEnabled RecursiveReadOnlyMode = "Enabled"
+)
+
 // VolumeDevice describes a mapping of a raw block device within a container.
 type VolumeDevice struct {
 	// name must match the name of a persistentVolumeClaim in the pod
@@ -2591,6 +2630,10 @@ type ContainerStatus struct {
 	// +featureGate=InPlacePodVerticalScaling
 	// +optional
 	Resources *ResourceRequirements
+	// Status of volume mounts.
+	// +listType=atomic
+	// +optional
+	VolumeMounts []VolumeMountStatus
 }
 
 // PodPhase is a label for the condition of a pod at the current time.
@@ -2664,6 +2707,21 @@ const (
 	PodResizeStatusInfeasible PodResizeStatus = "Infeasible"
 )
 
+// VolumeMountStatus shows status of volume mounts.
+type VolumeMountStatus struct {
+	// Name corresponds to the name of the original VolumeMount.
+	Name string
+	// ReadOnly corresponds to the original VolumeMount.
+	// +optional
+	ReadOnly bool
+	// RecursiveReadOnly must be set to Disabled, Enabled, or unspecified (for non-readonly mounts).
+	// An IfPossible value in the original VolumeMount must be translated to Disabled or Enabled,
+	// depending on the mount result.
+	// +featureGate=RecursiveReadOnlyMounts
+	// +optional
+	RecursiveReadOnly *RecursiveReadOnlyMode
+}
+
 // RestartPolicy describes how the container should be restarted.
 // Only one of the following restart policies may be specified.
 // If none of the following policies is specified, the default one
@@ -4591,6 +4649,24 @@ type NodeDaemonEndpoints struct {
 	KubeletEndpoint DaemonEndpoint
 }
 
+// RuntimeClassFeatures is a set of runtime features.
+type RuntimeClassFeatures struct {
+	// RecursiveReadOnlyMounts is set to true if the runtime class supports RecursiveReadOnlyMounts.
+	// +optional
+	RecursiveReadOnlyMounts *bool
+}
+
+// RuntimeClass is a set of runtime class information.
+type RuntimeClass struct {
+	// Runtime class name.
+	// Empty for the default runtime class.
+	// +optional
+	Name string
+	// Supported features.
+	// +optional
+	Features *RuntimeClassFeatures
+}
+
 // NodeSystemInfo is a set of ids/uuids to uniquely identify the node.
 type NodeSystemInfo struct {
 	// MachineID reported by the node. For unique machine identification
@@ -4701,6 +4777,9 @@ type NodeStatus struct {
 	// Status of the config assigned to the node via the dynamic Kubelet config feature.
 	// +optional
 	Config *NodeConfigStatus
+	// The available runtime classes.
+	// +optional
+	RuntimeClasses []RuntimeClass
 }
 
 // UniqueVolumeName defines the name of attached volume

CRI API

Add bool recursive_read_only to the Mount message. CRI implementations will also expose the availability of the feature via the RuntimeHandlerFeatures message.

As kubelet can inspect the availability of the feature via the RuntimeHandlerFeatures message, there is no concept of "IfPossible" in the CRI API; kubelet translates an "IfPossible" value in the Core API into true or false in the CRI API

The RuntimeHandlerFeatures message is also propagated to the NodeSystemInfo struct of the Core API.

Diff:

diff --git a/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
index e16688d8386..194d591c27f 100644
--- a/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
+++ b/staging/src/k8s.io/cri-api/pkg/apis/runtime/v1/api.proto
@@ -235,6 +235,15 @@ message Mount {
     repeated IDMapping uidMappings = 6;
     // GidMappings specifies the runtime GID mappings for the mount.
     repeated IDMapping gidMappings = 7;
+    // If set to true, the mount is made recursive read-only.
+    // In this CRI API, recursive_read_only is a plain true/false boolean, although its equivalent
+    // in the Kubernetes core API is a quaternary that can be nil, "Enabled", "IfPossible", or "Disabled".
+    // kubelet translates that quaternary value in the core API into a boolean in this CRI API.
+    // Remarks:
+    // - nil is just treated as false
+    // - when set to true, readonly must be explicitly set to true, and propagation must be PRIVATE (0).
+    // - (readonly == false && recursive_read_only == false) does not make the mount read-only.
+    bool recursive_read_only = 8;
 }
 
 // IDMapping describes host to container ID mappings for a pod sandbox.
@@ -1524,6 +1533,22 @@ message StatusRequest {
     bool verbose = 1;
 }
 
+message RuntimeHandlerFeatures {
+    // recursive_read_only_mounts is set to true if the runtime handler supports
+    // recursive read-only mounts.
+    // For runc-compatible runtimes, availability of this feature can be detected by checking whether
+    // the Linux kernel version is >= 5.12, and,  `runc features | jq .mountOptions` contains "rro".
+    bool recursive_read_only_mounts = 1;
+}
+
+message RuntimeHandler {
+    // Name must be unique in StatusResponse.
+    // An empty string denotes the default handler.
+    string name = 1;
+    // Supported features.
+    RuntimeHandlerFeatures features = 2;
+}
+
 message StatusResponse {
     // Status of the Runtime.
     RuntimeStatus status = 1;
@@ -1532,6 +1557,8 @@ message StatusResponse {
     // debug, e.g. plugins used by the container runtime.
     // It should only be returned non-empty when Verbose is true.
     map<string, string> info = 2;
+    // Runtime handlers.
+    repeated RuntimeHandler runtime_handlers = 3;
 }
 
 message ImageFsInfoRequest {}
diff --git a/staging/src/k8s.io/cri-api/pkg/errors/errors.go b/staging/src/k8s.io/cri-api/pkg/errors/errors.go
index a4538669122..c8e4a18dec5 100644
--- a/staging/src/k8s.io/cri-api/pkg/errors/errors.go
+++ b/staging/src/k8s.io/cri-api/pkg/errors/errors.go
@@ -29,6 +29,9 @@ var (
 
        // ErrSignatureValidationFailed - Unable to validate the image signature on the PullImage RPC call.
        ErrSignatureValidationFailed = errors.New("SignatureValidationFailed")
+
+       // ErrRROUnsupported - Unable to enforce recursive readonly mounts
+       ErrRROUnsupported = errors.New("RROUnsupported")
 )
 
 // IsNotFound returns a boolean indicating whether the error

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

The existing tests will continue to pass. New tests have to be added to cover the proposed feature.

Unit tests
Integration tests

See e2e tests below.

e2e tests
  • run a pod in each RecursiveReadOnly mode and verify that the status comes back correctly
  • run RecursiveReadOnly="Enabled" on a runtime that does not support it and ensure the error
  • run RecursiveReadOnly="Enabled", and verify that the mount is actually recursively read-only
  • run RecursiveReadOnly="Disabled", and verify that the mount is actually not recursively read-only

Tests are implemented in https://github.com/kubernetes/kubernetes/blob/v1.30.0/test/e2e_node/mount_rro_linux_test.go, and will be executed on the CI when the CI is upgraded to use containerd v2.0. So, there is no link to the testgrid yet.

Graduation Criteria

Alpha

  • Feature implemented behind a feature flag
  • Unit tests and CRI tests will pass

Beta

GA

  • Two beta releases of Kubernetes at least
  • containerd, CRI-O, and cri-dockerd supports the feature with their GA releases

Upgrade / Downgrade Strategy

Upgrade: No action is needed. Existing readonly mounts will remain non-recursively readonly.

Downgrade:

  • On downgrading kube-apiserver, the []volumeMounts.recursiveReadOnly property will be lost and will not be propagated to kubelet. If the mode was set to non-Disabled, this will result in producing writable mounts. It is the user's responsibility to use the correct version of kube-apiserver when they need non-Disabled mode.

  • On downgrading kubelet, the []volumeMounts.recursiveReadOnly properties will be lost, and the []containerStatuses.[]volumeMount.recursiveReadOnly status will not be updated. It is the user's responsibility to use the correct version of kubelet when they need to check []containerStatuses.[]volumeMount.recursiveReadOnly.

  • On downgrading the CRI or OCI runtime, if the RecursiveReadOnly mode is set to Enabled, kubelet will raise an error. IfPossible will be just treated as Disabled.

Version Skew Strategy

  • It is the user's responsibility to use the correct version of kube-apiserver when they need non-Disabled mode. Otherwise the mode will not be propagated to kubelet.

  • It is the user's responsibility to use the correct version of kube-apiserver and kubelet when they need to check []containerStatuses.[]volumeMount.recursiveReadOnly. Otherwise the property may have an inconsistent value.

  • CRI and OCI runtimes have to be updated before kubelet, otherwise kubelet will not be aware whether they supports the feature or not, and it will assume that they do not support the feature.

  • If only partial nodes supports the feature, Disabled and IfPossible will continue to work on all the nodes, but Enabled will fail on a node that does not support the feature. kube-scheduler does not care about this, and, it is the user's responsibility to set nodeSelector, nodeAffinity, etc. to avoid scheduling a pod with Enabled to a node that does not support the feature.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?
  • Feature gate (also fill in values in kep.yaml)
    • Feature gate name: RecursiveReadOnlyMounts
    • Components depending on the feature gate: kube-apiserver,kubelet
Does enabling the feature change any default behavior?

No

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, by unsetting RecursiveReadOnly=Enabled.

Components can be downgraded too, but it should be noted that VolumeMountStatus may still see an inconsistent state when kubelet was downgraded. The pod manifest has to be recreated to get a consistent state in this case.

What happens if we reenable the feature if it was previously rolled back?

Works. Just same as a fresh roll-out, as long as the user has recreated the pod manifests. (See "Can the feature be disabled once ..." section above)

Are there any tests for feature enablement/disablement?

Unit tests will run with and without the feature gate.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

A rollout may fail when at least one of the following components are too old:

Component readOnlyRecursive value that will cause an error
kube-apiserver any value
kubelet any value
CRI runtime Enabled
OCI runtime Enabled
kernel Enabled

For example, an error will be returned like this if kube-apiserver is too old:

$ kubectl apply -f rro.yaml
Error from server (BadRequest): error when creating "rro.yaml": Pod in version "v1" cannot be handled as a Pod:
strict decoding error: unknown field "spec.containers[0].volumeMounts[0].recursiveReadOnly"

No impact on already running workloads.

What specific metrics should inform a rollback?

Look for an event saying indicating RRO is not supported by the runtime.

$ kubectl get events -o json -w
...
{
    ...
    "kind": "Event",
    "message": "Error: RRONotSupported",
    ...
}
...
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

During the beta phase, the following test will be manually performed:

  • Enable the RecursiveReadOnly feature gate for kube-apiserver and kubelet.
  • Create a pod with recursiveReadOnly specified.
  • Disable the RecursiveReadOnly feature gate for kube-apiserver, and confirm that the pod gets rejected.
  • Enable the RecursiveReadOnly feature gate again, and confirm that the pod gets scheduled again.
  • Do the same for kubelet too.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Yes, the feature is used if the following jq command prints non-zero number:

kubectl get pods -A -o json | jq '[.items[].spec.containers[].volumeMounts[]? | select(.recursiveReadOnly)] | length'
How can someone using this feature know that it is working for their instance?
  • API .status
    • Condition name: volumeMountStatus.recursiveReadOnly
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
  • recursiveReadOnly=Enabled: 100% of pods that were scheduled into a node must run with recursive read-only mounts, or, 100% of them must fail to run.

  • recursiveReadOnly=IfPossible: 100% of pods that were scheduled into a node must run with or without recursive read-only mounts

  • recursiveReadOnly=Disabled, or unset: 100% of pods that were scheduled into a node must run without recursive read-only mounts

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name: Event
    • [Optional] Aggregation method: kubectl get events -o json -w
    • Components exposing the metric: kubelet -> kube-apiserver

If recursiveReadOnly is set to Enabled but it is not supported, kubelet will raise an event like this:

$ kubectl get events -o json -w
...
{
    ...
    "kind": "Event",
    "message": "Error: RRONotSupported",
    ...
}
...

If the OCI runtime claims that it supports recursive read only mounts but it actually fails to mount them, the pod will enter CrashLoopBackoff. The error from the OCI runtime can be inspected by running:

kubectl get pod -o json foo | jq .status.containerStatuses[0].lastState.terminated.message
Are there any missing metrics that would be useful to have to improve observability of this feature?

Potentially, kube-scheduler could be implemented to avoid scheduling a pod with recursiveReadOnly: Enabled to a pod running an old kernel.

In this way, the Event metric described above would not happen, and users would instead see Pending pods as an error metric.

However, this is not planned to be implemented in kube-scheduler, as it seems overengineering. Users may use nodeSelector, nodeAffinity, etc. to workaround this.

Dependencies

Does this feature depend on any specific services running in the cluster?

Specific version of CRI, OCI, and Linux kernel

Scalability

A pod with recursiveReadOnly: Enabled may be rejected by kubelet with the probablility of $$B/A$$, where $$A$$ is the number of all the nodes that may potentially accept the pod, and $$B$$ is the number of the nodes that may potentially accept the pod but does not support RRO. This may affect scalability.

To evaluate this risk, users may run kubectl get nodes -o json | jq '[.items[].status.runtimeClasses[].Features]' to see how many nodes support RecursiveReadOnlyMounts: true.

Will enabling / using this feature result in any new API calls?

No

Will enabling / using this feature result in introducing new API types?

No

Will enabling / using this feature result in any new calls to the cloud provider?

No

Will enabling / using this feature result in increasing size or count of the existing API objects?

A dozen of bytes

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

No

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

A pod cannot be created, just as in other pods.

What are other known failure modes?

None

What steps should be taken if SLOs are not being met to determine the problem?
  • Make sure that the node is running Linux kernel v5.12 or later.
  • Make sure that runc features | jq .mountOptions contains "rro". Otherwise update runc.
  • Make sure that crictl info (with the latest crictl) reports that RecursiveReadOnlyMounts is supported. Otherwise update the CRI runtime, and make sure that no relevant error is printed in the CRI runtime's log.
  • Make sure that kubectl get nodes -o json | jq '[.items[].status.runtimeClasses[].Features]' (with the latest kubectl and control planes) reports that RecursiveReadOnlyMounts is supported. Otherwise update the CRI runtime, and make sure that no relevant error is printed in kubelet's log.

Implementation History

  • v1.30: alpha
  • v1.31: beta

Drawbacks

See "Alternatives" below.

Alternatives

Plan B is to keep the Kubernetes Core API and the CRI API completely unmodified, and just let the CRI runtime treat "readonly" as "recursive readonly".

This would be much easier to implement and adopt, however, small portion of users may find this to be a breaking change.

Actually, containerd has once adopted the Plan B (containerd/containerd#9713) in its main branch (not in any GA release), but it is being reverted in favor of this KEP now (containerd/containerd#9747).

Infrastructure Needed (Optional)

runc >= 1.1 && kernel >= 5.12