1029-ephemeral-storage-quotas

Quotas for Ephemeral Storage

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

Summary

This proposal applies to the use of quotas for ephemeral-storage metrics gathering. Use of quotas for ephemeral-storage limit enforcement is a non-goal, but as the architecture and code will be very similar, there are comments interspersed related to enforcement. These comments will be italicized.

Local storage capacity isolation, aka ephemeral-storage, was introduced into Kubernetes via #361. It provides support for capacity isolation of shared storage between pods, such that a pod can be limited in its consumption of shared resources and can be evicted if its consumption of shared storage exceeds that limit. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption.

The current mechanism relies on periodically walking each ephemeral volume (emptydir, logdir, or container writable layer) and summing the space consumption. This method is slow, can be fooled, and has high latency (i. e. a pod could consume a lot of storage prior to the kubelet being aware of its overage and terminating it).

The mechanism proposed here utilizes filesystem project quotas to provide monitoring of resource consumption and optionally enforcement of limits. Project quotas, initially in XFS and more recently ported to ext4fs, offer a kernel-based means of monitoring and restricting filesystem consumption that can be applied to one or more directories.

A prototype is in progress; see kubernetes/kubernetes#66928.

Project Quotas

Project quotas are a form of filesystem quota that apply to arbitrary groups of files, as opposed to file user or group ownership. They were first implemented in XFS, as described here: http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/xfs-quotas.html.

Project quotas for ext4fs were proposed in late 2014 and added to the Linux kernel in early 2016, with commit 391f2a16b74b95da2f05a607f53213fc8ed24b8e. They were designed to be compatible with XFS project quotas.

Each inode contains a 32-bit project ID, to which optionally quotas (hard and soft limits for blocks and inodes) may be applied. The total blocks and inodes for all files with the given project ID are maintained by the kernel. Project quotas can be managed from userspace by means of the xfs_quota(8) command in foreign filesystem (-f) mode; the traditional Linux quota tools do not manipulate project quotas. Programmatically, they are managed by the quotactl(2) system call, using in part the standard quota commands and in part the XFS quota commands; the man page implies incorrectly that the XFS quota commands apply only to XFS filesystems.

The project ID applied to a directory is inherited by files created under it. Files cannot be (hard) linked across directories with different project IDs. A file's project ID cannot be changed by a non-privileged user, but a privileged user may use the xfs_io(8) command to change the project ID of a file.

Filesystems using project quotas may be mounted with quotas either enforced or not; the non-enforcing mode tracks usage without enforcing it. A non-enforcing project quota may be implemented on a filesystem mounted with enforcing quotas by setting a quota too large to be hit. The maximum size that can be set varies with the filesystem; on a 64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for ext4fs.

Conventionally, project quota mappings are stored in /etc/projects and /etc/projid; these files exist for user convenience and do not have any direct importance to the kernel. /etc/projects contains a mapping from project ID to directory/file; this can be a one to many mapping (the same project ID can apply to multiple directories or files, but any given directory/file can be assigned only one project ID). /etc/projid contains a mapping from named projects to project IDs.

This proposal utilizes hard project quotas for both monitoring and enforcement. Soft quotas are of no utility; they allow for temporary overage that, after a programmable period of time, is converted to the hard quota limit.

Motivation

The mechanism presently used to monitor storage consumption involves use of du and find to periodically gather information about storage and inode consumption of volumes. This mechanism suffers from a number of drawbacks:

It is slow. If a volume contains a large number of files, walking the directory can take a significant amount of time. There has been at least one known report of nodes becoming not ready due to volume metrics: kubernetes/kubernetes#62917
It is possible to conceal a file from the walker by creating it and removing it while holding an open file descriptor on it. POSIX behavior is to not remove the file until the last open file descriptor pointing to it is removed. This has legitimate uses; it ensures that a temporary file is deleted when the processes using it exit, and it minimizes the attack surface by not having a file that can be found by an attacker. The following pod does this; it will never be caught by the present mechanism:

apiVersion: v1
kind: Pod
max:
metadata:
  name: "diskhog"
spec:
  containers:
  - name: "perl"
    resources:
      limits:
        ephemeral-storage: "2048Ki"
    image: "perl"
    command:
    - perl
    - -e
    - >
      my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999
    volumeMounts:
    - name: a
      mountPath: /data/a
  volumes:
  - name: a
    emptyDir: {}

It is reactive rather than proactive. It does not prevent a pod from overshooting its limit; at best it catches it after the fact. On a fast storage medium, such as NVMe, a pod may write 50 GB or more of data before the housekeeping performed once per minute catches up to it. If the primary volume is the root partition, this will completely fill the partition, possibly causing serious problems elsewhere on the system. This proposal does not address this issue; a future enforcing project would.

In many environments, these issues may not matter, but shared multi-tenant environments need these issues addressed.

Goals

These goals apply only to local ephemeral storage, as described in #361.

Primary: improve performance of monitoring by using project quotas in a non-enforcing way to collect information about storage utilization of ephemeral volumes.
Primary: detect storage used by pods that is concealed by deleted files being held open.
Primary: this will not interfere with the more common user and group quotas.

Non-Goals

Application to storage other than local ephemeral storage.
Application to container copy on write layers. That will be managed by the container runtime. For a future project, we should work with the runtimes to use quotas for their monitoring.
Elimination of eviction as a means of enforcing ephemeral-storage limits. Pods that hit their ephemeral-storage limit will still be evicted by the kubelet even if their storage has been capped by enforcing quotas.
Enforcing node allocatable (limit over the sum of all pod's disk usage, including e. g. images).
Enforcing limits on total pod storage consumption by any means, such that the pod would be hard restricted to the desired storage limit.

Future Work

Enforce limits on per-volume storage consumption by using enforced project quotas.

Proposal

This proposal applies project quotas to emptydir volumes on qualifying filesystems (ext4fs and xfs with project quotas enabled). Project quotas are applied by selecting an unused project ID (a 32-bit unsigned integer), setting a limit on space and/or inode consumption, and attaching the ID to one or more files. By default (and as utilized herein), if a project ID is attached to a directory, it is inherited by any files created under that directory.

To use quotas to track a pod's resource usage, the pod must be in a user namespace. Within user namespaces, the kernel restricts changes to projectIDs on the filesystem, ensuring the reliability of storage metrics calculated by quotas.

If we elect to use the quota as enforcing, we impose a quota consistent with the desired limit. If we elect to use it as non-enforcing, we impose a large quota that in practice cannot be exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs).

For discussion of the low level implementation strategy, see below.

Control over Use of Quotas

At present, three feature gates control operation of quotas:

LocalStorageCapacityIsolation must be enabled for any use of quotas.
LocalStorageCapacityIsolationFSQuotaMonitoring must be enabled in addition. If this is enabled, quotas are used for monitoring, but not enforcement. At present, this defaults to False, but the intention is that this will default to True by initial release.
Ensure the UserNamespacesSupport is enabled, and that the kernel, CRI implementation and OCI runtime support user namespaces.

Operation Flow -- Applying a Quota

Caller (emptydir volume manager or container runtime) creates an emptydir volume, with an empty directory at a location of its choice.
Caller requests that a quota be applied to a directory.
Determine whether a quota can be imposed on the directory, by asking each quota provider (one per filesystem type) whether it can apply a quota to the directory. If no provider claims the directory, an error status is returned to the caller.
Select an unused project ID (see below).
Set the desired limit on the project ID, in a filesystem-dependent manner (see below).
Apply the project ID to the directory in question, in a filesystem-dependent manner.

An error at any point results in no quota being applied and no change to the state of the system. The caller in general should not assume a priori that the attempt will be successful. It could choose to reject a request if a quota cannot be applied, but at this time it will simply ignore the error and proceed as today.

Operation Flow -- Retrieving Storage Consumption

Caller (kubelet metrics code, cadvisor, container runtime) asks the quota code to compute the amount of storage used under the directory.
Determine whether a quota applies to the directory, in a filesystem-dependent manner (see below).
If so, determine how much storage or how many inodes are utilized, in a filesystem dependent manner.

If the quota code is unable to retrieve the consumption, it returns an error status and it is up to the caller to utilize a fallback mechanism (such as the directory walk performed today).

Operation Flow -- Removing a Quota.

Caller requests that the quota be removed from a directory.
Determine whether a project quota applies to the directory.
Remove the limit from the project ID associated with the directory.
Remove the association between the directory and the project ID.
Return the project ID to the system to allow its use elsewhere (see below).
Caller may delete the directory and its contents (normally it will).

Operation Notes

Selecting a Project ID

Project IDs are a shared space within a filesystem. If the same project ID is assigned to multiple directories, the space consumption reported by the quota will be the sum of that of all of the directories. Hence, it is important to ensure that each directory is assigned a unique project ID (unless it is desired to pool the storage use of multiple directories).

The canonical mechanism to record persistently that a project ID is reserved is to store it in the /etc/projid (projid[5]) and/or /etc/projects (projects(5)) files. However, it is possible to utilize project IDs without recording them in those files; they exist for administrative convenience but neither the kernel nor the filesystem is aware of them. Other ways can be used to determine whether a project ID is in active use on a given filesystem:

The quota values (in blocks and/or inodes) assigned to the project ID are non-zero.
The storage consumption (in blocks and/or inodes) reported under the project ID are non-zero.

The algorithm to be used is as follows:

Lock this instance of the quota code against re-entrancy.
open and flock() the /etc/project and /etc/projid files, so that other uses of this code are excluded.
Start from a high number (the prototype uses 1048577).
Iterate from there, performing the following tests:
- Is the ID reserved by this instance of the quota code?
- Is the ID present in /etc/projects?
- Is the ID present in /etc/projid?
- Are the quota values and/or consumption reported by the kernel non-zero? This test is restricted to 128 iterations to ensure that a bug here or elsewhere does not result in an infinite loop looking for a quota ID.
If an ID has been found:
- Add it to an in-memory copy of /etc/projects and /etc/projid so that any other uses of project quotas do not reuse it.
- Write temporary copies of /etc/projects and /etc/projid that are flock()ed
- If successful, rename the temporary files appropriately (if rename of one succeeds but the other fails, we have a problem that we cannot recover from, and the files may be inconsistent).
Unlock /etc/projid and /etc/projects.
Unlock this instance of the quota code.

A minor variation of this is used if we want to reuse an existing quota ID.

Determine Whether a Project ID Applies To a Directory

It is possible to determine whether a directory has a project ID applied to it by requesting (via the lsattr(1) command) the project ID associated with the directory. Whie the specifics are filesystem-dependent, the basic method is the same for at least XFS and ext4fs.

It is not possible to determine in constant operations the directory or directories to which a project ID is applied. It is possible to determine whether a given project ID has been applied to an existing directory or files (although those will not be known); the reported consumption will be non-zero.

The code records internally the project ID applied to a directory, but it cannot always rely on this. In particular, if the kubelet has exited and has been restarted (and hence the quota applying to the directory should be removed), the map from directory to project ID is lost. If it cannot find a map entry, it falls back on the approach discussed above.

Return a Project ID To the System

The algorithm used to return a project ID to the system is very similar to the algorithm used to select a project ID, except of course for selecting a project ID. It performs the same sequence of locking /etc/project and /etc/projid, editing a copy of the file, and restoring it.

If the project ID is applied to multiple directories and the code can determine that, it will not remove the project ID from /etc/projid until the last reference is removed. While it is not anticipated in this KEP that this mode of operation will be used, at least initially, this can be detected even on kubelet restart by looking at the reference count in /etc/projects.

Notes/Constraints/Caveats (Optional)

Implementation Strategy

The initial implementation will be done by shelling out to the xfs_quota(8) and lsattr(1) commands to manipulate quotas. It is possible to use the quotactl(2) and ioctl(2) system calls to do this, but the quotactl(2) system call is not supported by Go at all, and ioctl(2) subcommands required for manipulating quotas are only partially supported. Use of these system calls would require either use of cgo, which is not supported in Kubernetes, or copying the necessary structure definitions and constants into the source.

The use of Linux commands rather than system calls likely poses some efficiency issues, but these commands are only issued when ephemeral volumes are created and destroyed, or during monitoring, when they are called once per minute. At present, monitoring is done by shelling out to the du(1) and find(1) commands, which are much less efficient, as they perform filesystem scans. The performance of these commands must be measured under load prior to making this feature used by default.

All xfs_quota(8) commands are invoked as

xfs_quota -t <tmp_mounts_file> -P/dev/null -D/dev/null -f <mountpoint> -c <command>

The -P and -D arguments tell xfs_quota not to read the usual /etc/projid and /etc/projects files, and to use the empty special file /dev/null as stand-ins. xfs_quota reads the projid and projects files and attempts to access every listed mountpoint for every command. As we use numeric project IDs for all purposes, it is not necessary to incur this overhead.

The -t argument is to mitigate a hazard of xfs_quota(8) in the presence of stuck NFS (or similar) mounts. By default, xfs_quota(8) stat(2)s every mountpoint on the system, which could hang on a stuck mount. The -t option is used to pass in a temporary mounts file containing only the filesystem we care about.

The following operations are performed, with the listed commands (using xfs_quota(8) except as noted)

Determine whether quotas are enabled on the specified filesystem

state -p
Apply a limit to a project ID (note that at present the largest possible limit is applied, allowing the quota system to be used for monitoring only)

limit -p bhard=<blocks> bsoft=<blocks> <projectID>
Apply a project ID to a directory, enabling a quota on that directory

project -s -p <directory> <projectID>
Retrieve the number of blocks used by a given project ID

quota -p -N -n -v -b <projectID>
Retrieve the number of inodes used by a given project ID

quota -p -N -n -v -n <projectID>
Determine whether a specified directory has a quota ID applied to it, and if so, what that ID is (using lsattr(1))

lsattr -pd <path>

Future

In the long run, we should work to add the necessary constructs to the Go language, allowing direct use of the necessary system calls without the use of cgo.

Notes on Implementation

The primary new interface defined is the quota interface in pkg/volume/util/quota/quota.go. This defines five operations:

Does the specified directory support quotas?
Assign a quota to a directory. If a non-empty pod UID is provided, the quota assigned is that of any other directories under this pod UID; if an empty pod UID is provided, a unique quota is assigned.
Retrieve the consumption of the specified directory. If the quota code cannot handle it efficiently, it returns an error and the caller falls back on existing mechanism.
Retrieve the inode consumption of the specified directory; same description as above.
Remove quota from a directory. If a non-empty pod UID is passed, it is checked against that recorded in-memory (if any). The quota is removed from the specified directory. This can be used even if AssignQuota has not been used; it inspects the directory and removes the quota from it. This permits stale quotas from an interrupted kubelet to be cleaned up.

Two implementations are provided: quota_linux.go (for Linux) and quota_unsupported.go (for other operating systems). The latter returns an error for all requests.

As the quota mechanism is intended to support multiple filesystems, and different filesystems require different low level code for manipulating quotas, a provider is supplied that finds an appropriate quota applier implementation for the filesystem in question. The low level quota applier provides similar operations to the top level quota code, with two exceptions:

No operation exists to determine whether a quota can be applied (that is handled by the provider).
An additional operation is provided to determine whether a given quota ID is in use within the filesystem (outside of /etc/projects and /etc/projid).

The two quota providers in the initial implementation are in pkg/volume/util/quota/extfs and pkg/volume/util/quota/xfs. While some quota operations do require different system calls, a lot of the code is common, and factored into pkg/volume/util/quota/common/quota_linux_common_impl.go.

Notes on Code Changes

The prototype for this project is mostly self-contained within pkg/volume/util/quota and a few changes to pkg/volume/empty_dir/empty_dir.go. However, a few changes were required elsewhere:

The operation executor needs to pass the desired size limit to the volume plugin where appropriate so that the volume plugin can impose a quota. The limit is passed as 0 (do not use quotas), positive number (impose an enforcing quota if possible, measured in bytes), or -1 (impose a non-enforcing quota, if possible) on the volume.

This requires changes to pkg/volume/util/operationexecutor/operation_executor.go (to add DesiredSizeLimit to VolumeToMount), pkg/kubelet/volumemanager/cache/desired_state_of_world.go, and pkg/kubelet/eviction/helpers.go (the latter in order to determine whether the volume is a local ephemeral one).
The volume manager (in pkg/volume/volume.go) changes the Mounter.SetUp and Mounter.SetUpAt interfaces to take a new MounterArgs type rather than an FsGroup (*int64). This is to allow passing the desired size and pod UID (in the event we choose to implement quotas shared between multiple volumes; see below). This required small changes to all volume plugins and their tests, but will in the future allow adding additional data without having to change code other than that which uses the new information.

Implementation details of using Xfs-Quota in User Namespace

Using user namespaces prevents users from changing projectIDs on the filesystem, which is essential for preserving the reliability of xfs-quota metrics under user namespaces.
When LocalStorageCapacityIsolationFSQuotaMonitoring is enabled, kubelet utilizes xfs-quota to monitor disk usage. With user namespaces enforced, any changes that might manipulate projectIDs are inherently restricted, avoiding any inaccuracies in collected metrics.
During the setup of volumeToMount object, the hostUsersEnabled field is added, which indicates whether user namespaces are in effect. This field is critical for determining the environment in which the volume operates.
When mounting the filesystem, before the assignQuota method is invoked, a check for SupportQuota is conducted. This check now should also verify the hostUsersEnabled status to ensure that quotas are only assigned when operating within a valid user namespace context.
Kubelet's behavior must be updated to accommodate these changes. If LocalStorageCapacityIsolationFSQuotaMonitoring flag is enabled and hostUsersEnabled is false (i.e., user namespaces are being used), kubelet should proceed with xfs-quota based monitoring. If user namespaces are not enabled, kubelet should revert to using the traditional methods for disk monitoring like du and find.

Risks and Mitigations

The SIG raised the possibility of a container being unable to exit should we enforce quotas, and the quota interferes with writing the log. This can be mitigated by either not applying a quota to the log directory and using the du mechanism, or by applying a separate non-enforcing quota to the log directory.

As log directories are write-only by the container, and consumption can be limited by other means (as the log is filtered by the runtime), I do not consider the ability to write uncapped to the log to be a serious exposure.

Note in addition that even without quotas it is possible for writes to fail due to lack of filesystem space, which is effectively (and in some cases operationally) indistinguishable from exceeding quota, so even at present code must be able to handle those situations.
Filesystem quotas may impact performance to an unknown degree. Information on that is hard to come by in general, and one of the reasons for using quotas is indeed to improve performance. If this is a problem in the field, merely turning off quotas (or selectively disabling project quotas) on the filesystem in question will avoid the problem. Against the possibility that cannot be done (because project quotas are needed for other purposes), we should provide a way to disable use of quotas altogether via a feature gate.

A report https://blog.pythonanywhere.com/110/ notes that an unclean shutdown on Linux kernel versions between 3.11 and 3.17 can result in a prolonged downtime while quota information is restored. Unfortunately, the link referenced here is no longer available.
Bugs in the quota code could result in a variety of regression behavior. For example, if a quota is incorrectly applied it could result in ability to write no data at all to the volume. This could be mitigated by use of non-enforcing quotas. XFS in particular offers the pqnoenforce mount option that makes all quotas non-enforcing.

Design Details

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

The quota code is by an large not very amendable to unit tests. While there are simple unit tests for parsing the mounts file, and there could be tests for parsing the projects and projid files, the real work (and risk) involves interactions with the kernel and with multiple instances of this code (e. g. in the kubelet and the runtime manager, particularly under stress). It also requires setup in the form of a prepared filesystem. It would be better served by appropriate end to end tests.

The main unit test is in package under pkg/volume/util/fsquota/.

pkg/volume/util/fsquota/: 2024-06-12 - 73.9%
- project.go 75.7%
- quota.go 100%
- quota_linux.go 72.2%

See details in https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit&include-filter-by-regex=fsquota.

Integration tests

N/A

e2e tests

e2e evolution (LocalStorageCapacityIsolationQuotaMonitoring [Slow] [Serial] [Disruptive] [Feature:LocalStorageCapacityIsolationQuota][NodeFeature:LSCIQuotaMonitoring]) can be found in test/e2e_node/quota_lsci_test.go

The e2e tests are slow and serial and we will not promote them to be conformance test then. There is no failure history or flakes in https://storage.googleapis.com/k8s-triage/index.html?test=LocalStorageCapacityIsolationQuotaMonitoring

Performance Benchmarks

I performed a microbenchmark consisting of various operations on a directory containing 4096 subdirectories each containing 2048 1Kbyte files. The operations performed were as follows, in sequence:

Create Files: Create 4K directories each containing 2K files as described, in depth-first order.
du: run du immediately after creating the files.
quota: where applicable, run xfs_quota immediately after du.
du (repeat): repeat the du invocation.
quota (repeat): repeat the xfs_quota invocation.
du (after remount): run mount -o remount <filesystem> immediately followed by du.
quota (after remount): run mount -o remount <filesystem> immediately followed by xfs_quota.
unmount: umount the filesystem.
mount: mount the filesystem.
quota after umount/mount: run xfs_quota after unmounting and mounting the filesystem.
du after umount/mount: run du after unmounting and mounting the filesystem.
Remove Files: remove the test files.

The test was performed on four separate filesystems:

XFS filesystem, with quotas enabled (256 GiB, 128 Mi inodes)
XFS filesystem, with quotas disabled (64 GiB, 32 Mi inodes)
ext4fs filesystem, with quotas enabled (250 GiB, 16 Mi inodes)
ext4fs filesystem, with quotas disabled (60 GiB, 16 Mi inodes)

Other notes:

All filesystems reside on an otherwise idle HP EX920 1TB NVMe on a Lenovo ThinkPad P50 with 64 GB RAM and Intel Core i7-6820HQ CPU (4 cores/8 threads total) running Fedora 30 (5.1.5-300.fc30.x86_64)
Five runs were conducted with each combination; the median value is used. All times are in seconds.
Space consumption was calculated with the filesystem hot (after files were created), warm (after mount -o remount), and cold (after umount/mount of the filesystem).
Note that in all cases xfs_quota consumed zero time as reported by /usr/bin/time.
User and system time are not available for file creation.
All calls to xfs_quota and mount consumed less than 0.01 seconds elapsed, user, and CPU time and are not reported here.
Removing files was consistently faster with quotas disabled than with quotas enabled. With ext4fs, du was faster with quotas disabled than with quotas enabled. With XFS, creating files may have been faster with quotas disabled than with quotas enabled, but the difference was small. In other cases, the difference was within noise.

Elapsed Time

Operation	XFS+Quota	XFS	Ext4fs+Quota	Ext4fs
Create Files	435.0	419.0	348.0	343.0
du	12.1	12.6	14.3	14.3
du (repeat)	12.0	12.1	14.1	14.0
du (after remount)	23.2	23.2	39.0	24.6
unmount	12.2	12.1	9.8	9.8
du after umount/mount	103.6	138.8	40.2	38.8
Remove Files	196.0	159.8	105.2	90.4

User CPU Time

All calls to umount consumed less than 0.01 second of user CPU time and are not reported here.

Operation	XFS+Quota	XFS	Ext4fs+Quota	Ext4fs
du	3.7	3.7	3.7	3.7
du (repeat)	3.7	3.7	3.7	3.8
du (after remount)	3.3	3.3	3.7	3.6
du after umount/mount	8.1	10.2	3.9	3.7
Remove Files	4.3	4.1	4.2	4.3

System CPU Time

Operation	XFS+Quota	XFS	Ext4fs+Quota	Ext4fs
du	8.3	8.6	10.5	10.5
du (repeat)	8.3	8.4	10.4	10.4
du (after remount)	19.8	19.8	28.8	20.9
unmount	10.2	10.1	8.1	8.1
du after umount/mount	66.0	82.4	29.2	28.1
Remove Files	188.6	156.6	90.4	81.8

Graduation Criteria

The following criteria applies to LocalStorageCapacityIsolationFSMonitoring:

Alpha

Support integrated in kubelet
Alpha-level documentation
Unit test coverage
Node e2e test

Beta

User feedback
Benchmarks to determine latency and overhead of using quotas relative to existing monitoring solution
Cleanup
Use Ephemeral-Storage-Quotas in User Namespace

GA

TBD

Upgrade / Downgrade Strategy

Turn off the feature gate to turn off the feature.

Version Skew Strategy

Kubelet and API Server Skew:

If the API server is on the latest version with the user namespace feature flag enabled, and the kubelet also has the user namespace feature along with the LocalStorageCapacityIsolationFSQuotaMonitoring feature flag enabled, Pods with hostUsers set to false will have XFS quotas within user namespaces enabled.
If any of the necessary feature flags (user namespace on either the API server or kubelet, or LocalStorageCapacityIsolationFSQuotaMonitoring on the kubelet) are not enabled, then, XFS quotas within user namespaces will not be supported

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: LocalStorageCapacityIsolationFSQuotaMonitoring
- Components depending on the feature gate: kubelet
Feature gate
- Feature gate name: UserNamespacesSupport
- Components depending on the feature gate: kubelet, kube-apiserver

This feature uses project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy. This feature can be enabled only within user namespace.

Does enabling the feature change any default behavior?

None. Behavior will not change. The change is the way to monitoring the volume like ephemeral storage volumes and emptyDirs. When LocalStorageCapacityIsolation is enabled for local ephemeral storage and the backing filesystem for emptyDir volumes supports project quotas and they are enabled, use project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, but only for newly created pods.

Existed Pods: If the pod was created with enforcing quota, pod will not use the enforcing quota after the feature gate is disabled.
Newly Created Pods: After setting the feature gate to false, the newly created pod will not use the enforcing quota.

What happens if we reenable the feature if it was previously rolled back?

Like above, after we reenable the feature, newly created pod will use this feature. If a pod was created before rolling back, the pod will benefit from this feature as well.

Are there any tests for feature enablement/disablement?

Yes, in test/e2e_node/quota_lsci_test.go

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

No. In the event of a rollback, the system would revert to the previous method of disk usage monitoring. This switch should not impact the operational state of already running workloads. Additionally, pods created while the feature was active (created with user namespaces) will have to be re-created to run without user namespaces. If those weren't recreated, they will continue to run in a user namespace.

What specific metrics should inform a rollback?

kubelet_volume_metric_collection_duration_seconds was added since v1.24 for duration in seconds to calculate volume stats. This metric can help to compare between fsquota monitoring and du for disk usage.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Yes. I tested it locally and fixed a bug after restarting kubelet

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

In kubelet metrics, an operator can check the histgram metric kubelet_volume_metric_collection_duration_seconds with metric_source equals "fsquota". If there is no metric_source=fsquota, this feature should be disabled.
However, to figure out if a workload is use this feature, there is no direct way now and see more in below methods of how to check fsquota settings on a node.

How can someone using this feature know that it is working for their instance?

xfs_quota -x -c 'report -h' /path/to/directory will provide the information on project quota usage and limits.
When trying to change the projectID associated with the directory in a user namespace, xfs_quota -x -c 'project -s -p newProjectID /path/to/directory' /, the operation will be denied indicating the restriction the user namespace imposes.

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

99.9% of volume stats calculation will cost less than 1s or even 500ms. It can be calculated by kubelet_volume_metric_collection_duration_seconds metrics.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: kubelet_volume_metric_collection_duration_seconds
- Aggregation method: histogram
- Components exposing the metric: kubelet

Are there any missing metrics that would be useful to have to improve observability of this feature? **

Yes, there are no histogram metrics for each volume. The above metric was grouped by volume types because the cost for every volume is too expensive. As a result, users cannot figure out if the feature is used by a workload directly by the metrics. A cluster-admin can check kubelet configuration on each node. If the feature gate is disabled, workloads on that node will not use it. For example, run xfs_quota -x -c 'report -h' /dev/sdc to check quota settings in the device. Check spec.containers[].resources.limits.ephemeral-storage of each container to compare.

Dependencies

Does this feature depend on any specific services running in the cluster?

Yes, the feature depends on project quotas. Once quotas are enabled, the xfs_quota tool can be used to set limits and report on disk usage.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?

No.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Yes. It will use less CPU time and IO during ephemeral storage monitoring. kubelet now allows use of XFS quotas (on XFS and suitably configured ext4fs filesystems) to monitor storage consumption for ephemeral storage (currently for emptydir volumes only). This method of monitoring consumption is faster and more accurate than the old method of walking the filesystem tree. It does not enforce limits, only monitors consumption.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Enabling XFS quotas within user namespaces is unlikely to result in resource exhaustion of node resources like PIDs or sockets. However, quota settings could potentially lead to inode exhaustion if limits are set too low for the workload.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

No changes to current kubelet behaviors. The feature only uses kubelet-local information.

What are other known failure modes?

If the ephemeral storage limitation is reached, the pod will be evicted by kubelet.
It should skip when the image is not configured correctly (unsupported FS or quota not enabled).
For "out of space" failure, kublet eviction should be triggered.

What steps should be taken if SLOs are not being met to determine the problem?

If the metrics shows some problems, we can check the log and quota dir with below commands.

There will be warning logs(after the # is merged) if volume calculation took too long than 1 second
If quota is enabled, you can find the volume information and the process time with time repquota -P /var/lib/kubelet -s -v

Implementation History

Version 1.15

LocalStorageCapacityIsolationFSMonitoring implemented at Alpha

Version 1.24

kubelet_volume_metric_collection_duration_seconds metrics was added
A bug that quota cannot work after kubelet restarted, was fixed

Version 1.25

Promote LocalStorageCapacityIsolationFSMonitoring to Beta, but there is a regression and we revert it to alpha.

ConfigMap rendering issue was found in the 1.25.0 release. When ConfigMaps get updated within the API, they do not get rendered to the resulting pod's filesystem by the Kubelet. The feature has been reverted to alpha in the 1.25.1 release.

Version 1.27

Fix the blocking issue that caused the revert to alpha: kubernetes/kubernetes#112624 and kubernetes/kubernetes#115314.
Add test in sig-node test grid for this feature https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-fsquota-ubuntu: kubernetes/test-infra#28616

Version 1.29

Promote LocalStorageCapacityIsolationFSMonitoring to Beta

E2E tests
add VolumeStatCalDuration metrics for fsquato monitoring benchmark #107201

Drawbacks [optional]

Use of quotas, particularly the less commonly used project quotas, requires additional action on the part of the administrator. In particular:
- ext4fs filesystems must be created with additional options that are not enabled by default:

mkfs.ext4 -O quota,project -E quotatype=usrquota:grpquota:prjquota _device_

An additional option (prjquota) must be applied in /etc/fstab
If the root filesystem is to be quota-enabled, it must be set in the grub options.
Use of project quotas for this purpose will preclude future use within containers.

Alternatives [optional]

I have considered two classes of alternatives:

Alternatives based on quotas, with different implementation
Alternatives based on loop filesystems without use of quotas

Alternative quota-based implementation

Within the basic framework of using quotas to monitor and potentially enforce storage utilization, there are a number of possible options:

Utilize per-volume non-enforcing quotas to monitor storage (the first stage of this proposal).

This mostly preserves the current behavior, but with more efficient determination of storage utilization and the possibility of building further on it. The one change from current behavior is the ability to detect space used by deleted files.
Utilize per-volume enforcing quotas to monitor and enforce storage (the second stage of this proposal).

This allows partial enforcement of storage limits. As local storage capacity isolation works at the level of the pod, and we have no control of user utilization of ephemeral volumes, we would have to give each volume a quota of the full limit. For example, if a pod had a limit of 1 MB but had four ephemeral volumes mounted, it would be possible for storage utilization to reach (at least temporarily) 4MB before being capped.
Utilize per-pod enforcing user or group quotas to enforce storage consumption, and per-volume non-enforcing quotas for monitoring.

This would offer the best of both worlds: a fully capped storage limit combined with efficient reporting. However, it would require each pod to run under a distinct UID or GID. This may prevent pods from using setuid or setgid or their variants, and would interfere with any other use of group or user quotas within Kubernetes.
Utilize per-pod enforcing quotas to monitor and enforce storage.

This allows for full enforcement of storage limits, at the expense of being able to efficiently monitor per-volume storage consumption. As there have already been reports of monitoring causing trouble, I do not advise this option.

A variant of this would report (1/N) storage for each covered volume, so with a pod with a 4MiB quota and 1MiB total consumption, spread across 4 ephemeral volumes, each volume would report a consumption of 256 KiB. Another variant would change the API to report statistics for all ephemeral volumes combined. I do not advise this option.

Alternative loop filesystem-based implementation

Another way of isolating storage is to utilize filesystems of pre-determined size, using the loop filesystem facility within Linux. It is possible to create a file and run mkfs(8) on it, and then to mount that filesystem on the desired directory. This both limits the storage available within that directory and enables quick retrieval of it via statfs(2).

Cleanup of such a filesystem involves unmounting it and removing the backing file.

The backing file can be created as a sparse file, and the discard option can be used to return unused space to the system, allowing for thin provisioning.

I conducted preliminary investigations into this. While at first it appeared promising, it turned out to have multiple critical flaws:

If the filesystem is mounted without the discard option, it can grow to the full size of the backing file, negating any possibility of thin provisioning. If the file is created dense in the first place, there is never any possibility of thin provisioning without use of discard.

If the backing file is created densely, it additionally may require significant time to create if the ephemeral limit is large.
If the filesystem is mounted nosync, and is sparse, it is possible for writes to succeed and then fail later with I/O errors when synced to the backing storage. This will lead to data corruption that cannot be detected at the time of write.

This can easily be reproduced by e. g. creating a 64MB filesystem and within it creating a 128MB sparse file and building a filesystem on it. When that filesystem is in turn mounted, writes to it will succeed, but I/O errors will be seen in the log and the file will be incomplete:

# mkdir /var/tmp/d1 /var/tmp/d2
# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383
# mkfs.ext4 /var/tmp/fs1
# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1
# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767
# mkfs.ext4 /var/tmp/d1/fs2
# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2
# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576
  ...will normally succeed...
# sync
  ...fails with I/O error!...

If the filesystem is mounted sync, all writes to it are immediately committed to the backing store, and the dd operation above fails as soon as it fills up /var/tmp/d1. However, performance is drastically slowed, particularly with small writes; with 1K writes, I observed performance degradation in some cases exceeding three orders of magnitude.

I performed a test comparing writing 64 MB to a base (partitioned) filesystem, to a loop filesystem without sync, and a loop filesystem with sync. Total I/O was sufficient to run for at least 5 seconds in each case. All filesystems involved were XFS. Loop filesystems were 128 MB and dense. Times are in seconds. The erratic behavior (e. g. the 65536 case) was involved was observed repeatedly, although the exact amount of time and which I/O sizes were affected varied. The underlying device was an HP EX920 1TB NVMe SSD.

I/O Size	Partition	Loop w/sync	Loop w/o sync
1024	0.104	0.120	140.390
4096	0.045	0.077	21.850
16384	0.045	0.067	5.550
65536	0.044	0.061	20.440
262144	0.043	0.087	0.545
1048576	0.043	0.055	7.490
4194304	0.043	0.053	0.587

The only potentially viable combination in my view would be a dense loop filesystem without sync, but that would render any thin provisioning impossible.

Infrastructure Needed [optional]

Decision: who is responsible for quota management of all volume types (and especially ephemeral volumes of all types). At present, emptydir volumes are managed by the kubelet and logdirs and writable layers by either the kubelet or the runtime, depending upon the choice of runtime. Beyond the specific proposal that the runtime should manage quotas for volumes it creates, there are broader issues that I request assistance from the SIG in addressing.
Location of the quota code. If the quotas for different volume types are to be managed by different components, each such component needs access to the quota code. The code is substantial and should not be copied; it would more appropriately be vendored.

References

Bugs Opened Against Filesystem Quotas

The following is a list of known security issues referencing filesystem quotas on Linux, and other bugs referencing filesystem quotas in Linux since 2012. These bugs are not necessarily in the quota system.

CVE

CVE-2012-2133 Use-after-free vulnerability in the Linux kernel before 3.3.6, when huge pages are enabled, allows local users to cause a denial of service (system crash) or possibly gain privileges by interacting with a hugetlbfs filesystem, as demonstrated by a umount operation that triggers improper handling of quota data.

The issue is actually related to huge pages, not quotas specifically. The demonstration of the vulnerability resulted in incorrect handling of quota data.
CVE-2012-3417 The good_client function in rquotad (rquota_svc.c) in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl function the first time without a host name, which might allow remote attackers to bypass TCP Wrappers rules in hosts.deny (related to rpc.rquotad; remote attackers might be able to bypass TCP Wrappers rules).

This issue is related to remote quota handling, which is not the use case for the proposal at hand.

Other Security Issues Without CVE

Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and Create Large Files

A setuid root binary inheriting file descriptors from an unprivileged user process may write to the file without respecting quota limits. If this issue is still present, it would allow a setuid process to exceed any enforcing limits, but does not affect the quota accounting (use of quotas for monitoring).

Other Linux Quota-Related Bugs Since 2012

ext4: report delalloc reserve as non-free in statfs mangled by project quota

This bug, fixed in Feb. 2018, properly accounts for reserved but not committed space in project quotas. At this point I have not determined the impact of this issue.
XFS quota doesn't work after rebooting because of crash

This bug resulted in XFS quotas not working after a crash or forced reboot. Under this proposal, Kubernetes would fall back to du for monitoring should a bug of this nature manifest itself again.
quota can show incorrect filesystem name

This issue, which will not be fixed, results in the quota command possibly printing an incorrect filesystem name when used on remote filesystems. It is a display issue with the quota command, not a quota bug at all, and does not result in incorrect quota information being reported. As this proposal does not utilize the quota command or rely on filesystem name, or currently use quotas on remote filesystems, it should not be affected by this bug.

In addition, the e2fsprogs have had numerous fixes over the years.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
kep.yaml		kep.yaml

Files

1029-ephemeral-storage-quotas

Directory actions

More options

Directory actions

More options

Latest commit

History

1029-ephemeral-storage-quotas

Folders and files

parent directory

README.md

Quotas for Ephemeral Storage

Table of Contents

Release Signoff Checklist

Summary

Project Quotas

Motivation

Goals

Non-Goals

Future Work

Proposal

Control over Use of Quotas

Operation Flow -- Applying a Quota

Operation Flow -- Retrieving Storage Consumption

Operation Flow -- Removing a Quota.

Operation Notes

Selecting a Project ID

Determine Whether a Project ID Applies To a Directory

Return a Project ID To the System

Notes/Constraints/Caveats (Optional)

Implementation Strategy

Future

Notes on Implementation

Notes on Code Changes

Implementation details of using Xfs-Quota in User Namespace

Risks and Mitigations

Design Details

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Performance Benchmarks

Elapsed Time

User CPU Time

System CPU Time

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Kubelet and API Server Skew:

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the above SLIs?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature? **

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?