- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks [optional]
- Alternatives [optional]
- Infrastructure Needed [optional]
- References
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This proposal applies to the use of quotas for ephemeral-storage metrics gathering. Use of quotas for ephemeral-storage limit enforcement is a non-goal, but as the architecture and code will be very similar, there are comments interspersed related to enforcement. These comments will be italicized.
Local storage capacity isolation, aka ephemeral-storage, was introduced into Kubernetes via #361. It provides support for capacity isolation of shared storage between pods, such that a pod can be limited in its consumption of shared resources and can be evicted if its consumption of shared storage exceeds that limit. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption.
The current mechanism relies on periodically walking each ephemeral volume (emptydir, logdir, or container writable layer) and summing the space consumption. This method is slow, can be fooled, and has high latency (i. e. a pod could consume a lot of storage prior to the kubelet being aware of its overage and terminating it).
The mechanism proposed here utilizes filesystem project quotas to provide monitoring of resource consumption and optionally enforcement of limits. Project quotas, initially in XFS and more recently ported to ext4fs, offer a kernel-based means of monitoring and restricting filesystem consumption that can be applied to one or more directories.
A prototype is in progress; see kubernetes/kubernetes#66928.
Project quotas are a form of filesystem quota that apply to arbitrary groups of files, as opposed to file user or group ownership. They were first implemented in XFS, as described here: http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/xfs-quotas.html.
Project quotas for ext4fs were proposed in late 2014 and added to the Linux kernel in early 2016, with commit 391f2a16b74b95da2f05a607f53213fc8ed24b8e. They were designed to be compatible with XFS project quotas.
Each inode contains a 32-bit project ID, to which optionally quotas
(hard and soft limits for blocks and inodes) may be applied. The
total blocks and inodes for all files with the given project ID are
maintained by the kernel. Project quotas can be managed from
userspace by means of the xfs_quota(8)
command in foreign filesystem
(-f
) mode; the traditional Linux quota tools do not manipulate
project quotas. Programmatically, they are managed by the quotactl(2)
system call, using in part the standard quota commands and in part the
XFS quota commands; the man page implies incorrectly that the XFS
quota commands apply only to XFS filesystems.
The project ID applied to a directory is inherited by files created
under it. Files cannot be (hard) linked across directories with
different project IDs. A file's project ID cannot be changed by a
non-privileged user, but a privileged user may use the xfs_io(8)
command to change the project ID of a file.
Filesystems using project quotas may be mounted with quotas either enforced or not; the non-enforcing mode tracks usage without enforcing it. A non-enforcing project quota may be implemented on a filesystem mounted with enforcing quotas by setting a quota too large to be hit. The maximum size that can be set varies with the filesystem; on a 64-bit filesystem it is 2^63-1 bytes for XFS and 2^58-1 bytes for ext4fs.
Conventionally, project quota mappings are stored in /etc/projects
and
/etc/projid
; these files exist for user convenience and do not have
any direct importance to the kernel. /etc/projects
contains a mapping
from project ID to directory/file; this can be a one to many mapping
(the same project ID can apply to multiple directories or files, but
any given directory/file can be assigned only one project ID).
/etc/projid
contains a mapping from named projects to project IDs.
This proposal utilizes hard project quotas for both monitoring and enforcement. Soft quotas are of no utility; they allow for temporary overage that, after a programmable period of time, is converted to the hard quota limit.
The mechanism presently used to monitor storage consumption involves
use of du
and find
to periodically gather information about
storage and inode consumption of volumes. This mechanism suffers from
a number of drawbacks:
- It is slow. If a volume contains a large number of files, walking the directory can take a significant amount of time. There has been at least one known report of nodes becoming not ready due to volume metrics: kubernetes/kubernetes#62917
- It is possible to conceal a file from the walker by creating it and removing it while holding an open file descriptor on it. POSIX behavior is to not remove the file until the last open file descriptor pointing to it is removed. This has legitimate uses; it ensures that a temporary file is deleted when the processes using it exit, and it minimizes the attack surface by not having a file that can be found by an attacker. The following pod does this; it will never be caught by the present mechanism:
apiVersion: v1
kind: Pod
max:
metadata:
name: "diskhog"
spec:
containers:
- name: "perl"
resources:
limits:
ephemeral-storage: "2048Ki"
image: "perl"
command:
- perl
- -e
- >
my $file = "/data/a/a"; open OUT, ">$file" or die "Cannot open $file: $!\n"; unlink "$file" or die "cannot unlink $file: $!\n"; my $a="0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789"; foreach my $i (0..200000000) { print OUT $a; }; sleep 999999
volumeMounts:
- name: a
mountPath: /data/a
volumes:
- name: a
emptyDir: {}
- It is reactive rather than proactive. It does not prevent a pod from overshooting its limit; at best it catches it after the fact. On a fast storage medium, such as NVMe, a pod may write 50 GB or more of data before the housekeeping performed once per minute catches up to it. If the primary volume is the root partition, this will completely fill the partition, possibly causing serious problems elsewhere on the system. This proposal does not address this issue; a future enforcing project would.
In many environments, these issues may not matter, but shared multi-tenant environments need these issues addressed.
These goals apply only to local ephemeral storage, as described in #361.
- Primary: improve performance of monitoring by using project quotas in a non-enforcing way to collect information about storage utilization of ephemeral volumes.
- Primary: detect storage used by pods that is concealed by deleted files being held open.
- Primary: this will not interfere with the more common user and group quotas.
- Application to storage other than local ephemeral storage.
- Application to container copy on write layers. That will be managed by the container runtime. For a future project, we should work with the runtimes to use quotas for their monitoring.
- Elimination of eviction as a means of enforcing ephemeral-storage limits. Pods that hit their ephemeral-storage limit will still be evicted by the kubelet even if their storage has been capped by enforcing quotas.
- Enforcing node allocatable (limit over the sum of all pod's disk usage, including e. g. images).
- Enforcing limits on total pod storage consumption by any means, such that the pod would be hard restricted to the desired storage limit.
- Enforce limits on per-volume storage consumption by using enforced project quotas.
This proposal applies project quotas to emptydir volumes on qualifying filesystems (ext4fs and xfs with project quotas enabled). Project quotas are applied by selecting an unused project ID (a 32-bit unsigned integer), setting a limit on space and/or inode consumption, and attaching the ID to one or more files. By default (and as utilized herein), if a project ID is attached to a directory, it is inherited by any files created under that directory.
To use quotas to track a pod's resource usage, the pod must be in a user namespace. Within user namespaces, the kernel restricts changes to projectIDs on the filesystem, ensuring the reliability of storage metrics calculated by quotas.
If we elect to use the quota as enforcing, we impose a quota consistent with the desired limit. If we elect to use it as non-enforcing, we impose a large quota that in practice cannot be exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs).
For discussion of the low level implementation strategy, see below.
At present, three feature gates control operation of quotas:
-
LocalStorageCapacityIsolation
must be enabled for any use of quotas. -
LocalStorageCapacityIsolationFSQuotaMonitoring
must be enabled in addition. If this is enabled, quotas are used for monitoring, but not enforcement. At present, this defaults to False, but the intention is that this will default to True by initial release. -
Ensure the
UserNamespacesSupport
is enabled, and that the kernel, CRI implementation and OCI runtime support user namespaces.
- Caller (emptydir volume manager or container runtime) creates an emptydir volume, with an empty directory at a location of its choice.
- Caller requests that a quota be applied to a directory.
- Determine whether a quota can be imposed on the directory, by asking each quota provider (one per filesystem type) whether it can apply a quota to the directory. If no provider claims the directory, an error status is returned to the caller.
- Select an unused project ID (see below).
- Set the desired limit on the project ID, in a filesystem-dependent manner (see below).
- Apply the project ID to the directory in question, in a filesystem-dependent manner.
An error at any point results in no quota being applied and no change to the state of the system. The caller in general should not assume a priori that the attempt will be successful. It could choose to reject a request if a quota cannot be applied, but at this time it will simply ignore the error and proceed as today.
- Caller (kubelet metrics code, cadvisor, container runtime) asks the quota code to compute the amount of storage used under the directory.
- Determine whether a quota applies to the directory, in a filesystem-dependent manner (see below).
- If so, determine how much storage or how many inodes are utilized, in a filesystem dependent manner.
If the quota code is unable to retrieve the consumption, it returns an error status and it is up to the caller to utilize a fallback mechanism (such as the directory walk performed today).
- Caller requests that the quota be removed from a directory.
- Determine whether a project quota applies to the directory.
- Remove the limit from the project ID associated with the directory.
- Remove the association between the directory and the project ID.
- Return the project ID to the system to allow its use elsewhere (see below).
- Caller may delete the directory and its contents (normally it will).
Project IDs are a shared space within a filesystem. If the same project ID is assigned to multiple directories, the space consumption reported by the quota will be the sum of that of all of the directories. Hence, it is important to ensure that each directory is assigned a unique project ID (unless it is desired to pool the storage use of multiple directories).
The canonical mechanism to record persistently that a project ID is
reserved is to store it in the /etc/projid
(projid[5]
) and/or
/etc/projects
(projects(5)
) files. However, it is possible to utilize
project IDs without recording them in those files; they exist for
administrative convenience but neither the kernel nor the filesystem
is aware of them. Other ways can be used to determine whether a
project ID is in active use on a given filesystem:
- The quota values (in blocks and/or inodes) assigned to the project ID are non-zero.
- The storage consumption (in blocks and/or inodes) reported under the project ID are non-zero.
The algorithm to be used is as follows:
- Lock this instance of the quota code against re-entrancy.
- open and
flock()
the/etc/project
and/etc/projid
files, so that other uses of this code are excluded. - Start from a high number (the prototype uses 1048577).
- Iterate from there, performing the following tests:
- Is the ID reserved by this instance of the quota code?
- Is the ID present in
/etc/projects
? - Is the ID present in
/etc/projid
? - Are the quota values and/or consumption reported by the kernel non-zero? This test is restricted to 128 iterations to ensure that a bug here or elsewhere does not result in an infinite loop looking for a quota ID.
- If an ID has been found:
- Add it to an in-memory copy of
/etc/projects
and/etc/projid
so that any other uses of project quotas do not reuse it. - Write temporary copies of
/etc/projects
and/etc/projid
that areflock()
ed - If successful, rename the temporary files appropriately (if rename of one succeeds but the other fails, we have a problem that we cannot recover from, and the files may be inconsistent).
- Add it to an in-memory copy of
- Unlock
/etc/projid
and/etc/projects
. - Unlock this instance of the quota code.
A minor variation of this is used if we want to reuse an existing quota ID.
It is possible to determine whether a directory has a project ID
applied to it by requesting (via the lsattr(1)
command) the project
ID associated with the directory. Whie the specifics are
filesystem-dependent, the basic method is the same for at least XFS
and ext4fs.
It is not possible to determine in constant operations the directory or directories to which a project ID is applied. It is possible to determine whether a given project ID has been applied to an existing directory or files (although those will not be known); the reported consumption will be non-zero.
The code records internally the project ID applied to a directory, but it cannot always rely on this. In particular, if the kubelet has exited and has been restarted (and hence the quota applying to the directory should be removed), the map from directory to project ID is lost. If it cannot find a map entry, it falls back on the approach discussed above.
The algorithm used to return a project ID to the system is very
similar to the algorithm used to select a project ID, except of course
for selecting a project ID. It performs the same sequence of locking
/etc/project
and /etc/projid
, editing a copy of the file, and
restoring it.
If the project ID is applied to multiple directories and the code can
determine that, it will not remove the project ID from /etc/projid
until the last reference is removed. While it is not anticipated in
this KEP that this mode of operation will be used, at least initially,
this can be detected even on kubelet restart by looking at the
reference count in /etc/projects
.
The initial implementation will be done by shelling out to the
xfs_quota(8)
and lsattr(1)
commands to manipulate quotas. It is
possible to use the quotactl(2)
and ioctl(2)
system calls to do
this, but the quotactl(2)
system call is not supported by Go at all,
and ioctl(2)
subcommands required for manipulating quotas are only
partially supported. Use of these system calls would require either
use of cgo, which is not supported in
Kubernetes, or copying the necessary structure definitions and
constants into the source.
The use of Linux commands rather than system calls likely poses some
efficiency issues, but these commands are only issued when ephemeral
volumes are created and destroyed, or during monitoring, when they are
called once per minute. At present, monitoring is done by shelling
out to the du(1)
and find(1)
commands, which are much less
efficient, as they perform filesystem scans. The performance of
these commands must be measured under load prior to making this
feature used by default.
All xfs_quota(8)
commands are invoked as
xfs_quota -t <tmp_mounts_file> -P/dev/null -D/dev/null -f <mountpoint> -c <command>
The -P and -D arguments tell xfs_quota not to read the usual
/etc/projid
and /etc/projects
files, and to use the empty special
file /dev/null
as stand-ins. xfs_quota
reads the projid and
projects files and attempts to access every listed mountpoint for
every command. As we use numeric project IDs for all purposes, it is
not necessary to incur this overhead.
The -t
argument is to mitigate a hazard of xfs_quota(8)
in the
presence of stuck NFS (or similar) mounts. By default, xfs_quota(8)
stat(2)
s every mountpoint on the system, which could hang on a stuck
mount. The -t
option is used to pass in a temporary mounts file
containing only the filesystem we care about.
The following operations are performed, with the listed commands
(using xfs_quota(8)
except as noted)
-
Determine whether quotas are enabled on the specified filesystem
state -p
-
Apply a limit to a project ID (note that at present the largest possible limit is applied, allowing the quota system to be used for monitoring only)
limit -p bhard=<blocks> bsoft=<blocks> <projectID>
-
Apply a project ID to a directory, enabling a quota on that directory
project -s -p <directory> <projectID>
-
Retrieve the number of blocks used by a given project ID
quota -p -N -n -v -b <projectID>
-
Retrieve the number of inodes used by a given project ID
quota -p -N -n -v -n <projectID>
-
Determine whether a specified directory has a quota ID applied to it, and if so, what that ID is (using
lsattr(1)
)lsattr -pd <path>
In the long run, we should work to add the necessary constructs to the Go language, allowing direct use of the necessary system calls without the use of cgo.
The primary new interface defined is the quota interface in
pkg/volume/util/quota/quota.go
. This defines five operations:
-
Does the specified directory support quotas?
-
Assign a quota to a directory. If a non-empty pod UID is provided, the quota assigned is that of any other directories under this pod UID; if an empty pod UID is provided, a unique quota is assigned.
-
Retrieve the consumption of the specified directory. If the quota code cannot handle it efficiently, it returns an error and the caller falls back on existing mechanism.
-
Retrieve the inode consumption of the specified directory; same description as above.
-
Remove quota from a directory. If a non-empty pod UID is passed, it is checked against that recorded in-memory (if any). The quota is removed from the specified directory. This can be used even if AssignQuota has not been used; it inspects the directory and removes the quota from it. This permits stale quotas from an interrupted kubelet to be cleaned up.
Two implementations are provided: quota_linux.go
(for Linux) and
quota_unsupported.go
(for other operating systems). The latter
returns an error for all requests.
As the quota mechanism is intended to support multiple filesystems, and different filesystems require different low level code for manipulating quotas, a provider is supplied that finds an appropriate quota applier implementation for the filesystem in question. The low level quota applier provides similar operations to the top level quota code, with two exceptions:
-
No operation exists to determine whether a quota can be applied (that is handled by the provider).
-
An additional operation is provided to determine whether a given quota ID is in use within the filesystem (outside of
/etc/projects
and/etc/projid
).
The two quota providers in the initial implementation are in
pkg/volume/util/quota/extfs
and pkg/volume/util/quota/xfs
. While
some quota operations do require different system calls, a lot of the
code is common, and factored into
pkg/volume/util/quota/common/quota_linux_common_impl.go
.
The prototype for this project is mostly self-contained within
pkg/volume/util/quota
and a few changes to
pkg/volume/empty_dir/empty_dir.go
. However, a few changes were
required elsewhere:
-
The operation executor needs to pass the desired size limit to the volume plugin where appropriate so that the volume plugin can impose a quota. The limit is passed as 0 (do not use quotas), positive number (impose an enforcing quota if possible, measured in bytes), or -1 (impose a non-enforcing quota, if possible) on the volume.
This requires changes to
pkg/volume/util/operationexecutor/operation_executor.go
(to addDesiredSizeLimit
toVolumeToMount
),pkg/kubelet/volumemanager/cache/desired_state_of_world.go
, andpkg/kubelet/eviction/helpers.go
(the latter in order to determine whether the volume is a local ephemeral one). -
The volume manager (in
pkg/volume/volume.go
) changes theMounter.SetUp
andMounter.SetUpAt
interfaces to take a newMounterArgs
type rather than anFsGroup
(*int64
). This is to allow passing the desired size and pod UID (in the event we choose to implement quotas shared between multiple volumes; see below). This required small changes to all volume plugins and their tests, but will in the future allow adding additional data without having to change code other than that which uses the new information.
-
Using user namespaces prevents users from changing projectIDs on the filesystem, which is essential for preserving the reliability of xfs-quota metrics under user namespaces.
-
When LocalStorageCapacityIsolationFSQuotaMonitoring is enabled, kubelet utilizes xfs-quota to monitor disk usage. With user namespaces enforced, any changes that might manipulate projectIDs are inherently restricted, avoiding any inaccuracies in collected metrics.
-
During the setup of volumeToMount object, the hostUsersEnabled field is added, which indicates whether user namespaces are in effect. This field is critical for determining the environment in which the volume operates.
-
When mounting the filesystem, before the assignQuota method is invoked, a check for SupportQuota is conducted. This check now should also verify the hostUsersEnabled status to ensure that quotas are only assigned when operating within a valid user namespace context.
-
Kubelet's behavior must be updated to accommodate these changes. If LocalStorageCapacityIsolationFSQuotaMonitoring flag is enabled and hostUsersEnabled is false (i.e., user namespaces are being used), kubelet should proceed with xfs-quota based monitoring. If user namespaces are not enabled, kubelet should revert to using the traditional methods for disk monitoring like du and find.
-
The SIG raised the possibility of a container being unable to exit should we enforce quotas, and the quota interferes with writing the log. This can be mitigated by either not applying a quota to the log directory and using the du mechanism, or by applying a separate non-enforcing quota to the log directory.
As log directories are write-only by the container, and consumption can be limited by other means (as the log is filtered by the runtime), I do not consider the ability to write uncapped to the log to be a serious exposure.
Note in addition that even without quotas it is possible for writes to fail due to lack of filesystem space, which is effectively (and in some cases operationally) indistinguishable from exceeding quota, so even at present code must be able to handle those situations.
-
Filesystem quotas may impact performance to an unknown degree. Information on that is hard to come by in general, and one of the reasons for using quotas is indeed to improve performance. If this is a problem in the field, merely turning off quotas (or selectively disabling project quotas) on the filesystem in question will avoid the problem. Against the possibility that cannot be done (because project quotas are needed for other purposes), we should provide a way to disable use of quotas altogether via a feature gate.
A report https://blog.pythonanywhere.com/110/ notes that an unclean shutdown on Linux kernel versions between 3.11 and 3.17 can result in a prolonged downtime while quota information is restored. Unfortunately, the link referenced here is no longer available.
-
Bugs in the quota code could result in a variety of regression behavior. For example, if a quota is incorrectly applied it could result in ability to write no data at all to the volume. This could be mitigated by use of non-enforcing quotas. XFS in particular offers the
pqnoenforce
mount option that makes all quotas non-enforcing.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
The quota code is by an large not very amendable to unit tests. While there are simple unit tests for parsing the mounts file, and there could be tests for parsing the projects and projid files, the real work (and risk) involves interactions with the kernel and with multiple instances of this code (e. g. in the kubelet and the runtime manager, particularly under stress). It also requires setup in the form of a prepared filesystem. It would be better served by appropriate end to end tests.
The main unit test is in package under pkg/volume/util/fsquota/
.
pkg/volume/util/fsquota/
:2024-06-12
-73.9%
-
- project.go 75.7%
-
- quota.go 100%
-
- quota_linux.go 72.2%
See details in https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit&include-filter-by-regex=fsquota.
N/A
e2e evolution (LocalStorageCapacityIsolationQuotaMonitoring [Slow] [Serial] [Disruptive] [Feature:LocalStorageCapacityIsolationQuota][NodeFeature:LSCIQuotaMonitoring]) can be found in test/e2e_node/quota_lsci_test.go
The e2e tests are slow and serial and we will not promote them to be conformance test then. There is no failure history or flakes in https://storage.googleapis.com/k8s-triage/index.html?test=LocalStorageCapacityIsolationQuotaMonitoring
I performed a microbenchmark consisting of various operations on a directory containing 4096 subdirectories each containing 2048 1Kbyte files. The operations performed were as follows, in sequence:
-
Create Files: Create 4K directories each containing 2K files as described, in depth-first order.
-
du: run
du
immediately after creating the files. -
quota: where applicable, run
xfs_quota
immediately afterdu
. -
du (repeat): repeat the
du
invocation. -
quota (repeat): repeat the
xfs_quota
invocation. -
du (after remount): run
mount -o remount <filesystem>
immediately followed bydu
. -
quota (after remount): run
mount -o remount <filesystem>
immediately followed byxfs_quota
. -
unmount:
umount
the filesystem. -
mount:
mount
the filesystem. -
quota after umount/mount: run
xfs_quota
after unmounting and mounting the filesystem. -
du after umount/mount: run
du
after unmounting and mounting the filesystem. -
Remove Files: remove the test files.
The test was performed on four separate filesystems:
- XFS filesystem, with quotas enabled (256 GiB, 128 Mi inodes)
- XFS filesystem, with quotas disabled (64 GiB, 32 Mi inodes)
- ext4fs filesystem, with quotas enabled (250 GiB, 16 Mi inodes)
- ext4fs filesystem, with quotas disabled (60 GiB, 16 Mi inodes)
Other notes:
- All filesystems reside on an otherwise idle HP EX920 1TB NVMe on a Lenovo ThinkPad P50 with 64 GB RAM and Intel Core i7-6820HQ CPU (4 cores/8 threads total) running Fedora 30 (5.1.5-300.fc30.x86_64)
- Five runs were conducted with each combination; the median value is used. All times are in seconds.
- Space consumption was calculated with the filesystem hot (after files were created), warm (after
mount -o remount
), and cold (after umount/mount of the filesystem). - Note that in all cases xfs_quota consumed zero time as reported by
/usr/bin/time
. - User and system time are not available for file creation.
- All calls to
xfs_quota
andmount
consumed less than 0.01 seconds elapsed, user, and CPU time and are not reported here. - Removing files was consistently faster with quotas disabled than with quotas enabled. With ext4fs, du was faster with quotas disabled than with quotas enabled. With XFS, creating files may have been faster with quotas disabled than with quotas enabled, but the difference was small. In other cases, the difference was within noise.
Operation | XFS+Quota | XFS | Ext4fs+Quota | Ext4fs |
---|---|---|---|---|
Create Files | 435.0 | 419.0 | 348.0 | 343.0 |
du | 12.1 | 12.6 | 14.3 | 14.3 |
du (repeat) | 12.0 | 12.1 | 14.1 | 14.0 |
du (after remount) | 23.2 | 23.2 | 39.0 | 24.6 |
unmount | 12.2 | 12.1 | 9.8 | 9.8 |
du after umount/mount | 103.6 | 138.8 | 40.2 | 38.8 |
Remove Files | 196.0 | 159.8 | 105.2 | 90.4 |
All calls to umount
consumed less than 0.01 second of user CPU time
and are not reported here.
Operation | XFS+Quota | XFS | Ext4fs+Quota | Ext4fs |
---|---|---|---|---|
du | 3.7 | 3.7 | 3.7 | 3.7 |
du (repeat) | 3.7 | 3.7 | 3.7 | 3.8 |
du (after remount) | 3.3 | 3.3 | 3.7 | 3.6 |
du after umount/mount | 8.1 | 10.2 | 3.9 | 3.7 |
Remove Files | 4.3 | 4.1 | 4.2 | 4.3 |
Operation | XFS+Quota | XFS | Ext4fs+Quota | Ext4fs |
---|---|---|---|---|
du | 8.3 | 8.6 | 10.5 | 10.5 |
du (repeat) | 8.3 | 8.4 | 10.4 | 10.4 |
du (after remount) | 19.8 | 19.8 | 28.8 | 20.9 |
unmount | 10.2 | 10.1 | 8.1 | 8.1 |
du after umount/mount | 66.0 | 82.4 | 29.2 | 28.1 |
Remove Files | 188.6 | 156.6 | 90.4 | 81.8 |
The following criteria applies to
LocalStorageCapacityIsolationFSMonitoring
:
- Support integrated in kubelet
- Alpha-level documentation
- Unit test coverage
- Node e2e test
- User feedback
- Benchmarks to determine latency and overhead of using quotas relative to existing monitoring solution
- Cleanup
- Use Ephemeral-Storage-Quotas in User Namespace
- TBD
Turn off the feature gate to turn off the feature.
- If the API server is on the latest version with the user namespace feature flag enabled, and the kubelet also has the user namespace feature along with the LocalStorageCapacityIsolationFSQuotaMonitoring feature flag enabled, Pods with hostUsers set to false will have XFS quotas within user namespaces enabled.
- If any of the necessary feature flags (user namespace on either the API server or kubelet, or LocalStorageCapacityIsolationFSQuotaMonitoring on the kubelet) are not enabled, then, XFS quotas within user namespaces will not be supported
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: LocalStorageCapacityIsolationFSQuotaMonitoring
- Components depending on the feature gate: kubelet
- Feature gate
- Feature gate name: UserNamespacesSupport
- Components depending on the feature gate: kubelet, kube-apiserver
This feature uses project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy. This feature can be enabled only within user namespace.
None. Behavior will not change. The change is the way to monitoring the volume like ephemeral storage volumes and emptyDirs. When LocalStorageCapacityIsolation is enabled for local ephemeral storage and the backing filesystem for emptyDir volumes supports project quotas and they are enabled, use project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy.
Yes, but only for newly created pods.
- Existed Pods: If the pod was created with enforcing quota, pod will not use the enforcing quota after the feature gate is disabled.
- Newly Created Pods: After setting the feature gate to false, the newly created pod will not use the enforcing quota.
Like above, after we reenable the feature, newly created pod will use this feature. If a pod was created before rolling back, the pod will benefit from this feature as well.
Yes, in test/e2e_node/quota_lsci_test.go
No. In the event of a rollback, the system would revert to the previous method of disk usage monitoring. This switch should not impact the operational state of already running workloads. Additionally, pods created while the feature was active (created with user namespaces) will have to be re-created to run without user namespaces. If those weren't recreated, they will continue to run in a user namespace.
kubelet_volume_metric_collection_duration_seconds
was added since v1.24 for duration in
seconds to calculate volume stats. This metric can help to compare between fsquota
monitoring and du
for disk usage.
Yes. I tested it locally and fixed a bug after restarting kubelet
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No
- In kubelet metrics, an operator can check the histgram metric
kubelet_volume_metric_collection_duration_seconds
with metric_source equals "fsquota". If there is nometric_source=fsquota
, this feature should be disabled. - However, to figure out if a workload is use this feature, there is no direct way now and see more in below methods of how to check fsquota settings on a node.
xfs_quota -x -c 'report -h' /path/to/directory
will provide the information on project quota usage and limits.- When trying to change the projectID associated with the directory in a user namespace,
xfs_quota -x -c 'project -s -p newProjectID /path/to/directory' /
, the operation will be denied indicating the restriction the user namespace imposes.
99.9% of volume stats calculation will cost less than 1s or even 500ms.
It can be calculated by kubelet_volume_metric_collection_duration_seconds
metrics.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
kubelet_volume_metric_collection_duration_seconds
- Aggregation method: histogram
- Components exposing the metric: kubelet
- Metric name:
Are there any missing metrics that would be useful to have to improve observability of this feature? **
Yes, there are no histogram metrics for each volume. The above metric was grouped by volume types because
the cost for every volume is too expensive. As a result, users cannot figure out if the feature is used by
a workload directly by the metrics. A cluster-admin can check kubelet configuration on each node. If the
feature gate is disabled, workloads on that node will not use it.
For example, run xfs_quota -x -c 'report -h' /dev/sdc
to check quota settings in the device.
Check spec.containers[].resources.limits.ephemeral-storage
of each container to compare.
Yes, the feature depends on project quotas. Once quotas are enabled, the xfs_quota tool can be used to set limits and report on disk usage.
No.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by [existing SLIs/SLOs]?
No.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Yes. It will use less CPU time and IO during ephemeral storage monitoring. kubelet
now allows use of XFS quotas (on XFS and suitably configured ext4fs filesystems) to monitor storage consumption for ephemeral storage (currently for emptydir volumes only). This method of monitoring consumption is faster and more accurate than the old method of walking the filesystem tree. It does not enforce limits, only monitors consumption.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
Enabling XFS quotas within user namespaces is unlikely to result in resource exhaustion of node resources like PIDs or sockets. However, quota settings could potentially lead to inode exhaustion if limits are set too low for the workload.
No changes to current kubelet behaviors. The feature only uses kubelet-local information.
-
If the ephemeral storage limitation is reached, the pod will be evicted by kubelet.
-
It should skip when the image is not configured correctly (unsupported FS or quota not enabled).
-
For "out of space" failure, kublet eviction should be triggered.
If the metrics shows some problems, we can check the log and quota dir with below commands.
- There will be warning logs(after the # is merged) if volume calculation took too long than 1 second
- If quota is enabled, you can find the volume information and the process time with
time repquota -P /var/lib/kubelet -s -v
LocalStorageCapacityIsolationFSMonitoring
implemented at Alpha
kubelet_volume_metric_collection_duration_seconds
metrics was added- A bug that quota cannot work after kubelet restarted, was fixed
- Promote
LocalStorageCapacityIsolationFSMonitoring
to Beta, but there is a regression and we revert it to alpha.
ConfigMap rendering issue was found in the 1.25.0 release. When ConfigMaps get updated within the API, they do not get rendered to the resulting pod's filesystem by the Kubelet. The feature has been reverted to alpha in the 1.25.1 release.
- Fix the blocking issue that caused the revert to alpha: kubernetes/kubernetes#112624 and kubernetes/kubernetes#115314.
- Add test in sig-node test grid for this feature https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-fsquota-ubuntu: kubernetes/test-infra#28616
Promote LocalStorageCapacityIsolationFSMonitoring
to Beta
- Use of quotas, particularly the less commonly used project quotas,
requires additional action on the part of the administrator. In
particular:
- ext4fs filesystems must be created with additional options that are not enabled by default:
mkfs.ext4 -O quota,project -E quotatype=usrquota:grpquota:prjquota _device_
- An additional option (
prjquota
) must be applied in/etc/fstab
- If the root filesystem is to be quota-enabled, it must be set in the grub options.
- Use of project quotas for this purpose will preclude future use within containers.
I have considered two classes of alternatives:
-
Alternatives based on quotas, with different implementation
-
Alternatives based on loop filesystems without use of quotas
Within the basic framework of using quotas to monitor and potentially enforce storage utilization, there are a number of possible options:
-
Utilize per-volume non-enforcing quotas to monitor storage (the first stage of this proposal).
This mostly preserves the current behavior, but with more efficient determination of storage utilization and the possibility of building further on it. The one change from current behavior is the ability to detect space used by deleted files.
-
Utilize per-volume enforcing quotas to monitor and enforce storage (the second stage of this proposal).
This allows partial enforcement of storage limits. As local storage capacity isolation works at the level of the pod, and we have no control of user utilization of ephemeral volumes, we would have to give each volume a quota of the full limit. For example, if a pod had a limit of 1 MB but had four ephemeral volumes mounted, it would be possible for storage utilization to reach (at least temporarily) 4MB before being capped.
-
Utilize per-pod enforcing user or group quotas to enforce storage consumption, and per-volume non-enforcing quotas for monitoring.
This would offer the best of both worlds: a fully capped storage limit combined with efficient reporting. However, it would require each pod to run under a distinct UID or GID. This may prevent pods from using setuid or setgid or their variants, and would interfere with any other use of group or user quotas within Kubernetes.
-
Utilize per-pod enforcing quotas to monitor and enforce storage.
This allows for full enforcement of storage limits, at the expense of being able to efficiently monitor per-volume storage consumption. As there have already been reports of monitoring causing trouble, I do not advise this option.
A variant of this would report (1/N) storage for each covered volume, so with a pod with a 4MiB quota and 1MiB total consumption, spread across 4 ephemeral volumes, each volume would report a consumption of 256 KiB. Another variant would change the API to report statistics for all ephemeral volumes combined. I do not advise this option.
Another way of isolating storage is to utilize filesystems of
pre-determined size, using the loop filesystem facility within Linux.
It is possible to create a file and run mkfs(8)
on it, and then to
mount that filesystem on the desired directory. This both limits the
storage available within that directory and enables quick retrieval of
it via statfs(2)
.
Cleanup of such a filesystem involves unmounting it and removing the backing file.
The backing file can be created as a sparse file, and the discard
option can be used to return unused space to the system, allowing for
thin provisioning.
I conducted preliminary investigations into this. While at first it appeared promising, it turned out to have multiple critical flaws:
-
If the filesystem is mounted without the
discard
option, it can grow to the full size of the backing file, negating any possibility of thin provisioning. If the file is created dense in the first place, there is never any possibility of thin provisioning without use ofdiscard
.If the backing file is created densely, it additionally may require significant time to create if the ephemeral limit is large.
-
If the filesystem is mounted
nosync
, and is sparse, it is possible for writes to succeed and then fail later with I/O errors when synced to the backing storage. This will lead to data corruption that cannot be detected at the time of write.This can easily be reproduced by e. g. creating a 64MB filesystem and within it creating a 128MB sparse file and building a filesystem on it. When that filesystem is in turn mounted, writes to it will succeed, but I/O errors will be seen in the log and the file will be incomplete:
# mkdir /var/tmp/d1 /var/tmp/d2
# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383
# mkfs.ext4 /var/tmp/fs1
# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1
# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767
# mkfs.ext4 /var/tmp/d1/fs2
# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2
# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576
...will normally succeed...
# sync
...fails with I/O error!...
-
If the filesystem is mounted
sync
, all writes to it are immediately committed to the backing store, and thedd
operation above fails as soon as it fills up/var/tmp/d1
. However, performance is drastically slowed, particularly with small writes; with 1K writes, I observed performance degradation in some cases exceeding three orders of magnitude.I performed a test comparing writing 64 MB to a base (partitioned) filesystem, to a loop filesystem without
sync
, and a loop filesystem withsync
. Total I/O was sufficient to run for at least 5 seconds in each case. All filesystems involved were XFS. Loop filesystems were 128 MB and dense. Times are in seconds. The erratic behavior (e. g. the 65536 case) was involved was observed repeatedly, although the exact amount of time and which I/O sizes were affected varied. The underlying device was an HP EX920 1TB NVMe SSD.
I/O Size | Partition | Loop w/sync | Loop w/o sync |
---|---|---|---|
1024 | 0.104 | 0.120 | 140.390 |
4096 | 0.045 | 0.077 | 21.850 |
16384 | 0.045 | 0.067 | 5.550 |
65536 | 0.044 | 0.061 | 20.440 |
262144 | 0.043 | 0.087 | 0.545 |
1048576 | 0.043 | 0.055 | 7.490 |
4194304 | 0.043 | 0.053 | 0.587 |
The only potentially viable combination in my view would be a dense loop filesystem without sync, but that would render any thin provisioning impossible.
-
Decision: who is responsible for quota management of all volume types (and especially ephemeral volumes of all types). At present, emptydir volumes are managed by the kubelet and logdirs and writable layers by either the kubelet or the runtime, depending upon the choice of runtime. Beyond the specific proposal that the runtime should manage quotas for volumes it creates, there are broader issues that I request assistance from the SIG in addressing.
-
Location of the quota code. If the quotas for different volume types are to be managed by different components, each such component needs access to the quota code. The code is substantial and should not be copied; it would more appropriately be vendored.
The following is a list of known security issues referencing filesystem quotas on Linux, and other bugs referencing filesystem quotas in Linux since 2012. These bugs are not necessarily in the quota system.
-
CVE-2012-2133 Use-after-free vulnerability in the Linux kernel before 3.3.6, when huge pages are enabled, allows local users to cause a denial of service (system crash) or possibly gain privileges by interacting with a hugetlbfs filesystem, as demonstrated by a umount operation that triggers improper handling of quota data.
The issue is actually related to huge pages, not quotas specifically. The demonstration of the vulnerability resulted in incorrect handling of quota data.
-
CVE-2012-3417 The good_client function in rquotad (rquota_svc.c) in Linux DiskQuota (aka quota) before 3.17 invokes the hosts_ctl function the first time without a host name, which might allow remote attackers to bypass TCP Wrappers rules in hosts.deny (related to rpc.rquotad; remote attackers might be able to bypass TCP Wrappers rules).
This issue is related to remote quota handling, which is not the use case for the proposal at hand.
-
Linux Kernel Quota Flaw Lets Local Users Exceed Quota Limits and Create Large Files
A setuid root binary inheriting file descriptors from an unprivileged user process may write to the file without respecting quota limits. If this issue is still present, it would allow a setuid process to exceed any enforcing limits, but does not affect the quota accounting (use of quotas for monitoring).
-
ext4: report delalloc reserve as non-free in statfs mangled by project quota
This bug, fixed in Feb. 2018, properly accounts for reserved but not committed space in project quotas. At this point I have not determined the impact of this issue.
-
XFS quota doesn't work after rebooting because of crash
This bug resulted in XFS quotas not working after a crash or forced reboot. Under this proposal, Kubernetes would fall back to du for monitoring should a bug of this nature manifest itself again.
-
quota can show incorrect filesystem name
This issue, which will not be fixed, results in the quota command possibly printing an incorrect filesystem name when used on remote filesystems. It is a display issue with the quota command, not a quota bug at all, and does not result in incorrect quota information being reported. As this proposal does not utilize the quota command or rely on filesystem name, or currently use quotas on remote filesystems, it should not be affected by this bug.
In addition, the e2fsprogs have had numerous fixes over the years.