- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- (R) Graduation criteria is in place
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
Kubernetes currently does not support the use of swap memory on Linux, as it is difficult to provide guarantees and account for pod memory utilization when swap is involved. As part of Kubernetes’ earlier design, swap support was considered out of scope.
However, there are a number of use cases that would benefit from Kubernetes nodes supporting swap. Hence, this proposal aims to add swap support to nodes in a controlled, predictable manner so that Kubernetes users can perform testing and provide data to continue building cluster capabilities on top of swap.
There are two distinct types of user for swap, who may overlap:
- node administrators, who may want swap available for node-level performance tuning and stability/reducing noisy neighbour issues
- application developers, who have written applications that would benefit from using swap memory
There are hence a number of possible ways that one could envision swap use on a node.
- Swap is enabled on a node's host system, but the kubelet does not permit Kubernetes workloads to use swap. (This scenario is a prerequisite for the following use cases.)
- Swap is enabled at the node level. The kubelet can permit Kubernetes workloads scheduled on the node to use some quantity of swap, depending on the configuration.
- Swap is set on a per-workload basis. The kubelet sets swap limits for each individual workload.
This KEP will be limited in scope to the first two scenarios. The third can be addressed in a follow-up KEP. The enablement work that is in scope for this KEP will be necessary to implement the third scenario.
- On Linux systems, when swap is provisioned and available, Kubelet can start up with swap on.
- Configuration is available for kubelet to set swap utilization available to Kubernetes workloads, defaulting to 0 swap.
- Cluster administrators can enable and configure kubelet swap utilization on a per-node basis.
- Use of swap memory for cgroupsv2.
- Addressing non-Linux operating systems. Swap support will only be available for Linux.
- Provisioning swap. Swap must already be available on the system.
- Setting swappiness. This can already be set on a system-wide level outside of Kubernetes.
- Allocating swap on a per-workload basis with accounting (e.g. pod-level specification of swap). If desired, this should be designed and implemented as part of a follow-up KEP. This KEP is a prerequisite for that work. Hence, swap will be an overcommitted resource in the context of this KEP.
- Supporting zram, zswap, or other memory types like SGX EPC. These could be addressed in a follow-up KEP, and are out of scope.
- Use of swap for cgroupsv1.
We propose that, when swap is provisioned and available on a node, cluster administrators can configure the kubelet such that:
- It can start with swap on.
- It will direct the CRI to allocate Kubernetes workloads 0 swap by default.
- It will have configuration options to configure swap utilization for the entire node.
This proposal enables scenarios 1 and 2 above, but not 3.
Before enabling swap support through the pod API, it is crucial to build confidence in this feature by carefully assessing its impact on workloads and Kubernetes. As an initial step, we propose enabling swap support for Burstable QoS Pods by automatically calculating the appropriate swap values, rather than allowing users to input these values manually.
Swap access is granted only for pods of Burstable QoS. Guaranteed QoS pods are usually higher-priority pods, therefore we want to avoid swap's performance penalty for them. Best-Effort pods, on the contrary, are low-priority pods that are the first to be killed during node pressures. In addition, they're unpredictable, therefore it's hard to assess how much swap memory is a reasonable amount to allocate for them.
By doing so, we can ensure a thorough understanding of the feature's performance and stability before considering the manual input of swap values in a subsequent beta release. This cautious approach will ensure the efficient allocation of resources and the smooth integration of swap support into Kubernetes.
Allocate the swap limit equal to the requested memory for each container and adjust the proportion of swap based on the total swap memory available.
Note In Beta2, we found that having system critical daemons swapping memory could cause degration of services.
System critical daemons (such as Kubelet) are essential for node health. Usually, an appropriate portion of system resources (e.g., memory, CPU) is reserved as system reserved. However, swap doesn't inherently support reserving a portion out of the total available. For instance, in the case of memory, we set memory.min
on the node-level cgroup to ensure an adequate amount of memory is set aside, away from the pods, and for system critical daemons. But there is no equivalent for swap; i.e., no memory.swap.min
is supported in the kernel.
Since this proposal advocates enabling swap only for the Burstable QoS pods, this can be done by setting memory.swap.max
on the cgroups used by the Burstable QoS pods. The value of this memory.swap.max
can be calculated by:
memory.swap.max = total swap memory available on the system - system reserve (memory)
This is the total amount of swap available for all the Burstable QoS pods; let's call it TotalPodsSwapAvailable
. This will ensure that the system critical daemons will have access to the swap at least equal to the system reserved memory. This will indirectly act as having support for swap in system reserved.
This section is a recommendation for how to set up your nodes with swap if using this feature.
As we were testing this feature, we found degration of services if you allow system critical daemons to swap.
This could mean that kubelet is performing slower than normal so if you experience this,
we recommend setting the cgroup for the system slice to avoid swap (ie memory.swap.max 0
).
While doing this, we found that it is still possible for workloads to impact critical daemons.
As we disabled swap for system slice, we saw cases where the system.slice would still be impacted by workloads swapping.
The workloads need to have less priority for IO than the system slice. We found that setting io.latency
for system.slice fixes these issues.
See io-control for more details.
We only recommend enabling swap for the worker nodes. The control plane contains mostly Guaranteed QoS Pods, so swap may be disabled for the most part. The main concern would be swapping in the critical services on the control plane which can cause a performance impact.
We recommend using a separate disk for your swap partition. We recommend the separate disk be encrypted. If swap is on a partition or the root filesystem, workloads can interfere with system processes needing to write to disk. If they occupy the same disk, it's possible processes can overwhelm swap and throw off the I/O of kubelet/container runtime/systemd, which would affect other workloads. See [#protect-system-critical-daemons-for-iolatency] for more details on that. Swap space is located on a disk so it is imperative to make sure your disk is fast enough for your use cases.
We will turn the feature on for Beta 2 but the default setting will be NoSwap
.
Enabling Swap on nodes is a pretty advanced feature which requires tuning and knowledge of the kernel.
We do not recommend swap on all nodes so we still suggest --fail-swap-on=true
for most cases of Kubernetes.
If there is interest in trying out this feature, we suggest provisioning swap space on the worker node along with setting ``--fail-swap-on=false` and restarting kubelet.
- Calculate the container's memory proportionate to the node's memory:
- Divide the container's memory request by the total node's physical memory. Let's call this value
ContainerMemoryProportion
. - If a container is defined with memory requests == memory limits, its
ContainerMemoryProportion
is defined as 0. Therefore, as can be seen below, its overall swap limit is also 0.
- Multiply the container memory proportion by the available swap memory for Pods:
- Meaning:
ContainerMemoryProportion * TotalPodsSwapAvailable
.
Suppose we have a Burstable QoS pod with two containers:
- Container A: Memory request 20 GB
- Container B: Memory request 10 GB
Let's assume the total physical memory is 40 GB and the total swap memory available is also 40 GB. Also assume that the system reserved memory is configured at 2GB,
Step 1: Determine the containers memory proportion:
- Container A:
20G/40G
=0.5
. - Container B:
10G/40G
=0.25
.
Step 2: Determine swap limitation for the containers:
- Container A:
ContainerMemoryProportion * TotalPodsSwapAvailable
=0.5 * 38G
=19G
. - Container B:
ContainerMemoryProportion * TotalPodsSwapAvailable
=0.25 * 38G
=9.5G
.
In this example, Container A would have a swap limit of 19 GB, and Container B would have a swap limit of 9.5 GB.
This approach allocates swap limits based on each container's memory request and adjusts the proportion based on the total swap memory available in the system. It ensures that each container gets a fair share of the swap space and helps maintain resource allocation efficiency.
cgroupsv2 improved memory management algorithms, such as oomd, strongly recommend the use of swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery.
- https://man7.org/linux/man-pages/man8/systemd-oomd.service.8.html
- https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#id1
- https://chrisdown.name/2018/01/02/in-defence-of-swap.html
- https://media.ccc.de/v/ASG2018-175-oomd
- https://github.com/facebookincubator/oomd/blob/master/docs/production_setup.md#swap
This user story is addressed by scenario 1 and 2, and could benefit from 3.
- Applications such as the Java and Node runtimes rely on swap for optimal performance kubernetes/kubernetes#53533 (comment)
- Initialization logic of applications can be safely swapped out without affecting long-running application resource usage kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
This user story addresses cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments).
- Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to overprovisioning/high costs kubernetes/kubernetes#53533 (comment)
- Lack of swap support would require provisioning 3x the amount of memory as required with swap kubernetes/kubernetes#53533 (comment)
- On-premise deployment can’t horizontally scale available memory based on load kubernetes/kubernetes#53533 (comment)
- Scaling resources is technically feasible but cost-prohibitive, swap provides flexibility at lower cost kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters).
- Single node, local Kubernetes deployment on laptop kubernetes/kubernetes#53533 (comment)
- Linux has optimizations for swap on SSD, allowing for performance boosts kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenarios 1 and 2, and could benefit from 3.
For example, edge devices with limited memory.
- Edge compute systems/devices with small memory footprints (<2Gi) kubernetes/kubernetes#53533 (comment) k0sproject/k0s#3830
- Clusters with nodes <4Gi memory kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
This would apply to virtualized Kubernetes workloads such as VMs launched by kubevirt.
Every VM comes with a management related overhead which can sporadically be pretty significant (memory streaming, SRIOV attachment, gpu attachment, virtio-fs, …). Swap helps to not request much more memory to deal with short term worst-case scenarios.
With virtualization, clusters are typically provisioned based on the workloads’ memory consumption, and any infrastructure container overhead is overcommitted. This overhead could be safely swapped out.
- Required for live migration of VMs kubernetes/kubernetes#53533 (comment)
This user story is addressed by scenario 2, and could benefit from 3.
In updating the CRI, we must ensure that container runtime downstreams are able to support the new configurations.
We considered adding parameters for both per-workload memory-swap
and
swappiness
. These are documented as part of the Open Containers runtime
specification for Linux memory configuration. Since memory-swap
is a
per-workload parameter, and swappiness
is optional and can be set globally,
we are choosing to only expose memory-swap
which will adjust swap available
to workloads.
Since we are not currently setting memory-swap
in the CRI, the current
default behaviour when --fail-swap-on=false
is set is to allocate the same
amount of swap for a workload as memory requested. We will update the default
to not permit the use of swap by setting memory-swap
equal to limit
.
Having swap available on a system reduces predictability. Swap's performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system's behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
This risk is mitigated by preventing any workloads from using swap by default, even if swap is enabled and available on a system. This will allow a cluster administrator to test swap utilization just at the system level without introducing unpredictability to workload resource utilization.
Additionally, we will mitigate this risk by determining a set of metrics to quantify system stability and then gathering test and production data to determine if system stability changes when swap is available to the system and/or workloads in a number of different scenarios.
Since swap provisioning is out of scope of this proposal, this enhancement poses low risk to Kubernetes clusters that will not enable swap.
As beta2 was being worked on, we discovered use cases where --fail-swap-on=false
is used but Kubernetes is not utilizing swap.
Kind e2e tests run kubelet with --fail-swap-on=false
and
the default developer configuration for hack/local-up-cluster
allows for running developer clusters with swap enabled.
We need to support the --fail-swap-on=false
for both cgroup v1 and cgroupv2. We will not support KEP-2400 with cgroup v1.
So when one wants to GA this feature, we need to have a way to disable workloads from using swap while keeping the feature toggle on.
To address this, we will propose a new field to MemorySwap
called NoSwap
. This will disable swap usage on the node while keeping the feature active.
This can address existing use cases where --fail-swap-on=false
in cgroupv1 and still allow us to turn this feature on.
In previous releases of Swap, we had an UnlimitedSwap
option for workloads.
This can cause problems where workloads can use up all swap.
If all swap is used up on a node, it can make the node go unhealthy.
To avoid exhausting swap on a node, UnlimitedSwap
was dropped from the API in beta2.
Enabling swap on a system without encryption poses a security risk, as critical information, such as Kubernetes secrets, may be swapped out to the disk. If an unauthorized individual gains access to the disk, they could potentially obtain these secrets. To mitigate this risk, it is recommended to use encrypted swap. However, handling encrypted swap is not within the scope of kubelet; rather, it is a general OS configuration concern and should be addressed at that level. Nevertheless, it is essential to provide documentation that warns users of this potential issue, ensuring they are aware of the potential security implications and can take appropriate steps to safeguard their system.
To guarantee that system daemons are not swapped, the kubelet must configure the memory.swap.max
setting to 0
within the system reserved cgroup. Moreover, to make sure that burstable pods are able to utilize swap space, kubelet should verify that the cgroup associated with burstable pods should not be nested under the cgroup designated for system reserved.
Additionally, end user may decide to disable swap completely for a Pod or a container in beta 1 by making Pod guaranteed or set request == limit for a container. This way, there will be no swap enabled for the corresponding containers and there will be no information exposure risks.
In the early release of this feature, there was a goal to support cgroup v1. As the feature progressed, sig-node realized that supporting swap with cgroup v1 would be very difficult. Therefore, this feature is limited to cgroupv2 only. The main goal is to deprecate cgroupv1 eventually so this should not be a major inconvience.
We summarize the implementation plan as following:
- Add a feature gate
NodeSwap
to enable swap support. - Leave the default value of kubelet flag
--fail-on-swap
totrue
, to avoid changing default behaviour. - Introduce a new kubelet config parameter,
MemorySwap
, which configures how much swap Kubernetes workloads can use on the node. - Introduce a new CRI parameter,
memory_swap_limit_in_bytes
. - Ensure container runtimes are updated so they can make use of the new CRI configuration.
- Based on the behaviour set in the kubelet config, the kubelet will instruct the CRI on the amount of swap to allocate to each container. The container runtime will then write the swap settings to the container level cgroup.
- Add node stats to report swap usage.
Swap can be enabled as follows:
- Provision swap on the target worker nodes,
- Enable the
NodeSwap
feature flag on the kubelet, - Set
--fail-on-swap
flag tofalse
, and - (Optional) Allow Kubernetes workloads to use swap by setting
MemorySwap.SwapBehavior
toLimitedSwap
in the kubelet config.
We will add an optional MemorySwap
value to the KubeletConfig
struct
in pkg/kubelet/apis/config/types.go as follows:
// KubeletConfiguration contains the configuration for the Kubelet
type KubeletConfiguration struct {
metav1.TypeMeta
...
// Configure swap memory available to container workloads.
// +featureGate=NodeSwap
// +optional
MemorySwap MemorySwapConfiguration
}
type MemorySwapConfiguration struct {
// Configure swap memory available to container workloads. May be one of
// "", "NoSwap": workload will not use swap
// "LimitedSwap": workload combined memory and swap usage cannot exceed pod memory limit
SwapBehavior string
}
We want to expose common swap configurations based on the Docker and open
container specification for the --memory-swap
flag. Thus, the
MemorySwapConfiguration.SwapBehavior
setting will have the following effects:
- If
SwapBehavior
is set to"LimitedSwap"
, containers do not have access to swap beyond their memory limit. This value prevents a container from using swap in excess of their memory limit, even if it is enabled on a system.- With cgroups v2, swap is configured independently from memory. Thus, the
container runtimes can set
memory.swap.max
to 0 in this case, and no swap usage will be permitted.
- With cgroups v2, swap is configured independently from memory. Thus, the
container runtimes can set
- If
SwapBehavior
is set to""
or"NoSwap"
, no workloads will utilize swap.
The CRI requires a corresponding change in order to allow the kubelet to set
swap usage in container runtimes. We will introduce a parameter
memory_swap_limit_in_bytes
to the CRI API (found in
k8s.io/cri-api/pkg/apis/runtime/v1/api.proto):
// LinuxContainerResources specifies Linux specific configuration for
// resources.
message LinuxContainerResources {
...
// Memory + swap limit in bytes. Default: 0 (not specified).
int64 memory_swap_limit_in_bytes = 9;
...
}
We added metrics to the summary stats for the Node to report
SwapAvailableBytes
and SwapUsageBytes
.
type NodeStats struct {
...
// Stats pertaining to swap resources. This is reported to non-windows systems only.
// +optional
Swap *SwapStats `json:"swap,omitempty"`
}
// SwapStats contains data about memory usage
type SwapStats struct {
// The time at which these stats were updated.
Time metav1.Time `json:"time"`
// Available swap memory for use. This is defined as the <swap-limit> - <current-swap-usage>.
// If swap limit is undefined, this value is omitted.
// +optional
SwapAvailableBytes *uint64 `json:"swapAvailableBytes,omitempty"`
// Total swap memory in use.
// +optional
SwapUsageBytes *uint64 `json:"swapUsageBytes,omitempty"`
}
[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
All existing tests needs to pass with and without swap enabled.
This KEP introduces minor additions of memory swap controlling configuration parameters.
- Kubelet configuration parameters are tested in the package
k8s.io/kubernetes/pkg/kubelet/apis/config/validation
- Passing parameters to runtime is tested in
k8s.io/kubernetes/pkg/kubelet/kuberuntime
Both packages has near 100% coverage and new functionality was covered.
In alpha2, tests will be extended in these packages to support kube-reserved swap settings.
NA.
These tasks require e2e test setup so we did not add any integration tests for this.
For alpha:
- Swap scenarios are enabled in test-infra for at least two Linux
distributions. e2e suites will be run against them.
- Container runtimes must be bumped in CI to use the new CRI.
- Data should be gathered from a number of use cases to guide beta graduation
and further development efforts.
- Focus should be on supported user stories as listed above.
Test grid tabs enabled:
- kubelet-gce-e2e-swap-ubuntu: Green
- kubelet-gce-e2e-swap-ubuntu-serial: Green
- kubelet-gce-e2e-swap-fedora: Green
- kubelet-gce-e2e-swap-fedora-serial: Green
No new e2e tests introduced.
For alpha2:
- Add e2e tests that exercise all available swap configurations via the CRI.
- Verify MemoryPressure behavior with swap enabled and document any changes for configuring eviction.
- Verify new system-reserved settings for swap memory.
For beta 1:
- Add e2e tests that verify pod-level control of swap utilization.
- Add e2e tests that verify swap performance with pods using a tmpfs.
- Kubelet can be started with swap enabled and will support two configurations
for Kubernetes workloads:
LimitedSwap
andNoSwap
. - Kubelet can configure CRI to allocate swap to Kubernetes workloads. By default, workloads will not be allocated any swap.
- e2e test jobs are configured for Linux systems with swap enabled.
In alpha2 the focus will be on making sure that the feature can be used on subset of production scenarios to collect more feedback before entering beta. Specifically, security and test coverage will be increased. As well as the new setting that will split swap between kubelet and workload will be introduced.
Once functionality part is resolved while in alpha, beta will be more about performance and feedback on wider range of scenarios.
This will allow to collect feedback from the following scenarios reasonably safe:
- on cgroupv2: allow host system processes to use swap to increase system reliability under memory pressure.
- enable swap for the workload in "single large pod per node" scenarios.
Here are specific improvements to be made:
- Address swap impact on memory-backed volumes: kubernetes/kubernetes#105978.
- Investigate swap security when enabling on system processes on the node.
- Improve coverage for appropriate scenarios in testgrid.
- Add the ability to set a system-reserved quantity of swap from what kubelet detects on the host.
- Consider introducing new configuration modes for swap, such as a node-wide swap limit for workloads.
- Investigate eviction behavior with swap enabled.
- Enable Swap Support using Burstable QoS Pods only.
- Enable Swap Support for Cgroup v2 Only.
- Add swap memory to the Kubelet stats api.
- Determine a set of metrics for node QoS in order to evaluate the performance of nodes with and without swap enabled.
- Make sure node e2e jobs that use swap are healthy
- Improve coverage for appropriate scenarios in testgrid.
- Publish a Kubernetes doc page encouraging users to use encrypted swap if they wish to enable this feature.
- Add swap specific tests such as, handling the usage of swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what container runtime will do at (container cgroup boundary).
- Fix flaking/failing swap node e2e jobs.
- Address eviction related issue in swap implementation.
- Add
NoSwap
as the default setting. - Remove
UnlimitedSwap
as a supported option. - Add e2e test confirming that
NoSwap
will actually not swap - Add e2e test confirming that swap is used for
LimitedSwap
. - Document best practices for setting up Kubernetes with swap
(Tentative.)
- Test a wide variety of scenarios that may be affected by swap support.
- Remove feature flag.
- Remove the Swap Support using Burstable QoS Pods only deprecated in Beta 2.
No changes are required on upgrade to maintain previous behaviour.
It is possible to downgrade a kubelet on a node that was using swap, but this
would require disabling the use of swap and setting swapoff
on the node.
Feature flag will apply to kubelet only, so version skew strategy is N/A.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name: NodeSwap
- Components depending on the feature gate: API Server, Kubelet
- Other
- Describe the mechanism:
--fail-swap-on=false
flag for kubelet must also be set at kubelet start - Will enabling / disabling the feature require downtime of the control plane? Yes. Flag must be set on kubelet start. To disable, kubelet must be restarted. Hence, there would be brief control component downtime on a given node.
- Will enabling / disabling the feature require downtime or reprovisioning of a node? Yes. See above; disabling would require brief node downtime.
- Describe the mechanism:
No. If the feature flag is enabled, the user must still set
--fail-swap-on=false
to adjust the default behaviour.
A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour.
To turn this off, the kubelet would need to be restarted. If a cluster admin
wants to disable swap on the node without repartitioning the node, they could
stop the kubelet, set swapoff
on the node, and restart the kubelet with
--fail-swap-on=true
. The setting of the feature flag will be ignored in this
case.
In Beta2, we realize that we cannot rely on --fail-swap-on=false
as a flag for this feature. The flag predates this feature and it has
been used over time. We propose a configuration in MemorySwap
called NoSwap
.
Users could also set NoSwap
in MemorySwap
to disable all workloads from
using swap without requiring the user to disable swap if that is needed.
In Beta releases of this feature, one could use turn off NodeSwap
feature toggle
but once this feature is GA, users could use another option to disable swap
for workloads.
N/A
N/A. This should be tested separately for scenarios with the flag enabled and disabled.
If a new node with swap memory fails to come online, it will not impact any running components.
It is possible that if a cluster administrator adds swap memory to an already running node, and then performs an in-place upgrade, the new kubelet could fail to start unless the configuration was modified to tolerate swap. However, we would expect that if a cluster admin is adding swap to the node, they will also update the kubelet's configuration to not fail with swap present.
Generally, it is considered best practice to add a swap memory partition at node image/boot time and not provision it dynamically after a kubelet is already running and reporting Ready on a node.
Workload churn or performance degradations on nodes. The metrics will be application/use-case specific, but we can provide some suggestions, based on the stability metrics identified earlier.
N/A because swap support lacks a runtime upgrade/downgrade path; kubelet must be restarted with or without swap support.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
See #swap-metrics
- Kubelet stats API will be extended to show swap usage details.
KubeletConfiguration has set failOnSwap: false
.
The prometheus node_exporter
will also export stats on swap memory
utilization.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
TBD. We will determine a set of metrics as a requirement for beta graduation. We will need more production data; there is not a single metric or set of metrics that can be used to generally quantify node performance.
This section to be updated before the feature can be marked as graduated, and to be worked on during 1.23 development.
We will also add swap memory utilization to the Kubelet stats API, to provide a means of monitoring this beyond cadvisor Prometheus stats.
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
N/A
Are there any missing metrics that would be useful to have to improve observability of this feature?
We added metrics to the node stats to report how much swap is used and the capacity of swap.
No.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
No.
No.
No.
The KubeletConfig API object may slightly increase in size due to new config fields.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Yes, enabling swap can affect performance of other critical daemons on the system. Any scenario where swap memory gets utilized is a result of system running out of physical RAM. Hence, to maintain the SLIs/SLOs of critical daemons on the node we highly recommend to disable the swap for the system.slice along with reserving adequate enough system reserved memory.
The SLI that could potentially be impacted is pod startup latency. If the container runtime or kubelet are performing slower than expected, pod startup latency would be impacted. In addition to this SLI, general areas around pod lifecycle (image pulls, sandbox creation, storage) could become slow.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Yes. It will permit the utilization of swap memory (i.e. disk) on nodes. This is expected, as this enhancement is enabling cluster administrators to access this resource.
No change. Feature is specific to individual nodes.
Individual nodes with swap memory enabled may experience performance degradations under load. This could potentially cause a cascading failure on nodes without swap: if nodes with swap fail Ready checks, workloads may be rescheduled en masse.
Thus, cluster administrators should be careful while enabling swap. To minimize disruption, you may want to taint nodes with swap available to protect against this problem. Taints will ensure that workloads which tolerate swap will not spill onto nodes without swap under load.
It is suggested that if nodes with swap memory enabled cause performance or stability degradations, those nodes are cordoned, drained, and replaced with nodes that do not use swap memory.
- 2015-04-24: Discussed in #7294.
- 2017-10-06: Discussed in #53533.
- 2021-01-05: Initial design discussion document for swap support and use cases.
- 2021-04-05: Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400).
- 2021-08-09: New in Kubernetes v1.22: alpha support for using swap memory: https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/.
- 2023-04-17: KEP update for beta1 #3957.
- 2023-08-15: Beta1 released in kubernetes 1.28
- 2024-01-12: Updates to Beta2 KEP.
When swap is enabled, particularly for workloads, the kubelet’s resource accounting may become much less accurate. This may make cluster administration more difficult and less predictable.
Currently, there exists an unsupported workaround, which is setting the kubelet
flag --fail-swap-on
to false.
This is insufficient for most use cases because there is inconsistent control over how swap will be used by various container runtimes. Dockershim currently sets swap available for workloads to 0. The CRI does not restrict it at all. This inconsistency makes it difficult or impossible to use swap in production, particularly if a user wants to restrict workloads from using swap when using the CRI rather than dockershim.
This is also a breaking change. Users have used --fail-swap-on=false to allow for kubernetes to run on a swap enabled node.
Setting a swap limit at the cgroup level would allow us to restrict the usage of swap on a pod-level, rather than container-level basis.
For alpha, we are opting for the container-level basis to simplify the
implementation (as the container runtimes already support configuration of swap
with the memory-swap-limit
parameter). This will also provide the necessary
plumbing for container-level accounting of swap, if that is proposed in the
future.
In beta, we may want to revisit this.
See the Pod Resource Management design proposal for more background on the cgroup limits the kubelet currently sets based on each QoS class.
We may need Linux VM images built with swap partitions for e2e testing in CI.