[KEP-2400] Address swap impact on memory-backed volumes #105978

liggitt · 2021-10-28T18:02:58Z

Some volumes (notably secret volumes) are memory-backed in order to ensure they do not hit disk.

When swap is enabled, we need to ensure the content of memory-backed volumes continue to not touch disk.

On cgroupsv2, the kubelet should set cgroupv2 controls to prevent memory-backed volumes from swapping out
On cgroupsv1, where these controls are not available
- document (in PRR / KEP / docs) the impact of running with swap on memory-backed volumes
- consider not supporting swap on cgroupsv1 because of this impact?

/kind feature
/sig node storage
/assign @ehashman

mmdriley · 2021-10-28T18:22:12Z

I'm not exactly sure what cgroup controls are being gestured at, but it seems like the generic answer here would be encrypted swap with an ephemeral in-memory key. Unfortunately the details of how to set that up are aggressively distro-specific.

ehashman · 2021-12-02T22:30:50Z

/kind documentation
/triage accepted

ehashman · 2022-01-18T21:47:57Z

/unassign
/priority important-soon
/milestone v1.24

Clearing assignees for this release to ensure we can distribute the load.

249043822 · 2022-01-20T03:11:00Z

Some volumes (notably secret volumes) are memory-backed in order to ensure they do not hit disk.

When swap is enabled, we need to ensure the content of memory-backed volumes continue to not touch disk.

On cgroupsv2, the kubelet should set cgroupv2 controls to prevent memory-backed volumes from swapping out

@liggitt Does the controls for preventing memory-backed volumes from swapping out need be implemented in kubelet code or be documented for users self-configured on node hosts?

liggitt · 2022-01-20T14:31:55Z

@liggitt Does the controls for preventing memory-backed volumes from swapping out need be implemented in kubelet code or be documented for users self-configured on node hosts?

I'm not sure? Is node host configuration possible to do once outside the kubelet, or is it a configuration that has to be applied for every new pod/container?

249043822 · 2022-01-24T07:05:26Z

@liggitt Does the controls for preventing memory-backed volumes from swapping out need be implemented in kubelet code or be documented for users self-configured on node hosts?

I'm not sure? Is node host configuration possible to do once outside the kubelet, or is it a configuration that has to be applied for every new pod/container?

We also haven't found a config that could limit the tmpfs swapping out in cgroupv2 yet

mmiranda96 · 2022-01-24T21:06:49Z

/cc

liggitt · 2022-02-15T21:06:09Z

I see tmpfs mentioned in the beta graduation criteria for the swap feature (https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#beta), but I can't tell if this covers preventing memory-backed volumes from getting swapped to disk:

Handle usage of swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what container runtime will do at container cgroup boundary).

edit: oh, it looks like this issue is referenced from the beta task list in kubernetes/enhancements#2400, so I'm guessing it does

helayoty · 2022-03-21T21:41:38Z

@liggitt 👋 the release 1.24 bug triage shadow. Does this issue still target release 1.24?

liggitt · 2022-03-21T21:46:52Z

That's more a question for @ehashman , not sure what milestone completion of swap-to-beta is planned for

ehashman · 2022-03-28T16:32:40Z

/milestone clear

Per kubernetes/enhancements#2400 (comment) this was removed from milestone

k8s-triage-robot · 2023-02-07T23:18:55Z

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot · 2023-02-07T23:18:58Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

liggitt · 2023-09-21T12:35:29Z

I don't have a good sense for whether disk encryption actually helps here... if a swap file is written to an encrypted disk, while the disk is mounted, are fewer privileges required to inspect swap content than would be required to inspect the memory backing a tmpfs volume (or the kubelet memory)?

iholder101 · 2024-03-20T14:28:00Z

Hey @liggitt! Circling back to this.

So firstly I'd like to understand what's the scope of this. Do we specifically care about secrets? Or do we also care about other memory-backed volumes like EmptyDirs?

One possible path forward is using ramfs instead of tmpfs since it seems that ramfs doesn't allow swapping [1][2]. As Wikipedia claims:
Linux tmpfs is based on the ramfs code used during bootup and also uses the page cache, but unlike ramfs it supports swapping out less-used pages to swap space

Do you think that's a reasonable path forward?

[1] https://en.wikipedia.org/wiki/Tmpfs#Linux
[2] https://unix.stackexchange.com/a/50486

liggitt · 2024-03-20T14:47:13Z

So firstly I'd like to understand what's the scope of this. Do we specifically care about secrets?

I specifically care about

tmpfs volumes containing known credentials:
- secrets
- projected service account tokens
not breaking expectations of users who explicity said to use the memory medium with their emptyDir volume
- emptyDir exposes control over the medium via the REST API, so end-users can choose to place their emptydir in memory rather than disk
- some might choose to do this for speed
- others might do this to prevent content written to the volume from being persisted to disk

It looks like all downward API / projected volumes are configured with StorageMediumMemory, though I'm not sure that's required for configmap / non-token projected volumes / metadata downward-API volumes.

One possible path forward is using ramfs instead of tmpfs since it seems that ramfs doesn't allow swapping [1][2]. As Wikipedia claims:
Linux tmpfs is based on the ramfs code used during bootup and also uses the page cache, but unlike ramfs it supports swapping out less-used pages to swap space

Do you think that's a reasonable path forward?

I don't know enough about the implications to know (is it as widely supported as tmpfs? does it have other performance or functional tradeoffs? does it behave differently than tmpfs in ways that would be observable to the consumer?), so I'll defer to node and storage folks, but it sounds like a possibility at first glance.

iholder101 · 2024-03-21T10:00:53Z

So firstly I'd like to understand what's the scope of this. Do we specifically care about secrets?

I specifically care about

tmpfs volumes containing known credentials:

secrets

projected service account tokens

not breaking expectations of users who explicity said to use the memory medium with their emptyDir volume

emptyDir exposes control over the medium via the REST API, so end-users can choose to place their emptydir in memory rather than disk

some might choose to do this for speed

others might do this to prevent content written to the volume from being persisted to disk

It looks like all downward API / projected volumes are configured with StorageMediumMemory, though I'm not sure that's required for configmap / non-token projected volumes / metadata downward-API volumes.

Thanks for clarifying!

One possible path forward is using ramfs instead of tmpfs since it seems that ramfs doesn't allow swapping [1][2]. As Wikipedia claims:
Linux tmpfs is based on the ramfs code used during bootup and also uses the page cache, but unlike ramfs it supports swapping out less-used pages to swap space
Do you think that's a reasonable path forward?

I don't know enough about the implications to know (is it as widely supported as tmpfs? does it have other performance or functional tradeoffs? does it behave differently than tmpfs in ways that would be observable to the consumer?), so I'll defer to node and storage folks, but it sounds like a possibility at first glance.

Great, I'll look deeper into it than.

kannon92 · 2024-03-21T20:27:01Z

We talked about this on slack. It looks like there is a mount option for tmpfs to say that you don't want swap.

https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html

Podman added support for this: containers/podman#19659.

/sig storage

I think it would make sense to add the noswap option to the mounts for tmpfs. But I think we should bring this to sig-storage and see what they think.

It looks to be a newer kernel feature (https://lore.kernel.org/all/20230207025259.2522793-1-mcgrof@kernel.org/T/#u).

iholder101 · 2024-03-25T07:57:15Z

We talked about this on slack. It looks like there is a mount option for tmpfs to say that you don't want swap.

kernel.org/doc/html/latest/filesystems/tmpfs.html

Podman added support for this: containers/podman#19659.

/sig storage

I think it would make sense to add the noswap option to the mounts for tmpfs. But I think we should bring this to sig-storage and see what they think.

It looks to be a newer kernel feature (lore.kernel.org/all/20230207025259.2522793-1-mcgrof@kernel.org/T/#u).

Thank you for summarizing the slack conversation.
I'm working on a PR which will be ready soon. I'll also raise this up at the sig-storage slack to gain feedback.

mattcary · 2024-03-26T15:08:13Z

IIUC ramfs does not limit the volume size (ref). So a process writing to the ram drive can consume all ram in the system.

Using brd (block ram device) might be another option but I don't know how it behaves for swap.

kwilczynski · 2024-03-26T23:28:34Z

IIUC ramfs does not limit the volume size (ref). So a process writing to the ram drive can consume all ram in the system.

Using brd (block ram device) might be another option but I don't know how it behaves for swap.

Neither the brd nor zram (if you wanted to leverage compression) would have their backing pages swapped out. Thus, despite brd's inherent complexity, it would fulfil this requirement.

fabiand · 2024-04-09T10:48:01Z

I don't have a good sense for whether disk encryption actually helps here... if a swap file is written to an encrypted disk, while the disk is mounted, are fewer privileges required to inspect swap content than would be required to inspect the memory backing a tmpfs volume (or the kubelet memory)?

I think we need to differentiate between the "online" (system is running, logged in as root) and "offline" (adin took disk) case.

In the online case, root can grep /dev/mem as well as /dev/[vs]d*.
In the online case, non-root can usually not grep /dev/mem as well as not grep /dev/[vs]d*.
In both cases, encryption should not make a difference in the online case (root can still grep the enrypted block device)

In the offline case, with due to encrpytion, and admin can not read the plaintext of the disk, unless they have the enryption key.

ramfs

IMHO switching to ramfs is a too invasive change just to avoid swap, specifically knowning that tmpfs is gaining noswap support.

tmpfs for performance

If anybody has opted in to swap, then that consumer is already accepting a compromise on performance.
Thus to me, the risk of a performance impact on tmpfs is acceptable, because the user accepted it as soon as they turned on swap.

tl;dr to me beta2 scope is a good foundation, and valueable to our (KubeVirt) use-cases. We'd really like to get it GA'ed.
It's clear that there are additional challenges, like this noswap for tmpfs for vols, but as well as other issues i.e. around stressed clusters with swap enabled which we should address in follow up swap work (which we are committed to).

deads2k · 2024-04-10T21:44:13Z

The expectation when using a kubernetes cluster is that some volume types are backed by RAM. This expectation is reflected in administrative guides and development guides. While it is possible that some cluster-admins and cluster-operators would see the potential exposure and then dig deeply enough to determine whether they are exposed, it is likely that most will not.

We know the path that is required to safely enable swap on nodes, while avoiding this exposure for the core platform. While I understand the urge to enable features as quickly as possible, enabling features that are needlessly less generally safe erodes upgrade trust. Let's take the path of building the functionality needed to keep cluster security profiles the same over upgrades and ensure our kubelets fail safely (refuse to run pods) when that cannot be achieved.

fabiand · 2024-04-11T09:09:51Z

I agree that expectations about volumes types would be undermined.

kannon92 · 2024-04-11T13:45:32Z

So predating this feature, we have --fail-swap-on=false|true which allows for users to deploy kubelet and not fail on swap. KEP-2400 did not introduce this flag and we have found a lot more cases than we thought where this flag was set to false. Our CI sets this and a lot of developer machines allow this (our own hack/local-up-cluster.sh).

In the case of --fail-swap-on=false without this feature, I am not sure if kubelet would still swap out memory in a volume if the node was close to memory pressure.

I think we need to clearly state what our intention is for --fail-swap-on=false and tmpfs noswap. If I understand your point correctly (@deads2k), then we want to only enable swap on nodes if tmps/noswap option is available in the kernel. If not, we should fail to start kubelet.

I wonder if we should consider a separate KEP to target both cases (KEP-2400 and legacy swap enablement).

haircommander · 2024-04-11T17:04:55Z

In the case of --fail-swap-on=false without this feature, I am not sure if kubelet would still swap out memory in a volume if the node was close to memory pressure.

small correction: kernel would swap out memory, but yes this is already a problem for nodes that enable fail-swap-on. Good point

Would folks be open to a swapBehavior option like LimitedSwapInsecure / NoSwapInsecure (name pending) or something that doesn't require tmpfs noswap option be enabled? I think we as a community should take the stance that tmpfs noswap should be required to use swap, but there are many non production environments that would want to continue to use swap (kind being one) that maybe wouldn't care to ensure their secrets aren't being swapped.

BenTheElder · 2024-04-18T16:56:13Z

[...] but there are many non production environments that would want to continue to use swap (kind being one) that maybe wouldn't care to ensure their secrets aren't being swapped.

I can't speak for every user, but it's not intended for production and personally I agree, we prefer the cluster to function on as many hosts as we can even if the functionality is less secure.

The reason we enable --fail-swap-on=false is because so many developer environments have swap enabled and resource limits are already one of the few things that just doesn't quite work reliably in the nested container environment yet. It's expected that you can otherwise get say 90% of the way to testing controllers and core k8s logic (node_e2e aside).

My past understanding was that this wasn't even necessarily a well supported option but it enabled you to run on a host with swap, IIRC we log a warning on kubelet startup if you opt-into this flag.

Would folks be open to a swapBehavior option like LimitedSwapInsecure / NoSwapInsecure (name pending) or something that doesn't require tmpfs noswap option be enabled?

For KIND & local-up-cluster.sh (cc @dims) I think we're fine adopting a new option, but what about other users of fail-swap-on? I think generally Kubernetes shouldn't require new configuration just to upgrade safely ..?

For the feature of "swap is permitted by default" without setting this flag, it seems like we should require noswap and possibly enable opt-in to allow it anyhow with warning.

BenTheElder · 2024-04-18T17:14:36Z

I think minikube also does this at least in the "docker" / "podman" driver modes for the same reasons. FYI @medyagh @afbjorklund

afbjorklund · 2024-04-19T07:30:55Z

Right, I don't think minikube expects to be swapping - even if it does allow running with swap enabled on the machine

It (--fail-swap-on=false) has been enabled since 2017, and with kubeadm it uses --ignore-preflight-errors=Swap

iholder101 · 2024-04-21T08:34:16Z

So predating this feature, we have --fail-swap-on=false|true which allows for users to deploy kubelet and not fail on swap. KEP-2400 did not introduce this flag and we have found a lot more cases than we thought where this flag was set to false. Our CI sets this and a lot of developer machines allow this (our own hack/local-up-cluster.sh).

In the case of --fail-swap-on=false without this feature, I am not sure if kubelet would still swap out memory in a volume if the node was close to memory pressure.

Yes, the kernel could indeed swap memory-backed volumes if --fail-swap-on=false, even if NodeSwap is not being used.

I think this is an important point.
The fact that the flag exists for a long time and is not introduced by the NodeSwap feature makes it somewhat out-of-scope for the progress of the NodeSwap feature. This is of course not entirely true - there is a connection between the two as NodeSwap cannot be used without --fail-swap-on=false, but the security issue we're discussing exists for a long time and can be relevant even if NodeSwap is not enabled.

For this reason I think it's reasonable to consider improving --fail-swap-on's behavior instead of introducing a new safety mechanism to the NodeSwap feature. We can do so by using the tmpfs noswap option on a best-effort basis:

If --fail-swap-on is used and tmpfs noswap is not supported: fire a warning and move on.
If --fail-swap-on is used and tmpfs noswap is supported: use the tmpfs noswap option to mount memory-backed volumes.
- It's important to note that eventually the vast majority of environments would use this option, as tmpfs noswap would be widely supported.

This way we improve the existing behavior without breaking changes, introducing intermediary API changes or forcing swap users to adapt. In the future, when tmpfs noswap is widely supported, we may consider to deprecate --fail-swap-on entirely and use the tmpfs noswap option unconditionally. IMHO we could track this effort with a new KEP that focuses on the --fail-swap-on flag and that would not block NodeSwap, as they are essentially two different "features" / APIs.

I would appreciate feedback on this approach.

kannon92 · 2024-04-21T13:36:43Z

If --fail-swap-on is used and tmpfs noswap is not supported: fire a warning and move on.

If --fail-swap-on is used and tmpfs noswap is supported: use the tmpfs noswap option to mount memory-backed volumes.

+1 but I think we should also detect if swap is enabled on the node. (fail-swap-on=false + swap is on node) fail-swap-on could be false but the node does not have to have swap on it in that case.

I think in your case of "not supported" I think you mean to try the mount command and if it fails, revert back to non tmpfs case.

liggitt · 2024-04-25T22:01:59Z

I think this is the behavior that would be consistent with flag defaults and use today, and should be a strict improvement while remaining compatible (warn if noswap isn't supported but they've opted into running with swap in ways that could make their tmpfs volumes unsafe, use noswap if available and they've opted into running with swap):

	swap off	swap on, supports noswap	swap on, doesn't support noswap
--fail-swap-on unset	-	fail on kubelet start	fail on kubelet start
--fail-swap-on=true	-	fail on kubelet start	fail on kubelet start
--fail-swap-on=false	-	detect on start, use noswap	detect on start, warn, don't use noswap

iholder101 · 2024-04-26T07:26:38Z

I think this is the behavior that would be consistent with flag defaults and use today, and should be a strict improvement while remaining compatible (warn if noswap isn't supported but they've opted into running with swap in ways that could make their tmpfs volumes unsafe, use noswap if available and they've opted into running with swap):

swap off swap on, supports noswap swap on, doesn't support noswap
--fail-swap-on unset - fail on kubelet start fail on kubelet start
--fail-swap-on=true - fail on kubelet start fail on kubelet start
--fail-swap-on=false - detect on start, use noswap detect on start, warn, don't use noswap

Awesome 👍 if there are no objections I'll start working on a PR.
In the future, when tmpfs noswap is widely supported, we should re-think on implementing something similar to #124060 which always mounts with the noswap option.

iholder101 · 2024-05-06T09:30:24Z

I think this is the behavior that would be consistent with flag defaults and use today, and should be a strict improvement while remaining compatible (warn if noswap isn't supported but they've opted into running with swap in ways that could make their tmpfs volumes unsafe, use noswap if available and they've opted into running with swap):

swap off swap on, supports noswap swap on, doesn't support noswap
--fail-swap-on unset - fail on kubelet start fail on kubelet start
--fail-swap-on=true - fail on kubelet start fail on kubelet start
--fail-swap-on=false - detect on start, use noswap detect on start, warn, don't use noswap

@liggitt I've reworked #124060 to follow this approach. Can you PTAL?

k8s-ci-robot assigned ehashman Oct 28, 2021

k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 2, 2021

ehashman mentioned this issue Dec 2, 2021

Node memory swap support kubernetes/enhancements#2400

Open

60 tasks

k8s-ci-robot unassigned ehashman Jan 18, 2022

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 18, 2022

k8s-ci-robot added this to the v1.24 milestone Jan 18, 2022

k8s-ci-robot removed this from the v1.24 milestone Mar 28, 2022

liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 5, 2022

k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 7, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 7, 2023

liggitt mentioned this issue Mar 5, 2024

[KEP-2400] add no swap as the default option for swap #122745

Merged

iholder101 mentioned this issue Mar 26, 2024

[KEP-2400] Mount tmpfs memory-backed volumes with a noswap option if supported #124060

Merged

k8s-ci-robot closed this as completed in #124060 May 23, 2024

iholder101 mentioned this issue Jul 10, 2024

[KEP-2400] Node swap updates, GA criterias and clarifications kubernetes/enhancements#4701

Open

kannon92 mentioned this issue Aug 13, 2024

Memory EmptyDir tmpfs size capped at 100% of node RAM size, not SizeLimit as specified #126617

Open

github-project-automation bot added this to Node Swap Beta Aug 28, 2024

github-project-automation bot moved this to Done in Node Swap Beta Aug 28, 2024

iholder101 mentioned this issue Sep 29, 2024

Swap GA graduation: blog-post kubernetes/website#48099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KEP-2400] Address swap impact on memory-backed volumes #105978

[KEP-2400] Address swap impact on memory-backed volumes #105978

liggitt commented Oct 28, 2021

mmdriley commented Oct 28, 2021

ehashman commented Dec 2, 2021

ehashman commented Jan 18, 2022

249043822 commented Jan 20, 2022

liggitt commented Jan 20, 2022

249043822 commented Jan 24, 2022

mmiranda96 commented Jan 24, 2022

liggitt commented Feb 15, 2022 •

edited

Loading

helayoty commented Mar 21, 2022

liggitt commented Mar 21, 2022

ehashman commented Mar 28, 2022

k8s-triage-robot commented Feb 7, 2023

k8s-ci-robot commented Feb 7, 2023

liggitt commented Sep 21, 2023

iholder101 commented Mar 20, 2024

liggitt commented Mar 20, 2024

iholder101 commented Mar 21, 2024

kannon92 commented Mar 21, 2024

iholder101 commented Mar 25, 2024

mattcary commented Mar 26, 2024

kwilczynski commented Mar 26, 2024

fabiand commented Apr 9, 2024 •

edited

Loading

deads2k commented Apr 10, 2024

fabiand commented Apr 11, 2024

kannon92 commented Apr 11, 2024 •

edited

Loading

haircommander commented Apr 11, 2024

BenTheElder commented Apr 18, 2024 •

edited

Loading

BenTheElder commented Apr 18, 2024

afbjorklund commented Apr 19, 2024

iholder101 commented Apr 21, 2024 •

edited

Loading

kannon92 commented Apr 21, 2024

liggitt commented Apr 25, 2024

iholder101 commented Apr 26, 2024

iholder101 commented May 6, 2024

[KEP-2400] Address swap impact on memory-backed volumes #105978

[KEP-2400] Address swap impact on memory-backed volumes #105978

Comments

liggitt commented Oct 28, 2021

mmdriley commented Oct 28, 2021

ehashman commented Dec 2, 2021

ehashman commented Jan 18, 2022

249043822 commented Jan 20, 2022

liggitt commented Jan 20, 2022

249043822 commented Jan 24, 2022

mmiranda96 commented Jan 24, 2022

liggitt commented Feb 15, 2022 • edited Loading

helayoty commented Mar 21, 2022

liggitt commented Mar 21, 2022

ehashman commented Mar 28, 2022

k8s-triage-robot commented Feb 7, 2023

k8s-ci-robot commented Feb 7, 2023

liggitt commented Sep 21, 2023

iholder101 commented Mar 20, 2024

liggitt commented Mar 20, 2024

iholder101 commented Mar 21, 2024

kannon92 commented Mar 21, 2024

iholder101 commented Mar 25, 2024

mattcary commented Mar 26, 2024

kwilczynski commented Mar 26, 2024

fabiand commented Apr 9, 2024 • edited Loading

deads2k commented Apr 10, 2024

fabiand commented Apr 11, 2024

kannon92 commented Apr 11, 2024 • edited Loading

haircommander commented Apr 11, 2024

BenTheElder commented Apr 18, 2024 • edited Loading

BenTheElder commented Apr 18, 2024

afbjorklund commented Apr 19, 2024

iholder101 commented Apr 21, 2024 • edited Loading

kannon92 commented Apr 21, 2024

liggitt commented Apr 25, 2024

iholder101 commented Apr 26, 2024

iholder101 commented May 6, 2024

liggitt commented Feb 15, 2022 •

edited

Loading

fabiand commented Apr 9, 2024 •

edited

Loading

kannon92 commented Apr 11, 2024 •

edited

Loading

BenTheElder commented Apr 18, 2024 •

edited

Loading

iholder101 commented Apr 21, 2024 •

edited

Loading