Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-2400] Address swap impact on memory-backed volumes #105978

Closed
Tracked by #2400
liggitt opened this issue Oct 28, 2021 · 35 comments · Fixed by #124060
Closed
Tracked by #2400

[KEP-2400] Address swap impact on memory-backed volumes #105978

liggitt opened this issue Oct 28, 2021 · 35 comments · Fixed by #124060
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@liggitt
Copy link
Member

liggitt commented Oct 28, 2021

Some volumes (notably secret volumes) are memory-backed in order to ensure they do not hit disk.

When swap is enabled, we need to ensure the content of memory-backed volumes continue to not touch disk.

  • On cgroupsv2, the kubelet should set cgroupv2 controls to prevent memory-backed volumes from swapping out
  • On cgroupsv1, where these controls are not available
    • document (in PRR / KEP / docs) the impact of running with swap on memory-backed volumes
    • consider not supporting swap on cgroupsv1 because of this impact?

/kind feature
/sig node storage
/assign @ehashman

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 28, 2021
@mmdriley
Copy link
Contributor

I'm not exactly sure what cgroup controls are being gestured at, but it seems like the generic answer here would be encrypted swap with an ephemeral in-memory key. Unfortunately the details of how to set that up are aggressively distro-specific.

@ehashman
Copy link
Member

ehashman commented Dec 2, 2021

/kind documentation
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 2, 2021
@ehashman
Copy link
Member

/unassign
/priority important-soon
/milestone v1.24

Clearing assignees for this release to ensure we can distribute the load.

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 18, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Jan 18, 2022
@249043822
Copy link
Member

Some volumes (notably secret volumes) are memory-backed in order to ensure they do not hit disk.

When swap is enabled, we need to ensure the content of memory-backed volumes continue to not touch disk.

  • On cgroupsv2, the kubelet should set cgroupv2 controls to prevent memory-backed volumes from swapping out

@liggitt Does the controls for preventing memory-backed volumes from swapping out need be implemented in kubelet code or be documented for users self-configured on node hosts?

@liggitt
Copy link
Member Author

liggitt commented Jan 20, 2022

@liggitt Does the controls for preventing memory-backed volumes from swapping out need be implemented in kubelet code or be documented for users self-configured on node hosts?

I'm not sure? Is node host configuration possible to do once outside the kubelet, or is it a configuration that has to be applied for every new pod/container?

@249043822
Copy link
Member

@liggitt Does the controls for preventing memory-backed volumes from swapping out need be implemented in kubelet code or be documented for users self-configured on node hosts?

I'm not sure? Is node host configuration possible to do once outside the kubelet, or is it a configuration that has to be applied for every new pod/container?

We also haven't found a config that could limit the tmpfs swapping out in cgroupv2 yet

@mmiranda96
Copy link
Contributor

/cc

@liggitt
Copy link
Member Author

liggitt commented Feb 15, 2022

I see tmpfs mentioned in the beta graduation criteria for the swap feature (https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#beta), but I can't tell if this covers preventing memory-backed volumes from getting swapped to disk:

Handle usage of swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what container runtime will do at container cgroup boundary).

edit: oh, it looks like this issue is referenced from the beta task list in kubernetes/enhancements#2400, so I'm guessing it does

@helayoty
Copy link
Member

@liggitt 👋 the release 1.24 bug triage shadow. Does this issue still target release 1.24?

@liggitt
Copy link
Member Author

liggitt commented Mar 21, 2022

That's more a question for @ehashman , not sure what milestone completion of swap-to-beta is planned for

@ehashman
Copy link
Member

/milestone clear

Per kubernetes/enhancements#2400 (comment) this was removed from milestone

@k8s-ci-robot k8s-ci-robot removed this from the v1.24 milestone Mar 28, 2022
@liggitt liggitt added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 5, 2022
@k8s-triage-robot
Copy link

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Feb 7, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 7, 2023
@liggitt
Copy link
Member Author

liggitt commented Sep 21, 2023

I don't have a good sense for whether disk encryption actually helps here... if a swap file is written to an encrypted disk, while the disk is mounted, are fewer privileges required to inspect swap content than would be required to inspect the memory backing a tmpfs volume (or the kubelet memory)?

@iholder101
Copy link
Contributor

Hey @liggitt! Circling back to this.

So firstly I'd like to understand what's the scope of this. Do we specifically care about secrets? Or do we also care about other memory-backed volumes like EmptyDirs?

One possible path forward is using ramfs instead of tmpfs since it seems that ramfs doesn't allow swapping [1][2]. As Wikipedia claims:
Linux tmpfs is based on the ramfs code used during bootup and also uses the page cache, but unlike ramfs it supports swapping out less-used pages to swap space

Do you think that's a reasonable path forward?

[1] https://en.wikipedia.org/wiki/Tmpfs#Linux
[2] https://unix.stackexchange.com/a/50486

@liggitt
Copy link
Member Author

liggitt commented Mar 20, 2024

So firstly I'd like to understand what's the scope of this. Do we specifically care about secrets?

I specifically care about

  • tmpfs volumes containing known credentials:
    • secrets
    • projected service account tokens
  • not breaking expectations of users who explicity said to use the memory medium with their emptyDir volume
    • emptyDir exposes control over the medium via the REST API, so end-users can choose to place their emptydir in memory rather than disk
    • some might choose to do this for speed
    • others might do this to prevent content written to the volume from being persisted to disk

It looks like all downward API / projected volumes are configured with StorageMediumMemory, though I'm not sure that's required for configmap / non-token projected volumes / metadata downward-API volumes.

One possible path forward is using ramfs instead of tmpfs since it seems that ramfs doesn't allow swapping [1][2]. As Wikipedia claims:
Linux tmpfs is based on the ramfs code used during bootup and also uses the page cache, but unlike ramfs it supports swapping out less-used pages to swap space

Do you think that's a reasonable path forward?

I don't know enough about the implications to know (is it as widely supported as tmpfs? does it have other performance or functional tradeoffs? does it behave differently than tmpfs in ways that would be observable to the consumer?), so I'll defer to node and storage folks, but it sounds like a possibility at first glance.

@iholder101
Copy link
Contributor

So firstly I'd like to understand what's the scope of this. Do we specifically care about secrets?

I specifically care about

  • tmpfs volumes containing known credentials:

    • secrets
    • projected service account tokens
  • not breaking expectations of users who explicity said to use the memory medium with their emptyDir volume

    • emptyDir exposes control over the medium via the REST API, so end-users can choose to place their emptydir in memory rather than disk
    • some might choose to do this for speed
    • others might do this to prevent content written to the volume from being persisted to disk

It looks like all downward API / projected volumes are configured with StorageMediumMemory, though I'm not sure that's required for configmap / non-token projected volumes / metadata downward-API volumes.

Thanks for clarifying!

One possible path forward is using ramfs instead of tmpfs since it seems that ramfs doesn't allow swapping [1][2]. As Wikipedia claims:
Linux tmpfs is based on the ramfs code used during bootup and also uses the page cache, but unlike ramfs it supports swapping out less-used pages to swap space
Do you think that's a reasonable path forward?

I don't know enough about the implications to know (is it as widely supported as tmpfs? does it have other performance or functional tradeoffs? does it behave differently than tmpfs in ways that would be observable to the consumer?), so I'll defer to node and storage folks, but it sounds like a possibility at first glance.

Great, I'll look deeper into it than.

@kannon92
Copy link
Contributor

We talked about this on slack. It looks like there is a mount option for tmpfs to say that you don't want swap.

https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html

Podman added support for this: containers/podman#19659.

/sig storage

I think it would make sense to add the noswap option to the mounts for tmpfs. But I think we should bring this to sig-storage and see what they think.

It looks to be a newer kernel feature (https://lore.kernel.org/all/20230207025259.2522793-1-mcgrof@kernel.org/T/#u).

@iholder101
Copy link
Contributor

We talked about this on slack. It looks like there is a mount option for tmpfs to say that you don't want swap.

kernel.org/doc/html/latest/filesystems/tmpfs.html

Podman added support for this: containers/podman#19659.

/sig storage

I think it would make sense to add the noswap option to the mounts for tmpfs. But I think we should bring this to sig-storage and see what they think.

It looks to be a newer kernel feature (lore.kernel.org/all/20230207025259.2522793-1-mcgrof@kernel.org/T/#u).

Thank you for summarizing the slack conversation.
I'm working on a PR which will be ready soon. I'll also raise this up at the sig-storage slack to gain feedback.

@mattcary
Copy link
Contributor

IIUC ramfs does not limit the volume size (ref). So a process writing to the ram drive can consume all ram in the system.

Using brd (block ram device) might be another option but I don't know how it behaves for swap.

@kwilczynski
Copy link
Member

IIUC ramfs does not limit the volume size (ref). So a process writing to the ram drive can consume all ram in the system.

Using brd (block ram device) might be another option but I don't know how it behaves for swap.

Neither the brd nor zram (if you wanted to leverage compression) would have their backing pages swapped out. Thus, despite brd's inherent complexity, it would fulfil this requirement.

@fabiand
Copy link
Contributor

fabiand commented Apr 9, 2024

I don't have a good sense for whether disk encryption actually helps here... if a swap file is written to an encrypted disk, while the disk is mounted, are fewer privileges required to inspect swap content than would be required to inspect the memory backing a tmpfs volume (or the kubelet memory)?

I think we need to differentiate between the "online" (system is running, logged in as root) and "offline" (adin took disk) case.

In the online case, root can grep /dev/mem as well as /dev/[vs]d*.
In the online case, non-root can usually not grep /dev/mem as well as not grep /dev/[vs]d*.
In both cases, encryption should not make a difference in the online case (root can still grep the enrypted block device)

In the offline case, with due to encrpytion, and admin can not read the plaintext of the disk, unless they have the enryption key.

ramfs

IMHO switching to ramfs is a too invasive change just to avoid swap, specifically knowning that tmpfs is gaining noswap support.

tmpfs for performance

If anybody has opted in to swap, then that consumer is already accepting a compromise on performance.
Thus to me, the risk of a performance impact on tmpfs is acceptable, because the user accepted it as soon as they turned on swap.

tl;dr to me beta2 scope is a good foundation, and valueable to our (KubeVirt) use-cases. We'd really like to get it GA'ed.
It's clear that there are additional challenges, like this noswap for tmpfs for vols, but as well as other issues i.e. around stressed clusters with swap enabled which we should address in follow up swap work (which we are committed to).

@deads2k
Copy link
Contributor

deads2k commented Apr 10, 2024

The expectation when using a kubernetes cluster is that some volume types are backed by RAM. This expectation is reflected in administrative guides and development guides. While it is possible that some cluster-admins and cluster-operators would see the potential exposure and then dig deeply enough to determine whether they are exposed, it is likely that most will not.

We know the path that is required to safely enable swap on nodes, while avoiding this exposure for the core platform. While I understand the urge to enable features as quickly as possible, enabling features that are needlessly less generally safe erodes upgrade trust. Let's take the path of building the functionality needed to keep cluster security profiles the same over upgrades and ensure our kubelets fail safely (refuse to run pods) when that cannot be achieved.

@fabiand
Copy link
Contributor

fabiand commented Apr 11, 2024

I agree that expectations about volumes types would be undermined.

@kannon92
Copy link
Contributor

kannon92 commented Apr 11, 2024

So predating this feature, we have --fail-swap-on=false|true which allows for users to deploy kubelet and not fail on swap. KEP-2400 did not introduce this flag and we have found a lot more cases than we thought where this flag was set to false. Our CI sets this and a lot of developer machines allow this (our own hack/local-up-cluster.sh).

In the case of --fail-swap-on=false without this feature, I am not sure if kubelet would still swap out memory in a volume if the node was close to memory pressure.

I think we need to clearly state what our intention is for --fail-swap-on=false and tmpfs noswap. If I understand your point correctly (@deads2k), then we want to only enable swap on nodes if tmps/noswap option is available in the kernel. If not, we should fail to start kubelet.

I wonder if we should consider a separate KEP to target both cases (KEP-2400 and legacy swap enablement).

@haircommander
Copy link
Contributor

In the case of --fail-swap-on=false without this feature, I am not sure if kubelet would still swap out memory in a volume if the node was close to memory pressure.

small correction: kernel would swap out memory, but yes this is already a problem for nodes that enable fail-swap-on. Good point

Would folks be open to a swapBehavior option like LimitedSwapInsecure / NoSwapInsecure (name pending) or something that doesn't require tmpfs noswap option be enabled? I think we as a community should take the stance that tmpfs noswap should be required to use swap, but there are many non production environments that would want to continue to use swap (kind being one) that maybe wouldn't care to ensure their secrets aren't being swapped.

@BenTheElder
Copy link
Member

BenTheElder commented Apr 18, 2024

[...] but there are many non production environments that would want to continue to use swap (kind being one) that maybe wouldn't care to ensure their secrets aren't being swapped.

I can't speak for every user, but it's not intended for production and personally I agree, we prefer the cluster to function on as many hosts as we can even if the functionality is less secure.

The reason we enable --fail-swap-on=false is because so many developer environments have swap enabled and resource limits are already one of the few things that just doesn't quite work reliably in the nested container environment yet. It's expected that you can otherwise get say 90% of the way to testing controllers and core k8s logic (node_e2e aside).

My past understanding was that this wasn't even necessarily a well supported option but it enabled you to run on a host with swap, IIRC we log a warning on kubelet startup if you opt-into this flag.

Would folks be open to a swapBehavior option like LimitedSwapInsecure / NoSwapInsecure (name pending) or something that doesn't require tmpfs noswap option be enabled?

For KIND & local-up-cluster.sh (cc @dims) I think we're fine adopting a new option, but what about other users of fail-swap-on? I think generally Kubernetes shouldn't require new configuration just to upgrade safely ..?

For the feature of "swap is permitted by default" without setting this flag, it seems like we should require noswap and possibly enable opt-in to allow it anyhow with warning.

@BenTheElder
Copy link
Member

I think minikube also does this at least in the "docker" / "podman" driver modes for the same reasons. FYI @medyagh @afbjorklund

@afbjorklund
Copy link

Right, I don't think minikube expects to be swapping - even if it does allow running with swap enabled on the machine

It (--fail-swap-on=false) has been enabled since 2017, and with kubeadm it uses --ignore-preflight-errors=Swap

@iholder101
Copy link
Contributor

iholder101 commented Apr 21, 2024

So predating this feature, we have --fail-swap-on=false|true which allows for users to deploy kubelet and not fail on swap. KEP-2400 did not introduce this flag and we have found a lot more cases than we thought where this flag was set to false. Our CI sets this and a lot of developer machines allow this (our own hack/local-up-cluster.sh).

In the case of --fail-swap-on=false without this feature, I am not sure if kubelet would still swap out memory in a volume if the node was close to memory pressure.

Yes, the kernel could indeed swap memory-backed volumes if --fail-swap-on=false, even if NodeSwap is not being used.

I think this is an important point.
The fact that the flag exists for a long time and is not introduced by the NodeSwap feature makes it somewhat out-of-scope for the progress of the NodeSwap feature. This is of course not entirely true - there is a connection between the two as NodeSwap cannot be used without --fail-swap-on=false, but the security issue we're discussing exists for a long time and can be relevant even if NodeSwap is not enabled.

For this reason I think it's reasonable to consider improving --fail-swap-on's behavior instead of introducing a new safety mechanism to the NodeSwap feature. We can do so by using the tmpfs noswap option on a best-effort basis:

  • If --fail-swap-on is used and tmpfs noswap is not supported: fire a warning and move on.
  • If --fail-swap-on is used and tmpfs noswap is supported: use the tmpfs noswap option to mount memory-backed volumes.
    • It's important to note that eventually the vast majority of environments would use this option, as tmpfs noswap would be widely supported.

This way we improve the existing behavior without breaking changes, introducing intermediary API changes or forcing swap users to adapt. In the future, when tmpfs noswap is widely supported, we may consider to deprecate --fail-swap-on entirely and use the tmpfs noswap option unconditionally. IMHO we could track this effort with a new KEP that focuses on the --fail-swap-on flag and that would not block NodeSwap, as they are essentially two different "features" / APIs.

I would appreciate feedback on this approach.

@kannon92
Copy link
Contributor

  • If --fail-swap-on is used and tmpfs noswap is not supported: fire a warning and move on.
  • If --fail-swap-on is used and tmpfs noswap is supported: use the tmpfs noswap option to mount memory-backed volumes.

+1 but I think we should also detect if swap is enabled on the node. (fail-swap-on=false + swap is on node) fail-swap-on could be false but the node does not have to have swap on it in that case.

I think in your case of "not supported" I think you mean to try the mount command and if it fails, revert back to non tmpfs case.

@liggitt
Copy link
Member Author

liggitt commented Apr 25, 2024

I think this is the behavior that would be consistent with flag defaults and use today, and should be a strict improvement while remaining compatible (warn if noswap isn't supported but they've opted into running with swap in ways that could make their tmpfs volumes unsafe, use noswap if available and they've opted into running with swap):

swap off swap on, supports noswap swap on, doesn't support noswap
--fail-swap-on unset - fail on kubelet start fail on kubelet start
--fail-swap-on=true - fail on kubelet start fail on kubelet start
--fail-swap-on=false - detect on start, use noswap detect on start, warn, don't use noswap

@iholder101
Copy link
Contributor

I think this is the behavior that would be consistent with flag defaults and use today, and should be a strict improvement while remaining compatible (warn if noswap isn't supported but they've opted into running with swap in ways that could make their tmpfs volumes unsafe, use noswap if available and they've opted into running with swap):

swap off swap on, supports noswap swap on, doesn't support noswap
--fail-swap-on unset - fail on kubelet start fail on kubelet start
--fail-swap-on=true - fail on kubelet start fail on kubelet start
--fail-swap-on=false - detect on start, use noswap detect on start, warn, don't use noswap

Awesome 👍 if there are no objections I'll start working on a PR.
In the future, when tmpfs noswap is widely supported, we should re-think on implementing something similar to #124060 which always mounts with the noswap option.

@iholder101
Copy link
Contributor

I think this is the behavior that would be consistent with flag defaults and use today, and should be a strict improvement while remaining compatible (warn if noswap isn't supported but they've opted into running with swap in ways that could make their tmpfs volumes unsafe, use noswap if available and they've opted into running with swap):

swap off swap on, supports noswap swap on, doesn't support noswap
--fail-swap-on unset - fail on kubelet start fail on kubelet start
--fail-swap-on=true - fail on kubelet start fail on kubelet start
--fail-swap-on=false - detect on start, use noswap detect on start, warn, don't use noswap

@liggitt I've reworked #124060 to follow this approach. Can you PTAL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.