-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-2400] Address swap impact on memory-backed volumes #105978
Comments
I'm not exactly sure what cgroup controls are being gestured at, but it seems like the generic answer here would be encrypted swap with an ephemeral in-memory key. Unfortunately the details of how to set that up are aggressively distro-specific. |
/kind documentation |
/unassign Clearing assignees for this release to ensure we can distribute the load. |
@liggitt Does the controls for preventing memory-backed volumes from swapping out need be implemented in kubelet code or be documented for users self-configured on node hosts? |
I'm not sure? Is node host configuration possible to do once outside the kubelet, or is it a configuration that has to be applied for every new pod/container? |
We also haven't found a config that could limit the tmpfs swapping out in cgroupv2 yet |
/cc |
I see tmpfs mentioned in the beta graduation criteria for the swap feature (https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#beta), but I can't tell if this covers preventing memory-backed volumes from getting swapped to disk:
edit: oh, it looks like this issue is referenced from the |
@liggitt 👋 the release 1.24 bug triage shadow. Does this issue still target release 1.24? |
That's more a question for @ehashman , not sure what milestone completion of swap-to-beta is planned for |
/milestone clear Per kubernetes/enhancements#2400 (comment) this was removed from milestone |
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I don't have a good sense for whether disk encryption actually helps here... if a swap file is written to an encrypted disk, while the disk is mounted, are fewer privileges required to inspect swap content than would be required to inspect the memory backing a tmpfs volume (or the kubelet memory)? |
Hey @liggitt! Circling back to this. So firstly I'd like to understand what's the scope of this. Do we specifically care about secrets? Or do we also care about other memory-backed volumes like EmptyDirs? One possible path forward is using Do you think that's a reasonable path forward? [1] https://en.wikipedia.org/wiki/Tmpfs#Linux |
I specifically care about
It looks like all downward API / projected volumes are configured with StorageMediumMemory, though I'm not sure that's required for configmap / non-token projected volumes / metadata downward-API volumes.
I don't know enough about the implications to know (is it as widely supported as tmpfs? does it have other performance or functional tradeoffs? does it behave differently than tmpfs in ways that would be observable to the consumer?), so I'll defer to node and storage folks, but it sounds like a possibility at first glance. |
Thanks for clarifying!
Great, I'll look deeper into it than. |
We talked about this on slack. It looks like there is a mount option for tmpfs to say that you don't want swap. https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html Podman added support for this: containers/podman#19659. /sig storage I think it would make sense to add the noswap option to the mounts for tmpfs. But I think we should bring this to sig-storage and see what they think. It looks to be a newer kernel feature (https://lore.kernel.org/all/20230207025259.2522793-1-mcgrof@kernel.org/T/#u). |
Thank you for summarizing the slack conversation. |
Neither the brd nor zram (if you wanted to leverage compression) would have their backing pages swapped out. Thus, despite brd's inherent complexity, it would fulfil this requirement. |
I think we need to differentiate between the "online" (system is running, logged in as root) and "offline" (adin took disk) case. In the online case, root can grep In the offline case, with due to encrpytion, and admin can not read the plaintext of the disk, unless they have the enryption key.
IMHO switching to ramfs is a too invasive change just to avoid swap, specifically knowning that tmpfs is gaining noswap support.
If anybody has opted in to swap, then that consumer is already accepting a compromise on performance. tl;dr to me beta2 scope is a good foundation, and valueable to our (KubeVirt) use-cases. We'd really like to get it GA'ed. |
The expectation when using a kubernetes cluster is that some volume types are backed by RAM. This expectation is reflected in administrative guides and development guides. While it is possible that some cluster-admins and cluster-operators would see the potential exposure and then dig deeply enough to determine whether they are exposed, it is likely that most will not. We know the path that is required to safely enable swap on nodes, while avoiding this exposure for the core platform. While I understand the urge to enable features as quickly as possible, enabling features that are needlessly less generally safe erodes upgrade trust. Let's take the path of building the functionality needed to keep cluster security profiles the same over upgrades and ensure our kubelets fail safely (refuse to run pods) when that cannot be achieved. |
I agree that expectations about volumes types would be undermined. |
So predating this feature, we have In the case of I think we need to clearly state what our intention is for I wonder if we should consider a separate KEP to target both cases (KEP-2400 and legacy swap enablement). |
small correction: kernel would swap out memory, but yes this is already a problem for nodes that enable fail-swap-on. Good point Would folks be open to a |
I can't speak for every user, but it's not intended for production and personally I agree, we prefer the cluster to function on as many hosts as we can even if the functionality is less secure. The reason we enable My past understanding was that this wasn't even necessarily a well supported option but it enabled you to run on a host with swap, IIRC we log a warning on kubelet startup if you opt-into this flag.
For KIND & local-up-cluster.sh (cc @dims) I think we're fine adopting a new option, but what about other users of fail-swap-on? I think generally Kubernetes shouldn't require new configuration just to upgrade safely ..? For the feature of "swap is permitted by default" without setting this flag, it seems like we should require noswap and possibly enable opt-in to allow it anyhow with warning. |
I think minikube also does this at least in the "docker" / "podman" driver modes for the same reasons. FYI @medyagh @afbjorklund |
Right, I don't think minikube expects to be swapping - even if it does allow running with swap enabled on the machine It ( |
Yes, the kernel could indeed swap memory-backed volumes if I think this is an important point. For this reason I think it's reasonable to consider improving
This way we improve the existing behavior without breaking changes, introducing intermediary API changes or forcing swap users to adapt. In the future, when tmpfs noswap is widely supported, we may consider to deprecate I would appreciate feedback on this approach. |
+1 but I think we should also detect if swap is enabled on the node. (fail-swap-on=false + swap is on node) fail-swap-on could be false but the node does not have to have swap on it in that case. I think in your case of "not supported" I think you mean to try the mount command and if it fails, revert back to non tmpfs case. |
I think this is the behavior that would be consistent with flag defaults and use today, and should be a strict improvement while remaining compatible (warn if noswap isn't supported but they've opted into running with swap in ways that could make their tmpfs volumes unsafe, use noswap if available and they've opted into running with swap):
|
Awesome 👍 if there are no objections I'll start working on a PR. |
@liggitt I've reworked #124060 to follow this approach. Can you PTAL? |
Some volumes (notably secret volumes) are memory-backed in order to ensure they do not hit disk.
When swap is enabled, we need to ensure the content of memory-backed volumes continue to not touch disk.
/kind feature
/sig node storage
/assign @ehashman
The text was updated successfully, but these errors were encountered: