-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node system swap support #2602
Node system swap support #2602
Conversation
creating a (dummy) approve comment to re-trigger CI, the approval mechanism was broken for a bit - kubernetes/test-infra#21687 /approve |
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@ike-ma it appears you haven't signed the Kubernetes CLA, can you please follow the instructions in the comment above to do so and run |
### Non-Goals | ||
|
||
- Provisioning swap. Swap must already be available on the system. | ||
- Setting [swappiness]. This can already be set on a system-wide level outside of Kubernetes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is meant by "system-wide level" here? Is it referring to vm.swappiness
sysctl?
If that's the case, the downside I see there is it's system wide. For example maybe there are some usecases where you want certain workloads to swap often (and as such set a high swappiness
), but for system workloads (e.g those in /system.slice
, i.e. kubelet, container runtime, etc), you don't want them to swap at all (and want a zero swappiness). If it's a system wide setting I guess you lose the ability to make those types of adjustments.
Not sure how important that is and probably a post alpha thing, but maybe something to consider :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This KEP doesn't enable setting per-workload swap utilization, so I think it's YAGNI on per-workload swappiness configuration in the CRI. When we're ready to add per-workload swap configuration to k8s, then we should add it IMO.
but for system workloads (e.g those in /system.slice, i.e. kubelet, container runtime, etc), you don't want them to swap at all (and want a zero swappiness).
Typically one wouldn't run the container runtime, kubelet, etc. using k8s though, they're prerequisites for a node that run directly on the host. So potentially this is still individually configurable for them, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we're ready to add per-workload swap configuration to k8s, then we should add it IMO.
Make sense!
So potentially this is still individually configurable for them, no?
I guess what you would do in that case is set vm.swappiness
sysctl so that it applies system wide, but then you would have to edit all of your systemd units for kubelet, container runtime etc and override their swap settings. Might be helpful to note somewhere if that's the recommended path to configure swappiness since I would imagine most existing kubelet/container runtime units don't override swap settings.
edit: actually looking at the systemd config options, I don't think it exposes overriding swap for individual units... https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html
PRR looks good. An overall note when going to beta: I'm not clear from the proposal as a whole how a pod author
|
PRR /approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A number of requests throughout, let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more requests, I think this is close to ready.
Let me know what you think.
1. Add a feature gate `NodeSwapEnabled` to enable swap support. | ||
1. Leave the default value of kubelet flag `--fail-on-swap` to `true`, to avoid | ||
changing default behaviour. | ||
1. Introduce a new kubelet config parameter, `MemorySwapLimit`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer to never introduce the MemorySwapLimit
parameter.
I would update this section to say:
- kubelet detects available swap on startup (all available swap is basically allocatable)
- kubelet enables reservation of swap to reduce node allocatable swap (apha or beta period)
changing default behaviour. | ||
1. Introduce a new kubelet config parameter, `MemorySwapLimit`. | ||
1. Introduce a new CRI parameter, `memory_swap_limit_in_bytes`. | ||
1. Integrate new kubelet config and pass values to CRI for container creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a little more clarification.
The container runtime will write the swap settings to the container level cgroup. This will include ephemeral containers. We need to account for the proper setting that is applied at the pod cgroup boundary. Please list addressing this as a blocker to beta for the feature
I believe all comments have been addressed. PTAL! |
/lgtm |
int64 memory_swap_limit_in_bytes = 9; | ||
... | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to include this as part of CRI stats as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
thanks for the updates, look forward to experimenting with this as it evolves.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, derekwaynecarr, ehashman, nikhita The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hello @ehashman 👋, 1.22 Docs Shadow here. Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT. Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release. Thank you! 🙏 |
See: #2400
This is an initial draft, based on the community design doc. I wanted to make sure we're all in high-level alignment before I begin on the implementation specifics.
That should be the only section left TODO.(2021-04-28) I've added @ike-ma's initial draft of the implementation specifics as well, so this should be ready for overall review.I've tried to keep the scope as narrow as possible to reduce unknowns and make this as easy/clear to graduate as possible. This will necessarily mean (IMO) splitting any workload-level accounting for swap into a separate KEP.
/sig node
/cc @dims @SergeyKanzhelev @dchen1107 @derekwaynecarr @mrunalp
cc @ike-ma