Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KEP-2400] - Update KEP with findings from beta2 #4401

Merged
merged 12 commits into from
Feb 6, 2024

Conversation

kannon92
Copy link
Contributor

  • One-line PR description:
    Update Swap KEP with findings from beta2 testing.
  • Other comments:

Our recommendation is to implement MemorySwap = NoSwap and to let this feature sit in beta2. We don't want to promote this to GA at this time but we realize that someone guard rails are necessary for this feature.
In this KEP, we propose some best practices and allow for opt-in support for this feature regardless of the feature toggle.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 12, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 12, 2024
@kannon92 kannon92 force-pushed the kep-2400-beta-2-findings branch from af16a2d to d97e2bf Compare January 16, 2024 22:50
@kannon92
Copy link
Contributor Author

@@ -1059,6 +1169,7 @@ nodes that do not use swap memory.
- **2021-01-05:** Initial design discussion document for swap support and use cases.
- **2021-04-05:** Alpha KEP drafted for initial node-level swap support and implementation (KEP-2400).
- **2021-08-09:** New in Kubernetes v1.22: alpha support for using swap memory: https://kubernetes.io/blog/2021/08/09/run-nodes-with-swap-alpha/
- **2024-01-12:** Updates to Beta2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to add a reference to the PR that added swap support in the stats summary? kubernetes/kubernetes#118865

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, I see that section only has links to issues, not PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section does need some love. I don't really know the dates but I guess I can find PRs some I can fill this in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the section with what I could find. There are alot of PRs and I decided to focus on promotion to beta1. PTAL.

@kannon92 kannon92 requested review from harche and fabiand January 18, 2024 18:47
Copy link
Contributor

@harche harche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2024
@kannon92
Copy link
Contributor Author

/assign @dchen1107 @mrunalp @SergeyKanzhelev

Since this is staying in beta, I am not sure if we need PRR reviews/approvals at the moment.

@mrunalp
Copy link
Contributor

mrunalp commented Feb 2, 2024

This looks fine to me overall.
@dchen1107 if you want to review as well.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2024
@deads2k
Copy link
Contributor

deads2k commented Feb 5, 2024

Thank you for updating PRR as you go. The PRR changes lgtm

- Add [swap specific tests](https://github.com/kubernetes/kubernetes/issues/120798) such as, handling the usage of
swap during container restart boundaries for writes to tmpfs (which may require pod cgroup change beyond what
container runtime will do at (container cgroup boundary).
- Fix flaking/failing swap node e2e jobs.
- Address eviction related [issue](https://github.com/kubernetes/kubernetes/issues/120800) in swap implementation.
- Add `NoSwap` as the default setting.
- Remove `UnlimitedSwap` as a supported option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1000

@@ -608,12 +714,18 @@ Here are specific improvements to be made:
- Improve coverage for appropriate scenarios in testgrid.

#### Beta 2
- Publish a Kubernetes doc page encoring user to use encrypted swap if they wish to enable this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything for the debuggability enhancement since enabling the feature might affect the performance of the node. I know we introduced swap metrics, but what about kubectl describe node?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we are still exploring what the best way of monitoring swap is. If it's direct metrics - swap traffic, or indirect, such as *wait times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put this as something to look into @dchen1107! The issue with describe node is that I think it gives capacity for swap (ie how much swap the node actually has) but a user may get confused because they are only allowed to use up to computed limit on swap for Burstable QoS. So its a bit misleading to add a capacity when you can't really use it.

I think we should add some way to detect if swap is enabled on a node. I talked with nfd team about potentially adding a way to detect if swap is on a node so we can add useful labels for nodes for swap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the node-level swap capacity would still be a useful information even if the pods may not be able to consume it. That's the same for other resources on the node too.

Q: Is there a good way to expose per-pod swap usage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick way to examine whether system-critical daemons are using swap would also be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick way to examine whether system-critical daemons are using swap would also be helpful

I don't have a great way to do this. @fabiand any ideas here?

I would look at free and then see what is using swap.

Copy link
Contributor Author

@kannon92 kannon92 Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the node-level swap capacity would still be a useful information even if the pods may not be able to consume it. That's the same for other resources on the node too.

I've been asked that a few times so I think we can add this to the node information when doing kubectl describe node.

Q: Is there a good way to expose per-pod swap usage?
I was actually thinking that a crictl stats or something like that may be useful here.

I think one could view the summary api using kubectl --raw and view the swap stats.

It could maybe be useful to expose information at a lower level tool if that is of interest.

We would look into the container and view memory.swap.current to see if swap was being used. The scoping was a bit confusing to me but exec into the pod and viewing the memory.swap.current seems to be the best way to get swap information.

@iholder101 any other ways you found useful to look at per-pod swap usage?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been asked that a few times so I think we can add this to the node information when doing kubectl describe node

I think it makes sense. I'll get to it.

@iholder101 any other ways you found useful to look at per-pod swap usage?

What about using kubectl top to get this information? I can dive deeper into it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick way to examine whether system-critical daemons are using swap would also be helpful

Critical daemons can be provided in two areas

  1. node level (part of os)
  2. as containers

(1) is easy to monitor: system.slice contains all these services. system.slice should also be configured to not use swap (part of the kep)
(2) is more difficult i.e. rook or other CNI/CSI projects will have critical pods.

Monitoring and configuration of those is more difficult.

If we assume that pods are correctly assigned to PriorityClasses, then monitoring can be improve by displaying swap consumption by priority class.
Further - unsure about it - we could consider to implictily control swap for workloads based on priority class. I.e. disable for node- and cluster-critical workloads.

Addition thoughts:
I've often seen that pods from (2) do run as burstable, making them eligible to swap.
We either need to find a way to (a) let pods opt-out of swap independent of the qos class or (b) make pods to opt-in to swap - regardless of the qos class (burstable pods can also benefit from swap as a safety mechanism).

The broad decision we should think about is: Do we want opt-in or opt-out for workloads.

I've been asked that a few times so I think we can add this to the node information when doing kubectl describe node

@iholder101 +1 in adding to kubelet and then kubectl. Let's include two metrics: One around amount of swap, and the other on amount of swap traffic. The latter is worse than the first.

@dchen1107
Copy link
Member

/lgtm with a minor comment which can be added later.
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 6, 2024
@mrunalp
Copy link
Contributor

mrunalp commented Feb 6, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 6, 2024
@k8s-ci-robot k8s-ci-robot merged commit 1464eaf into kubernetes:master Feb 6, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 6, 2024
@pacoxu pacoxu mentioned this pull request Aug 9, 2024
60 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.