Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promote ephemeral-storage-quotas to beta in 1.25 #2697

Merged
merged 6 commits into from
Jun 22, 2022

Conversation

pacoxu
Copy link
Member

@pacoxu pacoxu commented May 8, 2021

part of #1029

#1029 (comment)

I'd like to follow up on this feature. In 1.22-1.24 cyle, I will benchmark it and then check with sig if we should promote it to beta then.

Action Items in 1.22~1.24

More details can be found in #1029 (comment)

Action Items in 1.25

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 8, 2021
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 8, 2021
@pacoxu pacoxu force-pushed the ephemeral-storage-quotas-beta branch 2 times, most recently from 4653a2f to 11900a7 Compare May 8, 2021 04:49
@pacoxu
Copy link
Member Author

pacoxu commented May 8, 2021

/assign @deads2k
for prod-readiness approval

@pacoxu
Copy link
Member Author

pacoxu commented May 10, 2021

@ehashman I opened the status update PR and will follow up on the actions in 1.22.

@deads2k
Copy link
Contributor

deads2k commented May 11, 2021

You need to add and complete the Production Readiness Review Questionaire in your KEP.

See the template here: https://github.com/kubernetes/enhancements/blame/master/keps/NNNN-kep-template/README.md#L359-L667

@pacoxu pacoxu force-pushed the ephemeral-storage-quotas-beta branch from 11900a7 to f23fcb4 Compare May 12, 2021 11:32
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 12, 2021
@pacoxu
Copy link
Member Author

pacoxu commented May 12, 2021

Updated. @deads2k
As I have no enough time today, I may go through and make answers better later.


###### What happens if we reenable the feature if it was previously rolled back?

Performance changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is different compared to the first time it was enabled or is this generally true of enabling the feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for this scanario "use project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for this scanario "use project quotas to monitor emptyDir volume storage consumption rather than filesystem walk for better performance and accuracy."

You'll want to add this to the KEP.


###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Set the feature gate to false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the volumes created while the feature gate was on still be enforcing quota.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the pod was created with enforcing quota, disable the feature gate will not change the running pod.
The newly created pod will not use the quota.


###### How can a rollout or rollback fail? Can it impact already running workloads?

None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised by this. What happens if the kube-apiserver starts using the feature gate and the kubelets have not yet restarted. Is there any impact?

Copy link
Member Author

@pacoxu pacoxu May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, kube-apiserver feature gate setting will not make sense.

When you set the feature gate on the kubelet and restart it, the newly created pod will use the quota strategy.

The rollout/rollback will not impact running workloads.


- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: LocalStorageCapacityIsolationFSQuotaMonitoring
- Components depending on the feature gate: kubelet\apiserver\controller-manager\scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised the scheduler and controller-manager are involved. What do they enforce?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the scheduler and controller feature-gate lists. However, it should only make sense for kubelet.
https://kubernetes.io/docs/reference/command-line-tools-reference/kube-scheduler/

LocalStorageCapacityIsolationFSQuotaMonitoring is only used in pkg/volume/util/fsquota/quota.go.

I update it to be kubelet only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocalStorageCapacityIsolation will affect scheduling.


###### What specific metrics should inform a rollback?

None. User can focus on volume metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like cluster-admins probably want to be able to tell when pods start failing because they run out of space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To see failures, I think we can read kubelet log for eviction related logs or using xfs_quota to check the quota settings.

here's an example of volume1048581 (empty dir).

[root@daocloud ~]# xfs_quota -x -c 'report -h' /dev/sdc
Project quota on /var/lib/kubelet (/dev/sdc)
                        Blocks
Project ID   Used   Soft   Hard Warn/Grace
---------- ---------------------------------
#0           156K      0      0  00 [------]
volume1048577    16K      0      0  00 [------]
volume1048578     8K      0      0  00 [------]
volume1048579     8K      0      0  00 [------]
volume1048581      0     8E     8E  00 [------]
volume1048582     8K      0      0  00 [------]
volume1048583     8K      0      0  00 [------]
volume1048584      0     8E     8E  00 [------]
[root@daocloud ~]# docker ps | grep test
2e848e2ae430   k8s.gcr.io/test-webserver   "/test-webserver"        4 seconds ago   Up 3 seconds             k8s_test-container_test-pd-1_default_b6fcd777-ffef-4148-ba9a-fbdf4e7cfc8d_0
19281193d915   k8s.gcr.io/test-webserver   "/test-webserver"        6 seconds ago   Up 5 seconds             k8s_test-container_test-pd-2_default_ce4c955b-a04a-4b98-8454-d904ba6341eb_0
57f96f8680eb   k8s.gcr.io/pause:3.2        "/pause"                 9 seconds ago   Up 8 seconds             k8s_POD_test-pd-2_default_ce4c955b-a04a-4b98-8454-d904ba6341eb_0
b14274ba66e8   k8s.gcr.io/pause:3.2        "/pause"                 9 seconds ago   Up 8 seconds             k8s_POD_test-pd-1_default_b6fcd777-ffef-4148-ba9a-fbdf4e7cfc8d_0
[root@daocloud ~]# docker cp data 2e848e2ae430:/cache/data/
[root@daocloud ~]# xfs_quota -x -c 'report -h' /dev/sdc
Project quota on /var/lib/kubelet (/dev/sdc)
                        Blocks
Project ID   Used   Soft   Hard Warn/Grace
---------- ---------------------------------
#0           160K      0      0  00 [------]
volume1048577    16K      0      0  00 [------]
volume1048578     8K      0      0  00 [------]
volume1048579     8K      0      0  00 [------]
volume1048580      0     8E     8E  00 [------]
volume1048581   2.6G     8E     8E  00 [------]
volume1048582     8K      0      0  00 [------]
volume1048583     8K      0      0  00 [------]
volume1048584      0     8E     8E  00 [------]

### Monitoring Requirements

* **How can an operator determine if the feature is in use by workloads?**
- A cluster-admin can set kubelet on each node. If the feature gate is disabled, workloads on that node will not use it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lets you know whether a kubelet supports it, but it's not obvious to me how a cluster-admin knows how close their pods are to the limit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • xfs_quota -x -c 'report -h' /dev/sdc to check the current usage.
  • Check spec.containers[].resources.limits.ephemeral-storage of each container.

Other ways to collect some more data:

  • docker ps -s to see container storage usage.
  • df -h on node
  • du container or pod dir

Copy link
Member Author

@pacoxu pacoxu May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/stats/summary can show ephemeral storage usage: https://127.0.0.1:10250/stats/summary | grep ephemeral-storage -A 10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, to me, this question is more about how an operator can tell across their fleet whether the feature is in use or not. I don't think it's critical for this feature; I think of this more as useful for allowing the operator to know what workloads would be impact by a rollout, or for figuring out if this feature could be the cause of higher error rates, for example.

That said, is there, say, a per-node metric saying which mechanism is in use? That could be useful; if you don't have it, perhaps add it to the "possible useful metrics" section below.


* **What are the SLIs (Service Level Indicators) an operator can use to determine
the health of the service?**
- Set a quota for the specified volume and try to write to the volume to check if there is a limitation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this scales for even a single cluster. If a cluster-admin is enforcing a limit, how do they know whether pods are being impacted by this limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a simple way(like kubectl top pod or docker stats to check the cpu/memory usage and limitation) to check that. The comment above is some methods that can help to know which pod will be impacted.

  • xfs_quota is the recommended way to say the usage
  • du is the traditional way

- N/A.

* **Are there any missing metrics that would be useful to have to improve observability of this feature? **
- No.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any cluster-admin visibility into the health of this feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubelet metrics can show kubelet_evictions{eviction_signal="ephemeralpodfs.limit"} 1

I'm not sure if it helps.


* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**
- No.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a previous answer was "Performance changes.". I'm not clear on how you can have no changes here if that's the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careless here.

kubelet now allows use of XFS quotas (on XFS and suitably configured ext4fs filesystems) to monitor storage consumption for ephemeral storage (currently for emptydir volumes only). This method of monitoring consumption is faster and more accurate than the old method of walking the filesystem tree. It does not enforce limits, only monitors consumption. To utilize this functionality, you must set the feature gate LocalStorageCapacityIsolationFSQuotaMonitoring=true. For ext4fs filesystems, you must create the filesystem with mkfs.ext4 -O project <block_device>and runtune2fs -Q prjquota block device; XFS
filesystems need no additional preparation. The filesystem must be mounted with option project in
/etc/fstab. If your primary partition is the root filesystem, you must also add rootflags=pquota to your
GRUB config file.`
kubernetes/kubernetes#66928

Updated.

@deads2k
Copy link
Contributor

deads2k commented May 12, 2021

Updated. @deads2k
As I have no enough time today, I may go through and make answers better later.

I'm happy to see improvement later, but I think greater granularity is required to meet merge requirements. I've left some detailed comments.

@pacoxu pacoxu force-pushed the ephemeral-storage-quotas-beta branch from f23fcb4 to 2c431ce Compare May 13, 2021 08:02
@pacoxu
Copy link
Member Author

pacoxu commented May 13, 2021

I replied to most of the questions and updated the KEP.

Updated. @deads2k
As I have no enough time today, I may go through and make answers better later.

I'm happy to see improvement later, but I think greater granularity is required to meet merge requirements. I've left some detailed comments.

Sorry, I mistakenly think yesterday is the deadline. 😂

@pacoxu
Copy link
Member Author

pacoxu commented May 30, 2022

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 30, 2022
@pacoxu
Copy link
Member Author

pacoxu commented Jun 10, 2022

/cc @dchen1107

Can we add this to the v1.25 release cycle?

@k8s-ci-robot k8s-ci-robot requested a review from chendave June 10, 2022 01:30
@pacoxu
Copy link
Member Author

pacoxu commented Jun 10, 2022

/cc @dchen1107
/uncc chendave

@k8s-ci-robot k8s-ci-robot removed the request for review from chendave June 10, 2022 01:58
@dchen1107 dchen1107 added this to the v1.25 milestone Jun 14, 2022
keps/sig-node/1029-ephemeral-storage-quotas/kep.yaml Outdated Show resolved Hide resolved
keps/sig-node/1029-ephemeral-storage-quotas/kep.yaml Outdated Show resolved Hide resolved
keps/prod-readiness/sig-node/1029.yaml Outdated Show resolved Hide resolved

###### What steps should be taken if SLOs are not being met to determine the problem?

- Restart kubelet and wait for 1 minute to make the SLOs clear.(The volume stats checking interval is determined by kubelet flag `volumeStatsAggPeriod`(default 1m).)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this helps figure out why the SLO isn't being met. Is the recommendation to disable the feature in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated here to

If the metrics show problems, we can check the log and quota dir with the below commands. 
- There will be warning logs([after the # is merged](https://github.com/kubernetes/kubernetes/pull/107490)) if volume calculation took too long than 1 second
- If quota is enabled, you can find the volume information and the processing time with `time repquota -P /var/lib/kubelet -s -v`

No, we recommend enabling this feature gate for performance.

I wanted to explain how to make the metric clear in the old message. Probably, this is not needed.

  • The metrics volume_metric_collection_duration_seconds is a histogram metric that contains a lot of old data.
    A restart may help you to clearly know what is the current status of volume monitoring.
  • volumeStatsAggPeriod flag is the time that we need to wait before checking the metrics.

@pacoxu pacoxu requested a review from johnbelamaric June 20, 2022 02:56
@pacoxu
Copy link
Member Author

pacoxu commented Jun 20, 2022

@johnbelamaric I updated the KEP according to your comments. Thanks for your detailed review.

@johnbelamaric
Copy link
Member

Ok. Minor comment on PRR, but I think it's OK. But it needs SIG approval.

@pacoxu pacoxu force-pushed the ephemeral-storage-quotas-beta branch from ead5a8f to 67e3eaa Compare June 21, 2022 01:37
@pacoxu
Copy link
Member Author

pacoxu commented Jun 21, 2022

/assign @derekwaynecarr @dchen1107

@johnbelamaric
Copy link
Member

Thanks, that looks good. Once there is SIG approval I will approve (I am root in this repo so I have to wait or it will approve the whole thing).

@dchen1107
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 22, 2022
@johnbelamaric
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, johnbelamaric, pacoxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

9 participants