Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: huge page support #33601

Closed
wants to merge 4 commits into from

Conversation

sjenning
Copy link
Contributor

@sjenning sjenning commented Sep 27, 2016

Proposal for supporting applications that desire pre-allocated huge pages in Kubernetes

@derekwaynecarr @kubernetes/rh-cluster-infra @dchen1107 @vishh @jeremyeder @kubernetes/sig-node


This change is Reviewable


A proposal to allow huge page use by applications running in a Kubernetes cluster.

A pod should be able to request a number of a huge pages for use by the application. The scheduler should be able have visibility into the node capacity of huge pages, for each huge page size, and make a decision about if the pod can be scheduled on that node. The kubelet should report the number of available huge pages (per size), and set up the environment such that the pod can successfully use the number of huge pages requested in the pod definition.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: be able to have


This proposal only includes pre-allocated huge pages configured on the node by the administrator at boot time or by manual dynamic allocation. It does not discuss the kubelet attempting to allocate huge pages dynamically in an attempt to accommodate a scheduling pod or the use of Transparent Huge Pages (THP). THP do not require knowledge by Kubernetes at all, but simply requires the node to have THP enabled and the application ```madvise()``` /w ```MADV_HUGEPAGES``` memory regions it desires to be backed by huge pages. Note that THP might lead to performance degradation on nodes with high memory utilization or fragmentation due to the defragmenting efforts of THP, which can lock memory pages. For this reason some, applications may be designed to (or recommend) use pre-allocated huge pages instead of THP.

The proposal is also limited to support on x86_64 where two huge page sizes are supported: 2MB and 1G. The design, however, should accommodate addition huge page sizes available on other architectures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/addition/additional

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Sep 27, 2016
@pmorie
Copy link
Member

pmorie commented Sep 27, 2016

@sjenning can you wrap lines at 80 or 100 chars? It makes it much much easier to comment.


Take a program that operates on a large 2MB structure as an example. If the program accesses that space in such a way that one byte in each regular 4k page is accessed, 2M/4k = 512 TLB entries are needed to map the address range. Each TLB miss results in an expensive walk of the page table. However, if the allocation is backed by a 2M huge page by, only 1 TLB entry is required resulting in a highly likelihood that entry will remain in the cache and hit on accesses to the entire 2M structure.

On x86_64, there are two huge page sizes: 2MB and 1G. 1G huge pages are also called gigantic pages. 1G must be enabled on kernel boot line with ```hugepagesz=1G```. Huge pages, especially 1G ones, should to be allocated early before memory fragments (i.e. at/near boot time) to increase the likelihood that they can be allocated successfully with minimal memory migration (i.e. defreg) required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/defreg/defrag


Huge pages can not be overcommitted on a node.

While a system may support multiple huge pages sizes, it is assumed that nodes configured with huge pages will only use one huge page size, namely the default page size in ```cat /proc/meminfo | grep Hugepagesize```. In Linux, this is 2M unless overridden by ```default_hugepagesz=1g``` in the kernel boot parameters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/default_//

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hugepagesz=1g enables 1G pages, but doesn't make them the default IIUC. I'll go try it out.

https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

@bgrant0607 bgrant0607 assigned derekwaynecarr and vishh and unassigned bgrant0607 Sep 27, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 5b1b312. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GCE e2e failed for commit 5b1b312. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@vishh
Copy link
Contributor

vishh commented Sep 27, 2016

Is this meant for v1.5?

@derekwaynecarr
Copy link
Member

@vishh - from our side, we want to start the discussion and ideally agree
on a design in 1.5. I would like to do the implementation in 1.6

On Tuesday, September 27, 2016, Vish Kannan notifications@github.com
wrote:

Is this meant for v1.5?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#33601 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbGTuWaPhjh6mD3cb_aVNoMT2Gdfwks5quZ_JgaJpZM4KIIjY
.

@yifan-gu
Copy link
Contributor

limits:
hugepages: "10"
nodeSelector:
kubernetes.io/huge-page-size: "2M"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this annotation is omitted? Will the pod be rejected by validation? The numbers above are only meaningful with this annotation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a label, not an annotation of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts yes a validation check is probably the best way. I agree that hugepages limit is ambiguous without the nodeSelector.

@dchen1107
Copy link
Member

cc/ @davidopp from the resource management and scheduler side.

@derekwaynecarr derekwaynecarr added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Sep 28, 2016
@davidopp davidopp self-assigned this Oct 2, 2016
@davidopp
Copy link
Member

davidopp commented Oct 2, 2016

I have a couple of questions

  1. I'm confused about how the available hugepages is supposed to be computed for scheduling. All of the current resources we have work like this
  • kubelet reports a capacity (somewhat confusingly called allocatable, but I'm just going to call it capacity here)
  • the reported capacity never changes
  • scheduler computes amount free for a resource as capacity minus sum of requests of the pods that are running on that node (as seen from the scheduler's state)
  • the only thing that affects the amount free is scheduler binding a pod or noticing that a pod has terminated

I did not fully understand how amount free for hugepages is calculated. If it works like the above, then great. But it sounds like memory fragmentation can change the number of "free" hugepages over time even in the absence of new pods being bound to the node? If that's the case, things get pretty messy -- the requests don't give you the full picture (since some hugepages might be effectively destroyed due to memory allocation) and neither does the usage (since some containers may have requested hugepages but not yet started using any memory yet).

  1. I am not clear on the relationship between hugepages request and memory request. Is there some kind of validation requirement such as: memory request must be greater or equal to the value of alpha.kubernetes.io/huge-page-size times hugepages request?

@sjenning
Copy link
Contributor Author

sjenning commented Oct 3, 2016

@davidopp so hugepages will behave in the manner you describe, baring badness on the part of the node admin.

hugepages=X on the kernel parameters or echo X > /proc/sys/vm/nr_hugepages will fix the number of huge pages on the node before the kubelet is started. After the kubelet has started and reports node status, the number of huge pages should not be changed. This number is the capacity. No node-local processes should consume huge pages. This ensures that only pods known to the scheduler can consume the huge pages so that the available number of huge pages can be computed.

The memory request and huge page request would be completely disjoint. Unused huge pages are not reported as general free memory. They sit in a reserved pool the kernel maintains. Now, that might impact the scheduler's calculation of available memory on the system if it computes that by subtracting the memory capacity from the total memory requests of the pods on the node. However, if it uses the actual node-reported available memory, it should be no problem as that number will reflect the huge pages as used memory.

Let me know I didn't clarify something well enough.

@davidopp
Copy link
Member

davidopp commented Oct 3, 2016

@sjenning Thanks, that was very clear. I didn't realize hugepages came out of a separate pool. So as you say, as long as Kubelet excludes hugepages when it reports memory capacity of the node, then everything is very straightforward.

@jeremyeder
Copy link

@davidopp thanks for looking at this proposal. Indeed the 2nd pool hopefully makes calculations easier on the Kubelet, because memory allocated to hugepages by the admin is immediately shown as "used", (whether or not an application has actually used them or not) -- so it's a several step process as @sjenning has described:

  1. admin allocates hugepages in whatever fashion they'd like (1GB should be allocated on cmdline, 2MB a little more flexible, but really should do it either on cmdline or early init service such as tuned)
  2. kubelet starts, and sees the state of the union
  3. user starts a pod that wants N-hugepages, scheduler identifies a node with hugepage "capacity", schedules it there, and then it's up to the pod author to actually configure her app to use them.

A proposal to allow huge page use by applications running in a Kubernetes
cluster.

A pod should be able to have a number of a huge pages for use by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/of a huge pages/of huge pages

memory regions it desires to be backed by huge pages. Note that THP might lead
to performance degradation on nodes with high memory utilization or
fragmentation due to the defragmenting efforts of THP, which can lock memory
pages. For this reason some, applications may be designed to (or recommend) use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"For this reason, some applications..."

pages. For this reason some, applications may be designed to (or recommend) use
pre-allocated huge pages instead of THP.

The proposal is also limited to support on x86_64 where two huge page sizes are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/to support on x86_64/to support x86_64

supported: 2MB and 1G. The design, however, should accommodate additional huge
page sizes available on other architectures.

**NOTE: This design, as currently proposed, requires the use of pod-levels
Copy link
Contributor

@rhcarvalho rhcarvalho Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pod-levels cgroups/pod-level cgroups

## Background

Huge pages are a hardware feature designed to reduce pressure on the Translation
Lookaside Buffer (TLB) The TLB is a small hardware cache of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... (TLB). The ..."

```hugepagesz=1G```. Huge pages, especially 1G ones, should to be allocated
early before memory fragments (i.e. at/near boot time) to increase the
likelihood that they can be allocated successfully with minimal memory migration
(i.e. defreg) required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/defreg/defrag


## Use Cases

The class of applications that benefit from huge pages are typically have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/are typically have/typically have


Implement huge page support with pod-level cgroups to enforce per-pod huge page
limits (not yet available). Enforcing huge page limits with pod-level cgroups
avoids, at least temporarily, the need for 1) docker to support the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/docker/Docker

Copy link
Contributor Author

@sjenning sjenning Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, in this case, i do mean docker the command, not Docker the company. i'll put code ticks around it.

The kubelet will also need to create the ```alpha.kubernetes.io/huge-page-size```
label for its node resource (if self registering).

pkg/aip/unversioned/well_known_labels.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/aip/api

memlocks them to prevent the system from moving or swapping them out.

There are several issues here:
- The the user running the java app must be a member of the gid set in the ```vm.huge_tlb_shm_group``` sysctl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/The the/The

s/java/Java

@rhcarvalho
Copy link
Contributor

Database workloads, e.g., MongoDB [1], suggest that THP should be disabled on the host for optimal performance.

I see in the proposal a way to request huge pages, but I'm missing a way to specify that a pod should be scheduled on a node with THP disabled.

Is that something we could include as part of the design, or is that a separate matter?

[1] https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

@sjenning
Copy link
Contributor Author

sjenning commented Oct 7, 2016

@rhcarvalho thanks for the review!

Regarding disabling THP, I am going to say that is outside the scope of this proposal. However, the cluster admin can do this with existing mechanisms. It could be understood that nodes with pre-allocated huge pages are also configured with THP disabled. Or you could disable THP on the nodes, add a node label thp=disabled, and have the MongoDB pods select nodes with that label.

@k8s-ci-robot
Copy link
Contributor

Jenkins GKE smoke e2e failed for commit 8caeda4. Full PR test history.

The magic incantation to run this job again is @k8s-bot gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GKE smoke e2e failed for commit 8caeda4. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@vishh
Copy link
Contributor

vishh commented Oct 8, 2016

@sjenning I'm prioritizing v1.5 PRs as of now. I will start reviewing this next week. Apologies for the delay.

@vishh vishh added this to the next-candidate milestone Oct 31, 2016
@k8s-github-robot
Copy link

This PR hasn't been active in 30 days. It will be closed in 59 days (Jan 6, 2017).

cc @davidopp @derekwaynecarr @sjenning @vishh

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

@k8s-github-robot
Copy link

Adding label:do-not-merge because PR changes docs prohibited to auto merge
See http://kubernetes.io/editdocs/ for information about editing docs

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. kind/old-docs do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Dec 1, 2016
@sjenning
Copy link
Contributor Author

migrating to kubernetes/community#181

@sjenning sjenning closed this Dec 15, 2016
@sjenning sjenning deleted the hugepage-proposal branch August 16, 2017 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. kind/design Categorizes issue or PR as related to design. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.