-
Notifications
You must be signed in to change notification settings - Fork 40.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: huge page support #33601
proposal: huge page support #33601
Conversation
|
||
A proposal to allow huge page use by applications running in a Kubernetes cluster. | ||
|
||
A pod should be able to request a number of a huge pages for use by the application. The scheduler should be able have visibility into the node capacity of huge pages, for each huge page size, and make a decision about if the pod can be scheduled on that node. The kubelet should report the number of available huge pages (per size), and set up the environment such that the pod can successfully use the number of huge pages requested in the pod definition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: be able to have
|
||
This proposal only includes pre-allocated huge pages configured on the node by the administrator at boot time or by manual dynamic allocation. It does not discuss the kubelet attempting to allocate huge pages dynamically in an attempt to accommodate a scheduling pod or the use of Transparent Huge Pages (THP). THP do not require knowledge by Kubernetes at all, but simply requires the node to have THP enabled and the application ```madvise()``` /w ```MADV_HUGEPAGES``` memory regions it desires to be backed by huge pages. Note that THP might lead to performance degradation on nodes with high memory utilization or fragmentation due to the defragmenting efforts of THP, which can lock memory pages. For this reason some, applications may be designed to (or recommend) use pre-allocated huge pages instead of THP. | ||
|
||
The proposal is also limited to support on x86_64 where two huge page sizes are supported: 2MB and 1G. The design, however, should accommodate addition huge page sizes available on other architectures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/addition/additional
@sjenning can you wrap lines at 80 or 100 chars? It makes it much much easier to comment. |
|
||
Take a program that operates on a large 2MB structure as an example. If the program accesses that space in such a way that one byte in each regular 4k page is accessed, 2M/4k = 512 TLB entries are needed to map the address range. Each TLB miss results in an expensive walk of the page table. However, if the allocation is backed by a 2M huge page by, only 1 TLB entry is required resulting in a highly likelihood that entry will remain in the cache and hit on accesses to the entire 2M structure. | ||
|
||
On x86_64, there are two huge page sizes: 2MB and 1G. 1G huge pages are also called gigantic pages. 1G must be enabled on kernel boot line with ```hugepagesz=1G```. Huge pages, especially 1G ones, should to be allocated early before memory fragments (i.e. at/near boot time) to increase the likelihood that they can be allocated successfully with minimal memory migration (i.e. defreg) required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/defreg/defrag
|
||
Huge pages can not be overcommitted on a node. | ||
|
||
While a system may support multiple huge pages sizes, it is assumed that nodes configured with huge pages will only use one huge page size, namely the default page size in ```cat /proc/meminfo | grep Hugepagesize```. In Linux, this is 2M unless overridden by ```default_hugepagesz=1g``` in the kernel boot parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/default_//
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hugepagesz=1g
enables 1G pages, but doesn't make them the default IIUC. I'll go try it out.
Jenkins verification failed for commit 5b1b312. Full PR test history. The magic incantation to run this job again is |
Jenkins GCI GCE e2e failed for commit 5b1b312. Full PR test history. The magic incantation to run this job again is |
Is this meant for v1.5? |
@vishh - from our side, we want to start the discussion and ideally agree On Tuesday, September 27, 2016, Vish Kannan notifications@github.com
|
limits: | ||
hugepages: "10" | ||
nodeSelector: | ||
kubernetes.io/huge-page-size: "2M" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this annotation is omitted? Will the pod be rejected by validation? The numbers above are only meaningful with this annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a label, not an annotation of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts yes a validation check is probably the best way. I agree that hugepages limit is ambiguous without the nodeSelector.
cc/ @davidopp from the resource management and scheduler side. |
I have a couple of questions
I did not fully understand how
|
@davidopp so hugepages will behave in the manner you describe, baring badness on the part of the node admin.
The memory request and huge page request would be completely disjoint. Unused huge pages are not reported as general free memory. They sit in a reserved pool the kernel maintains. Now, that might impact the scheduler's calculation of available memory on the system if it computes that by subtracting the memory Let me know I didn't clarify something well enough. |
@sjenning Thanks, that was very clear. I didn't realize hugepages came out of a separate pool. So as you say, as long as Kubelet excludes hugepages when it reports memory capacity of the node, then everything is very straightforward. |
@davidopp thanks for looking at this proposal. Indeed the 2nd pool hopefully makes calculations easier on the Kubelet, because memory allocated to hugepages by the admin is immediately shown as "used", (whether or not an application has actually used them or not) -- so it's a several step process as @sjenning has described:
|
A proposal to allow huge page use by applications running in a Kubernetes | ||
cluster. | ||
|
||
A pod should be able to have a number of a huge pages for use by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/of a huge pages/of huge pages
memory regions it desires to be backed by huge pages. Note that THP might lead | ||
to performance degradation on nodes with high memory utilization or | ||
fragmentation due to the defragmenting efforts of THP, which can lock memory | ||
pages. For this reason some, applications may be designed to (or recommend) use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"For this reason, some applications..."
pages. For this reason some, applications may be designed to (or recommend) use | ||
pre-allocated huge pages instead of THP. | ||
|
||
The proposal is also limited to support on x86_64 where two huge page sizes are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/to support on x86_64/to support x86_64
supported: 2MB and 1G. The design, however, should accommodate additional huge | ||
page sizes available on other architectures. | ||
|
||
**NOTE: This design, as currently proposed, requires the use of pod-levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/pod-levels cgroups/pod-level cgroups
## Background | ||
|
||
Huge pages are a hardware feature designed to reduce pressure on the Translation | ||
Lookaside Buffer (TLB) The TLB is a small hardware cache of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... (TLB). The ..."
```hugepagesz=1G```. Huge pages, especially 1G ones, should to be allocated | ||
early before memory fragments (i.e. at/near boot time) to increase the | ||
likelihood that they can be allocated successfully with minimal memory migration | ||
(i.e. defreg) required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/defreg/defrag
|
||
## Use Cases | ||
|
||
The class of applications that benefit from huge pages are typically have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/are typically have/typically have
|
||
Implement huge page support with pod-level cgroups to enforce per-pod huge page | ||
limits (not yet available). Enforcing huge page limits with pod-level cgroups | ||
avoids, at least temporarily, the need for 1) docker to support the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/docker/Docker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, in this case, i do mean docker
the command, not Docker the company. i'll put code ticks around it.
The kubelet will also need to create the ```alpha.kubernetes.io/huge-page-size``` | ||
label for its node resource (if self registering). | ||
|
||
pkg/aip/unversioned/well_known_labels.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/aip/api
memlocks them to prevent the system from moving or swapping them out. | ||
|
||
There are several issues here: | ||
- The the user running the java app must be a member of the gid set in the ```vm.huge_tlb_shm_group``` sysctl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/The the/The
s/java/Java
Database workloads, e.g., MongoDB [1], suggest that THP should be disabled on the host for optimal performance. I see in the proposal a way to request huge pages, but I'm missing a way to specify that a pod should be scheduled on a node with THP disabled. Is that something we could include as part of the design, or is that a separate matter? [1] https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ |
@rhcarvalho thanks for the review! Regarding disabling THP, I am going to say that is outside the scope of this proposal. However, the cluster admin can do this with existing mechanisms. It could be understood that nodes with pre-allocated huge pages are also configured with THP disabled. Or you could disable THP on the nodes, add a node label |
Jenkins GKE smoke e2e failed for commit 8caeda4. Full PR test history. The magic incantation to run this job again is |
Jenkins GCI GKE smoke e2e failed for commit 8caeda4. Full PR test history. The magic incantation to run this job again is |
@sjenning I'm prioritizing v1.5 PRs as of now. I will start reviewing this next week. Apologies for the delay. |
This PR hasn't been active in 30 days. It will be closed in 59 days (Jan 6, 2017). cc @davidopp @derekwaynecarr @sjenning @vishh You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days |
Adding label:do-not-merge because PR changes docs prohibited to auto merge |
migrating to kubernetes/community#181 |
Proposal for supporting applications that desire pre-allocated huge pages in Kubernetes
@derekwaynecarr @kubernetes/rh-cluster-infra @dchen1107 @vishh @jeremyeder @kubernetes/sig-node
This change is