proposal: huge page support #33601

sjenning · 2016-09-27T20:44:13Z

Proposal for supporting applications that desire pre-allocated huge pages in Kubernetes

@derekwaynecarr @kubernetes/rh-cluster-infra @dchen1107 @vishh @jeremyeder @kubernetes/sig-node

This change is

derekwaynecarr · 2016-09-27T20:48:43Z

docs/proposals/hugepages.md

+
+A proposal to allow huge page use by applications running in a Kubernetes cluster.
+
+A pod should be able to request a number of a huge pages for use by the application.  The scheduler should be able have visibility into the node capacity of huge pages, for each huge page size, and make a decision about if the pod can be scheduled on that node.  The kubelet should report the number of available huge pages (per size), and set up the environment such that the pod can successfully use the number of huge pages requested in the pod definition.


nit: be able to have

derekwaynecarr · 2016-09-27T20:49:45Z

docs/proposals/hugepages.md

+
+This proposal only includes pre-allocated huge pages configured on the node by the administrator at boot time or by manual dynamic allocation.  It does not discuss the kubelet attempting to allocate huge pages dynamically in an attempt to accommodate a scheduling pod or the use of Transparent Huge Pages (THP). THP do not require knowledge by Kubernetes at all, but simply requires the node to have THP enabled and the application ```madvise()``` /w ```MADV_HUGEPAGES``` memory regions it desires to be backed by huge pages.  Note that THP might lead to performance degradation on nodes with high memory utilization or fragmentation due to the defragmenting efforts of THP, which can lock memory pages.  For this reason some, applications may be designed to (or recommend) use pre-allocated huge pages instead of THP.
+
+The proposal is also limited to support on x86_64 where two huge page sizes are supported: 2MB and 1G.  The design, however, should accommodate addition huge page sizes available on other architectures.


s/addition/additional

pmorie · 2016-09-27T20:51:39Z

@sjenning can you wrap lines at 80 or 100 chars? It makes it much much easier to comment.

jeremyeder · 2016-09-27T20:56:22Z

docs/proposals/hugepages.md

+
+Take a program that operates on a large 2MB structure as an example.  If the program accesses that space in such a way that one byte in each regular 4k page is accessed, 2M/4k = 512 TLB entries are needed to map the address range.  Each TLB miss results in an expensive walk of the page table.  However, if the allocation is backed by a 2M huge page by, only 1 TLB entry is required resulting in a highly likelihood that entry will remain in the cache and hit on accesses to the entire 2M structure.
+
+On x86_64, there are two huge page sizes: 2MB and 1G.  1G huge pages are also called gigantic pages.  1G must be enabled on kernel boot line with ```hugepagesz=1G```. Huge pages, especially 1G ones, should to be allocated early before memory fragments (i.e. at/near boot time) to increase the likelihood that they can be allocated successfully with minimal memory migration (i.e. defreg) required.


s/defreg/defrag

jeremyeder · 2016-09-27T20:57:25Z

docs/proposals/hugepages.md

+
+Huge pages can not be overcommitted on a node.
+
+While a system may support multiple huge pages sizes, it is assumed that nodes configured with huge pages will only use one huge page size, namely the default page size in ```cat /proc/meminfo | grep Hugepagesize```.  In Linux, this is 2M unless overridden by ```default_hugepagesz=1g``` in the kernel boot parameters.


s/default_//

hugepagesz=1g enables 1G pages, but doesn't make them the default IIUC. I'll go try it out.

https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt

k8s-ci-robot · 2016-09-27T21:27:34Z

Jenkins verification failed for commit 5b1b312. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-ci-robot · 2016-09-27T21:39:04Z

Jenkins GCI GCE e2e failed for commit 5b1b312. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

vishh · 2016-09-27T22:58:45Z

Is this meant for v1.5?

derekwaynecarr · 2016-09-27T23:54:02Z

@vishh - from our side, we want to start the discussion and ideally agree
on a design in 1.5. I would like to do the implementation in 1.6

On Tuesday, September 27, 2016, Vish Kannan notifications@github.com
wrote:

Is this meant for v1.5?

—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#33601 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF8dbGTuWaPhjh6mD3cb_aVNoMT2Gdfwks5quZ_JgaJpZM4KIIjY
.

yifan-gu · 2016-09-28T00:37:40Z

cc @jonboulle @lucab @s-urbaniak @euank @alban @iaguis

sttts · 2016-09-28T07:37:35Z

docs/proposals/hugepages.md

+      limits:
+	    hugepages: "10"
+  nodeSelector:
+    kubernetes.io/huge-page-size: "2M"


If this annotation is omitted? Will the pod be rejected by validation? The numbers above are only meaningful with this annotation.

It's a label, not an annotation of course.

@sttts yes a validation check is probably the best way. I agree that hugepages limit is ambiguous without the nodeSelector.

dchen1107 · 2016-09-28T17:01:25Z

cc/ @davidopp from the resource management and scheduler side.

davidopp · 2016-10-02T21:03:54Z

I have a couple of questions

I'm confused about how the available hugepages is supposed to be computed for scheduling. All of the current resources we have work like this

kubelet reports a capacity (somewhat confusingly called allocatable, but I'm just going to call it capacity here)
the reported capacity never changes
scheduler computes amount free for a resource as capacity minus sum of requests of the pods that are running on that node (as seen from the scheduler's state)
the only thing that affects the amount free is scheduler binding a pod or noticing that a pod has terminated

I did not fully understand how amount free for hugepages is calculated. If it works like the above, then great. But it sounds like memory fragmentation can change the number of "free" hugepages over time even in the absence of new pods being bound to the node? If that's the case, things get pretty messy -- the requests don't give you the full picture (since some hugepages might be effectively destroyed due to memory allocation) and neither does the usage (since some containers may have requested hugepages but not yet started using any memory yet).

I am not clear on the relationship between hugepages request and memory request. Is there some kind of validation requirement such as: memory request must be greater or equal to the value of alpha.kubernetes.io/huge-page-size times hugepages request?

sjenning · 2016-10-03T03:42:04Z

@davidopp so hugepages will behave in the manner you describe, baring badness on the part of the node admin.

hugepages=X on the kernel parameters or echo X > /proc/sys/vm/nr_hugepages will fix the number of huge pages on the node before the kubelet is started. After the kubelet has started and reports node status, the number of huge pages should not be changed. This number is the capacity. No node-local processes should consume huge pages. This ensures that only pods known to the scheduler can consume the huge pages so that the available number of huge pages can be computed.

The memory request and huge page request would be completely disjoint. Unused huge pages are not reported as general free memory. They sit in a reserved pool the kernel maintains. Now, that might impact the scheduler's calculation of available memory on the system if it computes that by subtracting the memory capacity from the total memory requests of the pods on the node. However, if it uses the actual node-reported available memory, it should be no problem as that number will reflect the huge pages as used memory.

Let me know I didn't clarify something well enough.

davidopp · 2016-10-03T05:00:58Z

@sjenning Thanks, that was very clear. I didn't realize hugepages came out of a separate pool. So as you say, as long as Kubelet excludes hugepages when it reports memory capacity of the node, then everything is very straightforward.

jeremyeder · 2016-10-03T11:57:36Z

@davidopp thanks for looking at this proposal. Indeed the 2nd pool hopefully makes calculations easier on the Kubelet, because memory allocated to hugepages by the admin is immediately shown as "used", (whether or not an application has actually used them or not) -- so it's a several step process as @sjenning has described:

admin allocates hugepages in whatever fashion they'd like (1GB should be allocated on cmdline, 2MB a little more flexible, but really should do it either on cmdline or early init service such as tuned)
kubelet starts, and sees the state of the union
user starts a pod that wants N-hugepages, scheduler identifies a node with hugepage "capacity", schedules it there, and then it's up to the pod author to actually configure her app to use them.

rhcarvalho · 2016-10-07T08:14:22Z

docs/proposals/hugepages.md

+A proposal to allow huge page use by applications running in a Kubernetes
+cluster.
+
+A pod should be able to have a number of a huge pages for use by the


s/of a huge pages/of huge pages

rhcarvalho · 2016-10-07T08:14:48Z

docs/proposals/hugepages.md

+memory regions it desires to be backed by huge pages.  Note that THP might lead
+to performance degradation on nodes with high memory utilization or
+fragmentation due to the defragmenting efforts of THP, which can lock memory
+pages.  For this reason some, applications may be designed to (or recommend) use


"For this reason, some applications..."

rhcarvalho · 2016-10-07T08:16:08Z

docs/proposals/hugepages.md

+pages.  For this reason some, applications may be designed to (or recommend) use
+pre-allocated huge pages instead of THP.
+
+The proposal is also limited to support on x86_64 where two huge page sizes are


s/to support on x86_64/to support x86_64

rhcarvalho · 2016-10-07T08:17:51Z

docs/proposals/hugepages.md

+supported: 2MB and 1G.  The design, however, should accommodate additional huge
+page sizes available on other architectures.
+
+**NOTE: This design, as currently proposed, requires the use of pod-levels


s/pod-levels cgroups/pod-level cgroups

rhcarvalho · 2016-10-07T08:18:42Z

docs/proposals/hugepages.md

+## Background
+
+Huge pages are a hardware feature designed to reduce pressure on the Translation
+Lookaside Buffer (TLB)   The TLB is a small hardware cache of


"... (TLB). The ..."

rhcarvalho · 2016-10-07T08:28:11Z

docs/proposals/hugepages.md

+```hugepagesz=1G```. Huge pages, especially 1G ones, should to be allocated
+early before memory fragments (i.e. at/near boot time) to increase the
+likelihood that they can be allocated successfully with minimal memory migration
+(i.e. defreg) required.


s/defreg/defrag

rhcarvalho · 2016-10-07T08:28:57Z

docs/proposals/hugepages.md

+
+## Use Cases
+
+The class of applications that benefit from huge pages are typically have


s/are typically have/typically have

rhcarvalho · 2016-10-07T08:35:05Z

docs/proposals/hugepages.md

+
+Implement huge page support with pod-level cgroups to enforce per-pod huge page
+limits (not yet available).  Enforcing huge page limits with pod-level cgroups
+avoids, at least temporarily, the need for 1) docker to support the


s/docker/Docker

actually, in this case, i do mean docker the command, not Docker the company. i'll put code ticks around it.

rhcarvalho · 2016-10-07T08:36:10Z

docs/proposals/hugepages.md

+The kubelet will also need to create the ```alpha.kubernetes.io/huge-page-size```
+label for its node resource (if self registering).
+
+pkg/aip/unversioned/well_known_labels.go


rhcarvalho · 2016-10-07T08:39:09Z

docs/proposals/hugepages.md

+memlocks them to prevent the system from moving or swapping them out.
+
+There are several issues here:
+- The the user running the java app must be a member of the gid set in the ```vm.huge_tlb_shm_group``` sysctl


s/The the/The

s/java/Java

rhcarvalho · 2016-10-07T08:49:10Z

Database workloads, e.g., MongoDB [1], suggest that THP should be disabled on the host for optimal performance.

I see in the proposal a way to request huge pages, but I'm missing a way to specify that a pod should be scheduled on a node with THP disabled.

Is that something we could include as part of the design, or is that a separate matter?

[1] https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/

sjenning · 2016-10-07T14:42:07Z

@rhcarvalho thanks for the review!

Regarding disabling THP, I am going to say that is outside the scope of this proposal. However, the cluster admin can do this with existing mechanisms. It could be understood that nodes with pre-allocated huge pages are also configured with THP disabled. Or you could disable THP on the nodes, add a node label thp=disabled, and have the MongoDB pods select nodes with that label.

k8s-ci-robot · 2016-10-07T15:28:37Z

Jenkins GKE smoke e2e failed for commit 8caeda4. Full PR test history.

The magic incantation to run this job again is @k8s-bot gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-ci-robot · 2016-10-07T15:29:06Z

Jenkins GCI GKE smoke e2e failed for commit 8caeda4. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

vishh · 2016-10-08T00:09:24Z

@sjenning I'm prioritizing v1.5 PRs as of now. I will start reviewing this next week. Apologies for the delay.

k8s-github-robot · 2016-11-07T00:10:35Z

This PR hasn't been active in 30 days. It will be closed in 59 days (Jan 6, 2017).

cc @davidopp @derekwaynecarr @sjenning @vishh

You can add 'keep-open' label to prevent this from happening, or add a comment to keep it open another 90 days

k8s-github-robot · 2016-12-01T01:18:50Z

Adding label:do-not-merge because PR changes docs prohibited to auto merge
See http://kubernetes.io/editdocs/ for information about editing docs

sjenning · 2016-12-15T16:55:29Z

migrating to kubernetes/community#181

hugepage proposal

68b82f3

googlebot added the cla: yes label Sep 27, 2016

derekwaynecarr reviewed Sep 27, 2016

View reviewed changes

k8s-github-robot assigned bgrant0607 Sep 27, 2016

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Sep 27, 2016

jeremyeder reviewed Sep 27, 2016

View reviewed changes

nit fixes and line wrap

5b1b312

bgrant0607 assigned derekwaynecarr and vishh and unassigned bgrant0607 Sep 27, 2016

munge docs

c396cf8

sttts reviewed Sep 28, 2016

View reviewed changes

derekwaynecarr added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Sep 28, 2016

davidopp self-assigned this Oct 2, 2016

rhcarvalho reviewed Oct 7, 2016

View reviewed changes

rhcarvallo review fixups

8caeda4

vishh added this to the next-candidate milestone Oct 31, 2016

derekwaynecarr added the keep-open label Nov 17, 2016

k8s-github-robot added kind/design Categorizes issue or PR as related to design. kind/old-docs do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. labels Dec 1, 2016

sjenning mentioned this pull request Dec 15, 2016

hugepage proposal kubernetes/community#181

Closed

sjenning closed this Dec 15, 2016

sjenning deleted the hugepage-proposal branch August 16, 2017 02:17


		A proposal to allow huge page use by applications running in a Kubernetes cluster.

		A pod should be able to request a number of a huge pages for use by the application. The scheduler should be able have visibility into the node capacity of huge pages, for each huge page size, and make a decision about if the pod can be scheduled on that node. The kubelet should report the number of available huge pages (per size), and set up the environment such that the pod can successfully use the number of huge pages requested in the pod definition.


		This proposal only includes pre-allocated huge pages configured on the node by the administrator at boot time or by manual dynamic allocation. It does not discuss the kubelet attempting to allocate huge pages dynamically in an attempt to accommodate a scheduling pod or the use of Transparent Huge Pages (THP). THP do not require knowledge by Kubernetes at all, but simply requires the node to have THP enabled and the application ```madvise()``` /w ```MADV_HUGEPAGES``` memory regions it desires to be backed by huge pages. Note that THP might lead to performance degradation on nodes with high memory utilization or fragmentation due to the defragmenting efforts of THP, which can lock memory pages. For this reason some, applications may be designed to (or recommend) use pre-allocated huge pages instead of THP.

		The proposal is also limited to support on x86_64 where two huge page sizes are supported: 2MB and 1G. The design, however, should accommodate addition huge page sizes available on other architectures.


		Take a program that operates on a large 2MB structure as an example. If the program accesses that space in such a way that one byte in each regular 4k page is accessed, 2M/4k = 512 TLB entries are needed to map the address range. Each TLB miss results in an expensive walk of the page table. However, if the allocation is backed by a 2M huge page by, only 1 TLB entry is required resulting in a highly likelihood that entry will remain in the cache and hit on accesses to the entire 2M structure.

		On x86_64, there are two huge page sizes: 2MB and 1G. 1G huge pages are also called gigantic pages. 1G must be enabled on kernel boot line with ```hugepagesz=1G```. Huge pages, especially 1G ones, should to be allocated early before memory fragments (i.e. at/near boot time) to increase the likelihood that they can be allocated successfully with minimal memory migration (i.e. defreg) required.


		Huge pages can not be overcommitted on a node.

		While a system may support multiple huge pages sizes, it is assumed that nodes configured with huge pages will only use one huge page size, namely the default page size in ```cat /proc/meminfo \| grep Hugepagesize```. In Linux, this is 2M unless overridden by ```default_hugepagesz=1g``` in the kernel boot parameters.


		## Use Cases

		The class of applications that benefit from huge pages are typically have

proposal: huge page support #33601

proposal: huge page support #33601

Conversation

sjenning commented Sep 27, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmorie commented Sep 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 27, 2016

k8s-ci-robot commented Sep 27, 2016

vishh commented Sep 27, 2016

derekwaynecarr commented Sep 27, 2016

yifan-gu commented Sep 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchen1107 commented Sep 28, 2016

davidopp commented Oct 2, 2016

sjenning commented Oct 3, 2016

davidopp commented Oct 3, 2016

jeremyeder commented Oct 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjenning Oct 7, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhcarvalho commented Oct 7, 2016

sjenning commented Oct 7, 2016

k8s-ci-robot commented Oct 7, 2016

k8s-ci-robot commented Oct 7, 2016

vishh commented Oct 8, 2016

k8s-github-robot commented Nov 7, 2016

k8s-github-robot commented Dec 1, 2016

sjenning commented Dec 15, 2016

sjenning commented Sep 27, 2016 •

edited

Loading

rhcarvalho Oct 7, 2016 •

edited

Loading

sjenning Oct 7, 2016 •

edited

Loading