Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Topology Manager #693

Closed
10 of 11 tasks
lmdaly opened this issue Jan 17, 2019 · 190 comments · Fixed by kubernetes/kubernetes#115137, kubernetes/kubernetes#115590 or #3988
Closed
10 of 11 tasks

Node Topology Manager #693

lmdaly opened this issue Jan 17, 2019 · 190 comments · Fixed by kubernetes/kubernetes#115137, kubernetes/kubernetes#115590 or #3988
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Milestone

Comments

@lmdaly
Copy link
Contributor

lmdaly commented Jan 17, 2019

Enhancement Description

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jan 17, 2019
@lmdaly
Copy link
Contributor Author

lmdaly commented Jan 17, 2019

/sig node
/kind feature
cc @ConnorDoyle @balajismaniam @nolancon

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2019
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 17, 2019
@vishh
Copy link
Contributor

vishh commented Feb 4, 2019

I can help inform this design based on learning from Borg. So count me in as a reviewer/approver.

@jeremyeder
Copy link

I can help inform this design based on learning from Borg. So count me in as a reviewer/approver.

Is there any public documentation on how this feature works in borg?

@vishh
Copy link
Contributor

vishh commented Feb 11, 2019 via email

@spiffxp spiffxp added the tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team label Feb 24, 2019
@spiffxp
Copy link
Member

spiffxp commented Feb 24, 2019

FYI @claurence

This tracking issue and KEP (#781) did not make it in time for the v1.14 enhancements freeze nor the extended deadline. I appreciate that you opened these before the deadlines, but they didn't seem to get sufficient review or sign off. This will need to go through the exception process.

Until we decide whether this is worth the exception, I'm inclined to put a hold on all PR's associated with this enhancement.

ref: kubernetes/kubernetes#72828

@jiayingz
Copy link

/cc @jiayingz @dchen1107

@claurence
Copy link

@lmdaly I see y'all are have 1.14 listed in the description as the alpha milestone - since there wasn't a merged implementable KEP this issue is not being tracked for 1.14 - if there are intentions for it to be included in that release please submit an exception request.

@lmdaly
Copy link
Contributor Author

lmdaly commented Mar 5, 2019

@lmdaly I see y'all are have 1.14 listed in the description as the alpha milestone - since there wasn't a merged implementable KEP this issue is not being tracked for 1.14 - if there are intentions for it to be included in that release please submit an exception request.

@claurence the KEP is now merged (KEP had been previously merged in the community repo. this was just to move it to the new enhancements repo as per the new guidelines), do we still need to submit an exception request to get this issue tracked for 1.14?

@resouer
Copy link

resouer commented Mar 11, 2019

While after reading the design & WIP PRs througoutly, I have concerns that the current implementation is not generic as the original topology design we proposed in #781. This one currently reads more like NUMA topology in node level.

I left some comments for further discussion here: kubernetes/kubernetes#74345 (comment)

@k82cn
Copy link
Member

k82cn commented Mar 20, 2019

the current implementation is not generic

Share the same concern about on that :) How about others, e.g. links between device (nvlinke for GPU)?

@lmdaly
Copy link
Contributor Author

lmdaly commented Mar 25, 2019

@resouer @k82cn The initial proposal deals only with aligning the decisions made by cpu manager and device manager to ensure proximity of devices with the cpu the container runs on. Satisfying the inter-device affinity was a non-goal of the proposal.

If however, the current implementation is blocking the addition of inter-device affinity in the future then I am happy to change the implementation once I get an understanding of how it is doing so,

@klueska
Copy link
Contributor

klueska commented Apr 5, 2019

I think the main issue I see with the current implementation and the ability to support inter-device affinity is the following:

To support inter-device affinity you normally need to first figure out which devices you would like to allocate to a container before deciding what socket affinity you would like the container to have.

For example, with Nvidia GPUs, for optimal connectivity, you first need to find and allocate the set of GPUs with the most connected NVLINKs before determining what socket affinity that set has.

From what I can tell in the current proposal, the assumption is that these operations happen in reverse order, i.e. the socket affinity is decided before doing the allocation of devices.

@ConnorDoyle
Copy link
Contributor

That’s not necessarily true @klueska. If the topology hints were extended to encode point-to-point device topology, the Device Manager could consider that when reporting socket affinity. In other words, cross device topology wouldn’t need to leak out of the scope of the device manager. Does that seem feasible?

@klueska
Copy link
Contributor

klueska commented Apr 5, 2019

Maybe I'm confused about the flow somehow. This is how I understand it:

  1. At initialization, device plugins (not the devicemanager) register themselves with the topologymanager so it can issue callbacks on it at a later time.

  2. When a pod is submitted the kubelet calls the lifecycle.PodAdmitHandler on the topologymanager.

  3. The lifecycle.PodAdmitHandler calls GetTopologyHints on each registered device plugin

  4. It then merges these hints to produce a consolidated TopologyHint associated with the pod

  5. If it decided to admit the pod, it returns successfully from lifecycle.PodAdmitHandler storing the consolidated TopologyHint for the pod in a local state store

  6. At some point in the future, the cpumanager and the devicemanager call GetAffinity(pod) on the topology manager to retrieve the TopologyHint associated with the pod

  7. The cpumanager uses this TopologyHint` to allocate a CPU

  8. The devicemanager uses this TopologyHint` to allocate a set of devices

  9. Initialization of the pod continues...

If this is correct, I guess I'm struggling with what happens between the point in time when the device plugin reports its TopologyHints and the time when the devicemanager does the actual allocation.

If these hints are meant to encode "preferences" for allocation, then I think what you are saying is to have a structure more like:

type TopologyHints struct {
    hints []struct {
        SocketID int
        DeviceIDs []int
    }
}

Where we not only pass a list of socket affinity preferences, but how those socket affinity preferences pair with allocatable GPU preferences.

If this is the direction you are thinking, then I think we could make it work, but we would need to somehow coordinate between the cpumanager and the devicemanager to make sure they "accepted" the same hint when making their allocations.

Is there something in place that allows this already that I missed?

@eloyekunle
Copy link

eloyekunle commented Apr 6, 2019

@klueska

I think what happens, making some minor corrections to your flow is:

  1. At initialization, device plugins register themselves with the devicemanager so it can issue callbacks on it at a later time.

  2. The lifecycle.PodAdmitHandler calls GetTopologyHints on each topology-aware component in the Kubelet, currently devicemanager and cpumanager.

In this case, what will be represented as topology-aware in the Kubelet are the cpumanager and the devicemanager. The topology manager is only intended to coordinate allocations between topology-aware components.

For this:

but we would need to somehow coordinate between the cpumanager and the devicemanager to make sure they "accepted" the same hint when making their allocations.

This is what the topologymanager itself was introduced to achieve. From one of the earlier drafts,

These components should coordinate in order to avoid cross NUMA assignments. The problems related to this coordination are tricky; cross domain requests such as “An exclusive core on the same NUMA node as the assigned NIC” involves both CNI and the CPU manager. If the CPU manager picks first, it may select a core on a NUMA node without an available NIC and vice-versa.

@klueska
Copy link
Contributor

klueska commented Apr 9, 2019

I see.

So the devicemanager and cpumanager both implement GetTopologyHints() as well as call GetAffinity(), avoiding direction interaction from the topologymanager with any underlying device plugins. Looking more closely at the code, I see that the devicemanager simply delegates control to the plugins to help fill in TopologyHints, which makes more sense in the end anyway.

Circling back to the original question / issue I raised though....

From Nvidia's perspective, I think we can make everything work with this proposed flow, assuming more information is added to the TopologyHints struct (and consequently the device plugin interface) to report point-to-point link information in the future.

However, I think starting with a SocketMask as the primary data structure for advertising socket affinity may limit our ability to expand TopologyHints with point-to-point information in the future without breaking the existing interface. The primary reason being that (at least in the case of Nvidia GPUs) the preferred socket depends on which GPUs are actually going to be allocated in the end.

For example, consider the figure below, when attempting to allocate 2 GPUs to a pod with optimal connectivity:

Bildschirmfoto 2019-04-09 um 15 51 37

The GPU combinations of (2, 3) and (6, 7) both have 2 NVLINKs and reside on the same PCIe bus. They should therefore be considered equal candidates when attempting to allocate 2 GPUs to a pod. Depending on which combination is chosen, however, a different socket will obviously be preferred as (2, 3) is connected to socket 0 and (6, 7) is connected to socket 1.

This information will somehow need to be encoded in the TopologyHints struct so that the devicemanager can perform one of these desired allocations in the end (i.e. whichever one the topologymanager consolidates the hints down to). Likewise, the dependency between the preferred device allocations and the preferred socket will need to be encoded in TopologyHints so that the cpumanager can allocate CPUs from the correct socket.

A potential solution specific to Nvidia GPUs for this example would look something like:

type TopologyHint struct {
    SocketID int
    DeviceIDs []int
}

type TopologyHints []TopologyHint

devicemanagerhints := &TopologyHints{
    {SocketID: 0, DeviceIDs: []int{2, 3}},
    {SocketID: 1, DeviceIDs: []int{6, 7}},
}

cpumanagerhints := &TopologyHints{
    {SocketID: 1},
}

Where the topologymanager would consolidate these hints to return {SocketID: 1, DeviceIDs: []int{6, 7}} as the preferred hint when the devicemanager and cpumanager later call GetAffinity().

While this may or may not provide a generic enough solution for all accelerators, replacing SocketMask in the TopologyHints struct with something structured more like the following would allow us to expand each individual hint with more fields in the future:

Note that GetTopologyHints() still return TopologyHints, while GetAffinity()has been modified to return a single TopologyHint rather than TopologyHints.

type TopologyHint struct {
    SocketID int
}

type TopologyHints []TopologyHint

&TopologyHints{
    {SocketID: 0},
    {SocketID: 1},
}

type HintProvider interface {
	GetTopologyHints(pod v1.Pod, container v1.Container) TopologyHints
}

type Store interface {
	GetAffinity(podUID string, containerName string) TopologyHint
}

Thoughts?

@Atharva-Shinde
Copy link
Contributor

Hey again @swatisehgal @klueska
Please try to get the KEP PR #3745 (addressing the changes mentioned above), merged before tomorrow's Enhancement Freeze :)
The status of this enhancement is still marked as at risk

@swatisehgal
Copy link
Contributor

Hey again @swatisehgal @klueska Please try to get the KEP PR #3745 (addressing the changes mentioned above), merged before tomorrow's Enhancement Freeze :) The status of this enhancement is still marked as at risk

Thanks for the reminder, I am waiting for another round of PRR review.

@swatisehgal
Copy link
Contributor

I have also pinged SIG node approvers on slack. Let's see if we can manage to get this in this release.

@johnbelamaric
Copy link
Member

@swatisehgal ok, just commented - please take your reply and put it in the KEP and we're good to go for PRR

@SergeyKanzhelev
Copy link
Member

@Atharva-Shinde this KEP should satisfy all requirements now. Ready to be marked as Tracked

@Atharva-Shinde
Copy link
Contributor

Awesome!
With all the KEP requirements in place and merged into k/enhancements, this enhancement is all good for the upcoming enhancements freeze. 🚀

The status of this enhancement is marked as tracked. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

@swatisehgal
Copy link
Contributor

swatisehgal commented Feb 15, 2023

/reopen

Some PRs are still pending and need to be merged for this work to be marked as complete.

@k8s-ci-robot k8s-ci-robot reopened this Feb 15, 2023
@k8s-ci-robot
Copy link
Contributor

@swatisehgal: Reopened this issue.

In response to this:

/reopen

Some PRs are still pending for this work to be marked as complete.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mickeyboxell
Copy link

Hi @klueska @khenidak @lmdaly 👋, I’m reaching out from the 1.27 Release Docs team. This enhancement is marked as ‘Needs Docs’ for the 1.27 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.27 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by March 16. For more information, please take a look at Documenting for a release to familiarize yourself with the documentation requirements for the release.

Please feel free to reach out with any questions. Thanks!

@klueska
Copy link
Contributor

klueska commented Mar 7, 2023

@swatisehgal ^^

@Atharva-Shinde
Copy link
Contributor

Hey again @swatisehgal @klueska 👋 Enhancements team here,
Just checking in as we approach 1.27 code freeze at 17:00 PDT on Tuesday 14th March 2023.

Here's where this enhancement currently stands:

Also please let me know if there are other PRs in k/k we should be tracking for this KEP.
As always, we are here to help if any questions come up. Thanks!

@swatisehgal
Copy link
Contributor

Please follow the steps detailed in the documentation to open a PR against dev-1.27 branch in the k/website repo. This PR can be just a placeholder at this time, and must be created by March 16. For more information, please take a look at Documenting for a release to familiarize yourself with the documentation requirements for the release.

Please feel free to reach out with any questions. Thanks!

@mickeyboxell I have created docs PR and linked it to the comment tracking GA graduation work here.

@marosset marosset removed the tracked/no Denotes an enhancement issue is NOT actively being tracked by the Release Team label Mar 20, 2023
@marosset
Copy link
Contributor

/stage stable

@k8s-ci-robot k8s-ci-robot added stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status and removed stage/beta Denotes an issue tracking an enhancement targeted for Beta status labels Mar 20, 2023
@SergeyKanzhelev
Copy link
Member

NEXT: remove the feature gate in 1.29 and mark it as "implemented" in kep.yaml.

@salehsedghpour
Copy link

/remove-label lead-opted-in

@k8s-ci-robot k8s-ci-robot removed the lead-opted-in Denotes that an issue has been opted in to a release label Jan 6, 2024
@KunWuLuan
Copy link

Hi, I am new to use kubelet topology-manager with single-numa-node to manage my workloads. I have some questions, why we make this option as a node-level setting? Is there any problem if the option is a pod-level setting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. sig/node Categorizes an issue or PR as relevant to SIG Node. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Projects
Status: Tracked