3327-align-by-socket

KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

Summary

Starting with Kubernetes 1.22, a new CPUManager flag has facilitated the use of CPUManager Policy options(#2625) which enable users to customize their behavior based on workload requirements without having to introduce an entirely new policy. These policy options work together to ensure an optimized cpu set is allocated for workloads running on a cluster. The two policy options that already exist are full-pcpus-only(#2625) and distribute-cpus-across-numa (#2902). With this KEP, a new CPUManager policy option is introduced which ensures that all CPUs on a socket are considered to be aligned. Thus, the CPUManager will send a broader set of preferred hints to TopologyManager, enabling the increased likelihood of the best hint to be socket aligned with respect to CPU and other devices managed by DeviceManager.

Motivation

With the evolution of CPU architectures, the number of NUMA nodes per socket has increased. The devices managed by DeviceManager may not be uniformly distributed across all NUMA nodes. Thus there can be scenarios where perfect alignment between devices and CPU may not be possible. Latency sensitive applications desire resources to be aligned at least within the same socket if NUMA alignment is not possible for optimal performance. By default, the CPUManager prefers CPU allocations which require a minimum number of NUMA nodes. However, if the NUMA nodes selected for allocation are spread across sockets, it results in degraded performance. By ensuring the selected NUMA nodes are socket aligned, predictable performance can be achieved. The best possible alignment of CPUs with other resources(viz. Which are managed by DeviceManager) is crucial to guarantee predictable performance for latency sensitive applications.

Goals

Ensure CPUs are aligned at socket boundary rather than NUMA node boundary.

Non-Goals

Guarantee optimal NUMA allocation for cpu distribution.

Proposal

We propose to add a new CPUManager policy option called align-by-socket to the static policy of CPUManager. With this policy option, the CPUManager will prefer those hints in which all CPUs are within same socket in addition to exisiting hints which require minimum NUMA nodes. With this policy option CPUs will be considered aligned at socket boundary instead of NUMA boundary during allocation. Thus if best hint consist of NUMA nodes within one socket, CPUManager may try to assign available CPUs from all NUMA nodes of socket.

Risks and Mitigations

The risks of adding this new feature are quite low. It is isolated to a specific policy option within the CPUManager, and is protected both by the option itself, as well as the CPUManagerPolicyAlphaOptions feature gate (which is disabled by default).

Risk	Impact	Mitigation
Bugs in the implementation lead to kubelet crash	High	Disable the policy option and restart the kubelet. The workload will run but CPU allocations can spread across socket in cases when allocation could have been within same socket

Design Details

Proposed Change

When align-by-socket is enabled as a policy option, the CPUManager’s function GetTopologyHints will prefer hints which are socket aligned in addition to hints which require minimum number of NUMA nodes.

To achieve this, the following updates are needed to the generateCPUTopologyHints function of static policy of CPUManager:

func (p *staticPolicy) generateCPUTopologyHints(availableCPUs cpuset.CPUSet, reusableCPUs cpuset.CPUSet, request int) []topologymanager.TopologyHint {
	...

    // Loop back through all hints and update the 'Preferred' field based on
    // counting the number of bits sets in the affinity mask and comparing it
    // to the minAffinitySize. Those with an equal number of bits set (and
    // with a minimal set of numa nodes) will be considered preferred.
    // If align-by-socket policy option is enabled, socket aligned hints are
    // also considered preferred.
    for i := range hints {
      if p.options.AlignBySocket && isSocketAligned(hints[i].NUMANodeAffinity) {
        hints[i].Preferred = true
        continue
      }
      if hints[i].NUMANodeAffinity.Count() == minAffinitySize {
        hints[i].Preferred = true
      }
    }

	return hints
}

At the end, we will have a list of desired hints. These hints will then be passed to the topology manager whose job it is to select the best hint (with an increased likelihood of selecting a hint that has CPUs which are aligned by socket now).

During CPU allocation, in function allocatedCPUs(), alignedCPUs will consist of CPUs which are socket aligned instead of CPUs from NUMA nodes in numaAffinity hint when align-by-socket policy option is enabled.

 func (p *staticPolicy) allocateCPUs(s state.State, numCPUs int, numaAffinity bitmask.BitMask, reusableCPUs cpuset.CPUSet) (cpuset.CPUSet, error) {
	...
     if numaAffinity != nil {
         alignedCPUs := cpuset.NewCPUSet()

         bits := numaAffinity.GetBits()
         // If align-by-socket policy option is enabled, NUMA based hint is expanded to
         // socket aligned hint. It will ensure that first socket aligned available CPUs are
         // allocated before we try to find CPUs across socket to satify allocation request.
         if p.PolicyOptions.AlignBySocket {
           bits = p.topology.ExpandToFullSocketBits(bits)
         }
         for _, numaNodeID := range bits {
           alignedCPUs = alignedCPUs.Union(assignableCPUs.Intersection(p.topology.CPUDetails.CPUsInNUMANodes(numaNodeID)))
         }

	 ...
}

This will ensure that for purpose of allocation, CPUs are considered aligned at socket boundary rather than NUMA boundary

align-by-socket policy options will work well for general case where number of NUMA nodes per socket are one or more. In rare cases like DualNumaMultiSocketPerNumaHT where one NUMA can span multiple socket, above option is not applicable. We will error out in cases when align-by-socket is enabled and underlying topology consist of multiple socket per NUMA. We may address such scenarios in future if there is a usecase for it in real world.

In cases where number of NUMA nodes per socket is one or more as well as TopologyManager single-numa-node policy is enabled, the policy option of align-by-socket is redundant since allocation guarantees within the same NUMA are by definition socket aligned. Hence, we will error out in case the policy option of align-by-socket is enabled along with TopologyManager single-numa-node policy.

The policyOption align-by-socket can work in conjunction with TopologyManager best-effort and restricted policy without any conflict.

Test Plan

[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

k8s.io/kubernetes/pkg/kubelet/cm/cpumanager/policy_static.go: 06-13-2022 - 91.1

Integration tests

These cases will be added in the existing integration tests:
- Feature gate enable/disable tests
- align-by-socket policy option works as expected. When policy option is enabled
  - generateCPUTopologyHints prefers socket aligned hints in conjunction with hints with minimum NUMA nodes.
  - allocateCPUs allocated CPU at socket boundary.
- Verify no significant performance degradation

e2e tests

These cases will be added in the existing e2e tests:
- Feature gate enable/disable tests
- align-by-socket policy option works as expected.

Graduation Criteria

Alpha

Implement the new policy option.
Ensure proper unit tests are in place.
Ensure proper e2e node tests are in place.

Beta

Gather feedback from consumers of the new policy option.
Verify no major bugs reported in the previous cycle.

GA

Allow time for feedback (1 year).
Make sure all risks have been addressed.

Upgrade / Downgrade Strategy

We expect no impact. The new policy option is opt-in and orthogonal to the existing ones.

Version Skew Strategy

No changes needed

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name: CPUManagerPolicyAlphaOptions
- Components depending on the feature gate: kubelet
Change the kubelet configuration to set a CPUManager policy of static and a CPUManager policy option of align-by-socket
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? Yes -- a kubelet restart is required.

Does enabling the feature change any default behavior?

No. In order to trigger any of the new logic, three things have to be true:

The CPUManagerPolicyAlphaOptions feature gate must be enabled
The static CPUManager policy must be selected
The new align-by-socket policy option must be selected

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, the feature can be disabled by either:

Disabling the CPUManagerPolicyAlphaOptions feature gate
Switching the CPUManager policy to none
Removing align-by-socket from the list of CPUManager policy options

Existing workloads will continue to run uninterrupted, with any future workloads having their CPUs allocated according to the policy in place after the rollback.

What happens if we reenable the feature if it was previously rolled back?

No changes. Existing container will not see their allocation changed. New containers will.

Are there any tests for feature enablement/disablement?

A specific e2e test will demonstrate that the default behaviour is preserved when the feature gate is disabled, or when the feature is not used (2 separate tests)

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

Inspect the kubelet configuration of a node -- check for the presence of the feature gate and usage of the new policy option.

How can someone using this feature know that it is working for their instance?

In order to verify this feature is working, one should: Pick a node with at least 2 Sockets and multiple NUMA nodes per socket Ensure no other pods with exclusive CPUs are running on that node Launch a 2 pods with a nodeSelector to that node that has a single container in it Run a sleep infinity command and request exclusive CPUs for the container in the amount of (4*NUM_CPUS_PER_NUMA_NODE - 8) Verify that for both pods, all CPU’s are within same socket instead of cpu’s distributed across sockets

To verify the list of CPUs allocated to the container, one can either:

exec into uthe container and run taskset -cp 1 (assuming this command is available in the container).
Call the GetCPUS() method of the CPUProvider interface in the kubelet's podresources API.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

There are no specific SLOs for this feature. Parallel workloads will benefit from this feature in application specific ways.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

None

Are there any missing metrics that would be useful to have to improve observability of this feature?

None

Does this feature depend on any specific services running in the cluster?

This feature is linux specific, and requires a version of CRI that includes the LinuxContainerResources.CpusetCpus field. This has been available since v1alpha2.

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

No

Will enabling / using this feature result in introducing new API types?

No

Will enabling / using this feature result in any new calls to the cloud provider?

No

Will enabling / using this feature result in increasing size or count of the existing API objects?

No

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

The algorithm required to implement this feature could delay:

Pod admission time
The time it takes to launch each container after pod admission

This delay should be minimal.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No, the algorithm will run on a single goroutine with minimal memory requirements.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

2022-06-02: Initial KEP created
2022-06-08: Addressed review comments on KEP

Drawbacks

align-by-socket policy option when enabled, it might result CPU allocation which may not be perfectly aligned by NUMA with other resources managed by DeviceManager.

Alternatives

align-by-socket can alternatively be introduced as policy of TopologyManager which can choose hints from hint provider which are socket aligned. However, it is concluded that fuction of TopologyManager is to act as arbitrator for hints provider and should make decisions based on hints provided rather than influencing the BestHint based on its own policy.

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
kep.yaml		kep.yaml

Files

3327-align-by-socket

Directory actions

More options

Directory actions

More options

Latest commit

History

3327-align-by-socket

Folders and files

parent directory

README.md

KEP-3327: Add CPUManager policy option to align CPUs by Socket instead of by NUMA node

Release Signoff Checklist

Summary

Motivation

Goals

Non-Goals

Proposal

Risks and Mitigations

Design Details

Proposed Change

Test Plan

Prerequisite testing updates

Unit tests

Integration tests

e2e tests

Graduation Criteria

Alpha

Beta

GA

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Are there any missing metrics that would be useful to have to improve observability of this feature?

Does this feature depend on any specific services running in the cluster?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives