Skip to content

topology-aware: internal error from changing containers' NUMA nodes by adjusting AvailableResources #92

Open
@askervin

Description

Assume that a container runs on CPUs of NUMA node 0.

An admin wants to reorganize server resources so that containers will not use CPUs on NUMA/die/socket 0 anymore by removing those CPUs from AvailableResources.

When this is done, restarting the topology aware NRI plugin with new configuration fails with an internal error:

E0710 07:30:57.289447       1 nri.go:784] <= Synchronize FAILED: failed to start policy topology-aware: topology-aware: failed to start:
topology-aware: failed to restore allocations from cache:
topology-aware: failed to allocate <CPU request pod0/pod0c0: exclusive: 3><Memory request: limit:95.37M, req:95.37M> from <NUMA node #1 allocatable: MemLimit: DRAM 1.85G>:
topology-aware: internal error: NUMA node #1: can't slice 3 exclusive CPUs from , 0m available

Let's discuss if this is a bug, expected behavior or if we should provide a configuration option for forcing new CPU/memory pinning, even if it would lead into costly memory accesses/moves.

Current workaround on this error is deleting the cache and thereby forcing reassignment of resources from scratch. Using this workaround or draining a node before AvailableResources change are both heavier operations than what forcing new pinning would be.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions