Skip to content

schedule different containers on the same gpu device #46070

Closed
@WIZARD-CXY

Description

What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
gpu

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG maybe

Kubernetes version (use kubectl version):
1.6.1

Environment:

  • Cloud provider or hardware configuration: bare-metal
  • OS (e.g. from /etc/os-release): ubuntu
  • Kernel (e.g. uname -a): 4.4
  • Install tools: kubeadm
  • Others:

What happened:
I submit some gpu job and find out that some different container are using the same gpu
below is the nvidia-smi pmon output, we can see that device 7 is occupied by process 41470 and 45265, they belong to different container.
2017-05-19 11 36 10

What you expected to happen:
as the doc said https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Containers (and pods) do not share GPUs.

How to reproduce it (as minimally and precisely as possible):
find a gpu-enabled cluster and run multiple job based on this yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-mnist-example-nvidia1235fd6556
  labels:
    name: gpu-mnist-example-nvidia
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        name: gpu-mnist-example-nvidia
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia-gpu-type
                operator: In
                values:
                - "Tesla-K80"
              - key: nvidia-driver-version
                operator: In
                values:
                - "375.66"
      containers:
      - name: gpu-mnist-example
        image: alexwei/gpu-mnist:0.1.1
        args:
        - python3
        - mnist_deep.py
        - --data_dir=./MNIST_data
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: false
        resources:
          requests:
            alpha.kubernetes.io/nvidia-gpu: 1
          limits:
            alpha.kubernetes.io/nvidia-gpu: 1
        volumeMounts:
        - mountPath: /usr/local/nvidia
          name: lib
          readOnly: true
      volumes:
      - name: lib
        hostPath:
          path: /var/lib/nvidia-docker/volumes/nvidia_driver/latest
      restartPolicy: Never

Anything else we need to know:

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions