schedule different containers on the same gpu device #46070
Closed
Description
What keywords did you search in Kubernetes issues before filing this one? (If you have found any duplicates, you should instead reply there.):
gpu
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG maybe
Kubernetes version (use kubectl version
):
1.6.1
Environment:
- Cloud provider or hardware configuration: bare-metal
- OS (e.g. from /etc/os-release): ubuntu
- Kernel (e.g.
uname -a
): 4.4 - Install tools: kubeadm
- Others:
What happened:
I submit some gpu job and find out that some different container are using the same gpu
below is the nvidia-smi pmon
output, we can see that device 7 is occupied by process 41470 and 45265, they belong to different container.
What you expected to happen:
as the doc said https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Containers (and pods) do not share GPUs.
How to reproduce it (as minimally and precisely as possible):
find a gpu-enabled cluster and run multiple job based on this yaml
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-mnist-example-nvidia1235fd6556
labels:
name: gpu-mnist-example-nvidia
spec:
parallelism: 1
template:
metadata:
labels:
name: gpu-mnist-example-nvidia
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia-gpu-type
operator: In
values:
- "Tesla-K80"
- key: nvidia-driver-version
operator: In
values:
- "375.66"
containers:
- name: gpu-mnist-example
image: alexwei/gpu-mnist:0.1.1
args:
- python3
- mnist_deep.py
- --data_dir=./MNIST_data
imagePullPolicy: IfNotPresent
securityContext:
privileged: false
resources:
requests:
alpha.kubernetes.io/nvidia-gpu: 1
limits:
alpha.kubernetes.io/nvidia-gpu: 1
volumeMounts:
- mountPath: /usr/local/nvidia
name: lib
readOnly: true
volumes:
- name: lib
hostPath:
path: /var/lib/nvidia-docker/volumes/nvidia_driver/latest
restartPolicy: Never
Anything else we need to know: