[IMPROVEMENT] Instance manager spdk_tgt resilience due to spdk_tgt crash #6155
Closed
Description
Is your improvement request related to a feature? Please describe (👍 if you like this request)
Somehow it's possible that spdk_tgt
crashed and spdk volume gets stuck in attaching
state:
Error starting pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54:
failed to create instance:
rpc error: code = Unknown desc = failed to start SPDK engine:
rpc error: code = Unknown desc = error sending message, id 70374, method bdev_raid_create,
params {pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54 1 0 [pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-8eb29e9bn1 disk-1/pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-b7cafc48]}:
write unix @->/var/tmp/spdk.sock: write: broken pipe
It's hard to reproduce and currently unable to find a reliable way to reproduce it. The possible reproducing steps are:
- Create a 3-nodes k3s cluster, enable spdk, and add block type disk for each node
- Create spdk volume from Longhorn UI, and create PV/PVC for this volume from Longhorn UI. But forget to fill in the right storage class name, leave it blank
- Create a workload to use this volume, the volume will be stuck in
attaching
state:
apiVersion: v1
kind: Pod
metadata:
name: test-pod-1
spec:
containers:
- name: sleep
image: busybox
imagePullPolicy: IfNotPresent
args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
volumeMounts:
- name: pod-data
mountPath: /data
volumes:
- name: pod-data
persistentVolumeClaim:
claimName: test-1
- Create spdk storageclass:
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/v2/storageclass.yaml
- Create a statefulset using spdk volume:
# statefulset.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: statefulset-pvc
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn-v2-data-engine
resources:
requests:
storage: 20Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: test-statefulset
namespace: default
spec:
selector:
matchLabels:
app: test-statefulset
serviceName: test-statefulset
replicas: 1
template:
metadata:
labels:
app: test-statefulset
spec:
terminationGracePeriodSeconds: 10
nodeSelector:
node-role.kubernetes.io/control-plane: 'true'
containers:
- image: busybox
imagePullPolicy: IfNotPresent
name: sleep
args: ['/bin/sh', '-c', 'while true;do date;sleep 5; done']
volumeMounts:
- name: test-pod
mountPath: /data
volumes:
- name: test-pod
persistentVolumeClaim:
claimName: statefulset-pvc
# kubectl apply -f statefulset.yaml
- spdk volume can be created and attached, and the workload is running without problem, then write some data into the volume
- Delete replicas of this volume from Longhorn UI
- Scale down the statefulset to 0 to detach the volume
- The volume can be attached automatically and replica rebuilding triggered
- After all replicas rebuilt, the volume detached automatically
- scale up the statefulset to 1 to attach the volume again => The volume gets stuck in attaching state forever with error message:
Error starting pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54:
failed to create instance:
rpc error: code = Unknown desc = failed to start SPDK engine:
rpc error: code = Unknown desc = error sending message, id 70374, method bdev_raid_create,
params {pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54 1 0 [pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-8eb29e9bn1 disk-1/pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-b7cafc48]}:
write unix @->/var/tmp/spdk.sock: write: broken pipe
Describe alternatives you've considered
Since it's hard to reproduce, we can improve the resilience of instance manager first instead of struggling to reproducing it.
Additional context
Detailed support bundle and logs can be found in #6071
Metadata
Labels
Type
Projects
Status
Closed