[IMPROVEMENT] Instance manager spdk_tgt resilience due to spdk_tgt crash

## Is your improvement request related to a feature? Please describe (👍 if you like this request)

Somehow it's possible that `spdk_tgt` crashed and spdk volume gets stuck in `attaching` state:
```
Error starting pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54: 
failed to create instance: 
rpc error: code = Unknown desc = failed to start SPDK engine: 
rpc error: code = Unknown desc = error sending message, id 70374, method bdev_raid_create, 
params {pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54 1 0 [pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-8eb29e9bn1 disk-1/pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-b7cafc48]}: 
write unix @->/var/tmp/spdk.sock: write: broken pipe
```

It's hard to reproduce and currently unable to find a reliable way to reproduce it. The possible reproducing steps are:
1. Create a 3-nodes k3s cluster, enable spdk, and add block type disk for each node  
2. Create spdk volume from Longhorn UI, and create PV/PVC for this volume from Longhorn UI. But forget to fill in the right storage class name, leave it blank
3. Create a workload to use this volume, the volume will be stuck in `attaching` state:
```
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1
```
4. Create spdk storageclass:
```
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/v2/storageclass.yaml
```
5. Create a statefulset using spdk volume:
```
# statefulset.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: statefulset-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn-v2-data-engine
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-statefulset
  namespace: default
spec:
  selector:
    matchLabels:
      app: test-statefulset
  serviceName: test-statefulset
  replicas: 1
  template:
    metadata:
      labels:
        app: test-statefulset
    spec:
      terminationGracePeriodSeconds: 10
      nodeSelector:
        node-role.kubernetes.io/control-plane: 'true'
      containers:
        - image: busybox
          imagePullPolicy: IfNotPresent
          name: sleep
          args: ['/bin/sh', '-c', 'while true;do date;sleep 5; done']
          volumeMounts:
            - name: test-pod
              mountPath: /data
      volumes:
        - name: test-pod
          persistentVolumeClaim:
            claimName: statefulset-pvc

# kubectl apply -f statefulset.yaml
```
6. spdk volume can be created and attached, and the workload is running without problem, then write some data into the volume
7. Delete replicas of this volume from Longhorn UI
8. Scale down the statefulset to 0 to detach the volume
9. The volume can be attached automatically and replica rebuilding triggered
10. After all replicas rebuilt, the volume detached automatically
11. scale up the statefulset to 1 to attach the volume again => The volume gets stuck in attaching state forever with error message:
```
Error starting pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54: 
failed to create instance: 
rpc error: code = Unknown desc = failed to start SPDK engine: 
rpc error: code = Unknown desc = error sending message, id 70374, method bdev_raid_create, 
params {pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54 1 0 [pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-8eb29e9bn1 disk-1/pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-b7cafc48]}: 
write unix @->/var/tmp/spdk.sock: write: broken pipe
```

## Describe alternatives you've considered

Since it's hard to reproduce, we can improve the resilience of instance manager first instead of struggling to reproducing it.

## Additional context

Detailed support bundle and logs can be found in https://github.com/longhorn/longhorn/issues/6071


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVEMENT] Instance manager spdk_tgt resilience due to spdk_tgt crash #6155

Is your improvement request related to a feature? Please describe (👍 if you like this request)

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development