Skip to content

[IMPROVEMENT] Instance manager spdk_tgt resilience due to spdk_tgt crash #6155

Closed
@yangchiu

Description

Is your improvement request related to a feature? Please describe (👍 if you like this request)

Somehow it's possible that spdk_tgt crashed and spdk volume gets stuck in attaching state:

Error starting pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54: 
failed to create instance: 
rpc error: code = Unknown desc = failed to start SPDK engine: 
rpc error: code = Unknown desc = error sending message, id 70374, method bdev_raid_create, 
params {pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54 1 0 [pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-8eb29e9bn1 disk-1/pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-b7cafc48]}: 
write unix @->/var/tmp/spdk.sock: write: broken pipe

It's hard to reproduce and currently unable to find a reliable way to reproduce it. The possible reproducing steps are:

  1. Create a 3-nodes k3s cluster, enable spdk, and add block type disk for each node
  2. Create spdk volume from Longhorn UI, and create PV/PVC for this volume from Longhorn UI. But forget to fill in the right storage class name, leave it blank
  3. Create a workload to use this volume, the volume will be stuck in attaching state:
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
spec:
  containers:
    - name: sleep
      image: busybox
      imagePullPolicy: IfNotPresent
      args: ["/bin/sh", "-c", "while true;do date;sleep 5; done"]
      volumeMounts:
        - name: pod-data
          mountPath: /data
  volumes:
    - name: pod-data
      persistentVolumeClaim:
        claimName: test-1
  1. Create spdk storageclass:
kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/master/examples/v2/storageclass.yaml
  1. Create a statefulset using spdk volume:
# statefulset.yaml 
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: statefulset-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn-v2-data-engine
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test-statefulset
  namespace: default
spec:
  selector:
    matchLabels:
      app: test-statefulset
  serviceName: test-statefulset
  replicas: 1
  template:
    metadata:
      labels:
        app: test-statefulset
    spec:
      terminationGracePeriodSeconds: 10
      nodeSelector:
        node-role.kubernetes.io/control-plane: 'true'
      containers:
        - image: busybox
          imagePullPolicy: IfNotPresent
          name: sleep
          args: ['/bin/sh', '-c', 'while true;do date;sleep 5; done']
          volumeMounts:
            - name: test-pod
              mountPath: /data
      volumes:
        - name: test-pod
          persistentVolumeClaim:
            claimName: statefulset-pvc

# kubectl apply -f statefulset.yaml
  1. spdk volume can be created and attached, and the workload is running without problem, then write some data into the volume
  2. Delete replicas of this volume from Longhorn UI
  3. Scale down the statefulset to 0 to detach the volume
  4. The volume can be attached automatically and replica rebuilding triggered
  5. After all replicas rebuilt, the volume detached automatically
  6. scale up the statefulset to 1 to attach the volume again => The volume gets stuck in attaching state forever with error message:
Error starting pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54: 
failed to create instance: 
rpc error: code = Unknown desc = failed to start SPDK engine: 
rpc error: code = Unknown desc = error sending message, id 70374, method bdev_raid_create, 
params {pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-e-7b4b0d54 1 0 [pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-8eb29e9bn1 disk-1/pvc-1deb58b5-c8b8-4cf2-bb10-02c83877abd6-r-b7cafc48]}: 
write unix @->/var/tmp/spdk.sock: write: broken pipe

Describe alternatives you've considered

Since it's hard to reproduce, we can improve the resilience of instance manager first instead of struggling to reproducing it.

Additional context

Detailed support bundle and logs can be found in #6071

Metadata

Labels

area/resilienceSystem or volume resiliencearea/v2-data-enginev2 data engine (SPDK)kind/improvementRequest for improvement of existing functionpriority/0Must be implement or fixed in this release (managed by PO)

Type

No type

Projects

  • Status

    Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions