Skip to content

[BUG] User created snapshot deleted after node drain and uncordon #5992

Closed
@futuretea

Description

Describe the bug (🐛 if you encounter this issue)

User created snapshot deleted after node drain and uncordon

To Reproduce

Steps to reproduce the behavior:

  1. Prepare a 3-nodes Harvester cluster
$ kubectl get no
NAME           STATUS   ROLES                              AGE   VERSION
harv-yh-1-01   Ready    control-plane,etcd,master,worker   18h   v1.24.11+rke2r1
harv-yh-1-02   Ready    control-plane,etcd,master,worker   15h   v1.24.11+rke2r1
harv-yh-1-03   Ready    control-plane,etcd,master,worker   12h   v1.24.11+rke2r1
  1. Create a VM demo
  2. Wait for the VM's state to change to Running
  3. Take a VM Snapshot demo-snapshot
  4. Wait for the VM snapshot's state to change to Ready
  5. Check longhorn's snapshot CR, we can find a userCreated: true longhorn snapshot in longhorn-system namespace
the output of get snapshots.longhorn.io
$ kubectl -n longhorn-system get snapshots.longhorn.io -oyaml
apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
  kind: Snapshot
  metadata:
    creationTimestamp: "2023-05-24T03:33:06Z"
    finalizers:
    - longhorn.io
    generation: 1
    labels:
      longhornvolume: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
    name: snapshot-beb1a302-0c19-45de-9da0-35cec0bc2c9c
    namespace: longhorn-system
    ownerReferences:
    - apiVersion: longhorn.io/v1beta2
      kind: Volume
      name: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
      uid: fd3f6f2a-c674-4619-a72c-40f08a79af6a
    resourceVersion: "956629"
    uid: 78ef3f38-6aac-411c-b512-84106e689dd2
  spec:
    createSnapshot: false
    labels: null
    volume: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
  status:
    checksum: ""
    children:
      volume-head: true
    creationTime: "2023-05-24T03:33:02Z"
    labels:
      type: snap
    markRemoved: false
    ownerID: ""
    parent: ""
    readyToUse: true
    restoreSize: 10737418240
    size: 544698368
    userCreated: true
kind: List
metadata:
  resourceVersion: ""

image

  1. drain the node which the VM is scheduled
$ kubectl -n default get vmi demo -ojson | jq -r '.status.nodeName'
harv-yh-1-03

$ kubectl drain harv-yh-1-03 --ignore-daemonsets --delete-emptydir-data

after drain node, the volume status:
image

  1. uncordon node after drain finished
kubectl uncordon harv-yh-1-03
  1. wait for the volume's state to change to Healthy again
    image

  2. check the output of get snapshots.longhorn.io again

$ kubectl -n longhorn-system get snapshots.longhorn.io
NAME                                   VOLUME                                     CREATIONTIME           READYTOUSE   RESTORESIZE   SIZE        AGE
15463930-c07c-47e0-b9f9-23ddb3ed10c2   pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61   2023-05-24T03:39:47Z   false        10737418240   545931264   95s

we can find that the user created snapshot was deleted and find a system created snapshot

the output of get snapshots.longhorn.io
$ kubectl -n longhorn-system get snapshots.longhorn.io -oyaml
apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
  kind: Snapshot
  metadata:
    creationTimestamp: "2023-05-24T03:39:52Z"
    finalizers:
    - longhorn.io
    generation: 1
    labels:
      longhornvolume: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
    name: 15463930-c07c-47e0-b9f9-23ddb3ed10c2
    namespace: longhorn-system
    ownerReferences:
    - apiVersion: longhorn.io/v1beta2
      kind: Volume
      name: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
      uid: fd3f6f2a-c674-4619-a72c-40f08a79af6a
    resourceVersion: "963356"
    uid: 0bdf1236-97c7-40f3-9584-4007e89c1692
  spec:
    createSnapshot: false
    labels: null
    volume: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
  status:
    checksum: ""
    children:
      volume-head: true
    creationTime: "2023-05-24T03:39:47Z"
    labels: {}
    markRemoved: true
    ownerID: ""
    parent: ""
    readyToUse: false
    restoreSize: 10737418240
    size: 545931264
    userCreated: false
kind: List
metadata:
  resourceVersion: ""
  1. restore the VM snapshot demo-snapshot to create a new VM demo-restore
  2. the restore pvc will stuck in Pending and the restored VM will stuck in Restoring
    image
Events:
  Type     Reason                Age                From                                                                                      Message
  ----     ------                ----               ----                                                                                      -------
  Warning  ProvisioningFailed    89s (x2 over 89s)  driver.longhorn.io_csi-provisioner-68d95fd6d7-znlg2_50877834-6b2a-46df-a45e-f01c94e77cf5  failed to provision volume with StorageClass "longhorn-image-nwd54": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [detail=, message=unable to create volume: unable to create volume pvc-f5ec8db9-9e4e-4ab6-9f3e-3a066ba044aa: failed to verify data source: cannot find snapshot 'snapshot-beb1a302-0c19-45de-9da0-35cec0bc2c9c' for volume 'pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61', code=Server Error] from [http://longhorn-backend:9500/v1/volumes]
  Warning  ProvisioningFailed    64s (x2 over 72s)  driver.longhorn.io_csi-provisioner-68d95fd6d7-znlg2_50877834-6b2a-46df-a45e-f01c94e77cf5  failed to provision volume with StorageClass "longhorn-image-nwd54": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=unable to create volume: unable to create volume pvc-f5ec8db9-9e4e-4ab6-9f3e-3a066ba044aa: failed to verify data source: cannot find snapshot 'snapshot-beb1a302-0c19-45de-9da0-35cec0bc2c9c' for volume 'pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61', code=Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes]
  Normal   Provisioning          40s (x8 over 89s)  driver.longhorn.io_csi-provisioner-68d95fd6d7-znlg2_50877834-6b2a-46df-a45e-f01c94e77cf5  External provisioner is provisioning volume for claim "default/restore-demo-snapshot-793b83f2-3557-4249-9bcb-ca44f46b81a9-rootdisk"
  Warning  ProvisioningFailed    40s (x4 over 89s)  driver.longhorn.io_csi-provisioner-68d95fd6d7-znlg2_50877834-6b2a-46df-a45e-f01c94e77cf5  failed to provision volume with StorageClass "longhorn-image-nwd54": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=unable to create volume: unable to create volume pvc-f5ec8db9-9e4e-4ab6-9f3e-3a066ba044aa: failed to verify data source: cannot find snapshot 'snapshot-beb1a302-0c19-45de-9da0-35cec0bc2c9c' for volume 'pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61'] from [http://longhorn-backend:9500/v1/volumes]
  Normal   ExternalProvisioning  5s (x8 over 89s)   persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator

image

Expected behavior

A clear and concise description of what you expected to happen.

The user created snapshot should not be deleted

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Harvester Support bundle:
supportbundle_37f7d382-4226-4272-9f96-c86340b18085_2023-05-24T03-50-55Z.zip

Longhorn Support bundle:
supportbundle_35af9b9c-b457-44ab-8ecd-ab44f4d595ff_2023-05-24T03-52-19Z.zip

Environment

  • Longhorn version: v1.4.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Harvester master-e123ab14-head
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

Metadata

Labels

area/snapshotVolume snapshot (in-cluster snapshot or external backup)backport/1.3.4backport/1.4.3component/longhorn-managerLonghorn manager (control plane)kind/bugpriority/0Must be implement or fixed in this release (managed by PO)require/auto-e2e-testRequire adding/updating auto e2e test cases if they can be automatedseverity/1Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)

Type

No type

Projects

  • Status

    Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions