[BUG] User created snapshot deleted after node drain and uncordon #5992
Description
Describe the bug (🐛 if you encounter this issue)
User created snapshot deleted after node drain and uncordon
To Reproduce
Steps to reproduce the behavior:
- Prepare a 3-nodes Harvester cluster
$ kubectl get no
NAME STATUS ROLES AGE VERSION
harv-yh-1-01 Ready control-plane,etcd,master,worker 18h v1.24.11+rke2r1
harv-yh-1-02 Ready control-plane,etcd,master,worker 15h v1.24.11+rke2r1
harv-yh-1-03 Ready control-plane,etcd,master,worker 12h v1.24.11+rke2r1
- Create a VM
demo
- Wait for the VM's state to change to
Running
- Take a VM Snapshot
demo-snapshot
- Wait for the VM snapshot's state to change to
Ready
- Check longhorn's snapshot CR, we can find a
userCreated: true
longhorn snapshot in longhorn-system namespace
the output of get snapshots.longhorn.io
$ kubectl -n longhorn-system get snapshots.longhorn.io -oyaml
apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
kind: Snapshot
metadata:
creationTimestamp: "2023-05-24T03:33:06Z"
finalizers:
- longhorn.io
generation: 1
labels:
longhornvolume: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
name: snapshot-beb1a302-0c19-45de-9da0-35cec0bc2c9c
namespace: longhorn-system
ownerReferences:
- apiVersion: longhorn.io/v1beta2
kind: Volume
name: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
uid: fd3f6f2a-c674-4619-a72c-40f08a79af6a
resourceVersion: "956629"
uid: 78ef3f38-6aac-411c-b512-84106e689dd2
spec:
createSnapshot: false
labels: null
volume: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
status:
checksum: ""
children:
volume-head: true
creationTime: "2023-05-24T03:33:02Z"
labels:
type: snap
markRemoved: false
ownerID: ""
parent: ""
readyToUse: true
restoreSize: 10737418240
size: 544698368
userCreated: true
kind: List
metadata:
resourceVersion: ""
- drain the node which the VM is scheduled
$ kubectl -n default get vmi demo -ojson | jq -r '.status.nodeName'
harv-yh-1-03
$ kubectl drain harv-yh-1-03 --ignore-daemonsets --delete-emptydir-data
after drain node, the volume status:
- uncordon node after drain finished
kubectl uncordon harv-yh-1-03
-
check the output of get snapshots.longhorn.io again
$ kubectl -n longhorn-system get snapshots.longhorn.io
NAME VOLUME CREATIONTIME READYTOUSE RESTORESIZE SIZE AGE
15463930-c07c-47e0-b9f9-23ddb3ed10c2 pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61 2023-05-24T03:39:47Z false 10737418240 545931264 95s
we can find that the user created snapshot was deleted and find a system created snapshot
the output of get snapshots.longhorn.io
$ kubectl -n longhorn-system get snapshots.longhorn.io -oyaml
apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
kind: Snapshot
metadata:
creationTimestamp: "2023-05-24T03:39:52Z"
finalizers:
- longhorn.io
generation: 1
labels:
longhornvolume: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
name: 15463930-c07c-47e0-b9f9-23ddb3ed10c2
namespace: longhorn-system
ownerReferences:
- apiVersion: longhorn.io/v1beta2
kind: Volume
name: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
uid: fd3f6f2a-c674-4619-a72c-40f08a79af6a
resourceVersion: "963356"
uid: 0bdf1236-97c7-40f3-9584-4007e89c1692
spec:
createSnapshot: false
labels: null
volume: pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61
status:
checksum: ""
children:
volume-head: true
creationTime: "2023-05-24T03:39:47Z"
labels: {}
markRemoved: true
ownerID: ""
parent: ""
readyToUse: false
restoreSize: 10737418240
size: 545931264
userCreated: false
kind: List
metadata:
resourceVersion: ""
- restore the VM snapshot
demo-snapshot
to create a new VMdemo-restore
- the restore pvc will stuck in
Pending
and the restored VM will stuck inRestoring
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 89s (x2 over 89s) driver.longhorn.io_csi-provisioner-68d95fd6d7-znlg2_50877834-6b2a-46df-a45e-f01c94e77cf5 failed to provision volume with StorageClass "longhorn-image-nwd54": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [detail=, message=unable to create volume: unable to create volume pvc-f5ec8db9-9e4e-4ab6-9f3e-3a066ba044aa: failed to verify data source: cannot find snapshot 'snapshot-beb1a302-0c19-45de-9da0-35cec0bc2c9c' for volume 'pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61', code=Server Error] from [http://longhorn-backend:9500/v1/volumes]
Warning ProvisioningFailed 64s (x2 over 72s) driver.longhorn.io_csi-provisioner-68d95fd6d7-znlg2_50877834-6b2a-46df-a45e-f01c94e77cf5 failed to provision volume with StorageClass "longhorn-image-nwd54": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [message=unable to create volume: unable to create volume pvc-f5ec8db9-9e4e-4ab6-9f3e-3a066ba044aa: failed to verify data source: cannot find snapshot 'snapshot-beb1a302-0c19-45de-9da0-35cec0bc2c9c' for volume 'pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61', code=Server Error, detail=] from [http://longhorn-backend:9500/v1/volumes]
Normal Provisioning 40s (x8 over 89s) driver.longhorn.io_csi-provisioner-68d95fd6d7-znlg2_50877834-6b2a-46df-a45e-f01c94e77cf5 External provisioner is provisioning volume for claim "default/restore-demo-snapshot-793b83f2-3557-4249-9bcb-ca44f46b81a9-rootdisk"
Warning ProvisioningFailed 40s (x4 over 89s) driver.longhorn.io_csi-provisioner-68d95fd6d7-znlg2_50877834-6b2a-46df-a45e-f01c94e77cf5 failed to provision volume with StorageClass "longhorn-image-nwd54": rpc error: code = Internal desc = Bad response statusCode [500]. Status [500 Internal Server Error]. Body: [code=Server Error, detail=, message=unable to create volume: unable to create volume pvc-f5ec8db9-9e4e-4ab6-9f3e-3a066ba044aa: failed to verify data source: cannot find snapshot 'snapshot-beb1a302-0c19-45de-9da0-35cec0bc2c9c' for volume 'pvc-06dc0095-f847-4ec5-a290-f2a6ed2ecf61'] from [http://longhorn-backend:9500/v1/volumes]
Normal ExternalProvisioning 5s (x8 over 89s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "driver.longhorn.io" or manually created by system administrator
Expected behavior
A clear and concise description of what you expected to happen.
The user created snapshot should not be deleted
Log or Support bundle
If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Harvester Support bundle:
supportbundle_37f7d382-4226-4272-9f96-c86340b18085_2023-05-24T03-50-55Z.zip
Longhorn Support bundle:
supportbundle_35af9b9c-b457-44ab-8ecd-ab44f4d595ff_2023-05-24T03-52-19Z.zip
Environment
- Longhorn version: v1.4.2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Harvester master-e123ab14-head
- Number of management node in the cluster:
- Number of worker node in the cluster:
- Node config
- OS type and version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster:
Additional context
Add any other context about the problem here.
Metadata
Assignees
Labels
Type
Projects
Status
Resolved
Status
Closed