[BUG][v1.8.0-rc2] DR volume becomes faulted after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore #10105

yangchiu · 2024-12-31T04:38:56Z

Describe the bug

DR volume becomes faulted permanently after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore:

$ kubectl get volumes -n longhorn-system
NAME     DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE   AGE
test-3   v1            detached   faulted                  2147483648          17m

$ kubectl describe volumes -n longhorn-system test-3
Name:         test-3
Namespace:    longhorn-system
Labels:       backup-target=default
              backup-volume=test-3
              longhornvolume=test-3
              recurring-job-group.longhorn.io/default=enabled
              setting.longhorn.io/remove-snapshots-during-filesystem-trim=ignored
              setting.longhorn.io/replica-auto-balance=ignored
              setting.longhorn.io/snapshot-data-integrity=ignored
Annotations:  <none>
API Version:  longhorn.io/v1beta2
Kind:         Volume
Metadata:
  Creation Timestamp:  2024-12-31T03:51:58Z
  Finalizers:
    longhorn.io
  Generation:        3
  Resource Version:  4569
  UID:               7f37f878-0f5d-430f-8e6a-1b40025596f5
Spec:
  Standby:                    true
  Access Mode:                rwo
  Backend Store Driver:       
  Backing Image:              
  Backup Compression Method:  lz4
  Backup Target Name:         default
  Data Engine:                v1
  Data Locality:              disabled
  Data Source:                
  Disable Frontend:           false
  Disk Selector:
  Encrypted:                       false
  Engine Image:                    
  Freeze Filesystem For Snapshot:  ignored
  From Backup:                     s3://yang-test-19@us-east-1/?backup=backup-0a2000d2f70d4470&volume=test-3
  Frontend:                        
  Image:                           longhornio/longhorn-engine:v1.8.0-rc2
  Last Attached By:                
  Migratable:                      false
  Migration Node ID:               
  Node ID:                         
  Node Selector:
  Number Of Replicas:               3
  Replica Auto Balance:             ignored
  Replica Disk Soft Anti Affinity:  ignored
  Replica Soft Anti Affinity:       ignored
  Replica Zone Soft Anti Affinity:  ignored
  Restore Volume Recurring Job:     ignored
  Revision Counter Disabled:        false
  Size:                             2147483648
  Snapshot Data Integrity:          ignored
  Snapshot Max Count:               250
  Snapshot Max Size:                0
  Stale Replica Timeout:            0
  Unmap Mark Snap Chain Removed:    ignored
Status:
  Actual Size:  121638912
  Clone Status:
    Attempt Count:            0
    Next Allowed Attempt At:  
    Snapshot:                 
    Source Volume:            
    State:                    
  Conditions:
    Last Probe Time:          
    Last Transition Time:     2024-12-31T03:51:59Z
    Message:                  
    Reason:                   
    Status:                   False
    Type:                     WaitForBackingImage
    Last Probe Time:          
    Last Transition Time:     2024-12-31T03:51:59Z
    Message:                  
    Reason:                   
    Status:                   False
    Type:                     TooManySnapshots
    Last Probe Time:          
    Last Transition Time:     2024-12-31T03:51:59Z
    Message:                  
    Reason:                   
    Status:                   True
    Type:                     Scheduled
    Last Probe Time:          
    Last Transition Time:     2024-12-31T04:02:21Z
    Message:                  All replica restore failed and the volume became Faulted
    Reason:                   RestoreFailure
    Status:                   False
    Type:                     Restore
  Current Image:              longhornio/longhorn-engine:v1.8.0-rc2
  Current Migration Node ID:  
  Current Node ID:            
  Expansion Required:         false
  Frontend Disabled:          false
  Is Standby:                 true
  Kubernetes Status:
    Last PVC Ref At:  2024-12-31T03:47:16Z
    Last Pod Ref At:  2024-12-31T03:47:16Z
    Namespace:        default
    Pv Name:          
    Pv Status:        
    Pvc Name:         test-3
    Workloads Status:
      Pod Name:          test-pod-3
      Pod Status:        Running
      Workload Name:     
      Workload Type:     
  Last Backup:           backup-1deac75057cd4efa
  Last Backup At:        2024-12-31T04:09:06Z
  Last Degraded At:      
  Owner ID:              ip-10-0-1-187
  Pending Node ID:       
  Remount Requested At:  
  Restore Initiated:     true
  Restore Required:      true
  Robustness:            faulted
  Share Endpoint:        
  Share State:           
  State:                 detached
Events:
  Type     Reason         Age                    From                        Message
  ----     ------         ----                   ----                        -------
  Normal   Attached       17m                    longhorn-volume-controller  volume test-3 has been attached to ip-10-0-1-187
  Warning  FailedRestore  7m27s                  longhorn-volume-controller  replica test-3-r-224bfb79 failed the restore: tcp://10.42.2.9:10000: failed to restore backup data s3://yang-test-19@us-east-1/?backup=backup-47533d00228b4e85&volume=test-3 to snapshot file volume-snap-ef0d21c3-acf5-4831-aa40-4b7520c491b5.img: rpc error: code = Unknown desc = error starting backup restore: error initiating incremental backup restore: cannot find backupstore/volumes/1d/fa/test-3/backups/backup_backup-a568665c5b524b50.cfg in backupstore
  Warning  FailedRestore  7m26s (x2 over 7m27s)  longhorn-volume-controller  replica test-3-r-4ca31fa0 failed the restore: tcp://10.42.1.10:10000: failed to restore backup data s3://yang-test-19@us-east-1/?backup=backup-47533d00228b4e85&volume=test-3 to snapshot file volume-snap-ef0d21c3-acf5-4831-aa40-4b7520c491b5.img: rpc error: code = Unknown desc = error starting backup restore: error initiating incremental backup restore: cannot find backupstore/volumes/1d/fa/test-3/backups/backup_backup-a568665c5b524b50.cfg in backupstore
  Warning  FailedRestore  7m26s (x3 over 7m27s)  longhorn-volume-controller  replica test-3-r-8ed84b65 failed the restore: tcp://10.42.3.10:10000: failed to restore backup data s3://yang-test-19@us-east-1/?backup=backup-47533d00228b4e85&volume=test-3 to snapshot file volume-snap-ef0d21c3-acf5-4831-aa40-4b7520c491b5.img: rpc error: code = Unknown desc = error starting backup restore: error initiating incremental backup restore: cannot find backupstore/volumes/1d/fa/test-3/backups/backup_backup-a568665c5b524b50.cfg in backupstore
  Normal   Degraded       7m26s (x2 over 7m26s)  longhorn-volume-controller  volume test-3 became degraded
  Normal   Detached       7m26s (x2 over 17m)    longhorn-volume-controller  volume test-3 has been detached

To Reproduce

Prepare 2 Longhorn clusters and a remote backup store
In the 1st cluster, create a volume and create a backup for it
In the 2nd cluster, create a DR volume from the backup created in the 1st cluster
The DR volume should be restored successfully and remain in attached/healthy state
In the 1st cluster, create a recurring job for the volume to create a backup every minute
After a couple of minutes, the DR volume in the 2nd cluster will eventually become detached/faulted state

Expected behavior

Support bundle for troubleshooting

supportbundle_dc5cb6f7-5584-4153-8d1e-1b823b393fe2_2024-12-31T04-05-09Z.zip

Environment

Longhorn version: v1.8.0-rc2
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.30.0+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version: ubuntu 24.04
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

Workaround and Mitigation

The text was updated successfully, but these errors were encountered:

derekbit · 2024-12-31T06:17:33Z

~~@c3y1huang Could you help check it? Thank you.~~

derekbit · 2025-01-01T15:44:59Z

@yangchiu

Does it happen in v1.8.0-rc1, v1.7.2 and v1.6.3 as well?
Do you have the support bundles of the target and source clusters?

derekbit · 2025-01-01T15:46:11Z

@COLDTURNIP Can you help investigate the issue? Thanks.

derekbit · 2025-01-01T15:47:44Z

cc @c3y1huang @ChanYiLin

mantissahz · 2025-01-02T02:29:40Z

In the 1st cluster, create a recurring job for the volume to create a backup every minute

Hi @yangchiu,

As @derekbit mentioned, do you have the support bundles of the source clusters?
What's the retained number of this recurring job?

Would it be like this scenario:

The recurring job in 1st cluster started to create backup B.
2nd cluster got that the last backup is updated to backup A.
When the recurring job in 1st cluster completed backup B, it was deleting backup A.
When 2nd cluster tried to do restoration from the backup A for the DR volume, the recurring job of 1st cluster deleted the backup A.
2nd cluster failed to do incremental restoration.

@COLDTURNIP, could you check whether this scenario caused the issue?

yangchiu · 2025-01-02T02:57:17Z

What's the retained number of this recurring job?

Yes, Retain is set to 1 for this recurring job.

COLDTURNIP · 2025-01-03T03:49:12Z

Have tried to reproduce such problem using a local environment:

Two 2-node clusters in Vagrant VritualBox
Both clusters connect to a local Minio
1 empty volume

Unfortunately, the issue does not occur during one night trial. We're now trying to insert some delay in the manager to verify if there's some possibility of the data racing between the clusters and the backup store.

derekbit · 2025-01-03T05:19:14Z

@yangchiu Please help check if @COLDTURNIP steps are correct. Also reproduce more times to make sure reproduce/always is valid. Thank you.

cc @longhorn/qa

COLDTURNIP · 2025-01-03T09:57:56Z

Unable to reproduce the issue even by adding more delay during fetching backups from backup store.

derekbit · 2025-01-03T10:01:23Z

Thanks @COLDTURNIP

Moved to ready-for-testing, add require/qa-reproduce label and waiting for @yangchiu's update

yangchiu · 2025-01-03T11:54:47Z

This should be able to be reproduced by manually delete the .cfg file in the backupstore repeatedly. If this isn't an issue, feel free to close it. @derekbit @COLDTURNIP

derekbit · 2025-01-03T12:33:51Z

This should be able to be reproduced by manually delete the .cfg file in the backupstore repeatedly. If this isn't an issue, feel free to close it. @derekbit @COLDTURNIP

@yangchiu
Can you elaborate more on the reproduced by manually delete the .cfg file? Do you mean deleting the file intentionally?
Is the .cfg file deleted by yourself in the original description #10105 (comment)? From the description, the .cfg file is somehow missing rather than intentionally deleting it.

yangchiu · 2025-01-03T13:26:05Z

@yangchiu Can you elaborate more on the reproduced by manually delete the .cfg file? Do you mean deleting the file intentionally?

Yes, it easily simulates When 2nd cluster tried to do restoration from the backup A for the DR volume, the recurring job of 1st cluster deleted the backup A.

Is the .cfg file deleted by yourself in the original description #10105 (comment)?

No, it happens naturally, but it's not reproducible at this point.

From the description, the .cfg file is somehow missing rather than intentionally deleting it.

From the investigation, it's deleted by the recurring job with retain = 1.

Let's close this for now, as it is not reproducible at this time.

derekbit · 2025-01-03T13:37:36Z

@yangchiu Can you elaborate more on the reproduced by manually delete the .cfg file? Do you mean deleting the file intentionally?

Yes, it easily simulates When 2nd cluster tried to do restoration from the backup A for the DR volume, the recurring job of 1st cluster deleted the backup A.

Is the .cfg file deleted by yourself in the original description #10105 (comment)?

No, it happens naturally, but it's not reproducible at this point.

No worries. Let's keep an eye on the DR or incremental backup and test the functionalities more.

derekbit · 2025-01-03T13:45:08Z

@COLDTURNIP Before closing the issue, can you set up the destination and the source clusters and automatically execute #10105 (comment) for one day to make sure the issue won't happen?

cc @yangchiu let's do a long-term test first.

roger-ryao · 2025-01-06T07:06:35Z

@COLDTURNIP Before closing the issue, can you set up the destination and the source clusters and automatically execute #10105 (comment) for one day to make sure the issue won't happen?

cc @yangchiu let's do a long-term test first.

I replicated a similar situation this morning. I provided the support bundle to @COLDTURNIP and also shared the environment with him to see if he could replicate it again.

On Cluster A:

Apply the statefulset.yaml file: k apply -f statefulset.yaml

statefulset.yaml

apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwo
  labels:
    app: nginx-state-rwo
spec:
  ports:
  - port: 80
    name: web-state-rwo
  selector:
    app: nginx-state-rwo
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwo
spec:
  selector:
    matchLabels:
      app: nginx-state-rwo # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwo"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwo # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwo
        image: nginx:stable
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwo
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
#      accessModes:
#        - ReadWriteOnce
#        - ReadWriteMany
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwx
  labels:
    app: nginx-state-rwx
spec:
  ports:
  - port: 80
    name: web-state-rwx
  selector:
    app: nginx-state-rwx
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwx
spec:
  selector:
    matchLabels:
      app: nginx-state-rwx # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwx"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwx # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwx
        image: nginx:stable
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwx
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteMany" ]      
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 1Gi

Execute the node-disconnect.sh script every 20 seconds: watch -n 20 './node-disconnect.sh 10'

node-disconnect.sh

#!/bin/bash

if [ $# -ne 1 ]; then
  echo "Usage: $0 <file_size_in_MBs>"
  exit 1
fi

file_size=$1

# Write some data to the volume and sync
kubectl exec -it web-state-rwo-0 -- /bin/bash -c "dd if=/dev/urandom of=/usr/share/nginx/html/"$file_size"m bs=1M count=$file_size oflag=direct status=progress && md5sum /usr/share/nginx/html/"$file_size"m"
kubectl exec -it web-state-rwx-0 -- /bin/bash -c "dd if=/dev/urandom of=/usr/share/nginx/html/"$file_size"m bs=1M count=$file_size oflag=direct status=progress && md5sum /usr/share/nginx/html/"$file_size"m"
kubectl exec -it web-state-rwo-0 -- /bin/bash -c "sync"
kubectl exec -it web-state-rwx-0 -- /bin/bash -c "sync"

Create a Recurring Backup Job Schedule on Cluster A.

On Cluster B:

Create DR volumes using the backup from Cluster A.

innobead · 2025-01-06T07:09:39Z

Added back to 1.8.0 first.

derekbit · 2025-01-06T07:21:26Z

@roger-ryao Can you try if the issue happens in v1.7.2 as well?

roger-ryao · 2025-01-06T07:46:02Z

@roger-ryao Can you try if the issue happens in v1.7.2 as well?

Okay, I will set up another environment and follow the same steps to run it for a day. I will update the results tomorrow afternoon.

roger-ryao · 2025-01-06T10:16:16Z

@roger-ryao Can you try if the issue happens in v1.7.2 as well?

Okay, I will set up another environment and follow the same steps to run it for a day. I will update the results tomorrow afternoon.

Hi @COLDTURNIP @derekbit

I was able to reproduce the issue on v1.7.2.

volume name : pvc-9e855f07-e3d1-4d46-b394-2a11b71c274a
supportbundle_1aea7d34-0d13-46e8-a558-976713acc163_2025-01-06T10-10-51Z.zip

innobead · 2025-01-06T10:35:22Z

Okay this is not a regression from 1.8.0 but an existing issue at least since 1.7.2.

yangchiu added this to the v1.8.0 milestone Dec 31, 2024

github-project-automation bot added this to Longhorn Sprint Dec 31, 2024

github-project-automation bot moved this to New Issues in Longhorn Sprint Dec 31, 2024

derekbit assigned c3y1huang Dec 31, 2024

derekbit assigned COLDTURNIP and unassigned c3y1huang Jan 1, 2025

derekbit added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Jan 1, 2025

github-actions bot mentioned this issue Jan 1, 2025

[TEST][BUG][v1.8.0-rc2] DR volume becomes faulted after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore #10115

Open

innobead added the area/volume-disaster-recovery Volume DR label Jan 2, 2025

derekbit assigned yangchiu Jan 3, 2025

derekbit added the require/qa-reproduce Require QA to reproduce, especially for issues reported from community label Jan 3, 2025

derekbit moved this from New Issues to Ready For Testing in Longhorn Sprint Jan 3, 2025

yangchiu moved this from Ready For Testing to Implement in Longhorn Sprint Jan 3, 2025

yangchiu closed this as not planned Won't fix, can't repro, duplicate, stale Jan 3, 2025

github-project-automation bot moved this from Implement to Closed in Longhorn Sprint Jan 3, 2025

github-actions bot added the wontfix label Jan 3, 2025

github-actions bot removed this from the v1.8.0 milestone Jan 3, 2025

derekbit reopened this Jan 3, 2025

github-project-automation bot moved this from Closed to Implement in Longhorn Sprint Jan 3, 2025

derekbit removed the wontfix label Jan 3, 2025

innobead added this to the v1.8.0 milestone Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][v1.8.0-rc2] DR volume becomes faulted after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore #10105

[BUG][v1.8.0-rc2] DR volume becomes faulted after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore #10105

yangchiu commented Dec 31, 2024

derekbit commented Dec 31, 2024 •

edited

Loading

derekbit commented Jan 1, 2025 •

edited

Loading

derekbit commented Jan 1, 2025

derekbit commented Jan 1, 2025

mantissahz commented Jan 2, 2025 •

edited

Loading

yangchiu commented Jan 2, 2025

COLDTURNIP commented Jan 3, 2025

derekbit commented Jan 3, 2025

COLDTURNIP commented Jan 3, 2025 •

edited

Loading

derekbit commented Jan 3, 2025

yangchiu commented Jan 3, 2025

derekbit commented Jan 3, 2025 •

edited

Loading

yangchiu commented Jan 3, 2025

derekbit commented Jan 3, 2025

derekbit commented Jan 3, 2025

roger-ryao commented Jan 6, 2025

innobead commented Jan 6, 2025

derekbit commented Jan 6, 2025

roger-ryao commented Jan 6, 2025

roger-ryao commented Jan 6, 2025

innobead commented Jan 6, 2025

[BUG][v1.8.0-rc2] DR volume becomes faulted after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore #10105

[BUG][v1.8.0-rc2] DR volume becomes faulted after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore #10105

Comments

yangchiu commented Dec 31, 2024

Describe the bug

To Reproduce

Expected behavior

Support bundle for troubleshooting

Environment

Additional context

Workaround and Mitigation

derekbit commented Dec 31, 2024 • edited Loading

derekbit commented Jan 1, 2025 • edited Loading

derekbit commented Jan 1, 2025

derekbit commented Jan 1, 2025

mantissahz commented Jan 2, 2025 • edited Loading

yangchiu commented Jan 2, 2025

COLDTURNIP commented Jan 3, 2025

derekbit commented Jan 3, 2025

COLDTURNIP commented Jan 3, 2025 • edited Loading

derekbit commented Jan 3, 2025

yangchiu commented Jan 3, 2025

derekbit commented Jan 3, 2025 • edited Loading

yangchiu commented Jan 3, 2025

derekbit commented Jan 3, 2025

derekbit commented Jan 3, 2025

roger-ryao commented Jan 6, 2025

innobead commented Jan 6, 2025

derekbit commented Jan 6, 2025

roger-ryao commented Jan 6, 2025

roger-ryao commented Jan 6, 2025

innobead commented Jan 6, 2025

derekbit commented Dec 31, 2024 •

edited

Loading

derekbit commented Jan 1, 2025 •

edited

Loading

mantissahz commented Jan 2, 2025 •

edited

Loading

COLDTURNIP commented Jan 3, 2025 •

edited

Loading

derekbit commented Jan 3, 2025 •

edited

Loading