Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][v1.8.0-rc2] DR volume becomes faulted after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore #10105

Open
yangchiu opened this issue Dec 31, 2024 · 21 comments
Assignees
Labels
area/volume-backup-restore Volume backup restore area/volume-disaster-recovery Volume DR kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-reproduce Require QA to reproduce, especially for issues reported from community severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Milestone

Comments

@yangchiu
Copy link
Member

Describe the bug

DR volume becomes faulted permanently after encountering restoration error: error initiating incremental backup restore: cannot find .cfg in backupstore:

$ kubectl get volumes -n longhorn-system
NAME     DATA ENGINE   STATE      ROBUSTNESS   SCHEDULED   SIZE         NODE   AGE
test-3   v1            detached   faulted                  2147483648          17m
$ kubectl describe volumes -n longhorn-system test-3
Name:         test-3
Namespace:    longhorn-system
Labels:       backup-target=default
              backup-volume=test-3
              longhornvolume=test-3
              recurring-job-group.longhorn.io/default=enabled
              setting.longhorn.io/remove-snapshots-during-filesystem-trim=ignored
              setting.longhorn.io/replica-auto-balance=ignored
              setting.longhorn.io/snapshot-data-integrity=ignored
Annotations:  <none>
API Version:  longhorn.io/v1beta2
Kind:         Volume
Metadata:
  Creation Timestamp:  2024-12-31T03:51:58Z
  Finalizers:
    longhorn.io
  Generation:        3
  Resource Version:  4569
  UID:               7f37f878-0f5d-430f-8e6a-1b40025596f5
Spec:
  Standby:                    true
  Access Mode:                rwo
  Backend Store Driver:       
  Backing Image:              
  Backup Compression Method:  lz4
  Backup Target Name:         default
  Data Engine:                v1
  Data Locality:              disabled
  Data Source:                
  Disable Frontend:           false
  Disk Selector:
  Encrypted:                       false
  Engine Image:                    
  Freeze Filesystem For Snapshot:  ignored
  From Backup:                     s3://yang-test-19@us-east-1/?backup=backup-0a2000d2f70d4470&volume=test-3
  Frontend:                        
  Image:                           longhornio/longhorn-engine:v1.8.0-rc2
  Last Attached By:                
  Migratable:                      false
  Migration Node ID:               
  Node ID:                         
  Node Selector:
  Number Of Replicas:               3
  Replica Auto Balance:             ignored
  Replica Disk Soft Anti Affinity:  ignored
  Replica Soft Anti Affinity:       ignored
  Replica Zone Soft Anti Affinity:  ignored
  Restore Volume Recurring Job:     ignored
  Revision Counter Disabled:        false
  Size:                             2147483648
  Snapshot Data Integrity:          ignored
  Snapshot Max Count:               250
  Snapshot Max Size:                0
  Stale Replica Timeout:            0
  Unmap Mark Snap Chain Removed:    ignored
Status:
  Actual Size:  121638912
  Clone Status:
    Attempt Count:            0
    Next Allowed Attempt At:  
    Snapshot:                 
    Source Volume:            
    State:                    
  Conditions:
    Last Probe Time:          
    Last Transition Time:     2024-12-31T03:51:59Z
    Message:                  
    Reason:                   
    Status:                   False
    Type:                     WaitForBackingImage
    Last Probe Time:          
    Last Transition Time:     2024-12-31T03:51:59Z
    Message:                  
    Reason:                   
    Status:                   False
    Type:                     TooManySnapshots
    Last Probe Time:          
    Last Transition Time:     2024-12-31T03:51:59Z
    Message:                  
    Reason:                   
    Status:                   True
    Type:                     Scheduled
    Last Probe Time:          
    Last Transition Time:     2024-12-31T04:02:21Z
    Message:                  All replica restore failed and the volume became Faulted
    Reason:                   RestoreFailure
    Status:                   False
    Type:                     Restore
  Current Image:              longhornio/longhorn-engine:v1.8.0-rc2
  Current Migration Node ID:  
  Current Node ID:            
  Expansion Required:         false
  Frontend Disabled:          false
  Is Standby:                 true
  Kubernetes Status:
    Last PVC Ref At:  2024-12-31T03:47:16Z
    Last Pod Ref At:  2024-12-31T03:47:16Z
    Namespace:        default
    Pv Name:          
    Pv Status:        
    Pvc Name:         test-3
    Workloads Status:
      Pod Name:          test-pod-3
      Pod Status:        Running
      Workload Name:     
      Workload Type:     
  Last Backup:           backup-1deac75057cd4efa
  Last Backup At:        2024-12-31T04:09:06Z
  Last Degraded At:      
  Owner ID:              ip-10-0-1-187
  Pending Node ID:       
  Remount Requested At:  
  Restore Initiated:     true
  Restore Required:      true
  Robustness:            faulted
  Share Endpoint:        
  Share State:           
  State:                 detached
Events:
  Type     Reason         Age                    From                        Message
  ----     ------         ----                   ----                        -------
  Normal   Attached       17m                    longhorn-volume-controller  volume test-3 has been attached to ip-10-0-1-187
  Warning  FailedRestore  7m27s                  longhorn-volume-controller  replica test-3-r-224bfb79 failed the restore: tcp://10.42.2.9:10000: failed to restore backup data s3://yang-test-19@us-east-1/?backup=backup-47533d00228b4e85&volume=test-3 to snapshot file volume-snap-ef0d21c3-acf5-4831-aa40-4b7520c491b5.img: rpc error: code = Unknown desc = error starting backup restore: error initiating incremental backup restore: cannot find backupstore/volumes/1d/fa/test-3/backups/backup_backup-a568665c5b524b50.cfg in backupstore
  Warning  FailedRestore  7m26s (x2 over 7m27s)  longhorn-volume-controller  replica test-3-r-4ca31fa0 failed the restore: tcp://10.42.1.10:10000: failed to restore backup data s3://yang-test-19@us-east-1/?backup=backup-47533d00228b4e85&volume=test-3 to snapshot file volume-snap-ef0d21c3-acf5-4831-aa40-4b7520c491b5.img: rpc error: code = Unknown desc = error starting backup restore: error initiating incremental backup restore: cannot find backupstore/volumes/1d/fa/test-3/backups/backup_backup-a568665c5b524b50.cfg in backupstore
  Warning  FailedRestore  7m26s (x3 over 7m27s)  longhorn-volume-controller  replica test-3-r-8ed84b65 failed the restore: tcp://10.42.3.10:10000: failed to restore backup data s3://yang-test-19@us-east-1/?backup=backup-47533d00228b4e85&volume=test-3 to snapshot file volume-snap-ef0d21c3-acf5-4831-aa40-4b7520c491b5.img: rpc error: code = Unknown desc = error starting backup restore: error initiating incremental backup restore: cannot find backupstore/volumes/1d/fa/test-3/backups/backup_backup-a568665c5b524b50.cfg in backupstore
  Normal   Degraded       7m26s (x2 over 7m26s)  longhorn-volume-controller  volume test-3 became degraded
  Normal   Detached       7m26s (x2 over 17m)    longhorn-volume-controller  volume test-3 has been detached

To Reproduce

  1. Prepare 2 Longhorn clusters and a remote backup store
  2. In the 1st cluster, create a volume and create a backup for it
  3. In the 2nd cluster, create a DR volume from the backup created in the 1st cluster
  4. The DR volume should be restored successfully and remain in attached/healthy state
  5. In the 1st cluster, create a recurring job for the volume to create a backup every minute
  6. After a couple of minutes, the DR volume in the 2nd cluster will eventually become detached/faulted state

Expected behavior

Support bundle for troubleshooting

supportbundle_dc5cb6f7-5584-4153-8d1e-1b823b393fe2_2024-12-31T04-05-09Z.zip

Environment

  • Longhorn version: v1.8.0-rc2
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.30.0+k3s1
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: ubuntu 24.04
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type (e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Workaround and Mitigation

@yangchiu yangchiu added kind/bug reproduce/always 100% reproducible priority/1 Highly recommended to implement or fix in this release (managed by PO) area/volume-backup-restore Volume backup restore severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Dec 31, 2024
@yangchiu yangchiu added this to the v1.8.0 milestone Dec 31, 2024
@github-project-automation github-project-automation bot moved this to New Issues in Longhorn Sprint Dec 31, 2024
@derekbit
Copy link
Member

derekbit commented Dec 31, 2024

@c3y1huang Could you help check it? Thank you.

@derekbit
Copy link
Member

derekbit commented Jan 1, 2025

@yangchiu

  • Does it happen in v1.8.0-rc1, v1.7.2 and v1.6.3 as well?
  • Do you have the support bundles of the target and source clusters?

@derekbit derekbit assigned COLDTURNIP and unassigned c3y1huang Jan 1, 2025
@derekbit
Copy link
Member

derekbit commented Jan 1, 2025

@COLDTURNIP Can you help investigate the issue? Thanks.

@derekbit derekbit added the require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated label Jan 1, 2025
@derekbit
Copy link
Member

derekbit commented Jan 1, 2025

cc @c3y1huang @ChanYiLin

@derekbit derekbit added priority/0 Must be implement or fixed in this release (managed by PO) severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade) and removed priority/1 Highly recommended to implement or fix in this release (managed by PO) severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) labels Jan 1, 2025
@mantissahz
Copy link
Contributor

mantissahz commented Jan 2, 2025

  1. In the 1st cluster, create a recurring job for the volume to create a backup every minute

Hi @yangchiu,

As @derekbit mentioned, do you have the support bundles of the source clusters?
What's the retained number of this recurring job?

Would it be like this scenario:

  1. The recurring job in 1st cluster started to create backup B.
  2. 2nd cluster got that the last backup is updated to backup A.
  3. When the recurring job in 1st cluster completed backup B, it was deleting backup A.
  4. When 2nd cluster tried to do restoration from the backup A for the DR volume, the recurring job of 1st cluster deleted the backup A.
  5. 2nd cluster failed to do incremental restoration.

@COLDTURNIP, could you check whether this scenario caused the issue?

@yangchiu
Copy link
Member Author

yangchiu commented Jan 2, 2025

What's the retained number of this recurring job?

Yes, Retain is set to 1 for this recurring job.

@COLDTURNIP
Copy link
Contributor

Have tried to reproduce such problem using a local environment:

  • Two 2-node clusters in Vagrant VritualBox
  • Both clusters connect to a local Minio
  • 1 empty volume

Unfortunately, the issue does not occur during one night trial. We're now trying to insert some delay in the manager to verify if there's some possibility of the data racing between the clusters and the backup store.

@derekbit
Copy link
Member

derekbit commented Jan 3, 2025

@yangchiu Please help check if @COLDTURNIP steps are correct. Also reproduce more times to make sure reproduce/always is valid. Thank you.

cc @longhorn/qa

@COLDTURNIP
Copy link
Contributor

COLDTURNIP commented Jan 3, 2025

Unable to reproduce the issue even by adding more delay during fetching backups from backup store.

@derekbit derekbit added the require/qa-reproduce Require QA to reproduce, especially for issues reported from community label Jan 3, 2025
@derekbit
Copy link
Member

derekbit commented Jan 3, 2025

Thanks @COLDTURNIP

Moved to ready-for-testing, add require/qa-reproduce label and waiting for @yangchiu's update

@derekbit derekbit moved this from New Issues to Ready For Testing in Longhorn Sprint Jan 3, 2025
@yangchiu
Copy link
Member Author

yangchiu commented Jan 3, 2025

This should be able to be reproduced by manually delete the .cfg file in the backupstore repeatedly. If this isn't an issue, feel free to close it. @derekbit @COLDTURNIP

@yangchiu yangchiu moved this from Ready For Testing to Implement in Longhorn Sprint Jan 3, 2025
@derekbit
Copy link
Member

derekbit commented Jan 3, 2025

This should be able to be reproduced by manually delete the .cfg file in the backupstore repeatedly. If this isn't an issue, feel free to close it. @derekbit @COLDTURNIP

@yangchiu
Can you elaborate more on the reproduced by manually delete the .cfg file? Do you mean deleting the file intentionally?
Is the .cfg file deleted by yourself in the original description #10105 (comment)? From the description, the .cfg file is somehow missing rather than intentionally deleting it.

@yangchiu
Copy link
Member Author

yangchiu commented Jan 3, 2025

@yangchiu Can you elaborate more on the reproduced by manually delete the .cfg file? Do you mean deleting the file intentionally?

Yes, it easily simulates When 2nd cluster tried to do restoration from the backup A for the DR volume, the recurring job of 1st cluster deleted the backup A.

Is the .cfg file deleted by yourself in the original description #10105 (comment)?

No, it happens naturally, but it's not reproducible at this point.

From the description, the .cfg file is somehow missing rather than intentionally deleting it.

From the investigation, it's deleted by the recurring job with retain = 1.

Let's close this for now, as it is not reproducible at this time.

@yangchiu yangchiu closed this as not planned Won't fix, can't repro, duplicate, stale Jan 3, 2025
@github-project-automation github-project-automation bot moved this from Implement to Closed in Longhorn Sprint Jan 3, 2025
@github-actions github-actions bot removed this from the v1.8.0 milestone Jan 3, 2025
@derekbit
Copy link
Member

derekbit commented Jan 3, 2025

@yangchiu Can you elaborate more on the reproduced by manually delete the .cfg file? Do you mean deleting the file intentionally?

Yes, it easily simulates When 2nd cluster tried to do restoration from the backup A for the DR volume, the recurring job of 1st cluster deleted the backup A.

Is the .cfg file deleted by yourself in the original description #10105 (comment)?

No, it happens naturally, but it's not reproducible at this point.

No worries. Let's keep an eye on the DR or incremental backup and test the functionalities more.

@derekbit derekbit reopened this Jan 3, 2025
@github-project-automation github-project-automation bot moved this from Closed to Implement in Longhorn Sprint Jan 3, 2025
@derekbit
Copy link
Member

derekbit commented Jan 3, 2025

@COLDTURNIP Before closing the issue, can you set up the destination and the source clusters and automatically execute #10105 (comment) for one day to make sure the issue won't happen?

cc @yangchiu let's do a long-term test first.

@derekbit derekbit removed the wontfix label Jan 3, 2025
@roger-ryao
Copy link

@COLDTURNIP Before closing the issue, can you set up the destination and the source clusters and automatically execute #10105 (comment) for one day to make sure the issue won't happen?

cc @yangchiu let's do a long-term test first.

I replicated a similar situation this morning. I provided the support bundle to @COLDTURNIP and also shared the environment with him to see if he could replicate it again.

On Cluster A:

  1. Apply the statefulset.yaml file: k apply -f statefulset.yaml
statefulset.yaml
apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwo
  labels:
    app: nginx-state-rwo
spec:
  ports:
  - port: 80
    name: web-state-rwo
  selector:
    app: nginx-state-rwo
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwo
spec:
  selector:
    matchLabels:
      app: nginx-state-rwo # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwo"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwo # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwo
        image: nginx:stable
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwo
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
#      accessModes:
#        - ReadWriteOnce
#        - ReadWriteMany
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 1Gi
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-state-rwx
  labels:
    app: nginx-state-rwx
spec:
  ports:
  - port: 80
    name: web-state-rwx
  selector:
    app: nginx-state-rwx
  type: NodePort
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web-state-rwx
spec:
  selector:
    matchLabels:
      app: nginx-state-rwx # has to match .spec.template.metadata.labels
  serviceName: "nginx-state-rwx"
  replicas: 1 # by default is 1
  template:
    metadata:
      labels:
        app: nginx-state-rwx # has to match .spec.selector.matchLabels
    spec:
      restartPolicy: Always
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx-state-rwx
        image: nginx:stable
        livenessProbe:
          exec:
            command:
              - ls
              - /usr/share/nginx/html/lost+found
          initialDelaySeconds: 5
          periodSeconds: 5
        ports:
        - containerPort: 80
          name: web-state-rwx
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteMany" ]      
      storageClassName: "longhorn"
      resources:
        requests:
          storage: 1Gi
  1. Execute the node-disconnect.sh script every 20 seconds: watch -n 20 './node-disconnect.sh 10'
node-disconnect.sh
#!/bin/bash

if [ $# -ne 1 ]; then
  echo "Usage: $0 <file_size_in_MBs>"
  exit 1
fi

file_size=$1

# Write some data to the volume and sync
kubectl exec -it web-state-rwo-0 -- /bin/bash -c "dd if=/dev/urandom of=/usr/share/nginx/html/"$file_size"m bs=1M count=$file_size oflag=direct status=progress && md5sum /usr/share/nginx/html/"$file_size"m"
kubectl exec -it web-state-rwx-0 -- /bin/bash -c "dd if=/dev/urandom of=/usr/share/nginx/html/"$file_size"m bs=1M count=$file_size oflag=direct status=progress && md5sum /usr/share/nginx/html/"$file_size"m"
kubectl exec -it web-state-rwo-0 -- /bin/bash -c "sync"
kubectl exec -it web-state-rwx-0 -- /bin/bash -c "sync"

  1. Create a Recurring Backup Job Schedule on Cluster A.

On Cluster B:

  1. Create DR volumes using the backup from Cluster A.

@innobead innobead added this to the v1.8.0 milestone Jan 6, 2025
@innobead
Copy link
Member

innobead commented Jan 6, 2025

Added back to 1.8.0 first.

@derekbit
Copy link
Member

derekbit commented Jan 6, 2025

@roger-ryao Can you try if the issue happens in v1.7.2 as well?

@roger-ryao
Copy link

@roger-ryao Can you try if the issue happens in v1.7.2 as well?

Okay, I will set up another environment and follow the same steps to run it for a day. I will update the results tomorrow afternoon.

@roger-ryao
Copy link

@roger-ryao Can you try if the issue happens in v1.7.2 as well?

Okay, I will set up another environment and follow the same steps to run it for a day. I will update the results tomorrow afternoon.

Hi @COLDTURNIP @derekbit

I was able to reproduce the issue on v1.7.2.

volume name : pvc-9e855f07-e3d1-4d46-b394-2a11b71c274a
supportbundle_1aea7d34-0d13-46e8-a558-976713acc163_2025-01-06T10-10-51Z.zip

@innobead
Copy link
Member

innobead commented Jan 6, 2025

Okay this is not a regression from 1.8.0 but an existing issue at least since 1.7.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-backup-restore Volume backup restore area/volume-disaster-recovery Volume DR kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/always 100% reproducible require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-reproduce Require QA to reproduce, especially for issues reported from community severity/1 Function broken (a critical incident with very high impact (ex: data corruption, failed upgrade)
Projects
Status: Implement
Development

No branches or pull requests

7 participants