Skip to content

[BUG] test_dr_volume_with_restore_command_error failed #6130

Closed
@roger-ryao

Description

Describe the bug (🐛 if you encounter this issue)

The test case test_dr_volume_with_restore_command_error is intermittently failing on both master-head and v1.5.x-head branches. The DR volume status is being displayed as Faulted.

>       wait_for_volume_restoration_start(client, dr_volume_name, b2.name)

test_ha.py:1689: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common.py:4350: in wait_for_volume_restoration_start
    wait_for_volume_status(client, volume_name,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

client = <longhorn.Client object at 0x7faccd4516a0>, name = 'longhorn-testvol-yawbqn-dr', key = 'state'
value = 'attached', retry_count = 150

    def wait_for_volume_status(client, name, key, value,
                               retry_count=RETRY_COUNTS):
        wait_for_volume_creation(client, name)
        for i in range(retry_count):
            volume = client.by_id_volume(name)
            if volume[key] == value:
                break
            time.sleep(RETRY_INTERVAL)
>       assert volume[key] == value, f" value={value}\n. \
                volume[key]={volume[key]}\n. volume={volume}"
E       AssertionError:  value=attached
E       .             volume[key]=detached
E       . volume={'accessMode': 'rwo', 'backendStoreDriver': 'longhorn', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [], 'cloneStatus': {'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'restore': {'lastProbeTime': '', 'lastTransitionTime': '2023-06-15T06:00:33Z', 'message': 'All replica restore failed and the volume became Faulted', 'reason': 'RestoreFailure', 'status': 'False'}, 'scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2023-06-15T06:00:10Z', 'message': '', 'reason': '', 'status': 'True'}, 'toomanysnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2023-06-15T06:00:10Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '159383552', 'address': '', 'currentImage': '', 'endpoint': '', 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'hostId': '', 'instanceManagerName': '', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': '', 'name': 'longhorn-testvol-yawbqn-dr-e-6a691bda', 'requestedBackupRestore': 'backup-f8cd5acdedd249ac', 'running': False, 'size': '1073741824', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2023-06-15 06:00:09 +0000 UTC', 'currentImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': False, 'diskSelector': [], 'encrypted': False, 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'fromBackup': 'nfs://longhorn-test-nfs-svc.default:/opt/backupstore?backup=backup-d7b5e622c7434813&volume=longhorn-testvol-yawbqn-std', 'frontend': '', 'kubernetesStatus': {'lastPVCRefAt': '2023-06-15T06:00:04Z', 'lastPodRefAt': '2023-06-15T06:00:04Z', 'namespace': 'default', 'pvName': '', 'pvStatus': '', 'pvcName': 'longhorn-testvol-yawbqn-std-pvc', 'workloadsStatus': [{'podName': 'longhorn-testvol-yawbqn-std-pod', 'podStatus': 'Running', 'workloadName': '', 'workloadType': ''}]}, 'lastAttachedBy': '', 'lastBackup': 'backup-f8cd5acdedd249ac', 'lastBackupAt': '2023-06-15T06:00:30Z', 'migratable': False, 'name': 'longhorn-testvol-yawbqn-dr', 'nodeSelector': [], 'numberOfReplicas': 3, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': None, 'ready': False, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '', 'backendStoreDriver': 'longhorn', 'currentImage': '', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-yawbqn-dr-841000ca', 'diskID': 'd433bb4e-1e0d-4309-ae21-a8843b497ea0', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'failedAt': '2023-06-15T06:00:33Z', 'hostId': 'ip-10-0-2-134', 'instanceManagerName': '', 'mode': '', 'name': 'longhorn-testvol-yawbqn-dr-r-17e46904', 'running': False}, {'address': '', 'backendStoreDriver': 'longhorn', 'currentImage': '', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-yawbqn-dr-5a47bf71', 'diskID': '74124f91-977f-4413-adfd-93ba220f9141', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'failedAt': '2023-06-15T06:00:33Z', 'hostId': 'ip-10-0-2-190', 'instanceManagerName': '', 'mode': '', 'name': 'longhorn-testvol-yawbqn-dr-r-28137cdb', 'running': False}, {'address': '', 'backendStoreDriver': 'longhorn', 'currentImage': '', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-yawbqn-dr-725f8ba7', 'diskID': '2bd94598-1fe2-47db-b95e-b52cd21d5edd', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'failedAt': '2023-06-15T06:00:33Z', 'hostId': 'ip-10-0-2-252', 'instanceManagerName': '', 'mode': '', 'name': 'longhorn-testvol-yawbqn-dr-r-42d861ad', 'running': False}], 'restoreInitiated': True, 'restoreRequired': True, 'restoreStatus': [], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'faulted', 'shareEndpoint': '', 'shareState': '', 'size': '1073741824', 'snapshotDataIntegrity': 'ignored', 'staleReplicaTimeout': 0, 'standby': True, 'state': 'detached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'volume-restore-controller-longhorn-testvol-yawbqn-dr': {'attachmentID': 'volume-restore-controller-longhorn-testvol-yawbqn-dr', 'attachmentType': 'volume-restore-controller', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2023-06-15T06:00:33Z', 'message': '', 'reason': '', 'status': 'False'}], 'nodeID': 'ip-10-0-2-252', 'parameters': {'disableFrontend': 'true'}, 'satisfied': False}}, 'volume': 'longhorn-testvol-yawbqn-dr'}}

common.py:1844: AssertionError

Screenshot_20230615_135925
Screenshot_20230615_135934

To Reproduce

Steps to reproduce the behavior:

https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/
Verify test result of test_dr_volume_with_restore_command_error

Expected behavior

We should have consistent test results on all distro.

Log or Support bundle

c.zip

Environment

  • Longhorn version: v1.5.0-rc2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:v1.25.9+k3s1
    • Number of management node in the cluster:1
    • Number of worker node in the cluster:3
  • Node config
    • OS type and version: SLES 15.4
    • CPU per node: 4
    • Memory per node: 16G
    • Disk type(e.g. SSD/NVMe): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):AWS
  • Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

Metadata

Type

No type

Projects

  • Status

    Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions