[BUG] test_dr_volume_with_restore_command_error failed #6130
Closed
Description
Describe the bug (🐛 if you encounter this issue)
The test case test_dr_volume_with_restore_command_error
is intermittently failing on both master-head
and v1.5.x-head
branches. The DR volume status is being displayed as Faulted
.
> wait_for_volume_restoration_start(client, dr_volume_name, b2.name)
test_ha.py:1689:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
common.py:4350: in wait_for_volume_restoration_start
wait_for_volume_status(client, volume_name,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
client = <longhorn.Client object at 0x7faccd4516a0>, name = 'longhorn-testvol-yawbqn-dr', key = 'state'
value = 'attached', retry_count = 150
def wait_for_volume_status(client, name, key, value,
retry_count=RETRY_COUNTS):
wait_for_volume_creation(client, name)
for i in range(retry_count):
volume = client.by_id_volume(name)
if volume[key] == value:
break
time.sleep(RETRY_INTERVAL)
> assert volume[key] == value, f" value={value}\n. \
volume[key]={volume[key]}\n. volume={volume}"
E AssertionError: value=attached
E . volume[key]=detached
E . volume={'accessMode': 'rwo', 'backendStoreDriver': 'longhorn', 'backingImage': '', 'backupCompressionMethod': 'lz4', 'backupStatus': [], 'cloneStatus': {'snapshot': '', 'sourceVolume': '', 'state': ''}, 'conditions': {'restore': {'lastProbeTime': '', 'lastTransitionTime': '2023-06-15T06:00:33Z', 'message': 'All replica restore failed and the volume became Faulted', 'reason': 'RestoreFailure', 'status': 'False'}, 'scheduled': {'lastProbeTime': '', 'lastTransitionTime': '2023-06-15T06:00:10Z', 'message': '', 'reason': '', 'status': 'True'}, 'toomanysnapshots': {'lastProbeTime': '', 'lastTransitionTime': '2023-06-15T06:00:10Z', 'message': '', 'reason': '', 'status': 'False'}}, 'controllers': [{'actualSize': '159383552', 'address': '', 'currentImage': '', 'endpoint': '', 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'hostId': '', 'instanceManagerName': '', 'isExpanding': False, 'lastExpansionError': '', 'lastExpansionFailedAt': '', 'lastRestoredBackup': '', 'name': 'longhorn-testvol-yawbqn-dr-e-6a691bda', 'requestedBackupRestore': 'backup-f8cd5acdedd249ac', 'running': False, 'size': '1073741824', 'unmapMarkSnapChainRemovedEnabled': False}], 'created': '2023-06-15 06:00:09 +0000 UTC', 'currentImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'dataLocality': 'disabled', 'dataSource': '', 'disableFrontend': False, 'diskSelector': [], 'encrypted': False, 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'fromBackup': 'nfs://longhorn-test-nfs-svc.default:/opt/backupstore?backup=backup-d7b5e622c7434813&volume=longhorn-testvol-yawbqn-std', 'frontend': '', 'kubernetesStatus': {'lastPVCRefAt': '2023-06-15T06:00:04Z', 'lastPodRefAt': '2023-06-15T06:00:04Z', 'namespace': 'default', 'pvName': '', 'pvStatus': '', 'pvcName': 'longhorn-testvol-yawbqn-std-pvc', 'workloadsStatus': [{'podName': 'longhorn-testvol-yawbqn-std-pod', 'podStatus': 'Running', 'workloadName': '', 'workloadType': ''}]}, 'lastAttachedBy': '', 'lastBackup': 'backup-f8cd5acdedd249ac', 'lastBackupAt': '2023-06-15T06:00:30Z', 'migratable': False, 'name': 'longhorn-testvol-yawbqn-dr', 'nodeSelector': [], 'numberOfReplicas': 3, 'offlineReplicaRebuilding': 'disabled', 'offlineReplicaRebuildingRequired': False, 'purgeStatus': None, 'ready': False, 'rebuildStatus': [], 'recurringJobSelector': None, 'replicaAutoBalance': 'ignored', 'replicaSoftAntiAffinity': 'ignored', 'replicaZoneSoftAntiAffinity': 'ignored', 'replicas': [{'address': '', 'backendStoreDriver': 'longhorn', 'currentImage': '', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-yawbqn-dr-841000ca', 'diskID': 'd433bb4e-1e0d-4309-ae21-a8843b497ea0', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'failedAt': '2023-06-15T06:00:33Z', 'hostId': 'ip-10-0-2-134', 'instanceManagerName': '', 'mode': '', 'name': 'longhorn-testvol-yawbqn-dr-r-17e46904', 'running': False}, {'address': '', 'backendStoreDriver': 'longhorn', 'currentImage': '', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-yawbqn-dr-5a47bf71', 'diskID': '74124f91-977f-4413-adfd-93ba220f9141', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'failedAt': '2023-06-15T06:00:33Z', 'hostId': 'ip-10-0-2-190', 'instanceManagerName': '', 'mode': '', 'name': 'longhorn-testvol-yawbqn-dr-r-28137cdb', 'running': False}, {'address': '', 'backendStoreDriver': 'longhorn', 'currentImage': '', 'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-yawbqn-dr-725f8ba7', 'diskID': '2bd94598-1fe2-47db-b95e-b52cd21d5edd', 'diskPath': '/var/lib/longhorn/', 'engineImage': 'longhornio/longhorn-engine:v1.5.0-rc2', 'failedAt': '2023-06-15T06:00:33Z', 'hostId': 'ip-10-0-2-252', 'instanceManagerName': '', 'mode': '', 'name': 'longhorn-testvol-yawbqn-dr-r-42d861ad', 'running': False}], 'restoreInitiated': True, 'restoreRequired': True, 'restoreStatus': [], 'restoreVolumeRecurringJob': 'ignored', 'revisionCounterDisabled': False, 'robustness': 'faulted', 'shareEndpoint': '', 'shareState': '', 'size': '1073741824', 'snapshotDataIntegrity': 'ignored', 'staleReplicaTimeout': 0, 'standby': True, 'state': 'detached', 'unmapMarkSnapChainRemoved': 'ignored', 'volumeAttachment': {'attachments': {'volume-restore-controller-longhorn-testvol-yawbqn-dr': {'attachmentID': 'volume-restore-controller-longhorn-testvol-yawbqn-dr', 'attachmentType': 'volume-restore-controller', 'conditions': [{'lastProbeTime': '', 'lastTransitionTime': '2023-06-15T06:00:33Z', 'message': '', 'reason': '', 'status': 'False'}], 'nodeID': 'ip-10-0-2-252', 'parameters': {'disableFrontend': 'true'}, 'satisfied': False}}, 'volume': 'longhorn-testvol-yawbqn-dr'}}
common.py:1844: AssertionError
To Reproduce
Steps to reproduce the behavior:
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/
Verify test result of test_dr_volume_with_restore_command_error
Expected behavior
We should have consistent test results on all distro.
Log or Support bundle
Environment
- Longhorn version: v1.5.0-rc2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):Kubectl
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:v1.25.9+k3s1
- Number of management node in the cluster:1
- Number of worker node in the cluster:3
- Node config
- OS type and version: SLES 15.4
- CPU per node: 4
- Memory per node: 16G
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):AWS
- Number of Longhorn volumes in the cluster:
Additional context
Add any other context about the problem here.
Metadata
Assignees
Type
Projects
Status
Closed