Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] After crashed engine process, volume stuck in Unknown state #6699

Closed
roger-ryao opened this issue Sep 15, 2023 · 18 comments
Closed

[BUG] After crashed engine process, volume stuck in Unknown state #6699

roger-ryao opened this issue Sep 15, 2023 · 18 comments
Assignees
Labels
area/stability System or volume stability backport/1.5.2 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage severity/3 Function working but has a major issue w/ workaround
Milestone

Comments

@roger-ryao
Copy link

Describe the bug (🐛 if you encounter this issue)

Test case test_recurring_jobs_when_volume_detached_unexpectedly randomly fails intermittently on master-head and v1.5.x-head with a failure rate of approximately 2 out of 20.

E AssertionError: value = healthy
E.volume[key] = unknown
E.volume = {
    'accessMode': 'rwo',
    'backendStoreDriver': 'v1',
    'backingImage': '',
    'backupCompressionMethod': 'lz4',
    'backupStatus': [{
        'backupURL': '',
        'error': 'proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to get backup-97f0ecc4095b460b backup status: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.2.9:10010: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.9:10010: connect: connection refused"',
        'progress': 54,
        'replica': 'tcp://10.42.1.10:10000',
        'size': '',
        'snapshot': 'recurrin-ff62e23a-fa5d-4c60-8214-72b854e38f02',
        'state': 'Error'
    }, {
        'backupURL': '',
        'error': 'error getting snapshot \'recurrin-cf217437-01d7-4f1b-acc9-1fd9d513632c\', engine \'longhorn-testvol-0gbgrf-e-b242988d\': proxyServer=10.42.1.10:8501 destination=10.42.2.9:10010: failed to list snapshots: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.2.9:10010: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.9:10010: connect: connection refused"',
        'progress': 0,
        'replica': '',
        'size': '',
        'snapshot': '',
        'state': 'Error'
    }],
    'cloneStatus': {
        'snapshot': '',
        'sourceVolume': '',
        'state': ''
    },
    'conditions': {
        'restore': {
            'lastProbeTime': '',
            'lastTransitionTime': '2023-09-12T06:30:13Z',
            'message': '',
            'reason': '',
            'status': 'False'
        },
        'scheduled': {
            'lastProbeTime': '',
            'lastTransitionTime': '2023-09-12T06:30:13Z',
            'message': '',
            'reason': '',
            'status': 'True'
        },
        'toomanysnapshots': {
            'lastProbeTime': '',
            'lastTransitionTime': '2023-09-12T06:30:13Z',
            'message': '',
            'reason': '',
            'status': 'False'
        }
    },
    'controllers': [{
        'actualSize': '1150484480',
        'address': '',
        'currentImage': '',
        'endpoint': '',
        'engineImage': 'longhornio/longhorn-engine:master-head',
        'hostId': '',
        'instanceManagerName': '',
        'isExpanding': False,
        'lastExpansionError': '',
        'lastExpansionFailedAt': '',
        'lastRestoredBackup': '',
        'name': 'longhorn-testvol-0gbgrf-e-b242988d',
        'requestedBackupRestore': '',
        'running': False,
        'size': '2147483648',
        'unmapMarkSnapChainRemovedEnabled': False
    }],
    'created': '2023-09-12 06:30:12 +0000 UTC',
    'currentImage': 'longhornio/longhorn-engine:master-head',
    'dataLocality': 'disabled',
    'dataSource': '',
    'disableFrontend': False,
    'diskSelector': [],
    'encrypted': False,
    'engineImage': 'longhornio/longhorn-engine:master-head',
    'fromBackup': '',
    'frontend': 'blockdev',
    'kubernetesStatus': {
        'lastPVCRefAt': '',
        'lastPodRefAt': '2023-09-12T06:30:54Z',
        'namespace': 'default',
        'pvName': 'longhorn-testvol-0gbgrf-pv',
        'pvStatus': 'Bound',
        'pvcName': 'longhorn-testvol-0gbgrf-pvc',
        'workloadsStatus': [{
            'podName': 'longhorn-testvol-0gbgrf-dep-56bf7bf88-xmxzx',
            'podStatus': 'Succeeded',
            'workloadName': 'longhorn-testvol-0gbgrf-dep-56bf7bf88',
            'workloadType': 'ReplicaSet'
        }]
    },
    'lastAttachedBy': '',
    'lastBackup': '',
    'lastBackupAt': '',
    'migratable': False,
    'name': 'longhorn-testvol-0gbgrf',
    'nodeSelector': [],
    'numberOfReplicas': 3,
    'offlineReplicaRebuilding': 'disabled',
    'offlineReplicaRebuildingRequired': False,
    'purgeStatus': None,
    'ready': True,
    'rebuildStatus': [],
    'recurringJobSelector': None,
    'replicaAutoBalance': 'ignored',
    'replicaDiskSoftAntiAffinity': 'ignored',
    'replicaSoftAntiAffinity': 'ignored',
    'replicaZoneSoftAntiAffinity': 'ignored',
    'replicas': [{
        'address': '',
        'backendStoreDriver': 'v1',
        'currentImage': '',
        'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-0ff7d9b8',
        'diskID': '2a1fc81e-3ab8-463f-a818-b008f7e858a3',
        'diskPath': '/var/lib/longhorn/',
        'engineImage': 'longhornio/longhorn-engine:master-head',
        'failedAt': '',
        'hostId': 'ip-10-0-2-49',
        'instanceManagerName': '',
        'mode': '',
        'name': 'longhorn-testvol-0gbgrf-r-0600c183',
        'running': False
    }, {
        'address': '',
        'backendStoreDriver': 'v1',
        'currentImage': '',
        'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-5b039c13',
        'diskID': '51cbadf6-8b10-438a-b1ce-8ae00126e6ce',
        'diskPath': '/var/lib/longhorn/',
        'engineImage': 'longhornio/longhorn-engine:master-head',
        'failedAt': '',
        'hostId': 'ip-10-0-2-97',
        'instanceManagerName': '',
        'mode': '',
        'name': 'longhorn-testvol-0gbgrf-r-319ab5fc',
        'running': False
    }, {
        'address': '',
        'backendStoreDriver': 'v1',
        'currentImage': '',
        'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-c166e1b5',
        'diskID': '4d810ed2-9514-4b04-9559-0b2f6d830310',
        'diskPath': '/var/lib/longhorn/',
        'engineImage': 'longhornio/longhorn-engine:master-head',
        'failedAt': '',
        'hostId': 'ip-10-0-2-213',
        'instanceManagerName': '',
        'mode': '',
        'name': 'longhorn-testvol-0gbgrf-r-895589ea',
        'running': False
    }],
    'restoreInitiated': False,
    'restoreRequired': False,
    'restoreStatus': [],
    'restoreVolumeRecurringJob': 'ignored',
    'revisionCounterDisabled': False,
    'robustness': 'unknown',
    'shareEndpoint': '',
    'shareState': '',
    'size': '2147483648',
    'snapshotDataIntegrity': 'ignored',
    'staleReplicaTimeout': 0,
    'standby': False,
    'state': 'detached',
    'unmapMarkSnapChainRemoved': 'ignored',
    'volumeAttachment': {
        'attachments': {},
        'volume': 'longhorn-testvol-0gbgrf'
    }
}

To Reproduce

Excute test case test_recurring_jobs_when_volume_detached_unexpectedly

Expected behavior

We should have consistent test results on all distro.

Support bundle for troubleshooting

longhorn-tests-regression-4900-bundle (1).zip

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.1+k3s1
    • Number of management node in the cluster:1
    • Number of worker node in the cluster:3
  • Node config
    • OS type and version: SLES 15-sp5
    • Kernel version:
    • CPU per node: 4 cores
    • Memory per node: 16GB
    • Disk type(e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster:1
  • Impacted Longhorn resources:
    • Volume names: longhorn-testvol-0gbgrf

Additional context

N/A

@roger-ryao roger-ryao added kind/bug reproduce/rare < 50% reproducible require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Sep 15, 2023
@innobead innobead added this to the v1.6.0 milestone Sep 19, 2023
@innobead
Copy link
Member

@roger-ryao For any new issues found, let's just add it to the current milestone.

cc @longhorn/qa

@roger-ryao
Copy link
Author

roger-ryao commented Sep 19, 2023

@roger-ryao For any new issues found, let's just add it to the current milestone.

cc @longhorn/qa

Hi @innobead
Got it.
You previously told me not to add the milestone myself, which is why I didn't add it.

@innobead
Copy link
Member

I see, sorry for the confusion. Let's add it to the milestone to make it outstanding enough.

@ChanYiLin ChanYiLin added the priority/0 Must be implement or fixed in this release (managed by PO) label Sep 21, 2023
@derekbit derekbit added the area/resilience System or volume resilience label Sep 21, 2023
@ChanYiLin ChanYiLin added the investigation-needed Identified the issue but require further investigation for resolution (won't be stale) label Sep 22, 2023
@ChanYiLin
Copy link
Contributor

I have add more Debug message and ran the test again.
Here is the result
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4937

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Sep 25, 2023

The flow of the test is

  1. recurring job creates snapshot (volume would attach and detach)
  2. recurring job creates backup (volume would attach and detach)
  3. crash engine so the Volume is in faulted state and will come back in a while (volume would detach)
  4. recurring job retries
  5. recurring job creates snapshot (volume would attach and detach)
  6. recurring job creates backup (volume would attach and detach)

The root cause of the issue can be found from my test with Debug message
Here are the logs

## Snapshot had been created, so deleted the attachment ticket
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG snapshot controller] Snapshot created so delete delete Attachment Ticket, snapshot: recurrin-f4852903-2357-46ec-9037-16137d6aea80\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old spec: {\"attachmentTickets\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"type\":\"snapshot-controller\",\"nodeID\":\"ip-10-0-2-223\",\"parameters\":{\"disableFrontend\":\"any\"},\"generation\":0}},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new spec: {\"attachmentTickets\":{},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] update VA status, ID: snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80, ticket: &{snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80 snapshot-controller ip-10-0-2-223 map[disableFrontend:any] 0}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.CurrentNodeID: ip-10-0-2-223, attachmentTicket.NodeID: ip-10-0-2-223\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.State: attached\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] satisfied: true\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] engine.Status.CurrentState: running, engine.desireState: running\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"ip-10-0-2-223\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"ip-10-0-2-223\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume Attachement status\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old VA Status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new VA Status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"

## Attachment was deleted, so the new VA status was empty, and the nodeID of the Volume was also empty so Longhorn could detach the Volume 
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"ip-10-0-2-223\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume Attachement status\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old VA Status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new VA Status: {\"attachmentTicketStatuses\":{}}\n"

## First time backup controller create attacment but failed because CR conflicts (Not a big deal)
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-660589ed3d0b44b8\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old spec: {\"attachmentTickets\":{},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new spec: {\"attachmentTickets\":{\"backup-controller-backup-660589ed3d0b44b8\":{\"id\":\"backup-controller-backup-660589ed3d0b44b8\",\"type\":\"backup-controller\",\"nodeID\":\"ip-10-0-2-223\",\"parameters\":{\"disableFrontend\":\"any\"},\"generation\":0}},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old spec: {\"attachmentTickets\":{},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old status: {\"attachmentTicketStatuses\":{}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new spec: {\"attachmentTickets\":{\"backup-controller-backup-660589ed3d0b44b8\":{\"id\":\"backup-controller-backup-660589ed3d0b44b8\",\"type\":\"backup-controller\",\"nodeID\":\"ip-10-0-2-223\",\"parameters\":{\"disableFrontend\":\"any\"},\"generation\":0}},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-660589ed3d0b44b8" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-660589ed3d0b44b8: handleAttachmentTicketCreation: failed to create/update attachment: Operation cannot be fulfilled on volumeattachments.longhorn.io \"longhorn-testvol-1czczl\": the object has been modified; please apply your changes to the latest version and try again" node=ip-10-0-2-223

## backup controller create attachment ticket 2nd time
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-660589ed3d0b44b8\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old spec: {\"attachmentTickets\":{},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old status: {\"attachmentTicketStatuses\":{}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new spec: {\"attachmentTickets\":{\"backup-controller-backup-660589ed3d0b44b8\":{\"id\":\"backup-controller-backup-660589ed3d0b44b8\",\"type\":\"backup-controller\",\"nodeID\":\"ip-10-0-2-223\",\"parameters\":{\"disableFrontend\":\"any\"},\"generation\":0}},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new status: {\"attachmentTicketStatuses\":{}}"

# At the same time, Engine was being deleted
time="2023-09-22T10:37:11Z" level=info msg="Deleting instance longhorn-testvol-1czczl-e-db3461c5"
time="2023-09-22T10:37:11Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=longhorn-testvol-1czczl-e-db3461c5 node=ip-10-0-2-223

## Volume Attachment Controller handled Backup Controler Attachment Ticket
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] update VA status, ID: backup-controller-backup-660589ed3d0b44b8, ticket: &{backup-controller-backup-660589ed3d0b44b8 backup-controller ip-10-0-2-223 map[disableFrontend:any] 0}\n"

## Just in the same reconcile loop, the ticket was satisfied beacuse the volume was attached and the engine was still running
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] Create VA status for the first time, attachmentTicketID: backup-controller-backup-660589ed3d0b44b8\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.CurrentNodeID: ip-10-0-2-223, attachmentTicket.NodeID: ip-10-0-2-223\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.State: attached\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] satisfied: true\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] engine.Status.CurrentState: running, engine.desireState: stopped\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume Attachement status\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old VA Status: {\"attachmentTicketStatuses\":{}}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new VA Status: {\"attachmentTicketStatuses\":{\"backup-controller-backup-660589ed3d0b44b8\":{\"id\":\"backup-controller-backup-660589ed3d0b44b8\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:11Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"

## In Backup Controller, since the attachment was satisfied and the engine was running so it tried to request engine to create backup
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-660589ed3d0b44b8 "
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] true\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: attached, volume.Robustness: healthy\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG checkMonitor] volume.State: attached, volume.Robustness: healthy\

time="2023-09-22T10:37:11Z" level=info msg="Deleting instance longhorn-testvol-1czczl-e-db3461c5"
time="2023-09-22T10:37:11Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=longhorn-testvol-1czczl-e-db3461c5 node=ip-10-0-2-223
time="2023-09-22T10:37:11Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"longhorn-testvol-1czczl-e-db3461c5\", UID:\"f26c0786-d60d-4a7e-883d-8883ceaca977\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"12008\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops longhorn-testvol-1czczl-e-db3461c5"

## Failed to create Backup so the test failed.
time="2023-09-22T10:37:11Z" level=warning msg="Cannot take snapshot backup" backup=backup-660589ed3d0b44b8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to backup snapshot recurrin-f4852903-2357-46ec-9037-16137d6aea80 to backup-660589ed3d0b44b8: rpc error: code = Unknown desc = failed to get volume 10.42.2.9:10010: rpc error: code = Unavailable desc = error reading from server: EOF" node=ip-10-0-2-223
time="2023-09-22T10:37:11Z" level=warning msg="Failed to enable the backup monitor for backup backup-660589ed3d0b44b8" backup=backup-660589ed3d0b44b8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to backup snapshot recurrin-f4852903-2357-46ec-9037-16137d6aea80 to backup-660589ed3d0b44b8: rpc error: code = Unknown desc = failed to get volume 10.42.2.9:10010: rpc error: code = Unavailable desc = error reading from server: EOF" node=ip-10-0-2-223

The race condition flow

  1. When snapshot was created, Snapshot Controller deleted the attachment ticket to detach the volume
  2. Volume Controller detached the volume, set the v.Status.State = longhorn.VolumeStateDetaching and set the e.Spec.DesiredState = stopped here
  3. Before the state was updated, Backup Controller created the attachment ticket
  4. Volume Controller updated status follow the order Replica->Engine->Volume here
  5. After Volume Controller updated Engine spec and status
  • so the e.Spec.DesiredState=stopped and e.Status.CurrentState=running
  1. VolumeAttachment Controller updated the status of the attachment ticket and it was satisfied because volume was still attached. (But the volume was about to be detached, just because the state was not updated yet)
  2. Engine Controller requested IM to delete the engine process, but not yet updated engine state
  3. Backup Controller verified in checkMonitor() that
  • Attachment ticket was satisfied here
  • Engine was running here
  1. Backup Controller requested Engine to create backup and failed

I think there might have more race condition case for engine/volume/volumeattachment statue like this one.
We might need to review and see if we can improve from volumeattachment side, so it won't satisfied just because the volume is not fully detached from the previous ticket?
cc @PhanLe1010 @derekbit @innobead @roger-ryao @longhorn/dev-control-plane

@ChanYiLin
Copy link
Contributor

ChanYiLin commented Sep 25, 2023

For this issues, one simple solution is to add more verification before creating backup here

As we can see from the previous log

## In Backup Controller, since the attachment was satisfied and the engine was running so it tried to request engine to create backup
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-660589ed3d0b44b8 "
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] true\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: attached, volume.Robustness: healthy\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG checkMonitor] volume.State: attached, volume.Robustness: healthy\

time="2023-09-22T10:37:11Z" level=info msg="Deleting instance longhorn-testvol-1czczl-e-db3461c5"
time="2023-09-22T10:37:11Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=longhorn-testvol-1czczl-e-db3461c5 node=ip-10-0-2-223

## Failed to create Backup so the test failed.
time="2023-09-22T10:37:11Z" level=warning msg="Cannot take snapshot backup" backup=backup-660589ed3d0b44b8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to backup snapshot recurrin-f4852903-2357-46ec-9037-16137d6aea80 to backup-660589ed3d0b44b8: rpc error: code = Unknown desc = failed to get volume 10.42.2.9:10010: rpc error: code = Unavailable desc = error reading from server: EOF" node=ip-10-0-2-223
time="2023-09-22T10:37:11Z" level=warning msg="Failed to enable the backup monitor for backup backup-660589ed3d0b44b8" backup=backup-660589ed3d0b44b8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to backup snapshot recurrin-f4852903-2357-46ec-9037-16137d6aea80 to backup-660589ed3d0b44b8: rpc error: code = Unknown desc = failed to get volume 10.42.2.9:10010: rpc error: code = Unavailable desc = error reading from server: EOF" node=ip-10-0-2-223

We can see that engine.desireState: stopped before creating backup, and that would lead to failure because it means the current engine is about to be deleted, we shouldn't request this engine now.

So we should at least check following condition
engine.desireState: running && engine.currentState: running && volume.State: attached before creating backup

@ChanYiLin
Copy link
Contributor

Hi @innobead
volume attachment is only introduced after 1.5.0, we don't have to backport to v1.4.x

@PhanLe1010
Copy link
Contributor

Is it correct that this issue only happen when user want to take backup of detached volume?

@ChanYiLin
Copy link
Contributor

I believe so.
User can only take backup of attached volume from UI.
In this case, since the volume is already attached, volume would not be detached by snapshot-controller and volume-attachment-controller after the creation of snapshot.
So there is no detaching and attaching between taking snapshot and taking backup.

This issue happens because the volume is detaching by previous controller and attaching by the next controller.
There is a chance that the attachment ticket of the next controller might get satisfied immediately due to the volume is not fully detached (state is not updated yet). Thus the following code logic gets mislead.

@ChanYiLin
Copy link
Contributor

Here is the log after the fix
You can see that volume needed some time to update its state to detaching, before that, the attachment ticket of backup controller was true
After that, attachment controller would then update the satisfied to false.

## attachment ticket of backup controller is satisfied, but engine desireState is stopped
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] volume.State: attached, volume.Robustness: healthy\n"
time="2023-09-26T09:32:11Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-a38cd85ddfc34e94" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-a38cd85ddfc34e94: waiting for engine longhorn-testvol-1qqw62-e-0e7026ad to be running before enabling backup monitor" node=ip-10-0-2-104
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-a38cd85ddfc34e94\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-a38cd85ddfc34e94 "
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] true\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: detaching, volume.Robustness: unknown\n"
time="2023-09-26T09:32:11Z" level=warning msg="Failed to get snapshot recurrin-be1c0640-132c-4695-9c4c-5b1cf621923d when syncing backup status" controller=longhorn-backup error="proxyServer=10.42.3.8:8501 destination=10.42.3.8:10010: failed to list snapshots: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.3.8:10010: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.3.8:10010: connect: connection refused\"" node=ip-10-0-2-104
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] volume.State: detaching, volume.Robustness: unknown\n"
time="2023-09-26T09:32:11Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-a38cd85ddfc34e94" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-a38cd85ddfc34e94: waiting for engine longhorn-testvol-1qqw62-e-0e7026ad to be running before enabling backup monitor" node=ip-10-0-2-104
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-a38cd85ddfc34e94\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-a38cd85ddfc34e94 "
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] true\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: detaching, volume.Robustness: unknown\n"

## volume attachment controller update the ticket to no satisfied because volume is now in detaching state
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] update VA status, ID: backup-controller-backup-a38cd85ddfc34e94, ticket: &{backup-controller-backup-a38cd85ddfc34e94 backup-controller ip-10-0-2-104 map[disableFrontend:any] 0}\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.CurrentNodeID: ip-10-0-2-104, attachmentTicket.NodeID: ip-10-0-2-104\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.State: detaching\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] satisfied: false\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] engine.Status.CurrentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume Attachement status\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old VA Status: {\"attachmentTicketStatuses\":{\"backup-controller-backup-a38cd85ddfc34e94\":{\"id\":\"backup-controller-backup-a38cd85ddfc34e94\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-26T09:32:11Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new VA Status: {\"attachmentTicketStatuses\":{\"backup-controller-backup-a38cd85ddfc34e94\":{\"id\":\"backup-controller-backup-a38cd85ddfc34e94\",\"satisfied\":false,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"False\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-26T09:32:11Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"
time="2023-09-26T09:32:11Z" level=warning msg="Failed to get snapshot recurrin-be1c0640-132c-4695-9c4c-5b1cf621923d when syncing backup status" controller=longhorn-backup error="proxyServer=10.42.3.8:8501 destination=10.42.3.8:10010: failed to list snapshots: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.3.8:10010: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.3.8:10010: connect: connection refused\"" node=ip-10-0-2-104
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] volume.State: detaching, volume.Robustness: unknown\n"
time="2023-09-26T09:32:11Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-a38cd85ddfc34e94" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-a38cd85ddfc34e94: waiting for engine longhorn-testvol-1qqw62-e-0e7026ad to be running before enabling backup monitor" node=ip-10-0-2-104

## attachment ticket of backup controller is now not satisfied
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-a38cd85ddfc34e94\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-a38cd85ddfc34e94 "
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] false\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: detaching, volume.Robustness: unknown\n"

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Sep 27, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:

    • this case is a flaky issue, 2~4/20 times in jenkins regression test.
    • -vs test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly --count 10
  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at

@innobead innobead added area/stability System or volume stability and removed backport/1.4.4 area/resilience System or volume resilience labels Oct 1, 2023
@innobead
Copy link
Member

innobead commented Oct 2, 2023

ref: The logic is used to determine whether a ticket is satisfied. That means having more than one ticket satisfied is possible when vol.Status.CurrentNodeID == attachmentTicket.NodeID && vol.Status.State == longhorn.VolumeStateAttached is true.

Ideally, there is only one attachment is satisfied at a time correct to achieve exclusive attachment, correct? @PhanLe1010

/controller/volume_attachment_controller.go#L728-L752

	if vol.Status.CurrentNodeID == attachmentTicket.NodeID && vol.Status.State == longhorn.VolumeStateAttached {
		if !verifyAttachmentParameters(attachmentTicket.Parameters, vol) {
			attachmentTicketStatus.Satisfied = false
			cond := types.GetCondition(attachmentTicketStatus.Conditions, longhorn.AttachmentStatusConditionTypeSatisfied)
			if cond.Reason != longhorn.AttachmentStatusConditionReasonAttachedWithIncompatibleParameters {
				log.Warnf("Volume %v has already attached to node %v with incompatible parameters", vol.Name, vol.Status.CurrentNodeID)
			}
			attachmentTicketStatus.Conditions = types.SetCondition(
				attachmentTicketStatus.Conditions,
				longhorn.AttachmentStatusConditionTypeSatisfied,
				longhorn.ConditionStatusFalse,
				longhorn.AttachmentStatusConditionReasonAttachedWithIncompatibleParameters,
				fmt.Sprintf("volume %v has already attached to node %v with incompatible parameters", vol.Name, vol.Status.CurrentNodeID),
			)
			return
		}
		attachmentTicketStatus.Satisfied = true
		attachmentTicketStatus.Conditions = types.SetCondition(
			attachmentTicketStatus.Conditions,
			longhorn.AttachmentStatusConditionTypeSatisfied,
			longhorn.ConditionStatusTrue,
			"",
			"",
		)
	}

@nitendra-suse
Copy link

Verified Failed : using longhorn yaml "https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml"
Longhorn Manager commit id "6b910e9" .

Steps followed :

  • Ran test case: pytest -vs test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly --count 5

Result :
Constantly some iterations of tests are failing approx 10%-20% of all runs.

longhorn-test:/integration/tests # pytest -vs test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly --count 5

============================================================================================================= test session starts =============================================================================================================
platform linux -- Python 3.9.18, pytest-5.3.1, py-1.11.0, pluggy-0.13.1 -- /usr/bin/python3.9
cachedir: .pytest_cache
rootdir: /integration, inifile: pytest.ini
plugins: repeat-0.9.1, order-1.0.1
collected 10 items                                                                                                                                                                                                                            

test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3-1-5] 
**PASSED**
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3-2-5] FAILED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3-3-5] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3-4-5] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3-5-5] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs-1-5] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs-2-5] 
FAILED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs-3-5] 
PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs-4-5] 



PASSED

test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs-5-5] 


PASSED

Exact Error :

def wait_for_volume_status(client, name, key, value,
                               retry_count=RETRY_COUNTS):
        wait_for_volume_creation(client, name)
        for i in range(retry_count):
            volume = client.by_id_volume(name)
            if volume[key] == value:
                break
            time.sleep(RETRY_INTERVAL)
>       assert volume[key] == value, f" value={value}\n. \
                volume[key]={volume[key]}\n. volume={volume}"
**E       AssertionError:  value=healthy
E       .             volume[key]=unknown**

@nitendra-suse
Copy link

@ChanYiLin : Tests are still flaky.

@innobead
Copy link
Member

innobead commented Oct 6, 2023

@nitendra-suse Let's move back to the implement pipeline.

@roger-ryao as you are the original reporter, please check the verification result by @nitendra-suse .

@roger-ryao
Copy link
Author

test_recurring_jobs_when_volume_detached_unexpectedly

Hi @nitendra-suse @innobead
I triggered the test case 60 times, and all tests passed successfully. I believe that @ChanYiLin has already resolved the issue.

You can review the test results at the following links:
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5013/testReport/tests/test_recurring_job/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5014/testReport/tests/test_recurring_job/
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5015/testReport/tests/test_recurring_job/

@nitendra-suse
Copy link

nitendra-suse commented Oct 12, 2023

I triggered the test case 60 times, and all tests passed successfully. I believe that @ChanYiLin has already resolved the issue.

Let me run again on my system.Thanks.

@roger-ryao roger-ryao added the severity/3 Function working but has a major issue w/ workaround label Oct 13, 2023
@nitendra-suse
Copy link

Verified Passed : using longhorn yaml "https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml"
Longhorn Manager commit id "d8dbcd1" .

Steps followed :

Ran test case: pytest -vs test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly --count 3

============================================================================================================= test session starts =============================================================================================================
platform linux -- Python 3.9.18, pytest-5.3.1, py-1.11.0, pluggy-0.13.1 -- /usr/bin/python3.9
cachedir: .pytest_cache
rootdir: /integration, inifile: pytest.ini
plugins: repeat-0.9.1, order-1.0.1
collected 6 items                                                                                                                                                                                                                             

test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3-1-3] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3-2-3] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[s3-3-3] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs-1-3] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs-2-3] PASSED
test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly[nfs-3-3] PASSED

@innobead innobead removed the investigation-needed Identified the issue but require further investigation for resolution (won't be stale) label Oct 29, 2023
@derekbit derekbit moved this to Closed in Longhorn Sprint Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/stability System or volume stability backport/1.5.2 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) reproduce/rare < 50% reproducible require/backport Require backport. Only used when the specific versions to backport have not been definied. require/qa-review-coverage Require QA to review coverage severity/3 Function working but has a major issue w/ workaround
Projects
Status: Closed
Development

No branches or pull requests

7 participants