Skip to content

[BUG] timestamp or checksum not matched in test_snapshot_hash_detect_corruption test case #6145

Closed
@yangchiu

Description

Describe the bug (🐛 if you encounter this issue)

In test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode, it tries to check the checksum value and ctime of the checksum file in check_snapshot_checksums_and_change_timestamps before corrupting the snapshot:

                # Check checksums in snapshot resource and the calculated value
                # are matched
                checksum = get_checksum_from_snapshot_disk_file(data_path,
                                                                s.name)
                print(f'snapshot {s.name}: '
                      f'checksum in resource={s.checksum}, '
                      f'checksum recorded={checksum}')
                assert checksum == s.checksum

                # Check ctime in checksum file and from stat are matched
                ctime_recorded = get_ctime_in_checksum_file(disk_path)
                ctime = get_ctime_from_snapshot_disk_file(data_path, s.name)

                print(f'snapshot {s.name}: '
                      f'ctime recorded={ctime_recorded}, '
                      f'ctime={ctime}')

But this check randomly failed. It could be the checksum not matched:
https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/524/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/64/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
Or the ctime of the checksum file not matched:
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/59/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-arm64/15/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/6/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_enabled_mode/
https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/12/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/
https://ci.longhorn.io/job/public/job/master/job/rhel/job/amd64/job/longhorn-tests-rhel-amd64/62/testReport/tests/test_snapshot/test_snapshot_hash_detect_corruption_in_global_fast_check_mode/

It could be hard to manually reproduce because of its tedious and time-consuming test setup, and there's another issue also happening to this test case: #6129. So if the test case failed, it could be due to either issue addressed in this ticket or the issue addressed in #6129.

This issue could be introduced after v1.5.0-rc2, at least we didn't observe this in v1.5.0-rc1.

To Reproduce

Run test case test_snapshot_hash_detect_corruption_in_global_fast_check_mode or test_snapshot_hash_detect_corruption_in_global_enabled_mode

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

If applicable, add the Longhorn managers' log or support bundle when the issue happens.
You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Add any other context about the problem here.

Metadata

Labels

area/snapshotVolume snapshot (in-cluster snapshot or external backup)area/volume-data-integrityVolume Data integrity relatedinvestigation-neededIdentified the issue but require further investigation for resolution (won't be stale)kind/bugpriority/0Must be implement or fixed in this release (managed by PO)reproduce/often80 - 50% reproducible

Type

No type

Projects

  • Status

    Closed

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions