Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Regression in 1.6.x-head, significant increase in execution time #9439

Closed
roger-ryao opened this issue Sep 11, 2024 · 6 comments
Closed
Assignees
Labels
kind/bug kind/regression Regression which has worked before priority/1 Highly recommended to implement or fix in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied.
Milestone

Comments

@roger-ryao
Copy link

roger-ryao commented Sep 11, 2024

Describe the bug

After executing regression tests for the 1.6.3-rc1 build, we discovered that since jenkins v1.6.x-head amd64 build 164, it has been taking approximately 20 hours to complete execution. The estimated time for completion should be 16 hours. Upon checking the test results, we noticed a significant increase in execution time for the tests test_restore_basic and test_allow_volume_creation_with_degraded_availability_restore.

Version execution time
v1.6.3-rc1 test_restore_basic 12 min
v1.6.2 test_restore_basic 7 min

To Reproduce

Rerun the jenkins job https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7559/

Expected behavior

We need to investigate whether this is a performance issue.

Support bundle for troubleshooting

longhorn-tests-regression-7559-bundle.zip

Environment

  • Longhorn version: v1.6.3-rc1
  • Impacted volume (PV):
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of control plane nodes in the cluster:
    • Number of worker nodes in the cluster:
  • Node config
    • OS type and version: SLES 15-sp6
    • Kernel version:
    • CPU per node: 4
    • Memory per node: 16
    • Disk type (e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes (Gbps):
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster: 1

Additional context

N/A

Workaround and Mitigation

N/A

@roger-ryao roger-ryao added kind/bug require/qa-review-coverage Require QA to review coverage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Sep 11, 2024
@github-project-automation github-project-automation bot moved this to New Issues in Longhorn Sprint Sep 11, 2024
@derekbit
Copy link
Member

@ChanYiLin Can you help investigate the slowness of the test cases? Thank you.

@derekbit derekbit added the priority/1 Highly recommended to implement or fix in this release (managed by PO) label Sep 11, 2024
@derekbit derekbit added this to the v1.6.3 milestone Sep 11, 2024
@derekbit derekbit added the investigation-needed Need more investigation and the labelled issues won't be stale label Sep 11, 2024
@yangchiu
Copy link
Member

Based on the debug console log for #9455 on v1.6.x-head for amd64, a replica restoration could take more than 10 minutes to start:

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/7568/consoleFull

wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = , r.progress = 0 ... (870)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = , r.progress = 0 ... (871)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = , r.progress = 0 ... (872)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = , r.progress = 0 ... (873)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = , r.progress = 0 ... (874)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = , r.progress = 0 ... (875)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = , r.progress = 0 ... (876)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = , r.progress = 0 ... (877)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 9 ... (878)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 9 ... (879)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 9 ... (880)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 9 ... (881)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 9 ... (882)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 25 ... (883)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 25 ... (884)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 25 ... (885)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 25 ... (886)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 25 ... (887)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 44 ... (888)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 44 ... (889)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 44 ... (890)
wait for volume longhorn-testvol-20ngs4-restore-14 restored r.state = in_progress, r.progress = 44 ... (891)

This is likely not what we expected.

@yangchiu yangchiu mentioned this issue Sep 12, 2024
19 tasks
@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Sep 13, 2024

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at:
    • Steps:
      • Rerun the v1.6.x regression test, should not have performance drop

Fixes:

@derekbit derekbit added kind/regression Regression which has worked before and removed investigation-needed Need more investigation and the labelled issues won't be stale labels Sep 13, 2024
@derekbit
Copy link
Member

Back to normal

Image

@roger-ryao
Copy link
Author

Verified on v1.6.3-rc2 20240918

The test steps
#9439 (comment)

Result Passed

@github-project-automation github-project-automation bot moved this from Testing to Closed in Longhorn Sprint Sep 18, 2024
@roger-ryao
Copy link
Author

Remove the require/qa-review-coverage label because the review is complete and no further action is required for this ticket.

@roger-ryao roger-ryao removed the require/qa-review-coverage Require QA to review coverage label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug kind/regression Regression which has worked before priority/1 Highly recommended to implement or fix in this release (managed by PO) require/backport Require backport. Only used when the specific versions to backport have not been definied.
Projects
Status: Closed
Development

No branches or pull requests

5 participants