Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Enabling replica-auto-balance tries to replicate to disabled nodes causing lots of errors in the logs and in the UI #6508

Closed
jsalatiel opened this issue Aug 11, 2023 · 4 comments
Assignees
Labels
area/volume-replica-scheduling Volume replica scheduling related backport/1.4.5 backport/1.5.4 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/qa-review-coverage Require QA to review coverage
Milestone

Comments

@jsalatiel
Copy link

Describe the bug (🐛 if you encounter this issue)

I have a 3 zone cluster and create-default-disk-labeled-nodes is set to true.
Zone 1 and Zone 2 have 2 untainted nodes labeled as node.longhorn.io/create-default-disk=true each one replicating data.
Zone 3 has one single node that does not replicate anything( so no label set ) , but it is able to mount from the longhorn storage class.
It looks like this in the UI.

image

All the volumes are in healthy state:

image

The moment I set replica-auto-balance to best-effort I start getting the volume can not be scheduled in all volumes
image

image

I suppose since node0 is on another zone, the best-effort will try to schedule there even if that node is disabled which it should not be doing.

Support bundle for troubleshooting

Environment

  • Longhorn version: v1.5.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubespray
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 5

Additional context

@jsalatiel jsalatiel added kind/bug require/qa-review-coverage Require QA to review coverage labels Aug 11, 2023
@jsalatiel jsalatiel changed the title [BUG] Enabling replica-auto-balance try to replicated for disabled nodes causing lots of errors in the logs and in the UI [BUG] Enabling replica-auto-balance tries to replicate to disabled nodes causing lots of errors in the logs and in the UI Aug 11, 2023
@jsalatiel
Copy link
Author

@c3y1huang any progress on this?

@innobead innobead added this to the v1.6.0 milestone Sep 12, 2023
@innobead innobead added area/volume-replica-scheduling Volume replica scheduling related priority/0 Must be implement or fixed in this release (managed by PO) labels Sep 12, 2023
@c3y1huang c3y1huang moved this from New to Resolved/Scheduled in Community Review Sprint Sep 26, 2023
@c3y1huang
Copy link
Contributor

@c3y1huang any progress on this?

@jsalatiel I've had a few other tasks on my plate, but I will have the fix out soon.

@longhorn-io-github-bot
Copy link

longhorn-io-github-bot commented Dec 7, 2023

Pre Ready-For-Testing Checklist

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: in issue description

  • Is there a workaround for the issue? If so, where is it documented?
    The workaround is at:

  • Does the PR include the explanation for the fix or the feature?

  • Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart?
    The PR for the YAML change is at:
    The PR for the chart change is at:

  • Have the backend code been merged (Manager, Engine, Instance Manager, BackupStore etc) (including backport-needed/*)?
    The PR is at fix(replica-auto-balance): loop when node has no schedulable disk longhorn-manager#2336

  • Which areas/issues this PR might have potential impacts on?
    Area replica scheduling, replica auto-balance
    Issues

  • If labeled: require/LEP Has the Longhorn Enhancement Proposal PR submitted?
    The LEP PR is at

  • If labeled: area/ui Has the UI issue filed or ready to be merged (including backport-needed/*)?
    The UI issue/PR is at

  • If labeled: require/doc Has the necessary document PR submitted or merged (including backport-needed/*)?
    The documentation issue/PR is at

  • If labeled: require/automation-e2e Has the end-to-end test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue (including backport-needed/*)
    The automation skeleton PR is at test(integration): replica-auto-balance when disabled disk scheduling in zone longhorn-tests#1615
    The automation test case PR is at test(integration): replica-auto-balance when disabled disk scheduling in zone longhorn-tests#1615
    The issue of automation test case implementation is at (please create by the template)

  • If labeled: require/automation-engine Has the engine integration test been merged (including backport-needed/*)?
    The engine automation PR is at

  • If labeled: require/manual-test-plan Has the manual test plan been documented?
    The updated manual test plan is at

  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at

@chriscchien
Copy link
Contributor

Verified pass on longhorn-master (longhorn-manager a074cd) with test steps

In master-head, enable replica-auto-balance to best-effort won't replicate replica on nodes have no schedulable disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/volume-replica-scheduling Volume replica scheduling related backport/1.4.5 backport/1.5.4 kind/bug priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/qa-review-coverage Require QA to review coverage
Projects
Status: Resolved
Status: Closed
Development

No branches or pull requests

5 participants