-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Degraded volume generate failed replica make volume unschedulable #3220
Comments
cc @longhorn/qa |
Update: This issue still present in recently build and make test case falky |
|
For a replica's In this test case, there is a failed replica, and then a new replica is created for replenishing during rebuilding. The scheduler tries to schedule the replicas to nodes/disks periodically. When the failed replica is scheduled first and successfully, the newly created one can be cleaned up according to logic. If a new replica is scheduled first, the failed one cannot be scheduled or cleaned up, and then the volume becomes unschedulable. So, the flaky result is due to the scheduling order of replicas in I have confirmed the root cause by sorting the replicas by the replicas' creationTimestamp, which means that the newly created replica cannot be scheduled and can be cleaned up. |
@derekbit is this a regression or actually an existing issue for a long while (from 1.2.x)? |
After checking the logic in v1.2.x, it's also an existing issue in v1.2.x. |
In the test case longhorn/longhorn#3220 (comment), a replica cannot be scheduled to a node, but it's spec.failedAt is set in https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L1317. The strict constraints of cleanupFailedToScheduledReplicas() results in that the failed replica cannot be cleanup up. Longhorn 3320 Signed-off-by: Derek Su <derek.su@suse.com>
Pre Ready-For-Testing Checklist
longhorn/longhorn-manager#1371
|
In the test case longhorn/longhorn#3220 (comment), a replica cannot be scheduled to a node, but it's spec.failedAt is set in https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L1317. The strict constraints of cleanupFailedToScheduledReplicas() results in that the failed replica cannot be cleanup up. Longhorn 3320 Signed-off-by: Derek Su <derek.su@suse.com>
Close this ticket because issue not happen in recent build after fix merged . |
In the test case longhorn/longhorn#3220 (comment), a replica cannot be scheduled to a node, but it's spec.failedAt is set in https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L1317. The strict constraints of cleanupFailedToScheduledReplicas() results in that the failed replica cannot be cleanup up. Longhorn 3220 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit fedb7eb)
In the test case longhorn/longhorn#3220 (comment), a replica cannot be scheduled to a node, but it's spec.failedAt is set in https://github.com/longhorn/longhorn-manager/blob/master/controller/volume_controller.go#L1317. The strict constraints of cleanupFailedToScheduledReplicas() results in that the failed replica cannot be cleanup up. Longhorn 3220 Signed-off-by: Derek Su <derek.su@suse.com> (cherry picked from commit fedb7eb)
Describe the bug
This is found from automation test case test_basic.py::test_allow_volume_creation_with_degraded_availability , can reproduced by hand (Reproduced rate about 20%).
Write data, detach and attach a degraded volume, then enable scheduling to the node which make volume degraded, sometimes will be an extra failed replica make volume scheduling failure
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Volume should become scheduled, but somehow there will an failed replica make volume unscheduled.
Log
longhorn-support-bundle_8f6ba9b9-f232-4b14-a2e0-794fc8caa96f_2021-11-04T09-09-49Z.zip
Log related to failed replica name from support-bundle:
Environment:
Longhorn master-head
The text was updated successfully, but these errors were encountered: