-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] After crashed engine process, volume stuck in Unknown
state
#6699
Comments
@roger-ryao For any new issues found, let's just add it to the current milestone. cc @longhorn/qa |
Hi @innobead |
I see, sorry for the confusion. Let's add it to the milestone to make it outstanding enough. |
I have add more |
The flow of the test is
The root cause of the issue can be found from my test with
The race condition flow
I think there might have more race condition case for engine/volume/volumeattachment statue like this one. |
For this issues, one simple solution is to add more verification before creating backup here As we can see from the previous log
We can see that So we should at least check following condition |
Hi @innobead |
Is it correct that this issue only happen when user want to take backup of detached volume? |
I believe so. This issue happens because the volume is detaching by previous controller and attaching by the next controller. |
Here is the log after the fix
|
Pre Ready-For-Testing Checklist
|
ref: The logic is used to determine whether a ticket is satisfied. That means having more than one ticket satisfied is possible when Ideally, there is only one attachment is satisfied at a time correct to achieve exclusive attachment, correct? @PhanLe1010 /controller/volume_attachment_controller.go#L728-L752 if vol.Status.CurrentNodeID == attachmentTicket.NodeID && vol.Status.State == longhorn.VolumeStateAttached {
if !verifyAttachmentParameters(attachmentTicket.Parameters, vol) {
attachmentTicketStatus.Satisfied = false
cond := types.GetCondition(attachmentTicketStatus.Conditions, longhorn.AttachmentStatusConditionTypeSatisfied)
if cond.Reason != longhorn.AttachmentStatusConditionReasonAttachedWithIncompatibleParameters {
log.Warnf("Volume %v has already attached to node %v with incompatible parameters", vol.Name, vol.Status.CurrentNodeID)
}
attachmentTicketStatus.Conditions = types.SetCondition(
attachmentTicketStatus.Conditions,
longhorn.AttachmentStatusConditionTypeSatisfied,
longhorn.ConditionStatusFalse,
longhorn.AttachmentStatusConditionReasonAttachedWithIncompatibleParameters,
fmt.Sprintf("volume %v has already attached to node %v with incompatible parameters", vol.Name, vol.Status.CurrentNodeID),
)
return
}
attachmentTicketStatus.Satisfied = true
attachmentTicketStatus.Conditions = types.SetCondition(
attachmentTicketStatus.Conditions,
longhorn.AttachmentStatusConditionTypeSatisfied,
longhorn.ConditionStatusTrue,
"",
"",
)
} |
Verified Failed : using longhorn yaml "https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml" Steps followed :
Result : longhorn-test:/integration/tests # pytest -vs test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly --count 5
Exact Error :
|
@ChanYiLin : Tests are still flaky. |
@nitendra-suse Let's move back to the implement pipeline. @roger-ryao as you are the original reporter, please check the verification result by @nitendra-suse . |
Hi @nitendra-suse @innobead You can review the test results at the following links: |
Let me run again on my system.Thanks. |
Verified Passed : using longhorn yaml "https://raw.githubusercontent.com/longhorn/longhorn/master/deploy/longhorn.yaml" Steps followed : Ran test case: pytest -vs test_recurring_job.py::test_recurring_jobs_when_volume_detached_unexpectedly --count 3
|
Describe the bug (🐛 if you encounter this issue)
Test case
test_recurring_jobs_when_volume_detached_unexpectedly
randomly fails intermittently onmaster-head
andv1.5.x-head
with a failure rate of approximately 2 out of 20.To Reproduce
Excute test case
test_recurring_jobs_when_volume_detached_unexpectedly
Expected behavior
We should have consistent test results on all distro.
Support bundle for troubleshooting
longhorn-tests-regression-4900-bundle (1).zip
Environment
Kubectl
v1.28.1+k3s1
1
3
SLES 15-sp5
4 cores
16GB
SSD
AWS
1
longhorn-testvol-0gbgrf
Additional context
N/A
The text was updated successfully, but these errors were encountered: