-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CSI components CrashLoopBackOff, failed to connect to unix://csi/csi.sock after cluster restart #7116
Comments
I cannot find any clue in the support bundle. Since the My immediate guess was SELinux. However, SELinux should have been set to permissive given the Jenkins parameters (I am not able to confirm or deny this from any of the logs we have). Additionally, it should have affected all nodes similarly. I kicked off https://ci.longhorn.io/job/private/job/longhorn-e2e-test/129/ to try to reproduce it for live debugging. If that doesn't work, we will likely need to wait for another occurence. |
I hit this one when restarting a one-node cluster. I was planning to create a ticket but looks like one already exist here. |
This didn't help to reproduce the issue.
I'll look into this a bit when I have time to see if there is an improvement we can make. It seems different from the original event, since, in that case, the CSI components never recovered. |
I'm having this same issue, what kind of troubleshooting can we actually perform to identify the cause? This is on a five node cluster and it was working fine until a power issue took it out, now one node's looping on this error:
|
Hello @yaleman. Could we have a support bundle please to confirm the identical symptoms and look for additional clues? Please upload a it (or a paste a link) here, or send it to longhorn-support-bundle@suse.com. Also, per the analysis below:
|
Here are the latest longhorn-csi-plugin logs from the support bundle:
There is no evidence of anything wrong with longhorn-csi-plugin in this latest run. Here are the latest livenessprobe logs from the support bundle:
30 seconds after livenessprobe starts, it fails. From the code livenessprobe is expected to loop here indefinitely instead of failing. Due to the changes mentioned above, that does not happen. The following sequence of events seems possible (though I have not managed to reproduce it):
Additional evidence:
|
One mystery solved. #6920 bumped the version of livenessprobe (and other CSI components) in deployment manifests. This change hasn't made it to any released versions yet, though. It is expected to see it in |
Reproduce steps:
In the example below, I kept the
Deleting |
I replaced
Since the containers were deep into backoff when I restored the service, it took still took a while to recover (had to wait for the longhorn-csi-plugin container to be allowed to restart). However, full recovery eventually occured.
|
Ideas to fix (in order of preference?):
|
Mine seems to have been sorted by restarting the node, it's a transient weirdness unfortunately.. will send a support bundle once I get one. |
Yes, I think that fits with the analysis I've done. Even without the bundle, can you confirm which version of the
|
I reviewed #6916 and it looks like we really do need to go to livenessprobe The upstream bug has some traction, but it's unlikely we'll see a version we can grab and test before v1.6.0 releases. I'll move forward with my mitigation PR.
|
Pre Ready-For-Testing Checklist
|
ref: #7428 |
Verified passed on master-head (longhorn-manager 859d438) following test steps #7116 (comment) and longhorn/longhorn-manager#2388 (comment). Also ran negative test case |
Describe the bug (🐛 if you encounter this issue)
After cluster restart (reboot all k8s nodes including control plane node), all csi components got stuck in CrashLoopBackOff:
They are all unable to connect to
unix://csi/csi.sock
:To Reproduce
Run negative test case
Restart Cluster While Workload Heavy Writing
repeatedly.Expected behavior
Support bundle for troubleshooting
supportbundle_ae1d1892-8da7-4733-97d7-16326976bb0e_2023-11-16T03-17-09Z.zip
worker nodes logs:
worker_nodes_log.txt
Environment
Additional context
https://suse.slack.com/archives/C02DR3N5T24/p1700095454722879?thread_ts=1699951700.806219&cid=C02DR3N5T24
The text was updated successfully, but these errors were encountered: