-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] All volumes degraded at the same time #10091
Comments
Can you provide a support bundle? |
Yes, I just did. I wanted to complete the bug form first so I could provide the github bug report link with the support bundle. I sent it to longhorn-support-bundle at suse dot com. |
One quick question, are the failed replica on the same node? |
No, they were all over the place. Mostly on the wk01, 02 and 03 (I assume you see hostnames in the support bundle). It did recover fairly quickly, only one volume is rebuilding. Maybe I panicked too quickly but still I don't think this should happen so I really hope you can see why this happens. Thanks for the quick responses. |
BTW, the network bandwidth is not suitable for Longhorn, and 10 Gbps NIC is recommended. Please check the best practice https://longhorn.io/docs/1.7.2/best-practices/#volume-performance-optimization. |
It's not a production cluster its one of the compromises to keep costs to a minimum. Its also 1 disk which not recommended. There will be dedicated disks added in the near future though. |
@Hr46ph Has the support bundle provided? |
Thanks @Hr46ph. |
In which log do you see those so that I can look into those further. Network is professional equipment with certified cabling. In network management softtware I dont see errors (whcih doesnt mean there arent other issues). I can grep -ri over the extracted supportbundle I see many hits on timeouts, except most are harmless. If you see anything specific can you share that please so I can look into that. Thanks! |
You can check the messages |
For this kind of low-spec/non-prod env, might tune the timeout via https://longhorn.io/docs/1.7.2/references/settings/#engine-replica-timeout. |
I made a couple of changes, I think I have been asking too much of Longhorn. The timeouts you are referring to were all towards a single longhorn manager instance. Around 9 AM that particular Longhorn manager pods crashed:
From the logs I can see that before it crashed, it was attempting to hash snapshots but gave warnings that another node was already doing that. In Grafana I noticed several instance manager pods were capped at just below 1 core cpu usage, which more or less looks like the 12% default cpu usage in Longhorn settings (12% is actually around 75% of a single core but its close enough I suppose). I can't really see a reason for the termination of the instance manager pod, I do think my recurring jobs might have been asking too much of Longhorn. So I increased the Guaranteed Instance Manager CPU for V1 Data Engine to 33 (2 cores). All in all I think I might have caused this issue myself with the setting things too demanding and asking too much of Longhorn 😳 . With these more relaxed settings I will monitor. Thanks for the assistance, appreciate it! |
I cannot find any clues related to the termination of the
For the CPU usage caused by snapshot checksum hashing, we will improve the background IO throttling. |
Let's keep the ticket opened for
|
Possibly, because I had 5 concurrent jobs (reduced to 2 now). DIdn't realize the hashing or checking of it could cause a lot of disk I/O. Having said that, I don't really see that many IOPS or throughput but it might have been a combination of things, such as the limiting of cpu usage.
75% of a single core isn't that much, hence I increased it a bit. In combination with less concurrent actions this should hopefully help the situation. I will keep an eye on it! |
Describe the bug
Without apparent reason, all volumes are suddenly degraded. Some with one failed replica, others with 2. Several pods are left without data locality.
To Reproduce
I have no idea how to replicate, it just happened (not for the first time I might add).
Expected behavior
Reliability. This stuff scares me.
Support bundle for troubleshooting
WIll add after posting as I need a ticket number.
Environment
Additional context
I restarted a postgres pod, went to check its actual size in Longhorn GUI and all volumes went degraded all at once.
Workaround and Mitigation
Patience, I hope. Its still recovering.
The text was updated successfully, but these errors were encountered: