-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Database crashing into a degraded state #36217
Comments
@nairan-deshaw quick questions:
/assign @nairan-deshaw |
|
@nairan-deshaw logs in 30 min window before the crash is okay, but i suggest set the milvus log level to debug to provide more info to us. in case you need to know how to config the milvus config: https://milvus.io/docs/configure-helm.md?tab=component |
@yanliang567 attaching a file with all the logs ( |
/assign @weiliu1031 |
I think from the current log, there is no obvious reason from log except for some timeout . What do you mean by "none of the DB operations work." does that mean all the coord crashed? Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it? |
The inserts, queries etc. all time out and we end up seeing the below error on client side:
The mixcoord itself does not crash which is why k8s does not restart the pod, instead it goes into a degraded state from which it becomes non recoverable. In addition to that attu also becomes unresponsive and keeps loading for a long time before erroring out with Context deadline exceeded.
Any steps / references on how to do this? |
To get pprof, check this scripts
The first command goroutines, mutex, cpu profiling and block point, this will give us more details to analyze. to check logs, check this scripts Most likely there is a very slow operation in coordinator, but we need to know what the operation is. most likely this will be etcd |
Hi @xiaofan-luan, attaching the logs for the suggested endpoints from the current mixcoord component. Let me if that helps or more logs are needed: |
from the log, there isn't any suspicious log except for one index node failed. Is there any warn/error or any node crash? From the inital log, the query is in many section_ids and this is not recommend(could be very slow). there are many timeout at that moment but we don't have many details so no idea about whether this is querynode timeout or due to some crash or what |
right now the datacoord works fine, cpu usage is high on doing One guess is that we have a bug of leaking files , see #36161. And what might happens is after running for while there maybe many leakage cause the datacoord to be slower. This is fixed on 2.4.12 My question is how many collections/partitions are there in your use case? the file seems to be accumulating very fast, because we don't see this happen on even large clusters |
Checked that no other pod crashes right before this state happens
We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings. |
try to upgrade to 2.4.12 see if it helps. Maybe also try to increase the resources of coords. Did you see any increase on mixcoord's cpu usage? |
We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord. We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan |
I'm doubting on adding a standby mixcoord could help. We probably need to figure out the reason of panic before do the next step. Please check:
|
Logs before and during the crash are attached in #36217 (comment)
Any specific filters for looking at these logs?
Below are the usage patterns for CPU and memory from our cluster during the crash window: |
Are you using bulk insertions for all the inserted data? |
Yes, we are using bulk inserts. |
this might be a known bug since some of the imported segment info is leaked and the cpu accumulated for a while until coord reboot. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@bigsheeper were all the fixed for this issue merged? is it related to a pr that we reverted recently? |
Yes, the imported segment leak issue has been resolved, which should help prevent high CPU usage in mixcoord.
Nope. |
Thanks, does 2.4.15 have the potential fix in place? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there an existing issue for this?
Environment
Current Behavior
Since upgrading our Milvus deployment to
v2.4.7
we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting themixcoord
component fixes this issue.While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?
Expected Behavior
Database should continue working normally
Steps To Reproduce
Milvus Log
lines.txt
Anything else?
From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.
The text was updated successfully, but these errors were encountered: