[Bug]: Database crashing into a degraded state #36217

nairan-deshaw · 2024-09-12T11:48:57Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4.7
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0
- OS(Ubuntu or CentOS): RHEL
- CPU/Memory: 512 CPU / 1TB K8s cluster
- GPU: NA
- Others: NA

Current Behavior

Since upgrading our Milvus deployment to v2.4.7 we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting the mixcoord component fixes this issue.

While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?

Expected Behavior

Database should continue working normally

Steps To Reproduce

Reproducers are unclear, but has been happening on v2.4.7

Milvus Log

lines.txt

Anything else?

From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.

The text was updated successfully, but these errors were encountered:

yanliang567 · 2024-09-14T01:45:57Z

@nairan-deshaw quick questions:

which version of milvus did you upgrade from?
the logs you attached seems to only have rootcoord pod info after crash, but we need all tge completed milvus pods logs, so could you please help to collect the logs with the tool below after the crashing happening.
please refer this doc to export the whole Milvus logs for investigation(with this script, it collects the logs before crashing, which helps us understand what causes the crash)

/assign @nairan-deshaw

nairan-deshaw · 2024-09-16T06:44:03Z

We upgraded from v2.3.12
We might not be able to provide full logs, but can send logs in the 30min window before the crash from all the components. The attached logs have all the WARN logs from all pods in the said window.

yanliang567 · 2024-09-18T03:04:31Z

@nairan-deshaw logs in 30 min window before the crash is okay, but i suggest set the milvus log level to debug to provide more info to us. in case you need to know how to config the milvus config: https://milvus.io/docs/configure-helm.md?tab=component

nairan-deshaw · 2024-09-18T05:02:21Z

@yanliang567 attaching a file with all the logs (DEBUG level) in the 30 min window. Let me know if that helps.
logs.tar.gz

yanliang567 · 2024-09-18T09:44:41Z

/assign @weiliu1031
please help to take a look
/unassign

xiaofan-luan · 2024-09-21T14:09:23Z

Is there an existing issue for this?

I have searched the existing issues

Environment
- Milvus version: 2.4.7
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0
- OS(Ubuntu or CentOS): RHEL
- CPU/Memory: 512 CPU / 1TB K8s cluster
- GPU: NA
- Others: NA
Current Behavior

Since upgrading our Milvus deployment to v2.4.7 we are facing crashes where the database goes into a degraded state. This has happened twice in the last 30 days. The components also stay functional, but none of the DB operations work. The attu UI gives gRPC context deadline exceeded errors. Usually restarting the mixcoord component fixes this issue.

While we took a deeper look at the logs prior to the issue and during the issue, we could not find a trigger point and why the database was not able to recover from the failures. Attaching the logs for the window with this issue. Can you help with what could be causing this and why the database needs the mixcoord to be restarted for things to be functional again?

Expected Behavior

Database should continue working normally

Steps To Reproduce
Reproducers are unclear, but has been happening on v2.4.7
Milvus Log

lines.txt

Anything else?

From the logs, the errors continue to cascade and the database becomes unoperational. While we do see the same error logs every now and then, they do not cause db crashes every time.

I think from the current log, there is no obvious reason from log except for some timeout .

What do you mean by "none of the DB operations work." does that mean all the coord crashed?

Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it?

nairan-deshaw · 2024-09-23T05:26:17Z

What do you mean by "none of the DB operations work." does that mean all the coord crashed?

The inserts, queries etc. all time out and we end up seeing the below error on client side:

E0819 01:13:41.110673102 1719680 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:14:41.120204252 1720476 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:15:41.133475728 1721014 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:16:41.166404019 1721536 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
E0819 01:17:41.259244512 1722012 hpack_parser.cc:999]                  Error parsing 'content-type' metadata: invalid value
[20240819 01:17:41:259 decorators.py:100 WARNING] [release_partitions] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNKNOWN, Stream removed>

The mixcoord itself does not crash which is why k8s does not restart the pod, instead it goes into a degraded state from which it becomes non recoverable. In addition to that attu also becomes unresponsive and keeps loading for a long time before erroring out with Context deadline exceeded.

Could you try to pprof and catch goroutine info and cpu usage of coordinator so we can analyze on it?

Any steps / references on how to do this?

xiaofan-luan · 2024-09-23T05:31:20Z

To get pprof, check this scripts

milvus/tests/scripts/export_pprof_goroutine.sh

Line 26 in 3b01b7d

    
           curl "http://${ip}:9091/debug/pprof/goroutine?debug=10" -o ./$log_dir/goroutine/$pod.log || echo "export goroutine for pod $pod failed"

The first command goroutines, mutex, cpu profiling and block point, this will give us more details to analyze.
curl -o goroutine.txt http://10.28.191.52:9091/debug/pprof/goroutine
curl -o mutex.txt http://10.28.191.52:9091/debug/pprof/mutex
curl -o profile.txt http://10.28.191.52:9091/debug/pprof/profile
curl -o block.txt http://10.28.191.52:9091/debug/pprof/block

to check logs, check this scripts
https://github.com/milvus-io/milvus/tree/master/deployments/export-log

Most likely there is a very slow operation in coordinator, but we need to know what the operation is. most likely this will be etcd

nairan-deshaw · 2024-09-25T11:37:55Z

Hi @xiaofan-luan, attaching the logs for the suggested endpoints from the current mixcoord component. Let me if that helps or more logs are needed:
logs.tar.gz

xiaofan-luan · 2024-09-27T08:08:48Z

from the log, there isn't any suspicious log except for one index node failed.
[datacoord/indexnode_manager.go:131] ["get IndexNode slots failed"] [nodeID=5293]

Is there any warn/error or any node crash?

From the inital log, the query is in many section_ids and this is not recommend(could be very slow). there are many timeout at that moment but we don't have many details so no idea about whether this is querynode timeout or due to some crash or what

xiaofan-luan · 2024-09-27T08:15:58Z

right now the datacoord works fine, cpu usage is high on doing

One guess is that we have a bug of leaking files , see #36161. And what might happens is after running for while there maybe many leakage cause the datacoord to be slower. This is fixed on 2.4.12

My question is how many collections/partitions are there in your use case? the file seems to be accumulating very fast, because we don't see this happen on even large clusters

nairan-deshaw · 2024-09-27T09:40:44Z

Is there any warn/error or any node crash?

Checked that no other pod crashes right before this state happens

how many collections/partitions are there in your use case?

We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings.

xiaofan-luan · 2024-09-27T10:18:17Z

Is there any warn/error or any node crash?

Checked that no other pod crashes right before this state happens

how many collections/partitions are there in your use case?

We have 12 collections about 1300 partitions which increases at regular intervals. We have currently under 150 million embeddings.

try to upgrade to 2.4.12 see if it helps. Maybe also try to increase the resources of coords. Did you see any increase on mixcoord's cpu usage?

nairan-deshaw · 2024-09-27T10:52:42Z

Did you see any increase on mixcoord's cpu usage?

We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord.

We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan

xiaofan-luan · 2024-09-28T02:37:31Z

Did you see any increase on mixcoord's cpu usage?

We did see some normal spikes on the pods, but not anything unusual. Currently we have allocated 4 CPU and 8GB for the mixcoord.

We will try upgrading to v2.4.12. In addition, we are planning to deploy a standby mixcoord. Will that provide any benefits in this scenario? @xiaofan-luan

I'm doubting on adding a standby mixcoord could help. We probably need to figure out the reason of panic before do the next step.

Please check:

if there are any warn/error log when coord exit
is there any etcd slow message
why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

nairan-deshaw · 2024-09-30T05:13:33Z

if there are any warn/error log when coord exit

Logs before and during the crash are attached in #36217 (comment)

is there any etcd slow message

Any specific filters for looking at these logs?

why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

Below are the usage patterns for CPU and memory from our cluster during the crash window:

xiaofan-luan · 2024-09-30T05:26:21Z

if there are any warn/error log when coord exit

Logs before and during the crash are attached in #36217 (comment)

is there any etcd slow message

Any specific filters for looking at these logs?

why cpu usage or memory usage is high. Is there a trend of something going up? that might be a signal of leakge

Below are the usage patterns for CPU and memory from our cluster during the crash window:

Are you using bulk insertions for all the inserted data?

nairan-deshaw · 2024-09-30T09:17:50Z

Yes, we are using bulk inserts.

xiaofan-luan · 2024-10-01T02:17:32Z

this might be a known bug since some of the imported segment info is leaked and the cpu accumulated for a while until coord reboot.
@bigsheeper is working on solving it

stale · 2024-11-09T14:01:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 · 2024-11-11T02:06:12Z

@bigsheeper were all the fixed for this issue merged? is it related to a pr that we reverted recently?

bigsheeper · 2024-11-11T06:32:18Z

@bigsheeper were all the fixed for this issue merged?

Yes, the imported segment leak issue has been resolved, which should help prevent high CPU usage in mixcoord.

is it related to a pr that we reverted recently?

Nope.

nairan-deshaw · 2024-11-11T08:25:10Z

Thanks, does 2.4.15 have the potential fix in place?

stale · 2024-12-21T13:16:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

nairan-deshaw added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 12, 2024

nairan-deshaw assigned yanliang567 Sep 12, 2024

sre-ci-robot assigned nairan-deshaw Sep 14, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 14, 2024

sre-ci-robot assigned weiliu1031 and unassigned yanliang567 Sep 18, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Sep 18, 2024

bigsheeper mentioned this issue Oct 1, 2024

fix: Fix import segments leak in segment manager #36602

Merged

bigsheeper mentioned this issue Oct 1, 2024

fix: Fix import segments leak in segment manager (#36602) #36631

Merged

stale bot added the stale indicates no udpates for 30 days label Nov 9, 2024

yanliang567 assigned bigsheeper and unassigned weiliu1031 Nov 11, 2024

stale bot removed the stale indicates no udpates for 30 days label Nov 11, 2024

stale bot added the stale indicates no udpates for 30 days label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Database crashing into a degraded state #36217

[Bug]: Database crashing into a degraded state #36217

nairan-deshaw commented Sep 12, 2024

yanliang567 commented Sep 14, 2024 •

edited

Loading

nairan-deshaw commented Sep 16, 2024

yanliang567 commented Sep 18, 2024

nairan-deshaw commented Sep 18, 2024

yanliang567 commented Sep 18, 2024

xiaofan-luan commented Sep 21, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

nairan-deshaw commented Sep 23, 2024

xiaofan-luan commented Sep 23, 2024

nairan-deshaw commented Sep 25, 2024

xiaofan-luan commented Sep 27, 2024

xiaofan-luan commented Sep 27, 2024

nairan-deshaw commented Sep 27, 2024 •

edited

Loading

xiaofan-luan commented Sep 27, 2024

nairan-deshaw commented Sep 27, 2024

xiaofan-luan commented Sep 28, 2024

nairan-deshaw commented Sep 30, 2024

xiaofan-luan commented Sep 30, 2024

nairan-deshaw commented Sep 30, 2024

xiaofan-luan commented Oct 1, 2024

stale bot commented Nov 9, 2024

yanliang567 commented Nov 11, 2024

bigsheeper commented Nov 11, 2024

nairan-deshaw commented Nov 11, 2024

stale bot commented Dec 21, 2024

[Bug]: Database crashing into a degraded state #36217

[Bug]: Database crashing into a degraded state #36217

Comments

nairan-deshaw commented Sep 12, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Sep 14, 2024 • edited Loading

nairan-deshaw commented Sep 16, 2024

yanliang567 commented Sep 18, 2024

nairan-deshaw commented Sep 18, 2024

yanliang567 commented Sep 18, 2024

xiaofan-luan commented Sep 21, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

nairan-deshaw commented Sep 23, 2024

xiaofan-luan commented Sep 23, 2024

nairan-deshaw commented Sep 25, 2024

xiaofan-luan commented Sep 27, 2024

xiaofan-luan commented Sep 27, 2024

nairan-deshaw commented Sep 27, 2024 • edited Loading

xiaofan-luan commented Sep 27, 2024

nairan-deshaw commented Sep 27, 2024

xiaofan-luan commented Sep 28, 2024

nairan-deshaw commented Sep 30, 2024

xiaofan-luan commented Sep 30, 2024

nairan-deshaw commented Sep 30, 2024

xiaofan-luan commented Oct 1, 2024

stale bot commented Nov 9, 2024

yanliang567 commented Nov 11, 2024

bigsheeper commented Nov 11, 2024

nairan-deshaw commented Nov 11, 2024

stale bot commented Dec 21, 2024

yanliang567 commented Sep 14, 2024 •

edited

Loading

nairan-deshaw commented Sep 27, 2024 •

edited

Loading