[Bug]: Poor performance with a large collection loading #36318

kish5430 · 2024-09-17T22:07:29Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4.4
- Deployment mode(standalone or cluster): Cluster
- MQ type(rocksmq, pulsar or kafka): Kafka
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

We have ingested 3.5 billion vectors into a collection that uses the IVF_FLAT index type with an nlist of 65,536. The cluster was built with default configurations, and the query nodes, query coordinator, and other components have sufficient resources as per sizing tool. However, when attempting to load the collection into memory, the process is extremely slow, with only 1% of the collection loading per hour. During this time, CPU and memory utilization for Milvus components remain below 30%. What can be done to speed up the loading process?

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

yanliang567 · 2024-09-18T01:48:01Z

@kish5430 I don't think using nlist=65536 makes any sense if you want to have a high recall. Because Milvus builds index with a segment granularity. Using a very big nlist will lead poor performance in indexing as well.
@congqixia any ideas how to improve the performance in loading ?

/assign @congqixia
/unassign

kish5430 · 2024-09-18T03:28:34Z

@kish5430 I don't think using nlist=65536 makes any sense if you want to have a high recall. Because Milvus builds index with a segment granularity. Using a very big nlist will lead poor performance in indexing as well. @congqixia any ideas how to improve the performance in loading ?

/assign @congqixia /unassign

@yanliang567 According to Milvus documentation, nlist can be set to 4 * sqrt(n), where n is the total number of vectors, and nlist ranges from 1 to 65,536. With 3.5 billion records, the calculated nlist exceeds 65,536, so we set it to the maximum value for the cluster.

yanliang567 · 2024-09-18T03:42:20Z

As I mentioned, Milvus builds index in a segment granularity, so here n is the total number of vectors in a segment. If you did not change the max size of segment, it is 512MB by default(which can holds 1 million 128_dim vectors) @kish5430

congqixia · 2024-09-18T07:18:42Z

It's weird that milvus could load in such slow speed. It would be better to have some metrics of querycoord & querynodes from your cluster.
Also, please upgrade to 2.4.11 if possible to avoid the known issues solved in recent release.

kish5430 · 2024-09-18T07:53:55Z

It's weird that milvus could load in such slow speed. It would be better to have some metrics of querycoord & querynodes from your cluster. Also, please upgrade to 2.4.11 if possible to avoid the known issues solved in recent release.

Please find below metrics
querycoord:

QueryNode:

Runtime:

Collection Schema:

collection.describe
<bound method Collection.describe of :

: CVL_image_vfm
: Schema for image & video embeddings
: {'auto_id': False, 'description': 'Schema for image & video embeddings', 'fields': [{'name': 'uuid', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}, 'is_primary': True, 'auto_id': False}, {'name': 'phash', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}}, {'name': 'embedding', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 256}}, {'name': 'path', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512}}, {'name': 'meta_data', 'description': '', 'type': <DataType.JSON: 23>}, {'name': 'dataset_id', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 50}}], 'enable_dynamic_field': False}

Please let me know if any info required

yhmo · 2024-09-18T08:15:28Z

There is a known issue about bulkinsert that segments were not compacted:
#35349

But this issue cannot explain why there are so many tiny segments (50k ~ 150k per query node).

yanliang567 · 2024-09-18T08:45:37Z

There is a known issue about bulkinsert that segments were not compacted: #35349

But this issue cannot explain why there are so many tiny segments (50k ~ 150k per query node).

then we'd suggest you upgrade your milvus to 2.4.7 or above, 2.4.11 is recommended. @kish5430

congqixia · 2024-09-18T08:47:33Z

It's obvious that there are huge number of segments existing in your system. I was wondering that could you check the entry number for each segments? Say provide some sample row count number? It's not a healthy state that each segment has only few lines since each segment could have some extra load cost during loading procedure

xiaofan-luan · 2024-09-21T13:50:12Z

这个问题为何？

我已经搜索了现有问题

的
- Milvus version: 2.4.4
- Deployment mode(standalone or cluster): Cluster
- MQ type(rocksmq, pulsar or kafka): Kafka
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:
当前行为

我们已经将 35 亿个导入导入到使用 IVF_FLAT 索引类型且 nlist 为 65,536 的集合中。集群是使用默认配置构建的，查询节点、查询协调器和其他组件大小调整工具，根据足够的资源。，但是当尝试将集合加载到内存时，该过程非常缓慢，完成仅加载 1% 的集合。在此期间，Milvus 组件的 CPU 和内存利用率保持在 30% 以下。可以做些什么来加速加载过程吗？

预期行为

没有结果

重新实施步骤

没有结果

Milvus 日志

没有结果

还要去吗？

没有结果

Is there an existing issue for this?

I have searched the existing issues

Environment
- Milvus version: 2.4.4
- Deployment mode(standalone or cluster): Cluster
- MQ type(rocksmq, pulsar or kafka): Kafka
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:
Current Behavior

We have ingested 3.5 billion vectors into a collection that uses the IVF_FLAT index type with an nlist of 65,536. The cluster was built with default configurations, and the query nodes, query coordinator, and other components have sufficient resources as per sizing tool. However, when attempting to load the collection into memory, the process is extremely slow, with only 1% of the collection loading per hour. During this time, CPU and memory utilization for Milvus components remain below 30%. What can be done to speed up the loading process?

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

Hi @kish5430
if you need some extra help to make this work, please contact us at james.luan@zilliz.com.
We are excited about your use case

stale · 2024-11-09T14:02:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

yanliang567 · 2024-11-11T01:56:52Z

@kish5430 any updates for this issue?

stale · 2024-12-21T13:16:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

kish5430 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 17, 2024

kish5430 assigned yanliang567 Sep 17, 2024

sre-ci-robot assigned congqixia and unassigned yanliang567 Sep 18, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 18, 2024

yanliang567 mentioned this issue Sep 20, 2024

[Bug]: Unable to load the collection #35441

Closed

1 task

stale bot added the stale indicates no udpates for 30 days label Nov 9, 2024

stale bot removed the stale indicates no udpates for 30 days label Nov 11, 2024

stale bot added the stale indicates no udpates for 30 days label Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Poor performance with a large collection loading #36318

[Bug]: Poor performance with a large collection loading #36318

kish5430 commented Sep 17, 2024

yanliang567 commented Sep 18, 2024

kish5430 commented Sep 18, 2024

yanliang567 commented Sep 18, 2024 •

edited

Loading

congqixia commented Sep 18, 2024

kish5430 commented Sep 18, 2024

yhmo commented Sep 18, 2024

yanliang567 commented Sep 18, 2024

congqixia commented Sep 18, 2024

xiaofan-luan commented Sep 21, 2024

这个问题为何？

的

当前行为

预期行为

重新实施步骤

Milvus 日志

还要去吗？

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

stale bot commented Nov 9, 2024

yanliang567 commented Nov 11, 2024

stale bot commented Dec 21, 2024

[Bug]: Poor performance with a large collection loading #36318

[Bug]: Poor performance with a large collection loading #36318

Comments

kish5430 commented Sep 17, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Sep 18, 2024

kish5430 commented Sep 18, 2024

yanliang567 commented Sep 18, 2024 • edited Loading

congqixia commented Sep 18, 2024

kish5430 commented Sep 18, 2024

yhmo commented Sep 18, 2024

yanliang567 commented Sep 18, 2024

congqixia commented Sep 18, 2024

xiaofan-luan commented Sep 21, 2024

这个问题为何？

的

当前行为

预期行为

重新实施步骤

Milvus 日志

还要去吗？

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

stale bot commented Nov 9, 2024

yanliang567 commented Nov 11, 2024

stale bot commented Dec 21, 2024

yanliang567 commented Sep 18, 2024 •

edited

Loading