[Bug]: Query after Insertion timed out in v2.5.0-beta #38585

Andy6132024 · 2024-12-19T08:17:15Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: v2.5.0-beta
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2): v2.5.0
- OS(Ubuntu or CentOS): RockyLinux8
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Executed tasks to insert around 2k entities into a collection concurrently (concurrency is 12). Each task inserted only one entity. In the stage environment where Milvus has been upgraded to v2.5.0-beta, query got timed out after roughly 10 minutes. However, the query almost immediately returned results in Prod environment where Milvus is still at v2.4.15.

Enabled TimeTick protection in the Stage environment and noticed that the TimeTick lag went up to more than 3 minutes during the insertion and lasted about 10 minutes before subsiding gradually to the normal level. At the same time, the TimeTick lag recorded at QueryNode (for consumed insert) also went up to a couple minutes. All of the evidence seems to point to a slow-down of consumption from dml channel in QueryNode.

Appreciate anyone looking into this issue since it could be a blocker to upgrade to v2.5+ in Prod environment.

Expected Behavior

Timetick lag should not have obvious increase during insertion.

Steps To Reproduce

No response

Milvus Log

[2024/12/18 10:57:54.092 +00:00] [WARN] [querynodev2/handlers.go:227] ["failed to query on delegator"] [traceID=62bcacd674186cd5d570ed04369a103f] [msgID=454692378587327825] [collectionID=454692378588586550] [channel=by-dev-rootcoord-dml_12_454692378588586550v0] [scope=All] [error="context canceled"]
[2024/12/18 10:57:54.092 +00:00] [WARN] [delegator/delegator.go:563] ["delegator query failed to wait tsafe"] [traceID=62bcacd674186cd5d570ed04369a103f] [collectionID=454692378588586550] [channel=by-dev-rootcoord-dml_12_454692378588586550v0] [replicaID=454692378786922498] [error="context canceled"]

Anything else?

No response

yanliang567 · 2024-12-20T09:18:39Z

@Andy6132024 May I ask the reason you only insert 1 entity in an insert request? I am asking because this is highly not recommended to do that. Milvus would generate many some segments if you did that, which makes the system is busy in compaction and tt sync.

/assign @Andy6132024

xiaofan-luan · 2024-12-26T04:34:03Z

Is there an existing issue for this?

I have searched the existing issues

Environment
- Milvus version: v2.5.0-beta
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2): v2.5.0
- OS(Ubuntu or CentOS): RockyLinux8
- CPU/Memory: 
- GPU: 
- Others:
Current Behavior

Executed tasks to insert around 2k entities into a collection concurrently (concurrency is 12). Each task inserted only one entity. In the stage environment where Milvus has been upgraded to v2.5.0-beta, query got timed out after roughly 10 minutes. However, the query almost immediately returned results in Prod environment where Milvus is still at v2.4.15.

Enabled TimeTick protection in the Stage environment and noticed that the TimeTick lag went up to more than 3 minutes during the insertion and lasted about 10 minutes before subsiding gradually to the normal level. At the same time, the TimeTick lag recorded at QueryNode (for consumed insert) also went up to a couple minutes. All of the evidence seems to point to a slow-down of consumption from dml channel in QueryNode.

Appreciate anyone looking into this issue since it could be a blocker to upgrade to v2.5+ in Prod environment.

Expected Behavior

Timetick lag should not have obvious increase during insertion.

Steps To Reproduce

No response

Milvus Log

[2024/12/18 10:57:54.092 +00:00] [WARN] [querynodev2/handlers.go:227] ["failed to query on delegator"] [traceID=62bcacd674186cd5d570ed04369a103f] [msgID=454692378587327825] [collectionID=454692378588586550] [channel=by-dev-rootcoord-dml_12_454692378588586550v0] [scope=All] [error="context canceled"] [2024/12/18 10:57:54.092 +00:00] [WARN] [delegator/delegator.go:563] ["delegator query failed to wait tsafe"] [traceID=62bcacd674186cd5d570ed04369a103f] [collectionID=454692378588586550] [channel=by-dev-rootcoord-dml_12_454692378588586550v0] [replicaID=454692378786922498] [error="context canceled"]

Anything else?

No response

I thought this is definitely a potential problem .
Could you offer logs so we can investigate on that? especially for the querynode that has this tt log

Andy6132024 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2024

Andy6132024 assigned yanliang567 Dec 19, 2024

sre-ci-robot assigned Andy6132024 Dec 20, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 20, 2024

yanliang567 removed their assignment Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Query after Insertion timed out in v2.5.0-beta #38585

[Bug]: Query after Insertion timed out in v2.5.0-beta #38585

Andy6132024 commented Dec 19, 2024

yanliang567 commented Dec 20, 2024

xiaofan-luan commented Dec 26, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

[Bug]: Query after Insertion timed out in v2.5.0-beta #38585

[Bug]: Query after Insertion timed out in v2.5.0-beta #38585

Comments

Andy6132024 commented Dec 19, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Dec 20, 2024

xiaofan-luan commented Dec 26, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?