Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: msg: failed to get delegator 624 for channel base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624 #37710

Open
1 task done
sunwsh opened this issue Nov 15, 2024 · 4 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@sunwsh
Copy link

sunwsh commented Nov 15, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:  v2.4.14
- Deployment mode(standalone or cluster):   cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2): milvus-sdk-java v2.3.3
- OS(Ubuntu or CentOS):  ubuntu20.04
- CPU/Memory:    24c/64g
- GPU:    null
- Others:

Current Behavior

偶尔出现 failed to get delegator, 才升级到 v2.4.14 两天,最近两天都有发生,以前用的版本是 2.3.21,没有发现类似问题。

Expected Behavior

我们负载均衡方式用了
proxy.replicaSelectionPolicy: round_robin

image

Steps To Reproduce

No response

Milvus Log

有问题的一个 collection, failed to get delegator
collection:                 453896578611993902
querynode-624:       10.152.131.184:21123
collection                                                     nodeid         报错次数
collection: m185_128_dssm_v2_1]            624              2399

查询时日志都在 proxy 上,最开始出现查询 failed to get delegator 624 的日志是在,2024/11/15 10:32:45

[2024/11/15 10:32:45.729 +08:00] [ACCESS] <root: tcp-10.151.41.250:12682> Search [status: Failed] [code: 65535] [sdk: Java-Unknown] [msg: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624] [traceID: dc11a62b1e3a406ae3fda09c4f1230a2] [timeCost: 1.657523ms] [database: seareco] [collection: m185_128_dssm_v2_1] [partitions: [0]] [expr: ]
	
[2024/11/15 10:32:45.728 +08:00] [WARN] [retry/retry.go:46] ["retry func failed"] [traceID=dc11a62b1e3a406ae3fda09c4f1230a2] [retried=0] [error="failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624"] [errorVerbose="failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:190\n  | github.com/milvus-io/milvus/pkg/util/retry.Do\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/retry/retry.go:44\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:163\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:231\n  | golang.org/x/sync/errgroup.(*Group).Go.func1\n  | \t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (2) failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0\nWraps: (3) can not find client of node 624\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString"]
	
[2024/11/15 10:32:45.729 +08:00] [WARN] [proxy/task_search.go:552] ["search execute failed"] [traceID=dc11a62b1e3a406ae3fda09c4f1230a2] [nq=1] [error="failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624"] [errorVerbose="failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:190\n  | github.com/milvus-io/milvus/pkg/util/retry.Do\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/retry/retry.go:44\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:163\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:231\n  | golang.org/x/sync/errgroup.(*Group).Go.func1\n  | \t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (2) failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0\nWraps: (3) can not find client of node 624\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString"]
	
[2024/11/15 10:32:45.729 +08:00] [WARN] [proxy/task_scheduler.go:477] ["Failed to execute task: "] [traceID=dc11a62b1e3a406ae3fda09c4f1230a2] [error="failed to search: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624"] [errorVerbose="failed to search: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_search.go:553\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_scheduler.go:474\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_scheduler.go:553\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/conc/pool.go:81\n  | [...repeated from below...]\nWraps: (2) failed to search\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:190\n  | github.com/milvus-io/milvus/pkg/util/retry.Do\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/retry/retry.go:44\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:163\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:231\n  | golang.org/x/sync/errgroup.(*Group).Go.func1\n  | \t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (4) failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0\nWraps: (5) can not find client of node 624\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString"]
	
[2024/11/15 10:32:45.729 +08:00] [WARN] [proxy/impl.go:3014] ["Search failed to WaitToFinish"] [traceID=dc11a62b1e3a406ae3fda09c4f1230a2] [role=proxy] [db=seareco] [collection=m185_128_dssm_v2_1] [partitions="[0]"] [dsl=] [len(PlaceholderGroup)=524] [OutputFields="[]"] [search_params="[{\"key\":\"anns_field\",\"value\":\"vec\"},{\"key\":\"topk\",\"value\":\"3000\"},{\"key\":\"metric_type\",\"value\":\"IP\"},{\"key\":\"round_decimal\",\"value\":\"-1\"},{\"key\":\"ignore_growing\",\"value\":\"false\"},{\"key\":\"offset\",\"value\":\"0\"},{\"key\":\"params\",\"value\":\"{\\\"nprobe\\\":20}\"}]"] [ConsistencyLevel=Strong] [useDefaultConsistency=true] [nq=1] [error="failed to search: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624"] [errorVerbose="failed to search: failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0: can not find client of node 624\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_search.go:553\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_scheduler.go:474\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/task_scheduler.go:553\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/conc/pool.go:81\n  | [...repeated from below...]\nWraps: (2) failed to search\nWraps: (3) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:190\n  | github.com/milvus-io/milvus/pkg/util/retry.Do\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/pkg/util/retry/retry.go:44\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:163\n  | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n  | \t/home/jenkins/workspace/milvus2.api.vip.com/internal/proxy/lb_policy.go:231\n  | golang.org/x/sync/errgroup.(*Group).Go.func1\n  | \t/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (4) failed to get delegator 624 for channel gd16-seareco-base-rootcoord-dml_14_453896578611993902v0\nWraps: (5) can not find client of node 624\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString"]

624 节点,这个 collection 没有什么异常日志,在 kill 之前一直有那个 collection 的查询日志:2024/11/15 11:54:09,
看现象只是部分 proxy 查询不到。

[2024/11/15 11:54:09.136 +08:00] [INFO] [querynodev2/services.go:821] ["Received SearchRequest done"] [traceID=7dd04234bfc44f6b14bf5d7c1d069b5b] [collectionID=453896578611993902] [channels="[gd16-seareco-base-rootcoord-dml_14_453896578611993902v0]"] [nq=1] [segmentIDs="[]"] [reduceElapse=0] [requestElapse=23]

看proxy 的日志, 我们有 30个 proxy 进程, 提示这个查询不到的只有 8个proxy 进程, proxy 的前端是用的 7层负载均衡,每个节点的机会是均等的,说明只是部分 proxy 缓存不一致。

统计每个proxy 的失败次数,也不均衡

| proxy pod ip | count |
proxy-76f4ccd59f-djzdc@10.142.68.246 | 921
proxy-76f4ccd59f-ndk8r@10.151.200.71 | 957
proxy-76f4ccd59f-rldqc@10.142.248.58 | 914
proxy-76f4ccd59f-8xtbw@10.147.20.176 | 931
proxy-76f4ccd59f-qcpdt@10.147.88.54 | 893
proxy-76f4ccd59f-tt7wr@10.151.227.228 | 930
proxy-76f4ccd59f-8qn7c@10.142.123.211 | 30241
-proxy-76f4ccd59f-475cf@10.142.39.120 | 952

查询 10.142.123.211 这个 pod 缓存更新请求,
[2024/11/15 10:54:16.688 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=9e4453573fdf71e2a9eb334bc3a895e9] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:03:21.151 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=26fbf23dd2a15edd1f69a8c4d2aa55d8] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:06:04.896 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=81eff78d220cfa78cf5b8afb212b28fe] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:52:46.585 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=aea7c1acf8876da60fd5b86e97196c1b] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:52:51.864 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=6783b3df043cbb52295df11675753d0d] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]
[2024/11/15 11:54:22.087 +08:00] [INFO] [proxy/meta_cache.go:966] ["no shard cache for collection, try to get shard leaders from QueryCoord"] [traceID=13d4302a1b92b39c5f47f2b87d481b22] [collectionName=m185_128_dssm_v2_1] [collectionID=453896578611993902]

根据上面 失败的proxy ,查 meta update success 日志,发现在 前一天才更新后,之后一直没更新过。

[2024/11/14 15:03:38.879 +08:00] [INFO] [proxy/meta_cache.go:500] ["meta update success"] [database=seareco] [collectionName=m185_128_dssm_v2_1] ["actual collection Name"=m185_128_dssm_v2_1] [collectionID=453896578611993902]

看代码应该是 UpdateShardLeaders 这部分前后数据不同,才会回收掉连接

func (c *shardClientMgrImpl) UpdateShardLeaders(oldLeaders map[string][]nodeInfo, newLeaders map[string][]nodeInfo)

Anything else?

No response

@sunwsh sunwsh added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 15, 2024
@yanliang567
Copy link
Contributor

@sunwsh could you please provide completed milvus logs for investigation?
Please refer this doc to export the whole Milvus logs for investigation.
For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @sunwsh

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 16, 2024
@sunwsh
Copy link
Author

sunwsh commented Nov 18, 2024

@sunwsh could you please provide completed milvus logs for investigation? Please refer this doc to export the whole Milvus logs for investigation. For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

/assign @sunwsh

我们的日志是直接写在宿主机上的,然后上传到公司的日志系统了, 拉取出来比较麻烦。
我前天重启了 上面看到的 8个 proxy 进程,最近一天没有再发生了, 但是间隔一天后又有发生,看来还是没有避免。

@yanliang567
Copy link
Contributor

we cannot tell the root cause without logs. if it happens again, please help to collect the logs

@yanliang567 yanliang567 removed their assignment Nov 20, 2024
Copy link

stale bot commented Dec 21, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Dec 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

2 participants