Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] search RT almost doubled after enabling streamingNode #36804

Open
1 task done
wangting0128 opened this issue Oct 12, 2024 · 18 comments
Open
1 task done
Assignees
Labels
feature/streaming node streaming node feature kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20241011-3fe0f829-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.5rc7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: fouramf-streaming-node-corn-1728615600
test case name: test_bitmap_locust_dql_dml_partition_key_cluster

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-streami15600-1-63-8844-etcd-0                             1/1     Running     0                6h47m   10.104.17.43    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-etcd-1                             1/1     Running     0                6h47m   10.104.19.190   4am-node28   <none>           <none>
fouramf-streami15600-1-63-8844-etcd-2                             1/1     Running     0                6h47m   10.104.20.152   4am-node22   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-8ddv7   1/1     Running     1 (6h43m ago)    6h47m   10.104.14.20    4am-node18   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-8pfrl   1/1     Running     1 (6h43m ago)    6h47m   10.104.23.25    4am-node27   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-hftd6   1/1     Running     1 (6h43m ago)    6h47m   10.104.19.188   4am-node28   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-jppv7   1/1     Running     1 (6h43m ago)    6h47m   10.104.1.84     4am-node10   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-mzh45   1/1     Running     1 (6h43m ago)    6h47m   10.104.13.167   4am-node16   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-mzzd6   1/1     Running     0                6h47m   10.104.6.193    4am-node13   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-ptgqq   1/1     Running     1 (6h43m ago)    6h47m   10.104.18.162   4am-node25   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-qnq6b   1/1     Running     1 (6h43m ago)    6h47m   10.104.25.250   4am-node30   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-r5vwm   1/1     Running     0                6h47m   10.104.4.30     4am-node11   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-datanode-7586bdfbbc-tdvgt   1/1     Running     1 (6h43m ago)    6h47m   10.104.17.35    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-indexnode-d78d47c54-7k9qc   1/1     Running     0                6h47m   10.104.6.194    4am-node13   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-indexnode-d78d47c54-8rgnw   1/1     Running     0                6h47m   10.104.20.147   4am-node22   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-indexnode-d78d47c54-nft65   1/1     Running     0                6h47m   10.104.30.147   4am-node38   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-indexnode-d78d47c54-x45z8   1/1     Running     0                6h47m   10.104.1.85     4am-node10   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-mixcoord-8668dd7f99-24fwm   1/1     Running     1 (6h43m ago)    6h47m   10.104.30.146   4am-node38   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-proxy-654bbf597f-q6kgl      1/1     Running     1 (6h43m ago)    6h47m   10.104.9.223    4am-node14   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf445887gwx9   1/1     Running     0                6h47m   10.104.34.160   4am-node37   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf44588db2qn   1/1     Running     0                6h47m   10.104.9.224    4am-node14   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf44588fcsls   1/1     Running     0                6h47m   10.104.21.172   4am-node24   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf44588lqpgn   1/1     Running     0                6h47m   10.104.18.163   4am-node25   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-querynode-6dcbf44588q9gbl   1/1     Running     0                6h47m   10.104.4.29     4am-node11   <none>           <none>
fouramf-streami15600-1-63-8844-milvus-streamingnode-7dcd45vjxg7   1/1     Running     1 (6h43m ago)    6h47m   10.104.17.36    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-minio-0                            1/1     Running     0                6h47m   10.104.18.165   4am-node25   <none>           <none>
fouramf-streami15600-1-63-8844-minio-1                            1/1     Running     0                6h47m   10.104.17.44    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-minio-2                            1/1     Running     0                6h47m   10.104.33.198   4am-node36   <none>           <none>
fouramf-streami15600-1-63-8844-minio-3                            1/1     Running     0                6h47m   10.104.20.153   4am-node22   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-bookie-0                    1/1     Running     0                6h47m   10.104.30.148   4am-node38   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-bookie-1                    1/1     Running     0                6h47m   10.104.17.45    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-bookie-2                    1/1     Running     0                6h47m   10.104.20.154   4am-node22   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-bookie-init-ww94p           0/1     Completed   0                6h47m   10.104.17.33    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-broker-0                    1/1     Running     0                6h47m   10.104.13.168   4am-node16   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-proxy-0                     1/1     Running     0                6h47m   10.104.18.161   4am-node25   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-pulsar-init-tbfcf           0/1     Completed   0                6h47m   10.104.9.225    4am-node14   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-recovery-0                  1/1     Running     0                6h47m   10.104.19.187   4am-node28   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-zookeeper-0                 1/1     Running     0                6h47m   10.104.17.42    4am-node23   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-zookeeper-1                 1/1     Running     0                6h47m   10.104.33.207   4am-node36   <none>           <none>
fouramf-streami15600-1-63-8844-pulsar-zookeeper-2                 1/1     Running     0                6h46m   10.104.19.196   4am-node28   <none>           <none>

enabled streamingNode👇
image
截屏2024-10-12 10 55 33

disabled streamingNode👇
release name: fouramf-bitmap-scenes-q27w2-7
image
截屏2024-10-12 10 56 55

client log:
截屏2024-10-12 10 52 30

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `partition_key on scalar int64_1 field`, shards_num=16
            verify DQL & DML scenario,
            which has 1 vector fields(IVF_SQ8) and building `BITMAP` index on all supported 12 scalar fields

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim

                'int64_1': partition_key, num_partitions=1024
                all scalar fields: varchar max_length=100, array max_capacity=13
            2. build indexes:
                IVF_SQ8: 'float_vector'

                BITMAP: all scalar fields
            3. insert 50 million data
            4. flush collection
            5. build indexes again using the same params
            6. load collection
                replica: 1
            7. concurrent request:
                - search
                - query
                - hybrid_search
                - load
                - insert
                - delete: delete data 90%
                - flush: ignore RateLimiter

Milvus Log

No response

Anything else?

test result:

[2024-10-11 09:48:33,214 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     delete                                                                            36     0(0.00%) | 108027       7 1324820     21 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     flush                                                                             31     0(0.00%) | 722038    1130 2007647 344000 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     hybrid_search                                                                     16    4(25.00%) |1409925       0 23626801832000 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     insert                                                                            28     0(0.00%) |  83340      21 1322431     96 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,214 -  INFO - fouram]: grpc     load                                                                              19     0(0.00%) |  68352      22 1256742    360 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]: grpc     query                                                                             17     0(0.00%) |1307332  277148 23583891260000 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]: grpc     search                                                                            28     0(0.00%) |1532973  766062 23839951592000 |    0.00        0.00 (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]:          Aggregated                                                                       175     4(2.29%) | 672063       0 2383995  12000 |    0.02        0.00 (stats.py:789)
[2024-10-11 09:48:33,215 -  INFO - fouram]:  (stats.py:790)
[2024-10-11 09:48:33,218 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'cluster',
            'config_name': 'cluster_2c2m',
            'config': {'queryNode': {'resources': {'limits': {'cpu': '16.0', 'memory': '32Gi'}, 'requests': {'cpu': '9.0', 'memory': '17Gi'}}, 'replicas': 5},
                       'indexNode': {'resources': {'limits': {'cpu': '4.0', 'memory': '8Gi'}, 'requests': {'cpu': '3.0', 'memory': '5Gi'}}, 'replicas': 4},
                       'dataNode': {'resources': {'limits': {'cpu': '4.0', 'memory': '16Gi'}, 'requests': {'cpu': '3.0', 'memory': '9Gi'}}, 'replicas': 10},
                       'cluster': {'enabled': True},
                       'pulsar': {},
                       'kafka': {},
                       'minio': {'metrics': {'podMonitor': {'enabled': True}}},
                       'etcd': {'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'streaming': {'enabled': True},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20241011-3fe0f829-amd64'}}},
            'host': 'fouramf-streami15600-1-63-8844-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_bitmap_locust_dql_dml_partition_key_cluster',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'max_length': 512,
                                                    'scalars_index': {'int8_1': {'index_type': 'BITMAP'},
                                                                      'int16_1': {'index_type': 'BITMAP'},
                                                                      'int32_1': {'index_type': 'BITMAP'},
                                                                      'int64_1': {'index_type': 'BITMAP'},
                                                                      'varchar_1': {'index_type': 'BITMAP'},
                                                                      'bool_1': {'index_type': 'BITMAP'},
                                                                      'array_int8_1': {'index_type': 'BITMAP'},
                                                                      'array_int16_1': {'index_type': 'BITMAP'},
                                                                      'array_int32_1': {'index_type': 'BITMAP'},
                                                                      'array_int64_1': {'index_type': 'BITMAP'},
                                                                      'array_varchar_1': {'index_type': 'BITMAP'},
                                                                      'array_bool_1': {'index_type': 'BITMAP'}},
                                                    'scalars_params': {'int64_1': {'params': {'is_partition_key': True}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 50000000,
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['int8_1', 'int16_1', 'int32_1', 'int64_1', 'varchar_1', 'bool_1', 'array_int8_1',
                                                                        'array_int16_1', 'array_int32_1', 'array_int64_1', 'array_varchar_1', 'array_bool_1'],
                                                       'shards_num': 16,
                                                       'num_partitions': 1024},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'IVF_SQ8', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 15, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 1,
                                                       'params': {'nq': 1000,
                                                                  'top_k': 10,
                                                                  'search_param': {'nprobe': 16},
                                                                  'expr': 'int8_1 == 100',
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'output_fields': ['id', 'float_vector', 'int64_1'],
                                                                  'ignore_growing': False,
                                                                  'group_by_field': None,
                                                                  'timeout': 3000,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'nq': 1000}}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'ids': None,
                                                                  'expr': 'int64_1 > -1',
                                                                  'output_fields': ['*'],
                                                                  'offset': None,
                                                                  'limit': 10,
                                                                  'ignore_growing': False,
                                                                  'partition_names': None,
                                                                  'timeout': 3000,
                                                                  'consistency_level': None,
                                                                  'random_data': False,
                                                                  'random_count': 0,
                                                                  'random_range': [0, 1],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64',
                                                                  'check_task': 'check_query_output',
                                                                  'check_items': {'expect_length': 10}}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'reqs': [{'search_param': {'nprobe': 32},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': '(array_contains_any(array_int32_1, [0]) || array_contains(array_int64_1, '
                                                                                    '1)) || ((varchar_1 like "1%") and (bool_1 == True))',
                                                                            'top_k': 30},
                                                                           {'search_param': {'nprobe': 64},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': 'not (int16_1 == int8_1) && ARRAY_CONTAINS_ANY(array_int64_1, [-1, 0, '
                                                                                    '1])'}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': None,
                                                                  'timeout': 3000,
                                                                  'random_data': True,
                                                                  'check_task': 'check_search_output',
                                                                  'check_items': {'output_fields': ['int8_1', 'int16_1', 'int32_1', 'int64_1', 'varchar_1',
                                                                                                    'bool_1', 'array_int8_1', 'array_int16_1', 'array_int32_1',
                                                                                                    'array_int64_1', 'array_varchar_1', 'array_bool_1', 'id',
                                                                                                    'float_vector'],
                                                                                  'nq': 10}}},
                                                      {'type': 'load',
                                                       'weight': 1,
                                                       'params': {'replica_number': 1, 'timeout': 180, 'check_task': 'check_response', 'check_items': None}},
                                                      {'type': 'insert',
                                                       'weight': 1,
                                                       'params': {'nb': 10,
                                                                  'timeout': 30,
                                                                  'random_id': True,
                                                                  'random_vector': True,
                                                                  'varchar_filled': False,
                                                                  'start_id': 50000000,
                                                                  'shuffle_id': False,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'delete',
                                                       'weight': 1,
                                                       'params': {'expr': '',
                                                                  'delete_length': 9,
                                                                  'timeout': 30,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'flush',
                                                       'weight': 1,
                                                       'params': {'timeout': 600,
                                                                  'check_task': 'check_ignore_expected_errors',
                                                                  'check_items': [{'message': 'request is rejected by grpc RateLimiter middleware, please '
                                                                                              'retry later'},
                                                                                  {'message': 'wait for flush timeout'}]}}]},
            'run_id': 2024101156796516,
            'datetime': '2024-10-11 03:01:19.694088',
            'client_version': '2.4.0'},
 'result': {'test_result': {'index': {'RT': 4672.48,
                                      'int8_1': {'RT': 0.9997},
                                      'int16_1': {'RT': 0.5649},
                                      'int32_1': {'RT': 0.6071},
                                      'int64_1': {'RT': 0.548},
                                      'varchar_1': {'RT': 0.5494},
                                      'bool_1': {'RT': 0.7083},
                                      'array_int8_1': {'RT': 0.7709},
                                      'array_int16_1': {'RT': 0.5407},
                                      'array_int32_1': {'RT': 0.5794},
                                      'array_int64_1': {'RT': 0.5391},
                                      'array_varchar_1': {'RT': 0.541},
                                      'array_bool_1': {'RT': 0.547}},
                            'insert': {'total_time': 7252.4755, 'VPS': 6894.1977, 'batch_time': 0.7252, 'batch': 5000},
                            'flush': {'RT': 141.8577},
                            'load': {'RT': 21.2851},
                            'Locust': {'Aggregated': {'Requests': 175,
                                                      'Fails': 4,
                                                      'RPS': 0.02,
                                                      'fail_s': 0.02,
                                                      'RT_max': 2383995.06,
                                                      'RT_avg': 672063.63,
                                                      'TP50': 12000.0,
                                                      'TP99': 2382000.0},
                                       'delete': {'Requests': 36,
                                                  'Fails': 0,
                                                  'RPS': 0.0,
                                                  'fail_s': 0.0,
                                                  'RT_max': 1324820.55,
                                                  'RT_avg': 108027.29,
                                                  'TP50': 24,
                                                  'TP99': 1325000.0},
                                       'flush': {'Requests': 31,
                                                 'Fails': 0,
                                                 'RPS': 0.0,
                                                 'fail_s': 0.0,
                                                 'RT_max': 2007647.31,
                                                 'RT_avg': 722038.97,
                                                 'TP50': 344000.0,
                                                 'TP99': 2008000.0},
                                       'hybrid_search': {'Requests': 16,
                                                         'Fails': 4,
                                                         'RPS': 0.0,
                                                         'fail_s': 0.25,
                                                         'RT_max': 2362680.11,
                                                         'RT_avg': 1409925.17,
                                                         'TP50': 1982000.0,
                                                         'TP99': 2363000.0},
                                       'insert': {'Requests': 28,
                                                  'Fails': 0,
                                                  'RPS': 0.0,
                                                  'fail_s': 0.0,
                                                  'RT_max': 1322431.89,
                                                  'RT_avg': 83340.22,
                                                  'TP50': 99,
                                                  'TP99': 1322000.0},
                                       'load': {'Requests': 19,
                                                'Fails': 0,
                                                'RPS': 0.0,
                                                'fail_s': 0.0,
                                                'RT_max': 1256742.25,
                                                'RT_avg': 68352.56,
                                                'TP50': 360.0,
                                                'TP99': 1257000.0},
                                       'query': {'Requests': 17,
                                                 'Fails': 0,
                                                 'RPS': 0.0,
                                                 'fail_s': 0.0,
                                                 'RT_max': 2358389.62,
                                                 'RT_avg': 1307332.69,
                                                 'TP50': 1260000.0,
                                                 'TP99': 2358000.0},
                                       'search': {'Requests': 28,
                                                  'Fails': 0,
                                                  'RPS': 0.0,
                                                  'fail_s': 0.0,
                                                  'RT_max': 2383995.06,
                                                  'RT_avg': 1532973.61,
                                                  'TP50': 1682000.0,
                                                  'TP99': 2384000.0}}}}}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Oct 12, 2024
@wangting0128 wangting0128 added this to the 2.5.0 milestone Oct 12, 2024
@chyezh
Copy link
Contributor

chyezh commented Oct 12, 2024

It seems that the difference of flush policy make the final segment size different.
And the work load is too high, the root cause may be the scheduling policy of querynode.
Most cost is the queue time but not the execution time.

@chyezh
Copy link
Contributor

chyezh commented Oct 14, 2024

may be related to #36761

@yanliang567
Copy link
Contributor

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 14, 2024
@chyezh
Copy link
Contributor

chyezh commented Oct 14, 2024

#36761 is merged, but I do not make sure that it fix these issue.
@wangting0128 please help to rerun these test with commit f0f5147aefe581b87e30b7b144dc801d7926322e.
thx!

@wangting0128
Copy link
Contributor Author

#36761 is merged, but I do not make sure that it fix these issue. @wangting0128 please help to rerun these test with commit f0f5147aefe581b87e30b7b144dc801d7926322e. thx!

Verification failed

argo task: fouramf-kmshk
image: master-20241014-d566b0ce-amd64

server:

NAME                                                              READY   STATUS             RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
verify-36804-rt-etcd-0                                            1/1     Running            0               9h      10.104.18.119   4am-node25   <none>           <none>
verify-36804-rt-etcd-1                                            1/1     Running            0               9h      10.104.34.235   4am-node37   <none>           <none>
verify-36804-rt-etcd-2                                            1/1     Running            0               9h      10.104.19.59    4am-node28   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-2xctx                  1/1     Running            2 (9h ago)      9h      10.104.32.154   4am-node39   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-6mxxn                  1/1     Running            2 (9h ago)      9h      10.104.4.149    4am-node11   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-7x59m                  1/1     Running            1 (9h ago)      9h      10.104.14.14    4am-node18   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-7xz8j                  1/1     Running            2 (9h ago)      9h      10.104.17.25    4am-node23   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-c5gt4                  1/1     Running            2 (9h ago)      9h      10.104.15.61    4am-node20   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-mz7p6                  1/1     Running            2 (9h ago)      9h      10.104.18.115   4am-node25   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-r8nzw                  1/1     Running            2 (9h ago)      9h      10.104.25.133   4am-node30   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-vdsww                  1/1     Running            2 (9h ago)      9h      10.104.9.182    4am-node14   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-xwdv4                  1/1     Running            1 (9h ago)      9h      10.104.13.126   4am-node16   <none>           <none>
verify-36804-rt-milvus-datanode-7b9b88b76b-z6k7m                  1/1     Running            2 (9h ago)      9h      10.104.19.52    4am-node28   <none>           <none>
verify-36804-rt-milvus-indexnode-c646989df-msbkh                  1/1     Running            2 (9h ago)      9h      10.104.4.150    4am-node11   <none>           <none>
verify-36804-rt-milvus-indexnode-c646989df-qm5xp                  1/1     Running            2 (9h ago)      9h      10.104.34.230   4am-node37   <none>           <none>
verify-36804-rt-milvus-indexnode-c646989df-rwrl6                  1/1     Running            0               9h      10.104.14.16    4am-node18   <none>           <none>
verify-36804-rt-milvus-indexnode-c646989df-sbp24                  1/1     Running            2 (9h ago)      9h      10.104.5.14     4am-node12   <none>           <none>
verify-36804-rt-milvus-mixcoord-c7cc55b48-xfq49                   1/1     Running            1 (9h ago)      9h      10.104.14.17    4am-node18   <none>           <none>
verify-36804-rt-milvus-proxy-5cb97c6d46-rdb62                     1/1     Running            3 (9h ago)      9h      10.104.4.151    4am-node11   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-86wj5                 1/1     Running            2 (9h ago)      9h      10.104.9.183    4am-node14   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-hqdm5                 1/1     Running            2 (9h ago)      9h      10.104.4.152    4am-node11   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-krb4q                 1/1     Running            0               9h      10.104.14.18    4am-node18   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-s5277                 1/1     Running            2 (9h ago)      9h      10.104.15.62    4am-node20   <none>           <none>
verify-36804-rt-milvus-querynode-7c464946df-wz7nb                 1/1     Running            1 (9h ago)      9h      10.104.20.147   4am-node22   <none>           <none>
verify-36804-rt-milvus-streamingnode-59dbdd5fc8-brg85             1/1     Running            3 (9h ago)      9h      10.104.4.147    4am-node11   <none>           <none>
verify-36804-rt-minio-0                                           1/1     Running            0               9h      10.104.20.149   4am-node22   <none>           <none>
verify-36804-rt-minio-1                                           1/1     Running            0               9h      10.104.34.236   4am-node37   <none>           <none>
verify-36804-rt-minio-2                                           1/1     Running            0               9h      10.104.17.27    4am-node23   <none>           <none>
verify-36804-rt-minio-3                                           1/1     Running            0               9h      10.104.19.60    4am-node28   <none>           <none>
verify-36804-rt-pulsar-bookie-0                                   1/1     Running            0               9h      10.104.30.6     4am-node38   <none>           <none>
verify-36804-rt-pulsar-bookie-1                                   1/1     Running            0               9h      10.104.34.239   4am-node37   <none>           <none>
verify-36804-rt-pulsar-bookie-2                                   1/1     Running            0               9h      10.104.17.33    4am-node23   <none>           <none>
verify-36804-rt-pulsar-bookie-init-jtskk                          0/1     Completed          0               9h      10.104.4.148    4am-node11   <none>           <none>
verify-36804-rt-pulsar-broker-0                                   1/1     Running            0               9h      10.104.14.15    4am-node18   <none>           <none>
verify-36804-rt-pulsar-proxy-0                                    1/1     Running            0               9h      10.104.5.15     4am-node12   <none>           <none>
verify-36804-rt-pulsar-pulsar-init-hnnh5                          0/1     Completed          0               9h      10.104.14.12    4am-node18   <none>           <none>
verify-36804-rt-pulsar-recovery-0                                 1/1     Running            0               9h      10.104.14.13    4am-node18   <none>           <none>
verify-36804-rt-pulsar-zookeeper-0                                1/1     Running            0               9h      10.104.19.55    4am-node28   <none>           <none>
verify-36804-rt-pulsar-zookeeper-1                                1/1     Running            0               9h      10.104.24.66    4am-node29   <none>           <none>
verify-36804-rt-pulsar-zookeeper-2                                1/1     Running            0               9h      10.104.21.206   4am-node24   <none>           <none>

image

client log: hybrid_search request timeout
截屏2024-10-15 10 54 03

@chyezh

@chyezh
Copy link
Contributor

chyezh commented Oct 15, 2024

After Flushing Policy Fixed,

First

The Milvus With Streaming Service will finally generate 1.11k sealed segments while the milvus without streaming service will finally generate 2k sealed segments.
So the segments in milvus with streaming service have double size comparing with milvus without streaming service.
It's the major difference between two test case.

Streaming:
image

No Streaming:
image

Second

Milvus With Streaming Service's message consumer works correctly, so it's not introduced by streaming service.

image
[2024/10/14 19:14:50.024 +00:00] [DEBUG] [pipeline/insert_node.go:80] ["pipeline fetch insert msg"] [collectionID=453223040984548132] [segmentID=453223041097817164] [insertRowNum=1] [timestampMin=453229488314253321] [timestampMax=453229488314253321]

The timestamp `453229488314253321` is `2024-10-14 19:14:49.773`

Third

Found that the some request still wait for tsafe for long time whether using streaming or not. and ProcessInsert Delay increase periodically:

No Streaming:
image
image

Streaming:
image
image

Fourth

Found that inserting a new message cost 19min when creating a new segment.

[2024/10/14 19:14:50.024 +00:00] [DEBUG] [pipeline/insert_node.go:80] ["pipeline fetch insert msg"] [collectionID=453223040984548132] [segmentID=453223041097817164] [insertRowNum=1] [timestampMin=453229488314253321] [timestampMax=453229488314253321]
...
[2024/10/14 19:35:09.020 +00:00] [INFO] [delegator/delegator_data.go:341] ["add growing segments to delegator"] [collectionID=453223040984548132] [channel=by-dev-rootcoord-dml_3_453223040984548132v3] [replicaID=453223041145765889] [segmentIDs="[453223041097817164]"]

@chyezh
Copy link
Contributor

chyezh commented Oct 15, 2024

Found that the insert operation is blocked by the acquirisition of mutex growingSegmentLock.
And these mutex is also acquired by ReleaseSegments.

Release operation of segment 453223041097816587 use 1h3m.
And the release operation will be blocked because of distribution expiration.

[2024/10/14 18:32:37.675 +00:00] [INFO] [querynodev2/services.go:545] ["received release segment request"] [traceID=be09552d1e1e59448c7da62ffb1a9f5f] [collectionID=453223040984548132] [shard=by-dev-rootcoord-dml_3_453223040984548132v3] [segmentIDs="[453223041097816587]"] [currentNodeID=6] [scope=Streaming] [needTransfer=true]
...
[2024/10/14 19:35:09.020 +00:00] [INFO] [segments/segment.go:1467] ["delete segment from memory"] [traceID=be09552d1e1e59448c7da62ffb1a9f5f] [collectionID=453223040984548132] [partitionID=453223040984548984] [segmentID=453223041097816587] [segmentType=Growing] [insertCount=1]

@chyezh
Copy link
Contributor

chyezh commented Oct 21, 2024

tsafe timeout should be fixed by pr #36997
Another difference found:

After flushing policy fixed:
Milvus Streaming Service will continously generate flush segment, and compaction will execute more frequently and fluently.
image
The flush operation is triggered by policy: binlog file number.
image
So the streaming service performs more compaction and handoff operation than the milvus without streaming service.
Reach the less segment counts at final about 1000 L1 sealed segments.

Meanwhile, milvus without streaming service don't generate flushed segment fluently.
image
It performs less compaction and handoff operation, reach the segment counts about 1750 L1 sealed segments at last.

So the milvus without streaming service encounter less race condition than the milvus with streaming service when handing off, and performs more better RT.

sre-ci-robot pushed a commit that referenced this issue Oct 21, 2024
issue: #36804

Signed-off-by: chyezh <chyezh@outlook.com>
@chyezh
Copy link
Contributor

chyezh commented Oct 21, 2024

@wangting0128 please retry the test at commit ac178eeea569cb5c1f86e57ebe448ac4e15f4cb4.
thx.

@chyezh
Copy link
Contributor

chyezh commented Oct 22, 2024

At latest commit, tsafe problem is fixed.
But the search latency is still high.

image

sre-ci-robot pushed a commit that referenced this issue Oct 23, 2024
issue: #36804

Signed-off-by: chyezh <chyezh@outlook.com>
@chyezh
Copy link
Contributor

chyezh commented Oct 25, 2024

Found that f43527e increase the RT.
03a78ec keep the RT.

@chyezh
Copy link
Contributor

chyezh commented Oct 25, 2024

Found that scalar search latency increase:
df7070e2-95be-40b5-8a0f-f43e004753f2

b1e520f2-2963-4ca5-b09a-774ebb2e72e4

@xiaofan-luan
Copy link
Collaborator

Found that scalar search latency increase: df7070e2-95be-40b5-8a0f-f43e004753f2

b1e520f2-2963-4ca5-b09a-774ebb2e72e4

this is comapred master with what? could this be impacted by null?

@wangting0128
Copy link
Contributor Author

Found that scalar search latency increase: df7070e2-95be-40b5-8a0f-f43e004753f2
b1e520f2-2963-4ca5-b09a-774ebb2e72e4

this is comapred master with what? could this be impacted by null?

This is a comparison of the deployment of instances with and without streamingNode on the same case.

@chyezh
Copy link
Contributor

chyezh commented Oct 28, 2024

This is a comparison of the deployment of instances with and without streamingNode on the same case.

Nope, two tests both ran on a milvus with different commit without streaming enabled.

@xiaofan-luan
Copy link
Collaborator

tsafe timeout should be fixed by pr #36997 Another difference found:

After flushing policy fixed: Milvus Streaming Service will continously generate flush segment, and compaction will execute more frequently and fluently. image The flush operation is triggered by policy: binlog file number. image So the streaming service performs more compaction and handoff operation than the milvus without streaming service. Reach the less segment counts at final about 1000 L1 sealed segments.

Meanwhile, milvus without streaming service don't generate flushed segment fluently. image It performs less compaction and handoff operation, reach the segment counts about 1750 L1 sealed segments at last.

So the milvus without streaming service encounter less race condition than the milvus with streaming service when handing off, and performs more better RT.

is there a special reason why so many bin logs is actaully generated?

@chyezh
Copy link
Contributor

chyezh commented Nov 7, 2024

is there a special reason why so many bin logs is actaully generated?

There's a binlog-num-based flush policy in milvus.
At previous implementation:

  1. Milvus without streaming: use stats-log-num to determine the "binlog-num".
  2. Milvus with streaming: use real bin-log-num to determine the "binlog-num", so there's a multiply (field count).

It has been fixed by #37037, milvus with streaming has kept consistency with milvus without streaming.

Copy link

stale bot commented Dec 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Dec 8, 2024
@yanliang567 yanliang567 modified the milestones: 2.5.0, 2.5.1 Dec 24, 2024
@stale stale bot removed the stale indicates no udpates for 30 days label Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/streaming node streaming node feature kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants