Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix](Rowset Id) Use a randomly generated rowset ID to handle memory write failures #42949

Merged
merged 7 commits into from
Nov 14, 2024

Conversation

Yukang-Lian
Copy link
Collaborator

Proposed changes

Issue Number: close #xxx

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@cambyzju
Copy link
Contributor

1、这个改动很基础,有什么合理的场景,需要这么改吗?
2、随机生成rowset id会导致跟其他地方的元数据等对不上的风险吗?

be/src/olap/olap_common.h Outdated Show resolved Hide resolved
@Yukang-Lian
Copy link
Collaborator Author

1、这个改动很基础,有什么合理的场景,需要这么改吗? 2、随机生成rowset id会导致跟其他地方的元数据等对不上的风险吗?

  1. 当内存写飞了之后be会起不来,这样避免了起不来的问题。2. 随机生成需要修改,统一通过rowset id生成的方法来生成。

@Yukang-Lian
Copy link
Collaborator Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 41779 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4c87a4c864fcb9b46ff45d08c0ab2f5a216743c9, data reload: false

------ Round 1 ----------------------------------
q1	17610	7522	7342	7342
q2	2053	168	183	168
q3	10621	1129	1187	1129
q4	10270	860	960	860
q5	7781	3160	3132	3132
q6	246	151	149	149
q7	1034	618	604	604
q8	9361	1957	2007	1957
q9	6639	6499	6471	6471
q10	7082	2395	2469	2395
q11	449	247	243	243
q12	417	222	220	220
q13	17809	3039	3004	3004
q14	238	226	222	222
q15	588	543	513	513
q16	659	599	573	573
q17	985	559	605	559
q18	7509	6779	6787	6779
q19	1343	1024	1057	1024
q20	466	182	182	182
q21	3990	3328	3265	3265
q22	1112	988	1003	988
Total cold run time: 108262 ms
Total hot run time: 41779 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7358	7344	7320	7320
q2	337	232	226	226
q3	2922	2821	2835	2821
q4	1922	1736	1741	1736
q5	5475	5485	5544	5485
q6	221	142	141	141
q7	2139	1740	1733	1733
q8	3270	3429	3440	3429
q9	8588	8643	8641	8641
q10	3521	3470	3436	3436
q11	569	480	494	480
q12	787	590	566	566
q13	8054	3007	3009	3007
q14	299	287	266	266
q15	573	525	509	509
q16	701	648	632	632
q17	1839	1628	1581	1581
q18	7958	7433	7360	7360
q19	1706	1687	1499	1499
q20	2091	1808	1832	1808
q21	5475	5260	5227	5227
q22	1122	1016	1031	1016
Total cold run time: 66927 ms
Total hot run time: 58919 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191901 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4c87a4c864fcb9b46ff45d08c0ab2f5a216743c9, data reload: false

query1	969	366	365	365
query2	6513	2140	2086	2086
query3	6802	214	220	214
query4	33991	23717	23591	23591
query5	4330	443	449	443
query6	262	169	168	168
query7	4593	299	292	292
query8	298	230	228	228
query9	9372	2718	2693	2693
query10	470	270	258	258
query11	18298	15423	15323	15323
query12	154	103	100	100
query13	1657	423	416	416
query14	10569	7316	6833	6833
query15	319	177	181	177
query16	8100	464	437	437
query17	1753	576	562	562
query18	2144	315	298	298
query19	361	160	150	150
query20	115	107	106	106
query21	204	102	102	102
query22	4619	4291	4460	4291
query23	34867	33913	34224	33913
query24	11238	2792	2777	2777
query25	646	405	394	394
query26	1224	158	160	158
query27	2830	280	293	280
query28	8142	2421	2457	2421
query29	857	426	421	421
query30	321	173	158	158
query31	1045	820	868	820
query32	96	56	61	56
query33	807	279	272	272
query34	988	510	531	510
query35	907	739	759	739
query36	1127	974	972	972
query37	138	72	74	72
query38	4468	4258	4200	4200
query39	1445	1448	1457	1448
query40	200	101	102	101
query41	50	47	46	46
query42	113	99	105	99
query43	533	515	509	509
query44	1320	825	813	813
query45	187	166	171	166
query46	1159	684	704	684
query47	1949	1878	1880	1878
query48	409	333	332	332
query49	1172	409	401	401
query50	809	384	385	384
query51	7319	7149	7094	7094
query52	100	88	89	88
query53	253	184	185	184
query54	1304	419	436	419
query55	86	83	82	82
query56	280	244	246	244
query57	1309	1205	1169	1169
query58	247	205	215	205
query59	3482	3152	3145	3145
query60	269	249	249	249
query61	104	105	103	103
query62	857	648	688	648
query63	212	191	189	189
query64	5286	618	591	591
query65	3334	3236	3198	3198
query66	1447	309	311	309
query67	16284	15844	15684	15684
query68	4913	539	541	539
query69	427	252	250	250
query70	1206	1163	1153	1153
query71	430	258	252	252
query72	6424	3999	3898	3898
query73	780	355	370	355
query74	9520	9104	8993	8993
query75	3440	2709	2661	2661
query76	2946	1194	1151	1151
query77	398	266	273	266
query78	10337	9427	9426	9426
query79	1316	601	607	601
query80	964	454	420	420
query81	552	237	238	237
query82	960	119	120	119
query83	216	136	134	134
query84	232	69	73	69
query85	1279	297	294	294
query86	363	306	303	303
query87	4788	4922	4669	4669
query88	3463	2216	2168	2168
query89	403	296	293	293
query90	2029	190	187	187
query91	137	100	98	98
query92	56	47	51	47
query93	1088	551	536	536
query94	991	301	295	295
query95	353	249	248	248
query96	613	276	277	276
query97	2921	2714	2726	2714
query98	208	198	192	192
query99	1555	1312	1299	1299
Total cold run time: 302421 ms
Total hot run time: 191901 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 31.77 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4c87a4c864fcb9b46ff45d08c0ab2f5a216743c9, data reload: false

query1	0.03	0.03	0.02
query2	0.07	0.02	0.03
query3	0.23	0.07	0.06
query4	1.64	0.10	0.10
query5	0.42	0.41	0.41
query6	1.15	0.66	0.65
query7	0.02	0.01	0.02
query8	0.04	0.03	0.02
query9	0.56	0.48	0.50
query10	0.55	0.55	0.56
query11	0.14	0.11	0.11
query12	0.15	0.11	0.12
query13	0.60	0.59	0.59
query14	2.71	2.70	2.86
query15	0.91	0.82	0.83
query16	0.38	0.36	0.41
query17	1.00	1.06	1.00
query18	0.24	0.22	0.22
query19	1.95	1.80	1.93
query20	0.02	0.01	0.01
query21	15.36	0.62	0.60
query22	2.61	2.58	1.58
query23	16.96	0.87	0.85
query24	3.47	0.64	1.73
query25	0.29	0.06	0.05
query26	0.56	0.14	0.13
query27	0.05	0.04	0.06
query28	10.60	1.11	1.07
query29	12.52	3.24	3.24
query30	0.25	0.06	0.07
query31	2.86	0.37	0.38
query32	3.27	0.46	0.46
query33	3.00	2.99	3.05
query34	17.10	4.45	4.45
query35	4.51	4.54	4.43
query36	0.66	0.50	0.48
query37	0.09	0.06	0.06
query38	0.04	0.03	0.04
query39	0.03	0.02	0.02
query40	0.15	0.12	0.12
query41	0.07	0.02	0.02
query42	0.03	0.02	0.02
query43	0.03	0.02	0.03
Total cold run time: 107.32 s
Total hot run time: 31.77 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.94% (9850/25962)
Line Coverage: 29.20% (82090/281115)
Region Coverage: 28.45% (42356/148858)
Branch Coverage: 25.02% (21516/85984)
Coverage Report: http://coverage.selectdb-in.cc/coverage/4c87a4c864fcb9b46ff45d08c0ab2f5a216743c9_4c87a4c864fcb9b46ff45d08c0ab2f5a216743c9/report/index.html

be/src/olap/olap_common.h Outdated Show resolved Hide resolved
@Yukang-Lian Yukang-Lian force-pushed the Fix-RowsetId-Init-Fault branch from 8b53bbe to 893d440 Compare November 5, 2024 18:03
@Yukang-Lian
Copy link
Collaborator Author

run buildall

Copy link
Contributor

github-actions bot commented Nov 5, 2024

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

github-actions bot commented Nov 5, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 41670 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 893d4405e4076662ddbafa9d7e5388a1d34d48ac, data reload: false

------ Round 1 ----------------------------------
q1	17566	7481	7300	7300
q2	2054	166	163	163
q3	10560	1148	1130	1130
q4	10248	849	838	838
q5	7708	3100	3136	3100
q6	235	147	145	145
q7	1000	616	611	611
q8	9373	2020	1998	1998
q9	6678	6495	6479	6479
q10	7047	2431	2425	2425
q11	454	257	255	255
q12	414	221	211	211
q13	17760	3048	3035	3035
q14	243	213	206	206
q15	563	521	516	516
q16	656	583	581	581
q17	983	522	550	522
q18	7573	6719	6844	6719
q19	1330	1018	1018	1018
q20	479	189	185	185
q21	4117	3411	3231	3231
q22	1146	1002	1025	1002
Total cold run time: 108187 ms
Total hot run time: 41670 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7289	7268	7247	7247
q2	319	227	226	226
q3	2965	2816	2814	2814
q4	1980	1722	1704	1704
q5	5525	5511	5519	5511
q6	219	138	142	138
q7	2161	1775	1715	1715
q8	3277	3438	3437	3437
q9	8692	8636	8564	8564
q10	3537	3492	3455	3455
q11	596	503	499	499
q12	777	574	607	574
q13	10052	3001	3031	3001
q14	293	263	267	263
q15	572	514	512	512
q16	679	624	640	624
q17	1845	1604	1619	1604
q18	7817	7446	7577	7446
q19	1674	1524	1541	1524
q20	2057	1853	1832	1832
q21	5654	5204	5393	5204
q22	1141	1025	1023	1023
Total cold run time: 69121 ms
Total hot run time: 58917 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 191648 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 893d4405e4076662ddbafa9d7e5388a1d34d48ac, data reload: false

query1	984	370	369	369
query2	7086	2105	2090	2090
query3	6784	226	224	224
query4	33889	23585	23880	23585
query5	4328	462	444	444
query6	269	164	158	158
query7	4608	297	298	297
query8	292	232	222	222
query9	9713	2664	2655	2655
query10	474	259	253	253
query11	18173	15453	15498	15453
query12	148	99	100	99
query13	1686	421	422	421
query14	10127	7619	7140	7140
query15	292	179	180	179
query16	7980	441	493	441
query17	1771	564	545	545
query18	2121	298	292	292
query19	360	153	147	147
query20	118	109	106	106
query21	207	103	104	103
query22	4559	4471	4264	4264
query23	34605	33851	33910	33851
query24	11081	2812	2800	2800
query25	680	397	409	397
query26	1323	164	162	162
query27	2703	280	279	279
query28	7986	2470	2438	2438
query29	871	425	429	425
query30	320	159	160	159
query31	1033	809	806	806
query32	96	57	59	57
query33	777	274	278	274
query34	977	507	519	507
query35	882	736	747	736
query36	1097	951	972	951
query37	140	75	76	75
query38	4342	4276	4327	4276
query39	1460	1408	1463	1408
query40	267	100	102	100
query41	52	50	48	48
query42	110	98	95	95
query43	538	477	470	470
query44	1272	817	795	795
query45	188	164	170	164
query46	1154	693	692	692
query47	1970	1803	1831	1803
query48	439	322	329	322
query49	1162	398	405	398
query50	814	388	394	388
query51	7395	7107	7207	7107
query52	104	88	88	88
query53	258	180	177	177
query54	1370	417	419	417
query55	79	79	79	79
query56	277	234	252	234
query57	1300	1172	1188	1172
query58	242	204	233	204
query59	3222	3082	3083	3082
query60	287	251	240	240
query61	107	103	106	103
query62	859	683	655	655
query63	216	184	184	184
query64	5361	652	617	617
query65	3294	3193	3219	3193
query66	1454	303	299	299
query67	16040	15618	15466	15466
query68	4970	559	554	554
query69	422	253	248	248
query70	1117	1129	1139	1129
query71	422	257	247	247
query72	6654	4060	4043	4043
query73	765	354	362	354
query74	10419	9006	9077	9006
query75	3474	2662	2664	2662
query76	2906	1000	1039	1000
query77	417	274	289	274
query78	10441	9421	9314	9314
query79	1256	587	590	587
query80	805	435	419	419
query81	532	238	237	237
query82	1330	128	113	113
query83	192	136	140	136
query84	236	66	69	66
query85	1125	306	299	299
query86	309	300	302	300
query87	4743	4714	4650	4650
query88	3479	2226	2165	2165
query89	404	288	292	288
query90	2272	188	183	183
query91	138	103	106	103
query92	64	47	51	47
query93	1053	550	545	545
query94	932	291	296	291
query95	350	248	249	248
query96	611	278	288	278
query97	2853	2703	2780	2703
query98	202	195	206	195
query99	1709	1294	1284	1284
Total cold run time: 302775 ms
Total hot run time: 191648 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.44 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 893d4405e4076662ddbafa9d7e5388a1d34d48ac, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.03	0.04
query3	0.23	0.06	0.06
query4	1.64	0.10	0.10
query5	0.42	0.40	0.41
query6	1.15	0.66	0.65
query7	0.03	0.02	0.02
query8	0.04	0.03	0.03
query9	0.55	0.50	0.50
query10	0.55	0.54	0.57
query11	0.13	0.11	0.11
query12	0.14	0.12	0.10
query13	0.60	0.60	0.59
query14	2.83	2.70	2.74
query15	0.91	0.84	0.82
query16	0.38	0.38	0.37
query17	1.03	1.07	1.03
query18	0.24	0.22	0.23
query19	1.96	1.81	1.95
query20	0.02	0.02	0.01
query21	15.38	0.59	0.59
query22	2.95	1.74	1.83
query23	16.96	0.96	0.88
query24	3.37	1.42	0.76
query25	0.32	0.18	0.12
query26	0.29	0.14	0.13
query27	0.05	0.04	0.03
query28	10.52	1.10	1.07
query29	12.51	3.29	3.26
query30	0.26	0.07	0.06
query31	2.85	0.40	0.38
query32	3.27	0.47	0.46
query33	3.00	3.10	3.06
query34	17.14	4.48	4.48
query35	4.54	4.51	4.51
query36	0.66	0.49	0.48
query37	0.09	0.06	0.07
query38	0.04	0.04	0.03
query39	0.03	0.02	0.02
query40	0.16	0.12	0.12
query41	0.08	0.03	0.02
query42	0.03	0.03	0.03
query43	0.03	0.03	0.03
Total cold run time: 107.48 s
Total hot run time: 32.44 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.86% (9838/25985)
Line Coverage: 29.02% (81804/281862)
Region Coverage: 28.27% (42165/149165)
Branch Coverage: 24.86% (21402/86098)
Coverage Report: http://coverage.selectdb-in.cc/coverage/893d4405e4076662ddbafa9d7e5388a1d34d48ac_893d4405e4076662ddbafa9d7e5388a1d34d48ac/report/index.html

@Yukang-Lian
Copy link
Collaborator Author

run buildall

Copy link
Contributor

github-actions bot commented Nov 7, 2024

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

github-actions bot commented Nov 7, 2024

clang-tidy review says "All clean, LGTM! 👍"

@Yukang-Lian Yukang-Lian requested a review from gavinchou November 7, 2024 12:50
@Yukang-Lian
Copy link
Collaborator Author

run beut

1 similar comment
@Yukang-Lian
Copy link
Collaborator Author

run beut

@Yukang-Lian
Copy link
Collaborator Author

run buildall

@Yukang-Lian
Copy link
Collaborator Author

run beut

@Yukang-Lian
Copy link
Collaborator Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.94% (9888/26064)
Line Coverage: 29.14% (82644/283615)
Region Coverage: 28.26% (42466/150253)
Branch Coverage: 24.85% (21542/86694)
Coverage Report: http://coverage.selectdb-in.cc/coverage/3f2f669e520928dd06c12f14265a08ff9dad246b_3f2f669e520928dd06c12f14265a08ff9dad246b/report/index.html

@gavinchou gavinchou merged commit 002bbdc into apache:master Nov 14, 2024
25 of 28 checks passed
github-actions bot pushed a commit that referenced this pull request Nov 14, 2024
Yukang-Lian added a commit to Yukang-Lian/doris that referenced this pull request Nov 17, 2024
dataroaring pushed a commit that referenced this pull request Dec 9, 2024
gavinchou pushed a commit that referenced this pull request Dec 9, 2024
…ndle memory write failures #42949 (#43970)

Cherry-picked from #42949

Co-authored-by: abmdocrt <lianyukang@selectdb.com>
Yukang-Lian added a commit to Yukang-Lian/doris that referenced this pull request Dec 9, 2024
yiguolei pushed a commit that referenced this pull request Dec 10, 2024
…ated rowset ID to handle memory write failures (#42949)" (#44086)
Yukang-Lian added a commit to Yukang-Lian/doris that referenced this pull request Dec 27, 2024
gavinchou pushed a commit that referenced this pull request Dec 27, 2024
In PR #42949, during the rowset ID initialization process, we used a
random ID to replace the rowset ID that failed during serialization.
However, the generation of random IDs depends on the storage engine,
which hasn't been initialized during the rowset ID initialization
process, leading to a core dump. This PR fixes this issue by uniformly
using MAX_ROWSET_ID-1 to replace the failed rowset ID. This approach is
safe because the rowset ID generator won't generate such a large ID, and
we can consider all rowsets with rowset ID equal to MAX_ROWSET_ID-1 as
failed initialization rowsets that should rely on multiple replicas for
automatic recovery.
gavinchou pushed a commit that referenced this pull request Jan 2, 2025
…ated rowset ID to handle memory write failures (#42949) (#46102)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.8-merged dev/3.0.3-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants