Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug](parquet)Fix the problem that the parquet reader reads the missing sub-columns of the struct and fails. #38718

Merged

Conversation

hubgeter
Copy link
Contributor

@hubgeter hubgeter commented Aug 1, 2024

Proposed changes

Fix the problem that the parquet reader reads the missing sub-columns of the struct and fails.
for example :
suppose we have a column array<struct<a:int>> . And this column has two data

[{1},{2},{3}]
[{4},{5}]

Then we add a subcolumn b to the struct . Now the column structure is like this array<struct<a:int,b:string>>
The expected data for the query is as follows, instead of an error :

[{1,null},{2,null},{3,null}]
[{4,null},{5,null}]

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@github-actions github-actions bot added the doing label Aug 1, 2024
@hubgeter
Copy link
Contributor Author

hubgeter commented Aug 1, 2024

run buildall

Copy link
Contributor

github-actions bot commented Aug 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

1 similar comment
Copy link
Contributor

github-actions bot commented Aug 1, 2024

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TPC-H: Total hot run time: 41341 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b4f37c53afbdedb0d33ddb369fe3f0ffd9672f91, data reload: false

------ Round 1 ----------------------------------
q1	17597	4057	4042	4042
q2	2021	197	194	194
q3	10471	1328	1299	1299
q4	10177	874	984	874
q5	7632	2895	2973	2895
q6	221	144	140	140
q7	1031	618	627	618
q8	9443	1759	1932	1759
q9	8466	6597	6570	6570
q10	8763	3803	3862	3803
q11	432	246	249	246
q12	418	226	234	226
q13	17764	2928	2960	2928
q14	270	251	248	248
q15	530	483	478	478
q16	515	386	392	386
q17	969	915	918	915
q18	7901	7311	7196	7196
q19	2531	1195	1210	1195
q20	557	331	331	331
q21	5364	4779	4713	4713
q22	349	292	285	285
Total cold run time: 113422 ms
Total hot run time: 41341 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4116	4030	4021	4021
q2	328	231	218	218
q3	3014	2983	3151	2983
q4	1975	1992	1968	1968
q5	5539	5536	5448	5448
q6	221	134	133	133
q7	2116	1767	1806	1767
q8	3297	3380	3320	3320
q9	8650	8567	8756	8567
q10	3963	4048	3900	3900
q11	539	450	457	450
q12	740	614	581	581
q13	16669	3161	3105	3105
q14	308	266	270	266
q15	524	477	488	477
q16	472	419	408	408
q17	1752	1724	1756	1724
q18	8182	7660	7825	7660
q19	1719	1735	1739	1735
q20	2080	1831	1822	1822
q21	5743	5483	5337	5337
q22	509	450	478	450
Total cold run time: 72456 ms
Total hot run time: 56340 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 170010 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b4f37c53afbdedb0d33ddb369fe3f0ffd9672f91, data reload: false

query1	915	371	363	363
query2	6457	1697	1701	1697
query3	6653	220	247	220
query4	19517	17637	17278	17278
query5	3632	524	522	522
query6	304	207	173	173
query7	4599	295	307	295
query8	264	206	198	198
query9	8507	2366	2365	2365
query10	451	281	281	281
query11	10368	10098	9938	9938
query12	120	91	87	87
query13	1639	375	377	375
query14	9830	7623	7731	7623
query15	201	164	169	164
query16	6782	460	449	449
query17	945	554	544	544
query18	1833	279	285	279
query19	187	144	144	144
query20	91	87	84	84
query21	206	124	97	97
query22	4359	4306	3959	3959
query23	33785	33849	33375	33375
query24	10377	3118	3098	3098
query25	694	417	418	417
query26	1681	149	159	149
query27	2913	283	304	283
query28	7509	2019	1996	1996
query29	1249	436	445	436
query30	246	161	164	161
query31	917	767	782	767
query32	108	58	57	57
query33	693	326	340	326
query34	909	506	526	506
query35	893	772	758	758
query36	1029	892	894	892
query37	285	87	87	87
query38	2916	2876	2811	2811
query39	884	835	842	835
query40	267	120	121	120
query41	48	47	75	47
query42	120	100	102	100
query43	495	428	432	428
query44	1199	738	723	723
query45	211	175	175	175
query46	1106	816	806	806
query47	1794	1709	1694	1694
query48	362	292	290	290
query49	922	430	431	430
query50	895	437	447	437
query51	6726	6676	6599	6599
query52	104	94	86	86
query53	265	182	194	182
query54	626	466	461	461
query55	77	76	75	75
query56	293	272	256	256
query57	1162	1073	1055	1055
query58	268	276	292	276
query59	2590	2651	2344	2344
query60	306	279	288	279
query61	96	93	92	92
query62	864	669	677	669
query63	252	184	187	184
query64	5646	1912	1865	1865
query65	3160	3122	3102	3102
query66	1300	337	329	329
query67	15234	14722	14848	14722
query68	4269	561	589	561
query69	448	302	305	302
query70	1068	1056	1074	1056
query71	426	284	281	281
query72	7170	2691	2490	2490
query73	770	330	335	330
query74	5988	5642	5641	5641
query75	3364	2732	2743	2732
query76	2450	1179	1285	1179
query77	432	325	317	317
query78	9375	8930	8819	8819
query79	1951	542	534	534
query80	1159	525	522	522
query81	573	228	232	228
query82	1063	151	136	136
query83	242	183	180	180
query84	268	79	79	79
query85	1311	317	304	304
query86	400	303	315	303
query87	3303	3070	3057	3057
query88	2910	2434	2386	2386
query89	386	302	299	299
query90	1728	193	193	193
query91	126	106	100	100
query92	59	52	51	51
query93	1584	618	619	618
query94	805	296	320	296
query95	378	271	272	271
query96	611	289	284	284
query97	3269	3061	3076	3061
query98	218	197	196	196
query99	1671	1289	1291	1289
Total cold run time: 261111 ms
Total hot run time: 170010 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.95 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b4f37c53afbdedb0d33ddb369fe3f0ffd9672f91, data reload: false

query1	0.04	0.03	0.04
query2	0.07	0.04	0.04
query3	0.22	0.04	0.04
query4	1.68	0.07	0.07
query5	0.50	0.48	0.48
query6	1.16	0.73	0.72
query7	0.02	0.02	0.01
query8	0.05	0.04	0.05
query9	0.56	0.50	0.51
query10	0.55	0.56	0.54
query11	0.15	0.11	0.12
query12	0.15	0.13	0.12
query13	0.61	0.61	0.60
query14	0.78	0.80	0.79
query15	0.89	0.86	0.87
query16	0.34	0.35	0.35
query17	0.98	0.99	1.01
query18	0.22	0.21	0.21
query19	1.79	1.74	1.72
query20	0.02	0.01	0.01
query21	15.42	0.79	0.68
query22	4.78	7.79	1.27
query23	17.80	1.29	1.24
query24	2.27	0.22	0.22
query25	0.18	0.08	0.08
query26	0.32	0.21	0.22
query27	0.46	0.24	0.23
query28	13.17	1.00	0.98
query29	12.58	3.34	3.32
query30	0.25	0.06	0.06
query31	2.85	0.39	0.41
query32	3.26	0.49	0.48
query33	2.95	2.99	2.98
query34	15.42	4.25	4.25
query35	4.33	4.30	4.28
query36	0.68	0.46	0.50
query37	0.20	0.16	0.16
query38	0.16	0.15	0.15
query39	0.04	0.03	0.04
query40	0.16	0.12	0.14
query41	0.10	0.04	0.04
query42	0.06	0.05	0.05
query43	0.05	0.04	0.04
Total cold run time: 108.27 s
Total hot run time: 29.95 s

@hubgeter
Copy link
Contributor Author

hubgeter commented Aug 6, 2024

run p0

Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

github-actions bot commented Aug 7, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Aug 7, 2024
Copy link
Contributor

github-actions bot commented Aug 7, 2024

PR approved by anyone and no changes requested.

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@morningman morningman merged commit 1a00e5e into apache:master Aug 7, 2024
28 of 30 checks passed
hubgeter added a commit to hubgeter/doris that referenced this pull request Aug 10, 2024
…ng sub-columns of the struct and fails. (apache#38718)

## Proposed changes

Fix the problem that the parquet reader reads the missing sub-columns of
the struct and fails.
for example : 
suppose we have a column `array<struct<a:int>>` . And this column has
two data
```
[{1},{2},{3}]
[{4},{5}]
```
Then we add a subcolumn b to the struct . Now the column structure is
like this `array<struct<a:int,b:string>>`
The expected data for the query is as follows, instead of an error  :
 
```
[{1,null},{2,null},{3,null}]
[{4,null},{5,null}]
```
morningman pushed a commit that referenced this pull request Aug 11, 2024
dataroaring pushed a commit that referenced this pull request Aug 11, 2024
…ng sub-columns of the struct and fails. (#38718)

## Proposed changes

Fix the problem that the parquet reader reads the missing sub-columns of
the struct and fails.
for example : 
suppose we have a column `array<struct<a:int>>` . And this column has
two data
```
[{1},{2},{3}]
[{4},{5}]
```
Then we add a subcolumn b to the struct . Now the column structure is
like this `array<struct<a:int,b:string>>`
The expected data for the query is as follows, instead of an error  :
 
```
[{1,null},{2,null},{3,null}]
[{4,null},{5,null}]
```
wyxxxcat pushed a commit to wyxxxcat/doris that referenced this pull request Aug 14, 2024
…ng sub-columns of the struct and fails. (apache#38718)

## Proposed changes

Fix the problem that the parquet reader reads the missing sub-columns of
the struct and fails.
for example : 
suppose we have a column `array<struct<a:int>>` . And this column has
two data
```
[{1},{2},{3}]
[{4},{5}]
```
Then we add a subcolumn b to the struct . Now the column structure is
like this `array<struct<a:int,b:string>>`
The expected data for the query is as follows, instead of an error  :
 
```
[{1,null},{2,null},{3,null}]
[{4,null},{5,null}]
```
dataroaring pushed a commit that referenced this pull request Aug 16, 2024
…ng sub-columns of the struct and fails. (#38718)

## Proposed changes

Fix the problem that the parquet reader reads the missing sub-columns of
the struct and fails.
for example : 
suppose we have a column `array<struct<a:int>>` . And this column has
two data
```
[{1},{2},{3}]
[{4},{5}]
```
Then we add a subcolumn b to the struct . Now the column structure is
like this `array<struct<a:int,b:string>>`
The expected data for the query is as follows, instead of an error  :
 
```
[{1,null},{2,null},{3,null}]
[{4,null},{5,null}]
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.6-merged dev/3.0.2-merged doing reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants