Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](Outfile) Fix the data type mapping for complex types in Doris to the ORC and Parquet file formats. #44041

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

BePPPower
Copy link
Contributor

@BePPPower BePPPower commented Nov 15, 2024

What problem does this PR solve?

Problem Summary:

As before, the behavior of exporting of complex data types in Doris is as follows:

  orc type parquet type csv
bitmap string Not Supported Not Supported
quantile_state Not Supported Not Supported Not Supported
hll string string invisible string
jsonb Not Supported string string
variant Not Supported string string

What's more, there are some issues when exporting complex data types to the ORC file format.

This PR does two things:

  1. Fix the problem with exporting complex data types from Doris.
  2. Support exporting these three complex types to both the ORC and the Parquet file format.
  orc type parquet type csv
bitmap binary binary "NULL"
quantile_state binary binary "NULL"
hll binary binary "NULL"
jsonb string string string
variant string string string

Release note

None

Check List (For Author)

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BePPPower
Copy link
Contributor Author

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -17,6 +17,7 @@

#pragma once

#include <arrow/array/builder_binary.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'arrow/array/builder_binary.h' file not found [clang-diagnostic-error]

#include <arrow/array/builder_binary.h>
         ^

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.00% (9901/26057)
Line Coverage: 29.16% (82698/283600)
Region Coverage: 28.29% (42481/150152)
Branch Coverage: 24.87% (21541/86622)
Coverage Report: http://coverage.selectdb-in.cc/coverage/ee5770b3619e2c1a7754f00864c5eb11e750b1f3_ee5770b3619e2c1a7754f00864c5eb11e750b1f3/report/index.html

@BePPPower BePPPower changed the title [fix](Outfile) export the complex types of Doris to binary type of orc and parquet file format [fix](Outfile) Fix the data type mapping for complex types in Doris to the ORC and Parquet file formats. Nov 18, 2024
@BePPPower BePPPower marked this pull request as ready for review November 18, 2024 10:10
@BePPPower
Copy link
Contributor Author

run buildall

@BePPPower
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.03% (9902/26039)
Line Coverage: 29.20% (82819/283663)
Region Coverage: 28.33% (42550/150175)
Branch Coverage: 24.89% (21565/86646)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b113fc1e1319fd688b4d8dbad4fcb094b40f3510_b113fc1e1319fd688b4d8dbad4fcb094b40f3510/report/index.html

@BePPPower
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.02% (9900/26039)
Line Coverage: 29.20% (82820/283663)
Region Coverage: 28.32% (42536/150175)
Branch Coverage: 24.88% (21556/86646)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b113fc1e1319fd688b4d8dbad4fcb094b40f3510_b113fc1e1319fd688b4d8dbad4fcb094b40f3510/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 39950 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b113fc1e1319fd688b4d8dbad4fcb094b40f3510, data reload: false

------ Round 1 ----------------------------------
q1	17575	7423	7285	7285
q2	2044	177	169	169
q3	10542	1042	1096	1042
q4	10217	736	763	736
q5	7593	2731	2735	2731
q6	239	148	146	146
q7	982	614	599	599
q8	9251	1822	1914	1822
q9	6641	6339	6423	6339
q10	7051	2309	2329	2309
q11	459	266	251	251
q12	422	222	220	220
q13	18031	3052	3099	3052
q14	243	207	218	207
q15	572	530	508	508
q16	664	584	599	584
q17	988	536	594	536
q18	7335	6781	6790	6781
q19	1333	978	1067	978
q20	487	183	180	180
q21	4015	3253	3162	3162
q22	371	325	313	313
Total cold run time: 107055 ms
Total hot run time: 39950 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7350	7362	7359	7359
q2	324	228	230	228
q3	2905	2863	3024	2863
q4	2032	1839	1829	1829
q5	5728	5725	5707	5707
q6	227	142	147	142
q7	2262	1801	1876	1801
q8	3428	3652	3550	3550
q9	9124	9165	9236	9165
q10	3658	3646	3594	3594
q11	614	544	536	536
q12	861	650	664	650
q13	16719	3231	3201	3201
q14	310	283	283	283
q15	579	520	532	520
q16	693	677	661	661
q17	1903	1648	1654	1648
q18	8624	7965	7886	7886
q19	1745	1588	1505	1505
q20	2146	1891	1873	1873
q21	5588	5468	5263	5263
q22	639	589	592	589
Total cold run time: 77459 ms
Total hot run time: 60853 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 196086 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b113fc1e1319fd688b4d8dbad4fcb094b40f3510, data reload: false

query1	1264	967	944	944
query2	6260	2141	2066	2066
query3	10812	4113	4005	4005
query4	67622	29043	23985	23985
query5	4960	466	454	454
query6	411	187	195	187
query7	5553	289	285	285
query8	305	211	234	211
query9	8725	2678	2655	2655
query10	446	248	248	248
query11	17458	15245	15971	15245
query12	152	108	107	107
query13	1510	450	460	450
query14	10710	7825	7357	7357
query15	211	187	197	187
query16	7056	480	482	480
query17	1079	584	592	584
query18	1777	295	294	294
query19	219	155	161	155
query20	116	111	109	109
query21	202	104	100	100
query22	4755	4258	4389	4258
query23	34590	34004	34167	34004
query24	5397	2483	2468	2468
query25	482	390	398	390
query26	658	148	153	148
query27	1711	281	284	281
query28	4328	2475	2396	2396
query29	670	424	413	413
query30	209	147	146	146
query31	1011	802	829	802
query32	68	61	55	55
query33	389	268	279	268
query34	930	518	498	498
query35	846	735	706	706
query36	1064	938	953	938
query37	129	95	78	78
query38	4340	4145	4213	4145
query39	1454	1440	1407	1407
query40	196	101	96	96
query41	44	42	41	41
query42	106	96	99	96
query43	532	513	512	512
query44	1201	829	808	808
query45	179	163	167	163
query46	1145	708	709	708
query47	1940	1854	1838	1838
query48	402	305	322	305
query49	714	386	398	386
query50	815	387	401	387
query51	7292	7184	7194	7184
query52	97	92	95	92
query53	262	177	179	177
query54	524	399	388	388
query55	77	74	79	74
query56	252	247	243	243
query57	1292	1173	1148	1148
query58	226	210	230	210
query59	3297	3041	2930	2930
query60	270	248	253	248
query61	107	140	106	106
query62	803	680	689	680
query63	219	188	187	187
query64	1335	655	627	627
query65	3260	3206	3301	3206
query66	701	314	365	314
query67	15921	15922	15664	15664
query68	4067	565	555	555
query69	430	255	275	255
query70	1173	1094	1156	1094
query71	366	258	249	249
query72	6460	4028	3882	3882
query73	768	354	356	354
query74	10200	8989	9012	8989
query75	3401	2700	2697	2697
query76	2036	986	1036	986
query77	508	297	304	297
query78	10599	9475	9370	9370
query79	2360	598	608	598
query80	1407	423	457	423
query81	532	227	236	227
query82	1265	122	119	119
query83	180	155	152	152
query84	295	74	67	67
query85	975	304	357	304
query86	409	297	287	287
query87	4730	4525	4608	4525
query88	3395	2221	2169	2169
query89	430	286	302	286
query90	1971	184	182	182
query91	137	104	103	103
query92	65	49	49	49
query93	3272	548	539	539
query94	808	283	273	273
query95	345	249	250	249
query96	622	284	289	284
query97	2858	2730	2718	2718
query98	215	200	203	200
query99	1632	1314	1277	1277
Total cold run time: 320999 ms
Total hot run time: 196086 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.37 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b113fc1e1319fd688b4d8dbad4fcb094b40f3510, data reload: false

query1	0.04	0.03	0.02
query2	0.06	0.05	0.04
query3	0.23	0.08	0.07
query4	1.62	0.10	0.10
query5	0.43	0.41	0.38
query6	1.13	0.65	0.65
query7	0.02	0.02	0.02
query8	0.03	0.03	0.03
query9	0.58	0.51	0.49
query10	0.55	0.54	0.57
query11	0.15	0.11	0.10
query12	0.14	0.11	0.11
query13	0.61	0.59	0.60
query14	2.71	2.81	2.76
query15	0.89	0.82	0.83
query16	0.38	0.40	0.37
query17	1.06	1.06	1.03
query18	0.20	0.19	0.21
query19	1.98	1.86	1.96
query20	0.02	0.01	0.01
query21	15.37	0.63	0.60
query22	2.33	1.62	2.07
query23	16.95	1.08	0.82
query24	2.92	2.07	2.04
query25	0.22	0.08	0.08
query26	0.64	0.14	0.14
query27	0.05	0.05	0.04
query28	8.95	1.11	1.07
query29	12.57	3.22	3.24
query30	0.26	0.06	0.07
query31	2.88	0.38	0.38
query32	3.27	0.47	0.48
query33	2.99	3.01	3.04
query34	17.06	4.48	4.41
query35	4.51	4.47	4.50
query36	0.66	0.48	0.47
query37	0.09	0.06	0.06
query38	0.04	0.03	0.03
query39	0.03	0.02	0.02
query40	0.16	0.12	0.12
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.02
Total cold run time: 104.93 s
Total hot run time: 33.37 s

@BePPPower
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 38.03% (9902/26039)
Line Coverage: 29.20% (82843/283663)
Region Coverage: 28.33% (42541/150175)
Branch Coverage: 24.89% (21566/86646)
Coverage Report: http://coverage.selectdb-in.cc/coverage/21abf3dbe6a856b1e962e565300f44b98efa32a0_21abf3dbe6a856b1e962e565300f44b98efa32a0/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 40632 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 21abf3dbe6a856b1e962e565300f44b98efa32a0, data reload: false

------ Round 1 ----------------------------------
q1	17608	7477	7284	7284
q2	2044	171	158	158
q3	10823	1081	1161	1081
q4	10514	734	712	712
q5	7967	2783	2694	2694
q6	248	153	151	151
q7	981	642	623	623
q8	9243	1795	1939	1795
q9	6594	6414	6474	6414
q10	7052	2254	2292	2254
q11	470	272	253	253
q12	418	225	225	225
q13	18106	3120	3073	3073
q14	244	212	211	211
q15	569	548	520	520
q16	647	585	595	585
q17	985	699	616	616
q18	8593	7207	7313	7207
q19	2541	1053	1059	1053
q20	746	187	183	183
q21	4023	3226	3229	3226
q22	384	314	316	314
Total cold run time: 110800 ms
Total hot run time: 40632 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7295	7283	7250	7250
q2	327	229	226	226
q3	3051	2976	2954	2954
q4	2116	1815	1813	1813
q5	5604	5632	5725	5632
q6	234	151	146	146
q7	2241	1822	1795	1795
q8	3400	3532	3392	3392
q9	8938	8883	8790	8790
q10	3631	3531	3572	3531
q11	582	517	516	516
q12	843	618	612	612
q13	16871	3265	3276	3265
q14	302	291	274	274
q15	574	524	542	524
q16	714	657	641	641
q17	1851	1630	1608	1608
q18	8266	7820	7708	7708
q19	1689	1616	1684	1616
q20	2135	1905	1922	1905
q21	5730	5376	5432	5376
q22	638	572	565	565
Total cold run time: 77032 ms
Total hot run time: 60139 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 195881 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 21abf3dbe6a856b1e962e565300f44b98efa32a0, data reload: false

query1	1234	759	813	759
query2	6249	2174	2203	2174
query3	10831	4034	3816	3816
query4	67412	29506	23627	23627
query5	4858	474	450	450
query6	408	179	184	179
query7	5633	303	294	294
query8	320	231	229	229
query9	9111	2677	2666	2666
query10	457	243	248	243
query11	17519	15262	15952	15262
query12	157	119	117	117
query13	1587	426	452	426
query14	10502	7431	7521	7431
query15	198	178	177	177
query16	6997	471	427	427
query17	1078	545	547	545
query18	1821	282	293	282
query19	197	151	145	145
query20	110	105	107	105
query21	211	100	101	100
query22	4536	4286	4488	4286
query23	34422	34301	33907	33907
query24	5435	2537	2485	2485
query25	476	382	374	374
query26	636	151	146	146
query27	1720	292	288	288
query28	4221	2465	2432	2432
query29	651	414	429	414
query30	207	147	142	142
query31	988	804	811	804
query32	66	54	54	54
query33	391	280	276	276
query34	902	500	512	500
query35	825	712	713	712
query36	1110	947	983	947
query37	114	74	75	74
query38	4346	4372	4175	4175
query39	1451	1447	1445	1445
query40	193	99	100	99
query41	46	42	45	42
query42	108	95	95	95
query43	562	503	504	503
query44	1144	809	800	800
query45	194	159	163	159
query46	1119	727	674	674
query47	1966	1843	1873	1843
query48	413	305	317	305
query49	716	392	380	380
query50	818	376	397	376
query51	7326	7116	7165	7116
query52	105	88	92	88
query53	252	173	178	173
query54	524	408	386	386
query55	78	71	85	71
query56	247	241	239	239
query57	1265	1182	1165	1165
query58	225	222	217	217
query59	3408	3427	3113	3113
query60	271	258	241	241
query61	129	111	111	111
query62	784	676	666	666
query63	213	185	183	183
query64	1385	665	711	665
query65	3302	3233	3211	3211
query66	721	313	318	313
query67	16167	16067	15717	15717
query68	3876	584	566	566
query69	429	258	253	253
query70	1213	1148	1139	1139
query71	350	249	252	249
query72	6405	4147	4074	4074
query73	765	357	350	350
query74	10316	8912	9012	8912
query75	3421	2720	2688	2688
query76	1852	1060	1008	1008
query77	464	281	282	281
query78	10529	9498	9462	9462
query79	1812	587	604	587
query80	1101	430	436	430
query81	523	233	234	233
query82	1321	122	120	120
query83	262	151	156	151
query84	287	68	71	68
query85	903	296	332	296
query86	332	307	301	301
query87	4776	4556	4605	4556
query88	3409	2235	2155	2155
query89	420	290	299	290
query90	2016	186	191	186
query91	140	102	104	102
query92	69	51	52	51
query93	1977	528	536	528
query94	834	299	266	266
query95	351	246	248	246
query96	627	274	274	274
query97	2913	2675	2702	2675
query98	210	201	208	201
query99	1705	1317	1355	1317
Total cold run time: 318676 ms
Total hot run time: 195881 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 33.54 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 21abf3dbe6a856b1e962e565300f44b98efa32a0, data reload: false

query1	0.03	0.03	0.03
query2	0.07	0.03	0.03
query3	0.23	0.08	0.07
query4	1.62	0.10	0.10
query5	0.41	0.42	0.39
query6	1.14	0.65	0.66
query7	0.03	0.02	0.02
query8	0.04	0.04	0.03
query9	0.60	0.48	0.50
query10	0.55	0.56	0.56
query11	0.15	0.11	0.11
query12	0.14	0.11	0.11
query13	0.61	0.61	0.61
query14	2.83	2.74	2.74
query15	0.90	0.83	0.83
query16	0.38	0.39	0.39
query17	1.05	1.00	1.05
query18	0.20	0.20	0.20
query19	1.98	1.87	1.98
query20	0.02	0.01	0.01
query21	15.37	0.61	0.60
query22	2.49	2.45	1.77
query23	17.32	0.85	0.68
query24	3.29	2.10	1.88
query25	0.18	0.21	0.18
query26	0.48	0.15	0.14
query27	0.05	0.04	0.05
query28	8.80	1.10	1.08
query29	12.58	3.28	3.26
query30	0.24	0.06	0.06
query31	2.88	0.38	0.39
query32	3.27	0.47	0.46
query33	3.00	3.01	3.05
query34	17.01	4.50	4.49
query35	4.61	4.55	4.49
query36	0.64	0.49	0.49
query37	0.09	0.06	0.06
query38	0.04	0.04	0.04
query39	0.03	0.02	0.02
query40	0.15	0.12	0.12
query41	0.08	0.02	0.03
query42	0.03	0.02	0.02
query43	0.04	0.02	0.02
Total cold run time: 105.65 s
Total hot run time: 33.54 s

@@ -48,28 +49,17 @@ Status DataTypeHLLSerDe::serialize_column_to_json(const IColumn& column, int64_t
Status DataTypeHLLSerDe::serialize_one_cell_to_json(const IColumn& column, int64_t row_num,
BufferWritable& bw,
FormatOptions& options) const {
if (!options._output_object_data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only export use this option _output_object_data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants