Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7267][CORE][CH] Support nested column pruning for HiveTableScan json/parquet/orc format #7268

Merged

Conversation

KevinyhZou
Copy link
Contributor

@KevinyhZou KevinyhZou commented Sep 18, 2024

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #7267)

How was this patch tested?

BY UT

@KevinyhZou KevinyhZou marked this pull request as draft September 18, 2024 12:35
@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Sep 18, 2024
Copy link

#7267

Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou changed the title [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format Sep 18, 2024
@KevinyhZou KevinyhZou force-pushed the support_nested_project_push_down_json branch from 5ba3026 to 4c202a6 Compare September 19, 2024 07:08
Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou changed the title [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format Sep 26, 2024
Copy link

Run Gluten Clickhouse CI

2 similar comments
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI

@KevinyhZou KevinyhZou marked this pull request as ready for review September 26, 2024 06:33
Copy link

Run Gluten Clickhouse CI

@KevinyhZou
Copy link
Contributor Author

性能测试

表schema:test_tbl (a STRING, b STRUCT<x1: STRING, x2: STRING, x3: STRING, x4: STRING, x5: STRING>)
测试sql: select count(b.x1) from test_tbl
数据量:1200W行
分别使用json/parquet/orc 三种测试存放数据,测试 该SQL查询的端到端耗时情况

优化前 平均耗时:
json格式: 16.52s
parquet耗时:2.02s
orc耗时:1.25s

优化后 平均耗时:
json格式:12.71s
parquet耗时: 0.63s
orc耗时:0.36s

Copy link

Run Gluten Clickhouse CI

1 similar comment
Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI on x86

4 similar comments
Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou
Copy link
Contributor Author

@rui-mo take a look at this pr, whether velox backend need this feature ?

Copy link
Contributor

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KevinyhZou In the Velox backend, it is some kind of reader's task. For the below case you mentioned, Velox reader does not output the 'a2' and 'a3' columns. We are also working on some enhancements for the schema pruning in Velox, see facebookincubator/velox#5962.

Struct<a1 string, a2 string, a3 string>, when query s.a1 from table

@rui-mo
Copy link
Contributor

rui-mo commented Nov 14, 2024

I'm a little confused on the issue this PR is addressing. Are we supporting cases like those in the below suite?
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala

@KevinyhZou
Copy link
Contributor Author

KevinyhZou commented Nov 14, 2024

I'm a little confused on the issue this PR is addressing. Are we supporting cases like those in the below suite? https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala

Yes, we do this to only read the fields that we need from the complex type, like array, map, struct, which means to prune columns for the nested columns. we enabled this feature for HiveTableScan, and in gluten we use it to read json/hive text, and also can be used for parquet/orc (not the default way). And I think it is the same way for velox.

@rui-mo
Copy link
Contributor

rui-mo commented Nov 14, 2024

@KevinyhZou Thanks for your feedback. We have enabled 'GlutenParquetSchemaSuite' on Velox backend. The way we enable it is: Gluten passes user-specified schema to Velox, and the Velox reader handles the schema mismatch. For example, Gluten passes 'struct<a1>' as the output type of Velox scan node, and Velox reader handles the schema pruning internally. I wonder if CH needs this PR because the CH reader cannot handle schema pruning. Would you clarify? Thanks.

Struct<a1 string, a2 string, a3 string>, when query s.a1 from table

@KevinyhZou
Copy link
Contributor Author

OK, I see. This feature is already enabled for clickhouse backend FileSourceScan, which is the default way to read parquet/orc format data, we pass a prunned schema from FileSourceScanTransformer to ch parquet/orc reader. And I think we would not change this way. thanks. @rui-mo

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou force-pushed the support_nested_project_push_down_json branch from b2e4229 to e4af358 Compare November 14, 2024 10:14
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@taiyang-li
Copy link
Contributor

Run Gluten Clickhouse CI on x86

@taiyang-li taiyang-li force-pushed the support_nested_project_push_down_json branch from e4af358 to 7d50be4 Compare November 14, 2024 12:33
Copy link

Run Gluten Clickhouse CI on x86

@taiyang-li taiyang-li force-pushed the support_nested_project_push_down_json branch from 7d50be4 to 444f043 Compare November 15, 2024 01:25
Copy link
Contributor

@taiyang-li taiyang-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
@taiyang-li
Copy link
Contributor

Run Gluten Clickhouse CI on x86

@zzcclp
Copy link
Contributor

zzcclp commented Nov 15, 2024

Run Gluten Clickhouse CI on x86

@taiyang-li taiyang-li merged commit 596858a into apache:main Nov 15, 2024
46 checks passed
@zhztheplayer zhztheplayer changed the title [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format [GLUTEN-7267][CH] Support nested column pruning for HiveTableScan json/parquet/orc format Nov 20, 2024
@zhztheplayer zhztheplayer changed the title [GLUTEN-7267][CH] Support nested column pruning for HiveTableScan json/parquet/orc format [GLUTEN-7267][CORE][CH] Support nested column pruning for HiveTableScan json/parquet/orc format Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] Support nested colum pruning for HiveTableScanExec
4 participants