[GLUTEN-7267][CORE][CH] Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

KevinyhZou · 2024-09-18T12:35:15Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #7267)

How was this patch tested?

BY UT

github-actions · 2024-09-18T12:35:32Z

#7267

github-actions · 2024-09-18T12:35:49Z

Run Gluten Clickhouse CI

github-actions · 2024-09-19T07:08:49Z

Run Gluten Clickhouse CI

github-actions · 2024-09-25T04:31:34Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T02:47:12Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T02:53:18Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T04:28:06Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T04:30:46Z

Run Gluten Clickhouse CI

github-actions · 2024-09-26T06:35:00Z

Run Gluten Clickhouse CI

...ckhouse/src/test/scala/org/apache/gluten/execution/hive/GlutenClickHouseHiveTableSuite.scala

KevinyhZou · 2024-10-09T08:07:08Z

性能测试

表schema：test_tbl (a STRING, b STRUCT<x1: STRING, x2: STRING, x3: STRING, x4: STRING, x5: STRING>)
测试sql： select count(b.x1) from test_tbl
数据量：1200W行
分别使用json/parquet/orc 三种测试存放数据，测试该SQL查询的端到端耗时情况

优化前平均耗时：
json格式： 16.52s
parquet耗时：2.02s
orc耗时：1.25s

优化后平均耗时：
json格式：12.71s
parquet耗时： 0.63s
orc耗时：0.36s

github-actions · 2024-10-30T12:53:11Z

Run Gluten Clickhouse CI

github-actions · 2024-10-31T07:02:18Z

Run Gluten Clickhouse CI

shims/common/src/main/scala/org/apache/gluten/GlutenConfig.scala

...ckhouse/src/test/scala/org/apache/gluten/execution/hive/GlutenClickHouseHiveTableSuite.scala

...en-substrait/src/main/scala/org/apache/spark/sql/hive/HiveTableScanNestedColumnPruning.scala

github-actions · 2024-11-12T04:08:38Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-11-12T06:58:58Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-11-12T08:22:02Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-11-13T04:27:33Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-11-13T11:21:22Z

Run Gluten Clickhouse CI on x86

KevinyhZou · 2024-11-14T02:21:07Z

@rui-mo take a look at this pr, whether velox backend need this feature ?

rui-mo

@KevinyhZou In the Velox backend, it is some kind of reader's task. For the below case you mentioned, Velox reader does not output the 'a2' and 'a3' columns. We are also working on some enhancements for the schema pruning in Velox, see facebookincubator/velox#5962.

Struct<a1 string, a2 string, a3 string>, when query s.a1 from table

rui-mo · 2024-11-14T02:50:23Z

I'm a little confused on the issue this PR is addressing. Are we supporting cases like those in the below suite?
https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala

KevinyhZou · 2024-11-14T03:07:51Z

I'm a little confused on the issue this PR is addressing. Are we supporting cases like those in the below suite? https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruningSuite.scala

Yes, we do this to only read the fields that we need from the complex type, like array, map, struct, which means to prune columns for the nested columns. we enabled this feature for HiveTableScan, and in gluten we use it to read json/hive text, and also can be used for parquet/orc (not the default way). And I think it is the same way for velox.

rui-mo · 2024-11-14T03:19:41Z

@KevinyhZou Thanks for your feedback. We have enabled 'GlutenParquetSchemaSuite' on Velox backend. The way we enable it is: Gluten passes user-specified schema to Velox, and the Velox reader handles the schema mismatch. For example, Gluten passes 'struct<a1>' as the output type of Velox scan node, and Velox reader handles the schema pruning internally. I wonder if CH needs this PR because the CH reader cannot handle schema pruning. Would you clarify? Thanks.

Struct<a1 string, a2 string, a3 string>, when query s.a1 from table

KevinyhZou · 2024-11-14T04:05:03Z

OK, I see. This feature is already enabled for clickhouse backend FileSourceScan, which is the default way to read parquet/orc format data, we pass a prunned schema from FileSourceScanTransformer to ch parquet/orc reader. And I think we would not change this way. thanks. @rui-mo

github-actions · 2024-11-14T07:57:20Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-11-14T10:15:24Z

Run Gluten Clickhouse CI on x86

taiyang-li · 2024-11-14T12:32:54Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-11-14T12:33:44Z

Run Gluten Clickhouse CI on x86

taiyang-li

LGTM

github-actions · 2024-11-15T01:26:22Z

Run Gluten Clickhouse CI on x86

taiyang-li · 2024-11-15T03:21:27Z

Run Gluten Clickhouse CI on x86

zzcclp · 2024-11-15T03:42:33Z

Run Gluten Clickhouse CI on x86

KevinyhZou marked this pull request as draft September 18, 2024 12:35

github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Sep 18, 2024

KevinyhZou changed the title ~~[GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format~~ [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format Sep 18, 2024

KevinyhZou force-pushed the support_nested_project_push_down_json branch from 5ba3026 to 4c202a6 Compare September 19, 2024 07:08

KevinyhZou changed the title ~~[GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json format~~ [GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format Sep 26, 2024

KevinyhZou marked this pull request as ready for review September 26, 2024 06:33

taiyang-li reviewed Sep 29, 2024

View reviewed changes

...ckhouse/src/test/scala/org/apache/gluten/execution/hive/GlutenClickHouseHiveTableSuite.scala Outdated Show resolved Hide resolved

taiyang-li reviewed Nov 12, 2024

View reviewed changes

shims/common/src/main/scala/org/apache/gluten/GlutenConfig.scala Outdated Show resolved Hide resolved

taiyang-li reviewed Nov 12, 2024

View reviewed changes

...ckhouse/src/test/scala/org/apache/gluten/execution/hive/GlutenClickHouseHiveTableSuite.scala Outdated Show resolved Hide resolved

taiyang-li reviewed Nov 12, 2024

View reviewed changes

...ckhouse/src/test/scala/org/apache/gluten/execution/hive/GlutenClickHouseHiveTableSuite.scala Show resolved Hide resolved

taiyang-li reviewed Nov 12, 2024

View reviewed changes

...en-substrait/src/main/scala/org/apache/spark/sql/hive/HiveTableScanNestedColumnPruning.scala Outdated Show resolved Hide resolved

taiyang-li reviewed Nov 12, 2024

View reviewed changes

...en-substrait/src/main/scala/org/apache/spark/sql/hive/HiveTableScanNestedColumnPruning.scala Outdated Show resolved Hide resolved

rui-mo reviewed Nov 14, 2024

View reviewed changes

KevinyhZou force-pushed the support_nested_project_push_down_json branch from b2e4229 to e4af358 Compare November 14, 2024 10:14

taiyang-li force-pushed the support_nested_project_push_down_json branch from e4af358 to 7d50be4 Compare November 14, 2024 12:33

support nested column pruning

444f043

taiyang-li force-pushed the support_nested_project_push_down_json branch from 7d50be4 to 444f043 Compare November 15, 2024 01:25

taiyang-li approved these changes Nov 15, 2024

View reviewed changes

taiyang-li merged commit 596858a into apache:main Nov 15, 2024
46 checks passed

zhztheplayer changed the title ~~[GLUTEN-7267][CH]Support nested column pruning for HiveTableScan json/parquet/orc format~~ [GLUTEN-7267][CH] Support nested column pruning for HiveTableScan json/parquet/orc format Nov 20, 2024

zhztheplayer changed the title ~~[GLUTEN-7267][CH] Support nested column pruning for HiveTableScan json/parquet/orc format~~ [GLUTEN-7267][CORE][CH] Support nested column pruning for HiveTableScan json/parquet/orc format Nov 20, 2024

zhztheplayer mentioned this pull request Nov 20, 2024

[GLUTEN-7267][CORE][CH] Move schema pruning optimization of HiveTableScan to an individual post-transform rule #8008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-7267][CORE][CH] Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

[GLUTEN-7267][CORE][CH] Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

KevinyhZou commented Sep 18, 2024 •

edited

Loading

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 19, 2024

github-actions bot commented Sep 25, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

KevinyhZou commented Oct 9, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Oct 31, 2024

github-actions bot commented Nov 12, 2024

github-actions bot commented Nov 12, 2024

github-actions bot commented Nov 12, 2024

github-actions bot commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

KevinyhZou commented Nov 14, 2024

rui-mo left a comment •

edited

Loading

rui-mo commented Nov 14, 2024

KevinyhZou commented Nov 14, 2024 •

edited

Loading

rui-mo commented Nov 14, 2024 •

edited

Loading

KevinyhZou commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

taiyang-li commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

taiyang-li left a comment

github-actions bot commented Nov 15, 2024

taiyang-li commented Nov 15, 2024

zzcclp commented Nov 15, 2024

[GLUTEN-7267][CORE][CH] Support nested column pruning for HiveTableScan json/parquet/orc format #7268

[GLUTEN-7267][CORE][CH] Support nested column pruning for HiveTableScan json/parquet/orc format #7268

Conversation

KevinyhZou commented Sep 18, 2024 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 18, 2024

github-actions bot commented Sep 19, 2024

github-actions bot commented Sep 25, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

github-actions bot commented Sep 26, 2024

KevinyhZou commented Oct 9, 2024

性能测试

github-actions bot commented Oct 30, 2024

github-actions bot commented Oct 31, 2024

github-actions bot commented Nov 12, 2024

github-actions bot commented Nov 12, 2024

github-actions bot commented Nov 12, 2024

github-actions bot commented Nov 13, 2024

github-actions bot commented Nov 13, 2024

KevinyhZou commented Nov 14, 2024

rui-mo left a comment • edited Loading

Choose a reason for hiding this comment

rui-mo commented Nov 14, 2024

KevinyhZou commented Nov 14, 2024 • edited Loading

rui-mo commented Nov 14, 2024 • edited Loading

KevinyhZou commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

taiyang-li commented Nov 14, 2024

github-actions bot commented Nov 14, 2024

taiyang-li left a comment

Choose a reason for hiding this comment

github-actions bot commented Nov 15, 2024

taiyang-li commented Nov 15, 2024

zzcclp commented Nov 15, 2024

[GLUTEN-7267][CORE][CH] Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

[GLUTEN-7267][CORE][CH] Support nested column pruning for `HiveTableScan` json/parquet/orc format #7268

KevinyhZou commented Sep 18, 2024 •

edited

Loading

rui-mo left a comment •

edited

Loading

KevinyhZou commented Nov 14, 2024 •

edited

Loading

rui-mo commented Nov 14, 2024 •

edited

Loading