Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8094][CH][Part-1] Support reading data from the iceberg with CH backend #8095

Merged
merged 2 commits into from
Nov 29, 2024

Conversation

zzcclp
Copy link
Contributor

@zzcclp zzcclp commented Nov 29, 2024

What changes were proposed in this pull request?

Support reading data from the iceberg with CH backend

  • basic iceberg scan transformer
  • read from the iceberg with the copy-on-write mode

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

…CH backend

Support reading data from the iceberg with CH backend

- basic iceberg scan transformer
- read from the iceberg with the copy-on-write mode
Copy link

#8094

Copy link

Run Gluten Clickhouse CI on x86

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the common change, thanks.

@@ -76,4 +76,7 @@ trait TransformerApi {
def invalidateSQLExecutionResource(executionId: String): Unit = {}

def genWriteParameters(fileFormat: FileFormat, writeOptions: Map[String, String]): Any

/** use Hadoop Path class to encode the file path */
def encodeFilePathIfNeed(filePath: String): String = filePath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turn out that a similar difference exists on regular scan

VL

paths.add(
GlutenURLDecoder
.decode(file.filePath.toString, StandardCharsets.UTF_8.name()))

CH

Maybe the logics should be somehow consolidated in future, I am not sure.

Copy link

Run Gluten Clickhouse CI on x86

import org.apache.spark.SparkConf
import org.apache.spark.sql.Row

class ClickHouseIcebergSuite extends GlutenClickHouseWholeStageTransformerSuite {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to allow CH and Velox to share this part of the test cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will share this part after the CH backend support the merge on read mode for the iceberg, and there is also a bug when using the timestamp type as partition column for the CH backend.

@zzcclp zzcclp merged commit ea0bcd5 into apache:main Nov 29, 2024
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core DATA_LAKE
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants