Feat: Batch ingest iceberg/file source #14742

chenzl25 · 2024-01-23T06:17:04Z

Is your feature request related to a problem? Please describe.

According to RFC: Combine Historical and Incremental Data
we need to support ingesting data from an external source (e.g. iceberg or file source) as historical data. This is a typical scenario of bulk loading which is expected to be faster than streaming loading data from a Kafka.

To support this feature, we need to

Performance Improvement:

Icelake provide file level read interface, so we can use it in the iceberg scan executor. cc @ZENOTME
Column pruning for iceberg scan. feat(batch,optimizer): support column pruning for iceberg source #16429
Utilize iceberg zonemap statistics to perform data skipping .aka. predicate push down.
Accelerate scan file planning and data accessing. Use alluxio as a cache.

Others:

Give a type compatibility table between Iceberg and RisingWave data type. feat(batch): support decimal type for iceberg type #15298
integration test test(batch): add iceberg source integration test #15491 test(batch): support hive catalog for iceberg source #15550 feat(batch): support jdbc catalog for iceberg source #15551 feat(batch): support rest catalog for iceberg source #15535
Support more catalog type for iceberg via jni.
Support query iceberg table metadata feat(batch,meta): support iceberg snapshots sys table #16175 feat(meta,batch): support iceberg files system table #16180
Read a specific snapshot from iceberg source. feat(batch): support time travel for iceberg source #15866
Support iceberg merge-on-read. https://www.notion.so/risingwave-labs/Iceberg-Support-read-from-MoR-table-4ca08d580ff74ab6a0c3296f059bf158?pvs=4

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

lmatz · 2024-01-23T06:34:00Z

Just try to provide the complete context, the details of the POC user request can be found: https://www.notion.so/risingwave-labs/optimize-parquet-source-for-batch-load-dc498a043d504621bf56461690b14bd7?d=84ebdf5d7469412680278059c5898be8

In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.

chenzl25 · 2024-01-23T07:10:05Z

In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.

Must the file format be Parquet? Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first, we can support file source batch read first, to test the performance. BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?

lmatz · 2024-02-01T08:19:35Z

Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first,

Good point, I think it is ok as CSV is a less efficient format than Parquet in terms of read and write performance.
If we achieve decent enough performance for CSV files, we can be even faster when using Parquet. I guess it is a very convincing argument to the POC user.

BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?

I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏
I am also thinking of adding this to the daily performance tests

chenzl25 · 2024-02-01T08:37:19Z

I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏 I am also thinking of adding this to the daily performance tests

#14630 (comment)

chenzl25 · 2024-07-31T09:49:13Z

#15784

chenzl25 · 2024-12-26T05:13:18Z

FInished

chenzl25 added the type/feature label Jan 23, 2024

github-actions bot added this to the release-1.7 milestone Jan 23, 2024

chenzl25 assigned chenzl25 and wcy-fdu Jan 23, 2024

chenzl25 mentioned this issue Feb 1, 2024

feat(batch): support iceberg scan executor #14915

Merged

9 tasks

chenzl25 mentioned this issue Feb 4, 2024

feat(frontend): support create iceberg source #14971

Merged

9 tasks

This was referenced Feb 23, 2024

feat(batch): support batch read iceberg source #15214

Merged

feat(batch): support decimal type for iceberg type #15298

Merged

chore(batch): bump icelake #15323

Merged

This was referenced Mar 4, 2024

feat(frontend): derive columns from iceberg source automatically #15415

Merged

feat(batch): auto decide distributed dml for iceberg ingestion #15481

Merged

wcy-fdu modified the milestones: release-1.7, release-1.8 Mar 6, 2024

chenzl25 mentioned this issue Mar 22, 2024

feat(batch): support time travel for iceberg source #15866

Merged

9 tasks

This was referenced Apr 7, 2024

feat(batch,meta): support iceberg snapshots sys table #16175

Merged

feat(meta,batch): support iceberg files system table #16180

Merged

chenzl25 modified the milestones: release-1.8, release-1.9 Apr 8, 2024

chenzl25 mentioned this issue Apr 22, 2024

feat(batch,optimizer): support column pruning for iceberg source #16429

Merged

9 tasks

chenzl25 modified the milestones: release-1.9, release-1.10 May 14, 2024

This was referenced Jul 8, 2024

feat(batch): support batch s3 parquet file executor #17606

Merged

feat(batch): support batch s3 parquet frontend part #17625

Merged

chenzl25 modified the milestones: release-1.10, release-1.11 Jul 10, 2024

chenzl25 mentioned this issue Jul 25, 2024

feat(batch): support file scan a directory of parquet files #17811

Merged

9 tasks

chenzl25 modified the milestones: release-2.0, future-release-2.2 Aug 19, 2024

chenzl25 closed this as completed Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Batch ingest iceberg/file source #14742

Feat: Batch ingest iceberg/file source #14742

chenzl25 commented Jan 23, 2024 •

edited

Loading

lmatz commented Jan 23, 2024

chenzl25 commented Jan 23, 2024

lmatz commented Feb 1, 2024

chenzl25 commented Feb 1, 2024

chenzl25 commented Jul 31, 2024

chenzl25 commented Dec 26, 2024

Feat: Batch ingest iceberg/file source #14742

Feat: Batch ingest iceberg/file source #14742

Comments

chenzl25 commented Jan 23, 2024 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

lmatz commented Jan 23, 2024

chenzl25 commented Jan 23, 2024

lmatz commented Feb 1, 2024

chenzl25 commented Feb 1, 2024

chenzl25 commented Jul 31, 2024

chenzl25 commented Dec 26, 2024

chenzl25 commented Jan 23, 2024 •

edited

Loading