Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Batch ingest iceberg/file source #14742

Closed
20 of 21 tasks
chenzl25 opened this issue Jan 23, 2024 · 6 comments
Closed
20 of 21 tasks

Feat: Batch ingest iceberg/file source #14742

chenzl25 opened this issue Jan 23, 2024 · 6 comments
Assignees
Milestone

Comments

@chenzl25
Copy link
Contributor

chenzl25 commented Jan 23, 2024

Is your feature request related to a problem? Please describe.

According to RFC: Combine Historical and Incremental Data
we need to support ingesting data from an external source (e.g. iceberg or file source) as historical data. This is a typical scenario of bulk loading which is expected to be faster than streaming loading data from a Kafka.

To support this feature, we need to

Performance Improvement:

Others:

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@lmatz
Copy link
Contributor

lmatz commented Jan 23, 2024

Just try to provide the complete context, the details of the POC user request can be found: https://www.notion.so/risingwave-labs/optimize-parquet-source-for-batch-load-dc498a043d504621bf56461690b14bd7?d=84ebdf5d7469412680278059c5898be8

In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.

@chenzl25
Copy link
Contributor Author

In short, if implementing the batch iceberg source takes much time due to its complexity, a parquet file source with decent performance is good enough to help move the POC forward. The user will consider switching RW only if RW's iceberg batch source is fast enough.

Must the file format be Parquet? Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first, we can support file source batch read first, to test the performance. BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?

@lmatz
Copy link
Contributor

lmatz commented Feb 1, 2024

Is it possible to use a CSV file that has been supported in our file source? If it is ok to test a CSV file first,

Good point, I think it is ok as CSV is a less efficient format than Parquet in terms of read and write performance.
If we achieve decent enough performance for CSV files, we can be even faster when using Parquet. I guess it is a very convincing argument to the POC user.

BTW, I tested insert select from a RisingWave table to another RisingWave table last week. Can we just compare the streaming load from Kafka to a table with insert select from a table to another table?

I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏
I am also thinking of adding this to the daily performance tests

@chenzl25
Copy link
Contributor Author

chenzl25 commented Feb 1, 2024

I can try to communicate this first. The closer to the user's real use case, the better, but definitely nothing wrong if we use what we have at the moment, could you post the link to the last week's results? 🙏 I am also thinking of adding this to the daily performance tests

#14630 (comment)

@chenzl25
Copy link
Contributor Author

#15784

@chenzl25
Copy link
Contributor Author

FInished

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants