-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(batch): support batch s3 parquet file executor #17606
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest LGTM
let file_io_builder = FileIOBuilder::new("s3"); | ||
let file_io = file_io_builder.with_props(props.into_iter()).build()?; | ||
let parquet_file = file_io.new_input(&self.location)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason of using iceberg::io
to access the object storage, instead of OpenDAL or AWS client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I implemented it with OpenDAL, but found that to read a parquet file, we need to reimplement a bunch of logic that has already been implemented in iceberg-rust, so I don't want to reinvent the wheel again. BTW, iceberg FileIO
looks good to me, because in iceberg scan, we need to use this interface to avoid resolving iceberg metadata again in the compute node while scanning parquet files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I noticed that iceberg::io
actually use OpenDAL under the hood.
However, don't know why the implementation limits the storage to be AWS S3. Do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the core module of iceberg-rust is still under development, so only AWS S3 is supported currently.
|
||
batch_stream_builder = batch_stream_builder.with_batch_size(self.batch_size); | ||
|
||
let record_batch_stream = batch_stream_builder.build().map_err(|e| anyhow!(e))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend using some specific error instead of anyhow!(e)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will improve it together with the next PR.
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.