-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refine: refine writer interface #741
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this pr, left some question about this change.
@@ -37,11 +37,11 @@ pub trait FileWriterBuilder<O = DefaultOutput>: Send + Clone + 'static { | |||
/// The associated file writer type. | |||
type R: FileWriter<O>; | |||
/// Build file writer. | |||
fn build(self) -> impl Future<Output = Result<Self::R>> + Send; | |||
fn build(self, schema: SchemaRef) -> impl Future<Output = Result<Self::R>> + Send; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, why this can't be passed as constructor parameter of the FileWriterBuilder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use original process of creating equality delete writer as example:
let equality_ids = vec![0_i32, 8];
let equality_config =
EqualityDeleteWriterConfig::new(equality_ids, Arc::new(schema), None).unwrap();
let delete_schema =
arrow_schema_to_schema(equality_config.projected_arrow_schema_ref()).unwrap();
let projector = equality_config.projector.clone();
// prepare writer
let pb = ParquetWriterBuilder::new(
WriterProperties::builder().build(),
Arc::new(delete_schema),
file_io.clone(),
location_gen,
file_name_gen,
);
let mut equality_delete_writer = EqualityDeleteFileWriterBuilder::new(pb)
.build(equality_config)
.await?;
We need to get the projected schema from EqualityDeleteWriterConfig. Otherwise, the user have to project the schema by themselves. After remove writer config, we need to other thing help us to project the schema and let user get from it.
One pattern I realized for now is that the schema of file writer is always rely on the iceberg writer which use it. So I think move it to the build function can make thing looks more simple.🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I'm not convinced that this simplified things a lot. I think the construction of writers will not be called directly by user, since they are low level apis. Schema is only part of the arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a beginner rustacean, but you could still pass the equality_config
to the constructor, right?
And I agree with @liurenjie1024, I think in general we should think more about the high-level API. For example, constructing the writers by yourself can be pretty error-prone. Much rather I would see a Table
object returning a fully configured writer based on the current-schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. Let's remove the change.
crates/iceberg/src/writer/mod.rs
Outdated
/// Get the current file path. | ||
fn current_file_path(&self) -> String; | ||
/// Get the current file row number. | ||
fn current_row_num(&self) -> usize; | ||
/// Get the current file written size. | ||
fn current_written_size(&self) -> usize; | ||
/// Get the schema of the current file. | ||
fn current_schema(&self) -> SchemaRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this will be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function will be useful in delta writer. But it's fine to remove for now, let's add it until we find it's required.
738a4a2
to
db06454
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this pr, LGTM!
crates/iceberg/src/writer/mod.rs
Outdated
@@ -83,7 +81,7 @@ pub trait IcebergWriter<I = DefaultInput, O = DefaultOutput>: Send + 'static { | |||
|
|||
/// The current file status of iceberg writer. It implement for the writer which write a single | |||
/// file. | |||
pub trait CurrentFileStatus { | |||
pub trait CurrentWriterStatus { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to split to another pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove them.
Let's wait for a while to see other reviewer's point of view. |
db06454
to
d0a2897
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @ZENOTME.
I think we should derive Debug
for most structs in public interfaces, I just commented the first, but there are a few touched by this PR. At least for EqualityDeleteWriterConfig
it would be helpful. For EqualityDeleteWriterConfig
, or configs in general, PartialEq
would also be good.
@@ -29,38 +29,27 @@ use crate::Result; | |||
#[derive(Clone)] | |||
pub struct DataFileWriterBuilder<B: FileWriterBuilder> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maybe also derive Debug?
Hi, I'm working on writer recently and find some weaknesses of our original writer interface design:
And I realized that this custom config param cause we can't combine the write flexibility. E.g. In partition writer
So avoid this problem, we should pass the custom param when create the builder and the build interface should looks like
fn build() -> Self
In our original design, user should pass the schema to file writer builder when create them like
However, sometimes the schema is hard to determine when we create them. E.g. equality delete writer, we only know what the schema looks like util we pass the equality id and create the equality delete writer. To avoid the problem, we change
fn build() -> Self
of FileWriterBuilder tofn build(schema:SchemaRef) -> Self
. By this way, the schema of FileWriter is determined by base writer.I send this discussion as a PR to make it easier to express my idea. BTW, I applied this change and complete partition writer, delta writer in https://github.com/ZENOTME/iceberg-rust/tree/partition_writer and it looks well.
Feel free for any suggestion. cc @liurenjie1024 @Xuanwo @Fokko @c-thiel