Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refine: refine writer interface #741

Merged
merged 2 commits into from
Dec 10, 2024

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Nov 29, 2024

Hi, I'm working on writer recently and find some weaknesses of our original writer interface design:

  1. Our IcebergWriter Builder interface looks like following:
trait IcebergWriterBuilder {
 type C;
 async fn build(self, config: Self::C) -> Result<Self::R>;
}

And I realized that this custom config param cause we can't combine the write flexibility. E.g. In partition writer

struct PartitionWriter<IcebergWriterBuilder> {
  inner_writer_builder: B
}

impl  PartitionWriter<IcebergWriterBuilder>  {
  pub async fn write(..) {
      self.inner_writer_builder.build(...) // We can't build because we don't know pass which param.
  }
}

So avoid this problem, we should pass the custom param when create the builder and the build interface should looks like fn build() -> Self

  1. The schema of FileWriter can determined by base writer like: data file writer, position delete writer, equality delete writer.
    In our original design, user should pass the schema to file writer builder when create them like
let file_builder = ParquetWriterBuilder(schema)

However, sometimes the schema is hard to determine when we create them. E.g. equality delete writer, we only know what the schema looks like util we pass the equality id and create the equality delete writer. To avoid the problem, we change fn build() -> Self of FileWriterBuilder to fn build(schema:SchemaRef) -> Self. By this way, the schema of FileWriter is determined by base writer.

I send this discussion as a PR to make it easier to express my idea. BTW, I applied this change and complete partition writer, delta writer in https://github.com/ZENOTME/iceberg-rust/tree/partition_writer and it looks well.

Feel free for any suggestion. cc @liurenjie1024 @Xuanwo @Fokko @c-thiel

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZENOTME for this pr, left some question about this change.

@@ -37,11 +37,11 @@ pub trait FileWriterBuilder<O = DefaultOutput>: Send + Clone + 'static {
/// The associated file writer type.
type R: FileWriter<O>;
/// Build file writer.
fn build(self) -> impl Future<Output = Result<Self::R>> + Send;
fn build(self, schema: SchemaRef) -> impl Future<Output = Result<Self::R>> + Send;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, why this can't be passed as constructor parameter of the FileWriterBuilder?

Copy link
Contributor Author

@ZENOTME ZENOTME Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use original process of creating equality delete writer as example:

        let equality_ids = vec![0_i32, 8];
        let equality_config =
            EqualityDeleteWriterConfig::new(equality_ids, Arc::new(schema), None).unwrap();
        let delete_schema =
            arrow_schema_to_schema(equality_config.projected_arrow_schema_ref()).unwrap();
        let projector = equality_config.projector.clone();

        // prepare writer
        let pb = ParquetWriterBuilder::new(
            WriterProperties::builder().build(),
            Arc::new(delete_schema),
            file_io.clone(),
            location_gen,
            file_name_gen,
        );

        let mut equality_delete_writer = EqualityDeleteFileWriterBuilder::new(pb)
            .build(equality_config)
            .await?;

We need to get the projected schema from EqualityDeleteWriterConfig. Otherwise, the user have to project the schema by themselves. After remove writer config, we need to other thing help us to project the schema and let user get from it.
One pattern I realized for now is that the schema of file writer is always rely on the iceberg writer which use it. So I think move it to the build function can make thing looks more simple.🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I'm not convinced that this simplified things a lot. I think the construction of writers will not be called directly by user, since they are low level apis. Schema is only part of the arguments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a beginner rustacean, but you could still pass the equality_config to the constructor, right?

And I agree with @liurenjie1024, I think in general we should think more about the high-level API. For example, constructing the writers by yourself can be pretty error-prone. Much rather I would see a Table object returning a fully configured writer based on the current-schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Let's remove the change.

/// Get the current file path.
fn current_file_path(&self) -> String;
/// Get the current file row number.
fn current_row_num(&self) -> usize;
/// Get the current file written size.
fn current_written_size(&self) -> usize;
/// Get the schema of the current file.
fn current_schema(&self) -> SchemaRef;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this will be useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function will be useful in delta writer. But it's fine to remove for now, let's add it until we find it's required.

@ZENOTME ZENOTME force-pushed the refine_writer_interface branch 5 times, most recently from 738a4a2 to db06454 Compare December 6, 2024 13:51
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZENOTME for this pr, LGTM!

@@ -83,7 +81,7 @@ pub trait IcebergWriter<I = DefaultInput, O = DefaultOutput>: Send + 'static {

/// The current file status of iceberg writer. It implement for the writer which write a single
/// file.
pub trait CurrentFileStatus {
pub trait CurrentWriterStatus {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to split to another pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove them.

@liurenjie1024
Copy link
Contributor

Let's wait for a while to see other reviewer's point of view.

@ZENOTME ZENOTME force-pushed the refine_writer_interface branch from db06454 to d0a2897 Compare December 9, 2024 10:45
Copy link
Collaborator

@c-thiel c-thiel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @ZENOTME.

I think we should derive Debug for most structs in public interfaces, I just commented the first, but there are a few touched by this PR. At least for EqualityDeleteWriterConfig it would be helpful. For EqualityDeleteWriterConfig, or configs in general, PartialEq would also be good.

@@ -29,38 +29,27 @@ use crate::Result;
#[derive(Clone)]
pub struct DataFileWriterBuilder<B: FileWriterBuilder> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe also derive Debug?

@liurenjie1024 liurenjie1024 merged commit 8b90c84 into apache:main Dec 10, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants