Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add equality delete writer #703

Merged
merged 7 commits into from
Nov 28, 2024
Merged

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Nov 19, 2024

This PR fix tests and refines the interface based on #372. Thanks, @Dysprosium0626 again for this great job! The fix is included in the extra commit. Feel free for any suggestions. @Xuanwo @liurenjie1024 @sdd @Fokko

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZENOTME for this pr, generally LGTM! There are some structs I don't understand quite well, please add comment.


/// Help to project specific field from `RecordBatch`` according to the column id.
#[derive(Clone)]
pub struct ArrowFieldProjector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite confusing what's the meaning of each field, please add more comment here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we actually need to expose this struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we actually need to expose this struct?

Yes, this struct can help the user to get the projected schema from the original schema. Our lowel writer API requires user to provide the final schema in file writer, see:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like an enhanced version of this method, which handles nested schema. Also this is too arrow specific, so I would following changes:

  1. Move this struct to arrow module
  2. We could maintain the logic here for now, but it's better to contribute back to arrow community.
  3. I still prefer this to be crate private, as user only need to pass equality_field_id list to build the writer config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still prefer this to be crate private, as user only need to pass equality_field_id list to build the writer config.

But how does the user pass the projected schema to the parquet writer?

} else {
Err(Error::new(
ErrorKind::Unexpected,
"Equality delete inner writer does not exist",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Equality delete inner writer does not exist",
"Equality delete inner writer has been closed.",

} else {
Err(Error::new(
ErrorKind::Unexpected,
"Equality delete inner writer does not exist",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Equality delete inner writer does not exist",
"Equality delete inner writer has been closed.",


/// Help to project specific field from `RecordBatch`` according to the column id.
#[derive(Clone)]
pub struct ArrowFieldProjector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like an enhanced version of this method, which handles nested schema. Also this is too arrow specific, so I would following changes:

  1. Move this struct to arrow module
  2. We could maintain the logic here for now, but it's better to contribute back to arrow community.
  3. I still prefer this to be crate private, as user only need to pass equality_field_id list to build the writer config.

/// Create a new `DataFileWriterConfig` with equality ids.
pub fn new(
equality_ids: Vec<usize>,
projector: ArrowFieldProjector,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think user doesn't need to pass this, they just need to pass original iceberg schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about letting the user get projected schema from config🤔 Like:

let config = EqualityDeleteWriterConfig::new(original_schema);
let projected_schema = config.projected_schema();


/// Help to project specific field from `RecordBatch`` according to the fields id of meta of field.
#[derive(Clone)]
pub struct RecordBatchProjector {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make this pub(crate), otherwise it would be easy to leak this when export module by mistake.

crates/iceberg/src/arrow/record_batch_projector.rs Outdated Show resolved Hide resolved
Comment on lines 87 to 96
DataType::Float16 | DataType::Float32 | DataType::Float64 => {
return Err(Error::new(
ErrorKind::DataInvalid,
"Delete column data type cannot be float or double",
));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this check happens here? It means that any field is float then we throw an error, even it's not deletion field? Also as a column batch projector, we should not do this since this is iceberg specific logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my bad. I'm trying to separate this out.

crates/iceberg/src/arrow/record_batch_projector.rs Outdated Show resolved Hide resolved
liurenjie1024
liurenjie1024 previously approved these changes Nov 27, 2024
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZENOTME for this pr! Generally LGTM, just left some minor points

&field_id_fetch_func,
&searchable_field_func,
)?
.ok_or_else(|| Error::new(ErrorKind::Unexpected, "Field not found"))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.ok_or_else(|| Error::new(ErrorKind::Unexpected, "Field not found"))?;
.ok_or_else(|| Error::new(ErrorKind::Unexpected, "Field not found").with_context("field_id", id)?;

@liurenjie1024 liurenjie1024 dismissed their stale review November 27, 2024 06:55

Sorry, it's approved by mistake

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ZENOTME for this great pr!

@liurenjie1024 liurenjie1024 merged commit 62541fc into apache:main Nov 28, 2024
16 checks passed
shaeqahmed pushed a commit to matanolabs/iceberg-rust that referenced this pull request Dec 9, 2024
* copy from apache#372

* fix test and refine ArrowFieldProjector

* add comment and rename to make code more clear

* refine code

* refine RecordBatchProjector

* refine error

* refine function name

---------

Co-authored-by: ZENOTME <[email protected]>
@ZENOTME ZENOTME deleted the eq_delete branch December 23, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants