-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add equality delete writer #703
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this pr, generally LGTM! There are some structs I don't understand quite well, please add comment.
crates/iceberg/src/writer/base_writer/equality_delete_writer.rs
Outdated
Show resolved
Hide resolved
|
||
/// Help to project specific field from `RecordBatch`` according to the column id. | ||
#[derive(Clone)] | ||
pub struct ArrowFieldProjector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite confusing what's the meaning of each field, please add more comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we actually need to expose this struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we actually need to expose this struct?
Yes, this struct can help the user to get the projected schema from the original schema. Our lowel writer API requires user to provide the final schema in file writer, see:
Arc::new(delete_schema), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like an enhanced version of this method, which handles nested schema. Also this is too arrow specific, so I would following changes:
- Move this struct to arrow module
- We could maintain the logic here for now, but it's better to contribute back to arrow community.
- I still prefer this to be crate private, as user only need to pass
equality_field_id
list to build the writer config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still prefer this to be crate private, as user only need to pass equality_field_id list to build the writer config.
But how does the user pass the projected schema to the parquet writer?
} else { | ||
Err(Error::new( | ||
ErrorKind::Unexpected, | ||
"Equality delete inner writer does not exist", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Equality delete inner writer does not exist", | |
"Equality delete inner writer has been closed.", |
} else { | ||
Err(Error::new( | ||
ErrorKind::Unexpected, | ||
"Equality delete inner writer does not exist", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Equality delete inner writer does not exist", | |
"Equality delete inner writer has been closed.", |
crates/iceberg/src/writer/base_writer/equality_delete_writer.rs
Outdated
Show resolved
Hide resolved
|
||
/// Help to project specific field from `RecordBatch`` according to the column id. | ||
#[derive(Clone)] | ||
pub struct ArrowFieldProjector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is like an enhanced version of this method, which handles nested schema. Also this is too arrow specific, so I would following changes:
- Move this struct to arrow module
- We could maintain the logic here for now, but it's better to contribute back to arrow community.
- I still prefer this to be crate private, as user only need to pass
equality_field_id
list to build the writer config.
crates/iceberg/src/writer/base_writer/equality_delete_writer.rs
Outdated
Show resolved
Hide resolved
/// Create a new `DataFileWriterConfig` with equality ids. | ||
pub fn new( | ||
equality_ids: Vec<usize>, | ||
projector: ArrowFieldProjector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think user doesn't need to pass this, they just need to pass original iceberg schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about letting the user get projected schema from config🤔 Like:
let config = EqualityDeleteWriterConfig::new(original_schema);
let projected_schema = config.projected_schema();
|
||
/// Help to project specific field from `RecordBatch`` according to the fields id of meta of field. | ||
#[derive(Clone)] | ||
pub struct RecordBatchProjector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this pub(crate)
, otherwise it would be easy to leak this when export module by mistake.
DataType::Float16 | DataType::Float32 | DataType::Float64 => { | ||
return Err(Error::new( | ||
ErrorKind::DataInvalid, | ||
"Delete column data type cannot be float or double", | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this check happens here? It means that any field is float then we throw an error, even it's not deletion field? Also as a column batch projector, we should not do this since this is iceberg specific logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my bad. I'm trying to separate this out.
crates/iceberg/src/writer/base_writer/equality_delete_writer.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this pr! Generally LGTM, just left some minor points
&field_id_fetch_func, | ||
&searchable_field_func, | ||
)? | ||
.ok_or_else(|| Error::new(ErrorKind::Unexpected, "Field not found"))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ok_or_else(|| Error::new(ErrorKind::Unexpected, "Field not found"))?; | |
.ok_or_else(|| Error::new(ErrorKind::Unexpected, "Field not found").with_context("field_id", id)?; |
crates/iceberg/src/writer/base_writer/equality_delete_writer.rs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ZENOTME for this great pr!
* copy from apache#372 * fix test and refine ArrowFieldProjector * add comment and rename to make code more clear * refine code * refine RecordBatchProjector * refine error * refine function name --------- Co-authored-by: ZENOTME <[email protected]>
This PR fix tests and refines the interface based on #372. Thanks, @Dysprosium0626 again for this great job! The fix is included in the extra commit. Feel free for any suggestions. @Xuanwo @liurenjie1024 @sdd @Fokko