Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatch normalization (flattening) #6758

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

ngli-me
Copy link
Contributor

@ngli-me ngli-me commented Nov 20, 2024

Which issue does this PR close?

Closes #6369.

Rationale for this change

Adds normalization (flattening) for RecordBatch, with normalization via Schema. Based on pandas/pola-rs.

What changes are included in this PR?

Are there any user-facing changes?

@ngli-me ngli-me changed the title Feature/record batch flatten RecordBatch normalization (flattening) Nov 20, 2024
@ngli-me ngli-me changed the title RecordBatch normalization (flattening) RecordBatch normalization (flattening) Nov 20, 2024
… iterative function for `RecordBatch`. Not sure which one is better currently.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 23, 2024
Copy link
Contributor Author

@ngli-me ngli-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some questions regarding the implementation of this, since the one example from PyArrow doesn't seem to clarify on the edge cases here. Normalizing the Schema seems fairly straight forward to me, I'm just not sure on

  1. Whether the iterative or recursive approach is better (or something I missed)
  2. If DataType::Struct is the only DataType that requires flattening. To me, it looks like that's the only one that can contained nested Fields.

(I'm also not sure if I'm missing something with unwrapping like a List<Struct>)

Any feedback/help would be appreciated!

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-schema/src/schema.rs Show resolved Hide resolved
@ngli-me ngli-me marked this pull request as ready for review November 23, 2024 19:03
@ngli-me ngli-me marked this pull request as draft November 23, 2024 23:30
@ngli-me ngli-me marked this pull request as ready for review November 25, 2024 04:02
@alamb
Copy link
Contributor

alamb commented Dec 18, 2024

@kszlim can you please help review this PR ? You requested the feature and we are currently quite short on review capacity in arrow-rs

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @ngli-me and I apologize for the delay in reviewing.

Hopefully @kszlim can give this a look and help us review / get it moving too.

@@ -394,6 +396,56 @@ impl RecordBatch {
)
}

/// Normalize a semi-structured [`RecordBatch`] into a flat table.
///
/// If max_level is 0, normalizes all levels.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please improve this documentation (maybe copy from the pyarrow version)?

  1. Doucment what max_level means (in addition to that 0)
  2. Document what separator does
  3. provide an example of flatteing a record batch as a doc example?

For example like https://docs.rs/arrow/latest/arrow/index.html#columnar-format

Screenshot 2024-12-18 at 8 05 08 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, missed doing this, will do!

@@ -413,6 +413,81 @@ impl Schema {
&self.metadata
}

/// Returns a new schema, normalized based on the max_level
/// This carries metadata from the parent schema over as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, please document the parametrs to this function and add a documentation example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks!

@kszlim
Copy link
Contributor

kszlim commented Dec 19, 2024

I'll take a look, though please feel free to disregard anything I say and especially defer to the maintainers.

if max_level == 0 {
max_level = usize::MAX;
}
if self.num_rows() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this worth keeping? I could be reading this wrong, but it looks like there's a lot of code strictly to support normalization for the 0 row case (which is likely very rare)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure the best case to handle this.

I think the secondary case can be removed, as I missed that even with a new_empty RecordBatch, it will still have an (empty) columns field, good catch!

For the usize::MAX setting, this was because polars/pandas had this , with a default value of 0. However, since Rust does not have default parameters, I wasn't sure the best way to adapt this.

A possible idea could be to set up like

enum Depth {
    Default, // All levels
    Value(usize)
}

then do some matching? Might be overkill though.

And of course, there's the option of removing it altogether, although then a value of 0 would mean no change?

@@ -413,6 +413,81 @@ impl Schema {
&self.metadata
}

/// Returns a new schema, normalized based on the max_level
/// This carries metadata from the parent schema over as well
pub fn normalize(&self, separator: &str, mut max_level: usize) -> Result<Self, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean this is unused outside of the 0 row case right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to help with the recursion, since using just the helper function would result in the separator as a prefix in the field.name(). I do agree that this is not the best option though, maybe I can count this as a vote against the recursion approach? 😆

DataType::Struct(ff) => {
// Need to zip these in reverse to maintain original order
for (cff, fff) in c.as_struct().columns().iter().zip(ff.into_iter()).rev() {
let new_key = format!("{}{}{}", f.name(), separator, fff.name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a better way to structure it, but is there a way to keep the field name parts in a Vec and create the flattened fields at the end? That allows you to avoid the repeated format! in a deeply nested schema.

Might not be worth the trouble though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good point, this is definitely not my favorite way to do this. I'll have to do some testing and think about it some more, but it may be better to construct the queue with the components of the Field, then go through and construct all of the Fields at the very end.

@ngli-me
Copy link
Contributor Author

ngli-me commented Dec 19, 2024

Thank you for this contribution @ngli-me and I apologize for the delay in reviewing.

Hopefully @kszlim can give this a look and help us review / get it moving too.

No problem at all, it's the holiday season! Hope everyone's taking a good break.

Appreciate the feedback though! I'll get to work on it :)

@ngli-me ngli-me marked this pull request as draft December 20, 2024 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support RecordBatch.flatten
3 participants