`RecordBatch` normalization (flattening) #6758

ngli-me · 2024-11-20T03:10:29Z

Which issue does this PR close?

Closes #6369.

Rationale for this change

Adds normalization (flattening) for RecordBatch, with normalization via Schema. Based on pandas/pola-rs.

What changes are included in this PR?

Are there any user-facing changes?

…on pola-rs.

… iterative function for `RecordBatch`. Not sure which one is better currently.

ngli-me

I had some questions regarding the implementation of this, since the one example from PyArrow doesn't seem to clarify on the edge cases here. Normalizing the Schema seems fairly straight forward to me, I'm just not sure on

Whether the iterative or recursive approach is better (or something I missed)
If DataType::Struct is the only DataType that requires flattening. To me, it looks like that's the only one that can contained nested Fields.

(I'm also not sure if I'm missing something with unwrapping like a List<Struct>)

Any feedback/help would be appreciated!

arrow-array/src/record_batch.rs

arrow-schema/src/schema.rs

…ch the example from PyArrow.

arrow-array/src/record_batch.rs

…h-flatten

alamb · 2024-12-18T13:04:18Z

@kszlim can you please help review this PR ? You requested the feature and we are currently quite short on review capacity in arrow-rs

alamb

Thank you for this contribution @ngli-me and I apologize for the delay in reviewing.

Hopefully @kszlim can give this a look and help us review / get it moving too.

arrow-array/src/record_batch.rs

arrow-schema/src/schema.rs

kszlim · 2024-12-19T06:10:58Z

I'll take a look, though please feel free to disregard anything I say and especially defer to the maintainers.

arrow-array/src/record_batch.rs

arrow-schema/src/schema.rs

arrow-array/src/record_batch.rs

ngli-me · 2024-12-19T13:33:59Z

Thank you for this contribution @ngli-me and I apologize for the delay in reviewing.

Hopefully @kszlim can give this a look and help us review / get it moving too.

No problem at all, it's the holiday season! Hope everyone's taking a good break.

Appreciate the feedback though! I'll get to work on it :)

… normalization to iterative approach.

ngli-me · 2024-12-31T06:41:17Z

Sorry for the delays on this one, made changes based on the feedback, would appreciate another look! Hopefully the new documentation is more clear.

Jefffrey

Some potential simplifications

arrow-array/src/record_batch.rs

…d if statements, simplified the VecDeque fields.

ngli-me · 2025-01-05T06:13:02Z

Appreciate the feedback, as always. Changed some bits of the code, added some responses (and some stuff to work on).

ngli-me · 2025-01-20T05:09:57Z

Sorry, fell ill there for a good while. Added some additional tests to hopefully cover some more of the edges. I was trying to adapt it over for Schema as well, but I had some trouble initializing the ListArray with the inner StructArray, tried a few different things but was unable to get two_field working (I think I'm misunderstanding something with the buffer, as I get "InvalidArgumentError("ListArray data should contain a single buffer only (value offsets), had 0")"). Otherwise, should be ready for review!

        // Initialize schema
        let a = Arc::new(Field::new("a", DataType::Int64, true));
        let b = Arc::new(Field::new("b", DataType::Int64, false));
        let c = Arc::new(Field::new("c", DataType::Int64, true));

        let one = Arc::new(Field::new(
            "1",
            DataType::Struct(Fields::from(vec![a.clone(), b.clone(), c.clone()])),
            false,
        ));
        let two = Arc::new(Field::new(
            "2",
            DataType::List(Arc::new(Field::new_list_field(
                DataType::Struct(Fields::from(vec![a.clone(), b.clone(), c.clone()])),
                true,
            ))),
            false,
        ));

        let exclamation = Arc::new(Field::new(
            "!",
            DataType::Struct(Fields::from(vec![one.clone(), two.clone()])),
            false,
        ));

        let schema = Schema::new(vec![exclamation.clone()]);

        // Initialize fields
        let a_field = Int64Array::from(vec![Some(0), Some(1)]);
        let b_field = Int64Array::from(vec![Some(2), Some(3)]);
        let c_field = Int64Array::from(vec![None, Some(4)]);

        let one_field = StructArray::from(vec![
            (a.clone(), Arc::new(a_field.clone()) as ArrayRef),
            (b.clone(), Arc::new(b_field.clone()) as ArrayRef),
            (c.clone(), Arc::new(c_field.clone()) as ArrayRef),
        ]);

        let two_field_data = ArrayData::builder(DataType::Struct(Fields::from(vec![a.clone(), b.clone(), c.clone()])))
            .len(2)
            .add_child_data(Arc::new(a_field.clone()).to_data())
            .add_child_data(Arc::new(b_field.clone()).to_data())
            .add_child_data(Arc::new(c_field.clone()).to_data())
            .build()
            .unwrap();
        let two_field = ListArray::from(two_field_data);

        let exclamation_field = Arc::new(StructArray::from(vec![
            (one.clone(), Arc::new(one_field) as ArrayRef),
            (two.clone(), Arc::new(two_field) as ArrayRef),
        ]));

        // Normalize all levels
        let normalized = RecordBatch::try_new(Arc::new(schema), vec![exclamation_field])
            .expect("valid conversion")
            .normalize(".", None)
            .expect("valid normalization");

        let expected = RecordBatch::try_from_iter_with_nullable(vec![
            ("!.1.a", Arc::new(a_field.clone()) as ArrayRef, true),
            ("!.1.b", Arc::new(b_field.clone()) as ArrayRef, false),
            ("!.1.c", Arc::new(c_field.clone()) as ArrayRef, true),
            ("!.2.a", Arc::new(a_field.clone()) as ArrayRef, true),
            ("!.2.b", Arc::new(b_field.clone()) as ArrayRef, false),
            ("!.2.c", Arc::new(c_field.clone()) as ArrayRef, true),
        ])
            .expect("valid conversion");

        assert_eq!(expected, normalized);

Jefffrey

Some minor comments (which are also applicable for the schema code); otherwise it looks good to me

arrow-array/src/record_batch.rs

Jefffrey

Looks good to me 👍

alamb · 2025-01-21T22:01:30Z

The next release is shaping up to be quite nice -- thank you @ngli-me and @Jefffrey

Jefffrey · 2025-01-22T11:39:14Z

I'll merge this tomorrow or day after to leave some time for any last comments

Jefffrey · 2025-01-24T13:56:43Z

Thanks @ngli-me

nglime added 2 commits November 18, 2024 14:11

Added set up for the example of flattening from pyarrow.

bbd7c8b

Logic for recursive normalizer with a base normalize function, based …

8abcd25

…on pola-rs.

ngli-me changed the title ~~Feature/record batch flatten~~ RecordBatch normalization (flattening) Nov 20, 2024

ngli-me changed the title ~~RecordBatch normalization (flattening)~~ RecordBatch normalization (flattening) Nov 20, 2024

Added recursive normalize function for Schema, and started building…

6bba7d3

… iterative function for `RecordBatch`. Not sure which one is better currently.

github-actions bot added the arrow Changes to the arrow crate label Nov 23, 2024

Built out a bit more of the iterative normalize.

55eb953

ngli-me commented Nov 23, 2024

View reviewed changes

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

arrow-schema/src/schema.rs Outdated Show resolved Hide resolved

ngli-me marked this pull request as ready for review November 23, 2024 19:03

ngli-me marked this pull request as draft November 23, 2024 23:30

nglime added 2 commits November 23, 2024 21:03

Fixed normalize function for RecordBatch. Adjusted test case to mat…

30d6294

…ch the example from PyArrow.

Added tests for Schema normalization. Partial tests for RecordBatch.

0ed979d

ngli-me commented Nov 25, 2024

View reviewed changes

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

nglime added 2 commits November 24, 2024 21:54

Removed stray comments.

d9d08cd

Commenting out exclamation field.

d1b3260

ngli-me marked this pull request as ready for review November 25, 2024 04:02

nglime added 3 commits December 4, 2024 22:04

Merge remote-tracking branch 'upstream/main' into feature/record-batc…

a12082c

…h-flatten

Fixed test for RecordBatch.

7adda58

Formatting.

9c9c699

alamb reviewed Dec 18, 2024

View reviewed changes

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

arrow-schema/src/schema.rs Outdated Show resolved Hide resolved

kszlim reviewed Dec 19, 2024

View reviewed changes

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

arrow-schema/src/schema.rs Outdated Show resolved Hide resolved

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

ngli-me marked this pull request as draft December 20, 2024 12:27

nglime added 5 commits December 30, 2024 22:38

Additional documentation for normalize functions. Switched Schema…

4422add

… normalization to iterative approach.

Forgot to push to the columns in the else case.

d0dc5a7

Adjusted the documentation to include the parameters.

1e40c98

Formatting.

3c424d1

Edited examples to not be ran as tests.

6d6b026

ngli-me marked this pull request as ready for review December 31, 2024 06:41

Jefffrey reviewed Jan 5, 2025

View reviewed changes

Adjusted based on some of the suggestions. Simplified the matching an…

71380b6

…d if statements, simplified the VecDeque fields.

ngli-me marked this pull request as draft January 8, 2025 21:54

nglime added 2 commits January 10, 2025 22:23

Additional test cases for List and FixedSizeList in Schema.

af7946b

Additional test cases for deeply nested normalization.

e97cc9c

ngli-me marked this pull request as ready for review January 20, 2025 05:10

Jefffrey reviewed Jan 20, 2025

View reviewed changes

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved

nglime added 2 commits January 20, 2025 14:34

Suggestions from Jefffrey on the descriptions and stack initialization.

b90e8f5

Forgot parenthesis.

6a2e3ca

Jefffrey approved these changes Jan 21, 2025

View reviewed changes

Jefffrey merged commit 001239d into apache:main Jan 24, 2025
26 checks passed

ngli-me deleted the feature/record-batch-flatten branch January 24, 2025 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`RecordBatch` normalization (flattening) #6758

`RecordBatch` normalization (flattening) #6758

ngli-me commented Nov 20, 2024 •

edited

Loading

ngli-me left a comment •

edited

Loading

alamb commented Dec 18, 2024

alamb left a comment

kszlim commented Dec 19, 2024

ngli-me commented Dec 19, 2024 •

edited

Loading

ngli-me commented Dec 31, 2024

Jefffrey left a comment

ngli-me commented Jan 5, 2025

ngli-me commented Jan 20, 2025

Jefffrey left a comment

Jefffrey left a comment

alamb commented Jan 21, 2025

Jefffrey commented Jan 22, 2025

Jefffrey commented Jan 24, 2025

RecordBatch normalization (flattening) #6758

RecordBatch normalization (flattening) #6758

Conversation

ngli-me commented Nov 20, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

ngli-me left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Dec 18, 2024

alamb left a comment

Choose a reason for hiding this comment

kszlim commented Dec 19, 2024

ngli-me commented Dec 19, 2024 • edited Loading

ngli-me commented Dec 31, 2024

Jefffrey left a comment

Choose a reason for hiding this comment

ngli-me commented Jan 5, 2025

ngli-me commented Jan 20, 2025

Jefffrey left a comment

Choose a reason for hiding this comment

Jefffrey left a comment

Choose a reason for hiding this comment

alamb commented Jan 21, 2025

Jefffrey commented Jan 22, 2025

Jefffrey commented Jan 24, 2025

`RecordBatch` normalization (flattening) #6758

`RecordBatch` normalization (flattening) #6758

ngli-me commented Nov 20, 2024 •

edited

Loading

ngli-me left a comment •

edited

Loading

ngli-me commented Dec 19, 2024 •

edited

Loading