Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatch normalization (flattening) #6758

Merged
merged 21 commits into from
Jan 24, 2025

Conversation

ngli-me
Copy link
Contributor

@ngli-me ngli-me commented Nov 20, 2024

Which issue does this PR close?

Closes #6369.

Rationale for this change

Adds normalization (flattening) for RecordBatch, with normalization via Schema. Based on pandas/pola-rs.

What changes are included in this PR?

Are there any user-facing changes?

@ngli-me ngli-me changed the title Feature/record batch flatten RecordBatch normalization (flattening) Nov 20, 2024
@ngli-me ngli-me changed the title RecordBatch normalization (flattening) RecordBatch normalization (flattening) Nov 20, 2024
… iterative function for `RecordBatch`. Not sure which one is better currently.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 23, 2024
Copy link
Contributor Author

@ngli-me ngli-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some questions regarding the implementation of this, since the one example from PyArrow doesn't seem to clarify on the edge cases here. Normalizing the Schema seems fairly straight forward to me, I'm just not sure on

  1. Whether the iterative or recursive approach is better (or something I missed)
  2. If DataType::Struct is the only DataType that requires flattening. To me, it looks like that's the only one that can contained nested Fields.

(I'm also not sure if I'm missing something with unwrapping like a List<Struct>)

Any feedback/help would be appreciated!

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-schema/src/schema.rs Outdated Show resolved Hide resolved
@ngli-me ngli-me marked this pull request as ready for review November 23, 2024 19:03
@ngli-me ngli-me marked this pull request as draft November 23, 2024 23:30
@ngli-me ngli-me marked this pull request as ready for review November 25, 2024 04:02
@alamb
Copy link
Contributor

alamb commented Dec 18, 2024

@kszlim can you please help review this PR ? You requested the feature and we are currently quite short on review capacity in arrow-rs

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @ngli-me and I apologize for the delay in reviewing.

Hopefully @kszlim can give this a look and help us review / get it moving too.

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-schema/src/schema.rs Outdated Show resolved Hide resolved
@kszlim
Copy link
Contributor

kszlim commented Dec 19, 2024

I'll take a look, though please feel free to disregard anything I say and especially defer to the maintainers.

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-schema/src/schema.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
@ngli-me
Copy link
Contributor Author

ngli-me commented Dec 19, 2024

Thank you for this contribution @ngli-me and I apologize for the delay in reviewing.

Hopefully @kszlim can give this a look and help us review / get it moving too.

No problem at all, it's the holiday season! Hope everyone's taking a good break.

Appreciate the feedback though! I'll get to work on it :)

@ngli-me ngli-me marked this pull request as draft December 20, 2024 12:27
@ngli-me
Copy link
Contributor Author

ngli-me commented Dec 31, 2024

Sorry for the delays on this one, made changes based on the feedback, would appreciate another look! Hopefully the new documentation is more clear.

@ngli-me ngli-me marked this pull request as ready for review December 31, 2024 06:41
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some potential simplifications

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Show resolved Hide resolved
…d if statements, simplified the VecDeque fields.
@ngli-me
Copy link
Contributor Author

ngli-me commented Jan 5, 2025

Appreciate the feedback, as always. Changed some bits of the code, added some responses (and some stuff to work on).

@ngli-me ngli-me marked this pull request as draft January 8, 2025 21:54
@ngli-me
Copy link
Contributor Author

ngli-me commented Jan 20, 2025

Sorry, fell ill there for a good while. Added some additional tests to hopefully cover some more of the edges. I was trying to adapt it over for Schema as well, but I had some trouble initializing the ListArray with the inner StructArray, tried a few different things but was unable to get two_field working (I think I'm misunderstanding something with the buffer, as I get "InvalidArgumentError("ListArray data should contain a single buffer only (value offsets), had 0")"). Otherwise, should be ready for review!

        // Initialize schema
        let a = Arc::new(Field::new("a", DataType::Int64, true));
        let b = Arc::new(Field::new("b", DataType::Int64, false));
        let c = Arc::new(Field::new("c", DataType::Int64, true));

        let one = Arc::new(Field::new(
            "1",
            DataType::Struct(Fields::from(vec![a.clone(), b.clone(), c.clone()])),
            false,
        ));
        let two = Arc::new(Field::new(
            "2",
            DataType::List(Arc::new(Field::new_list_field(
                DataType::Struct(Fields::from(vec![a.clone(), b.clone(), c.clone()])),
                true,
            ))),
            false,
        ));

        let exclamation = Arc::new(Field::new(
            "!",
            DataType::Struct(Fields::from(vec![one.clone(), two.clone()])),
            false,
        ));

        let schema = Schema::new(vec![exclamation.clone()]);

        // Initialize fields
        let a_field = Int64Array::from(vec![Some(0), Some(1)]);
        let b_field = Int64Array::from(vec![Some(2), Some(3)]);
        let c_field = Int64Array::from(vec![None, Some(4)]);

        let one_field = StructArray::from(vec![
            (a.clone(), Arc::new(a_field.clone()) as ArrayRef),
            (b.clone(), Arc::new(b_field.clone()) as ArrayRef),
            (c.clone(), Arc::new(c_field.clone()) as ArrayRef),
        ]);

        let two_field_data = ArrayData::builder(DataType::Struct(Fields::from(vec![a.clone(), b.clone(), c.clone()])))
            .len(2)
            .add_child_data(Arc::new(a_field.clone()).to_data())
            .add_child_data(Arc::new(b_field.clone()).to_data())
            .add_child_data(Arc::new(c_field.clone()).to_data())
            .build()
            .unwrap();
        let two_field = ListArray::from(two_field_data);

        let exclamation_field = Arc::new(StructArray::from(vec![
            (one.clone(), Arc::new(one_field) as ArrayRef),
            (two.clone(), Arc::new(two_field) as ArrayRef),
        ]));

        // Normalize all levels
        let normalized = RecordBatch::try_new(Arc::new(schema), vec![exclamation_field])
            .expect("valid conversion")
            .normalize(".", None)
            .expect("valid normalization");

        let expected = RecordBatch::try_from_iter_with_nullable(vec![
            ("!.1.a", Arc::new(a_field.clone()) as ArrayRef, true),
            ("!.1.b", Arc::new(b_field.clone()) as ArrayRef, false),
            ("!.1.c", Arc::new(c_field.clone()) as ArrayRef, true),
            ("!.2.a", Arc::new(a_field.clone()) as ArrayRef, true),
            ("!.2.b", Arc::new(b_field.clone()) as ArrayRef, false),
            ("!.2.c", Arc::new(c_field.clone()) as ArrayRef, true),
        ])
            .expect("valid conversion");

        assert_eq!(expected, normalized);

@ngli-me ngli-me marked this pull request as ready for review January 20, 2025 05:10
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments (which are also applicable for the schema code); otherwise it looks good to me

arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
arrow-array/src/record_batch.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍

@alamb
Copy link
Contributor

alamb commented Jan 21, 2025

The next release is shaping up to be quite nice :bowtie: -- thank you @ngli-me and @Jefffrey

@Jefffrey
Copy link
Contributor

I'll merge this tomorrow or day after to leave some time for any last comments

@Jefffrey Jefffrey merged commit 001239d into apache:main Jan 24, 2025
26 checks passed
@Jefffrey
Copy link
Contributor

Thanks @ngli-me

@ngli-me ngli-me deleted the feature/record-batch-flatten branch January 24, 2025 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support RecordBatch.flatten
4 participants