Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add DataFrame and LazyFrame explode method #1542

Merged
merged 17 commits into from
Dec 22, 2024
Merged

Conversation

FBruzzesi
Copy link
Member

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

@github-actions github-actions bot added the enhancement New feature or request label Dec 9, 2024
@@ -743,3 +744,78 @@ def unpivot(
)
# TODO(Unassigned): Even with promote_options="permissive", pyarrow does not
# upcast numeric to non-numeric (e.g. string) datatypes

def explode(self: Self, columns: str | Sequence[str], *more_columns: str) -> Self:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow has two paths:

  • if nulls or empty lists are not present, then it is enough to:
    1. make sure element counts are same
    2. explode each array individually
  • if nulls or empty lists are present, then these are ignore by pc.list_parent_indices and pc.list_flatten, which is a problem. This implementation falls back to a python list both to flatten the array(s) and to create the corresponding indices .

After flattening, a new table is created by take-ing the indices of the non-flattened arrays and the flattened arrays.

@@ -937,3 +937,52 @@ def unpivot(
value_name=value_name if value_name is not None else "value",
)
)

def explode(self: Self, columns: str | Sequence[str], *more_columns: str) -> Self:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a single column is to be exploded, then we use the pandas native method. If multiple columns, the strategy is to explode the one column with the rest of the dataframe, and the other series individually and finally concatenating them back, plus sorting by original column names order

narwhals/dataframe.py Outdated Show resolved Hide resolved

def explode_null_array(array: pa.ChunkedArray) -> pa.ChunkedArray:
exploded_values = [] # type: ignore[var-annotated]
for lst_element in array.to_pylist():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might be missing something but this looks potentially very expensive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely is 😐 happy to raise for nullable cases for now and try to figure out a native alternative

Copy link
Member Author

@FBruzzesi FBruzzesi Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some improvements πŸ™ŒπŸΌ now there is only one .to_pylist call to create the indices and no loop in the case of nulls or empty lists

@MarcoGorelli
Copy link
Member

thanks - tbh i'm still not too keen on having to_pylist, should be just not support pyarrow and raise this as a feature request with them?

@FBruzzesi
Copy link
Member Author

thanks - tbh i'm still not too keen on having to_pylist, should be just not support pyarrow and raise this as a feature request with them?

Let me take a detour in the rabbit hole before a final decision πŸ˜‡

@FBruzzesi
Copy link
Member Author

FBruzzesi commented Dec 17, 2024

@MarcoGorelli thoughts on the following?

The idea being, if consecutive parent indices are the same, their diff will be zero, therefore the counter of such index should increase, otherwise it should not. By how much you ask? by the cumulative count of the matching parent indices, starting from 0 (that's why of the -1 after the cumulative sum).

I will definitly add a comment if we decide to move forward with this.

This is the else block:

import numpy as np
parent_indices = pc.list_parent_indices(native_frame[to_explode[0]])
diff = pc.pairwise_diff(parent_indices.combine_chunks())
is_change_point = pc.or_kleene(pc.equal(diff, 0), pc.is_null(diff))
indices = pc.subtract(
    pc.add(
        parent_indices,
        pc.cumulative_sum(is_change_point.cast(pa.int32()))
    ), 1
)

exploded_size = pc.sum(pc.max_element_wise(counts, 1, skip_nulls=True)).as_py()
valid_mask = pc.is_in(pa.array(np.arange(exploded_size), type=indices.type), indices)

def flatten_func(array: pa.ChunkedArray) -> pa.ChunkedArray:
    dtype = array.type.value_type
    return pc.replace_with_mask(
        pa.array([None] * exploded_size, type=dtype),
        valid_mask,
        pc.list_flatten(array).combine_chunks(),
    )

Edit: As the initial explanation above was a flow of consciousness, I will try to rephrase it in a proper way, which could actually end up documenting the "algorithm".

The issue

Both pc.list_parent_indices and pc.list_flatten ignore the list elements if these are null or [] (empty list).
Therefore the previous block (if fast_path-case) cannot be used, since it would not match what polars does.

Polars explode will result in nulls for both original null's and empty lists.

The solution

So far in the code the native solution is not present, yet the proposal is the codeblock above in this comment.

The idea goes as follows:

  1. Create the indices corresponding to the elements resulted from pc.list_flatten

  2. Consider the example we have in the test:

     data = [[3, None], None, [42], []]
     array = pa.chunked_array([data])
     pc.list_flatten(array)  # [3, null, 42]
     pc.list_parent_indices(array)  # [0, 0, 2]
  3. Now as stated, nulls and empty lists are ignored, therefore instead of a length 5 output, we have a length 3

  4. These 3 elements should end up in indices [0, 1, 3] with indices 2 and 4 being filled with nulls

  5. To obtain these such indices ([0, 1, 3]), the idea is to take the pc.list_parent_indices output and increase by 1 each time that consecutive parent indices do not change (meaning that they originally belong to the same list element and therefore will be sequential in the result) - that's obtained with diff, is_change_point and addition of parent indices with cumulative sum

  6. To construct the final output:

    • the final size is the sum of elements, counting nulls and empty lists (exploded_size)
    • the mask in which the pc.list_flatten elements should go is where the index is in indices constructed
    • flutten_fnc stars by creating an array of nulls, and then replace the values at the right index

Edit pt.2

It turns out that the solution is still missing how to compute indices for the .take method on the rest of the dataframe. And as of now I don't have a better option than:

indices = pa.array(
    [
        i
        for i, count in enumerate(filled_counts.to_pylist())
        for _ in range(count)
    ]
)

@MarcoGorelli
Copy link
Member

nice - imma have to spend some time understanding this πŸ˜„ 😳

@FBruzzesi
Copy link
Member Author

nice - imma have to spend some time understanding this πŸ˜„ 😳

I deduce that my explanation was garbage. Let me explode the previous comment

@MarcoGorelli
Copy link
Member

Not at all, I just hadn't spent enough time on it , will take a look and make sense of it!

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for explaining - sure, happy to try this out, maybe I could write a hypothesis test to stress test it

would also be open to merging for now without pyarrow (as the rest is quite uncontroversial) and then adding pyarrow later?

narwhals/_pandas_like/dataframe.py Outdated Show resolved Hide resolved
Comment on lines 799 to 808
if fast_path:
indices = pc.list_parent_indices(native_frame[to_explode[0]])
flatten_func = pc.list_flatten

else:
msg = (
"`DataFrame.explode` is not supported for pyarrow backend and column"
"containing null's or empty list elements"
)
raise NotImplementedError(msg)
Copy link
Member

@MarcoGorelli MarcoGorelli Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want a value-dependent determining whether an error is raised? would you be opposed to raising completely for pyarrow and raising a feature request on their part?

Copy link
Member Author

@FBruzzesi FBruzzesi Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see! I am happy to keep pyarrow on a dedicated PR. I will adjust here

Edit: feat/pyarrow-explode branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should also check how pandas does it with pyarrow backend?

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @FBruzzesi ! feel free to merge when ready

@FBruzzesi FBruzzesi merged commit e112a99 into main Dec 22, 2024
24 checks passed
@FBruzzesi FBruzzesi deleted the feat/explode-method branch December 22, 2024 11:31
@FBruzzesi FBruzzesi mentioned this pull request Dec 22, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants