-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add DataFrame
and LazyFrame
explode
method
#1542
Conversation
narwhals/_arrow/dataframe.py
Outdated
@@ -743,3 +744,78 @@ def unpivot( | |||
) | |||
# TODO(Unassigned): Even with promote_options="permissive", pyarrow does not | |||
# upcast numeric to non-numeric (e.g. string) datatypes | |||
|
|||
def explode(self: Self, columns: str | Sequence[str], *more_columns: str) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow has two paths:
- if nulls or empty lists are not present, then it is enough to:
- make sure element counts are same
- explode each array individually
- if nulls or empty lists are present, then these are ignore by
pc.list_parent_indices
andpc.list_flatten
, which is a problem. This implementation falls back to a python list both to flatten the array(s) and to create the corresponding indices .
After flattening, a new table is created by take
-ing the indices of the non-flattened arrays and the flattened arrays.
@@ -937,3 +937,52 @@ def unpivot( | |||
value_name=value_name if value_name is not None else "value", | |||
) | |||
) | |||
|
|||
def explode(self: Self, columns: str | Sequence[str], *more_columns: str) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a single column is to be exploded, then we use the pandas native method. If multiple columns, the strategy is to explode the one column with the rest of the dataframe, and the other series individually and finally concatenating them back, plus sorting by original column names order
narwhals/_arrow/dataframe.py
Outdated
|
||
def explode_null_array(array: pa.ChunkedArray) -> pa.ChunkedArray: | ||
exploded_values = [] # type: ignore[var-annotated] | ||
for lst_element in array.to_pylist(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might be missing something but this looks potentially very expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely is π happy to raise for nullable cases for now and try to figure out a native alternative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some improvements ππΌ now there is only one .to_pylist
call to create the indices and no loop in the case of nulls or empty lists
thanks - tbh i'm still not too keen on having to_pylist, should be just not support pyarrow and raise this as a feature request with them? |
Let me take a detour in the rabbit hole before a final decision π |
@MarcoGorelli thoughts on the following? The idea being, if consecutive parent indices are the same, their diff will be zero, therefore the counter of such index should increase, otherwise it should not. By how much you ask? by the cumulative count of the matching parent indices, starting from 0 (that's why of the I will definitly add a comment if we decide to move forward with this. This is the else block: import numpy as np
parent_indices = pc.list_parent_indices(native_frame[to_explode[0]])
diff = pc.pairwise_diff(parent_indices.combine_chunks())
is_change_point = pc.or_kleene(pc.equal(diff, 0), pc.is_null(diff))
indices = pc.subtract(
pc.add(
parent_indices,
pc.cumulative_sum(is_change_point.cast(pa.int32()))
), 1
)
exploded_size = pc.sum(pc.max_element_wise(counts, 1, skip_nulls=True)).as_py()
valid_mask = pc.is_in(pa.array(np.arange(exploded_size), type=indices.type), indices)
def flatten_func(array: pa.ChunkedArray) -> pa.ChunkedArray:
dtype = array.type.value_type
return pc.replace_with_mask(
pa.array([None] * exploded_size, type=dtype),
valid_mask,
pc.list_flatten(array).combine_chunks(),
) Edit: As the initial explanation above was a flow of consciousness, I will try to rephrase it in a proper way, which could actually end up documenting the "algorithm". The issueBoth Polars explode will result in The solutionSo far in the code the native solution is not present, yet the proposal is the codeblock above in this comment. The idea goes as follows:
Edit pt.2It turns out that the solution is still missing how to compute indices for the indices = pa.array(
[
i
for i, count in enumerate(filled_counts.to_pylist())
for _ in range(count)
]
) |
nice - imma have to spend some time understanding this π π³ |
I deduce that my explanation was garbage. Let me explode the previous comment |
Not at all, I just hadn't spent enough time on it , will take a look and make sense of it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for explaining - sure, happy to try this out, maybe I could write a hypothesis test to stress test it
would also be open to merging for now without pyarrow (as the rest is quite uncontroversial) and then adding pyarrow later?
narwhals/_arrow/dataframe.py
Outdated
if fast_path: | ||
indices = pc.list_parent_indices(native_frame[to_explode[0]]) | ||
flatten_func = pc.list_flatten | ||
|
||
else: | ||
msg = ( | ||
"`DataFrame.explode` is not supported for pyarrow backend and column" | ||
"containing null's or empty list elements" | ||
) | ||
raise NotImplementedError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want a value-dependent determining whether an error is raised? would you be opposed to raising completely for pyarrow and raising a feature request on their part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see! I am happy to keep pyarrow on a dedicated PR. I will adjust here
Edit: feat/pyarrow-explode branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should also check how pandas does it with pyarrow backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @FBruzzesi ! feel free to merge when ready
What type of PR is this? (check all applicable)
Checklist