Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown #1316

Closed
MrPowers opened this issue Apr 29, 2023 · 16 comments · Fixed by #1349
Closed

feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown #1316

MrPowers opened this issue Apr 29, 2023 · 16 comments · Fixed by #1349
Labels
binding/python Issues for the Python package enhancement New feature or request good first issue Good for newcomers

Comments

@MrPowers
Copy link
Contributor

Reading Delta tables into pandas DataFrames could be more intuitive.

Here's the current syntax:

import pyarrow.dataset as ds

dt = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
dataset = dt.to_pyarrow_dataset()
condition = ds.field("id1") == "id016"
df = dataset.to_table(filter=condition, columns=["id1", "id2", "v1"]).to_pandas()

Could we provide an interface that's more similar to pandas.read_parquet? Something like this:

import deltalake

dt = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
df = dt.filters([("id1", "==", "id016")]).to_pandas()

The list of tuples of how predicates are passed in pandas.read_parquet.

I'm open to other interfaces too. I think something that hides pyarrow.dataset, is less code, and is familiar with existing pandas syntax would be good.

@MrPowers MrPowers added the enhancement New feature or request label Apr 29, 2023
@wjones127
Copy link
Collaborator

Have you tried DeltaTable.to_pandas()? I think that's exactly what you are looking for. https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pandas

@MrPowers
Copy link
Contributor Author

@wjones127 - Here's the signature for that function to_pandas(partitions=None, columns=None, filesystem=None).

I don't see any filters argument for metadata-based file skipping. Let me know if I am missing something!

@wjones127
Copy link
Collaborator

I see. We can extend that method to pass down filters in the dataset.

@wjones127 wjones127 changed the title Provide a nicer pandas read interface feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown Apr 29, 2023
@wjones127 wjones127 added binding/python Issues for the Python package good first issue Good for newcomers labels Apr 29, 2023
@MrPowers
Copy link
Contributor Author

Great! Can we expose the filters to use this syntax [("id1", "==", "id016")] instead of this ds.field("id1") == "id016"?

@MrPowers
Copy link
Contributor Author

We may even want to combine the partitions and filters functionality. That's what pandas and pyarrow do from what I understand. See here.

@wjones127
Copy link
Collaborator

Yes, I think we'll deprecate partitions in favor of filters, but leave both in for now. And we can use DNF filters if you want. Or maybe can can accept either?

@ognis1205
Copy link
Contributor

Hi, guys. I apologize for interrupting your argument. Although it may take me some time to fully grasp the structure of delta-rs, I would like to work on this issue. Would it be possible to assign me to this issue?

@wjones127
Copy link
Collaborator

We would gladly accept a PR for this. 😄

Note there is a function called filters_to_expression that can handle converting the DNF filter format into PyArrow expressions. (code).

@ognis1205
Copy link
Contributor

Thank you @wjones127 @MrPowers!

Once there's a consensus on the implementation, please feel free to assign me to this issue, and I will follow the discussion here as well. Also, I will check the source code that you linked.

@MrPowers
Copy link
Contributor Author

MrPowers commented May 4, 2023

@ognis1205 - I think there is a consensus on the next steps and you can get started.

See here for more information on the DNF filter format that Will is referring to.

Here's the existing to_pandas method signature: to_pandas(partitions=None, columns=None, filesystem=None)

You can submit a PR that will update the method signature to this: to_pandas(partitions=None, columns=None, filesystem=None, filters=None)

Let me know if you'd like any additional clarification!

@ognis1205
Copy link
Contributor

Thanks again @MrPowers @wjones127 !!

@ognis1205
Copy link
Contributor

ognis1205 commented May 9, 2023

@wjones127 @MrPowers

Hi, everyone! I have a question.
While making a PR for this issue, I encountered the following error:

https://github.com/delta-io/delta-rs/actions/runs/4926020116/jobs/8800992032?pr=1349#step:8:18

So, I am considering implementing a fallback function for pyarrow.parquet.core.filters_to_expression that is compatible with PyArrow 7.0.0. However, I came across this comment on another PR:

#1315 (comment)

Should I still implement the fallback function if the Python deltalake package will soon drop support for Python 3.7?
Would it be okay to implement a stub function that does nothing instead?

EDIT:
I checked the PyArrow API doc. Only PyArrow >= 10.0.0 supports pyarrow.parquet.core.filters_to_expression.
So I think I should implement the fallback function in either case.

@wjones127
Copy link
Collaborator

Yes, we will drop support for Python 3.7, but we are keeping support (for now) for PyArrow 7.

@ognis1205
Copy link
Contributor

Yes, we will drop support for Python 3.7, but we are keeping support (for now) for PyArrow 7.

Thanks for answering!

@ognis1205
Copy link
Contributor

@MrPowers @wjones127
FYI:
I created a PR for this issue.

wjones127 pushed a commit that referenced this issue May 12, 2023
…er pushdown (#1349)

# Description

This pull request adds support for filtering to the `to_pandas` method
of `DeltaTable`. The new filters argument allows
users to specify filtering criteria in a format that is compatible with
`pyarrow.compute.Expression`.

- Adding the `filters` argument to `DeltaTable.to_pandas`.
- Adding the `_filters_to_expression` function to `table` module, which
is based on [this
implementation](https://github.com/apache/arrow/blob/b9cc5df8a4f7c7fe09f40ba92a74981dee5e536a/python/pyarrow/parquet/core.py#LL155C5-L155C26),
but with improved type consistency.
- ~~Based on [the existing conventional unit
tests](https://github.com/delta-io/delta-rs/blob/b5230835bc1d1b01d59da3649033f1180232ddb7/python/tests/test_table_read.py#L392),
I did not add any additional test cases for this feature. Instead, I
tested the feature on my local development environment.~~
- Adding a unit test for the feature.

# Related Issue(s)

- closes #1316

# Documentation

[The DNF filter
format](https://github.com/apache/arrow/blob/b9cc5df8a4f7c7fe09f40ba92a74981dee5e536a/python/pyarrow/parquet/core.py#LL155C5-L155C26).

---------

Signed-off-by: Shingo OKAWA <[email protected]>
@FrankPortman
Copy link

FrankPortman commented Jan 24, 2025

Sorry for bumping an ancient ticket - maybe the docs online are outdated, but my understanding is this new (old) route supports OR clauses indirectly (or maybe that was the intent, either way it appears to support OR), while partition_filters still doesn't. Further the more flexible filters don't extend into files or file_uris methods. The change was only for to_pandas(). E.G. this ticket: #1923

Is my understanding of the current limitations correct? I'm interested in helping contribute either code or documentation updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants