feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown #1316

MrPowers · 2023-04-29T18:10:18Z

Reading Delta tables into pandas DataFrames could be more intuitive.

Here's the current syntax:

import pyarrow.dataset as ds

dt = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
dataset = dt.to_pyarrow_dataset()
condition = ds.field("id1") == "id016"
df = dataset.to_table(filter=condition, columns=["id1", "id2", "v1"]).to_pandas()

Could we provide an interface that's more similar to pandas.read_parquet? Something like this:

import deltalake

dt = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
df = dt.filters([("id1", "==", "id016")]).to_pandas()

The list of tuples of how predicates are passed in pandas.read_parquet.

I'm open to other interfaces too. I think something that hides pyarrow.dataset, is less code, and is familiar with existing pandas syntax would be good.

The text was updated successfully, but these errors were encountered:

wjones127 · 2023-04-29T18:33:21Z

Have you tried DeltaTable.to_pandas()? I think that's exactly what you are looking for. https://delta-io.github.io/delta-rs/python/api_reference.html#deltalake.table.DeltaTable.to_pandas

MrPowers · 2023-04-29T20:29:16Z

@wjones127 - Here's the signature for that function to_pandas(partitions=None, columns=None, filesystem=None).

I don't see any filters argument for metadata-based file skipping. Let me know if I am missing something!

wjones127 · 2023-04-29T22:06:47Z

I see. We can extend that method to pass down filters in the dataset.

MrPowers · 2023-04-30T10:33:40Z

Great! Can we expose the filters to use this syntax [("id1", "==", "id016")] instead of this ds.field("id1") == "id016"?

MrPowers · 2023-04-30T14:18:08Z

We may even want to combine the partitions and filters functionality. That's what pandas and pyarrow do from what I understand. See here.

wjones127 · 2023-04-30T17:16:00Z

Yes, I think we'll deprecate partitions in favor of filters, but leave both in for now. And we can use DNF filters if you want. Or maybe can can accept either?

ognis1205 · 2023-05-04T02:08:08Z

Hi, guys. I apologize for interrupting your argument. Although it may take me some time to fully grasp the structure of delta-rs, I would like to work on this issue. Would it be possible to assign me to this issue?

wjones127 · 2023-05-04T05:08:36Z

We would gladly accept a PR for this. 😄

Note there is a function called filters_to_expression that can handle converting the DNF filter format into PyArrow expressions. (code).

ognis1205 · 2023-05-04T11:41:26Z

Thank you @wjones127 @MrPowers!

Once there's a consensus on the implementation, please feel free to assign me to this issue, and I will follow the discussion here as well. Also, I will check the source code that you linked.

MrPowers · 2023-05-04T12:16:29Z

@ognis1205 - I think there is a consensus on the next steps and you can get started.

See here for more information on the DNF filter format that Will is referring to.

Here's the existing to_pandas method signature: to_pandas(partitions=None, columns=None, filesystem=None)

You can submit a PR that will update the method signature to this: to_pandas(partitions=None, columns=None, filesystem=None, filters=None)

Let me know if you'd like any additional clarification!

ognis1205 · 2023-05-04T12:22:13Z

Thanks again @MrPowers @wjones127 !!

ognis1205 · 2023-05-09T17:51:50Z

@wjones127 @MrPowers

Hi, everyone! I have a question.
While making a PR for this issue, I encountered the following error:

https://github.com/delta-io/delta-rs/actions/runs/4926020116/jobs/8800992032?pr=1349#step:8:18

So, I am considering implementing a fallback function for pyarrow.parquet.core.filters_to_expression that is compatible with PyArrow 7.0.0. However, I came across this comment on another PR:

#1315 (comment)

Should I still implement the fallback function if the Python deltalake package will soon drop support for Python 3.7?
Would it be okay to implement a stub function that does nothing instead?

EDIT:
I checked the PyArrow API doc. Only PyArrow >= 10.0.0 supports pyarrow.parquet.core.filters_to_expression.
So I think I should implement the fallback function in either case.

wjones127 · 2023-05-09T18:24:24Z

Yes, we will drop support for Python 3.7, but we are keeping support (for now) for PyArrow 7.

ognis1205 · 2023-05-09T18:26:06Z

Yes, we will drop support for Python 3.7, but we are keeping support (for now) for PyArrow 7.

Thanks for answering!

ognis1205 · 2023-05-11T02:27:48Z

@MrPowers @wjones127
FYI:
I created a PR for this issue.

…er pushdown (#1349) # Description This pull request adds support for filtering to the `to_pandas` method of `DeltaTable`. The new filters argument allows users to specify filtering criteria in a format that is compatible with `pyarrow.compute.Expression`. - Adding the `filters` argument to `DeltaTable.to_pandas`. - Adding the `_filters_to_expression` function to `table` module, which is based on [this implementation](https://github.com/apache/arrow/blob/b9cc5df8a4f7c7fe09f40ba92a74981dee5e536a/python/pyarrow/parquet/core.py#LL155C5-L155C26), but with improved type consistency. - ~~Based on [the existing conventional unit tests](https://github.com/delta-io/delta-rs/blob/b5230835bc1d1b01d59da3649033f1180232ddb7/python/tests/test_table_read.py#L392), I did not add any additional test cases for this feature. Instead, I tested the feature on my local development environment.~~ - Adding a unit test for the feature. # Related Issue(s) - closes #1316 # Documentation [The DNF filter format](https://github.com/apache/arrow/blob/b9cc5df8a4f7c7fe09f40ba92a74981dee5e536a/python/pyarrow/parquet/core.py#LL155C5-L155C26). --------- Signed-off-by: Shingo OKAWA <[email protected]>

FrankPortman · 2025-01-24T19:20:59Z

Sorry for bumping an ancient ticket - maybe the docs online are outdated, but my understanding is this new (old) route supports OR clauses indirectly (or maybe that was the intent, either way it appears to support OR), while partition_filters still doesn't. Further the more flexible filters don't extend into files or file_uris methods. The change was only for to_pandas(). E.G. this ticket: #1923

Is my understanding of the current limitations correct? I'm interested in helping contribute either code or documentation updates.

MrPowers added the enhancement New feature or request label Apr 29, 2023

wjones127 changed the title ~~Provide a nicer pandas read interface~~ feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown Apr 29, 2023

wjones127 added binding/python Issues for the Python package good first issue Good for newcomers labels Apr 29, 2023

This was referenced May 2, 2023

Pass filter to file_uris to skip partitions dask-contrib/dask-deltatable#11

Closed

Add filters argument to DeltaTable.file_uris() and DeltaTable.files() #1329

Closed

ognis1205 mentioned this issue May 9, 2023

feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown #1349

Merged

wjones127 closed this as completed in #1349 May 12, 2023

j-bennet mentioned this issue Jun 29, 2023

Propagate filter to file_uris to skip partitions dask-contrib/dask-deltatable#30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown #1316

feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown #1316

MrPowers commented Apr 29, 2023

wjones127 commented Apr 29, 2023

MrPowers commented Apr 29, 2023

wjones127 commented Apr 29, 2023

MrPowers commented Apr 30, 2023

MrPowers commented Apr 30, 2023

wjones127 commented Apr 30, 2023

ognis1205 commented May 4, 2023

wjones127 commented May 4, 2023

ognis1205 commented May 4, 2023

MrPowers commented May 4, 2023

ognis1205 commented May 4, 2023

ognis1205 commented May 9, 2023 •

edited

Loading

wjones127 commented May 9, 2023

ognis1205 commented May 9, 2023

ognis1205 commented May 11, 2023

FrankPortman commented Jan 24, 2025 •

edited

Loading

feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown #1316

feat(python): add filters argument to DeltaTable.to_pandas() for filter pushdown #1316

Comments

MrPowers commented Apr 29, 2023

wjones127 commented Apr 29, 2023

MrPowers commented Apr 29, 2023

wjones127 commented Apr 29, 2023

MrPowers commented Apr 30, 2023

MrPowers commented Apr 30, 2023

wjones127 commented Apr 30, 2023

ognis1205 commented May 4, 2023

wjones127 commented May 4, 2023

ognis1205 commented May 4, 2023

MrPowers commented May 4, 2023

ognis1205 commented May 4, 2023

ognis1205 commented May 9, 2023 • edited Loading

wjones127 commented May 9, 2023

ognis1205 commented May 9, 2023

ognis1205 commented May 11, 2023

FrankPortman commented Jan 24, 2025 • edited Loading

ognis1205 commented May 9, 2023 •

edited

Loading

FrankPortman commented Jan 24, 2025 •

edited

Loading