Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python]: Support PyCapsule Interface Objects as input in more places #43410

Closed
kylebarron opened this issue Jul 24, 2024 · 4 comments
Closed

Comments

@kylebarron
Copy link
Contributor

Describe the enhancement requested

Now that the PyCapsule Interface is starting to gain more traction (#39195), I think it would be great if some of pyarrow's functional APIs accepted any PyCapsule Interface object, and not just pyarrow objects.

Do people have opinions on what functions should or should not check for these objects? I'd argue that file format writers should check for them, because it's only a couple lines of code, and the input stream will be fully iterated over regardless. E.g. looking at the Parquet writer: the high level API doesn't currently accept a RecordBatchReader either, so support for both can come at the same time.

from dataclasses import dataclass
from typing import Any

import pyarrow as pa
import pyarrow.parquet as pq


@dataclass
class ArrowCStream:
    obj: Any

    def __arrow_c_stream__(self, requested_schema=None):
        return self.obj.__arrow_c_stream__(requested_schema=requested_schema)


table = pa.table({"a": [1, 2, 3, 4]})
pq.write_table(table, "test.parquet")  # works

reader = pa.RecordBatchReader.from_stream(table)
pq.write_table(reader, "test.parquet")  # fails
pq.write_table(ArrowCStream(table), "test.parquet")  # fails

I'd argue that the writer should be generalized to accept any object with an __arrow_c_stream__ dunder, and to ensure the stream is not materialized as a table.

Component(s)

Python

@jorisvandenbossche
Copy link
Member

Specifically for pq.write_table(), this might be a bit trickier (without consuming the stream) because this currently uses parquet::arrow::FileWriter::WriteTable which is explicitly requiring a table input. The FileWriter interface has support for writing record batches as well, so we could rewrite the code a bit to iterate over the batches of the stream (but at that point, should that be done in something called write_table?)

But in general, certainly +1 on more widely supporting the interface.

Some other possible areas:

  • The dataset API for writing. In this case, pyarrow.dataset.write_dataset already does accept a record batch reader, so this should be straightforward to extend
  • Compute functions from pyarrow.compute ? Those could certainly accept objects with __arrow_c_array__, and in theory also __arrow_c_stream__, but those will fully consume the stream and return a materialized result, so not sure if that will be expected? (although, if you know those functions, that is kind of expected, so maybe this just requires good documentation)
  • Many of the methods on the Array/RecordBatch/Table classes accept similar objects (e.g. arr.take(..)). Not sure if we want to make those work with interface objects as well. Although currently what we exactly support as input is a bit inconsistent (only strictly a pyarrow array, or also a numpy array, a list, anything array-like or any sequence or collection? So if we would harmonize that with some helper, then we could at once also easily add support for any arrow-array-like object)

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Aug 20, 2024
@jorisvandenbossche
Copy link
Member

Started with exploring write_dataset -> #43771

@kylebarron
Copy link
Contributor Author

That sounds awesome.

For reference in my own experiments in https://github.com/kylebarron/arro3, I created an ArrayReader class, essentially just a RecordBatchReader but generalized to yield generic Arrays. Then for example cast is overloaded. So if it sees an object with __arrow_c_array__ it will immediately return an arro3.Array with the result. If it sees an object with __arrow_c_stream__ it will create a new ArrayReader holding an iterator with the compute function. So it will lazily yield casted chunks.

jorisvandenbossche added a commit to jorisvandenbossche/arrow that referenced this issue Sep 23, 2024
pitrou pushed a commit to jorisvandenbossche/arrow that referenced this issue Nov 18, 2024
pitrou pushed a commit that referenced this issue Nov 18, 2024
…taset (#43771)

### Rationale for this change

Expanding the support internally in pyarrow where we accept objects implementing the Arrow PyCapsule interface. This PR adds support in `ds.write_dataset()` since we already accept a RecordBatchReader as well.

### What changes are included in this PR?

`ds.write_dataset()` and `ds.Scanner.from_baches()` now accept any object implementing the Arrow PyCapsule interface for streams.

### Are these changes tested?

Yes

### Are there any user-facing changes?

No
* GitHub Issue: #43410

Authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 19.0.0 milestone Nov 18, 2024
@pitrou
Copy link
Member

pitrou commented Nov 18, 2024

Issue resolved by pull request 43771
#43771

@pitrou pitrou closed this as completed Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants