Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyArrow input should result in PyArrow output? #187

Closed
MarcoGorelli opened this issue Jul 21, 2024 · 5 comments
Closed

PyArrow input should result in PyArrow output? #187

MarcoGorelli opened this issue Jul 21, 2024 · 5 comments
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@MarcoGorelli
Copy link
Contributor

If I run the README example with PyArrow input, I get pandas output:

import pandas as pd
from formulaic import Formula
import pyarrow as pa

df = pa.table({
    'y': [0, 1, 2],
    'x': ['A', 'B', 'C'],
    'z': [0.3, 0.1, 0.2],
})

y, X = Formula('y ~ x + z').get_model_matrix(df)

print(y)
print(X)
   y
0  0
1  1
2  2
   Intercept  x[T.B]  x[T.C]    z
0        1.0       0       0  0.3
1        1.0       1       0  0.1
2        1.0       0       1  0.2

I think I'd have expected

pyarrow.Table
y: int64
----
y: [[0,1,2]]
pyarrow.Table
Intercept: double
x[T.B]: int64
x[T.C]: int64
z: double
----
Intercept: [[1,1,1]]
x[T.B]: [[0,1,0]]
x[T.C]: [[0,0,1]]
z: [[0.3,0.1,0.2]]

I'm asking in the context of #160 , because there, I think Polars input should probably result in Polars output?

@matthewwardrop
Copy link
Owner

matthewwardrop commented Aug 6, 2024

This is an interesting question. While none the internal plumbing hard-codes a specific data-type (very intentionally), a lot of the transforms were designed to work with pandas or sparse datatypes. They do have a mechanism (single-dispatch) for customising the behaviour with other data types, but it isn't implemented for most transforms. Obviously we could cast back to a pyarrow data type at the end if we wanted to.

At least historically, I'd always viewed arrow as an interchange format, since there were few routines that ran directly on the arrow datastructures themselves. I think this is changing, so I'm totally open to thinking through this more.

Do you have specific use-cases where having the output be an arrow table would make more sense for you?

@matthewwardrop matthewwardrop added question Further information is requested enhancement New feature or request labels Aug 6, 2024
@MarcoGorelli
Copy link
Contributor Author

Thanks for your response!

Do you have specific use-cases where having the output be an arrow table would make more sense for you?

I think if a user passes in Polars, they expect to get back Polars. And as I was looking into preserving the input data class for Polars, I noticed that for PyArrow the input data class isn't preserved

If you're open to it, I could put up a PR demonstrating how Narwhals could work here, as suggested in #160 (comment)? No obligations nor hard feelings if it then gets rejected of course, it just looks like a good use-case (for Polars in particular it would be good to keep things Polars-native if possible...maybe they can also stay lazy, not sure yet)

@matthewwardrop
Copy link
Owner

Hi @MarcoGorelli !

I was toying with Narwhals a bit this morning, and it looks great. I'm still leveling up, but I have most of an implementation working now in Formulaic that can use it as the materialization backend. Given your heavy involvement in Narwhals, I suspect you will know various tricks that I don't, so when I put up a PR soon, I'll let you chime in on it (and feel free at that time to make further contributions :)).

@MarcoGorelli
Copy link
Contributor Author

Cool, thanks!

Given your heavy involvement in Narwhals

😄 I'm the original author (maybe I should make that clearer somewhere)

when I put up a PR soon, I'll let you chime in on it

Sounds great! And feel free to join our Discord if you have any question/request which doesn't quite fit into a GitHub issue

@matthewwardrop
Copy link
Owner

Given that we have now merged in (experimental) support for narwhals, I'll close this issue in favour of specific bugs that may arise :). Thanks again for your help @MarcoGorelli !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants