Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support polars #160

Open
deanm0000 opened this issue Oct 20, 2023 · 5 comments
Open

Support polars #160

deanm0000 opened this issue Oct 20, 2023 · 5 comments
Labels
enhancement New feature or request
Milestone

Comments

@deanm0000
Copy link

Polars is a (relatively) new dataframe library that is gaining more popularity and blows pandas away in performance using arrow memory in the backend.

@matthewwardrop
Copy link
Owner

Hi @deanm0000 !

Thanks for the suggestion. There's a way to trivially implement support (i.e. how we currently implement support for pyarrow Tables) (by converting to pandas); or a more complicated integration that fully adds support for polars arrays everywhere; perhaps via just using Arrow arrays. The framework itself doesn't care about he datatypes, but some of the transforms do... and that will be the bulk of the work.

Of course, to get the performance benefits, converting everything to pandas defeats the purpose.

Do you have any instances where you are performance bottle-necked? Or is this more just a quality of life feature request?

@matthewwardrop matthewwardrop added the enhancement New feature or request label Nov 3, 2023
@deanm0000
Copy link
Author

I guess, in those terms, it's a quality of life improvement. From a pure usability perspective it isn't hard to convert to pandas. I didn't realize that the pyarrow input just converted to pandas under the hood. I poked around really quickly and I couldn't find where in the code the transformations happen. Could you point me to that, like if I did Y~X+I(X^2).

@matthewwardrop
Copy link
Owner

matthewwardrop commented Nov 3, 2023

The lazy arrow -> pandas conversion happens here: https://github.com/matthewwardrop/formulaic/blob/main/formulaic/materializers/arrow.py . In practice, under the hood, the data sometimes can pass through uncopied through this transaction, but then compute is done in numpy arrays or pandas Series depending on the transform. Again, the framework is datatype agnostic, so it is happy with other types... but we'd need to go through and update the transforms (like contrast encodings) to make sure they have implementations for these types.

@glemaitre
Copy link

Maybe on thing to consider here is the effort to come with a DataFrame API: https://data-apis.org/dataframe-api/draft/

It could be handy to write DataFrame agnostic code.

@MarcoGorelli
Copy link
Contributor

Hi @matthewwardrop - would you be open to using Narwhals for this? Altair recently adopted it for this purpose vega/altair#3452, as did scikit-lego

Happy to put up a POC if you'd be interested (just checking first!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants