-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Series.index_of() #19894
base: main
Are you sure you want to change the base?
feat: Series.index_of() #19894
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19894 +/- ##
===========================================
+ Coverage 58.69% 79.03% +20.33%
===========================================
Files 1564 1566 +2
Lines 220765 220918 +153
Branches 2504 2504
===========================================
+ Hits 129584 174598 +45014
+ Misses 90607 45747 -44860
+ Partials 574 573 -1 ☔ View full report in Codecov by Sentry. |
I think I've figured out how to use row encoding, so now I just need to write lots and lots of tests and make sure it actually works beyond the trivial case I've already tested. |
Unfortunately categorical and enum don't work (they also don't work for E.g. for Categorical: >>> import polars as pl
>>> pl.Series(["a", "b", "a"], dtype=pl.Categorical).index_of("a")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/itamarst/devel/polars/py-polars/polars/series/series.py", line 4771, in index_of
return F.select(F.lit(self).index_of(element)).item()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/itamarst/devel/polars/py-polars/polars/functions/lazy.py", line 1913, in select
return pl.DataFrame().select(*exprs, **named_exprs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/itamarst/devel/polars/py-polars/polars/dataframe/frame.py", line 9113, in select
return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/itamarst/devel/polars/py-polars/polars/lazyframe/frame.py", line 2029, in collect
return wrap_df(ldf.collect(callback))
^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: got invalid or ambiguous dtypes: '[cat, str]' in expression 'index_of'
Consider explicitly casting your input types to resolve potential ambiguity.
Resolved plan until failure:
---> FAILED HERE RESOLVING 'select' <---
SELECT [Series.index_of([String(a)])] FROM
DF []; PROJECT */0 COLUMNS; SELECTION: None |
My guess is that you are treating a categorical as a string when it goes into the row encoding. If you want to compare the row encoding of a series with the row encoding of another series they need to have been encoded with the exact same dtype (i.e. so the same RevMap as well) otherwise the output is undefined. If search_sorted doesn't do that either, that is a bug and I can look into it. |
@coastalwhite |
@coastalwhite and the question is how/where do I convert to an enum/categorical, my attempts have failed so far. |
It should raise. We should not allow for lossy comparisons. That's why we need a new supertype flag. |
To clarify, |
Or to put it another way, it is correctly searching float needles in int haystacks, but I can certainly see that you'd want to disallow that. |
After #20323 is merged I can change this PR to start testing Enum as non-xfail tests. |
Thank you for the new casting logic! I've updated to use it, and addressed the other two comments. |
Alright, looks great @itamarst. Thanks. I believe we only need docs entries on the python side (so that they end up in the ref guide), then it is good to go. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the tiny user-guide page? It's pretty much the same as the docstrings, so I feel like it's enough to have the docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this page? The examples shown here are already available in the docstrings, so I think we can delete this.
The docstrings in the Python side are enough to make this appear in the API reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was what Rich was asking for, maybe I misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant adding the entries in the reference guide. We don't need this in the User guide indeed.
API, which is similar to Python lists' `index()` method. | ||
Given a dataframe: | ||
|
||
{{code_block('user-guide/expressions/casting', 'dfnum', [])}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the page is not removed, as I'd like it to, then the []
here need to reference index_of
, which will also require adding Python and Rust entries to docs/source/_build/API_REFERENCE_LINKS.yml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one is just setting up the DataFrame, so doesn't need it, but I'll add to the other two.
Fixes #5503
Categoricals don't work yet; see #20171 and #20318.