Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Series.index_of() #19894

Open
wants to merge 96 commits into
base: main
Choose a base branch
from
Open

Conversation

itamarst
Copy link
Contributor

@itamarst itamarst commented Nov 20, 2024

Fixes #5503

Categoricals don't work yet; see #20171 and #20318.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Nov 20, 2024
Copy link

codecov bot commented Nov 20, 2024

Codecov Report

Attention: Patch coverage is 98.68421% with 2 lines in your changes missing coverage. Please review.

Project coverage is 79.03%. Comparing base (11dd4b3) to head (3cb65df).

Files with missing lines Patch % Lines
crates/polars-ops/src/series/ops/index_of.rs 98.76% 1 Missing ⚠️
.../polars-python/src/lazyframe/visitor/expr_nodes.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #19894       +/-   ##
===========================================
+ Coverage   58.69%   79.03%   +20.33%     
===========================================
  Files        1564     1566        +2     
  Lines      220765   220918      +153     
  Branches     2504     2504               
===========================================
+ Hits       129584   174598    +45014     
+ Misses      90607    45747    -44860     
+ Partials      574      573        -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@itamarst itamarst marked this pull request as ready for review November 21, 2024 14:13
crates/polars-ops/src/series/ops/index_of.rs Outdated Show resolved Hide resolved
crates/polars-ops/src/series/ops/index_of.rs Outdated Show resolved Hide resolved
crates/polars-ops/src/series/ops/index_of.rs Outdated Show resolved Hide resolved
@itamarst itamarst marked this pull request as draft November 27, 2024 15:00
@itamarst itamarst closed this Dec 2, 2024
@itamarst
Copy link
Contributor Author

itamarst commented Dec 2, 2024

I think I've figured out how to use row encoding, so now I just need to write lots and lots of tests and make sure it actually works beyond the trivial case I've already tested.

@itamarst itamarst reopened this Dec 2, 2024
@itamarst
Copy link
Contributor Author

itamarst commented Dec 2, 2024

Unfortunately categorical and enum don't work (they also don't work for search_sorted(), which would be nice to fix); they ought to work, since e.g. pl.Series(["A", "B"], dtype=pl.Categorical) == "B" works, but I'm not sure how that is different than what I'm doing, so would appreciate any hints.

E.g. for Categorical:

>>> import polars as pl
>>> pl.Series(["a", "b", "a"], dtype=pl.Categorical).index_of("a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/itamarst/devel/polars/py-polars/polars/series/series.py", line 4771, in index_of
    return F.select(F.lit(self).index_of(element)).item()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/itamarst/devel/polars/py-polars/polars/functions/lazy.py", line 1913, in select
    return pl.DataFrame().select(*exprs, **named_exprs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/itamarst/devel/polars/py-polars/polars/dataframe/frame.py", line 9113, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/itamarst/devel/polars/py-polars/polars/lazyframe/frame.py", line 2029, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: got invalid or ambiguous dtypes: '[cat, str]' in expression 'index_of'

Consider explicitly casting your input types to resolve potential ambiguity.

Resolved plan until failure:

        ---> FAILED HERE RESOLVING 'select' <---
 SELECT [Series.index_of([String(a)])] FROM
  DF []; PROJECT */0 COLUMNS; SELECTION: None

@itamarst itamarst changed the title feat: Start of Series.index_of(), for primitive numeric types feat: Series.index_of() Dec 2, 2024
@coastalwhite
Copy link
Collaborator

My guess is that you are treating a categorical as a string when it goes into the row encoding. If you want to compare the row encoding of a series with the row encoding of another series they need to have been encoded with the exact same dtype (i.e. so the same RevMap as well) otherwise the output is undefined. If search_sorted doesn't do that either, that is a bug and I can look into it.

@itamarst
Copy link
Contributor Author

itamarst commented Dec 4, 2024

@coastalwhite search_sorted() does gets it wrong, yes. And separately if memory serves, you pass in a non-matching pl.lit("a", dtype=pl.Categorical) it doesn't error out with mismatching categoricals, it gives the wrong result.

@itamarst
Copy link
Contributor Author

itamarst commented Dec 4, 2024

@coastalwhite and the question is how/where do I convert to an enum/categorical, my attempts have failed so far.

@ritchie46
Copy link
Member

That was mostly before the rebase. If I remember correctly I talked to ritchie and we both agreed that pl.Series([1, 2]).index_of(2.1) should not return 1. I am not sure if it still does that.

It should raise. We should not allow for lossy comparisons. That's why we need a new supertype flag.

@itamarst
Copy link
Contributor Author

To clarify, pl.Series([1, 2]).index_of(2.1) == None, and pl.Series([1, 2]).index_of(2.0) == 1. So neither result is incorrect, at least. Would you like it to error out in both cases?

@itamarst
Copy link
Contributor Author

Or to put it another way, it is correctly searching float needles in int haystacks, but I can certainly see that you'd want to disallow that.

@itamarst
Copy link
Contributor Author

After #20323 is merged I can change this PR to start testing Enum as non-xfail tests.

@itamarst
Copy link
Contributor Author

itamarst commented Jan 2, 2025

Thank you for the new casting logic! I've updated to use it, and addressed the other two comments.

@itamarst itamarst requested a review from ritchie46 January 2, 2025 17:28
@ritchie46
Copy link
Member

Alright, looks great @itamarst. Thanks. I believe we only need docs entries on the python side (so that they end up in the ref guide), then it is good to go.

Copy link
Collaborator

@rodrigogiraoserrao rodrigogiraoserrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the tiny user-guide page? It's pretty much the same as the docstrings, so I feel like it's enough to have the docstrings.

Copy link
Collaborator

@rodrigogiraoserrao rodrigogiraoserrao Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this page? The examples shown here are already available in the docstrings, so I think we can delete this.
The docstrings in the Python side are enough to make this appear in the API reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was what Rich was asking for, maybe I misunderstood.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant adding the entries in the reference guide. We don't need this in the User guide indeed.

API, which is similar to Python lists' `index()` method.
Given a dataframe:

{{code_block('user-guide/expressions/casting', 'dfnum', [])}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the page is not removed, as I'd like it to, then the [] here need to reference index_of, which will also require adding Python and Rust entries to docs/source/_build/API_REFERENCE_LINKS.yml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one is just setting up the DataFrame, so doesn't need it, but I'll add to the other two.

docs/source/user-guide/expressions/searching.md Outdated Show resolved Hide resolved
docs/source/user-guide/expressions/searching.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Find index of item in list
6 participants