Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add index_of() function to Series and Expr #19894

Merged
merged 100 commits into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from 85 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
5b3821f
Proof of concept end-to-end implementation of Series.index_of().
pythonspeed Oct 16, 2024
a270c02
Proof-of-concept NaN handling.
pythonspeed Oct 16, 2024
c34588b
Simplify.
pythonspeed Oct 16, 2024
0950682
Minimal support for null values.
pythonspeed Oct 16, 2024
95682e8
Document the motivation.
pythonspeed Oct 16, 2024
d062f98
Integer casting.
pythonspeed Oct 17, 2024
7d9be50
Implement lists.
pythonspeed Oct 17, 2024
e7234a6
Support for arrays.
pythonspeed Oct 17, 2024
7879c63
Start refactoring to be in polars-ops and use expressions.
pythonspeed Oct 23, 2024
c0864a4
Continued work on sketch of expr-based index_of().
pythonspeed Oct 23, 2024
8a3be1f
Forgot to check this on.
pythonspeed Oct 23, 2024
eeef428
Skip lists and arrays for now.
pythonspeed Nov 12, 2024
0ade91d
Maybe fine as is?
pythonspeed Nov 12, 2024
b48e840
Only use the downcast macro once.
pythonspeed Nov 12, 2024
e80c450
Simplify by using AnyValue throughout.
pythonspeed Nov 12, 2024
baccdc4
Not used
pythonspeed Nov 12, 2024
09a4496
Don't support multiple entries in this way
pythonspeed Nov 12, 2024
c9c6152
Fast path for sorted values.
pythonspeed Nov 12, 2024
8240ae2
Start of unit tests.
pythonspeed Nov 13, 2024
aabddbe
search_sorted() can give (from our perspective) false positives.
pythonspeed Nov 13, 2024
afc1892
Null Series.
pythonspeed Nov 13, 2024
d2c0a21
Test the lazy API too.
pythonspeed Nov 13, 2024
4fd40b6
Basic integer tests.
pythonspeed Nov 13, 2024
65328c9
Fix bug where index_of() searched for the rounded value of floats.
pythonspeed Nov 13, 2024
c7ffd79
Another test
pythonspeed Nov 14, 2024
bfb5473
Better phrasing
pythonspeed Nov 14, 2024
101ecce
Handle scalar columns.
pythonspeed Nov 14, 2024
7a59e0e
Hypothesis-based tests.
pythonspeed Nov 14, 2024
bc6872d
Better explanation
pythonspeed Nov 14, 2024
094da9c
Fix some bugs and lints
pythonspeed Nov 14, 2024
a64c69e
Add a feature.
pythonspeed Nov 15, 2024
5172116
Improve docs and type annotations.
pythonspeed Nov 15, 2024
c3da35c
Lint fix.
pythonspeed Nov 15, 2024
91db617
Catch Python stdlib floats too.
pythonspeed Nov 20, 2024
0afb5d9
Start moving towards code sharing.
pythonspeed Nov 20, 2024
0213d01
Start removing usage of ChunkedArray.iter().
pythonspeed Nov 20, 2024
3aeada3
Remove more usage of iter().
pythonspeed Nov 20, 2024
5e4d0ad
Pacify clippy.
pythonspeed Nov 20, 2024
0cacdf2
Not used.
pythonspeed Nov 20, 2024
362fb33
Improve test coverage.
pythonspeed Nov 20, 2024
661ce4f
Fix lint.
pythonspeed Nov 20, 2024
2146a8d
Pacify clippy
pythonspeed Nov 21, 2024
8ca042b
Only load if feature enabled.
pythonspeed Nov 21, 2024
c523bc1
Reformat.
pythonspeed Nov 21, 2024
0742f02
Documentation improvements.
pythonspeed Nov 21, 2024
837d0ed
Fix location.
pythonspeed Nov 21, 2024
5217198
Get rid of trait.
pythonspeed Nov 22, 2024
73df3b9
Simplify by using TotalEq.
pythonspeed Nov 27, 2024
463eaad
Do supertype casting at a higher level.
pythonspeed Nov 27, 2024
2b2da76
Sketch of row-encoding based search for non-numeric data types.
pythonspeed Dec 2, 2024
3d33717
Fix searching for nulls (and speed it up!) by moving it earlier in th…
pythonspeed Dec 2, 2024
01bb644
Basic tests for other dtypes
pythonspeed Dec 2, 2024
f95a228
Improve comments.
pythonspeed Dec 2, 2024
d5aef83
Fix decimals.
pythonspeed Dec 2, 2024
5bc19bf
Fix lists and arrays.
pythonspeed Dec 2, 2024
1377e6e
Lint
pythonspeed Dec 2, 2024
a340e7f
Improve test coverage.
pythonspeed Dec 4, 2024
1192d59
Handle edge case where we pass in an array literal, so the series is …
pythonspeed Dec 4, 2024
d8d2376
Gate on feature
pythonspeed Dec 4, 2024
d97bdf9
Skip enum and categorical for now
pythonspeed Dec 5, 2024
73427ae
Fix some lints
pythonspeed Dec 5, 2024
5b86e6a
Fix lint
pythonspeed Dec 5, 2024
86c13db
Only cast value in edge case that requires it.
pythonspeed Dec 5, 2024
5ddf2d4
Remove unnecessary trait
pythonspeed Dec 6, 2024
98543bd
Arrays and Lists are now better supported in row encoding
pythonspeed Dec 6, 2024
e0ae7a2
Test a couple more edge cases
pythonspeed Dec 6, 2024
50d55d7
Fix formatting.
pythonspeed Dec 6, 2024
d3977bb
fix small rebase error
coastalwhite Dec 16, 2024
6a8aaa4
fix: Properly handle no item find in group-by
coastalwhite Dec 16, 2024
97ad11b
fix: Error on multiple values
coastalwhite Dec 16, 2024
19d0365
fix: Properly error on type mismatch
coastalwhite Dec 16, 2024
7498bc3
Less code duplication.
pythonspeed Dec 16, 2024
9560a6b
Better tests for Enum and Categorical
pythonspeed Dec 16, 2024
ddaf69b
Point at the correct issue
pythonspeed Dec 16, 2024
4aa7210
Restore missing coverage
pythonspeed Dec 16, 2024
b7d689b
Null can be cast to anything.
pythonspeed Dec 16, 2024
ccecc38
Merge remote-tracking branch 'origin/main' into 5503-series-index_of
pythonspeed Dec 16, 2024
8cfe43d
Update to latest API.
pythonspeed Dec 16, 2024
21eedc9
Give good error messages, which can be removed when corresponding bug…
pythonspeed Dec 16, 2024
aad9544
Error out instead of giving the wrong result
pythonspeed Dec 16, 2024
a52a4e6
Format
pythonspeed Dec 16, 2024
0a02b48
Merge remote-tracking branch 'origin/main' into 5503-series-index_of
pythonspeed Dec 17, 2024
dbebe7c
Enum literals work now.
pythonspeed Dec 17, 2024
138bf73
Add missing cfg
pythonspeed Dec 17, 2024
dbc0cbd
Remove redundant type annotations
pythonspeed Dec 17, 2024
35250de
Merge remote-tracking branch 'origin/main' into 5503-series-index_of
pythonspeed Jan 2, 2025
731fd6a
Switch to strict casting.
pythonspeed Jan 2, 2025
7ee4ede
Remove duplicate logic.
pythonspeed Jan 2, 2025
3179992
Don't panic.
pythonspeed Jan 2, 2025
a9a06af
Improve testing slightly, and pacify mypy.
pythonspeed Jan 2, 2025
b0196ae
Merge remote-tracking branch 'origin/main' into 5503-series-index_of
pythonspeed Jan 6, 2025
0fe814e
Minimal guide level documentation for `index_of()`.
pythonspeed Jan 6, 2025
b358e22
Pacify linter
pythonspeed Jan 6, 2025
541049a
Reformat so dprint is happy.
pythonspeed Jan 6, 2025
68ebd34
fix reference
pythonspeed Jan 6, 2025
3cb65df
Add index references
pythonspeed Jan 6, 2025
0965950
Remove user guide.
pythonspeed Jan 7, 2025
22f3e88
Add index_of to Python API docs
pythonspeed Jan 7, 2025
0a61902
Merge remote-tracking branch 'origin/main' into 5503-series-index_of
pythonspeed Jan 7, 2025
8c3e0d4
Update to changed API.
pythonspeed Jan 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions crates/polars-core/src/datatypes/dtype.rs
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,10 @@ impl DataType {
return Some(true);
}

if self.is_null() {
return Some(true);
}

use DataType as D;
Some(match (self, to) {
#[cfg(feature = "dtype-categorical")]
Expand Down
3 changes: 3 additions & 0 deletions crates/polars-lazy/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,7 @@ string_pad = ["polars-plan/string_pad"]
string_reverse = ["polars-plan/string_reverse"]
string_to_integer = ["polars-plan/string_to_integer"]
arg_where = ["polars-plan/arg_where"]
index_of = ["polars-plan/index_of"]
search_sorted = ["polars-plan/search_sorted"]
merge_sorted = ["polars-plan/merge_sorted", "polars-stream?/merge_sorted"]
meta = ["polars-plan/meta"]
Expand Down Expand Up @@ -312,6 +313,7 @@ test_all = [
"row_hash",
"string_pad",
"string_to_integer",
"index_of",
"search_sorted",
"top_k",
"pivot",
Expand Down Expand Up @@ -358,6 +360,7 @@ features = [
"fused",
"futures",
"hist",
"index_of",
"interpolate",
"interpolate_by",
"ipc",
Expand Down
1 change: 1 addition & 0 deletions crates/polars-ops/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ rolling_window = ["polars-core/rolling_window"]
rolling_window_by = ["polars-core/rolling_window_by"]
moment = []
mode = []
index_of = []
search_sorted = []
merge_sorted = []
top_k = []
Expand Down
120 changes: 120 additions & 0 deletions crates/polars-ops/src/series/ops/index_of.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
use arrow::array::{BinaryArray, PrimitiveArray};
use polars_core::downcast_as_macro_arg_physical;
use polars_core::prelude::*;
use polars_utils::total_ord::TotalEq;
use row_encode::encode_rows_unordered;

/// Find the index of the value, or ``None`` if it can't be found.
fn index_of_value<'a, DT, AR>(ca: &'a ChunkedArray<DT>, value: AR::ValueT<'a>) -> Option<usize>
where
DT: PolarsDataType,
AR: StaticArray,
AR::ValueT<'a>: TotalEq,
{
let req_value = &value;
let mut index = 0;
for chunk in ca.chunks() {
let chunk = chunk.as_any().downcast_ref::<AR>().unwrap();
if chunk.validity().is_some() {
for maybe_value in chunk.iter() {
if maybe_value.map(|v| v.tot_eq(req_value)) == Some(true) {
return Some(index);
} else {
index += 1;
}
}
} else {
// A lack of a validity bitmap means there are no nulls, so we
// can simplify our logic and use a faster code path:
for value in chunk.values_iter() {
if value.tot_eq(req_value) {
return Some(index);
} else {
index += 1;
}
}
}
}
None
}

fn index_of_numeric_value<T>(ca: &ChunkedArray<T>, value: T::Native) -> Option<usize>
where
T: PolarsNumericType,
{
index_of_value::<_, PrimitiveArray<T::Native>>(ca, value)
}

/// Try casting the value to the correct type, then call
/// index_of_numeric_value().
macro_rules! try_index_of_numeric_ca {
($ca:expr, $value:expr) => {{
let ca = $ca;
let value = $value;
// extract() returns None if casting failed, so consider an extract()
// failure as not finding the value. Nulls should have been handled
// earlier.
let value = value.value().extract().unwrap();
index_of_numeric_value(ca, value)
}};
}

/// Find the index of a given value (the first and only entry in `value_series`)
/// within the series.
pub fn index_of(series: &Series, needle: Scalar) -> PolarsResult<Option<usize>> {
polars_ensure!(
series.dtype() == needle.dtype(),
InvalidOperation: "Cannot perform index_of with mismatching datatypes: {:?} and {:?}",
series.dtype(),
needle.dtype(),
);

// Series is null:
if series.dtype().is_null() {
if needle.is_null() {
return Ok((series.len() > 0).then_some(0));
} else {
return Ok(None);
}
}

// Series is not null, and the value is null:
if needle.is_null() {
let mut index = 0;
for chunk in series.chunks() {
let length = chunk.len();
if let Some(bitmap) = chunk.validity() {
let leading_ones = bitmap.leading_ones();
if leading_ones < length {
return Ok(Some(index + leading_ones));
}
} else {
index += length;
}
}
return Ok(None);
}

if series.dtype().is_numeric() {
return Ok(downcast_as_macro_arg_physical!(
series,
try_index_of_numeric_ca,
needle
));
}

if series.dtype().is_categorical() {
itamarst marked this conversation as resolved.
Show resolved Hide resolved
unimplemented!("Categorical index_of() can give incorrect result until https://github.com/pola-rs/polars/issues/20318 is fixed")
}

// For non-numeric dtypes, we convert to row-encoding, which essentially has
itamarst marked this conversation as resolved.
Show resolved Hide resolved
// us searching the physical representation of the data as a series of
// bytes.
let value_as_column = Column::new_scalar(PlSmallStr::EMPTY, needle, 1);
let value_as_row_encoded_ca = encode_rows_unordered(&[value_as_column])?;
let value = value_as_row_encoded_ca
.first()
.expect("Shouldn't have nulls in a row-encoded result");
let ca = encode_rows_unordered(&[series.clone().into()])?;
Ok(index_of_value::<_, BinaryArray<i64>>(&ca, value))
}
4 changes: 4 additions & 0 deletions crates/polars-ops/src/series/ops/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ mod floor_divide;
mod fused;
mod horizontal;
mod index;
#[cfg(feature = "index_of")]
mod index_of;
mod int_range;
#[cfg(any(feature = "interpolate_by", feature = "interpolate"))]
mod interpolation;
Expand Down Expand Up @@ -84,6 +86,8 @@ pub use floor_divide::*;
pub use fused::*;
pub use horizontal::*;
pub use index::*;
#[cfg(feature = "index_of")]
pub use index_of::*;
pub use int_range::*;
#[cfg(feature = "interpolate")]
pub use interpolation::interpolate::*;
Expand Down
3 changes: 3 additions & 0 deletions crates/polars-ops/src/series/ops/search_sorted.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ pub fn search_sorted(
side: SearchSortedSide,
descending: bool,
) -> PolarsResult<IdxCa> {
if s.dtype().is_categorical() {
itamarst marked this conversation as resolved.
Show resolved Hide resolved
unimplemented!("Categorical search_sorted() can give incorrect result until https://github.com/pola-rs/polars/issues/20318 is fixed");
}
let original_dtype = s.dtype();
let s = s.to_physical_repr();
let phys_dtype = s.dtype();
Expand Down
2 changes: 2 additions & 0 deletions crates/polars-plan/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ string_pad = ["polars-ops/string_pad"]
string_reverse = ["polars-ops/string_reverse"]
string_to_integer = ["polars-ops/string_to_integer"]
arg_where = []
index_of = ["polars-ops/index_of"]
search_sorted = ["polars-ops/search_sorted"]
merge_sorted = ["polars-ops/merge_sorted"]
meta = []
Expand Down Expand Up @@ -263,6 +264,7 @@ features = [
"find_many",
"string_encoding",
"ipc",
"index_of",
"search_sorted",
"unique_counts",
"dtype-u8",
Expand Down
61 changes: 61 additions & 0 deletions crates/polars-plan/src/dsl/function_expr/index_of.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
use polars_ops::series::index_of as index_of_op;

use super::*;

/// Given two columns, find the index of a value (the second column) within the
/// first column. Will use binary search if possible, as an optimization.
pub(super) fn index_of(s: &mut [Column]) -> PolarsResult<Column> {
let series = if let Column::Scalar(ref sc) = s[0] {
// We only care about the first value:
&sc.as_single_value_series()
} else {
s[0].as_materialized_series()
};

let needle_s = &s[1];
polars_ensure!(
needle_s.len() == 1,
InvalidOperation: "needle of `index_of` can only contain a single value, found {} values",
needle_s.len()
);
let needle = Scalar::new(
needle_s.dtype().clone(),
needle_s.get(0).unwrap().into_static(),
);

let is_sorted_flag = series.is_sorted_flag();
let result = match is_sorted_flag {
// If the Series is sorted, we can use an optimized binary search to
// find the value.
IsSorted::Ascending | IsSorted::Descending
if !needle.is_null() &&
// search_sorted() doesn't support decimals at the moment.
!series.dtype().is_decimal() =>
{
search_sorted(
series,
needle_s.as_materialized_series(),
SearchSortedSide::Left,
IsSorted::Descending == is_sorted_flag,
)?
.get(0)
.and_then(|idx| {
// search_sorted() gives an index even if it's not an exact
// match! So we want to make sure it actually found the value.
if series.get(idx as usize).ok()? == needle.as_any_value() {
Some(idx as usize)
} else {
None
}
})
},
_ => index_of_op(series, needle)?,
};

let av = match result {
None => AnyValue::Null,
Some(idx) => AnyValue::from(idx as IdxSize),
};
let scalar = Scalar::new(IDX_DTYPE, av);
Ok(Column::new_scalar(series.name().clone(), scalar, 1))
}
12 changes: 12 additions & 0 deletions crates/polars-plan/src/dsl/function_expr/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@ mod ewm_by;
mod fill_null;
#[cfg(feature = "fused")]
mod fused;
#[cfg(feature = "index_of")]
mod index_of;
mod list;
#[cfg(feature = "log")]
mod log;
Expand Down Expand Up @@ -154,6 +156,8 @@ pub enum FunctionExpr {
Hash(u64, u64, u64, u64),
#[cfg(feature = "arg_where")]
ArgWhere,
#[cfg(feature = "index_of")]
IndexOf,
#[cfg(feature = "search_sorted")]
SearchSorted(SearchSortedSide),
#[cfg(feature = "range")]
Expand Down Expand Up @@ -395,6 +399,8 @@ impl Hash for FunctionExpr {
#[cfg(feature = "business")]
Business(f) => f.hash(state),
Pow(f) => f.hash(state),
#[cfg(feature = "index_of")]
IndexOf => {},
#[cfg(feature = "search_sorted")]
SearchSorted(f) => f.hash(state),
#[cfg(feature = "random")]
Expand Down Expand Up @@ -640,6 +646,8 @@ impl Display for FunctionExpr {
Hash(_, _, _, _) => "hash",
#[cfg(feature = "arg_where")]
ArgWhere => "arg_where",
#[cfg(feature = "index_of")]
IndexOf => "index_of",
#[cfg(feature = "search_sorted")]
SearchSorted(_) => "search_sorted",
#[cfg(feature = "range")]
Expand Down Expand Up @@ -929,6 +937,10 @@ impl From<FunctionExpr> for SpecialEq<Arc<dyn ColumnsUdf>> {
ArgWhere => {
wrap!(arg_where::arg_where)
},
#[cfg(feature = "index_of")]
IndexOf => {
map_as_slice!(index_of::index_of)
},
#[cfg(feature = "search_sorted")]
SearchSorted(side) => {
map_as_slice!(search_sorted::search_sorted_impl, side)
Expand Down
2 changes: 2 additions & 0 deletions crates/polars-plan/src/dsl/function_expr/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,8 @@ impl FunctionExpr {
Hash(..) => mapper.with_dtype(DataType::UInt64),
#[cfg(feature = "arg_where")]
ArgWhere => mapper.with_dtype(IDX_DTYPE),
#[cfg(feature = "index_of")]
IndexOf => mapper.with_dtype(IDX_DTYPE),
#[cfg(feature = "search_sorted")]
SearchSorted(_) => mapper.with_dtype(IDX_DTYPE),
#[cfg(feature = "range")]
Expand Down
22 changes: 22 additions & 0 deletions crates/polars-plan/src/dsl/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -377,6 +377,28 @@ impl Expr {
)
}

#[cfg(feature = "index_of")]
/// Find the index of a value.
pub fn index_of<E: Into<Expr>>(self, element: E) -> Expr {
let element = element.into();
Expr::Function {
itamarst marked this conversation as resolved.
Show resolved Hide resolved
input: vec![self, element],
function: FunctionExpr::IndexOf,
options: FunctionOptions {
flags: FunctionFlags::default() | FunctionFlags::RETURNS_SCALAR,
fmt_str: "index_of",
cast_options: FunctionCastOptions {
supertype: Some(
(SuperTypeFlags::default() & !SuperTypeFlags::ALLOW_PRIMITIVE_TO_STRING)
itamarst marked this conversation as resolved.
Show resolved Hide resolved
.into(),
),
..Default::default()
},
..Default::default()
},
}
}

#[cfg(feature = "search_sorted")]
/// Find indices where elements should be inserted to maintain order.
pub fn search_sorted<E: Into<Expr>>(self, element: E, side: SearchSortedSide) -> Expr {
Expand Down
2 changes: 2 additions & 0 deletions crates/polars-python/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ repeat_by = ["polars/repeat_by"]

streaming = ["polars/streaming"]
meta = ["polars/meta"]
index_of = ["polars/index_of"]
search_sorted = ["polars/search_sorted"]
decompress = ["polars/decompress-fast"]
regex = ["polars/regex"]
Expand Down Expand Up @@ -211,6 +212,7 @@ operations = [
"asof_join",
"cross_join",
"pct_change",
"index_of",
"search_sorted",
"merge_sorted",
"top_k",
Expand Down
6 changes: 6 additions & 0 deletions crates/polars-python/src/expr/general.rs
Original file line number Diff line number Diff line change
Expand Up @@ -318,13 +318,19 @@ impl PyExpr {
self.inner.clone().arg_min().into()
}

#[cfg(feature = "index_of")]
fn index_of(&self, element: Self) -> Self {
self.inner.clone().index_of(element.inner).into()
}

#[cfg(feature = "search_sorted")]
fn search_sorted(&self, element: Self, side: Wrap<SearchSortedSide>) -> Self {
self.inner
.clone()
.search_sorted(element.inner, side.0)
.into()
}

fn gather(&self, idx: Self) -> Self {
self.inner.clone().gather(idx.inner).into()
}
Expand Down
2 changes: 2 additions & 0 deletions crates/polars-python/src/lazyframe/visitor/expr_nodes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1097,6 +1097,8 @@ pub(crate) fn into_py(py: Python<'_>, expr: &AExpr) -> PyResult<PyObject> {
("hash", seed, seed_1, seed_2, seed_3).into_py_any(py)
},
FunctionExpr::ArgWhere => ("argwhere",).into_py_any(py),
#[cfg(feature = "index_of")]
FunctionExpr::IndexOf => ("index_of",).into_py_any(py),
#[cfg(feature = "search_sorted")]
FunctionExpr::SearchSorted(side) => (
"search_sorted",
Expand Down
Loading
Loading