-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make random data in Python tests deterministic #14071
base: branch-23.12
Are you sure you want to change the base?
Changes from all commits
c1689dc
aa8d4c8
52f439a
10e9e41
a87641f
515ceac
614f911
6bcd0e8
6e35758
29ccfbb
fbd023d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3346,7 +3346,7 @@ def sample( | |
frac=None, | ||
replace=False, | ||
weights=None, | ||
random_state=None, | ||
random_state=1, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue: Similarly here, I don't think we should set a specific seed as a default argument to sample. This is also creating a difference in the default API wrt pandas (which defaults to None https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) |
||
axis=None, | ||
ignore_index=False, | ||
): | ||
|
@@ -3387,7 +3387,7 @@ def sample( | |
equal to the number of rows to sample from, and will be normalized | ||
to have a sum of 1. Unlike pandas, index alignment is not currently | ||
not performed. | ||
random_state : int, numpy/cupy RandomState, or None, default None | ||
random_state : int, numpy/cupy RandomState, or None, default 1 | ||
If None, default cupy random state is chosen. | ||
If int, the seed for the default cupy random state. | ||
If RandomState, rows-to-sample are generated from the RandomState. | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,4 +1,4 @@ | ||||||
# Copyright (c) 2020-2022, NVIDIA CORPORATION. | ||||||
# Copyright (c) 2020-2023, NVIDIA CORPORATION. | ||||||
|
||||||
import numpy as np | ||||||
import pandas as pd | ||||||
|
@@ -18,7 +18,7 @@ def timeseries( | |||||
freq="1s", | ||||||
dtypes=None, | ||||||
nulls_frequency=0, | ||||||
seed=None, | ||||||
seed=1, | ||||||
): | ||||||
"""Create timeseries dataframe with random data | ||||||
|
||||||
|
@@ -80,7 +80,7 @@ def timeseries( | |||||
return gdf | ||||||
|
||||||
|
||||||
def randomdata(nrows=10, dtypes=None, seed=None): | ||||||
def randomdata(nrows=10, dtypes=None, seed=1): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. note (non-blocking): I am on the fence about these defaults. I suppose it is OK. Perhaps better would be to flip this to a no-default keyword only argument, forcing the caller to specify a seed:
Suggested change
|
||||||
"""Create a dataframe with random data | ||||||
|
||||||
Parameters | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -181,8 +181,8 @@ def test_ufunc_series(request, ufunc, has_nulls, indexed): | |
# Converting nullable integer cudf.Series to pandas will produce a | ||
# float pd.Series, so instead we replace nulls with an arbitrary | ||
# integer value, precompute the mask, and then reapply it afterwards. | ||
for arg in args: | ||
set_random_null_mask_inplace(arg) | ||
for idx, arg in enumerate(args): | ||
set_random_null_mask_inplace(arg, seed=idx) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seed=idx to ensure different null masks for different columns |
||
pandas_args = [arg.fillna(0) for arg in args] | ||
|
||
# Note: Different indexes must be aligned before the mask is computed. | ||
|
@@ -261,8 +261,8 @@ def test_binary_ufunc_series_array( | |
# Converting nullable integer cudf.Series to pandas will produce a | ||
# float pd.Series, so instead we replace nulls with an arbitrary | ||
# integer value, precompute the mask, and then reapply it afterwards. | ||
for arg in args: | ||
set_random_null_mask_inplace(arg) | ||
for idx, arg in enumerate(args): | ||
set_random_null_mask_inplace(arg, seed=idx) | ||
|
||
# Cupy doesn't support nulls, so we fill with nans before converting. | ||
args[1] = args[1].fillna(cp.nan) | ||
|
@@ -403,8 +403,8 @@ def test_ufunc_dataframe(request, ufunc, has_nulls, indexed): | |
# Converting nullable integer cudf.Series to pandas will produce a | ||
# float pd.Series, so instead we replace nulls with an arbitrary | ||
# integer value, precompute the mask, and then reapply it afterwards. | ||
for arg in args: | ||
set_random_null_mask_inplace(arg["foo"]) | ||
for idx, arg in enumerate(args): | ||
set_random_null_mask_inplace(arg["foo"], seed=idx) | ||
pandas_args = [arg.copy() for arg in args] | ||
for arg in pandas_args: | ||
arg["foo"] = arg["foo"].fillna(0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: I am not sure I like this change, it means that user code that previously worked to draw a sequence of independent samples from groupby objects now always returns the same result for each sample.