-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make random data in Python tests deterministic #14071
base: branch-23.12
Are you sure you want to change the base?
Conversation
…bug-deterministic-tests
for arg in args: | ||
set_random_null_mask_inplace(arg) | ||
for idx, arg in enumerate(args): | ||
set_random_null_mask_inplace(arg, seed=idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seed=idx to ensure different null masks for different columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realise that tracking down all uses of random sampling in the test suite is a big thing, and providing a default fixed seed everywhere is a pragmatic choice to get deterministic tests, but I think I don't want to break API compatibility with pandas for the two sample calls.
@@ -950,7 +950,7 @@ def sample( | |||
frac: Optional[float] = None, | |||
replace: bool = False, | |||
weights: Union[abc.Sequence, "cudf.Series", None] = None, | |||
random_state: Union[np.random.RandomState, int, None] = None, | |||
random_state: Union[np.random.RandomState, int, None] = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: I am not sure I like this change, it means that user code that previously worked to draw a sequence of independent samples from groupby objects now always returns the same result for each sample.
@@ -3346,7 +3346,7 @@ def sample( | |||
frac=None, | |||
replace=False, | |||
weights=None, | |||
random_state=None, | |||
random_state=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: Similarly here, I don't think we should set a specific seed as a default argument to sample. This is also creating a difference in the default API wrt pandas (which defaults to None https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)
@@ -80,7 +80,7 @@ def timeseries( | |||
return gdf | |||
|
|||
|
|||
def randomdata(nrows=10, dtypes=None, seed=None): | |||
def randomdata(nrows=10, dtypes=None, seed=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note (non-blocking): I am on the fence about these defaults. I suppose it is OK. Perhaps better would be to flip this to a no-default keyword only argument, forcing the caller to specify a seed:
def randomdata(nrows=10, dtypes=None, seed=1): | |
def randomdata(nrows=10, dtypes=None, *, seed): |
Description
Some random data generators in cuDF default to
seed=None
, which means that an OS or time dependent seed is used, leading to different test data between systems/runs.This PR changes the default to a fixed integer so that the same data is always generated.
Contributes to #17045.
Checklist