Skip to content
This repository has been archived by the owner on Jan 14, 2025. It is now read-only.

set_index calls quietly sort the Ensemble dataframes #242

Closed
dougbrn opened this issue Sep 28, 2023 · 4 comments
Closed

set_index calls quietly sort the Ensemble dataframes #242

dougbrn opened this issue Sep 28, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@dougbrn
Copy link
Collaborator

dougbrn commented Sep 28, 2023

Discovered this recently that the default behavior of dd.set_index is to sort the dataframe based on the index values. This introduces a costly overhead to the workflow, and is needless when the user already has their data sorted. In principle, TAPE should be able to function without a sorted index. We should consider how to best implement this sorting functionality as an optional feature, and give users the ability to not do it for datasets that don't require it.

Additional note, if we are sorting the dataframes, it may be worthwhile to investigate generation of division information.

@dougbrn dougbrn added the bug Something isn't working label Sep 28, 2023
@hombit
Copy link
Contributor

hombit commented Sep 29, 2023

We should consider how to best implement this sorting functionality as an optional feature, and give users the ability to not do it for datasets that don't require it.

The third option, which could be a default behavior, is checking the index to be sorted while reading the data. This is significantly cheaper than force-sorting, and would inform user if data must be sorted.

@dougbrn
Copy link
Collaborator Author

dougbrn commented Sep 29, 2023

Yeah it would be nice if we could scan and check for that. I'm not sure how costly it will be since Dask will have to actually verify that it is sorted, and that is an operation that can't be done lazily.

@dougbrn
Copy link
Collaborator Author

dougbrn commented Oct 25, 2023

Auto-sorting is removed as the default behavior in #276, and users now opt in to it via the sort and sorted flags in data loader functions a la Dask.

There is the remaining question of whether to do a sort check on data load, per @hombit. I agree that it isn't too costly, but am not sure about the sequencing. If it checks to see if the data is sorted on load, the check will be triggered by the call where the user already has the opportunity to specify whether the data is sorted or not. This means they might need to immediately reload the ensemble data with a different kwarg set. Maybe this is fine? Having the ability to sort (#247 ) might also resolve this, as a user may load data, see that it's not sorted, and then use the sort call if it's wanted.

@dougbrn
Copy link
Collaborator Author

dougbrn commented Nov 29, 2023

Going to close this as the original scope of the issue has been addressed.

@dougbrn dougbrn closed this as completed Nov 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants