You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 14, 2025. It is now read-only.
Discovered this recently that the default behavior of dd.set_index is to sort the dataframe based on the index values. This introduces a costly overhead to the workflow, and is needless when the user already has their data sorted. In principle, TAPE should be able to function without a sorted index. We should consider how to best implement this sorting functionality as an optional feature, and give users the ability to not do it for datasets that don't require it.
Additional note, if we are sorting the dataframes, it may be worthwhile to investigate generation of division information.
The text was updated successfully, but these errors were encountered:
We should consider how to best implement this sorting functionality as an optional feature, and give users the ability to not do it for datasets that don't require it.
The third option, which could be a default behavior, is checking the index to be sorted while reading the data. This is significantly cheaper than force-sorting, and would inform user if data must be sorted.
Yeah it would be nice if we could scan and check for that. I'm not sure how costly it will be since Dask will have to actually verify that it is sorted, and that is an operation that can't be done lazily.
Auto-sorting is removed as the default behavior in #276, and users now opt in to it via the sort and sorted flags in data loader functions a la Dask.
There is the remaining question of whether to do a sort check on data load, per @hombit. I agree that it isn't too costly, but am not sure about the sequencing. If it checks to see if the data is sorted on load, the check will be triggered by the call where the user already has the opportunity to specify whether the data is sorted or not. This means they might need to immediately reload the ensemble data with a different kwarg set. Maybe this is fine? Having the ability to sort (#247 ) might also resolve this, as a user may load data, see that it's not sorted, and then use the sort call if it's wanted.
Discovered this recently that the default behavior of
dd.set_index
is to sort the dataframe based on the index values. This introduces a costly overhead to the workflow, and is needless when the user already has their data sorted. In principle, TAPE should be able to function without a sorted index. We should consider how to best implement this sorting functionality as an optional feature, and give users the ability to not do it for datasets that don't require it.Additional note, if we are sorting the dataframes, it may be worthwhile to investigate generation of division information.
The text was updated successfully, but these errors were encountered: