You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I understand, automated construction of splits for hub datasets is currently based on either file names or directory structure (as described here)
It would seem to be pretty useful to also allow splits to be based on identifiers of individual examples
This could be configured like
{"split_name": {"column_name": [column values in split]}}
(This in turn requires unique 'index' columns, which could be explicitly supported or just assumed to be defined appropriately by the user).
I guess a potential downside would be that shards would end up spanning different splits - is this something that can be handled somehow? Would this only affect streaming from hub?
Motivation
The main motivation would be that all data files could be stored in a single directory, and multiple sets of splits could be generated from the same data. This is often useful for large datasets with multiple distinct sets of splits.
This could all be configured via the README.md yaml configs
Your contribution
May be able to contribute if it seems like a good idea
The text was updated successfully, but these errors were encountered:
Currently the yaml splits specified like this only allow specifying which filenames to pass to each split.
But what if I have a situation where I know which individual training examples I want to put in each split.
I could build split-specific files, however for large datasets with overlapping (e.g. multiple sets of) splits this could result in significant duplication of data.
I can see that this could actually be very much intended (i.e. to discourage overlapping splits), but wondered whether some support for handling splits based on individual identifiers is something that could be considered.
you can create such a dataset by adapting this code for example
# upload the full datasetfull_dataset.push_to_hub("username/dataset")
# then upload the indices for each setDatasetDict({
"train": Dataset.from_dict({"indices": [0, 1, 2, 3]}),
"test": Dataset.from_dict({"indices": [4, 5]}),
}).push_to_hub("username/dataset", "my_first_set_of_split")
Feature request
As far as I understand, automated construction of splits for hub datasets is currently based on either file names or directory structure (as described here)
It would seem to be pretty useful to also allow splits to be based on identifiers of individual examples
This could be configured like
{"split_name": {"column_name": [column values in split]}}
(This in turn requires unique 'index' columns, which could be explicitly supported or just assumed to be defined appropriately by the user).
I guess a potential downside would be that shards would end up spanning different splits - is this something that can be handled somehow? Would this only affect streaming from hub?
Motivation
The main motivation would be that all data files could be stored in a single directory, and multiple sets of splits could be generated from the same data. This is often useful for large datasets with multiple distinct sets of splits.
This could all be configured via the README.md yaml configs
Your contribution
May be able to contribute if it seems like a good idea
The text was updated successfully, but these errors were encountered: