Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for identifier-based automated split construction #7287

Open
alex-hh opened this issue Nov 10, 2024 · 3 comments
Open

Support for identifier-based automated split construction #7287

alex-hh opened this issue Nov 10, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@alex-hh
Copy link
Contributor

alex-hh commented Nov 10, 2024

Feature request

As far as I understand, automated construction of splits for hub datasets is currently based on either file names or directory structure (as described here)

It would seem to be pretty useful to also allow splits to be based on identifiers of individual examples

This could be configured like
{"split_name": {"column_name": [column values in split]}}

(This in turn requires unique 'index' columns, which could be explicitly supported or just assumed to be defined appropriately by the user).

I guess a potential downside would be that shards would end up spanning different splits - is this something that can be handled somehow? Would this only affect streaming from hub?

Motivation

The main motivation would be that all data files could be stored in a single directory, and multiple sets of splits could be generated from the same data. This is often useful for large datasets with multiple distinct sets of splits.

This could all be configured via the README.md yaml configs

Your contribution

May be able to contribute if it seems like a good idea

@alex-hh alex-hh added the enhancement New feature or request label Nov 10, 2024
@lhoestq
Copy link
Member

lhoestq commented Nov 18, 2024

Hi ! You can already configure the README.md to have multiple sets of splits, e.g.

configs:
- config_name: my_first_set_of_split
  data_files:
  - split: train
    path: *.csv
- config_name: my_second_set_of_split
  data_files:
  - split: train
    path: train-*.csv
  - split: test
    path: test-*.csv

@alex-hh
Copy link
Contributor Author

alex-hh commented Nov 18, 2024

Hi - I had something slightly different in mind:

Currently the yaml splits specified like this only allow specifying which filenames to pass to each split.
But what if I have a situation where I know which individual training examples I want to put in each split.

I could build split-specific files, however for large datasets with overlapping (e.g. multiple sets of) splits this could result in significant duplication of data.

I can see that this could actually be very much intended (i.e. to discourage overlapping splits), but wondered whether some support for handling splits based on individual identifiers is something that could be considered.

@lhoestq
Copy link
Member

lhoestq commented Nov 19, 2024

This is not supported right now :/ Though you can load the data in two steps like this

from datasets import load_dataset

full_dataset = load_dataset("username/dataset", split="train")
my_first_set_indices = load_dataset("username/dataset", "my_first_set_of_split", split="train")

my_first_set = full_dataset.select(my_first_set_indices["indices"])

you can create such a dataset by adapting this code for example

# upload the full dataset
full_dataset.push_to_hub("username/dataset")
# then upload the indices for each set
DatasetDict({
    "train": Dataset.from_dict({"indices": [0, 1, 2, 3]}),
    "test": Dataset.from_dict({"indices": [4, 5]}),
}).push_to_hub("username/dataset", "my_first_set_of_split")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants