Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add general example of split/subset on the main "Data Files Configuration" page #1538

Merged
merged 2 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 22 additions & 2 deletions docs/hub/datasets-data-files-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,37 @@ Machine learning datasets typically have splits and may also have subsets. A dat

![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)

## File names and splits
## Automatic splits detection

Splits are automatically detected based on file and directory names. For example this is a dataset a the `train`, `test`, and `validation` splits:

```
my_dataset_repository/
├── README.md
├── train.csv
├── test.csv
└── validation.csv
```

To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation and the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135).

## Manual configuration
## Manual splits and subsets configuration

You can choose the data files to show in the Dataset Viewer for your dataset using YAML.
It is useful if you want to specify which file goes into which split manually.

You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).

Here is an example of a configuration defining a subset called "benchmark" with a `test` split.

```yaml
configs:
- config_name: benchmark
data_files:
- split: test
path: benchmark.csv
```

See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. Look also to the [example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87).

## Supported file formats
Expand Down
27 changes: 15 additions & 12 deletions docs/hub/datasets-manual-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,21 @@ configs:
---
```

<Tip>

You can set a default subset using `default: true`

```yaml
- config_name: main_data
data_files: "main_data.csv"
default: true
```

This is useful to set which subset the Dataset Viewer shows first, and which subset data libraries load by default.

</Tip>


## Builder parameters

Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which subset to load your `csv` files:
Expand All @@ -120,15 +135,3 @@ configs:
```

Refer to the [specific builders' documentation](/docs/datasets/package_reference/builder_classes) to see what parameters they have.

<Tip>

You can set a default subset using `default: true`

```yaml
- config_name: main_data
data_files: "main_data.csv"
default: true
```

</Tip>
Loading