Skip to content

Commit

Permalink
Add general example of split/subset on the main "Data Files Configura…
Browse files Browse the repository at this point in the history
…tion" page (#1538)

* add general example of split/subset

* minor
  • Loading branch information
lhoestq authored Dec 19, 2024
1 parent 859b2d7 commit 1070319
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 14 deletions.
24 changes: 22 additions & 2 deletions docs/hub/datasets-data-files-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,37 @@ Machine learning datasets typically have splits and may also have subsets. A dat

![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)

## File names and splits
## Automatic splits detection

Splits are automatically detected based on file and directory names. For example this is a dataset a the `train`, `test`, and `validation` splits:

```
my_dataset_repository/
├── README.md
├── train.csv
├── test.csv
└── validation.csv
```

To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation and the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135).

## Manual configuration
## Manual splits and subsets configuration

You can choose the data files to show in the Dataset Viewer for your dataset using YAML.
It is useful if you want to specify which file goes into which split manually.

You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).

Here is an example of a configuration defining a subset called "benchmark" with a `test` split.

```yaml
configs:
- config_name: benchmark
data_files:
- split: test
path: benchmark.csv
```
See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. Look also to the [example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87).
## Supported file formats
Expand Down
27 changes: 15 additions & 12 deletions docs/hub/datasets-manual-configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,21 @@ configs:
---
```

<Tip>

You can set a default subset using `default: true`

```yaml
- config_name: main_data
data_files: "main_data.csv"
default: true
```
This is useful to set which subset the Dataset Viewer shows first, and which subset data libraries load by default.
</Tip>
## Builder parameters
Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which subset to load your `csv` files:
Expand All @@ -120,15 +135,3 @@ configs:
```

Refer to the [specific builders' documentation](/docs/datasets/package_reference/builder_classes) to see what parameters they have.

<Tip>

You can set a default subset using `default: true`

```yaml
- config_name: main_data
data_files: "main_data.csv"
default: true
```
</Tip>

0 comments on commit 1070319

Please sign in to comment.