From 285d4c666c4a09f9909c166ada093dbf40e15be6 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 19 Dec 2024 16:17:42 +0100 Subject: [PATCH 1/2] add general example of split/subset --- docs/hub/datasets-data-files-configuration.md | 24 +++++++++++++++-- docs/hub/datasets-manual-configuration.md | 27 ++++++++++--------- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index a60009ace..06bdb7335 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -11,17 +11,37 @@ Machine learning datasets typically have splits and may also have subsets. A dat ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) -## File names and splits +## Automatic splits and subsets detection + +Splits are automatically detected based on file and directory names. For example this is a dataset a the `train`, `test`, and `validation` splits: + +``` +my_dataset_repository/ +├── README.md +├── train.csv +├── test.csv +└── validation.csv +``` To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation and the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135). -## Manual configuration +## Manual splits and subsets configuration You can choose the data files to show in the Dataset Viewer for your dataset using YAML. It is useful if you want to specify which file goes into which split manually. You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files). +Here is an example of a configuration defining a subset called "benchmark" with a `test` split. + +```yaml +configs: +- config_name: benchmark + data_files: + - split: test + path: benchmark.csv +``` + See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. Look also to the [example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87). ## Supported file formats diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md index b75631351..e4ba4dde1 100644 --- a/docs/hub/datasets-manual-configuration.md +++ b/docs/hub/datasets-manual-configuration.md @@ -103,6 +103,21 @@ configs: --- ``` + + +You can set a default subset using `default: true` + +```yaml +- config_name: main_data + data_files: "main_data.csv" + default: true +``` + +This is useful to set which subset the Dataset Viewer shows first, and which subset data libraries load by default. + + + + ## Builder parameters Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which subset to load your `csv` files: @@ -120,15 +135,3 @@ configs: ``` Refer to the [specific builders' documentation](/docs/datasets/package_reference/builder_classes) to see what parameters they have. - - - -You can set a default subset using `default: true` - -```yaml -- config_name: main_data - data_files: "main_data.csv" - default: true -``` - - From 813aebb9fc2606afed1ee13d3c94c6d654a30a5b Mon Sep 17 00:00:00 2001 From: Quentin Lhoest Date: Thu, 19 Dec 2024 16:21:48 +0100 Subject: [PATCH 2/2] minor --- docs/hub/datasets-data-files-configuration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md index 06bdb7335..2fb7bd184 100644 --- a/docs/hub/datasets-data-files-configuration.md +++ b/docs/hub/datasets-data-files-configuration.md @@ -11,7 +11,7 @@ Machine learning datasets typically have splits and may also have subsets. A dat ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif) -## Automatic splits and subsets detection +## Automatic splits detection Splits are automatically detected based on file and directory names. For example this is a dataset a the `train`, `test`, and `validation` splits: