Add general example of split/subset on the main "Data Files Configura…

…tion" page (#1538) * add general example of split/subset * minor
huggingface · Dec 19, 2024 · 1070319 · 1070319
1 parent 859b2d7
commit 1070319
Show file tree

Hide file tree

Showing 2 changed files with 37 additions and 14 deletions.
diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md
@@ -11,17 +11,37 @@ Machine learning datasets typically have splits and may also have subsets. A dat
 
 ![split-configs-server](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif)
 
-## File names and splits
+## Automatic splits detection
+
+Splits are automatically detected based on file and directory names. For example this is a dataset a the `train`, `test`, and `validation` splits:
+
+```
+my_dataset_repository/
+├── README.md
+├── train.csv
+├── test.csv
+└── validation.csv
+```
 
 To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation and the [companion collection of example datasets](https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135).
 
-## Manual configuration
+## Manual splits and subsets configuration
 
 You can choose the data files to show in the Dataset Viewer for your dataset using YAML.
 It is useful if you want to specify which file goes into which split manually.
 
 You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).
 
+Here is an example of a configuration defining a subset called "benchmark" with a `test` split.
+
+```yaml
+configs:
+- config_name: benchmark
+  data_files:
+  - split: test
+    path: benchmark.csv
+```
+
 See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. Look also to the [example datasets](https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87).
 
 ## Supported file formats

diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md
@@ -103,6 +103,21 @@ configs:
 ---
 ```
 
+<Tip>
+
+You can set a default subset using `default: true`
+
+```yaml
+- config_name: main_data
+  data_files: "main_data.csv"
+  default: true
+```
+
+This is useful to set which subset the Dataset Viewer shows first, and which subset data libraries load by default.
+
+</Tip>
+
+
 ## Builder parameters
 
 Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which subset to load your `csv` files:
@@ -120,15 +135,3 @@ configs:
 ```
 
 Refer to the [specific builders' documentation](/docs/datasets/package_reference/builder_classes) to see what parameters they have.
-
-<Tip>
-
-You can set a default subset using `default: true`
-
-```yaml
-- config_name: main_data
-  data_files: "main_data.csv"
-  default: true
-```
-
-</Tip>