From 35884aee417291f53b33d5b99011a524ea9e4e02 Mon Sep 17 00:00:00 2001 From: Quentin Lhoest <42851186+lhoestq@users.noreply.github.com> Date: Wed, 15 Nov 2023 12:51:02 +0100 Subject: [PATCH] Dataset Viewer, Structure and Libraries docs (#1070) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * more datasets docs * add configure your dataset * minor * minor * update toc * minor * add dataset structure docs * rename sections * Apply suggestions from code review Co-authored-by: Sylvain Lesage Co-authored-by: Lucain Co-authored-by: Julien Chaumond * sylvain's comments: Adding a new dataset * sylvain-s comments: Configure the Dataset Viewer * sylvain's comments: File names and splits * sylvain's comments: Manual Configuration * sylvain's comments: Libraries * sylvain's comment: login * sylvain's comments: Using 🤗 Datasets * lucain's comment: Adding a new dataset * add duckdb write * add create repo step for dask and pandas * minor * minor * typo * Update docs/hub/_toctree.yml Co-authored-by: Mishig * Apply suggestions from code review Co-authored-by: Daniel van Strien * rename titles for consistency * Add Downloading Datasets * Apply suggestions from code review Co-authored-by: Polina Kazakova * more links to list of supported formats * move supported file formats to upload page * fix link to row * fix links * again * remove duplicate * move configure viewer to bottom of page * fix to_parquet example * minor fix * Update docs/hub/datasets-file-names-and-splits.md Co-authored-by: Lucain * Apply suggestions from code review Co-authored-by: Polina Kazakova --------- Co-authored-by: Sylvain Lesage Co-authored-by: Lucain Co-authored-by: Julien Chaumond Co-authored-by: Mishig Co-authored-by: Daniel van Strien Co-authored-by: Polina Kazakova --- docs/hub/_toctree.yml | 31 +++- docs/hub/datasets-adding.md | 103 +++++++++++-- docs/hub/datasets-dask.md | 47 ++++++ docs/hub/datasets-data-files-configuration.md | 29 ++++ docs/hub/datasets-downloading.md | 44 ++++++ docs/hub/datasets-duckdb.md | 41 ++++++ docs/hub/datasets-file-names-and-splits.md | 135 +++++++++++++++++ docs/hub/datasets-libraries.md | 15 ++ docs/hub/datasets-manual-configuration.md | 136 ++++++++++++++++++ docs/hub/datasets-pandas.md | 47 ++++++ docs/hub/datasets-usage.md | 29 +++- docs/hub/datasets-viewer-configure.md | 27 ++++ docs/hub/datasets-viewer.md | 28 ++-- docs/hub/datasets-webdataset.md | 20 +++ docs/hub/index.md | 8 +- 15 files changed, 710 insertions(+), 30 deletions(-) create mode 100644 docs/hub/datasets-dask.md create mode 100644 docs/hub/datasets-data-files-configuration.md create mode 100644 docs/hub/datasets-downloading.md create mode 100644 docs/hub/datasets-duckdb.md create mode 100644 docs/hub/datasets-file-names-and-splits.md create mode 100644 docs/hub/datasets-libraries.md create mode 100644 docs/hub/datasets-manual-configuration.md create mode 100644 docs/hub/datasets-pandas.md create mode 100644 docs/hub/datasets-viewer-configure.md create mode 100644 docs/hub/datasets-webdataset.md diff --git a/docs/hub/_toctree.yml b/docs/hub/_toctree.yml index 68760c34a..ad3406ade 100644 --- a/docs/hub/_toctree.yml +++ b/docs/hub/_toctree.yml @@ -127,12 +127,35 @@ title: Dataset Cards - local: datasets-gated title: Gated Datasets + - local: datasets-adding + title: Uploading Datasets + - local: datasets-downloading + title: Downloading Datasets + - local: datasets-libraries + title: Integrated Libraries + sections: + - local: datasets-dask + title: Dask + - local: datasets-usage + title: Datasets + - local: datasets-duckdb + title: DuckDB + - local: datasets-pandas + title: Pandas + - local: datasets-webdataset + title: WebDataset - local: datasets-viewer title: Dataset Viewer - - local: datasets-usage - title: Using Datasets - - local: datasets-adding - title: Adding New Datasets + sections: + - local: datasets-viewer-configure + title: Configure the Dataset Viewer + - local: datasets-data-files-configuration + title: Data files Configuration + sections: + - local: datasets-file-names-and-splits + title: File names and splits + - local: datasets-manual-configuration + title: Manual Configuration - local: spaces title: Spaces isExpanded: true diff --git a/docs/hub/datasets-adding.md b/docs/hub/datasets-adding.md index 1a7f01fc8..a7a10e6f5 100644 --- a/docs/hub/datasets-adding.md +++ b/docs/hub/datasets-adding.md @@ -1,13 +1,100 @@ -# Adding new datasets +# Uploading datasets -Any Hugging Face user can create a dataset! You can start by [creating your dataset repository](https://huggingface.co/new-dataset) and choosing one of the following methods to upload your dataset: +The [Hub](https://huggingface.co/datasets) is home to an extensive collection of community-curated and research datasets. We encourage you to share your dataset to the Hub to help grow the ML community and accelerate progress for everyone. All contributions are welcome; adding a dataset is just a drag and drop away! -* [Add files manually to the repository through the UI](https://huggingface.co/docs/datasets/upload_dataset#upload-your-files) -* [Push files with the `push_to_hub` method from 🤗 Datasets](https://huggingface.co/docs/datasets/upload_dataset#upload-from-python) -* [Use Git to commit and push your dataset files](https://huggingface.co/docs/datasets/share#clone-the-repository) +Start by [creating a Hugging Face Hub account](https://huggingface.co/join) if you don't have one yet. -While in many cases it's possible to just add raw data to your dataset repo in any supported formats (JSON, CSV, Parquet, text, images, audio files, …), for some large datasets you may want to [create a loading script](https://huggingface.co/docs/datasets/dataset_script#create-a-dataset-loading-script). This script defines the different configurations and splits of your dataset, as well as how to download and process the data. +## Upload using the Hub UI -## Datasets outside a namespace +The Hub's web-based interface allows users without any developer experience to upload a dataset. -Datasets outside a namespace are maintained by the Hugging Face team. Unlike the naming convention used for community datasets (`username/dataset_name` or `org/dataset_name`), datasets outside a namespace can be referenced directly by their name (e.g. [`glue`](https://huggingface.co/datasets/glue)). If you find that an improvement is needed, use their "Community" tab to open a discussion or submit a PR on the Hub to propose edits. \ No newline at end of file +### Create a repository + +A repository hosts all your dataset files, including the revision history, making storing more than one dataset version possible. + +1. Click on your profile and select **New Dataset** to create a [new dataset repository](https://huggingface.co/new-dataset). +2. Pick a name for your dataset, and choose whether it is a public or private dataset. A public dataset is visible to anyone, whereas a private dataset can only be viewed by you or members of your organization. + +
+ +
+ +### Upload dataset + +1. Once you've created a repository, navigate to the **Files and versions** tab to add a file. Select **Add file** to upload your dataset files. We support many text, audio, and image data extensions such as `.csv`, `.mp3`, and `.jpg` among many others (see full list [here](./datasets-viewer-configure.md)). + +
+ +
+ +2. Drag and drop your dataset files. + +
+ +
+ +3. After uploading your dataset files, they are stored in your dataset repository. + +
+ +
+ +### Create a Dataset card + +Adding a Dataset card is super valuable for helping users find your dataset and understand how to use it responsibly. + +1. Click on **Create Dataset Card** to create a [Dataset card](./datasets-cards). This button creates a `README.md` file in your repository. + +
+ +
+ +2. At the top, you'll see the **Metadata UI** with several fields to select from such as license, language, and task categories. These are the most important tags for helping users discover your dataset on the Hub (when applicable). When you select an option for a field, it will be automatically added to the top of the dataset card. + + You can also look at the [Dataset Card specifications](https://github.com/huggingface/hub-docs/blob/main/datasetcard.md?plain=1), which has a complete set of allowed tags, including optional like `annotations_creators`, to help you choose the ones that are useful for your dataset. + +
+ +
+ +3. Write your dataset documentation in the Dataset Card to introduce your dataset to the community and help users understand what is inside: what are the use cases and limitations, where the data comes from, what are important ethical considerations, and any other relevant details. + + You can click on the **Import dataset card template** link at the top of the editor to automatically create a dataset card template. For a detailed example of what a good Dataset card should look like, take a look at the [CNN DailyMail Dataset card](https://huggingface.co/datasets/cnn_dailymail). + +### Dataset Viewer + +The [Dataset Viewer](./datasets-viewer) is useful to know how the data actually looks like before you download it. +It is enabled by default for all public datasets. + +Make sure the Dataset Viewer correctly shows your data, or [Configure the Dataset Viewer](./datasets-viewer-configure). + +## Using the `huggingface_hub` client library + +The rich features set in the `huggingface_hub` library allows you to manage repositories, including creating repos and uploading datasets to the Model Hub. Visit [the client library's documentation](https://huggingface.co/docs/huggingface_hub/index) to learn more. + +## Using other libraries + +Some libraries like [🤗 Datasets](https://huggingface.co/docs/datasets/index), [Pandas](https://pandas.pydata.org/), [Dask](https://www.dask.org/) or [DuckDB](https://duckdb.org/) can upload files to the Hub. +See the list of [Libraries supported by the Datasets Hub](./datasets-libraries) for more information. + +## Using Git + +Since dataset repos are just Git repositories, you can use Git to push your data files to the Hub. Follow the guide on [Getting Started with Repositories](repositories-getting-started) to learn about using the `git` CLI to commit and push your datasets. + +## File formats + +The Hub natively supports multiple file formats: + +- CSV (.csv, .tsv) +- JSON Lines, JSON (.jsonl, .json) +- Parquet (.parquet) +- Text (.txt) +- Images (.png, .jpg, etc.) +- Audio (.wav, .mp3, etc.) + +It also supports files compressed using ZIP (.zip), GZIP (.gz), ZSTD (.zst), BZ2 (.bz2), LZ4 (.lz4) and LZMA (.xz). + +Image and audio resources can also have additional metadata files, see the [Data files Configuration](./datasets-data-files-configuration) on image and audio datasets. + +You may want to convert your files to these formats to benefit from all the Hub features. +Other formats and structures may not be recognized by the Hub. diff --git a/docs/hub/datasets-dask.md b/docs/hub/datasets-dask.md new file mode 100644 index 000000000..00e5ef840 --- /dev/null +++ b/docs/hub/datasets-dask.md @@ -0,0 +1,47 @@ +# Dask + +[Dask](https://github.com/dask/dask) is a parallel and distributed computing library that scales the existing Python and PyData ecosystem. +Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: + +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: + +``` +huggingface-cli login +``` + +Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using: + +```python +from huggingface_hub import HfApi + +HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") +``` + +Finally, you can use Hugging Face paths in Dask: + +```python +import dask.dataframe as dd + +df.to_parquet("hf://datasets/username/my_dataset") + +# or write in separate directories if the dataset has train/validation/test splits +df_train.to_parquet("hf://datasets/username/my_dataset/train") +df_valid.to_parquet("hf://datasets/username/my_dataset/validation") +df_test .to_parquet("hf://datasets/username/my_dataset/test") +``` + +This creates a dataset repository `username/my_dataset` containing your Dask dataset in Parquet format. +You can reload it later: + +```python +import dask.dataframe as dd + +df = dd.read_parquet("hf://datasets/username/my_dataset") + +# or read from separate directories if the dataset has train/validation/test splits +df_train = dd.read_parquet("hf://datasets/username/my_dataset/train") +df_valid = dd.read_parquet("hf://datasets/username/my_dataset/validation") +df_test = dd.read_parquet("hf://datasets/username/my_dataset/test") +``` + +For more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system). diff --git a/docs/hub/datasets-data-files-configuration.md b/docs/hub/datasets-data-files-configuration.md new file mode 100644 index 000000000..20c2a3963 --- /dev/null +++ b/docs/hub/datasets-data-files-configuration.md @@ -0,0 +1,29 @@ +# Data files Configuration + +There are no constraints on how to structure dataset repositories. + +However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly. +Often it is as simple as naming your data files according to their split names, e.g. `train.csv` and `test.csv`. + +## File names and splits + +To structure your dataset by naming your data files or directories according to their split names, see the [File names and splits](./datasets-file-names-and-splits) documentation. + +## Manual configuration + +You can choose the data files to show in the Dataset Viewer for your dataset using YAML. +It is useful if you want to specify which file goes into which split manually. + +You can also define multiple configurations (or subsets) for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files). + +See the documentation on [Manual configuration](./datasets-manual-configuration) for more information. + +## Image and Audio datasets + +For image and audio classification datasets, you can also use directories to name the image and audio classes. +And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them. + +We provide two guides that you can check out: + +- [How to create an image dataset](https://huggingface.co/docs/datasets/image_dataset) +- [How to create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset) diff --git a/docs/hub/datasets-downloading.md b/docs/hub/datasets-downloading.md new file mode 100644 index 000000000..1ed63e38d --- /dev/null +++ b/docs/hub/datasets-downloading.md @@ -0,0 +1,44 @@ +# Downloading datasets + +## Integrated libraries + +If a dataset on the Hub is tied to a [supported library](./datasets-libraries), loading the dataset can be done in just a few lines. For information on accessing the dataset, you can click on the "Use in _Library_" button on the dataset page to see how to do so. For example, `samsum` shows how to do so with 🤗 Datasets below. + +
+ + +
+ +
+ + +
+ +## Using the Hugging Face Client Library + +You can use the [`huggingface_hub`](https://github.com/huggingface/huggingface_hub) library to create, delete, update and retrieve information from repos. You can also download files from repos or integrate them into your library! For example, you can quickly load a CSV dataset with a few lines using Pandas. + +```py +from huggingface_hub import hf_hub_download +import pandas as pd + +REPO_ID = "YOUR_REPO_ID" +FILENAME = "data.csv" + +dataset = pd.read_csv( + hf_hub_download(repo_id=REPO_ID, filename=FILENAME, repo_type="dataset") +) +``` + +## Using Git + +Since all datasets on the dataset Hub are Git repositories, you can clone the datasets locally by running: + +```bash +git lfs install +git clone git@hf.co:datasets/ # example: git clone git@hf.co:datasets/allenai/c4 +``` + +If you have write-access to the particular dataset repo, you'll also have the ability to commit and push revisions to the dataset. + +Add your SSH public key to [your user settings](https://huggingface.co/settings/keys) to push changes and/or access private repos. diff --git a/docs/hub/datasets-duckdb.md b/docs/hub/datasets-duckdb.md new file mode 100644 index 000000000..a308a972d --- /dev/null +++ b/docs/hub/datasets-duckdb.md @@ -0,0 +1,41 @@ +# DuckDB + +[DuckDB](https://github.com/duckdb/duckdb) is an in-process SQL [OLAP](https://en.wikipedia.org/wiki/Online_analytical_processing) database management system. +Since it supports [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: + +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: + +``` +huggingface-cli login +``` + +Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using: + +```python +from huggingface_hub import HfApi + +HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") +``` + +Finally, you can use Hugging Face paths in DuckDB: + +```python +>>> from huggingface_hub import HfFileSystem +>>> import duckdb + +>>> fs = HfFileSystem() +>>> duckdb.register_filesystem(fs) +>>> duckdb.sql("COPY tbl TO 'hf://datasets/username/my_dataset/data.parquet' (FORMAT PARQUET);") +``` + +This creates a file `data.parquet` in the dataset repository `username/my_dataset` containing your dataset in Parquet format. +You can reload it later: + +```python +>>> from huggingface_hub import HfFileSystem +>>> import duckdb + +>>> fs = HfFileSystem() +>>> duckdb.register_filesystem(fs) +>>> df = duckdb.query("SELECT * FROM 'hf://datasets/username/my_dataset/data.parquet' LIMIT 10;").df() +``` diff --git a/docs/hub/datasets-file-names-and-splits.md b/docs/hub/datasets-file-names-and-splits.md new file mode 100644 index 000000000..897af7ff4 --- /dev/null +++ b/docs/hub/datasets-file-names-and-splits.md @@ -0,0 +1,135 @@ +# File names and splits + +To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files. + +This guide will show you how to name your files and directories in your dataset repository when you upload it and enable all the Dataset Hub features like the Dataset Viewer. +A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a dataset viewer on its page on the Hub. + +Note that if none of the structures below suits your case, you can have more control over how you define splits and subsets with the [Manual Configuration](./datasets-manual-configuration). + +## Basic use-case + +If your dataset isn't split into [train/validation/test splits](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets), the simplest dataset structure is to have one file: `data.csv` (this works with any [supported file format](./datasets-adding#files-formats) and any file name). + +Your repository will also contain a `README.md` file, the [dataset card](./dataset-cards) displayed on your dataset page. + +``` +my_dataset_repository/ +├── README.md +└── data.csv +``` + +## Splits + +Some patterns in the dataset repository can be used to assign certain files to train/validation/test splits. + +### File name + + +You can name your data files after the `train`, `test`, and `validation` splits: + +``` +my_dataset_repository/ +├── README.md +├── train.csv +├── test.csv +└── validation.csv +``` + +If you don't have any non-traditional splits, then you can place the split name anywhere in the data file. The only rule is that the split name must be delimited by non-word characters, like `test-file.csv` for example instead of `testfile.csv`. Supported delimiters include underscores, dashes, spaces, dots, and numbers. + +For example, the following file names are all acceptable: + +- train split: `train.csv`, `my_train_file.csv`, `train1.csv` +- validation split: `validation.csv`, `my_validation_file.csv`, `validation1.csv` +- test split: `test.csv`, `my_test_file.csv`, `test1.csv` + +### Directory name + +You can place your data files into different directories named `train`, `test`, and `validation` where each directory contains the data files for that split: + +``` +my_dataset_repository/ +├── README.md +└── data/ + ├── train/ + │ └── data.csv + ├── test/ + │ └── more_data.csv + └── validation/ + └── even_more_data.csv +``` + +### Keywords + +There are several ways to refer to train/validation/test splits. Validation splits are sometimes called "dev", and test splits may be referred to as "eval". +These other split names are also supported, and the following keywords are equivalent: + +- train, training +- validation, valid, val, dev +- test, testing, eval, evaluation + +Therefore, the structure below is a valid repository: + +``` +my_dataset_repository/ +├── README.md +└── data/ + ├── training.csv + ├── eval.csv + └── valid.csv +``` + +### Multiple files per split + +Splits can span several files, for example: + +``` +my_dataset_repository/ +├── README.md +├── train_0.csv +├── train_1.csv +├── train_2.csv +├── train_3.csv +├── test_0.csv +└── test_1.csv +``` + +Make sure all the files of your `train` set have *train* in their names (same for test and validation). +You can even add a prefix or suffix to `train` in the file name (like `my_train_file_00001.csv` for example). + +For convenience, you can also place your data files into different directories. +In this case, the split name is inferred from the directory name. + +``` +my_dataset_repository/ +├── README.md +└── data/ + ├── train/ + │ ├── shard_0.csv + │ ├── shard_1.csv + │ ├── shard_2.csv + │ └── shard_3.csv + └── test/ + ├── shard_0.csv + └── shard_1.csv +``` + +### Custom split name + +If your dataset splits have custom names that aren't `train`, `test`, or `validation`, then you can name your data files like `data/-xxxxx-of-xxxxx.csv`. + +Here is an example with three splits, `train`, `test`, and `random`: + +``` +my_dataset_repository/ +├── README.md +└── data/ + ├── train-00000-of-00003.csv + ├── train-00001-of-00003.csv + ├── train-00002-of-00003.csv + ├── test-00000-of-00001.csv + ├── random-00000-of-00003.csv + ├── random-00001-of-00003.csv + └── random-00002-of-00003.csv +``` diff --git a/docs/hub/datasets-libraries.md b/docs/hub/datasets-libraries.md new file mode 100644 index 000000000..bdf6470cf --- /dev/null +++ b/docs/hub/datasets-libraries.md @@ -0,0 +1,15 @@ +# Libraries + +The Dataset Hub has support for several libraries in the Open Source ecosystem. +Thanks to the [huggingface_hub Python library](../huggingface_hub), it's easy to enable sharing your datasets on the Hub. +We're happy to welcome to the Hub a set of Open Source libraries that are pushing Machine Learning forward. + +The table below summarizes the supported libraries and their level of integration. + +| Library | Description | Download from Hub | Push to Hub | +|-----------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|---|----| +| [Dask](https://github.com/dask/dask) | Parallel and distributed computing library that scales the existing Python and PyData ecosystem. | ✅ | ✅ | +| [Datasets](https://github.com/huggingface/datasets) | 🤗 Datasets is a library for accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP). | ✅ | ✅ | +| [DuckDB](https://github.com/duckdb/duckdb) | In-process SQL OLAP database management system. | ✅ | ✅ | +| [Pandas](https://github.com/pandas-dev/pandas) | Python data analysis toolkit. | ✅ | ✅ | +| [WebDataset](https://github.com/webdataset/webdataset) | Library to write I/O pipelines for large datasets. | ✅ | ❌ | diff --git a/docs/hub/datasets-manual-configuration.md b/docs/hub/datasets-manual-configuration.md new file mode 100644 index 000000000..28586cd7f --- /dev/null +++ b/docs/hub/datasets-manual-configuration.md @@ -0,0 +1,136 @@ +# Manual Configuration + +This guide will show you how to configure a custom structure for your dataset repository. + +A dataset with a supported structure and [file formats](./datasets-adding#file-formats) automatically has a Dataset Viewer on its dataset page on the Hub. You can use YAML to define the splits, configurations and builder parameters that are used by the Viewer. + +It is also possible to define multiple configurations for the same dataset (e.g. if the dataset has various independent files). + +## Define your splits and subsets in YAML + +## Splits + +If you have multiple files and want to define which file goes into which split, you can use YAML at the top of your README.md. + +For example, given a repository like this one: + +``` +my_dataset_repository/ +├── README.md +├── data.csv +└── holdout.csv +``` + +You can define a configuration for your splits by adding the `configs` field in the YAML block at the top of your README.md: + +```yaml +--- +configs: +- config_name: default + data_files: + - split: train + path: "data.csv" + - split: test + path: "holdout.csv" +--- +``` + +You can select multiple files per split using a list of paths: + +``` +my_dataset_repository/ +├── README.md +├── data/ +│ ├── abc.csv +│ └── def.csv +└── holdout/ + └── ghi.csv +``` + +```yaml +--- +configs: +- config_name: default + data_files: + - split: train + path: + - "data/abc.csv" + - "data/def.csv" + - split: test + path: "holdout/ghi.csv" +--- +``` + +Or you can use glob patterns to automatically list all the files you need: + +```yaml +--- +configs: +- config_name: default + data_files: + - split: train + path: "data/*.csv" + - split: test + path: "holdout/*.csv" +--- +``` + + + +Note that `config_name` field is required even if you have a single configuration. + + + +## Multiple Configurations + +Your dataset might have several subsets of data that you want to be able to use separately. +For example each configuration has its own dropdown in the Dataset Viewer the Hugging Face Hub. + +In that case you can define a list of configurations inside the `configs` field in YAML: + +``` +my_dataset_repository/ +├── README.md +├── main_data.csv +└── additional_data.csv +``` + +```yaml +--- +configs: +- config_name: main_data + data_files: "main_data.csv" +- config_name: additional_data + data_files: "additional_data.csv" +--- +``` + +## Builder parameters + +Not only `data_files`, but other builder-specific parameters can be passed via YAML, allowing for more flexibility on how to load the data while not requiring any custom code. For example, define which separator to use in which configuration to load your `csv` files: + +```yaml +--- +configs: +- config_name: tab + data_files: "main_data.csv" + sep: "\t" +- config_name: comma + data_files: "additional_data.csv" + sep: "," +--- +``` + +Refer to the [specific builders' documentation](../datasets/package_reference/builder_classes) to see what configuration parameters they have. + + + +You can set a default configuration using `default: true` + +```yaml +- config_name: main_data + data_files: "main_data.csv" + default: true +``` + + diff --git a/docs/hub/datasets-pandas.md b/docs/hub/datasets-pandas.md new file mode 100644 index 000000000..9972816fb --- /dev/null +++ b/docs/hub/datasets-pandas.md @@ -0,0 +1,47 @@ +# Pandas + +[Pandas](https://github.com/pandas-dev/pandas) is a widely used Python data analysis toolkit. +Since it uses [fsspec](https://filesystem-spec.readthedocs.io) to read and write remote data, you can use the Hugging Face paths (`hf://`) to read and write data on the Hub: + +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: + +``` +huggingface-cli login +``` + +Then you can [Create a dataset repository](../huggingface_hub/quick-start#create-a-repository), for example using: + +```python +from huggingface_hub import HfApi + +HfApi().create_repo(repo_id="username/my_dataset", repo_type="dataset") +``` + +Finally, you can use Hugging Face paths in Pandas: + +```python +import pandas as pd + +df.to_parquet("hf://datasets/username/my_dataset/data.parquet") + +# or write in separate files if the dataset has train/validation/test splits +df_train.to_parquet("hf://datasets/username/my_dataset/train.parquet") +df_valid.to_parquet("hf://datasets/username/my_dataset/validation.parquet") +df_test .to_parquet("hf://datasets/username/my_dataset/test.parquet") +``` + +This creates a dataset repository `username/my_dataset` containing your Pandas dataset in Parquet format. +You can reload it later: + +```python +import pandas as pd + +df = pd.read_parquet("hf://datasets/username/my_dataset/data.parquet") + +# or read from separate files if the dataset has train/validation/test splits +df_train = pd.read_parquet("hf://datasets/username/my_dataset/train.parquet") +df_valid = pd.read_parquet("hf://datasets/username/my_dataset/validation.parquet") +df_test = pd.read_parquet("hf://datasets/username/my_dataset/test.parquet") +``` + +To have more information on the Hugging Face paths and how they are implemented, please refer to the [the client library's documentation on the HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system). diff --git a/docs/hub/datasets-usage.md b/docs/hub/datasets-usage.md index b32807d7b..d00bdd148 100644 --- a/docs/hub/datasets-usage.md +++ b/docs/hub/datasets-usage.md @@ -2,8 +2,31 @@ Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the **Use in dataset library** button to copy the code to load a dataset. -Some datasets on the Hub contain a [loading script](https://huggingface.co/docs/datasets/dataset_script), which allows you to easily [load the dataset when you need it](https://huggingface.co/docs/datasets/load_hub). +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: -Many datasets however do not need to include a loading script, for instance when their data is stored directly in the repository in formats such as CSV, JSON and Parquet. 🤗 Datasets can [load those kinds of datasets](https://huggingface.co/docs/datasets/loading#hugging-face-hub) automatically without a loading script. +``` +huggingface-cli login +``` -For more information about using 🤗 Datasets, check out the [tutorials](https://huggingface.co/docs/datasets/tutorial) and [how-to guides](https://huggingface.co/docs/datasets/how_to) available in the 🤗 Datasets documentation. \ No newline at end of file +And then you can load a dataset from the Hugging Face Hub using + +```python +from datasets import load_dataset + +dataset = load_dataset("username/my_dataset") + +# or load the separate splits if the dataset has train/validation/test splits +train_dataset = load_dataset("username/my_dataset", split="train") +valid_dataset = load_dataset("username/my_dataset", split="validation") +test_dataset = load_dataset("username/my_dataset", split="test") +``` + +You can also upload datasets to the Hugging Face Hub: + +```python +my_new_dataset.push_to_hub("username/my_new_dataset") +``` + +This creates a dataset repository `username/my_new_dataset` containing your Dataset in Parquet format, that you can reload later. + +For more information about using 🤗 Datasets, check out the [tutorials](https://huggingface.co/docs/datasets/tutorial) and [how-to guides](https://huggingface.co/docs/datasets/how_to) available in the 🤗 Datasets documentation. diff --git a/docs/hub/datasets-viewer-configure.md b/docs/hub/datasets-viewer-configure.md new file mode 100644 index 000000000..dd08f7848 --- /dev/null +++ b/docs/hub/datasets-viewer-configure.md @@ -0,0 +1,27 @@ +# Configure the Dataset Viewer + +The Dataset Viewer supports many [data files formats](./datasets-adding#file-formats), from text to tabular and from image to audio formats. +It also separates the train/validation/test splits based on file and folder names. + +To configure the Dataset Viewer for your dataset, first make sure your dataset is in a [supported data format](./datasets-adding#files-formats). + +## Configure dropdowns for splits or subsets + +In the Dataset Viewer you can view the [train/validation/test](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets) splits of datasets, and sometimes additionally choose between multiple subsets (e.g. one per language). + +To define those dropdowns, you can name the data files or their folder after their split names (train/validation/test). +It is also possible to customize your splits manually using YAML. + +For more information, feel free to check out the documentation on [Data files Configuration](./datasets-data-files-configuration). + +## Disable the viewer + +The dataset viewer can be disabled. To do this, add a YAML section to the dataset's `README.md` file (create one if it does not already exist) and add a `viewer` property with the value `false`. + +``` +--- +viewer: false +--- +``` + +Note that the viewer is always disabled on the private datasets. diff --git a/docs/hub/datasets-viewer.md b/docs/hub/datasets-viewer.md index 61377f2e7..05c21eb3c 100644 --- a/docs/hub/datasets-viewer.md +++ b/docs/hub/datasets-viewer.md @@ -7,17 +7,26 @@ The dataset page includes a table with the contents of the dataset, arranged by +## Inspect data distributions + +At the top of the columns you can see the graphs representing the distribution of their data. This gives you a quick insight on how balanced your classes are, what are the range and distribution of numerical data and lengths of texts, and what portion of the column data is missing. + +## Filter by value + +If you click on a bar of a histogram from a numerical column, the dataset viewer will filter the data and show only the rows with values that fall in the selected range. +Similarly, if you select one class from a categorical column, it will show only the rows from the selected category. + ## Search a word in the dataset -You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of type `string`, even if the values are nested in a dictionary. +You can search for a word in the dataset by typing it in the search bar at the top of the table. The search is case-insensitive and will match any row containing the word. The text is searched in the columns of `string`, even if the values are nested in a dictionary or a list. ## Share a specific row -You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/glue/viewer/mrpc/test?row=241 will open the dataset viewer on the MRPC dataset, on the test split, and on the 241st row. +You can share a specific row by clicking on it, and then copying the URL in the address bar of your browser. For example https://huggingface.co/datasets/glue/viewer/mrpc/test?p=2&row=241 will open the dataset viewer on the MRPC dataset, on the test split, and on the 241st row. ## Access the parquet files -Every dataset is auto-converted to the Parquet format. Click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset with libraries such as Polars, Pandas or DuckDB. +To power the dataset viewer, every dataset is auto-converted to the Parquet format. Click on [_"Auto-converted to Parquet"_](https://huggingface.co/datasets/glue/tree/refs%2Fconvert%2Fparquet/cola) to access the Parquet files. Refer to the [Datasets Server docs](/docs/datasets-server/parquet_process) to learn how to query the dataset parquet files with libraries such as Polars, Pandas or DuckDB. You can also access the list of Parquet files programmatically using the [Hub API](./api#endpoints-table): https://huggingface.co/api/datasets/glue/parquet. @@ -35,14 +44,9 @@ For the biggest datasets, the page shows a preview of the first 100 rows instead -## Disable the viewer - -The dataset viewer can be disabled. To do this, add a YAML section to the dataset's `README.md` file (create one if it does not already exist) and add a `viewer` property with the value `false`. +## Configure the Dataset Viewer -``` ---- -viewer: false ---- -``` +To have a properly working Dataset Viewer for your dataset, make sure your dataset is in a supported format and structure. +There is also an option to configure your dataset using YAML. -Note that the viewer is always disabled on the private datasets. +For more information see our guide on [How to configure the Dataset Viewer](./datasets-viewer-configure). diff --git a/docs/hub/datasets-webdataset.md b/docs/hub/datasets-webdataset.md new file mode 100644 index 000000000..15842fc98 --- /dev/null +++ b/docs/hub/datasets-webdataset.md @@ -0,0 +1,20 @@ +# WebDataset + +[WebDataset](https://github.com/webdataset/webdataset) is a library to write I/O pipelines for large datasets. +Since it supports streaming data using HTTP, you can use the Hugging Face data files URLs to stream a dataset in WebDataset format: + +First you need to [Login with your Hugging Face account](../huggingface_hub/quick-start#login), for example using: + +``` +huggingface-cli login +``` + +And then you can stream Hugging Face datasets in WebDataset: + +```python +>>> import webdataset as wds +>>> from huggingface_hub import HfFolder + +>>> hf_token = HfFolder().get_token() +>>> dataset = wds.WebDataset(f"pipe:curl -s -L https://huggingface.co/datasets/username/my_wds_dataset/resolve/main/train-000000.tar -H 'Authorization:Bearer {hf_token}'") +``` diff --git a/docs/hub/index.md b/docs/hub/index.md index 221c2155f..8c84da8ff 100644 --- a/docs/hub/index.md +++ b/docs/hub/index.md @@ -40,9 +40,11 @@ The Hugging Face Hub is a platform with over 350k models, 75k datasets, and 150k Datasets Overview Dataset Cards Gated Datasets -Dataset viewer -Using Datasets -Adding New Datasets +Uploading Datasets +Downloading Datasets +Libraries +Dataset Viewer +Data files Configuration