Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/main' into rm-deprecated
Browse files Browse the repository at this point in the history
  • Loading branch information
albertvillanova committed Jul 1, 2024
2 parents 89426a5 + 054e57a commit 065bce0
Show file tree
Hide file tree
Showing 130 changed files with 104 additions and 15,605 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ jobs:
pip install .[quality]
- name: Check quality
run: |
ruff check tests src benchmarks metrics utils setup.py # linter
ruff format --check tests src benchmarks metrics utils setup.py # formatter
ruff check tests src benchmarks utils setup.py # linter
ruff format --check tests src benchmarks utils setup.py # formatter
test:
needs: check_code_quality
Expand All @@ -56,7 +56,7 @@ jobs:
- name: Install uv
run: pip install --upgrade uv
- name: Install dependencies
run: uv pip install --system "datasets[tests,metrics-tests] @ ."
run: uv pip install --system "datasets[tests] @ ."
- name: Install dependencies (latest versions)
if: ${{ matrix.os == 'ubuntu-latest' }}
run: uv pip install --system -r additional-tests-requirements.txt --no-deps
Expand Down
7 changes: 0 additions & 7 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,6 @@ venv.bak/
.idea
.vscode

# keep only the empty datasets and metrics directory with it's __init__.py file
/src/*/datasets/*
!/src/*/datasets/__init__.py

/src/*/metrics/*
!/src/*/metrics/__init__.py

# Vim
.*.swp

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.PHONY: quality style test

check_dirs := tests src benchmarks metrics utils
check_dirs := tests src benchmarks utils

# Check that source code meets quality standards

Expand Down
4 changes: 0 additions & 4 deletions additional-tests-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1 @@
unbabel-comet>=1.0.0
git+https://github.com/pytorch/data.git
git+https://github.com/google-research/bleurt.git
git+https://github.com/ns-moosavi/coval.git
git+https://github.com/hendrycks/math.git
1 change: 0 additions & 1 deletion docs/source/_redirects.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ splits: loading#slice-splits
processing: process
faiss_and_ea: faiss_es
features: about_dataset_features
using_metrics: how_to_metrics
exploring: access
package_reference/logging_methods: package_reference/utilities
# end of first_section
6 changes: 0 additions & 6 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@
title: Know your dataset
- local: use_dataset
title: Preprocess
- local: metrics
title: Evaluate predictions
- local: create_dataset
title: Create a dataset
- local: upload_dataset
Expand Down Expand Up @@ -48,8 +46,6 @@
title: Search index
- local: cli
title: CLI
- local: how_to_metrics
title: Metrics
- local: troubleshoot
title: Troubleshooting
title: "General usage"
Expand Down Expand Up @@ -111,8 +107,6 @@
title: Build and load
- local: about_map_batch
title: Batch mapping
- local: about_metrics
title: All about metrics
title: "Conceptual guides"
- sections:
- local: package_reference/main_classes
Expand Down
25 changes: 0 additions & 25 deletions docs/source/about_metrics.mdx

This file was deleted.

4 changes: 1 addition & 3 deletions docs/source/audio_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ There are several methods for creating and sharing an audio dataset:

* Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.

* Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale audio datasets.


<Tip>

Expand Down Expand Up @@ -175,7 +173,7 @@ Some audio datasets, like those found in [Kaggle competitions](https://www.kaggl

</Tip>

## Loading script
## (Legacy) Loading script

Write a dataset loading script to manually create a dataset.
It defines a dataset's splits and configurations, and handles downloading and generating the dataset examples.
Expand Down
20 changes: 0 additions & 20 deletions docs/source/cache.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,6 @@ When you load a dataset, you also have the option to change where the data is ca
>>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")
```

Similarly, you can change where a metric is cached with the `cache_dir` parameter:

```py
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', cache_dir="MY/CACHE/DIRECTORY")
```

## Download mode

After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
Expand Down Expand Up @@ -77,19 +70,6 @@ If you want to reuse a dataset from scratch, try setting the `download_mode` par

</Tip>

You can also avoid caching your metric entirely, and keep it in CPU memory instead:

```py
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', keep_in_memory=True)
```

<Tip warning={true}>

Keeping the predictions in-memory is not possible in a distributed setting since the CPU memory spaces of the various processes are not shared.

</Tip>

<a id='load_dataset_enhancing_performance'></a>

## Improve performance
Expand Down
2 changes: 1 addition & 1 deletion docs/source/create_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ You can also create a dataset from local files by specifying the path to the dat
## Next steps
We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, but it also gives you the most flexibility and control over how a dataset is generated. It lets you configure additional options such as creating multiple configurations within a dataset, or enabling your dataset to be streamed.
We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, and are not well supported on Hugging Face. Though in some rare cases it can still be helpful.
To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.
Expand Down
8 changes: 0 additions & 8 deletions docs/source/filesystems.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -142,14 +142,6 @@ Load a dataset builder from the Hugging Face Hub (see [how to load from the Hugg
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

Load a dataset builder using a loading script (see [how to load a local loading script](./loading#local-loading-script)):

```py
>>> output_dir = "s3://my-bucket/imdb"
>>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py")
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

Use your own data files (see [how to load local and remote files](./loading#local-and-remote-files)):

```py
Expand Down
Loading

0 comments on commit 065bce0

Please sign in to comment.