Merge remote-tracking branch 'upstream/main' into rm-deprecated

huggingface · Jul 1, 2024 · 065bce0 · 065bce0
2 parents 89426a5 + 054e57a
commit 065bce0
Show file tree

Hide file tree

Showing 130 changed files with 104 additions and 15,605 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -28,8 +28,8 @@ jobs:
           pip install .[quality]
       - name: Check quality
         run: |
-          ruff check tests src benchmarks metrics utils setup.py # linter
-          ruff format --check tests src benchmarks metrics utils setup.py # formatter
+          ruff check tests src benchmarks utils setup.py # linter
+          ruff format --check tests src benchmarks utils setup.py # formatter
 
   test:
     needs: check_code_quality
@@ -56,7 +56,7 @@ jobs:
       - name: Install uv
         run: pip install --upgrade uv
       - name: Install dependencies
-        run: uv pip install --system "datasets[tests,metrics-tests] @ ."
+        run: uv pip install --system "datasets[tests] @ ."
       - name: Install dependencies (latest versions)
         if: ${{ matrix.os == 'ubuntu-latest' }}
         run: uv pip install --system -r additional-tests-requirements.txt --no-deps

diff --git a/.gitignore b/.gitignore
@@ -42,13 +42,6 @@ venv.bak/
 .idea
 .vscode
 
-# keep only the empty datasets and metrics directory with it's __init__.py file
-/src/*/datasets/*
-!/src/*/datasets/__init__.py
-
-/src/*/metrics/*
-!/src/*/metrics/__init__.py
-
 # Vim
 .*.swp
 

diff --git a/Makefile b/Makefile
@@ -1,6 +1,6 @@
 .PHONY: quality style test
 
-check_dirs := tests src benchmarks metrics utils
+check_dirs := tests src benchmarks utils
 
 # Check that source code meets quality standards
 

diff --git a/additional-tests-requirements.txt b/additional-tests-requirements.txt
@@ -1,5 +1 @@
-unbabel-comet>=1.0.0
 git+https://github.com/pytorch/data.git
-git+https://github.com/google-research/bleurt.git
-git+https://github.com/ns-moosavi/coval.git
-git+https://github.com/hendrycks/math.git
diff --git a/docs/source/_redirects.yml b/docs/source/_redirects.yml
@@ -8,7 +8,6 @@ splits: loading#slice-splits
 processing: process
 faiss_and_ea: faiss_es
 features: about_dataset_features
-using_metrics: how_to_metrics
 exploring: access
 package_reference/logging_methods: package_reference/utilities
 # end of first_section
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -15,8 +15,6 @@
     title: Know your dataset
   - local: use_dataset
     title: Preprocess
-  - local: metrics
-    title: Evaluate predictions
   - local: create_dataset
     title: Create a dataset
   - local: upload_dataset
@@ -48,8 +46,6 @@
       title: Search index
     - local: cli
       title: CLI
-    - local: how_to_metrics
-      title: Metrics
     - local: troubleshoot
       title: Troubleshooting
     title: "General usage"
@@ -111,8 +107,6 @@
     title: Build and load
   - local: about_map_batch
     title: Batch mapping
-  - local: about_metrics
-    title: All about metrics
   title: "Conceptual guides"
 - sections:
   - local: package_reference/main_classes

diff --git a/docs/source/about_metrics.mdx b/docs/source/about_metrics.mdx
diff --git a/docs/source/audio_dataset.mdx b/docs/source/audio_dataset.mdx
@@ -14,8 +14,6 @@ There are several methods for creating and sharing an audio dataset:
 
 * Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
 
-* Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale audio datasets.
-
 
 <Tip>
 
@@ -175,7 +173,7 @@ Some audio datasets, like those found in [Kaggle competitions](https://www.kaggl
 
 </Tip>
 
-## Loading script
+## (Legacy) Loading script
 
 Write a dataset loading script to manually create a dataset.
 It defines a dataset's splits and configurations, and handles downloading and generating the dataset examples.

diff --git a/docs/source/cache.mdx b/docs/source/cache.mdx
@@ -24,13 +24,6 @@ When you load a dataset, you also have the option to change where the data is ca
 >>> dataset = load_dataset('LOADING_SCRIPT', cache_dir="PATH/TO/MY/CACHE/DIR")
 ```
 
-Similarly, you can change where a metric is cached with the `cache_dir` parameter:
-
-```py
->>> from datasets import load_metric
->>> metric = load_metric('glue', 'mrpc', cache_dir="MY/CACHE/DIRECTORY")
-```
-
 ## Download mode
 
 After you download a dataset, control how it is loaded by [`load_dataset`] with the `download_mode` parameter. By default, 🤗 Datasets will reuse a dataset if it exists. But if you need the original dataset without any processing functions applied, re-download the files as shown below:
@@ -77,19 +70,6 @@ If you want to reuse a dataset from scratch, try setting the `download_mode` par
 
 </Tip>
 
-You can also avoid caching your metric entirely, and keep it in CPU memory instead:
-
-```py
->>> from datasets import load_metric
->>> metric = load_metric('glue', 'mrpc', keep_in_memory=True)
-```
-
-<Tip warning={true}>
-
-Keeping the predictions in-memory is not possible in a distributed setting since the CPU memory spaces of the various processes are not shared.
-
-</Tip>
-
 <a id='load_dataset_enhancing_performance'></a>
 
 ## Improve performance

diff --git a/docs/source/create_dataset.mdx b/docs/source/create_dataset.mdx
@@ -105,7 +105,7 @@ You can also create a dataset from local files by specifying the path to the dat
 
 ## Next steps
 
-We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, but it also gives you the most flexibility and control over how a dataset is generated. It lets you configure additional options such as creating multiple configurations within a dataset, or enabling your dataset to be streamed. 
+We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, and are not well supported on Hugging Face. Though in some rare cases it can still be helpful.
 
 To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.
 

diff --git a/docs/source/filesystems.mdx b/docs/source/filesystems.mdx
@@ -142,14 +142,6 @@ Load a dataset builder from the Hugging Face Hub (see [how to load from the Hugg
 >>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
 ```
 
-Load a dataset builder using a loading script (see [how to load a local loading script](./loading#local-loading-script)):
-
-```py
->>> output_dir = "s3://my-bucket/imdb"
->>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py")
->>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
-```
-
 Use your own data files (see [how to load local and remote files](./loading#local-and-remote-files)):
 
 ```py