huggingface · NathanHB · Dec 5, 2024 · Nov 21, 2024 · Nov 26, 2024 · Nov 26, 2024
diff --git a/.github/workflows/tests.yaml b/.github/workflows/tests.yaml
@@ -36,6 +36,8 @@ jobs:
      - name: Test
        env:
         HF_TEST_TOKEN: ${{ secrets.HF_TEST_TOKEN }}
+        HF_HOME: "cache/models"
+        HF_DATASETS_CACHE: "cache/datasets"
        run: | # PYTHONPATH="${PYTHONPATH}:src" HF_DATASETS_CACHE="cache/datasets" HF_HOME="cache/models"
         python -m pytest --disable-pytest-warnings
      - name: Write cache

diff --git a/docs/source/adding-a-custom-task.mdx b/docs/source/adding-a-custom-task.mdx
@@ -191,8 +191,7 @@ Once your file is created you can then run the evaluation with the following com
 
 ```bash
 lighteval accelerate \
-    --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
-    --tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \
-    --custom_tasks {path_to_your_custom_task_file} \
-    --output_dir "./evals"
+    "pretrained=HuggingFaceH4/zephyr-7b-beta" \
+    "community|{custom_task}|{fewshots}|{truncate_few_shot}" \
+    --custom-tasks {path_to_your_custom_task_file} \
 ```
diff --git a/docs/source/available-tasks.mdx b/docs/source/available-tasks.mdx
@@ -3,7 +3,13 @@
 You can get a list of all the available tasks by running:
 
 ```bash
-lighteval tasks --list
+lighteval tasks list
+```
+
+You can also inspect a specific task by running:
+
+```bash
+lighteval tasks inspect <task_name>
 ```
 
 ## List of tasks

diff --git a/docs/source/evaluate-the-model-on-a-server-or-container.mdx b/docs/source/evaluate-the-model-on-a-server-or-container.mdx
@@ -6,10 +6,9 @@ to the server. The command is the same as before, except you specify a path to
 a yaml config file (detailed below):
 
 ```bash
-lighteval accelerate \
-    --model_config_path="/path/to/config/file"\
-    --tasks <task parameters> \
-    --output_dir output_dir
+lighteval endpoint {tgi,inference-endpoint} \
+    "/path/to/config/file"\
+    <task parameters>
 ```
 
 There are two types of configuration files that can be provided for running on
@@ -65,3 +64,19 @@ model:
     inference_server_auth: null
     model_id: null # Optional, only required if the TGI container was launched with model_id pointing to a local directory
 ```
+
+### OpenAI API
+
+Lighteval also supports evaluating models on the OpenAI API. To do so you need to set your OpenAI API key in the environment variable.
+
+```bash
+export  OPENAI_API_KEY={your_key}
+```
+
+And then run the following command:
+
+```bash
+lighteval endpoint openai \
+    {model-name} \
+    <task parameters>
+```
diff --git a/docs/source/index.mdx b/docs/source/index.mdx
@@ -5,7 +5,7 @@ backends—whether it's
 [transformers](https://github.com/huggingface/transformers),
 [tgi](https://github.com/huggingface/text-generation-inference),
 [vllm](https://github.com/vllm-project/vllm), or
-[nanotron](https://github.com/huggingface/nanotron)—with
+[nanotron](https://github.com/huggingface/nanotron) with
 ease. Dive deep into your model’s performance by saving and exploring detailed,
 sample-by-sample results to debug and see how your models stack-up.
 

diff --git a/docs/source/package_reference/model_config.mdx b/docs/source/package_reference/model_config.mdx
@@ -8,5 +8,3 @@
 [[autodoc]] models.model_config.InferenceModelConfig
 [[autodoc]] models.model_config.TGIModelConfig
 [[autodoc]] models.model_config.VLLMModelConfig
-
-[[autodoc]] models.model_config.create_model_config
diff --git a/docs/source/quicktour.mdx b/docs/source/quicktour.mdx
@@ -1,11 +1,24 @@
 # Quicktour
 
-We provide two main entry points to evaluate models:
+
+> [!TIP]
+> We recommend using the `--help` flag to get more information about the
+> available options for each command.
+> `lighteval --help` and `lighteval accelerate --help`
+
+Lighteval can be used with a few different commands.
 
 - `lighteval accelerate` : evaluate models on CPU or one or more GPUs using [🤗
   Accelerate](https://github.com/huggingface/accelerate)
 - `lighteval nanotron`: evaluate models in distributed settings using [⚡️
   Nanotron](https://github.com/huggingface/nanotron)
+- `lighteval vllm`: evaluate models on one or more GPUs using [🚀
+  VLLM](https://github.com/vllm-project/vllm)
+- `lighteval endpoint`
+    - `inference-endpoint`: evaluate models on one or more GPUs using [🔗
+  Inference Endpoint](https://huggingface.co/inference-endpoints/dedicated)
+    - `tgi`: evaluate models on one or more GPUs using [🔗 Text Generation Inference](https://huggingface.co/docs/text-generation-inference/en/index)
+    - `openai`: evaluate models on one or more GPUs using [🔗 OpenAI API](https://platform.openai.com/)
 
 ## Accelerate
 
@@ -15,10 +28,8 @@ To evaluate `GPT-2` on the Truthful QA benchmark, run:
 
 ```bash
 lighteval accelerate \
-     --model_args "pretrained=gpt2" \
-     --tasks "leaderboard|truthfulqa:mc|0|0" \
-     --override_batch_size 1 \
-     --output_dir="./evals/"
+     "pretrained=gpt2" \
+     "leaderboard|truthfulqa:mc|0|0"
 ```
 
 Here, `--tasks` refers to either a comma-separated list of supported tasks from
@@ -51,10 +62,8 @@ You can then evaluate a model using data parallelism on 8 GPUs like follows:
 ```bash
 accelerate launch --multi_gpu --num_processes=8 -m \
     lighteval accelerate \
-    --model_args "pretrained=gpt2" \
-    --tasks "leaderboard|truthfulqa:mc|0|0" \
-    --override_batch_size 1 \
-    --output_dir="./evals/"
+    "pretrained=gpt2" \
+    "leaderboard|truthfulqa:mc|0|0"
 ```
 
 Here, `--override_batch_size` defines the batch size per device, so the effective
@@ -66,10 +75,8 @@ To evaluate a model using pipeline parallelism on 2 or more GPUs, run:
 
 ```bash
 lighteval accelerate \
-    --model_args "pretrained=gpt2,model_parallel=True" \
-    --tasks "leaderboard|truthfulqa:mc|0|0" \
-    --override_batch_size 1 \
-    --output_dir="./evals/"
+    "pretrained=gpt2,model_parallel=True" \
+    "leaderboard|truthfulqa:mc|0|0"
 ```
 
 This will automatically use accelerate to distribute the model across the GPUs.
@@ -81,7 +88,7 @@ GPUs.
 
 ### Model Arguments
 
-The `--model_args` argument takes a string representing a list of model
+The `model-args` argument takes a string representing a list of model
 argument. The arguments allowed vary depending on the backend you use (vllm or
 accelerate).
 
@@ -150,8 +157,8 @@ To evaluate a model trained with nanotron on a single gpu.
 ```bash
  torchrun --standalone --nnodes=1 --nproc-per-node=1  \
  src/lighteval/__main__.py nanotron \
- --checkpoint_config_path ../nanotron/checkpoints/10/config.yaml \
- --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
+ --checkpoint-config-path ../nanotron/checkpoints/10/config.yaml \
+ --lighteval-config-path examples/nanotron/lighteval_config_override_template.yaml
  ```
 
 The `nproc-per-node` argument should match the data, tensor and pipeline

diff --git a/docs/source/saving-and-reading-results.mdx b/docs/source/saving-and-reading-results.mdx
@@ -3,30 +3,30 @@
 ## Saving results locally
 
 Lighteval will automatically save results and evaluation details in the
-directory set with the `--output_dir` argument. The results will be saved in
+directory set with the `--output-dir` option. The results will be saved in
 `{output_dir}/results/{model_name}/results_{timestamp}.json`. [Here is an
 example of a result file](#example-of-a-result-file). The output path can be
 any [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html)
 compliant path (local, s3, hf hub, gdrive, ftp, etc).
 
-To save the details of the evaluation, you can use the `--save_details`
-argument. The details will be saved in a parquet file
+To save the details of the evaluation, you can use the `--save-details`
+option. The details will be saved in a parquet file
 `{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet`.
 
 ## Pushing results to the HuggingFace hub
 
 You can push the results and evaluation details to the HuggingFace hub. To do
-so, you need to set the `--push_to_hub` as well as the `--results_org`
-argument. The results will be saved in a dataset with the name at
+so, you need to set the `--push-to-hub` as well as the `--results-org`
+option. The results will be saved in a dataset with the name at
 `{results_org}/{model_org}/{model_name}`. To push the details, you need to set
-the `--save_details` argument.
+the `--save-details` option.
 The dataset created will be private by default, you can make it public by
-setting the `--public_run` argument.
+setting the `--public-run` option.
 
 
 ## Pushing results to Tensorboard
 
-You can push the results to Tensorboard by setting `--push_to_tensorboard`.
+You can push the results to Tensorboard by setting `--push-to-tensorboard`.
 
 
 ## How to load and investigate details

diff --git a/docs/source/use-vllm-as-backend.mdx b/docs/source/use-vllm-as-backend.mdx
@@ -4,10 +4,9 @@ Lighteval allows you to use `vllm` as backend allowing great speedups.
 To use, simply change the `model_args` to reflect the arguments you want to pass to vllm.
 
 ```bash
-lighteval accelerate \
-    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
-    --tasks "leaderboard|truthfulqa:mc|0|0" \
-    --output_dir="./evals/"
+lighteval vllm \
+    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
+    "leaderboard|truthfulqa:mc|0|0"
 ```
 
 `vllm` is able to distribute the model across multiple GPUs using data
@@ -17,19 +16,17 @@ You can choose the parallelism method by setting in the the `model_args`.
 For example if you have 4 GPUs you can split it across using `tensor_parallelism`:
 
 ```bash
-export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval accelerate \
-    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
-    --tasks "leaderboard|truthfulqa:mc|0|0" \
-    --output_dir="./evals/"
+export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
+    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
+    "leaderboard|truthfulqa:mc|0|0"
 ```
 
 Or, if your model fits on a single GPU, you can use `data_parallelism` to speed up the evaluation:
 
 ```bash
-lighteval accelerate \
-    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
-    --tasks "leaderboard|truthfulqa:mc|0|0" \
-    --output_dir="./evals/"
+lighteval vllm \
+    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
+    "leaderboard|truthfulqa:mc|0|0"
 ```
 
 Available arguments for `vllm` can be found in the `VLLMModelConfig`:
@@ -50,4 +47,3 @@ Available arguments for `vllm` can be found in the `VLLMModelConfig`:
 > [!WARNING]
 > In the case of OOM issues, you might need to reduce the context size of the
 > model as well as reduce the `gpu_memory_utilisation` parameter.
-
diff --git a/examples/model_configs/base_model.yaml b/examples/model_configs/base_model.yaml
@@ -1,5 +1,4 @@
 model:
-  type: "base" # can be base, tgi, or endpoint
   base_params:
     model_args: "pretrained=HuggingFaceH4/zephyr-7b-beta,revision=main" # pretrained=model_name,trust_remote_code=boolean,revision=revision_to_use,model_parallel=True ...
     dtype: "bfloat16"

diff --git a/examples/model_configs/endpoint_model.yaml b/examples/model_configs/endpoint_model.yaml
@@ -1,5 +1,4 @@
 model:
-  type: "endpoint" # can be base, tgi, or endpoint
   base_params:
     endpoint_name: "llama-2-7B-lighteval" # needs to be lower case without special characters
     model: "meta-llama/Llama-2-7b-hf"

diff --git a/examples/model_configs/peft_model.yaml b/examples/model_configs/peft_model.yaml
@@ -1,5 +1,4 @@
 model:
-  type: "base"
   base_params:
     model_args: "pretrained=predibase/customer_support,revision=main" # pretrained=model_name,trust_remote_code=boolean,revision=revision_to_use,model_parallel=True ... For a PEFT model, the pretrained model should be the one trained with PEFT and the base model below will contain the original model on which the adapters will be applied.
     dtype: "4bit"  # Specifying the model to be loaded in 4 bit uses BitsAndBytesConfig. The other option is to use "8bit" quantization.

diff --git a/examples/model_configs/quantized_model.yaml b/examples/model_configs/quantized_model.yaml
@@ -1,5 +1,4 @@
 model:
-  type: "base"
   base_params:
     model_args: "pretrained=HuggingFaceH4/zephyr-7b-beta,revision=main" # pretrained=model_name,trust_remote_code=boolean,revision=revision_to_use,model_parallel=True ...
     dtype: "4bit"  # Specifying the model to be loaded in 4 bit uses BitsAndBytesConfig. The other option is to use "8bit" quantization.

diff --git a/examples/model_configs/tgi_model.yaml b/examples/model_configs/tgi_model.yaml
@@ -1,5 +1,4 @@
 model:
-  type: "tgi" # can be base, tgi, or endpoint
   instance:
     inference_server_address: ""
     inference_server_auth: null

diff --git a/examples/nanotron/lighteval_config_override_template.yaml b/examples/nanotron/lighteval_config_override_template.yaml
@@ -4,9 +4,7 @@ generation: null
 logging:
   output_dir: "outputs"
   save_details: false
-  push_results_to_hub: false
-  push_details_to_hub: false
-  push_results_to_tensorboard: false
+  push_to_hub: false
   public_run: false
   results_org: null
   tensorboard_metric_prefix: "eval"

diff --git a/pyproject.toml b/pyproject.toml
@@ -61,6 +61,7 @@ dependencies = [
     "datasets>=2.14.0",
     "numpy<2",  # pinned to avoid incompatibilities
     # Prettiness
+    "typer",
     "termcolor==2.3.0",
     "pytablewriter",
     "colorama",
@@ -114,4 +115,4 @@ Issues = "https://github.com/huggingface/lighteval/issues"
 # Changelog = "https://github.com/huggingface/lighteval/blob/master/CHANGELOG.md"
 
 [project.scripts]
-lighteval = "lighteval.__main__:cli_evaluate"
+lighteval = "lighteval.__main__:app"