huggingface · NathanHB · Feb 7, 2024 · Feb 5, 2024 · Feb 5, 2024 · Feb 5, 2024
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -37,4 +37,5 @@ repos:
     rev: 'v0.1.6'
     hooks:
       - id: ruff
+        args: ['--fix']
       - id: ruff-format
diff --git a/README.md b/README.md
@@ -8,10 +8,10 @@ LightEval is an evaluation suite which gathers a selection of features from wide
 
 It is still an early, internal version - it should be nice to use but don't expect 100% stability!
 
-In case of problems or question, feel free to open an issue! 
+In case of problems or question, feel free to open an issue!
 
 ## How to install and use
-### Requirements
+### Installation
 0) Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10
 
 1) Clone the package using `git clone`, then `cd lighteval-harness`, `pip install -e .` Once the dependencies are installed, `cd src`.
@@ -22,6 +22,12 @@ Optional:
 
 2) Add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN` if you want to push your results to the hub
 
+For the linting:
+```bash
+pre-commit install
+pre-commit run --config .pre-commit-config.yaml --all-files
+```
+
 
 ### Usage
 - Launching on CPU
@@ -50,11 +56,11 @@ Lastly, create a **line summary** of your evaluation, in `metadata_table.json`.
 - `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
 - `prompt_function` (str), the name of the prompt function you defined in the step above
 - `hf_repo` (str), the path to your evaluation dataset on the hub
-- `hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`) 
+- `hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`)
 - `hf_avail_splits` (list), all the splits available for your dataset (train, valid or validation, test, other...)
 - `evaluation_splits` (list), the splits you want to use for evaluation
 - `few_shots_split` (str, can be `null`), the specific split from which you want to select samples for your few-shot examples. It should be different from the sets included in `evaluation_splits`
-- `few_shots_select` (str, can be `null`), the method that you will use to select items for your few-shot examples. Can be `null`, or one of: 
+- `few_shots_select` (str, can be `null`), the method that you will use to select items for your few-shot examples. Can be `null`, or one of:
     - `balanced` selects examples from the `few_shots_split` with balanced labels, to avoid skewing the few shot examples (hence the model generations) towards one specific label
     - `random` selects examples at random from the `few_shots_split`
     - `random_sampling` selects new examples at random from the `few_shots_split` for every new item, but if a sampled item is equal to the current one, it is removed from the available samples
@@ -102,7 +108,7 @@ These metrics need the model to generate an output. They are therefore slower.
     - `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
     - `f1_score_quasi` (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
     - `f1_score`:  Average F1 score in terms of word overlap between the model output and gold without normalisation
-    - `f1_score_macro`: Corpus level macro F1 score 
+    - `f1_score_macro`: Corpus level macro F1 score
     - `f1_score_macro`: Corpus level micro F1 score
 - Summarization:
     - `rouge` (Harness): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
@@ -141,7 +147,7 @@ These metrics need both the generation and its logprob. They are not working at
 - `prediction_perplexity` (HELM): Measure of the logprob of a given input.
 
 ## Adding a new metric
-If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric. 
+If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
 
 ## Examples of scripts to launch lighteval on the cluster
 ### Evaluate a whole suite on one node, 8 GPUs

diff --git a/pyproject.toml b/pyproject.toml
@@ -82,8 +82,7 @@ optimum = ["optimum==1.12.0"]
 quantization = ["bitsandbytes>=0.41.0", "auto-gptq>=0.4.2"]
 adapters = ["peft==0.3.0"]
 nanotron = [
-  "nanotron@git+https://github.com/huggingface/nanotron@8c1a49588d0745a6404644a86547c2dd6a63640e",
-  "brrr@git+https://github.com/huggingface/brrr@e8a503e2ec08b34eed7522d331aec3bee8cdd29b",
+  "nanotron@git+https://github.com/huggingface/nanotron",
   "tensorboardX"
 ]
 

diff --git a/run_evals_accelerate.py b/run_evals_accelerate.py
@@ -0,0 +1,78 @@
+import argparse
+
+from lighteval.main_accelerate import CACHE_DIR, main
+
+
+def get_parser():
+    parser = argparse.ArgumentParser()
+    group = parser.add_mutually_exclusive_group(required=True)
+    weight_type_group = parser.add_mutually_exclusive_group()
+
+    weight_type_group.add_argument(
+        "--delta_weights",
+        action="store_true",
+        default=False,
+        help="set to True of your model should be merged with a base model, also need to provide the base model name",
+    )
+    weight_type_group.add_argument(
+        "--adapter_weights",
+        action="store_true",
+        default=False,
+        help="set to True of your model has been trained with peft, also need to provide the base model name",
+    )
+    parser.add_argument(
+        "--base_model", type=str, default=None, help="name of the base model to be used for delta or adapter weights"
+    )
+
+    parser.add_argument("--model_args", required=True)
+    parser.add_argument("--output_dir", required=True)
+    parser.add_argument("--model_dtype", type=str, default=None)
+    parser.add_argument(
+        "--multichoice_continuations_start_space",
+        action="store_true",
+        help="Whether to force multiple choice continuations starts with a space",
+    )
+    parser.add_argument(
+        "--no_multichoice_continuations_start_space",
+        action="store_true",
+        help="Whether to force multiple choice continuations do not starts with a space",
+    )
+    parser.add_argument("--push_results_to_hub", default=False, action="store_true")
+    parser.add_argument("--save_details", action="store_true")
+    parser.add_argument("--push_details_to_hub", default=False, action="store_true")
+    parser.add_argument(
+        "--public_run", default=False, action="store_true", help="Push results and details to a public repo"
+    )
+    parser.add_argument("--max_samples", type=int, default=None)
+    parser.add_argument("--override_batch_size", type=int, default=-1)
+    parser.add_argument("--dataset_loading_processes", type=int, default=1)
+    parser.add_argument("--inference_server_address", type=str, default=None)
+    parser.add_argument("--inference_server_auth", type=str, default=None)
+    parser.add_argument("--num_fewshot_seeds", type=int, default=1, help="Number of trials the few shots")
+    parser.add_argument("--cache_dir", type=str, default=CACHE_DIR)
+    parser.add_argument(
+        "--results_org",
+        type=str,
+        help="Hub organisation where you want to store the results. Your current token must have write access to it",
+    )
+    parser.add_argument("--job_id", type=str, help="Optional Job ID for future reference", default="")
+    parser.add_argument("--use_chat_template", default=False, action="store_true")
+    parser.add_argument(
+        "--custom_tasks_file",
+        type=str,
+        default=None,
+        help="Path to a file with custom tasks (a TASK list of dict and potentially prompt formating functions)",
+    )
+    group.add_argument(
+        "--tasks",
+        type=str,
+        default=None,
+        help="Id of a task, e.g. 'original|mmlu:abstract_algebra|5' or path to a texte file with a list of tasks",
+    )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = get_parser()
+    args, unknowns = parser.parse_known_args()
+    main(args)
diff --git a/run_evals_nanotron.py b/run_evals_nanotron.py
@@ -0,0 +1,33 @@
+# flake8: noqa: C901
+import argparse
+
+from lighteval.main_nanotron import main
+
+
+def get_parser():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--checkpoint-config-path",
+        type=str,
+        required=True,
+        help="Path to the brr checkpoint YAML or python config file, potentially on S3",
+    )
+    parser.add_argument(
+        "--lighteval-override",
+        type=str,
+        help="Path to an optional YAML or python Lighteval config to override part of the checkpoint Lighteval config",
+    )
+    parser.add_argument(
+        "--cache-dir",
+        type=str,
+        default="",
+        help="Cache directory",
+    )
+
+    return parser
+
+
+if __name__ == "__main__":
+    parser = get_parser()
+    args, unknowns = parser.parse_known_args()
+    main(args.checkpoint_config_path, args.lighteval_override, args.cache_dir)
diff --git a/src/lighteval/data.py b/src/lighteval/data.py
@@ -189,7 +189,41 @@ def _sorting_criteria(self, x) -> int:
         Returns:
             Any: The collated data.
         """
-        toks, (stop_tokens, gen_length) = x
+        toks = x[0]
+        meta_data = x[1]
+        _, gen_length = meta_data[0], meta_data[1]
+        return -(len(toks) + gen_length)
+
+
+class GenerativeTaskDatasetNanotron(DynamicBatchDataset):
+    def __getitem__(self, index) -> Request:
+        """
+        Get an item from the dataset depending on the split we are currently in.
+        For instance, if we are in split 0, we will get the item at index 0, if
+        we are in split 1, we will get the item at index self.split_size, etc.
+        Used for dynamic batching.
+
+        Args:
+            index (int): The index of the item.
+
+        Returns:
+            Any: The item at the specified index.
+        """
+        return index, self.sorted_data[index + self.split_start]
+
+    def _sorting_criteria(self, x) -> int:
+        """
+        Collate function for generating batches.
+
+        Args:
+            x (Any): The input data.
+
+        Returns:
+            Any: The collated data.
+        """
+        toks = x[0]
+        meta_data = x[1]
+        _, gen_length = meta_data[0], meta_data[1]
         return -(len(toks) + gen_length)