Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for nanotron #11

Merged
merged 16 commits into from
Feb 7, 2024
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,5 @@ repos:
rev: 'v0.1.6'
hooks:
- id: ruff
args: ['--fix']
clefourrier marked this conversation as resolved.
Show resolved Hide resolved
- id: ruff-format
18 changes: 12 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ LightEval is an evaluation suite which gathers a selection of features from wide

It is still an early, internal version - it should be nice to use but don't expect 100% stability!

In case of problems or question, feel free to open an issue!
In case of problems or question, feel free to open an issue!

## How to install and use
### Requirements
### Installation
0) Create your virtual environment using virtualenv or conda depending on your preferences. We require Python3.10

1) Clone the package using `git clone`, then `cd lighteval-harness`, `pip install -e .` Once the dependencies are installed, `cd src`.
Expand All @@ -22,6 +22,12 @@ Optional:

2) Add your user token to the environment variable `HUGGING_FACE_HUB_TOKEN` if you want to push your results to the hub

For the linting:
```bash
pre-commit install
pre-commit run --config .pre-commit-config.yaml --all-files
```


### Usage
- Launching on CPU
Expand Down Expand Up @@ -50,11 +56,11 @@ Lastly, create a **line summary** of your evaluation, in `metadata_table.json`.
- `suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
- `prompt_function` (str), the name of the prompt function you defined in the step above
- `hf_repo` (str), the path to your evaluation dataset on the hub
- `hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`)
- `hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`)
- `hf_avail_splits` (list), all the splits available for your dataset (train, valid or validation, test, other...)
- `evaluation_splits` (list), the splits you want to use for evaluation
- `few_shots_split` (str, can be `null`), the specific split from which you want to select samples for your few-shot examples. It should be different from the sets included in `evaluation_splits`
- `few_shots_select` (str, can be `null`), the method that you will use to select items for your few-shot examples. Can be `null`, or one of:
- `few_shots_select` (str, can be `null`), the method that you will use to select items for your few-shot examples. Can be `null`, or one of:
- `balanced` selects examples from the `few_shots_split` with balanced labels, to avoid skewing the few shot examples (hence the model generations) towards one specific label
- `random` selects examples at random from the `few_shots_split`
- `random_sampling` selects new examples at random from the `few_shots_split` for every new item, but if a sampled item is equal to the current one, it is removed from the available samples
Expand Down Expand Up @@ -102,7 +108,7 @@ These metrics need the model to generate an output. They are therefore slower.
- `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
- `f1_score_quasi` (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
- `f1_score`: Average F1 score in terms of word overlap between the model output and gold without normalisation
- `f1_score_macro`: Corpus level macro F1 score
- `f1_score_macro`: Corpus level macro F1 score
- `f1_score_macro`: Corpus level micro F1 score
- Summarization:
- `rouge` (Harness): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
Expand Down Expand Up @@ -141,7 +147,7 @@ These metrics need both the generation and its logprob. They are not working at
- `prediction_perplexity` (HELM): Measure of the logprob of a given input.

## Adding a new metric
If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.

## Examples of scripts to launch lighteval on the cluster
### Evaluate a whole suite on one node, 8 GPUs
Expand Down
3 changes: 1 addition & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,7 @@ optimum = ["optimum==1.12.0"]
quantization = ["bitsandbytes>=0.41.0", "auto-gptq>=0.4.2"]
adapters = ["peft==0.3.0"]
nanotron = [
"nanotron@git+https://github.com/huggingface/nanotron@8c1a49588d0745a6404644a86547c2dd6a63640e",
"brrr@git+https://github.com/huggingface/brrr@e8a503e2ec08b34eed7522d331aec3bee8cdd29b",
"nanotron@git+https://github.com/huggingface/nanotron",
clefourrier marked this conversation as resolved.
Show resolved Hide resolved
"tensorboardX"
]

Expand Down
78 changes: 78 additions & 0 deletions run_evals_accelerate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import argparse

from lighteval.main_accelerate import CACHE_DIR, main


def get_parser():
clefourrier marked this conversation as resolved.
Show resolved Hide resolved
parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group(required=True)
weight_type_group = parser.add_mutually_exclusive_group()

weight_type_group.add_argument(
"--delta_weights",
action="store_true",
default=False,
help="set to True of your model should be merged with a base model, also need to provide the base model name",
)
weight_type_group.add_argument(
"--adapter_weights",
action="store_true",
default=False,
help="set to True of your model has been trained with peft, also need to provide the base model name",
)
parser.add_argument(
"--base_model", type=str, default=None, help="name of the base model to be used for delta or adapter weights"
)

parser.add_argument("--model_args", required=True)
parser.add_argument("--output_dir", required=True)
parser.add_argument("--model_dtype", type=str, default=None)
parser.add_argument(
"--multichoice_continuations_start_space",
action="store_true",
help="Whether to force multiple choice continuations starts with a space",
)
parser.add_argument(
"--no_multichoice_continuations_start_space",
action="store_true",
help="Whether to force multiple choice continuations do not starts with a space",
)
parser.add_argument("--push_results_to_hub", default=False, action="store_true")
parser.add_argument("--save_details", action="store_true")
parser.add_argument("--push_details_to_hub", default=False, action="store_true")
parser.add_argument(
"--public_run", default=False, action="store_true", help="Push results and details to a public repo"
)
parser.add_argument("--max_samples", type=int, default=None)
parser.add_argument("--override_batch_size", type=int, default=-1)
parser.add_argument("--dataset_loading_processes", type=int, default=1)
parser.add_argument("--inference_server_address", type=str, default=None)
parser.add_argument("--inference_server_auth", type=str, default=None)
parser.add_argument("--num_fewshot_seeds", type=int, default=1, help="Number of trials the few shots")
parser.add_argument("--cache_dir", type=str, default=CACHE_DIR)
parser.add_argument(
"--results_org",
type=str,
help="Hub organisation where you want to store the results. Your current token must have write access to it",
)
parser.add_argument("--job_id", type=str, help="Optional Job ID for future reference", default="")
parser.add_argument("--use_chat_template", default=False, action="store_true")
parser.add_argument(
"--custom_tasks_file",
type=str,
default=None,
help="Path to a file with custom tasks (a TASK list of dict and potentially prompt formating functions)",
)
group.add_argument(
"--tasks",
type=str,
default=None,
help="Id of a task, e.g. 'original|mmlu:abstract_algebra|5' or path to a texte file with a list of tasks",
)
return parser


if __name__ == "__main__":
parser = get_parser()
args, unknowns = parser.parse_known_args()
main(args)
33 changes: 33 additions & 0 deletions run_evals_nanotron.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# flake8: noqa: C901
import argparse

from lighteval.main_nanotron import main


def get_parser():
parser = argparse.ArgumentParser()
parser.add_argument(
"--checkpoint-config-path",
type=str,
required=True,
help="Path to the brr checkpoint YAML or python config file, potentially on S3",
)
parser.add_argument(
"--lighteval-override",
type=str,
help="Path to an optional YAML or python Lighteval config to override part of the checkpoint Lighteval config",
)
parser.add_argument(
"--cache-dir",
type=str,
default="",
help="Cache directory",
)

return parser


if __name__ == "__main__":
parser = get_parser()
args, unknowns = parser.parse_known_args()
main(args.checkpoint_config_path, args.lighteval_override, args.cache_dir)
36 changes: 35 additions & 1 deletion src/lighteval/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,41 @@ def _sorting_criteria(self, x) -> int:
Returns:
Any: The collated data.
"""
toks, (stop_tokens, gen_length) = x
toks = x[0]
meta_data = x[1]
_, gen_length = meta_data[0], meta_data[1]
return -(len(toks) + gen_length)


class GenerativeTaskDatasetNanotron(DynamicBatchDataset):
def __getitem__(self, index) -> Request:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need your own class? (Is it only to return the index with the item?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nathan's requirement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_model does not use the index for each sample, that means that we need to accommodate the dataset to nanotron.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I'm unsure why we need to grab the index for brr

"""
Get an item from the dataset depending on the split we are currently in.
For instance, if we are in split 0, we will get the item at index 0, if
we are in split 1, we will get the item at index self.split_size, etc.
Used for dynamic batching.

Args:
index (int): The index of the item.

Returns:
Any: The item at the specified index.
"""
return index, self.sorted_data[index + self.split_start]

def _sorting_criteria(self, x) -> int:
"""
Collate function for generating batches.

Args:
x (Any): The input data.

Returns:
Any: The collated data.
"""
toks = x[0]
meta_data = x[1]
_, gen_length = meta_data[0], meta_data[1]
return -(len(toks) + gen_length)


Expand Down
Loading
Loading