Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding documentation #282

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
a324d63
adding documentation
NathanHB Aug 28, 2024
26d8402
adding documentation nanotron
NathanHB Aug 28, 2024
203045a
commit
NathanHB Sep 3, 2024
cbdcf1b
commit
NathanHB Sep 3, 2024
dd67ce4
Merge branch 'main' into nathan-add-doc
NathanHB Sep 3, 2024
015e924
undo unecessary changes
NathanHB Sep 3, 2024
4e9c30e
Merge branch 'main' into nathan-add-doc
NathanHB Sep 3, 2024
8aabbc8
still working on docs
NathanHB Sep 5, 2024
3a74186
Merge branch 'nathan-add-doc' of github.com:huggingface/lighteval int…
NathanHB Sep 5, 2024
db0c06d
Merge remote-tracking branch 'origin/main' into nathan-add-doc
NathanHB Sep 6, 2024
57b0cd4
commit
NathanHB Sep 9, 2024
7e4d56d
commit
NathanHB Sep 11, 2024
e533074
commit
NathanHB Sep 11, 2024
2f1c7f5
Update docs/source/installation.md
NathanHB Sep 17, 2024
0d1da5d
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
7a8782a
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
1c7454b
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
2539035
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
b5f2942
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
9825950
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
fa67cf0
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
f17ce92
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
f3c319d
Update docs/source/adding_new_metric.md
NathanHB Sep 18, 2024
bcd6f50
Update docs/source/adding_new_task.md
NathanHB Sep 18, 2024
33c1e7f
Update docs/source/adding_new_task.md
NathanHB Sep 18, 2024
016cea4
fix
NathanHB Sep 18, 2024
e86912a
Merge branch 'nathan-add-doc' of github.com:huggingface/lighteval int…
NathanHB Sep 18, 2024
3aba2a1
fix
NathanHB Sep 18, 2024
af1ad13
commit
NathanHB Sep 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
- local: index
title: 🌤️ Lighteval
- title: "Getting Started"
sections:
- local: installation
title: Installation
- local: quicktour
title: Quicktour
- title: "Guides"
sections:
- local: saving_results
title: Saving Results
- local: use_python_api
title: Use The Python API
- local: adding_new_task
title: Adding a Custom Task
- local: adding_new_metric
title: Adding a Custom Metric
- local: use_vllm
title: Using VLLM as backend
- local: use_tgi
title: Evaluate on Server
- title: "API Reference"
sections:
- local: metric_list
title: Available Metrics
- local: tasks
title: Available Tasks
87 changes: 87 additions & 0 deletions docs/source/adding_new_metric.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Adding a New Metric

First, check if you can use one of the parametrized functions in
[src.lighteval.metrics.metrics_corpus]() or
[src.lighteval.metrics.metrics_sample]().

If not, you can use the `custom_task` system to register your new metric:

<Tip>
To see an example of a custom metric added along with a custom task, look at
<a href="">the IFEval custom task</a>.
</Tip>

- Create a new Python file which should contain the full logic of your metric.
- The file also needs to start with these imports

```python
from aenum import extend_enum
from lighteval.metrics import Metrics
```

You need to define sample level metric:
NathanHB marked this conversation as resolved.
Show resolved Hide resolved

```python
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
response = predictions[0]
return response == formatted_doc.choices[formatted_doc.gold_index]
```

Here the sample level metric only return one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.
NathanHB marked this conversation as resolved.
Show resolved Hide resolved

```python
def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
response = predictions[0]
return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}
```

Then, you can define an aggreagtion function if needed, a comon aggregation function is `np.mean`.
NathanHB marked this conversation as resolved.
Show resolved Hide resolved

```python
def agg_function(items):
flat_items = [item for sublist in items for item in sublist]
score = sum(flat_items) / len(flat_items)
return score
```

Finally, you can define your metric. If it's a sample level metric, you can use the following code:

```python
my_custom_metric = SampleLevelMetric(
metric_name={custom_metric_name},
higher_is_better={either True or False},
category={MetricCategory},
use_case={MetricUseCase},
sample_level_fn=custom_metric,
corpus_level_fn=agg_function,
)
```

If your metric defines multiple metrics per sample, you can use the following code:

```python
custom_metric = SampleLevelMetricGrouping(
metric_name={submetric_names},
higher_is_better={n: {True or False} for n in submetric_names},
category={MetricCategory},
use_case={MetricUseCase},
sample_level_fn=custom_metric,
corpus_level_fn={
"accuracy": np.mean,
"other_metric": agg_function,
},
)
```

And to end with the following, so that it adds your metric to our metrics list
NathanHB marked this conversation as resolved.
Show resolved Hide resolved
when loaded as a module.

```python
# Adds the metric to the metric list!
extend_enum(Metrics, "metric_name", metric_function)
if __name__ == "__main__":
print("Imported metric")
```

You can then give your custom metric to lighteval by using `--custom-tasks
path_to_your_file` when launching it.
194 changes: 194 additions & 0 deletions docs/source/adding_new_task.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Adding a Custom Task

To add a new task, first either open an issue, to determine whether it will be
integrated in the core evaluations of lighteval, in the extended tasks, or the
community tasks, and add its dataset on the hub.

- Core evaluations are evaluations that only require standard logic in their
metrics and processing, and that we will add to our test suite to ensure non
regression through time. They already see high usage in the community.
- Extended evaluations are evaluations that require custom logic in their
metrics (complex normalisation, an LLM as a judge, ...), that we added to
facilitate the life of users. They already see high usage in the community.
- Community evaluations are submissions by the community of new tasks.

A popular community evaluation can move to become an extended or core evaluation over time.

<Tip>
You can find examples of custom tasks in the <a
href="https://github.com/huggingface/lighteval/tree/main/community_tasks">community_task</a>
directory.
</Tip>

## Step by step creation of a custom task

First, create a python file under the `community_tasks` directory.

You need to define a prompt function that will convert a line from your
dataset to a document to be used for evaluation.

```python
# Define as many as you need for your different tasks
def prompt_fn(line, task_name: str = None):
"""Defines how to go from a dataset line to a doc object.
Follow examples in src/lighteval/tasks/tasks_prompt_formatting.py, or get more info
NathanHB marked this conversation as resolved.
Show resolved Hide resolved
about what this function should do in the README.
"""
return Doc(
task_name=task_name,
query="",
choices="",
gold_index=0,
NathanHB marked this conversation as resolved.
Show resolved Hide resolved
instruction="",
)
```

Then, you need to choose a metric, you can either use an existing one (defined
in `lighteval/metrics/metrics.py`) or [create a custom one](./adding_new_metric).

```python
custom_metric = SampleLevelMetric(
metric_name="my_custom_metric_name",
higher_is_better=True,
category=MetricCategory.IGNORED,
use_case=MetricUseCase.NONE,
sample_level_fn=lambda x: x, # how to compute score for one sample
corpus_level_fn=np.mean, # How to aggreagte the samples metrics
)
```

Then, you need to define your task. You can define a task with or without subsets.
To define a task with no subsets:

```python
# This is how you create a simple task (like hellaswag) which has one single subset
# attached to it, and one evaluation possible.
task = LightevalTaskConfig(
name="myothertask",
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
suite=["community"],
hf_repo="",
hf_subset="default",
hf_avail_splits=[],
evaluation_splits=[],
few_shots_split=None,
few_shots_select=None,
metric=[], # select your metric in Metrics
)
```

If you want to create a task with multiple subset, add them to the
`SAMPLE_SUBSETS` list and create a task for each subset.

```python
SAMPLE_SUBSETS = [] # list of all the subsets to use for this eval


class CustomSubsetTask(LightevalTaskConfig):
def __init__(
self,
name,
hf_subset,
):
super().__init__(
name=name,
hf_subset=hf_subset,
prompt_function=prompt_fn, # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
hf_repo="",
metric=[custom_metric], # select your metric in Metrics or use your custom_metric
hf_avail_splits=[],
evaluation_splits=[],
few_shots_split=None,
few_shots_select=None,
suite=["community"],
generation_size=-1,
stop_sequence=None,
output_regex=None,
frozen=False,
)
SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
```

Here is a list of the parameters and their meaning:

- `name` (str), your evaluation name
- `suite` (list), the suite(s) to which your evaluation should belong. This
field allows us to compare different task implementations and is used as a
task selection to differentiate the versions to launch. At the moment, you'll
find the keywords ["helm", "bigbench", "original", "lighteval", "community",
"custom"]; for core evals, please choose `lighteval`.
- `prompt_function` (Callable), the prompt function you defined in the step
above
- `hf_repo` (str), the path to your evaluation dataset on the hub
- `hf_subset` (str), the specific subset you want to use for your evaluation
(note: when the dataset has no subset, fill this field with `"default"`, not
with `None` or `""`)
- `hf_avail_splits` (list), all the splits available for your dataset (train,
valid or validation, test, other...)
- `evaluation_splits` (list), the splits you want to use for evaluation
- `few_shots_split` (str, can be `null`), the specific split from which you
want to select samples for your few-shot examples. It should be different
from the sets included in `evaluation_splits`
- `few_shots_select` (str, can be `null`), the method that you will use to
select items for your few-shot examples. Can be `null`, or one of:
- `balanced` select examples from the `few_shots_split` with balanced
labels, to avoid skewing the few shot examples (hence the model
generations) toward one specific label
- `random` selects examples at random from the `few_shots_split`
- `random_sampling` selects new examples at random from the
`few_shots_split` for every new item, but if a sampled item is equal to
the current one, it is removed from the available samples
- `random_sampling_from_train` selects new examples at random from the
`few_shots_split` for every new item, but if a sampled item is equal to
the current one, it is kept! Only use this if you know what you are
doing.
- `sequential` selects the first `n` examples of the `few_shots_split`
- `generation_size` (int), the maximum number of tokens allowed for a
generative evaluation. If your evaluation is a log likelihood evaluation
(multi-choice), this value should be -1
- `stop_sequence` (list), a list of strings acting as end of sentence tokens
for your generation
- `metric` (list), the metrics you want to use for your evaluation (see next
section for a detailed explanation)
- `output_regex` (str), A regex string that will be used to filter your
generation. (Generative metrics will only select tokens that are between the
first and the second sequence matched by the regex. For example, for a regex
matching `\n` and a generation `\nModel generation output\nSome other text`
the metric will only be fed with `Model generation output`)
- `frozen` (bool), for now, is set to False, but we will steadily pass all
stable tasks to True.
- `trust_dataset` (bool), set to True if you trust the dataset.


Then you need to add your task to the `TASKS_TABLE` list.

```python
# STORE YOUR EVALS

# tasks with subset:
TASKS_TABLE = SUBSET_TASKS

# tasks without subset:
# TASKS_TABLE = [task]
```

Finally, you need to add a module logic to convert your task to a dict for lighteval.

```python
# MODULE LOGIC
# You should not need to touch this
# Convert to dict for lighteval
if __name__ == "__main__":
print(t.name for t in TASKS_TABLE)
print(len(TASKS_TABLE))
```

Once your file is created you can then run the evaluation with the following command:

```bash
lighteval accelerate \
--model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
--tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \
--custom_tasks {path_to_your_custom_task_file} \
--output_dir "./evals"
```
12 changes: 12 additions & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# 🌤️ Lighteval

A lightweight framework for LLM evaluation

LightEval is a lightweight LLM evaluation suite that Hugging Face has been
using internally with the recently released LLM data processing library
datatrove and LLM training library nanotron.

We're releasing it with the community in the spirit of building in the open.

Note that it is still very much early so don't expect 100% stability ^^' In
case of problems or questions, feel free to open an issue!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still the same text as from the original release I believe. Maybe at this point it's ok to rephrase a bit into something less "it's very alpha" and more along the lines of "Even though lighteval has successfully been used in a variety of different research projects both internally and externally, keep in mind that parts of it are liable to change, and things may break. In case of problems or questions, feel free to open an issue!"

43 changes: 43 additions & 0 deletions docs/source/installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Installation

You can install Lighteval either from PyPi or from source.

## From PyPi

```bash
pip install lighteval
```

## From source

```bash
git clone https://github.com/huggingface/lighteval.git
cd lighteval
pip install -e .
```

### Extras

Lighteval has optional dependencies that you can install by specifying the
appropriate extras group.
`pip install lighteval[<group>]` or `pip install -e .[<group>]`.

| extra name | description |
|--------------|---------------------------------------------------------------------------|
| accelerate | To use accelerate for model and data parallelism with transformers models |
| tgi | To use Text Generation Inference API to evaluate your model |
| nanotron | To evaluate nanotron models |
| quantization | To evaluate quantized models |
| adapters | To evaluate adapters models (delta and peft) |
| tensorboardX | To upload your results to tensorboard |
| vllm | To use vllm as backend for inference |

NathanHB marked this conversation as resolved.
Show resolved Hide resolved
## Hugging Face login

If you want to push your results to the Hugging Face Hub or evaluate your own
private models, don't forget to add your access token to the environment
variable `HF_TOKEN`. You can do this by running:

```bash
huggingface-cli login
```
Loading
Loading