Skip to content

Commit

Permalink
Merge pull request #7 from Sagacify/scorer
Browse files Browse the repository at this point in the history
feat(scorer): add scorer class support
  • Loading branch information
LucieNvz authored Oct 26, 2023
2 parents fbced3b + d76676d commit 6607b27
Show file tree
Hide file tree
Showing 7 changed files with 2,198 additions and 87 deletions.
103 changes: 100 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,105 @@

[![ci](https://github.com/sagacify/saga-llm-evaluation-ml/actions/workflows/cd.yaml/badge.svg)](https://github.com/Sagacify/saga-llm-evaluation-ml/actions/workflows/cd.yaml)

Saga LLM Evaluation ML is a Python library ...
Welcome to the Saga LLM Evaluation ML library, a versatile Python tool designed for evaluating the performance of various large language models in Natural Language Processing (NLP) tasks. Whether you're developing language models, chatbots, or other NLP applications, our library provides a comprehensive suite of metrics to help you assess the quality of your language models.

## Key Metrics
### Embedding-based Metrics
- BERTScore: : A metric that measures the similarity between model-generated text and human-generated references. It's a valuable tool for evaluating semantic content. [Read more](https://arxiv.org/pdf/1904.09675.pdf).
- MAUVE: A metric to assess the quality of text generated by language models. [Read more](https://arxiv.org/pdf/2212.14578.pdf).

### Language-based Metrics
- BLEURTScore: A hybrid metric that combines BERT embeddings and BLEU to evaluate text generated by LLMs. [Read more](https://arxiv.org/pdf/2004.04696.pdf).
- Q-Squared: A metric for comparing model-generated text with human references.

### LLM-based Metrics
- SelCheck-GPT: A metric that evaluates the correctness of language model outputs by identifying comparing the output to the typical distribution of the model outputs. [Read more](https://arxiv.org/pdf/2303.08896.pdf).
- G-Eval: A metric for evaluating any aspect of the generated text using chain-of-toughts and another LLM. [Read more](https://arxiv.org/pdf/2303.16634.pdf).
- GPT-Score: A metric that provides an evaluation of any aspect of LLM-generated text. [Read more](https://arxiv.org/pdf/2302.04166.pdf).

You can assess these metrics either all at once using the Scorer provided by this library, or individually, depending on the availability of references, context, and other parameters.

## Installation
To install the Saga LLM Evaluation ML library, use the following command:

```poetry add saga_llm_evaluation_ml```

## Usage
Here's a complete example of how to use this library:

### Default use of the Scorer
```python

model = "TheBloke/Llama-2-7b-Chat-GGUF"
scorer = LLMScorer(model=model)

llm_input = "I am a dog."
prompt = f"System: You are a cat. You don't like dogs. User: {llm_input}"
context = "Examples: Eww, I hate dogs."
prediction = "I am a cat, I don't like dogs."
reference = "I am a cat, I don't like dogs, miau."
task = "diag"
aspect = ["CON"]
custom_prompt = {
"name": "Fluency",
"task": "Dialog",
"aspect": "Evaluate the fluency of the following dialog.",
}
# All default
scores = scorer.score(llm_input, prompt, prediction)

# With reference
scores = scorer.score(
llm_input,
prompt,
prediction,
reference=reference,
n_samples=10,
)

TODO
# With context
scores = scorer.score(
llm_input,
prompt,
prediction,
context=context,
n_samples=10,
)

# With task and aspect
scores = scorer.score(
llm_input,
prompt,
prediction,
task=task,
aspects=aspect,
n_samples=10,
)

# With custom prompt
scores = scorer.score(
llm_input,
prompt,
prediction,
custom_prompt=custom_prompt,
n_samples=10,
)
```
The result is a dictionary with keys `metrics` and `metadata` containing the scores and metadata respectively.
The `metrics` key contains the scores for each metric, while the `metadata` key contains the metadata computed on the text inputs.


### Standalone use of the metrics
You can also use the metrics individually as shown below:
```python
from saga_llm_evaluation_ml.helpers.embedding_metrics import BERTScore

bert_score = BERTScore()
scores = bert_score.compute_score(
references=["This is a reference sentence"],
candidates=["This is a candidate sentence"],
)
```

## Development

Expand All @@ -32,4 +127,6 @@ poetry run pytest
poetry run black saga_llm_evaluation_ml/ tests/
# Linter
poetry run pylint saga_llm_evaluation_ml/ tests/
```
```

Feel free to contribute and make this library even more powerful! We appreciate your support. 💻💪🏻
Loading

0 comments on commit 6607b27

Please sign in to comment.