Merge pull request #7 from Sagacify/scorer

feat(scorer): add scorer class support
Sagacify · Oct 26, 2023 · 6607b27 · 6607b27
2 parents fbced3b + d76676d
commit 6607b27
Show file tree

Hide file tree

Showing 7 changed files with 2,198 additions and 87 deletions.
diff --git a/README.md b/README.md
@@ -2,10 +2,105 @@
 
 [![ci](https://github.com/sagacify/saga-llm-evaluation-ml/actions/workflows/cd.yaml/badge.svg)](https://github.com/Sagacify/saga-llm-evaluation-ml/actions/workflows/cd.yaml)
 
-Saga LLM Evaluation ML is a Python library ...
+Welcome to the Saga LLM Evaluation ML library, a versatile Python tool designed for evaluating the performance of various large language models in Natural Language Processing (NLP) tasks. Whether you're developing language models, chatbots, or other NLP applications, our library provides a comprehensive suite of metrics to help you assess the quality of your language models.
+
+## Key Metrics
+### Embedding-based Metrics
+- BERTScore: : A metric that measures the similarity between model-generated text and human-generated references. It's a valuable tool for evaluating semantic content. [Read more](https://arxiv.org/pdf/1904.09675.pdf).
+- MAUVE: A metric to assess the quality of text generated by language models. [Read more](https://arxiv.org/pdf/2212.14578.pdf).
+
+### Language-based Metrics
+- BLEURTScore: A hybrid metric that combines BERT embeddings and BLEU to evaluate text generated by LLMs. [Read more](https://arxiv.org/pdf/2004.04696.pdf).
+- Q-Squared: A metric for comparing model-generated text with human references.
+
+### LLM-based Metrics
+- SelCheck-GPT: A metric that evaluates the correctness of language model outputs by identifying comparing the output to the typical distribution of the model outputs. [Read more](https://arxiv.org/pdf/2303.08896.pdf).
+- G-Eval: A metric for evaluating any aspect of the generated text using chain-of-toughts and another LLM. [Read more](https://arxiv.org/pdf/2303.16634.pdf).
+- GPT-Score: A metric that provides an evaluation of any aspect of LLM-generated text. [Read more](https://arxiv.org/pdf/2302.04166.pdf).
+
+You can assess these metrics either all at once using the Scorer provided by this library, or individually, depending on the availability of references, context, and other parameters.
+
+## Installation
+To install the Saga LLM Evaluation ML library, use the following command:
+
+```poetry add saga_llm_evaluation_ml```
+
 ## Usage
+Here's a complete example of how to use this library:
+
+### Default use of the Scorer
+```python
+
+model = "TheBloke/Llama-2-7b-Chat-GGUF"
+scorer = LLMScorer(model=model)
+
+llm_input = "I am a dog."
+prompt = f"System: You are a cat. You don't like dogs. User: {llm_input}"
+context = "Examples: Eww, I hate dogs."
+prediction = "I am a cat, I don't like dogs."
+reference = "I am a cat, I don't like dogs, miau."
+task = "diag"
+aspect = ["CON"]
+custom_prompt = {
+    "name": "Fluency",
+    "task": "Dialog",
+    "aspect": "Evaluate the fluency of the following dialog.",
+}
+# All default
+scores = scorer.score(llm_input, prompt, prediction)
+
+# With reference
+scores = scorer.score(
+    llm_input,
+    prompt,
+    prediction,
+    reference=reference,
+    n_samples=10,
+)
 
-TODO
+# With context
+scores = scorer.score(
+    llm_input,
+    prompt,
+    prediction,
+    context=context,
+    n_samples=10,
+)
+
+# With task and aspect
+scores = scorer.score(
+    llm_input,
+    prompt,
+    prediction,
+    task=task,
+    aspects=aspect,
+    n_samples=10,
+)
+
+# With custom prompt
+scores = scorer.score(
+    llm_input,
+    prompt,
+    prediction,
+    custom_prompt=custom_prompt,
+    n_samples=10,
+)
+```
+The result is a dictionary with keys `metrics` and `metadata` containing the scores and metadata respectively.
+The `metrics` key contains the scores for each metric, while the `metadata` key contains the metadata computed on the text inputs.
+
+
+### Standalone use of the metrics
+You can also use the metrics individually as shown below:
+```python
+from saga_llm_evaluation_ml.helpers.embedding_metrics import BERTScore
+
+bert_score = BERTScore()
+scores = bert_score.compute_score(
+    references=["This is a reference sentence"],
+    candidates=["This is a candidate sentence"],
+)
+```
 
 ## Development
 
@@ -32,4 +127,6 @@ poetry run pytest
 poetry run black saga_llm_evaluation_ml/ tests/
 # Linter
 poetry run pylint saga_llm_evaluation_ml/ tests/
-```
+```
+
+Feel free to contribute and make this library even more powerful! We appreciate your support. 💻💪🏻