Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add ADR for RAG Evaluations Framework #842

Merged
merged 25 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
a8b18bd
add first pass at ADR
jalling97 Jul 26, 2024
f6913f4
add datasets rationale
jalling97 Jul 26, 2024
51496c1
Merge remote-tracking branch 'origin/main' into 823-adr-rag-evaluatio…
jalling97 Jul 26, 2024
bf293c2
Merge remote-tracking branch 'origin/main' into 823-adr-rag-evaluatio…
jalling97 Jul 26, 2024
ca3b1de
Merge branch 'main' into 823-adr-rag-evaluations-framework
jalling97 Aug 27, 2024
f7482a1
Update 0007-rag-eval-framework.md
jalling97 Aug 27, 2024
21b9fa1
Update 0007-rag-eval-framework.md
jalling97 Aug 27, 2024
7a8c902
Update 0007-rag-eval-framework.md
jalling97 Aug 28, 2024
7fc1d93
Update 0007-rag-eval-framework.md
jalling97 Sep 6, 2024
6ad5891
Add models to evaluate section
jalling97 Sep 10, 2024
ec66679
add note about customizing RAG
jalling97 Sep 10, 2024
733f6af
Expand on QA documentation reasoning
jalling97 Sep 10, 2024
0534d4e
fix typo
jalling97 Sep 10, 2024
4cba064
Merge branch 'main' into 823-adr-rag-evaluations-framework
jalling97 Sep 10, 2024
40ca0d0
add execution/delivery decision and rationale
jalling97 Sep 19, 2024
a4eeb64
Update 0007-rag-eval-framework.md
jalling97 Sep 25, 2024
5854b66
Update 0007-rag-eval-framework.md
jalling97 Sep 30, 2024
8b06fc1
Merge branch 'main' into 823-adr-rag-evaluations-framework
jalling97 Sep 30, 2024
8828f09
Merge branch 'main' into 823-adr-rag-evaluations-framework
justinthelaw Oct 1, 2024
d208e66
Update adr/0007-rag-eval-framework.md
jalling97 Oct 1, 2024
4965111
Update adr/0007-rag-eval-framework.md
jalling97 Oct 1, 2024
13fce77
Update 0007-rag-eval-framework.md
jalling97 Oct 1, 2024
c654c25
Update 0007-rag-eval-framework.md
jalling97 Oct 3, 2024
30f55f0
Update 0007-rag-eval-framework.md
jalling97 Oct 3, 2024
04a482f
Update 0007-rag-eval-framework.md
jalling97 Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
File renamed without changes.
168 changes: 168 additions & 0 deletions adr/0007-rag-eval-framework.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# LeapfrogAI RAG Evaluation Framework

## Table of Contents

- [LeapfrogAI RAG Evaluation Framework](#leapfrogai-rag-evaluation-framework)
- [Table of Contents](#table-of-contents)
- [Status](#status)
- [Context](#context)
- [Decisions and Rationale](#decisions-and-rationale)
- [Tools](#tools)
- [Datasets](#datasets)
- [Models to Evaluate](#models-to-evaluate)
- [LLM-as-Judge / LLMs-as-Jury](#llm-as-judge--llms-as-jury)
- [Metrics / Evaluations](#metrics--evaluations)
- [Execution](#execution)
- [Delivery](#delivery)
- [Rationale](#rationale)
- [Alternatives](#alternatives)
- [Related ADRs](#related-adrs)
- [References](#references)

## Status

DRAFT

## Context

LeapfrogAI uses RAG to provide context-aware responses to users who have specific data they need to reference. In order to make sure RAG is operating at the levels we need it to, we need to get measurable feedback from our RAG pipeline to make it better. We also need a standard to show to mission heroes that we are in fact operating at that level. We do this with RAG-focused evaluations. This ADR documents all of the decisions and lessons learned for enabling a full-scale RAG evaluations pipeline MVP.
jalling97 marked this conversation as resolved.
Show resolved Hide resolved

## Decisions and Rationale

This section covers all of the decision points that needed to be made along side an explanation of how those decisions were made. Each section covers a different aspect of the RAG evaluations framework.

### Tools
<details>
<summary>Details</summary>

#### Decision
The primary toolset for architecting RAG evaluations will be **[DeepEval](https://docs.confident-ai.com/)**.
#### Rationale
Please see the the [RAG Evaluations Toolset](/adr/0004-rag-eval-toolset.md) ADR for an in-depth discussion of why DeepEval was chosen over other alternatives.

</details>

### Datasets
<details>
<summary>Details</summary>

#### Decision
To handle RAG evaluations, two types of datasets were determined to be needed:
- Question/Answer (QA)
- Needle in a Haystack (NIAH)

A QA dataset should contain a set of [test cases](https://docs.confident-ai.com/docs/evaluation-test-cases) that have:
- Questions, which will be prompted to the LLM
- Ground truth answers, which will be used to compare against the generated answer by the LLM
- Context, which will contain the correct piece of source documentation that supports the true answer
- The full source documentation from which the context is derived

A dataset for [NIAH Testing](https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/) should contain:
- A series of irrelevant texts of varying context length that have one point of information hidden within
CollectiveUnicorn marked this conversation as resolved.
Show resolved Hide resolved

To support these needs, two datasets were created:
- [LFAI_RAG_qa_v1](https://huggingface.co/datasets/defenseunicorns/LFAI_RAG_qa_v1)
- [LFAI_RAG_niah_v1](https://huggingface.co/datasets/defenseunicorns/LFAI_RAG_niah_v1)

These two datasets will be used as the basis for LFAI RAG evaluations that require data sources.

#### Rationale

These datasets were created because it filled a gap in the openly available datasets that could have been used. For example, in QA datasets, there did not exist any dataset that had all **4** components listed above. Many had the questions, answers, and context, but none also included the source documents in a readily accessible manner. Therefore, the fastest and most effective course of action was to generate a QA dataset from source documentation using the [DeepEval Synthesizer](https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data).
jalling97 marked this conversation as resolved.
Show resolved Hide resolved

As for the NIAH dataset, there was a similar "incompleteness" problem that was observed. While other iterations of NIAH datasets are more readily available than QA datasets, some [datasets](https://huggingface.co/datasets/nanotron/simple_needle_in_a_hay_stack) had haystacks constructed of small repeating sentences, which did not mirror what a deployment context is more likely to look like. Other implementations mirrored the original [NIAH experiment](https://x.com/GregKamradt/status/1722386725635580292?lang=en) using [Paul Graham essays](https://paulgraham.com/articles.html), but did not release their specific datasets. Therefore, it made sense to quickly generate a dataset that uses the same Paul Graham essays as context, while inserting individual "needles" into certain context lengths to create a custom dataset. LFAI_RAG_niah_v1 includes context lengths from 512 to 128k characters.

</details>

### Models to Evaluate
jalling97 marked this conversation as resolved.
Show resolved Hide resolved
<details>
<summary>Details</summary>

#### Decision

#### Rationale

</details>

### LLM-as-Judge / LLMs-as-Jury
<details>
<summary>Details</summary>

#### Decision

For the RAG Evals MVP, [Claude 3.5 Sonnet](https://www.anthropic.com/news/claude-3-5-sonnet) by Anthropic will be used as a single LLM-as-Judge.

#### Rationale

There are two points to rationalize; the model choice and the decision to use a single judge.

In order to reach an MVP product, a single LLM judge will be utilized for the evaluations that require it. This will be the first stage so that the evaluation framework can begin recieving results. As progress is made, additional LLM-based judges will be incorporated to develop an LLM-jury styled approach. For context, please see the following [paper](https://arxiv.org/pdf/2404.18796).
jalling97 marked this conversation as resolved.
Show resolved Hide resolved

Claude 3.5 Sonnet was chosen to be used as the first judge due to it's high levels of [performance](https://artificialanalysis.ai/models/claude-35-sonnet), which is crucial when utilizing an LLM judge. Additionally, it exists outside the family of models that will be evaluated against, which has been shown to be effective in comparison to using models of the same family due to [self-enhancement bias](https://arxiv.org/pdf/2306.05685).
justinthelaw marked this conversation as resolved.
Show resolved Hide resolved

</details>

### Metrics / Evaluations
<details>
<summary>Details</summary>

#### Decision

The LeapfrogAI RAG evaluation framework will utilize the following evaluations:

LLM-as-a-judge metrics to use:
- [Contextual Recall](https://docs.confident-ai.com/docs/metrics-contextual-recall) (for evaluating retrieval)
- [Answer Correctness](https://docs.confident-ai.com/docs/metrics-llm-evals) (for evaluating generation)
- [Faithfulness](https://docs.confident-ai.com/docs/metrics-faithfulness) (for evaluating generation)

Non-llm-enabled evaluations:
jalling97 marked this conversation as resolved.
Show resolved Hide resolved
- Needle in a Haystack (for evaluating retrieval and generation)
- Annotation Relevancy (for evaluating retrieval)

Performance Metrics:
- Total Execution Runtime

Non-RAG LLM benchmarks:
CollectiveUnicorn marked this conversation as resolved.
Show resolved Hide resolved
- [HumanEval](https://docs.confident-ai.com/docs/benchmarks-human-eval) (for evaluating generation)

#### Rationale

These metrics were chosen to balance the explainability/understandability of non-LLM based evaluations and the flexibility/scalability of LLM-as-judge evaluations.
- Contextual Recall: evaluates the extent to which the context retrieved by RAG corresponds to an expected output
- Answer Correctness: evaluates if an answer generated by an LLM is accurate when compared to the question asked and its context
- Faithfulness: evaluates whether an answer generated by an LLM factually aligns with the context provided
- Needle in a Haystack (retrieval): determines if a needle of information is correctly retrieved from the vector store by RAG
- Needle in a Haystack (response): determines if a needle of information is correctly given in the final response of the LLM in a RAG pipeline
- HumanEval: Evaluates an LLM's code generation abilities (not RAG-enabled, but it useful as an established baseline to compare against)
CollectiveUnicorn marked this conversation as resolved.
Show resolved Hide resolved
- Annotation Relevancy: A custom metric that measures how often documents that have nothing to do with the question are cited in the annotations. Higher is better

While these metrics are going to be utilized first to balance value-gained and time to implement, we will be adding additional evaluation metrics soon following MVP status. Potential options include:
- RAG retrieval Hit Rate: non-LLM metric that evaluates how often a retrieved context matches the expected context for a question/answer scenario
- Performance metrics: non-LLM metrics that measure performance targets such as runtime, compute (cpu and gpu), etc. (requires a standarized deployment context)

</details>

### Execution
<details>
<summary>Details</summary>

#### Decision

#### Rationale

</details>

### Delivery
<details>
<summary>Details</summary>

#### Decision

#### Rationale

</details>

## Related ADRs
This ADR was influenced by the [RAG Evaluations Toolset](/adr/0004-rag-eval-toolset.md) ADR.

## References