Skip to content

Commit

Permalink
RAG Interface Rewrite (#95)
Browse files Browse the repository at this point in the history
  • Loading branch information
svilupp authored Mar 20, 2024
1 parent 251c076 commit feb03ed
Show file tree
Hide file tree
Showing 39 changed files with 2,780 additions and 686 deletions.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

### Fixed

## [0.16.0]

### Added
- Added pretty-printing via `PT.pprint` that does NOT depend on Markdown and splits text to adjust to the width of the output terminal.
It is useful in notebooks to add newlines.
- Added support annotations for RAGTools (see `?RAGTools.Experimental.annotate_support` for more information) to highlight which parts of the generated answer come from the provided context versus the model's knowledge base. It's useful for transparency and debugging, especially in the context of AI-generated content. You can experience it if you run the output of `airag` through pretty printing (`PT.pprint`).
- Added utility `distance_longest_common_subsequence` to find the normalized distance between two strings (or a vector of strings). Always returns a number between 0-1, where 0 means the strings are identical and 1 means they are completely different. It's useful for comparing the similarity between the context provided to the model and the generated answer.
- Added a new documentation section "Extra Tools" to highlight key functionality in various modules, eg, the available text utilities, which were previously hard to discover.
- Extended documentation FAQ with tips on tackling rate limits and other common issues with OpenAI API.
- Extended documentation with all available prompt templates. See section "Prompt Templates" in the documentation.
- Added new RAG interface underneath `airag` in `PromptingTools.RAGTools.Experimental`. Each step now has a dedicated function and a type that can be customized to achieve arbitrary logic (via defining methods for your own types). `airag` is split into two main steps: `retrieve` and `generate!`. You can use them separately or together. See `?airag` for more information.

### Updated
- Renamed `split_by_length` text splitter to `recursive_splitter` to make it easier to discover and understand its purpose. `split_by_length` is still available as a deprecated alias.

### Fixed
- Fixed a bug where `LOCAL_SERVER` default value was not getting picked up. Now, it defaults to `http://localhost:8000` if not set in the preferences, which is the address of the server started by Llama.jl.
- Fixed a bug in multi-line code annotation, which was assigning too optimistic scores to the generated code. Now the score of the chunk is the length-weighted score of the "top" source chunk divided by the full length of score tokens (much more robust and demanding).

## [0.15.0]

Expand Down
6 changes: 3 additions & 3 deletions docs/src/examples/building_RAG.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ What does it do?
- [OPTIONAL] extracts any potential tags/filters from the question and applies them to filter down the potential candidates (use `extract_metadata=true` in `build_index`, you can also provide some filters explicitly via `tag_filter`)
- [OPTIONAL] re-ranks the candidate chunks (define and provide your own `rerank_strategy`, eg Cohere ReRank API)
- build a context from the closest chunks (use `chunks_window_margin` to tweak if we include preceding and succeeding chunks as well, see `?build_context` for more details)
- generate an answer from the closest chunks (use `return_details=true` to see under the hood and debug your application)
- generate an answer from the closest chunks (use `return_all=true` to see under the hood and debug your application)

You should save the index for later to avoid re-embedding / re-extracting the document chunks!

Expand Down Expand Up @@ -124,7 +124,7 @@ Let's evaluate this QA item with a "judge model" (often GPT-4 is used as a judge

````julia
# Note: that we used the same question, but generated a different context and answer via `airag`
msg, ctx = airag(index; evals[1].question, return_details = true);
ctx = airag(index; evals[1].question, return_all = true);
# ctx is a RAGContext object that keeps all intermediate states of the RAG pipeline for easy evaluation
judged = aiextract(:RAGJudgeAnswerFromContext;
ctx.context,
Expand Down Expand Up @@ -173,7 +173,7 @@ Let's run each question & answer through our eval loop in async (we do it only f
````julia
results = asyncmap(evals[1:10]) do qa_item
# Generate an answer -- often you want the model_judge to be the highest quality possible, eg, "GPT-4 Turbo" (alias "gpt4t)
msg, ctx = airag(index; qa_item.question, return_details = true,
msg, ctx = airag(index; qa_item.question, return_all = true,
top_k = 3, verbose = false, model_judge = "gpt4t")
# Evaluate the response
# Note: you can log key parameters for easier analysis later
Expand Down
210 changes: 207 additions & 3 deletions docs/src/extra_tools/rag_tools_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ CurrentModule = PromptingTools.Experimental.RAGTools

`RAGTools` is an experimental module that provides a set of utilities for building Retrieval-Augmented Generation (RAG) applications, ie, applications that generate answers by combining knowledge of the underlying AI model with the information from the user's knowledge base.

It is designed to be powerful and flexible, allowing you to build RAG applications with minimal effort. Extend any step of the pipeline with your own custom code (see the [RAG Interface](@ref) section), or use the provided defaults to get started quickly.

Once the API stabilizes (near term), we hope to carve it out into a separate package.

Import the module as follows:

```julia
Expand All @@ -22,13 +26,213 @@ const RT = PromptingTools.Experimental.RAGTools
The main functions to be aware of are:
- `build_index` to build a RAG index from a list of documents (type `ChunkIndex`)
- `airag` to generate answers using the RAG model on top of the `index` built above
- `annotate_support` to highlight which parts of the RAG answer are supported by the documents in the index vs which are generated by the model
- `retrieve` to retrieve relevant chunks from the index for a given question
- `generate!` to generate an answer from the retrieved chunks
- `annotate_support` to highlight which parts of the RAG answer are supported by the documents in the index vs which are generated by the model, it is applied automatically if you use pretty printing with `pprint` (eg, `pprint(result)`)
- `build_qa_evals` to build a set of question-answer pairs for evaluation of the RAG model from your corpus

See example `examples/building_RAG.jl` for an end-to-end example of how to use these tools.

The hope is to provide a modular and easily extensible set of tools for building RAG applications in Julia. Feel free to open an issue or ask in the `#generative-ai` channel in the JuliaLang Slack if you have a specific need.

## Examples

Let's build an index, we need to provide a starter list of documents:
```julia
sentences = [
"Find the most comprehensive guide on Julia programming language for beginners published in 2023.",
"Search for the latest advancements in quantum computing using Julia language.",
"How to implement machine learning algorithms in Julia with examples.",
"Looking for performance comparison between Julia, Python, and R for data analysis.",
"Find Julia language tutorials focusing on high-performance scientific computing.",
"Search for the top Julia language packages for data visualization and their documentation.",
"How to set up a Julia development environment on Windows 10.",
"Discover the best practices for parallel computing in Julia.",
"Search for case studies of large-scale data processing using Julia.",
"Find comprehensive resources for mastering metaprogramming in Julia.",
"Looking for articles on the advantages of using Julia for statistical modeling.",
"How to contribute to the Julia open-source community: A step-by-step guide.",
"Find the comparison of numerical accuracy between Julia and MATLAB.",
"Looking for the latest Julia language updates and their impact on AI research.",
"How to efficiently handle big data with Julia: Techniques and libraries.",
"Discover how Julia integrates with other programming languages and tools.",
"Search for Julia-based frameworks for developing web applications.",
"Find tutorials on creating interactive dashboards with Julia.",
"How to use Julia for natural language processing and text analysis.",
"Discover the role of Julia in the future of computational finance and econometrics."
]
```

Let's index these "documents":

```julia
index = build_index(sentences; chunker_kwargs=(; sources=map(i -> "Doc$i", 1:length(sentences))))
```

This would be equivalent to the following `index = build_index(SimpleIndexer(), sentences)` which dispatches to the default implementation of each step via the `SimpleIndexer` struct. We provide these default implementations for the main functions as an optional argument - no need to provide them if you're running the default pipeline.

Notice that we have provided a `chunker_kwargs` argument to the `build_index` function. These will be kwargs passed to `chunker` step.

Now let's generate an answer to a question.

1. Run end-to-end RAG (retrieve + generate!), return `AIMessage`
```julia
question = "What are the best practices for parallel computing in Julia?"

msg = airag(index; question) # short for airag(RAGConfig(), index; question)
## Output:
## [ Info: Done with RAG. Total cost: \$0.0
## AIMessage("Some best practices for parallel computing in Julia include us...
```

2. Explore what's happening under the hood by changing the return type - `RAGResult` contains all intermediate steps.
```julia
result = airag(index; question, return_all=true)
## RAGResult
## question: String "What are the best practices for parallel computing in Julia?"
## rephrased_questions: Array{String}((1,))
## answer: SubString{String}
## final_answer: SubString{String}
## context: Array{String}((5,))
## sources: Array{String}((5,))
## emb_candidates: CandidateChunks{Int64, Float32}
## tag_candidates: CandidateChunks{Int64, Float32}
## filtered_candidates: CandidateChunks{Int64, Float32}
## reranked_candidates: CandidateChunks{Int64, Float32}
## conversations: Dict{Symbol, Vector{<:PromptingTools.AbstractMessage}}
```

You can still get the message from the result, see `result.conversations[:final_answer]` (the dictionary keys correspond to the function names of those steps).


3. If you need to customize it, break the pipeline into its sub-steps: retrieve and generate - RAGResult serves as the intermediate result.
```julia
# Retrieve which chunks are relevant to the question
result = retrieve(index, question)
# Generate an answer
result = generate!(index, result)
```

You can leverage a pretty-printing system with `pprint` where we automatically annotate the support of the answer by the chunks we provided to the model.
It is configurable and you can select only some of its functions (eg, scores, sources).

```julia
pprint(result)
```

You'll see the following in REPL but with COLOR highlighting in the terminal.

```plaintext
--------------------
QUESTION(s)
--------------------
- What are the best practices for parallel computing in Julia?
--------------------
ANSWER
--------------------
Some of the best practices for parallel computing in Julia include:[1,0.7]
- Using [3,0.4]`@threads` for simple parallelism[1,0.34]
- Utilizing `Distributed` module for more complex parallel tasks[1,0.19]
- Avoiding excessive memory allocation
- Considering task granularity for efficient workload distribution
--------------------
SOURCES
--------------------
1. Doc8
2. Doc15
3. Doc5
4. Doc2
5. Doc9
```

**How to read the output**
- Color legend:
- No color: High match with the context, can be trusted more
- Blue: Partial match against some words in the context, investigate
- Magenta (Red): No match with the context, fully generated by the model
- Square brackets: The best matching context ID + Match score of the chunk (eg, `[3,0.4]` means the highest support for the sentence is from the context chunk number 3 with a 40% match).

Want more?

See `examples/building_RAG.jl` for one more example.

## RAG Interface

### System Overview

This system is designed for information retrieval and response generation, structured in three main phases:
- Preparation, when you create an instance of `AbstractIndex`
- Retrieval, when you surface the top most relevant chunks/items in the `index` and return `AbstractRAGResult`, which contains the references to the chunks (`AbstractCandidateChunks`)
- Generation, when you generate an answer based on the context built from the retrieved chunks, return either `AIMessage` or `AbstractRAGResult`

The system is designed to be hackable and extensible at almost every entry point.
If you want to customize the behavior of any step, you can do so by defining a new type and defining a new method for the step you're changing, eg,
```julia
struct MyReranker <: AbstractReranker end
RT.rerank(::MyReranker, index, candidates) = ...
```
And then you'd ask for the `retrive` step to use your custom `MyReranker`, eg, `retrieve(....; reranker = MyReranker())` (or customize the main dispatching `AbstractRetriever` struct).

The overarching principles are:
- Always dispatch / customize the behavior by defining a new `Struct` and the corresponding method for the existing functions (eg, `rerank` function for the re-ranking step).
- Custom types are provided as the first argument (the high-level functions will work without them as we provide some defaults).
- Custom types do NOT have any internal fields or DATA (with the exception of managing sub-steps of the pipeline like `AbstractRetriever` or `RAGConfig`).
- Additional data should be passed around as keyword arguments (eg, `chunker_kwargs` in `build_index` to pass data to the chunking step). The intention was to have some clearly documented default values in the docstrings of each step + to have the various options all in one place.

### RAG Diagram

The main functions are:

`build_index`:
- signature: `(indexer::AbstractIndexBuilder, files_or_docs::Vector{<:AbstractString}) -> AbstractChunkIndex`
- flow: `get_chunks` -> `get_embeddings` -> `get_tags` -> `build_tags`
- dispatch types: `AbstractIndexBuilder`, `AbstractChunker`, `AbstractEmbedder`, `AbstractTagger`

`airag`:
- signature: `(cfg::AbstractRAGConfig, index::AbstractChunkIndex; question::AbstractString)` -> `AIMessage` or `AbstractRAGResult`
- flow: `retrieve` -> `generate!`
- dispatch types: `AbstractRAGConfig`, `AbstractRetriever`, `AbstractGenerator`

`retrieve`:
- signature: `(retriever::AbstractRetriever, index::AbstractChunkIndex, question::AbstractString) -> AbstractRAGResult`
- flow: `rephrase` -> `get_embeddings` -> `find_closest` -> `get_tags` -> `find_tags` -> `rerank`
- dispatch types: `AbstractRAGConfig`, `AbstractRephraser`, `AbstractEmbedder`, `AbstractSimilarityFinder`, `AbstractTagger`, `AbstractTagFilter`, `AbstractReranker`

`generate!`:
- signature: `(generator::AbstractGenerator, index::AbstractChunkIndex, result::AbstractRAGResult)` -> `AIMessage` or `AbstractRAGResult`
- flow: `build_context!` -> `answer!` -> `refine!` -> `postprocess!`
- dispatch types: `AbstractGenerator`, `AbstractContextBuilder`, `AbstractAnswerer`, `AbstractRefiner`, `AbstractPostprocessor`

To discover the currently available implementations, use `subtypes` function, eg, `subtypes(AbstractReranker)`.

### Deepdive

**Preparation Phase:**
- Begins with `build_index`, which creates a user-defined index type from an abstract chunk index using specified dels and function strategies.
- `get_chunks` then divides the indexed data into manageable pieces based on a chunking strategy.
- `get_embeddings` generates embeddings for each chunk using an embedding strategy to facilitate similarity arches.
- Finally, `get_tags` extracts relevant metadata from each chunk, enabling tag-based filtering (hybrid search index). If there are `tags` available, `build_tags` is called to build the corresponding sparse matrix for filtering with tags.

**Retrieval Phase:**
- The `retrieve` step is intended to find the most relevant chunks in the `index`.
- `rephrase` is called first, if we want to rephrase the query (methods like `HyDE` can improve retrieval quite a bit)!
- `get_embeddings` generates embeddings for the original + rephrased query
- `find_closest` looks up the most relevant candidates (`CandidateChunks`) using a similarity search strategy.
- `get_tags` extracts the potential tags (can be provided as part of the `airag` call, eg, when we want to use only some small part of the indexed chunks)
- `find_tags` filters the candidates to strictly match _at least one_ of the tags (if provided)
- `rerank` is called to rerank the candidates based on the reranking strategy (ie, to improve the ordering of the chunks in context).

**Generation Phase:**
- The `generate` step is intended to generate a response based on the retrieved chunks, provided via `AbstractRAGResult` (eg, `RAGResult`).
- `build_context!` constructs the context for response generation based on a context strategy and applies the necessary formatting
- `answer!` generates the response based on the context and the query
- `refine!` is called to refine the response (optional, defaults to passthrough)
- `postprocessing!` is available for any final touches to the response or to potentially save or format the results (eg, automatically save to the disk)

Note that all generation steps are mutating the `RAGResult` object.

See more details and corresponding functions and types in `src/Experimental/RAGTools/rag_interface.jl`.

## References

```@docs; canonical=false
Expand Down
Loading

2 comments on commit feb03ed

@svilupp
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register

Release notes:

Added

  • Added pretty-printing via PT.pprint that does NOT depend on Markdown and splits text to adjust to the width of the output terminal.
    It is useful in notebooks to add newlines.
  • Added support annotations for RAGTools (see ?RAGTools.Experimental.annotate_support for more information) to highlight which parts of the generated answer come from the provided context versus the model's knowledge base. It's useful for transparency and debugging, especially in the context of AI-generated content. You can experience it if you run the output of airag through pretty printing (PT.pprint).
  • Added utility distance_longest_common_subsequence to find the normalized distance between two strings (or a vector of strings). Always returns a number between 0-1, where 0 means the strings are identical and 1 means they are completely different. It's useful for comparing the similarity between the context provided to the model and the generated answer.
  • Added a new documentation section "Extra Tools" to highlight key functionality in various modules, eg, the available text utilities, which were previously hard to discover.
  • Extended documentation FAQ with tips on tackling rate limits and other common issues with OpenAI API.
  • Extended documentation with all available prompt templates. See section "Prompt Templates" in the documentation.
  • Added new RAG interface underneath airag in PromptingTools.RAGTools.Experimental. Each step now has a dedicated function and a type that can be customized to achieve arbitrary logic (via defining methods for your own types). airag is split into two main steps: retrieve and generate!. You can use them separately or together. See ?airag for more information.

Updated

  • Renamed split_by_length text splitter to recursive_splitter to make it easier to discover and understand its purpose. split_by_length is still available as a deprecated alias.

Fixed

  • Fixed a bug where LOCAL_SERVER default value was not getting picked up. Now, it defaults to http://localhost:8000 if not set in the preferences, which is the address of the server started by Llama.jl.
  • Fixed a bug in multi-line code annotation, which was assigning too optimistic scores to the generated code. Now the score of the chunk is the length-weighted score of the "top" source chunk divided by the full length of score tokens (much more robust and demanding).

Commits

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/103298

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.16.0 -m "<description of version>" feb03edcb82382d9e47ae73a52071d7dbb9a7c78
git push origin v0.16.0

Please sign in to comment.