-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs(document-search): create docs (#180)
Co-authored-by: kdziedzic68 <[email protected]>
- Loading branch information
1 parent
d11ff73
commit 46a6974
Showing
4 changed files
with
279 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
# How-To: Search Documents | ||
|
||
`ragbits-document-search` package comes with all functionalities required to perform document search. The whole process can be divided into 3 steps: | ||
1. Load documents | ||
2. Process documents, embedd them and store into the vector database | ||
3. Do the search | ||
|
||
This guide will walk you through all those steps and explain the details. Let's start with a minimalistic example to get the main idea: | ||
```python | ||
import asyncio | ||
from pathlib import Path | ||
|
||
from ragbits.core.embeddings.litellm import LiteLLMEmbeddings | ||
from ragbits.core.vector_stores.in_memory import InMemoryVectorStore | ||
from ragbits.document_search import DocumentSearch | ||
from ragbits.document_search.documents.document import DocumentMeta | ||
from ragbits.document_search.documents.sources import GCSSource | ||
|
||
async def main() -> None: | ||
# Load documents (there are multiple possible sources) | ||
documents = [ | ||
DocumentMeta.from_local_path(Path("<path_to_your_document>")), | ||
DocumentMeta.create_text_document_from_literal("Test document"), | ||
DocumentMeta.from_source(GCSSource(bucket="<your_bucket>", object_name="<your_object_name>")) | ||
] | ||
|
||
embedder = LiteLLMEmbeddings() | ||
vector_store = InMemoryVectorStore() | ||
document_search = DocumentSearch( | ||
embedder=embedder, | ||
vector_store=vector_store, | ||
) | ||
|
||
# Ingest documents - here they are processed, embed and stored | ||
await document_search.ingest(documents) | ||
|
||
# Actual search | ||
results = await document_search.search("I'm boiling my water and I need a joke") | ||
print(results) | ||
|
||
|
||
if __name__ == "__main__": | ||
asyncio.run(main()) | ||
``` | ||
|
||
## Documents loading | ||
Before doing any search we need to have some documents that will build our knowledge base. Ragbits offers a handy class `Document` that stores all the information needed for document loading. | ||
Objects of this class are usually instantiated using `DocumentMeta` helper class that supports loading files from your local storage, GCS or HuggingFace. | ||
You can easily add support for your custom sources by extending the `Source` class and implementing the abstract methods: | ||
```python | ||
from pathlib import Path | ||
|
||
from ragbits.document_search.documents.sources import Source | ||
|
||
class CustomSource(Source): | ||
@property | ||
def id(self) -> str: | ||
pass | ||
|
||
async def fetch(self) -> Path: | ||
pass | ||
``` | ||
|
||
## Processing, embedding and storing | ||
Having the documents loaded we can proceed with the pipeline. The next step covers the processing, embedding and storing. Embeddings and Vector Stores have their own sections in the documentation, | ||
here we will focus on the processing. | ||
|
||
Before a document can be ingested into the system it needs to be processed into a collection of elements that the system supports. Right now there are two supported elements: | ||
`TextElement` and `ImageElement`. You can introduce your own elements by simply extending the `Element` class. | ||
|
||
Depending on a type of the document there are different `providers` that work under the hood to return a list of supported elements. Ragbits rely mainly on [Unstructured](https://unstructured.io/) | ||
library that supports parsing and chunking of most common document types (i.e. pdf, md, doc, jpg). You can specify a mapping of file type to provider when creating document search instance: | ||
```python | ||
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter | ||
from ragbits.document_search.documents.document import DocumentType | ||
from ragbits.document_search.ingestion.providers.unstructured.default import UnstructuredDefaultProvider | ||
|
||
document_search = DocumentSearch( | ||
embedder=embedder, | ||
vector_store=vector_store, | ||
document_processor_router=DocumentProcessorRouter({DocumentType.TXT: UnstructuredDefaultProvider()}) | ||
) | ||
``` | ||
|
||
If you want to implement a new provider you should extend the `BaseProvider` class: | ||
```python | ||
from ragbits.document_search.documents.document import DocumentMeta, DocumentType | ||
from ragbits.document_search.documents.element import Element | ||
from ragbits.document_search.ingestion.providers.base import BaseProvider | ||
|
||
|
||
class CustomProvider(BaseProvider): | ||
SUPPORTED_DOCUMENT_TYPES = { DocumentType.TXT } # provide supported document types | ||
|
||
async def process(self, document_meta: DocumentMeta) -> list[Element]: | ||
pass | ||
``` | ||
|
||
## Search | ||
After storing indexed documents in the system we can move to the search part. It is very simple and straightforward, you simply need to call `search()` function. | ||
The response will be a sequence of elements that are the most similar to provided query. | ||
|
||
## Advanced configuration | ||
There is an additional functionality of `DocumentSearch` class that allows to provide a config with complete setup. | ||
```python | ||
config = { | ||
"embedder": {...}, | ||
"vector_store": {...}, | ||
"reranker": {...}, | ||
"providers": {...}, | ||
"rephraser": {...}, | ||
} | ||
|
||
document_search = DocumentSearch.from_config(config) | ||
``` | ||
For a complete example please refer to `examples/document-search/from_config.py` | ||
|
||
If you want to improve your search results you could read more on how to adjust [QueryRephraser](use_rephraser.md) or [Reranker](use_reranker.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
# How-To: Use Rephraser | ||
`ragbits-document-search` contains a `QueryRephraser` module that could be used for creating an additional query that | ||
improves the original user query (fixes typos, handles abbreviations etc.). Those two queries are then sent to the document search | ||
module that can use them to find better matches. | ||
|
||
This guide will show you how to use `QueryRephraser` and how to create your custom implementation. | ||
|
||
## LLM rephraser usage | ||
To use a rephraser within retrival pipeline you need to provide it during `DocumentSearch` construction. In the following example we will use | ||
`LLMQueryRephraser` and default `QueryRephraserPrompt`. | ||
```python | ||
import asyncio | ||
from ragbits.core.llms.litellm import LiteLLM | ||
from ragbits.document_search import DocumentSearch | ||
from ragbits.document_search.retrieval.rephrasers.llm import LLMQueryRephraser | ||
from ragbits.document_search.retrieval.rephrasers.prompts import QueryRephraserPrompt | ||
|
||
async def main(): | ||
document_search = DocumentSearch( | ||
query_rephraser=LLMQueryRephraser(LiteLLM("gpt-3.5-turbo"), QueryRephraserPrompt), | ||
... | ||
) | ||
results = await document_search.search("<query>") | ||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
The next example will show on how to use the same rephraser as independent component: | ||
|
||
```python | ||
import asyncio | ||
from ragbits.document_search.retrieval.rephrasers.llm import LLMQueryRephraser | ||
from ragbits.document_search.retrieval.rephrasers.prompts import QueryRephraserPrompt | ||
from ragbits.core.llms.litellm import LiteLLM | ||
|
||
|
||
async def main(): | ||
rephraser = LLMQueryRephraser(LiteLLM("gpt-3.5-turbo"), QueryRephraserPrompt) | ||
rephrased = await rephraser.rephrase("Wht tim iz id?") | ||
print(rephrased) | ||
|
||
asyncio.run(main()) | ||
``` | ||
The console should print: | ||
```text | ||
['What time is it?'] | ||
``` | ||
|
||
To change the prompt you need to create your own class in the following way: | ||
```python | ||
from ragbits.core.prompt import Prompt | ||
from ragbits.document_search.retrieval.rephrasers.llm import QueryRephraserInput | ||
|
||
class QueryRephraserPrompt(Prompt[QueryRephraserInput, str]): | ||
user_prompt = "{{ query }}" | ||
system_prompt = ("<your_prompt>") | ||
``` | ||
You should only change the `system_prompt` as the `user_prompt` will contain a query passed to `DocumentSearch.search()` later. | ||
|
||
## Custom rephraser | ||
It is possible to create a custom rephraser by extending the base class: | ||
```python | ||
from ragbits.document_search.retrieval.rephrasers.base import QueryRephraser | ||
|
||
class CustomRephraser(QueryRephraser): | ||
async def rephrase(self, query: str) -> list[str]: | ||
pass | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
# How-To: Use Reranker | ||
`ragbits-document-search` contains a `Reranker` module that could be used to select the most relevant and high-quality information from a set of retrieved documents. | ||
|
||
This guide will show you how to use `LiteLLMReranker` and how to create your custom implementation. | ||
|
||
|
||
## LLM Reranker | ||
`LiteLLMReranker` is based on [litellm.rerank()](https://docs.litellm.ai/docs/rerank) that supports three providers: Cohere, Azure AI, Together AI. | ||
You will need to set a proper API key to use the reranking functionality. | ||
|
||
To use a `LiteLLMReranker` within retrival pipeline you simply need to provide it as an argument to `DocumentSearch`. | ||
```python | ||
import os | ||
from ragbits.document_search.retrieval.rerankers.litellm import LiteLLMReranker | ||
|
||
os.environ["COHERE_API_KEY"] = "<api_key>" | ||
|
||
document_search = DocumentSearch( | ||
reranker=LiteLLMReranker("cohere/rerank-english-v3.0"), | ||
... | ||
) | ||
``` | ||
|
||
The next example will show on how to use the basic usage of the same re-ranker as independent component: | ||
|
||
```python | ||
import asyncio | ||
import os | ||
from ragbits.document_search.retrieval.rerankers.litellm import LiteLLMReranker | ||
from ragbits.document_search.documents.element import TextElement | ||
from ragbits.document_search.documents.document import DocumentMeta | ||
|
||
os.environ["COHERE_API_KEY"] = "<api_key>" | ||
|
||
|
||
def create_text_element(text: str) -> TextElement: | ||
document_meta = DocumentMeta.create_text_document_from_literal(content=text) | ||
text_element = TextElement(document_meta=document_meta, content=text) | ||
return text_element | ||
|
||
|
||
async def main(): | ||
reranker = LiteLLMReranker(model="cohere/rerank-english-v3.0") | ||
text_elements = [ | ||
create_text_element( | ||
text="The artificial inteligence development is a milestone for global information accesibility" | ||
), | ||
create_text_element(text="The redpill will show you the true nature of things"), | ||
create_text_element(text="The bluepill will make you stay in the state of ignorance"), | ||
] | ||
query = "Take the pill and follow the rabbit!" | ||
ranked = await reranker.rerank(elements=text_elements, query=query) | ||
for element in ranked: | ||
print(element.content + "\n") | ||
|
||
|
||
asyncio.run(main()) | ||
``` | ||
|
||
The console should print the contents of the ranked elements in order of their relevance to the query, as determined by the model. | ||
|
||
```text | ||
The redpill will show you the true nature of things | ||
The bluepill will make you stay in the state of ignorance | ||
The artificial inteligence development is a milestone for global information accesibility | ||
``` | ||
|
||
## Custom Reranker | ||
To create a custom Reranker you need to extend the `Reranker` class: | ||
```python | ||
from collections.abc import Sequence | ||
|
||
from ragbits.document_search.retrieval.rerankers.base import Reranker, RerankerOptions | ||
from ragbits.document_search.documents.element import Element | ||
|
||
class CustomReranker(Reranker): | ||
async def rerank( | ||
self, | ||
elements: Sequence[Element], | ||
query: str, | ||
options: RerankerOptions | None = None, | ||
) -> Sequence[Element]: | ||
pass | ||
|
||
@classmethod | ||
def from_config(cls, config: dict) -> "CustomReranker": | ||
pass | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters