Skip to content

Commit

Permalink
docs(document-search): create docs (#180)
Browse files Browse the repository at this point in the history
Co-authored-by: kdziedzic68 <[email protected]>
  • Loading branch information
konrad-czarnota-ds and kdziedzic68 authored Nov 20, 2024
1 parent d11ff73 commit 46a6974
Show file tree
Hide file tree
Showing 4 changed files with 279 additions and 0 deletions.
118 changes: 118 additions & 0 deletions docs/how-to/document-search/search_documents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# How-To: Search Documents

`ragbits-document-search` package comes with all functionalities required to perform document search. The whole process can be divided into 3 steps:
1. Load documents
2. Process documents, embedd them and store into the vector database
3. Do the search

This guide will walk you through all those steps and explain the details. Let's start with a minimalistic example to get the main idea:
```python
import asyncio
from pathlib import Path

from ragbits.core.embeddings.litellm import LiteLLMEmbeddings
from ragbits.core.vector_stores.in_memory import InMemoryVectorStore
from ragbits.document_search import DocumentSearch
from ragbits.document_search.documents.document import DocumentMeta
from ragbits.document_search.documents.sources import GCSSource

async def main() -> None:
# Load documents (there are multiple possible sources)
documents = [
DocumentMeta.from_local_path(Path("<path_to_your_document>")),
DocumentMeta.create_text_document_from_literal("Test document"),
DocumentMeta.from_source(GCSSource(bucket="<your_bucket>", object_name="<your_object_name>"))
]

embedder = LiteLLMEmbeddings()
vector_store = InMemoryVectorStore()
document_search = DocumentSearch(
embedder=embedder,
vector_store=vector_store,
)

# Ingest documents - here they are processed, embed and stored
await document_search.ingest(documents)

# Actual search
results = await document_search.search("I'm boiling my water and I need a joke")
print(results)


if __name__ == "__main__":
asyncio.run(main())
```

## Documents loading
Before doing any search we need to have some documents that will build our knowledge base. Ragbits offers a handy class `Document` that stores all the information needed for document loading.
Objects of this class are usually instantiated using `DocumentMeta` helper class that supports loading files from your local storage, GCS or HuggingFace.
You can easily add support for your custom sources by extending the `Source` class and implementing the abstract methods:
```python
from pathlib import Path

from ragbits.document_search.documents.sources import Source

class CustomSource(Source):
@property
def id(self) -> str:
pass

async def fetch(self) -> Path:
pass
```

## Processing, embedding and storing
Having the documents loaded we can proceed with the pipeline. The next step covers the processing, embedding and storing. Embeddings and Vector Stores have their own sections in the documentation,
here we will focus on the processing.

Before a document can be ingested into the system it needs to be processed into a collection of elements that the system supports. Right now there are two supported elements:
`TextElement` and `ImageElement`. You can introduce your own elements by simply extending the `Element` class.

Depending on a type of the document there are different `providers` that work under the hood to return a list of supported elements. Ragbits rely mainly on [Unstructured](https://unstructured.io/)
library that supports parsing and chunking of most common document types (i.e. pdf, md, doc, jpg). You can specify a mapping of file type to provider when creating document search instance:
```python
from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
from ragbits.document_search.documents.document import DocumentType
from ragbits.document_search.ingestion.providers.unstructured.default import UnstructuredDefaultProvider

document_search = DocumentSearch(
embedder=embedder,
vector_store=vector_store,
document_processor_router=DocumentProcessorRouter({DocumentType.TXT: UnstructuredDefaultProvider()})
)
```

If you want to implement a new provider you should extend the `BaseProvider` class:
```python
from ragbits.document_search.documents.document import DocumentMeta, DocumentType
from ragbits.document_search.documents.element import Element
from ragbits.document_search.ingestion.providers.base import BaseProvider


class CustomProvider(BaseProvider):
SUPPORTED_DOCUMENT_TYPES = { DocumentType.TXT } # provide supported document types

async def process(self, document_meta: DocumentMeta) -> list[Element]:
pass
```

## Search
After storing indexed documents in the system we can move to the search part. It is very simple and straightforward, you simply need to call `search()` function.
The response will be a sequence of elements that are the most similar to provided query.

## Advanced configuration
There is an additional functionality of `DocumentSearch` class that allows to provide a config with complete setup.
```python
config = {
"embedder": {...},
"vector_store": {...},
"reranker": {...},
"providers": {...},
"rephraser": {...},
}

document_search = DocumentSearch.from_config(config)
```
For a complete example please refer to `examples/document-search/from_config.py`

If you want to improve your search results you could read more on how to adjust [QueryRephraser](use_rephraser.md) or [Reranker](use_reranker.md).
68 changes: 68 additions & 0 deletions docs/how-to/document-search/use_rephraser.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# How-To: Use Rephraser
`ragbits-document-search` contains a `QueryRephraser` module that could be used for creating an additional query that
improves the original user query (fixes typos, handles abbreviations etc.). Those two queries are then sent to the document search
module that can use them to find better matches.

This guide will show you how to use `QueryRephraser` and how to create your custom implementation.

## LLM rephraser usage
To use a rephraser within retrival pipeline you need to provide it during `DocumentSearch` construction. In the following example we will use
`LLMQueryRephraser` and default `QueryRephraserPrompt`.
```python
import asyncio
from ragbits.core.llms.litellm import LiteLLM
from ragbits.document_search import DocumentSearch
from ragbits.document_search.retrieval.rephrasers.llm import LLMQueryRephraser
from ragbits.document_search.retrieval.rephrasers.prompts import QueryRephraserPrompt

async def main():
document_search = DocumentSearch(
query_rephraser=LLMQueryRephraser(LiteLLM("gpt-3.5-turbo"), QueryRephraserPrompt),
...
)
results = await document_search.search("<query>")

asyncio.run(main())
```

The next example will show on how to use the same rephraser as independent component:

```python
import asyncio
from ragbits.document_search.retrieval.rephrasers.llm import LLMQueryRephraser
from ragbits.document_search.retrieval.rephrasers.prompts import QueryRephraserPrompt
from ragbits.core.llms.litellm import LiteLLM


async def main():
rephraser = LLMQueryRephraser(LiteLLM("gpt-3.5-turbo"), QueryRephraserPrompt)
rephrased = await rephraser.rephrase("Wht tim iz id?")
print(rephrased)

asyncio.run(main())
```
The console should print:
```text
['What time is it?']
```

To change the prompt you need to create your own class in the following way:
```python
from ragbits.core.prompt import Prompt
from ragbits.document_search.retrieval.rephrasers.llm import QueryRephraserInput

class QueryRephraserPrompt(Prompt[QueryRephraserInput, str]):
user_prompt = "{{ query }}"
system_prompt = ("<your_prompt>")
```
You should only change the `system_prompt` as the `user_prompt` will contain a query passed to `DocumentSearch.search()` later.

## Custom rephraser
It is possible to create a custom rephraser by extending the base class:
```python
from ragbits.document_search.retrieval.rephrasers.base import QueryRephraser

class CustomRephraser(QueryRephraser):
async def rephrase(self, query: str) -> list[str]:
pass
```
90 changes: 90 additions & 0 deletions docs/how-to/document-search/use_reranker.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# How-To: Use Reranker
`ragbits-document-search` contains a `Reranker` module that could be used to select the most relevant and high-quality information from a set of retrieved documents.

This guide will show you how to use `LiteLLMReranker` and how to create your custom implementation.


## LLM Reranker
`LiteLLMReranker` is based on [litellm.rerank()](https://docs.litellm.ai/docs/rerank) that supports three providers: Cohere, Azure AI, Together AI.
You will need to set a proper API key to use the reranking functionality.

To use a `LiteLLMReranker` within retrival pipeline you simply need to provide it as an argument to `DocumentSearch`.
```python
import os
from ragbits.document_search.retrieval.rerankers.litellm import LiteLLMReranker

os.environ["COHERE_API_KEY"] = "<api_key>"

document_search = DocumentSearch(
reranker=LiteLLMReranker("cohere/rerank-english-v3.0"),
...
)
```

The next example will show on how to use the basic usage of the same re-ranker as independent component:

```python
import asyncio
import os
from ragbits.document_search.retrieval.rerankers.litellm import LiteLLMReranker
from ragbits.document_search.documents.element import TextElement
from ragbits.document_search.documents.document import DocumentMeta

os.environ["COHERE_API_KEY"] = "<api_key>"


def create_text_element(text: str) -> TextElement:
document_meta = DocumentMeta.create_text_document_from_literal(content=text)
text_element = TextElement(document_meta=document_meta, content=text)
return text_element


async def main():
reranker = LiteLLMReranker(model="cohere/rerank-english-v3.0")
text_elements = [
create_text_element(
text="The artificial inteligence development is a milestone for global information accesibility"
),
create_text_element(text="The redpill will show you the true nature of things"),
create_text_element(text="The bluepill will make you stay in the state of ignorance"),
]
query = "Take the pill and follow the rabbit!"
ranked = await reranker.rerank(elements=text_elements, query=query)
for element in ranked:
print(element.content + "\n")


asyncio.run(main())
```

The console should print the contents of the ranked elements in order of their relevance to the query, as determined by the model.

```text
The redpill will show you the true nature of things
The bluepill will make you stay in the state of ignorance
The artificial inteligence development is a milestone for global information accesibility
```

## Custom Reranker
To create a custom Reranker you need to extend the `Reranker` class:
```python
from collections.abc import Sequence

from ragbits.document_search.retrieval.rerankers.base import Reranker, RerankerOptions
from ragbits.document_search.documents.element import Element

class CustomReranker(Reranker):
async def rerank(
self,
elements: Sequence[Element],
query: str,
options: RerankerOptions | None = None,
) -> Sequence[Element]:
pass

@classmethod
def from_config(cls, config: dict) -> "CustomReranker":
pass
```
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ nav:
- Document Search:
- how-to/document_search/async_processing.md
- how-to/document_search/create_custom_execution_strategy.md
- how-to/document-search/search_documents.md
- how-to/document-search/use_rephraser.md
- how-to/document-search/use_reranker.md
- API Reference:
- Core:
- api_reference/core/prompt.md
Expand Down

0 comments on commit 46a6974

Please sign in to comment.