docs(document-search): create docs (#180)

Co-authored-by: kdziedzic68 <[email protected]>
deepsense-ai · Nov 20, 2024 · 46a6974 · 46a6974
1 parent d11ff73
commit 46a6974
Show file tree

Hide file tree

Showing 4 changed files with 279 additions and 0 deletions.
diff --git a/docs/how-to/document-search/search_documents.md b/docs/how-to/document-search/search_documents.md
@@ -0,0 +1,118 @@
+# How-To: Search Documents
+
+`ragbits-document-search` package comes with all functionalities required to perform document search. The whole process can be divided into 3 steps:
+1. Load documents
+2. Process documents, embedd them and store into the vector database
+3. Do the search
+
+This guide will walk you through all those steps and explain the details. Let's start with a minimalistic example to get the main idea:
+```python
+import asyncio
+from pathlib import Path
+
+from ragbits.core.embeddings.litellm import LiteLLMEmbeddings
+from ragbits.core.vector_stores.in_memory import InMemoryVectorStore
+from ragbits.document_search import DocumentSearch
+from ragbits.document_search.documents.document import DocumentMeta
+from ragbits.document_search.documents.sources import GCSSource
+
+async def main() -> None:
+    # Load documents (there are multiple possible sources)
+    documents = [
+        DocumentMeta.from_local_path(Path("<path_to_your_document>")),
+        DocumentMeta.create_text_document_from_literal("Test document"),
+        DocumentMeta.from_source(GCSSource(bucket="<your_bucket>", object_name="<your_object_name>"))
+    ]
+
+    embedder = LiteLLMEmbeddings()
+    vector_store = InMemoryVectorStore()
+    document_search = DocumentSearch(
+        embedder=embedder,
+        vector_store=vector_store,
+    )
+
+    # Ingest documents - here they are processed, embed and stored
+    await document_search.ingest(documents)
+
+    # Actual search
+    results = await document_search.search("I'm boiling my water and I need a joke")
+    print(results)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+## Documents loading
+Before doing any search we need to have some documents that will build our knowledge base. Ragbits offers a handy class `Document` that stores all the information needed for document loading.
+Objects of this class are usually instantiated using `DocumentMeta` helper class that supports loading files from your local storage, GCS or HuggingFace.
+You can easily add support for your custom sources by extending the `Source` class and implementing the abstract methods:
+```python
+from pathlib import Path
+
+from ragbits.document_search.documents.sources import Source
+
+class CustomSource(Source):
+    @property
+    def id(self) -> str:
+        pass
+
+    async def fetch(self) -> Path:
+        pass
+```
+
+## Processing, embedding and storing
+Having the documents loaded we can proceed with the pipeline. The next step covers the processing, embedding and storing. Embeddings and Vector Stores have their own sections in the documentation,
+here we will focus on the processing.
+
+Before a document can be ingested into the system it needs to be processed into a collection of elements that the system supports. Right now there are two supported elements:
+`TextElement` and `ImageElement`. You can introduce your own elements by simply extending the `Element` class.
+
+Depending on a type of the document there are different `providers` that work under the hood to return a list of supported elements. Ragbits rely mainly on [Unstructured](https://unstructured.io/)
+library that supports parsing and chunking of most common document types (i.e. pdf, md, doc, jpg). You can specify a mapping of file type to provider when creating document search instance:
+```python
+from ragbits.document_search.ingestion.document_processor import DocumentProcessorRouter
+from ragbits.document_search.documents.document import DocumentType
+from ragbits.document_search.ingestion.providers.unstructured.default import UnstructuredDefaultProvider
+
+document_search = DocumentSearch(
+    embedder=embedder,
+    vector_store=vector_store,
+    document_processor_router=DocumentProcessorRouter({DocumentType.TXT: UnstructuredDefaultProvider()})
+)
+```
+
+If you want to implement a new provider you should extend the `BaseProvider` class:
+```python
+from ragbits.document_search.documents.document import DocumentMeta, DocumentType
+from ragbits.document_search.documents.element import Element
+from ragbits.document_search.ingestion.providers.base import BaseProvider
+
+
+class CustomProvider(BaseProvider):
+    SUPPORTED_DOCUMENT_TYPES = { DocumentType.TXT }  # provide supported document types
+
+    async def process(self, document_meta: DocumentMeta) -> list[Element]:
+        pass
+```
+
+## Search
+After storing indexed documents in the system we can move to the search part. It is very simple and straightforward, you simply need to call `search()` function.
+The response will be a sequence of elements that are the most similar to provided query.
+
+## Advanced configuration
+There is an additional functionality of `DocumentSearch` class that allows to provide a config with complete setup.
+```python
+config = {
+    "embedder": {...},
+    "vector_store": {...},
+    "reranker": {...},
+    "providers": {...},
+    "rephraser": {...},
+}
+
+document_search = DocumentSearch.from_config(config)
+```
+For a complete example please refer to `examples/document-search/from_config.py`
+
+If you want to improve your search results you could read more on how to adjust [QueryRephraser](use_rephraser.md) or [Reranker](use_reranker.md).
diff --git a/docs/how-to/document-search/use_rephraser.md b/docs/how-to/document-search/use_rephraser.md
@@ -0,0 +1,68 @@
+# How-To: Use Rephraser
+`ragbits-document-search` contains a `QueryRephraser` module that could be used for creating an additional query that
+improves the original user query (fixes typos, handles abbreviations etc.). Those two queries are then sent to the document search
+module that can use them to find better matches.
+
+This guide will show you how to use `QueryRephraser` and how to create your custom implementation.
+
+## LLM rephraser usage
+To use a rephraser within retrival pipeline you need to provide it during `DocumentSearch` construction. In the following example we will use
+`LLMQueryRephraser` and default `QueryRephraserPrompt`.
+```python
+import asyncio
+from ragbits.core.llms.litellm import LiteLLM
+from ragbits.document_search import DocumentSearch
+from ragbits.document_search.retrieval.rephrasers.llm import LLMQueryRephraser
+from ragbits.document_search.retrieval.rephrasers.prompts import QueryRephraserPrompt
+
+async def main():
+    document_search = DocumentSearch(
+        query_rephraser=LLMQueryRephraser(LiteLLM("gpt-3.5-turbo"), QueryRephraserPrompt),
+        ...
+    )
+    results = await document_search.search("<query>")
+
+asyncio.run(main())
+```
+
+The next example will show on how to use the same rephraser as independent component:
+
+```python
+import asyncio
+from ragbits.document_search.retrieval.rephrasers.llm import LLMQueryRephraser
+from ragbits.document_search.retrieval.rephrasers.prompts import QueryRephraserPrompt
+from ragbits.core.llms.litellm import LiteLLM
+
+
+async def main():
+    rephraser = LLMQueryRephraser(LiteLLM("gpt-3.5-turbo"), QueryRephraserPrompt)
+    rephrased = await rephraser.rephrase("Wht tim iz id?")
+    print(rephrased)
+
+asyncio.run(main())
+```
+The console should print:
+```text
+['What time is it?']
+```
+
+To change the prompt you need to create your own class in the following way:
+```python
+from ragbits.core.prompt import Prompt
+from ragbits.document_search.retrieval.rephrasers.llm import QueryRephraserInput
+
+class QueryRephraserPrompt(Prompt[QueryRephraserInput, str]):
+    user_prompt = "{{ query }}"
+    system_prompt = ("<your_prompt>")
+```
+You should only change the `system_prompt` as the `user_prompt` will contain a query passed to `DocumentSearch.search()` later.
+
+## Custom rephraser
+It is possible to create a custom rephraser by extending the base class:
+```python
+from ragbits.document_search.retrieval.rephrasers.base import QueryRephraser
+
+class CustomRephraser(QueryRephraser):
+    async def rephrase(self, query: str) -> list[str]:
+        pass
+```
diff --git a/docs/how-to/document-search/use_reranker.md b/docs/how-to/document-search/use_reranker.md
@@ -0,0 +1,90 @@
+# How-To: Use Reranker
+`ragbits-document-search` contains a `Reranker` module that could be used to select the most relevant and high-quality information from a set of retrieved documents.
+
+This guide will show you how to use `LiteLLMReranker` and how to create your custom implementation.
+
+
+## LLM Reranker
+`LiteLLMReranker` is based on [litellm.rerank()](https://docs.litellm.ai/docs/rerank) that supports three providers: Cohere, Azure AI, Together AI.
+You will need to set a proper API key to use the reranking functionality.
+
+To use a `LiteLLMReranker` within retrival pipeline you simply need to provide it as an argument to `DocumentSearch`.
+```python
+import os
+from ragbits.document_search.retrieval.rerankers.litellm import LiteLLMReranker
+
+os.environ["COHERE_API_KEY"] = "<api_key>"
+
+document_search = DocumentSearch(
+    reranker=LiteLLMReranker("cohere/rerank-english-v3.0"),
+    ...
+)
+```
+
+The next example will show on how to use the basic usage of the same re-ranker as independent component:
+
+```python
+import asyncio
+import os
+from ragbits.document_search.retrieval.rerankers.litellm import LiteLLMReranker
+from ragbits.document_search.documents.element import TextElement
+from ragbits.document_search.documents.document import DocumentMeta
+
+os.environ["COHERE_API_KEY"] = "<api_key>"
+
+
+def create_text_element(text: str) -> TextElement:
+    document_meta = DocumentMeta.create_text_document_from_literal(content=text)
+    text_element = TextElement(document_meta=document_meta, content=text)
+    return text_element
+
+
+async def main():
+    reranker = LiteLLMReranker(model="cohere/rerank-english-v3.0")
+    text_elements = [
+        create_text_element(
+            text="The artificial inteligence development is a milestone for global information accesibility"
+        ),
+        create_text_element(text="The redpill will show you the true nature of things"),
+        create_text_element(text="The bluepill will make you stay in the state of ignorance"),
+    ]
+    query = "Take the pill and follow the rabbit!"
+    ranked = await reranker.rerank(elements=text_elements, query=query)
+    for element in ranked:
+        print(element.content + "\n")
+
+
+asyncio.run(main())
+```
+
+The console should print the contents of the ranked elements in order of their relevance to the query, as determined by the model.
+
+```text
+The redpill will show you the true nature of things
+
+The bluepill will make you stay in the state of ignorance
+
+The artificial inteligence development is a milestone for global information accesibility
+```
+
+## Custom Reranker
+To create a custom Reranker you need to extend the `Reranker` class:
+```python
+from collections.abc import Sequence
+
+from ragbits.document_search.retrieval.rerankers.base import Reranker, RerankerOptions
+from ragbits.document_search.documents.element import Element
+
+class CustomReranker(Reranker):
+    async def rerank(
+        self,
+        elements: Sequence[Element],
+        query: str,
+        options: RerankerOptions | None = None,
+    ) -> Sequence[Element]:
+        pass
+
+    @classmethod
+    def from_config(cls, config: dict) -> "CustomReranker":
+        pass
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -13,6 +13,9 @@ nav:
       - Document Search:
           - how-to/document_search/async_processing.md
           - how-to/document_search/create_custom_execution_strategy.md
+          - how-to/document-search/search_documents.md
+          - how-to/document-search/use_rephraser.md
+          - how-to/document-search/use_reranker.md
   - API Reference:
       - Core:
           - api_reference/core/prompt.md