Project Summary: Intelligent Document Analysis with Retrieval-Augmented Generation (RAG) and Vector Search
This open-source project leverages Optical Character Recognition (OCR) to convert files in various formats (PDF, TIFF, PNG, JPEG) into text. It integrates Retrieval-Augmented Generation (RAG) for extracting relevant attributes from the text. The core functionality involves taking a query text as input, performing a vector search to identify relevant parts of the file, and using Large Language Model (LLM) providers such as OpenAI, KIMI, and Tencent Hunyuan to generate answers from the search results.
Feature | Description |
---|---|
File Upload | Facilitates the upload of files in supported formats for processing. |
Multi-format OCR | Supports OCR for PDF, TIFF, PNG, and JPEG files, converting them into text. |
Vector Search | Performs vector search to identify relevant parts of the text based on embeddings. |
LLM Integration | Integrates with LLM providers like OpenAI, KIMI, and Tencent Hunyuan for generating responses. |
Embedding-based Retrieval | Uses vector embeddings for accurate and efficient information retrieval. |
install with Docker
-
Clone the repos
-
Set neccessary environment variables
Make sure to set your required environment variables in the
.env
file. You can read more about how to set them up in the API Keys Section -
Deploy using Docker
With Docker installed and the rag repository cloned, navigate to the directory containing the Dockerfile in your terminal or command prompt. Run the following command to start the rag application in detached mode, which allows it to run in the background:
# clone rag repo
git clone https://github.com/likid1412/rag
# navigate to rag
cd rag
# build, will download the necessary Docker images
docker build -t rag .
# run and start rag
docker run --env-file .env -dt --name rag -p 80:80 rag
# check rag logs, once success, you should see `Application startup complete.`
docker container logs rag
Remember, Docker must be installed on your system to use this method. For installation instructions and more details about Docker, visit the official Docker documentation.
You can read FastAPI in Containers for a quick start.
- Access rag
- You can access your local rag Interactive API docs
- You can access your local rag Alternative API docs
-
logs
We will send logged messages to
app.log
file and thestdout
using loguru
- For
app.log
file, it will located at/rag/app.log
- For
stadout
, you can check it use command such asdocker container logs -f rag
, usedocker container logs --help
to read more
Before starting rag you'll need to configure access to various components depending on your chosen technologies, such as OpenAI, hunyuan, and Kimi via an .env
file. Create this .env
in the same directory you want to start rag in. Check the .env.example as example.
Make sure to only set environment variables you intend to use, environment variables with missing or incorrect values may lead to errors.
Below is a comprehensive list of the API keys and variables you may require:
Environment Variable | Value | Description |
---|---|---|
MINIO_ENDPOINT | the endpoint to your minio storage | See Minio as local storage |
MINIO_ACCESS_KEY | Minio access key | See Minio as local storage |
MINIO_SECRET_KEY | Minio secret key | See Minio as local storage |
--- | --- | --- |
TENCENT_VECTOR_URL | URL for Tencent Vector Database | Access to Tencent Vector Database |
TENCENT_VECTOR_USER | Username for Tencent Vector Database | Access to Tencent Vector Database |
TENCENT_VECTOR_KEY | API Key for Tencent Vector Database | Access to Tencent Vector Database |
--- | --- | --- |
TENCENTCLOUD_SECRET_ID | Tencent Cloud Secret ID for Tencent hunyuan LLM | Access to Tencent API for Tencent hunyuan LLM |
TENCENTCLOUD_SECRET_KEY | Tencent Cloud Secret Key for Tencent hunyuan LLM | Access to Tencent API for Tencent hunyuan LLM |
TENCENT_MODEL | Tencent HunYuan Model name | Tencent hunyuan model |
--- | --- | --- |
API_KEY | OpenAI SDK API Key | Accee OpenAI or compatible LLM Provider API Key such as Kimi |
BASE_URL | OpenAI SDK Base URL | Accee OpenAI or compatible LLM Provider API Key such as Kimi |
MODEL | OpenAI SDK Model name | Model of OpenAI or compatible LLM Provider |
Use minio as local storage, see Minio as local storage for more detail
You can get it from OpenAI
Check hunyuan-embedding-API for more detail.
You can find instructions for obtaining a key here
You can get it from Tencent Vector Database
You can get it from OpenAI
Check Moonshot for more detail.
You can find instructions for obtaining a key here
Check hunyuan for more detail.
You can find instructions for obtaining a key here
Once you have access to rag, you can interact with API using the Interactive API docs, below is the endpoint usage examples.
Functionality
- Accepts one or more file uploads (limited to pdf, tiff, png,jpeg formats).
- Saves the processed file to storage (e.g, MinIO) solution, returning one or more unique file identifiers or signed URLs for the upload.
Usage example
- Read the alternative automatic documentation for more Upload - ReDoc
- Try it out: File Upload Endpoint: /upload
- Click the
Add string item
, and choose file to upload, and will return uploaded file info with the original file name from uploaded file, unique file id, signed URL and unique file name which you can search in minio
Functionality
- Running an OCR service on the file downloaded from the
signed_url
- Process OCR results with embedding models (e.g, OpenAI, Tencent hunyuan)
- Upload the embeddings to a vector database (e.g, Pinecone, Tencent Vector Database) for future searches.
Usage example
- Read the alternative automatic documentation for more Ocr - ReDoc
- Try it out: OCR Endpoint: /ocr
- Fill the
signed_url
value with the url got from upload endppoint, this endpoint return immediately, because it will take some times, doing several tasks in the background mention above. - The return result look like below, you can check progress using Get OCR Progress Endpoint :
Functionality
- Get ocr progress
Usage example
-
Read the alternative automatic documentation for more Get Ocr Progress - ReDoc
-
Try it out: Get OCR Progress Endpoint: /ocr_progress/{file_id}
-
Fill the
file_id
which pass to ocr endpoint to get the current progress -
If still processing, return
{"status": "processing", "progress": 0.xxx}
-
If completed, return
{"status": "completed"}
Functionality
- Takes a query text and file_id as input, performs a vector search and returns relevanted text based on the embeddings.
- Chat with LLM provider (e.g, OpenAI, Tencent hunyuan) to generate the answer from the search result.
Usage example
- Read the alternative automatic documentation for more Extract - ReDoc
- Try it out: Attribute Extraction Endpoint: /extract
- Takes a query text and
file_id
as input, choose LLM provider api (OpenAI
orhunyuan
), return the answer from query using LLM and relevant texts search from vector database which related to the file_id- For
OpenAI
api, can use OpenAI model or compatible LLM Provider model such as Kimi - For
hunyuan
api, can use Tencent hunyuan model
- For
- Upload large file using stream upload
- Add requestId for trace
- Add Monitoring and Observability
- TODO/FIXME in code
Seems the oce result has divided the content based on its structure and hierarchy, which is the paragraphs, resulting in more semantically coherent chunks, we can simple use Fixed-size chunking base on the paragraphs.
read more: Chunking Strategies for LLM Applications | Pinecone