This repository implements a Retrieval-Augmented Generation (RAG) system using FAISS for vector-based retrieval and GPT for generative response. It is designed to process large datasets, index them with FAISS, and use GPT to answer queries with context retrieved from the documents.
- Document Loading: Load and preprocess datasets (e.g., CSV, plain text).
- Embedding Generation: Convert documents and queries into vector embeddings.
- Efficient Retrieval: Use FAISS for similarity search over large corpora.
- GPT Integration: Generate answers using GPT with context from retrieved documents.
- Modular Design: Easily extend the system with new vector stores, LLMs, or document loaders.
- Interactive User Interface: A Gradio-powered UI for easy interaction with the system. Supports:
- Uploading and viewing CSV files.
- Searching the indexed documents.
- Managing document chunks.
- Interacting with the system through intuitive input fields.
- CRUD Operations: Add, delete, update, and query document chunks in real-time.
- Python 3.8 or higher
- FAISS
- OpenAI API Key
- Gradio (for the UI)
Install the required packages using pip
:
pip install -U -r requirements.txt
requirements.txt
:
langchain
openai
faiss-cpu
python-dotenv
pandas
langchain-community
tiktoken
gradio
Create a .env
file in the project root directory and add your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key
- Use the provided example dataset (
amazon_products.csv
) or upload your own CSV dataset. - Ensure the dataset contains a column with text-based content (e.g.,
description
) to generate embeddings.
You can run the system in two modes:
Run the system interactively via CLI:
python main.py
This mode allows you to interact with the RAG system by asking questions and retrieving answers.
Run the system with the Gradio-powered UI:
python app.py
This launches a web-based interface for uploading datasets, managing document chunks, and querying the RAG system.
.
├── amazon_products.csv # Example dataset for testing
├── app.py # Gradio-based user interface
├── app.log # Log file for application events
├── dataset_cache.ipynb # Notebook for dataset caching or analysis
├── main.py # CLI entry point for the RAG system
├── rag_system/ # Main source code for the RAG system
│ ├── __init__.py
│ ├── config.py # Configuration settings (API keys, paths, etc.)
│ ├── core.py # Core logic for RAG system initialization
│ ├── loaders.py # Document loading and preprocessing
│ ├── llms.py # Integration with GPT or other LLMs
│ ├── utils.py # Utility functions (e.g., splitting documents)
│ ├── vector_stores.py # FAISS and other vector store implementations
├── vector_store_index/ # Directory for storing FAISS index files
├── requirements.txt # Python dependencies
├── SearchQ.md # Markdown for documenting queries or use cases
├── README.md # Project documentation (this file)
├── .env # Environment variables for API keys
The Gradio UI provides an intuitive interface for interacting with the RAG system. It supports:
-
Upload CSV Files:
- Upload datasets containing documents for indexing.
- Automatically preprocesses and splits documents into chunks for embedding generation.
-
Search Documents:
- Enter natural language queries in the search box.
- The system retrieves the most relevant documents and generates a response using GPT.
-
CRUD Operations:
- Add new document chunks to the indexed dataset.
- Delete or update existing chunks based on specific criteria.
Run the Gradio UI with:
python app.py
After launching, open the provided URL in a web browser to interact with the system.
The system uses FAISS for vector-based retrieval by default. To integrate another vector database (e.g., Pinecone, Weaviate, Milvus):
- Create a new class in
rag_system/vector_stores.py
inheriting fromBaseVectorStore
. - Update the
Config.VECTOR_STORE_TYPE
inconfig.py
.
The system integrates with OpenAI GPT. To switch to another LLM (e.g., HuggingFace models):
- Add a new class in
rag_system/llms.py
inheriting fromBaseLLM
. - Update
Config.LLM_TYPE
inconfig.py
.
To support additional formats (e.g., PDFs, JSON):
- Add a new class in
rag_system/loaders.py
inheriting fromBaseDocumentLoader
. - Update the
load_documents
function to detect and handle the new format.
The UI is modular and can be extended. To add new components:
- Modify
app.py
to include new Gradio widgets. - Update the callback functions to handle the added functionality.
The system logs important events (e.g., errors, indexing operations) in app.log
. Check this file for debugging or monitoring purposes.
- Distributed Vector Stores: Add support for scalable vector stores like Pinecone or Weaviate.
- Advanced Query Features: Implement query expansion, semantic search, and ranking.
- Custom Embeddings: Allow users to upload precomputed embeddings.
- User Authentication: Add authentication and access control for the Gradio interface.
- Visualization: Display results with data visualizations (e.g., charts for document relevance scores).
- Batch Processing: Optimize retrieval and generation for bulk queries.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Submit a pull request with detailed explanations of your changes.
This project is licensed under the MIT License.