RAG System with FAISS and GPT (Homework for Advanced Database)

This repository implements a Retrieval-Augmented Generation (RAG) system using FAISS for vector-based retrieval and GPT for generative response. It is designed to process large datasets, index them with FAISS, and use GPT to answer queries with context retrieved from the documents.

Features

Document Loading: Load and preprocess datasets (e.g., CSV, plain text).
Embedding Generation: Convert documents and queries into vector embeddings.
Efficient Retrieval: Use FAISS for similarity search over large corpora.
GPT Integration: Generate answers using GPT with context from retrieved documents.
Modular Design: Easily extend the system with new vector stores, LLMs, or document loaders.
Interactive User Interface: A Gradio-powered UI for easy interaction with the system. Supports:
- Uploading and viewing CSV files.
- Searching the indexed documents.
- Managing document chunks.
- Interacting with the system through intuitive input fields.
CRUD Operations: Add, delete, update, and query document chunks in real-time.

Installation

Requirements

Python 3.8 or higher
FAISS
OpenAI API Key
Gradio (for the UI)

Dependencies

Install the required packages using pip:

pip install -U -r requirements.txt

requirements.txt:

langchain
openai
faiss-cpu
python-dotenv
pandas
langchain-community
tiktoken
gradio

Usage

1. Set Up Environment Variables

Create a .env file in the project root directory and add your OpenAI API key:

OPENAI_API_KEY=your_openai_api_key

2. Prepare Your Dataset

Use the provided example dataset (amazon_products.csv) or upload your own CSV dataset.
Ensure the dataset contains a column with text-based content (e.g., description) to generate embeddings.

3. Run the System

You can run the system in two modes:

Command-Line Interface (CLI)

Run the system interactively via CLI:

python main.py

This mode allows you to interact with the RAG system by asking questions and retrieving answers.

Gradio User Interface

Run the system with the Gradio-powered UI:

python app.py

This launches a web-based interface for uploading datasets, managing document chunks, and querying the RAG system.

Project Structure

.
├── amazon_products.csv       # Example dataset for testing
├── app.py                    # Gradio-based user interface
├── app.log                   # Log file for application events
├── dataset_cache.ipynb       # Notebook for dataset caching or analysis
├── main.py                   # CLI entry point for the RAG system
├── rag_system/               # Main source code for the RAG system
│   ├── __init__.py
│   ├── config.py             # Configuration settings (API keys, paths, etc.)
│   ├── core.py               # Core logic for RAG system initialization
│   ├── loaders.py            # Document loading and preprocessing
│   ├── llms.py               # Integration with GPT or other LLMs
│   ├── utils.py              # Utility functions (e.g., splitting documents)
│   ├── vector_stores.py      # FAISS and other vector store implementations
├── vector_store_index/       # Directory for storing FAISS index files
├── requirements.txt          # Python dependencies
├── SearchQ.md                # Markdown for documenting queries or use cases
├── README.md                 # Project documentation (this file)
├── .env                      # Environment variables for API keys

Gradio User Interface

The Gradio UI provides an intuitive interface for interacting with the RAG system. It supports:

Upload CSV Files:
- Upload datasets containing documents for indexing.
- Automatically preprocesses and splits documents into chunks for embedding generation.
Search Documents:
- Enter natural language queries in the search box.
- The system retrieves the most relevant documents and generates a response using GPT.
CRUD Operations:
- Add new document chunks to the indexed dataset.
- Delete or update existing chunks based on specific criteria.

Starting the Gradio Interface

Run the Gradio UI with:

python app.py

After launching, open the provided URL in a web browser to interact with the system.

Customization

1. Vector Store

The system uses FAISS for vector-based retrieval by default. To integrate another vector database (e.g., Pinecone, Weaviate, Milvus):

Create a new class in rag_system/vector_stores.py inheriting from BaseVectorStore.
Update the Config.VECTOR_STORE_TYPE in config.py.

2. LLM

The system integrates with OpenAI GPT. To switch to another LLM (e.g., HuggingFace models):

Add a new class in rag_system/llms.py inheriting from BaseLLM.
Update Config.LLM_TYPE in config.py.

3. Document Loaders

To support additional formats (e.g., PDFs, JSON):

Add a new class in rag_system/loaders.py inheriting from BaseDocumentLoader.
Update the load_documents function to detect and handle the new format.

4. Gradio UI

The UI is modular and can be extended. To add new components:

Modify app.py to include new Gradio widgets.
Update the callback functions to handle the added functionality.

Logging

The system logs important events (e.g., errors, indexing operations) in app.log. Check this file for debugging or monitoring purposes.

Future Improvements

Distributed Vector Stores: Add support for scalable vector stores like Pinecone or Weaviate.
Advanced Query Features: Implement query expansion, semantic search, and ranking.
Custom Embeddings: Allow users to upload precomputed embeddings.
User Authentication: Add authentication and access control for the Gradio interface.
Visualization: Display results with data visualizations (e.g., charts for document relevance scores).
Batch Processing: Optimize retrieval and generation for bulk queries.

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch for your feature or bug fix.
Submit a pull request with detailed explanations of your changes.

License

This project is licensed under the MIT License.

Acknowledgments

FAISS: Efficient similarity search.
OpenAI: Embedding and generative APIs.
LangChain: Framework for building applications with LLMs.
Gradio: User-friendly interface for ML applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

RAG System with FAISS and GPT (Homework for Advanced Database)

Features

Installation

Requirements

Dependencies

Usage

1. Set Up Environment Variables

2. Prepare Your Dataset

3. Run the System

Command-Line Interface (CLI)

Gradio User Interface

Project Structure

Gradio User Interface

Starting the Gradio Interface

Customization

1. Vector Store

2. LLM

3. Document Loaders

4. Gradio UI

Logging

Future Improvements

Contributing

License

Acknowledgments

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

RAG System with FAISS and GPT (Homework for Advanced Database)

Features

Installation

Requirements

Dependencies

Usage

1. Set Up Environment Variables

2. Prepare Your Dataset

3. Run the System

Command-Line Interface (CLI)

Gradio User Interface

Project Structure

Gradio User Interface

Starting the Gradio Interface

Customization

1. Vector Store

2. LLM

3. Document Loaders

4. Gradio UI

Logging

Future Improvements

Contributing

License

Acknowledgments