This repository contains the source code for the paper Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented LLMs for Ancient Indian Philosophy.
Outstanding Paper at Machine Learning for Ancient Languages (ACL Workshop), 2024
If you find this work useful in your research, please consider citing:
@inproceedings{mandikal2024vedantany10m,
title = "Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented {LLM}s for {A}ncient {I}ndian Philosophy",
author = "Mandikal, Priyanka",
booktitle = "Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)",
year = "2024",
publisher = "Association for Computational Linguistics",
}
LLMs have revolutionized the landscape of information retrieval and knowledge dissemination. However, their application in specialized areas is often hindered by factual inaccuracies and hallucinations, especially in long-tail knowledge distributions. We explore the potential of retrieval-augmented generation (RAG) models for long-form question answering (LFQA) in a specialized knowledge domain. We present VedantaNY-10M, a dataset curated from extensive public discourses on the ancient Indian philosophy of Advaita Vedanta. We develop and benchmark a RAG model against a standard, non-RAG LLM, focusing on transcription, retrieval, and generation performance. Human evaluations by computational linguists and domain experts show that the RAG model significantly outperforms the standard model in producing factual and comprehensive responses having fewer hallucinations. In addition, a keyword-based hybrid retriever that emphasizes unique low-frequency terms further improves results. Our study provides insights into effectively integrating modern large language models with ancient knowledge systems.
We use transcripts for 612 lectures on the Indian philosophy of Advaita Vedanta from the Vedanta Society of New York (VSNY). They are automatically generated from the OpenAI Whisper large-v2 model.
Code for generating transcripts is provided in the transcription
folder. Please follow the steps below:
-
Download audio: You can either generate transcripts for all videos on VSNY's YouTube channel up to the current date or use the videos used in the paper (up to March 24th 2024). The list of videos used is provided in bot.csv. Download the csv file and place it in
data/metadata/large-v2/episodes
.- Download audio files using the list used in the paper:
python transcription/download_audio.py --download-from-csv --csv-file bot.csv
- Alternately, download all audio files from VSNY up to the current date:
python transcription/download_audio.py
- To update the list from time-to-time, you can use the
--skip-saved
flag:python transcription/download_audio.py --skip-saved
- Download audio files using the list used in the paper:
-
Split audio: Based on available resources, split the metadata into chunks that can be processed in parallel.
- Single machine: number of chunks is 1 by default (n=1)
python split_audio.py
- Multiple GPUs and/or machines: Set n as per your requirement
python split_audio.py --n 8
- Single machine: number of chunks is 1 by default (n=1)
-
Generate transcripts:
- To run on a single machine, you can run the following command:
python transcription/run_whisper.py
- To run in parallel, please edit
scripts/run_whisper.sh
as per your requirements and run as:bash scripts/run_whisper.sh
- To run on a single machine, you can run the following command:
Once the transcription is complete, the data
folder should contain all the transcript data to be chunked, embedded and stored in the vectordb.
-
Create a conda environment called 'vedantany10m':
conda create -n vedantany10m python=3.10 conda activate vedantany10m
-
Install required packages:
pip install -r requirements.txt
-
Install spacy models
python -m spacy download en_core_web_sm python -m spacy download en_core_web_lg
Make sure that you are able to import the following packages in python without any errors:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from utils.tfidf_retriever import CustomTFIDFRetriever
from utils.ensemble_retriever import CustomEnsembleRetriever
Note: If you plan to use an open-source embedder (Nomic), LLM (Mixtral) and VectorDB (ChromaDB), you can skip this section.
-
OpenAI: If you want to use OpenAI models such as ada for embedding chunks or GPT for the LLM, you have to open an account and obtain API keys. Please follow the steps below:
- Setup an OpenAI developer account here
- Get your API key
- Add it to bashrc as follows:
export OPENAI_API_KEY=<enter-api-key>
-
Pinecone: If you want to use Pinecone as the vectorDB, you'll again need API keys. Please follow the steps below:
- Setup a pinecone account here
- Get your API key
- Add it to bashrc as follows:
export PINECONE_API_KEY=<enter-api-key>
- Also set the environment e.g. us-west1-gcp-free
export PINECONE_ENV=<enter-env>
- After all additions, source the bash file:
source ~/.bashrc
-
Chunk transcripts
- This splits each transcript into multiple passages
python create_chunks.py --whisper_model large-v2
- This splits each transcript into multiple passages
-
Embed chunks: Use an embedding model and vectorDB of your choice. There are options to use closed source embedders and databases (OpenAI ada embdding, Pinecone vectorDB) or open source (Nomic embedding, ChromaDB). You can run any of the following as per your chice:
- Nomic embedding and ChromaDB:
python embed_chunks.py --embedding_model nomic --vectorstore chroma
- Openai embedding and ChromaDB:
python embed_chunks.py --embedding_model openai --vectorstore chroma
- Openai embedding and Pinecone:
python embed_chunks.py --embedding_model openai --vectorstore pinecone
- Nomic embedding and ChromaDB:
-
Run chatbot
-
We provide options to run open or closed LLMs of your choice. While OpenAI's GPTs are closed, Mixtral is open-source and can be run natively on your local machine. Run any of the following according to the model and embedding of choice:
python test_bot.py --llm gpt-4 --embedding_model openai --vectorstore pinecone python test_bot.py --llm gpt-3.5-turbo --embedding_model openai --vectorstore pinecone python test_bot.py --llm gpt-3.5-turbo --embedding_model nomic --vectorstore chroma python test_bot.py --llm mixtral --embedding_model nomic --vectorstore chroma
-
To pass an input query at run time, use the
--pass-query
flag. Example:python test_bot.py --llm mixtral --embedding_model nomic --vectorstore chroma --pass-query
-
The evaluation dataset comprising 25 queries across 5 categories is in eval/2-rag-vs-kwrag/queries.xlsx
. Answers from different models are in eval/2-rag-vs-kwrag/answers
.
We provide anonymized responses of our human survey. They contain numbered ratings for each LLM answer as well as long-form feedback.
- RAG vs Non-RAG
- KW-RAG
We evaluate five automatic metrics as follows:
- length: number of words and sentences (answer-only)
- self_bleu: reported self-bleu-5 in colm (answer-only)
- perplexity: gpt-2 (answer-only)
- rank_gen: q as prefix, a as suffix (question-answer)
- qafact_eval: run from qafacteval conda env (retrieval-answer)
To run the above automatic metrics reported in the paper, follow the steps below.
Install supporting repos for computing metrics:
-
RankGen
- Clone repo
git clone [email protected]:martiansideofthemoon/rankgen
- Clone repo
-
QAFactEval
- Clone repo and install packages
conda create --name qafacteval python=3.10 cenv qafacteval pip install qafacteval pip install gdown==4.6.0
- Download models
git clone [email protected]:salesforce/QAFactEval cd ../QAFactEval bash download_models.sh
- Clone repo and install packages
We already provide responses from different models in eval/2-rag-vs-kwrag/answers
. If you want to generate them yourself, you can run the following scripts:
bash scripts/keyword_extraction.sh
bash scripts/mixtral_bot.sh
Generated answers are saved in eval/2-rag-vs-kwrag/answers
To obtain all the automatic metrics, please run the following script:
bash scripts/metrics.sh
Note: For QAFactEval, please run the commands in the qafacteval conda env.
Metrics are saved in the eval/2-rag-vs-kwrag/metrics
- All data used in this project has been acquired from public lectures on YouTube delivered by Swami Sarvapriyananda of the Vedanta Society of New York. These transcripts have not been proofread for accuracy.
- While our study explores integrating ancient knowledge systems with modern machine learning techniques, we recognize their inherent limitations. Users of these tools need to be aware that these models can make errors, and should therefore seek guidance from qualified teachers to carefully progress on the path.