VedantaNY-10M

This repository contains the source code for the paper Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented LLMs for Ancient Indian Philosophy.

Outstanding Paper at Machine Learning for Ancient Languages (ACL Workshop), 2024

Citing this work

If you find this work useful in your research, please consider citing:

@inproceedings{mandikal2024vedantany10m,
  title = "Ancient Wisdom, Modern Tools: Exploring Retrieval-Augmented {LLM}s for {A}ncient {I}ndian Philosophy",
  author = "Mandikal, Priyanka",
  booktitle = "Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)",
  year = "2024",
  publisher = "Association for Computational Linguistics",
}

Overview

LLMs have revolutionized the landscape of information retrieval and knowledge dissemination. However, their application in specialized areas is often hindered by factual inaccuracies and hallucinations, especially in long-tail knowledge distributions. We explore the potential of retrieval-augmented generation (RAG) models for long-form question answering (LFQA) in a specialized knowledge domain. We present VedantaNY-10M, a dataset curated from extensive public discourses on the ancient Indian philosophy of Advaita Vedanta. We develop and benchmark a RAG model against a standard, non-RAG LLM, focusing on transcription, retrieval, and generation performance. Human evaluations by computational linguists and domain experts show that the RAG model significantly outperforms the standard model in producing factual and comprehensive responses having fewer hallucinations. In addition, a keyword-based hybrid retriever that emphasizes unique low-frequency terms further improves results. Our study provides insights into effectively integrating modern large language models with ancient knowledge systems.

Dataset

We use transcripts for 612 lectures on the Indian philosophy of Advaita Vedanta from the Vedanta Society of New York (VSNY). They are automatically generated from the OpenAI Whisper large-v2 model.

Code for generating transcripts is provided in the transcription folder. Please follow the steps below:

Download audio: You can either generate transcripts for all videos on VSNY's YouTube channel up to the current date or use the videos used in the paper (up to March 24th 2024). The list of videos used is provided in bot.csv. Download the csv file and place it in data/metadata/large-v2/episodes.
- Download audio files using the list used in the paper:
```
python transcription/download_audio.py --download-from-csv --csv-file bot.csv
```
- Alternately, download all audio files from VSNY up to the current date:
```
python transcription/download_audio.py
```
- To update the list from time-to-time, you can use the --skip-saved flag:
```
python transcription/download_audio.py --skip-saved
```
Split audio: Based on available resources, split the metadata into chunks that can be processed in parallel.
- Single machine: number of chunks is 1 by default (n=1)
```
python split_audio.py
```
- Multiple GPUs and/or machines: Set n as per your requirement
```
python split_audio.py --n 8
```
Generate transcripts:
- To run on a single machine, you can run the following command:
```
python transcription/run_whisper.py
```
- To run in parallel, please edit scripts/run_whisper.sh as per your requirements and run as:
```
bash scripts/run_whisper.sh
```

Once the transcription is complete, the data folder should contain all the transcript data to be chunked, embedded and stored in the vectordb.

Setup

Conda environment

Create a conda environment called 'vedantany10m':

conda create -n vedantany10m python=3.10
conda activate vedantany10m

Install required packages:
```
pip install -r requirements.txt 
```

Install spacy models

python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg

Verify installation

Make sure that you are able to import the following packages in python without any errors:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from utils.tfidf_retriever import CustomTFIDFRetriever
from utils.ensemble_retriever import CustomEnsembleRetriever

OpenAI & Pinecone API keys (optional)

Note: If you plan to use an open-source embedder (Nomic), LLM (Mixtral) and VectorDB (ChromaDB), you can skip this section.

OpenAI: If you want to use OpenAI models such as ada for embedding chunks or GPT for the LLM, you have to open an account and obtain API keys. Please follow the steps below:
- Setup an OpenAI developer account here
- Get your API key
- Add it to bashrc as follows:
  - export OPENAI_API_KEY=<enter-api-key>
Pinecone: If you want to use Pinecone as the vectorDB, you'll again need API keys. Please follow the steps below:
- Setup a pinecone account here
- Get your API key
- Add it to bashrc as follows:
  - export PINECONE_API_KEY=<enter-api-key>
- Also set the environment e.g. us-west1-gcp-free
  - export PINECONE_ENV=<enter-env>
- After all additions, source the bash file:
  - source ~/.bashrc

Run Bot

Chunk transcripts
- This splits each transcript into multiple passages
  - python create_chunks.py --whisper_model large-v2
Embed chunks: Use an embedding model and vectorDB of your choice. There are options to use closed source embedders and databases (OpenAI ada embdding, Pinecone vectorDB) or open source (Nomic embedding, ChromaDB). You can run any of the following as per your chice:
- Nomic embedding and ChromaDB:
  - python embed_chunks.py --embedding_model nomic --vectorstore chroma
- Openai embedding and ChromaDB:
  - python embed_chunks.py --embedding_model openai --vectorstore chroma
- Openai embedding and Pinecone:
  - python embed_chunks.py --embedding_model openai --vectorstore pinecone
Run chatbot
- We provide options to run open or closed LLMs of your choice. While OpenAI's GPTs are closed, Mixtral is open-source and can be run natively on your local machine. Run any of the following according to the model and embedding of choice:
```
python test_bot.py --llm gpt-4 --embedding_model openai --vectorstore pinecone
python test_bot.py --llm gpt-3.5-turbo --embedding_model openai --vectorstore pinecone
python test_bot.py --llm gpt-3.5-turbo --embedding_model nomic --vectorstore chroma
python test_bot.py --llm mixtral --embedding_model nomic --vectorstore chroma
```
- To pass an input query at run time, use the --pass-query flag. Example: python test_bot.py --llm mixtral --embedding_model nomic --vectorstore chroma --pass-query

Evaluation

The evaluation dataset comprising 25 queries across 5 categories is in eval/2-rag-vs-kwrag/queries.xlsx. Answers from different models are in eval/2-rag-vs-kwrag/answers.

Human survey

We provide anonymized responses of our human survey. They contain numbered ratings for each LLM answer as well as long-form feedback.

RAG vs Non-RAG
- All reviewers
KW-RAG

Automatic metrics

We evaluate five automatic metrics as follows:

length: number of words and sentences (answer-only)
self_bleu: reported self-bleu-5 in colm (answer-only)
perplexity: gpt-2 (answer-only)
rank_gen: q as prefix, a as suffix (question-answer)
qafact_eval: run from qafacteval conda env (retrieval-answer)

To run the above automatic metrics reported in the paper, follow the steps below.

Installation

Install supporting repos for computing metrics:

RankGen

Clone repo

git clone git@github.com:martiansideofthemoon/rankgen

QAFactEval

Clone repo and install packages

conda create --name qafacteval python=3.10
cenv qafacteval
pip install qafacteval
pip install gdown==4.6.0

Download models

git clone git@github.com:salesforce/QAFactEval
cd ../QAFactEval
bash download_models.sh

Generate LLM responses (optional)

We already provide responses from different models in eval/2-rag-vs-kwrag/answers. If you want to generate them yourself, you can run the following scripts:

bash scripts/keyword_extraction.sh
bash scripts/mixtral_bot.sh

Generated answers are saved in eval/2-rag-vs-kwrag/answers

Run metrics

To obtain all the automatic metrics, please run the following script:

bash scripts/metrics.sh

Note: For QAFactEval, please run the commands in the qafacteval conda env. Metrics are saved in the eval/2-rag-vs-kwrag/metrics

Ethics Statement

All data used in this project has been acquired from public lectures on YouTube delivered by Swami Sarvapriyananda of the Vedanta Society of New York. These transcripts have not been proofread for accuracy.
While our study explores integrating ancient knowledge systems with modern machine learning techniques, we recognize their inherent limitations. Users of these tools need to be aware that these models can make errors, and should therefore seek guidance from qualified teachers to carefully progress on the path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VedantaNY-10M

Citing this work

Overview

Dataset

Setup

Conda environment

Verify installation

OpenAI & Pinecone API keys (optional)

Run Bot

Evaluation

Human survey

Automatic metrics

Installation

Generate LLM responses (optional)

Run metrics

Ethics Statement

Files

README.md

Latest commit

History

README.md

File metadata and controls

VedantaNY-10M

Citing this work

Overview

Dataset

Setup

Conda environment

Verify installation

OpenAI & Pinecone API keys (optional)

Run Bot

Evaluation

Human survey

Automatic metrics

Installation

Generate LLM responses (optional)

Run metrics

Ethics Statement