This framework provides an end-to-end solution for evaluating retrieval systems using Wikipedia articles as a test corpus. It specializes in Arabic language content but can be adapted for other languages. It is a work in progress and I am only open sourcing a part of it for now with plans for open sourcing the entirety of my experiments and methodology in the future.
The framework implements the following pipeline:
- Data Collection: Fetches Wikipedia articles
- Text Processing: Chunks articles into meaningful segments
- Query Generation: Creates natural language queries using GPT-4
- Embedding Generation: Generates embeddings using Cohere's multilingual model
- Index Creation: Builds a vector index for efficient retrieval
- Evaluation: Measures retrieval performance using standard IR metrics
pip install uv
uv venv .venv
uv pip install ir-measures openai cohere python-dotenv usearch pymediawiki numpy
You'll need API keys for:
- OpenAI (for query generation)
- Cohere (for embeddings)
Create a .env
file with your API keys:
OPENAI_API_KEY=your_key_here
COHERE_API_KEY=your_key_here
The framework is organized into several sequential steps:
from mediawiki import MediaWiki
wikipedia = MediaWiki(lang="ar")
results = wikipedia.search("الثورة التونسية") # Example search term
Uses cluster semantic chunking to break documents into meaningful segments:
from chunking import ClusterSemanticChunker
text_splitter = ClusterSemanticChunker()
Generates diverse queries for each document chunk using GPT-4:
client = OpenAI()
# Generate queries using the Query class
Creates embeddings using Cohere's multilingual model:
co = ClientV2()
# Generate embeddings for documents and queries
Creates a vector index for efficient retrieval:
chunks_index = Index(ndim=1024, metric='cos')
chunks_index.add(keys, embeddings)
Evaluates retrieval performance using standard IR metrics:
ir_measures.calc_aggregate([P@1, P@3, P@5, R@1, R@3, R@5], qrels, results)
The framework generates several JSON files during processing:
data.json
: Raw Wikipedia articleschunks.json
: Chunked documentschunks_with_queries.json
: Documents with generated queriesquery_with_ground_truth.json
: Query-document relevance pairschunks_with_queries_and_embeddings.json
: Complete document data with embeddingsquery_with_ground_truth_and_embeddings.json
: Complete query data with embeddingschunk_key_mapping.json
: Mapping between chunk IDs and index keysqrels.txt
: TREC-format relevance judgments
The files aren't uploaded since I don't want to set up LFS.
The framework evaluates retrieval performance using:
- Precision@k (k=1,3,5)
- Recall@k (k=1,3,5)
To adapt the framework for different languages:
- Modify the MediaWiki language parameter
- Adjust the prompt for query generation
- Use appropriate multilingual embeddings
Feel free to submit issues and enhancement requests. If you flag any glaring errors in my methodology please do let me know, I am new to all of this and always eager to learn.