This is the repository for the implementation of the IR project for the TU Delft MSc course IN4325 Information Retrieval. In this project we implement two baselines for a Passage Ranking task and perform an analysis of the results.
- Python 3.6+
- Pyserini (latest)
- Pandas (latest)
- Numpy (latest)
- autocorrect (latest)
- jupyter notebook (latest)
- Maven 3.3+
- Java 11 or higher
- PyTerrier
- Download the TREC 2019 Deep Learning Track Passage Ranking Dataset and place the following (extracted) files in the
src/data
folder:- Collection
- Queries
- All Test files
Activate the virtual environment in the venv
folder to start a python environment with the appropriate dependencies.
The steps for this installation have been retrieved from the anserini and pyserini documentation. This process has been tested on Mac OSX and Windows. The package manager used is PyPI, using conda is not recommended.
- Clone the anserini repository using git clone with the
--recurse-submodules
option, this also installs theeval
subfolder (note: the following code uses SSH while HTTPS is also possible):
git clone [email protected]:castorini/anserini.git --recurse-submodules
Verify if the subfolders eval
and tools
are non-empty, otherwise make sure to download those manually.
-
Move all the contents of the
anserini
folder into thesrc
folder of the current project. -
Build the Anserini project using maven (tests can be skipped since we are only building):
mvn clean package appassembler:assemble -DskipTests
- Build the
tools
directory as follows:
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..
- Install pyserini using PyPI:
pip install pyserini
- Download and extract the MS MARCO dataset for Passage Ranking. We will create a new directory for this (make sure to do this from the
src
folder):
mkdir -p collections/msmarco-passage
wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage
# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage
tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage
If desired, the checksum of the downloaded collection file collectionandqueries.tar.gz
can be checked: it should have an MD5 checksum of 31644046b18952c1386cd4564ba2ae69
.
- Download the test queries file and place it in the directory with the other collection files:
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-test2019-queries.tsv.gz -P collections/msmarco-passage
- Convert the MS MARCO
.tsv
collection into Anserini's jsonl files:
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
--collection-path collections/msmarco-passage/collection.tsv \
--output-folder collections/msmarco-passage/collection_jsonl
- Now index these docs as a
JsonCollection
using Anserini (this may take a few minutes):
python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
-threads 9 -input collections/msmarco-passage/collection_jsonl \
-index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw
- This should complete the installation process. Verify that everything is correct by running
verify_installation.py
in thesrc
folder. This should printINSTALLATION OK
if everything is working correctly. If not, please refer to the installation of anserini, pyserini and the following doc to debug.
-
Install PyTerrier using PyPI as follows:
pip install python-terrier
. This should install Terrier as well. -
Also make sure to install LightGBM using
Homebrew
as follows:brew install lightgbm
and consequently through PyPI:pip install lightgbm
-
Make sure that
scikit-learn
is installed:pip install -U scikit-learn
-
Indexing the MSMARCO dataset will be performed by the pipeline. If the dataset has previously been indexed, then it will reload the index.
-
Install via PyPI:
pip install pygaggle
-
Clone the repo recursively such that the submodules are downloaded as well:
git clone --recursive https://github.com/castorini/pygaggle.git
-
Move all the contents of the repository into the
src
folder. -
Make sure all the
pygaggle
requirements are installed:pip install -r requirements.txt
-
Installation can be verified by opening and running the
src/pygaggle-reference.ipynb
notebook.
In order to run the BM25 algorithm, run src/BM25_pyserini.py
. Make sure that the locations to the index and query files specified at the bottom are correct.
Run the entire LambdaMART pipeline by running src/terrier-l2r-pipeline.py
. If no arguments are passed, default parameters will be used. Otherwise, make sure to pass the correct arguments:
[algorithm]
: either lambdamart or randomforest, default: lambdamart[no. passages to retrieve in stage 1]
: default: 1000[amount of train topics to use (values <= 0 will be interpreted as using all)]
, default: 100[amount of validation topics to use (values <= 0 will be interpreted as using all)]
, default: 100[run name for file output]
: default: 00
The T5 pipeline can be run by running: python t5-passage-ranking.py OUTPUT_PATH INDEX_PATH TEST_QUERIES_PATH RUN
.
The notebook used for the error analysis is the analysis.ipynb
notebook. The other notebooks are used as exploratory analysis of the algorithms.