IN4325 Information Retrieval Project

`Authors:` Thomas Bos (4543408) & Daniël van Gelder (4551028), group 7

This is the repository for the implementation of the IR project for the TU Delft MSc course IN4325 Information Retrieval. In this project we implement two baselines for a Passage Ranking task and perform an analysis of the results.

Prerequisites:

Python 3.6+
Pyserini (latest)
Pandas (latest)
Numpy (latest)
autocorrect (latest)
jupyter notebook (latest)
Maven 3.3+
Java 11 or higher
PyTerrier
Download the TREC 2019 Deep Learning Track Passage Ranking Dataset and place the following (extracted) files in the src/data folder:
- Collection
- Queries
- All Test files

Activate the virtual environment in the venv folder to start a python environment with the appropriate dependencies.

Installing Anserini/Pyserini/PyTerrier/pygaggle and the MS-MARCO dataset

The steps for this installation have been retrieved from the anserini and pyserini documentation. This process has been tested on Mac OSX and ~~Windows~~. The package manager used is PyPI, using conda is not recommended.

Pyserini

Clone the anserini repository using git clone with the --recurse-submodules option, this also installs the eval subfolder (note: the following code uses SSH while HTTPS is also possible):

git clone [email protected]:castorini/anserini.git --recurse-submodules

Verify if the subfolders eval and tools are non-empty, otherwise make sure to download those manually.

Move all the contents of the anserini folder into the src folder of the current project.
Build the Anserini project using maven (tests can be skipped since we are only building):

mvn clean package appassembler:assemble -DskipTests

Build the tools directory as follows:

cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..

Install pyserini using PyPI:

pip install pyserini

MS-MARCO

Download and extract the MS MARCO dataset for Passage Ranking. We will create a new directory for this (make sure to do this from the src folder):

mkdir -p collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

If desired, the checksum of the downloaded collection file collectionandqueries.tar.gz can be checked: it should have an MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

Download the test queries file and place it in the directory with the other collection files:

wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-test2019-queries.tsv.gz -P collections/msmarco-passage

Indexing MS-MARCO for PySerini

Convert the MS MARCO .tsv collection into Anserini's jsonl files:

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

Now index these docs as a JsonCollection using Anserini (this may take a few minutes):

python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
 -threads 9 -input collections/msmarco-passage/collection_jsonl \
 -index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw

This should complete the installation process. Verify that everything is correct by running verify_installation.py in the src folder. This should print INSTALLATION OK if everything is working correctly. If not, please refer to the installation of anserini, pyserini and the following doc to debug.

PyTerrier

Install PyTerrier using PyPI as follows: pip install python-terrier. This should install Terrier as well.
Also make sure to install LightGBM using Homebrew as follows: brew install lightgbm and consequently through PyPI: pip install lightgbm
Make sure that scikit-learn is installed: pip install -U scikit-learn
Indexing the MSMARCO dataset will be performed by the pipeline. If the dataset has previously been indexed, then it will reload the index.

pygaggle

Install via PyPI: pip install pygaggle
Clone the repo recursively such that the submodules are downloaded as well: git clone --recursive https://github.com/castorini/pygaggle.git
Move all the contents of the repository into the src folder.
Make sure all the pygaggle requirements are installed: pip install -r requirements.txt
Installation can be verified by opening and running the src/pygaggle-reference.ipynb notebook.

Running the pipeline

BM25

In order to run the BM25 algorithm, run src/BM25_pyserini.py. Make sure that the locations to the index and query files specified at the bottom are correct.

LambdaMART

Run the entire LambdaMART pipeline by running src/terrier-l2r-pipeline.py. If no arguments are passed, default parameters will be used. Otherwise, make sure to pass the correct arguments:

[algorithm]: either lambdamart or randomforest, default: lambdamart
[no. passages to retrieve in stage 1]: default: 1000
[amount of train topics to use (values <= 0 will be interpreted as using all)], default: 100
[amount of validation topics to use (values <= 0 will be interpreted as using all)], default: 100
[run name for file output]: default: 00

T5

The T5 pipeline can be run by running: python t5-passage-ranking.py OUTPUT_PATH INDEX_PATH TEST_QUERIES_PATH RUN.

Reading the analysis

The notebook used for the error analysis is the analysis.ipynb notebook. The other notebooks are used as exploratory analysis of the algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IN4325 Information Retrieval Project

`Authors:` Thomas Bos (4543408) & Daniël van Gelder (4551028), group 7

Prerequisites:

Installing Anserini/Pyserini/PyTerrier/pygaggle and the MS-MARCO dataset

Pyserini

MS-MARCO

Indexing MS-MARCO for PySerini

PyTerrier

pygaggle

Running the pipeline

BM25

LambdaMART

T5

Reading the analysis

About

Releases

Packages

Contributors 2

Languages

danielvangelder/IN4325-Information-Retrieval-Project

Folders and files

Latest commit

History

Repository files navigation

IN4325 Information Retrieval Project

Authors: Thomas Bos (4543408) & Daniël van Gelder (4551028), group 7

Prerequisites:

Installing Anserini/Pyserini/PyTerrier/pygaggle and the MS-MARCO dataset

Pyserini

MS-MARCO

Indexing MS-MARCO for PySerini

PyTerrier

pygaggle

Running the pipeline

BM25

LambdaMART

T5

Reading the analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`Authors:` Thomas Bos (4543408) & Daniël van Gelder (4551028), group 7

Packages