Skip to content

danielvangelder/IN4325-Information-Retrieval-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 

Repository files navigation

IN4325 Information Retrieval Project

Authors: Thomas Bos (4543408) & Daniël van Gelder (4551028), group 7

This is the repository for the implementation of the IR project for the TU Delft MSc course IN4325 Information Retrieval. In this project we implement two baselines for a Passage Ranking task and perform an analysis of the results.

Prerequisites:

  • Python 3.6+
  • Pyserini (latest)
  • Pandas (latest)
  • Numpy (latest)
  • autocorrect (latest)
  • jupyter notebook (latest)
  • Maven 3.3+
  • Java 11 or higher
  • PyTerrier
  • Download the TREC 2019 Deep Learning Track Passage Ranking Dataset and place the following (extracted) files in the src/data folder:
    • Collection
    • Queries
    • All Test files

Activate the virtual environment in the venv folder to start a python environment with the appropriate dependencies.

Installing Anserini/Pyserini/PyTerrier/pygaggle and the MS-MARCO dataset

The steps for this installation have been retrieved from the anserini and pyserini documentation. This process has been tested on Mac OSX and Windows. The package manager used is PyPI, using conda is not recommended.

Pyserini

  1. Clone the anserini repository using git clone with the --recurse-submodules option, this also installs the eval subfolder (note: the following code uses SSH while HTTPS is also possible):
git clone [email protected]:castorini/anserini.git --recurse-submodules

Verify if the subfolders eval and tools are non-empty, otherwise make sure to download those manually.

  1. Move all the contents of the anserini folder into the src folder of the current project.

  2. Build the Anserini project using maven (tests can be skipped since we are only building):

mvn clean package appassembler:assemble -DskipTests
  1. Build the tools directory as follows:
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..
  1. Install pyserini using PyPI:
pip install pyserini

MS-MARCO

  1. Download and extract the MS MARCO dataset for Passage Ranking. We will create a new directory for this (make sure to do this from the src folder):
mkdir -p collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

If desired, the checksum of the downloaded collection file collectionandqueries.tar.gz can be checked: it should have an MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

  1. Download the test queries file and place it in the directory with the other collection files:
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-test2019-queries.tsv.gz -P collections/msmarco-passage

Indexing MS-MARCO for PySerini

  1. Convert the MS MARCO .tsv collection into Anserini's jsonl files:
python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl
  1. Now index these docs as a JsonCollection using Anserini (this may take a few minutes):
python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
 -threads 9 -input collections/msmarco-passage/collection_jsonl \
 -index indexes/lucene-index-msmarco-passage -storePositions -storeDocvectors -storeRaw
  1. This should complete the installation process. Verify that everything is correct by running verify_installation.py in the src folder. This should print INSTALLATION OK if everything is working correctly. If not, please refer to the installation of anserini, pyserini and the following doc to debug.

PyTerrier

  1. Install PyTerrier using PyPI as follows: pip install python-terrier. This should install Terrier as well.

  2. Also make sure to install LightGBM using Homebrew as follows: brew install lightgbm and consequently through PyPI: pip install lightgbm

  3. Make sure that scikit-learn is installed: pip install -U scikit-learn

  4. Indexing the MSMARCO dataset will be performed by the pipeline. If the dataset has previously been indexed, then it will reload the index.

pygaggle

  1. Install via PyPI: pip install pygaggle

  2. Clone the repo recursively such that the submodules are downloaded as well: git clone --recursive https://github.com/castorini/pygaggle.git

  3. Move all the contents of the repository into the src folder.

  4. Make sure all the pygaggle requirements are installed: pip install -r requirements.txt

  5. Installation can be verified by opening and running the src/pygaggle-reference.ipynb notebook.

Running the pipeline

BM25

In order to run the BM25 algorithm, run src/BM25_pyserini.py. Make sure that the locations to the index and query files specified at the bottom are correct.

LambdaMART

Run the entire LambdaMART pipeline by running src/terrier-l2r-pipeline.py. If no arguments are passed, default parameters will be used. Otherwise, make sure to pass the correct arguments:

  • [algorithm]: either lambdamart or randomforest, default: lambdamart
  • [no. passages to retrieve in stage 1]: default: 1000
  • [amount of train topics to use (values <= 0 will be interpreted as using all)], default: 100
  • [amount of validation topics to use (values <= 0 will be interpreted as using all)], default: 100
  • [run name for file output]: default: 00

T5

The T5 pipeline can be run by running: python t5-passage-ranking.py OUTPUT_PATH INDEX_PATH TEST_QUERIES_PATH RUN.

Reading the analysis

The notebook used for the error analysis is the analysis.ipynb notebook. The other notebooks are used as exploratory analysis of the algorithms.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published