GitHub - chengxuanying/WSDM-Adhoc-Document-Retrieval: This is our solution for WSDM - DiggSci 2020. We implemented a simple yet robust search pipeline which ranked 2nd in the validation set and 4th in the test set. We won the gold prize at innovation track and bronze prize at dataset track.

WSDM-Adhoc-Document-Retrieval

This is our solution for WSDM - DiggSci 2020. We implemented a simple yet robust search pipeline which ranked 2nd in the validation set and 4th in the test set. We won the gold prize at innovation track and bronze prize at dataset track. [Video] [Slides] [Report]

Related Project: KDD-Multimodalities-Recall

Features

An end-to-end system with zero feature engineering.
Performed data cleaning on the dataset according to self-designed saliency-based rules, and removed the redundancy data with an insignificant impact on results, and improved the MAP@3 by 3%.
Designed a novel early stopping strategy for reranking based on the confidence score to avoid up to 40% unnecessary inference computation cost of the BERT.
Scores are stable (nearly the same) on the train_val, validation, and test sets.

Our Pipeline

Open the jupyter notebook jupyter lab or jupyter notebook
Clean the dataset: Open 01WASH.ipynb and run all cells. In this notebook, we clean the dataset, by removing the description text which is not highly related to the query topic. We also remove the NA rows in the candidate set. Such practice decreases the recall size from 838,939 to 636,439 without sacrificing so much recall rate.
Recall: Open 02RECALL.ipynb and run all cells. In this notebook, we use the bm25 metric, which is a kind of scoring method to recall the documents by the same important keywords. For faster calculating, we adopted the cupyx to accelerate the calculation. Hence, the matrix multiplication of (validsize, vocab_size) dot (600k, vocab_size) can be done in 15 mins in a single GPU card.
Rerank: Open 05BERT_ADARERANK.ipynb and run all cells. In this notebook, we used the fintuned BioBERT model to scoring every (query, document) pair. A novel early stopping strategy is designed for saving the computation. That is, when reranking documents for a given query, if a document is scored with high confidence (above a threshold), the reranking process for this query can be earlier stopped.

How to finetune the BioBERT for Scoring:

As mentioned in the step4, we need to finetune the BioBERT for the reranking task. Please refer to the notebook file03BERT_PREPARE.ipynb and 04BERT_TRAIN.ipynb for coding details. Also, we will give some worth-mentioning tips:

Use pairwise BERT: When using Bert to scoring sentence pairs, using the [token] vector as output and followed by a single-layer neural network with dropout is recommended.
Use RankNet loss: Cross-entropy is not the best choice for the ranking problem, because it aims to train the scoring function to be inf or -inf. Such loss benefits to the classification task, while in ranking task, we do not need extreme scores. What we need is more discriminative scores - the document more related scores higher. That is the Ranknet loss. Limited to the GPU resource, our team can not implement RankNet loss in BERT. Instead, we selected the finetuned models performing well in ranking task, which is so-called underfitting model in the classification tasks. Such practice improves 0.03+ MAP@3 in the validation set.
Use 512 Tokens in Training: For both training and inference phase, longer token means that the model can capture more semantic information. In our test, increasing token length from 256 to 512 can improve 0.02+ MAP@3.
Upsample Positive Items: Similar to the classification task, you can upsample the positive (query, doc) pairs or reweight them in the loss item.

Members

Chengxuan Ying, Dalian University of Technology (应承轩大连理工大学)
Chen HuoServer sponsor, Wechat (霍晨微信)

DataLeak

We did not use any data leak tricks, though we know the data leak exists.

Acknownledgment

Thanks for Yanming Shen, who provided a 8-GPU server for 4 days.

Links to Other Solutions

Chi-Yu Yang and Kuei-Chun Huang: WSDM_SimpleBaseline
supercoderhawk: wsdm-digg-2020
shuiliwanwu: wsdm_cup2020
just4fun, greedisgood, slowdown and funny: wsdm2020-solution
xiong, wzm, Yinxiang Xu, Xiaohao Xu and Yongqiang Liu: wsdm2020_diggsci
Seiya, eclipse, will and ferryman: wsdm_cup_2020_solution

Reference

Nogueira R, Cho K. Passage Re-ranking with BERT[J]. arXiv preprint arXiv:1901.04085, 2019.
Burges C, Shaked T, Renshaw E, et al. Learning to rank using gradient descent[C]//Proceedings of the 22nd International Conference on Machine learning (ICML-05). 2005: 89-96.
Severyn A, Moschitti A. Learning to rank short text pairs with convolutional deep neural networks[C]//Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, 2015: 373-382.

Seeking Opportunities

I will be graduated in the summer of 2021 from Dalian University of Technology. If you can refer me to any company, please contact me [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
01WASH.ipynb		01WASH.ipynb
02RECALL.ipynb		02RECALL.ipynb
03BERT_PREPARE.ipynb		03BERT_PREPARE.ipynb
04BERT_TRAIN.ipynb		04BERT_TRAIN.ipynb
05BERT_ADARERANK.ipynb		05BERT_ADARERANK.ipynb
05BERT_RERANK (despracated).ipynb		05BERT_RERANK (despracated).ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WSDM-Adhoc-Document-Retrieval

Related Project: KDD-Multimodalities-Recall

Features

Our Pipeline

Members

DataLeak

Acknownledgment

Links to Other Solutions

Reference

Seeking Opportunities

About

Releases

Packages

Languages

chengxuanying/WSDM-Adhoc-Document-Retrieval

Folders and files

Latest commit

History

Repository files navigation

WSDM-Adhoc-Document-Retrieval

Related Project: KDD-Multimodalities-Recall

Features

Our Pipeline

Members

DataLeak

Acknownledgment

Links to Other Solutions

Reference

Seeking Opportunities

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages