This code presents our system for the ChemProt task.
Ubuntu, Python 3.6.4. Install the required packages:
$ pip install -r requirements.txt
confusion.py: Calculate the confusion matrix and other statistics given a file with predicted relations.
create_embeddings.py: Create pre-trained part-of-speech and dependency embedding vectors.
main.py: Train a deep learning model and test it. The deep learning model can be a bidirectional long short-term memory (BiLSTM) recurrent network or a convolutional neural network (CNN). It is necessary to edit the script to choose the different input arguments. Only the seed number can be passed by command line:
$ python main.py 2
mfuncs.py: Functions used by the main.py script.
support.py: Auxiliary code to treat the ChemProt dataset.
utils.py: General use utilities.
voting.py: Average several outputs (probabilities). Edit the script to choose the input directory and the group to be evaluated.
The datasets were pre-processed (tokenization, sentence splitting,
part-of-speech tagging, and dependency parsing) by the Turku Event
Extraction System (TEES).
Available for download as data.zip
[Mirror 1]
[Mirror 2]:
Our word embedding models were created from PubMed English abstracts.
We also pre-trained part-of-speech and dependency embedding vectors from
the ChemProt dataset. Available for download as word2vec.zip
[Mirror 1]
[Mirror 2].
We also tested the word embeddings model created by Chen et al. (2018) [Paper] [Code].
Statistics about the datasets, and some prediction files.
Available for download as supp.zip
[Mirror 1]
[Mirror 2].
If you use this code or data in your work, please cite our publication:
@article{antunes2019a,
author = {Antunes, Rui and Matos, S{\'e}rgio},
journal = {Database},
month = oct,
number = {baz095},
publisher = {{Oxford University Press}},
title = {Extraction of chemical--protein interactions from the literature using neural networks and narrow instance representation},
url = {https://doi.org/10.1093/database/baz095},
volume = {2019},
year = {2019},
}