gile — This repository contains a Keras implementation of GILE, a generalized input-label embedding for large-scale text classification, which was proposed in TACL 2019 [1]. The overall model consists of a joint non-linear input-label embedding with controllable capacity and a joint-space-dependent classification unit which is trained with cross-entropy loss to optimize classification performance. GILE improves over monolingual and multilingual models which do not leverage label semantics [2] and previous joint input-label space models for text [3, 4] in both full-resource and low or zero-resource scenarios.
@article{Pappas_TACL_2019,
author = {Pappas, Nikolaos and Henderson, James},
title = {GILE: A Generalized Input-Label Embedding for Text Classification},
journal = {Transactions of the Association for Computational Linguistics (TACL)},
volume = {7},
year = {2019}
}
The provided code allows to reproduce the two experiments decribed in the paper:
-
Multilingual text classification: Under the main folder (./) you can find the code related to the multilingual text classification experiments on DW dataset from [2].
-
Biomedical semantic indexing: Under the hdf5/ folder you can find the code related to the biomedical text classification experiments on BioASQ dataset from [4]. We created a separate code to provide support for large-scale datasets like BioASQ (10M) using the HDF5 format and for trainable word embeddings.
The available code for GILE requires Python 2.7 programming language and pip package manager to run.
For detailed instructions on how to install them please refer
to the corresponding links. Next, you should be able to install the required libraries as follows using the provided list of dependencies:
pip install -r dependencies.txt
To avoid creating library conflicts in your existing pip environment, it may be more convenient for you to use a folder-specific pip environment using pipenv instead. For setting up your GPUs to work with Theano please refer to here and here. For example, the theano configuration for running our experiments on GPUs with CUDA 8.0 and cuDNN 5110 was the following one:
THEANO_FLAGS='cuda.root=<path>/CUDA-8.0, mode=FAST_RUN, dnn.enabled=True, device=gpu, lib.cnmem=0.9'
Note: A large portion of this repository is borrowed from mhan toolkit, so for networks which have typical sigmoid output layer parametrizations please consider using and acknowledging that toolkit instead.
This section describes how to obtain the task prerequisites (datasets, pretrained word vectors) and how to use the basic code functionality needed to perform multilingual text classification experiments on DW dataset from [2].
To get started, please follow the steps below to obtain the required data and the pretrained word vectors:
- Request and download the compressed file containing the pre-processed DW dataset from [2] (see Datasets in mhan) and unzip its contents under the data/ folder:
mkdir data; cd data;
# Copy the compressed data files to data/ folder.
unzip dw_general.zip dw_specific.zip
- Download the compressed files with the pretrained aligned word embeddings in pickle format from mhan repository and unzip its contents under the main folder:
cd ../;mkdir word_vectors; cd word_vectors;
for lang in {english,german,spanish,portuguese,ukrainian,russian,arabic,persian}; \
do wget https://github.com/idiap/mhan/raw/master/word_vectors/$lang.pkl.gz ; done
$ python run.py --help
Using Theano backend.
usage: run.py [-h] [--wdim WDIM] [--swpad SWPAD] [--spad SPAD] [--sdim SDIM]
[--ddim DDIM] [--ep EP] [--ep_size EP_SIZE] [--bs BS]
[--enc ENC] [--act ACT] [--gruact GRUACT] [--share SHARE]
[--t T] [--seed SEED] [--args_file ARGS_FILE]
[--wordemb_path WORDEMB_PATH]
[--languages LANGUAGES [LANGUAGES ...]] [--data_path DATA_PATH]
[--path PATH] [--target TARGET] [--source SOURCE]
[--store_file STORE_FILE] [--max_num MAX_NUM] [--train] [--test]
[--store_test] [--onlylabel] [--onlyinput] [--la]
[--ladim LADIM] [--laact LAACT] [--lashare]
(...)
To train a model we have to specify the --train argument in the run.py file. For each specified language, the script will automatically load the training and validation sets stored under the specified data folder and train the model on them. At each epoch the script will store a snapshot of the model along with its validation scores (precision, recall and F1-score) under the specified folder for all the languages involved in the training e.g. under exp/<language_1>/, exp/<language_2>/. In case the script detects already stored models for multiple epochs in the specified folder, it will continue training from the model stored at the last epoch.
For example, to train a mononolingual HAN with DENSE encoders and GILE output layer (GILE-HAN) on general German categories we execute the following command:
python run.py --train --languages german --wordemb_path word_vectors/ --data_path=data/dw_general \
--path exp/bi-gen/mono/gile-han-att --wdim 40 --swpad 30 --spad 30 --sdim 100 --ddim 100 --ep 300 \
--bs 16 --enc attdense --act relu --la
For example, to train a multilingual HAN with DENSE encoders, shared attention and GILE output layer (GILE-MHAN-Att) on general English and German categories we execute the following command:
python run.py --train --languages english german --wordemb_path word_vectors/ --data_path=data/dw_general \
--path exp/bi-gen/multi/en-de/gile-mhan-att --wdim 40 --swpad 30 --spad 30 --sdim 100 --ddim 100 --ep 300 \
--bs 16 --enc attdense --act relu --share att --la --lashare
Note: With the above commands you can also run the experiments for the rest of the languages in the monolingual case and the rest of the language-pair combinations in the multilingual case by changing the --languages argument.
To test a model we have to specify the --test argument in the run.py file and simply point to the directory of the model that we would like to evaluate and the language on which we would like to evaluate using the --target argument. The script will select the model with the best validation score in the specified directory and test it on the corresponding test set. When using the testing function, the architecture of the model is also plotted and stored in the specified directory (see below).
Using the previous example, we can test the above mononolingual HAN with DENSE encoders and GILE output layer (GILE-HAN) on the corresponding German test set as follows:
python run.py --test --path exp/bi-gen/mono/gile-han-att --target german --la --t 0.40
Using the previous example, we can evaluate the above multilingual HAN with DENSE encoders, shared attention and GILE output layer (GILE-MHAN-Att) on the corresponding German test set as follows:
python run.py --test --path exp/bi-gen/multi/en-de/gile-mhan-att --target german --t 0.40
HAN with Dense encoders + GILE output layer |
MHAN with Dense encoders + GILE output layer |
For other functionalities such as storing attention weights and visualizing them please check mhan toolkit.
Apart from the code, we also provide under pretrained/ folder the configurations of the best-performing models with GILE output layer from the full-resource experiments in [1] (Tables 2 and 4):
- bi-gen/: Bilingual models with DENSE encoders and GILE output layer trained on general categories (Table 2, upper part).
- bi-spe/: Bilingual models with DENSE encoders and GILE output layer trained on specific categories (Table 2, lower part).
- enc-gen/: Monolingual models with varying encoders (DENSE, GRU, biGRU) and GILE output layer trained on general categories (Table 4).
The command below evaluates the English-German MHAN model with DENSE encoders and GILE output layer, which was trained on general English and German categories, on the corresponding English test set. The resulting F1-score should match the one in the corresponding column of Table 3 in [1].
python run.py --test --path pretrained/bi-gen/multi/en-de/gile-mhan-att --target english --t 0.4
To train the model from scratch using the same configuration (args.json) and initial weights as in [1], one has to simply remove the optimal pretrained model files from the specified path folder as follows:
rm pretrained/bi-gen/multi/en-de/gile-mhan-att/english/*_[1-9]*-*
rm pretrained/bi-gen/multi/en-de/gile-mhan-att/german/*_[1-9]*-*
python run.py --path pretrained/bi-gen/multi/en-de/gile-mhan-att/ --languages english german --train
Note: We also provide upon request the configurations and initial weights of any other model used in the paper.
This section describes how to obtain the task prerequisites (datasets, pretrained word vectors) and how to use the basic code functionality needed to perform biomedical semantic indexing experiments on the BioASQ dataset from [4]. Compared to the code for the previous experiments, this version of the code differs in the following ways: i) it supports trainable word embeddings, ii) it uses the HDF5 format for efficiently storing and accessing large-scale datasets and iii) supports zero-shot evaluation.
To get started, please follow the steps below to obtain the required data and (optionally) the pretrained word vectors:
- Download our pre-processed version of the BioASQ dataset according to [4] stored in HDF5 format (or pre-process the data yourselves and store them in the same format):
cd hdf5;mkdir data;cd data;
wget https://raw.githubusercontent.com/circulosmeos/gdown.pl/master/gdown.pl; chmod +x gdown.pl;
./gdown.pl https://drive.google.com/open?id=16TcpunnhNyPIWoy4RSkIICVYDZSKDpzU bioasq.tar.gz
tar -zxvf bioasq.tar.gz
- (Optional) Use the pretrained word vectors by storing them under the hdf5/ folder:
cd ../;mkdir word_vectors;
wget https://raw.githubusercontent.com/circulosmeos/gdown.pl/master/gdown.pl; chmod +x gdown.pl;
./gdown.pl https://drive.google.com/open?id=1zu4l4bRcqiTwqNXrZI8gDPC1sxrDJsBf bioasq-word_vectors.zip
unzip bioasq-word_vectors.zip
$ cd hdf5/; python run.py --help
Using Theano backend.
usage: run.py [-h] [--wdim WDIM] [--wsize WSIZE] [--lpad LPAD] [--wpad WPAD]
[--sampling SAMPLING] [--swpad SWPAD] [--spad SPAD]
[--sdim SDIM] [--ddim DDIM] [--ep EP] [--ep_size EP_SIZE]
[--epval_size EPVAL_SIZE] [--bs BS] [--enc ENC] [--act ACT]
[--gruact GRUACT] [--share SHARE] [--t T] [--seed SEED]
[--args_file ARGS_FILE] [--wordemb_path WORDEMB_PATH]
[--languages LANGUAGES [LANGUAGES ...]] [--data_path DATA_PATH]
[--path PATH] [--target TARGET] [--source SOURCE]
[--store_file STORE_FILE] [--chunks CHUNKS] [--train] [--test]
[--store_test] [--mode MODE] [--pretrained] [--maskedavg]
[--chunks_offset CHUNKS_OFFSET] [--onlylabel] [--onlyinput]
[--la] [--ladim LADIM] [--laact LAACT] [--lashare]
[--seen_ids SEEN_IDS] [--unseen_ids UNSEEN_IDS]
(...)
As in the previous section, to train a model we have to specify the --train argument in the run.py file. For each specified language, the script will automatically load the training and validation sets stored under the specified data folder and train the model on them. At each epoch the script will store a snapshot of the model along with its validation scores (precision, recall and F1-score) under the specified folder for all the languages involved in the training e.g. under exp/<language_1>/, exp/<language_2>/. In case the script detects already stored models for multiple epochs in the specified folder, it will continue training from the model stored at the last epoch.
For example, to train a WAN with DENSE encoders and GILE output layer (GILE-HAN) on the BioASQ dataset we execute the following command:
python run.py --languages english --data_path data/bioasq/ --path exp/gile-wan --train \
--wdim 100 --bs 64 --sampling 0.03 --la --ladim 500 --lpad 50 --maskedavg
As in the previous section, to test a model we have to specify the --test argument in the run.py file and simply point to the directory of the model that we would like to evaluate and the language on which we would like to evaluate using the --target argument. The script will select the model with the best validation score in the specified directory and test it on the corresponding test set. When using the testing function, the architecture of the model is also plotted and stored in the specified directory (see below). One key difference with the previous section is that the test functionality allows to evaluate on a particular subsets of the test set, namely the ones that have labels that been seen during training and the ones that have labels which have not been seen during training (zero-shot setting).
For example, to evaluate a WAN with DENSE encoders and GILE output layer (GILE-HAN) on the seen labels during training we execute the following command:
python run.py --test --path exp/gile-wan --target english --mode seen --chunks 50 --bs 8
To evaluate on the unseen labels during training, we execute the following command:
python run.py --test --path exp/gile-wan --target english --mode unseen --chunks 5 --bs 8
Note: The results of the first script above are stored under the predifined folder (--path) in separate files and should be averaged to obtain the final score over the whole test set.
WAN with Dense encoders + GILE output layer |
For other functionalities such as storing attention weights and visualizing them please check mhan toolkit.
Apart from the code, we also provide under pretrained/ folder the configurations of the best-performing model that uses GILE output layer from the experiments in [1] (Table 1):
- gile-wan/: Model with DENSE encoders and GILE output layer on general categories (Table 1). Due to its big size this pretrained model is hosted externally and can be obtained as follows:
cd hdf5; wget https://raw.githubusercontent.com/circulosmeos/gdown.pl/master/gdown.pl ; chmod +x gdown.pl;
./gdown.pl https://drive.google.com/open?id=1Xq8-9KBLEBRoTzOMAIqqzXueq0_ciLML bioasq-pretrained.zip
unzip bioasq-pretrained.zip
The command below evaluates the WAN model with DENSE encoders and GILE output layer on the seen labels during. The resulting average F1-score should match the one in the corresponding column of Table 1 in [1].
python run.py --test --path pretrained/gile-wan --target english --mode seen --chunks 50 --bs 8
To train the model from scratch using the same configuration (args.json) and initial weights as in [1], one has to simply remove the optimal pretrained model files from the specified path folder as follows:
rm pretrained/gile-wan/*_[1-9]*-*
python run.py --path pretrained/gile-wan/ --train
Note: We also provide upon request the configurations and initial weights of any other model used in the paper.
- [1] Nikolaos Pappas, James Henderson, GILE: A Generalized Input-Label Embedding for Text Classification, Transactions of the Association for Computational Linguistics, 2019
- [2] Nikolaos Pappas, Andrei Popescu-Belis, Multilingual Hierarchical Attention Networks for Document Classification, 8th International Joint Conference on Natural Language Processing , Tapei, Taiwan, 2017
- [3] Mazid Yazdani, James Henderson A Model of Zero-Shot Learning of Spoken Language Understanding, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 2015
- [4] Jinseok Nam, Eneldo Loza Mencía, Johannes Fürnkranz, All-in Text: Learning Document, Label, and Word Representations Jointly, Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, USA, 2016
We are grateful for funding to the European Union's Horizon 2020 program through the SUMMA project (Research and Innovation Action, grant agreement n. 688139): Scalable Understanding of Multilingual Media, see http://www.summa-project.eu/.