We now provide our trained models and generated development and test set outputs for the results reported in the paper. They can be found in the experiments_reported folder, more documentation is available in section Pretrained models and generated outputs below.
This repository contains the code to reproduce the results of the data-to-text natural language generation (NLG) experiments reported in the INLG 2018 paper Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity.
Two datastes from NLG challenges (E2E and WebNLG) held in 2017 were used to train and evaluate sequence-to-sequence (Seq2Seq)-based NLG models using the library OpenNMT-py.
If you use this code, please cite the paper as follows:
@inproceedings{Seq2SeqData2Text_Jagfeld_18,
author = {Jagfeld, Glorianna and Jenne, Sabrina and Vu, Ngoc Thang},
title = {{Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity}},
booktitle = {Proceedings of the 11th International Natural Language Generation Conference (INLG)},
year = {2018},
doi = {10.18653/v1/w18-6529},
pages = {221--232}
}
The code is written for python 2.7, PyTorch version 0.3.0 and was run on GPUs with CUDA 9.1.
The following python packages are needed:
- pytorch 0.3.0.post4
- torchtext 0.2.3 (pip install torchtext==0.2.3)
- six (torchtext requirement)
- matplotlib 2.2.3 (needed for evaluation)
- scikit-image (needed for evaluation)
- regex 2018.11.22 (needed for evaluation)
Download the dataset from https://github.com/tuetschek/e2e-dataset/releases/download/v1.0.0/e2e-dataset.zip and unpack it into the folder datasets/e2e.
To create the preprocessed training, development and test splits for word-based generation from within the folder run-scripts/e2e run:
preprocess.py word
This lowercases and tokenizes the inputs and references and replaces the values of the NAME and NEAR slots by placeholders. The output is stored in dataset/e2e/preprocessed/word_based/.
- input: name [ NAME-X ] , eattype [ coffee shop ] , food [ italian ] , pricerange [ less than £20 ] , customer rating [ low ] , area [ riverside ] , familyfriendly [ yes ] , near [ NEAR-X ]
- reference: NAME-X is an inexpensive coffee shop near NEAR-X and the river . it is family-friendly and serves pasta .
To train N models with different random seeds in paralell, and subsequently generate texts for the development set and compute automatic evaluation metrics from within the folder run-scripts/e2e run:
run_word_based.sh N
To generate texts and evaluate on the test set run
run_word_based.sh N test
Consult the bash script to activate/deactivate the individual steps of data preprocessing for OpenNMT-py, training, generation and evaluation.
Running all steps produces the following files in the folder experiments/e2e/word_based, where n ϵ (1 ... N):
- train_word_based_n.log: training log file
- word_based_n.pt: trained model
- generate_{dev/test}_multi_ref_word_based_n.log: generation log file
- {dev/test}_multi_ref_word_based_n_hypotheses.txt: delexicalized generated texts (one best hypothesis per input)
- {dev/test}_multi_ref_word_based_n_hypotheses.txt_postprocessed: lexicalized generated texts
- {dev/test}_multi_ref_word_based_n_hypotheses.txt_eval: evaluation file (see example below)
SCORES:
==========
BLEU: 0.7310
NIST: 8.8055
METEOR: 0.4803
ROUGE_L: 0.7551
CIDEr: 2.3788
To create the preprocessed training, development and test splits for character-based generation from within the folder run-scripts/e2e run:
preprocess.py character
This only lowercases the inputs and references. The output is stored in dataset/e2e/preprocessed/character_based/.
-
input: name[the eagle],eattype[coffee shop],food[italian],pricerange[less than £20],customer rating[low],area[riverside],familyfriendly[yes],near[burger king]
-
reference: the eagle is an inexpensive coffee shop near burger king and the river. it is family-friendly and serves pasta.
Training, generation and evaluation works analogously to the word-based models but instead use the script run_character_based.sh.
SCORES:
==============
BLEU: 0.7103
NIST: 8.7457
METEOR: 0.4706
ROUGE_L: 0.7354
CIDEr: 2.2999
Download the dataset from https://gitlab.com/shimorina/webnlg-dataset/tree/master/release_v2/xml and place the train, dev, test folder with subfolders containing XML files from webnlg-dataset-master/release_v2/xml/ in the folder datasets/webnlg. To obtain single files for the training, development and test split from the folder run-scripts/webnlg run
bash preprocess_to_single_files.sh
This will create a single file for each split of the dataset in the folder datasets/webnlg, train.txt, dev.txt, test.txt, that each contain one input and reference per line. Inputs with multiple references are duplicated for each refrence.
To create the preprocessed training, development and test splits for word-based generation from within the folder run-scripts/webnlg run:
python preprocess.py word
The output is stored in dataset/webnlg/preprocessed/word_based/. This splits entity and property names at camel case, lowercases the inputs and references, delexicalizes all entities in the inputs and also in the references (if an input entity appears verbatim in the reference). For more details about the delexicalization consult Section 4 of the paper.
- input: capital ( BRIDGE-0 [ PATIENT-0 ] ) , material ( AGENT-0 [ PATIENT-1 ] ) , leader title ( BRIDGE-0 [ PATIENT-2 ] ) , dedicated to ( AGENT-0 [ PATIENT-3 ] ) , location ( AGENT-0 [ BRIDGE-0 ] ) , designer ( AGENT-0 [ PATIENT-4 ] ) , legislature ( BRIDGE-0 [ PATIENT-5 ] )
- reference: huyseyin butuner and hilmi guner designed the AGENT-0 . it is located in PATIENT-0 , BRIDGE-0 , which has legislature of national assembly , and led by the prime minster . the memorial is made from PATIENT-1 , and is dedicated to the PATIENT-3 .
Training, generation and evaluation works analogously to the E2E models but use the script run_word_based.sh from the folder run-scripts/webnlg.
SCORES:
========
BLEU: 0.4806
NIST: 9.2032
METEOR: 0.3705
ROUGE_L: 0.6450
CIDEr: 3.3558
To create the preprocessed training, development and test splits for character-based generation from within the folder run-scripts/webnlg run:
python preprocess.py character
This splits entity and property names at camel case and lowercases the inputs and references. The output is stored in dataset/webnlg/preprocessed/character_based/.
- input: capital(azerbaijan[baku]),material(baku turkish martyrs' memorial[red granite and white marble]),leader title(azerbaijan[prime minister of azerbaijan]),dedicated to(baku turkish martyrs' memorial[ottoman army soldiers killed in the battle of baku]),location(baku turkish martyrs' memorial[azerbaijan]),designer(baku turkish martyrs' memorial[hüseyin bütüner and hilmi güner]),legislature(azerbaijan[national assembly (azerbaijan)])
- reference: huyseyin butuner and hilmi guner designed the baku turkish martyrs' memorial. it is located in baku, azerbaijan, which has legislature of national assembly, and led by the prime minster. the memorial is made from red granite and white marble, and is dedicated to the ottoman army soldiers killed in the battle of baku.
Training, generation and evaluation works analogously to the E2E models but use the script run_character_based.sh from the folder run-scripts/webnlg.
SCORES:
========
BLEU: 0.5737
NIST: 9.5549
METEOR: 0.4020
ROUGE_L: 0.7151
CIDEr: 3.6612
We generate synthetic training data based on two templates using the scripts systems/template_1.py and systems/template_2.py. Template 1 corresponds to UKP-TUDA's submission to the E2E challenge, where the order of describing the input information is fixed. Specifically, the restaurant's customer rating is always mentioned before its location. For Template 2, we change the the beginning of the template and switch the order of mentioning the rating and location of the restaurant.
To generate the training, development and test data using the two templates, from the folder run-scripts/e2e/ raun:
bash create_template_data.sh
To train N models on the data created with template 1, template 2 or both templates, from the folder run-scripts/e2e/ run
bash run_template_experiment.sh N TEMPLATE_NUMBER
where TEMPLATE_NUMBER is either 1, 2 or 1+2.
The texts are generated for the 10 random test set inputs we evaluated on for the results discussed in Section 7 and Table 7 of the paper. They can be found in the file datasets/e2e/test_template_10_random_inputs.txt.
Pretrained models and generated outputs for the results reported in Section 6 of the paper can be found in the folder experiments_reported.
We provide the outputs and models for all 10 random seeds we used for each configuration. Model files end with .pt. Generated outputs end with _hypotheses.txt and contain the top 5/15 outputs for each input line-by-line before the outputs for the next input follow. For example, the first 10 lines of the file experiments_reported/e2e/character_based/dev_multi_ref_5_default_bidir_adam1_hypotheses.txt correspond to the top 5 outputs for the first two development set inputs of the E2E dataset. For the word-based delexicalised models, _hypotheses.txt_postprocessed files additionally contain the lexicalised outputs.
The E2E models and outputs are licensed as CC BY-SA 4.0, WebNLG models and outputs are licensed as CC BY-NC-SA 4.0 according to the licenses of the datasets.