Parse QASRL using Seq2Seq technologies.
This repository contains code for our experiments and analysis in the paper "QASem Parsing - Text to Text modeling of QA-based Semantics" (Klein et. al., EMNLP 2022).
notebook.py
- convenient entry point, with 3 relevant API functions:full_experiment
- train and evaluate a new model. See a few example calls at small_experiments.py.load_and_evaluate
- evaluate an existing model.load_and_predict
- run inference on new texts using an existing model.
run_parsing_model.py
- main file for train, evaluate and predict logics. It is adapted from Huggingfacerun_summarization.py
script but quite havily refactored.pipeline.py
- introduces a wrapper pipeline class for easily running text-to-text inference. Can load a model from Huggingface Hub or from a local directory.- QASRL State machine example - a docker project running the original QASRL State Machine, used in order to analyze the question format for evaluation. Eventually, for better efficiency we use our own machinery for that (see below the
seq2seq_constrained_decoding
submodule). evaluation
- main file for the evaluation logics. Since we compare with previous QA-SRL and QANom works, we utilize their evaluation functionality from their own pakages, and just adapt the input format per package.
Sub-modules:
qasrl-gs
- used for evaluation of the results, according to Roit et al., 2020. We call this "qasrl evaluation methodology" throughout the repo.QANom
- used for evaluation of the results, according to Klein et al., 2020. We call this "qanom evaluation methodology" throughout the repo.seq2seq_constrained_decoding
- was used in some preliminary experiments to constrain model generation according to QASRL output sequence format. We end up not using constrained decoding as it introduced noise related to the mismtach between words and tokens. The Deterministic Finite Autmoata (DFA
) defining QASRL question format is still being used indfa_fill_qasrl_slots.py
for analyzing the question format (i.e. decomposing the outout question string into the 7-slot template defining a valid QASRL question).
Make sure to init and update all submodules --- they are required for the run_parsing_model.py
script to run properly. You can use the --recurse-submodules
switch in the git clone
command, or alternatively run git submodule update --init --recursive
after cloning.
There are two recommneded ways to run a training experiment, in which you can select a certain set of hyperparameters (including training dataset, pretrained model, and various modeling-related options), train (=fine-tune) a new model, and then evaluate it on the dev or test set.
Inside notebook.py
there is the full_experiment
function.
This function basically prepares the desired arguments for the run_parsing_model
script based on more simple parameters. It also wraps it with wandb
utilities for logging the experiment. You can specify any seq2seq model from Huggingface Model Hub as model_type
(though the repo is maintained mainly for T5 type of models).
Commonly it is desired to run a series of training experiments with different hyperparamters, for tuning those hyperparamters to maxmize performance on the validation set (do_eval_on=validation
). In this project we use WandB and its sweep utility for performing hyperparamter search and other experiments.
A sweep (i.e. one hyperparameter search experiment) is defined by a YAML file - you can find some examples in the sweeps/
directory.
To run a sweep, run:
wandb sweep sweep_config.yaml
And then initialize agents to execute actual calls to the command defined by sweep_config.yaml --- wandb agent [sweep-id]
.
In our context, you can take sweeps/main_basic_grid.yaml
as a starting point. For training a single-task QASRL / QANom model, you should set program
to be run_parsing_model.py
, and provide the desired --dataset_name
. Alternatively, for training a joint QASRL-QANom model on both datasets, comment out the --dataset_name
arguments and set program
to joint_train_experiment.py
which will specify the dataset for you (while duplicating QANom train set by a factor of 14 to match the size of QASRL train set).
The full_experiment
function is trigerring evaluation after training by default. It does so by sending the --do_eval
argument to the main run_parsing_model.py
script. You can set specify do_eval_on
for evaluating over the "validation" set or "test" set.
In order to run evaluation on a given trained model, call the notebook.load_and_evaluate
function.
If you wish to perform evaluation through WandB sweeps (e.g. for experiemting with decoding options, e.g. "num_beams"), take a look at sweeps/inference.yaml
for an example.
The models from our paper are uploaded to Huggingface. We wrapped the model with an easy-to-use API, utilizing Huggignface Pipeline, in an independant repository termed QASem; see documentation there. It is also available through pip install qasem
.
Locally, you can use the notebook.load_and_predict
function, will can take a text file of instances (sentences marked with a predicate) or a Dataset as input. We recommend using the pipeline.QASRL_Pipeline
wrapper class.
@TBD