This repository contains the code used in my dissertation research on sentence compression and fusion. The system implements supervised structured prediction for text transformation in which the inference approach relies on integer programming algorithms to jointly produce output sentences characterized by
- a sequence of n-grams (bigrams or trigrams)
- an edge-factored dependency tree
- a SEMAFOR-style frame-semantic parse (compression only)
These models are described and evaluated in Chapter 3, the latter half of Chapter 6 and and Chapter 7 of my dissertation: Multi-Structured Models for Transforming and Aligning Text.
Honestly, it's unlikely that this code will be directly usable. It was extracted from a larger library without modification, hasn't been tested outside the original development environment and ultimately suffers from all the usual pitfalls of research code written under deadline pressure. Instead, interested users are encouraged to use this repository for reference or as a source of piecemeal solutions in reimplementation efforts.
Nevertheless, if you want to try to get this code running, here is a list of the known requirements:
- Python 2.6 or 2.7
- Ensure the distributed modules are on the
$PYTHONPATH
- Module dependencies:
- argparse (for Python 2.6)
- nltk 3 (with Wordnet and Framenet corpora)
- psutil
- pyutilib.enum
- simplejson
- swig-srilm
- stemming
- External software:
- Gurobi 6.0 (offers academic licensing)
- LPsolve
- SRILM
- Stanford parser 2.0.4 (or similar older version which produces projective trees)
- SEMAFOR
- RASP 3.x
- TagChunk
- Data:
- Dependency-converted Penn treebank for
interfaces/treebank/depmodel.py
(not necessary for default features) - Clarke & Lapata datasets for compression (contact me for dataset splits)
- Pyramid evaluation data from DUC 2005-2007 and TAC 2008-2011 for fusion, available from NIST
- Dependency-converted Penn treebank for
- Update all paths in the code with appropriate paths to your installations.
- Launch servers:
- LM servers through
interfaces/srilm.py
- Optionally, PTB servers through
interfaces/treebank/depmodel.py
- LM servers through
- Entry points to the code are
transduction/compression.py
andtransduction/pyrfusion.py
.- Run these with
--help
for command-line options. - Structural configurations are inferred through feature configurations, defined in
transduction/featconfigs.py
. The default options have simple names likeword
,ngram
,dep
and are listed at the top of the file.
- Run these with
- Contact me if you want the model files or system outputs from my experiments.
This code is provided as-is and without any implicit or explicit assurance of support. Minor bugs may not be addressed but will be listed in this README.