This is a README for training on the FrameNet 1.5 full text annotations. Dipanjan Das, [email protected], 2/18/2012.
Training models for frame-semantic parsing with SEMAFOR is still a very laborious and clunky set of steps. Your kind patience is required to train the models :-)
git clone [email protected]:Noahs-ARK/semafor.git
cd semafor
Modify the variables in ./bin/config.sh
as needed.
Run:
mvn package
Make sure you have the required data.
You can download FrameNet 1.5 here, but
also please fill out the request form here
if you haven't already.
Set the luxmldir
environment variable in training/config
to point at the lu
folder.
The train/dev/test splits that were used in the NAACL '12 and subsequent papers can be found
here.
Used to train and test the frame identification and argument identification models (please refer to our NAACL 2010 paper to understand these two steps). The first step is to create two maps -- I name these framenet.original.map
and framenet.frame.element.map
.
- The first map is of type
THashMap<String, THashSet<String>>
. It maps a frame to a set of disambiguated predicates (words along with part of speech tags, but in the style of FrameNet). - The second map is of type
THashMap<String,THashSet<String>>
, which maps each frame to a set of frame element names. In other words, this data structure is necessary for the argument identification model to know what the frame elements are for each frame.
My versions of these two maps are present in this directory (these are just serialized Java objects).
Use the semafor-deps.jar
file in lib/
directory of the googlecode repository to get the right version of GNU trove, and read (deserialize) these two maps. After that print the keys, and the corresponding values to see exactly what is
stored in these maps. After that, you will need to create your own versions of these two maps for your domain, in exactly the same format as these maps. If you want existing code in SEMAFOR to create these maps, you could use the method writeSerializedObject(Object object, String outFile
) in SerializedObjects.java to write serialize those maps. So creating your own maps will be easy. You could also read the maps using that class.
Used for training and inference procedure.
./training/2_createRequiredData.sh
./training/trainIdModel.sh
consists of:
- alphabet creation and combination:
./training/3_1_idCreateAlphabet.sh
This takes ~1 min using 8 threads (AMD Opteron(TM) 6272 2.1MHz processors; using the "ancestor" model).
- creating feature events for each datapoint:
./training/3_2_idCreateFeatureEvents.sh
Takes ~3-4 minutes.
- training the frame identification model:
./training/3_3_idTrainBatch.sh
Takes ~40 minutes. Line search in L-BFGS may fail at the end, but that does not mean training failed. In models_0.0, there will be models produced every few iterations. If line search failed, take the last model.
- convert the alphabet file:
./training/3_4_idConvertAlphabetFile.sh
Takes <1 minute.
./training/trainArgModel.sh
consists of:
- alphabet creation:
./training/4_1_createAlphabet.sh
Takes ~7 minutes.
- caching feature vectors:
./training/4_2_cacheFeatureVectors.sh
Takes ~10 minutes.
- training:
./training/4_3_training.sh
Takes ~ a day. This step has a regularization hyperparameter, lambda. You may tune lambda on a development set to get the best results.