Skip to content

This is the implementation of the double chained CRF used for predicting MWE and supersenses.

Notifications You must be signed in to change notification settings

mjhosseini/2-CRF-MWE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is the implementation of the double chained CRF used for predicting Multiword Expressions (MWE) and supersenses.

UW-CSE at SemEval-2016 Task 10: Detecting multiword expressions and supersenses using double-chained conditional random fields. Mohammad Javad Hosseini, Noah A. Smith, and Su-In Lee. In Proceedings of the NAACL Workshop on Semantic Evaluations (SemEval 2016), San Diego, CA, June 2016.

We participated at the SemEval 2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM). Our submitted models ranked first overall in the competition.

We have implemented a Conditional Random Field and a Double-Chained Conditional Random Field model for joint learning of multiword expressions and supersenses.

The feature extraction is based on AMALGrAM 2.0 (A Machine Analyzer of Lexical Groupings And Meanings) and the dependencies are the same as AMALGrAM 2.0.

Software

  • Python 2.7
  • Cython (tested on 0.21.1)
  • NLTK 3.0.2+ with the WordNet resource installed

Running:

After downloading the code, given the above softwares are installed, you can run the code from the scripts folder to replicate the paper's results and/or test on new data. (best model: Double_CRF_open.sh)

Tagging Scheme

Multiword Expressions:

The annotation for MWEs extends the conventional BIO scheme to include gappy MWEs with one level of nesting. Segmentations are represented using six tags; the lower-case variants indicate that an expression is within another MWE’s gap.

-- O and o: single word expression -- B and b: the first word of a MWE -- I and i: a word continuing a MWE

Supersenses:

Each noun or verb expression is also annotated with a supersense; there are 26 supersenses for nouns and 15 for verbs. Only the first word of a MWE receives a supersense tag.

The input must be sentence and word tokenized and part-of-speech tagged (with the Penn Treebank POS tagset).

Please refer to dimsum-data-1.5/TAGSET.md for more details.

Data:

The datasets are in the folder dimsum-data-1.5. There is a readme file in the folder explaining the format. For prediction on new data, input should be formatted as described there. Our original submission is in the folder submitted_results.

Please email the first author ([email protected]) in case of any questions and/or requests.

About

This is the implementation of the double chained CRF used for predicting MWE and supersenses.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages