In recent years constituency parsing: from grammars and feature-rich lexicons to RNNs. This work analyzes the extent to which information provided directly by the model structure in classical system is still captured by neural methods.
We propose a model/parser that implicitly learns to encode much of the same information which was explicitly provided by grammars and lexicons in the past. We replace the externally predicted PoS tags with character-level word representations.
Past work in constituency parsing used context-free grammars with production rules governing adjacent labels to propagate information and capture correlations between output decisions. Many recent parsers no longer have explicit grammar production rules, but still use information about other predictions, allowing them to capture output correlations.
We also show that we can completely remove output correlations from our model with a variant of our parser that makes independent span label decisions without any tree constraints.
We find that the same sources of information that were effective for grammar-driven parsers are also captured by parsers based on recurrent neural networks.
We propose a span-based parsing model which combines components from several recent NNs. The model has only a single scoring function s(i, j, l), assigning a real-valued score to every label l for each span (i, j) in an input sentence.
In contrast to other chart parsers, our model can directly score n-ary trees without binarization or other tree transformations. The parsing problem is to find the tree with the highest score. Our implementation of s(i, j, l) in 3 steps:
- Word representation
- Span representation
- Label scoring
We have a separate embedding for each word type in the training vocabulary and map all unknown words at test time to a single token.
Character-level representations alleviate the unknown word problem by working with smaller, more frequent units. We obtain a character-level representation for a word by running it through a bidirectional character LSTM and concatenating the final forward and backward outputs.
Complete representation of a word = concatenation of its word embedding + its character LSTM representation.
To build up to spans, we first run a bidirectional LSTM over the sequence of word representations for an input sentence to obtain context-sensitive forward and backward representations fi and bi for each fencepost i. Then we concatenate the corresponding forward and backward span differences.
We feed the span representation through a one-layer feedforward NN whose output dimensionality equals the number of possible labels. The score of a specific label l is the corresponding component of the output vector (formula in paper).
We introduce an auxiliary empty label 0 with s(i, j, 0) = 0 for all (i, j) to handle spans that are not constituents.
We can independently select the best label for the current span and the best split point, where the score of a split is the sum of the best scores for the corresponding subtree.
To parse the full sentence, we compute sbest(0, n) using a bottom-up chart decoder, then traverse backpointers to recover the tree achieving that score.
We use margin-based training. More in paper.
More in paper.
Output correlations are information about compatibility between outputs in a structured prediction model.
We investigate several common choices for lexical representations of words and their role in neural parsing.
In this section, we analyze where the information in the sentence-level LSTM hidden vectors comes from.