Paper, Tags: #nlp, #architectures, #ner
Until now, state-of-the-art NER systems rely heavily on hand-crafted features and domain-specific knowledge. In this paper, we introduce 2 new neural architectures:
- biLSTM and CRF
- The other constructs and labels segments usinga transition-based approach inspired by shift-reduce parsers.
The models rely on two sources of information about words:
- Character-based word representations learned from the supervised corpus
- Unsupervised word representations learned from unannotated corpora
Our models obtatin state-of-the-art performance in NER in 4 languages (English, Dutch, German and Spanish) without resorting to gazetteers or any language-specific language, which is costly to develop in new languages and new domains
Our architectures use no language-specific resources or features beyond a small amount of supervised training data and unlabeled corpora. They are designed to capture two intuitions:
- Names often consist of multiple tokens, and reasoning jointly over tagging decisions for each token is important. We compare two models:
- BiLSTM-CRF
- A new model that constructs and labels chunks of input sentences using an algorithm inspired by transition-based parsing with states represented by stack LSTM
- Token-level evidence for 'being a name' includes both orthographic evidence (what does the word being tagged as a name look like?) and distributional evidence (where does the word being tagged tend to occur in a corpus?)
- To capture orthographic sensitivity: we use character-based word representations model
- To capture distributional sensitivity, we combine these representations with distributional representations.
Our word representations combine both of these.
Our models model output label dependencies, either by a simple CRF architecture, or using a transition-based algorithm to explicitly construct and label chunks of the input.
In NER, the grammar that characterizes interpretable sequences of tags imposes several hard constraints (I-PER can't follow B-LOC), which would be impossible to model with independence assumption.
Instead of modelling tagging decisions independently, we model them jointly using a CRF. Instead of using a softmax output from the last layer, we use CRF to take into account neighboring tags.
The model directly constructs representations of the multi-token names. This model relies on a stack data structure to incrementally construct chunks of the input. We use Stack-LSTM in which the LSTM is augmented with a 'stack pointer'.
Sequential LSTMS model sequences from left to right, stack LSTMs permit embedding of a stack of objects that are both added to (push)and removed from (pop). This allows the S-LSTM to work like a stack that maintains a summary embedding of its contents.
We use two stacks, output(chunks) and stack(scratch space) and a buffer containing the words that have yet to be processed.
The transition inventory contains the following transitions:
- The SHIFT transition moves a word from the buffer to the stack
- the OUT transition moves a word from the buffer directly into the output stack
- the REDUCE(y) transition pops all items from the top of the stack creating a “chunk,” labels this with label y, and pushes a representation of this chunk onto the output stack.
The algorithm concludes when the stack and buffer are both empty.
We want representations that are sensitive to the spelling of words, since we deal with many languages. We use a model that constructs representations of words from representations of the characters they are composed of.
Unlike previous approaches, we learn character-level features while training instead of hand-engineering prefix and suffix information about words. We learn representations specific to the task and domain at hand.
A character lookup table initialized at random contains an embedding for every character, which are given in direct and reverse order to a fLSTM and bLSTM. The embedding for a word derived from its characters is the concatenation of its forward and backward representations from the bidirectional LSTM.
RNNs and LSTMs capture very long sequences, but have a representation biased towards their most recent inputs. We expect the final representation of the fLSTM to be an accurate representation of the suffix of the word, and the final state of the bLSTM to be a better representation of its prefix.