Skip to content

Latest commit

 

History

History
94 lines (64 loc) · 5.73 KB

1808.09031.md

File metadata and controls

94 lines (64 loc) · 5.73 KB

Targeted syntactic evaluation of language models, Marvin and Linzen, 2018

Paper, Tags: #nlp

We present a dataset to evaluate the grammaticality of the predictions of a LM. We automatically construct a large number of minimally different pairs of English sentences, each consisting of a grammatical and an ungrammatical sentence. These changes include:

  • subject-verb agreement
  • reflexive anaphora (himself)
  • negative polarity items (any, ever, which can only be used in the scope of negation)
    • No students have ever lived here
    • * Most students have ever lived here

We expect a LM to assign higher probability to the grammatical sentence. A LSTM performed poorly on many of the constructions. Multi-task training with a syntactic objective (CCG suppertagging) improved accuracy, but a large gap remained between LM's performance and a human performance.

This suggests that there is considerable room for improvement over LSTMs in capturing syntax in a LM. Code can be found in Github.

Language models

A LM defines a probability distribution over sequences of words. The most popular neural betwork-based LMs up to date are recurrent networks, RNNs, LSTMs. LMs are evaluated using perplexity. It's desirable for an LM to assign a high probability to held-out data from the same corpus as the training data. The quality of the syntactic predictions is yet difficult to measure using perplexity: since most sentences are grammatically simple and most words can be predicted from their local context, perplexity rewards LMs for collocational and semantic predictions.

We supplement perplexity with a metric that assesses whether the probability distribution defined by the model conforms to the grammar of the language. Given two sentences that differ minimally from each other, one grammatical and the other not, it's desirable for the model to assign a higher probability to the grammatical one.

We test three LMs on the data set we develop:

  • an n-gram baseline
    • largely performed at chance, suggesting that good performance on the task requires syntactic representations
  • an RNN LM trained on an unannotated corpus
  • an RNN LM trained on a multitask objective: language modeling and combinatory categorial grammar (CCG) supertagging.

The RNNs performed well on simple cases, but struggled on more complex ones. Multi-task training with a supervised syntactic objective improved the performance of the RNN, but it was still much weaker than humans.

Dataset construction and composition

We use templates to automatically construct a large number of English sentence pairs (~350k). It includes 3 common phenomena:

Subject-verb agreement

The relevant subject in the sentence doesn't have to be the first noun of the sentence, or the most recent noun before the verb.

  • Agreement across a prepositional phrase
    • The farmer near the parents smiles
  • Agreement across a subject relative clause
    • The officers that love the skater smile
  • And in verb phrase coordination, both verbs need to agree with the subject
    • The senator smiles and laughs
  • We also include a coordination condition with a longer dependency (long VP coordination)
    • The manager writes in a journal every day and likes/*like to watch television

Agreement and object relative clauses

Object relative clauses require a hierarchical representation.

  • Agreement across an object relative clause
    • The farmer that the parents love swims
  • Agreement in an object relative clause
    • The farmer that the parents love swims

In these sentences, the model needs to distinguish the embedded subject (parents) from the main clause subject (farmer) when making predictions.

Reflexive anaphora

The reflexive pronouns need to agree in number and gender with their antecedent

  • Simple reflexive
    • the senators embarrassed themselves
  • Reflexive in a sentential complement
    • the bankers thought the pilot embarrassed himself
  • Reflexive across an object relative clause
    • the manager that the architects like doubted himself

Experiments

We trained 3 LMs with increasing levels of syntactic sophistication. All 3 LMs were trained on a 90M word subset of Wikipedia.

  1. N-gram model: doesn't need annotated data, we trained a 5-gram model using the SRILM toolkit
  2. Single-task RNN: doesn't need annotated data either, 2 layers of 650 LSTM units, 40 epochs
  3. Multi-task RNN: the system is trained to optimize an objective function that combines the objective functions of several tasks (CCG supertagging)

Results

  • Local agreement:
    • The n-gram has poor overall accuracies, RNNs generalize better beyond the specific bigrams occurring in the orpus. RNNs don't rely on the heuristic whereby the first noun of the sentence is likely to be the subject.
  • Non-local agreement
    • Local collocational information isn't useful for the n-gram, RNNs also perform worse than in local agreement conditions. But multi-task learning was very helpful.
  • Agreement inside an object RC
    • this is a purely local dependency, performance was worse than in other cases overall, but RNNs performed better than humans
  • Reflexive anaphora
    • RNNs perform worse on simple reflexives than on simple agreement, and don't differ between the single-task and multi-task models. Human performance did not differ.
  • NPIs
    • N-gram had a poor performance as well

Discussion

An RNN LM performs well on local subject-verb agreement dependencies, significantly outperforming an n-gram baseline. But the RNN accuracy is sensitive to the particular lexical items occurring in the sentence. Multi-task learning with a syntactic objective (CCG supertagging) mitigates this drop in performance for some but not all tested dependencies.