Skip to content

Latest commit

 

History

History
93 lines (57 loc) · 4.99 KB

2008.08612.md

File metadata and controls

93 lines (57 loc) · 4.99 KB

A survey on text simplification, Sikka and Mago, 2020

Paper, Tags: #nlp, #text-summarization

Current research has shifted towards using DL, with a specific focus on developing solutions to combat the lack of data available for simplification. Many techniques in TS come from machine translation, monolingual text-to-text generation, text summarization and paraphrase generation.

Text summarization aims to reduce the length and content of input, whereas text simplification does not necessarily shorten the text, but it reduces the linguistic complexity of a text while still retaining the original information and meaning. Summarization also aims at reducing content, removing that which might be less important or redundant. That is typically not explored in simplifcation, where all the content is usually kept.

1. Text simplification approaches

Most early work involved extractive methods of summarization, extracting the sentences from a document that conveyed the most meaning. There are many resources online. Lately, the research has shifted into abstractive approaches: actual generation of text.

Extractive approach

Involves simply selecting sentences from a paragraph or document that convey the most 'meaning'. It is text summarization. The simplest method is TF-IDF. More on how to apply this in the section 2.1. in the paper.

Abstractive approach

Involves generation of new and novel text, lexically and/or syntactically simpler than the original. It simplifies the vocabulary of a text.

Truly abstractive simplification involves sentence splitting, text deletion and addition. The most common approach is seq2seq modeling, using tree-based approaches, RNNs and LSTMs. Abstractive approaches can be lexical simplification or novel text generation.

2. Abstractive approach: lexical simplification (LS)

Replaces complex words by simpler alternatives, which depends heavily on the database being referenced and the criteria for selecting a suitable word or phrase. The most commonly used resource for LS is PPDB, a collection of lexical, phrasal and syntactic pairings with over 40 variables describing each relationship.

LS pipeline

  • Complex word identification: determine which words to simplify
    • Ways to do this in section 3.1.1 in paper
  • Substitution generation: generate candidates for replacement of complex words
    • Ways to do this in section 3.1.2
  • Substitution selection: choose which candidates fit the context of the sentence, with respect to grammatical construction and meaning. 3.1.3
  • Substitution ranking: rank and decide which selected substitute produce the simplest output in the given context, based on some quantitative measure. 3.1.4

Discussion of LS systems

There are several challenges with word sense ambiguity, phrasal substitution

3. Abstractive approach: novel text generation (NTG)

This technique is extensively data driven, taking advantage of complex data structures and recently developed neural techniques.

Syntactice simplification

Aims to identify grammatically complex text, and rewrite it so that it is easier to comprehend. It can involve splitting long sentence into shorter, more digestable chunks, changing passive voice to active, and resolving ambiguities and anaphora. Syntactic simplification is usually done in 3 phases

  1. Analyze text, identify structure, create a parse tree
  2. Transformation phase: modify the parse tree according to a set of rewrite rules, simplificating text by splitting sentences, clause rearrangement, clause dropping. Many syntactic simplification systems use hand written rules
  3. Regeneration phase: further modifications to the text to improve cohesion, relevance and readability.

Statistical machine translation

SMT is used to translate from the source language of complex English to the target of simple English. It has also been performed in Brazilian and Portuguese, and has been proposed for German, Chinese and Swedish. The performance of seq2seq modeling for SMT relies heavily on the dataset being used to train the model.

Latest DL techniques

RNN-based NMT, LSTM encoder-decoders, unsupervised methods and other advances in section 4.3 of paper.

4. Datasets

Datasets for lexical simplification

  • SemEval 2012
  • LSeval
  • LexMTurk
  • BenchLS
  • NNseval

Datasets for novel text generation

  • Simple English Wikipedia
  • Wikipedia-simple Wikipedia
  • PWKP
  • SS corpus
  • WikiSmall
  • Turk corpus
  • Newsela

Dataset extraction, compilation and techniques for this in sections 5.3 and 5.4 of paper.

5. Evaluation metrics

Evaluating novel text generation

  • BLEU
  • SARI
  • Readability indices
  • SAMSA

6. Discussion and conclusion

Most current approaches to TS are abstractive, either involving LS or NTG. TS is still in relative infancy, both as a field itself and a part of NLP. This results in challenges like unavailability of extensive high quality data sources necessary for automated TS, especially using deep learning techniques.