A survey on text simplification, Sikka and Mago, 2020

Paper, Tags: #nlp, #text-summarization

Current research has shifted towards using DL, with a specific focus on developing solutions to combat the lack of data available for simplification. Many techniques in TS come from machine translation, monolingual text-to-text generation, text summarization and paraphrase generation.

Text summarization aims to reduce the length and content of input, whereas text simplification does not necessarily shorten the text, but it reduces the linguistic complexity of a text while still retaining the original information and meaning. Summarization also aims at reducing content, removing that which might be less important or redundant. That is typically not explored in simplifcation, where all the content is usually kept.

1. Text simplification approaches

Most early work involved extractive methods of summarization, extracting the sentences from a document that conveyed the most meaning. There are many resources online. Lately, the research has shifted into abstractive approaches: actual generation of text.

Extractive approach

Involves simply selecting sentences from a paragraph or document that convey the most 'meaning'. It is text summarization. The simplest method is TF-IDF. More on how to apply this in the section 2.1. in the paper.

Abstractive approach

Involves generation of new and novel text, lexically and/or syntactically simpler than the original. It simplifies the vocabulary of a text.

Truly abstractive simplification involves sentence splitting, text deletion and addition. The most common approach is seq2seq modeling, using tree-based approaches, RNNs and LSTMs. Abstractive approaches can be lexical simplification or novel text generation.

2. Abstractive approach: lexical simplification (LS)

Replaces complex words by simpler alternatives, which depends heavily on the database being referenced and the criteria for selecting a suitable word or phrase. The most commonly used resource for LS is PPDB, a collection of lexical, phrasal and syntactic pairings with over 40 variables describing each relationship.

LS pipeline

Complex word identification: determine which words to simplify
- Ways to do this in section 3.1.1 in paper
Substitution generation: generate candidates for replacement of complex words
- Ways to do this in section 3.1.2
Substitution selection: choose which candidates fit the context of the sentence, with respect to grammatical construction and meaning. 3.1.3
Substitution ranking: rank and decide which selected substitute produce the simplest output in the given context, based on some quantitative measure. 3.1.4

Discussion of LS systems

There are several challenges with word sense ambiguity, phrasal substitution

3. Abstractive approach: novel text generation (NTG)

This technique is extensively data driven, taking advantage of complex data structures and recently developed neural techniques.

Syntactice simplification

Aims to identify grammatically complex text, and rewrite it so that it is easier to comprehend. It can involve splitting long sentence into shorter, more digestable chunks, changing passive voice to active, and resolving ambiguities and anaphora. Syntactic simplification is usually done in 3 phases

Analyze text, identify structure, create a parse tree
Transformation phase: modify the parse tree according to a set of rewrite rules, simplificating text by splitting sentences, clause rearrangement, clause dropping. Many syntactic simplification systems use hand written rules
Regeneration phase: further modifications to the text to improve cohesion, relevance and readability.

Statistical machine translation

SMT is used to translate from the source language of complex English to the target of simple English. It has also been performed in Brazilian and Portuguese, and has been proposed for German, Chinese and Swedish. The performance of seq2seq modeling for SMT relies heavily on the dataset being used to train the model.

Latest DL techniques

RNN-based NMT, LSTM encoder-decoders, unsupervised methods and other advances in section 4.3 of paper.

4. Datasets

Datasets for lexical simplification

SemEval 2012
LSeval
LexMTurk
BenchLS
NNseval

Datasets for novel text generation

Simple English Wikipedia
Wikipedia-simple Wikipedia
PWKP
SS corpus
WikiSmall
Turk corpus
Newsela

Dataset extraction, compilation and techniques for this in sections 5.3 and 5.4 of paper.

5. Evaluation metrics

Evaluating novel text generation

BLEU
SARI
Readability indices
SAMSA

6. Discussion and conclusion

Most current approaches to TS are abstractive, either involving LS or NTG. TS is still in relative infancy, both as a field itself and a part of NLP. This results in challenges like unavailability of extensive high quality data sources necessary for automated TS, especially using deep learning techniques.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2008.08612.md

2008.08612.md

A survey on text simplification, Sikka and Mago, 2020

Paper, Tags: #nlp, #text-summarization

1. Text simplification approaches

Extractive approach

Abstractive approach

2. Abstractive approach: lexical simplification (LS)

LS pipeline

Discussion of LS systems

3. Abstractive approach: novel text generation (NTG)

Syntactice simplification

Statistical machine translation

Latest DL techniques

4. Datasets

Datasets for lexical simplification

Datasets for novel text generation

5. Evaluation metrics

Evaluating novel text generation

6. Discussion and conclusion

Files

2008.08612.md

Latest commit

History

2008.08612.md

File metadata and controls

A survey on text simplification, Sikka and Mago, 2020

Paper, Tags: #nlp, #text-summarization

1. Text simplification approaches

Extractive approach

Abstractive approach

2. Abstractive approach: lexical simplification (LS)

LS pipeline

Discussion of LS systems

3. Abstractive approach: novel text generation (NTG)

Syntactice simplification

Statistical machine translation

Latest DL techniques

4. Datasets

Datasets for lexical simplification

Datasets for novel text generation

5. Evaluation metrics

Evaluating novel text generation

6. Discussion and conclusion