Paper, Tags: #nlp, #text-summarization
Current research has shifted towards using DL, with a specific focus on developing solutions to combat the lack of data available for simplification. Many techniques in TS come from machine translation, monolingual text-to-text generation, text summarization and paraphrase generation.
Text summarization aims to reduce the length and content of input, whereas text simplification does not necessarily shorten the text, but it reduces the linguistic complexity of a text while still retaining the original information and meaning. Summarization also aims at reducing content, removing that which might be less important or redundant. That is typically not explored in simplifcation, where all the content is usually kept.
Most early work involved extractive methods of summarization, extracting the sentences from a document that conveyed the most meaning. There are many resources online. Lately, the research has shifted into abstractive approaches: actual generation of text.
Involves simply selecting sentences from a paragraph or document that convey the most 'meaning'. It is text summarization. The simplest method is TF-IDF. More on how to apply this in the section 2.1. in the paper.
Involves generation of new and novel text, lexically and/or syntactically simpler than the original. It simplifies the vocabulary of a text.
Truly abstractive simplification involves sentence splitting, text deletion and addition. The most common approach is seq2seq modeling, using tree-based approaches, RNNs and LSTMs. Abstractive approaches can be lexical simplification or novel text generation.
Replaces complex words by simpler alternatives, which depends heavily on the database being referenced and the criteria for selecting a suitable word or phrase. The most commonly used resource for LS is PPDB, a collection of lexical, phrasal and syntactic pairings with over 40 variables describing each relationship.
- Complex word identification: determine which words to simplify
- Ways to do this in section 3.1.1 in paper
- Substitution generation: generate candidates for replacement of complex words
- Ways to do this in section 3.1.2
- Substitution selection: choose which candidates fit the context of the sentence, with respect to grammatical construction and meaning. 3.1.3
- Substitution ranking: rank and decide which selected substitute produce the simplest output in the given context, based on some quantitative measure. 3.1.4
There are several challenges with word sense ambiguity, phrasal substitution
This technique is extensively data driven, taking advantage of complex data structures and recently developed neural techniques.
Aims to identify grammatically complex text, and rewrite it so that it is easier to comprehend. It can involve splitting long sentence into shorter, more digestable chunks, changing passive voice to active, and resolving ambiguities and anaphora. Syntactic simplification is usually done in 3 phases
- Analyze text, identify structure, create a parse tree
- Transformation phase: modify the parse tree according to a set of rewrite rules, simplificating text by splitting sentences, clause rearrangement, clause dropping. Many syntactic simplification systems use hand written rules
- Regeneration phase: further modifications to the text to improve cohesion, relevance and readability.
SMT is used to translate from the source language of complex English to the target of simple English. It has also been performed in Brazilian and Portuguese, and has been proposed for German, Chinese and Swedish. The performance of seq2seq modeling for SMT relies heavily on the dataset being used to train the model.
RNN-based NMT, LSTM encoder-decoders, unsupervised methods and other advances in section 4.3 of paper.
- SemEval 2012
- LSeval
- LexMTurk
- BenchLS
- NNseval
- Simple English Wikipedia
- Wikipedia-simple Wikipedia
- PWKP
- SS corpus
- WikiSmall
- Turk corpus
- Newsela
Dataset extraction, compilation and techniques for this in sections 5.3 and 5.4 of paper.
- BLEU
- SARI
- Readability indices
- SAMSA
Most current approaches to TS are abstractive, either involving LS or NTG. TS is still in relative infancy, both as a field itself and a part of NLP. This results in challenges like unavailability of extensive high quality data sources necessary for automated TS, especially using deep learning techniques.