2020 CCNS A Survey of Text Data Augmentation #26

DelaramRajaei · 2023-06-19T18:35:33Z

Title	A Survey of Text Data Augmentation
Year	2022
Venue	CCNS
Paper's Link	https://ieeexplore.ieee.org/document/9240734

This research paper served as a survey on data augmentation and was mentioned in the footnote of "A Survey of Data Augmentation Approaches for NLP" paper. I specifically selected and initiated my literature review with this paper to gain insights into the hierarchy of data augmentation.

What is data augmentation?

Generating new data, equivalent to the training sample without needing to significantly increase the amount of data.
Data augmentation = Data enhancement
Data augmentation was first used in computer vision and speech recognition.
The method of text data enhancement includes:
- back translation
- random word replacement
- non-core word replacement
- data enhancement based on text generation model

Why is data augmentation needed?

The number of training samples is limited and the model faces the possibility of underfitting.
High cost of manual annotation
Data augmentation is also used to balance the data. Imbalance data may lead to low accuracy and low recall rate.
Improve the model effect which depends on the quality of data.

Data augmentation can divide into 3 categories:

Supervised data enhancement: Using data transformation rules on available data. Transformation rules are divided into 2 categories.

Categories	Description
Single sample data enhancement	Transforming individual samples
Multi sample data enhancement	Transforming multiple samples together. A few examples of methods or approaches: 1. Synthetic minority over-sampling technique, 2. SamplePairing, 3. mixup

Semi-supervised data enhancement
Unsupervised data enhancement: Unsupervised data enhancement can be separated into two main areas: creating new data and learning how to improve the existing data.

Unsupervised data enhancement methods	Description
Method 1	Involves using a model to learn the pattern of data distribution. It generates new images that match the patterns found in the training dataset. This approach is commonly known as GAN.
Method 2	Utilize a model to learn a data enhancement technique, suitable for the specific task at hand. Example: AutoAugment.

Data augmentation in NLP

Adding noise - supervised method
Back-translation - supervised method
Challenges
- In NLP data is discrete -> makes it impossible to convert input data directly
- It may change the meaning or the tag (label) of the sentence

Classification of Data augmentation

Unconditional Data augmentation
- Back-translation
  - Translate the original data into other languages and then back to the original language
  - Inter-language can be more than one(>=1)
  - Both the translation model and tool can be used in this method.
  - Translation models:
    => produce text with richer language
    => seq2seq, NMT, transformer
  - Translation tools:
    => generate text that is more accurate and coherent in terms of meaning.
    => Google translation, baidu translation, Youdao translation
  - The quality of data generated through back translation relies on translation accuracy, which may not always be high.
- Lexical Subsituaiton
  - Divided into two categories: 1. random word substitution, 2. non-core word substitution

Random word substitution	Selects random words and performs synonym replacement or deletion
Easy Data Augmentation (EDA)	In 2019 a new method proposed which contains 4 operations
1. Synonym replacement	Select non-stop words randomly from sentences and replace these words with randomly selected synonyms.
2. Random insertion	Random insertion involves randomly selecting a non-stop-word from the sentence and replacing it with a randomly chosen synonym, which is then inserted at a random position in the sentence. Repeat n times.
3. Random exchange	Random swap is a random choice of two words in a sentence and exchanging their positions.
4. Random deletion	Random deletion is to delete every word in a sentence with probability P

Random word substitution	3 ways to obtain words in random word replacement
synonym dictionary	Using character-level replacement in text classification
word vector	Finding adjacent words in the embedding space to replace. Some famous pre-trained word embeddings include Word2Vec, GloVe, FastText, Sent2Vec
language model	Using a language model like BERT, substitution words are obtained by masking vocabulary and predicting replacements. This method generates contextually-aware text with improved semantic coherence. However, there are risks involved, such as potential changes in the original text's semantics or category label. For reducing the risk of changing the semantics, words with lower TF-IDF scores can be replaced.

Conditional Data augmentation: Involves adding "text label" information to the model and using it to generate data.
- Deep Generative Mode
  - This method needs a large quantity of high-quality text, however, data augmentation is employed because of limited data availability. So the practicability of deep generative model is poor
- Pre-trained Models
  - Pre-trained models, such as Contextual Augment, utilize tag information and abundant unlabeled data for pre-training to improve results.

DelaramRajaei added the literature-review Summary of the paper related to the work label Jun 19, 2023

DelaramRajaei self-assigned this Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2020 CCNS A Survey of Text Data Augmentation #26

2020 CCNS A Survey of Text Data Augmentation #26

DelaramRajaei commented Jun 19, 2023

2020 CCNS A Survey of Text Data Augmentation #26

2020 CCNS A Survey of Text Data Augmentation #26

Comments

DelaramRajaei commented Jun 19, 2023

What is data augmentation?

Why is data augmentation needed?

Data augmentation can divide into 3 categories:

Data augmentation in NLP

Classification of Data augmentation