You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This research paper served as a survey on data augmentation and was mentioned in the footnote of "A Survey of Data Augmentation Approaches for NLP" paper. I specifically selected and initiated my literature review with this paper to gain insights into the hierarchy of data augmentation.
What is data augmentation?
Generating new data, equivalent to the training sample without needing to significantly increase the amount of data.
Data augmentation = Data enhancement
Data augmentation was first used in computer vision and speech recognition.
The method of text data enhancement includes:
back translation
random word replacement
non-core word replacement
data enhancement based on text generation model
Why is data augmentation needed?
The number of training samples is limited and the model faces the possibility of underfitting.
High cost of manual annotation
Data augmentation is also used to balance the data. Imbalance data may lead to low accuracy and low recall rate.
Improve the model effect which depends on the quality of data.
Data augmentation can divide into 3 categories:
Supervised data enhancement: Using data transformation rules on available data. Transformation rules are divided into 2 categories.
Categories
Description
Single sample data enhancement
Transforming individual samples
Multi sample data enhancement
Transforming multiple samples together. A few examples of methods or approaches: 1. Synthetic minority over-sampling technique, 2. SamplePairing, 3. mixup
Semi-supervised data enhancement
Unsupervised data enhancement: Unsupervised data enhancement can be separated into two main areas: creating new data and learning how to improve the existing data.
Unsupervised data enhancement methods
Description
Method 1
Involves using a model to learn the pattern of data distribution. It generates new images that match the patterns found in the training dataset. This approach is commonly known as GAN.
Method 2
Utilize a model to learn a data enhancement technique, suitable for the specific task at hand. Example: AutoAugment.
Data augmentation in NLP
Adding noise - supervised method
Back-translation - supervised method
Challenges
In NLP data is discrete -> makes it impossible to convert input data directly
It may change the meaning or the tag (label) of the sentence
Classification of Data augmentation
Unconditional Data augmentation
Back-translation
Translate the original data into other languages and then back to the original language
Inter-language can be more than one(>=1)
Both the translation model and tool can be used in this method.
Translation models:
=> produce text with richer language
=> seq2seq, NMT, transformer
Translation tools:
=> generate text that is more accurate and coherent in terms of meaning.
=> Google translation, baidu translation, Youdao translation
The quality of data generated through back translation relies on translation accuracy, which may not always be high.
Lexical Subsituaiton
Divided into two categories: 1. random word substitution, 2. non-core word substitution
Random word substitution
Selects random words and performs synonym replacement or deletion
Easy Data Augmentation (EDA)
In 2019 a new method proposed which contains 4 operations
1. Synonym replacement
Select non-stop words randomly from sentences and replace these words with randomly selected synonyms.
2. Random insertion
Random insertion involves randomly selecting a non-stop-word from the sentence and replacing it with a randomly chosen synonym, which is then inserted at a random position in the sentence. Repeat n times.
3. Random exchange
Random swap is a random choice of two words in a sentence and exchanging their positions.
4. Random deletion
Random deletion is to delete every word in a sentence with probability P
Random word substitution
3 ways to obtain words in random word replacement
synonym dictionary
Using character-level replacement in text classification
word vector
Finding adjacent words in the embedding space to replace. Some famous pre-trained word embeddings include Word2Vec, GloVe, FastText, Sent2Vec
language model
Using a language model like BERT, substitution words are obtained by masking vocabulary and predicting replacements. This method generates contextually-aware text with improved semantic coherence. However, there are risks involved, such as potential changes in the original text's semantics or category label. For reducing the risk of changing the semantics, words with lower TF-IDF scores can be replaced.
Conditional Data augmentation: Involves adding "text label" information to the model and using it to generate data.
Deep Generative Mode
This method needs a large quantity of high-quality text, however, data augmentation is employed because of limited data availability. So the practicability of deep generative model is poor
Pre-trained Models
Pre-trained models, such as Contextual Augment, utilize tag information and abundant unlabeled data for pre-training to improve results.
The text was updated successfully, but these errors were encountered:
This research paper served as a survey on data augmentation and was mentioned in the footnote of "A Survey of Data Augmentation Approaches for NLP" paper. I specifically selected and initiated my literature review with this paper to gain insights into the hierarchy of data augmentation.
What is data augmentation?
Why is data augmentation needed?
Data augmentation can divide into 3 categories:
Semi-supervised data enhancement
Unsupervised data enhancement: Unsupervised data enhancement can be separated into two main areas: creating new data and learning how to improve the existing data.
Data augmentation in NLP
Classification of Data augmentation
Back-translation
=> produce text with richer language
=> seq2seq, NMT, transformer
=> generate text that is more accurate and coherent in terms of meaning.
=> Google translation, baidu translation, Youdao translation
Lexical Subsituaiton
The text was updated successfully, but these errors were encountered: