Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2020 CCNS A Survey of Text Data Augmentation #26

Open
DelaramRajaei opened this issue Jun 19, 2023 · 0 comments
Open

2020 CCNS A Survey of Text Data Augmentation #26

DelaramRajaei opened this issue Jun 19, 2023 · 0 comments
Assignees
Labels
literature-review Summary of the paper related to the work

Comments

@DelaramRajaei
Copy link
Member

Title A Survey of Text Data Augmentation
Year 2022
Venue CCNS
Paper's Link https://ieeexplore.ieee.org/document/9240734

This research paper served as a survey on data augmentation and was mentioned in the footnote of "A Survey of Data Augmentation Approaches for NLP" paper. I specifically selected and initiated my literature review with this paper to gain insights into the hierarchy of data augmentation.

What is data augmentation?

  • Generating new data, equivalent to the training sample without needing to significantly increase the amount of data.
  • Data augmentation = Data enhancement
  • Data augmentation was first used in computer vision and speech recognition.
  • The method of text data enhancement includes:
    • back translation
    • random word replacement
    • non-core word replacement
    • data enhancement based on text generation model

Why is data augmentation needed?

  • The number of training samples is limited and the model faces the possibility of underfitting.
  • High cost of manual annotation
  • Data augmentation is also used to balance the data. Imbalance data may lead to low accuracy and low recall rate.
  • Improve the model effect which depends on the quality of data.

Data augmentation can divide into 3 categories:

  • Supervised data enhancement: Using data transformation rules on available data. Transformation rules are divided into 2 categories.
Categories Description
Single sample data enhancement Transforming individual samples
Multi sample data enhancement Transforming multiple samples together. A few examples of methods or approaches: 1. Synthetic minority over-sampling technique, 2. SamplePairing, 3. mixup
  • Semi-supervised data enhancement

  • Unsupervised data enhancement: Unsupervised data enhancement can be separated into two main areas: creating new data and learning how to improve the existing data.

Unsupervised data enhancement methods Description
Method 1 Involves using a model to learn the pattern of data distribution. It generates new images that match the patterns found in the training dataset. This approach is commonly known as GAN.
Method 2 Utilize a model to learn a data enhancement technique, suitable for the specific task at hand. Example: AutoAugment.

Data augmentation in NLP

  • Adding noise - supervised method
  • Back-translation - supervised method
  • Challenges
    • In NLP data is discrete -> makes it impossible to convert input data directly
    • It may change the meaning or the tag (label) of the sentence

Classification of Data augmentation

  • Unconditional Data augmentation
    • Back-translation

      • Translate the original data into other languages and then back to the original language
      • Inter-language can be more than one(>=1)
      • Both the translation model and tool can be used in this method.
      • Translation models:
        => produce text with richer language
        => seq2seq, NMT, transformer
      • Translation tools:
        => generate text that is more accurate and coherent in terms of meaning.
        => Google translation, baidu translation, Youdao translation
      • The quality of data generated through back translation relies on translation accuracy, which may not always be high.
    • Lexical Subsituaiton

      • Divided into two categories: 1. random word substitution, 2. non-core word substitution
Random word substitution Selects random words and performs synonym replacement or deletion
Easy Data Augmentation (EDA) In 2019 a new method proposed which contains 4 operations
1. Synonym replacement Select non-stop words randomly from sentences and replace these words with randomly selected synonyms.
2. Random insertion Random insertion involves randomly selecting a non-stop-word from the sentence and replacing it with a randomly chosen synonym, which is then inserted at a random position in the sentence. Repeat n times.
3. Random exchange Random swap is a random choice of two words in a sentence and exchanging their positions.
4. Random deletion Random deletion is to delete every word in a sentence with probability P
Random word substitution 3 ways to obtain words in random word replacement
synonym dictionary Using character-level replacement in text classification
word vector Finding adjacent words in the embedding space to replace. Some famous pre-trained word embeddings include Word2Vec, GloVe, FastText, Sent2Vec
language model Using a language model like BERT, substitution words are obtained by masking vocabulary and predicting replacements. This method generates contextually-aware text with improved semantic coherence. However, there are risks involved, such as potential changes in the original text's semantics or category label. For reducing the risk of changing the semantics, words with lower TF-IDF scores can be replaced.
  • Conditional Data augmentation: Involves adding "text label" information to the model and using it to generate data.
    • Deep Generative Mode
      • This method needs a large quantity of high-quality text, however, data augmentation is employed because of limited data availability. So the practicability of deep generative model is poor
    • Pre-trained Models
      • Pre-trained models, such as Contextual Augment, utilize tag information and abundant unlabeled data for pre-training to improve results.
@DelaramRajaei DelaramRajaei added the literature-review Summary of the paper related to the work label Jun 19, 2023
@DelaramRajaei DelaramRajaei self-assigned this Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
literature-review Summary of the paper related to the work
Projects
None yet
Development

No branches or pull requests

1 participant