A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios, Hedderich et al., 2020
Paper, Tags: #nlp
We give an overview of promising approaches for low-resource NLP.
We propose to categorize low-resource settings along the following three dimensions:
- Availability of task-specific labels in the target language, usually created by manual annotation
- Availability of unlabeled language or domain-specific text
- Availability of auxiliary data
Different references give different thresholds for how low low-resource is. See paper section 3.2.
New labeled instances are created by modifying the features of existing instances with transformations that do not change the label of an instance. For text, this can be done by replacing words with equivalents from a collection of synonyms, entities of the same type or words that share the same morphology.
Data augmentation is ubiquitous in the computer vision community but it has not found such a widespread in NLP. A reason might be that the presented techniques are often task-specific requiring either hand-crafted systems or training of task-dependent language models.
This method uses unlabeled data and keeps it unmodified. This has been used in NER, where a list of location names can be obtained from a dictionary and matches of tokens in the text with entities in the list are automatically labeled as locations. Distant supervision is popular for information extraction tasks like NER, but it is not so prevalent in other areas of NLP.
Distant supervision methods heavily rely on auxiliary data, which can be difficult to obtain in a low-resource setting.
Here, a task-specific classifier is trained in a high-resource language. Using parallel corpora, the unlabeled low-resource data is then aligned to its equivalent in the high-resource language where labels can be obtained using the classifier. These labels are then projected back to the text in the low-resource language based on the alignment between tokens in the parallel texts.
Training directly on noisily-labeled data can actually hurt the performance. Many recent approaches use a noise handling method to diminish the negative effects of distant supervision.
- Noise filtering: remove instances from the training data that have a high probability of being incorrectly labeled. A classifier can make this decision
- Noise modeling: a confusion matrix estimates the relationship between clean and noisy labels. A noise model is appended which shifts the noisy to the (unseen) clean label distribution.
Alabi et al. (2020) found that word embeddings trained on massive amounts of unlabeled data from low-resource languages are not competitive to embeddings trained on smaller, but curated data sources.
While pre-trained language models achieve significant performance increases, it is still questionable if these methods are suited for real-world lowresource scenarios. For example, all of these models require large hardware requirements, in particular, considering that the transformer model size keeps increasing to boost performance.
Low-resource languages can also benefit from labeled resources available in other high-resource languages.
Multilingual BERT, XML-RoBERTa, these models are trained using unlabeled, monolingual corpora from different languages and can be used in cross-and multilingual settings.
However, recently it was questioned, whether these models are truly multilingual and evidence was found that language-specific information is stored at least in the upper transformer layers and more general, language-independent representations in the lower layers.
In this survey, we gave a structured overview of recent work in the field of low-resource natural language processing. We showed that it is essential to analyze resource-lean scenarios across the different dimensions of data-availability. This can reveal which techniques are expected to be applicable and helpful in a specific low-resource setting.