You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NMT models struggle when training on low-resource (LR) data.
Proposed method
Author proposed two pipelines that exploit monolingual data: a pipeline for extremely low-resource settings and a pipeline for low-resource settings.
Methods employed in the pipeline (P1) for extremely low-resource settings:
Pre-trained with MLM (Masked Language Modeling), CLM (Casual Language Modeling), DAE (Denoise Autoencoding)
Fine-tuned with MTL (Multi-task Learning) and CLM
Methods employed in the pipeline (P2) for low-resource settings
Pre-trained with MLM, CLM, DAE
Fine-tuned with MTL, BT (Back Translation), CLM, and tDAE (target-side DAE)
My Summary
The observations of the pipelines (P1 and P2) in the paper shows improvements over a small transformer model such as base-Enc3Dec1 in all the datasets tested in BLEU score.
Datasets
(1) WMT14 English-German
(2) WMT18 English-Turkish
(3) WMT16 English-Romanian
(4) OPUS-100 (contains EN-FR, EN-RU, EN-AR, and EN-ZH)
Random subsets from (1) of 5k, 10, 15k for extreme LR settings and 50k, 100k, 200k, and 500k for LR settings
Random subsets from (3) of 10k for extreme LR settings
Random subsets from (4) of 10k and 100k
The text was updated successfully, but these errors were encountered:
put the name of the venue (conference) in the header like {year}-{venue}-{title}. E.g., this one is from EMNLP
This paper and papers similar to this is little related to our work since we're not having low-resouce problem in our domain. Yet, it helps you get familiar with the neural translation literature and the keywords/terminology.
thangk
changed the title
Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation (2024)
2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation
Jun 25, 2024
Link: ACL Anthology
Main problem
NMT models struggle when training on low-resource (LR) data.
Proposed method
Author proposed two pipelines that exploit monolingual data: a pipeline for extremely low-resource settings and a pipeline for low-resource settings.
Methods employed in the pipeline (P1) for extremely low-resource settings:
Methods employed in the pipeline (P2) for low-resource settings
My Summary
The observations of the pipelines (P1 and P2) in the paper shows improvements over a small transformer model such as base-Enc3Dec1 in all the datasets tested in BLEU score.
Datasets
(1) WMT14 English-German
(2) WMT18 English-Turkish
(3) WMT16 English-Romanian
(4) OPUS-100 (contains EN-FR, EN-RU, EN-AR, and EN-ZH)
Random subsets from (1) of 5k, 10, 15k for extreme LR settings and 50k, 100k, 200k, and 500k for LR settings
Random subsets from (3) of 10k for extreme LR settings
Random subsets from (4) of 10k and 100k
The text was updated successfully, but these errors were encountered: