2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation #242

thangk · 2024-06-24T16:16:06Z

Link: ACL Anthology

Main problem

NMT models struggle when training on low-resource (LR) data.

Proposed method

Author proposed two pipelines that exploit monolingual data: a pipeline for extremely low-resource settings and a pipeline for low-resource settings.
Methods employed in the pipeline (P1) for extremely low-resource settings:

Pre-trained with MLM (Masked Language Modeling), CLM (Casual Language Modeling), DAE (Denoise Autoencoding)
Fine-tuned with MTL (Multi-task Learning) and CLM
Methods employed in the pipeline (P2) for low-resource settings
Pre-trained with MLM, CLM, DAE
Fine-tuned with MTL, BT (Back Translation), CLM, and tDAE (target-side DAE)

My Summary

The observations of the pipelines (P1 and P2) in the paper shows improvements over a small transformer model such as base-Enc3Dec1 in all the datasets tested in BLEU score.

Datasets

(1) WMT14 English-German
(2) WMT18 English-Turkish
(3) WMT16 English-Romanian
(4) OPUS-100 (contains EN-FR, EN-RU, EN-AR, and EN-ZH)
Random subsets from (1) of 5k, 10, 15k for extreme LR settings and 50k, 100k, 200k, and 500k for LR settings
Random subsets from (3) of 10k for extreme LR settings
Random subsets from (4) of 10k and 100k

hosseinfani · 2024-06-25T00:04:38Z

@thangk
thanks for the summary. Few notes:

check if the code is online
put the name of the venue (conference) in the header like {year}-{venue}-{title}. E.g., this one is from EMNLP
This paper and papers similar to this is little related to our work since we're not having low-resouce problem in our domain. Yet, it helps you get familiar with the neural translation literature and the keywords/terminology.

thangk added the literature-review Summary of the paper related to the work label Jun 25, 2024

hosseinfani assigned thangk Jun 25, 2024

thangk changed the title ~~Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation (2024)~~ 2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation #242

2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation #242

thangk commented Jun 24, 2024 •

edited

Loading

hosseinfani commented Jun 25, 2024

2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation #242

2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation #242

Comments

thangk commented Jun 24, 2024 • edited Loading

Link: ACL Anthology

Main problem

Proposed method

My Summary

Datasets

hosseinfani commented Jun 25, 2024

thangk commented Jun 24, 2024 •

edited

Loading