Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation #242

Open
thangk opened this issue Jun 24, 2024 · 1 comment
Assignees
Labels
literature-review Summary of the paper related to the work

Comments

@thangk
Copy link
Collaborator

thangk commented Jun 24, 2024

Link: ACL Anthology

Main problem

NMT models struggle when training on low-resource (LR) data.

Proposed method

Author proposed two pipelines that exploit monolingual data: a pipeline for extremely low-resource settings and a pipeline for low-resource settings.
Methods employed in the pipeline (P1) for extremely low-resource settings:

  • Pre-trained with MLM (Masked Language Modeling), CLM (Casual Language Modeling), DAE (Denoise Autoencoding)
  • Fine-tuned with MTL (Multi-task Learning) and CLM
    Methods employed in the pipeline (P2) for low-resource settings
  • Pre-trained with MLM, CLM, DAE
  • Fine-tuned with MTL, BT (Back Translation), CLM, and tDAE (target-side DAE)

My Summary

The observations of the pipelines (P1 and P2) in the paper shows improvements over a small transformer model such as base-Enc3Dec1 in all the datasets tested in BLEU score.

Datasets

(1) WMT14 English-German
(2) WMT18 English-Turkish
(3) WMT16 English-Romanian
(4) OPUS-100 (contains EN-FR, EN-RU, EN-AR, and EN-ZH)
Random subsets from (1) of 5k, 10, 15k for extreme LR settings and 50k, 100k, 200k, and 500k for LR settings
Random subsets from (3) of 10k for extreme LR settings
Random subsets from (4) of 10k and 100k

@thangk thangk added the literature-review Summary of the paper related to the work label Jun 25, 2024
@hosseinfani
Copy link
Member

@thangk
thanks for the summary. Few notes:

  • check if the code is online
  • put the name of the venue (conference) in the header like {year}-{venue}-{title}. E.g., this one is from EMNLP
  • This paper and papers similar to this is little related to our work since we're not having low-resouce problem in our domain. Yet, it helps you get familiar with the neural translation literature and the keywords/terminology.

@thangk thangk changed the title Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation (2024) 2024-EMNLP-Rethinking the Exploitation of Monolingual Data for Low-Resource Neural Machine Translation Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
literature-review Summary of the paper related to the work
Projects
None yet
Development

No branches or pull requests

2 participants