You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The main problem addressed in this work is the generation of synthetic data for back translation in Neural Machine Translation (NMT) and understanding the factors that affect the performance of back translation.
Proposed Method
The authors propose two methods to improve the synthetic data for back translation: Data Manipulation and Gamma Score. In Data Manipulation, they combine synthetic corpora generated by beam search and sampling to balance the trade-off between importance and quality. They tune the combination ratio to optimize the back-translation performance. In Gamma Score, they introduce a score that balances both quality and importance to generate translations. The score is based on an interpolation of importance weight and the probability of the translation given the source sentence. They select the translation with the highest score or sample a translation based on the score distribution.
Input/Output
The input to the proposed methods is a monolingual corpus in the source language and a pretrained NMT model. The output is a synthetic corpus generated through either data manipulation or the gamma score method.
Example
In an experiment on the WMT14 DE-EN dataset, the authors compared the performance of their proposed methods with baseline methods. In Data Manipulation, they achieved similar BLEU scores to sampling back translation, even without using bitext, and improved the performance compared to beam search back translation. In the Gamma Score method, they achieved significantly better results than both sampling and beam search back translation. The results were measured using SacreBLEU and COMET metrics.
Related Works & Their Gaps
The related works discussed include the initial proposal of back translation by Bojar and Tamchyna, the extension of back translation for NMT by Sennrich et al., and the exploration of various back-translation generation methods by Imamura et al., Edunov et al., and others. Data augmentation methods for NMT, such as token frequency balancing and SwitchOut, are also mentioned. Also, the use of monolingual data in semi-supervised machine translation and the improvement of translation quality through back translation are discussed. The gaps in the related works include the limited exploration of balancing importance and quality in synthetic data, inconsistent improvements across different translation tasks in data augmentation, and the need for more efficient methods for leveraging monolingual data in NMT.
The text was updated successfully, but these errors were encountered:
Main Problem
The main problem addressed in this work is the generation of synthetic data for back translation in Neural Machine Translation (NMT) and understanding the factors that affect the performance of back translation.
Proposed Method
The authors propose two methods to improve the synthetic data for back translation: Data Manipulation and Gamma Score. In Data Manipulation, they combine synthetic corpora generated by beam search and sampling to balance the trade-off between importance and quality. They tune the combination ratio to optimize the back-translation performance. In Gamma Score, they introduce a score that balances both quality and importance to generate translations. The score is based on an interpolation of importance weight and the probability of the translation given the source sentence. They select the translation with the highest score or sample a translation based on the score distribution.
Input/Output
The input to the proposed methods is a monolingual corpus in the source language and a pretrained NMT model. The output is a synthetic corpus generated through either data manipulation or the gamma score method.
Example
In an experiment on the WMT14 DE-EN dataset, the authors compared the performance of their proposed methods with baseline methods. In Data Manipulation, they achieved similar BLEU scores to sampling back translation, even without using bitext, and improved the performance compared to beam search back translation. In the Gamma Score method, they achieved significantly better results than both sampling and beam search back translation. The results were measured using SacreBLEU and COMET metrics.
Related Works & Their Gaps
The related works discussed include the initial proposal of back translation by Bojar and Tamchyna, the extension of back translation for NMT by Sennrich et al., and the exploration of various back-translation generation methods by Imamura et al., Edunov et al., and others. Data augmentation methods for NMT, such as token frequency balancing and SwitchOut, are also mentioned. Also, the use of monolingual data in semi-supervised machine translation and the improvement of translation quality through back translation are discussed. The gaps in the related works include the limited exploration of balancing importance and quality in synthetic data, inconsistent improvements across different translation tasks in data augmentation, and the need for more efficient methods for leveraging monolingual data in NMT.
The text was updated successfully, but these errors were encountered: