You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Low-resource languages produce wrong predictions and data augmentation methods are often not available to aid with the predictions.
Proposed method
The author proposes USKI (Unaligned Sentences Keytokens pre-training) which “leverages the relationships and similarities that exist between unaligned sentences. This method claims to improve the prediction by increasing the dataset by square of its initial quantity thus matching high-resource languages’ dataset size and result in improved performance.
My Summary
This is an interesting paper. According to the introduction section, there are over 7000 spoken languages and over half of them are estimated to be extinct by 2100. My mother tongue is a very low-resource language (LRL) as well. However, I am still skeptical on the accuracy of claims of squaring the LRL dataset and learning from unaligned sentences in other LRL datasets. This approach assumes every language has a translation result for a sentence even though it’s not a direct translation. I am not sure about this claim. However, it performance improvement isn’t super high, so it’s realistically understandable.
thangk
changed the title
Learning from Wrong Predictions in Low-Resource Neural Machine Translation (2024)
2024-LREC, COLING-Learning from Wrong Predictions in Low-Resource Neural Machine Translation
Jun 25, 2024
Link: ACL Anthology
Main problem
Low-resource languages produce wrong predictions and data augmentation methods are often not available to aid with the predictions.
Proposed method
The author proposes USKI (Unaligned Sentences Keytokens pre-training) which “leverages the relationships and similarities that exist between unaligned sentences. This method claims to improve the prediction by increasing the dataset by square of its initial quantity thus matching high-resource languages’ dataset size and result in improved performance.
My Summary
This is an interesting paper. According to the introduction section, there are over 7000 spoken languages and over half of them are estimated to be extinct by 2100. My mother tongue is a very low-resource language (LRL) as well. However, I am still skeptical on the accuracy of claims of squaring the LRL dataset and learning from unaligned sentences in other LRL datasets. This approach assumes every language has a translation result for a sentence even though it’s not a direct translation. I am not sure about this claim. However, it performance improvement isn’t super high, so it’s realistically understandable.
Datasets
Selkup-Russian
Evenki-Russian
Griko-Italian
Uzbek-English
Wolof-Ukrainian
The text was updated successfully, but these errors were encountered: