Main repository for the sharing of Quechua-Spanish Speech Translation data as part of the low-resource shared task at IWSLT 2024.
This corpus is a small extraction of the Siminchik corpus (Cardenas_et_al.,2018), a Quechua-based corpus created from several radio audio recordings. The recordings have been transcribed and translated into Spanish. The total recording time for the clean
speech data is 1 hour and 40 minutes. It can be found in the que_spa_constrained
folder which contains three sub-folders: training, valid, and test. The test folder will be made visible after the submissions have been received.
The raw text transcriptions are located in que_spa_constrained/<split>/txt/<split>.<lang>
.
True-cased Spanish target translations are found in que_spa_constrained/<split>/txt/<split>.spa.tc
.
True-casing was done with a sacremoses Truecaser model trained on the Spanish side of WMT13 EN-ES.
In addition to the 1 hour, 40 minutes of Quechua audio data aligned with Spanish translations, we also provided
participants with a corpus of 48 hours of fully transcribed Quechua audio without
translations for the unconstrained
task. The audio data and corresponding transcriptions are a bigger extract from the Siminchik data set. The hope is that
this data can be directly used for assistance in the development of speech recognition components for the unconstrained
task. The data can be easily downloaded directly fron here: Unconstrained QUE-SPA Additional Audio 1.
Please Note: Participants are not required to use this data but are free to use with the license below.
@article{cardenas2018siminchik,
title={Siminchik: A speech corpus for preservation of southern quechua},
author={Cardenas, Ronald and Zevallos, Rodolfo and Baquerizo, Reynaldo and Camacho, Luis},
journal={ISI-NLP 2},
pages={21},
year={2018}
}
As part of the constrained task, we allow the use of Machine Transaltion parallel text from previous work. Participants are also not required to use this data.
The data is found in this repository in the folder: additional_mt_text
.
They are extracted from the JW300 and Hinantin websites and used in the cited work below.
Please make sure to cite the work below if you use this data.
@article{ortega2020neural,
title={Neural machine translation with a polysynthetic low resource language},
author={Ortega, John E and Castro Mamani, Richard and Cho, Kyunghyun},
journal={Machine Translation},
volume={34},
number={4},
pages={325--346},
year={2020},
publisher={Springer}
}
All audio recordings are property of Siminchikkunarayku and Llamacha.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
Part of this work has been funded by AmericasNLP-2022, John E. Ortega, and Llamacha. Special thanks to Eva Mühlbauer, Maximilian Torres and Anku Kichka their support.