Skip to content

Latest commit

 

History

History
50 lines (47 loc) · 7.17 KB

zero.md

File metadata and controls

50 lines (47 loc) · 7.17 KB

Tatoeba Challenge Data

This is the "zero" sub-set of the Tatoeba data. Download the data files from the link in the table below. There is a total of

  • 40 language pairs in this sub-set
lang-pair test dev train
Kotava - French avk-fra 1244
Kotava - Spanish avk-spa 274
Awadhi - English awa-eng 279
German - Hunsrik deu-hrx 471
German - Swabian deu-swg 1523
German - Toki Pona deu-toki 10000 13422
Kadazan Dusun - English dtp-eng 1929 1000
Kadazan Dusun - Japanese dtp-jpn 251
Kadazan Dusun - Malay (macrolanguage) dtp-msa 440
Emilian - Italian egl-ita 202
English - Gronings eng-gos 1152
English - Ho eng-hoc 660
English - Hunsrik eng-hrx 221
English - Khasi eng-kha 1314
English - Tase Naga eng-nst 805
English - Old Russian eng-orv 322
English - Ottoman Turkish (1500-1928) eng-ota 678
English - Prussian eng-prg 213
English - Toki Pona eng-toki 5000 7603
Esperanto - Interlingue epo-ile 302
Esperanto - Ladino epo-lad 210
Esperanto - Lingua Franca Nova epo-lfn 426
Esperanto - Toki Pona epo-toki 2738 1000
Esperanto - Volapük epo-vol 759
Finnish - Kven Finnish fin-fkv 297
French - Guadeloupean Creole French fra-gcf 1164
French - Lingua Franca Nova fra-lfn 214
French - Toki Pona fra-toki 511
Gronings - Dutch gos-nld 1852
Interlingua (International Auxiliary Language Association) - Ladino ina-lad 215
Interlingua (International Auxiliary Language Association) - Lingua Franca Nova ina-lfn 435
Ladino - Yiddish lad-yid 400
Lingua Franca Nova - Portuguese lfn-por 1945
Lingua Franca Nova - Yiddish lfn-yid 380
Dutch - Toki Pona nld-toki 592
Old Russian - Ukrainian orv-ukr 973
Ottoman Turkish (1500-1928) - Turkish ota-tur 268
Portuguese - Toki Pona por-toki 1719
Russian - Toki Pona rus-toki 868
Spanish - Toki Pona spa-toki 533