Skip to content

UBC-NLP/Cheetah

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GitHub stars GitHub forks

This is the repository accompanying our ACL 2024 paper Cheetah: Natural Language Generation for 517 African Languages. In this paper, we develop Cheetah, a massively multilingual NLG language model for African languages. Cheetah supports 517 African languages and language varieties, allowing us to address the scarcity of NLG resources and provide a solution to foster linguistic diversity. We demonstrate the effectiveness of Cheetah through comprehensive evaluations across seven generation downstream tasks. In five of the seven tasks, Cheetah significantly outperforms other models, showcasing its remarkable performance for generating coherent and contextually appropriate text in a wide range of African languages. We additionally conduct a detailed human evaluation to delve deeper into the linguistic capabilities of Cheetah. The introduction of Cheetah has far-reaching benefits for linguistic diversity. By leveraging pretrained models and adapting them to specific languages, our approach facilitates the development of practical NLG applications for African communities. The findings of this study contribute to advancing NLP research in low-resource settings, enabling greater accessibility and inclusion for African languages in a rapidly expanding digital landscape.


Table of Contents

1. Our Language Models

1.1 Training Data

Cheetah Training Data: We are guided by three main principles in developing this data: quality, linguistic diversity, and coverage.

Quality. Developing NLP technologies for low resource languages poses a significant challenge due to the limited availability of high-quality training data. To address this issue, we undertook the task of manually curating a diverse corpus spanning multiple domains, including news articles, health documents, religious texts, legal documents, and social media feeds. This manual curation approach was necessary because there were no existing datasets available for the majority of the languages we aimed to support, and we wanted to ensure the utilization of reliable and high-quality data.

Coverage. In all, we train Cheetah using a 42G multi-domain corpus across 517 African languages and language varieties. The languages are spoken in 50 of 54 African countries and they are written with five scripts. This provides support to at least 500M Africans.

Linguistic Diversity. The inclusion of languages from various domains, geographical regions, and linguistic typologies, along with the utilization of reliable data sources, contributes to enhancing the robustness and quality of Cheetah. Our data consists of languages from 14 language families in Africa written in five different orthographies. Furthermore, our data spans languages with a vast array of exotic linguistic features including tone, vowel and consonant harmony, reduplication, word orders, and word classes.

  • Religious Domain. Our religious data is taken from online Bibles, Qurans, and data crawled from the Jehovah’s witness website. We also include religious texts from the book of Mormon.
  • News Domain. We collect data from online newspapers (Adebara and Abdul-Mageed, 2022) and news sites such as (Voice of America), (Voice of Nigeria), (BBC), (Global voices), and (DW) news sites. We collect local newspapers from 27 languages from across Africa.
  • Government Documents. We collect government documents South African Centre for Digital Language Resources (SADiLaR), and the Universal Declaration of human rights (UDHR) in multiple languages.
  • Health Documents. We collect multiple health documents from the Department of Health, State Government of Victoria, Australia. We collect documents in Amharic, Dinka, Harari, Oromo, Somali, Swahili, and Tigrinya.
  • Existing Corpora. We collect corpora available on the web for different African languages, including from Project Gutenberg for Afrikaans, South African News data. for Sepedi and Setswana, OSCAR (Abadji et al., 2021) for Afrikaans, Amharic, Somali, Swahili, Oromo, Malagasy, and Yoruba. We also used Tatoeba for Afrikaans, Amharic, Bemba, Igbo, Kanuri, Kongo, Luganda, Malagasy, Sepedi, Ndebele, Kinyarwanda, Somali, Swahili, Tsonga, Xhosa, Yoruba, and Zulu; Swahili Language Modelling Data for Swahili; Ijdutse corpus for Hausa; Data4Good corpora for Luganda, CC-100 for Amharic, Fulah, Igbo, Yoruba, Hausa, Tswana, Lingala, Luganada, Afrikaans, Somali, Swahili, Swati, North Sotho, Oromo, Wolof, Xhosa, and Zulu; Afriberta-Corpus for Afaan / Oromo, Amharic, Gahuza, Hausa, Igbo, Pidgin, Somali, Swahili, Tigrinya and Yoruba; mC4 for Afrikaans, Amharic, Hausa, Igbo, Malagasy, Chichewa, Shona, Somali, Sepedi, Swahili, Xhosa, Yoruba and Zulu. Further details about the model is available in the (paper).

1.2 Model Architecture

We pretrain Cheetah using the encoder-decoder architecture (xue-etal-2021-mt5). Each of the encoder and decoder components is similar in size and configuration to T5, with 12 layers each with 12 attention heads, and 768 hidden units for the base model. In total, this results in a model with ~580 million parameters.

1.3. Cheetah Model

For pretraining Cheetah, we use a learning rate of 0.01, a batch size of 1,024 sequences, and a maximum sequence length of 1,024. We pretrain each model for 1M steps. We train our models on Google Cloud TPU with 128 cores (v3-128) from TensorFlow Research Cloud (TFRC). Cheetah Pytorch and Tenserflow checkpoints are available on Huggingface website for direct download and use exclusively for research. For commercial use, please contact the authors via email @ (*muhammad.mageed[at]ubc[dot]ca*).

Model Link
🔥Cheetah-base🔥 https://huggingface.co/UBC-NLP/cheetah-base

2. AfroNLG Benchmark and Evaluation

We create AfroNLG, a multi-lingual, multi-task benchmark comprising $67$ test sets across six task clusters. Specifically, AfroNLG includes the following: code-swtiching, cloze tasks, machine translation, paraphrase, question answering, summarization, and title generation. AfroNLG supports 517 African languages and language varieties. To the best of our knowledge, this is the most extensive benchmark till date for African languages. AfroNLG includes the following tasks: machine translation, paraphrase, question answering, summarization, title generation, cloze.

2.1

2.1 Machine Translation

Lang-Pairs Metric mT0 mT5 Afri-MT5 AfriTeVa Cheetah
English $\rightarrow$ Afrikaans Bleu 20.38±0.3 12.35±1.1 7.12±2.67 7.75±1.67 19.72±0.75
English $\rightarrow$ Bemba Bleu 19.19±0.3 12.28±0.48 11.73±12.3 20.5±0.87 18.9±1.22
English $\rightarrow$ Lingala Bleu 15.98±1.16 14.12±0.56 14.32±12.74 13.88±1.04 9.64±1.11
English $\rightarrow$ Rundi Bleu 12.26±0.47 8.82±0.43 9.57±0.42 7.83±1.04 10.54±0.54
English $\rightarrow$ Sesotho Bleu 11.04±1.2 12.74±0.75 10.0±1.79 10.76±1.4 13.3±1.38
English $\rightarrow$ Swahili Bleu 10.59±1.84 9.33±0.58 3.08±0.57 7.24±0.46 11.08±0.61
English $\rightarrow$ Xhosa Bleu 10.04±0.98 8.25±0.7 3.86±1.35 7.5±0.32 12.34±0.51
English $\rightarrow$ Zulu Bleu 17.65±1.86 17.97±1.69 1.9±1.11 } 13.45±1.81 19.49±1.16
English $\rightarrow$ Hausa Bleu 5.06±0.21 4.96±0.16 0.85±0.04 7.32±0.00 9.22±0.08
English $\rightarrow$ Igbo Bleu 13.05±0.17 11.57±0.23 1.12±0.09 12.34±0.23 16.75±0.26
English $\rightarrow$ Luganda Bleu 2.17±2.77 3.33±0.35 0.09±0.01 4.21±0.77 9.75±0.01
English $\rightarrow$ N. Pidgin Bleu 33.17±0.28 32.65±0.19 2.39±0.23 9.39±0.18 32.64±0.14
English $\rightarrow$ Swahili Bleu 22.04±2.89 23.2±0.23 2.79±0.08 22.39±0.28 28.11±0.14
English $\rightarrow$ Zulu Bleu 6.83±0.29 0.58±1.37 0.4±0.03 4.45±0.37 11.75±0.38
English $\rightarrow$ Twi Bleu 3.4±0.12 1.23±0.03 0.03±0.0 1.68±0.94 4.64±0.13
English $\rightarrow$ Yoruba Bleu 5.42±0.85 2.58±3.1 0.04±0.0 3.63±4.01 7.83±0.14
English $\rightarrow$ Zulu Bleu 10.28±0.49 1.31±2.26 0.14±0.03 3.8±4.2 12.13±0.1
French $\rightarrow$ Bambara Bleu 2.0±2.6 0.37±0.19 0.15±0.01 3.18±0.18 3.06±0.27
French $\rightarrow$ Ghomálá’ Bleu 0.4±0.09 0.33±0.01 0.07±0.0 0.96±0.01 0.28±0.25
French $\rightarrow$ Ewe Bleu 0.7±0.35 0.31±0.36 0.09±0.07 0.84±0.16 3.47±0.03
French $\rightarrow$ Fon Bleu 0.69±0.31 0.8±0.13 1.52±0.06 1.73±0.53 1.29±0.16
French $\rightarrow$ Moore Bleu 0.27±0.06 0.12±0.05 0.19±0.02 0.47±0.04 1.66±0.86
French $\rightarrow$ Wolof Bleu 4.02±0.12 0.3±0.05 0.11±0.01 3.08±0.25 3.01±0.07
English $\rightarrow$ N. Pidgin (UNMT) Bleu 27.44±0.26 23.42±1.61 7.05±1.37 22.54±0.84 26.56±0.04
Acholi $\rightarrow$ English Bleu 16.41±0.08 11.16±4.77 4.9±0.11 8.37±8.12 19.33±0.1
Acholi $\rightarrow$ Lugbara Bleu 2.57±0.21 1.48±1.31 2.44±0.37 8.29±0.14 7.21±0.69
Acholi $\rightarrow$ Luganda Bleu 3.64±0.07 1.74±0.12 0.92±0.01 5.53±0.34 8.03±0.38
Acholi $\rightarrow$ Nyankore Bleu 2.17±0.14 0.79±0.51 0.46±0.03 4.26±0.54 5.1±0.14
Acholi $\rightarrow$ Ateso Bleu 1.64±2.34 1.94±0.25 4.9±0.11 7.74±0.33 6.33±0.6
English $\rightarrow$ Lugbara Bleu 6.19±6.33 8.38±0.49 5.93±0.22 10.95±0.32 11.61±0.28
English $\rightarrow$ Luganda Bleu 12.08±0.03 10.58±0.25 2.59±0.73 12.41±0.35 17.12±0.16
English $\rightarrow$ Nyankore Bleu 6.46±0.08 5.69±0.02 1.4±0.39 7.88±0.18 9.04±0.24
English $\rightarrow$ Ateso (salt) Bleu 10.24±0.06 8.28±0.19 4.91±0.59 11.64±0.49 11.12±0.38
Lugbara $\rightarrow$ Ateso Bleu 2.21±0.35 1.5±0.2 2.22±0.15 6.67±0.32 3.68±0.31
Luganda $\rightarrow$ Lugbara Bleu 3.96±0.57 2.61±0.12 3.44±0.32 8.05±0.23 7.99±0.47
Luganda $\rightarrow$ Ateso Bleu 4.47±0.08 3.01±0.16 2.5±0.22 8.17±0.18 8.13±0.33
Nyankore $\rightarrow$ Lugbara Bleu 3.45±0.29 2.1±0.32 2.6±0.29 7.5±0.09 7.29±0.09
Nyankore $\rightarrow$ Luganda Bleu 8.54±0.17 6.91±0.23 2.01±0.25 6.77±6.73 6.25±10.26
Nyankore $\rightarrow$ Ateso Bleu 3.33±0.11 2.25±0.23 2.12±0.4 6.27±0.12 6.36±0.4

2.2 Paraphrase

Langs Metric mT0 mT5 Afri-MT5 AfriTeVa Cheetah
Multilingual Bleu 41.79±0.28 41.75±0.21 34.72±0.51 43.02±1.25 43.23±0.09
Berber Bleu 44.84±0.31 44.03±0.24 36.08±0.83 46.41±0.71 46.0±0.27
Kabyle Bleu 25.91±0.13 25.32±0.46 11.56±0.73 16.06±14.79 26.27±0.56

2.3 Question Answering

Langs Metric mT0 mT5 Afri-MT5 AfriTeVa Cheetah
QA Swahili F1 79.84±0.19 72.04±0.54 0 62.64±0.78 71.98±1.18

2.4 Summarization

Langs Metric mT0 mT5 Afri-MT5 AfriTeVa Cheetah
Multilingual RougeL 22.31±0.12 22.23±0.04 5.34±0.48 18.97±0.06 24.86±0.02
Igbo RougeL 18.9±0.73 13.22±0.46 14.24±0.39 16.05±0.49 17.36±0.43
Oromo RougeL 11.28±0.03 10.51±0.07 3.52±0.49 7±1.73 14.53±0.1
Rundi RougeL 19.63±0.01 18.02±0.13 11.82±0.39 16.13±0.03 22.57±0.04
Swahili RougeL 26.38±0.02 24.81±0.11 15.07±0.17 21.59±0.13 29.05±0.13
Yoruba RougeL 21.57±0.05 20.06±0.12 13.52±0.18 17.3±0.11 22.49±0.0
Hausa RougeL 26.46±0.06 25.76±0.02 19.96±0.26 25.19±0.11 30.07±0.31
Nigerian Pidgin RougeL 26.54±0.05 25.79±0.1 14.28±1.23 20.29±0.12 27.08±0.02
Somali RougeL 20.69±0.08 19.21±0.06 13.62±0.81 19.27±0.18 23.92±0.04
Tigrinya RougeL 15.84±0.13 13.93±0.11 6.53±0.42 10.07±0.09 16.88±0.12

2.5 Title Generation

Langs Metric mT0 mT5 Afri-MT5 AfriTeVa Cheetah
Multilingual Bleu 6.53±0.02 6.65±0.08 0.1±0.02 5.2±0.02 7.52±0.07
Amharic Bleu 3.13±0.23 2.65±0.68 0.34±0.14 2.31±0.14 4.34±0.34
Igbo Bleu 6.95±0.13 6.9±0.22 0.77±0.12 4.61±0.14 8.47±0.07
Oromo Bleu 1.1±1.84 2.66±0.19 0.21±0.06 1.54±0.17 3.26±0.21
Rundi Bleu 4.4±0.28 4.13±0.22 0.84±0.07 3.33±0.23 6.05±0.5
Swahili Bleu 9.1±0.23 9.31±0.11 1.22±0.09 7.01±0.09 10.59±0.6
Yoruba Bleu 6.8±0.16 7.23±0.59 0.34±0.05 5.04±2.0 7.97±0.32
Hausa Bleu 8.11±0.24 7.3±0.34 2.59±0.01 6.69±0.18 8.48±0.23
Nigerian Pidgin Bleu 6.75±0.6 3.96±4.3 0.89±0.02 4.72±0.84 6.22±0.28
Somali Bleu 3.37±0.21 3.31±0.16 0.38±0.11 2.82±0.47 5.25±0.14
Tigrinya Bleu 2.99±0.1 2.94±1.09 0.7±0.18 1.92±0.26 5.1±0.05

2.6 Cloze

Task Metric mT0 mT5 Afri-MT5 AfriTeVa Cheetah
Mask-one - 517 Languages Bleu 13.61±0.91 8.18±3.94 0.00±0.00 8.36±3.42 13.98±0.32
Mask-at-least-one - 517 Languages Bleu 2.36±0.11 2.66±0.09 0.93±0.12 0.68±0.09 7.07±0.09

3. How to use Cheetah model

Below is an example for using Cheetah predict masked tokens.

from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

tokenizer = T5Tokenizer.from_pretrained("UBC-NLP/cheetah-base")
model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/cheetah-base")

yor_prompt="ìròyìn kan nípa owó ìjọba <extra_id_0> kan"

input_ids = tokenizer(yor_prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print("Tokenized input:", tokenizer.tokenize(yor_prompt))
print("Decoded output:", tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

Tokenized input: ['▁ìròyìn', '▁kan', '▁nípa', '▁owó', '▁ìjọba', '<extra_id_0>', '▁kan']
Decoded output:  ìpínlẹ̀

4. Ethics

Cheetah aligns with Afrocentric NLP where the needs of African people is put into consideration when developing technology. We believe Cheetah will not only be useful to speakers of the languages supported, but also researchers of African languages such as anthropologists and linguists. We discuss below some use cases for Cheetah and offer a number of broad impacts.

  • Cheetah aims to address the lack of access to technology in about 90% of the world's languages, which automatically discriminates against native speakers of those languages. More precisely, it does so by focusing on Africa. To the best of our knowledge, Cheetah is the first massively multilingual PLM developed for African languages and language varieties. A model with knowledge of 517 African languages, is by far the largest to date for African NLP.
  • Cheetah enables improved access of important information to the African community in Indigenous African languages. This is especially beneficial for people who may not be fluent in other languages. This will potentially connect more people globally.
  • Cheetah affords opportunities for language preservation for many African languages. To the best of our knowledge, Cheetah consists of languages that have not been used for any NLP task until now. We believe that it can help encourage continued use of these languages in several domains, as well as trigger future development of language technologies for many of these languages.
  • Cheetah Although LMs are useful for a wide range of applications, they can also be misused. Cheetah is developed using publicly available datasets that may carry biases. Although we strive to perform analyses and diagnostic case studies to probe performance of our models, our investigations are by no means comprehensive nor guarantee absence of bias in the data. In particular, we do not have access to native speakers of most of the languages covered. This hinders our ability to investigate samples from each (or at least the majority) of the languages.

Supported languages

Please refer to supported-languages

Citation

If you use the pre-trained model (Cheetah) for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows (to be updated):

@inproceedings{adebara-etal-2024-cheetah,
    title = "Cheetah: Natural Language Generation for 517 {A}frican Languages",
    author = "Adebara, Ife  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.691",
    pages = "12798--12823",
}

Acknowledgments

We gratefully acknowledges support from Canada Research Chairs (CRC), the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN-2018-04267), the Social Sciences and Humanities Research Council of Canada (SSHRC; 435-2018-0576; 895-2020-1004; 895-2021-1008), Canadian Foundation for Innovation (CFI; 37771), Digital Research Alliance of Canada, UBC ARC-Sockeye, Advanced Micro Devices, Inc. (AMD), and Google. Any opinions, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of CRC, NSERC, SSHRC, CFI, the Alliance, AMD, Google, or UBC ARC-Sockeye.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published