As part of Mission Bhashini, we are working on two major Machine Translation projects: ISHAAN and VIDYAAPATI. These projects focus on creating high-quality parallel corpora and robust bi-directional Neural Machine Translation (NMT) models for various Indian languages, enabling seamless translation between English, Hindi, and regional languages.
- ISHAAN focuses on developing translation models for translation between English, Hindi, and North-East Indian languages.
- VIDYAAPATI addresses translation needs for Hindi and several Indo-Aryan languages.
- Datasets: Parallel corpora generated in these projects cover multiple languages and domains, offering a robust foundation for developing and fine-tuning Machine Translation (MT) models.
- Pre-trained Machine Translation Models: The models are designed to handle diverse translation tasks and can be used directly or fine-tuned for specific applications.
The ISHAAN project covers the following language pairs:
Language Pair |
---|
English-Assamese |
English-Bodo |
English-Manipuri |
English-Nepali |
Hindi-Manipuri |
Assamese-Bodo |
The VIDYAAPATI project covers the following language pairs:
Language Pair |
---|
Hindi-Bengali |
Hindi-Konkani |
Hindi-Maithili |
Hindi-Marathi |
The following prestigious institutes are collaborating on these projects:
- IIT Bombay
- Gauhati University
- IIIT Manipur
- NIT Silchar
- North Bengal University
- IIT Bombay
- IIT Patna
- C-DAC Pune
- Goa University
- Jadavpur University
- Jawaharlal Nehru University (JNU)
- Indian Statistical Institute (ISI), Kolkata
pip3 install ctranslate2 mosestokenizer indic-nlp-library codecs subword-nmt
We provide two example inference scripts that use different tokenization schemes.
- Hindi to Konkani: This script uses bpe tokenizer.
- English to Manipuri: This script uses spm tokenizer.
The resources in this repository are released under the Creative Commons Attribution 4.0 International License (CC-BY 4.0). This license allows for sharing and adaptation of the material, provided appropriate credit is given.
This work is licensed under a Creative Commons Attribution 4.0 International License.
We welcome contributions to this repository. Feel free to raise issues or submit pull requests.
For any inquiries, reach out to our team at [email protected].