🔖 The Indic NLP Catalog

A Collaborative Catalog of Resources for Indic Language NLP

The Indic NLP Catalog repository is an attempt to collaboratively build the most comprehensive catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent.

Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:

[Wikipedia Dumps](https://dumps.wikimedia.org/)

Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.

👍 Featured Resources

Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future.

Universal Language Contribution API (ULCA): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the Bhasini mission. You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination.
We are seeing the rise of large-scale datasets across many tasks like IndicCorp (text corpus/9 billion tokens), Samanantar (parallel corpus/50 million sentence pairs), Naamapadam (named entity/5.7 million sentences), HiNER (named entity/100k sentences), Aksharantar (transliteration/26 million pairs) , etc. These are being built using either large-scale mining of web-resource or large human annotation efforts or both.
As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like Aksharantar (transliteration/21 languages), FLORES-200 (translation/27 languages), IndoWordNet (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages.
Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like Bodo, Kangri, Khasi, etc.
From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like AI4Bharat, BUET CSE NLP, KMI, L3Cube, iNLTK, IIT Patna, etc. who are contributing to building resources for Indian languages.

Browse the entire catalog...

🙋Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.

Major Indic Language NLP Repositories
Libraries and Tools
Evaluation Benchmarks
Standards
- Unicode Standard
Text Corpora
Models
Speech Corpora
OCR Corpora
Multimodal Corpora
Language Specific Catalogs

Major Indic Language NLP Repositories

Libraries and Tools

Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, etc
- Devnagri to Roman transliteration using hand-crafted rules and lexicons.
pyiwn: Python Interface to IndoWordNet
Indic-OCR : OCR for Indic Scripts
CLTK: Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
Sanskrit Coders Indic Transliteration: Script conversion and romanization for Indian languages.
Smart Sanskirt Annotator: Annotation tool for Sanskrit paper
BNLP: Bengali language processing toolkit with tokenization, embedding, POS tagging, NER suppport
CodeSwitch: Language identification, POS Tagging, NER, sentiment analysis support for code mixed data including Hindi and Nepali language
IndIE: An Open Information Extraction tool (triple extractor) in Hindi. It is conjectured to work for Tamil, Telugu, and Urdu as well.
Hindi-BenchIE: A triple evaluation tool for 112 Hindi sentences.

Evaluation Benchmarks

Benchmarks spanning multiple tasks.

AI4Bharat IndicGLUE: NLU benchmark for 11 languages.
AI4Bharat IndicNLG Suite: NLG benchmark for 11 languages spanning 5 generation tasks: biography generation, sentence summarization, headline generation, paraphrase generation and question generation.
GLUECoS: For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI).
AI4Bharat Text Classification: A compilation of classification datasets for 10 languages.
WAT 2021 Translation Dataset: Standard train and test sets for translation between English and 10 Indian languages.

Standards

Unicode Standard for Indic Scripts
- An Introduction to Indic Scripts
- Unicode Standard for South Asian Scripts

Text Corpora

Monolingual Corpus

[AIBharat IndicCorp]: Text corpora for Indian languages
- v1: contains 8.9 billion tokens from 12 Indian languages (including Indian English). [paper]
- v2: contains 20 billion tokens from 22 Indian languages (including Indian English). [paper]
Wikipedia Dumps
Common Crawl
- OSCAR Corpus: Released in 2019, large-scaled processed CommonCrawl.
- WMT Common Crawl Dumps: Crawls between 2012 and 2016. Noisy text, needs to be filtered.
- CC-100 Corpus: Facebook CommonCrawl extracted data. They provide scripts for processing CommonCrawl. StatMT has built a replica of the CC-100 corpus using these scripts. You can find it HERE. This corpus also has romanized corpora for some Indian languages.
WMT NEWS Crawl
LDCIL Monolingual Corpus
Charles University Hindi Monolingual Corpus
Charles University Urdu Monolingual Corpus
IIT Bombay Hindi Monolingual Corpus
EMILLE Corpus (multiple Indian languages)
Janmabhumi Malayalam Corpus
Leipzig Corpus
Sanskrit Monolingual and Sandhi-split Corpus
Lot Of Indic Tweets Corpus: Large twitter datasets for telugu (7.9 million) and hindi (17.6 million) and fasttext skipgram and cbow word vectors for the same.
CMU Romanized Hinglish Corpus: See THIS PAPER for details.
JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 45k sentences.
KMI Magahi Corpus:
KMI Awadhi Corpus:
KMI Linguistics Bodo: Contains the Bodo corpus and the frequency-ordered word and punctuation list.
SMC Malayalam text corpus
DNLP-Tel Telugu Corpus: Telugu corpus of 280M tokens and 23M sentences along with skip-gram model trained with word2vec.
Ema-lon Manipuri Corpus: The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2.
SinMin Corpus: Contains texts of different genres and styles of the modern and old Sinhala language.
Kangri_corpus: Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in this paper.
Sanskrit-Hindi-MT: The Sanskrit Monolingual Data is available here.
FacebookDecadeCorpora: Contains two language corpora of colloquial Sinhala content extracted from Facebook using the Crowdtangle platform. The larger corpus contains 28,825,820 to 29,549,672 words of text, mostly in Sinhala, English and Tamil and the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from Corpus-Alpha. Described in this paper.
Nepali National corpus: The Nepali Monolingual written corpus comprises the core corpus containing 802,000 words and the general corpus containing 1,400,000 words. Described here.

Language Identification

VarDial 2018 Language Identification Dataset: 5 languages - Hindi, Braj, Awadhi, Bhojpuri, Magahi.

Lexical Resources and Semantic Similarity

IndoWordNet
IIIT-Hyderabad Word Similarity Database: 7 Indian languages
Facebook Hindi Analogy Dataset
MGAD Hindi Analogy dataset
AI4Bharat Word Frequency Lists: Tokens and their frequencies from the AI4Bharat corpus, a large monolingual corpus.
Hindi RG-63: Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset
IITB Cognate Datasets: Dataset of Cognates and False Friend Pairs for 12 Indian Languages. (Paper)
AI4Bharat Cross-lingual Semantic Textual Similarity: 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines.
Toxicity-200: Toxicity Lists for 200 languages including 27 Indian languages.
FacebookDecadeCorpora: Contains a list of algorithmically derived stopwords extracted from Corpus-Sinhala-Redux. Described in this paper.

NER Corpora

FIRE 2013 AUKBC NER Corpus
FIRE 2014 AUKBC NER Corpus
IIT Bombay Marathi NER Corpus
WikiAnn NER Corpus (Noisy) DOWNLOAD (Old broken LINK)
IJCNLP 200 NER Corpus: NER corpora for hi, bn, or, te, ur.
a-mma NER data
AI4Bharat Naamapadam: NER dataset for 11 Indic languages.
AsNER: A named entity annotation dataset for low resource Assamese language containing 99k tokens.
L3Cube-MahaNER: The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in this paper.
CFILT HiNER: A large Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens. Described in this paper.
MultiCoNER: A multilingual complex Named Entity Recognition dataset composed of 2.3 million instances for 11 languages(including dataset for Indic languages Hindi and Bangla) representing three domains(wiki sentences, questions, and search queries) plus multilingual and code-mixed subsets.The NER tag-set consists of six classes viz.: PER,LOC,CORP,GRP,PROD and CW. Described in this paper.

Parallel Translation Corpus

BPCC Parallel Corpus: Largest parallel corpus for English and 22 Indian languages (as of Jan 2024). It comprises 230 million sentence pairs between English-Indian languages. A subset of this corpus is the BPCC-Human Corpus containing 2.2 English-Indic pairs for 22 Indic languages.
Samanantar Parallel Corpus: Largest parallel corpus for English and 11 Indian languages (as of 2021). It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages.
FLORES-101: Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel.
FLORES-200: Human translated evaluation sets for 200 languages released by Facebook. It includes 24 Indic languages. The testsets are n-way parallel.
IIT Bombay English-Hindi Parallel Corpus: Largest en-hi parallel corpora in public domain (about 1.5 million segments)
CVIT-IIITH PIB Multilingual Corpus: Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language).
CVIT-IIITH Mann ki Baat Corpus: Mined from Indian PM Narendra Modi's Mann ki Baat speeches.
PMIndia: Parallel corpus for En-Indian languages mined from Mann ki Baat speeches of the PM of India (paper).
OPUS corpus
WAT 2018 Parallel Corpus: There may significant overlap between WAT and OPUS.
Charles University Parallel Corpora Collection
- Charles University English-Hindi Parallel Corpus: This is included in the IITB parallel corpus.
- Charles University English-Tamil Parallel Corpus
- Charles University English-Odia Parallel Corpus v1.0
- Charles University English-Odia Parallel Corpus v2.0
- Charles University English-Urdu Religious Parallel Corpus
Indian Language Corpora Initiative: Available on TDIL portal on request
IndoWordnet Parallel Corpus: Parallel corpora mined from IndoWordNet gloss and/or examples for Indian-Indian language corpora (6.3 million segments, 18 languages).
MTurk Indian Parallel Corpus
TED Parallel Corpus
JW300 Corpus: Parallel corpus mined from jw.org. Religious text from Jehovah's Witness.
ALT Parallel Corpus: 10k sentences for Bengali, Hindi in parallel with English and many East Asian languages.
FLORES dataset: English-Sinhala and English-Nepali corpora
Uka Tarsadia University Corpus: 65k English-Gujarati sentence pairs. Corpus is described in this paper
NLPC-UoM English-Tamil Corpus: 9k sentences, 24k glossary terms
Wikititles: from statmt
- English-Tamil Wiki Titles
- English-Gujarati Wiki Titles
JNU-BHLTR Bhojpuri Corpus: English-Bhojpuri corpus of 65k sentences
EILMT Corpus
QED Corpus: English-Hindi corpus of 43k sentences from the educational domain.
WikiMatrix Corpus: Mined from Wikipedia, looks noisy.
CCMatrix: Parallel corpus mined from CommonCrawl, looks noisy (statmt repo).
CGNetSwara: Hindi-Gondi parallel corpus (19k sentence pairs)
MTEnglish2Odia: English-Odia (42k pairs)
SAP Software Documentation: test and evaluation set for English-Hindi in the software documentation domain [paper]
BUET English-Bangla Corpus, EMNLP-2020: 2.7M sentences (has overlaps with OPUS)
CLE Parallel Corpus: Parallel corpus for English, Urdu and Nepali.
Itihasa Parallel Corpus: 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata.
Ema-lon Manipuri Corpus: The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences.
PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in this paper.
IIIT-H en-hi-codemixed-corpus: A gold standard parallel corpus consisting of 6096 English-Hindi code-mixed sentences containing a total of 63,913 tokens and monolingual English. Described in this paper.
CALCS 2021 Eng-Hinglish dataset: Eng-Hinglish parallel corpus containing 10k pairs of sentences. Described in this paper.
Kangri_corpus: The corpus contains 27,362 Hindi-Kangri Parallel corpora. Described in [this paper] (https://arxiv.org/abs/2103.11596).
NLLB-Seed: Small human-translated parallel corpora from Wikipedia articles for very low resource languages. Includes 5 Indian languages: Kashmiri, Manipuri, Maithili, Bhojpuri, Chattisgarhi.
NLLB-MD: NLLB Multi Domain is a set of professionally-translated sentences in News, Unscripted informal speech, and Health domains. Cover Bhojpuri amongst Indian languages.
NLLB-Mined: All the parallel corpora mined by the NLLB project. This repository was reconstructed by AllenAI based on metadata released by the NLLB Project.
PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in this paper.
Sanskrit-Hindi-MT: Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data.
Nepali National corpus: The English-Nepali Parallel Corpus consists of a small set of data aligned at the sentence level with 27,060 English words and 21,756 Nepali words and a larger set of texts at the document level with 617,340 English words and 596,571 Nepali words. An additional set of monolingual data is also provided with 386,879 words in Nepali. Described here.
Kathmandu University-English–Nepali Parallel Corpus: A parallel corpus of size 1.8 million sentence pairs for a low resource language pair Nepali–English. Described in this paper.
CCAligned: A Massive Collection of more than 100 million cross-lingual web-document pairs in 137 languages aligned with English.
CoPara: Long-context parallel corpora for 4 Dravidian languages. Contains 2586 passage pairs mined from New India Samachar [paper]

MT Evaluation

WMT23 QE task: QE datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, te) with DA annotations. The references are also available, so these can also be used for reference based metrics. For Marathi, post-edits are also available as are word-level annotations error annotations are also available. 26k training sentences for Marathi, 7k for the others. report
AI4Bharat IndicMT-Eval: MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems).

Parallel Transliteration Corpus

Dakshina Dataset: The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. Contains an aggregate of around 300k word pairs and 120k sentence pairs.
BrahmiNet Corpus: 110 language pairs mined from ILCI parallel corpus.
Xlit-Crowd: Hindi-English Transliteration Corpus created via crowdsourcing.
Xlit-IITB-Par: Hindi-English Transliteration Corpus mined from parallel translation corpora.
FIRE 2013 Track on Transliterated Search: Transliteration dataset of native words in Hindi, Bengali and Gujarati.
NEWS 2018 Shared Task dataset: Transliteration datasets for Kannada, Tamil, Bengali and Hindi created by Microsoft Research India.
AI4Bharat StoryWeaver Xlit Dataset - Transliteration datasets for Hindi, Maithili & Konkani
Hindi WikiData Transliteration Pairs - Hindi dataset (90k pairs)
NotAI-tech English-Telugu: Around 38k word pairs
AI4Bharat Aksharantar: The largest publicly available transliteration dataset for 21 Indic languages consisting of 26M Indic language-English transliteration pairs. Described in this paper.

Text Classification

BBC news articles classification dataset: 14 class classification
iNLTK News Headlines classification: Datasets for multiple Indian languages.
AI4Bharat IndicNLP News Articles: Word embeddings for 10 Indian languages.
KMI Linguistics TRAC - 1: Contains aggression-annotated dataset (in English and Hindi) for the Shared Task on Aggression Identification during First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING - 2018.
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning in 11 languages (includes Tamil). Described in this paper.

Textual Entailment/Natural Language Inference

XNLI corpus: Hindi and Urdu test sets and machine translated training sets (from English MultiNLI).
csebuetnlp Bangla NLI: A Natural Language Inference (NLI) dataset for Bengali. Described in this paper.

Paraphrase

Amrita University-DPIL Corpus: Sentence level paraphrase identification for four Indian languages (Tamil, Malayalam, Hindi and Punjabi).

Sentiment, Sarcasm, Emotion Analysis

IIT Bombay movie review datasets for Hindi and Marathi
IIT Patna movie review datasets for Hindi
IIIT-H LTRC Multi-domain dataset for Telugu
ACTSA corpus for Telugu
BHAAV (भाव) Corpus: A Text Corpus for Emotion Analysis from Hindi Stories
SentiWordNet - SAIL - Hindi, Bangla, Tamil & Telugu
Dravidian-CodeMix - FIRE 2020 - Tamil & Malayalam
Bengali Sentiment Analysis - Classification Benchmark, 2020: 8k sentences
SentNoB: sentiment dataset for Bangla from 3 domains on user comments containing 15k examples (Paper) (Dataset)
UoM-Sinhala Sentiment Analysis: Sentiment Analysis for Sinhala Language. Consists of a multi-class annotated data set with 15059 sentiment annotated Sinhala news comments extracted from two Sinhala online news papers with four sentiment categories namely POSITIVE, NEGATIVE, NEUTRAL and CONFLICT and a corpus of 9.48 million tokens. Described in this paper.

Hate Speech and Offensive Comments

Hate Speech and Offensive Content Identification in Indo-European Languages: (HASOC FIRE-2020)
An Indian Language Social Media Collection for Hate and Offensive Speech, 2020: Hinglish Tweets and FB Comments collected during Parliamentary Election 2019 of India (Dataset available on request)
Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018: Scraped from Facebook (21k) & Twitter (18k) (Paper)
Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018: 3k tweets (Paper)
A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018: 4.5k Tweets (Paper)
Roman Urdu Offensive Language Detection, 2020: 10k tweets, can also used for Hindi, (Paper)
Bengali Hate Speech - Classification Benchmark, 2020: 1.5k sentences
Offensive Language Identification in Dravidian Languages, EACL 2021: Tamil, Malayalam, Kannada
Fear Speech in Indian WhatsApp Groups, 2021
HateCheckHIn: An evaluation dataset for Hindi Hate Speech Detection Models having a total of 34 functionalities out of which 28 functionalities are monolingual and the remaining 6 are multilingual. Hindi is used as the base language. Described in this paper.

Question Answering

Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
TyDi QA datasets: QA dataset for Bengali and Telugu.
bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
MMQA dataset: Hindi QA dataset described in this paper
XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
XQA: testset for Tamil QA. Described in this paper
HindiRC: A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in this paper
IITH HiDG: A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in this paper
Chaii a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, here is a good collection of papers on multilingual Question Answering.
csebuetnlp Bangla QA: A Question Answering (QA) dataset for Bengali. Described in this paper.
XOR QA: A large-scale cross-lingual open-retrieval QA dataset (includes Bengali and Telugu) with 40k newly annotated open-retrieval questions that cover seven typologically diverse languages. Described in this paper. More information is available here.
IITB HiQuAD: A question answering dataset in Hindi consisting of 6555 question-answer pairs. Described in this paper.

Dialog

a-mma Indic Casual Dialogs Datasets
A Code-Mixed Medical Task-Oriented Dialog Dataset: The dataset contains 3005 Telugu–English Code-Mixed dialogs with 29 k utterances covering ten specializations with an average code-mixing index (CMI) of 33.3%. Described in this paper.

Discourse

MIDAS-Hindi Discourse Analysis

Information Extraction

EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
[EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in this paper.
Amazon MASSIVE: A Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation containing one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. Described in this paper.
Facebook - MTOP Benchmark: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in this paper.

POS Tagged corpus

Indian Language Corpora Initiative
Universal Dependencies
IIITH Paninian Treebank: POS annotations for hi, bn, kn, ml and mr.
Code Mixed Dataset for Hindi, Bengali and Telugu, ICON 2016 shared task
JNU-BHLTR Bhojpuri Corpus: Bhojpuri corpus of 5000 sentences.
KMI Magahi Corpus:
KMI Awadhi Corpus:
Tham Khasi Corpus: An annotated Khasi POS tagged corpus containing 83,312 words, 4,386 sentences, 5,465 word types which amounts to 94,651 tokens (including punctuations).

Chunk Corpus

Indian Language Corpora Initiative
Indian Languages Treebanking Project: Chunk annotations for hi, bn, kn, ml and mr.

Dependency Parse Corpus

IIIT Hyderabad Hindi Treebank
Universal Dependencies
Universal Dependencies Hindi Treebank
Universal Dependencies Urdu Treebank
IIITH Paninian Treebank: Paninian Grammar Framework annotations along with mappings to Stanford dependency annotations for hi, bn, kn, ml and mr.
Vedic Sanskrit Treebank: 4k Sanskrit dependency treebank [paper]

Coreference Corpus

IIITH Coreference Anaphora Annotated Data: Hindi
IIITH Coreference Annotated Data: Hindi
TransMuCoRes: Synthetic data in multiple Indian languages. Finetuned model included.

Summarization

XL-Sum: A Large-Scale Multilingual Abstractive Summarization for 44 Languages with a comprehensive and diverse dataset comprising of 1 million professionally annotated article-summary pairs from BBC. Span 150k examples across 10 Indic languages. Described in this paper.
TeSum: Telugu Abstractive Summarization dataset containing 20k+ article-summary pairs, with the summaries being manually created. [paper]
WikiLingua: Cross-lingual summarization dataset created from WikiHow. Contains 9k English-Hindi article-summary pairs. [paper]
MassiveSum: A large summarization dataset for containing 13 Indian languages with ~1.9million article-summary pairs. The summaries are mined from article metadata. [paper]

Data to Text

XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages comprising of a high quality XF2T dataset in 7 languages: Hindi, Marathi, Gujarati, Telugu, Tamil, Kannada, Bengali, and monolingual dataset in English. The dataset is available upon request. Described in this paper.

Models

Language Identification

NLLB-200: LID for 200 languages including 27 Indic languages.

Word Embeddings

AI4Bharat IndicFT: Fast-text word embeddings for 11 Indian languages.
FastText CommonCrawl+Wikipedia
FastText Wikipedia
Polyglot
EM-FT: The first FastText word embedding available for Manipuri language trained on 1,880,035 Manipuri sentences.
Sanskrit-Hindi-MT: The FastText embeddings for Sanskrit is available here and for Hindi here.
UoM-Sinhala Sentiment Analysis- FastText 300: The FastText word embedding model for Sinhala language. Described in this paper.

Pre-trained Language Models

AI4Bharat IndicBERT: Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English).
AI4Bharat IndicBART: A multilingual,sequence-to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages. Described in this paper.
MuRIL: Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding (paper).
BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
mBART50: seq2seq pre-trained model trained on CommonCrawl of many languages (including major Indic languages).
BLOOM: GPT3 like multilingual transformer-decoder language model (includes major Indic languages.
iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
RoBERTa-hindi-guj-san: Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati.
Bangla-BERT-Base: Bengali BERT model trained on Bengali wikipedia and OSCAR datasets.
BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in this paper.
EM-ALBERT: The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences.
LaBSE: Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [paper].
LASER3: Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 27 Indic languges).

Multilingual Word Embeddings

Morphanalyzers

AI4Bharat IndicNLP Project: Unsupervised morphanalyzers for 10 Indian languages learnt using morfessor.

Translation Models

IndicTrans: Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported.
Shata-Anuvaadak: SMT for 110 language pairs (all pairs between English and 10 Indian languages.
LTRC Vanee: Dependency based Statistical MT system from English to Hindi.
NLLB-200: Models for 200 languages including 27 Indic languages.

Transliteration Models

AI4Bharat IndicXlit: A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion and vice versa that supports 21 Indic languages. Described in this paper.

Speech Models

AI4Bharat IndicWav2Vec: Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0.
Vakyansh CLSRIL-23: Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages (documentation) (experimentation platform).
arijitx/wav2vec2-large-xlsr-bengali: Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM.

NER

AI4Bharat IndicNER: NER model for 11 Indic languages.
AsNER: A Baseline Assamese NER model.
L3Cube-MahaNER-BERT: A 752 million token multilingual BERT model. Described in this paper.
CFILT HiNER: Hindi NER models trained on CFILT HiNER dataset. Described in this paper.

Speech Corpora

Microsoft Speech Corpus: Speech corpus for Telugu, Tamil and Gujarati.
Microsoft-IITB Marathi Speech Corpus: 109 hours of speech data collected via crowdsourcing.
AccentDB: Database of Indian English accents from native speakers in Bangla, Malayalam, Telugu and Oriya.
IIT Madras TTS database
BABEL Speech Corpus: includes some Indian languages
WikiPron: Words and their pronunciations in IPA mined from Wiktionary. Includes Indian languages. paper
CVIT IndicSpeech: TTS data for 3 Indian languages: Malayalam, Bengali and Hindi (24 hours each).
Google Speech Corpus: TTS data for 6 Indian languages: Malayalam, Marathi, Telugu, Kannada, Gujarati, Tamil (upto 9 hours each). Resources SLR#63-#66, #78-#79. (paper)
CoVoST 2: Tamil 2 hrs data
SMC Malayalam Speech Corpus - Download link
Vāksañcayaḥ Sanskrit Speech Corpus : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1 (paper).
IISc-MILE Kannada ASR Corpus: Transcribed speech corpus containing ~350 hours of read speech data for training ASR systems for Kannada language. Described in this paper.
IISc-MILE Tamil ASR Corpus: Transcribed speech corpus containing ~150 hours of read speech data for training ASR systems for Tamil language. Described in this paper.
MUCS 2021 Dataset: (Gujarati, Hindi, Marathi, Odia, Tamil, Telugu) Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages
Gramvaani: 100 hours of labelled data and 1000 hours of pretraining data for Hindi
Kashmiri Data Corpus: Collection of transcribed Kashmiri recordings taken from native speakers
Hindi-Tamil-English ASR Challenge: 490 hours of transcribed speeech data in three Indian Languages
Large Sinhala ASR training data set: Sinhala ASR training data set containing ~185K utterances
Large Bengali ASR training data set: Bengali ASR training data set containing ~196K utterances
Large Nepali ASR training data set: Nepali ASR training data set containing ~157K utterances
Crowdsourced high-quality Gujarati multi-speaker speech data set: Contains recordings of native speakers of Gujarati
Crowdsourced high-quality Kannada multi-speaker speech data set: Contains recordings of native speakers of Kannada
Crowdsourced high-quality Malayalam multi-speaker speech data set: Contains recordings of native speakers of Malayalam
Crowdsourced high-quality Marathi multi-speaker speech data set: Contains recordings of native speakers of Marathi
Crowdsourced high-quality Tamil multi-speaker speech data set: Contains recordings of native speakers of Tamil
Crowdsourced high-quality Telugu multi-speaker speech data set: Contains recordings of native speakers of Telugu
Nepali National corpus: The Nepali Spoken Corpus contains audio recordings from different 17 types of social activities with a total temporal recording duration of 31 hours and 26 minutes. Described here.
Shrutilipi: Over 6400 hours of transcribed speech corpus across 12 Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu

OCR Corpora

Multimodal Corpora

English-Hindi Visual Genome: Images captioned in both English and Hindi.
English-Hindi Flickr 8k: A subset of images from Flickr8k images captioned by native speakers in both English and Hindi. Code and data available here.

Language Specific Catalogs

Pointers to language-specific NLP resource catalogs

Name		Name	Last commit message	Last commit date
Latest commit History 286 Commits
CONTRIBUTORS.md		CONTRIBUTORS.md
README.md		README.md
_config.yml		_config.yml

AI4Bharat/indicnlp_catalog

Folders and files

Latest commit

History

Repository files navigation

🔖 The Indic NLP Catalog

👍 Featured Resources

Browse the entire catalog...

Major Indic Language NLP Repositories

Libraries and Tools

Evaluation Benchmarks

Standards

Text Corpora

Monolingual Corpus

Language Identification

Lexical Resources and Semantic Similarity

NER Corpora

Parallel Translation Corpus

MT Evaluation

Parallel Transliteration Corpus

Text Classification

Textual Entailment/Natural Language Inference

Paraphrase

Sentiment, Sarcasm, Emotion Analysis

Hate Speech and Offensive Comments

Question Answering

Dialog

Discourse

Information Extraction

POS Tagged corpus

Chunk Corpus

Dependency Parse Corpus

Coreference Corpus

Summarization

Data to Text

Models

Language Identification

Word Embeddings

Pre-trained Language Models

Multilingual Word Embeddings

Morphanalyzers

Translation Models

Transliteration Models

Speech Models

NER

Speech Corpora

OCR Corpora

Multimodal Corpora

Language Specific Catalogs

About

Topics

Resources

Stars

Watchers

Forks

Contributors 18