From e0af027604bd594ed72d3d709423ff7725dab8fc Mon Sep 17 00:00:00 2001 From: Gokul NC Date: Fri, 8 Jan 2021 22:06:20 +0530 Subject: [PATCH 001/119] Add Hate Speech section Src: https://hatespeechdata.com/ --- README.md | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/README.md b/README.md index 08e8c13..6047d2e 100644 --- a/README.md +++ b/README.md @@ -41,6 +41,7 @@ _Add a small, informative description of the dataset and provide links to any pa * [Textual Entailment/Natural Language Inference](#TextualEntailment) * [Paraphrase](#Paraphrase) * [Sentiment, Sarcasm, Emotion Analysis](#SentimentAnalysis) + * [Hate Speech and Offensive Comments](#HateSpeech) * [Question Answering](#QuestionAnswering) * [Dialog](#Dialog) * [Discourse](#Discourse) @@ -220,6 +221,14 @@ Benchmarks spanning multiple tasks. - [SentiWordNet - SAIL](http://amitavadas.com/SAIL/il_res.html) - Hindi, Bangla, Tamil & Telugu - [Dravidian-CodeMix - FIRE 2020](https://dravidian-codemix.github.io/2020/datasets.html) - Tamil & Malayalam +### Hate Speech and Offensive Comments + +- [Hate Speech and Offensive Content Identification in Indo-European Languages](https://hasocfire.github.io/hasoc/2020/dataset.html): (HASOC FIRE-2020) +- [An Indian Language Social Media Collection for Hate and Offensive Speech, 2020](https://www.aclweb.org/anthology/2020.restup-1.2/): Hinglish Tweets and FB Comments collected during Parliamentary Election 2019 of India (Dataset available on request) +- [Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018](https://sites.google.com/view/trac1/shared-task): Scraped from Facebook (21k) & Twitter (18k) ([Paper](https://arxiv.org/abs/1803.09402)) +- [Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018](https://github.com/pmathur5k10/Hinglish-Offensive-Text-Classification/tree/b8433ff1ebb885bd657f5117eab6bd3798f20408): 3k tweets ([Paper](https://www.aclweb.org/anthology/W18-5118)) +- [A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018](https://github.com/punyajoy/HateSpeech-Hindi-English-Code-Mixed-Social-Media-Text): 4.5k Tweets ([Paper](https://www.aclweb.org/anthology/W18-1105)) + ### Question Answering - [Facebook Multilingual QA datasets](https://github.com/facebookresearch/MLQA): Contains dev and test sets for Hindi. - [TyDi QA datasets](https://github.com/google-research-datasets/tydiqa): QA dataset for Bengali and Telugu. From 52061bc06a2491e88517da6d59b67d68116aa29a Mon Sep 17 00:00:00 2001 From: Gokul NC Date: Fri, 8 Jan 2021 22:12:31 +0530 Subject: [PATCH 002/119] Set theme jekyll-theme-hacker --- _config.yml | 1 + 1 file changed, 1 insertion(+) create mode 100644 _config.yml diff --git a/_config.yml b/_config.yml new file mode 100644 index 0000000..fc24e7a --- /dev/null +++ b/_config.yml @@ -0,0 +1 @@ +theme: jekyll-theme-hacker \ No newline at end of file From e4b313d9bac55e5e0179238aec8d414cd5164e49 Mon Sep 17 00:00:00 2001 From: Gokul NC Date: Fri, 8 Jan 2021 22:29:58 +0530 Subject: [PATCH 003/119] Add BengFastText datasets --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 6047d2e..61e3529 100644 --- a/README.md +++ b/README.md @@ -220,6 +220,7 @@ Benchmarks spanning multiple tasks. - [BHAAV (भाव) Corpus](https://github.com/midas-research/bhaav): A Text Corpus for Emotion Analysis from Hindi Stories - [SentiWordNet - SAIL](http://amitavadas.com/SAIL/il_res.html) - Hindi, Bangla, Tamil & Telugu - [Dravidian-CodeMix - FIRE 2020](https://dravidian-codemix.github.io/2020/datasets.html) - Tamil & Malayalam +- [Bengali Sentiment Analysis - Classification Benchmark, 2020](https://github.com/rezacsedu/BengFastText): 8k sentences ### Hate Speech and Offensive Comments @@ -228,6 +229,7 @@ Benchmarks spanning multiple tasks. - [Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018](https://sites.google.com/view/trac1/shared-task): Scraped from Facebook (21k) & Twitter (18k) ([Paper](https://arxiv.org/abs/1803.09402)) - [Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018](https://github.com/pmathur5k10/Hinglish-Offensive-Text-Classification/tree/b8433ff1ebb885bd657f5117eab6bd3798f20408): 3k tweets ([Paper](https://www.aclweb.org/anthology/W18-5118)) - [A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018](https://github.com/punyajoy/HateSpeech-Hindi-English-Code-Mixed-Social-Media-Text): 4.5k Tweets ([Paper](https://www.aclweb.org/anthology/W18-1105)) +- [Bengali Hate Speech - Classification Benchmark, 2020](https://github.com/rezacsedu/BengFastText): 1.5k sentences ### Question Answering - [Facebook Multilingual QA datasets](https://github.com/facebookresearch/MLQA): Contains dev and test sets for Hindi. From abcd05125c43c7de16a259dabf61a0ff34c579f6 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 9 Jan 2021 08:40:21 +0530 Subject: [PATCH 004/119] Update CONTRIBUTORS.md --- CONTRIBUTORS.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index e2ada60..16bdcc3 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -17,3 +17,4 @@ - Rahul Gupta - Kavya Manohar - Amrith Krishna +- Gokul NC From 0d2229f4fd51e2dd3c00e0ef8edb2edbb6207415 Mon Sep 17 00:00:00 2001 From: Gokul NC Date: Mon, 11 Jan 2021 11:52:56 +0530 Subject: [PATCH 005/119] Add EACL 2021 - Dravidian Offensive Classification --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 61e3529..bfd213e 100644 --- a/README.md +++ b/README.md @@ -230,6 +230,7 @@ Benchmarks spanning multiple tasks. - [Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018](https://github.com/pmathur5k10/Hinglish-Offensive-Text-Classification/tree/b8433ff1ebb885bd657f5117eab6bd3798f20408): 3k tweets ([Paper](https://www.aclweb.org/anthology/W18-5118)) - [A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018](https://github.com/punyajoy/HateSpeech-Hindi-English-Code-Mixed-Social-Media-Text): 4.5k Tweets ([Paper](https://www.aclweb.org/anthology/W18-1105)) - [Bengali Hate Speech - Classification Benchmark, 2020](https://github.com/rezacsedu/BengFastText): 1.5k sentences +- [Offensive Language Identification in Dravidian Languages, EACL 2021](https://dravidianlangtech.github.io/2021/): Tamil, Malayalam, Kannada ### Question Answering - [Facebook Multilingual QA datasets](https://github.com/facebookresearch/MLQA): Contains dev and test sets for Hindi. From 70842c7b5d42cc20667279e28ae01fb4b961582a Mon Sep 17 00:00:00 2001 From: Gokul NC Date: Wed, 3 Feb 2021 17:00:15 +0530 Subject: [PATCH 006/119] Add NPLT --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index bfd213e..2a7186c 100644 --- a/README.md +++ b/README.md @@ -77,6 +77,7 @@ _Add a small, informative description of the dataset and provide links to any pa - [AI4Bharat IndicNLP](https://indicnlp.ai4bharat.org) - [Linguistic Data Consortium For Indian Languages (LDCIL)](https://data.ldcil.org) - [University of Hyderabad - Sanskrit NLP](http://sanskrit.uohyd.ac.in/scl) +- [National Platform for Language Technology](https://nplt.in/demo/index.php?route=product/category&path=75_59&limit=100) ## Libraries and Tools From 7d147ec07651eb6f3e00d25c50e24eb4803e7c3c Mon Sep 17 00:00:00 2001 From: Gokul NC Date: Sun, 7 Feb 2021 13:52:38 +0530 Subject: [PATCH 007/119] Add Roman Urdu Hate Speech dataset Credits: https://github.com/leondz/hatespeechdata/pull/10 --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 2a7186c..939bb4d 100644 --- a/README.md +++ b/README.md @@ -230,6 +230,7 @@ Benchmarks spanning multiple tasks. - [Aggression-annotated Corpus of Hindi-English Code-mixed Data, 2018](https://sites.google.com/view/trac1/shared-task): Scraped from Facebook (21k) & Twitter (18k) ([Paper](https://arxiv.org/abs/1803.09402)) - [Did You Offend Me? Classification of Offensive Tweets in Hinglish Language, 2018](https://github.com/pmathur5k10/Hinglish-Offensive-Text-Classification/tree/b8433ff1ebb885bd657f5117eab6bd3798f20408): 3k tweets ([Paper](https://www.aclweb.org/anthology/W18-5118)) - [A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection, 2018](https://github.com/punyajoy/HateSpeech-Hindi-English-Code-Mixed-Social-Media-Text): 4.5k Tweets ([Paper](https://www.aclweb.org/anthology/W18-1105)) +- [Roman Urdu Offensive Language Detection, 2020](https://github.com/haroonshakeel/roman_urdu_hate_speech): 10k tweets, can also used for Hindi, ([Paper](https://www.aclweb.org/anthology/2020.emnlp-main.197)) - [Bengali Hate Speech - Classification Benchmark, 2020](https://github.com/rezacsedu/BengFastText): 1.5k sentences - [Offensive Language Identification in Dravidian Languages, EACL 2021](https://dravidianlangtech.github.io/2021/): Tamil, Malayalam, Kannada From e4e8f53c8a061c8e3a17da1b1ce2e595e9930c45 Mon Sep 17 00:00:00 2001 From: Gokul NC Date: Sat, 13 Feb 2021 11:29:29 +0530 Subject: [PATCH 008/119] Add BUET en-bn corpus Fixes https://github.com/AI4Bharat/indicnlp_catalog/issues/57 --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 939bb4d..bcce707 100644 --- a/README.md +++ b/README.md @@ -185,6 +185,7 @@ Benchmarks spanning multiple tasks. - [CGNetSwara](http://cgnetswara.org/hindi-gondi-corpus.html): Hindi-Gondi parallel corpus (19k sentence pairs) - [MTEnglish2Odia](https://github.com/soumendrak/MTEnglish2Odia): English-Odia (42k pairs) - [SAP Software Documentation](https://github.com/SAP/software-documentation-data-set-for-machine-translation): test and evaluation set for English-Hindi in the software documentation domain [[paper](https://arxiv.org/abs/2008.04550)] +- [BUET English-Bangla Corpus, EMNLP-2020](https://github.com/csebuetnlp/banglanmt) - 2.7M sentences (has overlaps with OPUS) ### Parallel Transliteration Corpus From 5edd3974078e25a5c4101944b66fa549e13e24cd Mon Sep 17 00:00:00 2001 From: Gokul NC Date: Sat, 13 Feb 2021 13:23:03 +0530 Subject: [PATCH 009/119] Add Indian Fear-speech-analysis --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index bcce707..9f48cee 100644 --- a/README.md +++ b/README.md @@ -234,6 +234,7 @@ Benchmarks spanning multiple tasks. - [Roman Urdu Offensive Language Detection, 2020](https://github.com/haroonshakeel/roman_urdu_hate_speech): 10k tweets, can also used for Hindi, ([Paper](https://www.aclweb.org/anthology/2020.emnlp-main.197)) - [Bengali Hate Speech - Classification Benchmark, 2020](https://github.com/rezacsedu/BengFastText): 1.5k sentences - [Offensive Language Identification in Dravidian Languages, EACL 2021](https://dravidianlangtech.github.io/2021/): Tamil, Malayalam, Kannada +- [Fear Speech in Indian WhatsApp Groups, 2021](https://github.com/punyajoy/Fear-speech-analysis) ### Question Answering - [Facebook Multilingual QA datasets](https://github.com/facebookresearch/MLQA): Contains dev and test sets for Hindi. From 142a8b477fa8a3dff52494a4bc90722a28fff910 Mon Sep 17 00:00:00 2001 From: Rishav Kundu Date: Thu, 11 Mar 2021 13:42:48 +0530 Subject: [PATCH 010/119] Fix some typos in README.md --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 9f48cee..9dcf537 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ _Add a small, informative description of the dataset and provide links to any pa - [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. - [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). - [CVIT-IIITH Mann ki Baat Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/mkb-v0.tar): Mined from Indian PM Narendra Modi's _Mann ki Baat_ speeches. -- [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library): Python Library for various Indian language NLP tasks like tokenization, sentece splitting, normalization, script conversion, transliteration, _etc_ +- [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library): Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, _etc_ - [iNLTK](https://github.com/goru001/inltk): iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. - [GLUECoS](https://microsoft.github.io/GLUECoS): For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI). - [Amrita University-DPIL Corpus](https://nlp.amrita.edu/dpil_cen/index.html): Sentence level paraphrase identification for four Indian languages (Tamil, Malayalam, Hindi and Punjabi). @@ -82,12 +82,12 @@ _Add a small, informative description of the dataset and provide links to any pa ## Libraries and Tools -- [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library): Python Library for various Indian language NLP tasks like tokenization, sentece splitting, normalization, script conversion, transliteration, _etc_ +- [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library): Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, _etc_ - [pyiwn](https://github.com/riteshpanjwani/pyiwn): Python Interface to IndoWordNet - [Indic-OCR](https://indic-ocr.github.io/) : OCR for Indic Scripts - [CLTK](https://github.com/cltk/cltk/tree/master/cltk): Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library. - [iNLTK](https://github.com/goru001/inltk): iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. -- [Sanskrit Coders Indic Transliteration](https://github.com/sanskrit-coders/indic_transliteration): Script conversion and ronaization for Indian languages. +- [Sanskrit Coders Indic Transliteration](https://github.com/sanskrit-coders/indic_transliteration): Script conversion and romanization for Indian languages. - [Smart Sanskirt Annotator](https://github.com/iamdsc/smart-sanskrit-annotator): Annotation tool for Sanskrit [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.874.pdf) ## Evaluation Benchmarks @@ -154,7 +154,7 @@ Benchmarks spanning multiple tasks. ### Parallel Translation Corpus -- [IIT Bombay English-Hindi Parallel Corpus](http://www.cfilt.iitb.ac.in/iitb_parallel/): Largest en-hi parallel corpora in public domain (about 1.5 million semgents) +- [IIT Bombay English-Hindi Parallel Corpus](http://www.cfilt.iitb.ac.in/iitb_parallel/): Largest en-hi parallel corpora in public domain (about 1.5 million segments) - [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). - [CVIT-IIITH Mann ki Baat Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/mkb-v0.tar): Mined from Indian PM Narendra Modi's _Mann ki Baat_ speeches. - [PMIndia](http://data.statmt.org/pmindia): Parallel corpus for En-Indian languages mined from _Mann ki Baat_ speeches of the PM of India ([paper](https://arxiv.org/abs/2001.09907)). From b3d64c43baf5a7b910cc4d5253ce5202e9ad01f1 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 13 Mar 2021 10:58:52 +0530 Subject: [PATCH 011/119] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 9dcf537..b2c263e 100644 --- a/README.md +++ b/README.md @@ -185,7 +185,8 @@ Benchmarks spanning multiple tasks. - [CGNetSwara](http://cgnetswara.org/hindi-gondi-corpus.html): Hindi-Gondi parallel corpus (19k sentence pairs) - [MTEnglish2Odia](https://github.com/soumendrak/MTEnglish2Odia): English-Odia (42k pairs) - [SAP Software Documentation](https://github.com/SAP/software-documentation-data-set-for-machine-translation): test and evaluation set for English-Hindi in the software documentation domain [[paper](https://arxiv.org/abs/2008.04550)] -- [BUET English-Bangla Corpus, EMNLP-2020](https://github.com/csebuetnlp/banglanmt) - 2.7M sentences (has overlaps with OPUS) +- [BUET English-Bangla Corpus, EMNLP-2020](https://github.com/csebuetnlp/banglanmt): 2.7M sentences (has overlaps with OPUS) +- [CLE Parallel Corpus](https://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm): Parallel corpus for English, Urdu and Nepali. ### Parallel Transliteration Corpus @@ -349,3 +350,4 @@ http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.800.pdf)) - [Nepali](https://github.com/amitness/ml-datasets) - [Odia](https://github.com/shantipriyap/Odia-NLP-Resource-Catalog) - [Tamil](https://narvidhai.github.io/tamil-nlp-catalog/) + - [Sinhala](https://lknlp.github.io): [[git repo]](https://github.com/lknlp/lknlp.github.io) From dc5f68e911353b4df8bbda5536093ea51680bad1 Mon Sep 17 00:00:00 2001 From: Ashwini Vaidya Date: Sat, 3 Apr 2021 12:40:49 +0530 Subject: [PATCH 012/119] added new lexical resource, Hindi RG-63 --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index b2c263e..affa7b6 100644 --- a/README.md +++ b/README.md @@ -142,6 +142,7 @@ Benchmarks spanning multiple tasks. - [Facebook Hindi Analogy Dataset](https://dl.fbaipublicfiles.com/fasttext/word-analogies/questions-words-hi.txt) - [MGAD Hindi Analogy dataset](https://github.com/rutrastone/MGAD) - [AI4Bharat Word Frequency Lists](https://github.com/AI4Bharat/indicnlp_corpus#text-corpora): Tokens and their frequencies from the AI4Bharat corpus, a large monolingual corpus. +- [Hindi RG-63](https://github.com/ashwinivd/similarity_hindi): Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset ### NER Corpora From f21d462f2664dd44855089e7a2ac367adf94ab5d Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Tue, 18 May 2021 21:33:01 +0530 Subject: [PATCH 013/119] Update README.md --- README.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index affa7b6..7b4a03b 100644 --- a/README.md +++ b/README.md @@ -11,13 +11,15 @@ _Add a small, informative description of the dataset and provide links to any pa :+1: **Featured Resources** + +- :new:[Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. - [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. - [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). - [CVIT-IIITH Mann ki Baat Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/mkb-v0.tar): Mined from Indian PM Narendra Modi's _Mann ki Baat_ speeches. - [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library): Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, _etc_ - [iNLTK](https://github.com/goru001/inltk): iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. - [GLUECoS](https://microsoft.github.io/GLUECoS): For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI). -- [Amrita University-DPIL Corpus](https://nlp.amrita.edu/dpil_cen/index.html): Sentence level paraphrase identification for four Indian languages (Tamil, Malayalam, Hindi and Punjabi). + **Browse the entire catalog...** @@ -155,7 +157,8 @@ Benchmarks spanning multiple tasks. ### Parallel Translation Corpus -- [IIT Bombay English-Hindi Parallel Corpus](http://www.cfilt.iitb.ac.in/iitb_parallel/): Largest en-hi parallel corpora in public domain (about 1.5 million segments) +- [Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. +- [IIT Bombay English-Hindi Parallel Corpus](http://www.cfilt.iitb.ac.in/iitb_parallel): Largest en-hi parallel corpora in public domain (about 1.5 million segments) - [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). - [CVIT-IIITH Mann ki Baat Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/mkb-v0.tar): Mined from Indian PM Narendra Modi's _Mann ki Baat_ speeches. - [PMIndia](http://data.statmt.org/pmindia): Parallel corpus for En-Indian languages mined from _Mann ki Baat_ speeches of the PM of India ([paper](https://arxiv.org/abs/2001.09907)). From 55aed688e1300f24116177f73a44a93f5db266c8 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Fri, 11 Jun 2021 13:19:57 +0530 Subject: [PATCH 014/119] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 7b4a03b..e34e5d4 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ _Add a small, informative description of the dataset and provide links to any pa - :new:[Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. +- :new:[Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. - [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. - [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). - [CVIT-IIITH Mann ki Baat Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/mkb-v0.tar): Mined from Indian PM Narendra Modi's _Mann ki Baat_ speeches. @@ -191,6 +192,7 @@ Benchmarks spanning multiple tasks. - [SAP Software Documentation](https://github.com/SAP/software-documentation-data-set-for-machine-translation): test and evaluation set for English-Hindi in the software documentation domain [[paper](https://arxiv.org/abs/2008.04550)] - [BUET English-Bangla Corpus, EMNLP-2020](https://github.com/csebuetnlp/banglanmt): 2.7M sentences (has overlaps with OPUS) - [CLE Parallel Corpus](https://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm): Parallel corpus for English, Urdu and Nepali. +- [Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. ### Parallel Transliteration Corpus From 29970e3356d457866e5fa5310d6090b6dec44ade Mon Sep 17 00:00:00 2001 From: sagorbrur Date: Mon, 14 Jun 2021 07:49:04 +0600 Subject: [PATCH 015/119] added Bangla-BERT-base, and other bengali libraries --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index e34e5d4..7b0f164 100644 --- a/README.md +++ b/README.md @@ -92,6 +92,8 @@ _Add a small, informative description of the dataset and provide links to any pa - [iNLTK](https://github.com/goru001/inltk): iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. - [Sanskrit Coders Indic Transliteration](https://github.com/sanskrit-coders/indic_transliteration): Script conversion and romanization for Indian languages. - [Smart Sanskirt Annotator](https://github.com/iamdsc/smart-sanskrit-annotator): Annotation tool for Sanskrit [paper](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.874.pdf) +- [BNLP](https://github.com/sagorbrur/bnlp): Bengali language processing toolkit with tokenization, embedding, POS tagging, NER suppport +- [CodeSwitch](https://github.com/sagorbrur/codeswitch): Language identification, POS Tagging, NER, sentiment analysis support for code mixed data including Hindi and Nepali language ## Evaluation Benchmarks @@ -307,6 +309,7 @@ Benchmarks spanning multiple tasks. - [iNLTK](https://github.com/goru001/inltk): ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles. - [albert-base-sanskrit](https://huggingface.co/surajp/albert-base-sanskrit): ALBERT-based model trained on Sanskrit Wikipedia. - [RoBERTa-hindi-guj-san](https://huggingface.co/surajp/RoBERTa-hindi-guj-san): Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati. +- [Bangla-BERT-Base](https://github.com/sagorbrur/bangla-bert): Bengali BERT model trained on Bengali wikipedia and OSCAR datasets ### Multilingual Word Embeddings From cc335f98280465914fe09724309fe7f9ecb02ad8 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 14 Jun 2021 16:11:46 +0530 Subject: [PATCH 016/119] Update README.md --- README.md | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/README.md b/README.md index 7b0f164..52e0c3c 100644 --- a/README.md +++ b/README.md @@ -15,11 +15,7 @@ _Add a small, informative description of the dataset and provide links to any pa - :new:[Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. - :new:[Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. - [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. -- [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). -- [CVIT-IIITH Mann ki Baat Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/mkb-v0.tar): Mined from Indian PM Narendra Modi's _Mann ki Baat_ speeches. -- [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library): Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, _etc_ -- [iNLTK](https://github.com/goru001/inltk): iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. -- [GLUECoS](https://microsoft.github.io/GLUECoS): For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI). + From 98a40e0d059e2f83d9d471c630de4d4ee68e2744 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 14 Jun 2021 16:25:04 +0530 Subject: [PATCH 017/119] Update README.md --- README.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 52e0c3c..4cd5422 100644 --- a/README.md +++ b/README.md @@ -11,14 +11,11 @@ _Add a small, informative description of the dataset and provide links to any pa :+1: **Featured Resources** - +- :new:[FLORES-101](https://github.com/facebookresearch/flores): Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel. - :new:[Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. - :new:[Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. - [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. - - - **Browse the entire catalog...** :raising_hand:**Note**: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo. @@ -157,6 +154,7 @@ Benchmarks spanning multiple tasks. ### Parallel Translation Corpus - [Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. +- [FLORES-101](https://github.com/facebookresearch/flores): Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel. - [IIT Bombay English-Hindi Parallel Corpus](http://www.cfilt.iitb.ac.in/iitb_parallel): Largest en-hi parallel corpora in public domain (about 1.5 million segments) - [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). - [CVIT-IIITH Mann ki Baat Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/mkb-v0.tar): Mined from Indian PM Narendra Modi's _Mann ki Baat_ speeches. From 6567da5679a1038317ede2d40e1a4f987ffab868 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Fri, 18 Jun 2021 23:15:11 +0530 Subject: [PATCH 018/119] Update README.md --- README.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 4cd5422..99e28d1 100644 --- a/README.md +++ b/README.md @@ -11,10 +11,11 @@ _Add a small, informative description of the dataset and provide links to any pa :+1: **Featured Resources** +- :new:[Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). - :new:[FLORES-101](https://github.com/facebookresearch/flores): Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel. - :new:[Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. - :new:[Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. -- [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. +- [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. **Browse the entire catalog...** @@ -51,7 +52,8 @@ _Add a small, informative description of the dataset and provide links to any pa * [Sentence Embeddings](#SentenceEmbeddings) * [Multilingual Word Embeddings](#MultilingualWordEmbeddings) * [Morphanalyzers](#Morphanalyzers) - * [SMT Models](#SMTModels) + * [Translation Models](#TranslationModels) + * [Speech Models](#SpeechModels) * [Speech Corpora](#SpeechCorpora) * [OCR Corpora](#OCRCorpora) * [Multimodal Corpora](#MultimodalCorpora) @@ -314,11 +316,15 @@ Benchmarks spanning multiple tasks. - [AI4Bharat IndicNLP Project](https://github.com/ai4bharat/indicnlp_corpus): Unsupervised morphanalyzers for 10 Indian languages learnt using morfessor. -### SMT Models +### Translation Models +- [IndicTrans](https://indicnlp.ai4bharat.org/indic-trans): Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported. - [Shata-Anuvaadak](http://www.cfilt.iitb.ac.in/~moses/shata_anuvaadak/): 110 language pairs - [LTRC Vanee](https://ltrc.iiit.ac.in/downloads/tools/Vaanee.tgz): Dependency based Statistical MT system from English to Hindi +### Speech Models + +- [Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). ## Speech Corpora - [Microsoft Speech Corpus](https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e): Speech corpus for Telugu, Tamil and Gujarati. From 6d0c31a62351b5b17e3eef506ae7c821e43b28bc Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 19 Jun 2021 10:19:07 +0530 Subject: [PATCH 019/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 99e28d1..3558e4e 100644 --- a/README.md +++ b/README.md @@ -14,6 +14,7 @@ _Add a small, informative description of the dataset and provide links to any pa - :new:[Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). - :new:[FLORES-101](https://github.com/facebookresearch/flores): Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel. - :new:[Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. +- :new:[IndicTrans](https://indicnlp.ai4bharat.org/indic-trans): Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported. - :new:[Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. - [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. From 2976d49d9b16d4a25261fc1660849c3b19386a53 Mon Sep 17 00:00:00 2001 From: Amrith Krishna Date: Sat, 19 Jun 2021 10:30:47 +0100 Subject: [PATCH 020/119] =?UTF-8?q?Added=20V=C4=81ksa=C3=B1caya=E1=B8=A5,?= =?UTF-8?q?=20a=20corpus=20for=20Sanskrit=20ASR?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit [Vāksañcayaḥ Sanskrit Speech Corpus](https://github.com/cyfer0618/Vaksanca) : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1. --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 3558e4e..03f6147 100644 --- a/README.md +++ b/README.md @@ -339,6 +339,8 @@ Benchmarks spanning multiple tasks. - [Google Speech Corpus](http://www.openslr.org/resources.php): TTS data for 6 Indian languages: Malayalam, Marathi, Telugu, Kannada, Gujarati, Tamil (upto 9 hours each). Resources SLR#63-#66, #78-#79. [(paper)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.800.pdf) - [CoVoST 2](https://github.com/facebookresearch/covost): Tamil 2 hrs data - [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) - [Download link](https://releases.smc.org.in/msc-reviewed-speech/) +- [Vāksañcayaḥ Sanskrit Speech Corpus](https://github.com/cyfer0618/Vaksanca) : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1. + ## OCR Corpora From 41223e44d19690c625f329200ca5e1338de43adb Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 19 Jun 2021 16:55:27 +0530 Subject: [PATCH 021/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 03f6147..9bb4e8b 100644 --- a/README.md +++ b/README.md @@ -339,7 +339,7 @@ Benchmarks spanning multiple tasks. - [Google Speech Corpus](http://www.openslr.org/resources.php): TTS data for 6 Indian languages: Malayalam, Marathi, Telugu, Kannada, Gujarati, Tamil (upto 9 hours each). Resources SLR#63-#66, #78-#79. [(paper)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.800.pdf) - [CoVoST 2](https://github.com/facebookresearch/covost): Tamil 2 hrs data - [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) - [Download link](https://releases.smc.org.in/msc-reviewed-speech/) -- [Vāksañcayaḥ Sanskrit Speech Corpus](https://github.com/cyfer0618/Vaksanca) : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1. +- [Vāksañcayaḥ Sanskrit Speech Corpus](https://github.com/cyfer0618/Vaksanca) : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1 [(paper)](https://arxiv.org/abs/2106.05852). From a3495e11b7c5c31c6846e01308b10b0673a8686f Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Tue, 22 Jun 2021 14:17:56 +0530 Subject: [PATCH 022/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 9bb4e8b..ee49970 100644 --- a/README.md +++ b/README.md @@ -363,3 +363,4 @@ http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.800.pdf)) - [Odia](https://github.com/shantipriyap/Odia-NLP-Resource-Catalog) - [Tamil](https://narvidhai.github.io/tamil-nlp-catalog/) - [Sinhala](https://lknlp.github.io): [[git repo]](https://github.com/lknlp/lknlp.github.io) + - [Urdu](https://github.com/urduhack/awesome-urdu) From 4795a1c463ea3394dc842aacfde97f27fac6ba2c Mon Sep 17 00:00:00 2001 From: Arijit Date: Thu, 24 Jun 2021 21:06:19 +0530 Subject: [PATCH 023/119] bengali huggingface wav2vec2 based ASR model --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index ee49970..68206a0 100644 --- a/README.md +++ b/README.md @@ -326,6 +326,8 @@ Benchmarks spanning multiple tasks. ### Speech Models - [Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). +- [arijitx/wav2vec2-large-xlsr-bengali](https://huggingface.co/arijitx/wav2vec2-large-xlsr-bengali): Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM. + ## Speech Corpora - [Microsoft Speech Corpus](https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e): Speech corpus for Telugu, Tamil and Gujarati. From e59c6e09da65ecc349b929c33980693404c4903e Mon Sep 17 00:00:00 2001 From: Sundar Date: Sun, 29 Aug 2021 09:11:55 +0530 Subject: [PATCH 024/119] fixing link to IIIT-H treebank --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 68206a0..785a56b 100644 --- a/README.md +++ b/README.md @@ -264,7 +264,7 @@ Benchmarks spanning multiple tasks. - [Indian Language Corpora Initiative](http://sanskrit.jnu.ac.in/ilci/index.jsp) - [Universal Dependencies](https://universaldependencies.org/) -- [IIITH Paninian Treebank](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi): POS annotations for hi, bn, kn, ml and mr. +- [IIITH Paninian Treebank](https://kcis.iiit.ac.in/LT/): POS annotations for hi, bn, kn, ml and mr. - [Code Mixed Dataset for Hindi, Bengali and Telugu, ICON 2016 shared task](https://amitavadas.com/Code-Mixing.html) - [JNU-BHLTR Bhojpuri Corpus](https://github.com/shashwatup9k/bho-resources/tree/master/mono-bho-corpus): Bhojpuri corpus of 5000 sentences. - [KMI Magahi Corpus](https://github.com/kmi-linguistics/magahi): @@ -273,7 +273,7 @@ Benchmarks spanning multiple tasks. ### Chunk Corpus - [Indian Language Corpora Initiative](http://sanskrit.jnu.ac.in/ilci/index.jsp) -- [IIITH Paninian Treebank](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi): Chunk annotations for hi, bn, kn, ml and mr. +- [Indian Languages Treebanking Project](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi): Chunk annotations for hi, bn, kn, ml and mr. ### Dependency Parse Corpus From 274635e731cdc5bc7417a2a8d90d9059eba7cc24 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Wed, 15 Sep 2021 12:03:55 +0530 Subject: [PATCH 025/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 68206a0..2551247 100644 --- a/README.md +++ b/README.md @@ -185,7 +185,7 @@ Benchmarks spanning multiple tasks. - [EILMT Corpus](http://tdil-dc.in/index.php?searchword=EILMT&searchphrase=all&option=com_search&lang=en) - [QED Corpus](http://alt.qcri.org/resources/qedcorpus): English-Hindi corpus of 43k sentences from the educational domain. - [WikiMatrix Corpus](https://ai.facebook.com/blog/wikimatrix): Mined from Wikipedia, looks noisy. -- [CCMatrix](https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix): Parallel corpus mined from CommonCrawl, looks noisy. +- [CCMatrix](https://github.com/facebookresearch/LASER/tree/master/tasks/CCMatrix): Parallel corpus mined from CommonCrawl, looks noisy ([statmt repo](http://data.statmt.org/cc-matrix)). - [CGNetSwara](http://cgnetswara.org/hindi-gondi-corpus.html): Hindi-Gondi parallel corpus (19k sentence pairs) - [MTEnglish2Odia](https://github.com/soumendrak/MTEnglish2Odia): English-Odia (42k pairs) - [SAP Software Documentation](https://github.com/SAP/software-documentation-data-set-for-machine-translation): test and evaluation set for English-Hindi in the software documentation domain [[paper](https://arxiv.org/abs/2008.04550)] From 6c57e7da370d55cc5feb903f4499c24923b986e8 Mon Sep 17 00:00:00 2001 From: Ritwik Mishra Date: Thu, 6 Jan 2022 19:41:04 +0530 Subject: [PATCH 026/119] Updated coreference dataset URLs --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2551247..29fcece 100644 --- a/README.md +++ b/README.md @@ -286,8 +286,8 @@ Benchmarks spanning multiple tasks. ### Coreference Corpus -- [IIITH Coreference Anaphora Annotated Data](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi): Hindi -- [IIITH Coreference Annotated Data](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi): Hindi +- [IIITH Coreference Anaphora Annotated Data](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/): Hindi +- [IIITH Coreference Annotated Data](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/): Hindi ## Models From 91e648fea175a524659493992f89cd85f8c50140 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Thu, 6 Jan 2022 21:21:36 +0530 Subject: [PATCH 027/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 2551247..3661443 100644 --- a/README.md +++ b/README.md @@ -229,6 +229,7 @@ Benchmarks spanning multiple tasks. - [SentiWordNet - SAIL](http://amitavadas.com/SAIL/il_res.html) - Hindi, Bangla, Tamil & Telugu - [Dravidian-CodeMix - FIRE 2020](https://dravidian-codemix.github.io/2020/datasets.html) - Tamil & Malayalam - [Bengali Sentiment Analysis - Classification Benchmark, 2020](https://github.com/rezacsedu/BengFastText): 8k sentences +- [SentNoB](https://github.com/KhondokerIslam/SentNoB): sentiment dataset for Bangla from 3 domains on user comments containing 15k examples [(Paper)](https://aclanthology.org/2021.findings-emnlp.278.pdf) [(Dataset)](https://www.kaggle.com/cryptexcode/sentnob-sentiment-analysis-in-noisy-bangla-texts) ### Hate Speech and Offensive Comments From a2f6fd3316d0fa98e29012fd54310bc939f5e509 Mon Sep 17 00:00:00 2001 From: sangeeta-anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Thu, 6 Jan 2022 21:45:56 +0530 Subject: [PATCH 028/119] Update README.md --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index 3661443..c8d754f 100644 --- a/README.md +++ b/README.md @@ -144,6 +144,10 @@ Benchmarks spanning multiple tasks. - [MGAD Hindi Analogy dataset](https://github.com/rutrastone/MGAD) - [AI4Bharat Word Frequency Lists](https://github.com/AI4Bharat/indicnlp_corpus#text-corpora): Tokens and their frequencies from the AI4Bharat corpus, a large monolingual corpus. - [Hindi RG-63](https://github.com/ashwinivd/similarity_hindi): Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset +- [IITB Cognate Datasets](https://github.com/dipteshkanojia/challengeCognateFF): Dataset of Cognates and False Friend Pairs for 12 Indian Languages. [(Paper)](https://aclanthology.org/2020.lrec-1.378.pdf) + + + ### NER Corpora From ed903d6c7dcc70d18f7d243d7bc4bfa5e0773068 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Thu, 6 Jan 2022 21:52:10 +0530 Subject: [PATCH 029/119] Update README.md --- README.md | 3 --- 1 file changed, 3 deletions(-) diff --git a/README.md b/README.md index c8d754f..d2fdb62 100644 --- a/README.md +++ b/README.md @@ -146,9 +146,6 @@ Benchmarks spanning multiple tasks. - [Hindi RG-63](https://github.com/ashwinivd/similarity_hindi): Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset - [IITB Cognate Datasets](https://github.com/dipteshkanojia/challengeCognateFF): Dataset of Cognates and False Friend Pairs for 12 Indian Languages. [(Paper)](https://aclanthology.org/2020.lrec-1.378.pdf) - - - ### NER Corpora - [FIRE 2013 AUKBC NER Corpus](http://au-kbc.org/nlp/NER-FIRE2013) From 497418e12916fdda36e84e6071936ddd0da7d7da Mon Sep 17 00:00:00 2001 From: sangeeta-anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 8 Jan 2022 20:50:50 +0530 Subject: [PATCH 030/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 58a5675..0535b3d 100644 --- a/README.md +++ b/README.md @@ -251,6 +251,7 @@ Benchmarks spanning multiple tasks. - [MMQA dataset](https://github.com/deepaknlp/MMQA): Hindi QA dataset described in [this paper](https://www.aclweb.org/anthology/L18-1440.pdf) - [XQuAD](https://github.com/deepmind/xquad): testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in [this paper](https://arxiv.org/abs/1910.11856) - [XQA](http://github.com/thunlp/XQA): testset for Tamil QA. Described in [this paper](https://www.aclweb.org/anthology/P19-1227.pdf) +- [HindiRC](https://github.com/erzaliator/HindiRC-Data):A Dataset for Reading Comprehension in Hindi.Described in [this paper]https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From f6b6e42edf4473a6e575b2b471fabca94bbffc53 Mon Sep 17 00:00:00 2001 From: sangeeta-anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 8 Jan 2022 20:55:47 +0530 Subject: [PATCH 031/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 0535b3d..0525446 100644 --- a/README.md +++ b/README.md @@ -251,7 +251,7 @@ Benchmarks spanning multiple tasks. - [MMQA dataset](https://github.com/deepaknlp/MMQA): Hindi QA dataset described in [this paper](https://www.aclweb.org/anthology/L18-1440.pdf) - [XQuAD](https://github.com/deepmind/xquad): testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in [this paper](https://arxiv.org/abs/1910.11856) - [XQA](http://github.com/thunlp/XQA): testset for Tamil QA. Described in [this paper](https://www.aclweb.org/anthology/P19-1227.pdf) -- [HindiRC](https://github.com/erzaliator/HindiRC-Data):A Dataset for Reading Comprehension in Hindi.Described in [this paper]https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) +- [HindiRC](https://github.com/erzaliator/HindiRC-Data):A Dataset for Reading Comprehension in Hindi.Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From 682c983d471ef5011b3fed804ea04558ac96f20e Mon Sep 17 00:00:00 2001 From: sangeeta-anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 8 Jan 2022 20:56:54 +0530 Subject: [PATCH 032/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 0525446..ac8421a 100644 --- a/README.md +++ b/README.md @@ -251,7 +251,7 @@ Benchmarks spanning multiple tasks. - [MMQA dataset](https://github.com/deepaknlp/MMQA): Hindi QA dataset described in [this paper](https://www.aclweb.org/anthology/L18-1440.pdf) - [XQuAD](https://github.com/deepmind/xquad): testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in [this paper](https://arxiv.org/abs/1910.11856) - [XQA](http://github.com/thunlp/XQA): testset for Tamil QA. Described in [this paper](https://www.aclweb.org/anthology/P19-1227.pdf) -- [HindiRC](https://github.com/erzaliator/HindiRC-Data):A Dataset for Reading Comprehension in Hindi.Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) +- [HindiRC](https://github.com/erzaliator/HindiRC-Data):A Dataset for Reading Comprehension in Hindi. Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From 4ba8bd924ffb13a386f5a31074ad4c7dbab1e39b Mon Sep 17 00:00:00 2001 From: sangeeta-anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 8 Jan 2022 21:17:13 +0530 Subject: [PATCH 033/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ac8421a..e28c756 100644 --- a/README.md +++ b/README.md @@ -251,7 +251,7 @@ Benchmarks spanning multiple tasks. - [MMQA dataset](https://github.com/deepaknlp/MMQA): Hindi QA dataset described in [this paper](https://www.aclweb.org/anthology/L18-1440.pdf) - [XQuAD](https://github.com/deepmind/xquad): testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in [this paper](https://arxiv.org/abs/1910.11856) - [XQA](http://github.com/thunlp/XQA): testset for Tamil QA. Described in [this paper](https://www.aclweb.org/anthology/P19-1227.pdf) -- [HindiRC](https://github.com/erzaliator/HindiRC-Data):A Dataset for Reading Comprehension in Hindi. Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) +- [HindiRC](https://github.com/erzaliator/HindiRC-Data): A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From 82b060fe74cf53333f31f0368aa0e7bec0ed0827 Mon Sep 17 00:00:00 2001 From: sangeeta-anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 11 Jan 2022 08:08:34 +0530 Subject: [PATCH 034/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index e28c756..ef80765 100644 --- a/README.md +++ b/README.md @@ -252,6 +252,7 @@ Benchmarks spanning multiple tasks. - [XQuAD](https://github.com/deepmind/xquad): testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in [this paper](https://arxiv.org/abs/1910.11856) - [XQA](http://github.com/thunlp/XQA): testset for Tamil QA. Described in [this paper](https://www.aclweb.org/anthology/P19-1227.pdf) - [HindiRC](https://github.com/erzaliator/HindiRC-Data): A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) +- [HiDG](https://github.com/kaushal0494/ZmBART): A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in [this paper](https://arxiv.org/pdf/2106.01597.pdf). [Dataset](https://drive.google.com/drive/folders/1XlY9yOfk0XcfHNO5k0QGsbLQU1nMekG-) ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From db4a60eb8ce63caa75a1afc626a9e5ff4e6a391d Mon Sep 17 00:00:00 2001 From: sangeeta-anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 11 Jan 2022 08:19:17 +0530 Subject: [PATCH 035/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ef80765..1a4d210 100644 --- a/README.md +++ b/README.md @@ -252,7 +252,7 @@ Benchmarks spanning multiple tasks. - [XQuAD](https://github.com/deepmind/xquad): testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in [this paper](https://arxiv.org/abs/1910.11856) - [XQA](http://github.com/thunlp/XQA): testset for Tamil QA. Described in [this paper](https://www.aclweb.org/anthology/P19-1227.pdf) - [HindiRC](https://github.com/erzaliator/HindiRC-Data): A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) -- [HiDG](https://github.com/kaushal0494/ZmBART): A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in [this paper](https://arxiv.org/pdf/2106.01597.pdf). [Dataset](https://drive.google.com/drive/folders/1XlY9yOfk0XcfHNO5k0QGsbLQU1nMekG-) +- [HiDG](https://github.com/kaushal0494/ZmBART): A Distractor Generation [Dataset](https://drive.google.com/drive/folders/1XlY9yOfk0XcfHNO5k0QGsbLQU1nMekG-) for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in [this paper](https://arxiv.org/pdf/2106.01597.pdf) ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From 08363d068bdc27bd1b0abb896d4a2a8aa6e1ab52 Mon Sep 17 00:00:00 2001 From: sangeeta-anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 11 Jan 2022 08:29:36 +0530 Subject: [PATCH 036/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 1a4d210..862fc28 100644 --- a/README.md +++ b/README.md @@ -252,7 +252,7 @@ Benchmarks spanning multiple tasks. - [XQuAD](https://github.com/deepmind/xquad): testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in [this paper](https://arxiv.org/abs/1910.11856) - [XQA](http://github.com/thunlp/XQA): testset for Tamil QA. Described in [this paper](https://www.aclweb.org/anthology/P19-1227.pdf) - [HindiRC](https://github.com/erzaliator/HindiRC-Data): A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) -- [HiDG](https://github.com/kaushal0494/ZmBART): A Distractor Generation [Dataset](https://drive.google.com/drive/folders/1XlY9yOfk0XcfHNO5k0QGsbLQU1nMekG-) for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in [this paper](https://arxiv.org/pdf/2106.01597.pdf) +- [IITH HiDG](https://github.com/kaushal0494/ZmBART): A Distractor Generation [Dataset](https://drive.google.com/drive/folders/1XlY9yOfk0XcfHNO5k0QGsbLQU1nMekG-) for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in [this paper](https://arxiv.org/pdf/2106.01597.pdf) ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From fdf2f4ff567399e5ef587766641761bf77ae0998 Mon Sep 17 00:00:00 2001 From: Ritwik Mishra Date: Fri, 14 Jan 2022 16:34:54 +0530 Subject: [PATCH 037/119] Added Chaii QnA dataset in the QA subsection --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 39893b6..ffe26a9 100644 --- a/README.md +++ b/README.md @@ -252,6 +252,7 @@ Benchmarks spanning multiple tasks. - [XQuAD](https://github.com/deepmind/xquad): testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in [this paper](https://arxiv.org/abs/1910.11856) - [XQA](http://github.com/thunlp/XQA): testset for Tamil QA. Described in [this paper](https://www.aclweb.org/anthology/P19-1227.pdf) - [HindiRC](https://github.com/erzaliator/HindiRC-Data): A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) +- [Chaii](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/overview) a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, [here](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/264695) is a good collection of papers on multilingual Question Answering. ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From cd3d68263a9eac72234a5a9c0b8e24c49061e4a9 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 24 Jan 2022 08:01:12 +0530 Subject: [PATCH 038/119] Update CONTRIBUTORS.md --- CONTRIBUTORS.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index 16bdcc3..a28af34 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -18,3 +18,4 @@ - Kavya Manohar - Amrith Krishna - Gokul NC +- Ritwik Mishra From 636fc0a908d36aa42efff3d155af10e646d112e0 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Tue, 22 Mar 2022 19:16:47 +0530 Subject: [PATCH 039/119] Update README.md --- README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 501d65d..fbed348 100644 --- a/README.md +++ b/README.md @@ -50,7 +50,7 @@ _Add a small, informative description of the dataset and provide links to any pa * [Co-reference Corpus](#CoreferenceCorpus) * [Models](#Models) * [Word Embeddings](#WordEmbeddings) - * [Sentence Embeddings](#SentenceEmbeddings) + * [Pre-trained Language Models](#PreTrainedLanguageModels) * [Multilingual Word Embeddings](#MultilingualWordEmbeddings) * [Morphanalyzers](#Morphanalyzers) * [Translation Models](#TranslationModels) @@ -304,9 +304,11 @@ Benchmarks spanning multiple tasks. - [Polyglot](https://sites.google.com/site/rmyeid/projects/polyglot) -### Sentence Embeddings +### Pre-trained Language Models -- [AI4Bharat IndicBERT](https://indicnlp.ai4bharat.org/indic-bert): Multilingual ALBERT embeddings spanning 12 languages (including Indian English). +- [AI4Bharat IndicBERT](https://indicnlp.ai4bharat.org/indic-bert): Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English). +- [AI4Bharat IndicBART](https://indicnlp.ai4bharat.org/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). +- [MuRIL](https://huggingface.co/google/muril-base-cased): Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding [(paper)](https://arxiv.org/abs/2103.10730). - [BERT Multilingual](https://github.com/google-research/bert): BERT model trained on Wikipedias of many languages (including major Indic languages). - [iNLTK](https://github.com/goru001/inltk): ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles. - [albert-base-sanskrit](https://huggingface.co/surajp/albert-base-sanskrit): ALBERT-based model trained on Sanskrit Wikipedia. @@ -330,6 +332,7 @@ Benchmarks spanning multiple tasks. ### Speech Models +- [AI4Bharat IndicWav2Vec](https://indicnlp.ai4bharat.org/indicwav2vec): Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0 - [Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). - [arijitx/wav2vec2-large-xlsr-bengali](https://huggingface.co/arijitx/wav2vec2-large-xlsr-bengali): Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM. From de4e0edbf129d0f28f3a2ab12442ba54dc734e4c Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Tue, 22 Mar 2022 20:11:35 +0530 Subject: [PATCH 040/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index fbed348..f7eae2d 100644 --- a/README.md +++ b/README.md @@ -27,8 +27,8 @@ _Add a small, informative description of the dataset and provide links to any pa * [Libraries and Tools](#Libraries) * [Evaluation Benchmarks](#Benchmarks) * [Standards](#Standards) -* [Text Corpora](#TextCorpora) * [Unicode Standard](#UnicodeStandard) +* [Text Corpora](#TextCorpora) * [Monolingual Corpus](#MonolingualCorpus) * [Language Identification](#LanguageIdentification) * [Lexical Resources](#LexicalResources) From 057f83a1bec8d8aa3e01ac824978e6400015e6eb Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Tue, 22 Mar 2022 20:16:57 +0530 Subject: [PATCH 041/119] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index f7eae2d..2b555b7 100644 --- a/README.md +++ b/README.md @@ -96,8 +96,10 @@ _Add a small, informative description of the dataset and provide links to any pa Benchmarks spanning multiple tasks. - [AI4Bharat IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue): NLU benchmark for 11 languages. +- [AI4Bharat IndicNLG Suite](https://indicnlp.ai4bharat.org/indicnlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. - [GLUECoS](https://microsoft.github.io/GLUECoS): For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI). - [AI4Bharat Text Classification](https://github.com/ai4bharat/indicnlp_corpus#publicly-available-classification-datasets): A compilation of classification datasets for 10 languages. +- [WAT 2021 Translation Dataset](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual): Standard train and test sets for translation between English and 10 Indian languages. ## Standards From efacfb947c15b6c738f298135eab7e9ad22aec59 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 2 May 2022 10:14:36 +0530 Subject: [PATCH 042/119] Update README.md --- README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 2b555b7..ba4b999 100644 --- a/README.md +++ b/README.md @@ -11,12 +11,12 @@ _Add a small, informative description of the dataset and provide links to any pa :+1: **Featured Resources** -- :new:[Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). -- :new:[FLORES-101](https://github.com/facebookresearch/flores): Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel. -- :new:[Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. -- :new:[IndicTrans](https://indicnlp.ai4bharat.org/indic-trans): Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported. -- :new:[Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. -- [AI4Bharat IndicNLPSuite](https://indicnlp.ai4bharat.org): Text corpora, word embeddings, BERT for Indian languages and NLU resources for Indian languages. +- :new:[AI4Bharat IndicNLG Suite](https://indicnlp.ai4bharat.org/indicnlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. Pre-trained models ara also available. +- :new:[AI4Bharat IndicBART](https://indicnlp.ai4bharat.org/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). +- :new:[HiNER](https://huggingface.co/datasets/cfilt/HiNER-original): Large manually annotated NER dataset for Hindi (100k sentences, 2m+ tokens) [[paper](https://arxiv.org/abs/2204.13743)] +- [AI4Bharat Cross-lingual Semantic Textual Similarity](https://storage.googleapis.com/samanantar-public/human_annotations.tsv): 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines +- [XL-Sum](https://github.com/csebuetnlp/xl-sum): Extreme Summarization data for many Indian languages +- [BUILD](https://github.com/Legal-NLP-EkStep/rhetorical-role-baseline): Indian Legal Data Benchmark for rhetorical roles **Browse the entire catalog...** @@ -334,7 +334,7 @@ Benchmarks spanning multiple tasks. ### Speech Models -- [AI4Bharat IndicWav2Vec](https://indicnlp.ai4bharat.org/indicwav2vec): Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0 +- [AI4Bharat IndicWav2Vec](https://indicnlp.ai4bharat.org/indicwav2vec): Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0. - [Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). - [arijitx/wav2vec2-large-xlsr-bengali](https://huggingface.co/arijitx/wav2vec2-large-xlsr-bengali): Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM. From 13b18fd4173088514770ef2c7aeae792af86a360 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 6 Aug 2022 19:21:08 +0530 Subject: [PATCH 043/119] Update CONTRIBUTORS.md --- CONTRIBUTORS.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index a28af34..b66c7d5 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -19,3 +19,4 @@ - Amrith Krishna - Gokul NC - Ritwik Mishra +- Sangeeta Rajagopal From 931a9f2ba9cb21d8232cc8cb6dac78de2c67e7c0 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 7 Aug 2022 20:28:37 +0530 Subject: [PATCH 044/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index ba4b999..bd826ce 100644 --- a/README.md +++ b/README.md @@ -156,6 +156,7 @@ Benchmarks spanning multiple tasks. - [WikiAnn NER Corpus](https://elisa-ie.github.io/wikiann) (_Noisy_) [DOWNLOAD](https://drive.google.com/drive/folders/1Q-xdT99SeaCghihGa7nRkcXGwRGUIsKN?usp=sharing) (Old broken [LINK](http://nlp.cs.rpi.edu)) - [IJCNLP 200 NER Corpus](http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5): NER corpora for hi, bn, or, te, ur. - [a-mma NER data](https://github.com/a-mma/NER_Open_Data) +- [AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. ### Parallel Translation Corpus From d7412dd499932c9b1bad274fb8a93f8a2c0f0e1f Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 7 Aug 2022 20:38:18 +0530 Subject: [PATCH 045/119] Update README.md --- README.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/README.md b/README.md index bd826ce..f87f5b8 100644 --- a/README.md +++ b/README.md @@ -339,6 +339,10 @@ Benchmarks spanning multiple tasks. - [Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). - [arijitx/wav2vec2-large-xlsr-bengali](https://huggingface.co/arijitx/wav2vec2-large-xlsr-bengali): Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM. +### NER + +- [AI4Bharat IndicNER](https://huggingface.co/ai4bharat/IndicNER): NER model for 11 Indic languages. + ## Speech Corpora - [Microsoft Speech Corpus](https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e): Speech corpus for Telugu, Tamil and Gujarati. From 6bb0e8c41a1bab1bcffd16b82e28868f30e2b54f Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 7 Aug 2022 20:45:51 +0530 Subject: [PATCH 046/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index f87f5b8..dfb0bc2 100644 --- a/README.md +++ b/README.md @@ -55,6 +55,7 @@ _Add a small, informative description of the dataset and provide links to any pa * [Morphanalyzers](#Morphanalyzers) * [Translation Models](#TranslationModels) * [Speech Models](#SpeechModels) + * [NER](#NER) * [Speech Corpora](#SpeechCorpora) * [OCR Corpora](#OCRCorpora) * [Multimodal Corpora](#MultimodalCorpora) From 66b66369ad0237ac5d5897b8986bc7b016309954 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 9 Aug 2022 17:21:31 +0530 Subject: [PATCH 047/119] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index dfb0bc2..ef45247 100644 --- a/README.md +++ b/README.md @@ -158,6 +158,7 @@ Benchmarks spanning multiple tasks. - [IJCNLP 200 NER Corpus](http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5): NER corpora for hi, bn, or, te, ur. - [a-mma NER data](https://github.com/a-mma/NER_Open_Data) - [AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. +- [AsNER]: A named entity annotation dataset for low resource Assamese language containing 99k tokens. Described in [this paper](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf) ### Parallel Translation Corpus @@ -343,6 +344,7 @@ Benchmarks spanning multiple tasks. ### NER - [AI4Bharat IndicNER](https://huggingface.co/ai4bharat/IndicNER): NER model for 11 Indic languages. +- [AsNER]: A Baseline Assamese NER model.Described in [this paper](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf) ## Speech Corpora From 26897b6742b50c33438a3259a7d4844c1cbeac67 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 9 Aug 2022 17:31:50 +0530 Subject: [PATCH 048/119] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index ef45247..cc68d3b 100644 --- a/README.md +++ b/README.md @@ -158,7 +158,7 @@ Benchmarks spanning multiple tasks. - [IJCNLP 200 NER Corpus](http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5): NER corpora for hi, bn, or, te, ur. - [a-mma NER data](https://github.com/a-mma/NER_Open_Data) - [AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. -- [AsNER]: A named entity annotation dataset for low resource Assamese language containing 99k tokens. Described in [this paper](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf) +- [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. ### Parallel Translation Corpus @@ -344,7 +344,7 @@ Benchmarks spanning multiple tasks. ### NER - [AI4Bharat IndicNER](https://huggingface.co/ai4bharat/IndicNER): NER model for 11 Indic languages. -- [AsNER]: A Baseline Assamese NER model.Described in [this paper](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf) +- [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A Baseline Assamese NER model. ## Speech Corpora From 4e777b604dc1119351befce831f9f6ff0d7d06dc Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 9 Aug 2022 19:14:16 +0530 Subject: [PATCH 049/119] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index cc68d3b..9e95ba0 100644 --- a/README.md +++ b/README.md @@ -158,7 +158,8 @@ Benchmarks spanning multiple tasks. - [IJCNLP 200 NER Corpus](http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5): NER corpora for hi, bn, or, te, ur. - [a-mma NER data](https://github.com/a-mma/NER_Open_Data) - [AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. -- [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. +- [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. +- [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consistion of 25,000 sentences in Marathi language.Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf) ### Parallel Translation Corpus @@ -345,6 +346,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat IndicNER](https://huggingface.co/ai4bharat/IndicNER): NER model for 11 Indic languages. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A Baseline Assamese NER model. +- [L3Cube-MahaNER-BERT](https://huggingface.co/l3cube-pune/marathi-ner): A 752 million token multilingual BERT model.Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf) ## Speech Corpora From ebe65f96eb91cd6bb0659a7260da4804d39916b7 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 9 Aug 2022 19:16:03 +0530 Subject: [PATCH 050/119] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9e95ba0..6926054 100644 --- a/README.md +++ b/README.md @@ -159,7 +159,7 @@ Benchmarks spanning multiple tasks. - [a-mma NER data](https://github.com/a-mma/NER_Open_Data) - [AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. -- [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consistion of 25,000 sentences in Marathi language.Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf) +- [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consistion of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). ### Parallel Translation Corpus @@ -346,7 +346,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat IndicNER](https://huggingface.co/ai4bharat/IndicNER): NER model for 11 Indic languages. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A Baseline Assamese NER model. -- [L3Cube-MahaNER-BERT](https://huggingface.co/l3cube-pune/marathi-ner): A 752 million token multilingual BERT model.Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf) +- [L3Cube-MahaNER-BERT](https://huggingface.co/l3cube-pune/marathi-ner): A 752 million token multilingual BERT model. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). ## Speech Corpora From 7531ed0e04451f3ab822c7d5ec3d5748c9a330aa Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 9 Aug 2022 19:16:56 +0530 Subject: [PATCH 051/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 6926054..2b185d0 100644 --- a/README.md +++ b/README.md @@ -159,7 +159,7 @@ Benchmarks spanning multiple tasks. - [a-mma NER data](https://github.com/a-mma/NER_Open_Data) - [AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. -- [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consistion of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). +- [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). ### Parallel Translation Corpus From 73283eac0c2e031b64320424404f0d9262e96904 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 9 Aug 2022 19:55:38 +0530 Subject: [PATCH 052/119] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 2b185d0..866edfc 100644 --- a/README.md +++ b/README.md @@ -160,6 +160,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. - [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). +- [CFILT HiNER](https://github.com/cfiltnlp/hiner): A large Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens. Described in [this paper](https://arxiv.org/abs/2204.13743) ### Parallel Translation Corpus @@ -347,6 +348,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat IndicNER](https://huggingface.co/ai4bharat/IndicNER): NER model for 11 Indic languages. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A Baseline Assamese NER model. - [L3Cube-MahaNER-BERT](https://huggingface.co/l3cube-pune/marathi-ner): A 752 million token multilingual BERT model. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). +- [CFILT HiNER](https://github.com/cfiltnlp/hiner#models): Hindi NER models trained on CFILT HiNER dataset. Described in [this paper](https://arxiv.org/abs/2204.13743) ## Speech Corpora From 84144682a12fa385b44394bb4bf8903f72646520 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 9 Aug 2022 19:58:23 +0530 Subject: [PATCH 053/119] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 866edfc..8baf3ef 100644 --- a/README.md +++ b/README.md @@ -160,7 +160,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. - [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). -- [CFILT HiNER](https://github.com/cfiltnlp/hiner): A large Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens. Described in [this paper](https://arxiv.org/abs/2204.13743) +- [CFILT HiNER](https://github.com/cfiltnlp/hiner): A large Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens. Described in [this paper](https://arxiv.org/abs/2204.13743). ### Parallel Translation Corpus @@ -348,7 +348,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat IndicNER](https://huggingface.co/ai4bharat/IndicNER): NER model for 11 Indic languages. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A Baseline Assamese NER model. - [L3Cube-MahaNER-BERT](https://huggingface.co/l3cube-pune/marathi-ner): A 752 million token multilingual BERT model. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). -- [CFILT HiNER](https://github.com/cfiltnlp/hiner#models): Hindi NER models trained on CFILT HiNER dataset. Described in [this paper](https://arxiv.org/abs/2204.13743) +- [CFILT HiNER](https://github.com/cfiltnlp/hiner#models): Hindi NER models trained on CFILT HiNER dataset. Described in [this paper](https://arxiv.org/abs/2204.13743). ## Speech Corpora From addc1808bd4f7b1693cafb15f70dfe1888442c97 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Wed, 10 Aug 2022 12:29:22 +0530 Subject: [PATCH 054/119] Update README.md --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index 8baf3ef..b354d65 100644 --- a/README.md +++ b/README.md @@ -212,6 +212,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat StoryWeaver Xlit Dataset](http://transliteration.ai4bharat.org/#/resources) - Transliteration datasets for Hindi, Maithili & Konkani - [Hindi WikiData Transliteration Pairs](https://trigonaminima.github.io/2019/11/transliteration-wikidata/) - Hindi dataset (90k pairs) - [NotAI-tech English-Telugu](https://github.com/notAI-tech/Datasets/tree/master/En-Te_Transliteration): Around 38k word pairs +- [AI4Bharat Aksharantar](https://indicnlp.ai4bharat.org/aksharantar/): The largest publicly available transliteration dataset for 21 Indic languages consisting of 26M Indic language-English transliteration pairs. Described in [this paper](https://arxiv.org/abs/2205.03018). ### Text Classification @@ -337,6 +338,10 @@ Benchmarks spanning multiple tasks. - [Shata-Anuvaadak](http://www.cfilt.iitb.ac.in/~moses/shata_anuvaadak/): 110 language pairs - [LTRC Vanee](https://ltrc.iiit.ac.in/downloads/tools/Vaanee.tgz): Dependency based Statistical MT system from English to Hindi +### Transliteration Models + +- [AI4Bharat IndicXlit](https://indicnlp.ai4bharat.org/indic-xlit/): A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion that supports 21 Indic languages. Described in [this paper](https://arxiv.org/abs/2205.03018). + ### Speech Models - [AI4Bharat IndicWav2Vec](https://indicnlp.ai4bharat.org/indicwav2vec): Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0. From 230e3a9e8dca24d2a491a12623668920e4e8d759 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Wed, 10 Aug 2022 18:09:53 +0530 Subject: [PATCH 055/119] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index b354d65..9405670 100644 --- a/README.md +++ b/README.md @@ -369,6 +369,8 @@ Benchmarks spanning multiple tasks. - [CoVoST 2](https://github.com/facebookresearch/covost): Tamil 2 hrs data - [SMC Malayalam Speech Corpus](https://blog.smc.org.in/malayalam-speech-corpus/) - [Download link](https://releases.smc.org.in/msc-reviewed-speech/) - [Vāksañcayaḥ Sanskrit Speech Corpus](https://github.com/cyfer0618/Vaksanca) : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1 [(paper)](https://arxiv.org/abs/2106.05852). +- [IISc-MILE Kannada ASR Corpus](http://www.openslr.org/126/): Transcribed speech corpus containing ~350 hours of read speech data for training ASR systems for Kannada language. Described in [this paper](https://arxiv.org/abs/2207.13331). +- [IISc-MILE Tamil ASR Corpus](http://www.openslr.org/127/): Transcribed speech corpus containing ~150 hours of read speech data for training ASR systems for Tamil language. Described in [this paper](https://arxiv.org/abs/2207.13331). From 31a2a4a91de6a9e91279985b24abe8f1765d60e3 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 13 Aug 2022 09:12:22 +0530 Subject: [PATCH 056/119] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9405670..5d28dae 100644 --- a/README.md +++ b/README.md @@ -212,7 +212,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat StoryWeaver Xlit Dataset](http://transliteration.ai4bharat.org/#/resources) - Transliteration datasets for Hindi, Maithili & Konkani - [Hindi WikiData Transliteration Pairs](https://trigonaminima.github.io/2019/11/transliteration-wikidata/) - Hindi dataset (90k pairs) - [NotAI-tech English-Telugu](https://github.com/notAI-tech/Datasets/tree/master/En-Te_Transliteration): Around 38k word pairs -- [AI4Bharat Aksharantar](https://indicnlp.ai4bharat.org/aksharantar/): The largest publicly available transliteration dataset for 21 Indic languages consisting of 26M Indic language-English transliteration pairs. Described in [this paper](https://arxiv.org/abs/2205.03018). +- [AI4Bharat Aksharantar](https://ai4bharat.iitm.ac.in/aksharantar): The largest publicly available transliteration dataset for 21 Indic languages consisting of 26M Indic language-English transliteration pairs. Described in [this paper](https://arxiv.org/abs/2205.03018). ### Text Classification @@ -340,7 +340,7 @@ Benchmarks spanning multiple tasks. ### Transliteration Models -- [AI4Bharat IndicXlit](https://indicnlp.ai4bharat.org/indic-xlit/): A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion that supports 21 Indic languages. Described in [this paper](https://arxiv.org/abs/2205.03018). +- [AI4Bharat IndicXlit](https://ai4bharat.iitm.ac.in/indic-xlit): A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion that supports 21 Indic languages. Described in [this paper](https://arxiv.org/abs/2205.03018). ### Speech Models From ad6541767f4d47f490da4024c3939cddde9ee1b3 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 13 Aug 2022 09:40:48 +0530 Subject: [PATCH 057/119] Update README.md --- README.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 5d28dae..d2da028 100644 --- a/README.md +++ b/README.md @@ -11,8 +11,8 @@ _Add a small, informative description of the dataset and provide links to any pa :+1: **Featured Resources** -- :new:[AI4Bharat IndicNLG Suite](https://indicnlp.ai4bharat.org/indicnlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. Pre-trained models ara also available. -- :new:[AI4Bharat IndicBART](https://indicnlp.ai4bharat.org/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). +- :new:[AI4Bharat IndicNLG Suite](https://ai4bharat.iitm.ac.in/indic-nlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. Pre-trained models ara also available. +- :new:[AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). - :new:[HiNER](https://huggingface.co/datasets/cfilt/HiNER-original): Large manually annotated NER dataset for Hindi (100k sentences, 2m+ tokens) [[paper](https://arxiv.org/abs/2204.13743)] - [AI4Bharat Cross-lingual Semantic Textual Similarity](https://storage.googleapis.com/samanantar-public/human_annotations.tsv): 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines - [XL-Sum](https://github.com/csebuetnlp/xl-sum): Extreme Summarization data for many Indian languages @@ -96,8 +96,8 @@ _Add a small, informative description of the dataset and provide links to any pa Benchmarks spanning multiple tasks. -- [AI4Bharat IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue): NLU benchmark for 11 languages. -- [AI4Bharat IndicNLG Suite](https://indicnlp.ai4bharat.org/indicnlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. +- [AI4Bharat IndicGLUE](https://ai4bharat.iitm.ac.in/indic-glue): NLU benchmark for 11 languages. +- [AI4Bharat IndicNLG Suite](https://ai4bharat.iitm.ac.in/indic-nlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. - [GLUECoS](https://microsoft.github.io/GLUECoS): For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI). - [AI4Bharat Text Classification](https://github.com/ai4bharat/indicnlp_corpus#publicly-available-classification-datasets): A compilation of classification datasets for 10 languages. - [WAT 2021 Translation Dataset](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual): Standard train and test sets for translation between English and 10 Indian languages. @@ -112,7 +112,7 @@ Benchmarks spanning multiple tasks. ### Monolingual Corpus -- [AIBharat IndicCorp](https://indicnlp.ai4bharat.org/corpora): contains 8.9 billion tokens from 12 Indian languages (including Indian English). +- [AIBharat IndicCorp](https://ai4bharat.iitm.ac.in/indic-corp): contains 8.9 billion tokens from 12 Indian languages (including Indian English). - [Wikipedia Dumps](https://dumps.wikimedia.org) - Common Crawl - [OSCAR Corpus](https://traces1.inria.fr/oscar): Released in 2019, large-scaled processed CommonCrawl. @@ -164,7 +164,7 @@ Benchmarks spanning multiple tasks. ### Parallel Translation Corpus -- [Samanantar Parallel Corpus](https://indicnlp.ai4bharat.org/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. +- [Samanantar Parallel Corpus](https://ai4bharat.iitm.ac.in/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. - [FLORES-101](https://github.com/facebookresearch/flores): Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel. - [IIT Bombay English-Hindi Parallel Corpus](http://www.cfilt.iitb.ac.in/iitb_parallel): Largest en-hi parallel corpora in public domain (about 1.5 million segments) - [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). @@ -306,7 +306,7 @@ Benchmarks spanning multiple tasks. ### Word Embeddings -- [AI4Bharat IndicFT](https://indicnlp.ai4bharat.org/indicft): Fast-text word embeddings for 11 Indian languages. +- [AI4Bharat IndicFT](https://ai4bharat.iitm.ac.in/indic-ft): Fast-text word embeddings for 11 Indian languages. - [FastText CommonCrawl+Wikipedia](https://fasttext.cc/docs/en/crawl-vectors.html) - [FastText Wikipedia](https://fasttext.cc/docs/en/pretrained-vectors.html) - [Polyglot](https://sites.google.com/site/rmyeid/projects/polyglot) @@ -314,8 +314,8 @@ Benchmarks spanning multiple tasks. ### Pre-trained Language Models -- [AI4Bharat IndicBERT](https://indicnlp.ai4bharat.org/indic-bert): Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English). -- [AI4Bharat IndicBART](https://indicnlp.ai4bharat.org/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). +- [AI4Bharat IndicBERT](https://ai4bharat.iitm.ac.in/indic-bert): Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English). +- [AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). - [MuRIL](https://huggingface.co/google/muril-base-cased): Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding [(paper)](https://arxiv.org/abs/2103.10730). - [BERT Multilingual](https://github.com/google-research/bert): BERT model trained on Wikipedias of many languages (including major Indic languages). - [iNLTK](https://github.com/goru001/inltk): ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles. @@ -334,7 +334,7 @@ Benchmarks spanning multiple tasks. ### Translation Models -- [IndicTrans](https://indicnlp.ai4bharat.org/indic-trans): Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported. +- [IndicTrans](https://ai4bharat.iitm.ac.in/indic-trans): Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported. - [Shata-Anuvaadak](http://www.cfilt.iitb.ac.in/~moses/shata_anuvaadak/): 110 language pairs - [LTRC Vanee](https://ltrc.iiit.ac.in/downloads/tools/Vaanee.tgz): Dependency based Statistical MT system from English to Hindi @@ -344,7 +344,7 @@ Benchmarks spanning multiple tasks. ### Speech Models -- [AI4Bharat IndicWav2Vec](https://indicnlp.ai4bharat.org/indicwav2vec): Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0. +- [AI4Bharat IndicWav2Vec](https://ai4bharat.iitm.ac.in/indic-wav-2-vec): Multilingual pre-trained models for 40 Indian languages based on Wav2Vec 2.0. - [Vakyansh CLSRIL-23](https://github.com/Open-Speech-EkStep/vakyansh-models): Pretrained wav2vec2 model trained on 10,000 hours of Speech data in 23 Indic Languages [(documentation)](https://open-speech-ekstep.github.io/) [(experimentation platform)](https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). - [arijitx/wav2vec2-large-xlsr-bengali](https://huggingface.co/arijitx/wav2vec2-large-xlsr-bengali): Pretrained wav2vec2-large-xlsr trained on ~50 hrs(40,000 utterances) of OpenSLR Bengali data. Test WER 32.45% without LM. From 51de10855b70c68ca9071a27522fe08d8371e697 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 13 Aug 2022 09:53:43 +0530 Subject: [PATCH 058/119] Update README.md --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index d2da028..ce762a7 100644 --- a/README.md +++ b/README.md @@ -7,9 +7,8 @@ _Please suggest any other resources you may be aware of. Raise a pull request or _Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the [CONTRIBUTORS](CONTRIBUTORS.md) list._ -:new: Added _Evaluation Benchmarks_ sections -:+1: **Featured Resources** +:+1: ##Featured Resources - :new:[AI4Bharat IndicNLG Suite](https://ai4bharat.iitm.ac.in/indic-nlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. Pre-trained models ara also available. - :new:[AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). From 523b75c4e3010f2a4686d057181f881cad7273ce Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 13 Aug 2022 09:58:21 +0530 Subject: [PATCH 059/119] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index ce762a7..d6b3610 100644 --- a/README.md +++ b/README.md @@ -8,8 +8,8 @@ _Please suggest any other resources you may be aware of. Raise a pull request or _Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the [CONTRIBUTORS](CONTRIBUTORS.md) list._ -:+1: ##Featured Resources - +## Featured Resources :+1: + - :new:[AI4Bharat IndicNLG Suite](https://ai4bharat.iitm.ac.in/indic-nlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. Pre-trained models ara also available. - :new:[AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). - :new:[HiNER](https://huggingface.co/datasets/cfilt/HiNER-original): Large manually annotated NER dataset for Hindi (100k sentences, 2m+ tokens) [[paper](https://arxiv.org/abs/2204.13743)] @@ -17,7 +17,7 @@ _Add a small, informative description of the dataset and provide links to any pa - [XL-Sum](https://github.com/csebuetnlp/xl-sum): Extreme Summarization data for many Indian languages - [BUILD](https://github.com/Legal-NLP-EkStep/rhetorical-role-baseline): Indian Legal Data Benchmark for rhetorical roles -**Browse the entire catalog...** +## Browse the entire catalog... :raising_hand:**Note**: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo. From e39e22e4838d8caf7416411109dc0dde62ef93c1 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 13 Aug 2022 10:08:49 +0530 Subject: [PATCH 060/119] Update README.md --- README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index d6b3610..3aee726 100644 --- a/README.md +++ b/README.md @@ -10,13 +10,13 @@ _Add a small, informative description of the dataset and provide links to any pa ## Featured Resources :+1: +- :new:[Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part for the [Bhasini mission](https://bhashini.gov.in). - :new:[AI4Bharat IndicNLG Suite](https://ai4bharat.iitm.ac.in/indic-nlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. Pre-trained models ara also available. -- :new:[AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). - :new:[HiNER](https://huggingface.co/datasets/cfilt/HiNER-original): Large manually annotated NER dataset for Hindi (100k sentences, 2m+ tokens) [[paper](https://arxiv.org/abs/2204.13743)] -- [AI4Bharat Cross-lingual Semantic Textual Similarity](https://storage.googleapis.com/samanantar-public/human_annotations.tsv): 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines - [XL-Sum](https://github.com/csebuetnlp/xl-sum): Extreme Summarization data for many Indian languages - [BUILD](https://github.com/Legal-NLP-EkStep/rhetorical-role-baseline): Indian Legal Data Benchmark for rhetorical roles + ## Browse the entire catalog... :raising_hand:**Note**: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo. @@ -69,7 +69,7 @@ _Add a small, informative description of the dataset and provide links to any pa ## Major Indic Language NLP Repositories - +- [Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca) - [Technology Development for Indian Languages (TDIL)](http://tdil-dc.in) - [Center for Indian Language Technology (CFILT)](http://www.cfilt.iitb.ac.in/) - [Language Technologies Research Center (LTRC)](https://ltrc.iiit.ac.in/download.php) @@ -138,7 +138,7 @@ Benchmarks spanning multiple tasks. - [VarDial 2018 Language Identification Dataset](https://github.com/kmi-linguistics/vardial2018): 5 languages - Hindi, Braj, Awadhi, Bhojpuri, Magahi. -### Lexical Resources +### Lexical Resources and Semantic Similarity - [IndoWordNet](http://www.cfilt.iitb.ac.in/indowordnet/) - [IIIT-Hyderabad Word Similarity Database](https://github.com/syedsarfarazakhtar/Word-Similarity-Datasets-for-Indian-Languages): 7 Indian languages @@ -147,6 +147,7 @@ Benchmarks spanning multiple tasks. - [AI4Bharat Word Frequency Lists](https://github.com/AI4Bharat/indicnlp_corpus#text-corpora): Tokens and their frequencies from the AI4Bharat corpus, a large monolingual corpus. - [Hindi RG-63](https://github.com/ashwinivd/similarity_hindi): Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset - [IITB Cognate Datasets](https://github.com/dipteshkanojia/challengeCognateFF): Dataset of Cognates and False Friend Pairs for 12 Indian Languages. [(Paper)](https://aclanthology.org/2020.lrec-1.378.pdf) +- [AI4Bharat Cross-lingual Semantic Textual Similarity](https://storage.googleapis.com/samanantar-public/human_annotations.tsv): 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines. ### NER Corpora From 4353f8b51a829d25e3c1121b700def7876349560 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 13 Aug 2022 11:53:28 +0530 Subject: [PATCH 061/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d2da028..ac4856d 100644 --- a/README.md +++ b/README.md @@ -315,7 +315,7 @@ Benchmarks spanning multiple tasks. ### Pre-trained Language Models - [AI4Bharat IndicBERT](https://ai4bharat.iitm.ac.in/indic-bert): Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English). -- [AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). +- [AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): A multilingual,sequence-to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages.Described in [this paper](https://arxiv.org/abs/2109.02903). - [MuRIL](https://huggingface.co/google/muril-base-cased): Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding [(paper)](https://arxiv.org/abs/2103.10730). - [BERT Multilingual](https://github.com/google-research/bert): BERT model trained on Wikipedias of many languages (including major Indic languages). - [iNLTK](https://github.com/goru001/inltk): ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles. From 9eb58b4ea7fa0a38419dee39b278551cbe22a09b Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 13 Aug 2022 12:20:32 +0530 Subject: [PATCH 062/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index ac4856d..c10b3e9 100644 --- a/README.md +++ b/README.md @@ -161,6 +161,7 @@ Benchmarks spanning multiple tasks. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. - [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). - [CFILT HiNER](https://github.com/cfiltnlp/hiner): A large Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens. Described in [this paper](https://arxiv.org/abs/2204.13743). +- [MultiCoNER](https://multiconer.github.io/): A multilingual complex Named Entity Recognition dataset composed of 2.3 million instances for 11 languages(including dataset for Indic languages Hindi and Bangla) representing three domains(wiki sentences, questions, and search queries) plus multilingual and code-mixed subsets.The NER tag-set consists of six classes viz.:PER,LOC,CORP,GRP,PROD and CW. Described in [this paper](https://aclanthology.org/2022.semeval-1.196.pdf). ### Parallel Translation Corpus From 3ca7fbf37db92d38a8715c8ba8fdc3095c465a67 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 13 Aug 2022 12:26:55 +0530 Subject: [PATCH 063/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c10b3e9..8025f66 100644 --- a/README.md +++ b/README.md @@ -161,7 +161,7 @@ Benchmarks spanning multiple tasks. - [AsNER](https://arxiv.org/ftp/arxiv/papers/2207/2207.03422.pdf): A named entity annotation dataset for low resource Assamese language containing 99k tokens. - [L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). - [CFILT HiNER](https://github.com/cfiltnlp/hiner): A large Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens. Described in [this paper](https://arxiv.org/abs/2204.13743). -- [MultiCoNER](https://multiconer.github.io/): A multilingual complex Named Entity Recognition dataset composed of 2.3 million instances for 11 languages(including dataset for Indic languages Hindi and Bangla) representing three domains(wiki sentences, questions, and search queries) plus multilingual and code-mixed subsets.The NER tag-set consists of six classes viz.:PER,LOC,CORP,GRP,PROD and CW. Described in [this paper](https://aclanthology.org/2022.semeval-1.196.pdf). +- [MultiCoNER](https://multiconer.github.io/): A multilingual complex Named Entity Recognition dataset composed of 2.3 million instances for 11 languages(including dataset for Indic languages Hindi and Bangla) representing three domains(wiki sentences, questions, and search queries) plus multilingual and code-mixed subsets.The NER tag-set consists of six classes viz.: PER,LOC,CORP,GRP,PROD and CW. Described in [this paper](https://aclanthology.org/2022.semeval-1.196.pdf). ### Parallel Translation Corpus From 1d13bc72bba7a649292c491c7e180d33f736b2d5 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sat, 13 Aug 2022 12:28:32 +0530 Subject: [PATCH 064/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 8025f66..d39dd71 100644 --- a/README.md +++ b/README.md @@ -316,7 +316,7 @@ Benchmarks spanning multiple tasks. ### Pre-trained Language Models - [AI4Bharat IndicBERT](https://ai4bharat.iitm.ac.in/indic-bert): Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English). -- [AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): A multilingual,sequence-to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages.Described in [this paper](https://arxiv.org/abs/2109.02903). +- [AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): A multilingual,sequence-to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages. Described in [this paper](https://arxiv.org/abs/2109.02903). - [MuRIL](https://huggingface.co/google/muril-base-cased): Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding [(paper)](https://arxiv.org/abs/2103.10730). - [BERT Multilingual](https://github.com/google-research/bert): BERT model trained on Wikipedias of many languages (including major Indic languages). - [iNLTK](https://github.com/goru001/inltk): ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles. From e3d8558ee50666407d9da7daf47394c55453e666 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 13 Aug 2022 12:39:01 +0530 Subject: [PATCH 065/119] Update README.md --- README.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 3aee726..f86667c 100644 --- a/README.md +++ b/README.md @@ -10,11 +10,10 @@ _Add a small, informative description of the dataset and provide links to any pa ## Featured Resources :+1: -- :new:[Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part for the [Bhasini mission](https://bhashini.gov.in). -- :new:[AI4Bharat IndicNLG Suite](https://ai4bharat.iitm.ac.in/indic-nlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. Pre-trained models ara also available. -- :new:[HiNER](https://huggingface.co/datasets/cfilt/HiNER-original): Large manually annotated NER dataset for Hindi (100k sentences, 2m+ tokens) [[paper](https://arxiv.org/abs/2204.13743)] -- [XL-Sum](https://github.com/csebuetnlp/xl-sum): Extreme Summarization data for many Indian languages -- [BUILD](https://github.com/Legal-NLP-EkStep/rhetorical-role-baseline): Indian Legal Data Benchmark for rhetorical roles +- [Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part for the [Bhasini mission](https://bhashini.gov.in). +- :new:[BLOOM](https://huggingface.co/bigscience/bloom): GPT3 like multilingual transformer-decoder language model (includes major Indic languages. +- :new:[AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. +- :new:[L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). ## Browse the entire catalog... @@ -318,6 +317,8 @@ Benchmarks spanning multiple tasks. - [AI4Bharat IndicBART](https://ai4bharat.iitm.ac.in/indic-bart): Multilingual mBART based embeddings spanning 12 languages for Natural Language Generation (including Indian English). - [MuRIL](https://huggingface.co/google/muril-base-cased): Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding [(paper)](https://arxiv.org/abs/2103.10730). - [BERT Multilingual](https://github.com/google-research/bert): BERT model trained on Wikipedias of many languages (including major Indic languages). +- [mBART50](https://huggingface.co/facebook/mbart-large-50): seq2seq pre-trained model trained on CommonCrawl of many languages (including major Indic languages). +- [BLOOM](https://huggingface.co/bigscience/bloom): GPT3 like multilingual transformer-decoder language model (includes major Indic languages. - [iNLTK](https://github.com/goru001/inltk): ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles. - [albert-base-sanskrit](https://huggingface.co/surajp/albert-base-sanskrit): ALBERT-based model trained on Sanskrit Wikipedia. - [RoBERTa-hindi-guj-san](https://huggingface.co/surajp/RoBERTa-hindi-guj-san): Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati. From d6518094b06036614482ce3455eddbbbf74adf08 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sat, 13 Aug 2022 12:43:16 +0530 Subject: [PATCH 066/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index ed4f312..bc1a7ff 100644 --- a/README.md +++ b/README.md @@ -72,7 +72,7 @@ _Add a small, informative description of the dataset and provide links to any pa - [Technology Development for Indian Languages (TDIL)](http://tdil-dc.in) - [Center for Indian Language Technology (CFILT)](http://www.cfilt.iitb.ac.in/) - [Language Technologies Research Center (LTRC)](https://ltrc.iiit.ac.in/download.php) -- [AI4Bharat IndicNLP](https://indicnlp.ai4bharat.org) +- [AI4Bharat IndicNLP](https://ai4bharat.iitm.ac.in/) - [Linguistic Data Consortium For Indian Languages (LDCIL)](https://data.ldcil.org) - [University of Hyderabad - Sanskrit NLP](http://sanskrit.uohyd.ac.in/scl) - [National Platform for Language Technology](https://nplt.in/demo/index.php?route=product/category&path=75_59&limit=100) From a918677199c7322ba10b83ebc12bfe82621da35e Mon Sep 17 00:00:00 2001 From: Kaushal Bhogale Date: Sat, 13 Aug 2022 17:51:52 +0530 Subject: [PATCH 067/119] Add speech corpus --- README.md | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index bc1a7ff..3f36ae2 100644 --- a/README.md +++ b/README.md @@ -364,7 +364,6 @@ Benchmarks spanning multiple tasks. - [AccentDB](https://accentdb.org/): Database of Indian English accents from native speakers in Bangla, Malayalam, Telugu and Oriya. - [IIT Madras TTS database](https://www.iitm.ac.in/donlab/tts/index.php) - [BABEL Speech Corpus](https://en.wikipedia.org/wiki/BABEL_Speech_Corpus): includes some Indian languages -- [Pratham ASER dataset](https://github.com/shashwatup9k/bho-resources): Dataset for research on reading level assessment. - [WikiPron](https://pypi.org/project/wikipron/): Words and their pronunciations in IPA mined from Wiktionary. Includes Indian languages. [paper](https://www.aclweb.org/anthology/2020.lrec-1.521) - [CVIT IndicSpeech](http://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages): TTS data for 3 Indian languages: Malayalam, Bengali and Hindi (24 hours each). - [Google Speech Corpus](http://www.openslr.org/resources.php): TTS data for 6 Indian languages: Malayalam, Marathi, Telugu, Kannada, Gujarati, Tamil (upto 9 hours each). Resources SLR#63-#66, #78-#79. [(paper)](http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.800.pdf) @@ -373,6 +372,19 @@ Benchmarks spanning multiple tasks. - [Vāksañcayaḥ Sanskrit Speech Corpus](https://github.com/cyfer0618/Vaksanca) : 78 hours of speech corpus in Sanskrit prose, with a speaker disjoint splits of train, dev and test. It also contains an additional out of domain test data with speakers having pronunciation influences from L1 [(paper)](https://arxiv.org/abs/2106.05852). - [IISc-MILE Kannada ASR Corpus](http://www.openslr.org/126/): Transcribed speech corpus containing ~350 hours of read speech data for training ASR systems for Kannada language. Described in [this paper](https://arxiv.org/abs/2207.13331). - [IISc-MILE Tamil ASR Corpus](http://www.openslr.org/127/): Transcribed speech corpus containing ~150 hours of read speech data for training ASR systems for Tamil language. Described in [this paper](https://arxiv.org/abs/2207.13331). +- [MUCS 2021 Dataset](https://navana-tech.github.io/MUCS2021/data.html): (Gujarati, Hindi, Marathi, Odia, Tamil, Telugu) Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages +- [Gramvaani](https://sites.google.com/view/gramvaaniasrchallenge/dataset): 100 hours of labelled data and 1000 hours of pretraining data for Hindi +- [Kashmiri Data Corpus](https://www.openslr.org/122/): Collection of transcribed Kashmiri recordings taken from native speakers +- [Hindi-Tamil-English ASR Challenge](https://sites.google.com/view/indian-language-asrchallenge/home): 490 hours of transcribed speeech data in three Indian Languages +- [Large Sinhala ASR training data set](http://openslr.org/52): Sinhala ASR training data set containing ~185K utterances +- [Large Bengali ASR training data set](http://openslr.org/53): Bengali ASR training data set containing ~196K utterances +- [Large Nepali ASR training data set](http://openslr.org/54): Nepali ASR training data set containing ~157K utterances +- [Crowdsourced high-quality Gujarati multi-speaker speech data set](https://www.openslr.org/78/): Contains recordings of native speakers of Gujarati +- [Crowdsourced high-quality Kannada multi-speaker speech data set](https://www.openslr.org/79/): Contains recordings of native speakers of Kannada +- [Crowdsourced high-quality Malayalam multi-speaker speech data set](https://www.openslr.org/63/): Contains recordings of native speakers of Malayalam +- [Crowdsourced high-quality Marathi multi-speaker speech data set](https://www.openslr.org/64/): Contains recordings of native speakers of Marathi +- [Crowdsourced high-quality Tamil multi-speaker speech data set](https://www.openslr.org/65/): Contains recordings of native speakers of Tamil +- [Crowdsourced high-quality Telugu multi-speaker speech data set](https://www.openslr.org/66/): Contains recordings of native speakers of Telugu From 689421d6230a09e0ca439418b258fea8bd4594bf Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 12:20:02 +0530 Subject: [PATCH 068/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 3f36ae2..ac25568 100644 --- a/README.md +++ b/README.md @@ -272,6 +272,7 @@ Benchmarks spanning multiple tasks. ### Information Extraction - [EventXtract-IL](http://78.46.86.133/EventXtractionIL-FIRE2018): Event extraction for Tamil and Hindi. Described in [this paper](http://ceur-ws.org/Vol-2266/T5-1.pdf). - [EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in [this paper](http://ceur-ws.org/Vol-2266/T5-1.pdf). +- [Amazon MASSIVE](https://github.com/alexa/massive): A Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation containing one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. Described in [this paper](https://arxiv.org/abs/2204.08582). ### POS Tagged corpus From a9dd2200bd6c7b0ab8053f2774ebf1f2dbf96a44 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 12:35:51 +0530 Subject: [PATCH 069/119] Update README.md --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index ac25568..9841ca4 100644 --- a/README.md +++ b/README.md @@ -46,6 +46,7 @@ _Add a small, informative description of the dataset and provide links to any pa * [Chunk Corpus](#ChunkCorpus) * [Dependency Parse Corpus](#DependencyParseCorpus) * [Co-reference Corpus](#CoreferenceCorpus) + * [Summarization](#Summarization) * [Models](#Models) * [Word Embeddings](#WordEmbeddings) * [Pre-trained Language Models](#PreTrainedLanguageModels) @@ -303,6 +304,10 @@ Benchmarks spanning multiple tasks. - [IIITH Coreference Anaphora Annotated Data](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/): Hindi - [IIITH Coreference Annotated Data](https://ltrc.iiit.ac.in/showfile.php?filename=downloads/kolhi/): Hindi +### Summarization + +- [XL-Sum](https://github.com/csebuetnlp/xl-sum): A Large-Scale Multilingual Abstractive Summarization for 44 Languages with a comprehensive and diverse dataset comprising of 1 million professionally annotated article-summary pairs from BBC. Described in [this paper](https://arxiv.org/abs/2106.13822). + ## Models ### Word Embeddings From 572fd7e889b7b2de1d0bb7d445082c9cac67acfc Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 12:50:06 +0530 Subject: [PATCH 070/119] Update README.md --- README.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/README.md b/README.md index 9841ca4..fcdc589 100644 --- a/README.md +++ b/README.md @@ -47,6 +47,7 @@ _Add a small, informative description of the dataset and provide links to any pa * [Dependency Parse Corpus](#DependencyParseCorpus) * [Co-reference Corpus](#CoreferenceCorpus) * [Summarization](#Summarization) + * [Data to Text](#Data to Text) * [Models](#Models) * [Word Embeddings](#WordEmbeddings) * [Pre-trained Language Models](#PreTrainedLanguageModels) @@ -308,6 +309,10 @@ Benchmarks spanning multiple tasks. - [XL-Sum](https://github.com/csebuetnlp/xl-sum): A Large-Scale Multilingual Abstractive Summarization for 44 Languages with a comprehensive and diverse dataset comprising of 1 million professionally annotated article-summary pairs from BBC. Described in [this paper](https://arxiv.org/abs/2106.13822). +### Models ### Word Embeddings From b840254a4feb214d769f352ed7c4ea5709dd2aaf Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 12:51:31 +0530 Subject: [PATCH 071/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index fcdc589..0df6745 100644 --- a/README.md +++ b/README.md @@ -47,7 +47,7 @@ _Add a small, informative description of the dataset and provide links to any pa * [Dependency Parse Corpus](#DependencyParseCorpus) * [Co-reference Corpus](#CoreferenceCorpus) * [Summarization](#Summarization) - * [Data to Text](#Data to Text) + * [Data to Text](#DatatoText) * [Models](#Models) * [Word Embeddings](#WordEmbeddings) * [Pre-trained Language Models](#PreTrainedLanguageModels) From 7e2626bb867830f30994a835c92c1fb69c1ce7bf Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 12:54:53 +0530 Subject: [PATCH 072/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 0df6745..72e97e8 100644 --- a/README.md +++ b/README.md @@ -309,7 +309,7 @@ Benchmarks spanning multiple tasks. - [XL-Sum](https://github.com/csebuetnlp/xl-sum): A Large-Scale Multilingual Abstractive Summarization for 44 Languages with a comprehensive and diverse dataset comprising of 1 million professionally annotated article-summary pairs from BBC. Described in [this paper](https://arxiv.org/abs/2106.13822). -### Data to Text -[XAlign](https://github.com/tushar117/XAlign): Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages comprising of a high quality XF2T dataset in 7 languages: Hindi, Marathi, Gujarati, Telugu, Tamil, Kannada, Bengali, and monolingual dataset in English. The dataset is available upon request. Described in [this paper](https://arxiv.org/abs/2202.00291). From db4eaf9f83525ce4ccd3a064fbf8342a7d8ce8b4 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 12:55:57 +0530 Subject: [PATCH 073/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 72e97e8..555d51a 100644 --- a/README.md +++ b/README.md @@ -311,7 +311,7 @@ Benchmarks spanning multiple tasks. ### Data to Text --[XAlign](https://github.com/tushar117/XAlign): Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages comprising of a high quality XF2T dataset in 7 languages: Hindi, Marathi, Gujarati, Telugu, Tamil, Kannada, Bengali, and monolingual dataset in English. The dataset is available upon request. Described in [this paper](https://arxiv.org/abs/2202.00291). +- [XAlign](https://github.com/tushar117/XAlign): Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages comprising of a high quality XF2T dataset in 7 languages: Hindi, Marathi, Gujarati, Telugu, Tamil, Kannada, Bengali, and monolingual dataset in English. The dataset is available upon request. Described in [this paper](https://arxiv.org/abs/2202.00291). ## Models From f8da4f4891def0827869174f358601dec1ca8f17 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 13:06:56 +0530 Subject: [PATCH 074/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 555d51a..35d2cb5 100644 --- a/README.md +++ b/README.md @@ -285,6 +285,7 @@ Benchmarks spanning multiple tasks. - [JNU-BHLTR Bhojpuri Corpus](https://github.com/shashwatup9k/bho-resources/tree/master/mono-bho-corpus): Bhojpuri corpus of 5000 sentences. - [KMI Magahi Corpus](https://github.com/kmi-linguistics/magahi): - [KMI Awadhi Corpus](https://github.com/kmi-linguistics/awadhi): +- [Tham Khasi Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0321/#): An annotated Khasi POS tagged corpus containing 83,312 words, 4,386 sentences, 5,465 word types which amounts to 94,651 tokens (including punctuations). ### Chunk Corpus From 5ef8f7badaffbdfdecf5e4389ce68f6e4beaffdf Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 16:54:20 +0530 Subject: [PATCH 075/119] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 35d2cb5..b6006ef 100644 --- a/README.md +++ b/README.md @@ -335,7 +335,8 @@ Benchmarks spanning multiple tasks. - [iNLTK](https://github.com/goru001/inltk): ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles. - [albert-base-sanskrit](https://huggingface.co/surajp/albert-base-sanskrit): ALBERT-based model trained on Sanskrit Wikipedia. - [RoBERTa-hindi-guj-san](https://huggingface.co/surajp/RoBERTa-hindi-guj-san): Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati. -- [Bangla-BERT-Base](https://github.com/sagorbrur/bangla-bert): Bengali BERT model trained on Bengali wikipedia and OSCAR datasets +- [Bangla-BERT-Base](https://github.com/sagorbrur/bangla-bert): Bengali BERT model trained on Bengali wikipedia and OSCAR datasets. +- [BanglaBERT](https://github.com/csebuetnlp/banglabert): Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla.Two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction are made available. Described in [this paper](https://arxiv.org/abs/2101.00204). ### Multilingual Word Embeddings From 7751d154356cab234cc0c0dcdebeab0aa3d0673b Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 17:42:49 +0530 Subject: [PATCH 076/119] Update README.md --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index b6006ef..d8258ed 100644 --- a/README.md +++ b/README.md @@ -134,6 +134,7 @@ Benchmarks spanning multiple tasks. - [KMI Awadhi Corpus](https://github.com/kmi-linguistics/awadhi): - [SMC Malayalam text corpus](https://gitlab.com/smc/corpus) - [DNLP-Tel Telugu Corpus](https://drive.google.com/drive/folders/0B7LLASJiB2m6cDVzbnNjUVZ5dUE): Telugu corpus of 280M tokens and 23M sentences. +- [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2. ### Language Identification @@ -202,6 +203,7 @@ Benchmarks spanning multiple tasks. - [BUET English-Bangla Corpus, EMNLP-2020](https://github.com/csebuetnlp/banglanmt): 2.7M sentences (has overlaps with OPUS) - [CLE Parallel Corpus](https://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm): Parallel corpus for English, Urdu and Nepali. - [Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. +- [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences. ### Parallel Transliteration Corpus @@ -322,7 +324,7 @@ Benchmarks spanning multiple tasks. - [FastText CommonCrawl+Wikipedia](https://fasttext.cc/docs/en/crawl-vectors.html) - [FastText Wikipedia](https://fasttext.cc/docs/en/pretrained-vectors.html) - [Polyglot](https://sites.google.com/site/rmyeid/projects/polyglot) - +- [EM-FT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first FastText word embedding available for Manipuri language trained on 1,880,035 Manipuri sentences. ### Pre-trained Language Models @@ -337,6 +339,7 @@ Benchmarks spanning multiple tasks. - [RoBERTa-hindi-guj-san](https://huggingface.co/surajp/RoBERTa-hindi-guj-san): Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati. - [Bangla-BERT-Base](https://github.com/sagorbrur/bangla-bert): Bengali BERT model trained on Bengali wikipedia and OSCAR datasets. - [BanglaBERT](https://github.com/csebuetnlp/banglabert): Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla.Two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction are made available. Described in [this paper](https://arxiv.org/abs/2101.00204). +- [EM-ALBERT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences. ### Multilingual Word Embeddings From 47277d70a9dbc338b360dbc1bbe7b62a53798476 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 17:55:34 +0530 Subject: [PATCH 077/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index d8258ed..bad6c7a 100644 --- a/README.md +++ b/README.md @@ -135,6 +135,7 @@ Benchmarks spanning multiple tasks. - [SMC Malayalam text corpus](https://gitlab.com/smc/corpus) - [DNLP-Tel Telugu Corpus](https://drive.google.com/drive/folders/0B7LLASJiB2m6cDVzbnNjUVZ5dUE): Telugu corpus of 280M tokens and 23M sentences. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2. +- [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. ### Language Identification From ffa7e41b34e9e4d4cb6524f62f85aafbd4dea4e3 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 18:05:05 +0530 Subject: [PATCH 078/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index bad6c7a..079df7c 100644 --- a/README.md +++ b/README.md @@ -205,6 +205,7 @@ Benchmarks spanning multiple tasks. - [CLE Parallel Corpus](https://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm): Parallel corpus for English, Urdu and Nepali. - [Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences. +- [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). ### Parallel Transliteration Corpus From 4b2b26eae331ce162120f609a0516cd1062f3d64 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 20:15:12 +0530 Subject: [PATCH 079/119] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 079df7c..952a14f 100644 --- a/README.md +++ b/README.md @@ -135,7 +135,8 @@ Benchmarks spanning multiple tasks. - [SMC Malayalam text corpus](https://gitlab.com/smc/corpus) - [DNLP-Tel Telugu Corpus](https://drive.google.com/drive/folders/0B7LLASJiB2m6cDVzbnNjUVZ5dUE): Telugu corpus of 280M tokens and 23M sentences. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2. -- [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. +- [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. +- [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. ### Language Identification @@ -225,6 +226,7 @@ Benchmarks spanning multiple tasks. - [BBC news articles classification dataset](https://github.com/NirantK/hindi2vec/releases/tag/bbc-hindi-v0.1): 14 class classification - [iNLTK News Headlines classification](https://github.com/goru001/inltk): Datasets for multiple Indian languages. - [AI4Bharat IndicNLP News Articles](https://github.com/ai4bharat/indicnlp_corpus): Word embeddings for 10 Indian languages. +- [KMI Linguistics TRAC - 1](https://github.com/kmi-linguistics/trac-1): Contains aggression-annotated dataset (in English and Hindi) for the Shared Task on Aggression Identification during First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING - 2018. ### Textual Entailment/Natural Language Inference From 36863f40466b4e2247013ac929b406fc8cffe861 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 20:42:50 +0530 Subject: [PATCH 080/119] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 952a14f..2eb96a3 100644 --- a/README.md +++ b/README.md @@ -231,6 +231,7 @@ Benchmarks spanning multiple tasks. ### Textual Entailment/Natural Language Inference - [XNLI corpus](https://github.com/facebookresearch/XNLI): Hindi and Urdu test sets and machine translated training sets (from English MultiNLI). +- [csebuetnlp Bangla NLI](https://huggingface.co/datasets/csebuetnlp/xnli_bn): A Natural Language Inference (NLI) dataset for Bengali. Described in [this paper](https://arxiv.org/abs/2101.00204). ### Paraphrase @@ -270,6 +271,7 @@ Benchmarks spanning multiple tasks. - [HindiRC](https://github.com/erzaliator/HindiRC-Data): A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in [this paper](https://www.researchgate.net/publication/342424208_HindiRC_A_Dataset_for_Reading_Comprehension_in_Hindi) - [IITH HiDG](https://github.com/kaushal0494/ZmBART): A Distractor Generation [Dataset](https://drive.google.com/drive/folders/1XlY9yOfk0XcfHNO5k0QGsbLQU1nMekG-) for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in [this paper](https://arxiv.org/pdf/2106.01597.pdf) - [Chaii](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/overview) a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, [here](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/264695) is a good collection of papers on multilingual Question Answering. +- [csebuetnlp Bangla QA](https://huggingface.co/datasets/csebuetnlp/squad_bn): A Question Answering (QA) dataset for Bengali. Described in [this paper](https://arxiv.org/abs/2101.00204). ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) @@ -342,7 +344,7 @@ Benchmarks spanning multiple tasks. - [albert-base-sanskrit](https://huggingface.co/surajp/albert-base-sanskrit): ALBERT-based model trained on Sanskrit Wikipedia. - [RoBERTa-hindi-guj-san](https://huggingface.co/surajp/RoBERTa-hindi-guj-san): Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati. - [Bangla-BERT-Base](https://github.com/sagorbrur/bangla-bert): Bengali BERT model trained on Bengali wikipedia and OSCAR datasets. -- [BanglaBERT](https://github.com/csebuetnlp/banglabert): Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla.Two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction are made available. Described in [this paper](https://arxiv.org/abs/2101.00204). +- [BanglaBERT](https://github.com/csebuetnlp/banglabert): Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in [this paper](https://arxiv.org/abs/2101.00204). - [EM-ALBERT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences. ### Multilingual Word Embeddings From 04d4f7059e5c42857d2eba7b02fbc84df44257d8 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Sun, 14 Aug 2022 23:49:32 +0530 Subject: [PATCH 081/119] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 2eb96a3..a93406d 100644 --- a/README.md +++ b/README.md @@ -206,7 +206,8 @@ Benchmarks spanning multiple tasks. - [CLE Parallel Corpus](https://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm): Parallel corpus for English, Urdu and Nepali. - [Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences. -- [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). +- [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). +- [IIITH en-hi-codemixed-corpus](https://github.com/mrinaldhar/en-hi-codemixed-corpus): A gold standard parallel corpus consisting of 6096 English-Hindi code-mixed sentences containing a total of 63,913 tokens and monolingual English. Described in [this paper](https://aclanthology.org/W18-3817.pdf). ### Parallel Transliteration Corpus From 8d566d736b640e8b6ed3f599b08ba64fa8ec8798 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sun, 14 Aug 2022 23:59:21 +0530 Subject: [PATCH 082/119] Update README.md --- README.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 2eb96a3..7163ac2 100644 --- a/README.md +++ b/README.md @@ -97,7 +97,7 @@ _Add a small, informative description of the dataset and provide links to any pa Benchmarks spanning multiple tasks. - [AI4Bharat IndicGLUE](https://ai4bharat.iitm.ac.in/indic-glue): NLU benchmark for 11 languages. -- [AI4Bharat IndicNLG Suite](https://ai4bharat.iitm.ac.in/indic-nlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks. +- [AI4Bharat IndicNLG Suite](https://ai4bharat.iitm.ac.in/indic-nlg-suite): NLG benchmark for 11 languages spanning 5 generation tasks: biography generation, sentence summarization, headline generation, paraphrase generation and question generation. - [GLUECoS](https://microsoft.github.io/GLUECoS): For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI). - [AI4Bharat Text Classification](https://github.com/ai4bharat/indicnlp_corpus#publicly-available-classification-datasets): A compilation of classification datasets for 10 languages. - [WAT 2021 Translation Dataset](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual): Standard train and test sets for translation between English and 10 Indian languages. @@ -132,11 +132,11 @@ Benchmarks spanning multiple tasks. - [JNU-BHLTR Bhojpuri Corpus](https://github.com/shashwatup9k/bho-resources/tree/master/mono-bho-corpus): Bhojpuri corpus of 45k sentences. - [KMI Magahi Corpus](https://github.com/kmi-linguistics/magahi): - [KMI Awadhi Corpus](https://github.com/kmi-linguistics/awadhi): +- [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. - [SMC Malayalam text corpus](https://gitlab.com/smc/corpus) - [DNLP-Tel Telugu Corpus](https://drive.google.com/drive/folders/0B7LLASJiB2m6cDVzbnNjUVZ5dUE): Telugu corpus of 280M tokens and 23M sentences. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2. - [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. -- [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. ### Language Identification @@ -152,6 +152,7 @@ Benchmarks spanning multiple tasks. - [Hindi RG-63](https://github.com/ashwinivd/similarity_hindi): Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset - [IITB Cognate Datasets](https://github.com/dipteshkanojia/challengeCognateFF): Dataset of Cognates and False Friend Pairs for 12 Indian Languages. [(Paper)](https://aclanthology.org/2020.lrec-1.378.pdf) - [AI4Bharat Cross-lingual Semantic Textual Similarity](https://storage.googleapis.com/samanantar-public/human_annotations.tsv): 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines. +- [Toxicity-200](https://github.com/facebookresearch/flores/blob/main/toxicity): Toxicity Lists for 200 languges including 26 Indian languages. ### NER Corpora @@ -171,6 +172,7 @@ Benchmarks spanning multiple tasks. - [Samanantar Parallel Corpus](https://ai4bharat.iitm.ac.in/samanantar): Largest parallel corpus for English and 11 Indian languages. It comprises 46m sentence pairs between English-Indian languages and 82m sentence pairs between Indian languages. - [FLORES-101](https://github.com/facebookresearch/flores): Human translated evaluation sets for 101 languages released by Facebook. It includes 14 Indic languages. The testsets are n-way parallel. +- [FLORES-200](https://github.com/facebookresearch/flores/tree/main/flores200): Human translated evaluation sets for 200 languages released by Facebook. It includes 24 Indic languages. The testsets are n-way parallel. - [IIT Bombay English-Hindi Parallel Corpus](http://www.cfilt.iitb.ac.in/iitb_parallel): Largest en-hi parallel corpora in public domain (about 1.5 million segments) - [CVIT-IIITH PIB Multilingual Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/pib-v0.tar): Mined from Press Information Bureau for many Indian languages. Contains both English-IL and IL-IL corpora (IL=Indian language). - [CVIT-IIITH Mann ki Baat Corpus](http://preon.iiit.ac.in/~jerin/resources/datasets/mkb-v0.tar): Mined from Indian PM Narendra Modi's _Mann ki Baat_ speeches. @@ -206,6 +208,8 @@ Benchmarks spanning multiple tasks. - [CLE Parallel Corpus](https://www.cle.org.pk/software/ling_resources/UrduNepaliEnglishParallelCorpus.htm): Parallel corpus for English, Urdu and Nepali. - [Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences. +- [NLLB-Seed](https://github.com/facebookresearch/flores/tree/main/nllb_seed): Small human-translated parallel corpora from Wikipedia articles for very low resource languages. Includes 5 Indian languages: Kashmiri, Manipuri, Maithili, Bhojpuri, Chattisgarhi. +- [NLLB-MD](https://github.com/facebookresearch/flores/tree/main/nllb_seed): NLLB Multi Domain is a set of professionally-translated sentences in News, Unscripted informal speech, and Health domains. Cover Bhojpuri amongst Indian languages. - [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). ### Parallel Transliteration Corpus From 2db3c4b0f4572d7f9a546ea323eb1f3a53b43d77 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Mon, 15 Aug 2022 00:19:54 +0530 Subject: [PATCH 083/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index a93406d..650983b 100644 --- a/README.md +++ b/README.md @@ -208,6 +208,7 @@ Benchmarks spanning multiple tasks. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences. - [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). - [IIITH en-hi-codemixed-corpus](https://github.com/mrinaldhar/en-hi-codemixed-corpus): A gold standard parallel corpus consisting of 6096 English-Hindi code-mixed sentences containing a total of 63,913 tokens and monolingual English. Described in [this paper](https://aclanthology.org/W18-3817.pdf). +- [CALCS 2021 Eng-Hinglish dataset](https://ritual.uh.edu/lince/datasets): Eng-Hinglish parallel corpus containing 10k pairs of sentences. Described in [this paper](https://arxiv.org/pdf/2202.09625.pdf). ### Parallel Transliteration Corpus From 227ed64c0a2850b7a234dbbba19616327e2c7f13 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Mon, 15 Aug 2022 00:21:37 +0530 Subject: [PATCH 084/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 650983b..16e156a 100644 --- a/README.md +++ b/README.md @@ -207,7 +207,7 @@ Benchmarks spanning multiple tasks. - [Itihasa Parallel Corpus](https://github.com/rahular/itihasa): 93k parallel sentences between English and Sanskrit from the Ramanyana and Mahabharata. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences. - [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). -- [IIITH en-hi-codemixed-corpus](https://github.com/mrinaldhar/en-hi-codemixed-corpus): A gold standard parallel corpus consisting of 6096 English-Hindi code-mixed sentences containing a total of 63,913 tokens and monolingual English. Described in [this paper](https://aclanthology.org/W18-3817.pdf). +- [IIIT-H en-hi-codemixed-corpus](https://github.com/mrinaldhar/en-hi-codemixed-corpus): A gold standard parallel corpus consisting of 6096 English-Hindi code-mixed sentences containing a total of 63,913 tokens and monolingual English. Described in [this paper](https://aclanthology.org/W18-3817.pdf). - [CALCS 2021 Eng-Hinglish dataset](https://ritual.uh.edu/lince/datasets): Eng-Hinglish parallel corpus containing 10k pairs of sentences. Described in [this paper](https://arxiv.org/pdf/2202.09625.pdf). ### Parallel Transliteration Corpus From be3220555638451555985c2e68842d713adc9e1c Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 00:22:26 +0530 Subject: [PATCH 085/119] Update README.md --- README.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 7163ac2..81517bb 100644 --- a/README.md +++ b/README.md @@ -350,6 +350,8 @@ Benchmarks spanning multiple tasks. - [Bangla-BERT-Base](https://github.com/sagorbrur/bangla-bert): Bengali BERT model trained on Bengali wikipedia and OSCAR datasets. - [BanglaBERT](https://github.com/csebuetnlp/banglabert): Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in [this paper](https://arxiv.org/abs/2101.00204). - [EM-ALBERT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences. +- [LaBSE](https://tfhub.dev/google/LaBSE/2): Encoder models suitable for sentence retrieval tasks supporting 109 languages (including 26 all major Indic languages). +- [LASER3](https://github.com/facebookresearch/fairseq/tree/nllb#laser3-encoder-models): Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 26 Indic languges). ### Multilingual Word Embeddings @@ -363,12 +365,13 @@ Benchmarks spanning multiple tasks. ### Translation Models - [IndicTrans](https://ai4bharat.iitm.ac.in/indic-trans): Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported. -- [Shata-Anuvaadak](http://www.cfilt.iitb.ac.in/~moses/shata_anuvaadak/): 110 language pairs -- [LTRC Vanee](https://ltrc.iiit.ac.in/downloads/tools/Vaanee.tgz): Dependency based Statistical MT system from English to Hindi +- [Shata-Anuvaadak](http://www.cfilt.iitb.ac.in/~moses/shata_anuvaadak/): SMT for 110 language pairs (all pairs between English and 10 Indian languages. +- [LTRC Vanee](https://ltrc.iiit.ac.in/downloads/tools/Vaanee.tgz): Dependency based Statistical MT system from English to Hindi. +- [NLLB-200](https://github.com/facebookresearch/fairseq/tree/nllb#open-sourced-models-and-community-integrations): Models for 200 languages including 26 Indic languages. ### Transliteration Models -- [AI4Bharat IndicXlit](https://ai4bharat.iitm.ac.in/indic-xlit): A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion that supports 21 Indic languages. Described in [this paper](https://arxiv.org/abs/2205.03018). +- [AI4Bharat IndicXlit](https://ai4bharat.iitm.ac.in/indic-xlit): A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion and vice versa that supports 21 Indic languages. Described in [this paper](https://arxiv.org/abs/2205.03018). ### Speech Models From 117ddf4d85875861b4635b01bf4edbc8028fda4a Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 00:28:01 +0530 Subject: [PATCH 086/119] Update README.md --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 81517bb..ace77a2 100644 --- a/README.md +++ b/README.md @@ -328,6 +328,10 @@ Benchmarks spanning multiple tasks. ## Models +### Language Identification + +- [NLLB-200](https://github.com/facebookresearch/fairseq/tree/nllb#lid-model): LID for 200 languages including 26 Indic languages. + ### Word Embeddings - [AI4Bharat IndicFT](https://ai4bharat.iitm.ac.in/indic-ft): Fast-text word embeddings for 11 Indian languages. @@ -350,7 +354,7 @@ Benchmarks spanning multiple tasks. - [Bangla-BERT-Base](https://github.com/sagorbrur/bangla-bert): Bengali BERT model trained on Bengali wikipedia and OSCAR datasets. - [BanglaBERT](https://github.com/csebuetnlp/banglabert): Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in [this paper](https://arxiv.org/abs/2101.00204). - [EM-ALBERT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences. -- [LaBSE](https://tfhub.dev/google/LaBSE/2): Encoder models suitable for sentence retrieval tasks supporting 109 languages (including 26 all major Indic languages). +- [LaBSE](https://tfhub.dev/google/LaBSE/2): Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [[paper]](https://arxiv.org/abs/2007.01852). - [LASER3](https://github.com/facebookresearch/fairseq/tree/nllb#laser3-encoder-models): Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 26 Indic languges). ### Multilingual Word Embeddings From 89493d8b763e65226b7ae4cffc6b76c8357fe537 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Mon, 15 Aug 2022 00:40:12 +0530 Subject: [PATCH 087/119] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 16e156a..f923f52 100644 --- a/README.md +++ b/README.md @@ -137,6 +137,7 @@ Benchmarks spanning multiple tasks. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2. - [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. - [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. +- [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in [this paper](https://arxiv.org/abs/2103.11596). ### Language Identification @@ -208,7 +209,8 @@ Benchmarks spanning multiple tasks. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with parallel data comprising of 124,975 Manipuri-English aligned sentences. - [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). - [IIIT-H en-hi-codemixed-corpus](https://github.com/mrinaldhar/en-hi-codemixed-corpus): A gold standard parallel corpus consisting of 6096 English-Hindi code-mixed sentences containing a total of 63,913 tokens and monolingual English. Described in [this paper](https://aclanthology.org/W18-3817.pdf). -- [CALCS 2021 Eng-Hinglish dataset](https://ritual.uh.edu/lince/datasets): Eng-Hinglish parallel corpus containing 10k pairs of sentences. Described in [this paper](https://arxiv.org/pdf/2202.09625.pdf). +- [CALCS 2021 Eng-Hinglish dataset](https://ritual.uh.edu/lince/datasets): Eng-Hinglish parallel corpus containing 10k pairs of sentences. Described in [this paper](https://arxiv.org/pdf/2202.09625.pdf). +- [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): The corpus contains 27,362 Hindi-Kangri Parallel corpora. Described in [this paper] (https://arxiv.org/abs/2103.11596). ### Parallel Transliteration Corpus From cc865b77770e70e8265281d129fc42f466187d05 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 09:25:27 +0530 Subject: [PATCH 088/119] Update README.md --- README.md | 25 +++++++++++++------------ 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index c8a131e..f84366e 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ -# A Catalog of resources for Indian language NLP +# A Catalog of Resources for Indian Language NLP + +This repository is an attempt to _collaboratively_ build the _most comprehensive_ catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent. _Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:_ @@ -7,14 +9,13 @@ _Please suggest any other resources you may be aware of. Raise a pull request or _Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the [CONTRIBUTORS](CONTRIBUTORS.md) list._ - -## Featured Resources :+1: +## Featured Resources :+1 _(75th Indian Independence Day Special)_: -- [Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part for the [Bhasini mission](https://bhashini.gov.in). -- :new:[BLOOM](https://huggingface.co/bigscience/bloom): GPT3 like multilingual transformer-decoder language model (includes major Indic languages. -- :new:[AI4Bharat Naamapadam](https://huggingface.co/datasets/ai4bharat/naamapadam): NER dataset for 11 Indic languages. -- :new:[L3Cube-MahaNER](https://github.com/l3cube-pune/MarathiNLP/tree/main/L3Cube-MahaNER): The first major gold standard named entity recognition dataset in Marathi consisting of 25,000 sentences in Marathi language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/workshops/WILDRE6/pdf/2022.wildre6-1.6.pdf). - +- [Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the [Bhasini mission](https://bhashini.gov.in). You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination. +- We are seeing the rise of large-scale datasets across many tasks like IndicCorp (text corpus/9 billion tokens), Samanantar (parallel corpus/50 million sentence pairs), Naamapadam (named entity/5.7 million sentences), HiNER (named entity/100k sentences), Aksharantar (transliteration/26 million pairs) , _etc_. These are being built using either large-scale mining of web-resource or large human annotation efforts or both. +- As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like Aksharantar (transliteration/21 languages), FLORES-200 (translation/27 languages), IndoWordnet (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages. +- Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like Bodo, Kangri, Khasi, _etc_. +- From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like AI4Bharat, KMI Linguistics, L3Cube, iNLTK, IIT Patna, etc. who are contributing to building resources for Indian languages. ## Browse the entire catalog... @@ -154,7 +155,7 @@ Benchmarks spanning multiple tasks. - [Hindi RG-63](https://github.com/ashwinivd/similarity_hindi): Hindi version of the Rubenstein and Goodenough (RG-65) word similarity dataset - [IITB Cognate Datasets](https://github.com/dipteshkanojia/challengeCognateFF): Dataset of Cognates and False Friend Pairs for 12 Indian Languages. [(Paper)](https://aclanthology.org/2020.lrec-1.378.pdf) - [AI4Bharat Cross-lingual Semantic Textual Similarity](https://storage.googleapis.com/samanantar-public/human_annotations.tsv): 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines. -- [Toxicity-200](https://github.com/facebookresearch/flores/blob/main/toxicity): Toxicity Lists for 200 languges including 26 Indian languages. +- [Toxicity-200](https://github.com/facebookresearch/flores/blob/main/toxicity): Toxicity Lists for 200 languages including 27 Indian languages. ### NER Corpora @@ -336,7 +337,7 @@ Benchmarks spanning multiple tasks. ### Language Identification -- [NLLB-200](https://github.com/facebookresearch/fairseq/tree/nllb#lid-model): LID for 200 languages including 26 Indic languages. +- [NLLB-200](https://github.com/facebookresearch/fairseq/tree/nllb#lid-model): LID for 200 languages including 27 Indic languages. ### Word Embeddings @@ -361,7 +362,7 @@ Benchmarks spanning multiple tasks. - [BanglaBERT](https://github.com/csebuetnlp/banglabert): Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in [this paper](https://arxiv.org/abs/2101.00204). - [EM-ALBERT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences. - [LaBSE](https://tfhub.dev/google/LaBSE/2): Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [[paper]](https://arxiv.org/abs/2007.01852). -- [LASER3](https://github.com/facebookresearch/fairseq/tree/nllb#laser3-encoder-models): Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 26 Indic languges). +- [LASER3](https://github.com/facebookresearch/fairseq/tree/nllb#laser3-encoder-models): Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 27 Indic languges). ### Multilingual Word Embeddings @@ -377,7 +378,7 @@ Benchmarks spanning multiple tasks. - [IndicTrans](https://ai4bharat.iitm.ac.in/indic-trans): Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported. - [Shata-Anuvaadak](http://www.cfilt.iitb.ac.in/~moses/shata_anuvaadak/): SMT for 110 language pairs (all pairs between English and 10 Indian languages. - [LTRC Vanee](https://ltrc.iiit.ac.in/downloads/tools/Vaanee.tgz): Dependency based Statistical MT system from English to Hindi. -- [NLLB-200](https://github.com/facebookresearch/fairseq/tree/nllb#open-sourced-models-and-community-integrations): Models for 200 languages including 26 Indic languages. +- [NLLB-200](https://github.com/facebookresearch/fairseq/tree/nllb#open-sourced-models-and-community-integrations): Models for 200 languages including 27 Indic languages. ### Transliteration Models From 7b88846c3cd28a79b8856de7435f4f5fbed1cfe2 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 10:15:31 +0530 Subject: [PATCH 089/119] Update README.md --- README.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index f84366e..22df3f5 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # A Catalog of Resources for Indian Language NLP -This repository is an attempt to _collaboratively_ build the _most comprehensive_ catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent. +This repository is an attempt to **collaboratively** build the **most comprehensive** catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent. _Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:_ @@ -9,13 +9,13 @@ _Please suggest any other resources you may be aware of. Raise a pull request or _Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the [CONTRIBUTORS](CONTRIBUTORS.md) list._ -## Featured Resources :+1 _(75th Indian Independence Day Special)_: +## :+1: Featured Resources _(75th Indian Independence Day Special)_: -- [Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the [Bhasini mission](https://bhashini.gov.in). You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination. -- We are seeing the rise of large-scale datasets across many tasks like IndicCorp (text corpus/9 billion tokens), Samanantar (parallel corpus/50 million sentence pairs), Naamapadam (named entity/5.7 million sentences), HiNER (named entity/100k sentences), Aksharantar (transliteration/26 million pairs) , _etc_. These are being built using either large-scale mining of web-resource or large human annotation efforts or both. -- As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like Aksharantar (transliteration/21 languages), FLORES-200 (translation/27 languages), IndoWordnet (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages. -- Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like Bodo, Kangri, Khasi, _etc_. -- From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like AI4Bharat, KMI Linguistics, L3Cube, iNLTK, IIT Patna, etc. who are contributing to building resources for Indian languages. +- [Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca): **ULCA** is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the [Bhasini mission](https://bhashini.gov.in). You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination. +- We are seeing the rise of large-scale datasets across many tasks like **IndicCorp** (text corpus/9 billion tokens), **Samanantar** (parallel corpus/50 million sentence pairs), **Naamapadam** (named entity/5.7 million sentences), **HiNER** (named entity/100k sentences), **Aksharantar** (transliteration/26 million pairs) , _etc_. These are being built using either large-scale mining of web-resource or large human annotation efforts or both. +- As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like **Aksharantar** (transliteration/21 languages), **FLORES-200** (translation/27 languages), **IndoWordNet** (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages. +- Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like **Bodo**, **Kangri**, **Khasi**, _etc_. +- From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like **AI4Bharat**, **KMI**, **L3Cube**, **iNLTK**, **IIT Patna**, _etc_. who are contributing to building resources for Indian languages. ## Browse the entire catalog... From 3bd9bab805ffca932bc60144dc7f36a5ddf9c1ac Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 10:18:53 +0530 Subject: [PATCH 090/119] Update CONTRIBUTORS.md --- CONTRIBUTORS.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/CONTRIBUTORS.md b/CONTRIBUTORS.md index b66c7d5..2adc4bc 100644 --- a/CONTRIBUTORS.md +++ b/CONTRIBUTORS.md @@ -20,3 +20,6 @@ - Gokul NC - Ritwik Mishra - Sangeeta Rajagopal +- Kaushal Bhosale +- Tahir Javed +- Maharaja Brahma From 939377183c8e3bb0c99410639330fc8d5ce8c46d Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 10:28:35 +0530 Subject: [PATCH 091/119] Update README.md --- README.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 22df3f5..4d39b58 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,8 @@ -# A Catalog of Resources for Indian Language NLP +# :bookmark: The Indic NLP Catalog +_A Catalog of Resources for Indian Language NLP_ -This repository is an attempt to **collaboratively** build the **most comprehensive** catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent. +The **Indic NLP Catalog** repository is an attempt to **collaboratively** build the **most comprehensive** catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent. _Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:_ From f6543836a998c233741b7151d20668fa926ad0e5 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 11:14:55 +0530 Subject: [PATCH 092/119] Update README.md --- README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4d39b58..72d30cc 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,9 @@ _Please suggest any other resources you may be aware of. Raise a pull request or _Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the [CONTRIBUTORS](CONTRIBUTORS.md) list._ -## :+1: Featured Resources _(75th Indian Independence Day Special)_: +## :+1: Featured Resources _(76_$^{ th}$ _Indian Independence Day Special)_: + +Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future. - [Universal Language Contribution API (ULCA)](https://bhashini.gov.in/ulca): **ULCA** is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the [Bhasini mission](https://bhashini.gov.in). You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination. - We are seeing the rise of large-scale datasets across many tasks like **IndicCorp** (text corpus/9 billion tokens), **Samanantar** (parallel corpus/50 million sentence pairs), **Naamapadam** (named entity/5.7 million sentences), **HiNER** (named entity/100k sentences), **Aksharantar** (transliteration/26 million pairs) , _etc_. These are being built using either large-scale mining of web-resource or large human annotation efforts or both. From 5c5126b4c42cb1f57e27c40bf847ae4a0b04596d Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 11:22:09 +0530 Subject: [PATCH 093/119] Update README.md --- README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 72d30cc..cc80327 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Indian language NLP has come a long way. We feature a few resources that are ill - We are seeing the rise of large-scale datasets across many tasks like **IndicCorp** (text corpus/9 billion tokens), **Samanantar** (parallel corpus/50 million sentence pairs), **Naamapadam** (named entity/5.7 million sentences), **HiNER** (named entity/100k sentences), **Aksharantar** (transliteration/26 million pairs) , _etc_. These are being built using either large-scale mining of web-resource or large human annotation efforts or both. - As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like **Aksharantar** (transliteration/21 languages), **FLORES-200** (translation/27 languages), **IndoWordNet** (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages. - Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like **Bodo**, **Kangri**, **Khasi**, _etc_. -- From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like **AI4Bharat**, **KMI**, **L3Cube**, **iNLTK**, **IIT Patna**, _etc_. who are contributing to building resources for Indian languages. +- From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like **AI4Bharat**, **BUET CSE NLP**, **KMI**, **L3Cube**, **iNLTK**, **IIT Patna**, _etc_. who are contributing to building resources for Indian languages. ## Browse the entire catalog... @@ -82,6 +82,10 @@ Indian language NLP has come a long way. We feature a few resources that are ill - [Linguistic Data Consortium For Indian Languages (LDCIL)](https://data.ldcil.org) - [University of Hyderabad - Sanskrit NLP](http://sanskrit.uohyd.ac.in/scl) - [National Platform for Language Technology](https://nplt.in/demo/index.php?route=product/category&path=75_59&limit=100) +- [BUET CSE NLP Group](https://csebuetnlp.github.io) +- [KMI Linguistics](https://github.com/kmi-linguistics) +- [L3Cube](https://github.com/l3cube-pune/MarathiNLP) +- [IIT Patna](https://www.iitp.ac.in/~ai-nlp-ml/resources.html) ## Libraries and Tools From 61d1d26c31b632c876221d94f44100c7f6cb2f35 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Mon, 15 Aug 2022 12:09:30 +0530 Subject: [PATCH 094/119] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index cc80327..bf72818 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # :bookmark: The Indic NLP Catalog -_A Catalog of Resources for Indian Language NLP_ +_A Collabortive Catalog of Resources for Indic Language NLP_ The **Indic NLP Catalog** repository is an attempt to **collaboratively** build the **most comprehensive** catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent. @@ -10,7 +10,7 @@ _Please suggest any other resources you may be aware of. Raise a pull request or _Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the [CONTRIBUTORS](CONTRIBUTORS.md) list._ -## :+1: Featured Resources _(76_$^{ th}$ _Indian Independence Day Special)_: +## :+1: Featured Resources Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future. From c684602255a72063c059adc15cfb76b5943890b1 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Mon, 15 Aug 2022 12:26:01 +0530 Subject: [PATCH 095/119] Update README.md --- README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index c8a131e..2e712ca 100644 --- a/README.md +++ b/README.md @@ -139,6 +139,7 @@ Benchmarks spanning multiple tasks. - [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. - [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. - [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in [this paper](https://arxiv.org/abs/2103.11596). +- [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): The Sanskrit Monolingual Data is available [here] (https://drive.google.com/file/d/1_qclc7unNLvToiDK8t2scIgj5oxJDEGm/view?usp=sharing). ### Language Identification @@ -216,7 +217,8 @@ Benchmarks spanning multiple tasks. - [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): The corpus contains 27,362 Hindi-Kangri Parallel corpora. Described in [this paper] (https://arxiv.org/abs/2103.11596). - [NLLB-Seed](https://github.com/facebookresearch/flores/tree/main/nllb_seed): Small human-translated parallel corpora from Wikipedia articles for very low resource languages. Includes 5 Indian languages: Kashmiri, Manipuri, Maithili, Bhojpuri, Chattisgarhi. - [NLLB-MD](https://github.com/facebookresearch/flores/tree/main/nllb_seed): NLLB Multi Domain is a set of professionally-translated sentences in News, Unscripted informal speech, and Health domains. Cover Bhojpuri amongst Indian languages. -- [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). +- [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). +- [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data. ### Parallel Transliteration Corpus @@ -345,6 +347,7 @@ Benchmarks spanning multiple tasks. - [FastText Wikipedia](https://fasttext.cc/docs/en/pretrained-vectors.html) - [Polyglot](https://sites.google.com/site/rmyeid/projects/polyglot) - [EM-FT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first FastText word embedding available for Manipuri language trained on 1,880,035 Manipuri sentences. +- [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): The FastText embeddings for Sanskrit is available [here](https://drive.google.com/file/d/1k5INFw9oaxV7yoWRg0qscmcFrOHVhdzW/view?usp=sharing) and for Hindi [here](https://drive.google.com/file/d/1Md9N7Ux2P9JCky1_9RgL2KjXRGb_lpXj/view?usp=sharing). ### Pre-trained Language Models From b7fb4833701423cd765cc0286d24ce2f59367a5c Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Mon, 15 Aug 2022 12:30:52 +0530 Subject: [PATCH 096/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2e712ca..b6ae7b3 100644 --- a/README.md +++ b/README.md @@ -139,7 +139,7 @@ Benchmarks spanning multiple tasks. - [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. - [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. - [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in [this paper](https://arxiv.org/abs/2103.11596). -- [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): The Sanskrit Monolingual Data is available [here] (https://drive.google.com/file/d/1_qclc7unNLvToiDK8t2scIgj5oxJDEGm/view?usp=sharing). +- [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): The Sanskrit Monolingual Data is available [here](https://drive.google.com/file/d/1_qclc7unNLvToiDK8t2scIgj5oxJDEGm/view?usp=sharing). ### Language Identification From 1a6f2ba7feab608d31aa975b11c1643d9f53db0b Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Mon, 15 Aug 2022 17:47:45 +0530 Subject: [PATCH 097/119] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 7c4bed0..5442736 100644 --- a/README.md +++ b/README.md @@ -148,6 +148,7 @@ Benchmarks spanning multiple tasks. - [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. - [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in [this paper](https://arxiv.org/abs/2103.11596). - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): The Sanskrit Monolingual Data is available [here](https://drive.google.com/file/d/1_qclc7unNLvToiDK8t2scIgj5oxJDEGm/view?usp=sharing). +- [FacebookDecadeCorpora](https://github.com/samithaj/FacebookDecadeCorpora): Contains two language corpora of colloquial Sinhala content extracted from Facebook using the Crowdtangle platform. The larger corpus contains 28,825,820 to 29,549,672 words of text, mostly in Sinhala, English and Tamil and the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from Corpus-Alpha. Described in [this paper](https://arxiv.org/ftp/arxiv/papers/2007/2007.07884.pdf). ### Language Identification @@ -164,6 +165,7 @@ Benchmarks spanning multiple tasks. - [IITB Cognate Datasets](https://github.com/dipteshkanojia/challengeCognateFF): Dataset of Cognates and False Friend Pairs for 12 Indian Languages. [(Paper)](https://aclanthology.org/2020.lrec-1.378.pdf) - [AI4Bharat Cross-lingual Semantic Textual Similarity](https://storage.googleapis.com/samanantar-public/human_annotations.tsv): 10 sentences across 11 en-Indic language pairs annotated on a scale of 0-5 as per SemEval cross-lingual STS guidelines. - [Toxicity-200](https://github.com/facebookresearch/flores/blob/main/toxicity): Toxicity Lists for 200 languages including 27 Indian languages. +- [FacebookDecadeCorpora](https://github.com/samithaj/FacebookDecadeCorpora): Contains a list of algorithmically derived stopwords extracted from Corpus-Sinhala-Redux. Described in [this paper](https://arxiv.org/ftp/arxiv/papers/2007/2007.07884.pdf). ### NER Corpora From 5de972c159acf0b00e13ad290c107ca4beafffda Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 16 Aug 2022 11:33:51 +0530 Subject: [PATCH 098/119] Update README.md --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 5442736..6f6ae02 100644 --- a/README.md +++ b/README.md @@ -270,6 +270,7 @@ Benchmarks spanning multiple tasks. - [Dravidian-CodeMix - FIRE 2020](https://dravidian-codemix.github.io/2020/datasets.html) - Tamil & Malayalam - [Bengali Sentiment Analysis - Classification Benchmark, 2020](https://github.com/rezacsedu/BengFastText): 8k sentences - [SentNoB](https://github.com/KhondokerIslam/SentNoB): sentiment dataset for Bangla from 3 domains on user comments containing 15k examples [(Paper)](https://aclanthology.org/2021.findings-emnlp.278.pdf) [(Dataset)](https://www.kaggle.com/cryptexcode/sentnob-sentiment-analysis-in-noisy-bangla-texts) +- [UoM-Sinhala Sentiment Analysis](https://github.com/LahiruSen/sinhala_sentiment_anlaysis_tallip#data-set): Sentiment Analysis for Sinhala Language. Consists of a multi-class annotated data set with 15059 sentiment annotated Sinhala news comments extracted from two Sinhala online news papers with four sentiment categories namely POSITIVE, NEGATIVE, NEUTRAL and CONFLICT and a corpus of 9.48 million tokens. Described in [this paper](https://arxiv.org/pdf/2011.07280.pdf). ### Hate Speech and Offensive Comments @@ -358,6 +359,7 @@ Benchmarks spanning multiple tasks. - [Polyglot](https://sites.google.com/site/rmyeid/projects/polyglot) - [EM-FT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first FastText word embedding available for Manipuri language trained on 1,880,035 Manipuri sentences. - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): The FastText embeddings for Sanskrit is available [here](https://drive.google.com/file/d/1k5INFw9oaxV7yoWRg0qscmcFrOHVhdzW/view?usp=sharing) and for Hindi [here](https://drive.google.com/file/d/1Md9N7Ux2P9JCky1_9RgL2KjXRGb_lpXj/view?usp=sharing). +- [UoM-Sinhala Sentiment Analysis- FastText 300](https://github.com/LahiruSen/sinhala_sentiment_anlaysis_tallip#word-embedding-models): The FastText word embedding model for Sinhala language. Described in [this paper](https://arxiv.org/pdf/2011.07280.pdf). ### Pre-trained Language Models From 326851a380b414edfb90cd9e53bccffee7b10573 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Tue, 16 Aug 2022 12:44:00 +0530 Subject: [PATCH 099/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 6f6ae02..d00573d 100644 --- a/README.md +++ b/README.md @@ -109,6 +109,7 @@ Benchmarks spanning multiple tasks. - [GLUECoS](https://microsoft.github.io/GLUECoS): For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI). - [AI4Bharat Text Classification](https://github.com/ai4bharat/indicnlp_corpus#publicly-available-classification-datasets): A compilation of classification datasets for 10 languages. - [WAT 2021 Translation Dataset](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual): Standard train and test sets for translation between English and 10 Indian languages. +- [Facebook - MTOP Benchmark](https://fb.me/mtop_dataset): A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in [this paper](https://arxiv.org/pdf/2008.09335.pdf). ## Standards From 7aa2b18aabaf58d6b84f25a462e121b08728c45b Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Wed, 17 Aug 2022 17:16:25 +0530 Subject: [PATCH 100/119] Update README.md --- README.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/README.md b/README.md index d00573d..923d57d 100644 --- a/README.md +++ b/README.md @@ -150,6 +150,7 @@ Benchmarks spanning multiple tasks. - [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in [this paper](https://arxiv.org/abs/2103.11596). - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): The Sanskrit Monolingual Data is available [here](https://drive.google.com/file/d/1_qclc7unNLvToiDK8t2scIgj5oxJDEGm/view?usp=sharing). - [FacebookDecadeCorpora](https://github.com/samithaj/FacebookDecadeCorpora): Contains two language corpora of colloquial Sinhala content extracted from Facebook using the Crowdtangle platform. The larger corpus contains 28,825,820 to 29,549,672 words of text, mostly in Sinhala, English and Tamil and the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from Corpus-Alpha. Described in [this paper](https://arxiv.org/ftp/arxiv/papers/2007/2007.07884.pdf). +- [Nepali National corpus](http://catalog.elra.info/product_info.php?products_id=1216): The Nepali Monolingual written corpus comprises the core corpus containing 802,000 words and the general corpus containing 1,400,000 words. Described [here](https://www.sketchengine.eu/nepali-national-corpus/). ### Language Identification @@ -230,6 +231,7 @@ Benchmarks spanning multiple tasks. - [NLLB-MD](https://github.com/facebookresearch/flores/tree/main/nllb_seed): NLLB Multi Domain is a set of professionally-translated sentences in News, Unscripted informal speech, and Health domains. Cover Bhojpuri amongst Indian languages. - [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data. +- [Nepali National corpus]( http://catalog.elra.info/product_info.php?products_id=1217): The English-Nepali Parallel Corpus consists of a small set of data aligned at the sentence level with 27,060 English words and 21,756 Nepali words and a larger set of texts at the document level with 617,340 English words and 596,571 Nepali words. An additional set of monolingual data is also provided with 386,879 words in Nepali. Described [here](https://www.sketchengine.eu/nepali-national-corpus/). ### Parallel Transliteration Corpus @@ -440,6 +442,7 @@ Benchmarks spanning multiple tasks. - [Crowdsourced high-quality Marathi multi-speaker speech data set](https://www.openslr.org/64/): Contains recordings of native speakers of Marathi - [Crowdsourced high-quality Tamil multi-speaker speech data set](https://www.openslr.org/65/): Contains recordings of native speakers of Tamil - [Crowdsourced high-quality Telugu multi-speaker speech data set](https://www.openslr.org/66/): Contains recordings of native speakers of Telugu +- [Nepali National corpus](http://catalog.elra.info/product_info.php?products_id=1219): The Nepali Spoken Corpus contains audio recordings from different 17 types of social activities with a total temporal recording duration of 31 hours and 26 minutes. Described [here](https://www.sketchengine.eu/nepali-national-corpus/). From c68bf331668fb95ab1295aa9bff8f07819afea57 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Wed, 17 Aug 2022 17:54:20 +0530 Subject: [PATCH 101/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 923d57d..5c0b4c2 100644 --- a/README.md +++ b/README.md @@ -232,6 +232,7 @@ Benchmarks spanning multiple tasks. - [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data. - [Nepali National corpus]( http://catalog.elra.info/product_info.php?products_id=1217): The English-Nepali Parallel Corpus consists of a small set of data aligned at the sentence level with 27,060 English words and 21,756 Nepali words and a larger set of texts at the document level with 617,340 English words and 596,571 Nepali words. An additional set of monolingual data is also provided with 386,879 words in Nepali. Described [here](https://www.sketchengine.eu/nepali-national-corpus/). +- [English–Nepali Parallel Corpus](https://github.com/sharad461/nepali-translator): A parallel corpus of size 1.8 million sentence pairs for a low resource language pair Nepali–English. Described in [this paper](https://lt4all.elra.info/proceedings/lt4all2019/pdf/2019.lt4all-1.94.pdf). ### Parallel Transliteration Corpus From 7cd8407648dacb5fe0bc54713d54ba054c98ad07 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Wed, 17 Aug 2022 17:57:45 +0530 Subject: [PATCH 102/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5c0b4c2..7229133 100644 --- a/README.md +++ b/README.md @@ -232,7 +232,7 @@ Benchmarks spanning multiple tasks. - [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data. - [Nepali National corpus]( http://catalog.elra.info/product_info.php?products_id=1217): The English-Nepali Parallel Corpus consists of a small set of data aligned at the sentence level with 27,060 English words and 21,756 Nepali words and a larger set of texts at the document level with 617,340 English words and 596,571 Nepali words. An additional set of monolingual data is also provided with 386,879 words in Nepali. Described [here](https://www.sketchengine.eu/nepali-national-corpus/). -- [English–Nepali Parallel Corpus](https://github.com/sharad461/nepali-translator): A parallel corpus of size 1.8 million sentence pairs for a low resource language pair Nepali–English. Described in [this paper](https://lt4all.elra.info/proceedings/lt4all2019/pdf/2019.lt4all-1.94.pdf). +- [Kathmandu University-English–Nepali Parallel Corpus](https://github.com/sharad461/nepali-translator): A parallel corpus of size 1.8 million sentence pairs for a low resource language pair Nepali–English. Described in [this paper](https://lt4all.elra.info/proceedings/lt4all2019/pdf/2019.lt4all-1.94.pdf). ### Parallel Transliteration Corpus From a4d0a2140c74bc30940c9f7db352439f8a167bde Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Fri, 19 Aug 2022 23:23:14 +0530 Subject: [PATCH 103/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7229133..00b31d5 100644 --- a/README.md +++ b/README.md @@ -109,7 +109,6 @@ Benchmarks spanning multiple tasks. - [GLUECoS](https://microsoft.github.io/GLUECoS): For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI). - [AI4Bharat Text Classification](https://github.com/ai4bharat/indicnlp_corpus#publicly-available-classification-datasets): A compilation of classification datasets for 10 languages. - [WAT 2021 Translation Dataset](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual): Standard train and test sets for translation between English and 10 Indian languages. -- [Facebook - MTOP Benchmark](https://fb.me/mtop_dataset): A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in [this paper](https://arxiv.org/pdf/2008.09335.pdf). ## Standards @@ -310,6 +309,7 @@ Benchmarks spanning multiple tasks. - [EventXtract-IL](http://78.46.86.133/EventXtractionIL-FIRE2018): Event extraction for Tamil and Hindi. Described in [this paper](http://ceur-ws.org/Vol-2266/T5-1.pdf). - [EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in [this paper](http://ceur-ws.org/Vol-2266/T5-1.pdf). - [Amazon MASSIVE](https://github.com/alexa/massive): A Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation containing one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. Described in [this paper](https://arxiv.org/abs/2204.08582). +- [Facebook - MTOP Benchmark](https://fb.me/mtop_dataset): A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in [this paper](https://arxiv.org/pdf/2008.09335.pdf). ### POS Tagged corpus From ccfd69185ebb8824ce0c0cd6fbc5e0c3b5ee9ffa Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Wed, 24 Aug 2022 22:45:48 +0530 Subject: [PATCH 104/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 00b31d5..b70f76d 100644 --- a/README.md +++ b/README.md @@ -232,6 +232,7 @@ Benchmarks spanning multiple tasks. - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data. - [Nepali National corpus]( http://catalog.elra.info/product_info.php?products_id=1217): The English-Nepali Parallel Corpus consists of a small set of data aligned at the sentence level with 27,060 English words and 21,756 Nepali words and a larger set of texts at the document level with 617,340 English words and 596,571 Nepali words. An additional set of monolingual data is also provided with 386,879 words in Nepali. Described [here](https://www.sketchengine.eu/nepali-national-corpus/). - [Kathmandu University-English–Nepali Parallel Corpus](https://github.com/sharad461/nepali-translator): A parallel corpus of size 1.8 million sentence pairs for a low resource language pair Nepali–English. Described in [this paper](https://lt4all.elra.info/proceedings/lt4all2019/pdf/2019.lt4all-1.94.pdf). +- [CCAligned](https://statmt.org/cc-aligned/): A Massive Collection of more than 100 million cross-lingual web-document pairs in 137 languages aligned with English. ### Parallel Transliteration Corpus From ebfd58ad8bd2ea50607b7c04fa157f013e4cc642 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Thu, 25 Aug 2022 09:40:08 +0530 Subject: [PATCH 105/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 5442736..a755627 100644 --- a/README.md +++ b/README.md @@ -227,6 +227,7 @@ Benchmarks spanning multiple tasks. - [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): The corpus contains 27,362 Hindi-Kangri Parallel corpora. Described in [this paper] (https://arxiv.org/abs/2103.11596). - [NLLB-Seed](https://github.com/facebookresearch/flores/tree/main/nllb_seed): Small human-translated parallel corpora from Wikipedia articles for very low resource languages. Includes 5 Indian languages: Kashmiri, Manipuri, Maithili, Bhojpuri, Chattisgarhi. - [NLLB-MD](https://github.com/facebookresearch/flores/tree/main/nllb_seed): NLLB Multi Domain is a set of professionally-translated sentences in News, Unscripted informal speech, and Health domains. Cover Bhojpuri amongst Indian languages. +- [NLLB-Mined](https://huggingface.co/datasets/allenai/nllb): All the parallel corpora mined by the NLLB project. This repository was reconstructed by AllenAI based on metadata released by the NLLB Project. - [PHINC](https://zenodo.org/record/3605597#.YvjqEXZBy5d): A Parallel Hinglish Social Media Code-Mixed Corpus consisting of 13,738 code-mixed English-Hindi sentences and their corresponding translation in English. Described in [this paper](https://aclanthology.org/2020.wnut-1.7.pdf). - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): Machine Translation from Sanskrit to Hindi using Unsupervised and Supervised Learning. Contains Sanskrit-English parallel data and Sanskrit-Hindi parallel(test) data. From a828610cde31847da9b60fd7c64fcfcee5d52e45 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Thu, 25 Aug 2022 11:26:43 +0530 Subject: [PATCH 106/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index b70f76d..27c7e95 100644 --- a/README.md +++ b/README.md @@ -253,6 +253,7 @@ Benchmarks spanning multiple tasks. - [iNLTK News Headlines classification](https://github.com/goru001/inltk): Datasets for multiple Indian languages. - [AI4Bharat IndicNLP News Articles](https://github.com/ai4bharat/indicnlp_corpus): Word embeddings for 10 Indian languages. - [KMI Linguistics TRAC - 1](https://github.com/kmi-linguistics/trac-1): Contains aggression-annotated dataset (in English and Hindi) for the Shared Task on Aggression Identification during First Workshop on Trolling, Aggression and Cyberbullying (TRAC - 1) at COLING - 2018. +- [XCOPA](https://github.com/cambridgeltl/xcopa): A Multilingual Dataset for Causal Commonsense Reasoning in 11 languages (includes Tamil). Described in [this paper](https://ducdauge.github.io/files/xcopa.pdf). ### Textual Entailment/Natural Language Inference From e6a112899165e9e9fe12793738f23993e715190b Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Thu, 25 Aug 2022 11:56:38 +0530 Subject: [PATCH 107/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 27c7e95..62361dc 100644 --- a/README.md +++ b/README.md @@ -300,6 +300,7 @@ Benchmarks spanning multiple tasks. - [IITH HiDG](https://github.com/kaushal0494/ZmBART): A Distractor Generation [Dataset](https://drive.google.com/drive/folders/1XlY9yOfk0XcfHNO5k0QGsbLQU1nMekG-) for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in [this paper](https://arxiv.org/pdf/2106.01597.pdf) - [Chaii](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/overview) a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, [here](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/264695) is a good collection of papers on multilingual Question Answering. - [csebuetnlp Bangla QA](https://huggingface.co/datasets/csebuetnlp/squad_bn): A Question Answering (QA) dataset for Bengali. Described in [this paper](https://arxiv.org/abs/2101.00204). +- [XOR QA](https://github.com/AkariAsai/XORQA): A large-scale cross-lingual open-retrieval QA dataset (includes Bengali and Telugu) with 40k newly annotated open-retrieval questions that cover seven typologically diverse languages. Described in [this paper](https://arxiv.org/pdf/2010.11856.pdf). More information is available [here](https://nlp.cs.washington.edu/xorqa/). ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From 0267bea63e21c5d9e0ce9fcdcc8710af93398608 Mon Sep 17 00:00:00 2001 From: Ganesh Katrapati Date: Thu, 1 Sep 2022 20:48:31 +0530 Subject: [PATCH 108/119] Changed the link for DNLP-Tel corpora --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a755627..855ed60 100644 --- a/README.md +++ b/README.md @@ -142,7 +142,7 @@ Benchmarks spanning multiple tasks. - [KMI Awadhi Corpus](https://github.com/kmi-linguistics/awadhi): - [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. - [SMC Malayalam text corpus](https://gitlab.com/smc/corpus) -- [DNLP-Tel Telugu Corpus](https://drive.google.com/drive/folders/0B7LLASJiB2m6cDVzbnNjUVZ5dUE): Telugu corpus of 280M tokens and 23M sentences. +- [DNLP-Tel Telugu Corpus](https://drive.google.com/drive/folders/1fEt7aIzYWGQKto3Nt51M5CdjtzxMqdCz?usp=sharing): Telugu corpus of 280M tokens and 23M sentences along with skip-gram model trained with word2vec. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2. - [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. - [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. From 7cd592ffd8d9c2c345344fb4e7e7bea005665012 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Mon, 5 Sep 2022 18:20:21 +0530 Subject: [PATCH 109/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index d284ac2..03ae0f2 100644 --- a/README.md +++ b/README.md @@ -385,6 +385,7 @@ Benchmarks spanning multiple tasks. - [EM-ALBERT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences. - [LaBSE](https://tfhub.dev/google/LaBSE/2): Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [[paper]](https://arxiv.org/abs/2007.01852). - [LASER3](https://github.com/facebookresearch/fairseq/tree/nllb#laser3-encoder-models): Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 27 Indic languges). +- [MuRIL](https://tfhub.dev/google/MuRIL/1): A BERT base (12L) model pre-trained on 17 Indian languages, and their transliterated counterparts. Described in [this paper](https://arxiv.org/abs/2103.10730). ### Multilingual Word Embeddings From daa73d7590b2a626b9166d4bc5cc7e0f1acf2e74 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Mon, 5 Sep 2022 18:42:13 +0530 Subject: [PATCH 110/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 03ae0f2..08b90f3 100644 --- a/README.md +++ b/README.md @@ -302,6 +302,7 @@ Benchmarks spanning multiple tasks. - [Chaii](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/overview) a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, [here](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/discussion/264695) is a good collection of papers on multilingual Question Answering. - [csebuetnlp Bangla QA](https://huggingface.co/datasets/csebuetnlp/squad_bn): A Question Answering (QA) dataset for Bengali. Described in [this paper](https://arxiv.org/abs/2101.00204). - [XOR QA](https://github.com/AkariAsai/XORQA): A large-scale cross-lingual open-retrieval QA dataset (includes Bengali and Telugu) with 40k newly annotated open-retrieval questions that cover seven typologically diverse languages. Described in [this paper](https://arxiv.org/pdf/2010.11856.pdf). More information is available [here](https://nlp.cs.washington.edu/xorqa/). +- [IITB HiQuAD](https://www.cse.iitb.ac.in/~ganesh/HiQuAD/clqg/clqg_data.tar.gz): A question answering dataset in Hindi consisting of 6555 question-answer pairs. Described in [this paper](https://www.cse.iitb.ac.in/~ganesh/papers/acl2019a.pdf). ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) From cd40e005380f2e6bbd68c73e50188aec52357477 Mon Sep 17 00:00:00 2001 From: Kaushal Bhogale Date: Wed, 14 Sep 2022 15:50:16 +0530 Subject: [PATCH 111/119] Add Shrutilipi to Speech Corpora --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d284ac2..c59621e 100644 --- a/README.md +++ b/README.md @@ -448,7 +448,7 @@ Benchmarks spanning multiple tasks. - [Crowdsourced high-quality Tamil multi-speaker speech data set](https://www.openslr.org/65/): Contains recordings of native speakers of Tamil - [Crowdsourced high-quality Telugu multi-speaker speech data set](https://www.openslr.org/66/): Contains recordings of native speakers of Telugu - [Nepali National corpus](http://catalog.elra.info/product_info.php?products_id=1219): The Nepali Spoken Corpus contains audio recordings from different 17 types of social activities with a total temporal recording duration of 31 hours and 26 minutes. Described [here](https://www.sketchengine.eu/nepali-national-corpus/). - +- [Shrutilipi](https://ai4bharat.org/shrutilipi): Over 6400 hours of transcribed speech corpus across 12 Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu ## OCR Corpora From 69178575955d12a26a26042c977584cf8a8244a4 Mon Sep 17 00:00:00 2001 From: Prasanna Venkatesh T S <17687018+vipranarayan14@users.noreply.github.com> Date: Tue, 10 Jan 2023 17:29:55 -0800 Subject: [PATCH 112/119] fix typo in README --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c59621e..43dcbc2 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # :bookmark: The Indic NLP Catalog -_A Collabortive Catalog of Resources for Indic Language NLP_ +_A Collaborative Catalog of Resources for Indic Language NLP_ The **Indic NLP Catalog** repository is an attempt to **collaboratively** build the **most comprehensive** catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent. From 108d64dd2b8259015e68a2ff104619514b380a02 Mon Sep 17 00:00:00 2001 From: Maharaj Brahma Date: Fri, 3 Feb 2023 01:23:45 +0530 Subject: [PATCH 113/119] Remove redundant KMI Linguistics Bodo catalog --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index c59621e..4b6599a 100644 --- a/README.md +++ b/README.md @@ -145,7 +145,6 @@ Benchmarks spanning multiple tasks. - [DNLP-Tel Telugu Corpus](https://drive.google.com/drive/folders/1fEt7aIzYWGQKto3Nt51M5CdjtzxMqdCz?usp=sharing): Telugu corpus of 280M tokens and 23M sentences along with skip-gram model trained with word2vec. - [Ema-lon Manipuri Corpus](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first comparable corpus built for the Manipuri (mni)-English (eng) language pair with the monolingual data comprising of 1,034,715 Manipuri sentences and 846,796 English sentences in version 1 and 1,880,035 Manipuri sentences and 1,450,053 English sentences in version 2. - [SinMin Corpus](https://osf.io/a5quv/): Contains texts of different genres and styles of the modern and old Sinhala language. -- [KMI Linguistics Bodo](https://github.com/kmi-linguistics/bodo): Contains the Bodo corpus and the frequency-ordered word and punctuation list. - [Kangri_corpus](https://github.com/chauhanshweta/Kangri_corpus): Monolingual corpus of Himachali low resource endangered language, Kangri comprising of 1,81,552 sentences. Described in [this paper](https://arxiv.org/abs/2103.11596). - [Sanskrit-Hindi-MT](https://github.com/priyanshu2103/Sanskrit-Hindi-Machine-Translation): The Sanskrit Monolingual Data is available [here](https://drive.google.com/file/d/1_qclc7unNLvToiDK8t2scIgj5oxJDEGm/view?usp=sharing). - [FacebookDecadeCorpora](https://github.com/samithaj/FacebookDecadeCorpora): Contains two language corpora of colloquial Sinhala content extracted from Facebook using the Crowdtangle platform. The larger corpus contains 28,825,820 to 29,549,672 words of text, mostly in Sinhala, English and Tamil and the smaller corpus amounts to 5,402,76 words of only Sinhala text extracted from Corpus-Alpha. Described in [this paper](https://arxiv.org/ftp/arxiv/papers/2007/2007.07884.pdf). From 4d4a20df1367374a54d38221d0ec2082ff52b1c4 Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Fri, 17 Mar 2023 13:42:12 +0530 Subject: [PATCH 114/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index bf6e38e..910f3d1 100644 --- a/README.md +++ b/README.md @@ -288,6 +288,7 @@ Benchmarks spanning multiple tasks. - [Bengali Hate Speech - Classification Benchmark, 2020](https://github.com/rezacsedu/BengFastText): 1.5k sentences - [Offensive Language Identification in Dravidian Languages, EACL 2021](https://dravidianlangtech.github.io/2021/): Tamil, Malayalam, Kannada - [Fear Speech in Indian WhatsApp Groups, 2021](https://github.com/punyajoy/Fear-speech-analysis) +- [HateCheckHIn](https://github.com/hate-alert/HateCheckHIn): An evaluation dataset for Hindi Hate Speech Detection Models having a total of 34 functionalities out of which 28 functionalities are monolingual and the remaining 6 are multilingual. Hindi is used as the base language. Described in [this paper](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.575.pdf). ### Question Answering - [Facebook Multilingual QA datasets](https://github.com/facebookresearch/MLQA): Contains dev and test sets for Hindi. From 590bc21786edde7e5bc54a96b4b1fd78f9da6bda Mon Sep 17 00:00:00 2001 From: Sangeeta Anoop <88283677+sangeeta-anoop@users.noreply.github.com> Date: Wed, 22 Mar 2023 13:15:54 +0530 Subject: [PATCH 115/119] Update README.md --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 910f3d1..799d964 100644 --- a/README.md +++ b/README.md @@ -306,6 +306,7 @@ Benchmarks spanning multiple tasks. ### Dialog - [a-mma Indic Casual Dialogs Datasets](https://github.com/a-mma/indic_casual_dialogs_dataset) +- [A Code-Mixed Medical Task-Oriented Dialog Dataset](https://github.com/suman101112/Code-Mixed-TOD-Medical-Dataset): The dataset contains 3005 Telugu–English Code-Mixed dialogs with 29 k utterances covering ten specializations with an average code-mixing index (CMI) of 33.3%. Described in [this paper](https://www.sciencedirect.com/science/article/abs/pii/S0885230822000729). ### Discourse - [MIDAS-Hindi Discourse Analysis](https://github.com/midas-research/hindi-discourse) From 19ec948b9159265dac0cd0f98c5777faa6248d58 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Wed, 22 Mar 2023 13:26:48 +0530 Subject: [PATCH 116/119] Update README.md --- README.md | 1 - 1 file changed, 1 deletion(-) diff --git a/README.md b/README.md index e646244..05e67d4 100644 --- a/README.md +++ b/README.md @@ -387,7 +387,6 @@ Benchmarks spanning multiple tasks. - [EM-ALBERT](http://catalog.elra.info/en-us/repository/browse/ELRA-W0316/): The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences. - [LaBSE](https://tfhub.dev/google/LaBSE/2): Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [[paper]](https://arxiv.org/abs/2007.01852). - [LASER3](https://github.com/facebookresearch/fairseq/tree/nllb#laser3-encoder-models): Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 27 Indic languges). -- [MuRIL](https://tfhub.dev/google/MuRIL/1): A BERT base (12L) model pre-trained on 17 Indian languages, and their transliterated counterparts. Described in [this paper](https://arxiv.org/abs/2103.10730). ### Multilingual Word Embeddings From 84ddc81f4847a290f7ede9d3523e561e04125e73 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Sun, 24 Dec 2023 08:51:42 +0530 Subject: [PATCH 117/119] Update README.md --- README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/README.md b/README.md index 05e67d4..f1fe584 100644 --- a/README.md +++ b/README.md @@ -36,6 +36,7 @@ Indian language NLP has come a long way. We feature a few resources that are ill * [Lexical Resources](#LexicalResources) * [NER Corpora](#NERCorpora) * [Parallel Translation Corpus](#ParallelTranslationCorpus) + * [MT Evaluation](#MTEvaluation) * [Parallel Transliteration Corpus](#ParallelTransliterationCorpus) * [Text Classification](#TextualClassification) * [Textual Entailment/Natural Language Inference](#TextualEntailment) @@ -234,6 +235,12 @@ Benchmarks spanning multiple tasks. - [Kathmandu University-English–Nepali Parallel Corpus](https://github.com/sharad461/nepali-translator): A parallel corpus of size 1.8 million sentence pairs for a low resource language pair Nepali–English. Described in [this paper](https://lt4all.elra.info/proceedings/lt4all2019/pdf/2019.lt4all-1.94.pdf). - [CCAligned](https://statmt.org/cc-aligned/): A Massive Collection of more than 100 million cross-lingual web-document pairs in 137 languages aligned with English. +### MT Evaluation + +- [WMT23 QE task](https://wmt-qe-task.github.io): QE datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, te) with DA annotations. The references are also available, so these can also be used for reference based metrics. For Marathi, post-edits are also available as are word-level annotations error annotations are also available. 26k training sentences for Marathi, 7k for the others. [report](https://aclanthology.org/2023.wmt-1.52) +- [AI4Bharat IndicMT-Eval]: MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems). + + ### Parallel Transliteration Corpus - [Dakshina Dataset](https://github.com/google-research-datasets/dakshina): The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. Contains an aggregate of around 300k word pairs and 120k sentence pairs. From 61f11e6811677dbf6c89b554b231ae1191d6c2c4 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Thu, 28 Dec 2023 14:01:12 +0530 Subject: [PATCH 118/119] Update README.md --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index f1fe584..eb9d591 100644 --- a/README.md +++ b/README.md @@ -36,7 +36,7 @@ Indian language NLP has come a long way. We feature a few resources that are ill * [Lexical Resources](#LexicalResources) * [NER Corpora](#NERCorpora) * [Parallel Translation Corpus](#ParallelTranslationCorpus) - * [MT Evaluation](#MTEvaluation) + * [MT Evaluation](#MTEvaluation) * [Parallel Transliteration Corpus](#ParallelTransliterationCorpus) * [Text Classification](#TextualClassification) * [Textual Entailment/Natural Language Inference](#TextualEntailment) @@ -238,7 +238,7 @@ Benchmarks spanning multiple tasks. ### MT Evaluation - [WMT23 QE task](https://wmt-qe-task.github.io): QE datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, te) with DA annotations. The references are also available, so these can also be used for reference based metrics. For Marathi, post-edits are also available as are word-level annotations error annotations are also available. 26k training sentences for Marathi, 7k for the others. [report](https://aclanthology.org/2023.wmt-1.52) -- [AI4Bharat IndicMT-Eval]: MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems). +- [AI4Bharat IndicMT-Eval])(https://github.com/AI4Bharat/IndicMT-Eval): MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems). ### Parallel Transliteration Corpus From 26735950f752d7f5d6e64ad2a109e8ce067897f7 Mon Sep 17 00:00:00 2001 From: Anoop Kunchukuttan Date: Thu, 28 Dec 2023 14:03:22 +0530 Subject: [PATCH 119/119] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index eb9d591..cc67274 100644 --- a/README.md +++ b/README.md @@ -238,7 +238,7 @@ Benchmarks spanning multiple tasks. ### MT Evaluation - [WMT23 QE task](https://wmt-qe-task.github.io): QE datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, te) with DA annotations. The references are also available, so these can also be used for reference based metrics. For Marathi, post-edits are also available as are word-level annotations error annotations are also available. 26k training sentences for Marathi, 7k for the others. [report](https://aclanthology.org/2023.wmt-1.52) -- [AI4Bharat IndicMT-Eval])(https://github.com/AI4Bharat/IndicMT-Eval): MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems). +- [AI4Bharat IndicMT-Eval](https://github.com/AI4Bharat/IndicMT-Eval): MT evaluation datasets for 5 Indian languages in En to Indic directions (mr, hi, gu, ta, ml) with Multidimensional Quality Metric (MQM) annotations. 1400 sentence annotations per language (200 sentences and outputs from 7 MT systems). ### Parallel Transliteration Corpus