Telugu Natural Language Processing Project

Welcome to the Telugu Natural Language Processing (NLP) Project! This open-source initiative is dedicated to developing resources and tools for language learners, teachers, linguists, and researchers in the Telugu language community. Our mission is to enhance the capabilities and accessibility of NLP technologies for Telugu.

Project Goals

1. Resource Compilation

To create a comprehensive database of publicly available linguistic resources for Telugu, including dictionaries, grammar guides, texts, audio-visual materials, and research papers, and ensure they are accessible and properly formatted for NLP applications.

2. Educational Tools Development

To develop a range of interactive educational tools, such as language learning apps, grammar checkers, and writing aids, specifically designed to support and enhance the learning experience for Telugu learners and educators.

3. Machine Translation Enhancement

To significantly improve the machine translation capabilities for Telugu, facilitating better translation both from and into other languages, thus enhancing communication and understanding.

4. Linguistic Research Support

To provide advanced tools and comprehensive datasets for linguistic research, including sophisticated corpus analysis tools and extensive databases, to document and analyze the unique linguistic features of Telugu.

5. Cultural Preservation and Promotion

To leverage the platform to actively promote and preserve the Telugu language and culture on a global scale, thereby contributing to its broader appreciation and recognition.

These goals aim to foster a vibrant ecosystem around the Telugu language, enabling better learning, research, and cultural exchange.

Corpara and word lists

Telugu Dictionary Words by Anusha Motamarri.
Telugu Newspaper Article Dataset by Anusha Motamarri.
Telugu News Articles by Shubham Jain.
Telugu Books Dataset by Anusha Motamarri.
Telugu Wikipedia Dataset by Shubham Jain.
Parallel Corpus for Indian Languages by Kartik.
Indic NLP Catalog by AI4Bharat.
Indic Tagger (Indian Language Tagger) by Avinesh PVS.
All words in all languages - Telugu by Eymen Efe Altun.
Thousand most common Telugu words by Samuel Menigat
Likitham - Repo containing scripts and datasets for processing Telugu language data by Chillar Anand.
LOIT (Lot of Indic Tweets) by Bedapudi Praneeth
English Telugu Bilingual Sentence Pairs by Sai Kumar Yava. English-Telugu-Bilingual-Sentence-Pairs dataset contains English sentances translated into Telugu language and it has total 155798 sentences.
Telugu Terms by etymology by Wiktionary

Code Mixed or Code Switching

1. [en_te_wiki_titles](https://github.com/notAI-tech/Datasets/releases/download/En-Te_Transliteration/v1.en_te_wiki_titles.txt):

    contains 13,811 word en-te pairs, generated from Wikipedia by comparing titles of parallel articles.


2. [ni_bondha_comments](https://github.com/notAI-tech/Datasets/releases/download/En-Te_Transliteration/v1.ni_bondha_comment_words.txt):

    contains 24,757 word en-te pairs.

    The english versions of telugu words are obtained from the subreddit [r/Ni_Bondha](https://www.reddit.com/r/Ni_Bondha/).

    The corresponding telugu words are obtained by ranking transliterations of the subreddit comments from multiple models and APIs,
  using a [flair](https://github.com/zalandoresearch/flair) based character lm trained on Telugu text.

    Please note that english words are not lower-cased in this data. Since the english words are human written, we decided to retain the capitalization information in this release. Only punctuation was removed.

Text Generation

Telugu Text Generation by Gnanendra Avvaru.

ASR - Automatic Speech Recognition

IndicASR (Speech Recognition for Indian Languages) by notAI-tech.

Topic Classification/Modeling/Extraction

Topic Modeling/Extraction for Telugu articles by Nirupam Purushothama - Medium (Topic Modeling — 2: Performing LDA on Telugu (తెలుగు) Articles).
Telugu Text classification — Part 1 by Pradeep Miriyala.

OCR

The Banti Framework (Comprehensive OCR System for Telugu Language)

Transliteration

Rice Transliteration Scheme for Telugu

Tokenizer, Stemmer and Lemmatizer

Telugu Tokenizer and Stemmer by chraghavendra
Telugu Language — Lemmatization & POS Tag Extraction by Nirupam Purushothama
Sangita. A Natural Language Toolkit for Indian Languages. (currently supports only Hindi).
Program for tokenizing Indian language inp by Anoop Kunchukuttan.

Spell Checker and Error Correction

Syllables

Script to get telugu syllables by Chillar Anand.

Other

Language Modeling for (తెలుగు) Telugu by Karthik Uppuluri.
Telugu Experiments by Karthik Uppuluri
Telugu Language Research Project by Luke Carlson.
NLP for Telugu by Shubham Jain.
TTD Selenium Crawler by Pradeep Miriyala.
Telugu POS by Pradeep Miriyala.
Deep Learning Language Model for Telugu Corpus using LSTM by Akanksha Telagamsetty.
advertools
UGC-NET/JRF Code 103 Indian Knowledge System (IKS) Syllabus by Heera Samvaya.
Sentiment Analysis of Twitter Data using NLTK in Python.
Text Analytics with Python.
Project Chalam - Telugu Books.
Memorize - Code and real data for "Enhancing Human Learning via Spaced Repetition Optimization", PNAS 2019

Surveys

Chatbot System in Indian Languages: A survey by Heera Samvaya.

Reading

మాతృభాషే ఎందుకు?
The digital language divide
Creative Writing and Translation - An interdisciplinary approach
Large language models: A guide on its benefits, use cases, and types
Bhasha - MT.
Tirumala Tirupati Devasthanams - TTD E-books.
archive.org - Telugu : Books by Language.
Free Gurukul - ఉచిత గురుకుల విద్య ఫౌండేషన్.
స్తోత్రనిధి - To collect sanskrit stotras and translate them to Telugu.
ai4bharat
- Areas
  - Translation
  - Transliteration
  - Speech Recognition
  - Language Understanding
  - Language Generation
  - Sign Language
  - Text to Speech
  - Shoonya
  - Chitralekha
  - Anuvaad
- Applications
  - SHOONYA - https://ai4bharat.iitm.ac.in/shoonya/
  - Chitralekha - https://ai4bharat.iitm.ac.in/chitralekha/
  - Anuvaad - https://ai4bharat.iitm.ac.in/anuvaad/
- Data Collection
- Models
https://niceorg.in/
హార్ట్ ఫుల్ నెస్ - ప్రేమతో పురోగమనం - ప్రేమపూర్వక సంభాషణ
సైకో థెరపీ అంటే ఏమిటి?
కళ ఆధారిత అభ్యాసన

Learning Telugu

Agriculture, Farming, Gardening, Herbs and Ayurveda

Self Employment

Papers

Enhancing human learning via spaced repetitionoptimization - Behzad Tabibiana,b,1, Utkarsh Upadhyaya, Abir Dea, Ali Zarezadea, Bernhard Sch ̈olkopfb, and Manuel Gomez-RodriguezaaNetworks Learning Group, Max Planck Institute for Software Systems, 67663 Kaiserslautern, Germany; andbEmpirical Inference Department, Max PlanckInstitute for Intelligent Systems, 72076 T ̈ubingen, GermanyEdited by Richard M. Shiffrin, Indiana University, Bloomington, IN, and approved December 14, 2018 (received for review September 3, 2018)
- Memorize - Code and real data for "Enhancing Human Learning via Spaced Repetition Optimization", PNAS 2019
Hindi Shabdamitra - A Wordnet based Tool for Enhancing Teaching-Learning Process by Hanumant Redkar, Nilesh Joshi, Sayali Khare, Lata Popale, Malhar Kulkarni and Pushpak Bhattacharyya - Center for Indian Language Technology, Indian Institute of Technology Bombay, India.
Hindi Shabdamitra - A Wordnet based E-Learning Tool for Language Learning and Teaching by Hanumant Redkar, Sandhya Singh, Meenakshi Somasundaram, Dhara Gorasia, Malhar Kulkarni and Pushpak Bhattacharyya - Center for Indian Language Technology, Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, India.
The WordNet in Indian Languages
Using Parallel Corpora for Language Learning by Michael H. Brown, Kanda Institute of Foreign Languages in Tokyo, Japan.
- (another aricle) Language Learning via Parallel Corpora.
Learning in Parallel: Using Parallel Corpora to Enhance Written Language Acquisition at the Beginning Level by Brody Bluemel, The Pennsylvania State University.
CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences by Devansh Gautam, Prashant Kodali, Kshitij Gupta, Anmol Goel, Manish Shrivastava, Ponnurangam Kumaraguru - International Institute of Information Technology Hyderabad & Indraprastha Institute of Information Technology Delhi & Guru Gobind Singh Indraprastha University, Delhi
The LTRC Hindi-Telugu Parallel Corpus by Vandan Mujadia, Dipti Misra Sharma, MT-NLP Lab, LTRC, KCIS, IIIT-Hyderabad, India.

DuoLingo - Persuasive Language Learning - Qualitative research on user engagement in the persuasive system design of Duolingo

    Authors: Sofie Kastelli, Napsugár Takács
    Supervisor: Ulf Linnman
    Field of research: Informatics
    Date: 1st of June 2023
    Jönköping University 2023

Exploring Persuasive Design Elements in Duolingo
The Duolingo Method for App-based Teaching and Learning by Cassie Freeman, Audrey Kittredge, Hope Wilson, and Bozena Pajak - Duolingo Research Report
A Novel Approach to Telugu Stemming Using N-gram Process by N.V. Ganapathi Raju (Associate Professor, Dept. of CSE, GRIET, Hyderabad, INDIA.), Chinta Someswara Rao (Assistant Professor, Dept. of CSE, SRKR Engineering College, Bhimavaram, INDIA.) and G. Meghana (P.G. Scholar, GRIET, Hyderabad, INDIA).
Telugu OCR Framework using Deeplearning by Rakesh Achanta, and Trevor Hastie - Stanford University.
Building specialised corpora for translation studies by Sattar Izwaini, Centre for Computational Linguistics, UMIST, PO Box 88, Manchester M60 1QD, UK.
Building parallel corpora for eContent professionals by M. Gavrilidou, P. Labropoulou, E. Desipri, V. Giouli, V. Antonopoulos, S. Piperidis, Institute for Language and Speech Processing.
Text Simplification - Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings
Text Summarisation - Using Parallel Corpora for Multilingual (Multi-document) Summarisation Evaluation
Co-Writing Screenplays and Theatre Scripts with Language Models.

Identifying Context-Dependent Translations for Evaluation Set Production

    Rachel Wicks 1,2 and Matt Post 1−3

    1. Human Language Technology Center of Excellence, Johns Hopkins University
    2. Center of Language and Speech Processing, Johns Hopkins University
    3. Microsoft [email protected], [email protected]

Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation Approach

    By:
    1. Mrinal Dhar, IIIT Hyderabad, Gachibowli, Hyderabad, Telangana, India
    2. Vaibhav Kumar, IIIT Hyderabad, Gachibowli, Hyderabad, Telangana, India
    3. Manish Shrivastava, IIIT Hyderabad, Gachibowli, Hyderabad, Telangana, India

Sentiment Analysis in Code-Mixed Telugu-English Text with Unsupervised Data Normalization

    By:
    1. Kusampudi Siva Subrahamanyam Varma, Language Technologies Research Centre, IIIT Hyderabad, India.
    2. Preetham Sathineni, Language Technologies Research Centre, IIIT Hyderabad, India.
    3. Radhika Mamidi, Language Technologies Research Centre, IIIT Hyderabad, India.

Development of Telugu-Tamil Transfer-Based Machine Translation system: With Special reference to Divergence Index
```
    By K. Parameswari, Centre for Applied Linguistics and Translation Studies, University of Hyderabad
```
A Rule-based Dependency Parser for Telugu: An Experiment with Simple Sentences
```
    By:
    1. SANGEETHA P., PARAMESWARI K.
    2. AMBA KULKARNI
```

Computational Morphology for Telugu

    By:
    1. B. Srinivasu, Department of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Hyderabad 500001, India
    2. R. Manivannan, Department of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Hyderabad 500001, India

Neural Dependency Parsing of Low-resource Languages: A Case Study on Marathi
Telugu dependency parsing using different statistical parsers
Dative Case in Telugu: A Parsing Perspective
Parsing Hindi with MDParser
Two-stage Approach for Hindi Dependency Parsing Using MaltParser
Hindi Dependency Parsing using a combined model of Malt and MST
Ensembling Various Dependency Parsers: Adopting Turbo Parser for Indian Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
001-extract-pdfs		001-extract-pdfs
002-tokenize-text		002-tokenize-text
003-remove-stopwords		003-remove-stopwords
004-stemming		004-stemming
docs		docs
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Telugu Natural Language Processing Project

Project Goals

1. Resource Compilation

2. Educational Tools Development

3. Machine Translation Enhancement

4. Linguistic Research Support

5. Cultural Preservation and Promotion

Corpara and word lists

Code Mixed or Code Switching

Text Generation

ASR - Automatic Speech Recognition

Topic Classification/Modeling/Extraction

OCR

Transliteration

Tokenizer, Stemmer and Lemmatizer

Spell Checker and Error Correction

Syllables

Other

Surveys

Reading

Learning Telugu

Agriculture, Farming, Gardening, Herbs and Ayurveda

Self Employment

Papers

About

Releases

Packages

Languages

Balu-Varanasi/telugu-nlp

Folders and files

Latest commit

History

Repository files navigation

Telugu Natural Language Processing Project

Project Goals

1. Resource Compilation

2. Educational Tools Development

3. Machine Translation Enhancement

4. Linguistic Research Support

5. Cultural Preservation and Promotion

Corpara and word lists

Code Mixed or Code Switching

Text Generation

ASR - Automatic Speech Recognition

Topic Classification/Modeling/Extraction

OCR

Transliteration

Tokenizer, Stemmer and Lemmatizer

Spell Checker and Error Correction

Syllables

Other

Surveys

Reading

Learning Telugu

Agriculture, Farming, Gardening, Herbs and Ayurveda

Self Employment

Papers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages