Skip to content

Balu-Varanasi/telugu-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Telugu Natural Language Processing Project

Welcome to the Telugu Natural Language Processing (NLP) Project! This open-source initiative is dedicated to developing resources and tools for language learners, teachers, linguists, and researchers in the Telugu language community. Our mission is to enhance the capabilities and accessibility of NLP technologies for Telugu.


Project Goals

1. Resource Compilation

To create a comprehensive database of publicly available linguistic resources for Telugu, including dictionaries, grammar guides, texts, audio-visual materials, and research papers, and ensure they are accessible and properly formatted for NLP applications.

2. Educational Tools Development

To develop a range of interactive educational tools, such as language learning apps, grammar checkers, and writing aids, specifically designed to support and enhance the learning experience for Telugu learners and educators.

3. Machine Translation Enhancement

To significantly improve the machine translation capabilities for Telugu, facilitating better translation both from and into other languages, thus enhancing communication and understanding.

4. Linguistic Research Support

To provide advanced tools and comprehensive datasets for linguistic research, including sophisticated corpus analysis tools and extensive databases, to document and analyze the unique linguistic features of Telugu.

5. Cultural Preservation and Promotion

To leverage the platform to actively promote and preserve the Telugu language and culture on a global scale, thereby contributing to its broader appreciation and recognition.

These goals aim to foster a vibrant ecosystem around the Telugu language, enabling better learning, research, and cultural exchange.


Corpara and word lists

  1. Telugu Dictionary Words by Anusha Motamarri.
  2. Telugu Newspaper Article Dataset by Anusha Motamarri.
  3. Telugu News Articles by Shubham Jain.
  4. Telugu Books Dataset by Anusha Motamarri.
  5. Telugu Wikipedia Dataset by Shubham Jain.
  6. Parallel Corpus for Indian Languages by Kartik.
  7. Indic NLP Catalog by AI4Bharat.
  8. Indic Tagger (Indian Language Tagger) by Avinesh PVS.
  9. All words in all languages - Telugu by Eymen Efe Altun.
  10. Thousand most common Telugu words by Samuel Menigat
  11. Likitham - Repo containing scripts and datasets for processing Telugu language data by Chillar Anand.
  12. LOIT (Lot of Indic Tweets) by Bedapudi Praneeth
  13. English Telugu Bilingual Sentence Pairs by Sai Kumar Yava. English-Telugu-Bilingual-Sentence-Pairs dataset contains English sentances translated into Telugu language and it has total 155798 sentences.
  14. Telugu Terms by etymology by Wiktionary

Code Mixed or Code Switching

  1. Word Level Language Identification in English Telugu Code Mixed Data
  2. EN-TE Transliteration Dataset
  3. Code Switched Papers
  4. A Tale of Two Languages: The Code Mixing story by Arindam Chatterjee
1. [en_te_wiki_titles](https://github.com/notAI-tech/Datasets/releases/download/En-Te_Transliteration/v1.en_te_wiki_titles.txt):

    contains 13,811 word en-te pairs, generated from Wikipedia by comparing titles of parallel articles.


2. [ni_bondha_comments](https://github.com/notAI-tech/Datasets/releases/download/En-Te_Transliteration/v1.ni_bondha_comment_words.txt):

    contains 24,757 word en-te pairs.

    The english versions of telugu words are obtained from the subreddit [r/Ni_Bondha](https://www.reddit.com/r/Ni_Bondha/).

    The corresponding telugu words are obtained by ranking transliterations of the subreddit comments from multiple models and APIs,
  using a [flair](https://github.com/zalandoresearch/flair) based character lm trained on Telugu text.

    Please note that english words are not lower-cased in this data. Since the english words are human written, we decided to retain the capitalization information in this release. Only punctuation was removed.

Text Generation

  1. Telugu Text Generation by Gnanendra Avvaru.

ASR - Automatic Speech Recognition

  1. IndicASR (Speech Recognition for Indian Languages) by notAI-tech.

Topic Classification/Modeling/Extraction

  1. Topic Modeling/Extraction for Telugu articles by Nirupam Purushothama - Medium (Topic Modeling — 2: Performing LDA on Telugu (తెలుగు) Articles).
  2. Telugu Text classification — Part 1 by Pradeep Miriyala.

OCR

  1. The Banti Framework (Comprehensive OCR System for Telugu Language)

Transliteration

  1. Rice Transliteration Scheme for Telugu

Tokenizer, Stemmer and Lemmatizer

  1. Telugu Tokenizer and Stemmer by chraghavendra
  2. Telugu Language — Lemmatization & POS Tag Extraction by Nirupam Purushothama
  3. Sangita. A Natural Language Toolkit for Indian Languages. (currently supports only Hindi).
  4. Program for tokenizing Indian language inp by Anoop Kunchukuttan.

Spell Checker and Error Correction

  1. Telugu Spell check by Chillar Anand.
  2. Grammatical Error Correction using Deep Learning

Syllables

  1. Script to get telugu syllables by Chillar Anand.

Other

  1. Language Modeling for (తెలుగు) Telugu by Karthik Uppuluri.
  2. Telugu Experiments by Karthik Uppuluri
  3. Telugu Language Research Project by Luke Carlson.
  4. NLP for Telugu by Shubham Jain.
  5. TTD Selenium Crawler by Pradeep Miriyala.
  6. Telugu POS by Pradeep Miriyala.
  7. Deep Learning Language Model for Telugu Corpus using LSTM by Akanksha Telagamsetty.
  8. advertools
  9. UGC-NET/JRF Code 103 Indian Knowledge System (IKS) Syllabus by Heera Samvaya.
  10. Sentiment Analysis of Twitter Data using NLTK in Python.
  11. Text Analytics with Python.
  12. Project Chalam - Telugu Books.
  13. Memorize - Code and real data for "Enhancing Human Learning via Spaced Repetition Optimization", PNAS 2019

Surveys

Reading

Learning Telugu

Agriculture, Farming, Gardening, Herbs and Ayurveda

Self Employment

Papers

  1. Enhancing human learning via spaced repetitionoptimization - Behzad Tabibiana,b,1, Utkarsh Upadhyaya, Abir Dea, Ali Zarezadea, Bernhard Sch ̈olkopfb, and Manuel Gomez-RodriguezaaNetworks Learning Group, Max Planck Institute for Software Systems, 67663 Kaiserslautern, Germany; andbEmpirical Inference Department, Max PlanckInstitute for Intelligent Systems, 72076 T ̈ubingen, GermanyEdited by Richard M. Shiffrin, Indiana University, Bloomington, IN, and approved December 14, 2018 (received for review September 3, 2018)

    • Memorize - Code and real data for "Enhancing Human Learning via Spaced Repetition Optimization", PNAS 2019
  2. Hindi Shabdamitra - A Wordnet based Tool for Enhancing Teaching-Learning Process by Hanumant Redkar, Nilesh Joshi, Sayali Khare, Lata Popale, Malhar Kulkarni and Pushpak Bhattacharyya - Center for Indian Language Technology, Indian Institute of Technology Bombay, India.

  3. Hindi Shabdamitra - A Wordnet based E-Learning Tool for Language Learning and Teaching by Hanumant Redkar, Sandhya Singh, Meenakshi Somasundaram, Dhara Gorasia, Malhar Kulkarni and Pushpak Bhattacharyya - Center for Indian Language Technology, Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Mumbai, India.

  4. The WordNet in Indian Languages

  5. Using Parallel Corpora for Language Learning by Michael H. Brown, Kanda Institute of Foreign Languages in Tokyo, Japan.

  6. Learning in Parallel: Using Parallel Corpora to Enhance Written Language Acquisition at the Beginning Level by Brody Bluemel, The Pennsylvania State University.

  7. CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences by Devansh Gautam, Prashant Kodali, Kshitij Gupta, Anmol Goel, Manish Shrivastava, Ponnurangam Kumaraguru - International Institute of Information Technology Hyderabad & Indraprastha Institute of Information Technology Delhi & Guru Gobind Singh Indraprastha University, Delhi

  8. The LTRC Hindi-Telugu Parallel Corpus by Vandan Mujadia, Dipti Misra Sharma, MT-NLP Lab, LTRC, KCIS, IIIT-Hyderabad, India.

  9. DuoLingo - Persuasive Language Learning - Qualitative research on user engagement in the persuasive system design of Duolingo

        Authors: Sofie Kastelli, Napsugár Takács
        Supervisor: Ulf Linnman
        Field of research: Informatics
        Date: 1st of June 2023
        Jönköping University 2023
    
  10. Exploring Persuasive Design Elements in Duolingo

  11. The Duolingo Method for App-based Teaching and Learning by Cassie Freeman, Audrey Kittredge, Hope Wilson, and Bozena Pajak - Duolingo Research Report

  12. A Novel Approach to Telugu Stemming Using N-gram Process by N.V. Ganapathi Raju (Associate Professor, Dept. of CSE, GRIET, Hyderabad, INDIA.), Chinta Someswara Rao (Assistant Professor, Dept. of CSE, SRKR Engineering College, Bhimavaram, INDIA.) and G. Meghana (P.G. Scholar, GRIET, Hyderabad, INDIA).

  13. Telugu OCR Framework using Deeplearning by Rakesh Achanta, and Trevor Hastie - Stanford University.

  14. Building specialised corpora for translation studies by Sattar Izwaini, Centre for Computational Linguistics, UMIST, PO Box 88, Manchester M60 1QD, UK.

  15. Building parallel corpora for eContent professionals by M. Gavrilidou, P. Labropoulou, E. Desipri, V. Giouli, V. Antonopoulos, S. Piperidis, Institute for Language and Speech Processing.

  16. Text Simplification - Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings

  17. Text Summarisation - Using Parallel Corpora for Multilingual (Multi-document) Summarisation Evaluation

  18. Co-Writing Screenplays and Theatre Scripts with Language Models.

  19. Identifying Context-Dependent Translations for Evaluation Set Production

        Rachel Wicks 1,2 and Matt Post 1−3
    
        1. Human Language Technology Center of Excellence, Johns Hopkins University
        2. Center of Language and Speech Processing, Johns Hopkins University
        3. Microsoft [email protected], [email protected]
    
  20. Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation Approach

        By:
        1. Mrinal Dhar, IIIT Hyderabad, Gachibowli, Hyderabad, Telangana, India
        2. Vaibhav Kumar, IIIT Hyderabad, Gachibowli, Hyderabad, Telangana, India
        3. Manish Shrivastava, IIIT Hyderabad, Gachibowli, Hyderabad, Telangana, India
    
  21. Sentiment Analysis in Code-Mixed Telugu-English Text with Unsupervised Data Normalization

        By:
        1. Kusampudi Siva Subrahamanyam Varma, Language Technologies Research Centre, IIIT Hyderabad, India.
        2. Preetham Sathineni, Language Technologies Research Centre, IIIT Hyderabad, India.
        3. Radhika Mamidi, Language Technologies Research Centre, IIIT Hyderabad, India.
    
  22. Development of Telugu-Tamil Transfer-Based Machine Translation system: With Special reference to Divergence Index

        By K. Parameswari, Centre for Applied Linguistics and Translation Studies, University of Hyderabad
    
  23. A Rule-based Dependency Parser for Telugu: An Experiment with Simple Sentences

        By:
        1. SANGEETHA P., PARAMESWARI K.
        2. AMBA KULKARNI
    
  24. Computational Morphology for Telugu

        By:
        1. B. Srinivasu, Department of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Hyderabad 500001, India
        2. R. Manivannan, Department of Computer Science and Engineering, Stanley College of Engineering and Technology for Women, Hyderabad 500001, India
    
  25. Neural Dependency Parsing of Low-resource Languages: A Case Study on Marathi

  26. Telugu dependency parsing using different statistical parsers

  27. Dative Case in Telugu: A Parsing Perspective

  28. Parsing Hindi with MDParser

  29. Two-stage Approach for Hindi Dependency Parsing Using MaltParser

  30. Hindi Dependency Parsing using a combined model of Malt and MST

  31. Ensembling Various Dependency Parsers: Adopting Turbo Parser for Indian Languages

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages