Projects for 2022 VPT Programming Contest
List of Projects given below. The underlying theme of the projects is Natural language processing and more specifically focusing on Indic languages.
Most of the existing open source libraries use prebuilt models or use sources that are either hard to reassemble. Also most of them focus on high resource languages (English, French, Spanish etc.,).
AI4bharat provides documentation and link to the source from where the models are generated. Replicating the process to build the model will help validating the approach and potentially helps understanding the end-to-end process of data collection, pre-processing, parameters involved.
- Follow steps detailed in IndicTrans repository
- Validate by comparing against the generated model with the Ai4Bharat model
- Ai4Bharat repositry https://github.com/AI4Bharat/indicTrans
- Corpus to build the model https://indicnlp.ai4bharat.org/samanantar
The Indic language model is primarily built on data from news sites and from wikipedia. So while it's adequate for general translation requirement for any domain specific translation needs it falls short e.g., law, medical etc., The domain specifc terms might not have been modeled.
We could attempt to focus on the literary domain.
- Collate corpus from literary domain where parallel translation exist
- Process text (sentence alignment, parameter tuning)
- Compare with test or goldstandard data with or without fine tuned model
- Steps to Finetune the model https://github.com/AI4Bharat/indicTrans#finetuning-the-model-on-your-input-dataset
Current Mediawiki search uses MySQL database based full text search by default. It has various limitations. Implement a full text search functionality using Bluge https://github.com/blugelabs/bluge or https://zincsearch.com/. Package it as a plugin so that this can be used to add search functionality to a Mediawiki site. An example for such Plugin https://www.mediawiki.org/wiki/Extension:Elastica
Implement the search functionality for https://tamil.wiki site and compare the test results against the custom implementation
The project has two tasks.
- Develop a custom Parser to parse the text generated by OCR
- Create an online search interface for content of AbithanaSinthamani (a tamil encyclopedia for mythology).
- Dictionary maker https://github.com/knadh/dictmaker
- Metaphone for tamil https://github.com/cmrajan/taphone
Create an algorithm for NER in tamil text. Identify proper nouns (Places, Persons etc.,) from Venmurasu text.