Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

Project Mentor
1. Dr.Uthayasanker Thayasivam
Contributor
1. Charangan Vasantharajan
2. Laksika Tharmalingam

Summary

This research is about developing a simple, and automatic OCR engine that can extract text from documents (with legacy fonts usage and printer-friendly encoding which are not optimized for text extraction) to create a parallel corpus.

For this purpose, we enhanced the performance of Tesseract 4.1.1 by employing LSTM-based training on many legacy fonts to recognize printed characters in the above languages. Especially, our model detects code-mix text, numbers, and special characters from the printed document.

Description

This project consists of the following.

Dataset
Model Training
Model
Improvements
Corpus Creation

Dataset

We created box files with coordinates specification, and then, we rectified misidentified characters, adjusted letter tracking, or spacing between characters to eliminate bounding box overlapping issues using jTessBoxEditor.

The following instructions will guide to generate TIFF/Box files.

tesstrain.sh --fonts_dir data/fonts \
	     --fontlist \
	     --lang tam \    
	     --linedata_only \
		 --noextract_font_properties \
		 --training_text data/langdata/tam/tam.training_text \
	     --langdata_dir data/langdata \
	     --tessdata_dir data/tessdata \
	     --save_box_tiff \
	     --maxpages 100 \
	     --output_dir data/output

Model Training

The table illustrates the command line flags used during the training. We have finalized the below numbers after conducting several experiments with different values.

Flag	Value
traineddata	path of traineddata file that contains the unicharset, word dawg, punctuation pattern dawg, number dawg
model_output	path of output model files / checkpoints
learning_rate	1e-05
max_iterations	5000
target_error_rate	0.001
continue_from	path to previous checkpoint from which to continue training.
stop_training	convert the training checkpoint to full traineddata.
train_listfile	filename of a file listing training data files.
eval_listfile	filename of a file listing evaluating data files.

The following instructions will guide to start training.

OMP_THREAD_LIMIT=8 lstmtraining \
	--continue_from data/model/tam.lstm \
	--model_output data/finetuned_model/ \
	--traineddata data/tessdata/tam.traineddata \
	--train_listfile data/output/tam.training_files.txt \
	--eval_listfile data/output/tam.training_files.txt \
	--max_iterations 5000

Performance Evaluation

In this analysis, we consider two metrics to evaluate OCR output, namely Character Error Rate (CER) and Word Error Rate (WER).

Tamil

Font	No. of Chrs	Original Tesseract			Fine-tuned Tesseract
		RC	CER (%)	WER (%)	RC	CER (%)	WER (%)
Aabohi	757	757	0.19	2.67	757	0.19	2.67
AnbeSivam	762	774	7.87	57.89	765	2.71	31.58
Baamini	762	770	7.44	56.26	762	2.42	31.58
Eelanadu	762	773	4.88	43.42	763	0.58	9.21
Kamaas	762	756	3.38	28.95	766	0.43	9.21
Keeravani	767	764	0.68	13.16	764	0.19	1.32
Kilavi	762	767	0.48	9.21	763	0.14	2.63
Klaimakal	762	765	0.82	14.47	766	0.48	3.95
Tamilweb	762	808	20.39	88.89	772	11.13	67.90
Nagananthini	762	783	14.2	82.89	785	7.83	46.05
-	-	-	-	-	-	-	-
Mean			6.03	39.68		2.61	20.61

Sinhala

Font	No. of Chrs	Original Tesseract			Fine-tuned Tesseract
		RC	CER (%)	WER (%)	RC	CER (%)	WER (%)
Bhasitha	731	701	25.97	84.62	725	8.73	46.15
BhashitaComplex	731	728	5.11	27.35	731	3.94	23.08
Bhasitha2Sans	731	726	4.68	23.93	730	3.88	22.22
Bhasitha Screen	731	726	4.79	24.79	729	3.99	23.93
Dinaminal Uni Web	731	728	5.64	29.91	731	4.52	22.22
Hodipotha & 731	726	6.07	35.90	729	4.10	24.79
Malithi Web & 731	718	6.01	34.19	726	4.74	29.91
Noto Sans Sinhala	731	730	3.94	23.08	732	3.73	21.37
Sarasavi Unicode	731	709	9.10	38.46	728	5.64	27.35
Warna & 731	726	4.74	28.21	732	4.10	24.79
-	-	-	-	-	-	-	-
Mean			7.61	35.04		4.74	26.58

Model

The architecture of PCR is shown below. As the first step, we detect the file type and convert it to images if the input file is PDF. Then images are binarized and then image character boundary detection techniques are applied to find character boxes. Finally, deep learning modules identify word and line boundaries first then the characters are recognized. Finally using a language model, post-processing the file.

Corpus Creation

To create a parallel corpus, we used www.parliament.lk website to download the required PDFs of all three languages and feed them into our model to get extracted texts.

Corpus statistics

Language	No. of Files	No. of Sentences	No. of Words	No. of Unique Words	Total Sentences
Tamil	100	185.4K	2.11M	334.16K	45.3MB
Sinhala	100	168.9K	2.22M	407.99K	35.7MB
English	100	181.04K	2.33M	372.03K	20.8MB

Cite this work

@INPROCEEDINGS{9961304, 
    author={Vasantharajan, Charangan and Tharmalingam, Laksika and Thayasivam, Uthayasanker},
    booktitle={2022 International Conference on Asian Language Processing (IALP)}, 
    title={Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English}, 
    year={2022},
    volume={}, 
    number={}, 
    pages={143-149}, 
    doi={10.1109/IALP57159.2022.9961304}
}

License

Apache License 2.0

Code of Conduct

Please read our code of conduct document here.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
corpus/parallel corpus		corpus/parallel corpus
docs		docs
fine-tuning		fine-tuning
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

Summary

Description

Dataset

Model Training

Performance Evaluation

Model

Corpus Creation

Cite this work

License

Code of Conduct

About

Releases

Packages

Languages

License

aaivu/Tamizhi-Net-OCR

Folders and files

Latest commit

History

Repository files navigation

Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

Summary

Description

Dataset

Model Training

Performance Evaluation

Model

Corpus Creation

Cite this work

License

Code of Conduct

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages