Vietnamese Ocr Error Correction Toolbox

This project provides a set of tools to correct OCR errors in Vietnamese text. The project currently supports four features: OCR corrector, address corrector, letter case corrector, and datetime corrector.

Features

1. OCR Corrector

The OCR Corrector uses a seq2seq or transformer model to predict the most likely correction for OCR errors in Vietnamese text. It is statistically proved that OCR errors skew significantly towards wrong-diacritics for Vietnamese. For that reason, the OCR Corrector focuses mainly on diacritics correction problem. See the appendix section for more details. The code used for this feature is heavily based on this tutorial.

See it in action

Before:

Sau khi có y kien của Phó thủ tướng Trần Hồng Hà, UBND tinh Đồng Nai đồng ý gia han 4 mỏ đất phục vụ đắp nền cho tuyen cao tốc chạy qua địa bàn

After:

Sau khi có ý kiến của Phó thủ tướng Trần Hồng Hà, UBND tỉnh Đồng Nai đồng ý gia hạn 4 mỏ đất phục vụ đắp nên cho tuyến cao tốc chạy qua địa bàn

2. Address Corrector

The Address Corrector uses a Trie to identify and correct common mistakes in Vietnamese addresses. The Trie data structure was prefered over a Dictionary-based approach because of its efficiency. The Trie was built by Vietnamese addresses data from vietnam_dataset.

See it in action

Before:

16.5 C/C 4 Nguyễn Đinh Chieu Đa Kao, Quận 1, TP HoChí Minh

After:

16.5 C/C 4 Nguyễn Đình Chiểu, Phường Đa Kao, Quận 1, Thành Phố Hồ Chí Minh

3. Letter Case Corrector

The Lettercase Corrector uses heuristics to identify and correct common mistakes in Vietnamese letter case. The letter case errors are mainly resulted from characters that have similar shape in uppercase and lowercase such as C, O, and S.

See it in action

Before:

tôi có một CƠ SỞ sản xuất bún đậu mắm tôm

After:

tôi có một cơ sở sản xuất bún đậu mắm tôm

4. Datetime Corrector

The datetime corrector uses heuristics to identify and correct common mistakes in Vietnamese datetime string. The targeted datetime string is widely used on various forms of documents such as Vietnamese National ID Card, Vietnames Driver's License, and Vietnamese public documents.

See it in action

Before:

ngày /date 01 tháng /month 04 năm/year 2022

After:

01/04/2022

Installation

git clone https://github.com/buiquangmanhhp1999/VietnameseOcrCorrection.git
pip install -r requirements.txt

Usage

The main.py file provide a sample snippet to use the project.

from main import Corrector
corrector = Corrector(kwargs_address=address_corrector_cfg,kwargs_ocr=ocr_corrector_cfg)
corrector(text, mode)

The corrector call method accept two arguments: the text to correct, and the correction mode. The correction mode can be one of the following: "ocr", "address", "lettercase", or "datetime".

Appendix

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing statiscally shown that the text errors from human typing and ocr output have different distributions. Briefly, human is more likely to produce shorter, non-existen, and space-missing words while OCR output tend to be longer, more real-world, and space-reduntant.

OCR	Human
59.21% real-word error (Hoàng → Hoàn)	67.5 % non-word error (Hoàng → Hoàgn)
2.36x higher incorrect split error (Hoàng → Ho àng)	6.5x higher run-on error (Hoàng hôn → Hoànghôn)
42.1% short-word error	63% short-word error

The distribution of error acts as the input and is essential to the performance of the Ocr Correction model, hence, demanding a statiscal study on Vietnamese OCR output error.

Using 214 document pages (549799 characters ~ 137643 words) crawled from Vietnamese wikipedia, I learned that almost 60% correctable Vietnamese OCR errors caused by losing or wrong diacritics. See the Exploratory data analysis report for more details.

A sample document page crawled from Vietnamese wikipedia:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
src		src
.gitignore		.gitignore
EDA.ipynb		EDA.ipynb
README.md		README.md
TODO.todo		TODO.todo
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese Ocr Error Correction Toolbox

Table of Contents

Features

1. OCR Corrector

2. Address Corrector

3. Letter Case Corrector

4. Datetime Corrector

Installation

Usage

Appendix

About

Releases

Packages

Languages

mrlasdt/vietnamese-ocr-error-corrector

Folders and files

Latest commit

History

Repository files navigation

Vietnamese Ocr Error Correction Toolbox

Table of Contents

Features

1. OCR Corrector

2. Address Corrector

3. Letter Case Corrector

4. Datetime Corrector

Installation

Usage

Appendix

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages