Skip to content

Toolbox to correct Vietnamese ocr output in term of diacritics, address, lettercase and datetime string.

Notifications You must be signed in to change notification settings

mrlasdt/vietnamese-ocr-error-corrector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vietnamese Ocr Error Correction Toolbox

This project provides a set of tools to correct OCR errors in Vietnamese text. The project currently supports four features: OCR corrector, address corrector, letter case corrector, and datetime corrector.

Table of Contents

Features

1. OCR Corrector

The OCR Corrector uses a seq2seq or transformer model to predict the most likely correction for OCR errors in Vietnamese text. It is statistically proved that OCR errors skew significantly towards wrong-diacritics for Vietnamese. For that reason, the OCR Corrector focuses mainly on diacritics correction problem. See the appendix section for more details. The code used for this feature is heavily based on this tutorial.

    See it in action Before:
    Sau khi có y kien của Phó thủ tướng Trần Hồng Hà, UBND tinh Đồng Nai đồng ý gia han 4 mỏ đất phục vụ đắp nền cho tuyen cao tốc chạy qua địa bàn
    After:
    Sau khi có ý kiến của Phó thủ tướng Trần Hồng Hà, UBND tỉnh Đồng Nai đồng ý gia hạn 4 mỏ đất phục vụ đắp nên cho tuyến cao tốc chạy qua địa bàn

2. Address Corrector

The Address Corrector uses a Trie to identify and correct common mistakes in Vietnamese addresses. The Trie data structure was prefered over a Dictionary-based approach because of its efficiency. The Trie was built by Vietnamese addresses data from vietnam_dataset.

    See it in action Before:
    16.5 C/C 4 Nguyễn Đinh Chieu Đa Kao, Quận 1, TP HoChí Minh
    After:
    16.5 C/C 4 Nguyễn Đình Chiểu, Phường Đa Kao, Quận 1, Thành Phố Hồ Chí Minh

3. Letter Case Corrector

The Lettercase Corrector uses heuristics to identify and correct common mistakes in Vietnamese letter case. The letter case errors are mainly resulted from characters that have similar shape in uppercase and lowercase such as C, O, and S.

    See it in action Before:
    tôi có một CƠ SỞ sản xuất bún đậu mắm tôm
    After:
    tôi có một cơ sở sản xuất bún đậu mắm tôm

4. Datetime Corrector

The datetime corrector uses heuristics to identify and correct common mistakes in Vietnamese datetime string. The targeted datetime string is widely used on various forms of documents such as Vietnamese National ID Card, Vietnames Driver's License, and Vietnamese public documents.

    See it in action Before:
    ngày /date 01 tháng /month 04 năm/year 2022
    After:
    01/04/2022

Installation

git clone https://github.com/buiquangmanhhp1999/VietnameseOcrCorrection.git
pip install -r requirements.txt

Usage

The main.py file provide a sample snippet to use the project.

from main import Corrector
corrector = Corrector(kwargs_address=address_corrector_cfg,kwargs_ocr=ocr_corrector_cfg)
corrector(text, mode)

The corrector call method accept two arguments: the text to correct, and the correction mode. The correction mode can be one of the following: "ocr", "address", "lettercase", or "datetime".

Appendix

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing statiscally shown that the text errors from human typing and ocr output have different distributions. Briefly, human is more likely to produce shorter, non-existen, and space-missing words while OCR output tend to be longer, more real-world, and space-reduntant.

OCR Human
59.21% real-word error​​ (Hoàng → Hoàn)​ 67.5 % non-word error​​ (Hoàng → Hoàgn)
2.36x higher incorrect split error​​ (Hoàng → Ho àng)​​ 6.5x higher run-on error​​ (Hoàng hôn → Hoànghôn)​​
42.1% short-word error​​ 63% short-word error​​

The distribution of error acts as the input and is essential to the performance of the Ocr Correction model, hence, demanding a statiscal study on Vietnamese OCR output error.

Using 214 document pages (549799 characters ~ 137643 words) crawled from Vietnamese wikipedia, I learned that almost 60% correctable Vietnamese OCR errors caused by losing or wrong diacritics. See the Exploratory data analysis report for more details.

A sample document page crawled from Vietnamese wikipedia: A sample document page crawled from Vietnamese wikipedia

About

Toolbox to correct Vietnamese ocr output in term of diacritics, address, lettercase and datetime string.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published