Zhir Benchmark

This repo contains a test suite to measure accuracy of our models, pre-processing, and post-processing scripts. We need to keep a diverse test suite to make sure that we don't regress in some areas while focusing on other areas.

Areas we are now focusing on include:

Area	Description
Clean Scan	This helps us see how our models perform when we use proper hardware to scan the documents. This use case is very important for business because they usually don't mind using proper hardware (e.g. a normal scanner/printer) to scan the documents. Performance mostly depends on the models, and on the ability of pre-processing scripts not to mess up the already good images.
Screenshots	Screenshots of images. This use case is important for extracting text from PDFs whose encoding has been corrupted. Performance mostly depends on the models, and on the ability of pre-processing scripts not to mess up the already good images.
Phones	Contains images that are taken by a cell phone. The images might need quite a bit of pre-processing for Tesseract to be able to do a good job. Performance depends on the pre-processing scripts.
Posters	Images that have colorful or complex backgrounds. Examples: Book covers, posters, memes, infographics, etc...
Edge Cases	Images that measure edge cases. Examples: White text on black background.
Tables	TBD
Two-columns	TBD

Format

Each test case is made up of two files:

An image file. Which can either be a JPG or PNG file.
A text file that contains the Ground Truth.

Example: s-1.jpg and s-1.txt

How to run tests

Install Tesseract

You can see the instructions to install tesseract here.

Put your model file in TESSDATA_PREFIX

Tesseract expects model files to be in TESSDATA_PREFIX. TESSDATA_PREFIX depends on the OS and the version of tesseract. For example, on Ubuntu, Tesseract 5-alpha expects the models to be in /usr/share/tesseract-ocr/5/tessdata/. Model files have .traineddata extension.

Install ocreval

The tests depend on ocreval so the commands must be present in PATH. Head over the the official repo for instructions on how to install it. Note: We only support Linux and Mac because ocreval is not available for Windows.

Run the python script

python3 ./src/eval.py source dest languages [--tessdata] [--dirty]

Examples:

python3 ./src/eval.py ./data ./out ckb

python3 ./src/eval.py ./data ./out ckb+eng

If you don't want to run ZhirPy on the images:

python3 ./src/eval.py ./data ./out ckb --dirty

This will create these files for each image:

image.jpg: The image.
image.txt: The ground truth of the image.
image.actual.txt: Tesseract result of the image.
image.ca.txt: Character accuracy report for the image.
image.wa.txt: Word accuracy report for the image.

And will create these two aggregate reports:

word_accuracy.txt: Aggregate word accuracy of all images.
character_accuracy.txt Aggregate character accuracy of all images.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zhir Benchmark

Format

How to run tests

Install Tesseract

Put your model file in TESSDATA_PREFIX

Install ocreval

Run the python script

Resources

About

Releases

Packages

Contributors 2

Languages

hesta-io/Zhir-Benchmark

Folders and files

Latest commit

History

Repository files navigation

Zhir Benchmark

Format

How to run tests

Install Tesseract

Put your model file in TESSDATA_PREFIX

Install ocreval

Run the python script

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages