Skip to content

This repo contains a test suite to measure accuracy of our models, pre-processing, and post-processing scripts.

Notifications You must be signed in to change notification settings

hesta-io/Zhir-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Zhir Benchmark

Cat driving

This repo contains a test suite to measure accuracy of our models, pre-processing, and post-processing scripts. We need to keep a diverse test suite to make sure that we don't regress in some areas while focusing on other areas.

Areas we are now focusing on include:

Area Description
Clean Scan This helps us see how our models perform when we use proper hardware to scan the documents. This use case is very important for business because they usually don't mind using proper hardware (e.g. a normal scanner/printer) to scan the documents. Performance mostly depends on the models, and on the ability of pre-processing scripts not to mess up the already good images.
Screenshots Screenshots of images. This use case is important for extracting text from PDFs whose encoding has been corrupted. Performance mostly depends on the models, and on the ability of pre-processing scripts not to mess up the already good images.
Phones Contains images that are taken by a cell phone. The images might need quite a bit of pre-processing for Tesseract to be able to do a good job. Performance depends on the pre-processing scripts.
Posters Images that have colorful or complex backgrounds. Examples: Book covers, posters, memes, infographics, etc...
Edge Cases Images that measure edge cases. Examples: White text on black background.
Tables TBD
Two-columns TBD

Format

Each test case is made up of two files:

  1. An image file. Which can either be a JPG or PNG file.
  2. A text file that contains the Ground Truth.

Example: s-1.jpg and s-1.txt

How to run tests

Install Tesseract

You can see the instructions to install tesseract here.

Put your model file in TESSDATA_PREFIX

Tesseract expects model files to be in TESSDATA_PREFIX. TESSDATA_PREFIX depends on the OS and the version of tesseract. For example, on Ubuntu, Tesseract 5-alpha expects the models to be in /usr/share/tesseract-ocr/5/tessdata/. Model files have .traineddata extension.

Install ocreval

The tests depend on ocreval so the commands must be present in PATH. Head over the the official repo for instructions on how to install it. Note: We only support Linux and Mac because ocreval is not available for Windows.

Run the python script

python3 ./src/eval.py source dest languages [--tessdata] [--dirty]

Examples:

python3 ./src/eval.py ./data ./out ckb
python3 ./src/eval.py ./data ./out ckb+eng

If you don't want to run ZhirPy on the images:

python3 ./src/eval.py ./data ./out ckb --dirty

This will create these files for each image:

  • image.jpg: The image.
  • image.txt: The ground truth of the image.
  • image.actual.txt: Tesseract result of the image.
  • image.ca.txt: Character accuracy report for the image.
  • image.wa.txt: Word accuracy report for the image.

And will create these two aggregate reports:

  • word_accuracy.txt: Aggregate word accuracy of all images.
  • character_accuracy.txt Aggregate character accuracy of all images.

Resources

  1. https://github.com/eddieantonio/ocreval
  2. Shreeshrii/tesstrain-ckb#1

About

This repo contains a test suite to measure accuracy of our models, pre-processing, and post-processing scripts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages