This repo contains a test suite to measure accuracy of our models, pre-processing, and post-processing scripts. We need to keep a diverse test suite to make sure that we don't regress in some areas while focusing on other areas.
Areas we are now focusing on include:
Area | Description |
---|---|
Clean Scan | This helps us see how our models perform when we use proper hardware to scan the documents. This use case is very important for business because they usually don't mind using proper hardware (e.g. a normal scanner/printer) to scan the documents. Performance mostly depends on the models, and on the ability of pre-processing scripts not to mess up the already good images. |
Screenshots | Screenshots of images. This use case is important for extracting text from PDFs whose encoding has been corrupted. Performance mostly depends on the models, and on the ability of pre-processing scripts not to mess up the already good images. |
Phones | Contains images that are taken by a cell phone. The images might need quite a bit of pre-processing for Tesseract to be able to do a good job. Performance depends on the pre-processing scripts. |
Posters | Images that have colorful or complex backgrounds. Examples: Book covers, posters, memes, infographics, etc... |
Edge Cases | Images that measure edge cases. Examples: White text on black background. |
Tables | TBD |
Two-columns | TBD |
Each test case is made up of two files:
- An image file. Which can either be a JPG or PNG file.
- A text file that contains the Ground Truth.
Example: s-1.jpg
and s-1.txt
You can see the instructions to install tesseract here.
Tesseract expects model files to be in TESSDATA_PREFIX. TESSDATA_PREFIX depends on the OS and the version of tesseract. For example, on Ubuntu, Tesseract 5-alpha expects the models to be in /usr/share/tesseract-ocr/5/tessdata/
. Model files have .traineddata
extension.
The tests depend on ocreval so the commands must be present in PATH. Head over the the official repo for instructions on how to install it. Note: We only support Linux and Mac because ocreval is not available for Windows.
python3 ./src/eval.py source dest languages [--tessdata] [--dirty]
Examples:
python3 ./src/eval.py ./data ./out ckb
python3 ./src/eval.py ./data ./out ckb+eng
If you don't want to run ZhirPy on the images:
python3 ./src/eval.py ./data ./out ckb --dirty
This will create these files for each image:
image.jpg
: The image.image.txt
: The ground truth of the image.image.actual.txt
: Tesseract result of the image.image.ca.txt
: Character accuracy report for the image.image.wa.txt
: Word accuracy report for the image.
And will create these two aggregate reports:
word_accuracy.txt
: Aggregate word accuracy of all images.character_accuracy.txt
Aggregate character accuracy of all images.