Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tesseract training setup scripts and example data #339

Draft
wants to merge 2 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1039279885.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1039279885
Binary file added dataScripts/tessTrain/example_truth/1039279885.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/104069277.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
104069277
Binary file added dataScripts/tessTrain/example_truth/104069277.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1120865009.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1120865009
Binary file added dataScripts/tessTrain/example_truth/1120865009.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1157374083.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1157374083
Binary file added dataScripts/tessTrain/example_truth/1157374083.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1242187802.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1242187802
Binary file added dataScripts/tessTrain/example_truth/1242187802.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1293982005.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1293982005
Binary file added dataScripts/tessTrain/example_truth/1293982005.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1307855064.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1307855064
Binary file added dataScripts/tessTrain/example_truth/1307855064.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1335408497.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1335408497
Binary file added dataScripts/tessTrain/example_truth/1335408497.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1401381498.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1401381498
Binary file added dataScripts/tessTrain/example_truth/1401381498.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1494819970.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1494819970
Binary file added dataScripts/tessTrain/example_truth/1494819970.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1588056367.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1588056367
Binary file added dataScripts/tessTrain/example_truth/1588056367.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1668624399.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1668624399
Binary file added dataScripts/tessTrain/example_truth/1668624399.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1713389101.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1713389101
Binary file added dataScripts/tessTrain/example_truth/1713389101.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1763858340.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1763858340
Binary file added dataScripts/tessTrain/example_truth/1763858340.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/1996779286.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1996779286
Binary file added dataScripts/tessTrain/example_truth/1996779286.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/2014646864.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
2014646864
Binary file added dataScripts/tessTrain/example_truth/2014646864.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/404341978.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
404341978
Binary file added dataScripts/tessTrain/example_truth/404341978.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/428256746.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
428256746
Binary file added dataScripts/tessTrain/example_truth/428256746.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/97984949.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
97984949
Binary file added dataScripts/tessTrain/example_truth/97984949.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions dataScripts/tessTrain/example_truth/984313802.gt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
984313802
Binary file added dataScripts/tessTrain/example_truth/984313802.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
89 changes: 89 additions & 0 deletions dataScripts/tessTrain/tessTrain.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/bin/bash
# Author: Charles Bock
# Email: [email protected]
# Date: 2024-02-27
# Tested on a bare install of Ubuntu 22.04.3 LTS

set -e

# Color Stuff
BGreen='\033[1;32m'
NC='\033[0m'

greentext () {
echo -e "\n${BGreen}### $1 ${NC}\n"
}

# Build tesseract
# Upstream Docs: https://tesseract-ocr.github.io/tessdoc/Compiling-%E2%80%93-GitInstallation#installing-with-autoconf-tools
greentext "Installing Deps and Creating File Structure"

# Dont polute the directory
mkdir -p ./tess
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this script creates artifacts, we'll need to add them to a .gitignore file. Ideally, we would keep the sole .gitignore so it's consolidated in one place.

Copy link
Author

@Penguin2600 Penguin2600 Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Callout, I'll consider what the new entries might need to be.

cd tess
pwd

# Get Deps
sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev
sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config
sudo apt-get install libpango1.0-dev libleptonica-dev
Comment on lines +27 to +29
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to extract this into the README as a ## training tesseract section.

I would split this script into two parts - a setup.sh script (also mention it in the README in the setup instructions) and a train.sh script that takes in a ground truth path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally, I had the same thought. one will likely run once while the other may need many runs.


greentext "Cloning Tesseract"

git clone https://github.com/tesseract-ocr/tesseract.git 2> /dev/null || (cd tesseract ; git pull; cd ..)

# Build the training tools
# Upstream Docs: https://tesseract-ocr.github.io/tessdoc/Compiling-%E2%80%93-GitInstallation#build-with-training-tools
greentext "Building Tesseract WITH Training Tools - This can take a long time"

sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev
Comment on lines +39 to +41
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These too


cd tesseract

./autogen.sh
./configure
make
sudo make install
sudo ldconfig
make training
sudo make training-install

greentext "Finished Building Teseract and Training Tools"

cd ..
pwd

# Install and configure tesstrain
# Upstream Docs: https://github.com/tesseract-ocr/tesstrain?tab=readme-ov-file#choose-model-name

greentext "Pulling the required ENG traineddata from github"
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata
sudo mv eng.traineddata /usr/local/share/tessdata
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to not populate paths outside of noitool? It would be great if this would be confined to this directory. It looks like you can use TESSDATA variable to make it local. https://github.com/tesseract-ocr/tesstrain?tab=readme-ov-file#train

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a great idea, will incorporate.


greentext "Cloning Tesstrain"
git clone https://github.com/tesseract-ocr/tesstrain.git 2> /dev/null || (cd tesstrain ; git pull; cd ..)
cd tesstrain
pwd

greentext "Generating Tesstrain Langdata"
make tesseract-langdata

greentext "Creating and populating tesstrain ground truth Directory Structure"
mkdir -p ./data/noita-ground-truth

# Copy our example data in
cp -ar ../../truth/example_truth/* ./data/noita-ground-truth

greentext "Running Example Training - This can take some time"
# Run training against our example data
make training MODEL_NAME=noita

if test -f ./data/noita.traineddata; then
greentext "Example Model was trained successfully"
greentext "Setup is complete and tested"
greentext "You are now ready for Training"

exit 0
fi