-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Tesseract training setup scripts and example data #339
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1039279885 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
104069277 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1120865009 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1157374083 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1242187802 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1293982005 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1307855064 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1335408497 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1401381498 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1494819970 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1588056367 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1668624399 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1713389101 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1763858340 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
1996779286 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
2014646864 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
404341978 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
428256746 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
97984949 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
984313802 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
#!/bin/bash | ||
# Author: Charles Bock | ||
# Email: [email protected] | ||
# Date: 2024-02-27 | ||
# Tested on a bare install of Ubuntu 22.04.3 LTS | ||
|
||
set -e | ||
|
||
# Color Stuff | ||
BGreen='\033[1;32m' | ||
NC='\033[0m' | ||
|
||
greentext () { | ||
echo -e "\n${BGreen}### $1 ${NC}\n" | ||
} | ||
|
||
# Build tesseract | ||
# Upstream Docs: https://tesseract-ocr.github.io/tessdoc/Compiling-%E2%80%93-GitInstallation#installing-with-autoconf-tools | ||
greentext "Installing Deps and Creating File Structure" | ||
|
||
# Dont polute the directory | ||
mkdir -p ./tess | ||
cd tess | ||
pwd | ||
|
||
# Get Deps | ||
sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev | ||
sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config | ||
sudo apt-get install libpango1.0-dev libleptonica-dev | ||
Comment on lines
+27
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would make sense to extract this into the README as a I would split this script into two parts - a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Totally, I had the same thought. one will likely run once while the other may need many runs. |
||
|
||
greentext "Cloning Tesseract" | ||
|
||
git clone https://github.com/tesseract-ocr/tesseract.git 2> /dev/null || (cd tesseract ; git pull; cd ..) | ||
|
||
# Build the training tools | ||
# Upstream Docs: https://tesseract-ocr.github.io/tessdoc/Compiling-%E2%80%93-GitInstallation#build-with-training-tools | ||
greentext "Building Tesseract WITH Training Tools - This can take a long time" | ||
|
||
sudo apt-get install libicu-dev | ||
sudo apt-get install libpango1.0-dev | ||
sudo apt-get install libcairo2-dev | ||
Comment on lines
+39
to
+41
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These too |
||
|
||
cd tesseract | ||
|
||
./autogen.sh | ||
./configure | ||
make | ||
sudo make install | ||
sudo ldconfig | ||
make training | ||
sudo make training-install | ||
|
||
greentext "Finished Building Teseract and Training Tools" | ||
|
||
cd .. | ||
pwd | ||
|
||
# Install and configure tesstrain | ||
# Upstream Docs: https://github.com/tesseract-ocr/tesstrain?tab=readme-ov-file#choose-model-name | ||
|
||
greentext "Pulling the required ENG traineddata from github" | ||
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata | ||
sudo mv eng.traineddata /usr/local/share/tessdata | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to not populate paths outside of noitool? It would be great if this would be confined to this directory. It looks like you can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think that's a great idea, will incorporate. |
||
|
||
greentext "Cloning Tesstrain" | ||
git clone https://github.com/tesseract-ocr/tesstrain.git 2> /dev/null || (cd tesstrain ; git pull; cd ..) | ||
cd tesstrain | ||
pwd | ||
|
||
greentext "Generating Tesstrain Langdata" | ||
make tesseract-langdata | ||
|
||
greentext "Creating and populating tesstrain ground truth Directory Structure" | ||
mkdir -p ./data/noita-ground-truth | ||
|
||
# Copy our example data in | ||
cp -ar ../../truth/example_truth/* ./data/noita-ground-truth | ||
|
||
greentext "Running Example Training - This can take some time" | ||
# Run training against our example data | ||
make training MODEL_NAME=noita | ||
|
||
if test -f ./data/noita.traineddata; then | ||
greentext "Example Model was trained successfully" | ||
greentext "Setup is complete and tested" | ||
greentext "You are now ready for Training" | ||
|
||
exit 0 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this script creates artifacts, we'll need to add them to a
.gitignore
file. Ideally, we would keep the sole.gitignore
so it's consolidated in one place.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Callout, I'll consider what the new entries might need to be.