-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Tesseract training setup scripts and example data #339
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Left a couple of minor comments.
I also have a question about dataScripts/tessTrain/example_truth/97984949.png and similar ones. Would the extra M hinder the training in any way?
sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev | ||
sudo apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config | ||
sudo apt-get install libpango1.0-dev libleptonica-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to extract this into the README as a ## training tesseract
section.
I would split this script into two parts - a setup.sh
script (also mention it in the README in the setup instructions) and a train.sh
script that takes in a ground truth path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally, I had the same thought. one will likely run once while the other may need many runs.
greentext "Installing Deps and Creating File Structure" | ||
|
||
# Dont polute the directory | ||
mkdir -p ./tess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this script creates artifacts, we'll need to add them to a .gitignore
file. Ideally, we would keep the sole .gitignore
so it's consolidated in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Callout, I'll consider what the new entries might need to be.
sudo apt-get install libicu-dev | ||
sudo apt-get install libpango1.0-dev | ||
sudo apt-get install libcairo2-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These too
|
||
greentext "Pulling the required ENG traineddata from github" | ||
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata | ||
sudo mv eng.traineddata /usr/local/share/tessdata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to not populate paths outside of noitool? It would be great if this would be confined to this directory. It looks like you can use TESSDATA
variable to make it local. https://github.com/tesseract-ocr/tesstrain?tab=readme-ov-file#train
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that's a great idea, will incorporate.
Co-authored-by: Seva Maltsev <[email protected]>
Short Answer: Long answer: |
Work In progress, opening for visibility.
Current status:
tessTrain/tessTrain.sh - works and will set up a baseline ubuntu 22.04 wsl / container / etc with the tools and binaries required for training tesseract. It will also run an example training session with the included example training data. Documentation and sources are commented inside the script for further details look there for now.
tessTrain/example_truth/ - Example of what a training data directory needs to look like. Used by tessTrain.sh to confirm that setup was successful.
Ping me on discord for any questions or comments! Thx.