-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Not an issue] Train a calamari model on RTL languages print or handwritten #368
Comments
Training works out of the box for RTL texts but you should set the direction in the preprocessing option You might also want to start from the def_arabic-model in calamari_models_experimental – at least for Arabic printed material. Good luck with the training! |
Thank you so much!!! Many thanks! |
Calamari needs the input to be pre-segmented, either with coordinates in PAGE XML or as line images. I've been using LAREX for the semi-automatic region segmentation, ocropus line segmentation, calamari for the training/recognition, and a custom web app for the manual post correction. Especially for Arabic handwriting I'd expect kraken to perform much better for the segmentation task. If you want a full pipeline and user interface, there is https://www.ocr4all.org/ – it contains calamari and other OCR engines, segmentation tools etc., but I'm not sure if it has been thoroughly tested with RTL texts. |
Thank you so much for your help! |
Sorry I open up as an issue.
I want to train a calamari model on RTL languages print or handwritten:
Are there any additional steps to do for RTL languages and scripts? Do I have to specify somehow the RTL direction, or do I have to reverse my texts? Use
pyton-bidi
or anything?I'm used to
kraken
and other OCR/HTR but I never triedcalamari
.Thank you so much!
The text was updated successfully, but these errors were encountered: