Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Not an issue] Train a calamari model on RTL languages print or handwritten #368

Open
johnlockejrr opened this issue Oct 16, 2024 · 4 comments

Comments

@johnlockejrr
Copy link

johnlockejrr commented Oct 16, 2024

Sorry I open up as an issue.
I want to train a calamari model on RTL languages print or handwritten:

  • arabic
  • hebrew
  • samaritan
  • syriac

Are there any additional steps to do for RTL languages and scripts? Do I have to specify somehow the RTL direction, or do I have to reverse my texts? Use pyton-bidi or anything?
I'm used to kraken and other OCR/HTR but I never tried calamari.

Thank you so much!

@andbue
Copy link
Member

andbue commented Oct 16, 2024

Training works out of the box for RTL texts but you should set the direction in the preprocessing option bidi_direction to RTL to prevent the bidi algorithms from guessing the wrong order in some cases. Have a look at my demo notebook for an example training procedure and the command line options needed for Arabic texts. Ignore the printouts of the example predictions between epochs, this has been broken for ages and does not represent the performance of the model at all.

You might also want to start from the def_arabic-model in calamari_models_experimental – at least for Arabic printed material.

Good luck with the training!

@johnlockejrr
Copy link
Author

Thank you so much!!!
I'm a console guy but anyway... is there for calamari something like eScriptorium for kraken? I mean a web UI where you can adnotate transcriptions etc.
And last question: for calamari I need to train only recognition models and not segmentation if I'm right?

Many thanks!

@andbue
Copy link
Member

andbue commented Oct 16, 2024

Calamari needs the input to be pre-segmented, either with coordinates in PAGE XML or as line images. I've been using LAREX for the semi-automatic region segmentation, ocropus line segmentation, calamari for the training/recognition, and a custom web app for the manual post correction. Especially for Arabic handwriting I'd expect kraken to perform much better for the segmentation task.

If you want a full pipeline and user interface, there is https://www.ocr4all.org/ – it contains calamari and other OCR engines, segmentation tools etc., but I'm not sure if it has been thoroughly tested with RTL texts.

@johnlockejrr
Copy link
Author

Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants