[Not an issue] Train a calamari model on RTL languages print or handwritten #368

johnlockejrr · 2024-10-16T14:53:47Z

Sorry I open up as an issue.
I want to train a calamari model on RTL languages print or handwritten:

arabic
hebrew
samaritan
syriac

Are there any additional steps to do for RTL languages and scripts? Do I have to specify somehow the RTL direction, or do I have to reverse my texts? Use pyton-bidi or anything?
I'm used to kraken and other OCR/HTR but I never tried calamari.

Thank you so much!

The text was updated successfully, but these errors were encountered:

andbue · 2024-10-16T15:16:44Z

Training works out of the box for RTL texts but you should set the direction in the preprocessing option bidi_direction to RTL to prevent the bidi algorithms from guessing the wrong order in some cases. Have a look at my demo notebook for an example training procedure and the command line options needed for Arabic texts. Ignore the printouts of the example predictions between epochs, this has been broken for ages and does not represent the performance of the model at all.

You might also want to start from the def_arabic-model in calamari_models_experimental – at least for Arabic printed material.

Good luck with the training!

johnlockejrr · 2024-10-16T15:20:27Z

Thank you so much!!!
I'm a console guy but anyway... is there for calamari something like eScriptorium for kraken? I mean a web UI where you can adnotate transcriptions etc.
And last question: for calamari I need to train only recognition models and not segmentation if I'm right?

Many thanks!

andbue · 2024-10-16T15:35:03Z

Calamari needs the input to be pre-segmented, either with coordinates in PAGE XML or as line images. I've been using LAREX for the semi-automatic region segmentation, ocropus line segmentation, calamari for the training/recognition, and a custom web app for the manual post correction. Especially for Arabic handwriting I'd expect kraken to perform much better for the segmentation task.

If you want a full pipeline and user interface, there is https://www.ocr4all.org/ – it contains calamari and other OCR engines, segmentation tools etc., but I'm not sure if it has been thoroughly tested with RTL texts.

johnlockejrr · 2024-10-16T15:41:00Z

Thank you so much for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Not an issue] Train a calamari model on RTL languages print or handwritten #368

[Not an issue] Train a calamari model on RTL languages print or handwritten #368

johnlockejrr commented Oct 16, 2024 •

edited

Loading

andbue commented Oct 16, 2024

johnlockejrr commented Oct 16, 2024

andbue commented Oct 16, 2024

johnlockejrr commented Oct 16, 2024

[Not an issue] Train a calamari model on RTL languages print or handwritten #368

[Not an issue] Train a calamari model on RTL languages print or handwritten #368

Comments

johnlockejrr commented Oct 16, 2024 • edited Loading

andbue commented Oct 16, 2024

johnlockejrr commented Oct 16, 2024

andbue commented Oct 16, 2024

johnlockejrr commented Oct 16, 2024

johnlockejrr commented Oct 16, 2024 •

edited

Loading