In this step, you will convert the scanned dictionary pages into plain text. However there's a complication. Virtually all historical dictionaries are typeset in obsolete fonts or use special character symbols that are unknown to today's OCR software. If you use an OCR as-is, you will get lots of transcription errors that are too time-consuming and costly to correct manually. Thus it is better to first train the OCR so that it recognizes the dictionary's distinct typography before transcribing all the pages. Fortunately, OCR software such as Tesseract are trainable and will serve our purpose well.
Tesseract uses Deep Learning technology (video) to recognize text from images. As of Version 4.0, it supports over 150 languages and scripts. In order to train it to recognize novel characters/symbols, you will train it via fine tuning. So instead of making Tesseract learn the entire orthography of a language from scratch, you start with one of its pre-trained language models that does the best job of recognizing your target orthography, then tweak that model to learn the novel symbols. Using this approach, training Tesseract is faster and requires much less training data.
Below is a visual of what you will do.
Text Capture has 3 substeps:
- Fine-tune - train the OCR to recognize the new characters/symbols in the dictionary
- Transcribe - use the trained model to convert the pages into text
- Post-edit - correct any residual OCR errors
To help make these concepts clearer, we will use the Hanunoo dictionary as an example.
Let's begin!
Note: The following instructions were distilled from the Tesseract 4.0 training guide. See that guide for mode details.
Prerequisite 1: Install the needed fonts
sudo apt update
sudo apt install ttf-mscorefonts-installer
sudo apt install fonts-dejavu
fc-cache -vf
Prerequisite 2: Install some required dependencies
PDFtk is a handy tool for splitting/joining/rotating PDF files while ImageMagick converts PDF files into TIFF and provides lots of image processing features. Leptonica is needed to build Tesseract from source code.
$ sudo apt install pdftk-java
$ sudo apt install imagemagick
$ sudo apt install libleptonica-dev
Edit the ImageMagick policy /etc/ImageMagick-6/policy.xml to allow converting PDF files. Look for a line like below and change the value of rights to "read | write":
<policy domain="coder" rights="read | write" pattern="PDF" />
Prerequisite 3: Install Tesseract
You must compile the Tesseract source code in order to use the training tools. Unfortunately, the Tesseract executables available for download do not include them.
$ cd retro-digitization/tutorial
- Download the source code of the latest release (e.g., "v4.1.1") and unzip it
- Rename the folder (for convenience)
$ mv tesseract-4.1.1 tesseract
- Configure and build Tesseract and its training tools.
NOTE: If the ./configure file does not exist or gives "undefined M4 macro" errors when you run it, run "autoreconf --install" first and then run "./configure" again.
$ cd tesseract
$ ./configure
$ make
$ make training
-
Using your favorite text editor, open the file src/training/language-specific.sh. Look for the line that starts with "LATIN_FONTS=". Comment out the following lines and re-save the file.
"URW Bookman L Bold"
"URW Bookman L Italic"
"URW Bookman L Bold Italic"
"Century Schoolbook L Bold"
"Century Schoolbook L Italic"
"Century Schoolbook L Bold Italic"
"Century Schoolbook L Medium" \
Optional: Install Tesseract and the training tools so you run them without giving the full path. They will be installed in /usr/local/bin
$ sudo make install training-install
Step 3.1 - Prepare the Training Data