Nicholas Wolf
ORCID 0000-0001-5512-6151
This lesson is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Overview
Overview This session is dedicated to exploring the tricks and tools available to build a workflow that turns digital images of text into computer-readable text. Its focus is on skills needed for an individual who is trying to bring together a corpus of texts for the purposes of text analysis, a website, a Digital Humanities project, or a small-scale digital library.
We'll focus on the following goals:
- Review a few options for making the digital capture of a text.
- Walk through some solutions for bulk transformation of image file types into formats that are OCR-ready
- Take a look at using Tesseract 4 for OCR, using a few different languages as examples
- Examine the HOCR output structure
Materials
You can participate in this lesson in any of the following ways subject to whether you want to install any of the software and the ability of your local system to run that software. For those who cannot install the software we will get a flavor of how to perform this work using a cloud-based interface.