GitHub - NYU-DataServices/RDM_tesseract_ocr: Class materials for a remote-modality version of Extract Text Using OCR that focuses on Tesseract-based workflows.

Extracting Text Using Optical Character Recognition (Tesseract Version for Remote Teaching)

Nicholas Wolf
ORCID 0000-0001-5512-6151

This lesson is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Overview

Overview This session is dedicated to exploring the tricks and tools available to build a workflow that turns digital images of text into computer-readable text. Its focus is on skills needed for an individual who is trying to bring together a corpus of texts for the purposes of text analysis, a website, a Digital Humanities project, or a small-scale digital library.

We'll focus on the following goals:

Review a few options for making the digital capture of a text.
Walk through some solutions for bulk transformation of image file types into formats that are OCR-ready
Take a look at using Tesseract 4 for OCR, using a few different languages as examples
Examine the HOCR output structure

Materials

You can participate in this lesson in any of the following ways subject to whether you want to install any of the software and the ability of your local system to run that software. For those who cannot install the software we will get a flavor of how to perform this work using a cloud-based interface.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
146.56750609.88c618c0-5293-0134-1278-00505686a51c.jpeg		146.56750609.88c618c0-5293-0134-1278-00505686a51c.jpeg
265.56750728.939e2500-5293-0134-d453-00505686a51c.jpeg		265.56750728.939e2500-5293-0134-d453-00505686a51c.jpeg
README.md		README.md
Tesseract-OCR.ipynb		Tesseract-OCR.ipynb
UN-DeclarationHumanRights.pdf		UN-DeclarationHumanRights.pdf
bbox.jpeg		bbox.jpeg
dc-reader-export.png		dc-reader-export.png
fontanills-1862-1.jpg		fontanills-1862-1.jpg
fontanills-1862-2.jpg		fontanills-1862-2.jpg
fontanills-1862-3.jpg		fontanills-1862-3.jpg
pdf-arabic-example.pdf		pdf-arabic-example.pdf
tess-uploads.sh		tess-uploads.sh
wpa-lifehistory-swenson-1938-bboxes.tif		wpa-lifehistory-swenson-1938-bboxes.tif
wpa-lifehistory-swenson-1938.tif		wpa-lifehistory-swenson-1938.tif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extracting Text Using Optical Character Recognition (Tesseract Version for Remote Teaching)

About

Releases

Packages

Languages

NYU-DataServices/RDM_tesseract_ocr

Folders and files

Latest commit

History

Repository files navigation

Extracting Text Using Optical Character Recognition (Tesseract Version for Remote Teaching)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages