Skip to content

Class materials for a remote-modality version of Extract Text Using OCR that focuses on Tesseract-based workflows.

Notifications You must be signed in to change notification settings

NYU-DataServices/RDM_tesseract_ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extracting Text Using Optical Character Recognition (Tesseract Version for Remote Teaching)

Nicholas Wolf
ORCID 0000-0001-5512-6151

This lesson is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Overview

Overview This session is dedicated to exploring the tricks and tools available to build a workflow that turns digital images of text into computer-readable text. Its focus is on skills needed for an individual who is trying to bring together a corpus of texts for the purposes of text analysis, a website, a Digital Humanities project, or a small-scale digital library.

We'll focus on the following goals:

  • Review a few options for making the digital capture of a text.
  • Walk through some solutions for bulk transformation of image file types into formats that are OCR-ready
  • Take a look at using Tesseract 4 for OCR, using a few different languages as examples
  • Examine the HOCR output structure

Materials

You can participate in this lesson in any of the following ways subject to whether you want to install any of the software and the ability of your local system to run that software. For those who cannot install the software we will get a flavor of how to perform this work using a cloud-based interface.

About

Class materials for a remote-modality version of Extract Text Using OCR that focuses on Tesseract-based workflows.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published