Skip to content

PDF Processing (convert to images and perform OCR to detect document structure) in python

Notifications You must be signed in to change notification settings

Kagigz/python-pdf-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python PDF Processing

-- Work in progress --

This repo contains 4 files:

  1. pdfToImages.py: converts PDF files to images
  2. imageOCR.py: performs OCR on a set of images with the Azure Computer Vision API, then stores the results in json files
  3. OCRinterpretation.py: interprets the results from the json files and generates new json files with the pdf's structure
  4. pdfProcessing.py: takes care of the whole pipeline described above

Installs

  • Install GhostScript on Windows
  • Install ImageMagick on Windows
  • Install Wand: pip install wand
  • Install OpenCV: pip install opencv-python
  • Install Matplotlib: pip install matplotlib

How to use

Clone or download this repo, then use either the sample files or your own. If you chose to use your own, change the path names accordingly. /!\ If you use your own files, try first with a small number of pages. Converting pdf to images takes a while.

If you wish to use the OCR capabilities, set up a Computer Vision API service on Azure and fill in your subscription key and region in imagesOCR.py and/or pdfProcessing.py.

Once everything is ready, execute the files and watch the magic happen: python .\filename.py

About

PDF Processing (convert to images and perform OCR to detect document structure) in python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages