Python PDF Processing

-- Work in progress --

This repo contains 4 files:

pdfToImages.py: converts PDF files to images
imageOCR.py: performs OCR on a set of images with the Azure Computer Vision API, then stores the results in json files
OCRinterpretation.py: interprets the results from the json files and generates new json files with the pdf's structure
pdfProcessing.py: takes care of the whole pipeline described above

Installs

Install GhostScript on Windows
Install ImageMagick on Windows
Install Wand: pip install wand
Install OpenCV: pip install opencv-python
Install Matplotlib: pip install matplotlib

How to use

Clone or download this repo, then use either the sample files or your own. If you chose to use your own, change the path names accordingly. /!\ If you use your own files, try first with a small number of pages. Converting pdf to images takes a while.

If you wish to use the OCR capabilities, set up a Computer Vision API service on Azure and fill in your subscription key and region in imagesOCR.py and/or pdfProcessing.py.

Once everything is ready, execute the files and watch the magic happen: python .\filename.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Resources		Resources
OCRinterpretation.py		OCRinterpretation.py
README.md		README.md
imageOCR.py		imageOCR.py
pdfProcessing.py		pdfProcessing.py
pdfToImages.py		pdfToImages.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python PDF Processing

Installs

How to use

About

Releases

Packages

Languages

Kagigz/python-pdf-processing

Folders and files

Latest commit

History

Repository files navigation

Python PDF Processing

Installs

How to use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages