-- Work in progress --
This repo contains 4 files:
- pdfToImages.py: converts PDF files to images
- imageOCR.py: performs OCR on a set of images with the Azure Computer Vision API, then stores the results in json files
- OCRinterpretation.py: interprets the results from the json files and generates new json files with the pdf's structure
- pdfProcessing.py: takes care of the whole pipeline described above
- Install GhostScript on Windows
- Install ImageMagick on Windows
- Install Wand:
pip install wand
- Install OpenCV:
pip install opencv-python
- Install Matplotlib:
pip install matplotlib
Clone or download this repo, then use either the sample files or your own. If you chose to use your own, change the path names accordingly. /!\ If you use your own files, try first with a small number of pages. Converting pdf to images takes a while.
If you wish to use the OCR capabilities, set up a Computer Vision API service on Azure and fill in your subscription key and region in imagesOCR.py and/or pdfProcessing.py.
Once everything is ready, execute the files and watch the magic happen:
python .\filename.py