OCR Extractor API

A FastAPI-based service that provides OCR (Optical Character Recognition) extraction from PDFs using Marker. The service can process PDFs both locally and remotely.

Features

Extract text from PDFs using OCR
Support for multiple languages
Image extraction from PDFs
Local and remote PDF processing
Force OCR processing when needed
Optional pagination support

Technology Stack

Python 3.10+ - pre-requisite
Poetry - pre-requisite
FastAPI
Marker PDF
PyTorch

Installation and Running

Using Docker (Recommended)

Build the Docker image:

make docker/build

Start the Docker container:

make docker/up

Using Poetry

Install dependencies:

poetry install

Run the service:

poetry run python src/main.py

API Documentation

PDF Conversion Endpoint

Endpoint: POST /marker

Request Parameters

Parameter	Type	Required	Default	Description
url	string	No	-	The URL to the PDF file to convert
filepath	string	No	-	The path to the PDF file to convert
max_pages	integer	No	null	The maximum number of pages in the document to convert
langs	string	No	null	Languages to use for OCR, comma separated (e.g., "en,es"). Uses codes from Surya's language file
force_ocr	boolean	No	false	Force OCR on all pages. Warning: Can lead to worse results if PDFs already have good text
paginate	boolean	No	false	If true, separates output pages with horizontal rules containing page numbers
extract_images	boolean	No	true	Whether to extract images from the PDF

Example Request

curl -X POST "http://localhost:8000/marker" -H "Content-Type: application/json" -d '{"url": "https://example.com/path/to/pdf.pdf", "max_pages": 10, "langs": "en,es", "force_ocr": false, "paginate": false, "extract_images": true}'

Response Format

The API returns a JSON object with the following structure:

{
    "markdown": "Extracted text content in markdown format",
    "images": {
        "image_key": "base64_encoded_image_string"
    },
    "metadata": {
        "languages": ["detected_language_codes"],
        "filetype": "pdf",
        "pdf_toc": [],
        "pages": 5,
        "ocr_stats": {
            "ocr_pages": 0,
            "ocr_failed": 0,
            "ocr_success": 0,
            "ocr_engine": "none"
        },
        "block_stats": {
            "header_footer": 0,
            "code": 0,
            "table": 0,
            "equations": {
    "successful_ocr": 0,
    "unsuccessful_ocr": 0,
    "equations": 0
    }
    },
    "computed_toc": []
    },
    "success": true
}

Response Fields

Field	Type	Description
markdown	string	The extracted text content in markdown format
images	object	Dictionary of extracted images (if any) as base64 encoded strings
metadata	object	Processing metadata and statistics
metadata.languages	array	Detected languages in the document
metadata.pages	integer	Total number of pages processed
metadata.ocr_stats	object	Statistics about OCR processing
metadata.block_stats	object	Statistics about different content blocks found
success	boolean	Whether the conversion was successful

Error Response

In case of an error, the API returns:

{
    "success": false,
    "error": "Error message description"
}

Interactive Documentation

For interactive API documentation, visit:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Extractor API

Features

Technology Stack

Installation and Running

Using Docker (Recommended)

Using Poetry

API Documentation

PDF Conversion Endpoint

Request Parameters

Example Request

Response Format

Response Fields

Error Response

Interactive Documentation

About

Releases

Packages

Languages

a-maggi/OCR-Extractor-API

Folders and files

Latest commit

History

Repository files navigation

OCR Extractor API

Features

Technology Stack

Installation and Running

Using Docker (Recommended)

Using Poetry

API Documentation

PDF Conversion Endpoint

Request Parameters

Example Request

Response Format

Response Fields

Error Response

Interactive Documentation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages