A FastAPI-based service that provides OCR (Optical Character Recognition) extraction from PDFs using Marker. The service can process PDFs both locally and remotely.
- Extract text from PDFs using OCR
- Support for multiple languages
- Image extraction from PDFs
- Local and remote PDF processing
- Force OCR processing when needed
- Optional pagination support
- Python 3.10+ - pre-requisite
- Poetry - pre-requisite
- FastAPI
- Marker PDF
- PyTorch
- Build the Docker image:
make docker/build
- Start the Docker container:
make docker/up
- Install dependencies:
poetry install
- Run the service:
poetry run python src/main.py
Endpoint: POST /marker
Parameter | Type | Required | Default | Description |
---|---|---|---|---|
url | string | No | - | The URL to the PDF file to convert |
filepath | string | No | - | The path to the PDF file to convert |
max_pages | integer | No | null | The maximum number of pages in the document to convert |
langs | string | No | null | Languages to use for OCR, comma separated (e.g., "en,es"). Uses codes from Surya's language file |
force_ocr | boolean | No | false | Force OCR on all pages. Warning: Can lead to worse results if PDFs already have good text |
paginate | boolean | No | false | If true, separates output pages with horizontal rules containing page numbers |
extract_images | boolean | No | true | Whether to extract images from the PDF |
curl -X POST "http://localhost:8000/marker" -H "Content-Type: application/json" -d '{"url": "https://example.com/path/to/pdf.pdf", "max_pages": 10, "langs": "en,es", "force_ocr": false, "paginate": false, "extract_images": true}'
The API returns a JSON object with the following structure:
{
"markdown": "Extracted text content in markdown format",
"images": {
"image_key": "base64_encoded_image_string"
},
"metadata": {
"languages": ["detected_language_codes"],
"filetype": "pdf",
"pdf_toc": [],
"pages": 5,
"ocr_stats": {
"ocr_pages": 0,
"ocr_failed": 0,
"ocr_success": 0,
"ocr_engine": "none"
},
"block_stats": {
"header_footer": 0,
"code": 0,
"table": 0,
"equations": {
"successful_ocr": 0,
"unsuccessful_ocr": 0,
"equations": 0
}
},
"computed_toc": []
},
"success": true
}
Field | Type | Description |
---|---|---|
markdown | string | The extracted text content in markdown format |
images | object | Dictionary of extracted images (if any) as base64 encoded strings |
metadata | object | Processing metadata and statistics |
metadata.languages | array | Detected languages in the document |
metadata.pages | integer | Total number of pages processed |
metadata.ocr_stats | object | Statistics about OCR processing |
metadata.block_stats | object | Statistics about different content blocks found |
success | boolean | Whether the conversion was successful |
In case of an error, the API returns:
{
"success": false,
"error": "Error message description"
}
For interactive API documentation, visit:
- Swagger UI:
http://localhost:8000/docs
- ReDoc:
http://localhost:8000/redoc