Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream OCR result by page & code restructure #112

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 18 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The general logic:
- Pass each image to GPT and ask nicely for Markdown
- Aggregate the responses and return Markdown

Try out the hosted version here: https://getomni.ai/ocr-demo
Try out the hosted version here: <https://getomni.ai/ocr-demo>

## Getting Started

Expand Down Expand Up @@ -76,9 +76,13 @@ const result = await zerox({
cleanup: true, // Clear images from tmp after run.
concurrency: 10, // Number of pages to run at a time.
correctOrientation: true, // True by default, attempts to identify and correct page orientation.
errorMode: ErrorMode.IGNORE, // ErrorMode.THROW or ErrorMode.IGNORE, defaults to ErrorMode.IGNORE.
maintainFormat: false, // Slower but helps maintain consistent formatting.
maxRetries: 1, // Number of retries to attempt on a failed page, defaults to 1.
maxTesseractWorkers: -1, // Maximum number of tesseract workers. Zerox will start with a lower number and only reach maxTesseractWorkers if needed.
model: 'gpt-4o-mini' // Model to use (gpt-4o-mini or gpt-4o).
model: "gpt-4o-mini", // Model to use (gpt-4o-mini or gpt-4o).
onPostProcess: async ({ page, progressSummary }) => Promise<void>, // Callback function to run after each page is processed.
onPreProcess: async ({ imagePath, pageNumber }) => Promise<void>, // Callback function to run before each page is processed.
outputDir: undefined, // Save combined result.md to a file.
pagesToConvertAsImages: -1, // Page numbers to convert to image as array (e.g. `[1, 2, 3]`) or a number (e.g. `1`). Set to -1 to convert all pages.
tempDir: "/os/tmp", // Directory to use for temporary files (default: system temp directory).
Expand Down Expand Up @@ -132,17 +136,23 @@ Request #3 => page_2_markdown + page_3_image
'**Terms:** \n' +
'Order ID : CA-2012-AB10015140-40974 ',
page: 1,
contentLength: 747
contentLength: 747,
status: 'SUCCESS',
}
]
],
summary: {
failedPages: 0,
successfulPages: 1,
totalPages: 1,
},
}
```

## Python Zerox

(Python SDK - supports vision models from different providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock etc)

### Installation:
### Installation

- Install **poppler-utils** on the system, it should be available in path variable
- Install py-zerox:
Expand Down Expand Up @@ -285,7 +295,7 @@ Returns
- ZeroxOutput:
Contains the markdown content generated by the model and also some metadata (refer below).

### Example Output (Output from "azure/gpt-4o-mini"):
### Example Output (Output from "azure/gpt-4o-mini")

`Note: The output is mannually wrapped for this documentation for better readability.`

Expand Down Expand Up @@ -340,7 +350,7 @@ ZeroxOutput(
)
````

## Supported File Types:
## Supported File Types

We use a combination of `libreoffice` and `graphicsmagick` to do document => image conversion. For non-image / non-pdf files, we use libreoffice to convert that file to a pdf, and then to an image.

Expand Down Expand Up @@ -373,7 +383,7 @@ We use a combination of `libreoffice` and `graphicsmagick` to do document => ima

## Credits

- [Litellm](https://github.com/BerriAI/litellm): https://github.com/BerriAI/litellm | This powers our python sdk to support all popular vision models from different providers.
- [Litellm](https://github.com/BerriAI/litellm): <https://github.com/BerriAI/litellm> | This powers our python sdk to support all popular vision models from different providers.

### License

Expand Down
Loading