OCR Extractor

This OCR module is intended to retrieve the texts, tables from the images or pdf files using Deep Learning Models developed by Paddle OCR (https://github.com/PaddlePaddle/PaddleOCR). This library employs several models for detection, recognition, layout parsing, just to name a few. Based on this library, we wrote a wrapper to process the documents.

Installation

It can be installed using pip.

pip install [email protected]:the-deep-nlp/ocr-extractor.git

Usages

The OCRProcessor class can be used as follows:

ocr_processor = OCRProcessor(  
    lang: str  -> The language of the document (e.g. 'en')  
    precision: str  -> 'fp16' Use lower value to optimize the floating point used during processing  
    extraction_type: enum[int] (1 - 4)  
    show_log: boolean -> To show the log or not (default: False)  
    layout: boolean  -> Whether to processor the layout of the documents (default: True)  
    use_s3: boolean -> To use s3 for storage (default: False)  
    s3_bucket_name: str -> Name of the bucket in s3 if use_s3 is True (default: None)  
    s3_bucket_key: str -> Key name in s3 if use_s3 is True (default: None)  
    process_text: boolean -> To extract text contents from the images / scanned document (default: True)  
    process_table: boolean -> To extract the tables from the images / pdf / scanned document (default: True)  
    aws_region_name: str -> AWS region (default: 'us-east-1')  
    models_base_path: str -> local location of the models  (default: '/ocr/models/')
)

ocr_processor.load_file(
    file_path: str -> File path which needs to be processed,
    is_image: boolean -> Whether the used file is an image (True) or a pdf (False)
)

result = await ocr_processer.handler()

Since few methods used are co-routines in this library, you need to use async to execute the co-routine, otherwise co-routine objects are only returned without being processed.

The result object is a dictionary consisting of few key-value pairs:

{
    "text": list[objects],
    "image": list[objects],
    "table": list[object]
}

The text value is a list of objects whose format is:

{
    'page_number': int,
    'order': int,
    'content': str
}

The image value is a list of objects whose structure is:

{
    'page_number': int,
    'images': list[str]  # The strings are the urls
}

The table value is a list of objects whose structure is:

{
    'page_number': int,
    'order': int,
    'content_link': str (url),
    'image_link': str (url)
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
src/ocr_extractor		src/ocr_extractor
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR Extractor

Installation

Usages

About

Releases

Packages

Contributors 2

Languages

License

the-deep-nlp/ocr-extractor

Folders and files

Latest commit

History

Repository files navigation

OCR Extractor

Installation

Usages

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages