Skip to content

the-deep-nlp/ocr-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCR Extractor

This OCR module is intended to retrieve the texts, tables from the images or pdf files using Deep Learning Models developed by Paddle OCR (https://github.com/PaddlePaddle/PaddleOCR). This library employs several models for detection, recognition, layout parsing, just to name a few. Based on this library, we wrote a wrapper to process the documents.

Installation

It can be installed using pip.

pip install [email protected]:the-deep-nlp/ocr-extractor.git

Usages

The OCRProcessor class can be used as follows:

ocr_processor = OCRProcessor(  
    lang: str  -> The language of the document (e.g. 'en')  
    precision: str  -> 'fp16' Use lower value to optimize the floating point used during processing  
    extraction_type: enum[int] (1 - 4)  
    show_log: boolean -> To show the log or not (default: False)  
    layout: boolean  -> Whether to processor the layout of the documents (default: True)  
    use_s3: boolean -> To use s3 for storage (default: False)  
    s3_bucket_name: str -> Name of the bucket in s3 if use_s3 is True (default: None)  
    s3_bucket_key: str -> Key name in s3 if use_s3 is True (default: None)  
    process_text: boolean -> To extract text contents from the images / scanned document (default: True)  
    process_table: boolean -> To extract the tables from the images / pdf / scanned document (default: True)  
    aws_region_name: str -> AWS region (default: 'us-east-1')  
    models_base_path: str -> local location of the models  (default: '/ocr/models/')
)

ocr_processor.load_file(
    file_path: str -> File path which needs to be processed,
    is_image: boolean -> Whether the used file is an image (True) or a pdf (False)
)

result = await ocr_processer.handler()

Since few methods used are co-routines in this library, you need to use async to execute the co-routine, otherwise co-routine objects are only returned without being processed.

The result object is a dictionary consisting of few key-value pairs:

{
    "text": list[objects],
    "image": list[objects],
    "table": list[object]
}

The text value is a list of objects whose format is:

{
    'page_number': int,
    'order': int,
    'content': str
}

The image value is a list of objects whose structure is:

{
    'page_number': int,
    'images': list[str]  # The strings are the urls
}

The table value is a list of objects whose structure is:

{
    'page_number': int,
    'order': int,
    'content_link': str (url),
    'image_link': str (url)
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages