This is a work in progress. Any contribution is appreciated
This repo aims to provide the largest filterable PDF corpus. In addition to raw PDF files, we aim to provide text, language information and spam-filtering information for these files. Current efforts are focused on replicating CC-PDF[^1] pipeline
- Get prediction on file
- OCR engines
- Tesseract
- Vision
- Azure
- Textract (might not do due to limited language support)
- Language Detection
- Langdetect
- lingua-py
- spacy
- gcld
- Language from OCR commercial OCR engines (Azure, Vision, etc)
- Replicate
Born Digital detector
of CC-PDF for accurate PDF parsing- DjVu
- pdfminer.six
- pdfplumber
- OCR engines
- Get statistics from crawl
- Figure out safe way to download and parse PDF
- Language detection from URL
- Spam filtering stats from URL
- Parsing CC dumps
- Figure out items here
- Replicate Section 4 (
Exploration of PDFs
) of CC-PDF - Figure out appropriate license
Aim is to provide this work and dataset in as permissible as possible license. Need to figure out the nitty gritty details b/w MIT, Apache-2, Creative Commons, etc licenses.
- Currently using Github issues to indicate WIP features
- We will be dealing in TBs of data, you can contribute compute credits as well
[^1] CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data
@misc{turski2023ccpdf,
title={CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data},
author={Michał Turski and Tomasz Stanisławek and Karol Kaczmarek and Paweł Dyda and Filip Graliński},
year={2023},
eprint={2304.14953},
archivePrefix={arXiv},
primaryClass={cs.CL}
}