GitHub - SushantDaga/ThePDFCorpus: (WIP) Aiming to be The PDF corpus you will need

Work In Progress (WIP)

This is a work in progress. Any contribution is appreciated

Goal

This repo aims to provide the largest filterable PDF corpus. In addition to raw PDF files, we aim to provide text, language information and spam-filtering information for these files. Current efforts are focused on replicating CC-PDF[^1] pipeline

TODO:

CC-PDF pipeline

License

Aim is to provide this work and dataset in as permissible as possible license. Need to figure out the nitty gritty details b/w MIT, Apache-2, Creative Commons, etc licenses.

Contribution

Currently using Github issues to indicate WIP features
We will be dealing in TBs of data, you can contribute compute credits as well

[^1] CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

@misc{turski2023ccpdf,
      title={CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data}, 
      author={Michał Turski and Tomasz Stanisławek and Karol Kaczmarek and Paweł Dyda and Filip Graliński},
      year={2023},
      eprint={2304.14953},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
doc_images		doc_images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Work In Progress (WIP)

Goal

TODO:

CC-PDF pipeline

License

Contribution

About

Releases

Packages

SushantDaga/ThePDFCorpus

Folders and files

Latest commit

History

Repository files navigation

Work In Progress (WIP)

Goal

TODO:

CC-PDF pipeline

License

Contribution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages