Extracting Header and footers #437

TheEyesChico · 2021-05-25T06:48:46Z

TheEyesChico
May 25, 2021

I have multiple documents for a project which needs to be processed for NLP task. I want to remove headers and footers from each file but I can't use traditional methods like text mining as headers and footers are dynamic (For each file, these might contain high level information like section number, division of the page, project details etc.)

I used your .within_bbox object but again it changes for every file. Is there any object/function in your package to identify header and footers accurately. Also after header ends and before footer starts, there is always a horizontal line. Maybe any function to identify that from the page and return co-ordinates of the body?

jsvine · 2021-05-26T03:03:17Z

jsvine
May 26, 2021
Maintainer

Unfortunately, I think adding such functionality would be out of scope for this library, since there's no universal/widespread criteria by which to identify headers and footers — they vary widely. But hopefully the more generic pdfplumber methods can help you construct a method that's tailored to your particular set of PDFs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Header and footers #437

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Extracting Header and footers #437

TheEyesChico May 25, 2021

Replies: 1 comment

jsvine May 26, 2021 Maintainer

TheEyesChico
May 25, 2021

jsvine
May 26, 2021
Maintainer