Extracting Header and footers #437
TheEyesChico
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Unfortunately, I think adding such functionality would be out of scope for this library, since there's no universal/widespread criteria by which to identify headers and footers — they vary widely. But hopefully the more generic |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have multiple documents for a project which needs to be processed for NLP task. I want to remove headers and footers from each file but I can't use traditional methods like text mining as headers and footers are dynamic (For each file, these might contain high level information like section number, division of the page, project details etc.)
I used your
.within_bbox
object but again it changes for every file. Is there any object/function in your package to identify header and footers accurately. Also after header ends and before footer starts, there is always a horizontal line. Maybe any function to identify that from the page and return co-ordinates of the body?Beta Was this translation helpful? Give feedback.
All reactions