-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text not extracted properly in 4.30.1, but works with 4.30.0 #336
Comments
I ran the following commands on our current pypdfium2 source tree: # assuming an editable install (python3 -m pip install -e .)
./run emplace auto:6462
pypdfium2 extract-text "p25-1144.pdf" --pages 4 --strategy range > pdfium_6462_range.txt
pypdfium2 extract-text "p25-1144.pdf" --pages 4 --strategy bounded > pdfium_6462_bounded.txt
./run emplace auto:6899
pypdfium2 extract-text "p25-1144.pdf" --pages 4 --strategy range > pdfium_6899_range.txt
pypdfium2 extract-text "p25-1144.pdf" --pages 4 --strategy bounded > pdfium_6899_bounded.txt diff pdfium_6462_range.txt pdfium_6899_range.txt 85,90c85,90
< Under 18 years. . . . . . . . . . . . . . . 73.6 74.0 75.7 77.1 78.2 80.1 6.5 8.8
< 18 to 44 years. . . . . . . . . . . . . . . . 116.0 119.2 125.0 126.4 129.6 132.7 16.7 14.4
< 45 to 64 years. . . . . . . . . . . . . . . . 84.3 83.4 81.3 89.1 95.4 97.0 12.7 15.1
< 65 years and over. . . . . . . . . . . . 49.2 56.1 73.1 80.8 85.7 94.7 45.4 92.3
< 85 years and over. . . . . . . . . . . . 6.4 6.7 9.1 14.4 18.6 19.0 12.6 198.1
< 100 years and over. . . . . . . . . . . 0.1 0.1 0.1 0.2 0.4 0.6 0.5 618.3
---
> Under 18 years 73.6 74.0 75.7 77.1 78.2 80.1 6.5 8.8
> 18 to 44 years 116.0 119.2 125.0 126.4 129.6 132.7 16.7 14.4
> 45 to 64 years 84.3 83.4 81.3 89.1 95.4 97.0 12.7 15.1
> 65 years and over 49.2 56.1 73.1 80.8 85.7 94.7 45.4 92.3
> 85 years and over 6.4 6.7 9.1 14.4 18.6 19.0 12.6 198.1
> 100 years and over 0.1 0.1 0.1 0.2 0.4 0.6 0.5 618.3 diff pdfium_6462_bounded.txt pdfium_6899_bounded.txt 84,90c84,90
< Total population . . . . . . 323.1 332.6 355.1 373.5 388.9 404.5 81.4 25.2
< Under 18 years. . . . . . . . . . . . . . . 73.6 74.0 75.7 77.1 78.2 80.1 6.5 8.8
< 18 to 44 years. . . . . . . . . . . . . . . . 116.0 119.2 125.0 126.4 129.6 132.7 16.7 14.4
< 45 to 64 years. . . . . . . . . . . . . . . . 84.3 83.4 81.3 89.1 95.4 97.0 12.7 15.1
< 65 years and over. . . . . . . . . . . . 49.2 56.1 73.1 80.8 85.7 94.7 45.4 92.3
< 85 years and over. . . . . . . . . . . . 6.4 6.7 9.1 14.4 18.6 19.0 12.6 198.1
< 100 years and over. . . . . . . . . . . 0.1 0.1 0.1 0.2 0.4 0.6 0.5 618.3
---
> Total population . . . . . . 323.1 332.6 355.1 373.5 388.9 404.5 81.4 25.
> Under 18 years. . . . . . . . . . . . . . . 73.6 74
> 18 to 44 years. . . . . . . . . . . . . . . . 116.0 119.2 1
> 45 to 64 years. . . . . . . . . . . . . . . . 84.3 83
> 65 years and over. . . . . . . . . . . . 49.2 56.1 73.1
> 85 years and over. . . . . . . . . . . . 6.4 6.7 9.1 1
> 100 years and over. . . . . . . . . . . 0.1 0.1 0.1 That does look like a regression in pdfium concerning the GetTextBounded extraction strategy. FWIW, the python helpers have barely changed between 4.30.0 and 4.30.1 (see diff) – this was mostly setup bug fixes and a pdfium update. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Might be interesting to bisect this to narrow down the commit range... |
OK, could narrow this down to 6721 (good) - 6844 (bad) [commit log] |
I went ahead and filed https://issues.chromium.org/issues/387277993 for this now. |
Checklist
pypdfium2
fromPyPI
orGitHub/pypdfium2-team
.Description
Text was extracted from the table on page 4 of this pdf properly with pypdfium2==4.30.0, but no longer works on 4.30.1. Only a partial set of characters are pulled from the table.
I'm not sure if this is an issue with the python bindings, or base pdfium - happy to close and open an issue there if it is.
Install Info
The text was updated successfully, but these errors were encountered: