Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text not extracted properly in 4.30.1, but works with 4.30.0 #336

Open
1 task done
VikParuchuri opened this issue Dec 30, 2024 · 5 comments
Open
1 task done

Text not extracted properly in 4.30.1, but works with 4.30.0 #336

VikParuchuri opened this issue Dec 30, 2024 · 5 comments

Comments

@VikParuchuri
Copy link

Checklist

  • I confirm to be using an official package of pypdfium2 from PyPI or GitHub/pypdfium2-team.

Description

Text was extracted from the table on page 4 of this pdf properly with pypdfium2==4.30.0, but no longer works on 4.30.1. Only a partial set of characters are pulled from the table.

I'm not sure if this is an issue with the python bindings, or base pdfium - happy to close and open an issue there if it is.

Install Info

Python 3.11.3 (main, May 23 2023, 07:06:17) [Clang 14.0.3 (clang-1403.0.22.14.1)]

macOS-15.0.1-arm64-arm-64bit

pypdfium2 4.30.0

pdfium 126.0.6462.0 at pypdfium2_raw/libpdfium.dylib

Name: pypdfium2
Version: 4.30.0
Summary: Python bindings to PDFium
Home-page: https://github.com/pypdfium2-team/pypdfium2
Author: pypdfium2-team
Author-email: [email protected]
License: (Apache-2.0 OR BSD-3-Clause) AND LicenseRef-PdfiumThirdParty
Location: lib/python3.11/site-packages
Requires:
@mara004
Copy link
Member

mara004 commented Dec 30, 2024

I ran the following commands on our current pypdfium2 source tree:

# assuming an editable install (python3 -m pip install -e .)
./run emplace auto:6462
pypdfium2 extract-text "p25-1144.pdf" --pages 4 --strategy range > pdfium_6462_range.txt
pypdfium2 extract-text "p25-1144.pdf" --pages 4 --strategy bounded > pdfium_6462_bounded.txt
./run emplace auto:6899
pypdfium2 extract-text "p25-1144.pdf" --pages 4 --strategy range > pdfium_6899_range.txt
pypdfium2 extract-text "p25-1144.pdf" --pages 4 --strategy bounded > pdfium_6899_bounded.txt
diff pdfium_6462_range.txt pdfium_6899_range.txt
85,90c85,90
< Under 18 years. . . . . . . . . . . . . . . 73.6 74.0 75.7 77.1 78.2 80.1 6.5 8.8
< 18 to 44 years. . . . . . . . . . . . . . . . 116.0 119.2 125.0 126.4 129.6 132.7 16.7 14.4
< 45 to 64 years. . . . . . . . . . . . . . . . 84.3 83.4 81.3 89.1 95.4 97.0 12.7 15.1
< 65 years and over. . . . . . . . . . . . 49.2 56.1 73.1 80.8 85.7 94.7 45.4 92.3
< 85 years and over. . . . . . . . . . . . 6.4 6.7 9.1 14.4 18.6 19.0 12.6 198.1
< 100 years and over. . . . . . . . . . . 0.1 0.1 0.1 0.2 0.4 0.6 0.5 618.3
---
> Under 18 years 73.6 74.0 75.7 77.1 78.2 80.1 6.5 8.8
> 18 to 44 years 116.0 119.2 125.0 126.4 129.6 132.7 16.7 14.4
> 45 to 64 years 84.3 83.4 81.3 89.1 95.4 97.0 12.7 15.1
> 65 years and over 49.2 56.1 73.1 80.8 85.7 94.7 45.4 92.3
> 85 years and over 6.4 6.7 9.1 14.4 18.6 19.0 12.6 198.1
> 100 years and over 0.1 0.1 0.1 0.2 0.4 0.6 0.5 618.3
diff pdfium_6462_bounded.txt pdfium_6899_bounded.txt
84,90c84,90
<    Total population . . . . . . 323.1 332.6 355.1 373.5 388.9 404.5 81.4 25.2
< Under 18 years. . . . . . . . . . . . . . . 73.6 74.0 75.7 77.1 78.2 80.1 6.5 8.8
< 18 to 44 years. . . . . . . . . . . . . . . . 116.0 119.2 125.0 126.4 129.6 132.7 16.7 14.4
< 45 to 64 years. . . . . . . . . . . . . . . . 84.3 83.4 81.3 89.1 95.4 97.0 12.7 15.1
< 65 years and over. . . . . . . . . . . . 49.2 56.1 73.1 80.8 85.7 94.7 45.4 92.3
< 85 years and over. . . . . . . . . . . . 6.4 6.7 9.1 14.4 18.6 19.0 12.6 198.1
< 100 years and over. . . . . . . . . . . 0.1 0.1 0.1 0.2 0.4 0.6 0.5 618.3
---
>    Total population . . . . . .  323.1 332.6 355.1 373.5 388.9 404.5 81.4 25.
> Under 18 years. . . . . . . . . . . . . . .  73.6 74
> 18 to 44 years. . . . . . . . . . . . . . . .  116.0 119.2 1
> 45 to 64 years. . . . . . . . . . . . . . . .  84.3 83
> 65 years and over. . . . . . . . . . . .  49.2 56.1 73.1 
> 85 years and over. . . . . . . . . . . .  6.4 6.7 9.1 1
> 100 years and over. . . . . . . . . . .  0.1 0.1 0.1

That does look like a regression in pdfium concerning the GetTextBounded extraction strategy.
pypdfium2 is only forwarding what pdfium returns, so yes, please file an issue about this upstream. Thanks!

FWIW, the python helpers have barely changed between 4.30.0 and 4.30.1 (see diff) – this was mostly setup bug fixes and a pdfium update.

@mara004

This comment was marked as off-topic.

@mara004
Copy link
Member

mara004 commented Jan 3, 2025

Might be interesting to bisect this to narrow down the commit range...

@mara004
Copy link
Member

mara004 commented Jan 3, 2025

OK, could narrow this down to 6721 (good) - 6844 (bad) [commit log]
Unfortunately, there are no pdfium-binaries builts in between, so this is still a rather large span.

@mara004
Copy link
Member

mara004 commented Jan 3, 2025

I went ahead and filed https://issues.chromium.org/issues/387277993 for this now.
Feel free to comment on the thread if you have anything to add. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants