Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal R error when attempting to extract text from a PDF that includes a particular mathematical symbol #166

Open
tomsutch opened this issue Aug 6, 2024 · 10 comments
Assignees

Comments

@tomsutch
Copy link

tomsutch commented Aug 6, 2024

Description

Fatal R error when attempting to use extract_text on a PDF that includes $\bar{x}$. There's no error message, R just terminates.

Reproducible example

I have constructed a simple example PDF, attached xbar.pdf, that gives the error. (I made this using Microsoft Word, inserting the $x$ and $\bar{x}$ using the equation editor, then saving to PDF.)

As this crashes R I can't use the reprex package for this, as far as I know...

library(tabulapdf)

# First try getting the text up to but not including the x-bar
out1 <- extract_text("xbar.pdf", area = list(c(0,0,200,193)))
# This works

# Get the whole text
out2 <- extract_text("xbar.pdf")
# This gives a fatal error

# Get the text for just the x-bar area
out3 <- extract_text("xbar.pdf", area = list(c(0,193,200,210)))
# This gives a fatal error

Note that if I call the tabula.jar bundled with the R package directly from the command line like this

java -jar C:\Users\<username>\AppData\Local\R\win-library\4.4\tabulapdf\java\tabula.jar xbar.pdf

I get the following output (which is fine for my purposes - I am not particularly concerned about the $\bar{x}$ rendering properly, I just don't want the R session to crash):

Aug 06, 2024 10:03:59 AM org.apache.fontbox.ttf.CmapSubtable processSubtype14
WARNING: Format 14 cmap table is not supported and will be ignored
The mean of x  is denoted ???

Expected result

No fatal error: I would expect any issues with reading/rendering the $\bar{x}$ to result in a fallback like putting in '??' or similar.

Session info

R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8   
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.utf8    

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tabulapdf_1.0.5-3

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        R6_2.5.1          tzdb_0.4.0        magrittr_2.0.3    glue_1.7.0        tibble_3.2.1     
 [7] pkgconfig_2.0.3   png_0.1-8         rJava_1.0-11      lifecycle_1.0.4   readr_2.1.5       cli_3.6.2        
[13] fansi_1.0.6       vctrs_0.6.5       compiler_4.4.0    rstudioapi_0.16.0 tools_4.4.0       hms_1.1.3        
[19] pillar_1.9.0      rlang_1.1.3      
@pachadotdev
Copy link
Contributor

@tomsutch thx for reporting this
I can fix it next week

@pachadotdev
Copy link
Contributor

@tomsutch it took me longer than expected but I think I was able to solve it

@pachadotdev
Copy link
Contributor

hi @tomsutch
just following up
did the last commit solve the issue?

@tomsutch
Copy link
Author

Hi, thanks for looking into this! I can't see a new commit here - please could you point me to it?

@pachadotdev
Copy link
Contributor

Hi, thanks for looking into this! I can't see a new commit here - please could you point me to it?

sorry, i realize i never pushed the commit

i did it now in dev/

but I realize that it fails on ubuntu but worked on windows when i set utf-8

@pachadotdev
Copy link
Contributor

hola @jazzido

@tomsutch found this very interesting case that I can't solve "universally"

do you have any clues?

I added my test to reproduce the error here https://github.com/ropensci/tabulapdf/blob/main/dev/test-special_characters.R

and the file here https://github.com/ropensci/tabulapdf/blob/main/inst/examples/xbar.pdf

@pachadotdev
Copy link
Contributor

@tomsutch @jazzido

I proposed a fix here pachadotdev/tabula-java@7bcb49c

but when I build the jar locally, the produced jar does no longer work with R

this:

load_doc <- function(file, password = NULL, copy = FALSE) {
  localfile <- localize_file(path = file, copy = copy)
  pdfDocument <- new(J("org.apache.pdfbox.pdmodel.PDDocument"))
  fileInputStream <- new(J("java.io.FileInputStream"), name <- localfile)
  if (is.null(password)) {
    message("HERE")
    doc <- pdfDocument$load(input = fileInputStream)
  } else {
    doc <- pdfDocument$load(input = fileInputStream, password = password)
  }
  pdfDocument$close()
  doc
}

fails with:

HERE
Error in pdfDocument$load : 
  no field, method or inner class called 'load' 

@JiaZhang42
Copy link

Hi
Is there any update on this one? I encountered another fatal error that aborts the R session when $\hat{\beta}$ is in the pdf
example2.pdf

tabulapdf::extract_text('example2.pdf', pages = 1, area = list(c(333.9459, 655.1610, 352.8368, 686.0823)))

@pachadotdev
Copy link
Contributor

Hi Is there any update on this one? I encountered another fatal error that aborts the R session when β ^ is in the pdf example2.pdf

tabulapdf::extract_text('example2.pdf', pages = 1, area = list(c(333.9459, 655.1610, 352.8368, 686.0823)))

I proposed a fix to the Java code, but the produced jar is not working for me
I pinged @jazzido about the build process

@pachadotdev
Copy link
Contributor

I updated to tabula 1.0.6, but because I do not know Java, I cannot fix the issue coming from there

see https://github.com/ropensci/tabulapdf/tree/166

The solution is that Java returns "The mean of x is denoted ?" instead of "The mean of x is denoted ?̅?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants