Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When a get textblock from a PDF vary depending on the operating system #840

Open
fpisarello-dawa opened this issue May 23, 2024 · 6 comments

Comments

@fpisarello-dawa
Copy link

fpisarello-dawa commented May 23, 2024

Hi, i have some problem when i get TextArea from this code:

When i run this script in Windows 10 Platform - LinqPad:

using (var document = PdfDocument.Open(FACTURA_AFIP.pdf))
{
for (var i = 1; i <= document.NumberOfPages; i += 4)
{
	// For PDF coordinates the y-axis runs from the bottom of the page up
	var bottomLeft = new PdfPoint(479, 149);
	var topRight = new PdfPoint(559, 159);
	var square = new PdfRectangle(bottomLeft, topRight);
	var page = document.GetPage(i);
	var letters = page.Letters.Where(x => square.IntersectsWith(x.GlyphRectangle)).ToList();

	var wordsInRegion = DefaultWordExtractor.Instance.GetWords(letters);
	var textInRegion = string.Join(" ", wordsInRegion.Select(x => x.Text).ToList());
			
	textInRegion.Dump();
	
}
}

Result:
72515176735833

but the same Script in Linux Ubuntu 20.04 - dotnet-script:
N°: 72515176735833

Why do Windows and Linux show different results?

Upload the PDF file form more detail.
FACTURA_AFIP.pdf

@BobLd
Copy link
Collaborator

BobLd commented May 23, 2024

@fpisarello-dawa I'm guessing this comes from different default fonts being used on different operating systems. I'd expect the fonts in your documents are not embedded, and PdfPig uses the OS ones to get the bounding boxes. These will differ by OS.

@EliotJones this is not the first time we have this kind of question. I think we should try to ship default fonts like other pdf readers do, so that pdfpig always use the sames ones.

Doing so will also make easier to write units tests across different OS, as people will expect consistency across. Let me know what you think

Also see https://askubuntu.com/questions/599915/what-is-the-closest-font-to-helvetica-available-on-ubuntu

And https://stackoverflow.com/questions/6383511/font-metrics-for-the-base-14-fonts-in-the-pdf-specification#6506818

@EliotJones
Copy link
Member

@BobLd it's a reasonable suggestion, I'm just not sure what the licensing situation for that looks like. I'd expect you need some kind of payment to redistribute most fonts from foundries.

@BobLd
Copy link
Collaborator

BobLd commented May 26, 2024

@EliotJones you nailed the main issue with fonts... I'll revert back with fonts that have a compatible license with the project. Let's see then what's doable

@BobLd
Copy link
Collaborator

BobLd commented May 26, 2024

Looking at the table below, we have open source equivalents (table from https://wiki.archlinux.org/title/Metric-compatible_fonts)
image

Liberation fonts are available under SIL OPEN FONT LICENSE Version 1.1, which is from what I understand as open source as you can get for a font, see here https://github.com/liberationfonts/liberation-fonts/tree/main/src

Using Liberation fonts, we cover 12 out of the 14 Base fonts (we are missing Symbol and ZapfDingbats) - I'll look into the rest (also, they are already referenced in the SystemFontFinder class)

@BobLd
Copy link
Collaborator

BobLd commented May 26, 2024

Symbol font: https://github.com/powerline/fonts/tree/master/SymbolNeu (Apache License, Version 2.0)

@fpisarello-dawa
Copy link
Author

@BobLd thank for response. I installed font into Linux server (Helvetica) and i had the same behavior. I need to install another font into a server to make the same response?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants