-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question regarding the Font, Widths and FontDescriptor #7
Comments
Hello @caspervanpomeren, For characters that are outside the bounds of the Widths array as given by the FirstChar and LastChar entries, the value contained in the MissingWidths key is used for the character's width. In this example, because the space character is outside the bounds, it uses the '0' value from the MissingWidths - that's why there's no visual spacing between the two words "Hello" and 'World" when viewing the PDF file. To give the space character an actual width, you would change the FirstChar value to '32', and add the desired width (likely '278' based on the metrics for the font) as the first entry of the Widths array. The encoding of this font, as a Type 1 with no specified encoding and with the font Flags value having the 6th bit set (with a value of '32', only the 6th bit is set in this binary bit flag), is the StandardEncoding detailed in Annex D and for which the character encoding is given in Table D.2. A Type 1 font is a single-byte font and therefore cannot encode more than 255 characters [note: do not encode character 0 in a font, it will confuse many processor]. Character encodings are how a writer describes what content stream data corresponds to what font characters. To use characters in Helvetica that are not contained in StandardEncoding (or are not in one of the other pre-defined encodings, which you could use by supplying the name for the Encoding value in the font dictionary), you need to make an encoding dictionary as described in section 9.6.5 and add the characters you'd like to encode. You'd also want to adjust the Widths array suitably. Note: these examples use unembedded Standard 14 fonts primarily for the sake of compactness. In most cases, I highly recommend embedding (and subsetting if desired) fonts that are used in PDF files. Hope this helps! |
Hi @pdfa-mattk , Thanks for the quick and detailed response. It took a while to digest all the information and respond (had lots of reading/testing to do and was a bit sick), but here it is.
I understand, MissingWidths is basically the fallback if you didn't specify a width for a character code.
I finally understand the logic behind this! See the following JavaScript code:
And the other example I gave that I didn't understand (for future readers):
And if I want to set a certain flag, for example 2, 6 and 19 I can use this logic to get the correct value:
It took me a while to understand the whole bits concept, but it finally clicked.
So if I understand correctly, the Helvetica font has 315 characters but I can only encode 255 characters when using a Type 1 font? This means I will not be able to encode all the 315 characters of the Helvetica font?
So I tried playing around with this and basically tried four scenarios:
Here are my findings/questions based on these scenarios:
The problem here is, the code returns the following text: "Hel´¡lo 32000-2world". The text I was expecting was: "Hel¡lo 32000-2world". What causes this extra ´? This happens with all kinds of characters, for example if I try to insert: "HelØlo 32000-2world", I get "Helˆ�lo 32000-2world" What am I doing wrong? I think I correctly filled in the widths array and these characters are included in the StandardEncoding if I look at Annex D. The only thing I could think of was that I filled the widths array with widths of zero to compensate for the holes between the available character codes that were available in the encoding, but this was the only logical thing that actually made the widths array correctly work for me.
This returns: "Hel€lo 32000-2world" as expected. The only thing I noticed was I can't use character names like sterling in the differences array since it already exists as a character code in the StandardEncoding. Is it not allowed to have two character codes for the same thing or am I missing something?
The problem here is, the code returns the following text: "Hel¡lo 32000-2 world". The text I was expecting was: "Hel¡lo 32000-2world". What causes this extra Â? This is basically the same problem as in scenario 1. Another problem, when I try using this text: "Hel€lo 32000-2world", I get this: "Hel€lo 32000w-2orld". Even though WinAnsiEncoding does include the character Euro.
I understand, but I am currently trying to understand the entire spec and make examples that explain everything from the spec. In these examples I also want to benefit from the compactness of unembedded standard fonts. I plan on making a pull request with all my examples and you can decide if you want them added. When I will use my knowledge of the spec, I will certainly use embedding and subsetting fonts. Do you know if there are any simple examples of embedding/subsetting of fonts? This is basically the next step I am going to work on. To conclude, a more generic question: How do people generally learn to write pdf by hand? Do they just read the spec and go from there? Or are there certain resources that are recommended? Or are there certain communities on IRC/Discord etc? Because while I have found some tutorials/knowledge online, it isn't a huge amount. It's especially hard since lots of information isn't based on the latest spec and the term "pdf" is used so much in relation to other things that search engines don't really return what I am looking for. Some extra background information, my end goal is to create a JavaScript library that can automatically write pdf that complies with the latest pdf spec and almost completely supports every aspect of it. So I am starting by doing everything by hand and understanding how everything works, and then I am going to translate that knowledge to JavaScript code, Thanks again for the help. I also understand that this is quite some text, so please take your time and even if you can only answer one thing I would really appreciate it. If I need to clarify anything please let me know. Casper |
Hi @caspervanpomeren, let me see if I can answer some of your questions here:
As a Type1 font, this is correct. This is a general characteristic of Type1 fonts as defined in PDF. Because they use single byte values in the content stream to reference characters in their encodings, and a single byte can only hold up to 256 different values, this sets the limit on the number of different characters that can be encoded in a given instance of a Type1 font.
For Type1 fonts, strings in content streams are read as individual bytes. I suspect that the text you're putting into the content stream might be encoded in UTF-8 - this would cause the ¡ (inverted exclamation mark, U+00A1) to be expressed as two bytes: 0xC2 0xA1. The "extra" character is likely that extra 0xC2 that I suspect you're putting into the content stream. The same concept looks like it explains the other odd and extra characters you're seeing when trying to put other characters in.
I don't know of any restriction on doing this. You should be able to use any name in any position in the Differences array, and as long as the font used for display has that character it should work. Could you tell me what error or behavior you were seeing when you tried this?
Most everything in PDF 2.0 is shared with earlier versions of PDF, so your best resources are mostly going to be written for earlier versions of PDF. Many people learn by reading the spec, examining PDFs that are generated from other libraries or programs, and experimenting. |
I am (still) working through the PDF spec (ISO 32000-2) and these examples are one of the few I could find that actually try to explain/showcase the new PDF 2.0 spec. So once again, thank you very much for these examples.
The topic I am currenty looking into is fonts and I noticed some things in the pdf20examples. For example if we look at "Simple PDF 2.0 file.pdf" it contains the following code:
My questions:
Which clearly starts at character 32, so why was character 33 used as FirstChar?
Similar to the first question, why is /LastChar 126 and not 251 (or -1)? I tried -1 myself and filled the widths array, but it doesn't work and I found no way to give all the -1 characters a width and as a result all these characters don't show up correctly (for example € doesn't show correctly). How would I give all 315 character codes the correct width?
In the FontDescriptor I see:
/Flags 32
What is the logic behind the value 32? The spec doesn't help me either, because it gives examples like:
/Flags 262178 %Bits 2, 6, and 19
How do you get to these values? I would expect a value like 2619 or something, but the values just don't make any sense. Same thing with the 32 I would have expected 23 as a combination of flag 2 and 3, but apparently it's bit position 32 and that means high-order? Even though this bit position doesn't even exist in the table of Font flags in the PDF spec. I also tried every unsigned 32-bit integer calculator to make sense of these numbers, but nothing worked.
In the fontDescriptor I also see:
/MissingWidth 0
Shouldn't this be used since we don't supply the width for all the character codes? Strangely enough it does get used in the "PDF 2.0 with page level output intent.pdf" example, see:
/MissingWidth 278
And then I am wondering, how do you get that value of 278, is that random or can I actually find that in the font information somewhere?
Hopefully someone can help me with these questions or point me in the right direction.
Thanks in advance
The text was updated successfully, but these errors were encountered: