This repository contains supplementary material for my five-part Blog series on understanding the representation of text in PDF:
- Text in PDF: Introduction
- Text in PDF: Basic Operators
- Text in PDF: Unicode
- Text in PDF: Fonts and Spacing
- Text in PDF: Non-Latin Alphabets
These posts use sample PDF files. Here you can find an actual PDF file along with a text file that, when opened in a text editor, looks like the PDF file's code. If you download the original PDF and open it in a text editor that gracefully handles binary data (like emacs), you can ignore the txt version. GitHub refuses to display a binary file, so the text files are copies of the PDF files with binary streams redacted and other characters encoded in UTF-8 for display. The posts link directly to lines in the txt file, but in all cases within the blog, the line numbers match up.
For Part 2:
- basic.pdf -- A PDF file with simple text using built-in fonts
- basic.pdf.txt -- A viewable text version
For Parts 3 through 5:
- advanced.pdf -- A PDF file with non-Latin characters, emoji, and other features
- advanced.pdf.txt -- A viewable text version
If you want to follow along with the blog posts or see PDF fragments in context, you can either download the PDF and open it in a text/binary editor, or you can follow along with the text file right from GitHub. The blogs are self-contained and included referenced fragments embedded as GitHub gists.
- Blog post: The Structure of a PDF File
- My earlier post: Examining a PDF File with qpdf
- From the PDF Association: PDF Operators Cheat Sheet
- The Wikipedia article on Unicode