Skip to content

jberkenbilt/pdf-text-blog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-Text Blog: Supplementary Material

This repository contains supplementary material for my five-part Blog series on understanding the representation of text in PDF:

  1. Text in PDF: Introduction
  2. Text in PDF: Basic Operators
  3. Text in PDF: Unicode
  4. Text in PDF: Fonts and Spacing
  5. Text in PDF: Non-Latin Alphabets

What's Here

These posts use sample PDF files. Here you can find an actual PDF file along with a text file that, when opened in a text editor, looks like the PDF file's code. If you download the original PDF and open it in a text editor that gracefully handles binary data (like emacs), you can ignore the txt version. GitHub refuses to display a binary file, so the text files are copies of the PDF files with binary streams redacted and other characters encoded in UTF-8 for display. The posts link directly to lines in the txt file, but in all cases within the blog, the line numbers match up.

For Part 2:

For Parts 3 through 5:

If you want to follow along with the blog posts or see PDF fragments in context, you can either download the PDF and open it in a text/binary editor, or you can follow along with the text file right from GitHub. The blogs are self-contained and included referenced fragments embedded as GitHub gists.

External References

About

Supplement to Text in PDF Blogs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published