I'm talking about PDFs at Data Wranglers DC.


Much of the world’s data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into databases, graphs, or spreadsheets. When I'm reading PDF files, I ask these questions.

  • Do we need to read the file contents at all?
  • Do we only need to extract the text and/or images?
  • Do we care about the layout of the file?

I take different approaches to parsing depending on the answers to these questions. In the talk, I’ll show a few different approaches to parsing and analyzing PDF files, and I'll discuss which approaches make sense in which situations.


We'll use this as a permanent URL for the materials for the talk. I'm dumping everything in the following git repository.

git clone git@github.com:tlevine/data-wranglers-dc-pdfs

You can also view the files in a web browser here.

Take notes on my wiki, and other people will be able to see them!