I’m talking about PDFs at Data Wranglers DC.

Abstract

Much of the world’s data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into databases, graphs, or spreadsheets. When I’m reading PDF files, I ask these questions.

  • Do we need to read the file contents at all?
  • Do we only need to extract the text and/or images?
  • Do we care about the layout of the file?

I take different approaches to parsing depending on the answers to these questions. In the talk, I’ll show a few different approaches to parsing and analyzing PDF files, and I’ll discuss which approaches make sense in which situations.

Materials

We’ll use this as a permanent URL for the materials for the talk. I’m dumping everything in the following git repository.

git clone git@github.com:tlevine/data-wranglers-dc-pdfs

You can also view the files in a web browser here.

Take notes on my wiki, and other people will be able to see them!