I’m talking about PDFs at Data Wranglers DC.
Much of the world’s data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into databases, graphs, or spreadsheets. When I’m reading PDF files, I ask these questions.
- Do we need to read the file contents at all?
- Do we only need to extract the text and/or images?
- Do we care about the layout of the file?
I take different approaches to parsing depending on the answers to these questions. In the talk, I’ll show a few different approaches to parsing and analyzing PDF files, and I’ll discuss which approaches make sense in which situations.
We’ll use this as a permanent URL for the materials for the talk. I’m dumping everything in the following git repository.
git clone firstname.lastname@example.org:tlevine/data-wranglers-dc-pdfs
You can also view the files in a web browser here.
Take notes on my wiki, and other people will be able to see them!