I'm talking about PDFs at Data Wranglers DC.
Much of the world’s data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into databases, graphs, or spreadsheets. When I'm reading PDF files, I ask these questions.
- Do we need to read the file contents at all?
- Do we only need to extract the text and/or images?
- Do we care about the layout of the file?
I take different approaches to parsing depending on the answers to these questions. In the talk, I’ll show a few different approaches to parsing and analyzing PDF files, and I'll discuss which approaches make sense in which situations.
We'll use this as a permanent URL for the materials for the talk. I'm dumping everything in the following git repository.
git clone email@example.com:tlevine/data-wranglers-dc-pdfs
You can also view the files in a web browser here.
Take notes on my wiki, and other people will be able to see them!
More like this
- How I parse PDF files
- How I parse PDF files Much of the world's data are stored in portable document format (PDF) files. This is not my preferred storage or presentation format, so I often convert such files into d...
- License-free data in Missouri's data portal
- I recently found that most datasets on the open data portals I've been looking at don't have open licenses or public domain dedications. But a lot of datasets do have licenses or public do...
- Websites to data tables
- I apparently know a lot about making websites into data tables. You might call this "web scraping". Here's a bit of how I go about writing computer programs that do that. The present...