Websites to data tables, in practice

Once you've read the other article on converting websites to data tables, you might want some more specifics. Here are some details of how I apply all of the linked theory when I'm using Python.

Download functions

Inspect network requests, and figure out what HTTP requests to make. There are tons of programs you could use for this; I use Firefox.

Inspecting HTTP requests in the New York City crime map

Write downloader functions that return Response objects from the requests package, and decorate them with vlermv.

import os

import requests
from vlermv import cache

@cache('~/.acpr')
def search(pg:int):
    '''
    pg is a natural number.
    '''
    url = 'https://www.regafi.fr/spip.php'
    params = {
        'page': 'results',
        'type': 'advanced',
        'id_secteur': '3',
        'lang': 'en',
        'denomination': '',
        'siren': '',
        'cib': '',
        'bic': '',
        'nom': '',
        'siren_agent': '',
        'num': '',
        'cat': '0',
        'retrait': '0',
        'pg': pg,
    }
    return requests.get(url, params = params)

(This is from download.py in acpr, except that I switched the obsolete picklecache for vlermv.)

A more complicated request

Often you have to change more than just one part of the query string; here's an example of that.

A more complicated request for http://vocopvarenden.nationaalarchief.nl/Search.aspx

The "Copy as cURL" feature can be helpful at times like this; here's the output of "Copy as cURL".

curl 'http://vocopvarenden.nationaalarchief.nl/exportCSV.aspx' -H 'Host: vocopvarenden.nationaalarchief.nl' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Referer: http://vocopvarenden.nationaalarchief.nl/list.aspx' -H 'Cookie: ASP.NET_SessionId=jchmjc450i5hfgavxigjymzi' -H 'Connection: keep-alive'

And here are the correspondingly complicated download functions for this above request. They're slightly complicated because they pass along some less obvious information from previous requests.

import requests
from vlermv import cache

from . import (
    list as _list, exportcsv as _exportcsv,
    parse,
)

DIR = '~/.maluku/vocopvarenden'

@cache(DIR, 'search')
def search(_):
    url = 'http://vocopvarenden.nationaalarchief.nl/Search.aspx'
    return requests.get(url)

@cache(DIR, 'list')
def list(year, search_response = None):
    if not search_response:
        raise TypeError('You must pass a search_response.')
    viewstate, viewstategenerator = parse.viewstate(search_response)
    return requests.post(_list.url, headers = _list.headers,
               cookies = search_response.cookies,
               data = _list.data(viewstate, viewstategenerator, year))

@cache(DIR, 'exportcsv')
def exportcsv(year, search_response = None):
    if not search_response:
        raise TypeError('You must pass a search_response.')
    return requests.get(_exportcsv.url, headers = _exportcsv.headers,
                        cookies = search_response.cookies)

This is download.py from vocopvarenden.

Parse functions

Write parser functions that accept Response objects as input and emit ordinary data structures. You probably will run this.

html = lxml.html.fromstring(response.text)

and something like one of these.

html.xpath('id("blah")/div[@class="blah"]/a[@href="blah"]')
html.cssselect('#blah > div.blah > a[href="blah"]')

Here's an example from the parse.py file in acpr.

import re
from collections import OrderedDict

from lxml.html import fromstring

def search(response):
    html = fromstring(response.text)
    table = html.xpath('//table[@summary="Search results"]')[0]
    keys = [re.sub(r'[^a-z]+', '.', str(th.text_content().strip()), flags = re.IGNORECASE) for th in table.xpath('tr[position()=1]/th')]
    for tr in table.xpath('tr[td]'):
        values = (td.text_content() for td in tr.xpath('td'))
        yield OrderedDict(zip(keys, values))

Make only one download inside each of the inner-most download functions, and make a parser function that corresponds to each download function; separate functions if they're making more than one HTTP request.

Generation

Wrap everything up as a generator. I didn't do that for ACPR, so here's a different example of that from main.py in wbcontractawards. "contracts" is the generator.

import sys
import csv
import itertools

import wbcontractawards.download as d
import wbcontractawards.parse as p

def contracts():
    for os in itertools.count(0, 10):
        response = d.search(os)
        contract_urls = p.search(response)
        if [] == contract_urls:
            break
        for url in contract_urls:
            response = d.get(url)
            try:
                yield p.contract(response)
            except:
                sys.stderr.write('Error at %s\n' % url)
                raise

Storing data

Iterate through the pages and send the results to whatever data store you like. Or not.

It's fine to leave everything as pickles!

People often want to save things in a fancy database or CSV file or something, but you might not need to. Maybe you just want to download the web pages before they go away and then figure out what to do with them later. Or maybe you know what you're going to do with them later and you know that you are going to do more work on them in Python.

If you've followed the directions above, you'll have a generator that emits your data as ordinary Python objects, and this can be way more convenient than having your data in a fancy database.

CSV

I often write CSV to STDOUT, because CSV is the platonic form of data.

As an example, here's the rest of the above main.py from wbcontractawards.

def cli():
    writer = csv.writer(sys.stdout)
    writer.writerow(['contract','bidder','status','amount','currency'])
    for contract in contracts():
        if contract != None:
            for bid in contract['bids']:
                row = [
                    contract['url'],
                    bid.get('bidder.name'),
                    bid.get('status'),
                    bid.get('opening.price.amount'),
                    bid.get('opening.price.currency'),
                ]
                writer.writerow(row)

JSON lines

In acpr's main.py I write JSON lines to STDOUT.

import itertools
import json
import sys

import acpr.download as d
import acpr.parse as p

def main():
    for page in itertools.count(1,1):
        response = d.search(page)
        if page > 1 and p.is_page_one(response):
            break
        else:
            for result in p.search(response):
                result['url'] = response.url
                sys.stdout.write(json.dumps(result) + '\n')

Relational database

A relational database is sometimes nice too. I suggest that you use dataset to save things to relational databases. (In case I've told you about dumptruck, you should still use dataset; it has everything I wanted in dumptruck and more.) Here's part of main.py from scarsdale-property-inquiry, wherein I save some data to a relational database.

import os
import functools

from jumble import jumble
import dataset
from pickle_warehouse import Warehouse # obselete, use vlermv instead

import scarsdale_property_inquiry.download as dl
import scarsdale_property_inquiry.read as read
import scarsdale_property_inquiry.schema as schema

# Lots of stuff omitted...

def main():
    root_dir, html_dir, warehouse = get_fs()
    url = getparser(root_dir).parse_args().database

    db = dataset.connect(url)
    db.query(schema.properties)

    session, street_ids = dl.home(warehouse)
    street = functools.partial(dl.street, warehouse, session)
    for future in jumble(street, street_ids):
        session, house_ids = future.result()
        house = functools.partial(dl.house, warehouse, session)
        for future in jumble(lambda house_id: (house_id, house(house_id)), house_ids):
            house_id, text = future.result()
            with open(os.path.join(html_dir, house_id + '.html'), 'w') as fp:
                fp.write(text)
            bumpy_row = read.info(text)
            if bumpy_row != None:
                excemptions = bumpy_row.get('assessment_information', {}).get('excemptions', [])
                if excemptions != []:
                    for excemption in excemptions:
                        excemption['property_number'] = bumpy_row['property_information']['Property Number']
                        db['excemptions'].upsert(excemption, ['property_number'])
                flat_row = read.flatten(bumpy_row)
                if flat_row != None and 'property_number' in flat_row:
                    db['properties'].upsert(flat_row, ['property_number'])

Sessions

Don't use requests.session or any other library's magic session management thing; make explicit all parameters to your HTTP requests. For example, here are some relevant parts of delaware.

def headers(user_agent, cookie, referer):
    if user_agent == None:
        raise ValueError('User agent may not be None.')
    h = {
        'User-Agent': user_agent,
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
        'Content-Type': 'application/x-www-form-urlencoded',
    }
    if cookie != None:
        h['Cookie'] = cookie
        if referer != None:
            h['Referer'] = referer
        else:
            raise ValueError('A cookie must be provided if a referer has been provided.')
    elif referer != None:
        raise ValueError('A referer must be provided if a cookie has been provided.')
    return h

@cache(os.path.join(os.path.expanduser('~'), '.delaware', week, 'result'))
def _result(firm_file_number, user_agent = None, cookie = None):
    h = headers(user_agent, cookie, referers['result'])
    return requests.post(urls['result'], headers = h,
        data = data['result'] % firm_file_number, allow_redirects = False)

A cookie is simply a header, and you can pass it around to all of your functions rather than storing the session more globally.

Functional wonderfulness

If you follow the above directions, your parser functions will have no side-effects. Your downloader functions will by tiny, and their only side effects will be HTTP requests. You'll wrap all of these up in a generator that doesn't introduce new side-effects, and then you'll call this generator from another thing that has side-effects but isn't that hard to test.

This arrangement makes much more sense to me and allows me not to think very hard.

Tips for testing

My tests are probably way simpler than most tests because I pretty much never use the class keyword in Python. My programs are broken into lots of mostly-pure functions, so my tests mostly just call the functions and check the outputs.

Fixtures

When something breaks, find the appropriate file in the vlermv directory, copy it to a fixtures directory, and load it in your tests like this.

with open(os.path.join('package_name', 'test', 'fixtures', 'web-page'), 'rb') as fp:
    error, response = pickle.load(fp)

As convenient as that is, you usually won't even need/want the full response. You can mock responses with collections.namedtuple objects.

import collections
MockResponse = collections.namedtuple('Response', ['ok','text'])
response = MockResponse(ok = True, text = '<html>This is a web page.</html>')

Testing downloader functions

I usually make the downloader function really really tiny and hope that it works. I move everything else to pure functions that I do test. See, for example, params.py from scarsdale-property-inquiry.

def url():
    return 'http://www.scarsdale.com/Home/Departments/InformationTechnology/PropertyInquiry.aspx'

def headers(user_agent):
    return {
        'User-Agent': user_agent,
        'Referer': url(),
    }

def data(publickeytoken, viewstate, eventvalidation, eventtarget, value):
    return {
        'StylesheetManager_TSSM': '',
        'ScriptManager_TSM': ';;System.Web.Extensions, Version=3.5.0.0, Culture=neutral, PublicKeyToken=' + publickeytoken,
      # '__EVENTTARGET': 'dnn$ctr1381$ViewPIRPS$lstboxAddresses',
        '__EVENTTARGET': eventtarget,
        '__EVENTARGUMENT': '',
        '__LASTFOCUS': '',
        '__VIEWSTATE': viewstate,
        '__EVENTVALIDATION': eventvalidation,
        'dnn$SEARCH1$Search': 'SiteRadioButton',
        'dnn$SEARCH1$txtSearch': '',
       #'dnn$ctr1381$ViewPIRPS$lst...': '19.03.287',
       #'dnn$ctr1381$ViewPIRPS$lstboxAddresses': '19.03.287',
        eventtarget: value,
        'dnn$dnnSEARCH$Search': 'SiteRadioButton',
        'dnn$dnnSEARCH$txtSearch': '',
        'ScrollTop': 228, # how far the page is scrolled
        '__dnnVariable': '{"__scdoff":"1"}',
    }

These are all pure functions that I can test without knowing anything about HTTP. The downloader functions are impure, and I don't even try testing them, but they pretty much just call these functions, so I'm fine with that.

I wrote scarsdale-property-inquiry before I wrote vlermv, so its downloader functions include the cache and are thus a bit big. But you can see them here.

Generating tests

I tend to run my tests with nose. One nice thing about nose is that it's easy to create many tests of the same structure and different data. For example, see craigsgenerator/test/test_parse/test_next_search_url.py from craigsgenerator

import os

import nose.tools as n
import lxml.html

import craigsgenerator.parse as parse

def check_next_search_url(fn, domain, url):
    with open(os.path.join('craigsgenerator','test','fixtures',fn)) as fp:
        html = lxml.html.fromstring(fp.read())
    html.make_links_absolute(url)
    observed = parse.next_search_url('https', domain, 'sub', html)
    n.assert_equal(observed, url)

def test_next_search_url():
    testcases = [
        ('austin-sub.html', 'austin.craigslist.org', 'https://austin.craigslist.org/sub/index100.html'),
        ('chicago-sub.html', 'chicago.craigslist.org', 'https://chicago.craigslist.org/sub/index100.html'),
    ]
    for fn, domain, url in testcases:
        yield check_next_search_url, fn, domain, url

Testing generators

You can test the generator that wraps everything up by substituting simple functions for your real functions. Let's look at craigsgenerator/test/test_craigsgenerator.py from craigsgenerator.

import nose.tools as n

from craigsgenerator.craigsgenerator import craigsgenerator

def test_not_superthreaded():
    cg = craigsgenerator(sites = ['foo'], sections = ['bar'], listings = lambda *args, **kwargs: ['baz'], superthreaded = False)
    n.assert_equal(next(cg), 'baz')
    with n.assert_raises(StopIteration):
        next(cg)

def test_superthreaded():
    cg = craigsgenerator(sites = ['foo'], sections = ['bar'], listings = lambda *args, **kwargs: ['baz'], superthreaded = True)
    n.assert_equal(next(cg), 'baz')
    with n.assert_raises(StopIteration):
        next(cg)

Instead of passing the real listings function, I pass this silly function that always returns ['baz'], and I make sure that the output is what I expect.

Testing that data are sent to the data store

Pass the destination of your data as an argument to the function that is connecting everything. For example, here's a function that writes to stdout and stderr.

def main(stdout, stderr, dataset):
    client = pymongo.MongoClient()
    db = client[today.isoformat()]

    try:
        tomongo(db, dataset)
    except Exception as e:
        stderr.write('Error at %s: %s\n' % (dataset['url'], e))
    else:
        stdout.write(do_stuff_with(dataset))

Rather than using sys.stdout and sys.stderr directly, I passed them as arguments so that I can test the output. That said, while I write the function this way, I usually only write the test if something is broken.

Similarly, you can test that Mongo is getting called correctly. tomongo calls a function called save, and it inserts some information about the dataset into the database. The test for save looks kind of like this.

def test_save():
    MockCollection = namedtuple('MockCollection', ['insert'])
    documents = []
    db = {url:MockCollection(lambda document:documents.append(document))}

    fp = open('blah blah')
    url = 'blah blah blah'

    save(db, fp, url)

    expected_documents = [
        # ...
    ]
    n.assert_list_equal(documents, expected_documents)

Command-line interface

Write some sort of command-line interface. I tend to write an argparse.ArgumentParser and then to add file to the scripts argument in setup.py. Here's the parser I defined in main.py of wbcontractawards.

parser = argparse.ArgumentParser('Get data about contracts for projects funded by the World Bank.')
parser.add_argument(dest = 'unit', metavar = '[unit]', choices = ['bids', 'contracts'])

And here's the function that wraps up everything as a command-line interface.

def cli():
    emit(sys.stdout, parser.parse_args().unit)

bin/wbcontractawards just calls that function.

#!/usr/bin/env python3
import wbcontractawards.main
wbcontractawards.main.cli()

And here's the corresponding setup.py.

from distutils.core import setup

setup(name='wbcontractawards',
      author='Thomas Levine',
      author_email='_@thomaslevine.com',
      description='Get information about World Bank contract awards',
      url='https://github.com/tlevine/wbcontractawards',
      packages=['wbcontractawards'],
      install_requires = ['lxml','picklecache','requests'],
      tests_require = ['nose'],
      scripts = ['bin/wbcontractawards'],
      version='0.0.4',
      license='AGPL',
)

Note the scripts field.

Python packages

Package your stuff as standard packages of some sort, and put them on a repository that you don't have to worry about. I tend to write Python packages and to put them on PyPI. This way, I don't have to remember were I put things or how they work; I just look the up on PyPI.

Here's an easy way to make a Python package.

# Make a new directory for your package.
# Inside that directory, make another directory of the same name.
# Inside that directory, make a directory called "test".
mkdir -p mypackage/mypackage/test

# Also make a "bin" directory for your scripts.
mkdir -p mypackage/bin

# Create __init__.py files in package directories.
touch mypackage/mypackage/__init__.py

# In the outer directory, copy a setup.py file from one of my packages
cp ~/vlermv/setup.py mypackage/setup.py

# Write your computer code in the inner directory.
vim mypackage/mypackage/download.py

# Write your tests in the test directory.
vim mypackage/mypackage/test/test_download.py

Before you have installed the package, you can check that the executable works like so.

cd mypackage # outer directory
PYTHONPATH=. ./bin/the-executable

Then try installing the package

cd mypackage # outer directory
pip install .

Once everything seems to be working, upload it to PyPI.

python setup.py register sdist upload

Why not other things

I've done things lots of other ways and have settled on this one. Why?

Other languages

Python has some nice things going for it, but the choice of language is rather arbitrary. Look for these things in whatever language you use.

  • An HTTP library
  • An HTML parser with a decent query language (XPath or CSS, for example)
  • A way of saving HTTP responses
  • Libraries for whatever data store you are using, especially CSV
  • A package manager (pip, npm, pacman, apt-get, &c.)

I'm always nervous about Python's impurity and dynamic typing; I see myself switching to Haskell reasonably soon.

Other XML Parsers

The only differences I ever see between XML parsers is whether I can run XPath or CSS on their results.

You might notice that I tend to use lxml in Python, even though everybody aparently uses BeautifulSoup. I use lxml because I never learned BeautifulSoup and because I like XPath. I have heard that BeautifulSoup is better for really really bad HTML, but I haven't had trouble with lxml.

You might choose to use BeautifulSoup because it is easier for you to install. I usually install lxml from pacman or apt-get with no trouble, and I once installed beautifulsoup4 from pip with no trouble. As you see, I hardly ever have trouble installing things, but you might, and it's perfectly reasonable to use things that are easier to install.

I didn't realize for the longest time that the built-in xml library constructs an XML tree and thus allows XPath queries; I had thought it had far fewer features. You're probably just fine with that.

Frameworks

I dislike Python mechanize because it combines downloading and parsing. These two things are wildly different and thus don't need to be connected; separating them makes lots of things easier. In particular, you have to do really weird things in mechanize if the standard parser doesn't do exactly what you want.

Amusingly enough, I'm still on the mechanize email list. Most of the posts to the list are questions about how to scrape particularly annoying websites; my response is pretty much always that you should not use mechanize for websites like these and that you should separate the downloader parts of your program from the parsing parts.

Lots of people seem to like Scrapy. I looked at it and was intrigued at first. I actually wrote something similar a few years ago, but I suppose there was a reason I stopped working on it. Scrapy adds frameworky layers that I didn't feel like thinking about, and it involves writing the class keyword, which I'm usually not in the mood to do. And I already knew how my stuff worked.

Real browsers

I fall back on something like Selenium or PhantomJS when a website is more complicated than I feel like figuring out. These introduce all kinds of state and complexity, though, so I'd rather not use them.

Update (2017)

Since writing this article, I have changed some details of my methods.

  • I run tests with py.test instead of with nose. They're both good, but py.test is sometimes a bit nicer.
  • I make user interfaces with horetu rather than with argparse.

The important stuff has stayed the same.

Other resources