It has been four years since I last wrote substantially on open data. I am pleased to see that people continue to reference my writings and to replicate my software, demonstrating that my work continues to be relevant.

Ever since my writings on open data tapered off, I have been meaning summarize my overall conclusions and recommendations in one article. It is finally ready.

1. Early conclusions: Problems with open data portals

One of my main reasons for starting on this project was that I wanted to know what was on open data portal websites. I mostly found that their contents were much less interesting than I had naïvely believed.

I also incorrectly determined that data are out-of-date, and people cited it a lot even though there was a note at the top explaining that the conclusions were wrong.

2. Later conclusions: Better software for cataloging datasets

Having looked at how the publishing of government data works, I came up with some ideas as to how it could be done better. And then I started specifying and prototyping relevant software tools.

Just as we have free-text search tools that make it easy to search all kinds of documents for, we can develop tools to search tabular datasets. Simply indexing tabular datasets with a conventional information retrieval software (See previous sentence.) can get you quite far.

If you want to do this with your own data, I recommend using Recoll. It should already support most file-based data formats, and you can write a custom filter for other formats, such as live database servers. In most cases I think it will be just fine to treat your entire data table as a single document, so the native text filter should do just fine for most datasets if you can export them to a text format like SQL dump or CSV. Here are some alternatives, in case you do not want to use Recoll.

As far as I can tell, this has still not caught on in large-scale search engines. The popular web search engines usually give me results in HTML or PDF format rather than, for example, Excel, CSV, JSON, or SQLite.

Going further, I think that we should be able to enhance this search strategy by considering the tabular structure in our search tools. I illustrate this with following truncated table of air quality measurements from a particular measurement station.

2013-02-11 19:00 26 64 40 497 16 45
2013-02-11 21:00 12 41 19 479 16 42
2013-02-11 23:00 9 36 26 461 16 41
2013-02-12 09:00 73 50 39 513 14 45
2013-02-12 13:00 56 49 40 527 14 42
2013-02-12 21:00 12 47 52 517 15 34
2013-02-12 23:00 13 45 63 488 14 34
2013-02-13 04:00 3 37 32 466 14 33
2013-02-13 06:00 29 57 47 486 13 33
2013-02-13 10:00 78 68 52 504 13 32
2013-02-13 14:00 96 81 64 533 13 31
2013-02-13 23:00 20 48 109 479 13 30

We can guess, without other documentation, that the statistical unit in this dataset is the date-hour; is combination of columns has unique values for all rows, and it is also the left-most such combination of columns.

We can also compare the values in each column to the values in columns of other datasets in order to figure out which datasets might share columns. Then, given another dataset with hour listed in a column called "HOUR" or "VAR02", rather than "HEURE", we might be able to determine that it is related to this particular dataset.

This is just one example of the many ways we could consider the tabular structure of a dataset when we search it.

I find this very interesting in theory, but I was unable to find the diversity of open data datasets to search like this. Most datasets that had at least a few rows and columns shared very similar structure with each other, so comparing their structures wouldn't have discriminated much; for example, many datasets were aggregated such as to us about how many things happened in each of several years or geographic region. If all of the datasets have the same tabular structure, searching by tabular structure is not very helpful.

Future study

I think that people who implement open data portal software understand well enough how they could be made better. Open data portals are only as garbage as any other commercial software, so I think they're pretty okay as-is. But, in case someone really does want to make it easy to find datasets, I think there are a lot of people who understand very well how do to this without inventing very many new things. I think this line of study has run its course, though people who want to promote open data portal software will continue to argue that they are good, and people who want to demote open data software will continue to argue that they are bad.

On the other hand, tools for searching tabular data could be developed much more. If I really wanted to develop this, I would crawl something on the order of at least billions of web pages until I find lots of spreadsheets, and then I would try searching those. I stopped looking at this because obtaining all the spreadsheets seemed expensive and a lot of work. I have heard of several other groups doing stuff like this, and I think they all use more focused strategies for finding all the spreadsheets.