I've heard of several other people trying to contact the owners of datasets on government websites but never getting a response. I figured that this data owner contact thingy must do something, so the idea that nobody would ever get a response seemed strange. In the present article, I try to figure out was going on.
I used Socrata's Contact Dataset Owner feature a lot, and I now have a much better idea as to how it works. In particular, I surveyed dataset owners through this feature in order to estimate the chance that you'll get a response if you submit a message through the dataset owner contact thingy.
I'll start with a discussion how the feature works and how I surveyed data owners. I learned a lot more about the feature as I ran the survey, and I'll tell you about that. I'll present the results from the survey, comment on some potential mechanisms that would explain the results, and discuss the practical relevance of these findings.
The Contact Dataset Owner feature
If you go to a dataset page on a Socrata Open Data Portal website, you'll see something like this.
There's an "About" button that opens this screen.
Scroll to the bottom, and you'll see "Contact Dataset Owner".
Then you can fill out the contact form,
which includes a CAPTCHA.
If you submit the form and all goes well, you'll see something like this.
I have no idea about how the message is handled technically after you submit the form.
This is the feature that I mentioned in the beginning, and that's all I knew about it before beginning the present study. Let's see what I learned!
I mainly wanted to answer this.
If I contact the owner of a dataset on a Socrata website, what is the chance that I will receive a response at all?
Secondarily, I am curious as to whether that "owner" is the right person.
If I contact the owner of a dataset on a Socrata website, what is the chance that the person will be the appropriate contact person?
I sent emails to a sample of the data owners asking whether they are still appropriate contacts for their datasets. They could respond with "yes" or "no". They could also not respond. I checked which ones responded, and I came up with some figures related to the above questions.
I downloaded metadata on all public datasets from all the Socrata websites I knew about and then looked at the owners of the different datasets. I treated this as a population of owners of public datasets. I drew a sample from this population, and I tried to send questionnaires to every owner in the sample.
My sampling script is parametrized by a salt.
This produces a file called "messages.csv" containing a record for each of the sampled users. Each record contained the URL of a dataset owned by the user and the message that I should send to the user.
Messages looked like this.
Dear [fullname], My name is Thomas Levine, and I have been studying how governments publish data. I am contacting you because you are listed as the "dataset owner" for the following datasets. https://data.somecity.gov/d/abcd-efgh https://data.somecity.gov/d/ijkl-mnop ... I would like to know whether you are the person I should contact about these datasets. If you are still the contact, please go to this web page. http://dataowners.thomaslevine.com/astoehusahul982h3cuhose/yes If you are no longer the contact or never were the contact, please go to this web page. http://dataowners.thomaslevine.com/astoehusahul982h3cuhose/no And feel free to email me if you have any questions or comments. Thanks
The message contained a unique identifiers the user that had been chosen as a hash of the salt and the Socrata identifier for the owner. I did this so that a hypothetical troublemaker would have difficulty messing with my data. I kept the salt value secret, and I can reproduce the sampling based on the salt.
Configuring the server for receiving responses
I pointed the domain "dataowners.thomaslevine.com" to my virtual server at the Numb Server Association. On this computer, I configured an an Nginx server with these properties.
- Listen only on "dataowners.thomaslevine.com".
- Serve the same exit page regardless of the HTTP request.
- Store logs to files specific to the "dataowners.thomaslevine.com" server, as to not confuse it with other domains on the same website.
I backed up the log files periodically and saved the final versions after a few weeks.
I wrote a little thingy that made it easier for me to contact the owners. This program guided me through the whole process of contacting the owners, and it remembered all of the annoying things for me.
It checks which owners have been contacted and selects one of the owners that hasn't been contacted. It tells me which URL to go to and gives me the text to put in the message to the dataset owner.
I ran my messages program thingy and went to the URLs that it told me to go to. I filled in the following information
- "Other" (the default)
- "A visitor has sent you a message about your ... dataset" (the default)
- Copy this from the output of the message-sending guide program
- Email address
and submitted the form.
I recorded in my little program any unexpected results of the form submission (mostly error messages).
Reading server logs
Recall that the server logs record clicks on the "yes" or "no" links that I included in the messages to dataset owners.
I wound up with a data table where each row was a dataset owner and where the two columns corresponded to the owner hash (identifier) and the response (yes or no).
Here are the details of how I converted the log files into a CSV file of two columns.
Three people responded by email, and I combined their responses with the data table produced above.
Oklahoma's Office of Management and Enterprise Services sent me an email saying that my message had been marked as spam, so I counted them as not having responded to my contact.
Things I learned about Socrata Open Data Portal
Socrata's dataset owner contact feature doesn't work quite as I had expected.
Sending the messages turned out to be more difficult than I had expected. Whenever an error occurred, I recorded that the error happened and then either tried again or moved on. I tried again more at the beginning; by the end, I had gotten pretty good at predicting whether an error would go away if I tried again.
In some cases, the page didn't load.
In other cases, the dataset had apparently been deleted between the time when I assembled my list of all datasets and when I finally sent the messages.
Errors in sending
Most of the errors were things that I categorized as errors in sending the message.
In two cases, the CAPTCHA didn't show up.
I submitted the form anyway, but I was told that I hadn't filled out the CAPTCHA properly (because I hadn't).
In most of the other cases, the problem was less clear. I got messages like this one.
If I looked more closely at what was going on in the webpage, I saw things like this.
That is, the XML HTTP request received a response with a status of 500, indicating that something had gone wrong in the server.
The body of the response was some HTML that says that the page is "unavailable".
Possible reason for the majority of errors
Errors in sending were happening a lot, and I was getting the feeling that long messages were more likely to produce errors. After I had finished attempting
Here's my guess as to what is going on. The messages are compressed and then stored in a relational database in a VARCHAR column. This column's maximum length is low enough to that longer messages won't be allowed in the database. Recall that you are supposed to enter a "Short Message" into the box. Maybe this is why!
I'm almost at the point where I discuss what I learned from the dataset owners once I contacted them. I'll comment on the sampling bias that the came out of the errors in sending the messages, and then I'll provide my answers to the research questions.
I'll go into much more detail in the following sections about these two concerns, but the following plot says almost everything that the longer sections below say.
In this plot, each data owner is represented as a bar, and each dataset is represented as a section of the bar; bars are longer for data owners who own more datasets. The color of the bar indicates whether the particular owner responded to my message with a "yes", (indicating that she was the appropriate contact person), responded with a "no (indicating that she was not the appropriate person), or did not respond.
We can assess whether the response is associated with the number of datasets by looking at whether bars change color as we move up and down the plot. If response is associated with the number of datasets, we should see more red bars in one section of the plot than in another section of the plot. If response isn't associated, the colors of the bars should alternate with no apparent trend. I see more red bars in the middle of the plot than on the top and bottom, so this association might exist. I'm not really sure though.
More interestingly, we can assess the likelihood of getting a particular response to your message. The likelihood of getting any response at all is the proportion of the colored area that is red ("yes") or green ("no"). Most of the colored area is blue, however, and this suggests that you probably won't get a response if you try to contact the owner of a particular dataset. Furthermore, there is much more red than green, and this suggests that the person who responds, if you do get a response, will be an appropriate person for asking questions about the dataset.
As I said, the above plot presents the results pretty well, but I'm going a bit more in depth in the following two sections.
Sampling bias introduced by the errors in sending
As we saw above, I had much more difficulty sending messages to people with more datasets. As such, my sample was no longer representative (mathematically) of my original population. In the present section, I consider the implications of this sampling bias.
It is possible that the owners that own the most datasets are qualitatively different from other owners in their manner of responding. For example, it could be that the accounts that own many datasets belong to departments rather than individual people and that they thus respond to messages in a different way.
So I checked whether there was a systematic difference in the sort of response associated with the number of datasets that an owner owns. (And I did this only with the owners to which I managed to send messages, as I didn't have relevant data about the owners to which I didn't manage to send messages.)
I'm going to look at it a few different ways.
In this first plot, each dot is a dataset owner. Its horizontal position is randomly chosen, its vertical position corresponds to the number of datasets the owner owned, and its color corresponds to whether the owner responded to my contact. The squiggly shapes in the back are violin plots, which are like smoothed versions of the points.
The shapes of the two sides of the plots are sort of similar but also sort of different. Based on the plot, I'd guess that there is at least a small difference in whether an owner responds depending on whether the owner has many or few datasets. But I'm not really sure, based on the plot, about what the difference is.
We can get a fancier result with a logistic regression of whether the owner responded (She either did or didn't.) on the number of datasets the owner owned.
glm(formula = response ~ n.datasets, family = "binomial", data = sent) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.543654 0.264629 2.054 0.0399 n.datasets 0.006386 0.007459 0.856 0.3919
Given how many owners I contacted, the term for the number of datasets
n.datasets) isn't large enough that we should guess that there truly is some
difference in response rates based on the number of datasets an owner owns.
Said more fancy-like, the p-value for the slope term (0.3919) is much larger than our conventional α of 0.05, so we do not reject the null hypothesis that the number of datasets an owner owns has nothing to do with whether the owner responds.
I didn't try very hard at checking the fit statistics; it is possible that a logistic regression is totally unreasonable for these data.
Based on this regression, responsiveness doesn't seem to relate to the number of datasets that an owner owns.
The most compelling way I found to display the data was in this last plot.
Focus on the bottom plot (the "no response" group). Aside from the left-most bar, the bars stay at about the same height as they move from left (few datasets) to right (many datasets). Compare this to the top plot (the "yes" group); excluding the left-most bar, the bars to the left are much taller than the bars to the right. Also, hardly anyone wound up in the middle plot (the "no" group).
Now let's put those three observations together: If an owner responded, the owner probably responded with "yes". And the rate of response might be a bit higher for owners who owned few datasets. Based on this plot, it looks like responsiveness is related to the number of datasets that an owner owns.
Conclusion about sampling bias
I don't have a very clear answer either way, but it seems sort of like people with more datasets respond differently from people with fewer datasets. Because of this, we should keep in mind that the thing I'm about to tell you about the chance of response to a message to an owner might only apply to people with few datasets.
In the next section, I am going to report figures about the chance that you will receive a response if you use the data owner contact feature. Because I didn't manage to sample the owners with especially many datasets, the following results might not apply very well to them.
How/whether owners respond
Recall that there were three possible outcomes for each owner that I successfully contacted.
- The owner did not respond. Let's abbreviate this as "no-response".
- The owner said that she or he was the appropriate contact person. Let's abbreviate this as "yes".
- The owner said that she or he was not the appropriate contact person. Let's abbreviate this as "no".
In order to answer the two research questions, I imagined that I was going to send a message to all of the owners of all datasets on Socrata Open Data Portal sites (not just 124 of them), and I tried to predict how many would respond in a particular way.
I made two estimates, in fact, for each of the two research questions
(four estimates in total). I came up with two estimates by modeling
the sampling strategy differently in the
survey R package.
The first of the strategies is less meaningful and more easy to explain. It assumes a simple random sample from an infinite population of data owners.
d.unweighted <- svydesign(~1, data = sent, weight = ~1)
Then I ran the following command to come up with a confidence interval for
the estimated proportion, where
x was the sample data.
svyciprop(x, d.unweighted, method = 'mean', level = 0.95)
This first method is equivalent to a one-sample t-test.
The second of the strategies is more meaningful because it considers that my I weighted my sample based on the number of datasets that owners owned (because I was really interested in the datasets rather than in the owners).
d.weighted <- svydesign(~1, data = sent, weight = sent$n.datasets)
The other difference in the second strategy; I used the "logit" method for computing the confidence intervals.
svyciprop(x, d.weighted, level = 0.95, method = 'logit')
According to the
The "logit" method fits a logistic regression model and computes a Wald-type interval on the log-odds scale, which is then transformed to the probability scale.
I don't fully know what this means; I used it because it was the default and because it gave a smaller confidence interval than the "mean" method above.
How many respond at all
This was the first research question.
If I contact a dataset owner on a Socrata website, what is the chance that I will receive a response at all?
For this question, I estimated the proportion of owners that would fall into the "yes" or "no" group, rather than the "no-response" group.
How many say yes
This was the second research question.
If I contact a dataset owner on a Socrata website, what is the chance that the person will be the appropriate contact person?
For the first question, I estimated the proportion of owners that would fall into the "yes" or "no" group, rather than the "no-response" group. In this second question, I instead estimate the proportion of owners that fall into the "yes" group only, rather than owners who fall into either of the "no" or "no response" groups. This result was thus lower than the result for the first question.
I sought to estimate the chance that asking questions about datasets with Socrata Open Data Portal's owner contact thingy would lead to responses. I came up with that estimate and learned a bunch of other things along the way.
Chance of getting a response
If I had to guess, I'd say the following.
- There is about a one-in-three chance of getting some sort of response.
- If you get a response, your message has probably been sent to an appropriate person.
- Different sorts of people might own different numbers of datasets, and the chance that you'll get a response might be related to the number of datasets that an owner owns.
This figure of one response in three messages seems a bit low. Also, I had quite a lot of difficulty in sending the messages. Why is it so inconvenient?
Intent of the owner contact thingy
I think that this owner contact thingy was never intended to be used to ask questions about a dataset and that people are using it for that purpose anyway.
When I sent the messages to the sample of owners, I specified that the reason for my contact was "Other". Let's look at the other reasons that are available.
- Copyright Violation
- Offensive Content
- Spam or Junk
- Personal Information
Aside from "Other", these seem to be about corrections to the dataset. There might be a very clear process as to how to handle all of these; you might evaluate whether the concern is valid and then update or remove the dataset. The process of answering a question about a dataset might be very different.
Here's a relevant comment from Chris Metcalf, who works as a Director of Developer Platform at Socrata (December 2014) and was one of their early employees.
Chris indicates that you can use the dataset owner contact feature "if there's an issue with a dataset" and doesn't say anything about other contexts in which you can use this feature. On the other hand, this is a response to a question about bug-tracking systems, so it makes sense that he would focus on that aspect even if there are wider uses.
It could be that the feature was designed for a common process of correcting the data and that the people who were making these corrections had different intuitions about how the system would work and thus wouldn't have the difficulties that I had seen.
Turning off the owner contact thingy
It might also be that some institutions use this feature more than others and that it is hard to turn this feature off even if you don't plan on using it.
I have this issue with most of the messaging features on Facebook, Twitter, LinkedIn, GitHub, and so on. The email notifications from these services aren't particularly helpful, so I mostly ignore them. (It's way more complicated than that; I might discuss it some other time, but not now.) Also, I can't simply turn off the message, pull request, &c. features, so people assume that I pay attention to them. I notice them every few months by accident and thus don't respond very quickly.
Reasoning behind the owner contact thingy
I discussed the above findings with a couple of people who work at Socrata (I spoke with them separately.) and then asked them if they knew of the original reason for including the feature. They both had the same answer for me: Socrata wants it to be clear that Socrata's client (the city, country, &c.) owns and manages the data, not Socrata, and the owner contact thingy was designed as it was to communicate this ownership; contrary to my suspicions, no particular client had specifically requested the feature.
Are you looking for information about a dataset? On one hand, you might be disappointed to learn that you probably won't get a response if you use the dataset owner contact thingy. On the other hand, it's not like this feature is broken; people sometimes do respond, so it's worth a shot.
I suspect that there's misunderstanding by the public as to how this feature should be used. If you are publishing data but do not want people to ask questions about it, I suggest against including a feature like this. If you want this feature for your own internal use, consider enabling it only for your own staff or at least noting that it's only for internal use.