Building journalistic data sets

This was published more than 6 years ago

Attitudes and opinions change and evolve.

You know that feeling you get when you stumble upon something you wrote in the distant past and it's terrible? There are many cringe-worthy bits of immature writing on this website (and I'm probably still creating more).

As well as the mere cringe-worthy, there are also opinions and attitudes expressed that I no longer hold and am, frankly, embarrassed by. (Please don't go looking, they're deliberately hard to find, but left for the sake of posterity.)

I hope if you've stumbled across some here, you'll give me the benefit of the doubt.

One of the great resources for data journalists is the NICAR mailing list. If data journalism or computer assisted journalism is your game you’re a fool not to be subscribed.

A few weeks ago, Jennifer Lu popped up looking for input.

I’m doing a project on best practices for building datasets from scratch (ie. when an agency isn’t collecting the information you want, so you decide to go ahead and collect it yourself.) If you’ve done this before, I’d love to interview you about the process.

As it happens, the team I work in at ABC News is just this moment embarking on such a project, so it was of obvious interest and my hopes were high. As usual, the replies were full of good tips.

This tip from Tim Henderson is one I’d wholeheartedly agree with.

Start by having people gather a few hundred records in Excel — that will help them wrap their heads around how much they need as data, and how much they need as freeform notes on individual cases, which is a different matter.

It’s such a difficult task to draw a boundary around what it is you’re collecting—you’re bound to get it wrong—so having a small pilot data set like this is a great way to get started.

Dan Nguyen did an awesome job of practicing what he preaches and created a data set of stories which involved the creation of data sets.

Right now my definition is: any enterprise story/project in which some poor sucker(s) had to unilaterally decide what to track, how to categorize it, and then hand-enter it into a spreadsheet (or some other form of systematic structured record). I’ve also included projects in which the data was a single-source stack of paper that the reporter had to single-handedly transcribe to computer.

He shared a very long list (which I won’t re-type here), but I live in hope that his promise to “maybe later build a showcase site that runs off of that spreadsheet” will someday soon be fulfilled.

A final list of useful tips came from David Akin.

  1. Document everything — the source for your data; detailed explanations of each field; explanations of any calculations or manipulations of data.
  2. Decide whether your database will be a “snapshot-in-time” database or whether you’ll curate it and continue to add new data to it.
  3. Consider a strategy to help you verify your data.

On the first point he notes that the purpose here is to make the data set easy for a colleague to use, maintain or verify—all incredibly important tasks that can easily be made more difficult than necessary.

Personally, I’ve neglected making the decision in point two on previous projects. It can be a hard decision to make, but it helps so much to make it up front and not look back.

I wish there were a public, searchable archive of the NICAR list, but there isn’t as far as I’m aware, so you’ll just have to sign up now so you get all the goods.

This is a cross-post from Medium because I wanted to give it a go.