How should I manage my data library?

My personal and professional data management practices could charitably be described as ‘ad-hoc’, which is less than ideal for someone who spends so much time collecting, working with and thinking about data.

A compositor is leaning over the lower case with a composing stick in his left hand, about to pick up a character to fill in a line of type.
Un compositeur à sa casse, 1875

These are some initial exploratory notes on how it might be done better. I’d love to hear about your system.

Prior art

A bunch of data management systems (DMSs) exist, and other people have already spent a lot of time solving this problem. While I definitely don’t want to re-invent the wheel, I do need to find a system that works for me and nothing I’ve found so far is an obvious fit.

The challenge will be finding something that has the right level of complexity for my particular use-case.

Most systems are designed for large teams or whole organisations which typically involves much more complexity than I want or need.

CKAN is one of the better known systems for maintaining and publishing a data library. But it’s definitely not intended to be a personal system, or even one for a small team. Something that requires at least three different kinds of database under the hood is overkill. Though it does have the advantage of being a mature, well maintained FOSS project.

Once upon a time, there was a project funded by the Knight Foundation that aimed to develop a newsroom data library. This is a lot closer to the kind of thing I think will work for me (and maybe a small number of my colleagues). It’s called PANDA, but it is sadly, long abandoned.

People who work primarily in machine learning environments have a big incentive to manage their data well. There are a few projects like lakeFS and DVC that bring Git-like version control (or leverage Git) to managing datasets. These look promising for my use-case, but need more investigation.

What kinds of data?

One of the challenges of working as a data journalist is that we’re not often working with the same types of data repeatedly and the data we do work with is often poorly described and/or structured. For one story you might be working with official statistics like CPI or census data, the next story might involve a manually compiled dataset or something esoteric supplied by a source. Then yet another story could be geographic data like GeoJSON or GeoPackage data. The list of possibilities goes on and on.

The scale of the data varies a lot — sometimes it’s gigabytes and hundreds of files and sometimes it’s a single 20kB CSV file — and the format varies too.

To add another complication beyond a wide variety of data types, scales, and formats, there is variability in how we need to handle the data. Some datasets are highly sensitive and need to remain private, some are sourced from already public data, while others are built in-house with the intent to share the data with both audiences and other journalists who might make use of it.

Requirements

That makes defining the requirements for a personal professional DMS quite difficult. But here’s an attempt:

  • Data should be version controlled
  • A record of provenance — data source, retrieval date, record of modifications — is necessary
  • Datasets should be private by default, but convenient to publish and/or share
  • It should be easy to use a single dataset in multiple projects
  • Datasets should easy enough to access programatically that there’s no hesitation to put them in the DMS rather than baking them into a project’s codebase
  • Datasets should be discoverable — specifically some kind of central index should exist so available data is easily found

The DMS doesn’t necessarily need to be a piece of software but could simply be a well defined process and folder/file structure. Though several of the requirements might be better solved with software, especially version control.

I’m also drawn to the possibility that it could be made a resource for other members of my team or people I work with. That opens a lot of possibilities, but also a lot of new requirements around authentication and authorisation.

Tell me, what does your personal / professional data management system look like?

datajournalism