Future Fridays: web scraping

Last week I had the pleasure of presenting the first Future Fridays event in Brisbane: a workshop on web scraping.

The tool of choice for our workshop was import.io, but we aimed to cover scraping a little more broadly too.

We had three key aims for the workshop:

  1. Understand the basics of scraping

    • What is it
    • Why it's useful
    • What can it be used for
  2. Start scraping some basic data using import.io
  3. Know how to extend your skills

A definition: Converting non-tabular or poorly structured data into a usable, structured format, usually a spreadsheet.

Humans are very bad at repetitive work. We're prone to errors and to getting bored, so let's make a computer do the work whenever we can.

Scraping is an incredibly useful for acquiring data where it has been made inaccessible by the way it’s presented online.

I am convinced that scraping is the original data journalism skill. —@jeremybowers

But as well as acquiring data, scraping can be really useful for other things like tracking changes and archiving. In short, it's a skill worth learning.

Before you get started

As useful as scraping is, it's important not to get carried away. Remember, there might be better options. Choose the right (i.e easiest) tool for the job.

  1. Always check that you can't just copy and paste into Excel or Google Sheets.
  2. Asking politely is a good place to start.
  3. Look for existing structured data (oh, look there's an open data set for that.)
  4. Freedom of information requests (be specific about the format you want data in). I recommend Right to Know for getting started.
  5. Sometimes the best (only) option will be lots of typing.

Import.io

Import.io has recently had a big update and has changed from a downloadable program to an entirely web-based tool.

They have a (previously) very good YouTube channel chock full of tutorials. Most are currently out of date (focused on the old tool) but I'd expect there'll be some new tutorials coming.

Their help centre does have some tutorials using the new version. This walk through for extracting product details from a record store covers most of the details we covered in the workshop.

Extending your skills

The next step past the 'point and click' style scraping that import.io does so well might involve a little bit of programming.

Here are some other tools you could try:

Remember, extending your skills in this area is relatively easy. In my experience you need just three things: motivation, basic vocabulary, and a community.

Motivation

To learn something new always takes a bit of time because you need to experiment. So having the motivation, a willingness to carve out some time to experiment is key. Find a personal project which interests you if work doesn't afford the opportunity to learn on the job.

Vocabulary

You need some basic vocabulary so you can ask for help. Asking other humans for help, you can usually work around a lack of vocabulary, but knowing basic terms helps immeasurably when searching the web.

Knowing terms like crawling, extracting, html, structured data and database—among others—will help.

Some places you can look for help:

Community

Having a go-to community of people to ask when you run into trouble is really helpful. Here are some options:

Good luck!