Web Scraping, part 1: files and APIs

Web scraping is extracting information from webpages, usually (but not always) as tables of data that you can save to csv files, json/xml files or databases.

Design it first, then scrape it

When you start on any piece of code, try asking yourself some design questions first; definitely do this if you’re thinking about something as potentially complex as web scraping code.   So you’ve seen a dataset hiding in a website – it might be a table of data that you need, or lists of data, or data spread across multiple pages on the site. Here are some questions for you:

1. Do you need to write scraper code at all?

    • Is the dataset very small – if you’re talking about 10 values that don’t get updated, writing a scraper will take longer than just typing all the numbers into a spreadsheet.
    • Has the site owner made this data available via an API (search for this on the site) or downloadable datafiles, either on this site or connected sites?  APIs are much much faster (and less messy) to use than writing scraper code.
    • Did someone already scrape this dataset and put it online somewhere else?  Check sites like scraperwiki.com and datahub.io.

2. What do you need to scrape?

    • What format is the data currently in?  Is it on html webpages, or in documents attached to the site?  Is it in excel or pdf files?
    • Does the data on this site get updated?  e.g. do you need to scrape this data regularly, or just once?

3. What’s your maintenance plan?

    • Where are you planning to put the data when you’ve scraped it?  How are you going to make it available to other people (this being the good and polite thing to do when you’ve scraped out some data)? How will you tell people that the data is one-off or regularly updated, and who’s responsible for doing that?
    • Who’s going to maintain your scraper code?  Website owners change their sites all the time, often breaking scraper code – who’s going to be around in a year, two years etc to fix this?

Reading files

Okay. Questions over. Let’s get down to business, working from easy to less-easy, starting with “you got to the website, there’s a file waiting for you to download and you only need to do this once”.

You’ve got the data, and it’s a CSV file

Lucky you: pretty much any visualization package and language out there can read CSV files.  You’ll still have to check (e.g. look for things like messed-up text, be suspicious if all the biggest files the same size, etc) and clean the data (e.g. check that your date columns all contain formatted dates, you have the right number of codes for gender – and no, its not always two  – etc) but as far as scraping goes, you’re done here.

You’ve got the data, and it’s a file with loads of brackets in it

Also, the file extension (the part after the last “.” in a filename) is probably “json”.  This is a json file  – not all data packages will read in this format, so you might have to convert it to CSV (and it might not quite fit the rows-by-columns format so you’ll have to do some work there too), but again, no scraping needed.

You’ve got the data, and it’s a file with loads of <>s in it

Either you’ve got html files (look for obvious things like HTMl tabs: <html>, <head>, <p>, <h1>, etc and text outside the opening <name> and closing <\name> brackets) or you’ve got an xml file.  Another big hint is if the file extension is “.xml”.  Like json, xml is read in by many but not all data visualization packages, and might need converting to csv files; a few quirks make this a little harder than converting json, but there’s a lot of help out there on this online.

You’ve got the data and it’s a PDF file

Ah, the joys of scraping PDF files. PDF files are difficult because even though they *look* like text files on your screen, they’re not nearly as tidy as that behind the scenes.  You need a PDF converter: these take PDF files and (usually) convert them into machine-readable formats (CSV, Excel etc). This means that the 800-page PDF of data tables someone sent you isn’t necessarily the end of your plan to use that data.

First, check that your PDF can be scraped.  Open it, and try to select some of the text in it (as though you were about to cut and paste).  If you can select the text, that’s a good sign – you can probably scrape this pdf file. Try one of these:

  • If you’ve got a small, one-off PDF table, either type it in yourself or use Acrobat’s “save as tables” function
  • If you’ve got just one large PDF to scrape, try a tool like pdftables  or cometdocs
  • If you want to use open source code – and especially if you want to contribute to an open-source project for scraping PDF data, use Tabula.

If you can’t select the text, that’s bad: the PDF has probably been created as an image of the text – your best hope at this point is using OCR software to try to read it in.

Using APIs

An Application Programming Interface (API) is a piece of software that allows websites and programs to communicate with each other.

So how do you check if a site or group has an API?  Usually a Google search for their name plus “API” will do, but you might also want to try  http://api.theirsitename.com/ and http://theirsitename.com/api (APIs are often in these places on a website).  If you still can’t find an API, try contacting the group and asking them if they have one.

Using APIs without coding

APIs are often used to output datasets requested using a RESTful interface, where the dataset request is contained in the address (URL) used to ask for it.   For example, http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:2015 is a RESTful call to the World Bank API that gives you the rural population (as a percentage) of all countries for the years 2000 to 2015 (try it!).  If you’ve entered in the RESTful URL and you’ve got a page with all the data that you need, you don’t have to code: just save the page to a file and use that. Note that you could also get the information on rural populations from http://data.worldbank.org/indicator/SP.RUR.TOTL.ZS but you won’t be able to use that directly in your program (although most good data pages like this will also have a “download” button someone too).

APIs with code

Using code to access an API means you can read data straight into a program and either process it there or combine and use it with other datasets (as a “mashup”). I’ll use Python here, but more other modern languages (PHP, Ruby etc) have functions that do the same things.  So, in Python, we’re going to use the requests library (here’s why).

import requests
worldbank_url = “http://api.worldbank.org/countries/all/indicators/SP.RUR.TOTL.ZS?date=2000:2015”
r = requests.get(github_url)

Erm. That was it.  The variable r contains several things: information about whether your request to the API was successful or not, error codes if it isn’t, the data you requested if it is. You can quickly check for success by seeing if r.ok is equal to True; if it is, your data is in r.text; if it isn’t, then look at r.status_code and r.text, take a big deep breath and start working out why (you’ll probably see 200, 400 and 500 status codes: here’s a list to get you started).

Many APIs will offer a choice of formats for their output data. The World Bank API outputs its data in XML format unless you ask for either json (add “&format=json” to the end of worldbank_url) or CSV (add “&format=csv”), and it’s always worth checking for these if you don’t want to handle a site’s default format.

Sometimes you’ll need to give a website a password before you can get data from its API.  Here, you add a second parameter to requests.get, “authenticating” that it’s you using the API:

import requests
r = requests.get('https://api.github.com/user', auth=('yourgithubname', ‘yourgithubpassword'))
dataset = r.text

That’s enough to get you started. The second half of this post is on scraping websites when you don’t have an API. In the meantime, please use the above to get yourself some more data.  Places you could start at include HDX and datahub.io (these are both CKAN APIs).

Ruby day 8: Next

Ruby has gone now – “goodbye Ruby, Tuesday” is apparently becoming a popular song here.  But the cleanup work is only just starting.  Celina spends a lot of the day trying to UAV stuff sorted out; we get word that the team is getting imagery in the worst affected areas, and she works on getting that data stored and back to the mapping groups that need it.

Response teams are moving into the field – many by boat because they can’t get flights. Requests still come in – one for an assessment of damage to communications and media stations (I suggest that Internews might have a list for this, then find that Agos is tracking communications outages. I stop with the lunchtime work and get back to the day job.

ACAPS puts out an overview map but still need a rainfall map for it. Maning from the Philippines observatory just happens to have one (he’s been working on this for a while).

6pm: A government sit rep has data in it that needs scraping: I get out the pdf tools. Cometdocs dies on it, so I move over to PDFTables. We will slowly teach governments to release tables as datasets not pdfs. We will.  The ACAPS HIRA work starts. First, I have some pdfs to scrape. Except the sitreps from http://www.ndrrmc.gov.ph are PDFs with images of excel spreadsheets cut-and-pasted in. Start reaching out, trying to find out who has the original datafiles, but rein in when gently reminded that these things are political.

Ruby Day 7: Back to Work

7am. Today is a work day, and my deal with work was that I’d be working-from-Philippines, not bunking-off-to-do-disasters, so I’ll only be popping in and out of the chats and here from now.

Jean Cailton from VISOV (French VOST) has popped up overnight: many VOSTies are online working under the flags of other groups, which is kinda normal for mappers. Someone asks for rainfall data: the image a couple of days ago was based on TRMM data, so I wonder if NASA has daily updates to share (I already have scraper code for this from another project, and I remember that Lea Shanley is working on communities there now): Maning points out that the data is updated 3-hourly; is good. SBTF data is going to be posted on the Rappler map. And ACAPS puts out their first briefing note: volunteers are working with them, gathering secondary data (news, data etc) for the notes.  Teams have all settled down into a rhythm, so I don’t feel too bad about having to drop out of helping for the day.

11am. Eat breakfast, with triple espressos all round (twice). Stock up on boko juice (coconut water) and head back to work.  Typhoon is winding down:  it’s now rated as a severe tropical storm (yawn worthy in this land of 20 typhoons a year), Borongan airport is open to receive disaster response teams and relief; Guiuan airport is being cleared, and “national roads affected by #RubyPH are now passable” (DWPH). Rappler people are reporting from some of the typhoon-hit areas (Dagami Leyte, Calarbazon, Camarines Sur) and connected to SBTF team.  Investigate flights to Dumaguete tomorrow, to catch up with some old friends (and work from local coffeeshop).

Do dayjob work today. Spend lunchtime being massaged: hunch over laptop 20/7 for a week has left me with two shoulder knots so big that I’ve given them names.  Try to book flight to Dumaguete: tomorrow’s flights get switched from airline to an online reseller whilst I’m trying to book, and the reseller’s site is down. *sigh*. Book flight for Wednesday – I haven’t come all this way *not* to check in on friends here.  It rains – drizzles – all day; at 5pm the sky is dark enough to need lighting on indoors, but there’s nothing more than a little light rain and wind going on outside. Skypechats are now silent, save for the occasional person coming in to start volunteering. Turning up 3 days later isn’t cool dudes: the volunteer help is usually needed as a disaster hits, not afterwards.

9pm.  Ruby isn’t quite done with me yet: find myself in another Skypechat, this time the one for helping ACAPS gather the background data needed for future event reports. Will be doing a bit each day on that for a couple of weeks, hopefully alongside lots of filipino volunteers too. It’s still raining a bit here in Manila, but if we didn’t know a typhoon <del><del>tropical storm was passing overhead, we really wouldn’t notice it.

Ruby day 6: clicking starts

Screen Shot 2014-12-07 at 2.43.02 PM

7am. Wake at 6:30 am – check self for hangover, then check overnight Skype / email traffic.  Andrej from UNOCHA has posted a whole pile of links to existing UNOCHA datastores (outside HDX), and a link to 2013 population estimates (up to now, we’ve only found 2010 figures).  Realise that there isn’t a long-list of datastores for the digital humanitarians (I’ve been sending links out, but they’re behind a group’s firewall).  And the HDX team now has a Ruby page at https://data.hdx.rwlabs.org/ruby.

The micromappers deployments have started (if you’re reading this and want something to do – that!).  I see a message that http://ourairports.com/countries/PH/ needs updating – Andrej has a link for that too.  Am asked to lead one of the remote mapping teams… point out that I might be a little short on internet soon.  One of the local mappers that the remote team has been worrying about has made it to a cellphone signal – he’s not only okay, he’s also volunteering in the regional disaster response office… yay, go local mappers etc!

This morning I’m planning to do a little micromapping (categorising incoming tweets) and scrape all the lat/long geolocation data from typhoon Pablo/Yolanda Ushahidi sites (I have code for this, will leave copy in github).  And drink lots of coconut water.

I do a little micromapping, classifying tweets.  This is still done by hand, even though the auto-classification algorithms are getting better with every disaster. The categories we’re sorting tweets into are:

  • Requests for Help / Needs – Something (e.g. food, water, shelter) or someone (e.g. volunteers, doctors) is needed
  • Infrastructure Damage – Houses, buildings, roads damaged or utilities such as water, electricity, interrupted.
  • Humanitarian Aid Provided – Affected populations receiving food, water, shelter, medication, etc. from humanitarian/emergency response organizations.
  • Other Relevant Information – Informative for emergency/humanitarian response, but in none of the above categories, including weather/evacuations/etc.
  • Not Informative – Not related to the Typhoon, or not relevant for emergency/humanitarian response.
  • N/A: does not apply, cannot judge – Not in English, not readable, or in none of the above categories.

I’m tempted to ask the micromapping team for a filter that removes any tweet with the word “pray” or “god” in it. Every time, this traffic happens: filipinos and americans start to pray; brits start with the dark humour.  At least I haven’t seen any spam on the hashtags yet – selling handbags and Ugg boots after a disaster seems to be a very popular online activity.

Start seeing requests from people who want to see what’s going on. I move list of websites about Ruby off a group site and into a HDX page. Do much annoying link-checking because urls don’t copy over.  Rose, who always has useful stuff to add, posts a list of typhoon landfall times (6 in total) in  PHT, EAT, GMT and EST times.  Timezones are both the digital humanitarians’ strength (always someone awake) and weakness (all those time conversions).

I hunker down with some Ushahidi API code.  I find a bug in my Ushahidi python code that’s been hanging around for a while: thank you Ruby, for supporting my dayjob. The code works (and is at https://github.com/bodacea/icanhaz/blob/master/rubytyphoon.py if you want to use it too) – the results are in HDX at https://data.hdx.rwlabs.org/dataset/geolocations-from-digital-humanitarian-deployments.

See lots of notes about missing aerial imagery; ask if anyone knows where SkyEye is operating today. Told to ask Celina (who’s asleep upstairs at the moment, exhausted).

10am. I learn stuff every time with these guys. Today it’s how easy reverse image searches are (photos from Haiyan are cropping up in the social media streams): use Google image search or tineye.com, or right-click in Chrome and use Search Google for this image.  My dayjob colleague Michelle checks in – she’s the only person there who’s asked how I’m doing, and none of my Ruby-related emails seem to have got through.  I guess they think I can look after myself.  Am doing what I can today because I’m back to the dayjob tomorrow – if you need anything from here, today is a good time to ask!

12pm. Brunch at Legaspi market. Hear people talking about the typhoon – calm, sounds more like gossip than panic.  Bump into a bunch of INGO folks who’ve just come out of their coordination meeting.  Eat great hot pad thai with homemade ginger beer; buy some christmas presents. Bump into agency person who needs some of the datasets that I’ve been helping with, and others from Andrej’s list this morning.

2pm. See yet more datasets hiding behind firewalls, some of them the data needed above.  Will keep trying to persuade people to share through HDX-linked googledocs, but watching replication between groups is annoying.  See that the HOT OSM instructions need a bit of gardening: try to create an account (I always forget my name and password there) but get Account creation error. Your IP address is listed as an open proxy in the DNSBL used by OpenStreetMap Wiki. You cannot create an account”.   I try to remember my account name by hitting “forgot password”, but get “Your IP address is blocked from editing, and so is not allowed to use the password recovery function to prevent abuse”.  Wonder how often that happens to people in developing countries. The HOT OSM team are wonderful about this, with suggestions of how to best deal with it.

New link comes in: Philippines government’s Ruby page. Google crisis map on it includes all the shelters across the islands (from DSWD data) – is beautiful, but can’t get KML download to work to cross-check list against ones acquired earlier (and take down old datasets as needed).  Someone asks about pdf to spreadsheet conversion software (please stop with the pdfs guys – please!); we recommend tabula for the easy and cometdocs for the difficult ones. I run the doc through cometdocs and upload to HDX just in case anyone else needs it (poverty figures for the Philippines).

4pm. Link field team with GPS locations up with OpenStreetMappers – sit back and enjoy watching them coordinate together. Check work chats – find some sweet comments from teammates on my note about Ruby, which is much heartening: perhaps I’m not forgotten after all (and to be fair, had forgotten about an offer of help there a day or two ago).  Download evacuation centre list from Google crisis map, then convert it from kml to csv (the command to use is “ogr2ogr -f CSV evacuation_centres.csv Typhoon\ RubyHagupit\ Evacuation\ Center\ Map.kml -lco GEOMETRY=AS_XY”; there’s a great ogr2ogr cheat sheet at https://github.com/dwtkns/gdal-cheat-sheet). Try to go for a nap.

Darn. Message from Redcross: Google map is missing evacuation centres on a couple of islands. Look over the 2 lists: there are other centres on NGO lists that aren’t on map either. Am now very sleepy: post notes in chat, add new shelter names to googledoc, drop note in chat for Google team and ask Jus to help fix it. Celina wakes up and starts a bridge between the areas that OSM is missing satellite data for and the local UAV team.  Finally go for nap.

8pm. Wake up to see the audit trail behind the DSDW shelters map in one of the chats: everyone’s contributed a part to it, and mysteries are solved. See a request for a list of towns/villages that the storm has passed through please; wonder if it’s possible to estimate that from a track+width (I know the storm walls are destructive, but not sure how far the damage area would extend; also see a request for road blockage data.  Maning points to photos on local news site (ABS-CBN); looks like home did after Sandy.

Darn again: Philippines government has UAV regulations amd not all local companies have got the licenses from CAAP to fly post-tornado (Andrej points at the UAV links on UNOCHA’s response page). Don’t tell me we have to break out the balloons and kites again? We check with the UAV teams: they’re cleared.

10pm. Pierre from OSM has the typhoon track on a map… off to turn that into a list of towns in a 50mile swath. Am exhausted and eyes are closing reading the Overpass API docs (although the test window for this is cool): post task in SBTF’s geolocation window for another mapper to pick up. Celina is talking to the UAV guys and mapping teams, getting lists of place that imagery is needed (damaged areas yes, but also areas that were covered by cloud in the satellite images).  Go off to bed after midnight, leaving Celina still hunched over her laptop.

Ruby Day 5: data day


6am Everyone was exhausted last night – crashed out, went to bed early.  Woke at 5am today to a note about sushi and coconut juice in the fridge. Ate supperbreakfast/breakupper/supfast/whatever and checked in on the Skypechats.  Lots of messages overnight: OSM team is organised and churning through mapping tasks; DHN is oriented, the SBTF is gearing up ready, some lovely new maps on Reliefweb and a few more links to add to Joyce’s online list but they’re slowing down in frequency now.

I map some residential roads and rivers on http://tasks.hotosm.org/. It’s time to start thinking about datasets. Maning is already working on shelter geolocation, someone’s repurposed the Yolanda geonode site, and I need to start checking through HDX for anything useful/ needed for Ruby.

Dayjob meets weekend: am asked about the OSM humanitarian style on Ushahidi V2. Start trying to remember what the workaround for the Ebola maps was. It was RobBaker writing a plugin: https://github.com/rrbaker/osm-humanitarian-tiles.  DSDW (the philippines agency that coordinates shelters etc) is working on Crowdmap; I check the plugins list there to see if Rob’s one has made it in. It hasn’t yet.

Josh from google person finder is looking for people in-country to test their SMS tool… yay! Am in country!  Does anyone else need something testing out here?

8am. Looks like a nice day out there – warm, slightly cloudy. Not a clue in the sky that there’s anything approaching.  Matchmake some Sahana-based design work.  Disasters aren’t the best time to try something new, but they can focus people’s minds beautifully on what’s needed for the next one.

11am. Starting to get little gusts of wind here. A local mapping group (Eric) makes contact via UNOCHA, but the remote mappers they need to talk to are all asleep. We have to get better at hyperlocal mapping – getting people the tools and techniques they need then getting the heck out of the way unless they need extra bodies or skills.  We all start micromapping the damage to trees from the last storm here (“come look at coconut trees int the Philippines – thank you, but I already saw a load from the car yesterday…) using imagery gathered by SkyEye, a great local UAV company. It’s frustrating, knowing that I could drive out and get corroborating imagery with my phone.  Results are being used as a training/test set for image classification algorithms at Simon Fraser Uni – is so good to see the bullshit “disaster application because it looks good in the publicity” work that used to happen in academia as cover for algorithm exploration or military work being replaced with genuine connections to people who actually need the help.

Dan checks in… he’s starting to worry about whether I’m safe; I say that right now it’s a lovely day and we’re all going to be fine.

4pm Normal life. Lunch at a hipster cafe that seems to have dropped intact out of midtown Manhattan.  Shopping at the market, visiting wine expert (margaux, yum), napping, spa.  Check in on Hagupit facebook page. (“We’re building a community of #RubyPH #Hagupit weather watchers and on-the-ground citizen journalists. Join up to share the latest info and meet people who want to make sure we’re all prepared!”).  Storm and power outage reports starting to come in (Tacloban, Dolores, Sulat, Calbayog).  Look through the SBTF micromappers deployment site.  More locals start connecting into the grid of international folks watching Ruby, including EngageSpark, who’ve been pushing out SMS messages.  I make a list of local crowd coordinators, but I’m not totally sure who to give it to.  There’s a new coordination skypegroup setting up – perhaps they all belong in there?

8pm. At dinner… everyone keeps popping away from the table to answer phones/emails and coordinate work.  See two datasets that should be out there for the digital humanitarians: evacuation centre list (messy, various forms, pdf) and expected storm surge heights in towns/ cities.  Convert both into PDF – team posts them up on HDX.  One of the dinner guests has a tech problem – field people using ODK on tablets who can’t get the data from them uploaded because they only have SMS available.  Ping the tech volunteer groups to see if anyone has an answer for them… Kate from OSM responds. Discover person with shelters list for 2 more areas – get promise of pdfs from them.

Ruby Day 3: OpenStreetMap’s birthday party

Screen Shot 2014-12-05 at 7.46.14 AM

I couldn’t persuade any taxis to come out here for an early-morning pickup, so up at 5:30am to catch the 6am bus to Manila.  Well, buses: the first one breaks down, and I’m given enough money back from my ticket to find another bus to Manila. The other passengers wave one down, and I put myself back in the hands of the transport gods. Bus 3 is from Manila’s main bus terminal, a rundown old mall on the waterside. Most transport comes through there, and it already looks packed to capacity. 4 hours after I start, I get to the event: the last bus drops me in the middle of traffic (literally: 3 lanes to cross to get to safety) but the first car I see is an empty taxi – the transport gods have been good to me today.

Celina rocks the crowd at the OSM event (nobody leaves, which in Manila is its own form of applause); she also does the first hakuna-matata-style waving a Brck in the air to happen in the Philippines (she is beyond excited at having one; I’m beyond patience at trying to get a local data sim card to work for it). I talk about Ushahidi – I’ve carefully pitched my slides away from disaster response (and more towards civic community), but again everyone here is obsessed with disaster (and, to be fair, traffic).  I get some quality geek time with the SkyEye (local UAV company) team, and look over their fixed-wing foam UAV (1-1.5 hours flight time at 3 m/s, good range and flies in decent winds). They’re great guys: we talk about drones, and camera types (they’re working on a low-cost multispectral camera, which is a seriously big deal for checking environment etc) and public laboratory’s work on sensors.

Ruby has become real: she’s now a cat-5 typhoon (real bad) heading straight for the islands.  Lots of people are fitting the camp in-between the many meetings that come with disaster preparation, mostly concentrating on pre-positioning resources. There’s a 2-day event on whether agencies have learnt the right lessons from Typhoon Yolanda. Somehow that seems a little moot now: there’s about to be a practical exam, and it’s a pretty high bar on pass/fail.  I put a note in the company chat that I might have a small issue with connectivity with a large typhoon coming and all but will try to stay safe; nobody responds.  I drop notes in the disaster data chats that I’m in the area if needed – friends respond and add me to groups.

I check the HOT tasks list for the Philippines.  There are still tasks on there (recent) mapping damage and infrastructure (roads etc) from the last storm.  At least one of them has been updated in the last few hours, but if the last storm isn’t finished yet, how are we going to prepare for the next one?

The local disaster website, Pagosa, is down – overloaded with people looking for information. I wonder how many were local, and whether it would be good to have a “locals-only” site somewhere to keep down the traffic.

We travel back to the Ecamp in time for Celina to pitch her disaster community ideas to the judges.  The competition is about education and disaster, and many of the pitches have woven both together. 

We show the Ushahidi/ MAVC team http://weather.com.ph/announcements/super-typhoon-hagupit-ruby-update-number-004  and make sure they’re booked into safe accommodation away from the storm.  It’s killing their plan to go walking on the coast at the weekend, but that isn’t the best idea right now (getting the hell out to a safe building is). We also reassure Matt, their colleague in Nairobi, that his team-mates are going to be safe.

As we leave for the night, there’s a crowd of people around a television, looking at a picture of the storm track.  We’re booked on the 5:30pm transport out tomorrow, but by that time the roads will be full of people heading from here to safety in Manila. Nobody here seems to be panicking about this yet.

We go out to dinner and Celina scores: after hours of looking for a taxi to get us off the coast early, the hotel has a car service that will take us (and up to 6 friends) away.

Ruby day 2: E-Camp

This day is hectic. We start with an early-morning breakfast and presentation run-through, then head out to the event.  There are people here from mongolia, cambodia, vietnam, the philippines, india, sri lanka, japan, indonesian, USA (erm… just me, aka the token white dude), all working together to improve education through communities, with tech as an enabler (not, note, as the end-point of each idea). There’s lots of work on citizen participation, and I have a great time hanging out with a mix of public-school teachers, government officials and designers. CheckMySchool is the working school-reporting system (SMS, parents, children, ownership et al) that I’ve seen other countries try to build, Bantay.ph is doing similar for government services and one of the local telecoms companies (Smart) has done interesting work on classrooms based on tablets. The thing that they all have in common is that they’re designed as systems, not technologies, and designed to be sustainable through creative use of communities and student grades (for instance, bantay.ph is part of the political science curriculum).  One comment that sticks is that agencies are easy to deal with but mayors don’t care about negative feedback – this devolved power could be an interesting issue for any other schemes rolling out across the country.

I help facilitate the Making All Voices Count brainstorming session on tech in education (I also talk about Ushahidi tools: of the 48 platform instances I’ve found in the Philippines so far, most are about disasters or traffic).  Corruption, bad governance and high workloads are big topics here.

I check in on Ruby. She still doesn’t seem so bad.

Ruby Day 1: To Tagatay

I’m geeking out about disaster preparation and response today (I’m also doing my day job work, as part of a promise to work-from-philippines for the next 2 weeks).  We talk about Ebola (I’ve been quietly doing bits where I can on the Ebola data response, and my friend is worried that with filipinos coming back from West Africa there’ll be an outbreak here too).  As i came in the airport, there were Ebola screeners but the early-morning flight from Tokyo seemed to look like low risk. Tracking an outbreak response across hundreds of islands would be a little different to, say, Sierra Leone – hospitals here are mostly private and unmapped, and transport estimates would be much more complex than the road time mapping that the OpenStreetMap crowd have been doing on the Ebola response recently.  A new event’s shown up on the radar for the weekend, using local drone images to map fallen coconut trees after Yolanda (and use this as a training set for algorithms).  I’m asked for a short talk at the OSM event.  Ruby isn’t really even a topic yet.

I have a little confusion about transport – the pickup in Manila was miles away (which in Manila traffic really is a lifetime) and earlier than I’d assumed (things can be organized but not communicated so much here), so after much back and forth, I’m booked on an afternoon transport from the airport. I take a taxi there… and get stuck in the traffic snarl-ups around the airport: it’s Christmas (a big deal where lots of people come home from abroad) and the new airport skyway construction has reduced the lanes available, making any travel there slow and miserable. We’re 7km away with half an hour to go and no traffic movement, so we divert and drive out to Tagataya. The toll roads provide relief; the local roads to the sides of us are moving more freely but still packed. It’s not dirt roads, but the motorbikes and plywood-built roadside stalls remind me strongly of Africa (“Africa the country”, as we’ve been teasing the MAVC team about).  There are slums at the roadside here – brick-built small houses jumbled up in a sea of tin roofs -they’re not going to be a good place to be in a strong storm.

Typhoon Ruby

Screen Shot 2014-12-05 at 7.29.41 AM

When I came to the Philippines, my sister begged me to write a diary like the Tanzania one – a log of what I was doing and seeing that she could compare her own experiences in the area with.  But the past few days I’ve stayed with friends and worked with colleagues, and it somehow seemed wrong and less interesting to focus a diary on that.

But all the talks and sessions (and talk and session preparation) are over, and it occurred to me that people might just be interested in a diary about what it’s like to wait for and be here after a supertyphoon (I know. D’oh).

When I left for the Philippines, I checked the weather forecasts.  It’s the anniversary of super typhoon Yolanda, but there weren’t any big warnings out about typhoon season, and it all felt pretty quiet.  I googled “typhoon” in the news, and saw a small piece about the Yolanda anniversary with a little note at the bottom that there was a storm tracking in that probably wouldn’t amount to much but would become Typhoon Ruby when it hit the Philippines Area of Responsibillity (this seems to be a uniquely filipino thing, this renaming of storms when they cross over their borders – my friends joked last night that it’s because they have so many storms they’ve run out of names).  I liked the name, so I put a note in my work’s chat area, joking about the irony (I’ve just finished teaching a Ruby on Rails class) and that I must have missed typhoons Php and Python already.

Then forgot about it and did the 30-hour trip from New York to Manila via a night out in Tokyo and some yummy food that I will never be able to identify.

It’s Friday now, and I’ve been here since Monday. In that time, I’ve geeked out with the crisis mapping friend that I’d promised to come visit here (and gave in after she kept posting me flight prices, and scheduled an OpenStreetMap event at the same time as an Ushahidi-related one). Been to 2 massage spas ($10 for an hour of back-pounding that I still managed to fall asleep through). Eaten piles of filipino food (filipinos eat.. and eat… but never seem to get fat). Spent lots of time in Manila’s crazy traffic jams. Taken an Uber car out to Tagataya on the coast, for the eCamp education communities unconference. Clung onto my wheelie suitcase as it tried to roll out of the tricycle (motorcycle sidecar taxi) I was in. Promised to fly out and visit (airfares are really cheap) some crisis mapping friends on other islands. Mistaken the volcanic lake we’re staying next to (volcano inside volcano inside volcano: active) for a sea.

And kept quietly checking in on Ruby, possibly hoping that I might see a big storm over here. She looked like she was dying out for a while – moderating.  One article I looked at had 6-7 predicted different tracks for the storm which looked odd until I spoke to people about it.  Apparently, that’s the way this storm is: unpredictable, moving, might track into here, might veer away at the last minute (although people are expecting this less, hour by hour).  It looked like storm Hagupit (“whip”) might never become “Ruby”.  And then she started to pick up strength.  Now she’s a cat 5 – the same strength as Yolanda, the typhoon that devasted huge areas of the Philippines last year.  And she’s heading straight for our hotel.

Whither crisismapping?

[Cross-posted from OpenCrisis.org]

Crisismapping has never been just about Twitter feeds; it’s always been about data.  But what data, and how do we know what’s useful?  I’ve been looking back over 4 years of archived data to start answering that one. 

In truth, I’ve been having a bit of an identity crisis.  I see all the “big data” work on social media feeds, and although I can swing an AWS instance and the NLTK toolkit like a data nerd, for me personally, that’s not where the value of crisismapping has been.

It’s been about the useful, actionable data, and about connecting the people who have it with the people who need it. And whilst some of that data lies hidden in Twitter streams and Facebook requests, most of it is already on people’s servers and hard drives, often in formats that can’t be combined or understood easily.

So, some first things that make a difference every time:

  • Rolodexes: knowing which response groups to follow, and who’s likely to bring what helps.  3Ws are part of this – but before the 3W (who’s doing what where) is the “who’s”.
  • GIS data: knowing where medical facilities, schools, roads, bridges are makes a difference.   Knowing what communications is available is important, so also knowing where cell towers are helps, but might be too coarse-grained: using signal maps to know which areas have cell coverage is often more useful.  For me, mapping cell towers is problematic for the same reason that mapping military bases is problematic: they’re both potential sources of help in a crisis, but they’re both critical infrastructure whose locations are potentially sensitive information.  But many maps include them (e.g. open signal map).
  • Demographics.  Very useful data, but finding even population counts at sub-country levels can be difficult.  They’re usually there (except perhaps in countries like DR Congo where surveying is difficult) but finding the “there” can be hard.   I’d add technology and social media use to demographics, because there’s no point sniffing Twitter if only 0.5% of the country (and mainly expats) use it – there used to be sites available that listed, e.g. Facebook, Twitter etc percentages in each country, but they all seem to be behind paywalls now.

After that, it’s the emerging data: the 3Ws, the situation reports (both official, via news sources and on social media), the field notes about what’s happening.

We also now have 4 years of historical crisis data collected and collated by volunteers, often in areas prone to repeated crises, on top of the data already available through organisations and groups that existed before crisismapping was a “thing”.  I’m not entirely sure what the value of that data is to the next crisis (like wars, every crisis is subtlely different), but it’s certainly worth working that out.