FIXIT: Ushahidi location lists

Today, a conversation amongst crisismappers went something like: “when are we going to migrate to IRC?” (this in response to yet another Skype idiosyncrasy getting in the way of the team), then “not until we have a decent interface for it on all platforms” and “we asked the weekend hackathons for this, but it’s not sexy: you can’t tell people you’re saving the world, or helping starving children with it”.  That’s a whole pile of cynicism and frustration, but behind it are three things: 1) mappers still don’t have all the tools they need, and are relying on people-as-processors to get round that, 2) mappers don’t know how to ask for tools in ways that gets them what they need, and 3) hackathons may not be the best place to get those non-sexy tools built.

So. Options. Best is to find a good piece of existing open-source kit that almost meets mappers’ needs, and extend it with the things needed.  Less-good is to build all-new kit – mappers aren’t hackers, and new kit inevitably breaks and needs maintaining and training.  In-between is adding things onto proprietary kit using APIs (if they don’t get removed – yes, I’m looking at you, Skype team), and adapting existing open-source kit that doesn’t meet needs, but could be adapted (with the programming equivalent of a carefully-swung sledgehammer).  But that’s just options from the mapping teams’ perspective: another option is to team up with a coding company that wants to build tools for a similar (or adjacent) market.

I’m just as guilty for not documenting the things that I think need improving.  I’m used to writing “FIXIT” in my code when I know there’s a potential problem in it – so here’s the start of a set of posts about things that could be upgraded.  Some of them, I’ll start trying to fix – and document that in updates to each FIXIT blogpost.

 

ushahidi_find_location

There are lots of little things that bug me about the technologies I use in mapping.  One of them is repeating geolocation – specifically, having to find the same addresses multiple times in Ushahidi because I can’t upload a new list of addresses to the “find location” box (see above).  Now, I’m not going to go very far with this thought because it’s quite possible that someone in the Ushahidi community has already fixed this, but here’s the how-and-where (for Ushahidi version 2.7, which will be superseded sometime next year by 3.0):

  • When you press the “find location” button, Ushahidi calls (eventually) a function called geocode($address) in file application/helpers/map.php
  • This calls the google map api: for instance, for the address “Mount Arlington”,it calls http://maps.google.com/maps/api/geocode/json?sensor=false&address=Mount%20Arlington (don’t worry about the %20: it’s the internet’s way of saying “space”).  The API call produces json data (go ahead and look at the page – it’s all there), which Ushahidi then pulls the country, country id, location name, latitude and longitude from.
  • Erm. That’s it.

A geolocation team on a deployment is responsible for finding lat/longs for location names.  They usually keep a list of these lat/longs and location names, and that’s usually in a Google spreadsheet.   So what we need here is:

      • An Ushahidi variable holding the address of the geolocators’ spreadsheet.
  • That spreadsheet to be in a recognisable form – e.g. it has a column each for latitude, longitude and placename
  • A piece of code inserted into function geocode($address), that when the Google map API comes up blank, checks the Google spreadsheet for that location instead (or maybe checks the spreadsheet first: that depends on how the geolocation teams usually work). That piece of code will need to use the googledocs API, which is possibly the hardest part of this plan.
  • Maybe even (shock, horror), check the *other* map APIs (openstreetmap etc) too.

None of this is horrendous, but it does take time to do.  Perhaps someone somewhere will think that it’s worth it.

Knowing Ourselves

Excuse me while I geek out on some data.  I’ve been wondering for a while about retention rates in volunteer groups.  And I just happen to have a lot of data about signups for one of them.  So I thought I’d start asking some questions.  The types of question that I want to start asking are:

  • What are the basic demographics for this group (ages etc)?
  • How many of these people were active this year?
  • What’s the geographical distribution… which countries are light on mappers, which timezones are light on mappers?
  • What’s the geographical distribution of active people?
  • How long do people stay before they drop out?

First, 80% of all data science (at the moment) is data cleaning, so I had to do a few things to make this possible.

Clean all the location information – there were two sets of location fields in the original data, and issues included:

  • not listing which countries they were in (generally USA people assuming we all knew that Ohio was in the US),
  • listing multiple countries (which is fair – development people often move between 2 or 3 ‘home’ sites).
  • America being a continent playing at being a country – it makes more sense to break US data into states, so we can see where on the continent people are distributed instead. Looking up the abbreviation for Minnesota, so the state column was consistent.
  • People living in a dependency (e.g. an island like Madeira) of a country (e.g. Portugal).
  • People who only gave their timezones as an address (also fair – it’s a way round declaring that you’re in a country hostile to mappers; also this only happened with US addresses).
  • People also got confused about US timezones (and I had to look them up too): there are 4 timezones in the contiguous United States – they’re called PST, MST, CST and EST.

File:US-Timezones.svg

European timezones are also less confusing than they look (unless you’re working out whether and when summertime occurs):

(blues=GMT; pinks=CET;yellow=EET;orange=FET!; green=Moscow time)

So I now have an anonymised file (a lot of the work above was to get the address fields to a state where they don’t give anything away), and start feeding it into Tableau Public (which is free, so you can follow along if you want…).

  •  First, I drop the “country” dimension into the middle of the tableau box – this automatically gives me a map of the world with a dot on every country mentioned.
  • Then I select the “pie” mark types, and drop the “last visit” dimension onto “color” in the “marks” box. This turns the dots on the map into little pie charts, coloured for each year of “last visit”.
  • Then I play with the “size” button a little to get good-sized marks, fiddle with the colours a bit so they don’t show “never visited” as green, and produce this:

ning_visits_dated

But that’s just telling me the percentage dropout rates per country. What about absolute rates? So I drop “number of records” onto the “size” box, and get

ning_visits_sized

Okay. That’s a lot of Americans. And a lot of countries with very few mappers in them.  But maybe it’s a lot of Americans because it’s a big place… a continent labelled as a country. So before I stop looking at this data, I have a look at the numbers by US state… I click on country, then filter, and select only the USA, then I drop the “state” dimension onto the map, and exclude Hawaii (sorry guys: I know there are two of you over there, but it was messing up the map) to give:

ning_visits_sized_USA

 

Yay! Go New Yorkers!  Or put less emphatically – there appear to be clusters of mappers, who might be local mapping groups.   Looking at those neat pie charts, I start wondering what the growth rate is like in each country – i.e. how many people joined the group when.  And about how long they stayed active after they joined. But first, a question about retention: plotting year and quarter of the mappers’ first visit to the site against their last visit looks like this:

ning_join_vs_leave

A few things to say about this.  First, the list of people who’ve never visited the community site just stops after 2011 – probably because a site visit is part of the joining process.  The expected batch of people who look at the site in the quarter they join, then ignore it after that are there (the diagonal line of bigger dots). But this just tells me who dropped out when… what I really want to see is a simpler graph of how long people have stayed, and whether the date they joined is related to this in any way.  I can’t quite get Tableau to do that yet,  but I’m working on it…