Indices and headings in data.un.org

‘Data.un.org is the UN Statistics Division’s Internet-accessible repository for data. It’s potentially incredibly useful to the world, but is lagging behind sites like data.worldbank.org and data.gov because it doesn’t have a machine API, it isn’t normalized (the datasets aren’t in a form that makes them easy to use with each other) and there are spelling errors in some of the headers and indices, notably in the names used for geographical locations (e.g. “USA” and “United States” are both used).

I’ve written already about how to access the data.un.org datasets from an external application and about the types of headers and indices within it. Now it’s time to make the geographical references more usable.

I looked at 5577 csv files in the data.un.org dataset, and automatically excluded the footnotes at the end of each dataset. Data.un.org data that was excluded from the investigation were as follows: the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and 25 files were excluded because they’re in a format (multi-sheet Excel files) that needs further work to separate comments from data. In all, there are 21195188 rows of data in the remaining dataset, so much of the following work had to be automated.

I collected the indices and headers from all the datasets into lists: the headers list was searched for geographical references, and the indices list was used to produce a list of corrections from the data.un.org geographical indices into both ISO standard 3166 (countrynames) and UNSTATS’ list “Country and Region Codes for Statistical Use” of region, country and economic group names on data.un.org.

Geographical references in the headers are:

  • country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin, reference area.
  • OID.
  • WMO station number, station name, national station id number.
  • City.
  • Area, residence area, city type.

The GIS naming standards found in the dataset were:

Country names: Most of the indices and country references are a close match to the UNSTATS standard or ISO3166 (as used by the World Bank etc), although country names in particular are very inconsistent in this dataset. The indices discussion below gives instructions on

  • how to correct all of the country and region names to either of these standards. This applies to headers country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin and reference area. OID: International Monetary Fund’s internal GIS standard.
  • WMO station number, Station Name, National Station Id Number: World Meteorological Organisation references to meteorological stations.
  • City: UNSD Demographic Statistics (code: POP) includes city names. No standard has yet been identified for these names.
  • Area: UNSD Demographic Statistics (code: POP) and World Health Organisation (code: WHO) use the headings Area and Residence Area to classify the geographic extent of coverage (e.g. “Total”, “Urban”, “Rural”). UNSD Demographic Statistics also uses the heading “City type” to subclassify cities too (e.g. “City proper”, “Urban agglomeration”).

Most of the data.un.org datasets contain information that is listed by country (e.g. Yemen), region (e.g. West Africa) or economic group (e.g. Developing Regions). Looking at the placenames in the indices, we see that they are a mix of country, region and economic group names, with different spellings and formats for similar names (e.g. “Yemen”, “YEMEN”, “Yemen,Rep.”, “Yemen, Republic of” etc).

Two standards are similar to the placenames used in these files: ISO3166 and the “composition of regions” list published by data.un.org.

  • ISO3166 is a widely-used standard, but contains code for countries and their subregions only (e.g. has no official lists of larger regions or economic areas) and is published as tables online and available (although without the list of withdrawn codes) in the Python library pycountry.
  • The Unstats list (which ISO3166 is partially based on) contains countries, regions and economic areas, but is available only as an html table at http://unstats.un.org/unsd/methods/m49/m49regin.htm. The countries list is available as a ScraperWiki dataset which needs some editing to make it usable. The regions list has been scraped by hand for now. There are two main lists in it: the regions, subregions and countries by physical location, and the economic status (e.g. “Developing regions”, “Least developed countries”) of each country and region. These are mostly consistent, with a couple of oddities. For instance, Netherland Antilles doesn’t appear on the list of countries, but does appear on the list of small island developing states. Luckily it has a country code, so it’s been included in the list of countries.

The indices were checked against both these standards. Suggested improvements to the indices and standards include:

  • Make the regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years).
  • Make the economic status list available as a csv file online.
  • Lobby ISO to create a region (Africa, West Africa, North America etc.) code standard, if it doesn’t already exist.
  • Lobby ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in Bolivia’s name).
  • Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN online data should attempt to adhere to.
  • Change all the data.un.org datafiles to meet this standard.

Against the ISO3166 standard, the data.un.org csv index errors were:

  • Withdrawn countries with no ISO3166 code: “East Timor\”, \”Czechoslovakia, Czechoslovak Socialist Republic”, \”USSR, Union of Soviet Socialist Republics\”, “Yemen, Yemen Arab Republic\”, “Yemen, Democratic, People’s Democratic Republic of\”, “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands\”, “Wake Island\”, “Serbia and Montenegro\”.
  • Abbreviation, e.g. “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”.
  • Added markers, e.g. “+” added to the end of region names, to differentiate them from countrynames.
  • Capitalisation, e.g. “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the” and the” for “The”.
  • Brackets: UNICEF in particular uses brackets “()” instead of commas in placenames
  • Standards confusion: the ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not).

Tables of other misspellings against both standards are given below. Some of these errors are the use of familiar names (e.g. Brunei, Ivory Coast, China) or issues with character translation (e.g. Cote d’Ivoire). Some names could not be resolved: remaining queries include the code for French Polynesia, whether “Christmas Is.(Aust)” is Christmas Island, whether St. Helena refers to just the island of Saint Helena, or “Saint Helena, Ascension and Tristan da Cunha” and whether Palestine and Palestinian Territories refer to “Palestinian Territory, Occupied”

Big data – wozzat?

So what is this big data thingy?

Big data has become a hot topic lately. The people who deal with it (“data scientists”) have become much in demand by companies wanting to find important business insight in amongst their sales data, twitter mentions and blogposts.

Which confuses three different concepts.

    • Big Data is defined as the processing of data that’s larger than your computer system can store and process at once. It doesn’t matter so much where the data is from – it’s more important that it’s too huge to handle with ‘normal’ processing methods.
  • Social media mining looks for patterns in the posts, tweets, feeds, questions, comments et al that people leave all over the Internet. It’s a logical consequence of Web 2.0, that idea that we could not only read what people put on their websites, but contribute our thoughts etc to it too.
  • Data analysis looks for patterns in any data. It doesn’t have to be big data (though it might be), and it doesn’t have to come from Internet use (although it might be that too). It’s just data, and the tools that have been used to understand meaning and find insights in data still apply. The data scientists have a saying “everything old is new again”, and it’s lovely to see a whole new generation discover Bayesian analysis and graphs as though they were shiny new super-exciting concepts.

So what are the data scientists trying to do here, and why is it special?

Well first, a lot of data scientists are working on Internet data. Which is why big data and Internet data often get confused: a collection of blogs, tweets etc can be seriously big – especially if you’re trying to collect and analyse all of them.

And they’re analyzing the data and using the results to help drive better business decisions. Yes, some people do this for fun or to help save the world, but mainly it’s popular because better decisions are worth money and those analysis results are big business differentiators.

Which is great until you realize that up ‘til now most of those decisions were made on data from inside the company. Nice, structured, controlled, and often quite clean (as in not too many mistakes) data, often stored in nice structured databases. Which is not how the Internet rolls. What data scientists often end up with is a mix of conventional structured data and data with high structural variance: data that looks kinda structured from a distance (time, tweeter, tweet for example) but has all sorts of unstructured stuff going on inside it. Sent from a mixture of conventional systems and devices. That companies often ask to be analysed in the same way they’re already analysing their structured data.nnSo, alongside the usual corporate data, we now have 3 new types of data that we can access and process: structured data stored in warehouses, unstructured internet-style data (blogs, tweets, sms) and streams of information.

Lets back up just a little. To do analysis, you need a question, some data and some tools (or techniques, if you will). It also helps to have someone who cares about the results, and it’s even better if they care enough to explain what’s important to them and why.

The Question

First, the question. Asking the right question is difficult, and often an art. Sometimes it’ll be obvious, sometimes it’ll come from staring at a subset of the data, sometimes the question will be given to you and you’ll have to hunt for the data to match. We’ll talk about the question later.

Handling the Data

So we have a question and some data. And if the data is big, this is where the Big Data part of the story comes in. If you suddenly find yourself with data that you can’t analyse (or possibly even read in) using the computing resources you have, then it’s big data and your choices (thank you Joseph Adler) are:

    • Use less data. Do you really need all the data points that you have to make those business decisions (try reducing down to a statistically significant number of points, or reducing down to mostly the points that are important to you)? Do you really need all the variables you’ve collected (do a sensitivity analysis)? Are there repeats (e.g. twitter retweets) in your dataset (tidy it up)?
  • Use a bigger computer. You’ll need to both store and process the data. “The cloud” is a generic term for storage that’s outside your home or office that you can still access from wherever you want (e.g. over the internet). Amazon Web Services is a prime example of this; other cloud storage includes Microsoft Azure (sql datastore), Cassandra (bigtable datastore), Buzz Data, Pachube (primarily storage for sensor outputs, a.k.a. the Internet of Things), Hive (data warehouse for Hadoop) and sharded databases.
  • Use parallel processing across multiple computers. A popular process for this is map/reduce, which splits data into chunks that are each processed by a different machine. Places where map/reduce is available include Hadoop, which also has a higher-level language, Pig, that reduces down to map/reduce instructions.
  • Get smart. Get lateral about the problem that you’re trying to solve (see any good statistics textbook for ideas).

The processing

And then we have the techniques part of the equation (sorry – couldn’t resist the pun). Again a post for later – there are many tools, packages and add-ons out there that make this part of the process easier.

Explaining the results

If you’re doing big data analysis, you’re doing it for a reason (or you really like fiddly complex tasks). And the reason is often to increase the knowledge or insight available to an end user. For this, we often use visualisations. Which is another post for later.

Creating humanitarian big data units

Global Pulse has done a fine job of making humanitarian big data visible both within and outside the UN. But it’s a big job, and they won’t be able to do it on their own. So. What, IMHO, would another humanitarian big data team need to be and do? What’s the landscape they’re moving into?

Why should we care about humanitarian big data?

First, there’s a growing body of evidence that data science can change the way that international organisations work, the speed that they can respond to issues and degree of insight that they can bring to bear on them.

And NGOs are changing. NGOs have to change. We are no longer organizations working in isolation in places that the world only sees through press releases. The Internet has changed that. We’re now in a connected world, where I work daily with people in Ghana, Columbia, England and Kazakhstan. Where a citizen in Liberia can apply community and data techniques from around the world, to improve the environment and infrastructure in their own cities and country.

We have to work with people who used to be outsiders: the people who used to receive aid (but are now working with NGOs to make their communities more resilient to crisis), and we have to work with data that used to be outside too: the tweets, blogposts, websites, news articles and privately-held data like mobile money logs and phone top-up rates that can help us to understand what is happening, when, where and to whom.

UN Global Pulse was formed to work out how to do that. Specifically, it was set up to help developing-world governments use new data sources to provide earlier warnings of growing development crises. And when we say earlier, we mean that in 2008 the world had a problem. Three crises (food, fuel and finance) happened at once and interacted with each other. And the first indicator that the G20 had was food riots. The G20 went to the UN looking for up-to-date information on who needed help, where and how. And the UN’s monitoring data was roughly 2 years out of date.

What have we done so far?

So what are the NGOs and IAs doing so far? The UN has started down the route to fix this with a bunch of data programs including Global Pulse and FEWSnet. Oxfam connected up to hackathons last month; the Red Cross has been there for a while. The World Economic Forum has open data people, as does the World Bank. And other groups as different as the Fed and IARPA are investigating risk reduction (which is the real bottom line here) through big data techniques.

What should we be doing?

But what do the NGOs need to do as a group? What will it take to make big data, social data, private data, open data and data-driven communities useful to risk-manage for crises?

1. First, ask the right questions.

When you design technology, the first question should be “what is the problem we’re trying to solve here?” Understand and ask the questions that NGOs do and could ask, and how new data could help with them. There is data exhaust, the data that people leave behind as they go about their lives: focus on the weak signals that occur in it as crises develop. Reach out to people across NGOs to work out what those questions could be.

2. Find data sources.

We cannot use new data if we don’t have new data.

Data Philanthropy was an idea from GFDI to create partnerships between NGOs, private data owners like the GSMA mobile phone authority and other data-owning organisations like the World Economic Forum. Data Commons was a similar idea to make data (or the results of searches on data – we want to map trends, not individuals) available via trusted third parties like the UN. It’s gone a long way politically but still has a lot of work to be done on access agreements, privacy frameworks and data licensing.

Keep encouraging the crisismapping and open data communities to improve the person-generated data available to crisis responders, to improve the access of people in cities and countries to data about their local infrastructure and services, and to voice their everyday concerns to decision makers (e.g. via Open311). Encourage the open data and hacker movements to continue creating user-input datasets like Pachube, Buzz and CKAN. All this is useful if you want to understand what is going wrong.

3. Find partners who understand data.

Link NGOs to private organisations, universities and communities who both collect and process new types of data. Five of these recently demonstrated Global Pulse led projects to the General Assembly:

    • Jana’s mobile phone coverage allowed us to send a global survey to their population of 2.1 billion users in over 70 countries. There are issues with moving from household surveys that need to be discussed, but it allowed us to collect a statistically significant sample of wellbeing and opinion faster and more often than current NGO systems (an authoritative survey I read recently had 3500 data points from a 5,000,000 person population. Statistical significance: discuss).
    • Pricestats used data from markets across Latin America to track the price of bread daily rather than monthly. Not so exciting in ‘normal’ mode or in countries where prices are regularly tracked. Incredibly useful during recovery or for places where there is no other price data gathered.
    • The Complex Systems Institute from Paris tracked topics emerging in food security related news since 2004. This showed topic shifts from humanitarian issues to food price volatility (with children’s vulnerability always being somewhere in the news). More of a strategic/ opinion indicator, but potentially incredibly useful when applied to social media.
    • SAS found new indicators related to unemployment from mood changes in online conversations – several of which spiked months before and after the unemployment rate (in Ireland and the USA) changed. This gave new indicators of both upcoming events and country-specific coping strategies.
    • Crimson Hexagon looked at the correlation between Indonesian tweets about food and real food-related events. The correlations exist, and mirrored official food inflation statistics. Again, useful if gathered data isn’t there.

And reach out to the communities that are forming around the world to process generated data, from the volunteer data scientists at Data Without Borders to the interns at Code for America and the GIS analysis experts connected to the Crisismappers Network.

4. Collect new data techniques and teach NGOs about them.

There is a whole science emerging around the vast ocean of data that we now find ourselves swimming in. It has many names, Big Data and Data Science being just two of them, but it’s basically statistical analysis of unstructured data from new sources including the Internet, where that data is often very large. Learn about them, play with them (yes, play!), and teach people in NGOs about how to use them. The list of things you probably need to know include data harvesting, data cleaning (80% of the work), text analysis, learning algorithms, network analysis, Bayesian statistics, argumentation and visualization.

And build a managed toolkit of open-source tools that NGOs and analysts in developing country can use. For free. With support. Which doesn’t mean “don’t use proprietary tools” – these have a major part to play too. It just means that we should make sure that everyone can help protect people, whatever the funds they have available are.

5. Design and build the technologies that are missing.

Like Hunchworks. Hunchworks is a social network-based hypothesis management system that is designed to connect together experts who each have part of the evidence needed to spot a developing crisis, but don’t individually have enough to raise it publically. It’s a safe space to share related evidence, and give access to the data and tools (including intelligent agents automatically searching data for related evidence) needed to collect more. It’s still in alpha, but it could potentially help break one of the largest problems in development analysis: namely, the silos that form between people working on the same issues and the people that need to see their results.

6. Localize.

Build labs in developing countries. Build analysis capacity amongst communities in developing countries. People respond differently to economic stress, and environments, data sources and language needs are different in different countries. The labs are there to localize tools, techniques and analysis, and to act as hubs, collectors and sharing environments for the types of minds needed to make this work a reality. No one NGO can afford to do this in all countries, so connections between differently-labelled labs will become vital to sharing best practice around the world.

7. Publicise and listen.

Be there at meetups and technology sessions, at hackathons and in Internet groups, listening and learning to do things better. And never ever forget that this isn’t just an exercise. It’s about working better, not building cool toys – if the answer to a problem is simple and low-tech, then swallow your pride and do it – if the answer is to share effort with others to get this thing worker faster to protect people around the world, then do that too. We do not have the luxury of excessive time or meeting-fuelled inaction before the next big crisis strikes.

Project idea: mining crisis googlegroups

The Problem

Let’s start with “what is the problem I’m trying to solve here”.

I’m responsible for designing Hunchworks.  I’m also responsible for making it what the users need, and for understanding and fitting into the systems that they already use to do hunch-style collaborative inference. Now we know a bit about business analysis, so we’ve spent a while looking for these existing systems, and found them mainly in two places: closed discussion groups (e.g. Googlegroups), and Skype group chats.

I’ve traced (and anonymised) a couple of threads by hand – the main point is the type of information that people post rather than who and what it is – but whilst I was doing it I realised there were mining tools that could help with this, and my usual problem of not being able to find information on X in a Googlegroup after it’s posted (no, the search doesn’t work for this).  Also, any visualisation or compression could be extremely useful to people trying to navigate the thousands of emails and hundreds of topics that thread their way through the crisismapping googlegroups.

Getting the Data

So to work. The first thing I needed was googlegroups data into my machine. There’s an RSS feed for group NameHere at  http://groups.google.com/group/namehere/feed/rss_v2_0_topics.xml  but that appears to be only for new posts. A Google search on “mining google groups –site:groups.google.com” didn’t show anything, but I did get a useful journalists guide to Google’s search operators as commiseration prize. And then paydirt: a PHP Googlegroup scraper that produces xml, that I’ve modified to dump xml into a file.

The data is about a group, so it has people (in username and cc fields), topics (in subject field), areas of interest (in the subject and text, so I’ll have to do some text analysis to get these) and times/dates.

As an aside, it would be really cool if Google allowed tagging on these posts. That way, people could come in later and add tags to each post, to make searching by area or subjects much easier.

Knowing which questions to ask

What I’m trying to understand at first is how people communicate data around an emerging crisis. Which people get involved, do they bring each other into the group to help, what types of evidence do they provide and which external links do they point at and when.

Another question is “are there cliques in this group” – not, “are there groups that don’t talk to each other”, but “are there natural groups of people who cluster around specific topic types or areas”.  And “who are the natural leaders – who starts the conversation on each topic, who keeps the conversations going”.

More subtle things to find are ‘hubs’ (people who connect other people together), attachments/links (so we can see what’s being linked to outside the group pages) and information flows between people and affiliations.

So first, what else can I do with the raw data?  Track threads?  Track related subjects?  Pick out geographical terms, for instance?

Let’s start with the people, and how they connect together.  Let’s assume that all people on the same thread are linked, or possibly that people who reply to each other are linked.  A google search on “mining relationships in email” proves useful here…  turns out there’s something called “relationship mining” in data warehousing already.  And the really simplest thing to do with this is a histogram of people who posted, or who sent the first post in a thread.  But… but… most of the literature on mining emails assumes that there are 1:1 links between people, i.e. each email has a sender and 1 or more recipients.  What we’re looking at here has a sender, all the group as recipients, and maybe a couple of other outside-the-group people added as recipients too.  So we don’t have direct links between people: we have the group as an intermediary.   So how do we know when two people are engaged with each other via the group?   One answer is “reply-to” – i.e. when one user replies to another user’s posting.  This will take a little work in the scraper to do, but could be a useful way of establishing chains of people through the group.

This is just the start of a chain of thought that is currently moving at the speed of code.  There will be more, just not for a little while…’,’

Project idea: visualising crisismapping categories

I’m a crisismapper. I’ve seen or worked on most crisis Ushahidi maps since January 2010, and I’ve watched the categories used on them split and evolve over time (from Haiti’s emergencies and public health to Snowmageddon’s problems / solutions and beyond).

Cat from Humanity Road has kept a screenshot of the categories on each of the main deployments since then – when I saw it today, it reminded me immediately of work that Global Pulse did on visualising category evolution across news articles, with articles as nodes connected by subject, and coloured by main category, as discovered by gisting the articles.

Except this time, we don’t have to guess the categories (although doing that later by mining text in the reports in each category could be fun).  Each Ushahidi map comes with a set of categories – each report (piece of geolocated information) is tagged by the categories it belongs to.

So. The simple version of the visualisation (and I’m thinking that Processing should be able to handle much of this).  The y-axis in the visualisation is a timeline. Each deployment is a column of categories, represented as graph nodes: we then put lines from deployment(n) category(x) to deployment(n+1) category(y) if they’re related.

Working out relations between categories isn’t totally trivial. Language differences (i.e. arabic to english), generalisation/specialisation and synonyms do happen, so we might need to build a category ontology to handle this. Line weights could be used to show the strength of connection between pairs of categories. Ushahidi also uses sub-categories on some deployments, so we’ll have to work out how to group these – perhaps by colour, perhaps by putting a box around them.

Making it pretty: we can use some standard tools (like matrix normalisation) to minimise the number of crossed lines between deployments.

So you’d have a meta-graph where the nodes are deployments, and inside each of those you’d have supercategories and categories.   One variant of this is to visualise all the deployments as metanodes in free space, i.e. scattered all over the place, and place the strongest-related deployments (by categories) closest to each other. I don’t know if there are packages that allow drilling into metanodes, i.e. to click on a metanode and view all the links from its categories out to other metanodes, but that could be fun too.

Once we have the connectors (e.g. into Ushahidi’s publically-accessible reports on CrowdMap), all sorts of other cool visualisations suddenly become possible too. But finally I need to ask the zeroth question about a project, i.e. “what is the problem we’re trying to solve here” (first question is “is there any value in it”).  I think what I’m trying to do is make an easy way for crisismapping historians to see how deployments have evolved, and give crisismapping leads an easily-visualised example set of categories that worked for different deployments over time.  The value might be new category sets, or an awareness of arcs that the category types are currently on.  I don’t know.  But I’m sure that I know several thousand people out there with opinions about it…