Why am I teaching my cat data science?

[Cross-posted from ICanHazDataScience]

I’m a data scientist. Not a ‘real’ one, as in someone who can write code that pulls lots of interesting comments out of Twitter and turns them into really pretty pictures without having to look up lots of things about algorithms and code. But I do spend a lot of time getting my hands on data and making sense of it, and I do know a bit about the work that data science is based on.  And I’m still learning about data science, so I’m writing this blog to store and hopefully pass on some useful tips and ideas.

Apparently, the best way to learn something is to try to teach it too, but I don’t have any students here. So this is Emily. She’s a cat: she likes sleeping, chasing small furry things (mice, possums, toes, that sort of thing), Meow Mix and hanging out next to my laptop, which (see above) is why I don’t type things with ‘q’ in them very often. She’s a pretty good listener, but will run off if you start to get boring, repetitive (or tickle her ears ‘wrong’).  She’s also prepared to try almost anything that will help to improve her world, although not for very long.  In short, she’s a pretty representative student (apologies to all my former students out there: you were much much better than this).

Onwards… with the help of the LolCat translator

Wut Iz Data Science?

[Cross-posted from ICanHazDataScience]

Hello. You’re probably here because you’re curious about data science. Or how and why it’s relevant to human development – i.e. not software development, but work that “gives everyone the chance to lead full lives”.  Which is a very very broad concept, that includes the work done by big agencies like the UN, down to communities and individuals just trying to make a difference, or journalists writing stories about where the gaps and problems are. You might be here because you’re one of these people, and want to know how this new “data science” and “big data” will change your work; you might be a volunteer coder building tools to help them, or someone I haven’t anticipated yet.  Whoever you are – welcome, and I hope this can be at least a little bit useful.

So.  What is data science, and what’s special about it when it’s done to help human development?

If you look up “data science”, you’ll get several good papers on it:

And the distinct impression (if you don’t read the articles above) that a) it’s all about statistics and coding, and b) there are a *lot* of pretty pictures involved.

Let’s stop for a moment.  Yes, there are lots of pictures. And yes, there’s also statistics and coding.  But that’s not the heart of data science. I’ve also been a systems engineer for a long time now, and from what I can tell form data scientists’ work and my own forays into it, is that a data scientist is one of those rare people who works at the boundary between engineering and business.  What they effectively are, are translators – between the world of number and text crunching and the insights that drive business, progress, politics, stories.  That’s why there are so many pictures: it’s not just because pictures are pretty, contain lots of information and draw in more people than numbers or words do (just look at any introductory course on giving a good presentation for some of the theory behind this), but it’s because what’s being created here is insight. Insight based not only on the numbers that a data scientist is given to work with, but all the other information available to them in the universe (internet, internal data stores and others) or obtained as part of their work (through FOIAs, direct investigation or discussion).

There have already been studies into the types of personality who make good data scientists (especially by people who, with the current shortage, are keen to develop their own), but we could also define this in terms of the company that they typically keep.  The translator-personality goes some way to explaining the types of people that data scientists seem to hang out with – data geeks, architects, systems designers et al.  The data gathering urge explains some of their connections to open data types, makers, and their recent interest in areas like the internet of things.

GIS references in data.un.org

I’ve been playing with most of the data.un.org dataset: all 5577 csv files and 21195188 rows of it. And it’s a fascinating dataset when you see it all together. I’ve already written about how to access the data.un.org datasets from an external application: now it’s time to look at the headings and indices (first row and first column) in them all.

Details: I looked at 5577 csv files in the data.un.org dataset, and automatically excluded the footnotes at the end of each dataset. Data.un.org data that was excluded from the investigation were as follows: the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and 25 files were excluded because they’re in a format (multi-sheet Excel files) that needs further work to separate comments from data. In all, there are 21195188 rows of data in the remaining dataset, so much of the following work had to be automated. Every cell in the first row of each of these files was added to file “headings.txt”; every cell in the first column of each of these files (excluding the first row) was added to file “indices.txt”.

Data.un.org files excluded (because they were in Excel format) were:

  • Human Development Indices: A statistical update 2011
  • Indicators on Women and Men
  • OECD Data
  • World Tourism Data
  • World Fertility Data
  • World Marriage Data
  • World Contraceptive Use
  • Key Indicators of the Labour Market, 7th Edition
  • WTI Data

My goal here was to survey the types of headers used, so I could create lists of errors, inconsistencies etc. for known index types (e.g. country names), and start the work of cross-matching indices and headers to each other.

So. The results are: the indices contain these types:

  • Most files: Country, region and economic group names, with a variety of spelling errors.
  • International Monetary Fund files: a code called OID. The second column (country name) from these files has also been added to the list of indices.

Headings are more varied:

  • Dates/times (Jan, Feb, year, number of years etc)
  • Age-related (e.g. age, age groups)
  • Geographical (e.g. country, OID, national station id, station name, residence area, WMO station number, city, country of birth, country of citizenship, country or territory, country or area, reference area, area etc)
  • City type
  • Birth/death details (cause of death, birth weight etc)
  • Number of people (e.g. number of refugees, children etc)
  • Personal (gender, sex, religion, marital status)
  • Nationality/language (citizenship, ethnic group, native or foreign born, language)
  • Household (household size, type of household etc)
  • Financial value (USD, wealth quintile, currency, SNA)
  • Employment/education (occupation, industry, education, literacy)
  • Trade (Commodity/ISIC rev 3, flow=import or export, etc)
  • General measure words (observation value, quantity, quantity name, weight, measure etc)
  • Classification words (subgroup, item, type etc)
  • Metadata (Description/source/series, survey coverage, activity status, record type, reliability, variant, footnotes ““ these headings are used to refer to footnotes at the end of the file)

More detailed data should be attached to this post.

Indices and headings in data.un.org

‘Data.un.org is the UN Statistics Division’s Internet-accessible repository for data. It’s potentially incredibly useful to the world, but is lagging behind sites like data.worldbank.org and data.gov because it doesn’t have a machine API, it isn’t normalized (the datasets aren’t in a form that makes them easy to use with each other) and there are spelling errors in some of the headers and indices, notably in the names used for geographical locations (e.g. “USA” and “United States” are both used).

I’ve written already about how to access the data.un.org datasets from an external application and about the types of headers and indices within it. Now it’s time to make the geographical references more usable.

I looked at 5577 csv files in the data.un.org dataset, and automatically excluded the footnotes at the end of each dataset. Data.un.org data that was excluded from the investigation were as follows: the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and 25 files were excluded because they’re in a format (multi-sheet Excel files) that needs further work to separate comments from data. In all, there are 21195188 rows of data in the remaining dataset, so much of the following work had to be automated.

I collected the indices and headers from all the datasets into lists: the headers list was searched for geographical references, and the indices list was used to produce a list of corrections from the data.un.org geographical indices into both ISO standard 3166 (countrynames) and UNSTATS’ list “Country and Region Codes for Statistical Use” of region, country and economic group names on data.un.org.

Geographical references in the headers are:

  • country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin, reference area.
  • OID.
  • WMO station number, station name, national station id number.
  • City.
  • Area, residence area, city type.

The GIS naming standards found in the dataset were:

Country names: Most of the indices and country references are a close match to the UNSTATS standard or ISO3166 (as used by the World Bank etc), although country names in particular are very inconsistent in this dataset. The indices discussion below gives instructions on

  • how to correct all of the country and region names to either of these standards. This applies to headers country of birth, country of citizenship, country or area, country or territory, country or territory of asylum or residence, country or territory of origin and reference area. OID: International Monetary Fund’s internal GIS standard.
  • WMO station number, Station Name, National Station Id Number: World Meteorological Organisation references to meteorological stations.
  • City: UNSD Demographic Statistics (code: POP) includes city names. No standard has yet been identified for these names.
  • Area: UNSD Demographic Statistics (code: POP) and World Health Organisation (code: WHO) use the headings Area and Residence Area to classify the geographic extent of coverage (e.g. “Total”, “Urban”, “Rural”). UNSD Demographic Statistics also uses the heading “City type” to subclassify cities too (e.g. “City proper”, “Urban agglomeration”).

Most of the data.un.org datasets contain information that is listed by country (e.g. Yemen), region (e.g. West Africa) or economic group (e.g. Developing Regions). Looking at the placenames in the indices, we see that they are a mix of country, region and economic group names, with different spellings and formats for similar names (e.g. “Yemen”, “YEMEN”, “Yemen,Rep.”, “Yemen, Republic of” etc).

Two standards are similar to the placenames used in these files: ISO3166 and the “composition of regions” list published by data.un.org.

  • ISO3166 is a widely-used standard, but contains code for countries and their subregions only (e.g. has no official lists of larger regions or economic areas) and is published as tables online and available (although without the list of withdrawn codes) in the Python library pycountry.
  • The Unstats list (which ISO3166 is partially based on) contains countries, regions and economic areas, but is available only as an html table at http://unstats.un.org/unsd/methods/m49/m49regin.htm. The countries list is available as a ScraperWiki dataset which needs some editing to make it usable. The regions list has been scraped by hand for now. There are two main lists in it: the regions, subregions and countries by physical location, and the economic status (e.g. “Developing regions”, “Least developed countries”) of each country and region. These are mostly consistent, with a couple of oddities. For instance, Netherland Antilles doesn’t appear on the list of countries, but does appear on the list of small island developing states. Luckily it has a country code, so it’s been included in the list of countries.

The indices were checked against both these standards. Suggested improvements to the indices and standards include:

  • Make the regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years).
  • Make the economic status list available as a csv file online.
  • Lobby ISO to create a region (Africa, West Africa, North America etc.) code standard, if it doesn’t already exist.
  • Lobby ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in Bolivia’s name).
  • Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN online data should attempt to adhere to.
  • Change all the data.un.org datafiles to meet this standard.

Against the ISO3166 standard, the data.un.org csv index errors were:

  • Withdrawn countries with no ISO3166 code: “East Timor\”, \”Czechoslovakia, Czechoslovak Socialist Republic”, \”USSR, Union of Soviet Socialist Republics\”, “Yemen, Yemen Arab Republic\”, “Yemen, Democratic, People’s Democratic Republic of\”, “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands\”, “Wake Island\”, “Serbia and Montenegro\”.
  • Abbreviation, e.g. “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”.
  • Added markers, e.g. “+” added to the end of region names, to differentiate them from countrynames.
  • Capitalisation, e.g. “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the” and the” for “The”.
  • Brackets: UNICEF in particular uses brackets “()” instead of commas in placenames
  • Standards confusion: the ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not).

Tables of other misspellings against both standards are given below. Some of these errors are the use of familiar names (e.g. Brunei, Ivory Coast, China) or issues with character translation (e.g. Cote d’Ivoire). Some names could not be resolved: remaining queries include the code for French Polynesia, whether “Christmas Is.(Aust)” is Christmas Island, whether St. Helena refers to just the island of Saint Helena, or “Saint Helena, Ascension and Tristan da Cunha” and whether Palestine and Palestinian Territories refer to “Palestinian Territory, Occupied”

A quick-and-dirty API for data.un.org

An API is a piece of software on a website that allows programs on other websites and machines to access data held on that website.  This means that coders can create programs that use third-party data (from e.g. Facebook, LinkedIn, Foursquare, the World Bank etc) as part of their applications.  An example API is data.gov’s broadband map API – an example call to which looks like “http://www.broadbandmap.gov/broadbandmap/provider?format=json”.  This provides data about broadband providers in a standard data format (json) to any program or application that needs it.

Without an API, datasite users are forced to access data through a series of mouseclicks leading eventually to either a webpage displaying the data, or to a file containing data that they can download.

Data.un.org does not have an API.  Yet.  But it would be very very easy to provide on their website.  A trace of the calls made during a data.un.org file retrieval shows that the standard format for most of the datasets on data.un.org (the exception is datasets whose martids are digits) is as follows:

  • Getting the ids needed to access a datafile:
    • The list of marts (agencies that provided data to data.un.org) is in a json file(http://data.un.org/Handlers/ExplorerHandler.ashx?t=marts) whose nodes all have the form martName, martId, childNodes; for example martName=”Commodity Trade Statistics Database”, martId=”ComTrade”, childNodes=””.
    • The list of datafiles for each mart is in a json file (http://data.un.org/Handlers/ExplorerHandler.ashx?m=martId) where martId is the martId given above, e.g. http://data.un.org/Handlers/ExplorerHandler.ashx?m=ComTrade The important nodes in this file are: label, childNodes, martId, dataFilter, where dataFilter is the id for the datafile connected to that label.  An example is label=”Trade of good, US$, HS 1992, All commodities”, martId=”ComTrade”, dataFilter=”_I1Code%3a1” ( these are really not intuitive names).
    • For almost all marts, the martId that you need to access a datafile is the same as the martId used to obtain the list above.  The exception is martId=”KI”.  This mart doesn’t correspond to one agency, but is an aggregate of datasets provided by different agencies. The dataFilter for a KI file looks like this: ”dataSetID%3aPopDiv%3bvariableId%3a12”, where PopDiv is the true martId and variableId%3a12 is the true filterId. These “true” values can be used to access the datafile as described below.
  • Accessing  a datafile:

In summary, I’ve described how to access a data.un.org datafile directly, using only the calls already provided by this website.  Propagating this information and potentially also using this information to add an API line to each description file in data.un.org is a trivial exercise that will make a large difference to coders’ ability to include UN data in their applications and websites.

I’ve attached two files that will help anyone who can’t do the “getting the ids” part of these instructions to just go straight to the “accessing a datafile” ones.

This is part of a 3-note set: the api instructions, correcting GIS references in the data.un.org dataset and accessing the UN’s other online data.  This should be enough to get people using static UN data (as in not datastreams like Twitter feeds, map reports etc) more widely.

Data.un.org: number of rows in each dataset

Addendum: as I visualized GIS data for the data.un.org dataset, some of the counts I was getting for countries seemed a bit off. For instance, countries (Australia, Canada) with names at the start of the alphabet had much larger counts than countries (e.g. Yemen) at the end of the alphabet.  I at first put this down to collection bias (e.g. data on developed countries was much easier to obtain than data on developing ones), but decided to investigate why Australia had 50000 references in the data.  The answer is that we need to do a bit more work on the API (not impossible: there’s a filter value that can be used to select blocks of data, and the limit is also variable).  UNStats has limited the downloads from its site to no more than 50000 rows of each dataset, but there are several datamarts with datafiles that exceed these limits (as in each datafile contains 1000s of values per country), with the Commodity Trade Statistics Database, from UNSTATS (code: COMTrade) being a consistent offender.

Basically, if your data comes back with approximately 50000 rows in it, then you probably don’t have the whole dataset, and anything over 40000 rows is suspect.  I tried looking at the filesize as a quick rule of thumb, i.e. filesizes between 5Mb and 6Mb might be affected, filesizes above 6Mb are definitely affected, but when I compared filesize against number of rows, there were smaller files (around 3Mb) that had the same problem.  If you’re downloading UNstats data, you need to be aware of this issue, but you probably won’t see it very often: the number of rows per dataset is a long-tail distribution (see the figure above), and only 159 files out of 7000+ have more than 40000 lines in them.