Data Science

Why am I teaching my cat data science?

[Cross-posted from ICanHazDataScience] I’m a data scientist. Not a ‘real’ one, as in someone who can write code that pulls lots of interesting comments out of Twitter and turns them into really pretty pictures without having to look up lots of things about algorithms and code. But I do spend a lot of time getting my hands on data and making sense of it, and I do know a bit about the work that data science is based on.  And I’m still learning about data science, so I’m writing this blog to store and hopefully pass on some useful tips and ideas. Apparently, the best way to learn something is to try to teach it too, but I don’t have any students here. So this is Emily. She’s a cat: she likes sleeping, chasing small furry things (mice, possums, toes, that sort of thing), Meow Mix and hanging out next…

Data Science

Wut Iz Data Science?

[Cross-posted from ICanHazDataScience] Hello. You’re probably here because you’re curious about data science. Or how and why it’s relevant to human development – i.e. not software development, but work that “gives everyone the chance to lead full lives”.  Which is a very very broad concept, that includes the work done by big agencies like the UN, down to communities and individuals just trying to make a difference, or journalists writing stories about where the gaps and problems are. You might be here because you’re one of these people, and want to know how this new “data science” and “big data” will change your work; you might be a volunteer coder building tools to help them, or someone I haven’t anticipated yet.  Whoever you are – welcome, and I hope this can be at least a little bit useful. So.  What is data science, and what’s special about it when it’s done to…

Data Science

GIS references in data.un.org

I’ve been playing with most of the data.un.org dataset: all 5577 csv files and 21195188 rows of it. And it’s a fascinating dataset when you see it all together. I’ve already written about how to access the data.un.org datasets from an external application: now it’s time to look at the headings and indices (first row and first column) in them all. Details: I looked at 5577 csv files in the data.un.org dataset, and automatically excluded the footnotes at the end of each dataset. Data.un.org data that was excluded from the investigation were as follows: the UN interface limits downloads to 50000 rows of data, so 159 files in the set are incomplete; and 25 files were excluded because they’re in a format (multi-sheet Excel files) that needs further work to separate comments from data. In all, there are 21195188 rows of data in the remaining dataset, so much of the…

Data Science

Indices and headings in data.un.org

‘Data.un.org is the UN Statistics Division’s Internet-accessible repository for data. It’s potentially incredibly useful to the world, but is lagging behind sites like data.worldbank.org and data.gov because it doesn’t have a machine API, it isn’t normalized (the datasets aren’t in a form that makes them easy to use with each other) and there are spelling errors in some of the headers and indices, notably in the names used for geographical locations (e.g. “USA” and “United States” are both used). I’ve written already about how to access the data.un.org datasets from an external application and about the types of headers and indices within it. Now it’s time to make the geographical references more usable. I looked at 5577 csv files in the data.un.org dataset, and automatically excluded the footnotes at the end of each dataset. Data.un.org data that was excluded from the investigation were as follows: the UN interface limits downloads…

Software

A quick-and-dirty API for data.un.org

An API is a piece of software on a website that allows programs on other websites and machines to access data held on that website.  This means that coders can create programs that use third-party data (from e.g. Facebook, LinkedIn, Foursquare, the World Bank etc) as part of their applications.  An example API is data.gov’s broadband map API – an example call to which looks like “http://www.broadbandmap.gov/broadbandmap/provider?format=json”.  This provides data about broadband providers in a standard data format (json) to any program or application that needs it. Without an API, datasite users are forced to access data through a series of mouseclicks leading eventually to either a webpage displaying the data, or to a file containing data that they can download. Data.un.org does not have an API.  Yet.  But it would be very very easy to provide on their website.  A trace of the calls made during a data.un.org file…