Data science is not about data. Data science is about insight – the knowledge and suggestions that you can glean by inspecting and using data. And that insight usually starts with a set of questions. Here are some examples, hopefully making you think a bit more about your own questions (which in Emily’s case is the correlation between cuteness, cuddles and the amount of Meow Mix in her dish).
You don’t always know what the good questions are, but you usually know (or pick) the framework that you’re asking them in. This is how I usually approach this:
- Look at context – ask question (or get question from user)
- Get data
- Phrase question in way that data can answer
- Write down issues with data
- Clean data
- Investigate question
- Check conclusions and possible issues with conclusions
- Describe possible further investigations / data gathering
- Which might mean improving on the data that you obtained this time
Here are two examples, to help you think about your own questions. One example analyses text, the other numbers; both are simple but raise many difficult questions.
Example 1: Starting with a question
Look at context- ask question
I’ve somehow been spending a lot of time lately thinking about poo… erm… sanitation, open defecation and farm slurry. Some of this stemmed from a question I asked about a UN ‘fact’ that was quoted without a data provenance – that more people have access to a mobile phone than to a toilet. My question was simple: “is this true?”.
Get data
Now at this point, I had no data. So I looked at the resources I had available (me and an internet full of open data) and the value of the result (me satisfying my curiosity), and scoped out the size of the project: I’d look for open data (i.e. not ping any of my contacts for data, set up surveys or anything that involved other peoples’ goodwill – that’s a valuable resource), and use that to determine whether the question could be answered. I’m spoiling the surprise, but this is something that happens a lot with development data: you start out with a clear question, find that the data isn’t there to answer it, then adapt either the data (by gathering more) or the question (by reducing its scope, or changing it to a set of also-valuable questions that the data can help you with).
So data. I searched all the usual suspects (see opencrisis.com for a list), but couldn’t find any dataset of surveys that included both access to toilets and mobile phones. There’s probably been one or more of these done, they could probably be dug up with a lot of phone calls, but they weren’t easily visible online. The datasets that I did find were one on sanitation from WSSinfo and another on mobile phone densities from ITU. And these have issues:
- The datasets were hard to find.
- I looked at the last 5 years (anything older than that in development isn’t that useful), but there was no data after 2010 in these datasets.
- The datasets were unrelated
- The dataset formats were hard to machine-read (they included merged cells, explanations etc).
- It was difficult to track provenance – e.g. what decisions did the people creating these datasets make? What assumptions?
- There were data issues: numbers were rounded up, data was at country level, countrynames didn’t match between the two datasets, there were multiple charactersets in the files (e.g. Å, A, Ԇ).
Phrase question in way the data can answer
So onto the question. Taking the question “more people have access to mobile phones than toilets” as a start point, we can rephrase this as: number of people with mobiles > number of people with toilets
or (mobile% – toilet%)*population > 0
or (mobile% – (100-opendef%)) > 0
Where mobile% is the percentage of people with mobile phones, toilet% is the percentage of people with access to a toilet (not, note, owning a toilet – or I’d be looking through the sanitaryware import and latrine digging figures for each country), opendef% is the number of people open defecating (pooing outside). And we can answer this question using with the datasets.
Write down issues and clean data
And even once the numbers for open defecation (a polite phrase for “has no toilet and has to poo outside”) and telephones were compared, that comparison only created a bunch more questions. Most of these questions exist because of the idea of statistical independence – if you gather two datasets independently of each other, it’s only possible to compare them under some really tight statistical conditions. Some of these questions were:
- Is there actually a correlation between the two datasets? Phone densities are quoted as the number of phones per hundred people, and are often over 100 (I think I have 4 phones at home, but I’ve lost count now). Most of the countries with phones > toilets are in the developing world: don’t some people in the developing world have more than one phone? In some cities (e.g. Benin City) I’ve visited, phone signal availability is so variable that people have up to 5 simcards each, on different carriers. Were the results uniform – the datasets were listed by country – what if the cities have lots of phones and toilets, and the rural areas don’t? What does that do to the numbers?
- And how do you count up people without toilets? Are these percentages estimates or survey results? If they’re surveys, how big were the surveys, and were they demographically and geographically representative (e.g. were city and country people surveyed proportionately, and how was this done – on paper or by phone?). We’re talking about people here – how likely were they to be truthful about toilets – having to poo outside could be deeply embarassing, and perhaps hard to admit.
- Where does my composting toilet fit in this? If I have an ‘unusual’ outdoor toilet, does that count as a toilet or open defecation?
- What do we do with a zero value in the datasets? What do we do with values over 100 per 100 people (I truncated these to 100, so extra phones had less of an effect, but I felt uneasy doing that).
- Did we just list the people who, with the right tools, can campaign for more toilets?
- Etc…
Investigate question, check conclusions, describe possible future investigations
So, having found run the question against the data, here are the numbers for 2010:
country |
population |
opendefecation |
not opendefecation |
phones |
phones minus loos |
people affected |
India |
1.22E+09 |
51.09471 |
48.90529 |
61.4226 |
12.51732 |
153288799 |
|
Indonesia |
2.4E+08 |
26.25828 |
73.74172 |
88.08497 |
14.34325 |
34405290 |
|
Brazil |
1.95E+08 |
3.694356 |
96.30564 |
100 |
3.694356 |
7202000 |
|
Morocco |
31951000 |
15.86805 |
84.13195 |
100 |
15.86805 |
5070000 |
|
South Africa |
50133000 |
7.745397 |
92.2546 |
100 |
7.745397 |
3882999 |
|
Viet Nam |
87848000 |
4.177671 |
95.82233 |
100 |
4.177671 |
3669999 |
|
Benin |
8850000 |
56.39548 |
43.60452 |
79.94351 |
36.33899 |
3216000 |
|
Cambodia |
14138000 |
60.53897 |
39.46103 |
57.65042 |
18.1894 |
2571616 |
|
Peru |
29077000 |
7.232521 |
92.76748 |
100 |
7.232521 |
2102999 |
|
Colombia |
46295000 |
6.486662 |
93.51334 |
96.07475 |
2.561412 |
1185805 |
|
Mauritania |
3460000 |
53.64162 |
46.35838 |
80.23792 |
33.87954 |
1172232 |
|
Guatemala |
14389000 |
6.046285 |
93.95371 |
100 |
6.046285 |
870000 |
|
Namibia |
2283000 |
51.86159 |
48.13841 |
85.50451 |
37.36609 |
853067 |
|
Ecuador |
14465000 |
4.638783 |
95.36122 |
100 |
4.638783 |
670999 |
|
Honduras |
7601000 |
8.748849 |
91.25115 |
100 |
8.748849 |
664999 |
|
Niger |
15512000 |
78.85508 |
21.14492 |
24.53329 |
3.388367 |
525603 |
|
El Salvador |
6193000 |
5.926046 |
94.07395 |
100 |
5.926046 |
367000 |
|
Botswana |
2007000 |
15.39611 |
84.60389 |
100 |
15.39611 |
309000 |
|
Mongolia |
2756000 |
11.71988 |
88.28012 |
91.09104 |
2.810925 |
77469 |
|
Suriname |
525000 |
6.095238 |
93.90476 |
100 |
6.095238 |
32000 |
|
Reading the whole table, the bottom line is that 200 million or so people have phones but not toilets, if you use the ITU and Wssinfo data, and ignore statistical independence (that’s an enormous ignore). That’s out of 7 billion people worldwide. So yes, it’s potentially an issue, but it’s more interesting to think about where, and what that means. For instance, there are 200 million people with phones who, if they get the right SMS apps or information, can lobby for governments and NGOs to build toilets in their areas, or for the plans, materials, money or labour to do this for themselves. If anyone wants to start a “givemealoo” site with an SMS connection and publicity through SMS and local radio, they now know where to start…
Example 2: Starting with a dataset
Sometimes you start with a dataset, and the question “what can you glean from this?”. For instance, my partner had a set of job descriptions that he liked, and wanted to find more like them. The long answer would be to do some supervised learning with these and other descriptions, and build a jobsite scraper that classified each description into “interesting” or “not interesting”. The short answer was to look for patterns, features and possibly clusters in the dataset.
The data was from a mix of different websites, all with a different structure (and different headings for ‘experience’ etc.), so I treated each page as unstructured text (e.g. I ignored labels and punctuation and treated each page as a huge collection of words). I started by building a histogram of the words used: a list of the top 30 words I found across all the documents, with how many times each one appeared. This list contained a lot of stopwords – common words that don’t add anything useful to the histogram, like “and”, “the”, “of”, “to” and “in”, that I then removed from the list, to give a list of terms that might be useful to Dan.
Removing stopwords is a common thing in text processing – normally I’d use a standard list of stopwords (e.g. Porter) for this, but I didn’t want to miss any industry-specific terms that might be on those lists, so I built my own stopword list. For development data, you’ll probably do this a lot too, e.g. “crisis” isn’t a really useful term to find when you’re working on crisis information. So I built a histogram (minus stopwords): the top 10 words in it were: estate (26), real (26), development (12), design (11), manage (11), planning (11), sales (11), investment (10), senior (8), portfolio (8).
I showed this to Dan and he said “great – but what about pairs of words”. .. something that might have been triggered by the top 2 words on that list (“real” and “estate”). So I modified the code to produce a histogram of adjacent words, and got: real estate (26), new york (5), trade marketing (4), job description (4), estate portfolio (4), senior strategist (4), city area (3), estate investment (3), funding approvals (3), area job (3).
I could have continued this – looking for chains of words, e.g. “real estate” linked to “estate portfolio” etc., and linked it to a jobsite scraper to automatically alert Dan to jobs that were similar to his “interesting” ones (you’ve probably worked out by now that he’s a real estate architectural designer), but the lists enough were enough for him: he got search terms that he hadn’t thought of, and is happily sifting through sites with them. Which is another lesson to learn: sometimes a seemingly simple thing will have enough of an effect to make a user happy, without needing complex analysis. Unless you’re playing with a dataset out of curiosity, that’s often a good place to stop.