Basic Materials: Werdz an Regulah Expreshuns

Download PDF

[Cross-posted from ICanHazDataScience]

Okay, that last post was a bit long for Emily… she fell asleep on my desk long before I’d finished typing.  So today we’re back to short and practical.

Data is not just numbers.  Numbers are one of the basic types of data that appear again and again in data science.   Two of those types are words (as in written text, like this blogpost) and networks (as in objects connected with links – like a diagram of your twitter friends and your friends’ friends etc).  Today we’re looking at words.

In the last post, I was looking at a set of online job descriptions.  We’ll leave the basics of webpage scraping til later (but if you’re curious, ScraperWiki’s notes are good) and assume that what we have is a set of text files that we’ve used the “Processing all the teh Files in Directory” post with the commands

fin = open(infile_fullname, “rb”)

bigstring += ” ” + fin.read()

to add the text from each file into one big string, which we’ve (rather imaginatively) called bigstring.

For Dan’s jobsite data, the first 200 characters of bigstring look like this to a human:

Title

Senior Director of Development

Department/Mall

SPG/Premium Outlets – NJ

Location

Roseland, NJ

 

Description

 

PRIMARY PURPOSE:

 

• Responsible for overseeing domestic

And like this to a Python program:

‘Title\r\n    Senior Director of Development\r\nDepartment/Mall\r\n    SPG/Premi

um Outlets – NJ\r\nLocation\r\n    Roseland, NJ\r\n\r\nDescription\r\n\r\n    PR

IMARY PURPOSE: \r\n\r\n    \x95 Responsible for overseeing domestic ‘

That’s what Python sees: a long sequence of characters, some of which are letters (both uppercase and lowercase), some punctuation (spaces, commas, dashes, full stops etc), character sequences (“\r\n”) to show line endings and special codes (“\x95”) for other characters like “•”.

Your program needs to split that sequence of characters into words.  We could do this the hard way – look at each character in bigstring, adding it to a word if it’s alphabetical, and creating a new word if it’s not alphabetical etc., but Python (and many other languages) has a really great shortcut for text processing, known as regular expressions.

Regular expressions – seen in Python as the “Re” library – are a fast way of searching for text patterns in strings (including very very large strings).  I’m not going to pretend that regular expressions are easy, but I am going to insist that you’ll find them very useful, and it’s worth the pain of learning about them because that allows you to do much more powerful things to your text.  For today, I’m going to show you the regular expression that I used to convert bigstring into a list of all the words in it (within limits: for this application, I ignored things like hyphenated words and words with numbers and non-alphabetical characters in them).

The regular expression I used for the jobsites was:

import re

words = re.sub(‘[\W_]+’, ‘ ‘, bigstring.lower()).split()

This combined the re library function “sub” with some standard python string functions (“lower” and “split”).  First, I wanted the words returned to be all-lowercase (there’s nothing more annoying than getting separate frequencies for “Follow”, “FOLLOW” and “follow” in your results).  For this, I used the string expression “bigstring.lower()”.  You can do this to any string in Python, and it will lowercase all your text.   The results was

lowerstring = ‘title\r\n    senior director of development\r\ndepartment/mall\r\n    spg/premium outlets – nj\r\nlocation\r\n    roseland, nj\r\n\r\ndescription\r\n\r\n    primary purpose: \r\n\r\n    \x95 responsible for overseeing domestic’

(I’ve called this lowerstring so you can see what happens next). Next, I used re.sub(‘[\W_]+’, ‘ ‘, lowerstring) to convert any sets of characters in the text that AREN’T alphabetical into spaces.  The result of this is

cleantext = ‘title senior director of development department mall spg premium outlets nj location roseland nj description primary purpose responsible for overseeing domestic ‘

Which just leaves the final step of using the spaces to split the whole text into words.  This is what the “split()” function does – if you use characters as parameters,e.g. split(“,”), it will convert the text string into a list of all the text between each of those characters, but if you leave the parameter blank, e.g. split(), it creates a list of all the text between spaces. For the jobsite example, that list starts like this:

>>> words[:200]

[‘title’, ‘senior’, ‘director’, ‘of’, ‘development’, ‘department’, ‘mall’, ‘spg’, ‘premium’, ‘outlets’, ‘nj’, ‘location’, ‘roseland’, ‘nj’, ‘description’, ‘primary’, ‘purpose’, ‘responsible’, ‘for’, ‘overseeing’, ‘domestic’ …]

Next post, we’re going to look at the code needed to do useful things with this simple list of words, and at some of the issues (like wordstems) that simple lists of words can have.

Wut 2 Do Wif Data?

Download PDF

Data science is not about data. Data science is about insight – the knowledge and suggestions that you can glean by inspecting and using data. And that insight usually starts with a set of questions.  Here are some examples, hopefully making you think a bit more about your own questions (which in Emily’s case is the correlation between cuteness, cuddles and the amount of Meow Mix in her dish).

You don’t always know what the good questions are, but you usually know (or pick) the framework that you’re asking them in.  This is how I usually approach this:

  • Look at context – ask question (or get question from user)
  • Get data
  • Phrase question in way that data can answer
  • Write down issues with data
  • Clean data
  • Investigate question
  • Check conclusions and possible issues with conclusions
  • Describe possible further investigations / data gathering
  • Which might mean improving on the data that you obtained this time

Here are two examples, to help you think about your own questions. One example analyses text, the other numbers; both are simple but raise many difficult questions.

Example 1: Starting with a question

Look at context- ask question

I’ve somehow been spending a lot of time lately thinking about poo… erm… sanitation, open defecation and farm slurry.  Some of this stemmed from a question I asked about a UN ‘fact’ that was quoted without a data provenance – that more people have access to a mobile phone than to a toilet. My question was simple: “is this true?”.

Get data

Now at this point, I had no data.  So I looked at the resources I had available (me and an internet full of open data) and the value of the result (me satisfying my curiosity), and scoped out the size of the project: I’d look for open data (i.e. not ping any of my contacts for data, set up surveys or anything that involved other peoples’ goodwill – that’s a valuable resource), and use that to determine whether the question could be answered.  I’m spoiling the surprise, but this is something that happens a lot with development data: you start out with a clear question, find that the data isn’t there to answer it, then adapt either the data (by gathering more) or the question (by reducing its scope, or changing it to a set of also-valuable questions that the data can help you with).

So data. I searched all the usual suspects (see opencrisis.com for a list), but couldn’t find any dataset of surveys that included both access to toilets and mobile phones.  There’s probably been one or more of these done, they could probably be dug up with a lot of phone calls, but they weren’t easily visible online.  The datasets that I did find were one on sanitation from WSSinfo and another on mobile phone densities from ITU. And these have issues:

  • The datasets were hard to find.
  • I looked at the last 5 years (anything older than that in development isn’t that useful), but there was no data after 2010 in these datasets.
  • The datasets were unrelated
  • The dataset formats were hard to machine-read (they included merged cells, explanations etc).
  • It was difficult to track provenance – e.g. what decisions did the people creating these datasets make? What assumptions?
  •  There were data issues: numbers were rounded up, data was at country level, countrynames didn’t match between the two datasets, there were multiple charactersets in the files (e.g. Å, A, Ԇ).

Phrase question in way the data can answer

So onto the question.  Taking the question “more people have access to mobile phones than toilets” as a start point, we can rephrase this as: number of people with mobiles > number of people with toilets

or (mobile% – toilet%)*population > 0

or (mobile% – (100-opendef%)) > 0

Where mobile% is the percentage of people with mobile phones, toilet% is the percentage of people with access to a toilet (not, note, owning a toilet – or I’d be looking through the sanitaryware import and latrine digging figures for each country), opendef% is the number of people open defecating (pooing outside).  And we can answer this question using with the datasets.

Write down issues and clean data

And even once the numbers for open defecation (a polite phrase for “has no toilet and has to poo outside”) and telephones were compared, that comparison only created a bunch more questions.  Most of these questions exist because of the idea of statistical independence – if you gather two datasets independently of each other, it’s only possible to compare them under some really tight statistical conditions.  Some of these questions were:

  • Is there actually a correlation between the two datasets?  Phone densities are quoted as the number of phones per hundred people, and are often over 100 (I think I have 4 phones at home, but I’ve lost count now).  Most of the countries with phones > toilets are in the developing world: don’t some people in the developing world have more than one phone? In some cities (e.g. Benin City) I’ve visited, phone signal availability is so variable that people have up to 5 simcards each, on different carriers. Were the results uniform – the datasets were listed by country – what if the cities have lots of phones and toilets, and the rural areas don’t? What does that do to the numbers?
  • And how do you count up people without toilets? Are these percentages estimates or survey results?  If they’re surveys, how big were the surveys, and were they demographically and geographically representative (e.g. were city and country people surveyed proportionately, and how was this done – on paper or by phone?).  We’re talking about people here – how likely were they to be truthful about toilets – having to poo outside could be deeply embarassing, and perhaps hard to admit.
  • Where does my composting toilet fit in this?  If I have an ‘unusual’ outdoor toilet, does that count as a toilet or open defecation?
  • What do we do with a zero value in the datasets? What do we do with values over 100 per 100 people (I truncated these to 100, so extra phones had less of an effect, but I felt uneasy doing that).
  • Did we just list the people who, with the right tools, can campaign for more toilets?
  • Etc…

Investigate question, check conclusions, describe possible future investigations

So, having found run the question against the data, here are the numbers for 2010:

country population opendefecation not opendefecation phones phones minus loos people affected
India 1.22E+09 51.09471 48.90529 61.4226 12.51732 153288799  
Indonesia 2.4E+08 26.25828 73.74172 88.08497 14.34325 34405290  
Brazil 1.95E+08 3.694356 96.30564 100 3.694356 7202000  
Morocco 31951000 15.86805 84.13195 100 15.86805 5070000  
South Africa 50133000 7.745397 92.2546 100 7.745397 3882999  
Viet Nam 87848000 4.177671 95.82233 100 4.177671 3669999  
Benin 8850000 56.39548 43.60452 79.94351 36.33899 3216000  
Cambodia 14138000 60.53897 39.46103 57.65042 18.1894 2571616  
Peru 29077000 7.232521 92.76748 100 7.232521 2102999  
Colombia 46295000 6.486662 93.51334 96.07475 2.561412 1185805  
Mauritania 3460000 53.64162 46.35838 80.23792 33.87954 1172232  
Guatemala 14389000 6.046285 93.95371 100 6.046285 870000  
Namibia 2283000 51.86159 48.13841 85.50451 37.36609 853067  
Ecuador 14465000 4.638783 95.36122 100 4.638783 670999  
Honduras 7601000 8.748849 91.25115 100 8.748849 664999  
Niger 15512000 78.85508 21.14492 24.53329 3.388367 525603  
El Salvador 6193000 5.926046 94.07395 100 5.926046 367000  
Botswana 2007000 15.39611 84.60389 100 15.39611 309000  
Mongolia 2756000 11.71988 88.28012 91.09104 2.810925 77469  
Suriname 525000 6.095238 93.90476 100 6.095238 32000  

Reading the whole table, the bottom line is that 200 million or so people have phones but not toilets, if you use the ITU and Wssinfo data, and ignore statistical independence (that’s an enormous ignore). That’s out of 7 billion people worldwide.  So yes, it’s potentially an issue, but it’s more interesting to think about where, and what that means.  For instance, there are 200 million people with phones who, if they get the right SMS apps or information, can lobby for governments and NGOs to build toilets in their areas, or for the plans, materials, money or labour to do this for themselves. If anyone wants to start a “givemealoo” site with an SMS connection and publicity through SMS and local radio, they now know where to start…

Example 2: Starting with a dataset

Sometimes you start with a dataset, and the question “what can you glean from this?”.  For instance, my partner had a set of job descriptions that he liked, and wanted to find more like them.  The long answer would be to do some supervised learning with these and other descriptions, and build a jobsite scraper that classified each description into “interesting” or “not interesting”.   The short answer was to look for patterns,  features and possibly clusters in the dataset.

The data was from a mix of different websites, all with a different structure (and different headings for ‘experience’ etc.), so I treated each page as unstructured text (e.g. I ignored labels and punctuation and treated each page as a huge collection of words).   I started by building a histogram of the words used: a list of the top 30 words I found across all the documents, with how many times each one appeared.  This list contained a lot of stopwords – common words that don’t add anything useful to the histogram, like “and”, “the”, “of”, “to” and “in”, that I then removed from the list, to give a list of terms that might be useful to Dan.

Removing stopwords is a common thing in text processing – normally I’d use a standard list of stopwords (e.g. Porter) for this, but I didn’t want to miss any industry-specific terms that might be on those lists, so I built my own stopword list.  For development data, you’ll probably do this a lot too, e.g. “crisis” isn’t a really useful term to find when you’re working on crisis information.  So I built a histogram (minus stopwords): the top 10 words in it were:  estate (26), real (26), development (12), design (11), manage (11), planning (11), sales (11), investment (10), senior (8), portfolio (8).

I showed this to Dan and he said “great – but what about pairs of words”. .. something that might have been triggered by the top 2 words on that list (“real” and “estate”).  So I modified the code to produce a histogram of adjacent words, and got: real estate (26), new york (5), trade marketing (4), job description (4), estate portfolio (4), senior strategist (4), city area (3), estate investment (3), funding approvals (3), area job (3).

I could have continued this – looking for chains of words, e.g. “real estate” linked to “estate portfolio” etc., and linked it to a jobsite scraper to automatically alert Dan to jobs that were similar to his “interesting” ones (you’ve probably worked out by now that he’s a real estate architectural designer), but the lists enough were enough for him: he got search terms that he hadn’t thought of, and is happily sifting through sites with them.  Which is another lesson to learn: sometimes a seemingly simple thing will have enough of an effect to make a user happy, without needing complex analysis.  Unless you’re playing with a dataset out of curiosity, that’s often a good place to stop.

2 Mutch Geek

Download PDF

[Cross-posted from ICanHazDataScience]

I’ve been descending into a lot of Python hints lately, and not so much on the data analysis side of things.  Whilst I finish writing up an example analysis involving poo (yes, really!), here’s something I wrote a while ago to a young aspiring data scientist…

What it takes to be a data scientist, most of all, is curiosity, persistence, intelligence and the ability to tell a story: that drive to understand what’s wrapped up inside the data, to learn whichever tools it takes to get that understanding, and the skills to explain it to others.

It also helps to have a strong technical background, because we haven’t yet developed the user-friendly tools and training that non-technical data scientists need (we’re working on it!), so some knowledge of Bayesian statistics and things like text processing and – unfortunately – the ability to program in a language like R or Python helps a lot.  The good news is that some excellent training materials are now available through things like python.org and Coursera.com (for the advanced user) and meetups – check meetup.com for ideas (I’m here if you want to look at the groups I belong to).  You can also learn a lot – one heck of a lot – from going to some of the data science hackathons like the ones run by DataKind, helping out with the drudge work like data cleaning for the teams, and listening to the techs talking about what they’re doing (being the person who writes up the team webpage text and presentation is a very good way to learn and get geek gratitude points too).

I suspect you were probably looking for me to tell you a list of skills that a data scientist needs to learn, but it’s not like that: data science is about storytelling from data, and right now your personality counts for much more than the things that you learnt at school (you will still have to learn things, but you need to start from the right place). Oh, and you also need a very low boredom threshold and a lateral-thinking mind – much of this work is about obtaining, cleaning (removing errors and getting into ‘standard’ formats) and wrestling with data in different (and sometimes unusual) ways, and like all new exploration, it can take an awful long time to do with much pain along the way, but with equal feelings of triumph at the end.

Learnins Python

Download PDF

[Cross-posted from ICanHazDataScience]

Bad news.  You’re probably going to have to learn to code.   Whilst you can go a very long way with the tools available online, at some point you’re going to have that “if I could just reformat that column and extract this information out of it” moment.  Which, generally, either means coding or finding a coder happy to help you with the task (hackathons like RHOK are good places, and always looking for good problem statements; there are also many coding-for-good groups around that might help too).

Not so bad news if you’re up for writing your own code. There is *lots* of help available online.  The language you choose is up to you… many social-good systems are written in PHP, for example, many open data systems are in Python (and there are a lot of good data-wrangling libraries available in Python and many data science courses use it as their default language), and R (free) and Matlab (not so free) are good for handling large arrays of data too.

I personally write most of my code in Python. This might not be your choice once you look at the other languages available, but it works for me, so that’s what I’m going to write about (interspersed with a little PHP and R where it’s appropriate).

So how do you start?  When someone sends you a file with an name like “thingy.py”, how do you run it?

You have options here, depending on what you want to do (run a file or code your own?) how much time you want to put it (two hours, a week, two years), and what your learning style is (reading text, watching video, doing tests, having a tutor).  Most of these options are currently available free. Here are some of them:

Python on Windows. Kthxbye!

Download PDF

[Cross-posted from ICanHazDataScience]

At one point, I had Windows, Mac and Linux laptops, an android tablet, android phone and iphone… but now I’m just down to the things that aren’t forbidden fruit.  I used to write all my code on the Linux machine, and use the Windows one for writing (it has a bigger screen that lets me view two documents side-by-side).  But now I *like* coding on my Windows 7 machine. Here are some of the things that I’ve learnt doing it.

  • Find some friendly instructions to help you with things like setting environment variables (i.e. the thing that means you can just type “python” in your terminal window instead of a really long address to the python executable file).
  • Install 32-bit Python instead of 64-bit Python.  Yes, it sounds wierd on a 64-bit machine, but trust me, some libraries will break horribly (and some will be just plain unavailable) if you try using the 64-bit version.  The 32-bit version works fine, honest!
  • “Real” coders will laugh at you in Hackathons.  Ignore them… or just blow them away with your awesome webpage scrapers.
  • Python 3.x is the latest version, but Python 2.x has better support. I’m currently on Python 2.7.3.  It works.
  • If you’re adding Python libraries (aka “packages”) to your machine (and you will), pip wins out over easy_install because it comes with things like uninstall (easy_install doesn’t).  Install Distribute, then install Pip… here are some geek notes on why and how.
  • Some packages need slightly different code if you’re running on a Windows machine.  for instance, connecting to Skype using Skype4Py has these variations between machines:
import os

import Skype4Py

 

if os.name == “nt”:

print(“Windows”)

#32 bit and 64 bit windows behave differently with Skype

if ‘PROGRAMFILES(X85)’ in os.environ:

sconn = Skype4Py.Skype() #64-bit

else:

sconn = Skype4Py.Skype() #32-bit

elif os.name == “os2”:

print(“Mac”)

sconn = Skype4Py.Skype()

else:

print(“Linux machine or similar”)

sconn = Skype4Py.Skype(Transport=’x11′)

(That was a very hard-won piece of code. It won’t happen to you often, but occasionally you need to ask whether it’s your machine type that’s the problem).

  •  Virtual environments rock. Mostly.  But in windows, you have to connect them to your Postgresql databases the hard way.  VirtualEnv is the package that you use to create virtual environments on your machine, that you can use to mimic the environment that you’ll eventually distribute your code in online (e.g. Heroku or Amazon).  You don’t need it yet, but it saves a lot of time when your code’s ready to show the whole world.

 Well, that’s just a few things I’ve learnt.  I know I should be a hard-core coder who defends Linux to the hilt – and I like Linux a lot (heck, I can remember when Vi was young and Emacs was a hot new thing), but I’m also very aware that a lot of the people who are managing, analysing and creating data are using Windows machines to do it, and I don’t want to be another person getting in the way of them starting to code on said machines.

Onwards!

Readin and Writin Excel Files

Download PDF

[Cross-posted from ICanHazDataScience]

And so to Excel.  Many people think Excel must be difficult to read into and write from a program.  Nah!  Again, there are libraries to help you: this time, it’s xlrd and xlwt (for reading Excel and writing Excel, respectively).

There’s a nice tutorial on using Python on Excel files here, so I’ll just give some example code in this post. Here’s the file I’m running it on:

Example Excel file

And here’s the code:

import xlrd

import xlwt

 

#set up

wbkin = xlrd.open_workbook(“infile.xls”)

wbkout = xlwt.Workbook()

 

#Read in data from Excel file

numsheets = wbkin.nsheets

sh=wbkin.sheet_by_index(0):

print(sh.nrows)

print(sh.ncols)

print(sh.cell_value(0,1))

merges = sh.mergedcells()

 

#Print out contents of first worksheet

for r in range(0, sh.nrows):

for c in range(0, sh.ncols):

print(“(“+str(r)+”,”+str(c)+”)”+str(sh.cell_value(r,c)))

 

#Write data out to Excel file

sheet = wbkout.add_sheet(“sheetname”, cell_overwrite_ok=True)

sheet.write(0,0, “cell contents”)

sheet.write(0,1, “more in the first row”)

sheet.write(1,0, “contents in the second row”)

wbkout.save(“outfile.xls”)

So what’s going on in here?   First, the setup.

  • wbkin = xlrd.open_workbook(“infile.xls”) creates a readable workbook object (the thing you use to access the input file) linked to the file “infile.xls”.  Note that this is a “‘xls” Excel file: xlrd can handle Excel 2010 files (“.xlsx”) but so far, they’ve given me errors (this might be something to do with me working on a 64-bit Windows machine though).
  • “wbkout = xlwt.Workbook()” creates a writable workbook object.  Note that this isn’t attached to any file at the moment – and that if you don’t use the command “wbkout.save(“outfile.xls”)” after you’ve written data into the workbook, it won’t ever make it into a file.
  • xlrd has other cool commands like “sh.col(1)”, which puts an entire column of your workbook into an array for you.

And the read:

  • “numsheets = wbkin.nsheets” gives you the number of worksheets in the input file.  In this case, there are 3 (“sheet 1”, “sheet 2” and “sheet 3”).
  • “sh = wbkin.sheet_by_index(0)” gives you a pointer to the worksheet numbered 0 (in this case “sheet 1” – note that Python counts from 0 rather than 1).  If you know what your worksheets are called, then you could use sh = wbkin.sheet_by_name(“Sheet1”) instead.
  • “sh.nrows” and “sh.ncols” gives you the number of rows and columns in the current worksheet; in this case, 5 and 3 respectively.
  • “sh.cell_value(row,col)” gives you the contents of a worksheet cell. “sh.cell_value(0,1)” gives you the contents of the first row, and the second column in that row (remember that Python counts from 0); in this case, “Title”.
  • “merges = sh.merged_cells” should give a list of the merged cells in the worksheet – which is really useful because these appear lots in development data and are a pain to unmerge.  But this command doesn’t work for me: “sh.merged_cells” gives me an empty set for the Excel file above.

Print out contents of first worksheet:

(0,0) Id

(0,1) Title

(0,2) Notes

(1,0) 1.0

(1,1) title1

(1,2) This is a note.

(2,0) 2.0

(2,1) title 2

(2,2) This is a merged cell

(3,0) 4.0

(3,1) Fourth title

(3,2)

(4,0) 5.0

(4,1) Title 5

(4,2) This is a note on line 5.

  • This is what Python thinks the contents of this worksheet are.Note the defensive piece of programming in the code: “str(sh.cell_value(r,c))” – this takes the value in the cell and converts it from whatever format xlrd thinks it is (notice that all the Ids have been turned into floating-point numbers, for instance) into a string. Which stops the print() statement from breaking because it’s been fed a number.
  • Note that xlrd happily read in all the rows and columns, but didn’t notice that one of the cells was merged.  So far, the contents of every merged cell that I’ve read in has been assigned to the row and column in its top left-hand corner (this might be helpful to you later).

And write:

  • sheet = wbkout.add_sheet(“sheetname”, cell_overwrite_ok=True)” adds a sheet named “sheetname” to the output workbook.  “cell_overwrite_ok=True” is a useful command because it lets you add cells to the workbook in a nonlinear way (e.g. you don’t have to write [0,0] then [0,1] then [0,2] then [1,0] etc etc). 
  • “sheet.write(0,1, “more in the first row”)” writes the text “more in the first row” to the cell in the first row and second column of the worksheet. 
  • “wbkout.save(“outfile.xls”)” saves the workbook to file “outfile.xls”.  Without this command, you’ll have a workbook object but it won’t be written to any files. 

That’s it for now.  As usual, if you get stuck, try searching StackOverflow and the rest of the internet for the error message you got – almost always, someone’s had the same problem before.  Onwards!

Why Cant I Redd Arabic in Mah Files?

Download PDF

[Cross-posted from ICanHazDataScience]

I work on development data.  Sometimes on datafiles, sites or streams that cover a large part of the world.  Which means that, sooner or later, I’m going to get an error that looks like this when I’m reading something into Python:

“UnicodeEncodeError: ascii codec can’t encode character u’\u015e’”.

At this point, you need Unicode.   Unicode is like the babelfish of written text: it contains characters for most human languages, including Arabic, so for instance it can deal with reading data from websites where multiple human languages are used (e.g. at least 10 on one of the sites that I maintain).  Most people’s files contain just one character set (things like ASCII and Latin-1) and don’t ever see the problem above – we development nerds are likely to see it a lot!  For example, this line (placed at the top of your code file) can save you pain when you’re writing non-ascii characters in your code (e.g. in comments or text strings):

# -*- coding: utf-8 -*-

And this line can save you a lot of pain when you’re looking at your files:

“Print(data.encode(‘utf-8’))”

But they didn’t say “unicode”, Emily says (before yawning and settling back down to sleep).  That’s because you probably won’t see the word “unicode” in Python code…  you’re more likely to see words like “utf-8” and “utf-16”.  These are both implementations of unicode – different ways of representing unicode characters as 1, 2 or more 8-bit bytes.  For Arabic and other common languages, utf-8 is more than sufficient (and very few people need or use utf-16).   I could rattle on here (and if you need it, can supply the magic words for e.g. reading an excel file in UTF-8) but this blogpost about Arabic in Python should cover all (or at least most) of what you need to get started.

Play, experiment, and remember to search StackOverflow if you get stuck (chances are, that whatever Python throws at you already has a question and answer on StackOverflow). More geeky notes on unicode include Green Notes’ blogpost and Evan Jones’ note.

Readin and Writin CSV files

Download PDF

[Cross-posted from ICanHazDataScience]

Google should add Lolcatz to translate.google.com…

But seriously, if you want to do data science, you’re going to need data.  Which means being able to access data: data in streams (like Twitter), data online (like websites) and data in files.  We’ll start with the files.

Development agencies, bless them, have a major love affair going on with Excel files.  Open any development data site (or crawl it for datafiles) and you’ll see overwhelming numbers of reports (we’ll get to those later), .xls and .csv files.

Let’s start with the easier filetype: CSV.   CSV (Comma-Separated Values) is as it sounds: a file containing readable text, usually separated with commas, but sometimes instead with semicolons “;” tabs, colons “:”, pipes “|” or anything else that the person writing the file thought was appropriate (data.un.org, for example, gives you the choice of comma, semicolon or pipe).  Here’s an example, seen from Microsoft Excel:

CSV file as seen in Microsoft Excel

If you open this file in a text editor (I’m using Notepad++ here – which is heartily recommended for looking at webpage files too), it looks like this:

CSV file as seen in Notepad++  

There are two things to note here: first, that the columns in the Excel view correspond to the commas in the Notepad++ one.  And second, that the comma in Excel (field C3) doesn’t create a new column in the CSV file because Excel wraps the whole field in punctuation marks (“This is a comment, that includes a comma.”).

Yes, it’s that simple a file format.  And there’s a friendly Python library to make reading and writing CSV files even easier.  It’s called “CSV“.  Here are some simple bits of code that use it to read a CSV file, print out its contents, and write them out to another CSV file.

import csv

 

#Set up

fin = open(“infile.csv”, “rb”)

fout = open(“outfile.csv”, “wb”)

csvin = csv.reader(fin)

csvout = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)

 

#Read in header row

headers = csvin.next()

 

for header in headers:

print(header)

 

 

#Read in non-header rows

for row in csvin:

for col in range(0,len(row)):

print(row[col])

 

#Write out

csvout.writerow([“header1”, “header2”, “header3”])

for row in csvin:

csvout.writerow(row)

 

#tidy up

fin.close()

fout.close()

Okay, a few things happened here.  First, the setup:

  • Before you can use a library (e.g. CSV), you have to tell your code about it.  That’s what the “import CSV” line is about.
  • Second, you have to open a file before you can use the CSV library on it.   The line “fin=open(“infile.csv”, “rb”)” opens a file called “infile.csv” for reading on my windows system (that’s the “rb” part), and points to the open file with a variable called “fin”.  The line “fout-open(“outfile.csv”. ‘wb”) does a similar thing, opening a file called “outfile.csv” for writing (that’s the “wb” part).
  • csv.reader and csv.writer are where the CSV library gets involved.  “csvin = csv.reader(fin)” gives you an object, csvin, with lots of handy ways to read in CSV files.  “csvout = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)” gives you an object, csvout, with lots of handy ways to write to CSV files.   That “quoting=csv.QUOTE_NONNUMERIC” is important by the way – it wraps your text fields in quotes (“) so you never get a comma in the text being read in as a new column in the CSV file, and CSV readers *always* read in each field as a text string (this is important because you don’t have to write extra code that worries about which type each input is).

Then the read:

  • CSV files are organised in rows, so it’s not too surprising that CSV readers do the same thing.  “headers = csvin.next()” picks the next row out of the CSV file, and puts it into an array variable called “headers”.  Arrays are lists of things that look like this: [“Id”, “Title”, “Comments”, “Date”] in Python – the next lines “for header in headers:” and “print(header)” select each item in the array headers and prints it to the screen (this is where QUOTE_NONNUMERIC wins: if you didn’t use it, then any numbers in the CSV file will crash this piece of code.  Unless you print(str(header)) here).
  • “for row in csvin:” is a loop – it picks the next row from the CSV file, and applies the following code to it (note the lack of brackets here: the only way you know what a loop applies to is by looking at the indents in the line), then repeats this with the next row in the CSV file until it gets to the end of the file (at which point it stops and your program goes on to the next piece of code). Note that “for header in headers:” is also a loop, but it’s working on a different type of object (an array rather than a file), so it works on each item in the array instead.
  • row is an array, e.g. [“1”, “Title 1′, “This is a comment on title 1. “, “3/10/2013”], which means that you can access any column in that row by typing “row[col]” where col is any number from 0 to the number of columns in the file, minus 1.  I’ve used this to print out the row contents in order, but you could use it to pick out any data field that you want.

And the write:

  •  Writing is pretty easy: once you’ve created a csv.writer object, e.g. csvout, you can use “csvout.writerow(array)” to write a row of data to the file (e.g. “outfile.csv”) associated with it.

And tidy up:

  • Closing your files is good practice – which is why “fin.close()” and “fout.close()” are here.   You’ll sometimes see CSV readers and writers created in code using lines like “csvin = csv.reader(open(“infile.csv”, “rb”))” – this makes it difficult to close the files when you finish, and could get messy over time.

Well, that’s a small short look at reading and writing CSV files.  Nothing much happened in the example code, but use a little imagination and you could be reading the files into your processing code, swapping around file columns, reformatting column contents and cross-matching them with the contents of other interesting files.  That part… is up to you.

Leaflet – maps in html…

Download PDF

[Cross-posted from ICanHazDataScience]

Maps are a very fast way to display data that’s spatial – i.e. has a physical position associated with it.  But doing this your own way in a webpage without using Ushahidi, Google, ESRI or a Python link to QGIS etc. means finding some open-source mapping code. I’m hearing a lot about the Leaflet mapping tool, so I thought I’d have a play with it.

First, to the Leaflet quick start guide, where you create a .html file, and get a map when you click on the html file.   Usual problem here with the instructions not quite working and not detailed enough for newbies… I found it useful to click on their example map and “view source” (right-click on the page, then select “view source” in most browsers) to see what I might be missing.

One part of which is the API key for Cloudmade.  It’s also sensible to cut and paste the example code from Leaflet, because they’ve hidden a lot of text in there (like the OpenStreetMap license text).

Back to the Cloudmade API key.  You need to register for a cloudmade developer key to obtain this: I just clicked “get api key”, then “register” without filling in the name, URL or description fields. Put the API key  into the Leaflet code snippet (instead of “API-key”) and you’re away.

Of course, if you mistype the “http://cdn… ” address for leaflet.js, then you’ll spend a little while working out why your map doesn’t appear.  If this happens, then deep breath, then go back and check everything that you’ve typed into your file.

Why am I teaching my cat data science?

Download PDF

[Cross-posted from ICanHazDataScience]

I’m a data scientist. Not a ‘real’ one, as in someone who can write code that pulls lots of interesting comments out of Twitter and turns them into really pretty pictures without having to look up lots of things about algorithms and code. But I do spend a lot of time getting my hands on data and making sense of it, and I do know a bit about the work that data science is based on.  And I’m still learning about data science, so I’m writing this blog to store and hopefully pass on some useful tips and ideas.

Apparently, the best way to learn something is to try to teach it too, but I don’t have any students here. So this is Emily. She’s a cat: she likes sleeping, chasing small furry things (mice, possums, toes, that sort of thing), Meow Mix and hanging out next to my laptop, which (see above) is why I don’t type things with ‘q’ in them very often. She’s a pretty good listener, but will run off if you start to get boring, repetitive (or tickle her ears ‘wrong’).  She’s also prepared to try almost anything that will help to improve her world, although not for very long.  In short, she’s a pretty representative student (apologies to all my former students out there: you were much much better than this).

Onwards… with the help of the LolCat translator