Learnins Python

[Cross-posted from ICanHazDataScience]

Bad news.  You’re probably going to have to learn to code.   Whilst you can go a very long way with the tools available online, at some point you’re going to have that “if I could just reformat that column and extract this information out of it” moment.  Which, generally, either means coding or finding a coder happy to help you with the task (hackathons like RHOK are good places, and always looking for good problem statements; there are also many coding-for-good groups around that might help too).

Not so bad news if you’re up for writing your own code. There is *lots* of help available online.  The language you choose is up to you… many social-good systems are written in PHP, for example, many open data systems are in Python (and there are a lot of good data-wrangling libraries available in Python and many data science courses use it as their default language), and R (free) and Matlab (not so free) are good for handling large arrays of data too.

I personally write most of my code in Python. This might not be your choice once you look at the other languages available, but it works for me, so that’s what I’m going to write about (interspersed with a little PHP and R where it’s appropriate).

So how do you start?  When someone sends you a file with an name like “thingy.py”, how do you run it?

You have options here, depending on what you want to do (run a file or code your own?) how much time you want to put it (two hours, a week, two years), and what your learning style is (reading text, watching video, doing tests, having a tutor).  Most of these options are currently available free. Here are some of them:

Python on Windows. Kthxbye!

[Cross-posted from ICanHazDataScience]

At one point, I had Windows, Mac and Linux laptops, an android tablet, android phone and iphone… but now I’m just down to the things that aren’t forbidden fruit.  I used to write all my code on the Linux machine, and use the Windows one for writing (it has a bigger screen that lets me view two documents side-by-side).  But now I *like* coding on my Windows 7 machine. Here are some of the things that I’ve learnt doing it.

  • Find some friendly instructions to help you with things like setting environment variables (i.e. the thing that means you can just type “python” in your terminal window instead of a really long address to the python executable file).
  • Install 32-bit Python instead of 64-bit Python.  Yes, it sounds wierd on a 64-bit machine, but trust me, some libraries will break horribly (and some will be just plain unavailable) if you try using the 64-bit version.  The 32-bit version works fine, honest!
  • “Real” coders will laugh at you in Hackathons.  Ignore them… or just blow them away with your awesome webpage scrapers.
  • Python 3.x is the latest version, but Python 2.x has better support. I’m currently on Python 2.7.3.  It works.
  • If you’re adding Python libraries (aka “packages”) to your machine (and you will), pip wins out over easy_install because it comes with things like uninstall (easy_install doesn’t).  Install Distribute, then install Pip… here are some geek notes on why and how.
  • Some packages need slightly different code if you’re running on a Windows machine.  for instance, connecting to Skype using Skype4Py has these variations between machines:
import os

import Skype4Py

 

if os.name == “nt”:

print(“Windows”)

#32 bit and 64 bit windows behave differently with Skype

if ‘PROGRAMFILES(X85)’ in os.environ:

sconn = Skype4Py.Skype() #64-bit

else:

sconn = Skype4Py.Skype() #32-bit

elif os.name == “os2”:

print(“Mac”)

sconn = Skype4Py.Skype()

else:

print(“Linux machine or similar”)

sconn = Skype4Py.Skype(Transport=’x11′)

(That was a very hard-won piece of code. It won’t happen to you often, but occasionally you need to ask whether it’s your machine type that’s the problem).

  •  Virtual environments rock. Mostly.  But in windows, you have to connect them to your Postgresql databases the hard way.  VirtualEnv is the package that you use to create virtual environments on your machine, that you can use to mimic the environment that you’ll eventually distribute your code in online (e.g. Heroku or Amazon).  You don’t need it yet, but it saves a lot of time when your code’s ready to show the whole world.

 Well, that’s just a few things I’ve learnt.  I know I should be a hard-core coder who defends Linux to the hilt – and I like Linux a lot (heck, I can remember when Vi was young and Emacs was a hot new thing), but I’m also very aware that a lot of the people who are managing, analysing and creating data are using Windows machines to do it, and I don’t want to be another person getting in the way of them starting to code on said machines.

Onwards!

Readin and Writin Excel Files

[Cross-posted from ICanHazDataScience]

And so to Excel.  Many people think Excel must be difficult to read into and write from a program.  Nah!  Again, there are libraries to help you: this time, it’s xlrd and xlwt (for reading Excel and writing Excel, respectively).

There’s a nice tutorial on using Python on Excel files here, so I’ll just give some example code in this post. Here’s the file I’m running it on:

Example Excel file

And here’s the code:

import xlrd

import xlwt

 

#set up

wbkin = xlrd.open_workbook(“infile.xls”)

wbkout = xlwt.Workbook()

 

#Read in data from Excel file

numsheets = wbkin.nsheets

sh=wbkin.sheet_by_index(0):

print(sh.nrows)

print(sh.ncols)

print(sh.cell_value(0,1))

merges = sh.mergedcells()

 

#Print out contents of first worksheet

for r in range(0, sh.nrows):

for c in range(0, sh.ncols):

print(“(“+str(r)+”,”+str(c)+”)”+str(sh.cell_value(r,c)))

 

#Write data out to Excel file

sheet = wbkout.add_sheet(“sheetname”, cell_overwrite_ok=True)

sheet.write(0,0, “cell contents”)

sheet.write(0,1, “more in the first row”)

sheet.write(1,0, “contents in the second row”)

wbkout.save(“outfile.xls”)

So what’s going on in here?   First, the setup.

  • wbkin = xlrd.open_workbook(“infile.xls”) creates a readable workbook object (the thing you use to access the input file) linked to the file “infile.xls”.  Note that this is a “‘xls” Excel file: xlrd can handle Excel 2010 files (“.xlsx”) but so far, they’ve given me errors (this might be something to do with me working on a 64-bit Windows machine though).
  • “wbkout = xlwt.Workbook()” creates a writable workbook object.  Note that this isn’t attached to any file at the moment – and that if you don’t use the command “wbkout.save(“outfile.xls”)” after you’ve written data into the workbook, it won’t ever make it into a file.
  • xlrd has other cool commands like “sh.col(1)”, which puts an entire column of your workbook into an array for you.

And the read:

  • “numsheets = wbkin.nsheets” gives you the number of worksheets in the input file.  In this case, there are 3 (“sheet 1”, “sheet 2” and “sheet 3”).
  • “sh = wbkin.sheet_by_index(0)” gives you a pointer to the worksheet numbered 0 (in this case “sheet 1” – note that Python counts from 0 rather than 1).  If you know what your worksheets are called, then you could use sh = wbkin.sheet_by_name(“Sheet1”) instead.
  • “sh.nrows” and “sh.ncols” gives you the number of rows and columns in the current worksheet; in this case, 5 and 3 respectively.
  • “sh.cell_value(row,col)” gives you the contents of a worksheet cell. “sh.cell_value(0,1)” gives you the contents of the first row, and the second column in that row (remember that Python counts from 0); in this case, “Title”.
  • “merges = sh.merged_cells” should give a list of the merged cells in the worksheet – which is really useful because these appear lots in development data and are a pain to unmerge.  But this command doesn’t work for me: “sh.merged_cells” gives me an empty set for the Excel file above.

Print out contents of first worksheet:

(0,0) Id

(0,1) Title

(0,2) Notes

(1,0) 1.0

(1,1) title1

(1,2) This is a note.

(2,0) 2.0

(2,1) title 2

(2,2) This is a merged cell

(3,0) 4.0

(3,1) Fourth title

(3,2)

(4,0) 5.0

(4,1) Title 5

(4,2) This is a note on line 5.

  • This is what Python thinks the contents of this worksheet are.Note the defensive piece of programming in the code: “str(sh.cell_value(r,c))” – this takes the value in the cell and converts it from whatever format xlrd thinks it is (notice that all the Ids have been turned into floating-point numbers, for instance) into a string. Which stops the print() statement from breaking because it’s been fed a number.
  • Note that xlrd happily read in all the rows and columns, but didn’t notice that one of the cells was merged.  So far, the contents of every merged cell that I’ve read in has been assigned to the row and column in its top left-hand corner (this might be helpful to you later).

And write:

  • sheet = wbkout.add_sheet(“sheetname”, cell_overwrite_ok=True)” adds a sheet named “sheetname” to the output workbook.  “cell_overwrite_ok=True” is a useful command because it lets you add cells to the workbook in a nonlinear way (e.g. you don’t have to write [0,0] then [0,1] then [0,2] then [1,0] etc etc). 
  • “sheet.write(0,1, “more in the first row”)” writes the text “more in the first row” to the cell in the first row and second column of the worksheet. 
  • “wbkout.save(“outfile.xls”)” saves the workbook to file “outfile.xls”.  Without this command, you’ll have a workbook object but it won’t be written to any files. 

That’s it for now.  As usual, if you get stuck, try searching StackOverflow and the rest of the internet for the error message you got – almost always, someone’s had the same problem before.  Onwards!

Why Cant I Redd Arabic in Mah Files?

[Cross-posted from ICanHazDataScience]

I work on development data.  Sometimes on datafiles, sites or streams that cover a large part of the world.  Which means that, sooner or later, I’m going to get an error that looks like this when I’m reading something into Python:

“UnicodeEncodeError: ascii codec can’t encode character u’\u015e’”.

At this point, you need Unicode.   Unicode is like the babelfish of written text: it contains characters for most human languages, including Arabic, so for instance it can deal with reading data from websites where multiple human languages are used (e.g. at least 10 on one of the sites that I maintain).  Most people’s files contain just one character set (things like ASCII and Latin-1) and don’t ever see the problem above – we development nerds are likely to see it a lot!  For example, this line (placed at the top of your code file) can save you pain when you’re writing non-ascii characters in your code (e.g. in comments or text strings):

# -*- coding: utf-8 -*-

And this line can save you a lot of pain when you’re looking at your files:

“Print(data.encode(‘utf-8’))”

But they didn’t say “unicode”, Emily says (before yawning and settling back down to sleep).  That’s because you probably won’t see the word “unicode” in Python code…  you’re more likely to see words like “utf-8” and “utf-16”.  These are both implementations of unicode – different ways of representing unicode characters as 1, 2 or more 8-bit bytes.  For Arabic and other common languages, utf-8 is more than sufficient (and very few people need or use utf-16).   I could rattle on here (and if you need it, can supply the magic words for e.g. reading an excel file in UTF-8) but this blogpost about Arabic in Python should cover all (or at least most) of what you need to get started.

Play, experiment, and remember to search StackOverflow if you get stuck (chances are, that whatever Python throws at you already has a question and answer on StackOverflow). More geeky notes on unicode include Green Notes’ blogpost and Evan Jones’ note.

Readin and Writin CSV files

[Cross-posted from ICanHazDataScience]

Google should add Lolcatz to translate.google.com…

But seriously, if you want to do data science, you’re going to need data.  Which means being able to access data: data in streams (like Twitter), data online (like websites) and data in files.  We’ll start with the files.

Development agencies, bless them, have a major love affair going on with Excel files.  Open any development data site (or crawl it for datafiles) and you’ll see overwhelming numbers of reports (we’ll get to those later), .xls and .csv files.

Let’s start with the easier filetype: CSV.   CSV (Comma-Separated Values) is as it sounds: a file containing readable text, usually separated with commas, but sometimes instead with semicolons “;” tabs, colons “:”, pipes “|” or anything else that the person writing the file thought was appropriate (data.un.org, for example, gives you the choice of comma, semicolon or pipe).  Here’s an example, seen from Microsoft Excel:

CSV file as seen in Microsoft Excel

If you open this file in a text editor (I’m using Notepad++ here – which is heartily recommended for looking at webpage files too), it looks like this:

CSV file as seen in Notepad++  

There are two things to note here: first, that the columns in the Excel view correspond to the commas in the Notepad++ one.  And second, that the comma in Excel (field C3) doesn’t create a new column in the CSV file because Excel wraps the whole field in punctuation marks (“This is a comment, that includes a comma.”).

Yes, it’s that simple a file format.  And there’s a friendly Python library to make reading and writing CSV files even easier.  It’s called “CSV“.  Here are some simple bits of code that use it to read a CSV file, print out its contents, and write them out to another CSV file.

import csv

 

#Set up

fin = open(“infile.csv”, “rb”)

fout = open(“outfile.csv”, “wb”)

csvin = csv.reader(fin)

csvout = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)

 

#Read in header row

headers = csvin.next()

 

for header in headers:

print(header)

 

 

#Read in non-header rows

for row in csvin:

for col in range(0,len(row)):

print(row[col])

 

#Write out

csvout.writerow([“header1”, “header2”, “header3”])

for row in csvin:

csvout.writerow(row)

 

#tidy up

fin.close()

fout.close()

Okay, a few things happened here.  First, the setup:

  • Before you can use a library (e.g. CSV), you have to tell your code about it.  That’s what the “import CSV” line is about.
  • Second, you have to open a file before you can use the CSV library on it.   The line “fin=open(“infile.csv”, “rb”)” opens a file called “infile.csv” for reading on my windows system (that’s the “rb” part), and points to the open file with a variable called “fin”.  The line “fout-open(“outfile.csv”. ‘wb”) does a similar thing, opening a file called “outfile.csv” for writing (that’s the “wb” part).
  • csv.reader and csv.writer are where the CSV library gets involved.  “csvin = csv.reader(fin)” gives you an object, csvin, with lots of handy ways to read in CSV files.  “csvout = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)” gives you an object, csvout, with lots of handy ways to write to CSV files.   That “quoting=csv.QUOTE_NONNUMERIC” is important by the way – it wraps your text fields in quotes (“) so you never get a comma in the text being read in as a new column in the CSV file, and CSV readers *always* read in each field as a text string (this is important because you don’t have to write extra code that worries about which type each input is).

Then the read:

  • CSV files are organised in rows, so it’s not too surprising that CSV readers do the same thing.  “headers = csvin.next()” picks the next row out of the CSV file, and puts it into an array variable called “headers”.  Arrays are lists of things that look like this: [“Id”, “Title”, “Comments”, “Date”] in Python – the next lines “for header in headers:” and “print(header)” select each item in the array headers and prints it to the screen (this is where QUOTE_NONNUMERIC wins: if you didn’t use it, then any numbers in the CSV file will crash this piece of code.  Unless you print(str(header)) here).
  • “for row in csvin:” is a loop – it picks the next row from the CSV file, and applies the following code to it (note the lack of brackets here: the only way you know what a loop applies to is by looking at the indents in the line), then repeats this with the next row in the CSV file until it gets to the end of the file (at which point it stops and your program goes on to the next piece of code). Note that “for header in headers:” is also a loop, but it’s working on a different type of object (an array rather than a file), so it works on each item in the array instead.
  • row is an array, e.g. [“1”, “Title 1′, “This is a comment on title 1. “, “3/10/2013”], which means that you can access any column in that row by typing “row[col]” where col is any number from 0 to the number of columns in the file, minus 1.  I’ve used this to print out the row contents in order, but you could use it to pick out any data field that you want.

And the write:

  •  Writing is pretty easy: once you’ve created a csv.writer object, e.g. csvout, you can use “csvout.writerow(array)” to write a row of data to the file (e.g. “outfile.csv”) associated with it.

And tidy up:

  • Closing your files is good practice – which is why “fin.close()” and “fout.close()” are here.   You’ll sometimes see CSV readers and writers created in code using lines like “csvin = csv.reader(open(“infile.csv”, “rb”))” – this makes it difficult to close the files when you finish, and could get messy over time.

Well, that’s a small short look at reading and writing CSV files.  Nothing much happened in the example code, but use a little imagination and you could be reading the files into your processing code, swapping around file columns, reformatting column contents and cross-matching them with the contents of other interesting files.  That part… is up to you.

Future cities

Cities are apparently the future. All the predictions I’ve seen for the next few decades show the world\’s population concentrating in cities, but our development indicators and policies are still listed by nation state. Perhaps they should be wider, for instance by including developing cities on the lists.

I said “developing” there – which begs the question “how are these cities developing?”.  This isn’t just a Las Vegas-style spreading of suburbia across the desert: many of the cities I’ve visited in the past year have shanty towns, and these appear, at least from outside, to be where a lot of the city development is happening (btw, I wanted to use a less emotive word than ‘slum’ here: although it’s what Slum Dwellers International uses, there’s still a lot of negative feeling about it).  From Lagos to Guatemala to Haiti, I’ve seen dozens of homes and businesses under tin roofs looking across at smaller numbers of tower blocks, and wondered “how do these economies fit together”, and “where does it go from here”.

Good old BBC gave me a few more answers… and a few more questions (like how does the nation-based world fit with people this adaptable and informal), and case studies, Medellin and London, of both positive and negative ways that the shanty and non-shanty worlds can start to fit together.

Perhaps it’s the way you look at it.  If you look at the Wikipedia links above, you’ll see shanty towns and slums described in very negative terms… impoverished, illegal, lack of services.   If you hang out with people who live or work in shanty towns, they’re communities and neighbours and businesses and services – and quite possibly the adaptable, informal, majority economic future of the cities they\’re part of.  Whichever way you look at it, there are a lot of people in shanty towns, and how (and what) they develop is important.