Data Science

Learnins Python

[Cross-posted from ICanHazDataScience] Bad news.  You’re probably going to have to learn to code.   Whilst you can go a very long way with the tools available online, at some point you’re going to have that “if I could just reformat that column and extract this information out of it” moment.  Which, generally, either means coding or finding a coder happy to help you with the task (hackathons like RHOK are good places, and always looking for good problem statements; there are also many coding-for-good groups around that might help too). Not so bad news if you’re up for writing your own code. There is *lots* of help available online.  The language you choose is up to you… many social-good systems are written in PHP, for example, many open data systems are in Python (and there are a lot of good data-wrangling libraries available in Python and many data science courses use it…

Data Science

Python on Windows. Kthxbye!

[Cross-posted from ICanHazDataScience] At one point, I had Windows, Mac and Linux laptops, an android tablet, android phone and iphone… but now I’m just down to the things that aren’t forbidden fruit.  I used to write all my code on the Linux machine, and use the Windows one for writing (it has a bigger screen that lets me view two documents side-by-side).  But now I *like* coding on my Windows 7 machine. Here are some of the things that I’ve learnt doing it. Find some friendly instructions to help you with things like setting environment variables (i.e. the thing that means you can just type “python” in your terminal window instead of a really long address to the python executable file). Install 32-bit Python instead of 64-bit Python.  Yes, it sounds wierd on a 64-bit machine, but trust me, some libraries will break horribly (and some will be just plain unavailable) if…

Data Science

Readin and Writin Excel Files

[Cross-posted from ICanHazDataScience] And so to Excel.  Many people think Excel must be difficult to read into and write from a program.  Nah!  Again, there are libraries to help you: this time, it’s xlrd and xlwt (for reading Excel and writing Excel, respectively). There’s a nice tutorial on using Python on Excel files here, so I’ll just give some example code in this post. Here’s the file I’m running it on: Example Excel file And here’s the code: import xlrd import xlwt   #set up wbkin = xlrd.open_workbook(“infile.xls”) wbkout = xlwt.Workbook()   #Read in data from Excel file numsheets = wbkin.nsheets sh=wbkin.sheet_by_index(0): print(sh.nrows) print(sh.ncols) print(sh.cell_value(0,1)) merges = sh.mergedcells()   #Print out contents of first worksheet for r in range(0, sh.nrows): for c in range(0, sh.ncols): print(“(“+str(r)+”,”+str(c)+”)”+str(sh.cell_value(r,c)))   #Write data out to Excel file sheet = wbkout.add_sheet(“sheetname”, cell_overwrite_ok=True) sheet.write(0,0, “cell contents”) sheet.write(0,1, “more in the first row”) sheet.write(1,0, “contents in the second row”) wbkout.save(“outfile.xls”) So what’s…

Data Science

Why Cant I Redd Arabic in Mah Files?

[Cross-posted from ICanHazDataScience] I work on development data.  Sometimes on datafiles, sites or streams that cover a large part of the world.  Which means that, sooner or later, I’m going to get an error that looks like this when I’m reading something into Python: “UnicodeEncodeError: ascii codec can’t encode character u’\u015e’”. At this point, you need Unicode.   Unicode is like the babelfish of written text: it contains characters for most human languages, including Arabic, so for instance it can deal with reading data from websites where multiple human languages are used (e.g. at least 10 on one of the sites that I maintain).  Most people’s files contain just one character set (things like ASCII and Latin-1) and don’t ever see the problem above – we development nerds are likely to see it a lot!  For example, this line (placed at the top of your code file) can save you pain when…

Data Science

Readin and Writin CSV files

[Cross-posted from ICanHazDataScience] Google should add Lolcatz to translate.google.com… But seriously, if you want to do data science, you’re going to need data.  Which means being able to access data: data in streams (like Twitter), data online (like websites) and data in files.  We’ll start with the files. Development agencies, bless them, have a major love affair going on with Excel files.  Open any development data site (or crawl it for datafiles) and you’ll see overwhelming numbers of reports (we’ll get to those later), .xls and .csv files. Let’s start with the easier filetype: CSV.   CSV (Comma-Separated Values) is as it sounds: a file containing readable text, usually separated with commas, but sometimes instead with semicolons “;” tabs, colons “:”, pipes “|” or anything else that the person writing the file thought was appropriate (data.un.org, for example, gives you the choice of comma, semicolon or pipe).  Here’s an example, seen from Microsoft…

Humanitarianism

Future cities

Cities are apparently the future. All the predictions I’ve seen for the next few decades show the world\’s population concentrating in cities, but our development indicators and policies are still listed by nation state. Perhaps they should be wider, for instance by including developing cities on the lists. I said “developing” there – which begs the question “how are these cities developing?”.  This isn’t just a Las Vegas-style spreading of suburbia across the desert: many of the cities I’ve visited in the past year have shanty towns, and these appear, at least from outside, to be where a lot of the city development is happening (btw, I wanted to use a less emotive word than ‘slum’ here: although it’s what Slum Dwellers International uses, there’s still a lot of negative feeling about it).  From Lagos to Guatemala to Haiti, I’ve seen dozens of homes and businesses under tin roofs looking…