Data Science

Why spreadsheets are hard

Thinking about machine-reading human-generated spreadsheets today, and I think I’ve got a handle on why this is a problem. Data nerds think of data in its back-end sense, of “what form of data do I need to be able to analyse/ visualise this”.  We normalise, we worry about consistency, we clean out the formatting. People used to creating spreadsheets for other people think of it more in the front-end sense, of “how do I make this data easily comprehensible to someone looking at this”. Each has merits/demerits (e.g. reading normalised data and seeing patterns in it can be hard for a human; reading human-formatted data is hard for machines) and part of our work as data nerds is working out how to bridge that divide.  Which is going to take work in both directions, but it’s necessary and important work to do.

Software

WriteSpeakCode/ PyLadies joint meetup 2015-10-22: Tales of Open Source: rough notes

Pyladies: international mentorship program for female python coders meetup,com, NYC Pyladies Lisa moderating, Panelists: Maia McCormick, Anna Herlihy, Julian Berman, Ben Darnell, David Turner Intros: Maia: worked on Outreachy (formerly OPW) – gives stipends to women and minorities to work on OS code; currently at Spring Anna: works at MongoDb, does a lot of Mongo OS work. Julian: works at Magnetic (ad company); worked on Twisted, started OS project (schema for validating Json projects) Ben: Tornado maintainer, working on OS distributed database on Go. David: ex FSF, OpenPlans, now at Twitter, “making git faster”. Q: how to find OS projects, how to get started? D: started contributing to Xchat… someone said “wish chat had the following feature”… silence… recently, whatever the company is working on. Advice: find the right project, see if they’re interested, then write the feature. B: started on python interpreter, was using game library, needed bindings for…

Data Science

Looking at data with Python: Matplotlib and Pandas

I like python. R and Excel have their uses too for data analysis, but I just keep coming back to Python. One of the first things I want to do once I’ve finally wrangled a dataset out of various APIs, websites and pieces of paper, is to have a good look at what’s in it.  Two python libraries are useful here: Pandas and Matplotlib. Pandas is Wes McKinney’s library for R-style dataframe (data in rows and columns) manipulation, summary and analysis. Matplotlib is John D Hunter’s library for Matlab-style plots of data. Before you start, you’ll need to type “pip install pandas” and “pip install matplotlib” in the terminal window.   It’s also convention to load the libraries into your code with these two lines: import pandas as pd import matplotlib.pyplot as plt Some things in Pandas (like reading in datafiles) are wonderfully easy; others take a little longer to learn….