Why spreadsheets are hard

Thinking about machine-reading human-generated spreadsheets today, and I think I’ve got a handle on why this is a problem.

  • Data nerds think of data in its back-end sense, of “what form of data do I need to be able to analyse/ visualise this”.  We normalise, we worry about consistency, we clean out the formatting.
  • People used to creating spreadsheets for other people think of it more in the front-end sense, of “how do I make this data easily comprehensible to someone looking at this”.

Each has merits/demerits (e.g. reading normalised data and seeing patterns in it can be hard for a human; reading human-formatted data is hard for machines) and part of our work as data nerds is working out how to bridge that divide.  Which is going to take work in both directions, but it’s necessary and important work to do.

WriteSpeakCode/ PyLadies joint meetup 2015-10-22: Tales of Open Source: rough notes

Pyladies: international mentorship program for female python coders

  • meetup,com, NYC Pyladies
  • Lisa moderating, Panelists: Maia McCormick, Anna Herlihy, Julian Berman, Ben Darnell, David Turner
  • Intros:
    • Maia: worked on Outreachy (formerly OPW) – gives stipends to women and minorities to work on OS code; currently at Spring
    • Anna: works at MongoDb, does a lot of Mongo OS work.
    • Julian: works at Magnetic (ad company); worked on Twisted, started OS project (schema for validating Json projects)
    • Ben: Tornado maintainer, working on OS distributed database on Go.
    • David: ex FSF, OpenPlans, now at Twitter, “making git faster”.
  • Q: how to find OS projects, how to get started?
    • D: started contributing to Xchat… someone said “wish chat had the following feature”… silence… recently, whatever the company is working on. Advice: find the right project, see if they’re interested, then write the feature.
    • B: started on python interpreter, was using game library, needed bindings for library
    • J: looked at OpenHatch OS projects.  Found Twisted – told that if want to get code in there, there’s a review process. Found feature/bug, wrote patch, waited for response – that got him in… vehicle for other people to read and respond to code.
    • A: first OS commit was to Mongodb – interned there after college. Couldn’t get feature to work on her mac, fixed it ’til it ran, then someone asked “are you going to put in a core request”…was first experience of request politics.  Hard to find projects that both need help, and want help. Best to contact first, e.g. “are you interested in a fix for OSX”. Most people’s experience of OS has been rejection or a negatively tinged experience.
    • M: first pull request got landed… top-down approach, “how do I get work experience on a big codebase – obvious answer is OS… applied to outreachy, who have a list of orgs who want donations”… found Gnome music on the list… iTunes for Gnome… looked at list of beginner-friendly bugs, built that (“approx a million years”) on own machine.  Gnome are particularly newbie-friendly.
    • Outreach deadline is Nov 2nd.
  • Q: how do you find a project that wants your contribution? (or tips for what to avoid)
    • D: avoid people who are loudly mean (e.g. Linux kernalists).  Responsiveness beyond everything… e.. friendly community who took a month to fix their instructions… sat on patch for 1-2 months.  Good: active community, can see closed pull requests (but linux/git have mailing lists, but that’s active)
    • B: has a list of newbie-friendly bugs.
    • J: gauge on whether want to use that software or not.
    • A: bugs are best place to start. Filing a bug report tells you a lot about the maintainers, e.g. on it immediately, starting a conversation about it, you can follow the progress of the bug – see the conversations between the contributors, reminds you that there are humans behind it… “any kind of form of life”.
    • M: probably would have started on bpython (shinier ipython), because peer was really excited about it… peer recommendation, people excited about a project = project probably doesn’t suck.
  • Q: suggestions for good places to find lists of welcoming OS projects
    • OpenHatch
    • Hacktoberfest (organised by digital ocean) – everyone submitting 4 projects from the list gets a free t-shirt
    • Look at the projects that OS projects include… those tools are also interesting projects.
    • Go to your bosses and ask if you can release the company software as OS.
  • Q: about your projects, features, bugs – something you’d like to share
    • M: dev environment – hard to build these. Long slog through virtual machine (e.g. fedora 2.1 was still in alpha)… lots of patience, and a new computer. Taking notes – wrote everything down, error messages etc so can do on next install, take to project maintainer as suggestions for things to go into instructions.
    • A: pymongo sometimes gets a bug that spirals out of control, and ends up being a python bug (that’s already been reported)… e.g. multiprocessing bug that took time to figure out. Getting a copy of the project is a big step towards actually contributing.
    • J: like perfect storm types of bugs, e.g. json schema had a bug… likes semantic versioning, maintaining backwards compatibility… a release was broken and put out, got bug report 6 hours after release from people in big orgs (e.g. openstack, mediawiki)… tiny detail – pip environment markers – broke the release; lots of people; doesn’t like fixing bugs until have regression test in place = pressure is on… did in 24 hours…
    • B: asynch in Tornado Async is interator returning awaitable objects, python library asynchio had different interpretation- trying to mix them, got stack overflows endlessly trying to convert objects. Still an open issue- did a workaround, but other code will have similar problems with it.
    • D: rewrote hash table function in git, git merge started crashing… because of the fix… git index is also called cache and staging area, depending of which part of code you’re in… created nightmares on macs… weirdass pointer being pulled out from under code whilst still in use – only happened on a mac on certain large merges… but patch not accepted the first way was written, so rewrote a different way
  • Q: anything your OS projects want help with now?
    • M: has a bug list – look for Gnome Music getting started page. “Gnome Love Bugs” https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=gnome-music;dist=unstable
    • A: lot of mongodb driver things to work on… any time release, looking for people to test for bugs – finding one = starting a conversation. Drivers have various levels of accessible bugs… mongodb is too big a place to start in.  And Mongodb is hiring.
    • J: Twisted has tedious but beginner-friendly work; J has proof of concept projects that wrote parts he needed (e.g. docker python bindings are literal translations of command line commands –  can jump in and extend out that library), etc. code lives on github https://github.com/Julian… not heavily organised.
    • B: cockroachdb has well-organised bug list. https://github.com/cockroachdb/cockroach – can talk to B about stuff that’s not well-organised.
    • D: git doesn’t have a public bug list, but can look at unit tests and see known failures… need to ask git if they’re things that people care about.  Also e.g. “git rm” removes entire account? (is not filed yet). (all panelists are hiring!)
  • Audience questions:
  • Q: Dropbox might be a good starter project.
  • Q: Setting aside time to work on OS? A: motivated by other people – find someone interested in working on a project. Take advantage of frustration – immediately after frustration, try to work something out.
  • Q: How do you deal with ownership in companies based on OS? Ordinary employee = work for hire. Contract employee = 20-point test, but can override that in the contract. Ownership matters if you want to enforce the license – need copyrights to do this.
  • Q: licenses? Apache vs MIT vs GPL? Prefer for most things copyleft (e.g. GPL), otherwise adoption. More permissive, e.g. Apache, MIT. But use FSF-approved license, e.g. Apache, MIT or GPL.

Looking at data with Python: Matplotlib and Pandas

I like python. R and Excel have their uses too for data analysis, but I just keep coming back to Python.

One of the first things I want to do once I’ve finally wrangled a dataset out of various APIs, websites and pieces of paper, is to have a good look at what’s in it.  Two python libraries are useful here: Pandas and Matplotlib.

  • Pandas is Wes McKinney’s library for R-style dataframe (data in rows and columns) manipulation, summary and analysis.
  • Matplotlib is John D Hunter’s library for Matlab-style plots of data.

Before you start, you’ll need to type “pip install pandas” and “pip install matplotlib” in the terminal window.   It’s also convention to load the libraries into your code with these two lines:

import pandas as pd
import matplotlib.pyplot as plt

Some things in Pandas (like reading in datafiles) are wonderfully easy; others take a little longer to learn. I’ll meander through a few of them here.  I’ll be using Nepal medical shipments data as an example.

Reading in data

Pandas makes this easy. Reading in a CSV file is as simple as:

df = pd.read_csv(csvfilename) # Comma-separated file
df = pd.read_csv(csvfilename, sep='\t') # Tab-separated file

There’s also pd.read_json, pd.read_html, pd.read_sas, pd.read_stata and pd.read_sql_table to read in other data formats.  Be careful with read_html though: it only reads in html tables, and you’ll need lxml, beautifulsoup or suchlike if you want to read tables straight from a webpage.

First-look at the dataframe

I like to know what I’m dealing with before starting analysis.  I usually use Tableau or R for this, but that’s not always possible, and Pandas is a good alternative.

df.columns # List all the column headings
df.head(4) # The first 4 rows of data, same as df.head(n=4)
df[['column1','column2', 'column3']].head(10) # Just some of the columns
df.describe() # Basic statistics for every numerical column

That tells you what your columns are, what your first few rows look like (df.tail(4) will give the last rows) and some basic statistics for numerical columns, but you’re probably more curious than that.

df['columnname1'].value_counts()

Value_counts will tell you what’s in a single column.  If you want to know what’s in a pair or combination of columns, you’ll need to start using pivot tables or group_by.

You might know pivot tables from Excel.  They’re ways of creating a new datatable whose rows, columns and content are defined by you.  This function, for example, gives you a new table whose rows are column x values, columns are column y values, and contents are the number of rows that contained those combinations of x and y values.

x_by_y = df.pivot_table(index='columnx',columns='columny', values='columnz', aggfunc='count', fill_value=0)

Column z gets involved here because if you don’t nominate a column for the values, Pandas will return an array with the counts for every combination of columns. I’ve included fill_value=0 because I’m counting, and Pandas would otherwise include NaN (not a number) in its counts.

x_by_y is a data frame. You can plot this, for example:

x_by_y.head(10).plot(kind='bar', stacked=True)
plt.show()

You’re now using Matplotlib.  And that was quite a complex plot: a stacked bar chart, with a legend.   Note that .plot creates a plot object: if you want to *see* your plot, you need to type “plt.show()”.  This will put up a plot window and stop your code until you close the window again.

Basic data manipulation

I’ve had a look at the dataset, got some ideas for more things to look at in it, and some of them need calculations. Pandas handles this too.  More soon. Meanwhile, here’s some stuff I did with the Nepal dataset.

pivot1 = df.pivot_table(
 index='Material Hierarchy Family',
 columns='Final Recipient Name',
 values='Dollar Value',
 aggfunc='count').plot(kind='barh', stacked=True)

recipientsize = df.groupby('Final Recipient Name').size()

pivot2 = df.pivot_table(
 columns='Material Hierarchy Family',
 index='Final Recipient Name',
 values='Dollar Value',
 aggfunc='count')

pivot2.plot(kind='bar', stacked=True, legend=False)
plt.show()

Links