ICanHazDatascience

Data Science Tools, or what’s in my e-backpack?

One of the infuriating (and at the same time, strangely cool) things about development data science is that you quite often find yourself in the middle of nowhere with a job to do, and no access to the Internet (although this doesn’t happen as often as many Westerners think: there is real internet in most of the world’s cities, honest!). Which means you get to do your job with exactly what you remembered to pack on your laptop: tools, code, help files, datasets and academic papers.  This is where we talk about the tools. The Ds4B toolset This is the tools list for the DS4B course (if you’re following the course, don’t panic: install notes are here). Offline toolset: Already on the machine: Terminal window Calculator Anaconda: Jupyter notebooks Python (version 3) R (need to add this to Anaconda) Rstudio OpenRefine D3 libraries Tabula Excel or LibreOffice (opensource equivalent) QGIS (Mac users: note the separate instructions on this!) GDAL…

ICanHazDatascience

Data Science Ethics [DS4B Session 1e]

This is what I usually refer to as the “Fear of God” section of the course… Ethics Most university research projects involving people (aka “human subjects”) have to write and adhere to an ethics statement, and adhere to an overarching ethics framework, e.g. “The University has an ethical commitment to minimize the risks to research subjects and to ensure that individuals who participate in research projects conducted under its auspices… do so voluntarily and with an informed understanding of what their involvement will mean”.  Development data scientists are not generally subject to ethics reviews, but that doesn’t mean we shouldn’t also ask ourselves the hard questions about what we’re doing with our work, and the people that it might affect. At a minimum, if you make data public, you have a responsibility, to the best of your knowledge, skills, and advice, to do no harm to the people connected to that data.  Data…

ICanHazDatascience

Writing a problem statement [DS4B Session 1d]

Data work can sometimes seem meaningless.  You go through all the training on cool machine learning techniques, find some cool datasets to play with, run a couple of algorithms on them, and then.  Nothing. That sinking feeling of “well, that was useless”. I’m not discouraging play. Play is how we learn how to do things, where we can find ideas and connect more deeply to our data.  It can be a really useful part of the “explore the data” part of data science, and there are many useful playful design activities that can help with “ask an interesting question”.   But data preparation and analysis takes time and is full of rabbitholes:  interesting but time-consuming things that aren’t linked to a positive action or change in the world. One thing that helps a lot is to have a rough plan: something that can guide you as you work through a data…

ICanHazDatascience

Data Science is a Process [DS4B Session 1c]

People often ask me how they can become a data scientist. To which my answers are usually ‘why’, ‘what do you want to do with it’ and ‘let’s talk about what it really is’.  So let’s talk about what it really is.  There are many definitions of data science, e.g.: “A data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.” “The analysis of data using the scientific method” “A data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.” We can spend hours debating which definition is ‘right’, or we could spend those hours looking at what data scientists do in practice, getting some tools and techniques under our belts and finding a definition that works for each one of us personally….

ICanHazDatascience

About these sessions [DS4B Session 1b]

First, there are no prerequisites for these sessions except a curiosity to learn about data science and willingness to play with data. Second, the labs are designed to cover most of what you need to talk to data scientists, to specify your project, and to start exploring data science on your own.  They’re specifically aimed at the types of issues that you’ll see in the messy strange world of development, humanitarian and social data, and use tools that are free, well-supported and available offline. This may not be what you need, and that’s fine. If you want to learn about machine learning in depth, or create beautiful Python code, there are many courses for that (and the Data Science Communities and Courses page has a list of places you might want to start at).  But if you’re looking at social data, and to quote one of my students “can see the data…

ICanHazDatascience

DS4B: The Course [DS4B Session 1a]

[Cross-posted from LinkedIn] I’ve designed and taught 4 university courses in the past 3 years: ICT for development and social change (DSC, with Eric from BlueRidge), coding for DSC, data science coding for DSC and data science for DSC (with Stefan from Sumall).  The overarching theme of each of them was better access and understanding of technical tools for people who work in areas where internet is unavailable or unreliable, and where proprietary tools are prohibitively expensive. The tagline for the non-coding courses was the “anti-bullsh*t courses”: a place where students could learn what words like ‘Drupal’ and ‘classifier’ mean before they’re trying to assess a technical proposal with those words in, but also a place to learn how to interact with data scientists and development teams, to learn their processes, language and needs, and ask the ‘dumb questions’ (even though there are no dumb questions) in a safe space. Throw a rock on the…

ICanHazDatascience

Releasing my course materials [DS4B]

[cross-post from LinkedIn] I teach data science literacy to different groups, and I’ve been struggling with a personal dilemma: I believe strongly in open access (data, code, materials), but each hour of lecture materials takes about 15-20 hours of my time to prepare: should I make everything openly available, or try to respect that preparation time? Which is basically the dilemma that anyone open-sourcing goes through. And the answer is the same: release the materials.  Made stronger by the realisation that I teach how to do field data science (data science in places without internet connectivity) and people in places without internet connectivity are unlikely to be dropping into New York for in-person sessions. Starting this week, the materials are going online in github; I’ll be backing this up by restarting the I Can Haz Datascience posts (posts on development data science, with Emily the cat) to cover the topics in them…

Data Science

Why spreadsheets are hard

Thinking about machine-reading human-generated spreadsheets today, and I think I’ve got a handle on why this is a problem. Data nerds think of data in its back-end sense, of “what form of data do I need to be able to analyse/ visualise this”.  We normalise, we worry about consistency, we clean out the formatting. People used to creating spreadsheets for other people think of it more in the front-end sense, of “how do I make this data easily comprehensible to someone looking at this”. Each has merits/demerits (e.g. reading normalised data and seeing patterns in it can be hard for a human; reading human-formatted data is hard for machines) and part of our work as data nerds is working out how to bridge that divide.  Which is going to take work in both directions, but it’s necessary and important work to do.

Software

WriteSpeakCode/ PyLadies joint meetup 2015-10-22: Tales of Open Source: rough notes

Pyladies: international mentorship program for female python coders meetup,com, NYC Pyladies Lisa moderating, Panelists: Maia McCormick, Anna Herlihy, Julian Berman, Ben Darnell, David Turner Intros: Maia: worked on Outreachy (formerly OPW) – gives stipends to women and minorities to work on OS code; currently at Spring Anna: works at MongoDb, does a lot of Mongo OS work. Julian: works at Magnetic (ad company); worked on Twisted, started OS project (schema for validating Json projects) Ben: Tornado maintainer, working on OS distributed database on Go. David: ex FSF, OpenPlans, now at Twitter, “making git faster”. Q: how to find OS projects, how to get started? D: started contributing to Xchat… someone said “wish chat had the following feature”… silence… recently, whatever the company is working on. Advice: find the right project, see if they’re interested, then write the feature. B: started on python interpreter, was using game library, needed bindings for…

Data Science, ICanHazDatascience, Software

Looking at data with Python: Matplotlib and Pandas

I like python. R and Excel have their uses too for data analysis, but I just keep coming back to Python. One of the first things I want to do once I’ve finally wrangled a dataset out of various APIs, websites and pieces of paper, is to have a good look at what’s in it.  Two python libraries are useful here: Pandas and Matplotlib. Pandas is Wes McKinney’s library for R-style dataframe (data in rows and columns) manipulation, summary and analysis. Matplotlib is John D Hunter’s library for Matlab-style plots of data. Before you start, you’ll need to type “pip install pandas” and “pip install matplotlib” in the terminal window.   It’s also convention to load the libraries into your code with these two lines: import pandas as pd import matplotlib.pyplot as plt Some things in Pandas (like reading in datafiles) are wonderfully easy; others take a little longer to learn….