ICanHazDatascience

Data Science Tools, or what’s in my e-backpack?

One of the infuriating (and at the same time, strangely cool) things about development data science is that you quite often find yourself in the middle of nowhere with a job to do, and no access to the Internet (although this doesn’t happen as often as many Westerners think: there is real internet in most of the world’s cities, honest!). Which means you get to do your job with exactly what you remembered to pack on your laptop: tools, code, help files, datasets and academic papers.  This is where we talk about the tools. The Ds4B toolset This is the tools list for the DS4B course (if you’re following the course, don’t panic: install notes are here). Offline toolset: Already on the machine: Terminal window Calculator Anaconda: Jupyter notebooks Python (version 3) R (need to add this to Anaconda) Rstudio OpenRefine D3 libraries Tabula Excel or LibreOffice (opensource equivalent) QGIS (Mac users: note the separate instructions on this!) GDAL…

ICanHazDatascience

Data Science Ethics [DS4B Session 1e]

This is what I usually refer to as the “Fear of God” section of the course… Ethics Most university research projects involving people (aka “human subjects”) have to write and adhere to an ethics statement, and adhere to an overarching ethics framework, e.g. “The University has an ethical commitment to minimize the risks to research subjects and to ensure that individuals who participate in research projects conducted under its auspices… do so voluntarily and with an informed understanding of what their involvement will mean”.  Development data scientists are not generally subject to ethics reviews, but that doesn’t mean we shouldn’t also ask ourselves the hard questions about what we’re doing with our work, and the people that it might affect. At a minimum, if you make data public, you have a responsibility, to the best of your knowledge, skills, and advice, to do no harm to the people connected to that data.  Data…

ICanHazDatascience

Writing a problem statement [DS4B Session 1d]

Data work can sometimes seem meaningless.  You go through all the training on cool machine learning techniques, find some cool datasets to play with, run a couple of algorithms on them, and then.  Nothing. That sinking feeling of “well, that was useless”. I’m not discouraging play. Play is how we learn how to do things, where we can find ideas and connect more deeply to our data.  It can be a really useful part of the “explore the data” part of data science, and there are many useful playful design activities that can help with “ask an interesting question”.   But data preparation and analysis takes time and is full of rabbitholes:  interesting but time-consuming things that aren’t linked to a positive action or change in the world. One thing that helps a lot is to have a rough plan: something that can guide you as you work through a data…

ICanHazDatascience

Data Science is a Process [DS4B Session 1c]

People often ask me how they can become a data scientist. To which my answers are usually ‘why’, ‘what do you want to do with it’ and ‘let’s talk about what it really is’.  So let’s talk about what it really is.  There are many definitions of data science, e.g.: “A data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.” “The analysis of data using the scientific method” “A data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.” We can spend hours debating which definition is ‘right’, or we could spend those hours looking at what data scientists do in practice, getting some tools and techniques under our belts and finding a definition that works for each one of us personally….

ICanHazDatascience

About these sessions [DS4B Session 1b]

First, there are no prerequisites for these sessions except a curiosity to learn about data science and willingness to play with data. Second, the labs are designed to cover most of what you need to talk to data scientists, to specify your project, and to start exploring data science on your own.  They’re specifically aimed at the types of issues that you’ll see in the messy strange world of development, humanitarian and social data, and use tools that are free, well-supported and available offline. This may not be what you need, and that’s fine. If you want to learn about machine learning in depth, or create beautiful Python code, there are many courses for that (and the Data Science Communities and Courses page has a list of places you might want to start at).  But if you’re looking at social data, and to quote one of my students “can see the data…

ICanHazDatascience

DS4B: The Course [DS4B Session 1a]

[Cross-posted from LinkedIn] I’ve designed and taught 4 university courses in the past 3 years: ICT for development and social change (DSC, with Eric from BlueRidge), coding for DSC, data science coding for DSC and data science for DSC (with Stefan from Sumall).  The overarching theme of each of them was better access and understanding of technical tools for people who work in areas where internet is unavailable or unreliable, and where proprietary tools are prohibitively expensive. The tagline for the non-coding courses was the “anti-bullsh*t courses”: a place where students could learn what words like ‘Drupal’ and ‘classifier’ mean before they’re trying to assess a technical proposal with those words in, but also a place to learn how to interact with data scientists and development teams, to learn their processes, language and needs, and ask the ‘dumb questions’ (even though there are no dumb questions) in a safe space. Throw a rock on the…

ICanHazDatascience

Releasing my course materials [DS4B]

[cross-post from LinkedIn] I teach data science literacy to different groups, and I’ve been struggling with a personal dilemma: I believe strongly in open access (data, code, materials), but each hour of lecture materials takes about 15-20 hours of my time to prepare: should I make everything openly available, or try to respect that preparation time? Which is basically the dilemma that anyone open-sourcing goes through. And the answer is the same: release the materials.  Made stronger by the realisation that I teach how to do field data science (data science in places without internet connectivity) and people in places without internet connectivity are unlikely to be dropping into New York for in-person sessions. Starting this week, the materials are going online in github; I’ll be backing this up by restarting the I Can Haz Datascience posts (posts on development data science, with Emily the cat) to cover the topics in them…

Data Science

Looking at data with Python: Matplotlib and Pandas

I like python. R and Excel have their uses too for data analysis, but I just keep coming back to Python. One of the first things I want to do once I’ve finally wrangled a dataset out of various APIs, websites and pieces of paper, is to have a good look at what’s in it.  Two python libraries are useful here: Pandas and Matplotlib. Pandas is Wes McKinney’s library for R-style dataframe (data in rows and columns) manipulation, summary and analysis. Matplotlib is John D Hunter’s library for Matlab-style plots of data. Before you start, you’ll need to type “pip install pandas” and “pip install matplotlib” in the terminal window.   It’s also convention to load the libraries into your code with these two lines: import pandas as pd import matplotlib.pyplot as plt Some things in Pandas (like reading in datafiles) are wonderfully easy; others take a little longer to learn….

Data Science

Processing all teh Files in Directory

[Cross-posted from ICanHazDataScience] Okay. So we’ve talked a bit about getting set up in Python, and about how to read in different types of file (cool visualisation tools, streams, pdfs, apis and webpages next, promise!).  But what if you’ve got a whole directory of files and no handy way to read them all into your program. Well, actually, you do.  It’s this: import globimport os   datadir = “dir1/dir2” csvfiles = glob.glob(os.path.join(datadir, ‘*.csv’))   for infile_fullname in csvfiles: filename = infile_fullname[len(datadir)+1:] print(filename) That’s it.”os.path.join” sticks your directory name (“dir1/dir2”) to the filetype you’re looking for (“*.csv” here to pull in all the CSV files in the directory, but you could ask for anything, like “a*.*” for files starting with the letter “a”, “*.xls” for excel files, etc etc).  “glob.glob(filepath)” uses the glob library to get the names of all the files in the directory.  And “infile_fullname[len(datadir)+1:] ” gives you just the names…

Data Science

Wut 2 Do Wif Werdz

[Cross-post from IcanHazDataScience] One of the first cool things beginning data scientists do with a bunch of words is create a word cloud.  There are several ways to do this, but it’s important to know about tools like D3 (one of the most powerful software libraries for data visualisation), so we’ll use that.  Before you run off screaming because it’s a javascript library, it’s okay: there are a whole bunch of example pages on the D3 website, that you can play with your own data with.  One of these is the word cloud generator.  Go there and play with it now… here are some clouds for both crisismapping and my own sites, with some comments… blog.overcognition.com – my work blog.  I seem to have a thing about countries, Wamp, Ushahidi and Data (which really reflects the last 3 posts that I wrote).  Now because I used Jason’s example code, it’s using every word…