Data Science Tools, or what’s in my e-backpack?

One of the infuriating (and at the same time, strangely cool) things about development data science is that you quite often find yourself in the middle of nowhere with a job to do, and no access to the Internet (although this doesn’t happen as often as many Westerners think: there is real internet in most of the world’s cities, honest!).

Which means you get to do your job with exactly what you remembered to pack on your laptop: tools, code, help files, datasets and academic papers.  This is where we talk about the tools.

The Ds4B toolset

This is the tools list for the DS4B course (if you’re following the course, don’t panic: install notes are here).

Offline toolset:

Online toolset (ie. is useful, but not available when you’re offline):

Python and R both come with a bunch of useful libraries in Anaconda (here’s the Python and R lists).  The (Anaconda 4.0) pre-installed libraries used in DS4B are:

  • Python: Basemap, BeautifulSoup, Dask, Matplotlib, NetworkX, Nltk, Numba, 
    Numpy, Pandas, Requests, Scikit-image, Scikit-learn, SeaBorn, Shapely, Sqlite3
  • R: ggplot2, frame, lm

The course also uses some Python libraries that don’t come with the Anaconda standard install. These are:

  • Csv, DateTime, Facepy, Fiona, Gdal, Gdalconst, Geopy, Googlemaps, 
    Json, Ogr2ogr, Osgeo, Pyspark, Re, Twitter

How to get each of these is listed in the DS4B course instructions, but usually isn’t more onerous than typing “pip install libraryname” in the terminal window.  Fiona might give you some trouble (“cannot find the gdal.h library”): if this happens, there are notes in the install instructions.

Choosing the course toolset

For the course, we deliberately biased towards tools that were:

  • Free (because they need to be accessible for everyone, not just people who can pay),
  • Open source with good communities (because you can ask for help, go in and see how and why things work when you need to, and things get fixed a lot faster when you’re not limited by a company’s resources/ will).
  • Accessible on most platforms (e.g. Windows, Mac and Linux of varying versions of operating system)
  • Easy to install. Because nothing puts you off a tool quite as much as watching a build fail again and again, and having to learn all about the technologies its built on and their variants before you can do even the simplest thing with it </rant>.
  • Stable. It’s cruel to tell people without tech support to install a tool that regularly crashes on them. Even if you’re a techie, it’s still very annoying…

I do a lot of basic numbers work in Microsoft Excel (add things up, do sorts and filters of data, click on numerical columns to averages etc at the bottom of the screen) and the calculator.  Some data scientists (although a dwindling number of them) only use Excel to process data, but in development data science we’re often handling very unstructured or messy data (e.g. ‘who knows how many columns this thing really has?’ type data) and need to produce results that we can both repeat, and trace back step-by-step to the raw data (and sometimes we get lucky and have a dataset bigger than Excel’s limits of 1,048,576 rows by 16,384 columns: NB Excel will silently fail and only give you the first rows/columns if you do this).

That means using a coding language (yes, yes, SPSS, SAS, Matlab, Pentaho etc, but a) those aren’t free, and b) did I mention the really messy inputs?).

R and Python are coding languages that turn up a lot in data science (sql is used a lot too, for database data).   R is a beautiful language for doing statistical things: it was built by statisticians, has packages that aren’t in other languages yet, and has some lovely lovely visualisation libraries.  But we needed to pick one language for the course to minimise the amount of code needed to illustrate each step in data science without causing cognitive dissonance in students’ heads (the previous version of the course taught both Python and R, but was much more code-focussed), and chose Python.

Why Python? Basically three things:

  • its ability to deal with really nastily messy data,
  • its great libraries for things outside statistical analysis (e.g. natural language processing, web scraping, content management systems etc)
  • not having to rewrite code when you start working with developers building applications and websites (although it’s possible to call both Python and R from each other using e.g. the rpy2 package, adding extra languages makes the code that much harder to maintain).
  • it’s much easier to learn (and teach) than people think

There have been many debates on R vs Python for data science. “R vs Python” isn’t really the right question to be asking; the question is “what tools will work for me”, and the answer for many data scientists is “both R and Python, and sometimes neither”: you use what works for you, and you use what’s appropriate for the task that you have at hand.

The other tools included are less controversial (!), and more specialised.

  • OpenRefine is a great tool for summarising and cleaning unruly row-column data without coding (and has provenance tracing built in)
  • Tabula is the most popular open-source pdf processing tool (CometDocs works on messier documents, but is online-only)
  • QGIS is a popular open-source GIS (maps etc) visualisation tool; CartoDb (now called Carto) is a beautiful online GIS visualisation tool
  • The GDAL toolkit is great for command-line processing of GIS data
  • Python includes visualisation libraries, but if you really want something special, D3 is a good offline tool (Tableau Public is good online, but as the name implies, your visualisations will be public).
  • Pen and paper are useful for doing mockups and calculations whilst saving your machine’s battery, and is flexible and portable to boot.

That should be enough to get most people started, and has already been field-tested by both myself and various ex-students (guys: I love it when you send me notes about what you’re doing with data, from the field!).

What’s in my own backpack?

Screen Shot 2016-08-24 at 2.20.36 PM

It’s only fair for me to open up my Mac and show you what I’ve got hiding on my own machine.  That’s my first Launchpad screen above. It’s the usual suspects: Excel, R, OpenRefine, Tableau, Calculator, with a few other things:

  • SQL tools: MySQL workbench, Postgres.
  • Readers for some common proprietary tool formats: SPSS Smartreader, Pentaho Kettle (“Data Integration”).
  • More data tools: Weka, Gephi, Dato, Trifecta. Weka is the granddaddy of data mining tools, and is still worth having in your ebackpack. Gephi is great for visualising and playing around with graph data.
  • Disk cleaners. OMG can I frag a disk on deployment, which is why I have Ccleaner and Disk Inventory X installed. Because sometimes you just need that extra 2Gb of space.
  • Tools for keeping my code contained: Github desktop, VirtualBox.
  • A Java development environment (IntelliJ) because, contrary to popular opinion, I don’t just write code in Python.

The things you can’t see here include:

  • Mapping tools: OpenStreetMap, LeafletJS, MapBox
  • Visualisation tools: Highcharts
  • Machine learning tools: specialist tools, as and when I need them
  • Ideation tools: freemind
  • Text tools: Acrobat reader, sublime text
  • Losing the minimum amount of stuff if my laptop dies in the humidity tools: Dropbox, Evernote, Google Drive, biggest portable drive I can find (2 of)
  • Giving me textbooks to read on the road tools: iBooks, Safari Books

That’s my ebackpack. I’d be interested to see what other people pack, and if there’s anything useful that I’ve missed from the lists above.

Data Science Ethics [DS4B Session 1e]

This is what I usually refer to as the “Fear of God” section of the course…

Ethics

Most university research projects involving people (aka “human subjects”) have to write and adhere to an ethics statement, and adhere to an overarching ethics framework, e.g. “The University has an ethical commitment to minimize the risks to research subjects and to ensure that individuals who participate in research projects conducted under its auspices… do so voluntarily and with an informed understanding of what their involvement will mean”.  Development data scientists are not generally subject to ethics reviews, but that doesn’t mean we shouldn’t also ask ourselves the hard questions about what we’re doing with our work, and the people that it might affect.

At a minimum, if you make data public, you have a responsibility, to the best of your knowledge, skills, and advice, to do no harm to the people connected to that data.  Data science projects can be very powerful (especially if we’ve designed them well), development data science can affect many people, and we need to be mindful of who we’re affecting and the risks we might unintentionally cause them with our work.  With that power comes responsibility, and a sometimes-difficult balance between making data available to people who can do good with it, and protecting that data’s subjects, sources, and managers.

I start by asking these two questions:

  • Could this work increase risk to anyone?
  • How will I respect privacy and security?

Risk

Risk is defined as “The probability of something happening multiplied by the resulting cost or benefit if it does”.  There are three parts to this: cost/benefit (what might happen), probability (how likely that is to happen) and subject (who the risk is to). For example:

  • Risk of: physical, legal, reputational, privacy harm
  • Likelihood (e.g. low, medium, high)
  • Risk to: data subjects, collectors, processors, releasers, users

So, basically, make a list of what bad things might happen, who to, how likely these bad things are, and what you should be doing to prevent or mitigate them, up to and including stopping the project or making sure that anyone at risk is aware of that risk and can consent to be subject to it.  It doesn’t have to be a down-to-the-tiniest-thing detailed list, but you do have to think about who could be harmed by your work and how.

And this isn’t a one-time thing.  New risks can occur as new data becomes available, or the original environment around your project changes: you need to be thinking about potential risks right from the planning phase of your project through to when you’re sharing data or insights (and beyond, if stale data or old results in changed contexts could also cause harm).

Risks can be obvious (e.g. try not to get your sources or subjects targeted by making them easy to find), but they can also be subtle.  Some of the more subtle ones include:

You don’t have to make this list on your own. Groups including the Responsible Data Forum have been doing great work on data risk (and I’ve been lucky to be part of that hivemind too): try reading those groups’ articles to kickstart your thinking about your project’s potential risks.

Privacy and Security

That last risk on the list (accidentally making data subjects easy to identify), is important.  Privacy and security is a complex topic, but at a minimum you should be thinking about PII (Personally identifiable information). PII is this: “any data that could potentially identify a specific individual. Any information that can be used to distinguish one person from another and can be used for de-anonymizing anonymous data can be considered PII.

When you think about PII risk, think beyond name, address, phone number, social security number. Think about things like these (which can all accidentally release PII):

  • Unique identifiers: Names, addresses, phone numbers etc
  • Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier)
  • Members of small populations (e.g. there may be only a small number people fitting this profile in the given geographical area)
  • Untranslated text (e.g. text that your team can’t read and understand)
  • Codes (e.g. “41”, especially if you don’t have a codebook telling you what they mean)
  • Slang terms
  • Can be combined with other datasets to produce PII (aka reidentification)

There is much literature on how easy reidentification can be easy from very small amounts of data, so you will have to think of it also in terms of risk: how likely is it that someone will want to reindentify, how easy is it for them to do this, and what can you do to make it harder.

But Don’t Panic

Don’t panic. When you first start thinking about risk in data science, the usual reaction is an “OMG I can’t do anything without causing harm” one.  That’s not the right answer either.  Be aware of the risks.  Take a deep breath, and remember that risk is cost times probability, and that if you give them the chance, people will often make a surprising but informed decision about their own risks.

Sometimes, the only answer is to walk away from the project. If it isn’t, and risk is high, consider mitigations like releasing data and results to a smaller group of people (e.g. academics, direct responders, people in your organisation, data subjects) with the caveat that once you release, you no longer have control of that data and have implicitly trusted other people to do the right thing too.  Consider releasing data at a different granularity, e.g. use town/district instead of street, and/or a subset or sample of the data ‘rows’ and ‘columns’. Look in places like the RDF website to see what other people have done as mitigations.

Be aware of ethics. Be as honest and cynical about data and results as you can. And don’t start a development data science project without doing a basic risks check.

[Image by Steve Calcott, licensed under cc-by-nc 2.0]

Writing a problem statement [DS4B Session 1d]

Data work can sometimes seem meaningless.  You go through all the training on cool machine learning techniques, find some cool datasets to play with, run a couple of algorithms on them, and then.  Nothing. That sinking feeling of “well, that was useless”.

I’m not discouraging play. Play is how we learn how to do things, where we can find ideas and connect more deeply to our data.  It can be a really useful part of the “explore the data” part of data science, and there are many useful playful design activities that can help with “ask an interesting question”.   But data preparation and analysis takes time and is full of rabbitholes:  interesting but time-consuming things that aren’t linked to a positive action or change in the world.

One thing that helps a lot is to have a rough plan: something that can guide you as you work through a data science project.  Making a plan for data science has much in common with making plans for Lean and design thinking: you’re putting in effort, so you need to be:

  • focused on the change you want to make in the world, e.g. solve a problem or change people’s minds (there’s no point doing analysis if you don’t do anything with it),
  • pragmatic about the work you need to do (sometimes the answer is a piece of paper, e.g. much simpler than the beautiful concept you had in your mind, but much more likely to be used in the current context) and
  • realistic about the problem you’re trying to solve and the resources you have around to do that (e.g. use what works for you, not an unachievable ‘best’ technology).

I like Max Shron’s CoNVO for planning small data science projects. In my dayjob, I work a lot with Lean Enterprise and Design Thinking techniques to achieve similar results at scale, but at the least you should have an A4 piece of paper somewhere with this on it, to refer back to:

  • Context: who needs this work, and what are they doing it for?
  • Needs: what are you trying to fix?
  • Vision: what do you expect your final result to look like?
  • Outcome: how do you get your results to the people who need them? What happens next?

The other question you’re going to want to ask is “is it worth me doing this work”.  It’s okay to say “yes, because I learn from it”, or “it’s fun”, but data science is work, and you don’t want to feel like you put in a ton of effort and late nights only to realise that effort was wasted.  I like the DrivenData competition guidelines for helping with this thinking:

  • Impact: “… clear win for the organisation in terms of effective planning, resources saved or people served… good story around how they generate social impact…”
  • Challenge: “… challenging enough for a rich competition…”
  • Feasibility: “….the right kind of data to answer the question at hand… does it have enough signal to be useful?…”
  • Privacy: “… can answer this question while protecting the privacy of individuals in the dataset and the operational privacy of an organisation…”

I haven’t yet mentioned the thing that your work will focus on: the “interesting question(s)” that you’re trying to answer.  There are several contexts you might find yourself in here, from a business team bringing you a well-defined question (at which point you start with the “what is the question you’re really trying to answer here” discussion), to having complete freedom over the questions you’re asking and the ways they could be turned into action.

One way to get better at asking good questions is to see what other people ask.  Look at your subject area: find other projects and questions in it, and see how they’re asked (and answered).  Look at existing data science projects for inspiration, e.g. Kaggle (and their UseCase list), DrivenDataDataKind and the projects listed in the course reading list, then design your questions.  Asking questions about your questions can help here, e.g.:

  • Is the question concrete enough? Is it solving a real problem, or just a symptom of that problem? (e.g. “what are the barriers to people engaging with us” vs “how can we get more people to call”)
  • Can you translate the question into an experiment? g. can you ask something like “I believe people have more phones than toilets” and start proving (or, more generally, disproving) that.
  • Is it actionable? And what actions will be taken given the answer?
  • What data is needed to do the analysis? At this point, datasets could be anything – tables, images, maps, sensor feeds; anything. Be aware that although data access can limit what you can do, data is just a support here, and focusing on the question can help you think about other ways that you might be able to answer it.

As you do this, you’ll find yourself questioning the meaning of many of the components of your original question: I have a longer blogpost on that, but this is where the plan becomes useful: instead of focusing on “what actually counts as a toilet” (yes, that really is a difficult thing to define), go back to your notes about who this is important to, why, and what they could do about it.  You’ll also find that a seemingly-simple initial question will generate a whole bunch of other questions you’ll need to answer too (questions are like bunnies: they breed).  Again, use your plan as a guide, and accept that there will usually be several parts to each project.  You’ll also find that several of these questions could be answered without using available data (e.g. you might be able to get a strong enough ‘signal’ from surveys that an action is worth further investigation): that too is a useful thing to know.

Plans rarely survive contact with your datasets, users etc.  They’re not about forcing you to produce things a specific way: they’re there to make you think about what you’re doing, and stop you from making newbie mistakes like fitting the questions to the data you have available or falling into data rabbitholes.  You might want to go down at least one of those rabbitholes: a fascinating piece of data that you want to explore for fun, or a bunch of other questions that look like really fun things to answer; they’re not necessarily bad things, and can be really valuable in themselves, but you do need to be aware that you’ve done this, and of any impacts it might have on your original goals.  Planning might seem a distraction from getting on with the data analysis, but it does help to have a guidestar, something to go back to and think “is this valuable to the people that I’m trying to help here?”.

Some short exercises

We ran 3 small exercises in class, to get people thinking about project design. Each of these was time-limited to 3 minutes, to make people concentrate hard on what might be needed, and to hit issues quickly so they could be discussed in class before students tried this at home.

Exercise 1: Ask some interesting questions. Either your own questions, or pick an existing question and think about how it might have been formed.

  • Questions that data might help with
  • Stories you want to tell with data
  • Datasets you’d like to explore (where ‘datasets’ could be anything – tables, images, maps, sensor feeds, etc)
  • Competition questions: Kaggle, DrivenData
  • A data science project that interested you

Exercise 2: Get the data.  Pick one of your questions:

  • List the ideal data you need to answer it
  • List the data that’s (probably) available

Think about what you’ll do if the data you need isn’t available:

  • What compromises could you make
  • Where would you look for more data
  • Are there proxies (other datasets that tell you something about your question)
  • Are there ways to get more data (surveys, crowdsourcing etc)

Exercise 3: Design your communications. List the types of people you’d want to show your results to.

  • How do you want them to change the world? Can they take actions, can they change opinions etc
  • Describe the types of outputs that might be persuasive to them – visuals, text, numbers, stories, art… be as wild with this as you want

Data Science is a Process [DS4B Session 1c]

People often ask me how they can become a data scientist. To which my answers are usually ‘why’, ‘what do you want to do with it’ and ‘let’s talk about what it really is’.  So let’s talk about what it really is.  There are many definitions of data science, e.g.:

  • “A data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.”
  • “The analysis of data using the scientific method
  • “A data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.”

We can spend hours debating which definition is ‘right’, or we could spend those hours looking at what data scientists do in practice, getting some tools and techniques under our belts and finding a definition that works for each one of us personally.

My own working definition of data science is “a process that helps people gain understanding through using data”.  So let’s look at some of that process.  The scientific method, mentioned above, is a process (O’Neill & Schutt, “Doing data science”):

  • Ask a question
  • Do background research
  • Construct a hypothesis
  • Test your hypothesis by doing an experiment
  • Analyse your data and draw a conclusion
  • Communicate your results

Most of science works like this: it’s all about creating explanations that fit our knowledge of the world, then testing those explanations with experiments.  I like the cynicism embedded in this, the acknowledgement that everything is a working hypothesis that might turn out to be false, or false in different circumstances (see under Newton/Einstein), and those are all good things, but they don’t quite cover what data scientists do all day.

One of the process descriptions that data scientists use for themselves is the OSEMN process (Obtain-Scrub-Explore-Model-Interpret, pronounced ‘awesome’: your data is safe with data scientists, but not your acronyms…):

  • Obtain datasets
  • Clean, combine, transform data
  • Explore the data
  • Try models (classification, machine learning etc)
  • Interpret and communicate your results

This is less about experiments, and more about the things that you need to do with data, but it loses what, to me, is the most important part: asking an interesting question.  Data science isn’t about data – it’s about people, their problems and questions, and informing, persuading or entertaining them with your results (I’m with Sarah Cohen when she says “every good story starts with an idea, a question or an observation”, and with anyone who says that a visualization isn’t always the answer). So the process for these sessions, the thing we’ll be working through slowly, is:

  • Ask an interesting question
  • Get the data
  • Explore the data
  • Model the data
  • Communicate and visualize your results

with a healthy dose of cynicism, e.g. sanity-checking your results, in the context they’re relevant to, which is especially important in a development data context, where data is hard to come by and may be erroneous, miss geographical areas or demographics and potentially be older than it looks.

Note that none of the quotes say “enormous amounts of data”.  We’ll touch on big data in a later session (session 10), but most development data scientists work with small datasets, and that’s nothing to be ashamed of: I’d rather have relevant, information-rich datasets than huge amounts of data that tells me almost nothing.

That process again

  • Ask an interesting question. Write hypotheses that can be explored (Do people have more phones than toilets?, How is Ebola spreading? Is using wood fires sustainable in rural Tanzania? Can we feed 9 billion people?). Make them simple, actionable, and incremental (e.g. you can test different parts of the question separately).
  • Get the data. There are many different data sources (e.g. datafiles, databases, APIs, text, maps, images, social media, people). Some of them are harder to get information out of than others, but they all contain data. Which means you’ll often be extracting datasets from those sources (e.g. 80-page PDFs), and cleaning it.  By cleaning, I mean getting the data into a shape that can be used by algorithms: dealing with file formats (pdfs!), badly-specified locations, human errors and differences between standards (e.g. “Tanzania” vs “Republic of Tanzania”).  Although cleaning takes a lot of time, it’s also time spent getting to know your datasets: what’s in them, what’s missing, what’s strange, what potentially got lost in translation. Which leads us to:
  • Explore the data. Once you’ve got the dataset in machine-readable form, you can start looking for more issues to deal with (e.g. these issues with different placename standards in Tanzania) and, eventually, for potentially interesting patterns. Eyeballing (looking at) your data is usually a good place to start; often you have to take a subset of the whole dataset to do this, but it’s usually worth it to get a better feel for what’s contained in each dataset.  Doing quick but ugly visualisations of your data is also a good thing to do.
  • Model the data. Modelling is where we look for patterns and insights hidden in the data.  It’s where machine learning comes in. We’ll look later at how to learn relationships between numbers, categories and graphs.
  • Communicate and visualize your results. We want to get this effect: ”I already knew that increased incarceration didn’t lower crime, but I wasn’t sure of the statistics. To see it on the graphs is really eye opening.” (Pandey et al, The Persuasive Power of Data Visualisation), using whatever’s appropriate (which might or might not be visualisations).

What we’re aiming at is simple: “ask good questions, tell good stories” – if you can do this with data, you’ve won. 

Data Scientists

Data scientists are the mysterious beasts people who do data science.  The data science Venn diagram basically says that to be a data scientist, you need to know statistics, your business area and be able to code.  But it’s not quite like that.  Although it’s heresy to say this, many good data scientists don’t code at all, and you can useful on a data science team without knowing everything, e.g. build insightful visualisations without using statistics, or specify a data science problem well enough for hardcore machine learning specialists to develop good algorithms for it.

That said, it does help a development data scientist to have expertise in development and statistics (these sessions were originally designed for people who had these skills: a statistics session is already in the pipeline…), and being familiar with data science skills and having the coding skills to get, clean and explore data will help you even if you never want to do anything more than create and be the ‘client’ for a data science problem specification.

At this point, you might be asking two questions:

  • How do you become a data scientist?, and
  • Should you become a data scientist?

You become a data scientist through learning and practice (that never stops: I’m still working on it myself).  Yes, you need to learn a bunch of theory, but there’s nothing like learning data science by doing it: you’ll handle issues you didn’t know existed, and learn many details about techniques by using them on non-sanitised (uncleaned) data.  Good places to practice exploring and modelling data include:

  • Kaggle – online datascience competitions
  • Driven Data – social good datascience competitions
  • Innocentive – some datascience challenges
  • CrowdAnalytix – business datascience competitions
  • TunedIt – scientific/industrial datascience challenges

Good places to practice asking good questions, getting data, communication and visualizing results include:

  • Your own projects
  • Data science for good groups (e.g. DataKind)

The answer to “should you become a data scientist” is “not necessarily”.  There are lots of data science students desperate for good problems to work on, so you might want to become someone who can work with data scientists; which means learning how to specify data problems well.   One of the places to see the work of these people who can specify problems (“problem owners”) is the competition sites listed above.  The problem owner doesn’t have to do data modelling or machine learning themself, but they do need to be able to specify a problem well, find and clean data related to that problem so that competitors can access it easily (and all have the same starting dataset), and specify how the competition results will be marked (e.g. by accuracy on an unseen ‘test’ dataset).   Go look at some of the problems listed on these sites, and think about how you would have done this yourself.

(session 1 slideset is here; cover image is from the Pump it Up challenge)

About these sessions [DS4B Session 1b]

First, there are no prerequisites for these sessions except a curiosity to learn about data science and willingness to play with data. Second, the labs are designed to cover most of what you need to talk to data scientists, to specify your project, and to start exploring data science on your own.  They’re specifically aimed at the types of issues that you’ll see in the messy strange world of development, humanitarian and social data, and use tools that are free, well-supported and available offline.

This may not be what you need, and that’s fine. If you want to learn about machine learning in depth, or create beautiful Python code, there are many courses for that (and the Data Science Communities and Courses page has a list of places you might want to start at).  But if you’re looking at social data, and to quote one of my students “can see the data and know what I want to do with it, but I don’t know how to do that”, these sessions might help.

There are a bunch of resources that go with these sessions: they’re all available on the sessions’ github page, and slideshare pages.

There are 10 sessions in all, grouped into themes: people (designing a project, communicating results), tools, getting data, special data types (text, GIS, big) and learning from data (there are also bonus sessions in the works, as the chats with Quito continue); the session order is designed to get people doing useful things as quickly as possible, and give breathing space between difficult topics.

Each session concentrates on 5 to 7 concepts within a single topic (e.g. machine learning), and plays with apps or code related to that topic.  Install instructions for each tool used in a session are in the tool install instructions; pre-session required reading, and post-session reading for fun are in the course reading list, and python code used in or related to the session are in the ipython notebooks.

There’ll be several blogposts here for each of the sessions, loosely arranged around those 5-7 concepts and homework.

DS4B: The Course [DS4B Session 1a]

[Cross-posted from LinkedIn]

I’ve designed and taught 4 university courses in the past 3 years: ICT for development and social change (DSC, with Eric from BlueRidge), coding for DSC, data science coding for DSC and data science for DSC (with Stefan from Sumall).  The overarching theme of each of them was better access and understanding of technical tools for people who work in areas where internet is unavailable or unreliable, and where proprietary tools are prohibitively expensive.

The tagline for the non-coding courses was the “anti-bullsh*t courses”: a place where students could learn what words like ‘Drupal’ and ‘classifier’ mean before they’re trying to assess a technical proposal with those words in, but also a place to learn how to interact with data scientists and development teams, to learn their processes, language and needs, and ask the ‘dumb questions’ (even though there are no dumb questions) in a safe space.

Throw a rock on the internet and you’ll hit a data science course. There are many of these, many covered by the open data science masters, and many places to learn the details of machine learning, R, business data science et al.  What there isn’t so much of on the internet is courses on how to specify a data science problem, how to interact with the people coming off those machine learning courses, or how to do data science work when you’re days from the nearest stable internet connection.  The latest course, Data Science for Development and Social Change, was designed for that.  It changed a bit from its inception: we found ourselves heavily oversubscribed with students from not just International Development, but also Journalism, Medicine and other departments, and adjusted from a 12-person hands-on intensive lab format (e.g. the coding course included D3 down to the CSS, HTML, SVG and Javascript level) to a 30+ person lab lecture with ipython notebooks.

The slideset shared here is from the latest inception of the lab part of that course (the non-lab half showed social data science in different contexts): a set of sessions that started as a weekly one-hour chat between myself and the Quito ThoughtWorks team, and grew into a multi-city weekly data science session.

The way to approach these slides is to read the slide notes in each session’s powerpoint file (you’ll probably have to download the files to see this: I’ll slowly upload the notes if needed), then try the exercises in the ipython notebooks for that session.

The first session is an introduction, a ‘why are we doing this’, with a bunch of downloads and setups (student- and developer- tested!) as an exercise.  I’ll write about that soon.  The downloads and setups are integral to these sessions (to leave someone with enough tools, notes and information on their machines to be able to do basic data science, and understand what a data scientist does all day, even where there’s no hope of stable Internet (yet)), but though the learning is much better if you try the examples, the slides can be mostly followed without them.

Releasing my course materials [DS4B]

[cross-post from LinkedIn]

I teach data science literacy to different groups, and I’ve been struggling with a personal dilemma: I believe strongly in open access (data, code, materials), but each hour of lecture materials takes about 15-20 hours of my time to prepare: should I make everything openly available, or try to respect that preparation time?

Which is basically the dilemma that anyone open-sourcing goes through. And the answer is the same: release the materials.  Made stronger by the realisation that I teach how to do field data science (data science in places without internet connectivity) and people in places without internet connectivity are unlikely to be dropping into New York for in-person sessions.

Starting this week, the materials are going online in github; I’ll be backing this up by restarting the I Can Haz Datascience posts (posts on development data science, with Emily the cat) to cover the topics in them (designing and scoping a DS project, python basics, acquiring data, handling text, geospatial, relationship data etc).  Hopefully they’ll be of some use to people.

Why spreadsheets are hard

Thinking about machine-reading human-generated spreadsheets today, and I think I’ve got a handle on why this is a problem.

  • Data nerds think of data in its back-end sense, of “what form of data do I need to be able to analyse/ visualise this”.  We normalise, we worry about consistency, we clean out the formatting.
  • People used to creating spreadsheets for other people think of it more in the front-end sense, of “how do I make this data easily comprehensible to someone looking at this”.

Each has merits/demerits (e.g. reading normalised data and seeing patterns in it can be hard for a human; reading human-formatted data is hard for machines) and part of our work as data nerds is working out how to bridge that divide.  Which is going to take work in both directions, but it’s necessary and important work to do.

Looking at data with Python: Matplotlib and Pandas

I like python. R and Excel have their uses too for data analysis, but I just keep coming back to Python.

One of the first things I want to do once I’ve finally wrangled a dataset out of various APIs, websites and pieces of paper, is to have a good look at what’s in it.  Two python libraries are useful here: Pandas and Matplotlib.

  • Pandas is Wes McKinney’s library for R-style dataframe (data in rows and columns) manipulation, summary and analysis.
  • Matplotlib is John D Hunter’s library for Matlab-style plots of data.

Before you start, you’ll need to type “pip install pandas” and “pip install matplotlib” in the terminal window.   It’s also convention to load the libraries into your code with these two lines:

import pandas as pd
import matplotlib.pyplot as plt

Some things in Pandas (like reading in datafiles) are wonderfully easy; others take a little longer to learn. I’ll meander through a few of them here.  I’ll be using Nepal medical shipments data as an example.

Reading in data

Pandas makes this easy. Reading in a CSV file is as simple as:

df = pd.read_csv(csvfilename) # Comma-separated file
df = pd.read_csv(csvfilename, sep='\t') # Tab-separated file

There’s also pd.read_json, pd.read_html, pd.read_sas, pd.read_stata and pd.read_sql_table to read in other data formats.  Be careful with read_html though: it only reads in html tables, and you’ll need lxml, beautifulsoup or suchlike if you want to read tables straight from a webpage.

First-look at the dataframe

I like to know what I’m dealing with before starting analysis.  I usually use Tableau or R for this, but that’s not always possible, and Pandas is a good alternative.

df.columns # List all the column headings
df.head(4) # The first 4 rows of data, same as df.head(n=4)
df[['column1','column2', 'column3']].head(10) # Just some of the columns
df.describe() # Basic statistics for every numerical column

That tells you what your columns are, what your first few rows look like (df.tail(4) will give the last rows) and some basic statistics for numerical columns, but you’re probably more curious than that.

df['columnname1'].value_counts()

Value_counts will tell you what’s in a single column.  If you want to know what’s in a pair or combination of columns, you’ll need to start using pivot tables or group_by.

You might know pivot tables from Excel.  They’re ways of creating a new datatable whose rows, columns and content are defined by you.  This function, for example, gives you a new table whose rows are column x values, columns are column y values, and contents are the number of rows that contained those combinations of x and y values.

x_by_y = df.pivot_table(index='columnx',columns='columny', values='columnz', aggfunc='count', fill_value=0)

Column z gets involved here because if you don’t nominate a column for the values, Pandas will return an array with the counts for every combination of columns. I’ve included fill_value=0 because I’m counting, and Pandas would otherwise include NaN (not a number) in its counts.

x_by_y is a data frame. You can plot this, for example:

x_by_y.head(10).plot(kind='bar', stacked=True)
plt.show()

You’re now using Matplotlib.  And that was quite a complex plot: a stacked bar chart, with a legend.   Note that .plot creates a plot object: if you want to *see* your plot, you need to type “plt.show()”.  This will put up a plot window and stop your code until you close the window again.

Basic data manipulation

I’ve had a look at the dataset, got some ideas for more things to look at in it, and some of them need calculations. Pandas handles this too.  More soon. Meanwhile, here’s some stuff I did with the Nepal dataset.

pivot1 = df.pivot_table(
 index='Material Hierarchy Family',
 columns='Final Recipient Name',
 values='Dollar Value',
 aggfunc='count').plot(kind='barh', stacked=True)

recipientsize = df.groupby('Final Recipient Name').size()

pivot2 = df.pivot_table(
 columns='Material Hierarchy Family',
 index='Final Recipient Name',
 values='Dollar Value',
 aggfunc='count')

pivot2.plot(kind='bar', stacked=True, legend=False)
plt.show()

Links

The coolest thing about data analysis…

…is that it makes you think very hard about the world.  I’m having a short break between jobs. After 2 weeks of cycling, kayaking, camping, tidying, diy, meditation, niggling chores and piano playing, I’m finally ready to start thinking about data again.

Kaggle is a lovely thing. Not only does it encourage learning and cooperation between data scientists and create better algorithms for worthy causes, it also provides nice clean data to play with (believe me, raw data is way messier than this).  I’m currently geeking out on the San Francisco crime dataset – basically a huge CSV containing the date, category (and details), location (with police district) and outcomes of SF crimes from 2003 to 2015.

I could just throw a bunch of learning algorithms at this dataset (and, in truth, am, because that’s what Kaggle’s about) but I’m finding it interesting to think first about what I know about crime patterns from my time in the UK, whether those transfer to the US, and what else might be interesting to look for if I know more about the area.  For instance, are car crimes grouped near easy escape routes (thank you SFdata.gov for the major roads map); are there more thefts on area ‘boundaries’ (a UK study a while back showed more crimes near areas that perpetrators felt comfortable in); are there more juvenile and petty crimes near schools (also in SFdata.gov)?  I know that this is consciously testing out my own biases, but I’m on holiday and it’s an interesting thought experiment.

There’s a similar style of dataset (Taarifa’s water point challenge) on DrivenData. Some similar thinking might be interesting there too.