Software

Notes from John Sarapata’s talk on online responses to organised adversaries

John Sarapata (@JohnSarapata) = head of engineering at Jigsaw  (= new name for Google Ideas).  Jigsaw = “the group at Google that tries to help users facing organized violence and oppression”.  A common thread in their work is that they’re dealing with the outputs from organized adversaries, e.g. governments, online mobs, extremist groups like ISIS. One example project is redirectmethod.org, which looks for people who are searching for extremist connections (e.g. ISIS) and shows them content from a different point of view, e.g. a user searching for travel to Aleppo might be shown realistic video of conditions there. [IMHO this is a useful application of social engineering in a clear-cut situation; threats and responses in other situations may be more subtle than this (e.g. what does ‘realistic’ mean in a political context?).] The Jigsaw team is looking at threats and counters at 3 levels of the tech stack: device/user: activities are consume and create content; threats include attacks by governments, phishing, surveillance,…

Software

Why am I writing about belief?

[Cross-post from LinkedIn] I’ve been meaning to write a set of sessions on computational belief for a while now, based on the work I’ve done over the years on belief, reasoning, artificial intelligence and community beliefs. With all that’s happening in our world now, both online and in the “real world”, I believe that the time has come to do this. We could start with truth. We often talk about ‘true’ and ‘false’ as though they’re immovable things: that every statement should be able to be assigned one of these values. But it’s a little more complicated than that. What we see as ‘true’ is often the result of a judgement we made, given our perception and experience of the world, that a belief is close enough to certain to be ‘true’. But what is there are no objective truths? In robotics, we talk about “ground truth” and the “god’s…

ICanHazDatascience

Data Science Tools, or what’s in my e-backpack?

One of the infuriating (and at the same time, strangely cool) things about development data science is that you quite often find yourself in the middle of nowhere with a job to do, and no access to the Internet (although this doesn’t happen as often as many Westerners think: there is real internet in most of the world’s cities, honest!). Which means you get to do your job with exactly what you remembered to pack on your laptop: tools, code, help files, datasets and academic papers.  This is where we talk about the tools. The Ds4B toolset This is the tools list for the DS4B course (if you’re following the course, don’t panic: install notes are here). Offline toolset: Already on the machine: Terminal window Calculator Anaconda: Jupyter notebooks Python (version 3) R (need to add this to Anaconda) Rstudio OpenRefine D3 libraries Tabula Excel or LibreOffice (opensource equivalent) QGIS (Mac users: note the separate instructions on this!) GDAL…

ICanHazDatascience

Data Science Ethics [DS4B Session 1e]

This is what I usually refer to as the “Fear of God” section of the course… Ethics Most university research projects involving people (aka “human subjects”) have to write and adhere to an ethics statement, and adhere to an overarching ethics framework, e.g. “The University has an ethical commitment to minimize the risks to research subjects and to ensure that individuals who participate in research projects conducted under its auspices… do so voluntarily and with an informed understanding of what their involvement will mean”.  Development data scientists are not generally subject to ethics reviews, but that doesn’t mean we shouldn’t also ask ourselves the hard questions about what we’re doing with our work, and the people that it might affect. At a minimum, if you make data public, you have a responsibility, to the best of your knowledge, skills, and advice, to do no harm to the people connected to that data.  Data…

ICanHazDatascience

Writing a problem statement [DS4B Session 1d]

Data work can sometimes seem meaningless.  You go through all the training on cool machine learning techniques, find some cool datasets to play with, run a couple of algorithms on them, and then.  Nothing. That sinking feeling of “well, that was useless”. I’m not discouraging play. Play is how we learn how to do things, where we can find ideas and connect more deeply to our data.  It can be a really useful part of the “explore the data” part of data science, and there are many useful playful design activities that can help with “ask an interesting question”.   But data preparation and analysis takes time and is full of rabbitholes:  interesting but time-consuming things that aren’t linked to a positive action or change in the world. One thing that helps a lot is to have a rough plan: something that can guide you as you work through a data…

ICanHazDatascience

Data Science is a Process [DS4B Session 1c]

People often ask me how they can become a data scientist. To which my answers are usually ‘why’, ‘what do you want to do with it’ and ‘let’s talk about what it really is’.  So let’s talk about what it really is.  There are many definitions of data science, e.g.: “A data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.” “The analysis of data using the scientific method” “A data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.” We can spend hours debating which definition is ‘right’, or we could spend those hours looking at what data scientists do in practice, getting some tools and techniques under our belts and finding a definition that works for each one of us personally….

ICanHazDatascience

About these sessions [DS4B Session 1b]

First, there are no prerequisites for these sessions except a curiosity to learn about data science and willingness to play with data. Second, the labs are designed to cover most of what you need to talk to data scientists, to specify your project, and to start exploring data science on your own.  They’re specifically aimed at the types of issues that you’ll see in the messy strange world of development, humanitarian and social data, and use tools that are free, well-supported and available offline. This may not be what you need, and that’s fine. If you want to learn about machine learning in depth, or create beautiful Python code, there are many courses for that (and the Data Science Communities and Courses page has a list of places you might want to start at).  But if you’re looking at social data, and to quote one of my students “can see the data…

ICanHazDatascience

DS4B: The Course [DS4B Session 1a]

[Cross-posted from LinkedIn] I’ve designed and taught 4 university courses in the past 3 years: ICT for development and social change (DSC, with Eric from BlueRidge), coding for DSC, data science coding for DSC and data science for DSC (with Stefan from Sumall).  The overarching theme of each of them was better access and understanding of technical tools for people who work in areas where internet is unavailable or unreliable, and where proprietary tools are prohibitively expensive. The tagline for the non-coding courses was the “anti-bullsh*t courses”: a place where students could learn what words like ‘Drupal’ and ‘classifier’ mean before they’re trying to assess a technical proposal with those words in, but also a place to learn how to interact with data scientists and development teams, to learn their processes, language and needs, and ask the ‘dumb questions’ (even though there are no dumb questions) in a safe space. Throw a rock on the…

ICanHazDatascience

Releasing my course materials [DS4B]

[cross-post from LinkedIn] I teach data science literacy to different groups, and I’ve been struggling with a personal dilemma: I believe strongly in open access (data, code, materials), but each hour of lecture materials takes about 15-20 hours of my time to prepare: should I make everything openly available, or try to respect that preparation time? Which is basically the dilemma that anyone open-sourcing goes through. And the answer is the same: release the materials.  Made stronger by the realisation that I teach how to do field data science (data science in places without internet connectivity) and people in places without internet connectivity are unlikely to be dropping into New York for in-person sessions. Starting this week, the materials are going online in github; I’ll be backing this up by restarting the I Can Haz Datascience posts (posts on development data science, with Emily the cat) to cover the topics in them…

Data Science

Why spreadsheets are hard

Thinking about machine-reading human-generated spreadsheets today, and I think I’ve got a handle on why this is a problem. Data nerds think of data in its back-end sense, of “what form of data do I need to be able to analyse/ visualise this”.  We normalise, we worry about consistency, we clean out the formatting. People used to creating spreadsheets for other people think of it more in the front-end sense, of “how do I make this data easily comprehensible to someone looking at this”. Each has merits/demerits (e.g. reading normalised data and seeing patterns in it can be hard for a human; reading human-formatted data is hard for machines) and part of our work as data nerds is working out how to bridge that divide.  Which is going to take work in both directions, but it’s necessary and important work to do.