Fake News Isn’t About Truth, It’s About Gaming Belief Systems

Download PDF

[cross-post from Medium]

Thinking about #fakenews. Starting with “what is it”.

* We’re not dealing with truth here: we’re dealing with gaming belief systems. That’s what fake news does (well, one of the things; another thing it does is make money from people reading it), and just correcting fake news is aiming at the wrong thing. Because…
* Information leaves traces in our heads, even when we know what’s going on. If I jokingly tell you that I’ve crashed your car, then go ‘ha ha’, you know that I didn’t crash your car, but I’ve left a trace in your head that I’m an unsafe driver. The bigger the surprise of the thing you initially believe, the bigger the trace it leaves (this is why I never make jokes like that).
* That’s important because #fakenews isn’t about the thing that’s being said. It’s about the things that are being implied. Always look for the thing being implied. That’s what you have to counter.
* Some of those things are, e.g. “Liberals are unpatriotic”. “Terrorists are a real and present threat *to you*”. Work out counters for these, and mechanisms for those counters. F’example: wearing US flags at protests and being loudly patriotic whilst standing up for basic rights is a good idea.
* Yes, straighten the record, but you’re not aiming at the person (or site) spouting fake news. What you *are* trying to change is their readers’ belief in whether something is true.
* America is a big country. Not everyone can go and see what’s true or not. Which means they have to trust someone else to go look for them. The Internet is even bigger. Some of the things on it (e.g. beliefs about other people’s beliefs) don’t have physical touchpoints and are impossible to confirm or deny as ‘truth’.
* Which means you’re trying to change the beliefs of large groups of people, who have a whole bunch of trust issues (both overtrust for in-group, and serious distrust of out-group people) and no direct proof.
* You know who else hacks trust and beliefs in large groups? Salesmen and advertisers. Learn from them (oh, and propagandists, but you might want to be careful what you learn there).
* People often hold conflicting beliefs in their heads (unless they’re Aspie: Aspies have a hard time doing this). Niggling doubts are levers, even when people are still being defensive and doubling-down on their stated beliefs. Look for the traces of these.
* But go gentle. Create too much cognitive dissonance, and people will shut down. Learn from the salesmen on this.
* People are more likely to trust people they know. Get to know the people whose beliefs you want to change (even if it means hanging out in conservative chat channels). Also know that your attention is a resource: learn to distinguish between people who are engaged and might listen (hint: they’re often the ones shouting at you), people who won’t, and sock puppets.
* More advertising tricks: look for influencers (not just on Twitter ‘cos it’s easy goddammit; check in the real world too). There’s only one you: use that you wisely.

Some reading:
* A field guide to earthlings (the Aspie reference)
* Social psychology: a very short introduction

How to culture jam a populist

The Internet is made of beliefs

Download PDF

[cross-post from Medium]

“Most people don’t have the time or headspace to handle IW: we’re going to need to tool up. Is not much, but I’m talking next month on belief, and how some of the pre-big-data AI tools and verification methods we used in mapping could be useful in this new (for many) IW world… am hoping it sparks a few people to build stuff.” — me, whilst thoroughly lost somewhere in Harlem.

Dammit. I’ve started talking about belief and information warfare, and my thoughts looked half-baked and now I’m going to have to follow through. I said we’d need to tool up to deal with the non-truths being presented, but that’s only a small part of the thought. So here are some other thoughts.

1) The internet is also made of beliefs. The internet is made of many things: pages and and comment boxes and ports and protocols and tubes (for a given value of ‘tubes’). But it’s also made of belief: it’s a virtual space that’s only tangentially anchored in reality, and to navigate that virtual space, we all build mental models of who is out there, where they’re coming from, who or what to trust, and how to verify that they are who they say they are, and what they’re saying is true (or untrue but entertaining, or fantasy, or… you get the picture).

2) This isn’t new, but it is bigger and faster. The US is a big country; news here has always been either hyperlocal or spread through travelers and media (newspapers, radio, telegrams, messages on ponies). These were made of belief too. Lying isn’t new; double-talk isn’t new; what’s new here is the scale, speed and number of people that it can reach.

3) Don’t let the other guys frame your reality. We’re entering a time where misinformation and double-talk are likely to dominate our feeds, and even people we trust are panic-sharing false information. It’s not enough to pick a media outlet or news site or friend to trust, because they’ve been fooled recently too; we’re going to have to work out together how best to keep a handle on the truth. As a first step, we should separate out our belief in a source from our belief in a piece of information from them, and factor in our knowledge about their potential motivations in that.

4) Verification means going there. For most of us, verification is something we might do up front, but rarely do as a continuing practice. Which, apart from making people easy to phish, also makes us vulnerable to deliberate misinformation. We want to believe stuff? We need to do the leg-work of cross-checking that the source is real (crisismappers had a bunch of techniques for this, including checking how someone’s social media profile had grown, looking at activity patterns), finding alternate sources, getting someone to physically go look at something and send photos (groups like findyr still do this). We want to do this without so much work every time? We need to share that load; help each other out with #icheckedthis tags, pause and think before we hit the “share” button.

5) Actions really do speak louder than words. There will most likely be a blizzard of information heading our way; we will need to learn how to find the things that are important in it. One of the best pieces of information I’ve ever received (originally, it was about men) applies here: “ignore everything they say, and watch everything they do”. Be aware of what people are saying, but also watch their actions. Follow the money, and follow the data; everything leaves a trace somewhere if you know how to look for it (again, something that perhaps is best done as a group).

6) Truth is a fragile concept; aim for strong, well-grounded beliefs instead. Philosophy warning: we will probably never totally know our objective truths. We’re probably not in the matrix, but we humans are all systems whose beliefs in the world are completely shaped by our physical senses, and those senses are imperfect. We’ll rarely have complete information either (e.g. there are always outside influences that we can’t see), so what we really have are very strong to much weaker beliefs. There are some beliefs that we accept as truths (e.g. I have a bruise on my leg because I walked into a table today), but mostly we’re basing what we believe on a combination of evidence and personal viewpoint (e.g. “it’s not okay to let people die because they don’t have healthcare”). Try to make both of those as strong as you can.

I haven’t talked at all about tools yet. That’s for another day. One of the things I’ve been building into my data science practice is the idea of thinking through problems as a human first, before automating them, so perhaps I’ll roll these thoughts around a bit first. I’ve been thinking about things like perception, e.g. a camera’s perception of a car color changes when it moves from daylight to sodium lights, and adaptation (e.g. using other knowledge like position, shape and plates) and actions (clicking the key) and when beliefs do and don’t matter (e.g. they’re usually part of an action cycle, but some action cycles are continuous and adaptive, not one-shot things), how much of data work is based on chasing beliefs and what we can learn from people with different ways of processing information (hello, Aspies!), but human first here.

The Ethics of Algorithms

Download PDF

I opened a discussion on the ethics of algorithms recently, with a small thing about what algorithms are, what can be unethical about them and how we might start mitigating that. I kinda sorta promised a blogpost off that, so here it is.

Algorithm? Wassat?

muh%cc%a3ammad_ibn_musa_al-khwarizmi

Al-Khwarizmi (wikipedia image)

Let’s start by demystifying this ‘algorithm’ thing. An algorithm (from Al-Khwārizmī, a 9th-century Persian mathematician, above) is a sequence of steps to solve a problem. Like the algorithm to drink coffee is to get a mug, add coffee to the mug, put the mug to your mouth, and repeat. An algorithm doesn’t have to be run a computer: it might be the processes that you use to run a business, or the set of steps used to catch a train.

But the algorithms that the discussion organizers were concerned about aren’t the ones used to not spill coffee all over my face. They were worried about the algorithms used in computer-assisted decision making in things like criminal sentencing, humanitarian aid, search results (e.g. which information is shown to which people) and the decisions made by autonomous vehicles; the algorithms used by or instead of human decision-makers to affect the lives of other human beings. And many of these algorithms get grouped into the bucket labelled “machine learning”. There are many variants of machine learning algorithm, but generally what they do is find and generalize patterns in data: either in a supervised way (“if you see these inputs, expect these outputs; now tell me what you expect for an input you’ve never seen before, but is similar to something you have”), reinforcement-learning way (“for these inputs, your response is/isn’t good) or unsupervised way (“here’s some data; tell me about the structures you see in it”). Which is great if you’re classifying cats vs dogs or flower types from petal and leaf measurements, but potentially disastrous if you’re deciding who to sentence and for how long.

 neuralnetwork architecture (wikipedia)

Simple neural network (wikipedia image)

Let’s anthropomorphise that. A child absorbs what’s around them: algorithms do the same. One use is to train the child/machines to reason or react, by connecting what they see in the world (inputs) with what they believe the state of the world to be (outputs) and the actions they take using those beliefs. And just like children, the types of input/output pairs (or reinforcements) we feed to a machine-learning based system affects the connections and decisions that it makes. Also like children, different algorithms have different abilities to explain why they made specific connections or responded in specific ways, ranging from clear explanations of reasoning (e.g. decision trees, which make a set of decisions based on each input) to something that can be mathematically but not cogently expressed (e.g. neural networks and other ‘deep’ learning algorithms, which adjust ‘weights’ between inputs, outputs and ‘hidden’ representations, mimicking the ways that neurons connect to each other in human brains).

Algorithms can be Assholes

Algorithms are behind a lot of our world now. e.g. Google (which results should you be shown), Facebook (which feeds you should see), medical systems detecting if you might have cancer or not. And sometimes those algorithms can be assholes.

algorithm_asshole_headlines

Headlines (screenshots)

Here are two examples: a Chinese program that takes facial images of ‘criminals’ and maps those images to a set of ‘criminal’ facial features that the designers claim have nearly 90% accuracy in determining if someone is criminal, from just their photo. Their discussion of “the normality of faces of non-criminals” aside, this has echoes of phrenology, and should raise all sorts of alarms about imitating human bias. The second example is a chatbot that was trained on Twitter data; the headline here should not be too surprising to anyone who’s recently read any unfiltered social media.

We make lots of design decisions when we create an algorithm. One decision is which dataset to use. We train algorithms on data. That data is often generated by humans, and by human decisions (e.g. “do we jail this person”), many of which are imperfect and biased (e.g. thinking that people whose eyes are close together are untrustworthy). This can be a problem if we use those results blindly, and we should always be asking about the biases that we might consciously or unconsciously be including in our data. But that’s not the only thing we can do: instead of just dismissing algorithm results as biased, we can also use them constructively, to hold a mirror up to ourselves and our societies, to show us things that we otherwise conveniently ignore, and perhaps should be thinking about addressing in ourselves.

In short, it’s easy to build biased algorithms with biased data, so we should strive to teach algorithms using ‘fair’ data, but when we can’t, we need to use other strategies for our models of the world, and can either talk about the terror of biased algorithms being used to judge us, or we can think about what they’re showing us about ourselves and our society’s decision-making, and where we might improve both.

What goes wrong?

If we want to fix our ‘asshole’ algorithms and algorithm-generated models, we need to think about the things that go wrong.  There are many of these:

  • On the input side, we have things like biased inputs or biased connections between cause and effect creating biased classifications (see the note on input data bias above), bad design decisions about unclean data (e.g. keeping in those 200-year-old people), and missing whole demographics because we didn’t think hard about who the input data covered (e.g. women are often missing in developing world datasets, mobile phone totals are often interpreted as 1 phone per person etc).
  • On the algorithm design side, we can have bad models: lazy assumptions about what input variables actually mean (just think for a moment of the last survey you filled out, the interpretations you made of the questions, and how as a researcher you might think differently about those values), lazy interpretations of connections between variables and proxies (e.g. clicks == interest), algorithms that don’t explain or model the data they’re given well, algorithms fed junk inputs (there’s always junk in data), and models that are trained once on one dataset but used in an ever-changing world.
  • On the output side, there’s also overtrust and overinterpretation of outputs. And overlaid on that are the willful abuses, like gaming an algorithm with ‘wrong’ or biased data (e.g. propaganda, but also why I always use “shark” as my first search of the day), and inappropriate reuse of data without the ethics, caveats and metadata that came with the original (e.g. using school registration data to target ‘foreigners’).

But that’s not quite all. As with the outputs of humans, the outputs of algorithms can be very context-dependent, and we often make different design choices, depending on that context, for instance last week, when I found myself dealing with a spammer trying to use our site at the same time as helping our business team stop their emails going into customers’ spam filters.  The same algorithms, different viewpoints, different needs, different experiences: algorithm designers have a lot to weigh up every time.

Things to fight

Accidentally creating a deviant algorithm is one thing; deliberately using algorithms (including well-meant algorithms) for harm is another, and of interest in the current US context.  There are good detailed texts about this, including Cathy O’Neill’s work, and Latzer, who categorised abuses as:

  • Manipulation
  • Bias
  • Censorship
  • Privacy violations
  • Social discrimination
  • Property right violations
  • Market power abuses
  • Cognitive effects (e.g. loss of human skills)
  • Heteronomy (individuals no longer have agency over the algorithms influencing them)

I’ll just note that these things need to be resisted, especially by those of us in a position to influence their propagation and use.

How did we get here?

Part of the issue above is in how we humans interface with and trust algorithm results (and there are many of these, e.g. search, news feed generators, recommendations, recidivism predictions etc), so let’s step back and look at how we got to this point.

And we’ve got here over a very long time: at least a century or two, to back when humans started using machines that they couldn’t easily explain because they could do tasks that had become too big for the humans.   We automate because humans can’t handle the data loads coming in (e.g. in legal discovery, where a team often has a few days to sift through millions of emails and other organizational data); we also automate because we hope that machines will be smarter than us at spotting subtle patterns. We can’t not automate discovery, but we also have to be aware of the ethical risks in doing it.  But humans working with algorithms (or any other automation) tend to go through cycles: we’re cynical and undertrust a system tip it’s “proved”, then tend to overtrust its results (these are both part of automation trust). In human terms, we’re balancing these things:

More human:

  • Overload
  • Incomplete coverage
  • Missed patterns and overlooked details
  • Stress
More automation:

  • Overtrust
  • Situation awareness loss (losing awareness because algorithms are doing processing for us, creating e.g. echo chambers)
  • Passive decision making
  • Discrimination, power dynamics etc

And there’s an easy reframing here: instead of replacing human decisions with automated ones, let’s concentrate more on sharing, and frame this as humans plus algorithms (not humans or algorithms), sharing responsibility, control and communication, between the extremes of under- and over-trust (NB that’s not a new idea: it’s a common one in robotics).

Things I try to do

Having talked about what algorithms are, what we do as algorithm designers, and the things that can and do go wrong with that, I’ll end with some of the things I try to do myself.  Which basically comes down to consider ecosystems, and measure properly.

Considering ecosystems means looking beyond the data, and at the human and technical context an algorithm is being designed in.  It’s making sure we verify sources, challenge both the datasets we obtain and the algorithm designers we work with, and have a healthy sense of potential risks and their management (e.g. practice both data and algorithm governance), and reduce bias risk by having as diverse (in thought and experience) a design team as we can, and access to people who know the domain we’re working in.

Measuring properly means using metrics that aren’t just about how accurately the models we create fit the datasets we have (this is too often the only goal of a novice algorithm designer, expressed as precision = how many of the things you labelled as X are actually X; and recall = how many of the things that are really X did you label as X?), but also metrics like “can we explain this” and “is this fair”.  It’s not easy but the alternative is a long way from pretty.

I once headed an organisation whose only rule was “Don’t be a Jerk”.  We need to think about how to apply “Don’t be a jerk” to algorithms too.

Infosec, meet data science

Download PDF

I know you’ve been friends for a while, but I hear you’re starting to get closer, and maybe there are some things you need to know about each other. And since part of my job is using my data skills to help secure information assets, it’s time that I put some thoughts down on paper… er… pixels.

Infosec and data science have a lot in common: they’re both about really really understanding systems, and they’re both about really understanding people and their behaviors, and acting on that information to protect or exploit those systems.  It’s no secret that military infosec and counterint people have been working with machine learning and other AI algorithms for years (I think I have a couple of old papers on that myself), or that data scientists and engineers are including practical security and risk in their data governance measures, but I’m starting to see more profound crossovers between the two.

Take data risk, for instance. I’ve spent the past few years as part of the conversation on the new risks from both doing data science and its components like data visualization (the responsible data forum is a good place to look for this): that there is risk to everyone involved in the data chain, from subject and collectors through to processors and end-product users, and that what we need to secure goes way beyond atomic information like EINs and SSNs, to the products and actions that could be generated by combining data points.  That in itself is going to make infosec harder: there will be incursions (or, if data’s coming out from outside, excursions), and tracing what was collected, when and why is becoming a lot more subtle.  Data scientists are also good at subtle: some of the Bayes-based pattern-of-life tools and time-series anomaly algorithms are well-bounded things of beauty.  But it’s not all about DS; also in those conversations have been infosec people who understand how to model threats and risks, and help secure those data chains from harm (I think I have some old talks on that too somewhere, from back in my crisismapping days).

There are also differences.  As a data scientist, I often have the luxury of time: I can think about a system, find datasets, make and test hypotheses and consider the veracity and inherent risks in what I’m doing over days, weeks or sometimes months.  As someone responding to incursion attempts (and yes, it’s already happening, it’s always already happening), it’s often in the moment or shortly after, and the days, weeks or months are taken in preparation and precautions.  Data scientists often play 3d postal chess; infosec can be more like Union-rules rugby, including the part where you’re all muddy and not totally sure who’s on your side any more.

Which isn’t to say that data science doesn’t get real-time and reactive: we’re often the first people to spot that something’s wrong in big streaming data, and the pattern skills we have can both search for and trace unusual events, but much of our craft to date has been more one-shot and deliberate (“help us understand and optimise this system”). Sometimes we realize a long time later that we were reactive (like realizing recently that mappers have been tracking and rejecting information injection attempts back to at least 2010 – yay for decent verification processes!). But even in real-time we have strengths: a lot of data engineering is about scaling data science processes in both volume and time, and work on finding patterns and reducing reaction times in areas ranging from legal discovery (large-scale text analysis) to manufacturing and infrastructure (e.g. not-easy-to-predict power flows) can also be applied to security.

Both infosec and data scientists have to think dangerously: what’s wrong with this data, these algorithms, this system (how is it biased, what is it missing, how is it wrong); how do I attack this system, how can I game these people; how do I make this bad thing least-worst given the information and resources I have available, and that can get us both into difficult ethical territory.  A combination of modern data science and infosec skills means I could gather data on and profile all the people I work with, and know things like their patterns of life and potential vulnerabilities to e.g. phishing attempts, but the ethics of that is very murky: there’s a very very fine line between protection and being seriously creepy (yep, another thing I work on sometimes).  Equally, knowing those patterns of life could help a lot in spotting non-normal behaviours on the inside of our systems (because infosec has gone way beyond just securing the boundaries now), and some of our data summary and anonymisation techniques could be helpful here too.  Luckily much of what I deal with is less ethically murky: system and data access logs, with known vulnerabilities, known data and motivations, and I work with a most wonderfully evil and detailed security nerd.  But we still have a lot to learn from each other.  Back in the Cold War days (the original Cold War, not the one that seems to be restarting now), every time we designed a system, we also designed countermeasures to it, often drawing on disciplines far outside the original system’s scope.  That seems to be core to the infosec art, and data science would seem to be one of those disciplines that could help.

Notes from John Sarapata’s talk on online responses to organised adversaries

Download PDF
John Sarapata (@JohnSarapata) = head of engineering at Jigsaw  (= new name for Google Ideas).  Jigsaw = “the group at Google that tries to help users facing organized violence and oppression”.  A common thread in their work is that they’re dealing with the outputs from organized adversaries, e.g. governments, online mobs, extremist groups like ISIS.
One example project is redirectmethod.org, which looks for people who are searching for extremist connections (e.g. ISIS) and shows them content from a different point of view, e.g. a user searching for travel to Aleppo might be shown realistic video of conditions there. [IMHO this is a useful application of social engineering in a clear-cut situation; threats and responses in other situations may be more subtle than this (e.g. what does ‘realistic’ mean in a political context?).]
The Jigsaw team is looking at threats and counters at 3 levels of the tech stack:
  • device/user: activities are consume and create content; threats include attacks by governments, phishing, surveillance, brigading, intimidation
  • wire: activities are find then transfer; threats include DNS hijacking, TOR bridge probes
  • server: activities are hosting; threats include DDOS
[They also appear to be looking at threats and counters on a meta level (e.g. the social hack above).]
Examples of emergent countermeasures outside the team include people bypassing censorship in Turkey by using Google’s public DNS 8.8.8.8, and people in China after the 2008 Szechwan earthquake posting images of school collapses and investigating links between these (ultimately leading to finding links between collapses, school contractors using substandard concrete and officials being bribed to ignore this) despite government denial of issues.  These are about both reading and generating content, both of which need to be protected.
There are still unsolved problems, for example communications inside a government firewall.  Firewalls (e.g. China’s Great Firewall) generally have slow external pipes with internal alternatives (e.g. Sino Weibo), so people tend to consume information from inside. Communication of external information inside a firewall isn’t solved yet, e.g mesh networks aren’t great; the use of thumb drives to share information in Cuba was one way around this, but there’s still more to do.  [This comment interested me because that’s exactly the situation we’ve been dealing with in crises over the past few years: using sneakernet/ mopeds,  point-to-point, meshes etc., and there may be things to learn in both directions.]
Example Jigsaw projects and apps include:
  • Unfiltered.news, still in beta: creates a knowledge graph when Google scans news stories (this is language independent). One of the cooler uses of this is being able to find things that are reported on in every country except yours (e.g. Russia, China not showing articles on the Panama Papers).
  • Anti-phishing: team used stuff from Google’s security team for this, e.g. using Password Alert (alerts when user e.g. puts their company password into a non-company site) on Google accounts.
  • Government Attack Warning. Google can see attacks on gmail, google drive etc accounts: when a user logs in, Google displays a message to them about a detected attack, including what they could do.
  • Conversation AI. Internet discussions aren’t always civil, e.g. 20-25 governments including China and Russia have troll armies now, amplified by bots (brigading); conversation AI is machine classification/detection of abuse/harassment in text; the Jigsaw team is working on machine learning approaches together with the youtube comment cleanup team.  The team’s considered the tension that exists between free speech and reducing threats: their response is that detection apps must lay out values, and Jigsaw values include that conversation algorithms are community specific, e.g. each community decides its limits on swearing etc.; a good example of this is Riot Games. [This mirrors a lot of the community-specific work by community of community groups like the Community Leadership Forum].  Three examples of communities using Conversation AI: a Youtube feature that flags potential abuse to channel owners (launching Nov 2016). Wikipedia: flagging personal attacks (e.g. you are full of shit) in talk pages (Wikipedia has a problem with falling numbers of editors, partly because of this). New York Times: scaling existing human moderation of website comments (NYT currently turns off comments on 90% of pages because they don’t have enough human moderators). “NYT has lots of data, good results”.  Team got interesting data on how abuse spreads after releasing a photo of women talking to their team about #gamergate, then watching attackers discuss online (4chan etc) who of those women to attack and how, and the subsequent attacks.
  • Firehook: censorship circumvention. Jigsaw has the Uproxy plugin  for peer-to-peer information sharing across censorship boundaries (article), but needs to do more, eg look at the whole ecosystem.  Most people use proxy servers (e.g. VPNs), but a government could disallow VPNs: we need many different proxies and ways to hide them.  Currently using WebRTC for peer to peer proxies (e.g. Germany to Turkey using e.g NAT hole punching), collateral freedom and domain fronting, e.g. GreatFire routing New York Times articles through Amazon and GitHub.  Domain fronting (David Fyfield article) uses the fact that e.g. CloudFlare hosts many sites: the user connects to an allowed host, https encrypts it, then uses the encrypted header to go to a blocked site on the same host.  There are still counters to this; China first switched off GitHub access (then had to restore it), and used the Great Cannon to counter GreatFire, e.g. every 100th load of Baidu Analytics injects malware into external machines and creates a DDOS botnet. NB the firewall here was on path, not in path: a machine off to one side listens for banned words and breaks connections, but Great Cannon is inside the connection; and with current access across the great firewall, people see different pages based on who’s browsing.
  • DDOS: http://www.digitalattackmap.com/ (with Arbor Networks), shows DDOS in real time. Digital attacks are now mirroring physical ones, and during recent attacks, e.g. Hong Kong’s Umbrella Revolution, Jigsaw protected sites on both sides (using Project Shield, below) because Google thinks some things, like DDOS, are unfair.  Interesting point: trolling as the human equivalent of DDOS [how far can this comparison go in designing potential counters?].
  • Project Shield: reused Google’s PSS (Page Speed Service) to protect news sites, human rights organization etc from DDOS attacks. Sites are on Google cloud: can scale up number of VMs used and nginx allows clever uses with reverse proxies, cookie challenges etc.  Example site: Krebs on Security was being DDOSed (nb a DDOS attack on a site costs about $50 online), moved from host Akamai to Google with Project Shield. Team is tracking Twitter user bragging about this and other attacks (TL;DR: IoT attack, e.g. baby monitors; big botnet, Mirai botnet source code now released, brought down Twitter, Snapchat).  Krebs currently getting about 5 attacks a day, e.g. brute-force, Slowloris, Hulk (bandwidth, syn flood, post flood, cache busting, WordPress pingback etc), and Jigsaw gets the world’s best DDOSers hacking and testing their services.
Audience questions:
  • Qs: protecting democracy in US, e.g. botnets, online harassment etc.? A: don’t serve specific countries but worldwide.
  • Q: google autofill encouraging hatespeech? A: google reflects the world as it is; google search reflects what people do, holds a mirror back up to you. Researching machine learning bias on google suggest results and bias in training data.  Don’t want to censor, but don’t want to propagate bad things.
  • Q: not censor but inform, can you e.g. tell a user “your baby monitor is hacked”? A: privacy issue, eg g connecting kit and emails to ip addresses.
  • Q: people visiting google site to attack… spamming google auto complete, algorithms? A: if google detects people gaming them, will come down hard.
  • Q: standard for “abusive”? how to compare with human?. A: is training data, Wikipedia is saying what’s abusive. Is all people in the end.
  • Q: how deal with e.g. misinformation? A: unsolved problem, politically sensitive, e.g. who gets to decide what’s fake and true? censorship and harassment work will take time.
  • Q: why Twitter not on list of content? A: Twitter might not want this, team is resource constrained, e.g. NYT models are useless on youtube because NYT folks use proper grammar and spelling.
  • Q: how to decide what to intervene in?A:  e.g. Google takes sides against ISIS, who are off the charts on eg genocide.
  • Q: AWS Shield; does Google want to commercialise their stuff? A: no. NB CloudFlare Galileo is also response to google work.
  • Q: biggest emerging tech threat online? Brigading, e.g. groups of people and bots. This breaches physical and online, includes eg physical threats and violence, and is hard to detect and attribute.
Other references:

Why am I writing about belief?

Download PDF

[Cross-post from LinkedIn]

I’ve been meaning to write a set of sessions on computational belief for a while now, based on the work I’ve done over the years on belief, reasoning, artificial intelligence and community beliefs. With all that’s happening in our world now, both online and in the “real world”, I believe that the time has come to do this.

We could start with truth. We often talk about ‘true’ and ‘false’ as though they’re immovable things: that every statement should be able to be assigned one of these values. But it’s a little more complicated than that. What we see as ‘true’ is often the result of a judgement we made, given our perception and experience of the world, that a belief is close enough to certain to be ‘true’.

But what is there are no objective truths? In robotics, we talk about “ground truth” and the “god’s eye view” of the world: the knowledge of the world that our robots (or computer vision or reasoning systems) would have if they had perfect information about the world. We talk about things like the “frame problem”, where a system’s ability to reason and act is limited by the “frame” that it has around the world, and the “naughty baby problem” of outside influences that it has no awareness of and cannot plan for. We accept that a robot’s version of “truth” is limited to what it can perceive. But humans being are also limited by their perceptions of the world, by the amount of information available to them. Without going all “Matrix” on you, is it possible that we too are wrong about our “truths”, and we’re not truly objective in reasoning about them because there is no “God’s Eye View” that we can access?

For now, let’s put aside Godel’s theorem and the ‘undecidable’ sentences like “this sentence is false” that can’t be assigned a true or false value, and think about what happens in a world where we all have only perception and consensus agreements on ‘reality’, and nobody has perfect information. One of the things that happens is that we stop talking about “true” and “false”, and start talking about perception: what we can reasonably believe to be true or false (or undecidable), the uncertainty we have about those beliefs, influence and what it might take in terms of evidence or new information to change them. In psychology, that gets us into the theories of mind and reasoning like cognitive psychology and studies of people like Aspergers individuals who process ‘facts’ differently; in design, into things like social engineering and the theory of change. In maths and AI, that gets us into territories like multi-state logics (which beliefs are possible, necessary etc) and both frequentist (‘what happened”) and Bayesian (“what if”) statistics. We might also shift our focus, and talk not of beliefs, but of what we are trying to achieve with them, getting us into theories of actions, influence and decisions (hello, robotics and operational research). There are many theories of uncertainty, but for now probability theory is dominant, so its good to spend time with and understand how that works under the hood.

Although this may all seem abstract hand-wavey, late-night-discussiony “are we living in the Matrix already”, theories of belief have many practical applications. They’re used heavily in data science, and underpin decisions on things like technology designs (via e.g. AB testing), political information release and propaganda (which are usually not the same thing). We need to talk about that too, because the tools and the terrain available (e.g. the internet) are already more powerful than their current uses. One thing we’re becoming painfully aware of now is how belief functions in groups; the use of influence, multiple separate sources of information and repetition to spread beliefs that compete with each other, and the role of things like desire in those beliefs. We’re also learning that systems we build based on human outputs (e.g. Internet-based AI) have the same biases in belief as the humans; an obvious-but-not-obvious thing that we need to recognise and handle. There are useful theories for this too, ranging from the maths of multiple viewpoints to techniques used to both create and make sense of competing views (ACH, information incest detection, phemes). There are also theories of how humans think in groups, and how they can be persuaded to or do change their minds, both rapidly and slowly over time (e.g. game theory, creativity theory and the study of both human and scientific revolutions).

I’ve spent a lot of my life thinking about and applying the theories above, but I’ve never really got round (apart from the odd note on intelligence, data science or the risks inherent in processing data about people) to writing it all down. The session notes are, I hope, a start on this, and even if nobody else reads them, it’ll be fun to do some targeted thinking around them.

Data Science Tools, or what’s in my e-backpack?

Download PDF

One of the infuriating (and at the same time, strangely cool) things about development data science is that you quite often find yourself in the middle of nowhere with a job to do, and no access to the Internet (although this doesn’t happen as often as many Westerners think: there is real internet in most of the world’s cities, honest!).

Which means you get to do your job with exactly what you remembered to pack on your laptop: tools, code, help files, datasets and academic papers.  This is where we talk about the tools.

The Ds4B toolset

This is the tools list for the DS4B course (if you’re following the course, don’t panic: install notes are here).

Offline toolset:

Online toolset (ie. is useful, but not available when you’re offline):

Python and R both come with a bunch of useful libraries in Anaconda (here’s the Python and R lists).  The (Anaconda 4.0) pre-installed libraries used in DS4B are:

  • Python: Basemap, BeautifulSoup, Dask, Matplotlib, NetworkX, Nltk, Numba, 
    Numpy, Pandas, Requests, Scikit-image, Scikit-learn, SeaBorn, Shapely, Sqlite3
  • R: ggplot2, frame, lm

The course also uses some Python libraries that don’t come with the Anaconda standard install. These are:

  • Csv, DateTime, Facepy, Fiona, Gdal, Gdalconst, Geopy, Googlemaps, 
    Json, Ogr2ogr, Osgeo, Pyspark, Re, Twitter

How to get each of these is listed in the DS4B course instructions, but usually isn’t more onerous than typing “pip install libraryname” in the terminal window.  Fiona might give you some trouble (“cannot find the gdal.h library”): if this happens, there are notes in the install instructions.

Choosing the course toolset

For the course, we deliberately biased towards tools that were:

  • Free (because they need to be accessible for everyone, not just people who can pay),
  • Open source with good communities (because you can ask for help, go in and see how and why things work when you need to, and things get fixed a lot faster when you’re not limited by a company’s resources/ will).
  • Accessible on most platforms (e.g. Windows, Mac and Linux of varying versions of operating system)
  • Easy to install. Because nothing puts you off a tool quite as much as watching a build fail again and again, and having to learn all about the technologies its built on and their variants before you can do even the simplest thing with it </rant>.
  • Stable. It’s cruel to tell people without tech support to install a tool that regularly crashes on them. Even if you’re a techie, it’s still very annoying…

I do a lot of basic numbers work in Microsoft Excel (add things up, do sorts and filters of data, click on numerical columns to averages etc at the bottom of the screen) and the calculator.  Some data scientists (although a dwindling number of them) only use Excel to process data, but in development data science we’re often handling very unstructured or messy data (e.g. ‘who knows how many columns this thing really has?’ type data) and need to produce results that we can both repeat, and trace back step-by-step to the raw data (and sometimes we get lucky and have a dataset bigger than Excel’s limits of 1,048,576 rows by 16,384 columns: NB Excel will silently fail and only give you the first rows/columns if you do this).

That means using a coding language (yes, yes, SPSS, SAS, Matlab, Pentaho etc, but a) those aren’t free, and b) did I mention the really messy inputs?).

R and Python are coding languages that turn up a lot in data science (sql is used a lot too, for database data).   R is a beautiful language for doing statistical things: it was built by statisticians, has packages that aren’t in other languages yet, and has some lovely lovely visualisation libraries.  But we needed to pick one language for the course to minimise the amount of code needed to illustrate each step in data science without causing cognitive dissonance in students’ heads (the previous version of the course taught both Python and R, but was much more code-focussed), and chose Python.

Why Python? Basically three things:

  • its ability to deal with really nastily messy data,
  • its great libraries for things outside statistical analysis (e.g. natural language processing, web scraping, content management systems etc)
  • not having to rewrite code when you start working with developers building applications and websites (although it’s possible to call both Python and R from each other using e.g. the rpy2 package, adding extra languages makes the code that much harder to maintain).
  • it’s much easier to learn (and teach) than people think

There have been many debates on R vs Python for data science. “R vs Python” isn’t really the right question to be asking; the question is “what tools will work for me”, and the answer for many data scientists is “both R and Python, and sometimes neither”: you use what works for you, and you use what’s appropriate for the task that you have at hand.

The other tools included are less controversial (!), and more specialised.

  • OpenRefine is a great tool for summarising and cleaning unruly row-column data without coding (and has provenance tracing built in)
  • Tabula is the most popular open-source pdf processing tool (CometDocs works on messier documents, but is online-only)
  • QGIS is a popular open-source GIS (maps etc) visualisation tool; CartoDb (now called Carto) is a beautiful online GIS visualisation tool
  • The GDAL toolkit is great for command-line processing of GIS data
  • Python includes visualisation libraries, but if you really want something special, D3 is a good offline tool (Tableau Public is good online, but as the name implies, your visualisations will be public).
  • Pen and paper are useful for doing mockups and calculations whilst saving your machine’s battery, and is flexible and portable to boot.

That should be enough to get most people started, and has already been field-tested by both myself and various ex-students (guys: I love it when you send me notes about what you’re doing with data, from the field!).

What’s in my own backpack?

Screen Shot 2016-08-24 at 2.20.36 PM

It’s only fair for me to open up my Mac and show you what I’ve got hiding on my own machine.  That’s my first Launchpad screen above. It’s the usual suspects: Excel, R, OpenRefine, Tableau, Calculator, with a few other things:

  • SQL tools: MySQL workbench, Postgres.
  • Readers for some common proprietary tool formats: SPSS Smartreader, Pentaho Kettle (“Data Integration”).
  • More data tools: Weka, Gephi, Dato, Trifecta. Weka is the granddaddy of data mining tools, and is still worth having in your ebackpack. Gephi is great for visualising and playing around with graph data.
  • Disk cleaners. OMG can I frag a disk on deployment, which is why I have Ccleaner and Disk Inventory X installed. Because sometimes you just need that extra 2Gb of space.
  • Tools for keeping my code contained: Github desktop, VirtualBox.
  • A Java development environment (IntelliJ) because, contrary to popular opinion, I don’t just write code in Python.

The things you can’t see here include:

  • Mapping tools: OpenStreetMap, LeafletJS, MapBox
  • Visualisation tools: Highcharts
  • Machine learning tools: specialist tools, as and when I need them
  • Ideation tools: freemind
  • Text tools: Acrobat reader, sublime text
  • Losing the minimum amount of stuff if my laptop dies in the humidity tools: Dropbox, Evernote, Google Drive, biggest portable drive I can find (2 of)
  • Giving me textbooks to read on the road tools: iBooks, Safari Books

That’s my ebackpack. I’d be interested to see what other people pack, and if there’s anything useful that I’ve missed from the lists above.

Data Science Ethics [DS4B Session 1e]

Download PDF

This is what I usually refer to as the “Fear of God” section of the course…

Ethics

Most university research projects involving people (aka “human subjects”) have to write and adhere to an ethics statement, and adhere to an overarching ethics framework, e.g. “The University has an ethical commitment to minimize the risks to research subjects and to ensure that individuals who participate in research projects conducted under its auspices… do so voluntarily and with an informed understanding of what their involvement will mean”.  Development data scientists are not generally subject to ethics reviews, but that doesn’t mean we shouldn’t also ask ourselves the hard questions about what we’re doing with our work, and the people that it might affect.

At a minimum, if you make data public, you have a responsibility, to the best of your knowledge, skills, and advice, to do no harm to the people connected to that data.  Data science projects can be very powerful (especially if we’ve designed them well), development data science can affect many people, and we need to be mindful of who we’re affecting and the risks we might unintentionally cause them with our work.  With that power comes responsibility, and a sometimes-difficult balance between making data available to people who can do good with it, and protecting that data’s subjects, sources, and managers.

I start by asking these two questions:

  • Could this work increase risk to anyone?
  • How will I respect privacy and security?

Risk

Risk is defined as “The probability of something happening multiplied by the resulting cost or benefit if it does”.  There are three parts to this: cost/benefit (what might happen), probability (how likely that is to happen) and subject (who the risk is to). For example:

  • Risk of: physical, legal, reputational, privacy harm
  • Likelihood (e.g. low, medium, high)
  • Risk to: data subjects, collectors, processors, releasers, users

So, basically, make a list of what bad things might happen, who to, how likely these bad things are, and what you should be doing to prevent or mitigate them, up to and including stopping the project or making sure that anyone at risk is aware of that risk and can consent to be subject to it.  It doesn’t have to be a down-to-the-tiniest-thing detailed list, but you do have to think about who could be harmed by your work and how.

And this isn’t a one-time thing.  New risks can occur as new data becomes available, or the original environment around your project changes: you need to be thinking about potential risks right from the planning phase of your project through to when you’re sharing data or insights (and beyond, if stale data or old results in changed contexts could also cause harm).

Risks can be obvious (e.g. try not to get your sources or subjects targeted by making them easy to find), but they can also be subtle.  Some of the more subtle ones include:

You don’t have to make this list on your own. Groups including the Responsible Data Forum have been doing great work on data risk (and I’ve been lucky to be part of that hivemind too): try reading those groups’ articles to kickstart your thinking about your project’s potential risks.

Privacy and Security

That last risk on the list (accidentally making data subjects easy to identify), is important.  Privacy and security is a complex topic, but at a minimum you should be thinking about PII (Personally identifiable information). PII is this: “any data that could potentially identify a specific individual. Any information that can be used to distinguish one person from another and can be used for de-anonymizing anonymous data can be considered PII.

When you think about PII risk, think beyond name, address, phone number, social security number. Think about things like these (which can all accidentally release PII):

  • Unique identifiers: Names, addresses, phone numbers etc
  • Locations: lat/long, GIS traces, locality (e.g. home + work as an identifier)
  • Members of small populations (e.g. there may be only a small number people fitting this profile in the given geographical area)
  • Untranslated text (e.g. text that your team can’t read and understand)
  • Codes (e.g. “41”, especially if you don’t have a codebook telling you what they mean)
  • Slang terms
  • Can be combined with other datasets to produce PII (aka reidentification)

There is much literature on how easy reidentification can be easy from very small amounts of data, so you will have to think of it also in terms of risk: how likely is it that someone will want to reindentify, how easy is it for them to do this, and what can you do to make it harder.

But Don’t Panic

Don’t panic. When you first start thinking about risk in data science, the usual reaction is an “OMG I can’t do anything without causing harm” one.  That’s not the right answer either.  Be aware of the risks.  Take a deep breath, and remember that risk is cost times probability, and that if you give them the chance, people will often make a surprising but informed decision about their own risks.

Sometimes, the only answer is to walk away from the project. If it isn’t, and risk is high, consider mitigations like releasing data and results to a smaller group of people (e.g. academics, direct responders, people in your organisation, data subjects) with the caveat that once you release, you no longer have control of that data and have implicitly trusted other people to do the right thing too.  Consider releasing data at a different granularity, e.g. use town/district instead of street, and/or a subset or sample of the data ‘rows’ and ‘columns’. Look in places like the RDF website to see what other people have done as mitigations.

Be aware of ethics. Be as honest and cynical about data and results as you can. And don’t start a development data science project without doing a basic risks check.

[Image by Steve Calcott, licensed under cc-by-nc 2.0]

Writing a problem statement [DS4B Session 1d]

Download PDF

Data work can sometimes seem meaningless.  You go through all the training on cool machine learning techniques, find some cool datasets to play with, run a couple of algorithms on them, and then.  Nothing. That sinking feeling of “well, that was useless”.

I’m not discouraging play. Play is how we learn how to do things, where we can find ideas and connect more deeply to our data.  It can be a really useful part of the “explore the data” part of data science, and there are many useful playful design activities that can help with “ask an interesting question”.   But data preparation and analysis takes time and is full of rabbitholes:  interesting but time-consuming things that aren’t linked to a positive action or change in the world.

One thing that helps a lot is to have a rough plan: something that can guide you as you work through a data science project.  Making a plan for data science has much in common with making plans for Lean and design thinking: you’re putting in effort, so you need to be:

  • focused on the change you want to make in the world, e.g. solve a problem or change people’s minds (there’s no point doing analysis if you don’t do anything with it),
  • pragmatic about the work you need to do (sometimes the answer is a piece of paper, e.g. much simpler than the beautiful concept you had in your mind, but much more likely to be used in the current context) and
  • realistic about the problem you’re trying to solve and the resources you have around to do that (e.g. use what works for you, not an unachievable ‘best’ technology).

I like Max Shron’s CoNVO for planning small data science projects. In my dayjob, I work a lot with Lean Enterprise and Design Thinking techniques to achieve similar results at scale, but at the least you should have an A4 piece of paper somewhere with this on it, to refer back to:

  • Context: who needs this work, and what are they doing it for?
  • Needs: what are you trying to fix?
  • Vision: what do you expect your final result to look like?
  • Outcome: how do you get your results to the people who need them? What happens next?

The other question you’re going to want to ask is “is it worth me doing this work”.  It’s okay to say “yes, because I learn from it”, or “it’s fun”, but data science is work, and you don’t want to feel like you put in a ton of effort and late nights only to realise that effort was wasted.  I like the DrivenData competition guidelines for helping with this thinking:

  • Impact: “… clear win for the organisation in terms of effective planning, resources saved or people served… good story around how they generate social impact…”
  • Challenge: “… challenging enough for a rich competition…”
  • Feasibility: “….the right kind of data to answer the question at hand… does it have enough signal to be useful?…”
  • Privacy: “… can answer this question while protecting the privacy of individuals in the dataset and the operational privacy of an organisation…”

I haven’t yet mentioned the thing that your work will focus on: the “interesting question(s)” that you’re trying to answer.  There are several contexts you might find yourself in here, from a business team bringing you a well-defined question (at which point you start with the “what is the question you’re really trying to answer here” discussion), to having complete freedom over the questions you’re asking and the ways they could be turned into action.

One way to get better at asking good questions is to see what other people ask.  Look at your subject area: find other projects and questions in it, and see how they’re asked (and answered).  Look at existing data science projects for inspiration, e.g. Kaggle (and their UseCase list), DrivenDataDataKind and the projects listed in the course reading list, then design your questions.  Asking questions about your questions can help here, e.g.:

  • Is the question concrete enough? Is it solving a real problem, or just a symptom of that problem? (e.g. “what are the barriers to people engaging with us” vs “how can we get more people to call”)
  • Can you translate the question into an experiment? g. can you ask something like “I believe people have more phones than toilets” and start proving (or, more generally, disproving) that.
  • Is it actionable? And what actions will be taken given the answer?
  • What data is needed to do the analysis? At this point, datasets could be anything – tables, images, maps, sensor feeds; anything. Be aware that although data access can limit what you can do, data is just a support here, and focusing on the question can help you think about other ways that you might be able to answer it.

As you do this, you’ll find yourself questioning the meaning of many of the components of your original question: I have a longer blogpost on that, but this is where the plan becomes useful: instead of focusing on “what actually counts as a toilet” (yes, that really is a difficult thing to define), go back to your notes about who this is important to, why, and what they could do about it.  You’ll also find that a seemingly-simple initial question will generate a whole bunch of other questions you’ll need to answer too (questions are like bunnies: they breed).  Again, use your plan as a guide, and accept that there will usually be several parts to each project.  You’ll also find that several of these questions could be answered without using available data (e.g. you might be able to get a strong enough ‘signal’ from surveys that an action is worth further investigation): that too is a useful thing to know.

Plans rarely survive contact with your datasets, users etc.  They’re not about forcing you to produce things a specific way: they’re there to make you think about what you’re doing, and stop you from making newbie mistakes like fitting the questions to the data you have available or falling into data rabbitholes.  You might want to go down at least one of those rabbitholes: a fascinating piece of data that you want to explore for fun, or a bunch of other questions that look like really fun things to answer; they’re not necessarily bad things, and can be really valuable in themselves, but you do need to be aware that you’ve done this, and of any impacts it might have on your original goals.  Planning might seem a distraction from getting on with the data analysis, but it does help to have a guidestar, something to go back to and think “is this valuable to the people that I’m trying to help here?”.

Some short exercises

We ran 3 small exercises in class, to get people thinking about project design. Each of these was time-limited to 3 minutes, to make people concentrate hard on what might be needed, and to hit issues quickly so they could be discussed in class before students tried this at home.

Exercise 1: Ask some interesting questions. Either your own questions, or pick an existing question and think about how it might have been formed.

  • Questions that data might help with
  • Stories you want to tell with data
  • Datasets you’d like to explore (where ‘datasets’ could be anything – tables, images, maps, sensor feeds, etc)
  • Competition questions: Kaggle, DrivenData
  • A data science project that interested you

Exercise 2: Get the data.  Pick one of your questions:

  • List the ideal data you need to answer it
  • List the data that’s (probably) available

Think about what you’ll do if the data you need isn’t available:

  • What compromises could you make
  • Where would you look for more data
  • Are there proxies (other datasets that tell you something about your question)
  • Are there ways to get more data (surveys, crowdsourcing etc)

Exercise 3: Design your communications. List the types of people you’d want to show your results to.

  • How do you want them to change the world? Can they take actions, can they change opinions etc
  • Describe the types of outputs that might be persuasive to them – visuals, text, numbers, stories, art… be as wild with this as you want

Data Science is a Process [DS4B Session 1c]

Download PDF

People often ask me how they can become a data scientist. To which my answers are usually ‘why’, ‘what do you want to do with it’ and ‘let’s talk about what it really is’.  So let’s talk about what it really is.  There are many definitions of data science, e.g.:

  • “A data scientist… excels at analyzing data, particularly large amounts of data, to help a business gain a competitive edge.”
  • “The analysis of data using the scientific method
  • “A data scientist is an individual, organization or application that performs statistical analysis, data mining and retrieval processes on a large amount of data to identify trends, figures and other relevant information.”

We can spend hours debating which definition is ‘right’, or we could spend those hours looking at what data scientists do in practice, getting some tools and techniques under our belts and finding a definition that works for each one of us personally.

My own working definition of data science is “a process that helps people gain understanding through using data”.  So let’s look at some of that process.  The scientific method, mentioned above, is a process (O’Neill & Schutt, “Doing data science”):

  • Ask a question
  • Do background research
  • Construct a hypothesis
  • Test your hypothesis by doing an experiment
  • Analyse your data and draw a conclusion
  • Communicate your results

Most of science works like this: it’s all about creating explanations that fit our knowledge of the world, then testing those explanations with experiments.  I like the cynicism embedded in this, the acknowledgement that everything is a working hypothesis that might turn out to be false, or false in different circumstances (see under Newton/Einstein), and those are all good things, but they don’t quite cover what data scientists do all day.

One of the process descriptions that data scientists use for themselves is the OSEMN process (Obtain-Scrub-Explore-Model-Interpret, pronounced ‘awesome’: your data is safe with data scientists, but not your acronyms…):

  • Obtain datasets
  • Clean, combine, transform data
  • Explore the data
  • Try models (classification, machine learning etc)
  • Interpret and communicate your results

This is less about experiments, and more about the things that you need to do with data, but it loses what, to me, is the most important part: asking an interesting question.  Data science isn’t about data – it’s about people, their problems and questions, and informing, persuading or entertaining them with your results (I’m with Sarah Cohen when she says “every good story starts with an idea, a question or an observation”, and with anyone who says that a visualization isn’t always the answer). So the process for these sessions, the thing we’ll be working through slowly, is:

  • Ask an interesting question
  • Get the data
  • Explore the data
  • Model the data
  • Communicate and visualize your results

with a healthy dose of cynicism, e.g. sanity-checking your results, in the context they’re relevant to, which is especially important in a development data context, where data is hard to come by and may be erroneous, miss geographical areas or demographics and potentially be older than it looks.

Note that none of the quotes say “enormous amounts of data”.  We’ll touch on big data in a later session (session 10), but most development data scientists work with small datasets, and that’s nothing to be ashamed of: I’d rather have relevant, information-rich datasets than huge amounts of data that tells me almost nothing.

That process again

  • Ask an interesting question. Write hypotheses that can be explored (Do people have more phones than toilets?, How is Ebola spreading? Is using wood fires sustainable in rural Tanzania? Can we feed 9 billion people?). Make them simple, actionable, and incremental (e.g. you can test different parts of the question separately).
  • Get the data. There are many different data sources (e.g. datafiles, databases, APIs, text, maps, images, social media, people). Some of them are harder to get information out of than others, but they all contain data. Which means you’ll often be extracting datasets from those sources (e.g. 80-page PDFs), and cleaning it.  By cleaning, I mean getting the data into a shape that can be used by algorithms: dealing with file formats (pdfs!), badly-specified locations, human errors and differences between standards (e.g. “Tanzania” vs “Republic of Tanzania”).  Although cleaning takes a lot of time, it’s also time spent getting to know your datasets: what’s in them, what’s missing, what’s strange, what potentially got lost in translation. Which leads us to:
  • Explore the data. Once you’ve got the dataset in machine-readable form, you can start looking for more issues to deal with (e.g. these issues with different placename standards in Tanzania) and, eventually, for potentially interesting patterns. Eyeballing (looking at) your data is usually a good place to start; often you have to take a subset of the whole dataset to do this, but it’s usually worth it to get a better feel for what’s contained in each dataset.  Doing quick but ugly visualisations of your data is also a good thing to do.
  • Model the data. Modelling is where we look for patterns and insights hidden in the data.  It’s where machine learning comes in. We’ll look later at how to learn relationships between numbers, categories and graphs.
  • Communicate and visualize your results. We want to get this effect: ”I already knew that increased incarceration didn’t lower crime, but I wasn’t sure of the statistics. To see it on the graphs is really eye opening.” (Pandey et al, The Persuasive Power of Data Visualisation), using whatever’s appropriate (which might or might not be visualisations).

What we’re aiming at is simple: “ask good questions, tell good stories” – if you can do this with data, you’ve won. 

Data Scientists

Data scientists are the mysterious beasts people who do data science.  The data science Venn diagram basically says that to be a data scientist, you need to know statistics, your business area and be able to code.  But it’s not quite like that.  Although it’s heresy to say this, many good data scientists don’t code at all, and you can useful on a data science team without knowing everything, e.g. build insightful visualisations without using statistics, or specify a data science problem well enough for hardcore machine learning specialists to develop good algorithms for it.

That said, it does help a development data scientist to have expertise in development and statistics (these sessions were originally designed for people who had these skills: a statistics session is already in the pipeline…), and being familiar with data science skills and having the coding skills to get, clean and explore data will help you even if you never want to do anything more than create and be the ‘client’ for a data science problem specification.

At this point, you might be asking two questions:

  • How do you become a data scientist?, and
  • Should you become a data scientist?

You become a data scientist through learning and practice (that never stops: I’m still working on it myself).  Yes, you need to learn a bunch of theory, but there’s nothing like learning data science by doing it: you’ll handle issues you didn’t know existed, and learn many details about techniques by using them on non-sanitised (uncleaned) data.  Good places to practice exploring and modelling data include:

  • Kaggle – online datascience competitions
  • Driven Data – social good datascience competitions
  • Innocentive – some datascience challenges
  • CrowdAnalytix – business datascience competitions
  • TunedIt – scientific/industrial datascience challenges

Good places to practice asking good questions, getting data, communication and visualizing results include:

  • Your own projects
  • Data science for good groups (e.g. DataKind)

The answer to “should you become a data scientist” is “not necessarily”.  There are lots of data science students desperate for good problems to work on, so you might want to become someone who can work with data scientists; which means learning how to specify data problems well.   One of the places to see the work of these people who can specify problems (“problem owners”) is the competition sites listed above.  The problem owner doesn’t have to do data modelling or machine learning themself, but they do need to be able to specify a problem well, find and clean data related to that problem so that competitors can access it easily (and all have the same starting dataset), and specify how the competition results will be marked (e.g. by accuracy on an unseen ‘test’ dataset).   Go look at some of the problems listed on these sites, and think about how you would have done this yourself.

(session 1 slideset is here; cover image is from the Pump it Up challenge)