Practical influence operations

Image may contain: Sara-Jayne Terp, standing

Sofwerx are the US Special Operations command’s open source innovations unit – it’s where scarily fit (but nice) military people, contractors, academics and people with green hair, piercings, dodgy pasts and fiercely bright creative minds come together to help solve wicked problems that need extremely unconventional but still disciplined thinking (don’t ever make the mistake of thinking hackers are just chaotic; given a good problem, they are very very focussed and determined). This week, Sofwerx gathered some experts together to tell other parts of the US government about Weaponized Information.

I was one of the experts. I’m low on purple hair and piercings, but I do have about 30 years practice in the defense innovation community, am lucky enough to speak at Black Hat and Defcon about things like how algorithms control humans and the ethics of adversarial machine learning, and I really get interested in things.  Like how autonomous vehicles and people team together. Or how agencies like FEMA and the UN use data in disasters. Or, for the past couple of years, how to counter attacks on human belief systems.  Here are some notes from my talk whilst I wait for the amazing media team to make me look elegant in the video (please please!).

I spoke at the end of the day, so my talk is rightly one of a pair of talks, with David Perlman and myself bookending the day. After a military introduction (waves to Pablo), David opened the day by describing the landscape of misinformation and other large-scale, internet-enabled influence operations; expert talks during the day built out from that, explaining lessons we can learn from earlier operations against Jihadis (Scot Terban), deep dives into specific technologies of interest (Matthew Sorrell and Irene Amerini on countering deepfakes and other multimedia forensics), then me pulling those back together with a talk setting out a framework from which we (and by we I meant the people in front of us in the room plus a discipline created from the skills of a very specific set of experts) could start to respond to the problems, before passing it back to the military in the form of Keith Dear from the RAF. 

So. Lots of people talk about the problem of misinformation, election hacking, influence operations, Russia, the internet research agency blah blah blah.  Very few of them talk about potential solutions, including difficult or unconventional solutions.  The analogy I used at the start was from my days playing rugby union for my university.  Sometimes we would play another side, and the two sides would be completely matched in power and ballskill and speed, but they just didn’t understand the tactics of the game. And the other side would score again and again to the point of embarrassment against them, because they knew the field and had gameplay and the other side didn’t.  And that is what recent history has felt like to me.  If you’re going to talk about solutions, and about handling misinformation as a someone has to do this, and someone is going to have to do this forever because this is just like spam and isn’t going to go away thing, you’re going to need processes, you’re going to need your own gameplay, and you’re going to need to understand the other side’s gameplay so you can get inside and disrupt it. Play the game. Don’t just stand on the field.


So first, offense. Usually I talk a lot about this, because I’ve spent a lot of time in the past two years raising the alarm with different communities, and asking them to frame the problem in terms of actors with intents, targets, artefacts and potential vulnerabilities.  This time I skimmed over – mentioned the Internet Research Agency in Russia as the obvious biggest player in this game, but that despite their size they were playing a relatively unsubtle unsophisticated game and that more interesting to me were the more subtle tests and attacks that might be also happening whilst we were watching them. I defined misinformation as deliberately false information with an objective that’s often money or geopolitical gain and ranges from the strange (“Putin with aliens!”) to the individually dangerous (“muslim rape gangs in Toronto”).  I also pushed back o the idea that influence operations aren’t the same as social engineering; to me, influence operations are social engineering at scale, and if we use the SE definition of “psychological manipulation of people into performing actions or divulging confidential information”, we are still talking about action, but those actions are often either in aggregate or at a remove from the original target (e.g. a population is targeted to force a politician to take action), with the scale being sometimes millions of people (Russian-owned Facebook groups in the 2016 Congress investigation had shares and interactions in the 10s of millions, although we do have to allow for botnet activities there).

Scale is important when we talk about impacts: these can range from individual – people caught up in opposing-group demonstrations deliberately created at the same place and time, to communities – disaster responses and resources being diverted around imaginary roadblocks (e.g. fake “bridge out” messaging) to nationstate (the “meme war” organizing pages that we saw with QAnon and related groups’ branding for the US, Canada and other nations in the past year). 

Targeting is scaled too: every speaker mentioned human cognitive biases; although I have my favorite biases like familiarity backfire (if you repeat a message with a negative in it, humans remember the message but not the negative) there are hundreds of other biases that can be used as a human attack surface online (the cognitive bias codex lists about 180 of them). There’s sideways scale: many efforts focus on single platforms, but misinformation is now everywhere there’s user-generated content: social media sites like facebook, twitter, reddit, eventbrite, but also comment streams, payment sites, event sites: anywhere you can leave a message, comment, image, video, content that another human can sense.  Influence operations aren’t new, but social media buys reach and scale: you can buy 1000 easy-to-find bots for a few dollars or 100 very hard to detect Twitter or Facebook ‘aged users’ for $150; less if you know where to look.  There are plenty of botnet setup guides online; a large cheap set can do a lot of damage very quickly, and you can play a longer, more subtle online game by adding a little pattern matching or AI to a smaller aged set.

Actors and motivations pretty much divide into: state/nonstate actors who are doing this for geopolitical gain (creating discord or swaying opinion on a specific topic), entrepreneurs doing it for money (usually driving people to view their websites and making money from advertising on them), grassroots groups doing it for fun (e.g. to create chaos as a new form of vandalism) and private influencers for either attention (the sharks on the subways) or, sometimes, money.  This isn’t always a clean-cut landscape: American individual influencers have been known to create content that is cut-and-pasted onto entrepreneurs’ websites (most, but increasingly not all, entrepreneurs don’t have English as their first language and the US is a large market); that messaging is often also useful to the state actors (especially if their goal is in-country division) and attractive to grassroots groups.  This is a huge snurfball that people like Ben Nimmo do great work unravelling some of the linkages in.


One of the most insightful comments I got at a talk was “isn’t this just like spam? Won’t they just make it go away the same way?”.  I didn’t appreciate it at the time, and my first thought was “but we’re the ‘they’, dammit”, but IMHO there are some good correlates here, and that one question got me thinking about whether we could treat misinformation the same way we treat other unwanted internet content like spam and ddos attacks.

I’ve looked at a lot of disciplines, architectures and frameworks (“lean/agile misinformation”, anyone?) and the ones that look closest to what we need come from information security.  One of these is the Gartner cycle: deceptively simple with its prevent-detect-respond-predict.  The good news is that we can take these existing frameworks and fit our problem to them, to see if there are areas that we’ve missed or need to strengthen in some way.   The other good news is that approach works well.  The bad news is that if you fit existing misinformation defense work to the Gartner cycle, we’ve got quite a lot of detect work going on, a small bit of prevent, almost no effective respond and nothing of note except some special exceptions (Chapeau! again to Macron’s election team for the wonderful con-job you pulled on your attackers) on predict.

Looking at “detect”: one big weakness of an influence operation is that the influencer has to be visible in some way (although the smart ones find ways to pop up and remove messages quickly, and target small enough to be difficult to detect) – they leave “artefacts”, traces of their activity.  There are groups and sites dedicated to detecting and tracking online botnets, which is a useful place to look up any ‘user’ behaving suspiciously.  The artifacts they use tend to split into content and context artifacts.  Content artifacts are things within a message or profile: known hashtags (e.g. #qanon), text that correlates with known bots, image artifacts in deepfake videos, known fake news URLs, known fake stories.  Stories are interesting because sites like Snopes already exist to track at the story level, and groups like Buzzfeed and FEMA have started listing known fake stories during special events like natural disasters.  But determining whether something is misinformation from content alone can be difficult – the Credibility Coalition and W3C credibility standards I’ve been helping with also include context-based artifacts: whether users are connected to known botnets, trolls or previous rumors (akin to the intelligence system of rating both the content and the carrier), their follower and retweet/likes patterns and metadata like advertising tags and DNS.  One promising avenue, as always, is to follow the money, in this case advertising dollars; this is promising both in tracking misinformation and also in its potential to disrupt it.

There are different levels of “respond”, ranging from individual actions to community, platform and nationstates.  Individuals can report user behaviors to social media platforms; this has been problematic so far, for reasons discussed in earlier talks (basically platform hesitation at accidentally removing user accounts).  Individuals can also report brands advertising on “fake news” sites to advertisers through pressure groups like Sleeping Giants, who have been effective in communicating the risk from this to the brands.  Individuals have tools that they can use to collaboratively block specific account types (e.g. new accounts, accounts with few followers): all of these individual behaviors could be scaled.  Platforms have options: they do remove non-human traffic (the polite term for “botnets and other creepy online things”) and make trolls less visible to other users; ad exchanges do remove non-human traffic (because of a related problem, click fraud – bots don’t buy from brands) and problematic pages from their listings.

Some communities actively respond.  One of my favorites are the Lithuanian ‘Elves’: an anonymous online group who fight Russian misinformation online, apparently successfully, with a combination of humor and facts.  This has also been promising in small-scale trials in Panama and the US during disasters (full disclosure: I ran one of those tests).  One of the geopolitical aims of influence operations that was mentioned by several other speakers was to widen political divides in a country.  A community that’s been very active in countering that is the peace technology community, and specifically the Commons Project, which used techniques developed across divides including Israel-Palestine and Cyprus with a combination of bots and humans to rebuild human connections across damaged political divides.

On a smaller scale, things that have been tried in the past years include parody-based counter-campaigns, SEO hacks to place disambiguation sites above misinformation sites in search results, overwhelming (“dogpiling onto”) misinformation hashtags with unrelated content, diverting misinformation hashtag followers with spoof messages, misspelt addresses and users names (‘typosquatting’), and identifying and engaging with affected individuals.  I remain awed by my co-conspirator Tim who is a master at this. 

All the above has been tactical because that’s where we are right now, but there are also strategic things going on. Initiatives to innoculate and educate people about misinformation exist, and the long work of bringing it into the light continues in many places.


I covered offense and defence, but that’s never the whole of a game: for instance, in yet another of my interests, MLsec (the application of machine learning to information security),  the community divides its work into using machine learning to attack, using it to defend, and attacking the machine learning itself. 

Right now the game is changing, and this is why I’m emphasizing frameworks.  This time also feels to me like the moment Cliff Stoll writes about in The Cuckoo’s Egg, when one man is investigating an information security incursion, a “hack”, happening through his computers, and slowly finding other people across the government who were recognizing the problem too, before that small group grew out into the huge industry we see today. 

We need frameworks because the attacks are adapting quickly, and it’s going to get worse because of advances in areas like MLsec: we’re creating adaptive, machine-learning-driven attacks that learn to evade machine-learning-driven detectors and rapidly heading from artefact-based to behavior-based to intent-based discussions.  Already happening or likely to happen next include hybrid attacks where attackers combine algorithms and humans to evade and attack a combination of algorithms (e.g. detectors, popularity etc) and humans; a current shift from obvious trolls and botnets to infiltrating and weaponizing existing human communities (mass-scale “useful idiots”), and attacks across multiple channels at the same time masked with techniques like pop-up and low-and-slow messaging.  This is where we are: this is becoming an established part of hybrid warfare that needs to be considered not as war, but certainly on a similar level to, say, turning up in part of Colombia with some money and a gunboat pointed at the railway station and accidentally creating a new country from a territory you’d quite like to build a canal in (Panama).  Also of note is what happens if the countries currently attacking the US make the geopolitical and personal gains they required, stop their current campaigns and leave several hundred highly-trained influence operators without a salary.  Generally what happens in those situations is an industry forms around commercial targets: some of this has already happened, but those numbers could be interesting, and not in a good way.

One framework isn’t enough to cover this. The SANS sliding scale of security describes, from left to right, the work needed to secure a system from architecting that system to be secure through passively defending it against threats, actively responding to attacks and producing intelligence all the way to “legal countermeasures and self-self-defense against an adversary”.  We have some of the architecture work done.  Some of the passive defence. Lots of intelligence.  There’s potential for defense here.  There’s going to need to be strategic and tactical collaboration, and by that I mean practical things like nobody quite knows what to call the state we’re in: it’s not war but it is a form of landgrab (later in the day I whispered “are we the Indians?” to a co-speaker, meaning this must have been what it felt like to be a powerful leader watching the settlers say “nice country, we’ll take it”), possibly politics with the addition of other means, and without that definition it’s really hard to regulate what is and isn’t allowed to happen (also perhaps important: it seems that only the military have limits on themselves in this space).  With cross-platform subtle attacks, collaboration and information sharing will be crucial, so trusted third-party exchanges matter.  Sharing of offensive techniques, tactics and processes matter too, so a misinformation version of the ATT&CK framework for now (I tried fitting it to the end of the existing framework and it just doesn’t fit – the shape is good but there’s adjustments needed) with a SANS top 20 later (because we’re already seeing the same attack vectors repeating, misinformation versions of script kiddies etc etc).  There’s a defense correlate to the algorithms+ humans comment on offense above: we will most likely need a hybrid response of algorithms plus humans countering attacks by algorithms plus humans.  We will need to think the unthinkable, even if we immediately reject it (“Great Wall Of America”, nah).   And we really need to talk about what offense would look like: and I don’t mean that in a kinetic sense, I mean what are valid self-self-defense actions.

I ended my presentation with a brief glimpse at what I’m working on right now, and a plea for the audience.  I’m working half my time helping to build the Global Disinformation Index, an independent disinformation rating system, and the rest researching areas that interest me, which right now is that misinformation equivalent to the ATT&CK techniques, tactics and procedures framework. My plea for the audience was to please not fight the last war here.

Bodacea Light Industries LLC

I have a consulting company now.  It’s not something I meant to do, and I’ve learned something important from it: creating a company for its own sake is a lot less likely to succeed than creating a company because it helps you do something else that you really wanted to do.

In my case, that’s to work full-time on countering automated and semi-automated influence operations, whether that’s through a 20-hour-a-week contract helping to create the Global Disinformation Index as part of the forever pushback on misinformation-based fraud (“fake news” sites etc), or working on practical solutions, e.g. writing a how-to book and working on infosec-style architectural frameworks for misinformation responses, so as I put it in a talk yesterday we can “actually play the game, instead of standing on the field wondering what’s going on whilst the other team is running round us with gameplays and rulebooks”.

I still have much paperwork and website filling to go before I ‘hard launch’ BLightI, as I’ve started affectionally calling the company (and buying the namespace for, before you get any ideas, fellow hackers…).  I also have quite a lot of work to do (more soon). In the meantime, if you’re interested in what it and I are doing, watch this space and @bodaceacat for updates.

Security frameworks for misinformation

Someone over in the AI Village (one of the MLsec communities – check them out) asked about frameworks for testing misinformation attacks.  Whilst the original question was perhaps about how to simulate and test attacks – and more on how we did that later – one thing I’ve thought about a lot over the past year is how misinformation could fit into a ‘standard’ infosec response framework (this comes from a different thought, namely who the heck is going to do all the work of defending against the misinformation waves that are now with us forever).

I digress. I’m using some of my new free time to read up on security frameworks, and I’ve been enjoying Travis Smith’s Leveraging Mitre Att&ck video.  Let’s see how some of that maps to misinfo.

First, the Gartner adaptive security architecture.

The Four Stages of an Adaptive Security Architecture

It’s a cycle (OODA! intelligence cycle! etc!) but the main point of the Gartner article is that security is now continuous rather than incident-based. That matches well with what I’m seeing in the mlsec community (that attackers are automating, and those automations are adaptive, e.g. capable of change in real time) and with what I believe will happen next in misinformation: a shift from crude human-created, templated incidents to machine-generated, adaptive continuous attacks.

The four stages in this cycle seem sound for misinformation attacks,  but we would need to change the objects under consideration (e.g. systems might need to change to communities) and the details of actions under headings like “contain incidents”.  9/10 I’d post-it this one.

Then the SANS sliding scale of cyber security


AKA “Yes, you want to hit them, but you’ll get a better return from sorting out your basics”. This feels related to the Gartner cycle in that the left side is very much about prevention, and the right about response and prediction. As with most infosec, a lot of this is about what we’re protecting from what, when, why and how.

With my misinformation hat on, architecture seems to be key here.I know that we have to do the hard work of protecting our base systems: educating and inoculating populations (basically patching the humans), designing platforms to be harder to infect with bots and/or botlike behaviours. SANS talks about compliance, but for misinformation there’s nothing (yet) to be compliant against. I think we need to fix that. Non-human traffic pushing misinformation is an easy win here: nobody loves an undeclared bot, especially if it’s trying to scalp your grandmother.

For passive defence, we need to keep up with non-human traffic evading our detectors, and have work still to do on semi-automated misinformation classifiers and propagation detection. Misinformation firewalls are drastic but interesting: I could argue that the Great Firewall (of China) is a misinformation firewall, and perhaps that becomes an architectural decision for some subsystems too.

Active defence is where the humans come in, working on the edge cases that the classifiers can’t determine (there will most likely always be a human in the system somewhere, on each side), and hunting for subtly crafted attacks. I also like the idea of misinformation canaries (we had some of these from the attacker side, to show which types of unit were being shut down, but they could be useful on the defence side too).

Intelligence is where most misinformation researchers (on the blue team side) have been this past year: reverse-engineering attacks, looking at tactics, techniques etc.  Here’s where we need repositories, sharing and exchanges of information like misinfocon and credco.

And offense is the push back – everything from taking down fake news sites and removing related advertisers to the types of creative manoeuvres exemplified by the Macron election team.

To me, Gartner is tactical, SANS is more strategic.  Which is another random thought I’d been kicking around recently: that if we look at actors and intent, looking at the strategic, tactical and execution levels of misinformation attacks can also give us a useful way to bucket them together.  But I digress: let’s get back to following Travis, who looks next at what needs to be secured.

V7 Matrix web 1024x720.png

For infosec, there’s the SANS Top 20, aka CIS controls (above). This interests me because it’s designed at an execution level for systems with boundaries that can be defended, and part of our misinformation problem is that we haven’t really thought hard about what our boundaries are – and if we have thought about them, have trouble deciding where they should be and how they could be implemented. It’s a useful exercise to find misinformation equivalents though, because you can’t defend what you can’t define (“you know it when you see it” isn’t a useful trust&safety guideline).  More soon on this.

Compliance frameworks aren’t likely to help us here: whilst HIPAA is useful to infosec, it’s not really forcing anyone to tell the truth online. Although I like the idea of writing hardening guides for adtech exchanges, search, cloud providers and other (unwitting) enablers of “fake news” sites and large-scale misinformation attacks, this isn’t a hardware problem (unless someone gets really creative repurposing unused bitcoin miners).

So there are still a couple of frameworks to go (and that’s not including the 3-dimensional framework I saw and liked at Hack Manhattan), which I’ll skim through for completeness, more to see if they spark any “ah, we would have missed that” thoughts.

Image result for lockheed-martin cyber kill chain

Lockheed-Martin cyber kill chain: yeah, this basically just says “you’ve got a problem”. Nothing in here for us; moving on.

Related image

And finally, MITRE’s ATT&CK framework, or Adversarial Tactics, Techniques & Common Knowledge. To quote Travis, it’s a list of the 11 most common tactics used against systems, and 100s of techniques used at each phase of their kill chains. Misinformation science is moving quickly, but not that quickly, and I’ve watched the same types of attacks (indeed the same text even) be reused against different communities at different times for different ends.  We have lists from watching since 2010, and pre-internet related work from before that: either extending ATT&CK (I started on that a while ago, but had issues making it fit) or a separate repository of attack types is starting to make sense.

A final note on how we characterise attacks. I think the infosec lists on characterising attackers make sense for misinformation: persistence, privilege escalation, defence evasion, credential access, discovery, lateral movement, execution, collection, exfiltration, command and control are all just as valid if we’re talking about communities instead of hardware and systems. Travis’s notes characterising response also make sense for us too: we also need to gather information each time on what is affected, what to gather, how the technique was used by an adversary, how to prevent a technique from being exploited, and how to detect the attack.


Not just America

I’ve been reading lately – my current book is Benkler et al’s Network Propaganda, and in its Origins chapter, I was reading about 19th century vs 20th century American political styles and how they apply today, and caught myself thinking “am I studying America too much? Misinformation is hitting most countries around the world now – am I biased because I’m here?”.

I think that’s a valid question to ask.  Despite there being good work in many places around the world (looking at you, Lithuania and France!), much has been made of the 2016 US elections and beyond; many of the studies I’ve seen and work I’ve been involved in have a US bias, funding or teams.  And are we making our responses vulnerable because of that?

I think not.  And I think not because we’ve all been touched by American culture.  When we talk about globalisation, we’re usually talking about American companies, American movies and styles and brands, and much Internet culture has followed that too (which country doesn’t bother with a country suffix?) And just as we see Snickers and Disney all over the world, so too have our cultures been changed by things like game shows.

making a simple map (15 minute project)

It’s Portland’s tech crawl tomorrow, where we all visit each others’ offices, admire the toys (reminder to self: must put some more air in the 7′ lobster), drink each others’ beers and try to persuade each others’ techs to work for somewhere with cooler toys.  My office is on the crawl, but rather than be the token female data scientist in there, I’m going to go out visiting.
I started by grabbing and formatting some data.
There’s no map of the crawl (yet),and I needed to know how far I’d be walking (I’m still recovering from my time in a wheelchair). There is a list of companies on the route , so I cut and pasted that list into a text file that I imaginatively named techcrawl_raw.txt.
Then I applied a little Python:
import pandas as pd
fin = open(‘techcrawl_raw.txt’, ‘r’)
txt =
df = pd.DataFrame([x.strip()[:-1].split(‘ (‘) for x in list(set(txt.split(‘\r’))- set([”]))],
columns=[‘name’, ‘url’])
df[‘address’] = df[‘name’] + ‘, portland, oregon’
df.to_csv(‘tech_crawl.csv’, index=False)
This takes the contents of the text file, which looks like this:
Creates an array that looks like this (NB the order changed because I used set() to remove any duplicate entries):
Converts it to a pandas dataframe, and adds a new column with the company name plus “, portland, oregon”, then dumps that dataframe to a csv file that looks like this:
The code that creates the array is fragile (just one added space could break it), but it works for a one-shot thing.
And then I made a map.
Go to google maps, click on the hamburger menu then “your places”, “maps”, “create map” (that’s at the bottom of the page) then “import”.   Select the csv (e.g. tech_crawl.csv), click on “address” when it say “choose columns to position your placemarks” and “name” for “column to title your markers”.
You’ve now got a map, because Google used its gazetteer to look up all the addresses you gave it.  It won’t be perfect: I got “2 rows couldn’t be shown on the map” – you can click on “open data table” and go edit the “address” field til Google finds where they are.
It’s a little fancier than the basic map.  To fancy it up, I clicked on “untitled map” and gave the map a name; I hovered over “all items” until a little paintpot appeared, clicked on that and chose a colour (green) and icon (wineglass).  I also clicked on “add layer” and added a layer called “important things”, used the search bar to find the start and afterparty locations, then clicked on the icons that appeared on the map, then “add to map” and used the paintpot to customise those too.  And that was it.  One simple map, about 15 minutes, most of which was spent creating the CSV file.  And a drinking <del><del> visiting route that I can walk without getting to exhaustion.
Making your own maps for fun is fun, but there are other more serious maps you can help with too, like the Humanitarian OpenStreetMap maps of disaster areas – if you’re interested, the current task list is at

Squirrel! Or, learning to love nonlinearity

I write a lot.  I wrote posts, and notes, comments on other people’s work and long emails to friends about things that interest me (and hopefully them too).  But I don’t write enough here.  And part of that is the perception of writing as a perfect thing, as a contiguous thread of thought, of a blog as a “themed” thing written for an audience.

So I stopped writing here. I do that sometimes. Because the things that interest me vary, and aren’t always serious, or aren’t part of the current ‘theme’ (which is currently misinformation and how people think).  Or I don’t have enough time, and leave half-written notes in my ‘drafts’ folder waiting to be turned into ‘good’ content.

But that seems a little grandiose. I’m assuming that people read this blog, that they’re looking for a specific thing, for something polished, a product.  And that it’s my job to provide that.  And that leads to the above, to stasis, to me not publishing anything.

So for now, I’m just going to write about what interests me, about the projects I’m working on, the thoughts that I spin out into larger things.  The serious stuff will be on my medium,

Cleaning a borked Jupyter notebook

It’s a small simple thing, but this might save me (or you) an hour some day. One of my Jupyter notebooks got corrupted – I have some less-than-friendly tweet data in it that not only stopped my notebook from loading, but also crashed my Jupyter instance when I did.  I would normally just fix this in the terminal window, but thought it might be nice to share.  If you’re already familiar with the backend of Jupyter and json file formats, feel free to skip to the next post.  And if you’ve borked a Jupyter file but can still open it, then “clear all” might be a better solution.  Otherwise…

Jupyter notebooks are pretty things, but behind the scenes, they’re just a Json data object.   Here’s how to look at that, using Python (in another Jupyter notebook: yay, recursion!):

import json
raw = open('examine_cutdown_tweets.ipynb', 'r').read()

That just read the jupyter notebook file in as text.  See those curly brackets: it’s formatted as json.   If we want to clean the notebook up, we need to read this data in as json. Here’s how (and a quick look at its metadata):

jin = json.loads(raw)
(jin['metadata'], jin['nbformat'], jin['nbformat_minor'])

Each section of the notebook is in one of the “cells”. Let’s have a look at what’s in the first one:


The output from the cell is in “outputs”. This is where the pesky killing-my-notebook output is hiding. The contents of the first of these looks like:


At which point, if I delete all the outputs (or the one that I *know* is causing problems), I should be able to read in my notebook and look at the code and comments in it again.

for i in range(len(jin['cells'])):
   jin['cells'][i]['outputs'] = []

And write out the cleaned-up notebook:

jout = json.dumps(jin)
open('tmp_notebook.ipynb', 'w').write(jout)

Finally, the code that does the cleaning (and just that code) is:

import json
raw = open('mynotebook.ipynb', 'r').read()
jin = json.loads(raw)
for i in range(len(jin['cells'])):
jin['cells'][i]['outputs'] = []
open('tmp_mynotebook.ipynb', 'w').write(json.dumps(jin))

Good luck!

The ‘citizens’ have power. They can help.

[cross-post from medium]

This is a confusing time in a confusing place. I’ve struggled with concepts like allyship from within, of whether I sit in a country on the verge of collapse or renewal, on how I might make some small positive difference in a place that could take the world down with it. And, y’know, also getting on with life, because even in the midst of chaos, we still eat and sleep and continue to do all the human things. But I’m starting to understand things again.

I often say that the units of this country are companies, not people: that democratic power here rests a lot in the hands of organizations (and probably will until Citizens United is overturned, people find their collective strength and politics comes back from the pay-to-play that it’s become in the last decade or two). But since we’re here, I’ve started thinking about how that might itself become an advantage. We’ve already seen a hint of it with StormFront being turned away by the big hosts. What else could be done here? Where else are the levers?

One thing is that a relatively small number of people are in a position to define what’s socially acceptable: either by removing the unacceptable (e.g StormFront) or making it harder to find or easier to counter. And for me, this is not about ensuring a small group of nazis don’t coordinate and share stores: that’s going to happen anyway, whether we like it or not. It’s more about reducing their access and effect on our grandmothers, on people who might see a well-produced article or presidential statement and not realize that it doesn’t reflect reality (we can talk all we want about ‘truth’, and I have been known to do that, but some subjective assessments are just way past the bounds of uncertainty).

Removing the unacceptable is a pretty nuclear option, and one that is hard to do cleanly (although chapeau to the folks who say ‘no’). Making things harder to find and easier to counter — that should be doable. Like, tell me again how search and ranking works? Or rather, don’t (yes, I get eigenvector centrality and support, and enjoy Query Understanding) — it’s more important that people who are working on search and ranks are connected to the people working on things like algorithm ethics and classifying misinformation; people who are already wrestling with how algorithm design and page content adversely affect human populations, and the ethics of countering those effects. There’s already a lot of good work starting on countering the unacceptable (e.g. lots of interesting emergent work like counter-memes, annotation, credibility standards, removing advertising money and open source information warfare tools).

Defining and countering “unacceptable” is a subject for another note. IMHO, there’s a rot spreading in our second (online) world: it’s a system built originally on trust and a belief in its ability to connect, inform, entertain, and these good features are increasingly being used to bully, target, coerce and reflect the worst of our humanities. One solution would be to let it collapse on itself, to be an un-countered carrier for the machine and belief hacks that I wrote about earlier. Another is to draw our lines, to work out what we as technologists can and should do now. Some of those things (like adapt search) are thematically simple but potentially technically complex, but that’s never scared us before (and techies like a good chewy challenge). And personally, if things do start going Himmler, I’d rather be with Bletchley than IBM.

Boosting the data startup

[cross-post from Medium]

Some thoughts from working in data-driven startups.

Data scientists, like consultants, aren’t needed all the time. You’re very useful as the data starts to flow and the business questions start to gel, but before that there are other skills more needed (like coding); and afterwards there’s a lull before your specialist skills are a value-add again.

There is no data fiefdom. Everyone in the organization handles data: your job is to help them do that more efficiently and effectively, supplementing with your specialist skills when needed, and getting out of the way once people have the skills, experience and knowledge to fly on their own. That knowledge should include knowing when to call in expert help, but often the call will come in the form of happening to walk past their desk at the right time.

Knowing when to get involved is a delicate dance. Right at the start, a startup is (or should be) still working out what exactly it’s going to do, feeling its way around the market for the place where its skills fit; even under Lean, this can resemble a snurfball of oscillating ideas, out of which a direction will eventually emerge: it’s too easy to constrain that oscillation with the assumptions and concrete representations that you need to make to do data science. You can (and should) do data science in an early startup, but it’s more a delicate series of nudges and estimates than hardcore data work. As the startup starts to gel around its core ideas and creates code is a good time to join as a data scientist because you can start setting good behavior patterns for both data management and the business questions that rely on it, but beware that it isn’t an easy path: you’ll have a lot of rough data to clean and work with, and spend a lot of your life justifying your existence on the team (it helps a lot if you have a second role that you can fall back on at this stage). Later on, there will be plenty of data and people will know that you’re needed, but you’ll have lost those early chances to influence how and what data is stored, represented, moved, valued and appraised, and how that work links back (as it always should) to the startup’s core business.

There are typically 5 main areas that the first data nerd in a company will affect and be affected by (h/t Carl Anderson from Warby Parker ): data engineering, supporting analysts, supporting metrics, data science and data strategy. These are all big, 50-person-company areas that will grow out of initial seeds: data engineering, analyst support, metric support, data science and data governance.

  • Data engineering is the business of making data available and accessible. That starts with the dev team doing their thing with the datastores, pipelines, APIs etc needed to make the core system run. It’ll probably be some time before you can start the conversation about a second storage system for analysis use (because nobody wants the analysts slowing down their production system) so chances are you’ll start by coding in whatever language they’re using, grabbing snapshots of data to work on ’til that happens.
  • To start with, there will also be a bunch of data outside the core system, that parts of the company will run on ’til the system matures and includes them; much of it will be in spreadsheets, lists and secondary systems (like CRMs). You’ll need patience, charm, and a lot of applied cleaning skills to extract these from busy people and make the data in them more useful to them. Depending on the industry you’re in, you may already have analysts (or people, often system designers, doing analysis) doing this work. Your job isn’t to replace them at this; it’s to find ways to make their jobs less burdensome, usually through a combination of applied skills (e.g. writing code snippets to find and clean dirty datapoints), training (e.g. basic coding skills, data science project design etc), algorithm advice and help with more data access and scaling.
  • Every company has performance metrics. Part of your job is to help them be more than window-dressing, by helping them link back to company goals, actions and each other (understanding the data parts of lean enterprise/ lean startup helps a lot here, even if you don’t call it that); it’s also to help find ways to measure non-obvious metrics (mixtures, proxies etc; just because you can measure it easily doesn’t make it a good metric; just because it’s a good metric doesn’t make it easy to measure).
  • Data science is what your boss probably thought you were there to do, and one of the few things that you’ll ‘own’ at first. If you’re a data scientist, you’ll do this as naturally as breathing: as you talk to people in the company, you’ll start getting a sense of the questions that are important to them; as you find and clean data, you’ll have questions and curiosities of your own to satisfy, often finding interesting things whilst you do. Running an experiment log will help here: for each experiment, what was the business need, the dataset, what did you do, and most importantly, what did you learn from it. That not only frames and leaves a trail for someone understanding your early decisions; it will also leave a blueprint for other people you’re training in the company on how data scientists think and work (because really you want everyone in a data-driven company to have a good feel for their data, and there will be a *lot* of data). Some experiments will be for quick insights (e.g. into how and why a dataset is dirty); others will be longer and create prototypes that will eventually become either internal tools or part of the company’s main systems; being able to reproduce and explain your code will help here (e.g. Jupyter notebooks FTW).
  • With data comes data problems. One of these is security; others are privacy, regulatory compliance, data access and user responsibilities. These all fall generally under ‘data governance’. You’re going to be part of this conversation, and will definitely have opinions on things like data anonymisation, but early on, much of it will be covered by the dev team / infosec person, and you’ll have to wait for the right moment to start conversations about things like potential privacy violations from combining datasets, data ecosystem control and analysis service guarantees. You’ll probably do less of this part at first than you thought you would.

Depending on where in the lifecycle you’ve joined, you’re either trying to make yourself redundant, or working out how you need to build a team to handle all the data tasks as the company grows. Making yourself redundant means giving the company the best data training, tools, and structural and scientific head start you can; leaving enough people, process and tech behind to know that you’ve made a positive difference there. Building the team means starting by covering all the roles (it helps if you’re either a ‘unicorn’ — a data scientist who call fill all the roles above — or pretty close to being one; a ‘pretty white pony’ perhaps) and gradually moving them either fully or partially over to other people you’ve either brought in or trained (or both). Neither of these things is easy; both are tremendously rewarding and will grow your skills a lot.

What could help fix belief problems?

[cross-post from Medium]

Last part (of 4) from a talk I gave about hacking belief systems.

Who can help?

My favourite quote is “ignore everything they say, watch everything they do”. It was originally advice to girls about men, but it works equally well with both humanity and machines.

If you want to understand what is happening behind a belief-based system, you’re going to need a contact point with reality. Be aware of what people are saying, but also watch their actions; follow the money, and follow the data, because everything leaves a trace somewhere if you know how to look for it (looking may be best done as a group, not just to split the load, but also because diversity of viewpoint and opinion will be helpful).

Verification and validation (V&V) become important. Verification means going there. For most of us, verification is something we might do up front, but rarely do as a continuing practice. Which, apart from making people easy to phish, also makes us vulnerable to deliberate misinformation. Want to believe stuff? You need to do the leg-work of cross-checking that the source is real, finding alternate sources, or getting someone to physically go look at something and send photos (groups like findyr still do this). Validation is asking whether this is a true representation of what’s happening. Both are important; both can be hard.

One way to start V&V is to find people who are already doing this, and learn from them. Crisismappers used a bunch of V&V techniques, including checking how someone’s social media profile had grown, looking at activity patterns, tracking and contacting sources, and not releasing a datapoint until at least 3 messages had come in about it. We also used data proxies and crowdsourcing from things like satellite images (spending one painful Christmas counting and marking all the buildings in Somalia’s Afgooye region), automating that crowdsourcing where we could.

Another group that’s used to V&V is the intelligence community, e.g. intelligence analysts (IAs). As Heuer puts it in the Psychology of Intelligence Analysis (see the CIA’s online library), “Conflicting information of uncertain reliability is endemic to intelligence analysis, as is the need to make rapid judgments on current events even before all the evidence is in”. Both of these groups (IAs and crisismappers) have been making rapid judgements about what to believe, how to handle deliberate misinformation, what to share and how to present uncertainty in it for years now. And they both have toolsets.

Is there a toolset?

Graphs from a session on statistical significance, AB testing and friends. We can ‘clearly’ see when the two distributions are separate or nearly the same, but need tools to work out where the boundary between these two states are.

The granddaddy of belief tools is statistics. Statistics helps you deal with and code for uncertainty; it also makes me uncomfortable, and makes a lot of other people uncomfortable. My personal theory on that is it’s because statistics tries to describe uncertainty, and there’s always that niggling feeling that there’s something else we haven’t quite considered when we use do this.

Statistics is nice once you’ve pinned down your problem and got it into a nicely quantified form. But most data on belief isn’t like that. And that’s where some of the more interesting qualitative tools come in, including structured analytics (many of which are described in Heuer’s book “structured analytic techniques”). I’ll talk more about these in a later post, but for now I’ll share my favourite technique: Analysis of Competing Hypotheses (ACH).

ACH is about comparing multiple hypotheses; these might be problem hypotheses or solution hypotheses at any level of the data science ladder (e.g. from explanation to prediction and learning). The genius of ACH isn’t that it encourages people to come up with multiple competing explanations for something; it’s in the way that a ‘winning’ explanation is selected, i.e. by collecting evidence *against* each hypothesis and choosing the one with the least disproof, rather than trying to find evidence to support the “strongest” one. The basic steps (Heuer) are:

  • Hypothesis. Create a set of potential hypotheses. This is similar to the hypothesis generation used in hypothesis-driven development (HDD), but generally has graver potential consequences than how a system’s users are likely to respond to a new website, and is generally encouraged to be wilder and more creative.
  • Evidence. List evidence and arguments for each hypothesis.
  • Diagnostics. List evidence against each hypothesis; use that evidence to compare the relative likelihood of different hypotheses.
  • Refinement. Review findings so far, find gaps in knowledge, collect more evidence (especially evidence that can remove hypotheses).
  • Inconsistency. Estimate the relative likelihood of each hypothesis; remove the weakest hypotheses.
  • Sensitivity. Run sensitivity analysis on evidence and arguments.
  • Conclusions and evidence. Present most likely explanation, and reasons why other explanations were rejected.

This is a beautiful thing, and one that I suspect several of my uncertainty in AI colleagues were busy working on back in the 80s. It could be extended (and has been); for example, DHCP creativity theory might come in useful in hypothesis generation.

What can help us change beliefs?

Boyd’s original OODA loop diagram

Which us brings us back to one of the original questions. What could help with changing beliefs? My short answer is “other people”, and that the beliefs of both humans and human networks are adjustable. The long answer is that we’re all individuals (it helps if you say this in a Monty Python voice), but if we view humans as systems, then we have a number of pressure points, places that we can be disrupted, some of which are nicely illustrated in Boyd’s original OODA loop diagram (but do this gently: go too quickly with belief change, and you get shut down by cognitive dissonance). We can also be disrupted as a network, borrowing ideas from biostatistics, and creating idea ‘infections’ across groups. Some of the meta-level things that we could do are:

  • Teaching people about disinformation and how to question (this works best on the young; we old ‘uns can get a bit set in our ways)
  • Making belief differences visible
  • Breaking down belief ‘silos’
  • Building credibility standards
  • Creating new belief carriers (e.g. meme wars)