… if you’re looking for my most recent work, check https://medium.com/@sarajayneterp
[Short notes will also turn up on Twitter https://twitter.com/bodaceacat or LinkedIn https://www.linkedin.com/in/sarajayneterp]
… if you’re looking for my most recent work, check https://medium.com/@sarajayneterp
[Short notes will also turn up on Twitter https://twitter.com/bodaceacat or LinkedIn https://www.linkedin.com/in/sarajayneterp]
Dear friends back home. You’ve stopped calling me when disasters happen here, or when your news from America is full of angry people and frightening police, or more guns on the street than the average military armory. You’ve even stopped strongly suggesting that I should come home to be ‘safer’ from it all. Even my mother’s stopped doing it – reassured by a visit that I’m tucked away in a sleepy town in a quiet corner of the country that’s so close to Canada it’s caught politeness and salmon (although no signs of ice hockey yet).
I came here a decade ago with a job to do, helping to protect some of the most vulnerable people and places on earth. I found a country of contradictions: of massive opportunities and crushing, life-minimizing poverties; of freedom to do anything if you looked and sounded right, or had the right type of name; of a wide choice of labels in the stores on a very small number of products; of both massive diversity and unity as “xxxx-Americans” where xxxx was every country or culture in the world (except, strangely, British). I fell in love with that pluralism: that people could be from anywhere, everywhere, and yet still be at home here.
Four years ago, when the regime changed, several of my friends did ‘go home’ – back to Europe or their other countries of origin, knowing things would get less safe here. I stayed because I go where the need is, and when I looked around the world, I frankly couldn’t see a country more heading into trouble, more able to damage the rest of the world once it got into that trouble, and more in need of people who could work through a slow-running crisis response and recovery to help get it out again.
Which sounds a bit grandiose, so let me put it differently. That America I fell in love with – the diverse one, where everyone was American first, and it didn’t matter where their accent was from – that’s just one version of America, and there are others. There’s the Daughters of the Revolution version of America (married into that) where America is all the white people from the 1950s posters and movies with everyone else playing maids and other bit parts; the America of children-of-children-of people who were systematically impoverished because of the interaction of the colour of their skin and the unfinished business of America the country being founded pretty solidly on the idea that darker-skinned people were much less important; and the America of children-of-children-of people who used to own other people like cattle and thought that was okay. There’s been a lot wrong for a very long time, including things like police forces not dealing with racism and violence in their ranks (and historical things like some police forces having been created to enforce racism and that not having quite got out of their bloodstream yet), summed up in my first year here when I asked about black people being killed in the South of the USA as “yeah, it happens there. Nothing much we can do about it”.
Two major things changed whilst I’ve been here: we all got more connected to each other through the internet. And we all got more direct evidence of things happening around us, as people started carrying phones with cameras on everywhere. This has been terrible for bigfoot spotters, but it’s made two other things possible: creating and manipulating the emotions and views of large groups of people online, including pitting all those Americas against each other, and making the things that “just happened over there” visible to everyone. You’re seeing both of those things right now. You’ve seen a video of a man being killed by police – something that’s happened too often to black men here but usually without record – you’re seeing now the news coverage of the protests against that being met with a much more violent police response than the non-violent, and sometimes non-response (anti-lockdown protesters took over government buildings whilst armed, without being pushed back, and threatened state governors’ lives without any arrests) police responses to mostly-white protests against local stay-at-home health orders.
That’s realities. But there’s also image, and nowhere is more image than online. People have manipulated each other since they started communicating; countries have been manipulating each others’ populations with mass propaganda for decades, and all humans, all journalism, have biases in how they report things. But for the past decade that’s been happening online too. And those multiple Americas have been strengthened in peoples’ minds, formed into in-groups and out-groups, “us” and “them”, by clever manipulators using social media tools and what basically amounts to a dark version of marketing, to form a fighting, biting, mess of humanity online. This is the thing that I stayed for. I’ve spent years working on human belief manipulation, and nowhere else in the world is as vulnerable to it as America is today. In America, it seems, image is everything.
So next, inevitably, will be those forces using these events to create more divisions. There will be the usual signals, e.g. when “American” means the 1950s-poster version only, the use of emotions including fear to enforce it (American news are excessively emotional, but that’s a discussion for another day). Disinformation aimed at all the usual groups by probably all of the usual actors (e.g. the Russian IRA has probably been busy already with its fake Black American accounts and groups). You’ll see this from where you are too. It will look crazy. It is, to be fair, actually pretty crazy, and has sucked normal, good, moderate Americans into the crazy too, but I don’t think this will be forever. There are good people working on all the things above, despite the difficult political climate for it, and they haven’t stopped working on them in the last difficult years. I personally think there’s hope America will eventually start dealing with the rots of racism and inequality and unfinished war that were built into its founding, and before that will build resilience against these disinformation storms. It’s why I’ve stayed.
Warning: this is a set of rough notes, for other geeks to read. There’s an ungeeked post about this for people who don’t want to wade through code.
2019 was mostly about building infrastructure and communities, but every so often I did a little “data safari” on a piece of misinformation that interested me.
Data safaris are small looks into an area of interest: they’re not big enough to be expeditions, but they’re not standing prodding the bushes a couple of times then walking away again either. How this (typically) works is: Something interests me. I see an article, or a hashtag, or an oddity that I’d like to have a better look at, so I take a cursory look at it, write a rough plan of how I want to further investigate it, then start at one corner of the world, and traverse out from it, making notes on things of interest, or things I want or need to remember as I go. Whilst I’m traversing, I’m also filling in a spreadsheet or doc with the end data I want, and collecting/storing other datasets (twitter, facebook, text from sites etc) as needed. On the way, I’ll find other things I want to explore, but won’t want to interrupt my flow through whichever thing I’m on at the time – for these, I leave myself an “action” note, which I overwrite with “done” or “ignored” when I’ve either gone back to that branch and followed it, or decided not to chase it down.
This one was pink slime: Hundreds of ‘pink slime’ local news outlets are distributing algorithmic stories and conservative talking points
This article from the Tow Center – data journalism unit at Columbia that has Jonathan Albright (d1gi) as one of its peeps working on misinformation data, so I had a quick look at his recent datasets https://data.world/d1gi (nothing on data.world in the past 9 months), medium, twitter https://twitter.com/d1gi – looks like it isn’t him. Back to the article.
“Pink slime” = “low cost automated story generation”. This stuff is going to get worse, as text generation improves (and remember that many of these sites are heavy on the image:text ratio anyways).
No link to data given in the article. Article writer is Priyanjana Bengani – talks about “we”, writes about datasets. Checking their accounts in case they released a dataset somewhere (unlikely given the subject, and what happened when Barrett Golding released a misinformation sites dataset, but worth checking). Nothing on https://twitter.com/acookiecrumbles. No sign of a dataset being released, so will have to start chasing it up by hand. Which, admittedly, will be fun. Can I find more than 450? [edit: yes, but it’s going to take an age to go through them all]
Okay. Quick-and-dirty plan:
Todo list:
Clues from cjr article:
Lansing State Journal article:
Looked for origin of story – Matt Grossmann
Checking Michigan Daily
Now the Guardian and NYT. These are both heavy hitters, data journalism-wise, and can call on a wide network of expert researchers (I know; I fed into some of their stuff). Maybe we’re seeing a pipeline here too – from a local researcher to a local news organisation, then nationals, then journalism school, then wider again?
Twitter search on NYT article
Oh the glamour. 2am and still going through the basic datanerding. Well, on to the adding lots of sites to the list part (current haul is the 23 sites mentioned in articles, but we have the Franklin Archer list, and the state publications list at the bottom of each site to add in, as easy wins)
Looking for “pink slime” articles
Look for misinformation-related terms plus actors/objects above:
Grab site lists from bottom of each known site
import requests
from bs4 import BeautifulSoup
import pandas as pd
puburl = 'https://annarbortimes.com/'
response = requests.request('GET', puburl)
response.text
import requests
from bs4 import BeautifulSoup
import pandas as pd
pubtitle = 'Ann Arbor Times'
puburl = 'https://annarbortimes.com/'
headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}
response = requests.request('GET', puburl, headers=headers)
response.text
data = BeautifulSoup(response.text, 'html.parser')
site_uls = data.find_all('ul', attrs={'class', 'footer__list'})
site_uls[0]
sudata = site_uls[0]
allas = sudata.find_all('a')
rows = [[pubtitle, puburl]]
for thisa in allas:
rows += [[thisa.text, thisa['href']]]
import pandas as pd
df = pd.DataFrame(rows, columns=['Site', 'Url'])
df.to_csv(pubtitle+'.csv')
df
Automating the google search
Text searches (by hand first)
from googlesearch import search
query = '"is a product of LGIS - Local Government Information Services" site:.com/about-us'
js = [j for j in search(query, tld="com", num=100, stop=None, pause=2.0,
extra_params={'filter': '0'})]
js
newsites = [re.findall('//(.*)/', x)[0] for x in js]
df = pd.DataFrame(newsites, columns=['url']).sort_values('url')
df.to_csv('googlesearch_test.csv', index=False)
df2 = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df2['URL'].to_list()
fromsites = [re.findall('//(.*)', x)[0].strip('/') for x in fromsites]
set(newsites) - set(fromsites)
So now I have a bunch of sites, several of which appear to be related (in subject, look and feel etc), but will take further investigation to see whether they’re part of something underhand, or just normal business. I suspect we’ll see a lot of this in future: companies that do legit business online (e.g. directory services), also providing (wittingly or unwittingly) disinformation carriers.
Networks of related sites found so far:
Leftover action: not sure how many search results come back from google search api. Recheck code https://github.com/anthonyhseb/googlesearch/blob/master/googlesearch/googlesearch.py
We have the original set of local news sites. Going to add all the other sites above to the main list, and make that a list of connected sites, before checking them all for disinformation behaviours (e.g. look at their non-automated stories). Action: check all sites on main list for disinformation behaviours.
Current counts (sites found):
Open:
So we’ve worked outwards from the original set of articles that flagged these networks, scraped sites’ lists of related sites, and used Google’s API to search for sites containing specific phrases. Time to look for the other things that connect sites (and use them to find more).
Quick and dirty tag comparisons: using BuiltWith tags and IP relationships, e.g. https://builtwith.com/relationships/montgomerymdnews.com
Now this is where we start putting the “science” into our “data science”. There are two versions of BuiltWith: public and pay-for. The pay-for is amazing, with all sorts of useful data in it, but the public (free) version isn’t bad either if you’re looking for related sites. And BuiltWith has an API. Here’s where we start cooking on gas.
To do next:
Coded up builtwith API calls. First 10 API calls on an account are free, then need to pay $100 for each 2000 calls or so (students/ academics might be able to get free access?). Code is:
import requests
import json
bwkey = '<put your own key here>'
bwdom = 'propertyinsurancewire.com'
bwapi = 'rv1' # 'free1' is the free api
bwurl = 'https://api.builtwith.com/{}/api.json?KEY={}&LOOKUP={}'.format(bwapi, bwkey, bwdom)
bwresp = requests.get(bwurl)
with open('builtwith/{}.json'.format(bwdom), 'w') as outfile:
json.dump(bwresp.json(), outfile)
matches = pd.DataFrame([])
identifiers = pd.DataFrame([])
rs = bwresp.json()['Relationships']
for thisr in rs:
fromdomain = thisr['Domain']
rsi = thisr['Identifiers']
ids = pd.DataFrame(rsi).drop('Matches', axis=1)
ids['FromDomain'] = fromdomain
identifiers = identifiers.append(ids)
for rsix in rsi:
rsimatches = pd.DataFrame(rsix['Matches'])
rsimatches['Type'] = rsix['Type']
rsimatches['Value'] = rsix['Value']
rsimatches['FromDomain'] = fromdomain
if len(rsimatches) > 0:
matches = matches.append(rsimatches)
matches.to_csv('builtwith/{}_matches.csv'.format(bwdom), index=False)
identifiers.to_csv('builtwith/{}_identifiers.csv'.format(bwdom), index=False)
matches
Which dumps out json and csv files for a single API call, but in a format where a whole collection of call outputs could be stuck together. Tested on montgomerymdnews.com, which gave a few new sites – and on propertyinsurancewire.com, which gave a list of 300 to check, including ones on the same NewRelic id.
Alrighty. Modified code to loop round all the urls I have. It’s fast. But it’s also spitting out some empty results – all the “1 byte” csvs below. Not sure why; will check when the run finishes.
Action: check the “1 byte” builtwith API returns.
Run finished. Got 46434 and 1120 unique domains returned by BuiltWith. 946 of those domains aren’t on the seed list. These are a mix: some look like news domain addresses; others don’t.
At this point, a requests.get call on /about-us for these sites should give a decent clue to whether they’re linked to the above sites (although africanmangoscam.net is definitely getting a visit). Looking at them, by first filtering on newsy words like “news”, “county” etc:
I ran out of holiday time (to be honest, I diverted into hanging out with my parents, which was IMHO a very good use of time). I enjoyed the exercise very much, and I have even more respect now for the data journalists who do this work all the time.
''' Scrape pink slime '''
import requests
from bs4 import BeautifulSoup
import pandas as pd
# pull page data from site
pubtitle = 'NE Kentucky News'
puburl = 'https://nekentuckynews.com/'
stype = 'local'
headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}
response = requests.request('GET', puburl, headers=headers)
# Grab the raw list of sites
data = BeautifulSoup(response.text, 'html.parser')
if stype == 'florida':
sudata = data.find_all('nav', attrs={'class', 'foot-nav'})[0]
elif stype == 'bd':
fdata = data.find_all('div', attrs={'class', 'footer'})[0]
row1 = fdata.find_all('div', attrs={'class', 'row'})[0]
sudata = row1.find_all('div', attrs={'class', 'row'})[0]
else:
sudata = data.find_all('ul', attrs={'class', 'footer__list'})[0]
# Convert to CSV
allas = sudata.find_all('a')
rows = [[pubtitle, puburl]]
for thisa in allas:
rows += [[thisa.text, thisa['href']]]
df = pd.DataFrame(rows, columns=['Site', 'Url'])
df.to_csv(pubtitle+'.csv', index=False)
''' googlesearch_for_terms '''
from googlesearch import search
import pandas as pd
import re
from datetime import datetime
qterm = 'access to and use of Locality Labs'
query = '"{}" site:.com/terms'.format(qterm)
js = [j for j in search(query, tld="com", num=1000, stop=None, pause=2.0,
extra_params={'filter': '0'})]
# Save, and compare against existing site list
newsites = [re.findall('//(.*)/', x)[0] for x in js]
df = pd.DataFrame(newsites, columns=['url']).sort_values('url')
df.to_csv('googlesearch_{}.csv'.format(datetime.now().strftime('%Y-%m-%d-%H-%M')),
index=False)
df2 = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df2['URL'].to_list()
fromsites = [re.findall('//(.*)', x)[0].strip('/') for x in fromsites]
df3 = pd.DataFrame(list(set(newsites) - set(fromsites)), columns=['url'])
df3.to_csv('temp.csv', index=False)
''' use_builtwith '''
import pandas as pd
import requests
import json
import re
df = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df['URL'].to_list()
fromsites = [x[x.strip('/').rfind('/')+1:].strip('/') for x in fromsites]
bwkey = '<get_a_key>'
bwapi = 'rv1' # 'free1' is the free api
allmatches = pd.DataFrame([])
allidentifiers = pd.DataFrame([])
for bwdom in fromsites:
print(bwdom)
try:
bwurl = 'https://api.builtwith.com/{}/api.json?KEY={}&LOOKUP={}'.format(bwapi, bwkey, bwdom)
bwresp = requests.get(bwurl)
with open('builtwith/{}.json'.format(bwdom), 'w') as outfile:
json.dump(bwresp.json(), outfile)
matches = pd.DataFrame([])
identifiers = pd.DataFrame([])
rs = bwresp.json()['Relationships']
for thisr in rs:
fromdomain = thisr['Domain']
rsi = thisr['Identifiers']
ids = pd.DataFrame(rsi).drop('Matches', axis=1)
ids['FromDomain'] = fromdomain
identifiers = identifiers.append(ids)
for rsix in rsi:
rsimatches = pd.DataFrame(rsix['Matches'])
rsimatches['Type'] = rsix['Type']
rsimatches['Value'] = rsix['Value']
rsimatches['FromDomain'] = fromdomain
if len(rsimatches) > 0:
matches = matches.append(rsimatches)
matches.to_csv('builtwith/{}_matches.csv'.format(bwdom), index=False)
identifiers.to_csv('builtwith/{}_identifiers.csv'.format(bwdom), index=False)
if len(matches) > 0:
allmatches = allmatches.append(matches)
if len(identifiers) > 0:
allidentifiers = allidentifiers.append(identifiers)
except:
continue
allmatches.to_csv('builtwith/allmatches.csv', index=False)
allidentifiers.to_csv('builtwith/allidentifiers.csv', index=False)
newsites = pd.DataFrame(set(allmatches['Domain'].to_list()) - set(fromsites), columns=['URL'])
newsites.to_csv('newsites_tmp.csv', index=False)
[Cross-post from Medium https://medium.com/misinfosec/disinformation-datasets-8c678b8203ba]
I’m often asked for disinformation datasets — other data scientists wanting training data, mathematician friends working on things like how communities separate and rejoin, infosec friends curious about how cognitive security hacks work. I usually point them at the datasets section on my Awesome Misinformation repo, which currently contains these lists:
That’s just the data that can be downloaded. There’s a lot of implicit disinformation data out there. For example groups like EUvsDisinfo, NATO Stratcom, OII Comprop all have structured data on their websites.
That’s a place to start, but there’s a lot more to know about disinformation data sources. One of them, as pointed out by Lee Foster at Cyberwarcon, last month, is that these datasets are rarely a complete picture of disinformation around an event. Lee’s work is interesting: he did what many of us do: as soon as an event kicked off, his team started collecting social media data around it. What they did next was to compare that dataset against the data output officially by social media companies (Twitter, Facebook). There were gaps — big gaps — in the officially released data; understandable in a world where attribution is hard and campaigns work hard to include non-trollbots (aka ordinary people) in the spread of disinformation.
Some people can get better data than others. For instance, Twitter’s IRA dataset has obfuscated user ids, but academics can ask for unobfuscated data. It’s worth asking, but also worth asking yourself about the limitations placed on you by things like non-disclosure agreements.
So what happens is that people who are serious about this subject collect their own data. And lots of them collect data at the same time. Which sits on their personal drives (or somewhere online) whilst other researchers are scrabbling round for datasets on events that have passed. I’ve seen this before. Everything old is new again — and in this case, it’s just what we saw with crisismapping data. There were people all over the world — locals, humanitarian workers, data enthusiasts etc, who had data that was useful in each disaster that hit, which meant that a large part of my work as a crisis data nerd was quietly gently extracting that data from people and getting it to a place online where it could be used, in a form that it could be found. We volunteers built an online repository, the Humanitarian Data Project, which informed the build of the UN’s repository, the Humanitarian Data Exchange — I also worked with the Humanitarian Data Language team on ways to meta-tag datasets so the data needed was easier to find. There’s a lot of learning in there to be transferred.
Disinformation is being created at scale, and at a scale beyond the ability of human tagging teams. That means we’re going to need automation (or rather augmentation — automating some of the tasks so the humans can still do the ‘hard’ parts of finding, labelling and managing disinformation campaigns, their narratives and artefacts). And to do that, we generally need datasets that are labelled in some way, so the machines can ‘learn’ from the humans (there are other ways to learn, like reinforcement learning, but I’ll talk about them another time). Unsurprisingly, there is very little in the way of labelled data in this world. The Pheme project labelled data; I helped with the Jigsaw project on a labelled dataset that was due for open release; I’ve also helped create labelling schemes for data at GDI, and am watching conversations about starting labelling projects at places like the Credibility Coalition.
That’s it — that’s a start on datasets for disinformation research. This is a living post, so if there are more places to look please tell me and I’ll update this and other notes.
[Cross-post from Medium https://medium.com/misinfosec/disinformation-as-a-security-problem-why-now-and-how-might-it-play-out-3f44ea6cda95]
When I talk about security going back to thinking about the combination of physical, cyber and cognitive, people sometimes ask me why now? Why, apart from the obvious weekly flurries of misinformation incidents, are we talking about cognitive security now?
I usually answer with the three Vs of big data: volume, velocity, variety (the fourth V, veracity, is kinda the point of disinformation, so we’re leaving it out of this discussion).
NB The internet isn’t the only system carrying these things: we still have traditional media like radio, television and newspapers, but they’re each increasingly part of these larger connected systems.
Another question I get a lot is “so what happens next”. Usually I answer that one by pointing people at two books: The Cuckoo’s Egg and Walking Wounded — both excellent books about the evolution of the cybersecurity industry (and not just because great friends feature in them), and say we’re at the start of The Cuckoo’s Egg, where Stoll starts noticing there’s a problem in the systems and tracking the hackers through them.
I think we’re getting a bit further through that book now. I live in America. Someone sees a threat here, someone else makes a market out of it. Cuddle-an-alligator — tick. Scorpion lollipops in the supermarket — yep. Disinformation as a service / disinformation response as a service — also in the works, as predicted for a few years now. Disinformation response is a market, but it’s one with several layers to it, just as the existing cybersecurity market has specialists and sizes and layers.
Frank is a very wise, very experienced friend (see books above), who calls our work on AMITT “botany” — building catalogs of techniques and counters slower than the badguys can maraud across our networks, when we really should be out there chasing them. He’s right. Kinda.
I read Adam Shostack’s slides on threat modelling in 2019 today. He talks about the difference between “waterfall” (STRIDE, kill chain etc) and “agile” threat modelling. I’ve worked on both: on big critical systems that used waterfall/“V” methods because you don’t really get to continuously rebuild an aircraft or ship design, and on agile systems that we trialled with and adapted to end-user needs. (I’ve also worked on lean production, where classically speaking, agile is where you know the problemspace and are iterating over solutions, and lean is iterations on both the problem and solution spaces. This will become important later). This is one of the splits: we’ll still need the slower, deliberative work that gives labels and lists defences and counters for common threats (the “phishing” etc equivalents of cognitive security), but we also need that rapid response to things previously unseen that keeps white-hat hackers glued to their screens for hours, and there’s a growing market too in tools to support them. (as an aside, I’m part of a new company, and this agile/ waterfall split finally gives me a word to describe “that stuff over there that we do on the fly”).
Also because I’m old I can remember when universities had no clue where to put their computer science group — it was sometimes in the physics department, sometimes engineering, or maths, or somewhere wierder still; later on, nobody quite knew where to put data scientists as they cut across disciplines and used techniques from wherever made sense, from art to hardcore stats. This market will shake out that way too. Some of the tools, uses and companies will end up as part of day-to-day infosec. Others will be market-specific (Media and adtech are already heading that way); others again will meet specific needs on the “influence chain”, like educational tools and narrative trackers. Perhaps a good next post would be an emerging-market analysis?
[Cross-post from Medium https://medium.com/misinfosec/at-truth-trust-online-someone-asked-me-about-the-overlaps-between-misinformation-research-and-iot-a69772aba963]
At Truth&Trust Online, someone asked me about the overlaps between misinformation research and IoT security. There’s more than you’d think, and not just in the overlaps between people like Chris Blask who are working on both problem sets.
I stopped for a second, then went “Oh. I recognize this problem. It’s exactly what we did with data and information fusion (and knowledge fusion too, but you know a lot of that now as just normal data science and AI). Basically it’s about what happens when you’re building situation pictures (mental models of what is happening in the world) based on data that’s come from people (the misinformation, or information fusion part) and things (the IoT, or data fusion part). And what we basically did last time was run both disciplines separately – the text analysis and reasoning in a different silo to the sensor-based analysis and reasoning – til it made sense to start combining them (which is basically what became information fusion). That’s how we got the *last* pyramid diagram – the DIKW model of data under information under knowledge under wisdom (sorry: it’s been changed to insight now), and similar ideas of transformations and transitions in information (in the Shannon information theory sense of the word) between layers.
We’ll probably do a similar thing now. Both disciplines feed into situation pictures; both can be used to support (or refute) each other. Both contain all the classics like protecting information CIA: confidentiality, integrity, accessibility. I tasked two people at the conference to start delving into this area further (and connect to Chris) – will see where this goes.
[Cross-post from Medium https://medium.com/misinfosec/short-thought-the-unit-should-be-person-not-account-81c48002aaa]
I’ve been thinking, inspired partly by Amy Zhang’s paper on mailing lists vs social media use https://twitter.com/amyxzh/status/1173812276211662848?s=20
We have a bunch of issues with online identity. Like, I have at least 20 different ways to contact some of my friends, send half my life trying to separate out people being themselves from massive coordinated cross-platform campaigns, and dozens of issues with privacy, openness (like do we throw a message into the infinite beerhall that’s twitter or deliberately email to just a few chosen peeps). How much of this has happened because our base unit of contact has changed from an individual human to an online account?
I’m wondering if there’s a way to switch that back again. Zeynep Tufecki said that people stayed on Facebook despite its shortcomings because that’s where the school emergency alerts, group organisation etc were. What if we could make those things platform-independent again? I mean we have APIs, yes? They’re generally broadcast, or broadcast-and-feedback, yes?
I guess this is two ideas. One is to challenge the idea that everything has to be instant-to-instant. Yeah, sure, we want to chat with our friends. But do we really need instant chat on everything? If we drop that, can we build healthier models?
The second idea is to challenge the account-as-user idea. Remember addressbooks? Like those real physical paper books that you listed your friends, family etc names, addresses, phone numbers, emails etc in? What if we had a system that went back to that, and when you sent a message to someone it went to their system of choice in your style of choice (dm, group, public etc). I get that you’re all unique etc, and I’m still cool with some of you having multiple personalities, but this 20 ways to contact a person — that’s got old, and fast.
The third (because who doesn’t like a fourth book in a trilogy) is to give people introvert time. Instead of having control over our electronic lives by putting down the electronics, have a master switch for “only my mother can contact me right now”.
[Cross-post from Medium https://medium.com/misinfosec/writing-about-countermeasures-1671d231e8a2]
The AMITT framework so far is a beautiful thing — we’ve used it to decompose different misinformation incidents into stages and techniques, so we can start looking for weak points in the ways that incidents are run, and in the ways that their component parts are created, used and put together. But right now, it’s still part of the “admiring the problem” collection of misinformation tools -to be truly useful, AMITT needs to contain not just the breakdown of what the blue team thinks the red team is doing, but also what the blue team might be able to do about it. Colloquially speaking, we’re talking about countermeasures here.
Now there are several ways to go about finding countermeasures to any action:
So right. We get some counters. But hasn’t this all been done before. Like if we’re following the infosec playbook to do all this faster this time around (we really don’t have 20 years to get decent defences in place — we barely have 2…) then shouldn’t we look at things like courses of action matrices? Yes. Yes we should…
Courses of Action Matrix [1]
So this thing goes with the Cyber Killchain — the thing that we matched AMITT to. Down the left side we have the stages; 7 in this case; 12 in AMITT. Along the top we have six things we can do to disrupt each stage. And in each grid square, we have a suggestion of an action (I suspect there are more than one of these for each square) that we could take to cause that type of disruption at that stage. That’s cool. We can do this.
The other place we can look is at our other parent models, like the ATT&CK framework, the psyops model, marketing models etc, and see how they modelled and described counters too — for example, the mitigations for ATT&CK T1193 Spearphishing.
Checking parent models is also useful because this gives us formats for our counter objects— which is basically that these are of type “mitigation”, and contain a title, id, brief description and list of techniques that they address. Looking at the STIX format for course-of-action gives us a similarly simple format for each counter against tactics — a name, description and list of things it mitigates against.
We want to be more descriptive whilst we find and refine our list of counters, so we can trace our decisions and where they came from. A more thorough list of features for a counter list would probably include:
And be generated from a cross-table of counters within incidents, which looks similar to the above, but also contains the who/where/when etc:
At this stage, older infosec people are probably shaking their heads and muttering something about stamp collecting and bingo cards. We get that. We know that defending against a truly agile adversary isn’t a game of lookup, and as fast as we design and build counters, our counterparts will build counters to the counters, new techniques, new adaptations of existing techniques etc.
But that’s only part of the game. Most of the time people get lazy, or get into a rut — they reuse techniques and tools, or it’s too expensive to keep moving. It makes sense to build descriptions like this that we can adapt over time. It also helps us spot when we’re outside the frame.
Right. Time to get back to those counters.
References:
[1] Hutchins et al “Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains”, 2011
[Cross-post from Medium https://medium.com/misinfosec/responses-to-misinformation-885b9d82947e]
There is no one, magic, response to misinformation. Misinformation mitigation, like disease control, is a whole-system response.
MisinfosecWG has been working on infosec responses to misinformation. Part of this work has been creating the AMITT framework, to provide a way for people from different fields to talk about misinformation incidents without confusion. We’re now starting to map out misinformation responses, e.g.
Today I’m sat in Las Vegas, watching the Rootzbook misinformation challenge take shape. I’m impressed at what the team has done in a short period of time (and has planned for later). It also has a place on the framework — specifically at the far-right of it, in TA09 Exposure. Other education responses we’ve seen so far include:
Education is an important counter, but won’t be enough on its own. Other counters that are likely to be trialled with it include:
Jonathan Stray’s paper “Institutional Counter-disinformation Strategies in a Networked Democracy” is a good primer on counters available on a national level.
I’m one of the DEFCON AI Village core team, and there’s quite a bit of disinformation activity in the Village this year, including:
Why talk about disinformation* at a hacking event? I mean, shouldn’t it be in the fluffy social science candle-waving events instead? What’s it doing in the AI Village? Isn’t it all a bit, kinda, off-topic?
Nope. It’s in exactly the right place. Misinformation, or more correctly its uglier cousin, disinformation, is a hack. Disinformation takes an existing system (communication between very large numbers of people) apart and adapts it to fit new intentions – whether that’s temporarily destroying the system’s ability to function (the “Division” attacks that we see on people’s trust in each other and in the democratic systems that they live within), changing system outputs (influence operations to dissuade opposition voters or change marginal election results) or making the system easy to access and weaken from outside (antivax and other convenient conspiracy theories). And a lot of this is done at scale, at speed, and across many different platforms and media – which if you remember your data science history is the three Vs: volume, variety and velocity (there was a fourth v: veracity, but erm misinformation guys!)
And the AI part? Disinformation is also called Computational Propaganda for a reason. So far, we’ve been relatively lucky: the algorithms used by disinformation’s current ruling masters, Russia, Iran et al, have been fairly dumb (but still useful). We had bots (scripts pretending to be social media users, usually used to amplify a message, theme or hashtag til algorithms fed it to to real users) so simple you could probably spot them from space – like, seriously, sending the same message 100s of times a day at a rate even Win (who’s running the R00tz bots exercise at AI Village) can’t type at, backed up by trolls – humans (the most famous of which were in the Russian Internet Research Agency) spreading more targetted messages and chaos, with online advertising (and its ever so handy demographic targetting) for more personalised message delivery. That luck isn’t going to last. Isn’t lasting. Bots are changing. The way they’re used is changing. The way we find disinformation is changing (once, sigh, it was easy enough to look for #qanon on twitter, to find a whole treasure trove of crazy).
The disinformation itself is starting to change: goodbye straight-up “fake news” and hunting for high-frequency messages, hello more nuanced incidents that means anomaly detection and pattern-finding across large volumes of disparate data and its connections. And as a person who’s been part of both MLsec (the intersection of machine learning/AI and information security), and Mmisinfosec (the intersection of misinformation and information security), I *know* that looks just like a ‘standard’ (because hell, there are no standards but we’ll pretend for a second there are) MLsec problem. And that’s why there’s disinformation in the AI village.
If you get more curious about this, there’s a whole separate community, Misinfosec http://misinfosec.org, working on the application of information security principles to misinformation. Come check us out too.
* “Is there a widely accepted definition of mis vs disinformation?” Well not really, not yet (there’s lots of discussion about it in places like the Credibility Coalition‘s Terminology group, reading papers like Fallis’ “what is Disinformation?“. Clare Wardle’s definitions of dis, mis, and mal information are used a lot. But most active groups pick a definition and get on with the work – for instance this is MisinfosecWG’s working definition “We use misinformation attack (and misinformation campaign) to refer to the deliberate promotion of false, misleading or mis-attributed information. Whilst these attacks occur in many venues (print, radio, etc), we focus on the creation, propagation and consumption of misinformation online. We are especially interested in misinformation designed to change beliefs in a large number of people.” and my personal one is that we’re heading towards disinformation as the mass manipulation of beliefs that isn’t necessarily with fake content (text, images, videos etc) but usually includes fake context (misattribution of source, location, date, context etc) and use of real content to manipulate emotion in specific directions. Honestly, it’s like trying to define pornography – trying to find the right definitions is important, but can get in the way of the work of keeping it out of the mainstream, and if it’s obvious, it’s obvious. We’ll get there, but in the meantime, there’s work to do.