Data Safari rough notes: “pink slime” network

Warning: this is a set of rough notes, for other geeks to read. There’s an ungeeked post about this for people who don’t want to wade through code.

2019 was mostly about building infrastructure and communities, but every so often I did a little “data safari” on a piece of misinformation that interested me. 

Data safaris are small looks into an area of interest: they’re not big enough to be expeditions, but they’re not standing prodding the bushes a couple of times then walking away again either. How this (typically) works is: Something interests me. I see  an article, or a hashtag, or an oddity that I’d like to have a better look at, so I take a cursory look at it, write a rough plan of how I want to further investigate it, then start at one corner of the world, and traverse out from it, making notes on things of interest, or things I want or need to remember as I go.  Whilst I’m traversing, I’m also filling in a spreadsheet or doc with the end data I want, and collecting/storing other datasets (twitter, facebook, text from sites etc) as needed. On the way, I’ll find other things I want to explore, but won’t want to interrupt my flow through whichever thing I’m on at the time – for these, I leave myself an “action” note, which I overwrite with “done” or “ignored” when I’ve either gone back to that branch and followed it, or decided not to chase it down. 

This one was pink slime: Hundreds of ‘pink slime’ local news outlets are distributing algorithmic stories and conservative talking points

This article from the Tow Center – data journalism unit at Columbia that has Jonathan Albright (d1gi) as one of its peeps working on misinformation data, so I had a quick look at his recent datasets https://data.world/d1gi (nothing on data.world in the past 9 months), medium, twitter https://twitter.com/d1gi – looks like it isn’t him.  Back to the article. 

“Pink slime” = “low cost automated story generation”.  This stuff is going to get worse, as text generation improves (and remember that many of these sites are heavy on the image:text ratio anyways). 

No link to data given in the article.  Article writer is Priyanjana Bengani – talks about “we”, writes about datasets.  Checking their accounts in case they released a dataset somewhere (unlikely given the subject, and what happened when Barrett Golding released a misinformation sites dataset, but worth checking).  Nothing on https://twitter.com/acookiecrumbles.  No sign of a dataset being released, so will have to start chasing it up by hand.  Which, admittedly, will be fun. Can I find more than 450?  [edit: yes, but it’s going to take an age to go through them all]

Okay. Quick-and-dirty plan: 

  • Quick reconstruction: grab the sites listed in the article, in the articles it mentions and any other local news outlets that pointed at “pink slime” or the players listed in the past month.  That’s the low-low-hanging fruits.
  • Twin efforts: do what I would normally do; rebuild from the methods they’ve listed (which have lots of overlaps, because there are only so many links across sites, although…)
  • Clean up the lists, rerun to see if anything got missed.
  • Look for other sites using less-obvious markers like social behaviours and links.   

1. Quick reconstruction

1.1 check articles

Todo list:

  • Done: check CJR article
  • Done: look for local “pink slime” articles
  • Done enough for now: look for local articles about actors/entities in this story

Clues from cjr article

  • Lansing State Journal broke news on Oct 20 2019
  • Done: check Lansing State Journal
  • Further reporting by the Michigan Daily, the Guardian and the New York Times  got to about 200 sites
  • Done: check Michigan Daily
  • Done: Check Guardian
  • Done: Check New York Times
  • Columbia analysis got to at least 450 sites, 
  • 189 as local news networks across 10 states in last 12 months by Metric Media
  • Metric Media is just one component in network

Lansing State Journal article:

  • Carol Thompson article.  (517) 377-1018, ckthompson@lsj.com, @thompsoncarolk.
  • “Nearly 40 new sites”
  • First found by Matt Grossmann, director of Michigan State University’s Institute for Public Policy and Social Research (he leads this;  517 355-6672, grossm63@msu.edu, @mattgrossmann, www.mattg.org).
  • Site: micapitolnews.com – interesting url; action: try other state +capitolnews.com combinations?
  • “About us” of the MI sites say are published by Metric Media LLC – “fill the “growing void in local and community news after years of steady disinvestment in local reporting by legacy media.”
  • Bradley Cameron is CEO Metric Media and Situation Management Group (online biography page)
  • Privacy pages say are  operated by Locality Labs LLC, a Delaware company that similarly affiliated with a network of local sites in Illinois and Maryland, and business sites in nearly every U.S. state.
  • Done: West Cook News about says “West Cook News is a product of LGIS – Local Government Information Services” – chase this up too?
  • Done: West Cook News has a list of “other publications” on the bottom. 
  • Locality Labs CEO is Brian Timpone
  • At this point, I’m looking at these publications and wondering if they’re really misinformation sites.  Right now, they’re looking more like the clipper magazine “Our Times” that we got in New Jersey – which was foaming at the mouth right-wing and filled with hateful rhetoric between the adverts and dad joke cartoons, but a) would take work to establish just how much disinformation was flowing through them and b) being on a political ‘side’ should never be a reason to call ‘misinformation’ – that’s why Tim and I set up a completely new set of labels for domains.  Ah good – this gets addressed in the article, which talks about the difference between “fake news” and “information with a perspective” that’s been dressed up to look like objective local news. This is the sort of thing we saw with the Jenna Abrams troll: mostly ‘useful’ information, with occasional forays into political strong opinions. Action: check percentage of political articles on some of these sites. 
  • Side note: Lansingstatejournal.com says I have 4 free articles left.  If every news outlet I need to look at does this, or worse paywalls me, this is going to be a long night of research. 

Looked for origin of story – Matt Grossmann

  • Nothing on www.mattg.org
  • Twitter search on “@mattgrossmann until:2019-10-22” (original article came out on 20th October) found reference to original article and lots of tweets about it (hello again, @emptywheel). 
  • Good seeing people noodling about ways to counter: find host, see if can take down because violating ToS (yep, got that one); feed list to Google etc for downranking (yep, got that); Maggie Haberman’s Index Of Approved Sources Of Information  (not heard of that at all). 
  • Hmm “”Metric Media LLC maintains a licensing agreement with the Metric Media Foundation, a Missouri 501(c)(3) non-profit news organization.””
  • @bywillpollock shows links to “The Ohio Star” articles on facebook groups Teachers for Trump, Blacks for Trump, Students for Trump Pence (following facebook group links has been a really rich source of URLs for me in the past – it’s how I found the northern european antivax sites)
  • Action: search facebook for site names/urls; spider out: check other URLs on pages that mention them, check “related pages” if they’re for the sites themselves etc. 
  • @aphexmandelbrot searches for first sentence in ToS “access to and use of Locality Labs” (yep – I also find interesting phrases on a site and search for them; site creators are lazy), gets “at least 400” results.
  • Action: yeah, try the ToS sentence (just automate the google part)

Checking Michigan Daily

  • Y’know, everyone thought data scientists had a glamorous job ‘til they tried it themselves and discovered it’s mostly doing lookups, cleaning datasets and discovering that the less-sexy algorithms are more stable today.
  • Ohmy – followed links to Ann Arbor Times, Grand Rapids Reporter – their pages were at first glance identical (which will make some of the GDI featureset work very nicely, thank you).  As was https://nemontananews.com/ – the same format, the same gun image in the top LHS article.  Hang on – the Grand Rapids Reporter article is “from Great Lakes Wire” which is on the naughty list, and the Montana article claims to be “from big sky times” – so there’s another way to find linked sites.
  • Ignored: check sites referenced by the repeated gun article (ignored because the related sites are listed at the bottom of each site, and Big Sky Times is there already).
  • So the articles with politically skewed content are written by humans and have bylines.  The rest of the articles are generated (by Local Labs News Service), with no byline. Useful to know. 
  • Action: “fill the void in local communities” is a useful search phrase
  • New networks in Montana and Iowa
  • Action: grab list of sites from bottom of page in every state
  • Locality labs is https://locallabs.com/; Metric Media Foundation is http://metricmedia.org/
  • Locality labs operates networks in Maryland and Florida.
  • Locality Labs… emerged out of Journatic, LLC and BlockShopper, LLC (note use of SEC archive to find this)
  • Aside: this article has lots of interesting information in it.  Interesting (but not to the data search) includes: “Tribune Media Company, a media conglomerate that once owned major outlets including the Chicago Tribune and the Los Angeles Times, invested in Journatic as a service to provide hyperlocal news coverage. Journatic reportedly distributed fabricated and plagiarized content and used workers in the Philippines writing under pseudonyms to remotely produce stories. Due to these scandals, many outlets suspended their use of the service and stopped publishing Journatic articles. Tribune Media has since been absorbed by Nexstar Media Group, Inc., the largest local television and media company in the U.S., but still has not fully divested from the venture, which has since reorganized as Locality Labs. The Daily requested comment from Gary Weitman, chief communications officer at Nexstar, regarding investment in Locality Labs, and was told Locality Labs was not a subsidiary. While acknowledging Nexstar does partially own Locality Labs, Weitman downplayed the influence of Nexstar’s investment.”   I remember these stories back from – 2012? Some of this stuff has long roots, and I’m wondering now if some of the emerging disinformation industry in the Philippines can be traced back to these original content generating factories, or if the links to them have been completely severed, or might be stood up quickly again for 2020? 
  • “Timpone is also the co-founder of Local Government Information Services, a network of more than 30 Illinois print and web publications that have been considered to propagate conservative news and hold an identical layout to Metric Media’s websites.” – West Cook News listed itself as being a LGIS site. So this is an issue.  West Cook News is definitely mentioned in the Lansing State Journal article, but the Michigan Daily article seems to be hedging about it and associated sites.  Is it or isn’t it a disinformation site – that’s a question I just spent a year of my life on, and in the end it’s the wrong question. The question isn’t “how do we tell everyone about this huge set of fake news sites we’ve found”, but more “how do we make sure people are aware that they’re reading a syndicated publication which isn’t clear about its connections and has a likelihood of disinforming – regardless of its political slant – and how do we make this situation better?  Better in this case can range from working with the sites’ owners (if possible) to improve the way people understand them, to producing better local news feeds with less bias, to using whatever levers are needed to deal with genuine disinformation campaigns. Each of these comes with a burden of empathy, proof and genuine interest in the connection from grassroots local activity and information upwards. 
  • Okay. Back to the data. I’m going to include those sites for now, but make sure they’re tagged carefully as LGIS.  The article overlap between them and the Metric Media sites should be a good indicator of where we need to look across them.
  • Another possible network branch: “Timpone is associated with Franklin Archer, a publishing organization operated from Chicago. Franklin Archer hosts a similar network which consists of a set of nationwide business journals. Earlier this year, Franklin Archer published the Hinsdale School News — a publication that infringed upon the name and logo trademarks of Hinsdale High School District 86 in Illinois and potentially violated election law by attempting to influence the vote on a $140 million school district referendum.”
  • Aside: mentions the IFFY quotient for URL mentions on social media sites, giving roughly how many of their mentions are okay, unknown and known to be dodgy https://csmr.umich.edu/platform-health-metrics/  Action – look into this more (but later)
  • Aside: this end section seems to be key to why peeps should care about this ““The two of those collectively lead to a situation where it’s fairly easy to distribute extremely partisan, low quality or complete misinformation in a way designed to influence voters,” Pasek said. “It appears that these news sites purporting to be Michigan-based news outlets are attempting to do this, to some extent. They’re targeting Michigan in part because Michigan is viewed as a critical state for this upcoming election, with the goal of providing a presentation that implies that the local news story is indeed one that’s more favorable to the president, and less favorable to his potential opponents, whoever they may be.” Pasek said while it’s normal for outlets to have different biases, these sites are disregarding journalistic standards. “The question is not about bias — it’s about journalistic standards and how journalists are misunderstood.” Pasek said. “It’s okay to have outlets that have varying different views out there, but there’s a certain point at which the attempt to be an outlet with a particular angle oversteps how journalism is supposed to operate. And once that occurs, now there becomes a substantive question as to whether what you’re observing is in fact news, or is instead a disinformation campaign.””

Now the Guardian and NYT.  These are both heavy hitters, data journalism-wise, and can call on a wide network of expert researchers (I know; I fed into some of their stuff).  Maybe we’re seeing a pipeline here too – from a local researcher to a local news organisation, then nationals, then journalism school, then wider again? 

The Guardian

  • Leads with the Hinsdale School News story, about using a local news site to swing a local vote.  “This was purposely done to mislead people into thinking that was a publication from the district.”
  • Franklin Archer has a list of publications – operated by Locality Labs https://franklinarcher.com/our_publications
  • Jeanne Ives’ payment followed by praise in papers is interesting. Can we track these articles / did she pay any other networks of interest?  Action: check Ives FEC filings for more media payments around the $2k range
  • Action: cross-map locations of these papers to a) the news deserts in University of North Carolina study and b) battleground states. Carto is your friend here…  maybe also map against Sinclair TV coverage

New York Times

  • Author: Dan Levin @globaldan
  • Nice use of 2×2 grid to show how similar the sites look
  • Minimal advertising on sites, plus promotional push on facebook
  • Action: check out facebook links – if they’re promoting, it should be visible
  • “Many if not all of the sites were registered on June 30 and updated on the same day in August, according to online domain records.” Hmm. This happens a lot, and is likely to happen a lot in 2020. Could the registries alert when large batches of newsy URLs get registered together? Yes, there are easy counters to this, but unless it gets automated, generally people are lazy. 

Twitter search on NYT article

  • Twitter search “@globaldan until:2019-10-22”
  • Article about student papers filling in news deserts https://t.co/Tb8iMaiF60?amp=1
  • Yeah. That’s it. 

Oh the glamour. 2am and still going through the basic datanerding. Well, on to the adding lots of sites to the list part (current haul is the 23 sites mentioned in articles, but we have the Franklin Archer list, and the state publications list at the bottom of each site to add in, as easy wins)

 Looking for “pink slime” articles

  • Google ‘“pink slime” misinformation’ – I learn this is a term for mechanically separated meat, which has its own misinformation subgenre
  • Googled ‘”pink slime” misinformation -beef -meat -food’
    • Mediawell has a link to Columbia article – nothing new added
    • Ooch, I should listen to the CarolC podcast – Action: do later
    • Yeah… the CJR article seems to be the only misinformation piece talking about “pink slime” (I haven’t heard the term before, but I don’t keep up with all the journos covering misinformation and could have missed it) – count as a dead end.

Look for misinformation-related terms plus actors/objects above:

  • Google ‘“Lansing state journal” misinformation’ – lots of unrelated articles
  • Google ‘“Lansing state journal” “Metric media”’ – looking for articles spawned by the first one
    • Detroit metro times – added some more sites (was the list of 40 circulated, or did they look at the page ends for these hyperlocal sites?)
    • Index journal – says lansing state journal found 47 sites.   
      • SC based. Written by Matthew Hensley 864-943-2529 mhensley@indexjournal.com, @IJMattHensley
      • “Metric Media is a division of Situation Management Group Inc., a firm that specializes in crisis response but provides a number of other marketing services.”  SMG CEO Brad Cameron says in his profile that Metric Media “operates more than 1,100 community-based news sites”  
      • “Headlines on the bulk of these stories contain the name of a small, South Carolina community, including such towns as Arial, Branchville, Fairplay, Fingerville, Jacksonboro and Mountville. Hundreds of others included ZIP codes. These prominent local references appear to be a tactic for search engine optimization that seems aimed at getting eyeballs from across the Palmetto State.” – yeah, see lots of this SEO in con-artist posts.  
      • “Under the umbrella of the Metro Business Network, one such site exists for each state and for Washington, D.C.” Action: find these sites
      • “the South Carolina one has caught fire on social media, with nearly 4,000 combined followers between Facebook and Twitter.” Action: find facebook, twitter for each of these publications
      • Sensible counter: suggestions alternative local news sources on both left and right of political spectrum
    • Nieman labs – talks about local news having high levels of trust vs low profitability.  I think this point is important here – is the vulnerability. Links current work back to Laura McGann 2010 work. 
      • Lot of history in here: 2016 fake sites like @ElPasoTopNews, @MilwaukeeVoice, @CamdenCityNews, @Seattle_Post, Denver Guardian. 
      • Mentions democrat plans to launch outlets too. Action: check link to democrat network.  
      • Has lots of examples of other current news sites set up with political bias/ by political operators. Action: re-read this article after safari, and do same exercise with other links in it.
      • I think this mixing of intentions and outputs is a key part of the current internet.  That you can be both a political party operation and a news outlet at the same time is easy when the endpoint is a website, not e.g. a newspaper.  “In all of these cases, the issue is less about politicians promoting their points of view than hiding their affiliation with the content — making it hard for a reader who would naturally bring more skepticism to a campaign ad than they would a local news story.” – so this is about those levels of trust, skepticism, the way that people read differently sourced material differently.  It’s basically the Admiralty Scale without the “from” part of the scale – so people apply their own “from” rating to each article, and we’re terrible at making that snap judgement. Action: important point- write note on this.
    • The Washington Tribune Company has “as many as 1,200 locally-focused URLs from AlaskaTribune.com to WichitaLedger.com”
    • Publishing insider – nothing much new; 1 new site named
    • Holland Sentinel – looks at hollandreporter.com
      • Arpan Lobo alobo@hollandsentinel.com @arpanlobo
      • Is there a master list of local news outlets in the US?  Does it include originators/ owners? I think I can remember one being created when the first local-is-dying stories started to break.  Action: look for list of US local news outlets. Map density against new sites? 
      • Nice piece of further digging
      • “Other Metric Media sites include Michigan Business Daily. There is a “Business Daily” website for all 50 states, part of the Metro Business Network. The websites have a similar layout to those in the Michigan Network.”
      • Action: find all the “xx business daily” sites
    • Livingston Post – story from PoV of local news outlet
      • Looks at Livingston Today – “if you look at the site today, there isn’t a single local news story on it, unless you want to know the price of gas in Pinckney, or read a few calendar items scraped from the web.”
    • Action: continue google search for local outlets reporting on this story (specifically, looking for local reporters who’ve found new parts of the network)

1.2 scrape from pages

Grab site lists from bottom of each known site

  • Look at annarbortimes.com.  We want the list at the bottom of the page:
  • Inspect element shows this is in <ul class=”footer__list”>. We’re going to do this for up to 50 states, so write a small scraper
  • Try something simple: 
import requests
from bs4 import BeautifulSoup
import pandas as pd
puburl = 'https://annarbortimes.com/'
response = requests.request('GET', puburl)
response.text
import requests
from bs4 import BeautifulSoup
import pandas as pd
pubtitle = 'Ann Arbor Times'
puburl = 'https://annarbortimes.com/'
headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}
response = requests.request('GET', puburl, headers=headers)
response.text
  • Look for the UL with the site names in: 
data = BeautifulSoup(response.text, 'html.parser')
site_uls = data.find_all('ul', attrs={'class', 'footer__list'})
site_uls[0]
  • Yep, that’s all of them – in three divs, then lis.  We can use beautifulsoup (or scraper of choice) to iterate through these, pull out the names/urls and stuff them in a csv.
sudata = site_uls[0]
allas = sudata.find_all('a')
rows = [[pubtitle, puburl]]
for thisa in allas:
    rows += [[thisa.text, thisa['href']]]
import pandas as pd
df = pd.DataFrame(rows, columns=['Site', 'Url'])
df.to_csv(pubtitle+'.csv')
df
  • Scraped MI (via annarbortimes.com)
  • Trying collegeparktoday.com (Florida) – this has a different html format to the MI sites
    • Done: adjust scraper for Florida sites
  • Scraped IL (via grundyreporter.com)
  • Scraped MD (via mdstatewire.com)
  • Scraped IA (via iowacitytoday.com)
  • Scraped MT (via nemontananews.com)
  • Scraped AZ (via grandcanyontimes.com)
  • Scraped NC (via hickorysun.com)
  • Look at michigandaily.com – it isn’t on the sites list, has a different format, looks like a student paper?  Action: check out michigandaily.com – should it be in here?
  • Palmetto Business Daily is part of a network of state-specific sites.  Has a different div html, but otherwise scraper should work. Html is <div.col-xl-6.col-lg-5.col-md-5.col-sm-12.col-12> (and “copy selector” in inspect element is very useful).  Action: adjust scraper, scrape site list from Palmetto Business Daily
  • Thinking that we could do with a list of states for the google search, so we don’t get *all* sites back the first time, but go state by state through them. 

Automating the google search

  • https://www.geeksforgeeks.org/performing-google-search-using-python-code/ is good step-by-step guide
  • Thinking that we could either spend our time hunting for these local news aggregator sites, or we could create aggregator sites without the politics, that local people might use.  Action: write up this thought somewhere, thinking about effort and reward. 
  • Thinking that what we’re seeing here is someone who really thought about the internet, what it does, what types of sites are around, and then spent time thinking hard (with post-its?) about how to adapt it as a political weapon.  Action: write up this thought too. Think about what else we haven’t seen yet, that we probably will if this planning session(s) had taken place. 
  • Thinking about finding more outside this network. What if we generated a bunch of likely titles from the patterns above, and went looking for them? Action: write small generator for news site names
  • One way to look for local news covering these sites is to google search for the site URLs – find who’s pointing to them and isn’t already on the naughty list. Action: google search for site URLs, looking for links to them that aren’t already on the naughty list. 

Text searches (by hand first)

from googlesearch import search 
query = '"is a product of LGIS - Local Government Information Services" site:.com/about-us'
js = [j for j in search(query, tld="com", num=100, stop=None, pause=2.0,
                        extra_params={'filter': '0'})]
js
  • The “extra_params={‘filter’: ‘0’}” gives us the extra search results we wanted.  Found this using help(search) in the notebook. Currently on a “429: Too Many Requests” timeout. 
    • Have to sleep now – Action: in the morning, write code that searches for the url roots from the google search in the pink_slime_sites master list, and spits out anything new. 
  • 2019-12-24
  • Checking what came back from google search (unholy bad code):
newsites = [re.findall('//(.*)/', x)[0] for x in js]
df = pd.DataFrame(newsites, columns=['url']).sort_values('url')
df.to_csv('googlesearch_test.csv', index=False)
df2 = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df2['URL'].to_list()
fromsites = [re.findall('//(.*)', x)[0].strip('/') for x in fromsites]
set(newsites) - set(fromsites)
  • No new sites found using the string above.  Let’s see what *didn’t* have it in… 
  • Looking at bigskytimes.com – has a link to facebook.  We could pull the facebook pages for all these sites, see what’s listed as “related sites”
  • Looking at Northern California Record – the list of other sites is across many states
  • About_us says “OUR GOAL at the Norcal Record is to cover Northern  legal system in a way that enables you, our readers, to make the public business your business.” … “The Record is owned by the U.S. Chamber Institute for Legal Reform.”
    • Action: look further into Northern California Record – is this another loop?
  • I’m thinking about the cathedral and the bazaar now.  This looks like a lot of sites have been created – we started with ‘local’ news sites, but this is a subject-specific site.  Is this like having a market with stalls, where stall keepers are shouting their wares, and someone comes in with a team and megaphones and drowns out all the other voices?  We want voices online- we want everyone to be able to speak – but this somehow feels unfair. In the market, there would be an ombudsman – someone who could be asked to make this right. Who are the ombudsmen of the internet? 
  • Trying a new search term: “Metric Media LLC began to fill the void in local and community news”
    • 7 new sites found: [‘eastnewmexiconews.com’,  ‘enchantmentstatenews.com’,  ‘nenewmexiconews.com’, ‘santafestandard.com’,  ‘scminnesotanews.com’, ‘seminnesotanews.com’, ‘swnewmexiconews.com’]
  • Scraped NM
  • Scraped MN
  • Trying google search  “’access to and use of Locality Labs’” in /terms
    • Search in /about-us gave lots of results too
    • Lots of results, not all local news
    • Includes manilabusinessdaily.com.  Action: look at manilabusinessdaily – is this model going overseas too?
    • Found staging sites? http://louisianarecord.pli-records-staging.locallabs.com/ – worth checking the main site for these?  Action: general search on site pli-records-staging.locallabs.com
    • Grimesjournal.com says it’s a set of local business listings. It has story links to Iowa Business Daily and Indiana Business Daily, but no list of linked sites.   (We’ve already noted these business daily sites above). Is Grimes a hyperlocal spin-off of the business dailies (Grimes is a city in Iowa)?  Action: look for texts in Grimes, or links to the business dailies, to see if there are other hyperlocal sites like this one.  
    • https://lynwoodtimes.com/ is a similar site with local business listings.  Feels like someone is astroturfing all the “useful” but hard to monetise sites – this is Our Town all over again.  Looking at their facebook site, get unavailable notice:
  • More local business listing sites: urbandaletimes.com, https://lansingreporter.com/, glendalesun.com, 
    • Action: lansingreporter.com wasn’t flagged by original article – is this a new site? Check dates on it. 
    • Action: add all the clipper sites. 
    • Action: look for more clipper sites.
    • Action: add business daily sites from from_articles list.
    • Statesman.com looks very different to the other news sites; has login, more production values?  “Austin American-Statesman is owned by Gannett Media Corp”…. Action: check austin american statesman.  Action: check Gannett media corp
    • https://torontobusinessdaily.com/ – no “other sites” listed, but check Canada too? Action: check torontobusinessdaily.com and look for other sites in Canada
    • https://gulfnewsjournal.com/ – looks like the businessdaily sites.  Action: check gulfnewsjournal.com and for related non-US sites too?
    • FDAreporter.com – says it’s “FDA Reporter is a trade journal for the U.S Food and Drug Administration’s employees and contractors, covering personnel moves, budgets, acquisitions, contracting and hiring. This includes the Office of the Commissioner, Office of Operations, the Offices of Regulatory Affairs and Global Regulatory Affairs, and Office of International Programs.”.  Is this real, or a site aimed at a vertical? Action: check FDA Reporter provenance.
    • Tobacconewswire.com – “Tobacco News Wire covers federal and state regulation and taxation of tobacco, including the U.S. Food and Drug Administration’s Center for Tobacco Products. Topics include the rise of e-cigarettes, the process through which tobacco companies get products approved by the FDA, and how these things affect retailers and manufacturers.”.  Action: check provenance of tobacconewswire.com
    • Montgomerymdnews.com – caught my eye.  Don’t we have this in MD already? Nope, turns out we have montgomerynews.com because methinks the person who set up the site list at the end got the name wrong (montgomerynews.com is a very different site, different owners). Action: flag montgomerynews vs montgomerymdnews confusion to other list owners?
    • Wealthmanagementwire.com looks different to other trade sites; about page different too, mentions funding “Wealth Management Wire is supported – in part – through sponsorships with brands interested in reaching our audience of personal wealth management, banking and life insurance professionals , advisors, brokers and America’s C-Suite. Interested in partnering? Email us at partners@wealthmanagementwire.com.”  Hrdailywire.com is same format as wealthmanagementwire.com 
    • This creating a site for each vertical looks to me like the way meetup.com was created (by speculatively creating and populating pages for groups that might be of interest to people, rather than having people create all the groups entirely from scratch).  Is it worth looking at other interesting origin stories and seeing if those could be adapted this way too? Action – write note about adapting company origin stories for astroturfing etc.
    • FDAhealthnews.com: “FDA Health News covers the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research and its Center for Devices and Radiological Health. It reports on the FDA’s process of testing and approving new drugs, its interactions with pharmaceutical companies, and any other news related to the FDA.”
    • https://hansondirectory.com/about-us is interesting – is a directory service.  Might be able to spider out from its facebook page https://www.facebook.com/HansonDirectory  Action: spider out from HansonDirectory facebook page – look for sites.
    • Surprisejournal.com gave a 503 error – ran it from console, *lots* of back end calls… looks like another hyperlocal site.
    • Epnewswire.com “EPNewswire is a business journal focused on the intersection of science-based regulation and basic industries, including food, energy and chemical production.”
    • Is it me, or are there a bunch of trade journals here, all reposting newswire content, but all focussed on divisive subjects?
    • Cistranfinance.com “The independent CISTRAN Finance news service serves as a global hub for business and banking news for approximately 70 percent of the world’s population in Eurasia and Afro-Asia.” – this is an odd one. Wide area, not understanding the point of doing this, who/why.
  • So. Have about 8(?) states so far, need to go looking for more.  Try some of the headlines? 
    • Found several local (clipper) sites. https://mahaskaguide.com/about-us includes “We have published directories for independent telephone companies since our inception in 1973. Current production includes 108 telephone directories for more than 130 U.S. telephone companies spread across 23 states.”
    • NB DuckDuckGo has results on news phrases from sites, even when Google produces no results. 

So now I have a bunch of sites, several of which appear to be related (in subject, look and feel etc), but will take further investigation to see whether they’re part of something underhand, or just normal business.  I suspect we’ll see a lot of this in future: companies that do legit business online (e.g. directory services), also providing (wittingly or unwittingly) disinformation carriers. 

Networks of related sites found so far:

  • Statewide “news” sites (e.g. LansingSun.com – NB Florida is different)
  • State-by-state “business daily” sites (e.g. mdbusinessdaily.com)
  • Non-US “business daily” sites (e.g. torontobusinessdaily.com)
  • Clipper sites (e.g. lynwoodtimes.com)
  • Meta-level sites (hansondirectory.com, metrobusinessnetwork.com)
  • Unrelated? Sites (e.g. statesman.com, michigandaily.com)
  • “Record” sites (e.g. norcalrecord.com) and staging sites (e.g. wvrecord.pli-records-staging.locallabs.com)
  • Trade sites (e.g. farminsurancenews.com)
  • Issue sites (e.g. epnewswire.com)
  • Odd sites (e.g. newsroom.westandforprogress.com)

Leftover action: not sure how many search results come back from google search api. Recheck code https://github.com/anthonyhseb/googlesearch/blob/master/googlesearch/googlesearch.py

We have the original set of local news sites.  Going to add all the other sites above to the main list, and make that a list of connected sites, before checking them all for disinformation behaviours (e.g. look at their non-automated stories). Action: check all sites on main list for disinformation behaviours.

  • Florida localnews: scraped.
  • Palmetto business daily: scraped. Looked at Alabama Business Daily, got a popup with no ‘close’ button (“Pulse…” not optimised something or other).
  • Non-US business dailies: copied list (4 sites) into master list,  but will need a different search type to find others. Action: search for non-US business dailies.

Current counts (sites found):

  • localnews                230
  • businessdaily             50
  • clipper                   12
  • trade site                10
  • businessdaily – nonUS      5
  • pressure group             3
  • staging                    3
  • metasite                   2

Open: 

1.3 Use Metadata

So we’ve worked outwards from the original set of articles that flagged these networks, scraped sites’ lists of related sites, and used Google’s API to search for sites containing specific phrases.  Time to look for the other things that connect sites (and use them to find more). 

Quick and dirty tag comparisons: using BuiltWith tags and IP relationships, e.g. https://builtwith.com/relationships/montgomerymdnews.com

Now this is where we start putting the “science” into our “data science”. There are two versions of BuiltWith: public and pay-for.  The pay-for is amazing, with all sorts of useful data in it, but the public (free) version isn’t bad either if you’re looking for related sites.  And BuiltWith has an API. Here’s where we start cooking on gas. 

To do next:

  • Done. Compare the “from_articles” and “from_sites” url lists. Add anything new to the master list.  
  • Start using the tags (google, ad etc) on each site to find more sites related to the seed list (the “master” list). 
  • Head into social media and other more subtle links

Coded up builtwith API calls. First 10 API calls on an account are free, then need to pay $100 for each 2000 calls or so (students/ academics might be able to get free access?).  Code is:

import requests
import json 

bwkey = '<put your own key here>'
bwdom = 'propertyinsurancewire.com'
bwapi = 'rv1' # 'free1' is the free api
bwurl = 'https://api.builtwith.com/{}/api.json?KEY={}&LOOKUP={}'.format(bwapi, bwkey, bwdom)
bwresp = requests.get(bwurl)

with open('builtwith/{}.json'.format(bwdom), 'w') as outfile:
    json.dump(bwresp.json(), outfile)
    
matches = pd.DataFrame([])
identifiers = pd.DataFrame([])

rs = bwresp.json()['Relationships']
for thisr in rs:
    fromdomain = thisr['Domain']
    rsi = thisr['Identifiers']

    ids = pd.DataFrame(rsi).drop('Matches', axis=1)
    ids['FromDomain'] = fromdomain
    identifiers = identifiers.append(ids)

    for rsix in rsi:
        rsimatches = pd.DataFrame(rsix['Matches'])
        rsimatches['Type'] = rsix['Type']
        rsimatches['Value'] = rsix['Value']
        rsimatches['FromDomain'] = fromdomain
        if len(rsimatches) > 0:
            matches = matches.append(rsimatches)
matches.to_csv('builtwith/{}_matches.csv'.format(bwdom), index=False)
identifiers.to_csv('builtwith/{}_identifiers.csv'.format(bwdom), index=False)
matches

Which dumps out json and csv files for a single API call, but in a format where a whole collection of call outputs could be stuck together.   Tested on montgomerymdnews.com, which gave a few new sites – and on propertyinsurancewire.com, which gave a list of 300 to check, including ones on the same NewRelic id. 

Alrighty. Modified code to loop round all the urls I have.  It’s fast. But it’s also spitting out some empty results – all the “1 byte” csvs below.  Not sure why; will check when the run finishes. 

Action: check the “1 byte” builtwith API returns.

Run finished.  Got 46434 and 1120 unique domains returned by BuiltWith.  946 of those domains aren’t on the seed list. These are a mix: some look like news domain addresses; others don’t. 

At this point, a requests.get call on /about-us for these sites should give a decent clue to whether they’re linked to the above sites (although africanmangoscam.net is definitely getting a visit). Looking at them, by first filtering on newsy words like “news”, “county” etc:

  • Lots of clipper sites. Found many of these looking for common ‘news’ terms: news, county, today, times, reporter, wire, sun, record
  • Some parked (godaddy) and down
  • Gold = nekentuckynews.com is another localnews site (scraped, added)
  • Some of these names aren’t easy searches (e.g. Gray Guide), and they don’t have local content. Is this a new form of astroturfing? Action: think about how networks of sites could affect e.g. Google search results. 
  • Not sure about some sites, e.g. powernewswire.com – looked at /about-us https://powernewswire.com/privacy and /terms (privacy and terms weren’t listed on front of site, about was vague); found Locality Labs text on /privacy. 

I ran out of holiday time (to be honest, I diverted into hanging out with my parents, which was IMHO a very good use of time).   I enjoyed the exercise very much, and I have even more respect now for the data journalists who do this work all the time.

Code

''' Scrape pink slime '''
import requests
from bs4 import BeautifulSoup
import pandas as pd

# pull page data from site
pubtitle = 'NE Kentucky News'
puburl = 'https://nekentuckynews.com/'
stype = 'local'
headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}
response = requests.request('GET', puburl, headers=headers)

# Grab the raw list of sites
data = BeautifulSoup(response.text, 'html.parser')
if stype == 'florida':
    sudata = data.find_all('nav', attrs={'class', 'foot-nav'})[0]
elif stype == 'bd':
    fdata = data.find_all('div', attrs={'class', 'footer'})[0]
    row1 = fdata.find_all('div', attrs={'class', 'row'})[0]
    sudata = row1.find_all('div', attrs={'class', 'row'})[0]
else:
    sudata = data.find_all('ul', attrs={'class', 'footer__list'})[0]

# Convert to CSV
allas = sudata.find_all('a')
rows = [[pubtitle, puburl]]
for thisa in allas:
    rows += [[thisa.text, thisa['href']]]
df = pd.DataFrame(rows, columns=['Site', 'Url'])
df.to_csv(pubtitle+'.csv', index=False)
''' googlesearch_for_terms '''
from googlesearch import search 
import pandas as pd
import re
from datetime import datetime

qterm = 'access to and use of Locality Labs'
query = '"{}" site:.com/terms'.format(qterm)
  
js = [j for j in search(query, tld="com", num=1000, stop=None, pause=2.0,
                        extra_params={'filter': '0'})]

# Save, and compare against existing site list
newsites = [re.findall('//(.*)/', x)[0] for x in js]
df = pd.DataFrame(newsites, columns=['url']).sort_values('url')
df.to_csv('googlesearch_{}.csv'.format(datetime.now().strftime('%Y-%m-%d-%H-%M')),
          index=False)

df2 = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df2['URL'].to_list()
fromsites = [re.findall('//(.*)', x)[0].strip('/') for x in fromsites]
df3 = pd.DataFrame(list(set(newsites) - set(fromsites)), columns=['url'])
df3.to_csv('temp.csv', index=False)
''' use_builtwith '''

import pandas as pd
import requests
import json 
import re

df = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df['URL'].to_list()
fromsites = [x[x.strip('/').rfind('/')+1:].strip('/') for x in fromsites]

bwkey = '<get_a_key>'
bwapi = 'rv1' # 'free1' is the free api

allmatches = pd.DataFrame([])
allidentifiers = pd.DataFrame([])

for bwdom in fromsites:
    print(bwdom)
    try:
        bwurl = 'https://api.builtwith.com/{}/api.json?KEY={}&LOOKUP={}'.format(bwapi, bwkey, bwdom)
        bwresp = requests.get(bwurl)

        with open('builtwith/{}.json'.format(bwdom), 'w') as outfile:
            json.dump(bwresp.json(), outfile)

        matches = pd.DataFrame([])
        identifiers = pd.DataFrame([])

        rs = bwresp.json()['Relationships']
        for thisr in rs:
            fromdomain = thisr['Domain']
            rsi = thisr['Identifiers']

            ids = pd.DataFrame(rsi).drop('Matches', axis=1)
            ids['FromDomain'] = fromdomain
            identifiers = identifiers.append(ids)

            for rsix in rsi:
                rsimatches = pd.DataFrame(rsix['Matches'])
                rsimatches['Type'] = rsix['Type']
                rsimatches['Value'] = rsix['Value']
                rsimatches['FromDomain'] = fromdomain
                if len(rsimatches) > 0:
                    matches = matches.append(rsimatches)
        matches.to_csv('builtwith/{}_matches.csv'.format(bwdom), index=False)
        identifiers.to_csv('builtwith/{}_identifiers.csv'.format(bwdom), index=False)
        if len(matches) > 0:
            allmatches = allmatches.append(matches)
        if len(identifiers) > 0:
            allidentifiers = allidentifiers.append(identifiers)
    except:
        continue

allmatches.to_csv('builtwith/allmatches.csv', index=False)
allidentifiers.to_csv('builtwith/allidentifiers.csv', index=False)

newsites = pd.DataFrame(set(allmatches['Domain'].to_list()) - set(fromsites), columns=['URL'])
newsites.to_csv('newsites_tmp.csv', index=False)

Disinformation Datasets

[Cross-post from Medium https://medium.com/misinfosec/disinformation-datasets-8c678b8203ba]

Top of Facebook’s Blacktivists dataset

“Genius is Knowing Where To Look” (Einstein)

I’m often asked for disinformation datasets — other data scientists wanting training data, mathematician friends working on things like how communities separate and rejoin, infosec friends curious about how cognitive security hacks work. I usually point them at the datasets section on my Awesome Misinformation repo, which currently contains these lists:

That’s just the data that can be downloaded. There’s a lot of implicit disinformation data out there. For example groups like EUvsDisinfoNATO StratcomOII Comprop all have structured data on their websites.

You’re not going to get it all

That’s a place to start, but there’s a lot more to know about disinformation data sources. One of them, as pointed out by Lee Foster at Cyberwarcon, last month, is that these datasets are rarely a complete picture of disinformation around an event. Lee’s work is interesting: he did what many of us do: as soon as an event kicked off, his team started collecting social media data around it. What they did next was to compare that dataset against the data output officially by social media companies (Twitter, Facebook). There were gaps — big gaps — in the officially released data; understandable in a world where attribution is hard and campaigns work hard to include non-trollbots (aka ordinary people) in the spread of disinformation.

Some people can get better data than others. For instance, Twitter’s IRA dataset has obfuscated user ids, but academics can ask for unobfuscated data. It’s worth asking, but also worth asking yourself about the limitations placed on you by things like non-disclosure agreements.

I’ve seen this before

So what happens is that people who are serious about this subject collect their own data. And lots of them collect data at the same time. Which sits on their personal drives (or somewhere online) whilst other researchers are scrabbling round for datasets on events that have passed. I’ve seen this before. Everything old is new again — and in this case, it’s just what we saw with crisismapping data. There were people all over the world — locals, humanitarian workers, data enthusiasts etc, who had data that was useful in each disaster that hit, which meant that a large part of my work as a crisis data nerd was quietly gently extracting that data from people and getting it to a place online where it could be used, in a form that it could be found. We volunteers built an online repository, the Humanitarian Data Project, which informed the build of the UN’s repository, the Humanitarian Data Exchange — I also worked with the Humanitarian Data Language team on ways to meta-tag datasets so the data needed was easier to find. There’s a lot of learning in there to be transferred.

And labelled data is precious, so very precious

Disinformation is being created at scale, and at a scale beyond the ability of human tagging teams. That means we’re going to need automation (or rather augmentation — automating some of the tasks so the humans can still do the ‘hard’ parts of finding, labelling and managing disinformation campaigns, their narratives and artefacts). And to do that, we generally need datasets that are labelled in some way, so the machines can ‘learn’ from the humans (there are other ways to learn, like reinforcement learning, but I’ll talk about them another time). Unsurprisingly, there is very little in the way of labelled data in this world. The Pheme project labelled data; I helped with the Jigsaw project on a labelled dataset that was due for open release; I’ve also helped create labelling schemes for data at GDI, and am watching conversations about starting labelling projects at places like the Credibility Coalition.

That’s it — that’s a start on datasets for disinformation research. This is a living post, so if there are more places to look please tell me and I’ll update this and other notes.

Disinformation as a security problem: why now, and how might it play out?

[Cross-post from Medium https://medium.com/misinfosec/disinformation-as-a-security-problem-why-now-and-how-might-it-play-out-3f44ea6cda95]

When I talk about security going back to thinking about the combination of physical, cyber and cognitive, people sometimes ask me why now? Why, apart from the obvious weekly flurries of misinformation incidents, are we talking about cognitive security now?

Big, Fast, Weird

I usually answer with the three Vs of big data: volume, velocity, variety (the fourth V, veracity, is kinda the point of disinformation, so we’re leaving it out of this discussion).

  • The internet has a lot of text data floating around it, but its variety isn’t just in all the different platforms and data formats needed to scrape or inject into it — it’s also in the types of information being carried. We’re way past the Internet 1.0 days of someone posting the sports scores online and a bunch of hackers lurking on bulletin boards: now everyone and their grandmother is here, and the (sniffable, actionable and adjustable) data flows include emotions, relationships, group sentiment (anyone thinking about market sentiment should be at least a little worried by now) and group cohesion markers.
  • There’s a lot of it — volumes are high enough that brands and data scientists can spend their days doing social media analysis, looking at cliques, message spread, adaption and reach.
  • And it’s coming in fast: so fast that an incident manager can do AB-testing on humans in real time, adapting messages and other parts of each incident to fit the environment and head towards incident goals faster, more efficiently etc. Ideally that adaptation is much faster than any response, which fits the classic definition of “getting inside the other guy’s OODA loop”.

NB The internet isn’t the only system carrying these things: we still have traditional media like radio, television and newspapers, but they’re each increasingly part of these larger connected systems.

So what next?

Another question I get a lot is “so what happens next”. Usually I answer that one by pointing people at two books: The Cuckoo’s Egg and Walking Wounded — both excellent books about the evolution of the cybersecurity industry (and not just because great friends feature in them), and say we’re at the start of The Cuckoo’s Egg, where Stoll starts noticing there’s a problem in the systems and tracking the hackers through them.

I think we’re getting a bit further through that book now. I live in America. Someone sees a threat here, someone else makes a market out of it. Cuddle-an-alligator — tick. Scorpion lollipops in the supermarket — yep. Disinformation as a service / disinformation response as a service — also in the works, as predicted for a few years now. Disinformation response is a market, but it’s one with several layers to it, just as the existing cybersecurity market has specialists and sizes and layers.

Markets: sometimes botany, sometimes agile

Frank is a very wise, very experienced friend (see books above), who calls our work on AMITT “botany” — building catalogs of techniques and counters slower than the badguys can maraud across our networks, when we really should be out there chasing them. He’s right. Kinda.

I read Adam Shostack’s slides on threat modelling in 2019 today. He talks about the difference between “waterfall” (STRIDE, kill chain etc) and “agile” threat modelling. I’ve worked on both: on big critical systems that used waterfall/“V” methods because you don’t really get to continuously rebuild an aircraft or ship design, and on agile systems that we trialled with and adapted to end-user needs. (I’ve also worked on lean production, where classically speaking, agile is where you know the problemspace and are iterating over solutions, and lean is iterations on both the problem and solution spaces. This will become important later). This is one of the splits: we’ll still need the slower, deliberative work that gives labels and lists defences and counters for common threats (the “phishing” etc equivalents of cognitive security), but we also need that rapid response to things previously unseen that keeps white-hat hackers glued to their screens for hours, and there’s a growing market too in tools to support them. (as an aside, I’m part of a new company, and this agile/ waterfall split finally gives me a word to describe “that stuff over there that we do on the fly”).

Also because I’m old I can remember when universities had no clue where to put their computer science group — it was sometimes in the physics department, sometimes engineering, or maths, or somewhere wierder still; later on, nobody quite knew where to put data scientists as they cut across disciplines and used techniques from wherever made sense, from art to hardcore stats. This market will shake out that way too. Some of the tools, uses and companies will end up as part of day-to-day infosec. Others will be market-specific (Media and adtech are already heading that way); others again will meet specific needs on the “influence chain”, like educational tools and narrative trackers. Perhaps a good next post would be an emerging-market analysis?

overlaps between misinformation research and IoT security

[Cross-post from Medium https://medium.com/misinfosec/at-truth-trust-online-someone-asked-me-about-the-overlaps-between-misinformation-research-and-iot-a69772aba963]

At Truth&Trust Online, someone asked me about the overlaps between misinformation research and IoT security. There’s more than you’d think, and not just in the overlaps between people like Chris Blask who are working on both problem sets.

I stopped for a second, then went “Oh. I recognize this problem. It’s exactly what we did with data and information fusion (and knowledge fusion too, but you know a lot of that now as just normal data science and AI). Basically it’s about what happens when you’re building situation pictures (mental models of what is happening in the world) based on data that’s come from people (the misinformation, or information fusion part) and things (the IoT, or data fusion part). And what we basically did last time was run both disciplines separately – the text analysis and reasoning in a different silo to the sensor-based analysis and reasoning – til it made sense to start combining them (which is basically what became information fusion). That’s how we got the *last* pyramid diagram – the DIKW model of data under information under knowledge under wisdom (sorry: it’s been changed to insight now), and similar ideas of transformations and transitions in information (in the Shannon information theory sense of the word) between layers.

We’ll probably do a similar thing now. Both disciplines feed into situation pictures; both can be used to support (or refute) each other. Both contain all the classics like protecting information CIA: confidentiality, integrity, accessibility. I tasked two people at the conference to start delving into this area further (and connect to Chris) – will see where this goes.

Short thought: The unit should be person, not account

[Cross-post from Medium https://medium.com/misinfosec/short-thought-the-unit-should-be-person-not-account-81c48002aaa]

I’ve been thinking, inspired partly by Amy Zhang’s paper on mailing lists vs social media use https://twitter.com/amyxzh/status/1173812276211662848?s=20

We have a bunch of issues with online identity. Like, I have at least 20 different ways to contact some of my friends, send half my life trying to separate out people being themselves from massive coordinated cross-platform campaigns, and dozens of issues with privacy, openness (like do we throw a message into the infinite beerhall that’s twitter or deliberately email to just a few chosen peeps). How much of this has happened because our base unit of contact has changed from an individual human to an online account?

I’m wondering if there’s a way to switch that back again. Zeynep Tufecki said that people stayed on Facebook despite its shortcomings because that’s where the school emergency alerts, group organisation etc were. What if we could make those things platform-independent again? I mean we have APIs, yes? They’re generally broadcast, or broadcast-and-feedback, yes?

I guess this is two ideas. One is to challenge the idea that everything has to be instant-to-instant. Yeah, sure, we want to chat with our friends. But do we really need instant chat on everything? If we drop that, can we build healthier models?

The second idea is to challenge the account-as-user idea. Remember addressbooks? Like those real physical paper books that you listed your friends, family etc names, addresses, phone numbers, emails etc in? What if we had a system that went back to that, and when you sent a message to someone it went to their system of choice in your style of choice (dm, group, public etc). I get that you’re all unique etc, and I’m still cool with some of you having multiple personalities, but this 20 ways to contact a person — that’s got old, and fast.

The third (because who doesn’t like a fourth book in a trilogy) is to give people introvert time. Instead of having control over our electronic lives by putting down the electronics, have a master switch for “only my mother can contact me right now”.

Writing about Countermeasures

[Cross-post from Medium https://medium.com/misinfosec/writing-about-countermeasures-1671d231e8a2]

The AMITT framework so far is a beautiful thing — we’ve used it to decompose different misinformation incidents into stages and techniques, so we can start looking for weak points in the ways that incidents are run, and in the ways that their component parts are created, used and put together. But right now, it’s still part of the “admiring the problem” collection of misinformation tools -to be truly useful, AMITT needs to contain not just the breakdown of what the blue team thinks the red team is doing, but also what the blue team might be able to do about it. Colloquially speaking, we’re talking about countermeasures here.

Go get some counters

Now there are several ways to go about finding countermeasures to any action:

  • Go look at ones that already exist. We’ve logged a few already in the AMITT repo, against specific techniques — for example, we listed a set of counters from the Macron election team as part of incident I00022.
  • Pick a specific tactic, technique or procedure and brainstorm how to counter it — the MisinfosecWG did this as part of their Atlanta retreat, describing potential new counters for two of the techniques on the AMITT framework.
  • Wargame red v blue in a ‘safe’ environment, and capture the counters that people start using. The Rootzbook exercise that Win and Aaron ran at Defcon AI Village was a good start on this, and holds promise as a training and learning environment.
  • Run a machine learning algorithm to generate random countermeasures until one starts looking more sensible/effective than the others. Well, perhaps not, but there’s likely to be some measure of automation in counters eventually…

Learn from the experts

So right. We get some counters. But hasn’t this all been done before. Like if we’re following the infosec playbook to do all this faster this time around (we really don’t have 20 years to get decent defences in place — we barely have 2…) then shouldn’t we look at things like courses of action matrices? Yes. Yes we should…

Courses of Action Matrix [1]

So this thing goes with the Cyber Killchain — the thing that we matched AMITT to. Down the left side we have the stages; 7 in this case; 12 in AMITT. Along the top we have six things we can do to disrupt each stage. And in each grid square, we have a suggestion of an action (I suspect there are more than one of these for each square) that we could take to cause that type of disruption at that stage. That’s cool. We can do this.

The other place we can look is at our other parent models, like the ATT&CK framework, the psyops model, marketing models etc, and see how they modelled and described counters too — for example, the mitigations for ATT&CK T1193 Spearphishing.

Make it easy to share

Checking parent models is also useful because this gives us formats for our counter objects— which is basically that these are of type “mitigation”, and contain a title, id, brief description and list of techniques that they address. Looking at the STIX format for course-of-action gives us a similarly simple format for each counter against tactics — a name, description and list of things it mitigates against.

We want to be more descriptive whilst we find and refine our list of counters, so we can trace our decisions and where they came from. A more thorough list of features for a counter list would probably include:

  • id
  • name
  • brief description
  • list of tactics can be used on
  • list of techniques can be used on
  • expected action (detect, deny etc)
  • who could take this action (this isn’t in the infosec lists, but we have many actors on the defence side with different types of power, so this might need to be a thing)
  • anticipated effects (both positive and negative — also not in the infosec lists)
  • anticipated effort (not sure how to quantify this — people? money? hours? but part of the overarching issue is that attacks are much cheaper than defences, so defence cost needs to be taken into account)

And be generated from a cross-table of counters within incidents, which looks similar to the above, but also contains the who/where/when etc:

  • id
  • brief description
  • list of tactics it was used on
  • list of techniques it was used on
  • action (detect, deny etc)
  • who took this action
  • effects seen (positive and negative)
  • resources used
  • incident id (if known)
  • date (if known)
  • counters-to-the-counter seen

“Boundaries for Fools, Guidelines for the Wise…”

At this stage, older infosec people are probably shaking their heads and muttering something about stamp collecting and bingo cards. We get that. We know that defending against a truly agile adversary isn’t a game of lookup, and as fast as we design and build counters, our counterparts will build counters to the counters, new techniques, new adaptations of existing techniques etc.

But that’s only part of the game. Most of the time people get lazy, or get into a rut — they reuse techniques and tools, or it’s too expensive to keep moving. It makes sense to build descriptions like this that we can adapt over time. It also helps us spot when we’re outside the frame.

Right. Time to get back to those counters.

References:

[1] Hutchins et al “Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains”, 2011

Responses to misinformation

[Cross-post from Medium https://medium.com/misinfosec/responses-to-misinformation-885b9d82947e]

There is no one, magic, response to misinformation. Misinformation mitigation, like disease control, is a whole-system response.

MisinfosecWG has been working on infosec responses to misinformation. Part of this work has been creating the AMITT framework, to provide a way for people from different fields to talk about misinformation incidents without confusion. We’re now starting to map out misinformation responses, e.g.

Today I’m sat in Las Vegas, watching the Rootzbook misinformation challenge take shape. I’m impressed at what the team has done in a short period of time (and has planned for later). It also has a place on the framework — specifically at the far-right of it, in TA09 Exposure. Other education responses we’ve seen so far include:

Education is an important counter, but won’t be enough on its own. Other counters that are likely to be trialled with it include:

  • Tracking data providence to protect against context attacks (digitally sign media and metadata in a way that media includes the original URL in which it was published and private key is that of the original author/publisher)
  • Forcing products altered by AI/ML to notify their users (e.g. there was an effort to force Google’s very believable AI voice assistant to announce it was an AI before it could talk to customers)
  • Requiring legitimate news media to label editorials as such
  • Participating in the Cognitive Security Information Sharing and Analysis Organization (ISAO)
  • Forcing paid political ads on the Internet to follow the same rules as paid political advertisements on television
  • Baltic community models, e.g. Baltic “Elves” teamed with local media etc

Jonathan Stray’s paper “Institutional Counter-disinformation Strategies in a Networked Democracy” is a good primer on counters available on a national level.

Why talk about disinformation at a hacking event?

I’m one of the DEFCON AI Village core team, and there’s quite a bit of disinformation activity in the Village this year, including:

Why talk about disinformation* at a hacking event?  I mean, shouldn’t it be in the fluffy social science candle-waving events instead? What’s it doing in the AI Village?  Isn’t it all a bit, kinda, off-topic? 

Nope.  It’s in exactly the right place. Misinformation, or more correctly its uglier cousin, disinformation, is a hack.  Disinformation takes an existing system (communication between very large numbers of people) apart and adapts it to fit new intentions – whether that’s temporarily destroying the system’s ability to function (the “Division” attacks that we see on people’s trust in each other and in the democratic systems that they live within), changing system outputs (influence operations to dissuade opposition voters or change marginal election results) or making the system easy to access and weaken from outside (antivax and other convenient conspiracy theories).  And a lot of this is done at scale, at speed, and across many different platforms and media – which if you remember your data science history is the three Vs: volume, variety and velocity (there was a fourth v: veracity, but erm misinformation guys!)

And the AI part? Disinformation is also called Computational Propaganda for a reason.  So far, we’ve been relatively lucky: the algorithms used by disinformation’s current ruling masters, Russia, Iran et al, have been fairly dumb (but still useful).  We had bots (scripts pretending to be social media users, usually used to amplify a message, theme or hashtag til algorithms fed it to to real users) so simple you could probably spot them from space – like, seriously, sending the same message 100s of times a day at a rate even Win (who’s running the R00tz bots exercise at AI Village) can’t type at, backed up by trolls – humans (the most famous of which were in the Russian Internet Research Agency) spreading more targetted messages and chaos, with online advertising (and its ever so handy demographic targetting) for more personalised message delivery.   That luck isn’t going to last. Isn’t lasting.  Bots are changing. The way they’re used is changing. The way we find disinformation is changing (once, sigh, it was easy enough to look for #qanon on twitter, to find a whole treasure trove of crazy).

The disinformation itself is starting to change: goodbye straight-up “fake news” and hunting for high-frequency messages, hello more nuanced incidents that means anomaly detection and pattern-finding across large volumes of disparate data and its connections.  And as a person who’s been part of both MLsec (the intersection of machine learning/AI and information security), and Mmisinfosec (the intersection of misinformation and information security), I *know* that looks just like a ‘standard’ (because hell, there are no standards but we’ll pretend for a second there are) MLsec problem.  And that’s why there’s disinformation in the AI village. 

If you get more curious about this, there’s a whole separate community, Misinfosec http://misinfosec.org, working on the application of information security principles to misinformation.  Come check us out too.

* “Is there a widely accepted definition of mis vs disinformation?” Well not really, not yet (there’s lots of discussion about it in places like the Credibility Coalition‘s Terminology group, reading papers like Fallis’ “what is Disinformation?“. Clare Wardle’s definitions of dis, mis, and mal information are used a lot. But most active groups pick a definition and get on with the work – for instance this is MisinfosecWG’s working definition “We use misinformation attack (and misinformation campaign) to refer to the deliberate promotion of false, misleading or mis-attributed information. Whilst these attacks occur in many venues (print, radio, etc), we focus on the creation, propagation and consumption of misinformation online. We are especially interested in misinformation designed to change beliefs in a large number of people.” and my personal one is that we’re heading towards disinformation as the mass manipulation of beliefs that isn’t necessarily with fake content (text, images, videos etc) but usually includes fake context (misattribution of source, location, date, context etc) and use of real content to manipulate emotion in specific directions. Honestly, it’s like trying to define pornography – trying to find the right definitions is important, but can get in the way of the work of keeping it out of the mainstream, and if it’s obvious, it’s obvious. We’ll get there, but in the meantime, there’s work to do.

Misinformation has stages

[Cross-posted from Medium https://misinfocon.com/misinformation-has-stages-7e00bd917108]

Now we just need to work out what the stages should be…

Misinformation Pyramid

The Credibility Coalition’s Misinfosec Working Group (“MisinfosecWG”) maps information security (infosec) principles onto misinformation. Our current work is to develop a tactics, techniques and procedures (TTP) based framework that gives misinformation researchers and responders a common language to discuss and disrupt misinformation incidents.

We researched several existing models from different fields, looking for a model that was both well-supported and familiar to people, and well suited for the variety of global misinformation incidents that we were tracking. We fixed on stage-based models, that divide an incident into a sequence of stages, e.g. “recon” or “exfiltration”, and started work mapping known misinformation incidents to the ATT&CK framework, which is used by the infosec community to share information about infosec incidents. Here’s the ATT&CK framework, aligned with its parent model, the Cyber Killchain:

Cyber Killchain stages (top), ATT&CK framework stages (bottom)

The ATT&CK framework adds more detail to the last three stages of the Cyber Killchain. These stages are known as “right-of-boom,” as opposed to the four “left-of-boom” Cyber Killchain stages, which happen before bad actors gain control of a network and start damaging it.

Concentrating on the ATT&CK model made sense when we started doing this work. It was detailed, well-supported, and had useful concepts, like being able to group related techniques together under each stage. The table below is the Version 1.0 strawman framework that we created; an initial hypothesis about the stages with example techniques that a misinformation campaign might use.

Table 1: Early strawman version of the ATT&CK framework for misinformation [1]

This framework isn’t perfect. It was never designed to be perfect. We recognized that we are dealing with many different types of incidents, each with potentially very different stages, routes through them, feedback loops and dependencies (see the Mudge quote below), so we created this strawman to start a conversation about what more is needed. Behind that, we started working in two complementary directions: bottom-up from the incident data, and top-down from other frameworks that are used to plan similar activities to misinformation campaigns, like psyops and advertising.

ATT&CK may be missing a dimension…

The ATT&CK framework has missing dimensions, which is why we introduced the misinformation pyramid. A misinformation campaign is a longer-scale activity (usually months, sometimes years), composed of multiple connected incidents — one example is the IRA campaign that focussed on the 2016 US elections. The attackers designing and running a campaign see the entire campaign terrain: they know the who, what, when, why, how, the incidents in that campaign, the narratives (stories and memes) they’re deploying, and the artifacts (users, hashtags, messages, images, etc.) that support those narrative frames.

Defenders generally see just the artifacts, and are left guessing about the rest of the pyramid. Misinformation artifacts are right-of-boom: the themes seemingly coming out of nowhere, the ‘users’ attached to conversations, etc. This is what misinformation researchers and counters have typically concentrated on. This is what the ATT&CK framework is good at, and why we have invested effort on it by cataloguing and breaking campaigns and incidents down into techniques, actors, action flows.

misinformation pyramid

But this only covers part of each misinformation attack. There are stages “left-of-boom” too. Although difficult to identify, there are key artifacts in this campaign phase too. This is the other part of our work. We’re working from the attacker point of view, listing and comparing stages we’d expect them to be working through, based on what we know about marketing/advertising, psyops and other analyses. We’ve compared a key set of stage-based models from these disciplines to the Cyber Killchain, as seen in the table below.

Table 2: Comparison between cyber killchain, marketing, psyops and other models

This is a big beast, so let’s look at its components.

First, the marketing funnels. These are about the journey of the end consumer of a marketing campaign — the person who watches an inline video, sees a marketing image online, and so on, and is ideally persuaded to change their view, or buy something related to a brand. This is a key consideration when listing stages: whose point of view is this? Do we understand an incident from the point of view of the people targeted by it (which is what marketing funnels do), the point of view of the people delivering it (most cyber frameworks), or the people defending against it? We suggest that the correct point of view for misinformation is that of the creator/attacker, because attackers go through a set of stages, all of which are essentially invisible to a defender, yet each of these stages can potentially be disrupted.

Marketing funnels from Moz and Singlegrain

Marketing funnels, meanwhile, are “right-of-boom.” They begin at the point in time where the audience is exposed to an idea or narrative and becomes aware of it. This is described as the “customer journey,” which is a changing mental state, from seeing something to taking an interest in it, to building a relationship with a brand/idea/ideology, and subsequently advocating it to others.

This same dynamic plays out in online misinformation and radicalisation (e.g. Qanon effects), with different hierarchies of effects that might still contain the attraction, trust and advocacy phases. Should we reflect these in our misinformation stage list? We can borrow from the marketing funnel and map these stages across to the Cyber Killchain (above), and by adding in stages for marketing planning and production (market research, campaign design, content production, etc.) and seeing how they are similar to an attacker’s game plan, we can begin planning how to disrupt and deny these left-of-boom activities.

When considering the advocacy phase, in relation to other misinformation models, we see this fitting the ‘amplification’ and ‘useful idiot’ stages (as noted above in Table 2). This is new thinking, and modeling how an ‘infected’ node in the system isn’t just repeating a message, but might be or become a command node too, is something to consider.

Developing the misinformation framework also requires adopting and acknowledging the role of psyops, as its point of view is clear: it’s all about the campaign producer who controls every stage, from a step-by-step list of things to do, from the start through to a completed operation, including hierarchy-aware things like getting sign-offs and permissions.

Left-of-boom, psyops maps closely to the marketing funnel, with the addition of a “planning” stage, while right-of-boom it glosses over all the end-consumer-specific considerations, in a process flow defined by “production, distribution, dissemination.” This does, however, add a potentially useful evaluation stage. One of the strengths of working at scale online is the ability to hypothesis test (eg. AB test) and adapt quickly at all stages of a campaign. Additionally, when running a set of incidents, after-action reviews can be invaluable in learning and adjusting the higher-level tactics such as adjusting the list of stages, the target platforms, or determining the most effective narrative styles and assets.

Psyops stages (https://2009-2017.state.gov/documents/organization/148419.pdf)

As we develop misinformation-specific stage-based models and see more of them (maybe it’s something to do with all the talks our misinfosec family have given?), things like Tactics, Techniques and Procedures (“TTPs”) and Information Sharing Analysis Center (“ISAC”) are appearing in misinformation presentations and articles. Two noteworthy models are the Department of Justice (DOJ) model and one recently outlined by Bruce Schnieier. First the DOJ model, which is a thing of beauty:

page 26 of https://www.justice.gov/ag/page/file/1076696/download

This clearly presents what each stage looks like from both the attacker (‘adversary’) and defender points of view (the end consumer isn’t of much interest here.) It’s a solid description of early IRA incidents, yet is arguably too passive for some of the later ones. This is where we start inserting our incident descriptions and mapping them to stages. This is where we start asking about how our adversaries are exercising things like command & control. When we say “passive”, we mean this model works for “create and amplify a narrative”, but we’re fitting something like “create a set of fake groups and make them fight each other”, which takes on a more active and more command & control-like presence. This is a great example of how we can create models that work well for some, but not all, of the misinformation incidents that we’ve seen, or expect to see.

We have some answers. More importantly, we have a starting point. We are now taking these stage-based models and extracting the best assets, methods, and practices (what looks most useful to us today), such as testing various points of view, creating feedback loops, monitoring activity, documenting advocacy, and so on. Our overarching goal is to create a comprehensive misinformation framework that covers as much of the incident space as possible, without becoming a big mess of edge cases. We use our incident analyses to cross-check and refine this. And we accept that we might — might — just have more than one model that’s appropriate for this set of problems.

“We are trying to prove ourselves wrong as quickly as possible, because only in that way can we find progress.” ― Richard P. Feynman

Addendum: yet more models…

Ben Decker’s models look at the groups involved in different stages of misinformation, and the activities of each of those groups. This focuses on misinformation campaigns as a series of handoffs between groups: from the originators of content, to command and control signals via Gab/Telegram, etc., for signal receivers to post that content to social media platforms, then amplify its messages with social media messages that eventually get picked up by professional media. This has too many groups to fit neatly onto a marketing model, and appears to be on a different axis to psyops and DOJ models, but still seems important.

Ben Decker misinformation propagation models

As a further axis — the stage models we’ve discussed above are all tactical — the steps that an attacker would typically go through in a misinformation incident. There are also strategies to consider, including Ben Nimmo’s “four Ds” (Distort, Distract, Dismay, Dismiss — commonly-used IRA strategies), echoed in Clint Watt’s online manipulation generations. In infosec modelling, this would get us into a Courses of Action Matrix. We need to get on with creating the list of stages: we’ll leave that part until next time.

Clint Watts matrix, and 5Ds, with common tactics (from Boucher) mapped to them.

References:

  1. Walker et al, Misinfosec: applying information security paradigms to misinformation campaigns, WWW’19 workshop