Data Safari rough notes: “pink slime” network

Warning: this is a set of rough notes, for other geeks to read. There’s an ungeeked post about this for people who don’t want to wade through code.

2019 was mostly about building infrastructure and communities, but every so often I did a little “data safari” on a piece of misinformation that interested me. 

Data safaris are small looks into an area of interest: they’re not big enough to be expeditions, but they’re not standing prodding the bushes a couple of times then walking away again either. How this (typically) works is: Something interests me. I see  an article, or a hashtag, or an oddity that I’d like to have a better look at, so I take a cursory look at it, write a rough plan of how I want to further investigate it, then start at one corner of the world, and traverse out from it, making notes on things of interest, or things I want or need to remember as I go.  Whilst I’m traversing, I’m also filling in a spreadsheet or doc with the end data I want, and collecting/storing other datasets (twitter, facebook, text from sites etc) as needed. On the way, I’ll find other things I want to explore, but won’t want to interrupt my flow through whichever thing I’m on at the time – for these, I leave myself an “action” note, which I overwrite with “done” or “ignored” when I’ve either gone back to that branch and followed it, or decided not to chase it down. 

This one was pink slime: Hundreds of ‘pink slime’ local news outlets are distributing algorithmic stories and conservative talking points

This article from the Tow Center – data journalism unit at Columbia that has Jonathan Albright (d1gi) as one of its peeps working on misinformation data, so I had a quick look at his recent datasets https://data.world/d1gi (nothing on data.world in the past 9 months), medium, twitter https://twitter.com/d1gi – looks like it isn’t him.  Back to the article. 

“Pink slime” = “low cost automated story generation”.  This stuff is going to get worse, as text generation improves (and remember that many of these sites are heavy on the image:text ratio anyways). 

No link to data given in the article.  Article writer is Priyanjana Bengani – talks about “we”, writes about datasets.  Checking their accounts in case they released a dataset somewhere (unlikely given the subject, and what happened when Barrett Golding released a misinformation sites dataset, but worth checking).  Nothing on https://twitter.com/acookiecrumbles.  No sign of a dataset being released, so will have to start chasing it up by hand.  Which, admittedly, will be fun. Can I find more than 450?  [edit: yes, but it’s going to take an age to go through them all]

Okay. Quick-and-dirty plan: 

  • Quick reconstruction: grab the sites listed in the article, in the articles it mentions and any other local news outlets that pointed at “pink slime” or the players listed in the past month.  That’s the low-low-hanging fruits.
  • Twin efforts: do what I would normally do; rebuild from the methods they’ve listed (which have lots of overlaps, because there are only so many links across sites, although…)
  • Clean up the lists, rerun to see if anything got missed.
  • Look for other sites using less-obvious markers like social behaviours and links.   

1. Quick reconstruction

1.1 check articles

Todo list:

  • Done: check CJR article
  • Done: look for local “pink slime” articles
  • Done enough for now: look for local articles about actors/entities in this story

Clues from cjr article

  • Lansing State Journal broke news on Oct 20 2019
  • Done: check Lansing State Journal
  • Further reporting by the Michigan Daily, the Guardian and the New York Times  got to about 200 sites
  • Done: check Michigan Daily
  • Done: Check Guardian
  • Done: Check New York Times
  • Columbia analysis got to at least 450 sites, 
  • 189 as local news networks across 10 states in last 12 months by Metric Media
  • Metric Media is just one component in network

Lansing State Journal article:

  • Carol Thompson article.  (517) 377-1018, ckthompson@lsj.com, @thompsoncarolk.
  • “Nearly 40 new sites”
  • First found by Matt Grossmann, director of Michigan State University’s Institute for Public Policy and Social Research (he leads this;  517 355-6672, grossm63@msu.edu, @mattgrossmann, www.mattg.org).
  • Site: micapitolnews.com – interesting url; action: try other state +capitolnews.com combinations?
  • “About us” of the MI sites say are published by Metric Media LLC – “fill the “growing void in local and community news after years of steady disinvestment in local reporting by legacy media.”
  • Bradley Cameron is CEO Metric Media and Situation Management Group (online biography page)
  • Privacy pages say are  operated by Locality Labs LLC, a Delaware company that similarly affiliated with a network of local sites in Illinois and Maryland, and business sites in nearly every U.S. state.
  • Done: West Cook News about says “West Cook News is a product of LGIS – Local Government Information Services” – chase this up too?
  • Done: West Cook News has a list of “other publications” on the bottom. 
  • Locality Labs CEO is Brian Timpone
  • At this point, I’m looking at these publications and wondering if they’re really misinformation sites.  Right now, they’re looking more like the clipper magazine “Our Times” that we got in New Jersey – which was foaming at the mouth right-wing and filled with hateful rhetoric between the adverts and dad joke cartoons, but a) would take work to establish just how much disinformation was flowing through them and b) being on a political ‘side’ should never be a reason to call ‘misinformation’ – that’s why Tim and I set up a completely new set of labels for domains.  Ah good – this gets addressed in the article, which talks about the difference between “fake news” and “information with a perspective” that’s been dressed up to look like objective local news. This is the sort of thing we saw with the Jenna Abrams troll: mostly ‘useful’ information, with occasional forays into political strong opinions. Action: check percentage of political articles on some of these sites. 
  • Side note: Lansingstatejournal.com says I have 4 free articles left.  If every news outlet I need to look at does this, or worse paywalls me, this is going to be a long night of research. 

Looked for origin of story – Matt Grossmann

  • Nothing on www.mattg.org
  • Twitter search on “@mattgrossmann until:2019-10-22” (original article came out on 20th October) found reference to original article and lots of tweets about it (hello again, @emptywheel). 
  • Good seeing people noodling about ways to counter: find host, see if can take down because violating ToS (yep, got that one); feed list to Google etc for downranking (yep, got that); Maggie Haberman’s Index Of Approved Sources Of Information  (not heard of that at all). 
  • Hmm “”Metric Media LLC maintains a licensing agreement with the Metric Media Foundation, a Missouri 501(c)(3) non-profit news organization.””
  • @bywillpollock shows links to “The Ohio Star” articles on facebook groups Teachers for Trump, Blacks for Trump, Students for Trump Pence (following facebook group links has been a really rich source of URLs for me in the past – it’s how I found the northern european antivax sites)
  • Action: search facebook for site names/urls; spider out: check other URLs on pages that mention them, check “related pages” if they’re for the sites themselves etc. 
  • @aphexmandelbrot searches for first sentence in ToS “access to and use of Locality Labs” (yep – I also find interesting phrases on a site and search for them; site creators are lazy), gets “at least 400” results.
  • Action: yeah, try the ToS sentence (just automate the google part)

Checking Michigan Daily

  • Y’know, everyone thought data scientists had a glamorous job ‘til they tried it themselves and discovered it’s mostly doing lookups, cleaning datasets and discovering that the less-sexy algorithms are more stable today.
  • Ohmy – followed links to Ann Arbor Times, Grand Rapids Reporter – their pages were at first glance identical (which will make some of the GDI featureset work very nicely, thank you).  As was https://nemontananews.com/ – the same format, the same gun image in the top LHS article.  Hang on – the Grand Rapids Reporter article is “from Great Lakes Wire” which is on the naughty list, and the Montana article claims to be “from big sky times” – so there’s another way to find linked sites.
  • Ignored: check sites referenced by the repeated gun article (ignored because the related sites are listed at the bottom of each site, and Big Sky Times is there already).
  • So the articles with politically skewed content are written by humans and have bylines.  The rest of the articles are generated (by Local Labs News Service), with no byline. Useful to know. 
  • Action: “fill the void in local communities” is a useful search phrase
  • New networks in Montana and Iowa
  • Action: grab list of sites from bottom of page in every state
  • Locality labs is https://locallabs.com/; Metric Media Foundation is http://metricmedia.org/
  • Locality labs operates networks in Maryland and Florida.
  • Locality Labs… emerged out of Journatic, LLC and BlockShopper, LLC (note use of SEC archive to find this)
  • Aside: this article has lots of interesting information in it.  Interesting (but not to the data search) includes: “Tribune Media Company, a media conglomerate that once owned major outlets including the Chicago Tribune and the Los Angeles Times, invested in Journatic as a service to provide hyperlocal news coverage. Journatic reportedly distributed fabricated and plagiarized content and used workers in the Philippines writing under pseudonyms to remotely produce stories. Due to these scandals, many outlets suspended their use of the service and stopped publishing Journatic articles. Tribune Media has since been absorbed by Nexstar Media Group, Inc., the largest local television and media company in the U.S., but still has not fully divested from the venture, which has since reorganized as Locality Labs. The Daily requested comment from Gary Weitman, chief communications officer at Nexstar, regarding investment in Locality Labs, and was told Locality Labs was not a subsidiary. While acknowledging Nexstar does partially own Locality Labs, Weitman downplayed the influence of Nexstar’s investment.”   I remember these stories back from – 2012? Some of this stuff has long roots, and I’m wondering now if some of the emerging disinformation industry in the Philippines can be traced back to these original content generating factories, or if the links to them have been completely severed, or might be stood up quickly again for 2020? 
  • “Timpone is also the co-founder of Local Government Information Services, a network of more than 30 Illinois print and web publications that have been considered to propagate conservative news and hold an identical layout to Metric Media’s websites.” – West Cook News listed itself as being a LGIS site. So this is an issue.  West Cook News is definitely mentioned in the Lansing State Journal article, but the Michigan Daily article seems to be hedging about it and associated sites.  Is it or isn’t it a disinformation site – that’s a question I just spent a year of my life on, and in the end it’s the wrong question. The question isn’t “how do we tell everyone about this huge set of fake news sites we’ve found”, but more “how do we make sure people are aware that they’re reading a syndicated publication which isn’t clear about its connections and has a likelihood of disinforming – regardless of its political slant – and how do we make this situation better?  Better in this case can range from working with the sites’ owners (if possible) to improve the way people understand them, to producing better local news feeds with less bias, to using whatever levers are needed to deal with genuine disinformation campaigns. Each of these comes with a burden of empathy, proof and genuine interest in the connection from grassroots local activity and information upwards. 
  • Okay. Back to the data. I’m going to include those sites for now, but make sure they’re tagged carefully as LGIS.  The article overlap between them and the Metric Media sites should be a good indicator of where we need to look across them.
  • Another possible network branch: “Timpone is associated with Franklin Archer, a publishing organization operated from Chicago. Franklin Archer hosts a similar network which consists of a set of nationwide business journals. Earlier this year, Franklin Archer published the Hinsdale School News — a publication that infringed upon the name and logo trademarks of Hinsdale High School District 86 in Illinois and potentially violated election law by attempting to influence the vote on a $140 million school district referendum.”
  • Aside: mentions the IFFY quotient for URL mentions on social media sites, giving roughly how many of their mentions are okay, unknown and known to be dodgy https://csmr.umich.edu/platform-health-metrics/  Action – look into this more (but later)
  • Aside: this end section seems to be key to why peeps should care about this ““The two of those collectively lead to a situation where it’s fairly easy to distribute extremely partisan, low quality or complete misinformation in a way designed to influence voters,” Pasek said. “It appears that these news sites purporting to be Michigan-based news outlets are attempting to do this, to some extent. They’re targeting Michigan in part because Michigan is viewed as a critical state for this upcoming election, with the goal of providing a presentation that implies that the local news story is indeed one that’s more favorable to the president, and less favorable to his potential opponents, whoever they may be.” Pasek said while it’s normal for outlets to have different biases, these sites are disregarding journalistic standards. “The question is not about bias — it’s about journalistic standards and how journalists are misunderstood.” Pasek said. “It’s okay to have outlets that have varying different views out there, but there’s a certain point at which the attempt to be an outlet with a particular angle oversteps how journalism is supposed to operate. And once that occurs, now there becomes a substantive question as to whether what you’re observing is in fact news, or is instead a disinformation campaign.””

Now the Guardian and NYT.  These are both heavy hitters, data journalism-wise, and can call on a wide network of expert researchers (I know; I fed into some of their stuff).  Maybe we’re seeing a pipeline here too – from a local researcher to a local news organisation, then nationals, then journalism school, then wider again? 

The Guardian

  • Leads with the Hinsdale School News story, about using a local news site to swing a local vote.  “This was purposely done to mislead people into thinking that was a publication from the district.”
  • Franklin Archer has a list of publications – operated by Locality Labs https://franklinarcher.com/our_publications
  • Jeanne Ives’ payment followed by praise in papers is interesting. Can we track these articles / did she pay any other networks of interest?  Action: check Ives FEC filings for more media payments around the $2k range
  • Action: cross-map locations of these papers to a) the news deserts in University of North Carolina study and b) battleground states. Carto is your friend here…  maybe also map against Sinclair TV coverage

New York Times

  • Author: Dan Levin @globaldan
  • Nice use of 2×2 grid to show how similar the sites look
  • Minimal advertising on sites, plus promotional push on facebook
  • Action: check out facebook links – if they’re promoting, it should be visible
  • “Many if not all of the sites were registered on June 30 and updated on the same day in August, according to online domain records.” Hmm. This happens a lot, and is likely to happen a lot in 2020. Could the registries alert when large batches of newsy URLs get registered together? Yes, there are easy counters to this, but unless it gets automated, generally people are lazy. 

Twitter search on NYT article

  • Twitter search “@globaldan until:2019-10-22”
  • Article about student papers filling in news deserts https://t.co/Tb8iMaiF60?amp=1
  • Yeah. That’s it. 

Oh the glamour. 2am and still going through the basic datanerding. Well, on to the adding lots of sites to the list part (current haul is the 23 sites mentioned in articles, but we have the Franklin Archer list, and the state publications list at the bottom of each site to add in, as easy wins)

 Looking for “pink slime” articles

  • Google ‘“pink slime” misinformation’ – I learn this is a term for mechanically separated meat, which has its own misinformation subgenre
  • Googled ‘”pink slime” misinformation -beef -meat -food’
    • Mediawell has a link to Columbia article – nothing new added
    • Ooch, I should listen to the CarolC podcast – Action: do later
    • Yeah… the CJR article seems to be the only misinformation piece talking about “pink slime” (I haven’t heard the term before, but I don’t keep up with all the journos covering misinformation and could have missed it) – count as a dead end.

Look for misinformation-related terms plus actors/objects above:

  • Google ‘“Lansing state journal” misinformation’ – lots of unrelated articles
  • Google ‘“Lansing state journal” “Metric media”’ – looking for articles spawned by the first one
    • Detroit metro times – added some more sites (was the list of 40 circulated, or did they look at the page ends for these hyperlocal sites?)
    • Index journal – says lansing state journal found 47 sites.   
      • SC based. Written by Matthew Hensley 864-943-2529 mhensley@indexjournal.com, @IJMattHensley
      • “Metric Media is a division of Situation Management Group Inc., a firm that specializes in crisis response but provides a number of other marketing services.”  SMG CEO Brad Cameron says in his profile that Metric Media “operates more than 1,100 community-based news sites”  
      • “Headlines on the bulk of these stories contain the name of a small, South Carolina community, including such towns as Arial, Branchville, Fairplay, Fingerville, Jacksonboro and Mountville. Hundreds of others included ZIP codes. These prominent local references appear to be a tactic for search engine optimization that seems aimed at getting eyeballs from across the Palmetto State.” – yeah, see lots of this SEO in con-artist posts.  
      • “Under the umbrella of the Metro Business Network, one such site exists for each state and for Washington, D.C.” Action: find these sites
      • “the South Carolina one has caught fire on social media, with nearly 4,000 combined followers between Facebook and Twitter.” Action: find facebook, twitter for each of these publications
      • Sensible counter: suggestions alternative local news sources on both left and right of political spectrum
    • Nieman labs – talks about local news having high levels of trust vs low profitability.  I think this point is important here – is the vulnerability. Links current work back to Laura McGann 2010 work. 
      • Lot of history in here: 2016 fake sites like @ElPasoTopNews, @MilwaukeeVoice, @CamdenCityNews, @Seattle_Post, Denver Guardian. 
      • Mentions democrat plans to launch outlets too. Action: check link to democrat network.  
      • Has lots of examples of other current news sites set up with political bias/ by political operators. Action: re-read this article after safari, and do same exercise with other links in it.
      • I think this mixing of intentions and outputs is a key part of the current internet.  That you can be both a political party operation and a news outlet at the same time is easy when the endpoint is a website, not e.g. a newspaper.  “In all of these cases, the issue is less about politicians promoting their points of view than hiding their affiliation with the content — making it hard for a reader who would naturally bring more skepticism to a campaign ad than they would a local news story.” – so this is about those levels of trust, skepticism, the way that people read differently sourced material differently.  It’s basically the Admiralty Scale without the “from” part of the scale – so people apply their own “from” rating to each article, and we’re terrible at making that snap judgement. Action: important point- write note on this.
    • The Washington Tribune Company has “as many as 1,200 locally-focused URLs from AlaskaTribune.com to WichitaLedger.com”
    • Publishing insider – nothing much new; 1 new site named
    • Holland Sentinel – looks at hollandreporter.com
      • Arpan Lobo alobo@hollandsentinel.com @arpanlobo
      • Is there a master list of local news outlets in the US?  Does it include originators/ owners? I think I can remember one being created when the first local-is-dying stories started to break.  Action: look for list of US local news outlets. Map density against new sites? 
      • Nice piece of further digging
      • “Other Metric Media sites include Michigan Business Daily. There is a “Business Daily” website for all 50 states, part of the Metro Business Network. The websites have a similar layout to those in the Michigan Network.”
      • Action: find all the “xx business daily” sites
    • Livingston Post – story from PoV of local news outlet
      • Looks at Livingston Today – “if you look at the site today, there isn’t a single local news story on it, unless you want to know the price of gas in Pinckney, or read a few calendar items scraped from the web.”
    • Action: continue google search for local outlets reporting on this story (specifically, looking for local reporters who’ve found new parts of the network)

1.2 scrape from pages

Grab site lists from bottom of each known site

  • Look at annarbortimes.com.  We want the list at the bottom of the page:
  • Inspect element shows this is in <ul class=”footer__list”>. We’re going to do this for up to 50 states, so write a small scraper
  • Try something simple: 
import requests
from bs4 import BeautifulSoup
import pandas as pd
puburl = 'https://annarbortimes.com/'
response = requests.request('GET', puburl)
response.text
import requests
from bs4 import BeautifulSoup
import pandas as pd
pubtitle = 'Ann Arbor Times'
puburl = 'https://annarbortimes.com/'
headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}
response = requests.request('GET', puburl, headers=headers)
response.text
  • Look for the UL with the site names in: 
data = BeautifulSoup(response.text, 'html.parser')
site_uls = data.find_all('ul', attrs={'class', 'footer__list'})
site_uls[0]
  • Yep, that’s all of them – in three divs, then lis.  We can use beautifulsoup (or scraper of choice) to iterate through these, pull out the names/urls and stuff them in a csv.
sudata = site_uls[0]
allas = sudata.find_all('a')
rows = [[pubtitle, puburl]]
for thisa in allas:
    rows += [[thisa.text, thisa['href']]]
import pandas as pd
df = pd.DataFrame(rows, columns=['Site', 'Url'])
df.to_csv(pubtitle+'.csv')
df
  • Scraped MI (via annarbortimes.com)
  • Trying collegeparktoday.com (Florida) – this has a different html format to the MI sites
    • Done: adjust scraper for Florida sites
  • Scraped IL (via grundyreporter.com)
  • Scraped MD (via mdstatewire.com)
  • Scraped IA (via iowacitytoday.com)
  • Scraped MT (via nemontananews.com)
  • Scraped AZ (via grandcanyontimes.com)
  • Scraped NC (via hickorysun.com)
  • Look at michigandaily.com – it isn’t on the sites list, has a different format, looks like a student paper?  Action: check out michigandaily.com – should it be in here?
  • Palmetto Business Daily is part of a network of state-specific sites.  Has a different div html, but otherwise scraper should work. Html is <div.col-xl-6.col-lg-5.col-md-5.col-sm-12.col-12> (and “copy selector” in inspect element is very useful).  Action: adjust scraper, scrape site list from Palmetto Business Daily
  • Thinking that we could do with a list of states for the google search, so we don’t get *all* sites back the first time, but go state by state through them. 

Automating the google search

  • https://www.geeksforgeeks.org/performing-google-search-using-python-code/ is good step-by-step guide
  • Thinking that we could either spend our time hunting for these local news aggregator sites, or we could create aggregator sites without the politics, that local people might use.  Action: write up this thought somewhere, thinking about effort and reward. 
  • Thinking that what we’re seeing here is someone who really thought about the internet, what it does, what types of sites are around, and then spent time thinking hard (with post-its?) about how to adapt it as a political weapon.  Action: write up this thought too. Think about what else we haven’t seen yet, that we probably will if this planning session(s) had taken place. 
  • Thinking about finding more outside this network. What if we generated a bunch of likely titles from the patterns above, and went looking for them? Action: write small generator for news site names
  • One way to look for local news covering these sites is to google search for the site URLs – find who’s pointing to them and isn’t already on the naughty list. Action: google search for site URLs, looking for links to them that aren’t already on the naughty list. 

Text searches (by hand first)

from googlesearch import search 
query = '"is a product of LGIS - Local Government Information Services" site:.com/about-us'
js = [j for j in search(query, tld="com", num=100, stop=None, pause=2.0,
                        extra_params={'filter': '0'})]
js
  • The “extra_params={‘filter’: ‘0’}” gives us the extra search results we wanted.  Found this using help(search) in the notebook. Currently on a “429: Too Many Requests” timeout. 
    • Have to sleep now – Action: in the morning, write code that searches for the url roots from the google search in the pink_slime_sites master list, and spits out anything new. 
  • 2019-12-24
  • Checking what came back from google search (unholy bad code):
newsites = [re.findall('//(.*)/', x)[0] for x in js]
df = pd.DataFrame(newsites, columns=['url']).sort_values('url')
df.to_csv('googlesearch_test.csv', index=False)
df2 = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df2['URL'].to_list()
fromsites = [re.findall('//(.*)', x)[0].strip('/') for x in fromsites]
set(newsites) - set(fromsites)
  • No new sites found using the string above.  Let’s see what *didn’t* have it in… 
  • Looking at bigskytimes.com – has a link to facebook.  We could pull the facebook pages for all these sites, see what’s listed as “related sites”
  • Looking at Northern California Record – the list of other sites is across many states
  • About_us says “OUR GOAL at the Norcal Record is to cover Northern  legal system in a way that enables you, our readers, to make the public business your business.” … “The Record is owned by the U.S. Chamber Institute for Legal Reform.”
    • Action: look further into Northern California Record – is this another loop?
  • I’m thinking about the cathedral and the bazaar now.  This looks like a lot of sites have been created – we started with ‘local’ news sites, but this is a subject-specific site.  Is this like having a market with stalls, where stall keepers are shouting their wares, and someone comes in with a team and megaphones and drowns out all the other voices?  We want voices online- we want everyone to be able to speak – but this somehow feels unfair. In the market, there would be an ombudsman – someone who could be asked to make this right. Who are the ombudsmen of the internet? 
  • Trying a new search term: “Metric Media LLC began to fill the void in local and community news”
    • 7 new sites found: [‘eastnewmexiconews.com’,  ‘enchantmentstatenews.com’,  ‘nenewmexiconews.com’, ‘santafestandard.com’,  ‘scminnesotanews.com’, ‘seminnesotanews.com’, ‘swnewmexiconews.com’]
  • Scraped NM
  • Scraped MN
  • Trying google search  “’access to and use of Locality Labs’” in /terms
    • Search in /about-us gave lots of results too
    • Lots of results, not all local news
    • Includes manilabusinessdaily.com.  Action: look at manilabusinessdaily – is this model going overseas too?
    • Found staging sites? http://louisianarecord.pli-records-staging.locallabs.com/ – worth checking the main site for these?  Action: general search on site pli-records-staging.locallabs.com
    • Grimesjournal.com says it’s a set of local business listings. It has story links to Iowa Business Daily and Indiana Business Daily, but no list of linked sites.   (We’ve already noted these business daily sites above). Is Grimes a hyperlocal spin-off of the business dailies (Grimes is a city in Iowa)?  Action: look for texts in Grimes, or links to the business dailies, to see if there are other hyperlocal sites like this one.  
    • https://lynwoodtimes.com/ is a similar site with local business listings.  Feels like someone is astroturfing all the “useful” but hard to monetise sites – this is Our Town all over again.  Looking at their facebook site, get unavailable notice:
  • More local business listing sites: urbandaletimes.com, https://lansingreporter.com/, glendalesun.com, 
    • Action: lansingreporter.com wasn’t flagged by original article – is this a new site? Check dates on it. 
    • Action: add all the clipper sites. 
    • Action: look for more clipper sites.
    • Action: add business daily sites from from_articles list.
    • Statesman.com looks very different to the other news sites; has login, more production values?  “Austin American-Statesman is owned by Gannett Media Corp”…. Action: check austin american statesman.  Action: check Gannett media corp
    • https://torontobusinessdaily.com/ – no “other sites” listed, but check Canada too? Action: check torontobusinessdaily.com and look for other sites in Canada
    • https://gulfnewsjournal.com/ – looks like the businessdaily sites.  Action: check gulfnewsjournal.com and for related non-US sites too?
    • FDAreporter.com – says it’s “FDA Reporter is a trade journal for the U.S Food and Drug Administration’s employees and contractors, covering personnel moves, budgets, acquisitions, contracting and hiring. This includes the Office of the Commissioner, Office of Operations, the Offices of Regulatory Affairs and Global Regulatory Affairs, and Office of International Programs.”.  Is this real, or a site aimed at a vertical? Action: check FDA Reporter provenance.
    • Tobacconewswire.com – “Tobacco News Wire covers federal and state regulation and taxation of tobacco, including the U.S. Food and Drug Administration’s Center for Tobacco Products. Topics include the rise of e-cigarettes, the process through which tobacco companies get products approved by the FDA, and how these things affect retailers and manufacturers.”.  Action: check provenance of tobacconewswire.com
    • Montgomerymdnews.com – caught my eye.  Don’t we have this in MD already? Nope, turns out we have montgomerynews.com because methinks the person who set up the site list at the end got the name wrong (montgomerynews.com is a very different site, different owners). Action: flag montgomerynews vs montgomerymdnews confusion to other list owners?
    • Wealthmanagementwire.com looks different to other trade sites; about page different too, mentions funding “Wealth Management Wire is supported – in part – through sponsorships with brands interested in reaching our audience of personal wealth management, banking and life insurance professionals , advisors, brokers and America’s C-Suite. Interested in partnering? Email us at partners@wealthmanagementwire.com.”  Hrdailywire.com is same format as wealthmanagementwire.com 
    • This creating a site for each vertical looks to me like the way meetup.com was created (by speculatively creating and populating pages for groups that might be of interest to people, rather than having people create all the groups entirely from scratch).  Is it worth looking at other interesting origin stories and seeing if those could be adapted this way too? Action – write note about adapting company origin stories for astroturfing etc.
    • FDAhealthnews.com: “FDA Health News covers the U.S. Food and Drug Administration’s Center for Drug Evaluation and Research and its Center for Devices and Radiological Health. It reports on the FDA’s process of testing and approving new drugs, its interactions with pharmaceutical companies, and any other news related to the FDA.”
    • https://hansondirectory.com/about-us is interesting – is a directory service.  Might be able to spider out from its facebook page https://www.facebook.com/HansonDirectory  Action: spider out from HansonDirectory facebook page – look for sites.
    • Surprisejournal.com gave a 503 error – ran it from console, *lots* of back end calls… looks like another hyperlocal site.
    • Epnewswire.com “EPNewswire is a business journal focused on the intersection of science-based regulation and basic industries, including food, energy and chemical production.”
    • Is it me, or are there a bunch of trade journals here, all reposting newswire content, but all focussed on divisive subjects?
    • Cistranfinance.com “The independent CISTRAN Finance news service serves as a global hub for business and banking news for approximately 70 percent of the world’s population in Eurasia and Afro-Asia.” – this is an odd one. Wide area, not understanding the point of doing this, who/why.
  • So. Have about 8(?) states so far, need to go looking for more.  Try some of the headlines? 
    • Found several local (clipper) sites. https://mahaskaguide.com/about-us includes “We have published directories for independent telephone companies since our inception in 1973. Current production includes 108 telephone directories for more than 130 U.S. telephone companies spread across 23 states.”
    • NB DuckDuckGo has results on news phrases from sites, even when Google produces no results. 

So now I have a bunch of sites, several of which appear to be related (in subject, look and feel etc), but will take further investigation to see whether they’re part of something underhand, or just normal business.  I suspect we’ll see a lot of this in future: companies that do legit business online (e.g. directory services), also providing (wittingly or unwittingly) disinformation carriers. 

Networks of related sites found so far:

  • Statewide “news” sites (e.g. LansingSun.com – NB Florida is different)
  • State-by-state “business daily” sites (e.g. mdbusinessdaily.com)
  • Non-US “business daily” sites (e.g. torontobusinessdaily.com)
  • Clipper sites (e.g. lynwoodtimes.com)
  • Meta-level sites (hansondirectory.com, metrobusinessnetwork.com)
  • Unrelated? Sites (e.g. statesman.com, michigandaily.com)
  • “Record” sites (e.g. norcalrecord.com) and staging sites (e.g. wvrecord.pli-records-staging.locallabs.com)
  • Trade sites (e.g. farminsurancenews.com)
  • Issue sites (e.g. epnewswire.com)
  • Odd sites (e.g. newsroom.westandforprogress.com)

Leftover action: not sure how many search results come back from google search api. Recheck code https://github.com/anthonyhseb/googlesearch/blob/master/googlesearch/googlesearch.py

We have the original set of local news sites.  Going to add all the other sites above to the main list, and make that a list of connected sites, before checking them all for disinformation behaviours (e.g. look at their non-automated stories). Action: check all sites on main list for disinformation behaviours.

  • Florida localnews: scraped.
  • Palmetto business daily: scraped. Looked at Alabama Business Daily, got a popup with no ‘close’ button (“Pulse…” not optimised something or other).
  • Non-US business dailies: copied list (4 sites) into master list,  but will need a different search type to find others. Action: search for non-US business dailies.

Current counts (sites found):

  • localnews                230
  • businessdaily             50
  • clipper                   12
  • trade site                10
  • businessdaily – nonUS      5
  • pressure group             3
  • staging                    3
  • metasite                   2

Open: 

1.3 Use Metadata

So we’ve worked outwards from the original set of articles that flagged these networks, scraped sites’ lists of related sites, and used Google’s API to search for sites containing specific phrases.  Time to look for the other things that connect sites (and use them to find more). 

Quick and dirty tag comparisons: using BuiltWith tags and IP relationships, e.g. https://builtwith.com/relationships/montgomerymdnews.com

Now this is where we start putting the “science” into our “data science”. There are two versions of BuiltWith: public and pay-for.  The pay-for is amazing, with all sorts of useful data in it, but the public (free) version isn’t bad either if you’re looking for related sites.  And BuiltWith has an API. Here’s where we start cooking on gas. 

To do next:

  • Done. Compare the “from_articles” and “from_sites” url lists. Add anything new to the master list.  
  • Start using the tags (google, ad etc) on each site to find more sites related to the seed list (the “master” list). 
  • Head into social media and other more subtle links

Coded up builtwith API calls. First 10 API calls on an account are free, then need to pay $100 for each 2000 calls or so (students/ academics might be able to get free access?).  Code is:

import requests
import json 

bwkey = '<put your own key here>'
bwdom = 'propertyinsurancewire.com'
bwapi = 'rv1' # 'free1' is the free api
bwurl = 'https://api.builtwith.com/{}/api.json?KEY={}&LOOKUP={}'.format(bwapi, bwkey, bwdom)
bwresp = requests.get(bwurl)

with open('builtwith/{}.json'.format(bwdom), 'w') as outfile:
    json.dump(bwresp.json(), outfile)
    
matches = pd.DataFrame([])
identifiers = pd.DataFrame([])

rs = bwresp.json()['Relationships']
for thisr in rs:
    fromdomain = thisr['Domain']
    rsi = thisr['Identifiers']

    ids = pd.DataFrame(rsi).drop('Matches', axis=1)
    ids['FromDomain'] = fromdomain
    identifiers = identifiers.append(ids)

    for rsix in rsi:
        rsimatches = pd.DataFrame(rsix['Matches'])
        rsimatches['Type'] = rsix['Type']
        rsimatches['Value'] = rsix['Value']
        rsimatches['FromDomain'] = fromdomain
        if len(rsimatches) > 0:
            matches = matches.append(rsimatches)
matches.to_csv('builtwith/{}_matches.csv'.format(bwdom), index=False)
identifiers.to_csv('builtwith/{}_identifiers.csv'.format(bwdom), index=False)
matches

Which dumps out json and csv files for a single API call, but in a format where a whole collection of call outputs could be stuck together.   Tested on montgomerymdnews.com, which gave a few new sites – and on propertyinsurancewire.com, which gave a list of 300 to check, including ones on the same NewRelic id. 

Alrighty. Modified code to loop round all the urls I have.  It’s fast. But it’s also spitting out some empty results – all the “1 byte” csvs below.  Not sure why; will check when the run finishes. 

Action: check the “1 byte” builtwith API returns.

Run finished.  Got 46434 and 1120 unique domains returned by BuiltWith.  946 of those domains aren’t on the seed list. These are a mix: some look like news domain addresses; others don’t. 

At this point, a requests.get call on /about-us for these sites should give a decent clue to whether they’re linked to the above sites (although africanmangoscam.net is definitely getting a visit). Looking at them, by first filtering on newsy words like “news”, “county” etc:

  • Lots of clipper sites. Found many of these looking for common ‘news’ terms: news, county, today, times, reporter, wire, sun, record
  • Some parked (godaddy) and down
  • Gold = nekentuckynews.com is another localnews site (scraped, added)
  • Some of these names aren’t easy searches (e.g. Gray Guide), and they don’t have local content. Is this a new form of astroturfing? Action: think about how networks of sites could affect e.g. Google search results. 
  • Not sure about some sites, e.g. powernewswire.com – looked at /about-us https://powernewswire.com/privacy and /terms (privacy and terms weren’t listed on front of site, about was vague); found Locality Labs text on /privacy. 

I ran out of holiday time (to be honest, I diverted into hanging out with my parents, which was IMHO a very good use of time).   I enjoyed the exercise very much, and I have even more respect now for the data journalists who do this work all the time.

Code

''' Scrape pink slime '''
import requests
from bs4 import BeautifulSoup
import pandas as pd

# pull page data from site
pubtitle = 'NE Kentucky News'
puburl = 'https://nekentuckynews.com/'
stype = 'local'
headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}
response = requests.request('GET', puburl, headers=headers)

# Grab the raw list of sites
data = BeautifulSoup(response.text, 'html.parser')
if stype == 'florida':
    sudata = data.find_all('nav', attrs={'class', 'foot-nav'})[0]
elif stype == 'bd':
    fdata = data.find_all('div', attrs={'class', 'footer'})[0]
    row1 = fdata.find_all('div', attrs={'class', 'row'})[0]
    sudata = row1.find_all('div', attrs={'class', 'row'})[0]
else:
    sudata = data.find_all('ul', attrs={'class', 'footer__list'})[0]

# Convert to CSV
allas = sudata.find_all('a')
rows = [[pubtitle, puburl]]
for thisa in allas:
    rows += [[thisa.text, thisa['href']]]
df = pd.DataFrame(rows, columns=['Site', 'Url'])
df.to_csv(pubtitle+'.csv', index=False)
''' googlesearch_for_terms '''
from googlesearch import search 
import pandas as pd
import re
from datetime import datetime

qterm = 'access to and use of Locality Labs'
query = '"{}" site:.com/terms'.format(qterm)
  
js = [j for j in search(query, tld="com", num=1000, stop=None, pause=2.0,
                        extra_params={'filter': '0'})]

# Save, and compare against existing site list
newsites = [re.findall('//(.*)/', x)[0] for x in js]
df = pd.DataFrame(newsites, columns=['url']).sort_values('url')
df.to_csv('googlesearch_{}.csv'.format(datetime.now().strftime('%Y-%m-%d-%H-%M')),
          index=False)

df2 = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df2['URL'].to_list()
fromsites = [re.findall('//(.*)', x)[0].strip('/') for x in fromsites]
df3 = pd.DataFrame(list(set(newsites) - set(fromsites)), columns=['url'])
df3.to_csv('temp.csv', index=False)
''' use_builtwith '''

import pandas as pd
import requests
import json 
import re

df = pd.read_csv('pink_slime_sites - from_sites.csv')
fromsites = df['URL'].to_list()
fromsites = [x[x.strip('/').rfind('/')+1:].strip('/') for x in fromsites]

bwkey = '<get_a_key>'
bwapi = 'rv1' # 'free1' is the free api

allmatches = pd.DataFrame([])
allidentifiers = pd.DataFrame([])

for bwdom in fromsites:
    print(bwdom)
    try:
        bwurl = 'https://api.builtwith.com/{}/api.json?KEY={}&LOOKUP={}'.format(bwapi, bwkey, bwdom)
        bwresp = requests.get(bwurl)

        with open('builtwith/{}.json'.format(bwdom), 'w') as outfile:
            json.dump(bwresp.json(), outfile)

        matches = pd.DataFrame([])
        identifiers = pd.DataFrame([])

        rs = bwresp.json()['Relationships']
        for thisr in rs:
            fromdomain = thisr['Domain']
            rsi = thisr['Identifiers']

            ids = pd.DataFrame(rsi).drop('Matches', axis=1)
            ids['FromDomain'] = fromdomain
            identifiers = identifiers.append(ids)

            for rsix in rsi:
                rsimatches = pd.DataFrame(rsix['Matches'])
                rsimatches['Type'] = rsix['Type']
                rsimatches['Value'] = rsix['Value']
                rsimatches['FromDomain'] = fromdomain
                if len(rsimatches) > 0:
                    matches = matches.append(rsimatches)
        matches.to_csv('builtwith/{}_matches.csv'.format(bwdom), index=False)
        identifiers.to_csv('builtwith/{}_identifiers.csv'.format(bwdom), index=False)
        if len(matches) > 0:
            allmatches = allmatches.append(matches)
        if len(identifiers) > 0:
            allidentifiers = allidentifiers.append(identifiers)
    except:
        continue

allmatches.to_csv('builtwith/allmatches.csv', index=False)
allidentifiers.to_csv('builtwith/allidentifiers.csv', index=False)

newsites = pd.DataFrame(set(allmatches['Domain'].to_list()) - set(fromsites), columns=['URL'])
newsites.to_csv('newsites_tmp.csv', index=False)