[Cross-post from Medium https://medium.com/misinfosec/disinformation-datasets-8c678b8203ba]
“Genius is Knowing Where To Look” (Einstein)
I’m often asked for disinformation datasets — other data scientists wanting training data, mathematician friends working on things like how communities separate and rejoin, infosec friends curious about how cognitive security hacks work. I usually point them at the datasets section on my Awesome Misinformation repo, which currently contains these lists:
- Nationstate-level social media messages: Twitter Election Integrity (state-backed infoops archives: regularly updated), Grafika Information Operations Archive (2018 Twitter and Reddit IRA datasets), 538 list of IRA tweets (2012–2018), Ushadrons (-2018), Hamilton 2.0 (continously updated), authoritarian interference tracker (regularly updated)
- Online advertisements: House intelligence facebook ads sample and House Intelligence Committee Facebook ads — static collections
- Bot/Cyborg/Troll accounts: botsentinel, probabot (rates accounts from bot to not) — continually updated
- Websites and articles: fake news challenge (articles: static collection), false, misleading, clickbaity and satirical ‘news’ sources (the original ‘Melissa’ list, 2016: static list), GDI dataset (domains, articles — regularly updated)
- Pre-2016 misinformation datasets: Pheme 8.2 annotated news corpus, Jonathan Albricht’s datasets of bot posts (to 2016) — static collections
That’s just the data that can be downloaded. There’s a lot of implicit disinformation data out there. For example groups like EUvsDisinfo, NATO Stratcom, OII Comprop all have structured data on their websites.
You’re not going to get it all
That’s a place to start, but there’s a lot more to know about disinformation data sources. One of them, as pointed out by Lee Foster at Cyberwarcon, last month, is that these datasets are rarely a complete picture of disinformation around an event. Lee’s work is interesting: he did what many of us do: as soon as an event kicked off, his team started collecting social media data around it. What they did next was to compare that dataset against the data output officially by social media companies (Twitter, Facebook). There were gaps — big gaps — in the officially released data; understandable in a world where attribution is hard and campaigns work hard to include non-trollbots (aka ordinary people) in the spread of disinformation.
Some people can get better data than others. For instance, Twitter’s IRA dataset has obfuscated user ids, but academics can ask for unobfuscated data. It’s worth asking, but also worth asking yourself about the limitations placed on you by things like non-disclosure agreements.
I’ve seen this before
So what happens is that people who are serious about this subject collect their own data. And lots of them collect data at the same time. Which sits on their personal drives (or somewhere online) whilst other researchers are scrabbling round for datasets on events that have passed. I’ve seen this before. Everything old is new again — and in this case, it’s just what we saw with crisismapping data. There were people all over the world — locals, humanitarian workers, data enthusiasts etc, who had data that was useful in each disaster that hit, which meant that a large part of my work as a crisis data nerd was quietly gently extracting that data from people and getting it to a place online where it could be used, in a form that it could be found. We volunteers built an online repository, the Humanitarian Data Project, which informed the build of the UN’s repository, the Humanitarian Data Exchange — I also worked with the Humanitarian Data Language team on ways to meta-tag datasets so the data needed was easier to find. There’s a lot of learning in there to be transferred.
And labelled data is precious, so very precious
Disinformation is being created at scale, and at a scale beyond the ability of human tagging teams. That means we’re going to need automation (or rather augmentation — automating some of the tasks so the humans can still do the ‘hard’ parts of finding, labelling and managing disinformation campaigns, their narratives and artefacts). And to do that, we generally need datasets that are labelled in some way, so the machines can ‘learn’ from the humans (there are other ways to learn, like reinforcement learning, but I’ll talk about them another time). Unsurprisingly, there is very little in the way of labelled data in this world. The Pheme project labelled data; I helped with the Jigsaw project on a labelled dataset that was due for open release; I’ve also helped create labelling schemes for data at GDI, and am watching conversations about starting labelling projects at places like the Credibility Coalition.
That’s it — that’s a start on datasets for disinformation research. This is a living post, so if there are more places to look please tell me and I’ll update this and other notes.