Project idea: mining crisis googlegroups

The Problem

Let’s start with “what is the problem I’m trying to solve here”.

I’m responsible for designing Hunchworks.  I’m also responsible for making it what the users need, and for understanding and fitting into the systems that they already use to do hunch-style collaborative inference. Now we know a bit about business analysis, so we’ve spent a while looking for these existing systems, and found them mainly in two places: closed discussion groups (e.g. Googlegroups), and Skype group chats.

I’ve traced (and anonymised) a couple of threads by hand – the main point is the type of information that people post rather than who and what it is – but whilst I was doing it I realised there were mining tools that could help with this, and my usual problem of not being able to find information on X in a Googlegroup after it’s posted (no, the search doesn’t work for this).  Also, any visualisation or compression could be extremely useful to people trying to navigate the thousands of emails and hundreds of topics that thread their way through the crisismapping googlegroups.

Getting the Data

So to work. The first thing I needed was googlegroups data into my machine. There’s an RSS feed for group NameHere at  http://groups.google.com/group/namehere/feed/rss_v2_0_topics.xml  but that appears to be only for new posts. A Google search on “mining google groups –site:groups.google.com” didn’t show anything, but I did get a useful journalists guide to Google’s search operators as commiseration prize. And then paydirt: a PHP Googlegroup scraper that produces xml, that I’ve modified to dump xml into a file.

The data is about a group, so it has people (in username and cc fields), topics (in subject field), areas of interest (in the subject and text, so I’ll have to do some text analysis to get these) and times/dates.

As an aside, it would be really cool if Google allowed tagging on these posts. That way, people could come in later and add tags to each post, to make searching by area or subjects much easier.

Knowing which questions to ask

What I’m trying to understand at first is how people communicate data around an emerging crisis. Which people get involved, do they bring each other into the group to help, what types of evidence do they provide and which external links do they point at and when.

Another question is “are there cliques in this group” – not, “are there groups that don’t talk to each other”, but “are there natural groups of people who cluster around specific topic types or areas”.  And “who are the natural leaders – who starts the conversation on each topic, who keeps the conversations going”.

More subtle things to find are ‘hubs’ (people who connect other people together), attachments/links (so we can see what’s being linked to outside the group pages) and information flows between people and affiliations.

So first, what else can I do with the raw data?  Track threads?  Track related subjects?  Pick out geographical terms, for instance?

Let’s start with the people, and how they connect together.  Let’s assume that all people on the same thread are linked, or possibly that people who reply to each other are linked.  A google search on “mining relationships in email” proves useful here…  turns out there’s something called “relationship mining” in data warehousing already.  And the really simplest thing to do with this is a histogram of people who posted, or who sent the first post in a thread.  But… but… most of the literature on mining emails assumes that there are 1:1 links between people, i.e. each email has a sender and 1 or more recipients.  What we’re looking at here has a sender, all the group as recipients, and maybe a couple of other outside-the-group people added as recipients too.  So we don’t have direct links between people: we have the group as an intermediary.   So how do we know when two people are engaged with each other via the group?   One answer is “reply-to” – i.e. when one user replies to another user’s posting.  This will take a little work in the scraper to do, but could be a useful way of establishing chains of people through the group.

This is just the start of a chain of thought that is currently moving at the speed of code.  There will be more, just not for a little while…’,’

Project idea: visualising crisismapping categories

I’m a crisismapper. I’ve seen or worked on most crisis Ushahidi maps since January 2010, and I’ve watched the categories used on them split and evolve over time (from Haiti’s emergencies and public health to Snowmageddon’s problems / solutions and beyond).

Cat from Humanity Road has kept a screenshot of the categories on each of the main deployments since then – when I saw it today, it reminded me immediately of work that Global Pulse did on visualising category evolution across news articles, with articles as nodes connected by subject, and coloured by main category, as discovered by gisting the articles.

Except this time, we don’t have to guess the categories (although doing that later by mining text in the reports in each category could be fun).  Each Ushahidi map comes with a set of categories – each report (piece of geolocated information) is tagged by the categories it belongs to.

So. The simple version of the visualisation (and I’m thinking that Processing should be able to handle much of this).  The y-axis in the visualisation is a timeline. Each deployment is a column of categories, represented as graph nodes: we then put lines from deployment(n) category(x) to deployment(n+1) category(y) if they’re related.

Working out relations between categories isn’t totally trivial. Language differences (i.e. arabic to english), generalisation/specialisation and synonyms do happen, so we might need to build a category ontology to handle this. Line weights could be used to show the strength of connection between pairs of categories. Ushahidi also uses sub-categories on some deployments, so we’ll have to work out how to group these – perhaps by colour, perhaps by putting a box around them.

Making it pretty: we can use some standard tools (like matrix normalisation) to minimise the number of crossed lines between deployments.

So you’d have a meta-graph where the nodes are deployments, and inside each of those you’d have supercategories and categories.   One variant of this is to visualise all the deployments as metanodes in free space, i.e. scattered all over the place, and place the strongest-related deployments (by categories) closest to each other. I don’t know if there are packages that allow drilling into metanodes, i.e. to click on a metanode and view all the links from its categories out to other metanodes, but that could be fun too.

Once we have the connectors (e.g. into Ushahidi’s publically-accessible reports on CrowdMap), all sorts of other cool visualisations suddenly become possible too. But finally I need to ask the zeroth question about a project, i.e. “what is the problem we’re trying to solve here” (first question is “is there any value in it”).  I think what I’m trying to do is make an easy way for crisismapping historians to see how deployments have evolved, and give crisismapping leads an easily-visualised example set of categories that worked for different deployments over time.  The value might be new category sets, or an awareness of arcs that the category types are currently on.  I don’t know.  But I’m sure that I know several thousand people out there with opinions about it…