Don’t Use the “I” Word

I’ve been told for the past year not to use the “I” word, “intelligence”. Oh f’gods sake. IMHO NGOs haven’t, are not, and are certainly not intending to, collect any form of intelligence about people, nations etc. But they are starting to build assisted intelligence systems, as in computer systems that help human beings make decisions by improving their access to information.

And some of the techniques used in intelligence systems also turn out to be very useful for working out what’s happening in a crisis, i.e. who needs help where and when.

So let’s put on the big-girl pants and look at what these are, and how they might be used to help people in crises (as opposed to, say, make them nervous enough to stick tinfoil to their ceilings).

Development Intelligence

I noticed the term “development intelligence” being used more often this year (and no, not by us). Mostly, it seem to cover information about NGOs (also here) rather than information about the people they’re trying to help. It’s also a term used about software development intelligence but that’s not really related.

But people are starting to use it for the process of gathering and processing information to help people in crisis, and if this term sticks we need to think about what it is, and how it differs in spirit from the military intelligence that we’re terrified of being confused with. A few suggestions about this are:

  • First, there is no enemy. Military intelligence presupposes an enemy: NGOs are not fighting someone, and are not taking sides in a conflict. The only opposing forces at work against them are usually mother nature (cyclones, floods, droughts etc) and economics (economic depression, unemployment, economic migration to marginal land etc).
  • Deception happens, but is rare. Yes, fraud happens and mistake happen but there isn’t the all-out watchfulness for counterintelligence and massively false information that military intelligence always has to be aware of.
  • Threats happen, but aren’t usually propellent-based: disease, further natural disasters (e.g. mudslides) and secondary effects (e.g. nuclear reactors breaking) are much more likely events and aren’t likely to be ameliorated by keeping the other guy out of your area.
  • The end goal is not for one side to ‘win’ whilst someone else ‘loses’ (my apologies to those military people who are simply trying to understand what is going on in their worlds). The end goal is to avoid crises, or if that can’t be done, to mitigate their effects. Nobody ‘wins’: we’re trying to help people not lose as much as they might have done.

Situation Awareness

Intelligence analysts are experts at turning raw data into actionable knowledge, i.e. information that people can use to help them make decisions. In our case, those decisions are things like “where do we put the food supplies” and “which hospitals do we send our earthquake casualties to”.

To make these types of decision, you need to know what’s happening, where and to whom. Things like where the hungry people are, which hospitals are still operational (and where all the field hospitals with equipment x are). The NGO community has made a start on this with its 3W (who, what, where) methods – basically listing which NGO agencies have placed which resources where on a map – but this is a very early form of something that intelligence analysts know as situation awareness (or sometimes situational awareness – the difference between them is so fine that the terms are often confused). There’s a lot of good literature on situation awareness (and quite a few systems too), and it’s time we started reading some of it.

Sources

Intelligence analysts know a lot about data too: where to look for it, how to get more, how to talk about how true they believe data and the sources that produced it are, how to make sure those beliefs are preserved so we don’t get statements like “Iraq can attack us in 45 minutes” getting through without checking (ah whoops).

And one of things they’ve done is classify the types of data that are available. I’m not advocating that NGOs start using terms like HUMINT and SIGINT, but it won’t hurt for us to think hard about the types of data source that we have available to us, the types of data that we can legitimately be interested in (and conversely which sources and types of data we should avoid using) and the issues that are important to each of them. Some of this has already happened in journalism, where ideas like stringers (trusted sources who don’t work for a specific new organization) and careful data provenance recording already have a long history.

Intelligence Cycles

Analysts also know a lot about the process of collecting data. Many people are now aware of Boyd’s OODA loop (observe-orient-decide-act), originally used to describe the cycle of thought that pilots went through in a dogfight. Fewer people are aware of and use the Intelligence Cycle and its variants. This is a great shame, since it describes the steps that analysts take to create actionable data (note that it doesn’t include action – there are separate models for that). Again, there is a lot of literature on this, and again these would be useful things for people dealing with crisis-related data to read.

Assisted Arguments

Sometimes data is just data: it won’t magically tell you that something is wrong, or that there’s something you need to be aware of, no matter how much you collect and however many processes you apply.

But you can start asking questions with data, and one of the most common ones is “what if”. An analyst faced with a situation will often start forming hypotheses about what might be happening in the world, about why a situation has developed in a certain way etc. This hypothesis formation is part of what we’ve formed Hunchworks from. And when there are more than one hypothesis about an event or situation, there’s an argument. Analysts have created a tool to decide between them called Analysis of Competing Hypotheses. There are also other forms of assisted (and augmented) argument system being used by intelligence analysts, but ACH is one of the most common ones. And also worth a look for analyzing some of the “what ifs” being faced daily by NGOs.

Systems

And last, systems. This is not a definitive list of systems that we can learn from (my memory is still suffering from Christmas), but I will add to it as I remember places, companies and people who can help.
IBM Big Sheets, which is part of Big Insights
• Xerox Parc’s ACH (open source)
IBM Watson
Savanna
Palantir

Etc

There are also discussions to be had about other techniques like imagery analysis, uncertainty handling, cognitive and cultural bias, but many of these areas are already covered by emerging (and existing) work on crisis data handling.

Big data – wozzat?

So what is this big data thingy?

Big data has become a hot topic lately. The people who deal with it (“data scientists”) have become much in demand by companies wanting to find important business insight in amongst their sales data, twitter mentions and blogposts.

Which confuses three different concepts.

    • Big Data is defined as the processing of data that’s larger than your computer system can store and process at once. It doesn’t matter so much where the data is from – it’s more important that it’s too huge to handle with ‘normal’ processing methods.
  • Social media mining looks for patterns in the posts, tweets, feeds, questions, comments et al that people leave all over the Internet. It’s a logical consequence of Web 2.0, that idea that we could not only read what people put on their websites, but contribute our thoughts etc to it too.
  • Data analysis looks for patterns in any data. It doesn’t have to be big data (though it might be), and it doesn’t have to come from Internet use (although it might be that too). It’s just data, and the tools that have been used to understand meaning and find insights in data still apply. The data scientists have a saying “everything old is new again”, and it’s lovely to see a whole new generation discover Bayesian analysis and graphs as though they were shiny new super-exciting concepts.

So what are the data scientists trying to do here, and why is it special?

Well first, a lot of data scientists are working on Internet data. Which is why big data and Internet data often get confused: a collection of blogs, tweets etc can be seriously big – especially if you’re trying to collect and analyse all of them.

And they’re analyzing the data and using the results to help drive better business decisions. Yes, some people do this for fun or to help save the world, but mainly it’s popular because better decisions are worth money and those analysis results are big business differentiators.

Which is great until you realize that up ‘til now most of those decisions were made on data from inside the company. Nice, structured, controlled, and often quite clean (as in not too many mistakes) data, often stored in nice structured databases. Which is not how the Internet rolls. What data scientists often end up with is a mix of conventional structured data and data with high structural variance: data that looks kinda structured from a distance (time, tweeter, tweet for example) but has all sorts of unstructured stuff going on inside it. Sent from a mixture of conventional systems and devices. That companies often ask to be analysed in the same way they’re already analysing their structured data.nnSo, alongside the usual corporate data, we now have 3 new types of data that we can access and process: structured data stored in warehouses, unstructured internet-style data (blogs, tweets, sms) and streams of information.

Lets back up just a little. To do analysis, you need a question, some data and some tools (or techniques, if you will). It also helps to have someone who cares about the results, and it’s even better if they care enough to explain what’s important to them and why.

The Question

First, the question. Asking the right question is difficult, and often an art. Sometimes it’ll be obvious, sometimes it’ll come from staring at a subset of the data, sometimes the question will be given to you and you’ll have to hunt for the data to match. We’ll talk about the question later.

Handling the Data

So we have a question and some data. And if the data is big, this is where the Big Data part of the story comes in. If you suddenly find yourself with data that you can’t analyse (or possibly even read in) using the computing resources you have, then it’s big data and your choices (thank you Joseph Adler) are:

    • Use less data. Do you really need all the data points that you have to make those business decisions (try reducing down to a statistically significant number of points, or reducing down to mostly the points that are important to you)? Do you really need all the variables you’ve collected (do a sensitivity analysis)? Are there repeats (e.g. twitter retweets) in your dataset (tidy it up)?
  • Use a bigger computer. You’ll need to both store and process the data. “The cloud” is a generic term for storage that’s outside your home or office that you can still access from wherever you want (e.g. over the internet). Amazon Web Services is a prime example of this; other cloud storage includes Microsoft Azure (sql datastore), Cassandra (bigtable datastore), Buzz Data, Pachube (primarily storage for sensor outputs, a.k.a. the Internet of Things), Hive (data warehouse for Hadoop) and sharded databases.
  • Use parallel processing across multiple computers. A popular process for this is map/reduce, which splits data into chunks that are each processed by a different machine. Places where map/reduce is available include Hadoop, which also has a higher-level language, Pig, that reduces down to map/reduce instructions.
  • Get smart. Get lateral about the problem that you’re trying to solve (see any good statistics textbook for ideas).

The processing

And then we have the techniques part of the equation (sorry – couldn’t resist the pun). Again a post for later – there are many tools, packages and add-ons out there that make this part of the process easier.

Explaining the results

If you’re doing big data analysis, you’re doing it for a reason (or you really like fiddly complex tasks). And the reason is often to increase the knowledge or insight available to an end user. For this, we often use visualisations. Which is another post for later.

Creating humanitarian big data units

Global Pulse has done a fine job of making humanitarian big data visible both within and outside the UN. But it’s a big job, and they won’t be able to do it on their own. So. What, IMHO, would another humanitarian big data team need to be and do? What’s the landscape they’re moving into?

Why should we care about humanitarian big data?

First, there’s a growing body of evidence that data science can change the way that international organisations work, the speed that they can respond to issues and degree of insight that they can bring to bear on them.

And NGOs are changing. NGOs have to change. We are no longer organizations working in isolation in places that the world only sees through press releases. The Internet has changed that. We’re now in a connected world, where I work daily with people in Ghana, Columbia, England and Kazakhstan. Where a citizen in Liberia can apply community and data techniques from around the world, to improve the environment and infrastructure in their own cities and country.

We have to work with people who used to be outsiders: the people who used to receive aid (but are now working with NGOs to make their communities more resilient to crisis), and we have to work with data that used to be outside too: the tweets, blogposts, websites, news articles and privately-held data like mobile money logs and phone top-up rates that can help us to understand what is happening, when, where and to whom.

UN Global Pulse was formed to work out how to do that. Specifically, it was set up to help developing-world governments use new data sources to provide earlier warnings of growing development crises. And when we say earlier, we mean that in 2008 the world had a problem. Three crises (food, fuel and finance) happened at once and interacted with each other. And the first indicator that the G20 had was food riots. The G20 went to the UN looking for up-to-date information on who needed help, where and how. And the UN’s monitoring data was roughly 2 years out of date.

What have we done so far?

So what are the NGOs and IAs doing so far? The UN has started down the route to fix this with a bunch of data programs including Global Pulse and FEWSnet. Oxfam connected up to hackathons last month; the Red Cross has been there for a while. The World Economic Forum has open data people, as does the World Bank. And other groups as different as the Fed and IARPA are investigating risk reduction (which is the real bottom line here) through big data techniques.

What should we be doing?

But what do the NGOs need to do as a group? What will it take to make big data, social data, private data, open data and data-driven communities useful to risk-manage for crises?

1. First, ask the right questions.

When you design technology, the first question should be “what is the problem we’re trying to solve here?” Understand and ask the questions that NGOs do and could ask, and how new data could help with them. There is data exhaust, the data that people leave behind as they go about their lives: focus on the weak signals that occur in it as crises develop. Reach out to people across NGOs to work out what those questions could be.

2. Find data sources.

We cannot use new data if we don’t have new data.

Data Philanthropy was an idea from GFDI to create partnerships between NGOs, private data owners like the GSMA mobile phone authority and other data-owning organisations like the World Economic Forum. Data Commons was a similar idea to make data (or the results of searches on data – we want to map trends, not individuals) available via trusted third parties like the UN. It’s gone a long way politically but still has a lot of work to be done on access agreements, privacy frameworks and data licensing.

Keep encouraging the crisismapping and open data communities to improve the person-generated data available to crisis responders, to improve the access of people in cities and countries to data about their local infrastructure and services, and to voice their everyday concerns to decision makers (e.g. via Open311). Encourage the open data and hacker movements to continue creating user-input datasets like Pachube, Buzz and CKAN. All this is useful if you want to understand what is going wrong.

3. Find partners who understand data.

Link NGOs to private organisations, universities and communities who both collect and process new types of data. Five of these recently demonstrated Global Pulse led projects to the General Assembly:

    • Jana’s mobile phone coverage allowed us to send a global survey to their population of 2.1 billion users in over 70 countries. There are issues with moving from household surveys that need to be discussed, but it allowed us to collect a statistically significant sample of wellbeing and opinion faster and more often than current NGO systems (an authoritative survey I read recently had 3500 data points from a 5,000,000 person population. Statistical significance: discuss).
    • Pricestats used data from markets across Latin America to track the price of bread daily rather than monthly. Not so exciting in ‘normal’ mode or in countries where prices are regularly tracked. Incredibly useful during recovery or for places where there is no other price data gathered.
    • The Complex Systems Institute from Paris tracked topics emerging in food security related news since 2004. This showed topic shifts from humanitarian issues to food price volatility (with children’s vulnerability always being somewhere in the news). More of a strategic/ opinion indicator, but potentially incredibly useful when applied to social media.
    • SAS found new indicators related to unemployment from mood changes in online conversations – several of which spiked months before and after the unemployment rate (in Ireland and the USA) changed. This gave new indicators of both upcoming events and country-specific coping strategies.
    • Crimson Hexagon looked at the correlation between Indonesian tweets about food and real food-related events. The correlations exist, and mirrored official food inflation statistics. Again, useful if gathered data isn’t there.

And reach out to the communities that are forming around the world to process generated data, from the volunteer data scientists at Data Without Borders to the interns at Code for America and the GIS analysis experts connected to the Crisismappers Network.

4. Collect new data techniques and teach NGOs about them.

There is a whole science emerging around the vast ocean of data that we now find ourselves swimming in. It has many names, Big Data and Data Science being just two of them, but it’s basically statistical analysis of unstructured data from new sources including the Internet, where that data is often very large. Learn about them, play with them (yes, play!), and teach people in NGOs about how to use them. The list of things you probably need to know include data harvesting, data cleaning (80% of the work), text analysis, learning algorithms, network analysis, Bayesian statistics, argumentation and visualization.

And build a managed toolkit of open-source tools that NGOs and analysts in developing country can use. For free. With support. Which doesn’t mean “don’t use proprietary tools” – these have a major part to play too. It just means that we should make sure that everyone can help protect people, whatever the funds they have available are.

5. Design and build the technologies that are missing.

Like Hunchworks. Hunchworks is a social network-based hypothesis management system that is designed to connect together experts who each have part of the evidence needed to spot a developing crisis, but don’t individually have enough to raise it publically. It’s a safe space to share related evidence, and give access to the data and tools (including intelligent agents automatically searching data for related evidence) needed to collect more. It’s still in alpha, but it could potentially help break one of the largest problems in development analysis: namely, the silos that form between people working on the same issues and the people that need to see their results.

6. Localize.

Build labs in developing countries. Build analysis capacity amongst communities in developing countries. People respond differently to economic stress, and environments, data sources and language needs are different in different countries. The labs are there to localize tools, techniques and analysis, and to act as hubs, collectors and sharing environments for the types of minds needed to make this work a reality. No one NGO can afford to do this in all countries, so connections between differently-labelled labs will become vital to sharing best practice around the world.

7. Publicise and listen.

Be there at meetups and technology sessions, at hackathons and in Internet groups, listening and learning to do things better. And never ever forget that this isn’t just an exercise. It’s about working better, not building cool toys – if the answer to a problem is simple and low-tech, then swallow your pride and do it – if the answer is to share effort with others to get this thing worker faster to protect people around the world, then do that too. We do not have the luxury of excessive time or meeting-fuelled inaction before the next big crisis strikes.

What is a hackathon?

I accidentally ended up organising a hackathon recently. RHOK NYC could have been a tragedy. I was too overloaded to help organize it, the local Crisiscommons lead was too busy, and the young man who stepped in to lead was inexperienced with hackathons and unsupported but managed to pull things together well until the fortnight before, when RHOK NYC lost its venue. Which is a big deal in New York – – they’re not easy to come by for a weekend event with a sleep-over (or rather a crash on the floor for a couple of hours in-between coding -over). Oh, and the young man was unexpectedly out of the country at another event.

So we cancelled, got talked back into trying again, and put out the call to the local volunteer technical community. The community answered us, in spades. Within a week, Phil from Open Plans found us a venue via Naomi and Beth at New York Law School; Josh from the UN Mission to UNICEF offered us space in their building, and Phil’s friends Danielle and Beth offered to take the whole RHOK NYC contingent wholesale into their Open Data Day hackathon on the upcoming US Farm Bill and its effects on agriculture. NY Hackers offered support, and JonMark from the StartupBus offered sandwich money and advice from his new position as Twilio’s developer evangelist.

We then had a week to go and no main organizer, so we took the hackathon-sharing option (where hardcore hackers met gourmet macrobiotic pizza, but that’s a story for another day). There’s a whole post of thankyous to be had (it’s on the RHOK website), bun in short it worked out really really well. And the most unexpected benefit came from mixing the Farm Bill people who’d never run or been to a hackathons before with our little band of hard-core techs and hackathons veterans (‘0’ in the previous hackathons box met ‘over 10’).

And I met Sarah from Oxfam who said “I had no idea what a hackathon was”.

Wow. After years of evangelizing, it’s easy to forget that for many people outside the tech community, this is utterly completely new (and perhaps a little terrifying too – props to Sarah for coming along anyway).

The short answer is that a hackathon strengthens your volunteer technical community and teaches you more about your specialist subject – whether that be how to build better iphone apps or distribute food aid around the world. And each others’ subjects too. My favourite moment was a Farmbill lady who’d come expecting hand-designed graphics walking by a RHOK hacker, saying “could you just do a scraper for (some random farming subject) for me please?”, to be met with “sure, it’ll take about 10 minutes”. Transfer of skills and information – it’s beautiful to watch.

The long answer is that eventually hackathons run like this. A lot goes on before a hackathon: finding sponsors, finding a venue, finding subject matter experts, advertising the event to potential attendees, organizing catering, sorting security, planning. But on the day (or weekend),  typically:

  • Everyone turns up , drinks coffee and chats with other attendees (don’t worry – hackers don’t bite, and we generally have a good sense of humour. Even at 1am when our computer crashes and loses our code). Then the hacking starts.
  • Subject matter experts (people who understand the problems you’re trying to address with the hackathon) come up one by one to describe a problem that they want to work on, or a system they want to build, or even their area in general (not all great hacks are pre-determined).
  • People gravitate to the problems that interest them to form teams.
  • Then spend the day/weekend working on the problem: building designs, code, visualisations using whatever skills, data, code and knowledge they can glean (problem providers: please have someone on standby for the hackathon, even if they’re at home – it’s frustrating to make design decisions without a user around, and it’s rude not to help the hackers once you’ve asked the community for help yourself).
  • And at the end of the day/weekend, each team stands up and presents their work. There might be prizes, but usually the biggest prize is to go home knowing that you’ve touched the world in some way. Oh, and know (and have been through a hard day/night with) a lot more interesting people.

Your first hackathon may be different. In your first hackathon, you generally turn up and learn what a hackathon is, by working with the other participants. That learning takes time, so don’t beat yourself up if you’re not instantly brilliant at the first one you go to.

I was privileged to have been there when a whole subject area – food – went through their first hackathon together. They got it, they did brilliantly, they had some great ideas I hadn’t seen before here (like micro-lectures in a room off to the side) and Sarah not only learnt about hackathons and how to create problem statements for them, she also (with her team) won a prize. I’m glad I was there.

Where next for Hunchworks?

Hunchworks is just one of the technologies that we designed this year.  It’s time to talk about why, and what some of the others are.

Why build Hunchworks?

Imagine the scene today. I’m an analyst at the UN, and most mornings I get to my desk in the morning and think “what’s going on in my area” or “something’s not right here”. What do I do about that today, and what would be useful to me next year if I want to do it better?

Today, I rely mostly on personal connections, colleagues, skype chats, newsgroups and websites. If I want to know what’s going on, I log into accounts in several different systems (my UN and personal emails, a group of Skype chats, some irc channels and a set of newsfeed websites) and scan the entries in them for things (specific event types, geographical areas etc) that are urgent, or interesting, or that I might be able to help with. If there’s something important, I then go to a new set of sites and start digging around in data and messaging experts I trust to verify what I’ve seen and what I think might be happening, and then start thinking about what I can do to help with the situation and who I need to start telling about it (and how).

This is wildly inefficient. It contains time lags at every step, it has me searching through time-ordered items and repeating thoughts and work that other people are probably doing at their own desks too. And I miss things: I miss emails in the mass of group mailings, I miss announcements and important data hidden in the mass of writing in every channel when an event’s going on, and half the time if I don’t deliberately save something when I see it go past I’m unlikely to be able to find it again.

Next year could be different. I would like to get to my desk and log onto one system to see my alerts overnight (yes, I’ll still have to look at the others, but I would like to have one place to start). I’d like to see a dashboard of new hunches that something might be happening (that have been created either by humans with that “something’s not right” feeling or data sniffers checking through the webosphere for first signals) in areas that interest me, a list of things that I and people connected to me are already working on. If I have that “something’s not right” feeling myself, I’d like access to newsfeeds and data that I can look through for clues to why I’m having that thought, or support for a hypothesis I’m forming about it.

And so on. The bottom line is that I need access to data, tools, actions, people and their existing hypotheses about the world. And some of that’s going to be specialist. Which is why we’ve been building Hunchworks.

Where to go next

We’ve built the basic system (and you can help too), but there’s a lot of development still to do. Here’s the Hunchworks feature pick-list for next year; the one that we’ll show users and ask ‘which of these will help you most’. How much can be developed relies on Global Pulse funding, and also on how far other organisations get with similar work in the open source domain. But here’s the first list. We’re looking for external partners to help with these projects:

  • Complementarity search. Allow users to search for users with skills and interests that complement their skillset or the skills needed for their hunches.
  • Improved trust. New trust metrics, and improved algorithms for rating combination and interest ranking.
  • Better links to data sources and conversations about them, including twitter, open-source data, city data and satellite data.
  • Hunch splitting, merging and cluster algorithms, based on the Paris text-analysis algorithms developed for proof of concept study.

And we need to do this work either in-house at the UN or with close support from an in-houseteam:

  • Hunchworks plugin for UNDP Teamworks. This makes Hunchworks available to the whole UN Teamworks community. Although since Teamworks is a Drupal site, it might be better for us to aim at building a custom Drupal module to handle this connection instead.
  • Hunchworks LinkedIn application. Allows a user to form a hunch, then invite their LinkedIn contacts to comment on it.
  • Report generation plugin. Many UN departments still have their own reporting formats and procedures. Yes, we could lobby to change that, but for the moment it’s easier to map what they are and make sure reports in the right formats go to the right people in each organization, and that Hunchworks users know what information they need to be gathering for this.
  • Connection to a data science toolbag.

Other technologies are needed to support this vision. In the next few years, our analysts will need:

  • Security. Reassurance that their data will not be accessible (or that it will be very difficult and take a lot of time andresources to access) by people they don’t want to access it. Also reassurance that if an incursion is attempted, it will be quickly detected and anythingconnected it isolated and checked.
  • Localisation. The ability to access Hunchworks and other tools in their local language.
  • Federation. The ability to access Hunchworks from local nodes in bandwidth environments ranging from completely offline to intermittent to low-band and the speeds we sometimes see in Manhattan.
  • Access from multiple devices. Because when you’re out in the middle of nowhere, the only sure thing is that your Internet connection will fail. But your phone (and specifically the SMS on it) might still be okay.
  • Intelligent agents. We’ve started building simple agents that monitor real-time feeds, but we need to build this concept out so the agents are actively helping the human Hunchworks users.

Work that will have to happen, will affect Global Pulse capabilities and timelines, but is outside the scope of Global Pulse includes:

  • Data standards definitions.
  • Data science toolbag (containing useful data gathering, cleaning, analysis and visualization tools). We’ve been talking to Data without Borders about hosting this for the world.
  • Data science tools (excluding Hunchworks).

We’ll be helping with these too where we can.

Lessons from mapping Sahel

We needed an example problem set for our current version of Hunchworks (note that this is a very early, i.e. pre-alpha version of the code and a lot of the cool Hunchworks features aren’t in it yet). The UN’s main use for Hunchworks is to gather up the weak signals that people put out about emerging development crises – those small hints that something isn’t right that appear all over the world before they coalesce into ‘obvious’.

Awareness of development crises can happen very quickly. One minute there are whispers of a potential problem – a chat here, an email or text asking for a bit of data there. And then a tipping point appears and there’s suddenly data everywhere. And we have a great example of this happening just at the time that we’re demonstrating Hunchworks to the UN General Assembly.

We had one of these serendipitous test sets before: we tracked the Horn of Africa crisis emerging across newsgroups as one of our early will-this-work paper exercises (this btw is also why we’re suddenly interested in data mining googlegroups).  But the Horn of Africa crisis is well established in the public eye now, and there is both too much online data on it to pick out the early weak signals, and many of the early traces (e.g. anecdotes and messages) have been lost in both human and machine memories (yes folks, not everything on the Internet is logged).  But there’s a new thing starting to happen (which is potentially very very bad for the world and certainly for the people caught up in it) – over the last month or so there have been mysterious messages here and there about something starting to happen in Chad and Niger, across the African region known as Sahel.

So this week we started collecting information about Sahel and turning it into hunches and evidence in Hunchworks. First we had an email from a trusted colleague containing potential places to look. Then an Internet search for news and background information, followed by more digging across the Internet (including recent reports from the UN and other NGOs), a colleague searching in the food-related UN agencies and a Twitter search for hashtags and interested parties.

We could have got a lot more information faster and with more insightful comments into it if we’d crowdsourced its collection, but the EOSG couldn’t be seen getting involved prematurely (i.e. before the dedicated HLWG team) in this crisis. We could have also involved more people in a non-public search if we’d had Hunchworks at its Beta testing stage, but doing the first search by hand is a sane early stage test that exposes bugs early to a small number of people before a larger group (e.g. the Alpha and Beta testers) get annoyed by them.

So what did we learn about ourselves, our information and Hunchworks from this exercise?

We ran the exercise using a spreadsheet (no, we can’t just do this for hunches because it will quickly become overwhelmed – see below). Its first worksheet was a list of hunches: this was quickly populated with a mix of hunches and evidence for hunches that took some time to separate out. We also discovered that the evidence that we gathered often contained new places to look for evidence, suggested new problems that should be proposed as hunches and spawned a whole pile of other evidence-gathering activities.

  • Lesson 1: people confuse hunches and evidence. Sometimes evidence is posted on the hunches list; other times hunches are posited as evidence to another hunch.
  • Lesson 2: evidence generates hunches. For example, we realized that a hunch about a famine in Sahel also contained hunches about famines in Mali and Chad that the country teams there needed to investigate.
  • Lesson 3: evidence generates evidence-gathering activities. We ended up with a to-do list linked to each hunch.

Some things were confirmed for us. We suspected that we could map the connections between hunches as though the hunches were propositions in a reasoning system. We also suspected that there were a set of basic search actions that we would do at the start of most hunches, that some of them would be ongoing (i.e. to catch new information being added to the Internet) and that we could automate many of these. Yes.

  • Lesson 4: when we draw graphs using our hunches as nodes, the links between nodes look suspiciously like the links in semantic networks. This should come as no surprise to anyone working on linked data.
  • Lesson 5: Google searches, news searches, twitter searches, UN report searches and emailing around likely suspects are obvious first things to do on any new hunch.
  • Lesson 6: We can automate some of the above searches, especially if we have search terms (e.g. Sahel) and tags to start from. Our hitlist for Sahel was: twitter stream, food price, migration from/to Sahel and news monitoring agents.

We’ve tried to build a system that doesn’t need much management or moderation. We might need to revise that: in the initial excitement of chasing up leads and links from the original hunch, it was difficult to maintain momentum (e.g. amount of evidence added) and completeness at the same time. I had to do a lot of reading and editing – both to disambiguate hunches and evidence as discussed above, but also in generating tags, thinking about the links between hunches and managing the list of actions that happened most times we added any evidence. Some more lessons from this:

  • Lesson 7: Hunches have information-gathering actions attached to them.
  • Lesson 8: Once we get textual evidence, it’s pretty easy to create tags from it.

And then we’ve got some very specific lessons about the system.

  • Lesson 9: if two hunches are related, they probably need the same people involved in them. Can we start the “involve x” list of one from the other?
  • Lesson 10: Some places had 2 locations, e.g. migration from Libya into Chad.
  • Lesson 11: We have the same problem as crisismappers with location accuracy, e.g. sometimes we want to mark a region rather than a single point on the map.
  • Lesson 12: Using tags brings a set of questions about how we find related things again. This is the same issue we’ve seen in crisismapping and Twitter feeds, and we have tools that can help with this.

There are more lessons learnt, but we’re somewhat busy today. More soon.

Strata talk on hunchworks technology

I try not to put too much dayjob stuff here, but sometimes I need to leave less-tidy breadcrumbs for myself.  Here’s the 10-minute (ish) talk I gave at Strata New York this year.

Intro

I’m Sara Farmer, and I’m responsible for technology at Global Pulse. This brings its own special issues.  One thing we’re passionate about is making systems available that help the world. And one thing we’ve learnt is that we can’t do that without cooperation both within and outside the UN.  We’re here to make sure analysts and field staff get the tools that they need, and that’s a) a lot of system that’s needed, and b) something that organisations across the whole humanitarian, development and open data space need too.

<Slide 1: picture of codejammers>

We’re talking to those other organisations about what’s needed, and how we can best pool our resources to build the things that we all need.  And the easiest way for us to do that is to release all our work as open-source software, and act as facilitators and connectors for communities rather than as big-system customers.

<Slide 2: open-source issues>

Open source isn’t a new idea in the UN – UNDP and UNICEF, amongst others, have trailblazed for us, and we’re grateful to have learnt from their experience in codejams, hackathons and running git repositories and communities.  And it’s a community of people like Adaptive Path and open source coders who make HunchWorks happen, and I’d like to publicly thank their dedication and the dedication of our small tech team.  We have more work to do, not least in building a proper open innovations culture across the UN (we’ve codenamed this “Blue Hacks”) and selecting licensing models that allow us to share data and code whilst meeting the UN’s core mandate, but in this pilot project it’s working well.

<slide 3: bubble diagram with other UN systems we’re connecting to>

We’re already building out the human parts of the system (trust, groups etc) but Hunchworks doesn’t work in isolation: it needs to be integrated into a wider ecosystem of both existing and new processes, users and technologies.  The humanitarian technology world is changing rapidly, and we’ve spent a lot of time thinking about what it’s likely to look like both soon and in the very near future.

So. For hunchworks to succeed, it must connect to four other system types.

We need users, and we need to build up our user population quickly. We’re talking about lab managers, development specialists, mappers, policy coordinators, local media and analysts. Those users will need also specialist help from other users in areas like maps, crowdsourcing, data and tools curation.

  • UN Teamworks – this is a Drupal-based content management system that a UNDP New York team has created to connect groups of UN people with each other, and with experts beyond the UN.
  • Professional networking systems like LinkedIn that many crisis professionals are using to create networks.
  • Note that some of our users will be bots – more on this in a minute.

Users will need access to data, or (in the case of data philanthropy), to search results.

  • UN SDI – a project creating standards and gazetteers for geospatial data across the UN.
  • CKAN  – data repository nodes, both for us and from open data initiatives.
  • Geonode, JEarth et al – because a lot of our data is geospatial.

They’ll need tools to help them make sense of that data. And bots to do some of that automatically.

  • We need toolboxes – ways to search through tools in the same way that we do already with data.  We’re talking to people like Civic Commons about the best ways to build these.
  • We’re building apps and plugins where we have to, but we’re talking about organisations putting in nodes around the world, so we’re hunting down open source and openly available tools wherever we can. We’re waiting for our first research projects to finish before we finalise our initial list, but we’re going to at least need data preparation, pattern recognition, text analysis, signal processing, graphs, stats, modelling and visualisation tools.
  • Because we want to send hunchworks instances to the back of beyond, we’re also including tools that could be useful in a disaster – like Ushahidi, Sahana, OpenStreetMap and Google tools.
  • And there are commercial tools and systems that we’re going to need to interface with too. We’re talking about systems like Hunch and a bunch of other suppliers that we’ll be talking to once we get the panic of our first code sprints out of the way.

And they need a ‘next’, a way to spur action, to go with the knowledge that Hunchworks creates.

  • We’re adding tools for this too. And also connecting to UN project mapping systems:
  • UN CRMAT – risk mapping and project coordination during regional development
  • UN CIMS – project coordination during humanitarian crises, an extension of the 3W (who, what, where) idea.

Which is a big vision to have and a lot to do after our first releases next spring. And yet another reason why we’re going to need to do all the partnering and facilitation that we can.

<slide 4: algorithms list>

So. You’ve seen how we’ve designed Hunchworks to help its users work together on hunches. But Hunchworks is more than just a social system, and there are a lot of algorithms needed to make that difference.  We have to design and implement the algorithms that make Hunchworks smart enough to show its users the information that is relevant to them when they need it (also known as looking for all the boxes marked “and then a miracle happens”).

And the first algorithms needs are here:

  • Similarity and complementarity metrics.  We need to work on both of these.  Now there’s a lot of work out there on how things are similar, but there’s not so much around about how people and their skills can complement each other.  We’ve been looking at things like robot team theories, autonomy and human-generated team templates as baselines for this.
  • Relevance. And for that, read “need some interesting search algorithms”. We’re looking into search, but we’re also looking at user profiling and focus of attention theories, including how to direct users’ peripheral attention onto things that are related to a hunch that they’re viewing.
  • Credibility. We’d like to combine all the information we have about hunches and evidence (including user support) into estimates of belief for each hunch, that we can use as ratings for hunches, people and evidence sources. There’s work in uncertain reasoning, knowledge fusion and gamification that could be helpful here, and there are some excellent examples already out there on the internet. As part of this, we’re also looking at how Hunchworks can be mapped onto a reasoning system, with hunches as propositions in that system. Under “everything old is new again”, we’re interested in how that correlates to 1980s reasoning systems too.
  • Hunch splitting, merging and clustering. We need to know when hunches are similar enough to suggest merging or clustering them.  We also would like to highlight when a hunch description and the evidence attached to it deviates far enough from its original description to consider splitting it into a group of related hunches. Luckily, one of our research projects has addressed exactly this problem – this is an example of how our internal algorithm needs are often the same as the users’ tool needs – and we’re looking into how to adapt it.

Mixing human insight with big data results. One of the things that makes Hunchworks more than just a social system is the way that we want to handle big data feeds. We don’t think it’s enough to give analysts RSS feeds or access to tools, and we’re often working in environments where time is our most valuable commodity.   The big question is how we can best combine human knowledge and insight with automated searches and analysis.

Let’s go back to Global Pulse’s mission.  We need to detect crises as they unfold in real time, and alert people who can investigate them further and take action to mitigate their effects.  It’s smart for us to use big data tools to detect ‘data exhaust’ from crises.  It’s smart for us to add human expertise to hunches that something might be happening.  But it’s smarter still for us to combine these two information and knowledge sources into something much more powerful.

We’ve argued a lot about how to do this, but the arguments all seem to boil down to one question: “do we treat big data tools as evidence or users in Hunchworks”?  If we treat big data tools as evidence, we have a relatively easy life – we can rely on users to use the tools to generate data that they attach to hunches, or can set up tools to add evidence to hunches based on hunch keywords etc.  But the more we talked about what we wanted to do with the tools, from being able to create hunches automatically from search results to rating each tool’s effectiveness on a given type of hunch, the more they started sounding like users.

So we’ve decided to use bots. Agents. Intelligent agents. Whatever you personally call a piece of code that’s wrapped in something that observes its environment and acts based on those observations, we’re treating them as a special type of Hunchworks user.  And by doing that, we’ve given bots the right to post hunches when they spot interesting patterns; the ability to be rated on their results, and the ability to be useful members of human teams.

<Slide 5: System issues>

And now I’ll start talking about the things that are difficult for us.  You’ve already seen that trust is incredibly important in the Hunchworks design. Whilst we have to build the system to enhance trust between users and users, users and bot, users and the system etc, we also have to build to deal with what happens when that trust is broken. Yes, we need security.

We need security to ensure that hidden hunches are properly distanced from the rest of the system.  I’ve worked responses where people died because they were blogging information, and we need to minimise that risk where we can.

We also need ways to spot sock puppet attacks and trace their effects through the system when they happen. This is on our roadmap for next year.

And then we have localisation. The UN has 6 official languages (arabic, chinese, english, french, russian and spanish), but we’re going to be putting labs into countries all over the world, each with their own languages, data sources and cultural styles.  We can’t afford to lose people because we didn’t listen to what they needed, and this is a lot of what the techs embedded in Pulse Labs will be doing. We’ll need help with that too.

<Slide 6: federation diagram>

Also on the ‘later’ roadmap is dealing with federation.  We’re starting with a single instance of Hunchworks so we can get the user management, integration and algorithm connections right.  But we’re part of an organisation that works across the world, including places where bandwidth is very limited and sometimes non-existent, and mobile phones are more ubiquitous than computers.  We’re also working out ways, as part of Data Philanthropy, to connect to appliances embedded within organisations that can’t, or shouldn’t, share raw data with us. Which means a federated system of platforms, with all the sychronisation, timing and interface issues that entails.

We aren’t addressing federation yet, but we are learning a lot in the meantime about its issues and potential solutions from systems like Git and from our interactions with the crisismapper communities and other organisations with similar operating structures and connectivity problems. Crisismapping, for example, is teaching us a lot about how people handle information across cultures and damaged connections whilst under time and resource stress.

Okay, I’ve geeked out enough on you. Back to Chris for the last part of this talk.

Slideset

Focussed on the technical challenges of building Hunchworks… slide titles are:

  • Open source and the UN – pic of codejam – licensing models, working models for UN technology creation (h/t unicef and undp, big h/t adaptive path and the codejammers)
  • Integration – bubble diagram with the other UN systems that we’re connecting to (CRMA and CIMS for actions, TeamWorks for the user base, non-UN systems for tools – CKAN for data, CivicCommons for toolbag etc)
  • Algorithms – list of algorithms we’re designing, discussion of uncertainty handling, risk management, handling mixed human-bot teams, similarity and complementarity metrics (including using hand-built and learnt team templates) etc
  • Security – how to handle hidden hunches, incursions and infiltrations. Diagram of spreading infiltration tracked across users, hunches and evidences.
  • Localisation – wordcloud of world languages; discuss languages and the use of pulse labs
  • Federation – clean version of my federation diagram – describe first builds as non-federated, but final builds being targeted at a federated system-of-systems, with some nodes having little or no bandwidth and the sychronisation and interface needs that creates

What’s my real job again?

Where does Global Pulse start and end technically? It’s a question I often have to answer, mostly because of my very ambiguous role leading work both inside (at Global Pulse), outside (with the crisismappers, open data peeps, volunteer hackers etc) and across (UNGIWG, UNSDI) the UN.

The UN, both traditionally and increasingly, has the role of coordinator between both its internal agencies and external NGOs, government agencies and affected communities in the geographical areas in which development and humanitarian efforts take place. On top of that, it’s still doing the political and physical (food aid, heritage sites etc) legwork that helps to keep the world safer and more stable. That’s a lot to ask of any organization, and it’s a huge amount to ask of one whose systems until very recently were based on a slower-moving, less-connected social world.

There’s a lot to be done, data-wise, to support all of that. We need to improve the information to people making decisions across the whole of the UN: give them information that’s timely, covers their areas of interest, and covers the geographies they’re interested in. There are good people all over the UN and outside who are working on this, and I’m working with many of them in my non-GlobalPulse role, watching them make things work better from areas as far apart as mapping which agencies are helping in a natural disaster to satellite-based estimates of refugee settlement sizes and better conditions for New-York-based technologists.

Global Pulse’s remit is helping governments to improve their knowledge about possible development crises in their areas. That doesn’t include any of the following things I do: data and coordination in natural disasters, improving community knowledge, improving technology available to the UN, connecting related UN agency staff to each other, raising awareness of technology possibilities across the UN, mapping political changes, improving the maps available to the UN and its partners, running humanitarian technology events and projects, leading and advising mapping teams, raising awareness in the UN about how to work with technologists, linking UN staff to outside agencies, companies, academics and technologists who could help them with their systems, raising awareness of the UN in local and global technology groups, working on disaster information responses, clarifying data licenses for data going into and out of the UN, helping the UN to work with open source and open data communities, guiding crisismapping work and connecting humanitarian technologists with each other. Or any of the work that I do to help create the environment in which Global Pulse and its technologies can thrive. And some other stuff, but I forget what that is at the moment.

It does include: design the systems and survey the technologies needed to improve (better data and faster, better insights from that data) government analysts’ situation awareness of development and potential development problems in their countries. Highlight and design processes and technologies that are not available (and preferably available open-source) to those analysts. Find ways to get those technologies built, tested and fielded – preferably using global pulse staff if they’re available, but also finding synergies between what the analysts need and what people in related fields and agencies need too. Which also means spending time creating relationships with agencies, companies, universities, groups and individuals who can help with design, build, test, field, and keeping an eye on developments in all the groups, communities and areas that can help with this.

My apologies for the non-techie post. I needed to get some things about my dayjob straight, as much for myself as for anyone else. And I’ve spent a year paying for myself to go to events as an individual only to find that everyone’s assumed that it’s part of my work. Which has admittedly been very useful sometimes, and made several things more possible than they were. But sometimes one has to draw a line somewhere.

Lessons from the (latest) internet revolution

Once more with publishing some notes that I’d left lying around.  I was thinking about the new Internet – the one that’s emerged over the last year and is becoming more real-time, reactive and uses more data analysis, maps and geography.  And I was wondering what lessons we could learn from this shift.

1. Stop observing your users and start interacting with them

Huge just published a book on the shift from customers to users, i.e. from people who buy one-off purchases from a firm, and people who are engaged in a conversation with that firm, whether they’re purchasers, potential purchasers, bloggers, employees or fans.

That resonates.  It’s a shift already happening in the commercial world, and it’s a shift going on in the development world too.  I’ve blogged before about the shift in organisational focus from vulnerability to resilence and the work of people like CDAC and the World Bank in this. In its purest form, this is just another way to say “start listening to and working with the people you’re trying to help”.  It’s not rocket science, but it does come with all sorts of culture changes, some of which (shock, horror) might just be hastened by tech.

2. Keep it clean

No, no, I’m not talking about the rise and fall of pornography markets here – I’m talking about data and the information it contains.  If you ask many seasoned development professionals about their data collection methods, you might get quite a cynical response (see un blog).   Okay, so Amazon has had a couple of high-profile data fails, but on the whole commercial data is accurate and consistent because it has to be.

3. Make real-time internet easier

When you buy a book or a dress online, you don’t have to wait for days to know what the price is or whether you’ve bought it (okay, with some sites you still do, but that’s getting rarer now) – it comes up, the stock comes up, you buy and the thing is delivered to your house / office / significant other because you forgot their birthday.

Real-time development data (by which I mean occurs in months, days, hours rather than per year) is still, largely, mandraulic.  Teams of people are out there entering data, formatting it, cleaning it and transferring it from place to place because we don’t have the systems in place to make this easy yet.  Other teams are working on those systems piecemeal (and a big hand to people like okfn, data without borders and crisismappers for making this possible), but there is still (to my knowledge – please correct me if I’m wrong) an easy template or data standards to allow development data to be discovered, uploaded, cleaned and transferred between systems without pain.  And when we’re talking about common human systems like market prices, water systems, sewage disposal, traffic (yes, yes, I know there are standards for that), aid (yes, yes, IATI), that’s a need that’s calling out for coordination.  When I can dial up and compare the water point systems for Uganda and Brasil, I’ll be a lot happier bunny.

4. Keep it simple. But not too simple.

Websites and mobile apps rule today.  In New York, Ruby coders are at a premium over C++ (remember that?) because the driver now is user interaction and simplicity rather than algorithms and depth.  Which is great if you’re building a consumer site, but not so great if you want to fairly handle complex uncertain data.

Well, that’s as far as I got with that thought.  More soon.