The Challenge of Verifying Crowdsourced Information

A better way to sift through a river of data

Shortly after a devastating earthquake struck Haiti in January, a small team of workers with Ushahidi, a project that enables people to crowdsource and map crisis information, started sifting through information online and mapping reports of damage, security threats, people in need of assistance, and other data.

“From the very initial hours after the earthquake, what we did was deploy the Ushahidi platform and started monitoring any available sources of information that were out there,” Jaroslav Valuch, the project manager for Ushahidi Haiti, told me.

We spoke when he was in Montreal earlier this week to take part in a panel, “Citizen use of new media for the defense of human rights,” at the Citizen Media Rendez-Vous conference. I was on a different panel and managed to grab a few minutes of his time to chat about the challenge of verifying crowdsourced information. We also discussed how this relates to the upcoming beta launch of SwiftRiver, a software project that grew out of Ushahidi and calls itself a “free and open source software platform that uses algorithms and crowdsourced interaction to validate and filter news.” (Mathew Ingram at GigaOm previously wrote a post with some good background on the project.) The SwiftRiver beta launches on Monday.

During the first hours after the earthquake, the Ushahidi Haiti team consisted of just a few people. Valuch focused on analyzing international and Haitian media, Twitter, and Facebook to look for reports of people trapped, violence, collapsed buildings, those in need of medical attention, or other pieces of information that could explain what was happening on the ground. In the end, the Ushahidi Haiti map and information helped Marines and other responders figure out where to go to provide help, especially outside of the capital. To make this happen, the Ushahidi Haiti team had to sift through the river of reports, tweets and information and figure out which items deserved to be added to the map, and which should be discarded. In the end they decided to err on the side of inclusion, and to use tags to highlight the level (or lack) of trust they had in a given piece of information.

“Even though the information from Twitter is not particularly reliable—and things are being retweeted so it’s kind of messy—the basic idea is if you crowdsource the information and put it on one map you can really see the clusters of incidents,” Valuch said. “So even though one particular tweet is not that important, if you have similar reports from the media … you can see where the incidents are clustering.”

The Ushahidi platform enables users to tag reports as “not verified” if they didn’t come from a reliable source. The Ushahidi Haiti team discovered that by mapping the unverified reports, they were able to see if different sources were reporting similar things in similar areas. It was verification by aggregation. They would also attempt to verify tweets by seeing if they were retweeted by trusted sources, checking if the originating Twitter account was followed by people in Haiti, and looking to see if the user had enabled location data in their tweets.

The team focused less on monitoring media once they had a short code that anyone in Haiti could use to submit information by cell phone. (In order to try and verify those reports, they often called back the phone number to try and speak with the person who sent the report.) In the end, over 2,000 reports submitted by cell phones were added to the map.

Valuch admitted the process wasn’t perfect; but it showcases some of the techniques that can be used in crowdsourced verification. It’s interesting to note how the team used a mass of unverified reports in order to achieve accuracy. Ushahidi is a map-driven project, so it chose to cluster the unverified reports in order to look for patterns, but there are other ways of collecting, analyzing, and presenting this information. The challenge is to find a way to quickly and accurately sort and evaluate a mass of incoming reports according to your preferences. This is a core element of distributed verification, which I called “the best way to engineer trust in today’s information environment” in a previous column about WikiLeaks’ Afghanistan documents.

This is where SwiftRiver comes in. I got in touch with Jon Gosier, a co-founder of SwiftRiver and the CEO of African software consultancy Appfrica, to talk about the project.

“The big motivation behind SwiftRiver, to be quite frank, was to solve two problems Ushahidi was having,” he told me by e-mail. “One, how to verify crowd sourced information, and two, how to filter realtime streams of data when it became overwhelming, without sacrificing the integrity of the stream. In other words, how can you speed up the process of vetting information from Twitter, RSS feeds, SMS and email.”

When put in those terms, it’s clear that SwiftRiver has uses beyond the crisis and incident mapping pursued by Ushahidi. Gosier said the goal “is to use algorithms to make humans more efficient at sifting through data. This means using semantic technology to summarize content, the social graph to measure reputation, interaction with content to calculate exactly what type of content the user wants to see more of.”

From his comments and the Haiti example above, you can begin to see the different elements that can aid in the verification process of data: location (is the report coming from the right place?); reputation (is the source trusted by me or by people who themselves are trusted?) content comparison/aggregation (via clustering or other methods to discover patterns); timing (is the report coming at the right time?).

The complicated nature of sifting and verifying a river of information, especially under time constraints, means that total automation is unlikely, or perhaps impossible. “With Swift our goal isn’t to completely automate verification … but Swift tries to help the user deal with preferred content first, and everything else after,” Gosier said.

As for the human element, “We’re betting the [farm] on hybrids. Using algorithms to optimize human interaction. We don’t feel humans can be removed from the process.”

Gosier said “a few” newsrooms are testing out the software. I asked him to explain how a news organization might make use of SwiftRiver. Here’s what he sent back:

In the case of the newsroom, a group of reports can aggregate as much realtime info as they want and trust that the sources the group finds to be most accurate will be the sources that are prioritized. If a newsroom were to run a campaign where they crowd source, like CNN does with iReport, they can then find those citizen journalists in the crowd who actually add value.

At the core of SwiftRiver is an acknowledgement that accuracy can be a matter of perception, or situation. The tool is meant to enable people to define what accuracy means to them, and then filter based on those parameters.

“If you are a user researching a specific subject or event, certain sources of information and certain types of information are going to be more relevant to you,” Gosler said. “Swift learns from what you prefer, in a given context, and helps you curate information based on what it learns. If you were to use SwiftRiver to curate information in a different context, the results would be different.”

Or, as Valuch put it, “These days, forget about having 100 percent verified information—but you can have trusted sources or things with high probability.”

Correction of the Week

The Beliefs column on Saturday, about Buddhist leaders who addressed a sex scandal, referred incorrectly to a 1990 article by Katy Butler, a journalist, titled ”Encountering the Shadow in Buddhist America.” Ms. Butler did not describe Richard Baker, the abbot of the San Francisco Zen Center during the 1970s and 1980, as an alcoholic. She was comparing patterns of behavior by his followers to patterns of enabling behavior of relatives of alcoholics. – The New York Times

Has America ever needed a media watchdog more than now? Help us by joining CJR today.

Craig Silverman is the editor of and the author of Regret The Error: How Media Mistakes Pollute the Press and Imperil Free Speech. He is also the editorial director of and a columnist for the Toronto Star.