Anyone who thinks Big Data projects are easy or inexpensive, would do well to read Giannina Segnini’s piece on the painstaking efforts by staffers at La Nación Costa Rica to clean and meld data in numerous formats and tables, with misspellings, abbreviations, weird punctuation, etc., into something resembling usable form. Just eliminating duplication proved a huge challenge, requiring, among other things, the use of a library developed by MIT and named Vicino that performs “nearest neighbor searching and clustering” and an algorithm, called SIMIL, that looks for similar strings of data. One important consideration, for instance, was making sure that the people with similar-sounding names were in fact the same person.
—Then there’s the business model: ICIJ is an offshoot of the Center for Public Integrity, and is philanthropically funded. Recent ICIJ funders include: Open Society Foundations, the David and Lucile Packard Foundation, Pew Charitable Trusts, and the like. The collaboration with commercial news organization makes it something of a hybrid, a model that has been put to good use elsewhere and makes all sorts of sense. Whether nonprofits can ever make up for what’s been lost in the news business is an open question. But this arrangement is on scale that’s vastly larger than those tried so far. Coordinating among so many organizations is a job unto itself. And given the expense and risk of such grand investigative projects, the more resources available the better.
For a few reasons, then, this type of project is worth watching as a kind of ad-hoc model for the Great Stories, the longform, labor-intensive projects that, once again, prove indispensable.