The story was already great, even before Daniel Gilbert opened his first spreadsheet. Thousands of citizens in the southern Virginia area Gilbert covered for the Bristol Herald Courier (daily circulation: 30,000) had leased their mineral rights to oil and gas companies in exchange for royalties. Twenty years later, they alleged, the companies had not paid, adding up to potentially millions of dollars owed. As Gilbert learned, the complaint was complicated. It involved esoteric oil and gas practices and regulations, a virtually unknown state oversight agency, the rules of escrow accounts—and finally, some very angry people and a handful of very big companies. With these facts alone, he could have written a stellar story giving voice to citizens’ complaints, and shining a light on a little-known regulatory agency. That, in many newsrooms, would have been plenty.
But Gilbert, who officially covered the courts for the paper, wasn’t satisfied simply to raise the specter of noncompliance. Whenever a well produced natural gas, the energy company was supposed to make a monthly payment into a corresponding escrow account. These payment schedules were public. So were the production records. All Gilbert had to do was match the production records with the payment schedules to see who had—and had not—been paid.
Easier said than done. Gilbert requested the information he needed and received spreadsheets with thousands of rows of information. In Excel, a typical computer monitor displays less than a hundred rows and ten wide columns. Gilbert’s data was much too massive to cram into this relatively modest template. So he started with one month’s worth of information, using the program’s “find” function to match wells and their corresponding accounts. One by one. Control-f, control-f, control-f. It was tedious and time-consuming. There was a story there, he was certain. But control-f would not find it.
What would you do? Could you navigate, process, and make sense of thousands of rows of data? If you have not yet had to ask yourself this question, there is no time like the present.
Most journalists are just like Gilbert, with daily computer skills that include Internet searches, word processing, and maybe some basic calculations in Excel, none of which enables journalists to truly mine large collections of data. Meanwhile, the amount of raw data available to journalists has mushroomed. At the federal level, the Obama administration’s “open government” initiative has given rise to new sources like Data.gov, a website devoted to the aggregation and easy dissemination of national data sets. State and local governments have followed suit, making much of the data they collect available online. More elusive tranches of data have been pried loose by nonprofit organizations courtesy of the Freedom of Information Act; an inquisitive journalist can download them in minutes. “I’m constantly amazed and surprised about what’s out there,” said Thomas Hargrove, a national correspondent for Scripps-Howard News Service who often leads data-based research projects for the chain’s fourteen newspapers and nine television stations.
Against this backdrop, the ability to find, manipulate, and analyze data has become increasingly important, not only for teams of investigative journalists, but for beat reporters. It is hard to conceive of a beat that doesn’t generate data—even arts reporters evaluate budgets and have access to nonprofit organizations’ tax returns. What’s more, because the universe of data is vast and growing, and the stories that use it are rare, data-based journalism has become a powerful way to stand out in the crowded news cycle. “When you acquire a certain level of data skills and literacy, you can punch way above your weight,” says Derek Willis, a web developer at The New York Times and author of the computer-assisted reporting blog, The Scoop. “Simply put, you can do things others can’t.”
And last but certainly not least, readers like data. They like charts and interactive graphics and searchable databases. At The Texas Tribune, which has published more than three dozen interactive databases and usually adds or updates one a week on average, the data sets account for 75 percent of the site’s overall traffic.