Everywhere we go, everything we do, we send signals. Simple acts create streams of data, whether it is crossing the road, making a speech, running 100 meters, phoning your mother, or shooting a gun. Up till now, the data generated by such activities has been difficult to capture, collect, and sort into patterns from which stories can be spun. But this is changing by the day. The streaming, structuring, and storing of this information in reusable formats—what we think of as “big data”—is increasingly the raw material of journalism.
Like every other part of the process of disseminating news, this activity is being redefined by mechanization. One of the most important questions for journalism’s sustainability will be how individuals and organizations respond to this availability of data.
In the cool 13th Street office space of Betaworks, in the heart of New York’s Silicon Alley, John Borthwick, the company’s chief executive, demonstrates one of the apps shown to him by a developer. It can trace the activity of people who you might know in your immediate vicinity—who is around the corner, who is meeting whom, where they are going next. All of this represented on an iPhone screen by pulling together signals from the “social graph” of activity we can choose to make public when we log into services like FourSquare, or Facebook, or Twitter.
The total transparency offered is awesome, in the true sense of the word: It sends a shiver of wonder and apprehension. Borthwick, who runs an enterprise that he characterizes as part investment company, part studio, is one of a number of people either creating or incubating businesses that begin to mine and exploit this world of information: Many newsrooms, including Gawker, Forbes, and The New York Times, already use the real-time data analytics of Chartbeat; most journalists with a Twitter account will at some point have shortened their links through bit.ly; and the data company SocialFlow explains how stories become “viral” with more speed, clarity, and depth than any circulation or marketing department can provide. All of these startups received investment from Betaworks, and they represent what Borthwick believes is the future of information dissemination and, by default, journalism—understanding information “out of the container,” as he puts it.
At a recent technology-and-journalism breakfast hosted by Columbia Journalism School, Borthwick elaborated: “This data layer is a shadow,” he said. “It’s part of how we live. It is always there but seldom observed.” Observing, reporting, and making sense of that data is, he thinks, a place where journalism can forge a role for itself.
Borthwick is not alone in the belief that a world that is increasingly quantified will create opportunities. Up to now, the journalism organizations that have been actively engaged in understanding the possibilities of large data-sets have been largely confined to those who make money from specialist financial information. Reuters, Bloomberg, and Dow Jones all burnish their brands with high-quality reporting and analysis, but in each case, the core of their enterprise remains real-time automated information businesses pointed at the financial-services market.
Five years ago, data journalism was a very niche activity, conducted in just a handful of newsrooms. Even now, to be a journalist who handles data can mean a variety of things, from being a statistics number cruncher or creative interaction designer to being a reporter who uses data skills—extracting the story and/or explaining the bias in it—as part of his or her beat. The roles are still emerging, but very rarely are there teams of information scientists and mathematicians (such as those employed by bit.ly, Chartbeat, and SocialFlow) sitting inside news organizations, working out how to use these new resources for best effect.
Just as computer scientists figured out search algorithms that sorted information, taking away a part of journalism’s role, others are now writing algorithms that assemble data into stories. The most high-profile exponent of this practice is Narrative Science, a collaboration between computer scientists and the Medill Journalism School at Northwestern University. Kristian Hammond, the data scientist leading the company, envisions a world in which everything from your cholesterol level to the state of your garbage bin creates continual streams of information that can be reassembled in story form. Narrative Science uses algorithms to produce basic, bread-and-butter stories that don’t require much flair in the writing—high-school-sports reports, local-government-meeting recaps, company financial results. Since these sorts of stories can be produced using unprecedented levels of automation, they offer a realistic chance of cutting newsroom costs. And although Hammond has a vested interest in predicting that vast amounts of data will be turned into personal, local, national, and international stories, his vision is also a logical extension of current trends.