Journalism by numbers

Sign up for the daily CJR newsletter.

Everywhere we go, everything we do, we send signals. Simple acts create streams of data, whether it is crossing the road, making a speech, running 100 meters, phoning your mother, or shooting a gun. Up till now, the data generated by such activities has been difficult to capture, collect, and sort into patterns from which stories can be spun. But this is changing by the day. The streaming, structuring, and storing of this information in reusable formats—what we think of as “big data”—is increasingly the raw material of journalism.

Like every other part of the process of disseminating news, this activity is being redefined by mechanization. One of the most important questions for journalism’s sustainability will be how individuals and organizations respond to this availability of data.

In the cool 13th Street office space of Betaworks, in the heart of New York’s Silicon Alley, John Borthwick, the company’s chief executive, demonstrates one of the apps shown to him by a developer. It can trace the activity of people who you might know in your immediate vicinity—who is around the corner, who is meeting whom, where they are going next. All of this represented on an iPhone screen by pulling together signals from the “social graph” of activity we can choose to make public when we log into services like FourSquare, or Facebook, or Twitter.

The total transparency offered is awesome, in the true sense of the word: It sends a shiver of wonder and apprehension. Borthwick, who runs an enterprise that he characterizes as part investment company, part studio, is one of a number of people either creating or incubating businesses that begin to mine and exploit this world of information: Many newsrooms, including Gawker, Forbes, and The New York Times, already use the real-time data analytics of Chartbeat; most journalists with a Twitter account will at some point have shortened their links through bit.ly; and the data company SocialFlow explains how stories become “viral” with more speed, clarity, and depth than any circulation or marketing department can provide. All of these startups received investment from Betaworks, and they represent what Borthwick believes is the future of information dissemination and, by default, journalism—understanding information “out of the container,” as he puts it.

At a recent technology-and-journalism breakfast hosted by Columbia Journalism School, Borthwick elaborated: “This data layer is a shadow,” he said. “It’s part of how we live. It is always there but seldom observed.” Observing, reporting, and making sense of that data is, he thinks, a place where journalism can forge a role for itself.

Borthwick is not alone in the belief that a world that is increasingly quantified will create opportunities. Up to now, the journalism organizations that have been actively engaged in understanding the possibilities of large data-sets have been largely confined to those who make money from specialist financial information. Reuters, Bloomberg, and Dow Jones all burnish their brands with high-quality reporting and analysis, but in each case, the core of their enterprise remains real-time automated information businesses pointed at the financial-services market.

Five years ago, data journalism was a very niche activity, conducted in just a handful of newsrooms. Even now, to be a journalist who handles data can mean a variety of things, from being a statistics number cruncher or creative interaction designer to being a reporter who uses data skills—extracting the story and/or explaining the bias in it—as part of his or her beat. The roles are still emerging, but very rarely are there teams of information scientists and mathematicians (such as those employed by bit.ly, Chartbeat, and SocialFlow) sitting inside news organizations, working out how to use these new resources for best effect.

Just as computer scientists figured out search algorithms that sorted information, taking away a part of journalism’s role, others are now writing algorithms that assemble data into stories. The most high-profile exponent of this practice is Narrative Science, a collaboration between computer scientists and the Medill Journalism School at Northwestern University. Kristian Hammond, the data scientist leading the company, envisions a world in which everything from your cholesterol level to the state of your garbage bin creates continual streams of information that can be reassembled in story form. Narrative Science uses algorithms to produce basic, bread-and-butter stories that don’t require much flair in the writing—high-school-sports reports, local-government-meeting recaps, company financial results. Since these sorts of stories can be produced using unprecedented levels of automation, they offer a realistic chance of cutting newsroom costs. And although Hammond has a vested interest in predicting that vast amounts of data will be turned into personal, local, national, and international stories, his vision is also a logical extension of current trends.

Javaun Moradi, a digital strategist and product developer for NPR, is one of a new breed of digital journalists who are working to weave the use of algorithms and new kinds of data into the arsenal of skills in the newsroom. In particular, he sees sensor networks—low-cost devices that civic-interest groups use to monitor things like air quality—as a potential data source. “It’s coming at us whether we like it or not,” he says. “A lot of inexpensive devices will start sending us a great deal more information.” Moradi can easily imagine journalists building and maintaining their own networks of information. “Up until now,” he notes, “journalists have had really very little data, and mostly other people’s data, acquired from elsewhere.” At the same time, Moradi points out, there are bound to be new dilemmas and challenges around the ownership and control of information.

Alex Howard, who writes about data journalism, government, and the open-data movement for O’Reilly Media, also flags the ownership and control of data as a key issue. “For lots of types of data—finance, for instance—there are laws that say who can obtain it and who can use it,” Howard notes. “But new kinds of information don’t necessarily have legal and regulatory frameworks.” How newsrooms obtain and handle information—what their standards and practices are—is likely to become an important part of differentiating news brands.

Journalism by numbers does not mean ceding human process to the bots. Every algorithm, however it is written, contains human, and therefore editorial, judgments. The decisions made about what data to include and exclude adds a layer of perspective to the information provided. There must be transparency and a set of editorial standards underpinning the data collection.

The truth is, those streams of numbers are going to be as big a transformation for journalism as rise of the social Web. Newsrooms will rise and fall on the documentation of real-time information and the ability to gather and share it. Yet while social media demands skills of conversation and dissemination familiar to most journalists, the ability to work with data is a much less central skill in most newsrooms, and still completely absent in many. Automation of stories and ownership of newly collected data could both reduce production costs and create new revenue sources, so it ought to be at the heart of exploration and experimentation for newsrooms. But news executives have missed the cues before. The industry shot itself in the foot 15 years ago by failing to recognize that search and information filtering would be a core challenge and opportunity for journalism; this time, there is an awareness that data will be similarly significant, but once again the major innovations appear destined to come from outside the field.

To solve journalism’s existential problems, the field needs to forge a close relationship with information science. At Columbia Journalism School, Medill, Missouri, and elsewhere, bridges between computer science and journalism are being hastily constructed. Every week sees new collaborative computer science and journalism meetups or hackathons. Enlightened news organizations already have APIs (application programming interfaces) so that outsiders can access elements of their data. But much of the activity remains marginal rather than core to business planning and development.

“Data are everywhere all the time,” notes Mark Hansen, director of Columbia University’s new Brown Institute for Media Innovation. “They have something to say about us and how we live. But they aren’t neutral, and neither are the algorithms we rely on to interpret them. The stories they tell are often incomplete, uncertain, and open-ended. Without journalists thinking in data, who will help us distinguish between good stories and bad? We need journalists to create entirely new kinds of stories, new hybrid forms that engage with the essential stuff of data—the digital shadows of who we are, now, collectively.”

In the remaking of the field, the shadow of information is something journalism should no longer be afraid of.

Journalism by numbers

More from CJR

If They Can’t Block This Merger, Can Anyone?

Hyperlocal Listening

Knock, Knock

Documenting as Protection

About

Support CJR

Advertise