A reflective piece in The New York Times’s business pages points to a critical future role for science reporters—guarding against a “Big Data bubble.”
The article, by Steve Lohr, described an MIT conference that explored the considerable promise that big data—a catchall label that describes the new way of understanding the world through the analysis of vast amounts of data—will improve corporate decision-making and the efficiency of company management.
Yet Lohr, the paper’s reporter on the intersection of technology, business and finance, sounded a cautionary note about the data revolution that is sweeping fields from marketing, to sports, to politics, to health. He quoted Claudia Perlich, chief scientist at the online advertising company m6d, saying, “You can fool yourself with data like you can’t with anything else. I fear a Big Data bubble.” She worries that an influx of self-titled “data scientists” who do poor work would damage the emerging field’s reputation.
Technology reporters like Lohr, who has been looking at this issue for a long time and whose February 2012 article, “The Age of Big Data,” is an essential primer, are well-positioned to question and weed-out those poor practitioners, and to provide a needed check on the current exuberance surrounding big data. So are science reporters, who excel at scrutinizing big-data techniques such as statistical modeling.
Lohr quoted a series of questions that Thomas H. Davenport, a visiting professor at Harvard Business School, urged managers to ask about big data projects. “How do you define the problem? What data do you need? Where does it come from? What are the assumptions behind the model that the data is fed into? How is the model different from reality?”
Science reporters ask these questions every day. Modeling voting behavior or economic opportunities—two of big-data scientists’ favorite subjects—is much different than modeling, say, climate change and reporters must take into account their various strengths and weaknesses (the laws of physics tend be a lot more reliable than human behavior, for instance). But the basic method of interrogation, from sampling procedure to data analysis, is the same.
Take, for example, science writer Sharon Begley’s Reuters article about the flu epidemic sweeping across the US, in which she paid careful attention to the difficulty associated with developing annual vaccines. Public health officials correctly forecasted the strains that would emerge this year, yet they still developed a vaccine that was what epidemiologist Dr. Arnold Monto called, “good but not great.”
They don’t always do it perfectly, but it is science reporters’ job to contextualize new studies and examine the social and ethical impacts of new developments. Big data needs the same treatment. After all, as Lohr noted in his article for the Times, some of its methods, such as techniques such as predictive calculations and mathematical modeling, were first widely applied by the same Wall Street bankers who brought the country to the brink of economic ruin.
The Financial Times journalist Gillian Tett outlined part of that story in her outstanding book about the origins of the crisis, Fool’s Gold, chronicling the evolution of new credit derivatives created by J.P Morgan bankers in the mid-1990s. New financial products ignited a banking revolution, where reckless speculation was based in part on ever-more complex data models. So impenetrable did these models become, a decade after it had created the idea, the same group of J.P Morgan bankers could no longer completely fathom how new forms of derivatives actually worked. Fool’s Gold can be read by science reporters as a warning from recent history about the dangers of unexamined faith in data models.
Again, financial models are much different than model of the natural world, but they all need the same type of testing. Because the statistical concepts at play are so abstract, however, the media have covered the tension in the new world of data science largely by focusing on controversial characters in the field. This was most evident in the massive amount of attention devoted during the 2012 presidential election campaign to statistician and New York Times blogger Nate Silver, who used sophisticated modeling from polling data to predict Obama’s victory and the electoral college outcome in all states.
To some extent, Silver became a proxy for discussions about the possibilities and pitfalls in big data. Those who agreed with Silver’s approach said his work demonstrated irrefutably that opinion-based punditry was dead, unable to compete with the claimed objectivity of statistics. Those who challenged Silver pointed out that a range of models co-exist and not all can be correct.
For his part, Silver told The Observer, “I’d be the first to say you want diversity of opinion. You don’t want to treat any one person as oracular.”