“Everything clicks for a different reason”: Why journalism analytics are so hard to interpret

In 2004, journalist Michael Lewis published Moneyball, a bestselling book about how baseball general manager Billy Beane was able to turn around the flailing and cash-poor Oakland Athletics with a clever use of statistics, culminating in a record-breaking 20-game winning streak. With its cast of beleaguered-underdogs-turned-triumphant-visionaries, Moneyball makes for a compelling story. It also turned out to be a tale perfectly suited to the dawning of the so-called “Big Data” era, in which new sources of analytics were supposed to provide strategic advantages for any organization savvy enough to leverage them. 

It was, then, perhaps inevitable that the Moneyball mindset would be applied to the journalism field, especially as audience metrics became more granular and accessible. The parallels were clear enough: it was easy to paint anti-metrics diehards, who claimed that they had an ineffable sense of “news judgment” that allowed them to divine what would resonate with readers better than data ever could, as the newsroom equivalent of the out-of-touch baseball scouts who tried to stop Beane from implementing his revolutionary data-driven approach. To publishers facing intensifying revenue pressure, there was also powerful appeal in the notion that metrics would allow news organizations to tap into hidden pockets of value that would enable them to stay afloat—or even thrive. If data could help the Oakland A’s pull off a dramatic and unexpected winning streak, why couldn’t it help a news website struggling to bring in readers and revenue?

But applying the Moneyball mindset to the actual work of journalism presents particular challenges. This is not only because some journalists remain wary of analytics as a potential force of displacement and usurpation. It is also due to the fact that the news industry differs in significant ways from professional baseball. A baseball team has a single, definable goal: to maximize wins. Furthermore, it is possible to know definitively if a team is making progress toward achieving that goal by looking at its record compared to those of other teams. Finally and perhaps most significant, all participants in the field agree unanimously about whether the goal has or has not been attained. 

Journalism does not possess these traits. Commercial news organizations in democratic societies have multiple goals that are often in tension and difficult, if not impossible, to commensurate. Just as artists must contend with what media scholar Mark Banks calls the “art-commerce relation,” journalists continually navigate the “democracy-commerce relation”—that is, the long-standing tension between journalism’s civic and commercial aims. As an “axis point in the political struggle,” as Banks calls it, between democracy and commerce, it’s no surprise that journalists are often embroiled in heated debates about what—and whom—journalism is for. Journalism’s conflicting mandates complicate the process of interpreting traffic metrics.

As a sociologist interested in how digital data is reshaping knowledge work, I wanted to understand this process better. From 2011-2015, I conducted a mix of in-depth interviews and ethnographic observation at three sites: Chartbeat, a startup that creates real-time Web analytics tools for editorial use; the New York Times, at a time when the organization was trying to reconcile its storied print past with the rhythms, technologies, and economic constraints of digital journalism; and Gawker Media, which at the time of my research was still an online-only independent media company that owned a network of popular blogs. I sought to uncover how newsroom metrics are changing the way journalists practice journalism. 

“We’re looking at these numbers all the time, but thinking about them completely irrationally. Like, looking at Chartbeat, what the fuck does that mean?”

One of the first things that became clear was that, contrary to the idea that data provides clarity, traffic metrics can be difficult to make sense of. I discovered three distinct types of interpretive ambiguity that accompany news analytics. First, there is meaning uncertainty: journalists did not know what metrics meant and were unsure how to make sense of them. The second type of interpretive ambiguity is causal uncertainty. When a piece disappointed or exceeded traffic expectations, journalists were often unsure why it had done so. This causal uncertainty, in turn, led to a third type of interpretive ambiguity, action uncertainty. Since it was often unclear why one article had attracted high traffic while another had not, journalists were often flummoxed about how, if at all, to adjust their editorial practices.

Meaning Uncertainty

Sign up for CJR's daily email

The first type of interpretive ambiguity was uncertainty about how to interpret metrics at a basic level. 

It is one thing to see a number rising or falling on an analytics dashboard, and quite another to understand what that information means. Parker, a long-time New York Times staffer who worked in newsroom operations and had extensive access to and experience with analytics, and whose name, like the rest, has been changed as a condition of my fieldwork, told me that the most common question reporters asked him upon seeing metrics for a particular story was, “Is this good?” Betsy, a Times editor who was known around the newsroom for her digital savvy, lamented the fact that most journalists at the paper lacked a “baseline” of average performance for a story: “They don’t know if 100,000 pageviews is good or bad.” Felix, a writer at Gawker, made a similar point: “We’re looking at these numbers all the time, but thinking about them completely irrationally.” When I asked him to elaborate, Felix responded, “Like, looking at Chartbeat, what the fuck does that mean?”

Meaning uncertainty often stemmed from the substantial qualitative differences between articles. Though most anything published on a Gawker site was considered a “post” (or, at the Times, a “story”) and its traffic ranked against others thus categorized, particular posts and stories differed from each other in ways that staffers saw as substantial and important to account for. Articles tackled vastly different types of subject matter, were authored by different writers, were published on different days of the week and at different times of day, and received different placement on the home page (and, in the case of the Times, in the print edition and the mobile app) and different levels of visibility via promotional actions like mobile push alerts and social media posts. Further complicating matters, each story, once published, entered into a broader information environment that was ever-changing, unique, and largely outside the newsroom’s control. Another news organization might publish a high-profile scoop that monopolized the spotlight; Facebook might suddenly change its algorithm to disfavor a particular type of news content or publication; an unanticipated world event might occur and require wall-to-wall coverage. (Linda, a Times editor, told me ruefully about a devastating exposé on the tilapia industry that had the misfortune of being published the same day as the raid that killed Osama bin Laden.)

At both the Times and Gawker, journalists attempted to compensate for the lack of an apples-to-apples comparison by adjusting their traffic expectations for stories in a way that took some of these mitigating factors into account. Yet doing so proved difficult in practice. It was intuitively obvious to many journalists I interviewed that, say, a 3,000-word feature story about the civil war in Syria published on a Friday afternoon would have fewer views than a recap of the Superbowl Halftime Show published the morning after the game. But how many fewer? How should the stark qualitative differences between these two stories—to say nothing of various contextual factors—be accounted for numerically? As Ben explained, “We don’t have a great sense of context for whether something is more or less than we would expect. When you’re looking at a raw number, it’s hard to know how that fits into what you would expect.” Chartbeat was not especially helpful at providing the sense of context that Ben was hoping for. With its ever-shifting list of “top stories” displayed prominently at the center of the dashboard, Chartbeat announced itself as fundamentally a form of commensuration. And, as Wendy Espeland and Michael Sauder put it, commensuration is “notable for how rigorously it simplifies information and for how thoroughly it decontextualizes knowledge.”

In the absence of a ready-made interpretive schema with which to understand analytics, staffers relied on quantitative heuristics that were intuitively familiar. James Robinson, the erstwhile director of newsroom analytics at the Times, described reporters’ search for an intuitive framework with which to understand metrics:

They often don’t know how to interpret [traffic numbers]. . . . They search for things to compare that number to . . . Our numbers line up so it’s almost like a salary scale. So you can say, “Your story got 20,000 visits” and they’re like “Oh, 20,000, I couldn’t live on 20,000 a year.” But then [if ] you say, like, 150,000, they’re like, “Oh, that’s pretty comfortable, that’s good.” And if you say a million visits, it’s like, “Oh, I’m a millionaire!”

Sports analogies were also common—perhaps unsurprisingly, given the game-like user experience of analytics dashboards. Betsy used baseball metaphors to help reporters understand metrics, referring to stories’ traffic performance as a “solid single,” a double, a grand slam—and, for exceptional successes, winning the World Series “because we only have one of those a year.”

 Metrics that measure the performance of news articles—and the performance of the journalists who produce them—present interpretive challenges that have no clear or easy answers. These interpretive challenges also have high stakes. It is not lost on journalists that the frames used to interpret traffic—that is, to decide if a particular story’s pageview count is “good” or “bad,” a “home run” or a “double”—are highly influential in shaping managerial perceptions of how effective they are at their jobs. When the management of the Washington Post and the Washington-Baltimore Newspaper Guild, which represents Post reporters, reached an agreement over a two-year contract in June 2015, one of the union’s key demands was that reporters would gain greater access to metrics. Why? According to the Guild, “Metrics are already showing up in performance reviews,” despite the fact that, as the Guild co-chair Fredrick Kunkle explained, it was “not clear how to fairly gauge people’s performance relative to others when their missions are so variable. That is, how do you judge traffic performance between someone covering Hillary Clinton and someone covering Scott Walker—let alone someone covering county government in Washington’s suburbs? And yet it seems likely that these sorts of evaluations will appear in performance reviews soon.”

Because there were no generalizable causal laws that could explain a story’s traffic performance, journalists at the Times and Gawker developed “folk theories” to help them fill the interpretive gap.

Kunkle’s comments illustrate how the metrics cart was put before the interpretive horse. In other words, the ubiquity and influence of metrics have outstripped the development of shared cognitive frameworks with which to make sense of them. The result is that the ability to interpret metrics—and have one’s interpretations be considered dependable and authoritative—has become a coveted and contested currency of power in contemporary newsrooms.

Causal Uncertainty

Even in extreme cases where meaning uncertainty was negligible—when, for instance, a story’s traffic was so exceptionally high as to be unanimously considered a success—interpretive ambiguity sometimes persisted. In these instances, it usually took the form of causal uncertainty: it was unclear why a specific story was performing the way it was on the dashboard. Many writers I interviewed recounted instances where they had been baffled by a story’s anomalously high or low traffic. Josh, a Times business reporter, told me about one such case that continued to mystify both him and the Web producer who showed him the traffic data: “There’s this one article I wrote [several months ago]. I wrote it on the subway. It took six minutes to write. It is about a [personnel change at a high-profile company]. It still gets tens of thousands of views a month.” While Josh was happy to have such unusually high sustained traffic to a months-old piece, he was unsure why this particular article was getting so much attention, when he had written so many other seemingly similar pieces (many of which he had put far more work into).

The lack of causal explanation in analytics tools could be maddeningly frustrating for users, like Times reporter Josh, who wanted to develop an understanding of what drove traffic to news content. Chartbeat’s tool tended to be strategically agnostic about causal factors, often opting instead for more mystical language—for example, “there’s magic happening here”—in the pop-up “tool tips” that accompanied exceptionally high-performing stories in the dashboard.

Yet even if Chartbeat had wanted to provide causal explanations as to why a particular story had performed “well” or “poorly,” it would have proven difficult to do so, because of the sheer number of factors—many of them entirely outside a newsroom’s control—that could conceivably affect traffic performance. As Eddie, a Gawker writer, put it, “Everything clicks for a different reason.” A/B testing tools (including one debuted by Chartbeat after the conclusion of my fieldwork) allow journalists to perform a trial run of multiple headlines for a single story simultaneously and identify the one attracting the most traffic. But such tests can be cumbersome to run for every story, are less reliable for sites with smaller audiences, and only allow newsrooms to isolate the effect of headlines and images. Other potentially significant factors influencing a story’s traffic, such as its subject matter and the “news mix” at the time of its publication, are difficult to measure. And as news distribution has become increasingly reliant on the inscrutable and frequently changing algorithms of large technology platforms, journalists contend with ever more of what sociologist Angèle Christin has called “radical uncertainty about the determinants of online popularity.”

Because there were no generalizable causal laws that could explain a story’s traffic performance, journalists at the Times and Gawker developed “folk theories” to help them fill the interpretive gap. A folk theory is a collectively shared, non-expert explanation of how something works. Journalists in both newsrooms, but especially Gawker, formulated such theories to try to explain why some stories were traffic “hits” while others were flops. When a story underperformed relative to expectations (which were often themselves a product of folk theorizing), Gawker editors and writers performed a sort of informal postmortem in an effort to determine what had caused the low traffic. The headline was often singled out as a likely culprit. As Felix put it, “bad traffic is almost always a bad headline”—and, inversely, the headline is “usually a big part” of high traffic.

Writers also considered some topics to be of inherently greater interest to audiences than others (though as we saw earlier, the question of how much greater was continually up for debate). In the instances where Felix didn’t think low traffic could be attributed to a bad headline, he felt it was usually because the story was “just lame”—that is, about a topic that readers did not find particularly compelling or worthy of their attention. Similarly, Eddie told me that “people click what they click. And people are gonna click on a story about sex or weed more than they’ll click on a story about, I don’t know, drones or 3-D printing, any day of the week.” Lisa, an editor for a different Gawker site, rattled off a list of topics that she felt could reliably produce high traffic: “People love unhinged letters. . . . Unhinged sorority girl! Unhinged bride! [Or], ‘Look at what this douchebag wrote me.’ . . . And people like cute things that kids did. People like heartwarming videos with interspecies friendships.”

Such lists were not based on any systematic review of traffic data but rather on a general and intuitive feel for traffic patterns that writers had gleaned in part from observing metrics over time and in part from long-held collective notions about what kinds of content interested news audiences (such as the age-old journalism adage, “if it bleeds, it leads”). According to Felix, while a select few writers at Gawker took a more methodical approach to devising causal models to explain traffic patterns—people who, as he put it, “spreadsheet their shit”—these writers were the exception. Rather, he explained, metrics were “just kinda something you absorb over time so you kind of have this sense for how data works, you know?” For all the comparisons to Moneyball we have become accustomed to seeing in discussions of data analytics, writers’ use of metrics arguably had more in common with the method of the traditional baseball scouts, who drew on their years of experience to make gut-level assessments of players, than it did with the data-driven sabermetrics method employed by Billy Beane and Paul DePodesta.

Folk theories are often conceptualized as a way people alleviate their anxiety about new, intimidating, or complex technologies by demystifying what is unfamiliar. Indeed, we’ve seen that Gawker journalists crafted folk theories to explain traffic patterns that might otherwise seem mysterious and inscrutable. Interestingly, however, at times Gawker writers and editors explicitly declined to folk-theorize about metrics, instead embracing the unknowability of the causal mechanisms that were driving traffic to their sites and stories. As Felix put it: “Sometimes you just gotta let it go… sometimes shit hits and you have no idea why.” Alison, a Gawker site lead, was extremely hard on herself when her site’s traffic was below where she believed it should be. Alison explained that it “feels horrible” when a story gets significantly worse traffic than she had predicted, a testament to the emotional power and influence of real-time newsroom analytics for the journalists who encounter them in the workplace every day. However, Alison also told me that, while she sometimes wondered if an underperforming post could have had a better headline, at times she also stopped herself from proceeding with that line of thinking: “Sometimes I’m just like, you know what? The internet wasn’t in the mood for this today. I think that’s totally a thing.”

Alison’s characterization of “the internet” as an autonomous, cohesive entity whose motivations and tastes were shifting and mysterious is an example of what media researcher Taina Bucher has called the algorithmic imaginary: “the way in which people imagine, perceive and experience algorithms and what these imaginations make possible.” Alison was often quite unforgiving toward herself about her site’s traffic. But occasionally setting limits on causal speculation, as she did here, functioned as a form of psychological self-care that helped her cope with the intense time and traffic pressures of the job. Telling herself that “the internet just wasn’t in the mood” for a particular post was Alison’s way of absolving herself from personal responsibility for low traffic, just as Felix did in moments when he decided to “just let it go.” To say “the internet wasn’t in the mood for this today” is, somewhat paradoxically, both a type of causal explanation and an acknowledgment that there can never be a causal explanation—or at least not a knowable, mechanistic one.

Alison explained that it “feels horrible” when a story gets significantly worse traffic than she had predicted, a testament to the emotional power and influence of real-time newsroom analytics

Thinking of “the internet” as a singular, omnipotent being with unpredictable mood swings, as Alison did, may seem bizarre or even silly at first blush. Yet Alison’s formulation wasn’t actually that far off. As noted above, digital news sites—especially those, like Gawker, that are free to access and do not have paying subscribers—rely on a handful of large technology platforms such as Facebook, Google, and Twitter to refer much of their traffic. These platforms, in turn, use proprietary and dynamic algorithms to make some content more visible to users than other content. Given the circumstances, it is understandable that journalists like Alison would become resigned to the lack of control that accompanies such a dramatic imbalance of power between platforms and publishers.

By eschewing casual speculation about how algorithms work, writers like Alison and Felix mirrored the rhetoric of big data proponents who have disdained the importance of causal explanation in the digital age. Yet there is a key difference. Big data enthusiasts dismiss the need for causal explanation on the grounds that if a data set is sufficiently large, then one can take strategic action based solely on correlation. For instance, when Walmart discovered that strawberry PopTarts sold at seven times the usual rate in the days leading up to a hurricane, the chain began stocking extra PopTarts and positioning them prominently in its stores whenever the forecast predicted one. Walmart executives needn’t know why shoppers craved that particular toaster pastry when faced with turbulent weather; the company could take goal-directed action based on the correlative finding alone.

This kind of action based on strong correlation is difficult to replicate in journalism for two reasons. First, as we have seen, there is no single goal in journalism but rather a multiplicity of aims that often seem incommensurable and, at times, mutually exclusive. Second, any patterns of correlation journalists may observe between a particular aspect of a story and traffic are subject to change based on shifts in platform algorithms that journalists are powerless to control or even fully grasp.

Thus, when journalists reject mechanistic explanations of traffic, they do so not because an understanding of causal mechanisms is unnecessary for taking action, but rather as a way to give themselves permission not to take incessant action. After all, if the internet simply isn’t in the mood for a certain story, the journalist is powerless to influence the traffic outcome: there is no sense in continuing to tweak the headline, or promoting the post on a new social platform, or switching out the main image to try to boost the numbers. In sum, the periodic refusal to engage in mechanistic causal speculation allowed Gawker journalists to take a break from the otherwise relentless hunt for higher traffic.

Such a respite was only temporary, however. Generally speaking, Gawker journalists did feel pressure to take editorial action in response to traffic data— but they were often unsure of precisely what kind of action to take.

 

Action Uncertainty

A key aspect of the Moneyball mindset is that analytics data is not merely interesting but also useful. News analytics companies frame their products in a way that emphasizes their usefulness. The “about” page on Chartbeat’s website described the company as a “content intelligence platform for publishers” that “empowers media companies to build loyal audiences with real-time and historical editorial analytics.” A company blog post announcing a new feature promised insights into subscribers that were “robust and actionable for publishers.” Parse.ly, one of Chartbeat’s main competitors, employed similar language on its website: “Parse.ly empowers companies to understand, own and improve digital audience engagement through data, so they can ensure the work they do makes the impact it deserves.”

For both companies, the usefulness of analytics was central to the sales pitch. And yet, in large part because of the meaning uncertainty and causal uncertainty that often accompany metrics, journalists were often unsure precisely what, if any, action to take in response to what they saw on the dashboard. If it is unclear whether or not a story is succeeding or failing (because it is unclear what the expectations are or should be for each story), then it is also unclear how to capitalize on traffic success or buffer against failure. Even in cases where a story’s success was clear but the reason for the success was not, there were still no easy answers about what one should do next. Josh, the Times reporter who was befuddled by continually high traffic to the story he had written in six minutes, explained that he didn’t know how to put the knowledge to use. Paraphrasing conversations he’d had with a Times Web producer who showed him metrics for his six-minute story, Josh said: “We talk about, ‘well, what do we do with it?’ And we haven’t ever figured it out. We don’t know what the value of that [information] is, you know? And we don’t know what we should do differently as a result.”

Josh’s sense of paralysis in the face of metrics was striking because it stood in such stark con- trast to the prevailing cultural narrative about big data: that it not only is actionable but also will uncover hidden pockets of value to anyone smart enough to leverage it. Josh explained that even if he put aside thorny normative questions about how much metrics should guide his editorial agenda, practical challenges would remain. Without knowing precisely what it was about his six-minute story that was generating so much traffic, there was no surefire way to reproduce the success in his future work.

Even at Gawker, which had a strong reputation for taking editorial action based on traffic numbers (or, in the parlance of the field, producing “click-bait”), staffers were often unsure of which actions to take. For instance, while Eddie confidently asserted that posts about “sex or weed” would always out-perform posts about “drones or 3-D printing,” he also chafed at the notion that there was a particular editorial strategy that would always boost traffic. Eddie had worked at another digital publication before coming to Gawker, where he told me that his editor’s bosses “would always tell him to bring them more viral hits. When we’d get [one of our articles picked up] on [the] Drudge [Report] or on Reddit that would go wild [i.e., drive a large amount of referral traffic to the site], they’d be like, ‘oh, do this more often!’ And we’d be like, ‘you guys don’t get it, this isn’t a switch, there’s not a formula.’”

Alison also had been frustrated by failed attempts to replicate a particular post’s surprisingly high traffic. She related an instance in which, after being shocked by the high traffic for a short post about the upcoming series finale of a popular television show, she had instructed the writer to produce a follow-up post immediately after the final episode aired. “I said, ‘I need you to cover this first thing in the morning. I don’t care what you write, but you need to cover it.’” But the second post hadn’t performed nearly as well as the first, for reasons that remained unclear to Alison. Felix struck a similar note in our interview, explaining that while he consulted metrics constantly, he wasn’t interpreting it “in any kind of rational way.” “I have this number,” he told me, “but I don’t really think about how I could get to whatever other number I want to get to.”

Alison, Eddie, and Felix had found through experience that the mystery of the hunt for traffic could be mitigated somewhat with vigilance and harder work, but it could not be eliminated. There was always an element of the game that resisted rationalization and systematization.

The Moneyball mindset was a preposterously poor fit for journalism. Metrics in journalism are characterized by interpretive ambiguity in a way that baseball statistics are not. Because journalism does not have a sole outcome that can be easily prioritized and optimized for, interpretive labor is required to make sense of analytics. In the absence of clear, profession-wide normative standards around handling metrics, journalists drew their own symbolic moral boundaries between acceptable and unacceptable uses of the data. These interpretive strategies are consequential because of the influence that metrics now wield in many contemporary news organizations, despite the fact that their meaning is so often unclear. In other words, the interpretive ambiguity of news metrics was not merely frustrating for journalists—it also had direct implications for their working conditions. An article whose traffic is interpreted as disappointing will be placed and promoted differently than an article that is seen as a traffic hit; a journalist who is seen as drawing reliably high traffic might have a different career trajectory (e.g., in terms of pay, promotion, and job security) than one who is seen as struggling to attract a sufficient audience. Given these high stakes, the question of who gets to make sense of metrics—and have their interpretations be considered authoritative in a news organization—is hotly contested.

 

This article is an excerpt from a chapter in Petre’s book All the News That’s Fit to Click: How Metrics Are Transforming the Work of Journalism. It has been edited to add context and fit CJR’s editorial style.

Has America ever needed a media watchdog more than now? Help us by joining CJR today.

Caitlin Petre is an Assistant Professor in the Department of Journalism and Media Studies at Rutgers University. Her work examines the social processes behind digital datasets and algorithms.