BuzzFeed’s pro tennis investigation displays ethical dilemmas of data journalism

Illustration: Charis Tsevis. Art Direction: ruiz+company, Barcellona. Image via flickr.

BuzzFeed News in January turned its attention to the issue of fraud in professional tennis, publishing an investigation called “The Tennis Racket.” The piece featured an innovative use of statistical analysis to identify professional players who may have thrown matches. By analyzing win-loss records and betting odds at both the beginning and ending of a match, BuzzFeed identified cases where there was an unusually large swing (e.g. greater than 10 percent difference). If there were enough of these matches, it cast suspicion on the player. 

But they didn’t publish the names of suspicious players, including in the anonymized data or the code releases that accompanied the article. In the article, BuzzFeed states that the statistical evidence presented is not definitive proof of match fixing. That didn’t stop others from quickly de-anonymizing the players pinpointed by the statistical analysis. A group of undergraduate students from Stanford University were able to infer and make public the names of players BuzzFeed had kept hidden.

Did BuzzFeed News make it too easy to de-anonymize the players’ identities? In other words, were they too transparent? Not necessarily, but the incident does raise interesting questions about where to draw the line in enabling reproducibility of journalistic investigation, especially those that generate statistical indictments of individuals. As newsrooms adapt to statistical and algorithmic techniques, new questions of media accountability and ethics are emerging.

The news industry is rapidly adopting algorithmic approaches to production: automatically monitoring, alerting, curating, disseminating, predicting, and even writing news. This year alone The Washington Post began experimenting with automation and artificial intelligence in producing its Olympics and elections coverage, The New York Times published an anxiety-provoking real-time prediction of the 2016 presidential election results, the Associated Press is designing machine learning that can translate print-stories for broadcast, researchers in Sweden demonstrated that statistical techniques can be harnessed to draw journalists’ attention to potentially newsworthy patterns in data, and Reuters is developing techniques to automatically identify event witnesses from social media.

Related: Investigating the algorithms that govern our lives

While such technologies enable an ostensibly objective and factual approach to editorial decision-making, they also harbor biases that shape how they include, exclude, highlight, or make salient information to users. At least some of the pushback against Facebook in the wake of Trump’s victory has been about the role of News Feed in algorithmically boosting fake news on the platform. Such concerns are only exacerbated by the lack of transparency of many automated decision-making technologies.

Several prominent ethics codes such as those of SPJ, RTDNA, and NPR now emphasize transparency as a guiding norm. Transparency is not a silver bullet for media ethics, but with so much machinery now being used in the journalistic sausage making, transparency is a pragmatic approach that facilitates the evaluation of the interpretations (algorithmic or otherwise) that underlie newswork. 

Transparency is not a silver bullet. But it does facilitate the evaluation of the interpretations (algorithmic or otherwise) that underlie newswork.

One of the struggles that news organizations have is a clear path on how and how much to be transparent about the algorithms they use. To that end, I worked with the Tow Center for Digital Journalism to convene a workshop on the topic of Algorithmic Transparency in the News Media. The goal was to bring together industry and academia to work through what information factors might be disclosed about algorithms in use in the news media. (The full academic write-up of those results is now available online.) 

The results of the study emphasize that algorithms are not socially inert; the deep entanglements of people within algorithmic systems are essential context in an algorithmic transparency disclosure. That in mind, we identified four layers where information disclosure could be beneficial:

  • the data, especially if it’s pumped through a machine-learning process;
  • the model used to process that data;
  • the inferences, including errors and uncertainty around outcomes like classifications or predictions;
  • and the interface, how transparency disclosures are integrated into the user experience.

In “The Tennis Racket,” BuzzFeed decided to provide varying levels of transparency that would appeal to different levels of reader expertise. Linked from the original story is an article that can only be described as “BuzzFeed-y”—comical animated .gifs, crystallized sub-heads, and short accessible paragraphs. From there, the curious reader can click deeper on the “detailed methodology”—a Github page that describes the data acquisition and preparation as well as other calculations. This detailed methodology then links to the final and most detailed disclosure: the annotated source code used to run the analysis.

Each level of disclosure adds additional nuance, so different stakeholders can access the granularity of information most relevant to their interests. A dedicated journalist or a professional tennis investigator might use the source code to reproduce the analysis or to try it out with a different set of data. Someone who is merely curious about where the data came from for the project could look at the methodology without needing to understand the source code. This pyramid structure mitigates one of the primary concerns around implementing algorithmic transparency: that it is too difficult for readers to make sense of transparency disclosures around algorithms.

Related: Journalism’s moment of reckoning has arrived.

But the flip side of transparency is that, in the case of BuzzFeed, providing the source code and a detailed-enough methodology allowed students to de-anonymize the results relatively quickly and easily. The students re-scraped the data from the online source (though there was some uncertainty in identifying the exact sample used in the original story) with identities preserved, and then cross-referenced with the anonymized BuzzFeed data based on the other data fields available. This allowed them to associate a name with each of the 15 players identified in the original analysis.

On first glance, this re-appropriation of transparency information seems to undermine the journalists’ decision not to name names when the statistical evidence was not airtight. By stopping short of publishing names, BuzzFeed News was able to avoid the risk of potential libel lawsuits, while providing enough detail to allow professional investigators to follow their footsteps. Setting aside the possibility of actual harm from the published information, if legal liability were its only concern, we can also see this case in a more constructive light. Heidi Blake, the first author of the main article, works as the UK investigations editor for BuzzFeed News and is based in London, England. Since British law makes it considerably easier to sue for libel than in the US, one might speculate that BuzzFeed News thought the statistical evidence was not strong enough to stave off libel lawsuits in that jurisdiction. Because BuzzFeed put the methodology out there so that others could replicate the work with non-anonymized data, a less legally averse outlet (students blogging about it) was able to publish the names. In other words, transparency could be read as a mechanism to facilitate the assumption of legal liability by others, perhaps in less risky or more sympathetic jurisdictions.

Not all algorithms need to attain the same level of transparency; there’s a spectrum of information that can be disclosed, and intervening interests may not always leave transparency as the top priority. For many in the industry building computational products, there are still concerns over the proprietary nature of algorithmic media production, the weakening of the system with respect to manipulation and gaming, and of the costs of producing transparency information. There is still much work to do in studying algorithmic transparency, but this will be a worthwhile investment. We need a more accountable media system in which these black boxes are rendered more explainable and trustworthy.

Has America ever needed a media watchdog more than now? Help us by joining CJR today.

Nicholas Diakopoulos is an assistant professor at the University of Maryland and a fellow at the Tow Center for Digital Journalism.