Join us
 
Tow Center

Toward AI-Powered Source Audits

Building a quote-detection tool for accountability in journalism.

Sign up for the daily CJR newsletter.

This is our second blog post about Unheard, a new project designed to help news organizations uncover potentially overlooked narratives by using AI to audit who is quoted in their articles.

Thoughtful sourcing is the backbone of every good story, the raw material from which narratives are built. Decisions about whom to quote are extremely delicate; the journalist shapes not only the perceived balance of a story, but also its framing, stance, the broader image of society it constructs—and the audiences it serves.

News organizations risk overlooking important narratives and injecting bias into their reporting without thoughtful sourcing. That’s why we’re building an AI-powered tool to conduct source audits on large corpora of news articles. The goal of the project is to equip newsrooms with data that helps them identify when coverage leans too heavily on a single type of source. Thanks to recent breakthroughs in generative AI, highly accurate quote-detection technology is now within reach. 

In an experiment, we tested seven different generative AI models on their ability to extract quotes from articles published by the New York Times and the Associated Press. We found that the models performed surprisingly well, with some achieving accuracy scores in the mid- to high-ninetieth percentile.

We are working to develop an AI tool that comes as close as possible to perfectly extracting quotes and identifying their associated speakers. While these results based on commercial models are promising, we hope to fine-tune open-source models to build our tool. In the hands of newsrooms, it can help journalists reflect on whom they choose to include in their stories—a practice that, as journalist Bette Dam argues in her PhD research, is urgently needed.

In our first newsletter, Dam—who lived in and reported from Kabul for fifteen years—outlined her concerns that the voices shaping global conflict reporting are often narrow and predictable. While covering the so-called “war on terror,” Dam observed that government officials, particularly from the Pentagon and related security agencies, were privileged by journalists. In her PhD research, she found that countervailing sources were often absent. Among Afghan voices, the allied government elite dominated. Taken together, in both the New York Times and the AP, nearly 70 percent of sources came from elite (mostly government) circles. Civilian perspectives made up only a small share, usually limited to witnesses of the violence being reported.

In the first weeks of her PhD research, Dam had already found older examples of this predominance of elite-official sourcing. For example, political scientist W. Lance Bennett found that in almost four years of the New York Times’ coverage of the CIA’s covert operations in Nicaragua in the mid-1980s, there were almost nine hundred “voiced opinions in the news.” More than six hundred of these came from “officers, offices, or committees of US governmental institutions.” About 15 percent came from nongovernmental sources, and reporters referenced Nicaraguan sovereignty or concern for another Vietnam-like situation only three times. (The sample consisted of all news articles and editorials indexed under “Nicaragua” in the New York Times Index between January 1, 1983, and October 15, 1986.)

Long before the current war in Gaza—which human rights organizations like Amnesty International have called a genocide—Dam learned from other academic research that US newspapers over the past few decades often prioritize official Israeli voices, while Palestinian perspectives are limited or framed negatively. This is shown, for example, quite recently by scholar Holly Jackson, who analyzed stories in the Times. During both Intifadas, Israeli voices dominated: in the First Intifada (1987–93), Israelis appeared in 93 percent of articles, compared with Palestinians in only 40 percent (sample: sixteen thousand articles); in the Second Intifada (2000–2005), that figure rose to just 49 percent. American officials also featured prominently, reinforcing a shared Israeli-US narrative as the dominant frame (sample: seventeen thousand articles).

Over the past few months, we’ve taken the first steps toward building Unheard, a collaborative “post-public editor-system” that empowers journalists to ask critical questions about their sourcing patterns to ensure more accurate narratives through AI-powered source audits. But AI-powered tools are notoriously unreliable, since they “hallucinate”—fabricating information out of thin air. That’s why we’ve built computational tools to test the accuracy of our systems as we work on improving them.

As part of her PhD research, Dam and her team manually tagged every quote from a sample of fifteen hundred Times and AP articles about the war in Afghanistan. Building on Dam’s field experience, the team categorized sources into civilian versus military, distinguishing between those speaking from within Afghanistan and those commenting from the United States. They also differentiated between spokespersons/high-level officials, revealing the dominance of “über-officialdom” and the structural reliance on propaganda-like sourcing. In the Afghan context, the government was by far the most frequently cited source, again with a strong reliance on top-level officials. By contrast, Afghan and American civilians—noninstitutional or “civil” sources—were far less prominent in the coverage.

This is what researchers in the machine-learning world call a “golden set,” a painstakingly manually annotated and (ideally) 100 percent accurate dataset against which to test a machine’s ability to complete a task. We’re currently using this golden set as a benchmark to evaluate the performance of our quote-extraction tool, with the goal of eventually achieving a nearly perfect match.

In a preliminary test published earlier this year, Unheard evaluated how well ChatGPT could identify quotes that had been manually labeled by Dam as part of her PhD research. We randomly selected twenty articles from the golden set, all related to the US war in Afghanistan. After incrementally improving various prompts to extract both the quotes and their associated speakers, ChatGPT correctly identified 230 of the 251 quotes Dam had labeled in our sample. The results were promising, but we wanted to extend the experiment to other models using a standardized computational pipeline that can automatically score the accuracy of the quote extraction.

We have now completed building this pipeline, which brings us to our second experiment. Using an updated prompt, we tested seven models—Claude Opus 4.1, Claude Sonnet 4.0, Gemini 1.5 Pro, Gemini 2.5 Flash, Gemini 2.5 Pro, GPT-4o, and GPT-5 mini—on their ability to extract quotes from our sample of twenty articles. To evaluate the models, we measured how well each identified the actual quotes (known as “recall”) and the number of identified quotes that were correct (referred to as “precision”). We found that most of the models were highly accurate, with GPT-5 mini and Gemini 2.5 Pro achieving F1 scores (a combination of precision and recall scores) of 94 percent.

We went through a series of prompt iterations to achieve our results. What proved to be most effective was a prompting technique known as few-shot prompting, in which example inputs and outputs are provided to guide the language model’s responses.

For instance, in the prompt segment below, we included illustrative cases to demonstrate the difference between a direct quote and a paraphrased quote, so the model learns how to distinguish between the two.

Here are some more specific instructions with accompanying examples, in the format: 

In this context, a quote is defined as both a direct and paraphrased instance of a source conveying some information. A direct quote is one that is lifted verbatim from its source, surrounded by double quotation marks. A paraphrased quote is not lifted entirely from the source, but is information synthesized from the source. A quote is still considered paraphrased if it includes fragments of a direct quote within it.


T: “I really like this weather,” said Adam B. He later added that he felt the wind was “pleasant”.


R: [


    {


        “source”: “Adam B”,


        “quote”: “\”I really like this weather,\” said Adam B.,


        “is_paraphrased”: false


    },


    {


        “source”: “Adam B”,


        “quote”: “He later added that he felt the wind was \”pleasant\”.”,


        “is_paraphrased”: true


    }


]

 

Our initial tests are promising, especially when compared with previous attempts at quote extraction. We found that the models performed surprisingly well, with some achieving recall or precision scores in the mid- to high-ninetieth percentile. By comparison, in 2023, the BBC open-sourced its quote-detection system, Citron, but reached an overall success rate (F1 score) of only about 60 percent.

However, since Citron was applied to a different dataset, we can’t draw definitive conclusions about which models perform best. To make that comparison, we plan to test Citron and our models on the same dataset—a task for our next blog post.

Once we reach a performance level we’re satisfied with, we plan to run the model on Professor Dam’s full golden set of fifteen hundred articles. After that, we aim to fine-tune a model capable of extracting quotes from the Times and AP articles about Afghanistan that fall outside the scope of Dam’s original dataset.

Our goal is to bridge the gap between technology and journalistic ethics, and to ensure that the voices shaping our understanding of the world are as diverse and accurate as the realities they aim to represent.

 

This work is being done with assistance and funding from the Pulitzer Center.

Has America ever needed a media defender more than now? Help us by joining CJR today.

About the Tow Center

The Tow Center for Digital Journalism at Columbia's Graduate School of Journalism, a partner of CJR, is a research center exploring the ways in which technology is changing journalism, its practice and its consumption — as we seek new ways to judge the reliability, standards, and credibility of information online.

View other Tow articles »

Visit Tow Center website »

More from CJR