Sign up for the daily CJR newsletter.
A recent paper from OpenAI researchers sheds new light on why large language models (LLMs) are prone to “hallucination,” or fabricating information. According to the paper, the evaluation methods major AI companies use encourage overconfidence. Performance tests often take the form of multiple-choice questions with explicit correct answers that end up unintentionally rewarding models for guessing rather than declining to answer if they aren’t certain. By optimizing their systems to achieve a high score on these evaluations, AI companies are training their models to be good test-takers instead of actually improving their overall accuracy.
There’s a growing recognition among researchers that popular benchmark tests used to evaluate how well models perform at skills such as general reasoning, math, or coding often fail to capture the models’ real-world capabilities. Companies are under competitive pressure to demonstrate constant progress, so they optimize their models to perform well and rank high on benchmark leaderboards. A study of ChatBot Arena, a widely used benchmark platform, found that major companies like OpenAI, Meta, and Google test many variants of their models privately and release the scores of only the best-performing versions, leaving out poor results. The study authors argue that this process misrepresents the actual capabilities of the models.
“Most people who use AI for science seem content to allow the developers of AI tools to evaluate their usefulness using their own criteria,” AI researcher Nick McGreivy told Nature last month. “That’s like letting pharmaceutical companies decide whether their drug should go to market.”
To address this problem, researchers are pushing for a fundamental rethinking of how LLMs are evaluated. Instead of broad benchmarks claiming to measure “general intelligence” that are administered by the AI companies designing the model, teams at Stanford and DeepMind advocate smaller, task-based evaluations grounded in social-science methods. They recommend prioritizing adaptability, transparency, and practicality. They also propose focusing on “highest-risk deployment contexts,” such as applications in medicine, law, education, and finance.
Experts in some of those fields, like medicine and law, have already begun building domain-specific benchmarks, and there are some efforts to do the same for journalism. Most AI tools aren’t designed with journalists or news audiences in mind, and benchmarks used by AI companies rarely measure what matters in the newsroom. As a result, reporters, editors, and fact-checkers lack visibility into whether the ever-evolving models are suited to their needs, or how their outputs stack up against journalistic values like accuracy, transparency, accountability, and objectivity. As Charlotte Li, a computational-journalism PhD student at Northwestern University, puts it, “Do any of these scores tell us which models we should use for journalism and when?”
Recognizing this knowledge gap, the Generative AI in the Newsroom project, led by Nicholas Diakopoulos, the director of the Computational Journalism Lab at Northwestern University, is pushing for the development of benchmarks tailored to journalism. This summer, the team convened a workshop with twenty-three journalists to imagine what a “news benchmark” might look like. Participants identified six core use cases: information extraction, semantic search, summarization, content transformation, background research, and fact-checking. They found that the challenges of generalizing newsroom tasks into a benchmark are significant, given the wide variation in editorial contexts. Additionally, the prospect of building open datasets raises questions about confidentiality and resources. Still, coming together as an industry to share infrastructure, develop standards, and independently test these tools is critical, because robust evaluation is highly resource-intensive.
Individual newsrooms should also try to evaluate AI tools directly on the tasks they care about most, designing “fail tests” that reflect their editorial priorities. Generalized benchmarks, after all, do not always predict how well a chatbot will perform in a specific scenario. When OpenAI released GPT-5, for example, it emphasized the model’s improved visual reasoning abilities. In our tests, the model did show a slight improvement in identifying the provenance of photos—determining when, where, and by whom they were taken. But Bellingcat’s experiments revealed that for geolocation tasks, GPT-5, even in its “Thinking” and “Pro” modes, was a significant downgrade compared with GPT o4-mini-high. Both provenance and geolocation tap into the same broad skill of visual reasoning, yet the results diverged between the two tests, showing how even small differences in how a task is framed can produce very different outcomes.
There is also a need for industry-wide scrutiny of how chatbots present news content to users. A recent Muck Rack study found that 27 percent of all links cited by major models were journalistic, and in time-sensitive queries nearly half the citations pointed to news publishers. Despite the frequency with which journalistic content is surfaced to LLM users, little is known about whether chatbots accurately represent reporting, cite sources correctly, or provide sufficient context when repackaging articles. When the BBC conducted its own tests earlier this year, it found that AI tools often distorted the content of articles.
Third-party evaluation of AI models is not just a technical matter; it’s a matter of accountability. Without independent, transparent assessment, news organizations risk adopting tools whose strengths and weaknesses remain obscured by corporate claims. Establishing clear standards could help to ensure that journalistic uses of AI are more responsible and trustworthy.
Has America ever needed a media defender more than now? Help us by joining CJR today.