Sign up for the daily CJR newsletter.
Researchers routinely rely on websites like data.census.gov, but this month, the top of the site displays a banner that reads: âDue to the lapse of federal funding, this website is not being updated.â Dozens of reportsâfrom critical spending data to disease surveillanceâare not updated or completely dark due to the ongoing government shutdown. Earlier this year, President Trump and Elon Muskâs funding cuts also led to the removal of publicly available datasets. Some of the datasets were restored after legal challenges, but others were not, leaving some researchers wondering what to do. For some, the answer lies in turning to news.
At the Tow Center, we regularly rely on news data to understand how âpink slimeâ networks function or local news ecosystems evolve over time. Last week, at the New(s) Knowledge Symposiumâa gathering of researchers, journalists, and technologists hosted by the Media Ecosystems Analysis GroupâI found that many other researchers have been monitoring the news for data, often in unexpected ways.Â
Maia Majumder, a computational epidemiologist at Boston Childrenâs Hospital, said that aggregated local news reports were critical for supplementing and fact-checking official sources in disease outbreaks. Because news is hyperlocal, Majumder told me in an interview, it is often more helpful for examining vaccination rates or spread of disease in a community compared with official sources that often only come at the state or national level. And if official data goes offline, news becomes even more critical. âNews media help fill that gap by acting as our eyes and ears in the community,â Majumder said. âThis is a role the media plays even when the government isnât shut down, because an open government doesnât necessarily mean a transparent one.â She previously worked at HealthMap, a platform relying partially on news to track disease outbreaks.
Bia Carneiro, a research team leader at the Alliance of Bioversity International and CIAT, gave a talk at the conference on using news to monitor food scarcity and famine in the Global South. She described the work as ânowcasting,â or monitoring for early warning signals of various events. In a 2024 paper, Carneiro, along with other researchers, searched for signs of food insecurity using news as a complement to other data, like social media and Google trends. She now applies that work to a famine early warning network called FEWS NET and uses it in other research around migration and climate.Â
I spoke to Carneiro after the conference, where she said news data provides a helpful signal. âSocial media is really noisy. With news, it’s more topical, and it’s a bit cleaner,â Carneiro said. âIt’s easier to pull out the relevant information that we want.â Specifically, local news from various countries gave her an advantage over using larger and more Eurocentric publications.
This type of work depends on news aggregators like Media Cloud, an open-source platform that is maintained by the Media Ecosystems Analysis Group. Media Cloud crawls thousands of news websites, identifies if they have RSS feeds, and stores links to the sitesâ articlesâallowing researchers like Majumder and Carneiro to query the data and find trends.
Processing news articles into data isnât simple. Some publishers, for exampleâlocal news outlets in particularâdonât provide RSS feeds. At Tow, weâre developing a tool that we are calling âScraper Factories.â Itâs a Python package that supplements platforms like Media Cloud and leverages large language models to quickly write Web scrapers. We plan to present it this December at the Computation + Journalism Symposium in Miami and hope it can help researchers in cases where a website may not have an RSS feed.
There are also commercial players in this space, like NewsCatcher and NewsData.io, that aggregate news feeds and charge for access to the data.
Itâs getting increasingly hard to access data from news sites. Publishers are locking down their content, largely to prevent bigger AI companies from using their data for training models. The Tow Center has been studying how publishers fight to monetize their content by engaging with large tech companies, whether it be through partnerships or lawsuits. But one unintended conflict might be further restricting the universe of data that researchers like Majumder and Carneiro (and research centers like Tow) have access to.
âSome worry that academic research, security scans and other types of benign web crawling will get elbowed out of websites as barriers are built around more sites,â the Wall Street Journal said in an article earlier this year.Â
News data has limitations for researchers and isnât a substitute for robust government datasetsâbut in the face of increasing threats to publicly available information, Carneiro said, combining it with other reliable sources remains essential. She stressed that, although she recently learned of the threats to American data, she believed researchers in this field can help by increasing transparency and collaboration. âââOnce we start realizing that people are losing access to key datasets, people will start reevaluating how they treat our data and how they make it available to other researchers,â she said.
Has America ever needed a media defender more than now? Help us by joining CJR today.