Scraper Factories

Sign up for the daily CJR newsletter.

It can feel more difficult than ever for journalists to find data: FOIA requests go unanswered, government dashboards are taken down, and public databases get quietly archived. In some cases, no datasets exist at all. Increasingly, reporters are left to collect and create their own data.

Systematically collecting data from across the web, a process known as scraping, is a common solution. Writing a scraper or two for a story is (usually) a fairly straightforward task for a data journalist who knows a bit of code. But writing dozens or even hundreds of scrapers to scrape data from multiple websites in myriad formats can be prohibitively time-consuming.

At the Tow Center, we have been able to use AI to generate a first-draft fleet of Python scrapers in just minutes. Now we’re sharing that code for you to use, too.

Scraper Factory is an AI-powered code generator for creating scrapers. Users can enter a list of URLs and generate Python code to collect content like press releases, school board meeting minutes, or news articles. Scraper Factory then generates a checkable artifact in Python, which we review and fix for data and code accuracy before deploying. That code runs on a regular schedule without AI assistance and notifies us if there are errors, ensuring that data is consistently collected over time. In our tests, over three in four scripts that we generated using AI worked successfully without any human intervention.

This code generator has already enabled us to do more. The recent renewed focus on campus investigations has created an unprecedented moment to study how institutions communicate under federal pressure. To capture this, we built a University Communications Database to track public announcements from more than 100 universities—many of them currently under federal investigation.

This project presented exactly the kind of messy, real-world challenge that newsrooms frequently need to handle. During the fall 2025 semester a group of students from the Data Journalism MS program at Columbia Journalism School gathered to use Scraper Factory to create scrapers, test them, and check the resulting Python code for inaccuracies. Those scrapers continue to run today, collecting new communications from universities daily for a single database, which we will publish soon.

The script outputs by Scraper Factory are designed to be modular and reusable. Each contains the same functions, so users can run them at their own cadence. And unlike other AI-powered scraping solutions, running the scrapers requires no continued reliance on LLMs; creation is the only time AI is needed.

Scraper Factory handles communication with an LLM by parsing information about a webpage in a cost-effective way. The underlying code behind webpages can be long and unwieldy, making it difficult for AI to parse in full. The framework coordinates with the LLM to parse only the necessary information, sometimes over multiple calls.

We also ensure each scraper returns consistent data. Sites can change formatting or be taken down, so Scraper Factory notifies us when a scraper fails so that we can fix it—although we plan to use AI tools to automate fixing some of the broken scrapers as well.

By default, Scraper Factory respects robots.txt, meaning that if a website has expressed a desire not to be scraped, ScraperFactory will respect that unless otherwise specified. Multiple court cases have affirmed the legal right to scrape the public web, especially for journalists working in the public interest, but we made this choice so that journalists can be aware of the stated preferences of websites before they scrape, especially if they intend to use other third-party APIs to parse or analyze the data they collect.

Script creation with the models we used for testing averaged $0.08 per scraper, enabling smaller newsrooms to deploy large fleets of scrapers at a relatively low cost. We are also aiming to make Scraper Factory model-agnostic, so as open-source and local models get better, users can substitute their preference into the code.

We demonstrated Scraper Factory at the Computation + Journalism conference in Miami in December 2025, and taught a session on how to use it at NICAR 2026 in Indianapolis in March 2026. We have also made a component of the project available on GitHub, and are planning to make the full set of scraper fleet management tools public soon.

Scraper Factory is a work in progress. We welcome contributors, requests, and questions.

Scraper Factories

About the Tow Center

More from CJR

If They Can’t Block This Merger, Can Anyone?

Hyperlocal Listening

Knock, Knock

Documenting as Protection

About

Support CJR

Advertise