A new tool allows journalists to quickly sort through FOIA data dumps

March 2, 2022
AP Photo/Damian Dovarganes

In the 2020 fiscal year alone, federal agencies received nearly 800,000 requests under freedom of information laws. The process is notoriously frustrating, marked by delays, denials, and appeals before documents are turned over (if they ever are). Even success can be exasperating—documents arrive in the form of large dumps, without any meaningful organization. All that work is time- and labor-intensive; for smaller newsrooms with fewer financial resources and less manpower, it may feel prohibitive. A recent foia workshop held by the Chicago Headline Club included a session called “More data, more problems,” aimed at finding new approaches to reporting with massive data dumps.  

“I file a lot of foia requests, and I often get back hundreds and hundreds of emails, documents, and a ton of text files,” Hilke Schellmann, a journalism professor at New York University, says. “I don’t necessarily know what or where the smoking gun will be, but I know I don’t need to read hundreds of emails about someone’s lunch schedule to find it.”

Schellmann, along with senior research scientist Dr. Mona Sloane and computer science professor Julia Stoyanovich, led a team of graduate students at NYU’s Center for Data Science to develop Gumshoe, an artificial-intelligence tool that uses natural language processing to sort through large caches of text documents and categorize them by relevance to the journalist’s main topic of investigation, reducing the time needed to sift through everything. 

For instance, a target text may be Joe Biden’s inauguration speech and the label category may be “politics.” The model then computes the probability of the text being about politics, filtering out what pieces it understands to be irrelevant in comparison with the rest; in the case of the inauguration speech, that might mean marking sentences with “good morning” or “thank you” as not relevant to the address itself. Because a journalist might not even know what they are looking for within a cache of documents, it’s not necessary to input specific keywords or labels at the start (although you can if you know exactly what you want). The model will train itself on which documents, or portions of each document, are more likely to be relevant to a journalist’s query as they continue to use it. At the moment, the tool is optimized to go through big sets of emails—a common result if you foia communication between any two parties—but can also be used for any large text.

MuckRock, a nonprofit news site devoted to record requests, plans to integrate Gumshoe into its DocumentCloud platform, which is used by journalists for posting and reviewing public records. Journalists using documents “need better tools, better resources, and better support,” Michael Morisy, a MuckRock cofounder, says. “This is a tool that uses machine learning to solve an actual real problem that journalists had.” 

The Gumshoe team developed the tool with an initial grant from the Center for Digital Humanities at NYU. A subsequent $200,000 grant, awarded last month by the Patrick J. McGovern Foundation, will enable the team to build out Gumshoe’s user interface and distribute the product widely. At the moment, the team is inviting journalists and newsrooms to test out the tool and help review/improve it.

Sign up for CJR's daily email

“You don’t really need something to help you analyze 10,000 emails until the moment you do—and then you need it right away,” Derek Kravitz, data and investigations editor at MuckRock, says. “Not many newsrooms can support that infrastructure in an ongoing way, to keep it on hand in case they get a massive leak—like the Facebook Files, for instance. So having this accessible when it’s needed might be the difference between some really important stories getting told and some stories never even being looked at.” 

Paroma Soni was a CJR fellow.