You probably haven’t heard of “Operation Boulder,” a Nixon-era program that scrutinized the activities of Arab Americans and profiled visa applicants with Arab-sounding names. Possibly you should know about it—it’s one of the clearest precedents to the sort of policies the US government pursued after September 11 when it starting building anti-terrorism tools, like the no-fly list, around questionable metrics. But the State Department is pretty interested in keeping the program secret.

Matthew Connelly, a professor of history at Columbia University, came across Operation Boulder after a tool he and his colleagues built pointed right to it.  They had just started up the Declassification Engine, a project led by Connelly and Columbia statistics professor David Madigan, with the idea that they would develop analytic tools that could coax new information out of secret or declassified documents. One of their inaugural research efforts began with a group of 250,000 diplomatic cables from the State Department. The cables were classified, but their metadata—bits of information about the embassy where a cable originated, the topic, the sender—were not. Analyzing the metadata, Connelly says, he and his colleagues found they could reveal what topics the State Department has chosen to hold most closely to its chest. And the word that stood out most prominently was “Boulder.”

This analysis is just one of the widgets that make up the Declassification Engine, which uses statistical and machine learning to better understand government secrecy.  The project aims to aggregate archives of once-secret material and help historians, journalists, and other snoops understand it in ways they couldn’t have in the past. 

Connelly and his collaborators have already developed a few tools (like the one that sniffed out the screen over the Operation Boulder cables), taught Columbia students about “Hacking the Archive,” and organized a conference about official secrecy. They’re trying to raise money through Indiegogo to keep students working on the project through the summer and just received a $150,000 “magic grant” from the Brown Institute for Media Innovation to continue work in the fall. 

What ties all these efforts together is the sense that, as Connelly puts it, “secrecy is out of control.” The federal government is both withholding more information than it once did and throwing away reams of documents that it deems useless but that could be valuable when subjected to powerful new forms of analysis. By looking at government secrecy in the past, the Declassification Engine could start answering questions about what’s being withheld, how much, why, and whether the government is making justifiable choices about what it keeps from the public. “As much as it can feel overwhelming to get on top of what the government is releasing, there’s even more that they don’t tell us about,” says Connelly. “If we can’t even find out what they were doing 30, 40 years ago, then there’s no accountability at all.”

One of the most ambitious and legally tenuous ideas that Connelly and his collaborators have talked about is predicting the content of redacted text. That’s what any good sleuth would do when delivered black-blocked document, he says—try to guess what you’re not seeing. “If you’re going to do this at scale and use technologies like machine learning, it’s not just a quantitative difference. It’s potentially a qualitative difference,” he says.

For now, though, their engine won’t be spitting out guesses at redacted text. Instead, they’re looking one level up, at the patterns of the type of text that tends to be redacted and withheld. One of the other tools they’ve already developed compares different versions of declassified documents in order to identify blocks of text that were once redacted and later revealed. By analyzing just those bits of documents, they can get a better sense of what types of stories makes the government queasy. (A few decades back, for instance, it was information about Mohammad Mosaddegh—the democratically elected leader of Iran whom the CIA helped overthrow—that tended to get blacked out.)

One of the simplest ambitions of the Declassification Engine is also one of the toughest ones to pull off: show how, in the digital era, the government should keep more documents than it does already. The tools that the Declassification Engine is developing could, in theory, be useful not just to those of us on the receiving end of declassified documents but also to government workers trying to figure out what should go out to the public and what should stay secret. Connelly gets a little bit upset when he talks about how State Department archivists routinely trash documents after a cursory statistical analysis of their usefulness. These included migration records, like applications for passports and visas.

“At no point did it seem that they had any sense of the possibility of data mining,” he says. “You can learn things from data mining, even from seemingly mundane materials. You can look at patterns, in visas and passports, about how people move around the word, that you might not see looking at individual records.”

Declassification Engine researchers have already started talking to government archivists to understand better the work they do and to start making the case that their tools could be useful to to the government, too. “Especially now that we’re dealing with electronic records and the cost of storage is trivial, at least save it,” says Connelly. “Don’t destroy it. Just wait until we find ways of managing it.”

Disclosure: CJR has received funding from the Motion Picture Association of America (MPAA) to cover intellectual-property issues, but the organization has no influence on the content.

If you'd like to get email from CJR writers and editors, add your email address to our newsletter roll and we'll be in touch.

 

More in Cloud Control

Copyright 101.2

Read More »

Sarah Laskow is a writer and editor in New York City. Her work has appeared in print and online in Grist, Good, The American Prospect, Salon, The New Republic, and other publications.