Tow Center

Digital journalism’s disappearing public record, and what to do about it

Archived papers, Wikimedia Commons

The average news consumer is awash in data. This abundance presents challenges not just to the daily act of staying informed, but to the work of keeping content publicly available. At an April 13 conference, “Public Record Under Threat: News and the Archive in the Age of Digital Distribution,” scholars and specialists from organizations including the Internet Archive, Wikipedia, and the Center for Investigative Reporting discussed problems of preservation ranging from the technological to the financial.

Institutions from banks to hospitals struggle with digital record-keeping now that much of the information that used to live on paper exists primarily in digital form. News organizations find it particularly challenging to keep the proverbial first draft of history from vanishing, and to figure out how to keep that draft continuously available to readers. It’s an area in which tech companies and newsrooms need to find common ground, and soon. “Publishers aren’t doing archiving,” says Mark Graham, director of the Wayback Machine, a tool developed by the Internet Archive for preserving Web content. “Archiving is simply not a business priority.”

Wikipedia is grappling with these issues as well, says Jake Orlowitz, head of the Wikipedia Library. Orlowitz claims the open-source encyclopedia’s team of volunteer editors—200,000 monthly—relies on vetted, accurate sources of information, but he admits that the cohort, which is largely white and male, has a “certain implicit bias.”

ICYMI: How to report in a machine reality

Wikipedia tries to combat that bias, and other problems endemic to a labor corps of volunteer workers, by making changes to the site public. News organizations, which don’t always offer the same level of granular transparency, are instead trading on the reputations of their mastheads, Orlowitz says. He suggests a “news slider” feature that would let audiences see different versions of the same story over time, including updates.


Sign up for CJR's daily email

BUT NEWS ORGANIZATIONS ARE LIMITED by topic and by the interests of their audiences, rather than by the mandate to pull together as much knowledge as possible. Reporters have to make thoughtful decisions about what stays and what gets removed from a story, says Victoria Baranetsky, general counsel at The Center for Investigative Reporting.

The Internet Archive makes no such distinctions, earning the organization the moniker “history grabber machine” from North Carolina Senator Thom Tillis during the recent testimony of Facebook Co-Founder and Chief Executive Officer Mark Zuckerberg. Graham says the Wayback doesn’t discriminate—“I just try to grab everything,” he says—but admits that, given the speed of the internet and the limits of technology and human ability, “everything” was probably not on the menu. Thus, the question, according to Regina Lee Roberts, a Stanford University librarian, is what goes missing from the public record, such as paywalled articles that may be too expensive for an institution to reliably access. “What is it that we don’t see?” Roberts asks.


THEN THERE’S THE STUFF we simply can’t see: algorithms that produce personalized online experiences, sites that are migrated to new content systems, and rotten links. If that wasn’t enough of an uphill battle, the rapid pace of technology is going to result in plenty of material that, though it may be archived, doesn’t function. What happens to Flash-based apps, data visualizations, and virtual-reality projects?

There’s also the matter of material that gets destroyed or interrupted by cataclysm. Even as it took its nightmarish toll on human lives, the 1994 Rwandan genocide interrupted the collection of meteorological data for more than half a decade, hampering efforts to assess climate change, according to Francesco Fiondella of the International Research Institute for Climate and Society at Columbia University. Fiondella is working on a tool to address the data loss, which further troubled an already-devastated country that relies on agriculture.

These challenges may seem like technical puzzles, but they are also a matter of priority. Stephen Abrams of the California Digital Library—which houses, among other collections, the Online Archive of California—says some problems are easy, but either too low on the to-do list, or too expensive. Budgets to pay for preservation of records rise with volume. “The cost of storage has not gone to zero,” he says. To muster the necessary funding, organizations have to first see preservation as a priority. This requires advocacy, strategic coalition building, and patient diplomacy.


BEYOND THESE TECHNICAL and financial obstacles, there is also the question of who bears the burden of keeping records when institutions use social-media platforms as an ad hoc archive. Platforms are motivated to collect data because of the potential for profit, rather than as a public service, and relying on them as an archive can conflict with the the legal “right to be forgotten,” which the European Union enforces but which remains contentious in the United States. These tensions were most recently on display during congressional hearings over the data policies of Facebook, whose campus is just five miles away from Stanford, where the conference—focusing on preservation of the public record—took place.

READ: How Google and Facebook became two of the biggest journalism patrons on earth

These tensions are not easy to reconcile. Google Product Manager Geoff Samek says YouTube, which his company owns, feels responsible for preserving content and making it available as “part of their DNA.” That means relying on machine-driven curation, which has the potential to distort the public record. The task can’t be left to machines alone, Samek says—there have to be living humans involved somewhere.“Otherwise,” he says, “it’s a problem.”

The authors would like to thank Priyanjana Bengani and George Tsiveriotis for their notes from the conference.

Sharon Ringel and Angela Woodall are research fellows at the Tow Center for Digital Journalism. Sharon Ringel is a Postdoctoral researcher at the Columbia University Graduate School of Journalism. Angela Woodall is a Communications Ph.D. candidate at the Columbia University Graduate School of Journalism.