behind the news

Minus proper archives, news outlets risk losing years of backstories forever

Experts say journalism outfits are naive about the safety of digital content and underestimate the value of archives
July 21, 2014

Print stories can be lost, but digital stories last forever, captured for eternity in some nebulous internet ether or on a hard drive in a desk drawer. At least, that’s the vague theory assumed by many producers and consumers of digital news. Once something is posted or backed up, it never really disappears–and if that’s true, archiving digital work seems less urgent. That line of thinking is exactly why so many news organizations risk losing years’ worth of stories.

As we move deeper into the digital era, we’ve recognized the need to preserve and digitize print content, but we’re still in the early stages of understanding how we safely archive our digital news. A survey released last week by The Missouri School of Journalism’s Donald W. Reynolds Institute shows how much outlets are losing when they don’t effectively archive their work, which many do not.

Among the 476 digital and hybrid news organizations that participated in the survey, 27 percent of hybrid news organizations and 17 percent of online-only enterprises said they’ve experienced a significant loss of news content due to technical failure. To Edward McCain, the digital curator of journalism at the institute, these numbers confirm a very basic but largely overlooked fact of digital media enterprises: Digital content is fragile and easily lost.

Take The Columbia Missourian, run out of The Missouri School of Journalism, which lost 15 years’ worth of stories and seven years’ worth of images in a single server crash in 2002. Although a backup did exist, the system that was holding the material had become obsolete, rendering the information irretrievable. It was actually this experience that inspired the school to become invested in digital preservation, and it has since launched several projects through its Journalism Digital News Archive initiative, including last week’s survey.

The Missourian crash is “a textbook example of what can and does happen,” says McCain. When the preservation of digital news is under-prioritized, what’s at risk is not only an individual journalist’s work, or the news enterprise’s backstory and legacy, but also our cultural heritage, he adds.

“There’s a growing level of awareness, but overall, from talking to people and from the survey, I think we have a very long way to go,” McCain explains. “I don’t think there’s a significant number of people who understand the difference between a backup and a preservation system.”

Sign up for CJR's daily email

The difference is this: A backup system is a short-term strategy for retrieval of information within a period of days, weeks, or a few years. It is a short-term solution because the technology can become obsolete or storage devices can be damaged. Far too many news organizations rely on single backups for storage of their digital content, says McCain: “We’re still kind of thinking that [storage devices] are like paper, that you can put a hard drive on a shelf and come back in one year and find it in the same condition.”

For the individual journalist, a good backup system is probably the best he or she can realistically do to protect a portfolio, but news institutions with a backstory of thousands of articles, photos, and videos can take much greater measures.

An archive, then, is a long-term preservation system that ensures both the survival of content and easy access to it through descriptions and cataloging. The data is monitored for sudden changes that might mean a loss of content and is re-formatted when it migrates to new, updated systems. The information is organized and search terms are applied so old articles can easily be found and used for reference, or even be re-published, a strategy used by several media organizations.

In short, an archive is a comprehensive system that needs to be developed and monitored by a professional–meaning, it isn’t cheap. That’s exactly why digital preservation isn’t a priority to most outlets and why some are even getting rid of archives that are too expensive to maintain.

“The economics just aren’t there,” says Victoria McCargar, an archivist, lecturer, and consultant on digital management with a background in journalism. She explains that some news outlets will drop old material rather than spend the money to incorporate it into an archive.

US News, for example, deleted its pre-2007 archives of digitized and native digital content in February, leaving the stories with LexisNexis and EBSCO.

While the costs of building an archive deter many news outlets, these same outlets miss out on the potential to monetize their archives, either through paywalls or by reusing and repurposing old content, as an opportunity for revenue.

Smaller and online-only news organizations lag especially behind due to the costs and efforts involved, says McCargar, while some of the bigger news organizations do have real archives in place.

There has been a lot of focus on digitizing and preserving old print newspapers in recent years, but the preservation of digital news has largely been overlooked. “What’s being produced [in the newsroom] tonight is maybe even more at risk than old newsprint,” says McCargar, who has worked with news organizations like the Associated Press to create archives of digital content but generally does not see many news organizations that are aware of the issue.

“Old print can hang on another 50 years if it has to, but a lot of people are really naïve about digital media,” McCargar says, explaining that even good backup systems start to malfunction after about five years, “so we don’t have a lot of time to go after digital media.”

The Donald W. Reynold Institute’s survey shows that news organizations do find archives valuable–93 percent of online news organizations and 88 percent of hybrid organizations agree with the statement that archives are valuable or very valuable to their operations–but that does not mean the development of archives is a priority. If news organizations don’t take responsibility for archiving their digital content themselves, no one else will.

Since the late 1990s, several large-scale projects such as the Internet Archive and the Library of Congress’ digital preservation team have been creating vast archives of digital content, but their scope is limited. Due to copyright restrictions, archivists often need to obtain approval from the sites they copy–which can bring the process to a halt–while paywalls and password-protected content are an obstacle for the crawlers that harvest digital content for archiving.

Abigail Grotke leads the web archiving team at the Library of Congress and says the amount of digital news content they are able to archive is limited. “We haven’t done a whole lot over the years but we’re trying to do more news,” she says, “especially online-only publications and more local, regional papers.”

McCain is organizing an upcoming forum on the preservation of native digital news and hopes it will start a conversation that can push for policy focused on protecting digital news as part of our cultural heritage. But first, journalists and editors must be convinced that safeguarding their own work should be a priority. “There has to be a change of culture,” McCain says. “At some point, they’re gonna have to take an interest.”

The piece has been updated to delete mention of a Los Angeles Times infographics archive. The piece said the archive had become inaccessible; however, the material is still accessible through a new archiving platform.

Lene Bech Sillesen is a CJR Delacorte Fellow. Follow her on Twitter at @LeneBechS.