When History Is Lost in the Ether

Digital archiving is shoddy and incomplete, and it will hamper the ability of future generations to understand the current era.

By Christian Schneider

Apr 6, 2022

“Then where does the past exist, if at all?”

“In records. It is written down.” —George Orwell, 1984

In late February, Trisha Hope settled into her booth at the Conservative Political Action Conference in Orlando, hoping to sell printed copies of her five-volume collection of former President Donald Trump’s tweets.

Hope’s goal was, of course, to unload a stack of her $39 linen-bound, self-published books, but she later told me the books had another important purpose.

“I was told by many that they wanted to share our book with future generations and felt they could not rely on other media or even historians to do so in a truthful manner,” Hope told me. “President Trump’s words to the people on a daily basis will always be the most valuable tool in sharing the history of this time.”

When Twitter famously banned Trump from its platform in early 2021, it had the effect of erasing his tweets from the public record. Numerous other social media sites like Facebook, Instagram, and YouTube followed with their own bans.

Hope said when Trump was kicked off social media, sales of her books increased more than 1,000 percent.

“President Trump’s words are his; they reflect the day-to-day operations of the Trump administration in a way no other resource can provide,” she said.

On this, Hope is entirely correct. Reading Trump’s tweets between 2016 and 2020 will be a crucial way for future historians to understand American politics of the early 21st century. Telling the story of the Trump years without being able to cite his tweets is akin to writing about the life of Benjamin Franklin without ever having seen a copy of Poor Richard’s Almanack.

And yet on Twitter, the original documents are now memory-holed, likely never to return.

Sure, there are several third-party internet sites that have archived Trump’s tweets, but these platforms exist only at the whim of their administrators. When they disappear, so might all their data.

The overwhelming majority of what we know about the past comes from people writing things down on media that lasted centuries, from cave walls to wood carvings, to sheep skin. And a lot of people took the time to write things down. Read William Manchester’s volumes on the life of Winston Churchill, and you’ll see that much of what we know about the 20th century’s greatest leader is derived from journals kept by the mistresses of men in his government. (One cheer for infidelity!)

But like information shared on Twitter, the overwhelming majority of news in modern times is made and reported in digital formats. This presents a host of potentially dangerous problems for future Americans trying to understand how we lived in 2022.

Among the digital-only outlets we currently peruse, which will keep their data archived? Will that data be accessible in spite of future technological advances? Will the text remain untouched by future editors who may want to erase past content? How will we know the full context of what we are seeing?

When the present becomes the past, the past exists only in the manner in which it is preserved. Right now, the present is being saved only in shoddy and haphazard ways, as publishers try to adjust to ever-changing technologies. Further, that same technology allows previously published content to be changed in ways that reflect current trends and sensibilities. Combined, it spells doom for future generations.

What Are We Saving?

You may have had a friend or mentor warn you that “the internet is forever.” And it is true that unfortunate opinions you held in your teens or pictures of catastrophic haircuts you got as an adult may haunt you in perpetuity.

(Full disclosure: I run a Twitter account that exhumes unfavorable historical reviews of classic movies, books, albums, and television shows, many of which are retroactively embarrassing to the reviewer.)

But in terms of news content, the internet is hardly forever. Online-only news outlets drop out of sight all the time, taking their archives with them. Most digital articles you read could accurately carry the traditional “this story will self-destruct” spy disclaimer, even if it takes a few years rather than a few seconds.

This is especially true of the early days of the internet, when editors often viewed online-only content as disposable. In one 1996 interview, the manager of the Minneapolis Star Tribune website explained the lack of an archive of old online stories, saying there was no need because the paper’s online presence was “a service, not a publication.”

But when a non-archived website goes belly-up, it is akin to a fire at a small-town newspaper that costs them their entire physical archive.

Even news organizations that survive are falling behind. In 2021, the Donald W. Reynolds Journalism Institute at the University of Missouri released an extensive report documenting how newsrooms are lagging in their efforts to preserve “born digital” content.

Of the 24 news organizations studied by the Missouri researchers, every one said it was archiving some content. But 17 of the news outlets weren’t storing all of their output.

Edward McCain, the digital curator of journalism at the Reynolds Institute, told me he was heartened that most news organizations are saving “at least some” of their content.

“We realize that it’s not realistic to think that we are going to save 100 percent of the digital first draft of history,” said McCain. But he added that while they may not be saving it at full resolution or in preferred archival formats, publications are making an effort to keep it one way or another.

“That indicates to me that there is an understanding in newsrooms that their archives are valuable assets,” said McCain.

Yet the Missouri report details some of the obstacles newsrooms are facing in archiving material, specifically that “nearly insurmountable financial challenges” have forced newspapers to cut positions dedicated to preserving material.

The report credits news organizations like the Associated Press, the BBC, and NPR for their comprehensive efforts to save material, but notes they are the exception. The report contends most news outlets are stuck in a pre-digital mindset of preservation, saying “few have been able to update their processes to establish the deliberate activities needed to properly preserve digital content for the long term.”

“Chances are, in fact, that unless you do something specific and intentional to preserve it,” the study warns, “some or all of your born-digital content will be gone in a few years. It will no longer be accessible, readable, searchable or recoverable unless you take deliberate steps to ensure it is.”

(Ironically, as if they are trying to prove their own point, the Missouri Journalism Digital News Archive is filled with broken links.)

But even news articles that stay online can become more difficult to find and access, as content management systems utilized by news organizations update and become more complex.

Kathleen Hansen, a former professor of journalism at the University of Minnesota, told me that while the perception is that online content will exist forever, digital text is not constant.

“No one is dealing with backward compatibility, no one is dealing with bit degradation,” she said. “And the biggest issue for a lot of legitimate news organizations is that most of their content now is on websites that still have no way to archive properly.”

In their 2017 book Future-Proofing the News: Preserving the First Draft of History, Hansen and her former University of Minnesota colleague Nora Paul describe the challenges in archiving digital content, warning that we may soon have better access to news produced in 1817 than in 2017.

Paul said that even if someone figures out how to archive current digital content, that doesn’t mean it will be accessible decades from now.

“The change in technology is going to continue, so whatever gets properly archived, in 50 years there won’t be a machine or a method to be able to read what gets archived now,” she said.

As an example, during their research for the book, the authors found a number of television stations that keep their archives on two-inch magnetic tape. The CBS affiliate in New York, for instance, has only one machine on the premises that can read the old tape. If it breaks, the archive is useless.

They noted that in 1960, the U.S. census was compiled digitally for the first time. In 2022, there are only two machines in the world that can read the original data tapes; one is in the Smithsonian Institution and one is in Japan.

If news organizations aren’t able to archive their data in an accessible way, the job falls to third-party sites like the Internet Archive, which attempts to “scrape” the internet and save as much digital content as possible.

According to the Internet Archive’s own statistics, its database contains 669 billion web pages, compiled through the nonprofit group’s 20-year-old Wayback Machine. With help from a “crawler,” or software that helps identify links, the site grabs news articles automatically. It further relies on an army of users to submit suggested pages to save.

Wayback Machine director Mark Graham told me the Internet Archive’s humble goal is “universal access to all knowledge.” Currently, the archive works with more than 1,000 libraries, universities, and other historical groups to provide a “time horizon” for the internet—a snapshot of content that will live into the future.

Graham said he was inspired by the work of Ferdinand Columbus (yes, Christopher’s son), who spent the 1520s trying to compile a library of every book ever printed. Columbus succeeded in collecting 15,000 books, of which 7,000 currently still exist.

Graham said he found the collection of knowledge to be a noble goal to replicate. “What can we learn from the people who have come before us who have tried to build repositories of knowledge for the ages?” he asked.

The Wayback Machine does face daunting technological challenges: Imagine archiving the entire back-end Amazon.com catalog database, for instance. Instead, Graham said he was “going to focus more on the press releases, the news articles, the tweets, the YouTube videos. The stuff that makes up our lives, our digital world.”

But with an expanding online world, even while archiving 170,000 new pages per day, the Internet Archive can’t save everything.

It’s not for lack of trying. In the next year alone, the Wayback Machine is expected to increase its total holdings by 25 percent. But Graham said he doesn’t know what percentage of the web the Wayback Machine is currently collecting, because there is no way to calculate how much information is out there to grab.

He noted that he recently archived a wildly popular 25-year-old recipe site called Chowhound, which went out of business in late March. A crawl of Chowhound netted the Wayback Machine millions of new cooking-related URLs. But he has no idea how many total URLs were embedded in the website, so he doesn’t know whether he got it all.

“Take that same problem and expand it to the web,” he said. “That’s hundreds of millions of sites.”

(Graham noted the Wayback Machine is also one of the sites archiving Donald Trump’s tweets, but he’s missing “around 10.”)

McCain called the Internet Archive the “oldest and most established organization devoted to the purpose of archiving the web,” but said “even it struggles to capture news websites.”

For one thing, more news outlets are protecting their content through the use of paywalls.

“Those paywalls are apparently necessary in order to help the news organizations generate much-needed revenue,” said McCain, “but they do present a formidable barrier to the kind of web archiving that the Internet Archive and other such services provide.”

For that reason, McCain thinks the responsibility for preserving content falls with the outlets creating it.

“Since there is so much content that is owned and held by the news organizations themselves, we believe that a significant part of the solution for preserving digital news, including online news and whatever formats may come next, could and should reside within the news organizations and the technologies they employ,” he said.

And, of course, the ever-changing face of technology makes it difficult to freeze a website in time as content management systems change.

“The Internet Archive scrapes, but you get two clicks in, and you get dead links,” said Hansen. “No one has figured out how to properly archive a complete website.” She called it a “snapshot, not an archive.”

Other third-party entities are still trying, however. Libraries and historical societies attempt to archive news, but rarely have the funding or the technological know-how to do so. In 2010, the Library of Congress began archiving all of Twitter, but dumped the project in 2017, deciding only to preserve a small selection of tweets. The collection is still not accessible to the public.

Paul said the incentives for archiving digital content simply aren’t there.

“People that are going to be making the decisions about what gets preserved are not basing their decisions on future accessibility or the value in understanding what life was like in the early 2020s,” she said.

With news organizations being bought up by private equity companies, Paul said the decision to archive old material is going to be based more on whether the material can be monetized through clicks.

“Just the intrinsic value of a record of what life was like in a past time is not going to be based on that value. It’s going to be based on whether we can sell ads on it,” she said.

For their book, Hansen and Paul conducted a study of archival practices at ten legacy news organizations, eight of which were traditional news outlets that had produced a printed newspaper for decades but now also published online.

Not one of the 10 organizations had a complete archive of its website.

Why Does It Matter?

In 2021, the Rolling Stones decided to pull one of their most popular songs, “Brown Sugar,” from their live performances. (It is unclear what part of that sentence would most confuse a Rolling Stones fan from 1971, the year “Brown Sugar” was released: that the song, which details the rape of black women during slavery, would one day be deemed offensive, or that the Stones were still playing live shows in 2021.)

So far, “Brown Sugar” has remained on the major digital streaming platforms. But it hardly seems possible that it will survive forever. Of course, it will be available on vinyl and CD for those who own turntables and CD players. Similarly, it seems likely that the only people with access to the original language of Mark Twain’s “Adventures of Huckleberry Finn” will be those who own the physical copies: new versions scrubbing the book of its 1884 racial nomenclature hit bookshelves a decade ago.

But digital news is more analogous to streaming music or TV shows. There often is no print version or hard copy that can be preserved. And once everything is digital, a whole new problem crops up: Content can easily be erased or retroactively edited to reflect modern sensibilities. And once that happens, history becomes negotiable. It will be impossible to access an accurate record of what people in 2022 thought or felt.

Without a doubt, there are things newspapers printed in the past that they would certainly want to bowdlerize in modern times. In the 1950s, for instance, the U.S. government ran an anti-immigration policy called—and this is the official government name for it—“Operation Wetback.” Newspapers all over America cited the program using this offensive and derogatory term, and we know that because it still exists on unedited, unalterable microfilm.

Is there any question the New York Times regrets downplaying the threat of Adolf Hitler’s antisemitism? Or its use of the “n-word” hundreds of times in the 20th century?

We don’t even have to look to the future to speculate about legacy media’s desire to scrub its archives. In 2020, the New York Times made a number of stealth edits to its 1619 Project, a controversial work that claimed the arrival of slaves in America was the country’s “true founding.” After a number of notable historians from all over the political spectrum pointed out the questionable facts used in the project, the Times removed the “true founding” language—the entire thesis of the project— without editorial note or comment. (It was only after criticism about the secret edits that the Times copped to the erasure, in a separate New York Times Magazine article that blamed the readers for misunderstanding what project leader Nikole Hannah-Jones had written. “The online language risked being read literally,” the paper said.)

So it is not out of the question that if left to archive their own digital content, news organizations will keep old content contemporary by continuing to “correct” outdated terms and concepts. And in doing so, they could be destroying the snapshot of present life that had been preserved like Han Solo in carbonite.

Aside from losing digital content and the ability to retroactively change it, electronic news also saps future researchers from all the clues historians use to put that news in context. Microfilm of old newspapers, for instance, gives us what newspaper people call “placement and play” information—what stories warranted being above the fold on the front page? What stories were buried deep in the classified section? What type of stories got large headlines? What stories surrounded the story on each page that might provide context as to what else was going on that day?

When modern digital stories are saved as individual pages, it will rob them of this type of related context.

“Both the placement and play of the story was an indication of the news judgment – who thought what was important,” said Hansen. “Even now, it’s hard to do that on websites. They’re constantly changing the headlines and the placement based on clicks.”

According to Hansen, the whole notion of news judgment and placement and play are now gone. “Going forward, there’s not going to be any way to understand that,” she said.

Paul noted that for this reason, historians who rely on Twitter could find two completely different sets of facts based on whose feed they decide to exhume.

For one, tweets contain very little information content, given they are strings of a few short words with very little context around them.

“How do you understand whether a tweet was intended as snark, or a hard-felt fact, or what? The whole way that communication is happening now will make historical work very difficult,” said Paul.

Hansen likened the search for history in Tweets to ancient Rome, where the waterways were lined with lead. As the Roman Empire was crumbling, the people who were in charge were crazy because of lead poisoning.

“What I say now is that we are living in a toxic information environment. It’s not toxic water, it’s toxins in the information stream. We are suffering the consequences of that,” she said.

Of course, the news available to future historians will depend on who decides to invest the funds to preserve their copy. This will mean the enduring stories told about modern times will come from those who can afford to mold the narrative into the future. (This is not particularly new; newspapers have historically been owned by wealthy individuals with the resources to preserve their version of the present.)

But this will likely create a patchwork of news sources whose veracity it will be difficult to ascertain in hindsight.

If a historian in 2071 wants to tell the story of the January 6, 2021, U.S. Capitol insurrection, how do they navigate the melange of news outlets? In the early 2020s, was the Washington Post more reliable than, say, The Federalist?

Graham told me the Wayback Machine is working diligently to make sure its archive will provide context to future users.

“We care a lot about context and we work very hard to maintain the provenance of our archives, ensure the integrity of our archives, and we are working on a number of projects to add context to our archives,” he said. He further noted the Wayback Machine would soon be launching a browser add-on that would cross-reference URLs visited by a user with fact-checking sites that will measure the veracity of what the user is reading.

Graham also defended the Internet Archive from those who say it isn’t doing enough to preserve digital data.

“I read a lot of things that a lot of people say who don’t know what they’re talking about,” he said. “Even people who you think would know what they’re talking about. Because they’re not practitioners. They think about this in theory. We are practitioners. We are engineers that have been doing this for 25 years, we get our hands dirty from the code, I use the system every single day. And one of our philosophies is that an archive that is used is a healthy archive.”

But the Wayback Machine can’t do everything. When history becomes a liquid property, with incomplete facts constantly shifting, it distorts our understanding of what humans did and why. As historian Abby Smith Rumsey said in a 2016 talk at the Google campus, “When you control someone’s understanding of the past, you control their sense of who they are and also their sense of what they can imagine becoming.”

The people of the future will control the past, which is currently the present in which we live. And that could have lasting effects on our day-to-day lives. (For evidence of this, just witness the tumult caused by The 1619 Project.)

Graham agreed that when history becomes “mutable,” it could have hazardous consequences.

“It is vitally important that humans have an ability to reflect on our past and to have a common frame of reference,” he said. “It is vitally important that we are able to cite facts, cite information, cite sources, so that we can reference them and compare and contrast, and we can have conversations.”

“If we can’t even have a common framework of understanding of what is true, then anything can be true and anyone can put forth any idea whatsoever.”

McCain said he feared for future Americans’ understanding of their government.

“I would say that the greatest danger from loss of digital news is likely its effect on our democracy,” McCain told me.

“In order for our governance systems to function effectively, our society needs rapid access to accurate information, from the present and the past,” he said. “Much like the brain needs memory cells and a nervous system in order to function properly – to react to pain or receive rewards – an electorate and its leadership need a free and functioning media to properly address current dangers and to anticipate future threats or opportunities.”

Without a reliable historical record, future citizens will live in a world of vanishing knowledge, where facts can’t be verified or refuted. Throughout history, the human knowledge base has built upon itself—if that knowledge can be retroactively memory-holed, all the pillars of truth crumble. We will be living in a world without precedent, where every generation has to relearn basic facts by experiencing them all over again.

Hansen, borrowing a term from “father of the internet” Vint Cerf, says the answer to many of our problems is to figure out a technological way to develop “digital vellum” to preserve knowledge. For one, companies could form partnerships with the Internet Archive to preserve their material. Further, news organizations should begin indexing their highest value content to provide future viewers a clue as to what material was most influential in 2022.

It’s not microfilm, but in the absence of a solution better than a database of PDFs or printed hard copies, it’s the best we have right now.

And as for Trisha Hope’s book of Trump tweets?

“I’ll buy one,” said the Wayback Machine’s Graham.