If the first casualty of war is the truth, its first fatality may soon be the internet.
A frantic international effort is underway to preserve Ukraine’s digital history and Russia’s media archive. The stakes, say internet archivists, include how the war and contemporary Ukraine are remembered.
A team of over 1,300 volunteers at a newly launched global initiative called Saving Ukrainian Cultural Heritage Online are racing to preserve hundreds of thousands of websites. Archivists from the National Electronic Archive of Ukraine, the Internet Archive, the Library of Congress are trying to save copies of news publications, digital archives of museums, local government pages, exhibitions and more.
Archivists are able to save copies of websites through capturing a website’s code with a number of what are called “crawling” tools.
The Internet Archive
Since 1996 the Internet Archive has maintained a digital library where anyone can upload and download content to its online collection, called the Wayback Machine. It has saved over 666 billion web pages.
In 2001 the archived content became available to the general public.
“Their whole philosophy is that the web is one document, there’s no use trying to draw edges,” said Christopher Lee, a professor at the University of North Carolina. Archivists at the Wayback Machine work to capture as much of the internet as they possibly can and allow users to join these efforts.
Lee says that one of the potential threats to our digital record is political: “Ultimately there is a machine somewhere and if that’s in a jurisdiction where someone can go in and seize that physical item, which is the server, then that cloud storage doesn’t exist anymore.”
In 2016, the Archive’s founder Brewser Khale announced the organization, based in San Francisco, would be opening a second location in Canada following the election of Donald Trump. “The history of libraries is one of loss,” wrote Kahle. “On November 9th in America, we woke up to a new administration promising radical change. It was a firm reminder that institutions like ours, built for the long-term, need to design for change.”
“As far as we can tell, no one has done web archiving at this scale in a war before,” says Quinn Dombrowski, a project administrator at Saving Ukrainian Cultural Heritage Online and a technology specialist at Stanford University’s Library. “Our goal is not to create an archive that people will study somewhere safely in the West. Our goal is to repatriate this data back to the Ukrainains.”
The war between Russia and Ukraine has opened the door to the wide scale destruction of their internets — for very different reasons. Ukraine’s digital record faces annihilation from military invasion; Russia’s internet destruction has been ordered from within. But people in both countries are now grappling with a shocking reality: their online records can disappear. The reality is now dawning that that their nation’s internet is fragile and impermanent
Websites are going offline in Ukraine for a number of reasons, from power outages, to local servers being destroyed by shelling, to hosting bills going unpaid. “When you get right down to it, it’s cables, it’s hardware, it’s things that exist in the physical world even though we think of the internet as a different sphere,” said Dombrowski.
Dombrowski’s archiving initiative has set up a list of websites for volunteers to archive, prioritizing websites for organizations located in cities under siege or with active air raid warnings.
Their efforts point to one of the internet’s best kept secrets: the fragility of the internet.
Christopher Lee, a professor at the University of North Carolina’s School of Information and Library Science says he often encounters misconceptions about the internet’s durability. “What happens is people get shocked on both ends of the spectrum. Things that you thought would be persistent, go away. And things that feel like they should be ephemeral, stick around. Both of those things are true.” What lives on is determined by “power and resources,” he says, with a lot of what we think of as “junk” data, such as our browsing history and other user behavior information, actively maintained by governments or companies using it for revenue.
The archiving of websites and databases for the most part has not been incorporated into disaster or military preparedness. In the U.S., emergency digital archiving initiatives have sprung up after events like Ferguson in 2014, Hurricane Maria in 2017, and the election of President Donald Trump. After these events, volunteers captured websites, social media posts and federal databases before they were lost or they were taken down by government officials.
There is a growing awareness among the public for the importance of web archiving, according to Abigail Grotke, assistant head of the digital content management section at the Library of Congress in Washington, D.C.
Grotke joined the Library when it first began web archiving in the early 2000s. Today, the Library’s digital archive is one of the largest of any government body. By the end of 2021, over 100 web archive specialists have captured 21.7 billion digital documents or 2.827 petabytes.
The Library of Congress is in the process of switching their operations to a digital first approach. “In the past if something was available in both print and digital, we would prefer the print. But we’re switching focus now where digital is preferred,” said Angela Cannon, a reference specialist at the Library. Last summer the Library had to freeze its social media archiving work due to persistent barriers enacted by tech companies and the technological challenges in archiving content from private profiles and accounts.
“It’s been frustrating but it’s not just our problem,” said Grotke, pointing out that it’s something archivists around the world are trying to solve.
Cannon, the reference specialist at the Library, says this becomes especially important in regions around the world where public figures and politicians almost exclusively use social media for messaging.
“Increasingly politicians are not bothering with websites,” Cannon said. “If you’re not talking to traditional newspapers, and you’re not going on television, it is most definitely going to matter in the future. That’s a gap in our collecting, so how do we document that for our researchers?”
Crawling
Web crawlers are automated software programs that visit websites and capture the data and media on the site by making a copy which can be saved in a variety of formats. This process is also called “harvesting” web content and crawlers themselves are sometimes called spiders or robots.
Crawlers begin with a seed URL and travel to the links on that page, and the links on those pages and so on. Crawls are usually given a frequency set to daily, weekly, quarterly or other schedules.
Site administrators can add robot exclusions (robots.txt) to their website’s code which blocks crawlers from archiving it. In many cases, this means archival collections like the Internet Archive can no longer make the content available to users in their collection.
When Russia’s full scale invasion of Ukraine began, Grotke and Cannon say their team’s first step was to increase the frequency of “crawls” for Ukrainian government sites and selected Russian sites.
While many governments, including Ukraine’s, have dedicated national digital archives, Russia is one of the few that does not. Instead, Russia has Ivan Begtin, a transparency advocate in Moscow who for over a decade has led a small team of archivists. Their work has new urgency as the Kremlin erases swatches of the Russian-language web.
“In the next month and a half many publications and cultural websites can disappear entirely,” said Begtin.
Since the war, dozens of independent Russian media sites have been blocked by the Kremlin for violating censorship laws banning the use of the word “war” in coverage of Ukraine. More quietly, hundreds of smaller publications, Russian cultural websites and online projects have gone offline as many western hosting services stopped accepting payments from Russia.
The list of websites archived by Begtin and his team of three in the past few weeks gives a snapshot of the current shattered state of freedom of expression in Russia. He prioritizes content that “forms our contemporary history, what people are going to use to write books and textbooks one day.”
The National Digital Archives, Begtin’s self-funded project, has captured content from independent news publications such as the Insider, Colta, Tjournal, Paper — all now blocked in Russia — along with websites like the “Forum of Kostroma Jedi,” a chapter of Star Wars enthusiasts that was recently listed as an “undesirable” organization by federal authorities; and dozens of historical memory projects from Memorial, Russia’s oldest civil rights group shuttered in December 2021 by court order.
The work is grueling and dangerous. “How much longer I can keep this up for, I’m not so sure,” said Begtin.
For years Russia’s internet existed as an unregulated bastion of free speech, pirated films, music and software. Its transformation into one of the most censored corners of the internet in the world has gone hand in hand with the transformation of Russian politics, said Begtin.
“This is a story of the erosion of your sense of freedom, the erosion of democracy, the erosion of people’s faith in themselves. Because many thinking people in Russia today speak in these words: there is nothing I can do to change what is happening.”
Begtin’s one-man crawl of the Russian web is a far cry from the wave of initiatives backing up Ukraine’s digital records. Dombrowski, the technology specialist at Stanford, says there is a much broader international conversation that has to happen around digital archiving, cultural heritage, and conflict.
“It’s inspiring when everyone comes together to do something to support this effort,” she said. “On the other hand, it represents a fundamental failure of infrastructure. It should never come to random people archiving Ukrainian websites on their laptops.”
Of all the work Dombrowski, who studied medieval East Slavic languages, managed to preserve in the past few weeks, one site was especially vivid: the website of a small museum in the Ukrainian city Novhorod-Siverskyi dedicated entirely to the medieval epic poem, The Tale of Igor’s Campaign, famously translated into English by Vladimir Nabokov. The poem’s original manuscript was destroyed in 1812 when Moscow burned to the ground during the Napoleonic wars.
“I studied the poem in grad school and when I saw what this museum was about, my heart just stopped,” Dombrowski remembers. “Our automatic processes had failed so I manually went through each page on their website. There were 83 pages. I clicked on every image. I downloaded everything I could and saved the file. The thought of this beautiful museum being under attack, I immediately burst into tears.”