The web has a problem: huge chunks of it keep going offline. The web isn’t static, parts of it sometimes just… vanish. But it’s not all grim. The Internet Archive has a massive mission to identify and back up our online world into a vast digital library. In 2001, it launched the Wayback Machine, an interface that lets anyone call up snapshots of sites and look at how they used to be and what they used to say at a given moment in time. Mark Graham, director of the Wayback Machine, joins Decoder this week to explain both why and how the organization tries to keep the web from disappearing. Links: When Online Content Disappears | Pew Research Game Informer is shutting down | The Verge When Media Outlets Shutter, Why Are the Websites Wiped, Too? Slate MTV News lives on in the Internet Archive | The Verge The video game industry is mourning the loss of Game Informer | The Verge Guest host Hank Green makes Nilay Patel explain why websites have a future | Decoder How The Onion is saving itself from the digital media death spiral | Decoder The Internet Archive is defending its digital library in court today | The Verge The Internet Archive has lost its first fight to scan and lend ebooks | The Verge The Internet Archive just lost its appeal over ebook lending | The Verge Credits: Decoder is a production of The Verge and part of the Vox Media Podcast Network. Our producers are Kate Cox and Nick Statt. Our editor is Callie Wright. Our supervising producer is Liam James. The Decoder music is by Breakmaster Cylinder. Learn more about your ad choices. Visit podcastchoices.com/adchoices
Amgen, a leading biotechnology company, needed a global financial company to facilitate funding and acquisition to broaden Amgen's therapeutic reach, expand its pipeline, and accelerate bringing new and innovative medicines to patients in need globally.
They found that partner in Citi, whose seamlessly connected banking, markets, and services businesses can advise, finance, and close deals around the world. Learn more at citi.com slash client stories.
Do you want to be a more empowered citizen but don't know where to start? It's time to sharpen your civic vision and ignite the spark for a brighter future. I'm Mila Atmos, and on my weekly podcast, Future Hindsight, I bring you conversations to translate today's most urgent issues into clear, actionable ways to make impact.
With so much at stake in our democracy, join us at futurehindsight.com or wherever you listen to podcasts.
Hello and welcome to Decoder. I'm Nilay Patel, editor-in-chief of The Verge, and Decoder is my show about big ideas and other problems. We've been talking a lot about the future of the web on Decoder and across The Verge lately. And one big problem keeps coming up. Huge chunks of it keep going offline. In a lot of meaningful ways, large portions of the web are just dying.
Servers go offline, software upgrades break links and pages, companies go out of business. The web isn't static, and that means sometimes parts of it simply vanish. And it's not just the really old internet from the 90s or early 2000s that's at risk. A recent study from Pew found that 38% of all links from 2013 are no longer accessible.
That's more than a third of the collected media, knowledge, and online culture from just a decade ago gone. Pew calls it digital decay, but for decades, many of us have simply called this phenomenon link rot. And lately, link rot has meant a bunch of really meaningful journalism has gone away as well, as various news outlets have failed to make it through the platform era.
The list is virtually endless. Sites like MTV News, Gawker, Twice, Protocol, The Messenger, and most recently Game Informer are all just gone. Some of these were short-lived, but some were outlets that were live for literal decades, and their entire archives vanished overnight. But it's not all grim.
For nearly as long as we've had a consumer internet, we've had the Internet Archive, a massive mission to identify and back up our online world into a vast digital library. It was founded in 1996, and in 2001 it launched the Wayback Machine, an interface that lets anyone call up snapshots of sites and look at how they used to be and what they used to say at a given moment in time.
I wanted to know more about how this all works, so I asked Mark Graham, director of The Wayback Machine, to join me on the show this week to explain both how and why the organization tries to keep the web from disappearing. The answers are fascinating. You'll hear Mark explain how many hard drives the Internet Archive adds to its system every single day.
And then there's the choices that go into preservation. Not necessarily everything on the Internet merits preserving, and not everything is technically accessible. especially now as more of the online world moves to private platforms. Making those choices, not just preserving the internet, but curating it, is a complicated proposition that hits on basically every decoder theme there is.
The idea of running a library that stores the internet's history? That's a puzzle worth solving. One quick note before we start. The Internet Archive just lost an appeal on a lawsuit over a short-lived book lending initiative it launched at the start of the pandemic.
We don't get into the details of that in this episode because we recorded before the court issued its decision, but I did want to mention the news. We'll link to a couple of Verge stories about it in the show notes. Okay, the Wayback Machine and internet preservation. Here we go. Mark Graham, you are the director of the Wayback Machine at the Internet Archive. Welcome to Decoder.
Really glad to be here today. Quickly for the audience, explain what the Wayback Machine is and how it fits into the Internet Archive.
The Wayback Machine is a service of the Internet Archive that is used to provide a time machine to the web. We have been archiving much of the public web for nearly three decades now, and we make those archives available through the Wayback Machine.
The Internet Archive is the organization, the Wayback Machine is the service. How do those two things relate? What are the other things the Internet Archive does?
The Internet Archive is a nonprofit organization with a mission of universal access to all knowledge. We pursue that mission in a variety of ways, including archiving, as I said, much of the public web. We work toward acquiring and digitizing and preserving and organizing and making available a whole range of material that is kind of grouped into media types.
So one might be books, for example, and we digitize more than 4,000 books every day. Or television news. We archive television news, both from the United States and for other countries around the world. journal articles. We have a collection of more than 30 million publicly accessible journal articles available from scholar.archive.org.
78s, those old things on shellac, we've got hundreds of thousands of those that we have digitized. Those were donated to us by the Boston Public Library. So I could go on and on. We identify media, recording media that people have been publishing in for some period of time.
If it's digital, like born digital, then that makes life easier because we're able to then capture that material in some fashion on our hard drives and preserve it. But maybe it's analog, maybe it's paper or microfiche or microfilm or vinyl or shellac, as I said. In that case, we have to first digitize the material, in some cases using the Stoke hardware and software setups that we have developed.
And once we've digitized it, then we can preserve it and organize it and make it available. At the end of the day, this is what this is about. This is about the voices of humanity expressed in a variety of medium that in many cases are being stored and made available on a series of platforms that are inherently ephemeral. that have a history of disappearing. One of the terms is link rot.
That's talking about the material that may have been available at a given URL and a given address on the web at a given point in time is no longer there. You go to that URL and one of two things are going to be true. Well, three things, I guess. The first thing is that what you're looking for is their success.
But the second is that you get a page not found or some other error message, a 500 error message, maybe something like that on the server end. So you just can't get the material. It's just no longer there at that URL. Now, that material might be available via another URL. It may have been moved somewhere, but you may not necessarily know that if there's no redirect in place.
But the other thing that can happen is that at that same URL, there may be different material. That's referred to as content drift. Same URL, different material. Well, how would you even know what the prior material was or that there even was prior material at that URL? You wouldn't. Why? Because there's no version control system for the web.
I go to a URL, I may get something, and then five minutes later, I go to the same URL, I may get that same thing, or I may get nothing, or I may get something different. And it just is. It is what it is at any given moment. That's what the web primarily is. There are exceptions to this, of course. There are applications on the web, like Wikipedia, for example.
which is fundamentally based on a version control system. And you can go back and you can see all the various representations of what was available from a given URL. But for the web overall, it's not like that. And so that's where the Wayback Machine steps in. That's where we provide a time-based view for...
for URLs that we have been able to access and that we've been able to archive and then organize and make available to our patrons.
You're talking a lot about URLs that is inherently web-focused. I think a lot about the web. I run a web-based business. Watching the web change, especially with things like Google search changing, AI changing the web in different ways. You obviously have the longest view, right? You have the widest view of the web as it's changed. Do you see an acceleration of the web's decline?
Do you see the web changing in any significant way that other people might be missing? What do you think is happening right now?
Where to start? I mean, those are some big questions. A very general statement that I can make is that about a third of the old web measured in, say, 10 or 15 years or something like that is gone. So about a third. In some cases, it's less, and in some cases, it's more. And certainly for an individual website that may have had millions of pages, like GeoCities, for example, It's 100% gone, right?
So it's just not there on the live web. But it turns out that in more than two-thirds of the cases that we've looked at where a given URL is no longer available, it is available through the Wayback Machine. So one way of looking at that is saying that instead of saying that maybe a third of the old web is gone, maybe a ninth of the old web is gone.
And once again, these are very broad generalizations because much of that material was backed up and can be accessed through web.archive.org from the Wayback Machine. But you asked a different question.
Well, I'm hoping that you're going to say things are getting better. I'm worried you're going to- No, they're getting worse.
I don't know. They're getting different, right? So things are changing.
The most optimistic take of all is they're getting different. Yeah.
So first of all, let's look a little bit like why things go away. There are very benign reasons why things go away. Maybe a company has simply gone out of business or a government has changed. And so there's a new administration. And so you would expect if a company goes out of business, what entity would want to keep that company's website alive, for example, or a publication, right?
Thousands of local news organizations have shut down in the United States over the last 10 or 15 years, for example. News organizations, media organizations are shut down by governments when they go out of favor. When the failed coup happened in Turkey a few years ago, Wikipedia has documented about 150 media organizations were shut down.
We have a collection of four websites, four news sites from Hong Kong, for example. Apple Daily was one that were shut down for political reasons. In all of those cases, we have really good archives of that material. We have, for example, a full text searchable index of about a million pages from Gawker.
and those four news organizations from hong kong that i mentioned we have built a full text index of the articles from those news sites but there are many many other reasons why a given site may make maybe the hard drives that it that the website was running on crashed Or maybe there was just a change in the content management system.
And when the upgrade was done, the people doing the engineering behind it didn't put in the redirects. And so all those old parts of the site are no longer available. I used to work for NBC News. And I mean, we had more than 100 websites that we were running at one point. And when we were doing upgrades, the last thing we'd be thinking about is the old stuff.
It'd all be like, well, how do we meet the deadline to get the new stuff out?
I feel like every person who's ever worked in product at a media company is experiencing second order body horror right now because of what you're describing.
Many of those conditions are still with us. They're not fundamentally changing. For those reasons, stuff still is going to atrophy. Also, as the web gets older, The older stuff gets older too. People die. The legacy often of an individual's efforts then falls on the heirs or their friends. I can't tell you.
Literally every day here at the Interim Archive, we get communications, principally on emails or DMs or things like that from people saying, Hey, my husband or this organization I worked with, the person has passed away and we're going to shut down the website. We want to make sure that it's preserved. Often we will have already done that. Here's a recent case.
MTV News was shut down and people said, oh, you know, what did you do? Did you have to jump into action and archive it? It's like, no, no. Our work was done. We had been archiving. I mean, if that was what we had to do, then we would have failed because it's too late, right? Our work had been done over the decades.
We've spoken about why internet preservation is necessary. We have to take a quick break, but when we come back, Mark's going to get into how the Wayback Machine works. We'll be back in just a minute.
Fox Creative. This is advertiser content from Zelle. When you picture an online scammer, what do you see?
For the longest time, we have these images of somebody sitting crouched over their computer with a hoodie on, just kind of typing away in the middle of the night. And honestly, that's not what it is anymore.
That's Ian Mitchell, a banker turned fraud fighter. These days, online scams look more like crime syndicates than individual con artists. And they're making bank. Last year, scammers made off with more than $10 billion.
It's mind blowing to see the kind of infrastructure that's been built to facilitate scamming at scale. There are hundreds, if not thousands of scam centers all around the world. These are very savvy business people. These are organized criminal rings. And so once we understand the magnitude of this problem, we can protect people better.
One challenge that fraud fighters like Ian face is that scam victims sometimes feel too ashamed to discuss what happened to them. But Ian says one of our best defenses is simple. We need to talk to each other.
We need to have those awkward conversations around what do you do if you have text messages you don't recognize? What do you do if you start getting asked to send information that's more sensitive? Even my own father fell victim to a, thank goodness, a smaller dollar scam, but he fell victim and we have these conversations all the time.
So we are all at risk and we all need to work together to protect each other.
Learn more about how to protect yourself at vox.com slash zelle. And when using digital payment platforms, remember to only send money to people you know and trust.
They're not writers, but they help their clients shape their businesses' financial stories. They're not an airline, but their network connects global businesses in nearly 180 local markets. They're not detectives, but they work across businesses to uncover new financial opportunities for their clients. They're not just any bank. They are Citi. Learn more at Citi.com slash WeAreCiti.
That's C-I-T-I dot com slash WeAreCiti.
Support for this episode comes from Microsoft. Did you know one in 43 US children have had their personal information exposed or compromised? Scammers are targeting our kids online, especially on social media, where unmonitored conversations can easily lead to identity theft. We need better tools to protect our loved ones to stay ahead.
Thankfully, there's Microsoft Defender, all-in-one protection that can help keep our families safe when they're online. Microsoft Defender makes it easy to safeguard your family's data, identities, and privacy with a single security app across your devices. Take control of your family's security by helping to protect their personal info, computers, and phones from hackers and scammers.
Visit Microsoft365.com slash Defender.
Welcome back. I'm talking with the Internet Archive's Mark Graham, director of the Wayback Machine, about the actual structure of it all. Inside of the Internet Archive, how is the Wayback Machine structured? Is that just the front-facing service? Is that also the digitization of the Internet? How does that work?
We call it the Wayback Machine as if it's like a computer that's sitting on somebody's desk. It's actually a whole network of literally hundreds of nodes as part of our overall infrastructure of the Internet Archive of thousands of nodes. more than 100 petabyte of material growing at the rate of more than 60 terabyte a day.
It's a combination of applications that do what's referred to as crawling, which is a process of looking at a URL, looking at a webpage, and then looking at all of the other links, all of the other URLs on that page, and then going to them and then looking at them and then going on and on and on, crawling the web like a spider, metaphorically.
So it's a combination of this crawling and archiving process, as well as the aggregation of all of those archived resources with indexes that makes those discoverable. And then they can be recompiled into web pages. And then patrons, millions of patrons a day come to our sites and they request resources that we have.
Maybe it's a digitized version of a book from archive.org, or maybe it's a archived web page from the Wayback Machine. And then we will present that to them in their browser.
You said 60 terabytes a day?
More than that, yeah. Actually, it's something like more than a billion URLs every single day, and that can get pretty quick. It could be like 20,000 URLs a second can be coming into our server. So think of a database that you're writing to 20,000 times a second and you're reading from 5,000 times a second. That's one view into what the Wayback Machine is.
That's just a lot of storage and a lot of ongoing storage because you're not just taking the changes, right? You're storing the history. I actually have gone to go look at our old designs on the Verge on the Wayback Machine because it's the easiest way for me to just go remember what the site looked like 10 years ago. So you've got the long history. So you're adding storage every day.
Do you just buy hard drives every day? Are you a new egg?
Yes, the heading purchase is always with Seagate and others. We buy a lot of hard drives.
Are you buying platters? Are you buying SSDs?
The primary storage medium is spinning disk. I think today we're using 20 terabyte drives. When we started, they were much smaller, of course. Actually, the very, very, very first version of the Wayback Machine, going back almost like 24, 25 years ago, I think we used a tape machine for a little while. But very quickly, our founder, Brewster Kahle, decided that he really wanted
the material that we have to be as accessible as possible to people so that when people wanted something that wasn't like, oh, we have to go back to the stacks and then find it and then get it. He wanted things to be as immediately available as possible. So spinning disks has been the primary format.
And of course, yes, we use a lot of SSDs and a lot of MVMEs and other kind of memory devices for primarily for indexes and caches and things like that.
So 60 terabytes a day, let's say, 20 terabytes spinning disk hard drives, that's three a day, if my math is correct. Oh, it's more than that. Yeah, it's more than that because- I'm just envisioning somebody going to plug in between three and five hard drives a day and then- More than that because we at least double everything up because- Sure.
So first of all, we own and operate our own data centers. They are physically distributed. So when we write something, we're actually writing it to more than one location for physical reliability. It's north of six hard drives a day.
So I have a SimCity map in my head where you're just an ever-expanding physical footprint. Is there an outer limit? Are you going to take over a city? Is there a desert mountain cave? How does this work?
I doubt that. Caves have their own challenges. We're looking at some interesting things. Some of us is in an abandoned coal mine in Norway. We participated with GitHub a few years ago in something called the Arctic GitHub Repository. And we are looking at some more exotic recording formats from some special purpose applications.
But frankly, we think that hard drives are going to be the primary medium that we use for some time into the future. We're constantly evaluating options, but it's a kind of a tried and true and reliable format and process. We know how to handle them. We put them into machines that we rack ourselves and they've been serving us well.
We're talking about preserving a very digital, somewhat ephemeral medium on the internet. The actual process of it is extraordinarily physical. You just have to take up space. run wires and have power and all that.
Electricity and heat and all the rest of that. Well, I should say, if you come and visit our operation in San Francisco, which you should do sometimes if you haven't, we have several physical locations. We have physical archives in different locations in the United States and also in Canada. But our headquarters building is an old church, a former church of Christian scientists.
and now it's a temple for knowledge. When you come into our building, you'll see how frugal we are. We've kind of left it the way it was when it was a regular kind of church. We don't have air conditioning or our backup generators or anything like that, but we have a lot of hard drives in racks, and we do have some fans.
When it's a hot day here in San Francisco, we open up the windows and ventilate that. Also, people who use the service may know that sometimes we'll go down if the power goes out. We'll be down for a little while, but we're a library. It's okay. We'll be back. The material itself is stored in multiple locations, so it's safe.
How is this all funded? How much does it cost to run and where does the money come from?
Last year, I think we spent probably about $28 million, and I think I'd divide that into three buckets. The first bucket would be earned income, program-related business activity, they say, in the nonprofit world. This is work that we do on behalf of museums and governments and libraries and the like. when they pay us primarily to do web archiving on their behalf or do book digitization.
Another third comes from a very loyal collection of more than 150,000 people who donate money to us every year. A growing number of them are monthly donors, so we're very appreciative of the folks who give us $10, $20, $30 a month. And then the final third comes from a combination of high wealth individuals and foundations.
And is that mix changing over time? As I think about the broader piece of link rot and the ever-expanding nature of the problem, it seems like that funding mix might have to change over time.
It's diversifying. We're certainly looking at ways to continue to diversify it. The monthly donor program is certainly an area that is growing for us. And just, you know, as more and more people use our service and depend on it, frankly, and see the value of it, then more and more of them support us every year so that the number of unique annual donors has been increasing.
on a fairly consistent basis, and we very much appreciate that. It allows us to do what we do. It is only through the support that we get from our patrons that we're able to continue to work diligently and creatively to try to preserve our world cultural heritage.
We haven't built a full-text index on the entire holdings of the Wayback Machine, maybe someday, but for now we kind of do it on a case-by-case basis.
So it seems like money is not the biggest challenge with Wayback Machine, and that's a good place to be. But then what are the challenges?
But there's other dimensions, though, of this evolving digital world that we live in that are representing new challenges and new opportunities. Issues like hyper-personalization. The web you experience is different than the web I experience. Even down to a given web page, what you see and what I see may be different because of
geography or browser type or what that website knows about us as individuals, our age or our preferences, et cetera. And I'm not just talking about the ads either. You know, this is elements of it. So hyper-personalization is one thing. The splinterization of the net often around geopolitical boundaries where large parts of the internet are just not accessible to other parts of the internet.
Certainly we all know about the great firewall of China where But there are many, many other examples of that. When Russia invaded Ukraine, many thousands of websites that had traditionally been available from Russia in the West are no longer available. And then there's the evolution of what we think of as the internet into the web.
And now it's this kind of like mobile first kind of environment with apps and apps to their own kind of special hell of walled garden content. For a variety of ways, it's very bound technically and often administratively with IDs and passwords and paywalls and all the rest of that. So getting material out of these containers that we think of as apps that live on our phones is challenging.
We've been talking about it with the web, right? The Wayback Machine is centered on the web. There's reasons that websites have gone out of favor. MTV News is a great example. They just couldn't make money running MTV News on the web. It just wasn't happening for them. They shut it down. That's more or less the case for media on the web, probably.
That's why so many news websites are going out of business. That's why local news on the web is going out of business. That's not the case for video platforms.
right if you're an independent creator and you're on youtube maybe you're making a lot of money maybe you're a tick tocker making a lot of money you're inside of that ecosystem and that's where the money and that's where the advertising is going none of that has the same ideals or norms of the web right which is that it is available which is what so much of the internet has been built on is the norms and ideals of the web that availability is the key
There's three, four million videos are uploaded to YouTube every day. I'm assuming TikTok and the others all have similar amounts. It's a massive amount of information. Are you collecting that as well?
No. We have some archives from some YouTube videos, but you just threw out a number like three or four million a day, like nothing near that. This goes to also like, why do we archive what we archive and how do we make choices? And the answer in short is there are more than 10,000 different reasons why a given URL may be archived by the Wayback Machine at any given day.
And they are in part selected by the more than 1,000 partners that we have that are primarily librarians that do curation of material that they think should be archived. So we have partnerships with them. We have partnerships with CloudFlare, an infrastructure provider, with WordPress, with Wikipedia. We also offer a service called Save Page Now.
What I'm getting at is this is all pretty based in the web, right? If you capture a web page and it has a YouTube video on it, maybe you'll capture the YouTube video too. But there's a growing body of information that lives on more closed platforms, even if they are exposed to the web. Like Instagram is exposed to the web, but it's not the web.
I would say the Instagram is the web, but I would say it's not necessarily the public web because generally speaking, material from Instagram, Facebook, and threads, they're basically the meta properties are not very accessible unless you have an ID and a password on those services. Even the so-called public pages have limitations for how one can access them. So there are special cases.
We work hard to archive things that people think they want to preserve in some fashion. And so a lot of material on some of these social platforms are archived by patrons who enter URLs into the Wayback Machine.
Does the shift to people doing more and more of their publishing on closed platforms threaten the nature of what you're doing? If all the information is going from the open web to Discord channels, I'm guessing you're not able to archive all that. And that seems like a big problem for the information landscape.
It absolutely is making some of the work that we do more challenging. But I actually think there's larger implications here. It's hurting our democracy. It's hurting our culture. It's hurting our ability to have shared conversations and shared understandings of the world. that we live in. But this isn't necessarily a technical thing too, because we can make choices.
I can watch one TV channel and you can watch another, and we get radically different worldviews. But in those cases, we have choices at least, and we can flip between one and the other. If the switching cost is higher, where it's a paywall, for example, and where the switching cost is an actual dollar sign cost, then
Can I afford to pay for the 30 or 40 different news sources that I would like to have access to as an informed citizen? Is a real cost associated with that or the cost of using this app or that app and not be able to bring this material together and aggregate it? So the issue that you're addressing, I think, is one of the critical issues of our times.
And yes, certainly it affects the work that we do here as archivists, but I think it has much broader and profound social implications. There's a lot of material that's publicly available. I keep using this phrase public web, and I'm making a distinction here. Things you can get to without an ID and a password.
We have to take another quick break. We'll be right back.
Support for this show comes from the refinery at Domino. Look, location and atmosphere are key when deciding on a home for your business, and the refinery can be that home. If you're a business leader, specifically one in New York, the refinery at Domino is an opportunity to claim a defining part of the New York City skyline.
The refinery at Domino is located in Williamsburg, Brooklyn, and it offers all the perks and amenities of a brand new building while being a landmark address that dates back to the mid-19th century. Its 15 floors of Class A modern office environment house within the original urban artifact, making it a unique experience for inhabitants as well as the wider community.
The building is outfitted with immersive interior gardens, a glass-domed penthouse lounge, and a world-class event space. The building is also home to a state-of-the-art Equinox with a pool and spa, world-renowned restaurants, and exceptional retail. As New Yorkers return to the office, the refinery at Domino can be more than a place to work.
It can be a magnetic hub fit to inspire your team's best ideas. Visit therefinery.nyc for a tour.
This episode is brought to you by Shopify. Forget the frustration of picking commerce platforms when you switch your business to Shopify, the global commerce platform that supercharges your selling wherever you sell. With Shopify, you'll harness the same intuitive features, trusted apps, and powerful analytics used by the world's leading brands.
Sign up today for your $1 per month trial period at Shopify.com slash tech, all lowercase. That's Shopify.com slash tech.
So if you're a team of developers, Jira better connects you with teams like marketing and design so you have all the information you need in one place. Plus, their AI helps you knock out the small stuff so you can focus on delivering your best work. Get started on your next big idea today in Jira.
Welcome back. I'm talking with Mark Graham, director of the Wayback Machine, about the challenges of preserving the internet when everything is not only ephemeral, but also more and more closed off. Mark just mentioned the concept of the public web, meaning anything you can get to without an ID and a password. And that brings us to a new challenge for preservation.
Up until a year or so ago, maybe two, the idea that the Wayback Machine would just cycle through the internet to read and preserve websites was more or less seen as a universal good. But now there's a new crop of players scraping websites, and it's a lot more contentious. All the generative AI companies are scraping the entire web and using it to train their LLMs.
And that has made a lot of people very upset and very litigious. We've had some of them on the show. The New York Times and a bunch of artists and organizations have filed plenty of lawsuits over this practice.
That's made a lot of people suddenly aware of something called robots.txt, the file which dictates which web pages third-party crawlers and other automated tools are allowed to visit on a website. Lots of websites are now making changes to block these scrapers, and it's called into question one of the oldest and most widely used practices on the open web, one that's vital for preservation.
Has that affected your work at the Internet Archive?
This is an evolving landscape that we live in of people's perceptions and realities. With the advent of the AI companies and large amounts of material that they've gathered from the public web and then used in new and different ways, There has been changes by some of the folks who are making that material available. Many organizations are kind of closing down the hashes.
So far, we've been doing okay. We've actually been working cooperatively with many different platforms for a long time. And we also take measures to respect the intellectual property and the rights of content creators. The material from the Wayback Machine is... generally only available as a playback of an individual URL. We don't support the bulk downloading of the material.
In general terms, there are exceptions to that. There's a project we do, for example, with the Library of Congress and the National Archives, where we archive material from U.S. government websites. Making the material available within a specific controlled environment, we've been able to have good relationships with most folks out there.
For example, Reddit recently put out an announcement where they said, you know, we're locking things down, but we have an agreement with the Internet Archive. Reddit considers the work that they do with us to be a legitimate and beneficial, a beneficial service to the patrons of Reddit.
What's interesting about that is Reddit's kind of an old company. It's like an old web company, and there's a bunch of web people there who understand what the Internet Archive is and why it's valuable, and they probably use it. And then you've got a bunch of new companies who might have new leaders who don't understand the ideals of the web.
And then you've got the AI companies who I think a lot of people woke up last year and said, there's something called robots.txt, and it It should maybe pay us money. And now everyone's confused, right? Is that meaningfully changed what you do, that the idea that this should be a set of business agreements or a set of legal agreements? But do you get to just run around saying you're a library?
Well, we are a library. I'm not a lawyer. I don't have those kind of conversations. I get up every day and I ask myself the question, how can we do a better job archiving more of the public web in a way that is respectful, in a way that is useful, in a way that is helping to preserve the cultural heritage of our times? I guess...
Much more directly, my question is, a bunch of companies took advantage of the open web to build AI models, and now the rest of the open web might get ornerier or more closed down even. Is that making your job harder?
We're doing fine. You know, so there's challenges every day. But honestly, that's not one of the ones that's keeping me up at night. No.
Let's talk about some solutions for all of these changes kind of broadly. I'm thinking about just the amount of culture that is uploaded to TikTok every day. It is where the culture is happening right now. That is the most ephemeral of all. It doesn't even feel searchable in a real way. It comes, it goes. Obviously, the algorithm creates an infinite array of filter bubbles for people.
Is it even possible to capture all of that or organize it or make it understandable? Because I'm thinking about historians 20 years from now trying to understand a meme today, and I have no idea how they're going to do it.
Some of it is, yeah. Actually, TikTok is one of those platforms that we're doing a fair amount of archiving on. So I would say yes. And in some of these cases, like say TikTok or Telegram or Rumble or let's say Truth Social or some of these other social platforms, we're not trying to get everything. That's far too much, but we are trying to get a fair amount.
In some cases, we're working with domain experts, subject matter experts, et cetera. We're hoping to guide us and to get things that may be cultural or historically more significant or others. You mentioned memes, for example. And so if you take a meme as a meme and as a vector into, okay, let's try to collect material related to this meme.
So there are any one of a number of methods that we might incorporate to try to help prioritize material that we would get from some of these platforms.
When you think about all of those opportunities, you're going to have to prioritize somehow, right? Six hard drives a day, or you can go to 12 hard drives a day. How do you make those kinds of prioritization decisions?
First of all, there's a lot of people that work here. More than 100 people work at the Internet Archive. We do a variety of different things from an engineering perspective or a program perspective. And yeah, there are choices that are made. But I would say we're mostly constrained by just our own creativity and our own imagination.
We have a fair amount of latitude as we work here to explore our interests as individuals and as an organization. But with a really strong focus on just trying to do a really good job of the things that we set out to do. And admittedly, the North Star, universal access to all knowledge, is a high bar. But we have the...
The luxury of being able to pursue that with a lot of resources is something that I have a great deal of gratitude for. And I know that the people that I work with do as well. And I think that millions of patrons that use our service every day also.
Preservation is a high, noble goal. I work in the media. It's fine for you to preserve everything that we make. Some people want stuff deleted. How do you balance preservation and privacy?
And we are respectful of rights holders. And one of the ways that we are respectful is that we do respond to requests to have things excluded from the Wayback Machine. So rights holders that make legitimate requests. We actually have human beings that check these things out. We just don't say, oh, so-and-so said, just take that out. But we do consider the request.
Sometimes if the person is a public official, then we will have to weigh off their request with maybe a broader public right to know. But we work those things out on a case-by-case basis.
You've made some content moderation decisions along the way as well. Two years ago, you removed Kiwi Farms for sort of a notorious forum for people who don't behave very well. How do you make that kind of decision?
We weigh off the evidence, the information that's available to us. That particular case, for example, would live in a category, I would say, where there are times when we learn about situations where there is what may be considered a high probability of real-world harm. and then have to make a decision.
The fact of the matter is there's material that is made available on the web that does cause real human suffering. And there are cases in which we have a duty to care. Another one maybe that is with child sexual exploitation material, for example. I don't think people are really questioning that too much, right? You say, oh, they took that down or something. Well, yeah, of course.
And first of all, that's the law. But there's other cases, doxing, for example, or harassment, or where people's personal safety or other risks have to be taken into consideration. So these are not decisions that are made lightly. We have policy that helps guide us, but very carefully and diligently. And we reconsider, too. That's another thing, too. It's not just like, oh, that was done.
And that's never to be looked at again. No. Over time, situations change, and the context of material in a new light of a new day may lead to different kinds of decisions.
Obviously, there's a lot of systems at play here. You sometimes partner with organizations. Entire websites also come and go with the whim of corporations beyond most people's control. But there is a personal element. How should individuals think about all of this?
If you create something and you want it to stick around for a while, then take care. If you see something, save something. The Internet Archive is a free resource. It's available to anyone with a browser and a connection to the Internet. Just go to web.archive.org. If you'd like to preserve a URL, put it into the Save Page Now feature on the bottom right. Write to us.
Write to us at info.archive.org. If you've got a website that you think may be at risk, send us a note, and we'll make sure that we do a really good job of preserving it.
Sounds good. Thank you so much, Mark. I really appreciate it.
You're welcome.
Thanks again to Mark Graham for joining me on the show, and thanks again to the Internet Archive and the Wayback Machine. We depend on their work all the time here at The Verge. If you have thoughts about this episode or what you'd like to hear more of, you can email us at decoder at theverge.com. We really do read all the emails. Or you can hit me up directly on threads. I'm at reckless1280.
We also have a TikTok, which you should check out. While there's a TikTok, it's at decoderpod. It's a lot of fun. If you like Decoder, please share it with your friends and subscribe or read your podcasts. If you really like the show, hit us with that five-star review. Decoder is a production of The Verge and part of the Vox Media Podcast Network. Our producers are Kate Cox and Nick Statt.
Our editor is Callie Wright. Our supervising producer is Liam James. The Decoder music is by Breakmaster Cylinder. We'll see you next time.