Menu
Sign In Pricing Add Podcast

Mark Graham

Appearances

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1023.98

We call it the Wayback Machine as if it's like a computer that's sitting on somebody's desk. It's actually a whole network of literally hundreds of nodes as part of our overall infrastructure of the Internet Archive of thousands of nodes. more than 100 petabyte of material growing at the rate of more than 60 terabyte a day.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1044.429

It's a combination of applications that do what's referred to as crawling, which is a process of looking at a URL, looking at a webpage, and then looking at all of the other links, all of the other URLs on that page, and then going to them and then looking at them and then going on and on and on, crawling the web like a spider, metaphorically.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1065.383

So it's a combination of this crawling and archiving process, as well as the aggregation of all of those archived resources with indexes that makes those discoverable. And then they can be recompiled into web pages. And then patrons, millions of patrons a day come to our sites and they request resources that we have.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1090.093

Maybe it's a digitized version of a book from archive.org, or maybe it's a archived web page from the Wayback Machine. And then we will present that to them in their browser.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1102.833

More than that, yeah. Actually, it's something like more than a billion URLs every single day, and that can get pretty quick. It could be like 20,000 URLs a second can be coming into our server. So think of a database that you're writing to 20,000 times a second and you're reading from 5,000 times a second. That's one view into what the Wayback Machine is.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1145.098

Yes, the heading purchase is always with Seagate and others. We buy a lot of hard drives.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1152.821

The primary storage medium is spinning disk. I think today we're using 20 terabyte drives. When we started, they were much smaller, of course. Actually, the very, very, very first version of the Wayback Machine, going back almost like 24, 25 years ago, I think we used a tape machine for a little while. But very quickly, our founder, Brewster Kahle, decided that he really wanted

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1175.823

the material that we have to be as accessible as possible to people so that when people wanted something that wasn't like, oh, we have to go back to the stacks and then find it and then get it. He wanted things to be as immediately available as possible. So spinning disks has been the primary format.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1193.611

And of course, yes, we use a lot of SSDs and a lot of MVMEs and other kind of memory devices for primarily for indexes and caches and things like that.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1218.633

So first of all, we own and operate our own data centers. They are physically distributed. So when we write something, we're actually writing it to more than one location for physical reliability. It's north of six hard drives a day.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1243.827

I doubt that. Caves have their own challenges. We're looking at some interesting things. Some of us is in an abandoned coal mine in Norway. We participated with GitHub a few years ago in something called the Arctic GitHub Repository. And we are looking at some more exotic recording formats from some special purpose applications.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1267.303

But frankly, we think that hard drives are going to be the primary medium that we use for some time into the future. We're constantly evaluating options, but it's a kind of a tried and true and reliable format and process. We know how to handle them. We put them into machines that we rack ourselves and they've been serving us well.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1301.468

Electricity and heat and all the rest of that. Well, I should say, if you come and visit our operation in San Francisco, which you should do sometimes if you haven't, we have several physical locations. We have physical archives in different locations in the United States and also in Canada. But our headquarters building is an old church, a former church of Christian scientists.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1320.655

and now it's a temple for knowledge. When you come into our building, you'll see how frugal we are. We've kind of left it the way it was when it was a regular kind of church. We don't have air conditioning or our backup generators or anything like that, but we have a lot of hard drives in racks, and we do have some fans.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1339.131

When it's a hot day here in San Francisco, we open up the windows and ventilate that. Also, people who use the service may know that sometimes we'll go down if the power goes out. We'll be down for a little while, but we're a library. It's okay. We'll be back. The material itself is stored in multiple locations, so it's safe.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1361.266

Last year, I think we spent probably about $28 million, and I think I'd divide that into three buckets. The first bucket would be earned income, program-related business activity, they say, in the nonprofit world. This is work that we do on behalf of museums and governments and libraries and the like. when they pay us primarily to do web archiving on their behalf or do book digitization.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1386.029

Another third comes from a very loyal collection of more than 150,000 people who donate money to us every year. A growing number of them are monthly donors, so we're very appreciative of the folks who give us $10, $20, $30 a month. And then the final third comes from a combination of high wealth individuals and foundations.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1423.007

It's diversifying. We're certainly looking at ways to continue to diversify it. The monthly donor program is certainly an area that is growing for us. And just, you know, as more and more people use our service and depend on it, frankly, and see the value of it, then more and more of them support us every year so that the number of unique annual donors has been increasing.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1444.281

on a fairly consistent basis, and we very much appreciate that. It allows us to do what we do. It is only through the support that we get from our patrons that we're able to continue to work diligently and creatively to try to preserve our world cultural heritage.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1462.353

We haven't built a full-text index on the entire holdings of the Wayback Machine, maybe someday, but for now we kind of do it on a case-by-case basis.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1478

But there's other dimensions, though, of this evolving digital world that we live in that are representing new challenges and new opportunities. Issues like hyper-personalization. The web you experience is different than the web I experience. Even down to a given web page, what you see and what I see may be different because of

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1498.275

geography or browser type or what that website knows about us as individuals, our age or our preferences, et cetera. And I'm not just talking about the ads either. You know, this is elements of it. So hyper-personalization is one thing. The splinterization of the net often around geopolitical boundaries where large parts of the internet are just not accessible to other parts of the internet.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1524.072

Certainly we all know about the great firewall of China where But there are many, many other examples of that. When Russia invaded Ukraine, many thousands of websites that had traditionally been available from Russia in the West are no longer available. And then there's the evolution of what we think of as the internet into the web.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1545.677

And now it's this kind of like mobile first kind of environment with apps and apps to their own kind of special hell of walled garden content. For a variety of ways, it's very bound technically and often administratively with IDs and passwords and paywalls and all the rest of that. So getting material out of these containers that we think of as apps that live on our phones is challenging.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1629.447

No. We have some archives from some YouTube videos, but you just threw out a number like three or four million a day, like nothing near that. This goes to also like, why do we archive what we archive and how do we make choices? And the answer in short is there are more than 10,000 different reasons why a given URL may be archived by the Wayback Machine at any given day.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1652.12

And they are in part selected by the more than 1,000 partners that we have that are primarily librarians that do curation of material that they think should be archived. So we have partnerships with them. We have partnerships with CloudFlare, an infrastructure provider, with WordPress, with Wikipedia. We also offer a service called Save Page Now.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1693.951

I would say the Instagram is the web, but I would say it's not necessarily the public web because generally speaking, material from Instagram, Facebook, and threads, they're basically the meta properties are not very accessible unless you have an ID and a password on those services. Even the so-called public pages have limitations for how one can access them. So there are special cases.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1715.798

We work hard to archive things that people think they want to preserve in some fashion. And so a lot of material on some of these social platforms are archived by patrons who enter URLs into the Wayback Machine.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1746.747

It absolutely is making some of the work that we do more challenging. But I actually think there's larger implications here. It's hurting our democracy. It's hurting our culture. It's hurting our ability to have shared conversations and shared understandings of the world. that we live in. But this isn't necessarily a technical thing too, because we can make choices.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1769.687

I can watch one TV channel and you can watch another, and we get radically different worldviews. But in those cases, we have choices at least, and we can flip between one and the other. If the switching cost is higher, where it's a paywall, for example, and where the switching cost is an actual dollar sign cost, then

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1788.194

Can I afford to pay for the 30 or 40 different news sources that I would like to have access to as an informed citizen? Is a real cost associated with that or the cost of using this app or that app and not be able to bring this material together and aggregate it? So the issue that you're addressing, I think, is one of the critical issues of our times.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

1811.847

And yes, certainly it affects the work that we do here as archivists, but I think it has much broader and profound social implications. There's a lot of material that's publicly available. I keep using this phrase public web, and I'm making a distinction here. Things you can get to without an ID and a password.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2041.038

This is an evolving landscape that we live in of people's perceptions and realities. With the advent of the AI companies and large amounts of material that they've gathered from the public web and then used in new and different ways, There has been changes by some of the folks who are making that material available. Many organizations are kind of closing down the hashes.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2065.04

So far, we've been doing okay. We've actually been working cooperatively with many different platforms for a long time. And we also take measures to respect the intellectual property and the rights of content creators. The material from the Wayback Machine is... generally only available as a playback of an individual URL. We don't support the bulk downloading of the material.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2091.72

In general terms, there are exceptions to that. There's a project we do, for example, with the Library of Congress and the National Archives, where we archive material from U.S. government websites. Making the material available within a specific controlled environment, we've been able to have good relationships with most folks out there.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2109.63

For example, Reddit recently put out an announcement where they said, you know, we're locking things down, but we have an agreement with the Internet Archive. Reddit considers the work that they do with us to be a legitimate and beneficial, a beneficial service to the patrons of Reddit.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2161.206

Well, we are a library. I'm not a lawyer. I don't have those kind of conversations. I get up every day and I ask myself the question, how can we do a better job archiving more of the public web in a way that is respectful, in a way that is useful, in a way that is helping to preserve the cultural heritage of our times? I guess...

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2197.83

We're doing fine. You know, so there's challenges every day. But honestly, that's not one of the ones that's keeping me up at night. No.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2238.241

Some of it is, yeah. Actually, TikTok is one of those platforms that we're doing a fair amount of archiving on. So I would say yes. And in some of these cases, like say TikTok or Telegram or Rumble or let's say Truth Social or some of these other social platforms, we're not trying to get everything. That's far too much, but we are trying to get a fair amount.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2257.105

In some cases, we're working with domain experts, subject matter experts, et cetera. We're hoping to guide us and to get things that may be cultural or historically more significant or others. You mentioned memes, for example. And so if you take a meme as a meme and as a vector into, okay, let's try to collect material related to this meme.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2277.651

So there are any one of a number of methods that we might incorporate to try to help prioritize material that we would get from some of these platforms.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2297.323

First of all, there's a lot of people that work here. More than 100 people work at the Internet Archive. We do a variety of different things from an engineering perspective or a program perspective. And yeah, there are choices that are made. But I would say we're mostly constrained by just our own creativity and our own imagination.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2315.526

We have a fair amount of latitude as we work here to explore our interests as individuals and as an organization. But with a really strong focus on just trying to do a really good job of the things that we set out to do. And admittedly, the North Star, universal access to all knowledge, is a high bar. But we have the...

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2337.256

The luxury of being able to pursue that with a lot of resources is something that I have a great deal of gratitude for. And I know that the people that I work with do as well. And I think that millions of patrons that use our service every day also.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2361.491

And we are respectful of rights holders. And one of the ways that we are respectful is that we do respond to requests to have things excluded from the Wayback Machine. So rights holders that make legitimate requests. We actually have human beings that check these things out. We just don't say, oh, so-and-so said, just take that out. But we do consider the request.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2382.861

Sometimes if the person is a public official, then we will have to weigh off their request with maybe a broader public right to know. But we work those things out on a case-by-case basis.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2405.399

We weigh off the evidence, the information that's available to us. That particular case, for example, would live in a category, I would say, where there are times when we learn about situations where there is what may be considered a high probability of real-world harm. and then have to make a decision.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2426.491

The fact of the matter is there's material that is made available on the web that does cause real human suffering. And there are cases in which we have a duty to care. Another one maybe that is with child sexual exploitation material, for example. I don't think people are really questioning that too much, right? You say, oh, they took that down or something. Well, yeah, of course.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2448.769

And first of all, that's the law. But there's other cases, doxing, for example, or harassment, or where people's personal safety or other risks have to be taken into consideration. So these are not decisions that are made lightly. We have policy that helps guide us, but very carefully and diligently. And we reconsider, too. That's another thing, too. It's not just like, oh, that was done.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2470.956

And that's never to be looked at again. No. Over time, situations change, and the context of material in a new light of a new day may lead to different kinds of decisions.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2497.714

If you create something and you want it to stick around for a while, then take care. If you see something, save something. The Internet Archive is a free resource. It's available to anyone with a browser and a connection to the Internet. Just go to web.archive.org. If you'd like to preserve a URL, put it into the Save Page Now feature on the bottom right. Write to us.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2517.663

Write to us at info.archive.org. If you've got a website that you think may be at risk, send us a note, and we'll make sure that we do a really good job of preserving it.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

252.807

The Wayback Machine is a service of the Internet Archive that is used to provide a time machine to the web. We have been archiving much of the public web for nearly three decades now, and we make those archives available through the Wayback Machine.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

2529.996

You're welcome.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

278.374

The Internet Archive is a nonprofit organization with a mission of universal access to all knowledge. We pursue that mission in a variety of ways, including archiving, as I said, much of the public web. We work toward acquiring and digitizing and preserving and organizing and making available a whole range of material that is kind of grouped into media types.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

305.684

So one might be books, for example, and we digitize more than 4,000 books every day. Or television news. We archive television news, both from the United States and for other countries around the world. journal articles. We have a collection of more than 30 million publicly accessible journal articles available from scholar.archive.org.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

328.629

78s, those old things on shellac, we've got hundreds of thousands of those that we have digitized. Those were donated to us by the Boston Public Library. So I could go on and on. We identify media, recording media that people have been publishing in for some period of time.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

345.595

If it's digital, like born digital, then that makes life easier because we're able to then capture that material in some fashion on our hard drives and preserve it. But maybe it's analog, maybe it's paper or microfiche or microfilm or vinyl or shellac, as I said. In that case, we have to first digitize the material, in some cases using the Stoke hardware and software setups that we have developed.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

370.651

And once we've digitized it, then we can preserve it and organize it and make it available. At the end of the day, this is what this is about. This is about the voices of humanity expressed in a variety of medium that in many cases are being stored and made available on a series of platforms that are inherently ephemeral. that have a history of disappearing. One of the terms is link rot.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

396.634

That's talking about the material that may have been available at a given URL and a given address on the web at a given point in time is no longer there. You go to that URL and one of two things are going to be true. Well, three things, I guess. The first thing is that what you're looking for is their success.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

415.649

But the second is that you get a page not found or some other error message, a 500 error message, maybe something like that on the server end. So you just can't get the material. It's just no longer there at that URL. Now, that material might be available via another URL. It may have been moved somewhere, but you may not necessarily know that if there's no redirect in place.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

438.374

But the other thing that can happen is that at that same URL, there may be different material. That's referred to as content drift. Same URL, different material. Well, how would you even know what the prior material was or that there even was prior material at that URL? You wouldn't. Why? Because there's no version control system for the web.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

460.689

I go to a URL, I may get something, and then five minutes later, I go to the same URL, I may get that same thing, or I may get nothing, or I may get something different. And it just is. It is what it is at any given moment. That's what the web primarily is. There are exceptions to this, of course. There are applications on the web, like Wikipedia, for example.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

482.629

which is fundamentally based on a version control system. And you can go back and you can see all the various representations of what was available from a given URL. But for the web overall, it's not like that. And so that's where the Wayback Machine steps in. That's where we provide a time-based view for...

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

501.889

for URLs that we have been able to access and that we've been able to archive and then organize and make available to our patrons.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

540.885

Where to start? I mean, those are some big questions. A very general statement that I can make is that about a third of the old web measured in, say, 10 or 15 years or something like that is gone. So about a third. In some cases, it's less, and in some cases, it's more. And certainly for an individual website that may have had millions of pages, like GeoCities, for example, It's 100% gone, right?

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

566.748

So it's just not there on the live web. But it turns out that in more than two-thirds of the cases that we've looked at where a given URL is no longer available, it is available through the Wayback Machine. So one way of looking at that is saying that instead of saying that maybe a third of the old web is gone, maybe a ninth of the old web is gone.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

590.967

And once again, these are very broad generalizations because much of that material was backed up and can be accessed through web.archive.org from the Wayback Machine. But you asked a different question.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

609.001

I don't know. They're getting different, right? So things are changing.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

618.135

So first of all, let's look a little bit like why things go away. There are very benign reasons why things go away. Maybe a company has simply gone out of business or a government has changed. And so there's a new administration. And so you would expect if a company goes out of business, what entity would want to keep that company's website alive, for example, or a publication, right?

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

640.869

Thousands of local news organizations have shut down in the United States over the last 10 or 15 years, for example. News organizations, media organizations are shut down by governments when they go out of favor. When the failed coup happened in Turkey a few years ago, Wikipedia has documented about 150 media organizations were shut down.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

663.855

We have a collection of four websites, four news sites from Hong Kong, for example. Apple Daily was one that were shut down for political reasons. In all of those cases, we have really good archives of that material. We have, for example, a full text searchable index of about a million pages from Gawker.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

686.423

and those four news organizations from hong kong that i mentioned we have built a full text index of the articles from those news sites but there are many many other reasons why a given site may make maybe the hard drives that it that the website was running on crashed Or maybe there was just a change in the content management system.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

706.6

And when the upgrade was done, the people doing the engineering behind it didn't put in the redirects. And so all those old parts of the site are no longer available. I used to work for NBC News. And I mean, we had more than 100 websites that we were running at one point. And when we were doing upgrades, the last thing we'd be thinking about is the old stuff.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

725.149

It'd all be like, well, how do we meet the deadline to get the new stuff out?

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

735.94

Many of those conditions are still with us. They're not fundamentally changing. For those reasons, stuff still is going to atrophy. Also, as the web gets older, The older stuff gets older too. People die. The legacy often of an individual's efforts then falls on the heirs or their friends. I can't tell you.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

759.363

Literally every day here at the Interim Archive, we get communications, principally on emails or DMs or things like that from people saying, Hey, my husband or this organization I worked with, the person has passed away and we're going to shut down the website. We want to make sure that it's preserved. Often we will have already done that. Here's a recent case.

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

783.044

MTV News was shut down and people said, oh, you know, what did you do? Did you have to jump into action and archive it? It's like, no, no. Our work was done. We had been archiving. I mean, if that was what we had to do, then we would have failed because it's too late, right? Our work had been done over the decades.