Menu
Sign In Pricing Add Podcast
Podcast Image

Decoder with Nilay Patel

How the Wayback Machine is fighting linkrot

Thu, 05 Sep 2024

Description

The web has a problem: huge chunks of it keep going offline. The web isn’t static, parts of it sometimes just… vanish. But it’s not all grim. The Internet Archive has a massive mission to identify and back up our online world into a vast digital library. In 2001, it launched the Wayback Machine, an interface that lets anyone call up snapshots of sites and look at how they used to be and what they used to say at a given moment in time. Mark Graham, director of the Wayback Machine, joins Decoder this week to explain both why and how the organization tries to keep the web from disappearing. Links:  When Online Content Disappears | Pew Research Game Informer is shutting down | The Verge When Media Outlets Shutter, Why Are the Websites Wiped, Too? Slate MTV News lives on in the Internet Archive | The Verge The video game industry is mourning the loss of Game Informer | The Verge Guest host Hank Green makes Nilay Patel explain why websites have a future | Decoder How The Onion is saving itself from the digital media death spiral | Decoder The Internet Archive is defending its digital library in court today | The Verge The Internet Archive has lost its first fight to scan and lend ebooks | The Verge The Internet Archive just lost its appeal over ebook lending | The Verge Credits:  Decoder is a production of The Verge and part of the Vox Media Podcast Network. Our producers are Kate Cox and Nick Statt. Our editor is Callie Wright. Our supervising producer is Liam James. The Decoder music is by Breakmaster Cylinder. Learn more about your ad choices. Visit podcastchoices.com/adchoices

Audio
Transcription

0.723 - 15.609 Citi

Amgen, a leading biotechnology company, needed a global financial company to facilitate funding and acquisition to broaden Amgen's therapeutic reach, expand its pipeline, and accelerate bringing new and innovative medicines to patients in need globally.

0
💬 0

16.25 - 29.075 Citi

They found that partner in Citi, whose seamlessly connected banking, markets, and services businesses can advise, finance, and close deals around the world. Learn more at citi.com slash client stories.

0
💬 0

31.206 - 52.314 Mila Atmos

Do you want to be a more empowered citizen but don't know where to start? It's time to sharpen your civic vision and ignite the spark for a brighter future. I'm Mila Atmos, and on my weekly podcast, Future Hindsight, I bring you conversations to translate today's most urgent issues into clear, actionable ways to make impact.

0
💬 0

52.934 - 59.277 Mila Atmos

With so much at stake in our democracy, join us at futurehindsight.com or wherever you listen to podcasts.

0
💬 0

63.103 - 82.66 Nilay Patel

Hello and welcome to Decoder. I'm Nilay Patel, editor-in-chief of The Verge, and Decoder is my show about big ideas and other problems. We've been talking a lot about the future of the web on Decoder and across The Verge lately. And one big problem keeps coming up. Huge chunks of it keep going offline. In a lot of meaningful ways, large portions of the web are just dying.

0
💬 0

83.34 - 103.248 Nilay Patel

Servers go offline, software upgrades break links and pages, companies go out of business. The web isn't static, and that means sometimes parts of it simply vanish. And it's not just the really old internet from the 90s or early 2000s that's at risk. A recent study from Pew found that 38% of all links from 2013 are no longer accessible.

0
💬 0

103.968 - 125.344 Nilay Patel

That's more than a third of the collected media, knowledge, and online culture from just a decade ago gone. Pew calls it digital decay, but for decades, many of us have simply called this phenomenon link rot. And lately, link rot has meant a bunch of really meaningful journalism has gone away as well, as various news outlets have failed to make it through the platform era.

0
💬 0

126.045 - 144.035 Nilay Patel

The list is virtually endless. Sites like MTV News, Gawker, Twice, Protocol, The Messenger, and most recently Game Informer are all just gone. Some of these were short-lived, but some were outlets that were live for literal decades, and their entire archives vanished overnight. But it's not all grim.

0
💬 0

144.455 - 164.779 Nilay Patel

For nearly as long as we've had a consumer internet, we've had the Internet Archive, a massive mission to identify and back up our online world into a vast digital library. It was founded in 1996, and in 2001 it launched the Wayback Machine, an interface that lets anyone call up snapshots of sites and look at how they used to be and what they used to say at a given moment in time.

0
💬 0

165.399 - 182.786 Nilay Patel

I wanted to know more about how this all works, so I asked Mark Graham, director of The Wayback Machine, to join me on the show this week to explain both how and why the organization tries to keep the web from disappearing. The answers are fascinating. You'll hear Mark explain how many hard drives the Internet Archive adds to its system every single day.

0
💬 0

183.186 - 202.936 Nilay Patel

And then there's the choices that go into preservation. Not necessarily everything on the Internet merits preserving, and not everything is technically accessible. especially now as more of the online world moves to private platforms. Making those choices, not just preserving the internet, but curating it, is a complicated proposition that hits on basically every decoder theme there is.

0
💬 0

203.576 - 215.984 Nilay Patel

The idea of running a library that stores the internet's history? That's a puzzle worth solving. One quick note before we start. The Internet Archive just lost an appeal on a lawsuit over a short-lived book lending initiative it launched at the start of the pandemic.

0
💬 0

216.424 - 245.425 Nilay Patel

We don't get into the details of that in this episode because we recorded before the court issued its decision, but I did want to mention the news. We'll link to a couple of Verge stories about it in the show notes. Okay, the Wayback Machine and internet preservation. Here we go. Mark Graham, you are the director of the Wayback Machine at the Internet Archive. Welcome to Decoder.

0
💬 0

245.985 - 252.147 Nilay Patel

Really glad to be here today. Quickly for the audience, explain what the Wayback Machine is and how it fits into the Internet Archive.

0
💬 0

252.807 - 270.591 Mark Graham

The Wayback Machine is a service of the Internet Archive that is used to provide a time machine to the web. We have been archiving much of the public web for nearly three decades now, and we make those archives available through the Wayback Machine.

0
💬 0

271.266 - 277.873 Nilay Patel

The Internet Archive is the organization, the Wayback Machine is the service. How do those two things relate? What are the other things the Internet Archive does?

0
💬 0

278.374 - 305.624 Mark Graham

The Internet Archive is a nonprofit organization with a mission of universal access to all knowledge. We pursue that mission in a variety of ways, including archiving, as I said, much of the public web. We work toward acquiring and digitizing and preserving and organizing and making available a whole range of material that is kind of grouped into media types.

0
💬 0

305.684 - 328.609 Mark Graham

So one might be books, for example, and we digitize more than 4,000 books every day. Or television news. We archive television news, both from the United States and for other countries around the world. journal articles. We have a collection of more than 30 million publicly accessible journal articles available from scholar.archive.org.

0
💬 0

328.629 - 344.955 Mark Graham

78s, those old things on shellac, we've got hundreds of thousands of those that we have digitized. Those were donated to us by the Boston Public Library. So I could go on and on. We identify media, recording media that people have been publishing in for some period of time.

0
💬 0

345.595 - 370.351 Mark Graham

If it's digital, like born digital, then that makes life easier because we're able to then capture that material in some fashion on our hard drives and preserve it. But maybe it's analog, maybe it's paper or microfiche or microfilm or vinyl or shellac, as I said. In that case, we have to first digitize the material, in some cases using the Stoke hardware and software setups that we have developed.

0
💬 0

370.651 - 396.314 Mark Graham

And once we've digitized it, then we can preserve it and organize it and make it available. At the end of the day, this is what this is about. This is about the voices of humanity expressed in a variety of medium that in many cases are being stored and made available on a series of platforms that are inherently ephemeral. that have a history of disappearing. One of the terms is link rot.

0
💬 0

396.634 - 415.029 Mark Graham

That's talking about the material that may have been available at a given URL and a given address on the web at a given point in time is no longer there. You go to that URL and one of two things are going to be true. Well, three things, I guess. The first thing is that what you're looking for is their success.

0
💬 0

415.649 - 438.094 Mark Graham

But the second is that you get a page not found or some other error message, a 500 error message, maybe something like that on the server end. So you just can't get the material. It's just no longer there at that URL. Now, that material might be available via another URL. It may have been moved somewhere, but you may not necessarily know that if there's no redirect in place.

0
💬 0

438.374 - 459.908 Mark Graham

But the other thing that can happen is that at that same URL, there may be different material. That's referred to as content drift. Same URL, different material. Well, how would you even know what the prior material was or that there even was prior material at that URL? You wouldn't. Why? Because there's no version control system for the web.

0
💬 0

460.689 - 482.049 Mark Graham

I go to a URL, I may get something, and then five minutes later, I go to the same URL, I may get that same thing, or I may get nothing, or I may get something different. And it just is. It is what it is at any given moment. That's what the web primarily is. There are exceptions to this, of course. There are applications on the web, like Wikipedia, for example.

0
💬 0

482.629 - 501.108 Mark Graham

which is fundamentally based on a version control system. And you can go back and you can see all the various representations of what was available from a given URL. But for the web overall, it's not like that. And so that's where the Wayback Machine steps in. That's where we provide a time-based view for...

0
💬 0

501.889 - 512.237 Mark Graham

for URLs that we have been able to access and that we've been able to archive and then organize and make available to our patrons.

0
💬 0

513.138 - 534.678 Nilay Patel

You're talking a lot about URLs that is inherently web-focused. I think a lot about the web. I run a web-based business. Watching the web change, especially with things like Google search changing, AI changing the web in different ways. You obviously have the longest view, right? You have the widest view of the web as it's changed. Do you see an acceleration of the web's decline?

0
💬 0

534.698 - 540.342 Nilay Patel

Do you see the web changing in any significant way that other people might be missing? What do you think is happening right now?

0
💬 0

540.885 - 566.728 Mark Graham

Where to start? I mean, those are some big questions. A very general statement that I can make is that about a third of the old web measured in, say, 10 or 15 years or something like that is gone. So about a third. In some cases, it's less, and in some cases, it's more. And certainly for an individual website that may have had millions of pages, like GeoCities, for example, It's 100% gone, right?

0
💬 0

566.748 - 590.907 Mark Graham

So it's just not there on the live web. But it turns out that in more than two-thirds of the cases that we've looked at where a given URL is no longer available, it is available through the Wayback Machine. So one way of looking at that is saying that instead of saying that maybe a third of the old web is gone, maybe a ninth of the old web is gone.

0
💬 0

590.967 - 603.936 Mark Graham

And once again, these are very broad generalizations because much of that material was backed up and can be accessed through web.archive.org from the Wayback Machine. But you asked a different question.

0
💬 0

604.116 - 608.661 Nilay Patel

Well, I'm hoping that you're going to say things are getting better. I'm worried you're going to- No, they're getting worse.

0
💬 0

609.001 - 612.706 Mark Graham

I don't know. They're getting different, right? So things are changing.

0
💬 0

615.169 - 617.692 Nilay Patel

The most optimistic take of all is they're getting different. Yeah.

0
💬 0

618.135 - 640.289 Mark Graham

So first of all, let's look a little bit like why things go away. There are very benign reasons why things go away. Maybe a company has simply gone out of business or a government has changed. And so there's a new administration. And so you would expect if a company goes out of business, what entity would want to keep that company's website alive, for example, or a publication, right?

0
💬 0

640.869 - 663.235 Mark Graham

Thousands of local news organizations have shut down in the United States over the last 10 or 15 years, for example. News organizations, media organizations are shut down by governments when they go out of favor. When the failed coup happened in Turkey a few years ago, Wikipedia has documented about 150 media organizations were shut down.

0
💬 0

663.855 - 685.563 Mark Graham

We have a collection of four websites, four news sites from Hong Kong, for example. Apple Daily was one that were shut down for political reasons. In all of those cases, we have really good archives of that material. We have, for example, a full text searchable index of about a million pages from Gawker.

0
💬 0

686.423 - 706.42 Mark Graham

and those four news organizations from hong kong that i mentioned we have built a full text index of the articles from those news sites but there are many many other reasons why a given site may make maybe the hard drives that it that the website was running on crashed Or maybe there was just a change in the content management system.

0
💬 0

706.6 - 725.089 Mark Graham

And when the upgrade was done, the people doing the engineering behind it didn't put in the redirects. And so all those old parts of the site are no longer available. I used to work for NBC News. And I mean, we had more than 100 websites that we were running at one point. And when we were doing upgrades, the last thing we'd be thinking about is the old stuff.

0
💬 0

725.149 - 727.69 Mark Graham

It'd all be like, well, how do we meet the deadline to get the new stuff out?

0
💬 0

728.01 - 735.48 Nilay Patel

I feel like every person who's ever worked in product at a media company is experiencing second order body horror right now because of what you're describing.

0
💬 0

735.94 - 758.963 Mark Graham

Many of those conditions are still with us. They're not fundamentally changing. For those reasons, stuff still is going to atrophy. Also, as the web gets older, The older stuff gets older too. People die. The legacy often of an individual's efforts then falls on the heirs or their friends. I can't tell you.

0
💬 0

759.363 - 782.724 Mark Graham

Literally every day here at the Interim Archive, we get communications, principally on emails or DMs or things like that from people saying, Hey, my husband or this organization I worked with, the person has passed away and we're going to shut down the website. We want to make sure that it's preserved. Often we will have already done that. Here's a recent case.

0
💬 0

783.044 - 802.076 Mark Graham

MTV News was shut down and people said, oh, you know, what did you do? Did you have to jump into action and archive it? It's like, no, no. Our work was done. We had been archiving. I mean, if that was what we had to do, then we would have failed because it's too late, right? Our work had been done over the decades.

0
💬 0

804.297 - 812.46 Nilay Patel

We've spoken about why internet preservation is necessary. We have to take a quick break, but when we come back, Mark's going to get into how the Wayback Machine works. We'll be back in just a minute.

0
💬 0

817.902 - 826.954 Ian Mitchell

Fox Creative. This is advertiser content from Zelle. When you picture an online scammer, what do you see?

0
💬 0

827.554 - 836.44 Fraud Fighter

For the longest time, we have these images of somebody sitting crouched over their computer with a hoodie on, just kind of typing away in the middle of the night. And honestly, that's not what it is anymore.

0
💬 0

837.221 - 850.87 Ian Mitchell

That's Ian Mitchell, a banker turned fraud fighter. These days, online scams look more like crime syndicates than individual con artists. And they're making bank. Last year, scammers made off with more than $10 billion.

0
💬 0

852.111 - 871.663 Fraud Fighter

It's mind blowing to see the kind of infrastructure that's been built to facilitate scamming at scale. There are hundreds, if not thousands of scam centers all around the world. These are very savvy business people. These are organized criminal rings. And so once we understand the magnitude of this problem, we can protect people better.

0
💬 0

873.593 - 886.685 Ian Mitchell

One challenge that fraud fighters like Ian face is that scam victims sometimes feel too ashamed to discuss what happened to them. But Ian says one of our best defenses is simple. We need to talk to each other.

0
💬 0

887.046 - 903.521 Fraud Fighter

We need to have those awkward conversations around what do you do if you have text messages you don't recognize? What do you do if you start getting asked to send information that's more sensitive? Even my own father fell victim to a, thank goodness, a smaller dollar scam, but he fell victim and we have these conversations all the time.

0
💬 0

904.202 - 908.627 Fraud Fighter

So we are all at risk and we all need to work together to protect each other.

0
💬 0

909.728 - 919.538 Ian Mitchell

Learn more about how to protect yourself at vox.com slash zelle. And when using digital payment platforms, remember to only send money to people you know and trust.

0
💬 0

922.128 - 946.066 Citi

They're not writers, but they help their clients shape their businesses' financial stories. They're not an airline, but their network connects global businesses in nearly 180 local markets. They're not detectives, but they work across businesses to uncover new financial opportunities for their clients. They're not just any bank. They are Citi. Learn more at Citi.com slash WeAreCiti.

0
💬 0

946.086 - 949.829 Citi

That's C-I-T-I dot com slash WeAreCiti.

0
💬 0

951.641 - 973.191 Microsoft

Support for this episode comes from Microsoft. Did you know one in 43 US children have had their personal information exposed or compromised? Scammers are targeting our kids online, especially on social media, where unmonitored conversations can easily lead to identity theft. We need better tools to protect our loved ones to stay ahead.

0
💬 0

974.231 - 996.628 Microsoft

Thankfully, there's Microsoft Defender, all-in-one protection that can help keep our families safe when they're online. Microsoft Defender makes it easy to safeguard your family's data, identities, and privacy with a single security app across your devices. Take control of your family's security by helping to protect their personal info, computers, and phones from hackers and scammers.

0
💬 0

997.529 - 999.07 Microsoft

Visit Microsoft365.com slash Defender.

0
💬 0

1007.915 - 1023.34 Nilay Patel

Welcome back. I'm talking with the Internet Archive's Mark Graham, director of the Wayback Machine, about the actual structure of it all. Inside of the Internet Archive, how is the Wayback Machine structured? Is that just the front-facing service? Is that also the digitization of the Internet? How does that work?

0
💬 0

1023.98 - 1043.929 Mark Graham

We call it the Wayback Machine as if it's like a computer that's sitting on somebody's desk. It's actually a whole network of literally hundreds of nodes as part of our overall infrastructure of the Internet Archive of thousands of nodes. more than 100 petabyte of material growing at the rate of more than 60 terabyte a day.

0
💬 0

1044.429 - 1064.883 Mark Graham

It's a combination of applications that do what's referred to as crawling, which is a process of looking at a URL, looking at a webpage, and then looking at all of the other links, all of the other URLs on that page, and then going to them and then looking at them and then going on and on and on, crawling the web like a spider, metaphorically.

0
💬 0

1065.383 - 1089.993 Mark Graham

So it's a combination of this crawling and archiving process, as well as the aggregation of all of those archived resources with indexes that makes those discoverable. And then they can be recompiled into web pages. And then patrons, millions of patrons a day come to our sites and they request resources that we have.

0
💬 0

1090.093 - 1100.316 Mark Graham

Maybe it's a digitized version of a book from archive.org, or maybe it's a archived web page from the Wayback Machine. And then we will present that to them in their browser.

0
💬 0

1100.616 - 1102.197 Nilay Patel

You said 60 terabytes a day?

0
💬 0

1102.833 - 1125.222 Mark Graham

More than that, yeah. Actually, it's something like more than a billion URLs every single day, and that can get pretty quick. It could be like 20,000 URLs a second can be coming into our server. So think of a database that you're writing to 20,000 times a second and you're reading from 5,000 times a second. That's one view into what the Wayback Machine is.

0
💬 0

1126.514 - 1142.777 Nilay Patel

That's just a lot of storage and a lot of ongoing storage because you're not just taking the changes, right? You're storing the history. I actually have gone to go look at our old designs on the Verge on the Wayback Machine because it's the easiest way for me to just go remember what the site looked like 10 years ago. So you've got the long history. So you're adding storage every day.

0
💬 0

1143.138 - 1145.058 Nilay Patel

Do you just buy hard drives every day? Are you a new egg?

0
💬 0

1145.098 - 1149.359 Mark Graham

Yes, the heading purchase is always with Seagate and others. We buy a lot of hard drives.

0
💬 0

1150.299 - 1152.381 Nilay Patel

Are you buying platters? Are you buying SSDs?

0
💬 0

1152.821 - 1175.303 Mark Graham

The primary storage medium is spinning disk. I think today we're using 20 terabyte drives. When we started, they were much smaller, of course. Actually, the very, very, very first version of the Wayback Machine, going back almost like 24, 25 years ago, I think we used a tape machine for a little while. But very quickly, our founder, Brewster Kahle, decided that he really wanted

0
💬 0

1175.823 - 1193.271 Mark Graham

the material that we have to be as accessible as possible to people so that when people wanted something that wasn't like, oh, we have to go back to the stacks and then find it and then get it. He wanted things to be as immediately available as possible. So spinning disks has been the primary format.

0
💬 0

1193.611 - 1202.535 Mark Graham

And of course, yes, we use a lot of SSDs and a lot of MVMEs and other kind of memory devices for primarily for indexes and caches and things like that.

0
💬 0

1203.226 - 1218.613 Nilay Patel

So 60 terabytes a day, let's say, 20 terabytes spinning disk hard drives, that's three a day, if my math is correct. Oh, it's more than that. Yeah, it's more than that because- I'm just envisioning somebody going to plug in between three and five hard drives a day and then- More than that because we at least double everything up because- Sure.

0
💬 0

1218.633 - 1230.398 Mark Graham

So first of all, we own and operate our own data centers. They are physically distributed. So when we write something, we're actually writing it to more than one location for physical reliability. It's north of six hard drives a day.

0
💬 0

1231.138 - 1243.686 Nilay Patel

So I have a SimCity map in my head where you're just an ever-expanding physical footprint. Is there an outer limit? Are you going to take over a city? Is there a desert mountain cave? How does this work?

0
💬 0

1243.827 - 1266.902 Mark Graham

I doubt that. Caves have their own challenges. We're looking at some interesting things. Some of us is in an abandoned coal mine in Norway. We participated with GitHub a few years ago in something called the Arctic GitHub Repository. And we are looking at some more exotic recording formats from some special purpose applications.

0
💬 0

1267.303 - 1287.478 Mark Graham

But frankly, we think that hard drives are going to be the primary medium that we use for some time into the future. We're constantly evaluating options, but it's a kind of a tried and true and reliable format and process. We know how to handle them. We put them into machines that we rack ourselves and they've been serving us well.

0
💬 0

1288.038 - 1301.448 Nilay Patel

We're talking about preserving a very digital, somewhat ephemeral medium on the internet. The actual process of it is extraordinarily physical. You just have to take up space. run wires and have power and all that.

0
💬 0

1301.468 - 1319.915 Mark Graham

Electricity and heat and all the rest of that. Well, I should say, if you come and visit our operation in San Francisco, which you should do sometimes if you haven't, we have several physical locations. We have physical archives in different locations in the United States and also in Canada. But our headquarters building is an old church, a former church of Christian scientists.

0
💬 0

1320.655 - 1338.971 Mark Graham

and now it's a temple for knowledge. When you come into our building, you'll see how frugal we are. We've kind of left it the way it was when it was a regular kind of church. We don't have air conditioning or our backup generators or anything like that, but we have a lot of hard drives in racks, and we do have some fans.

0
💬 0

1339.131 - 1356.731 Mark Graham

When it's a hot day here in San Francisco, we open up the windows and ventilate that. Also, people who use the service may know that sometimes we'll go down if the power goes out. We'll be down for a little while, but we're a library. It's okay. We'll be back. The material itself is stored in multiple locations, so it's safe.

0
💬 0

1357.352 - 1360.696 Nilay Patel

How is this all funded? How much does it cost to run and where does the money come from?

0
💬 0

1361.266 - 1385.509 Mark Graham

Last year, I think we spent probably about $28 million, and I think I'd divide that into three buckets. The first bucket would be earned income, program-related business activity, they say, in the nonprofit world. This is work that we do on behalf of museums and governments and libraries and the like. when they pay us primarily to do web archiving on their behalf or do book digitization.

0
💬 0

1386.029 - 1408.818 Mark Graham

Another third comes from a very loyal collection of more than 150,000 people who donate money to us every year. A growing number of them are monthly donors, so we're very appreciative of the folks who give us $10, $20, $30 a month. And then the final third comes from a combination of high wealth individuals and foundations.

0
💬 0

1409.502 - 1422.439 Nilay Patel

And is that mix changing over time? As I think about the broader piece of link rot and the ever-expanding nature of the problem, it seems like that funding mix might have to change over time.

0
💬 0

1423.007 - 1443.921 Mark Graham

It's diversifying. We're certainly looking at ways to continue to diversify it. The monthly donor program is certainly an area that is growing for us. And just, you know, as more and more people use our service and depend on it, frankly, and see the value of it, then more and more of them support us every year so that the number of unique annual donors has been increasing.

0
💬 0

1444.281 - 1461.713 Mark Graham

on a fairly consistent basis, and we very much appreciate that. It allows us to do what we do. It is only through the support that we get from our patrons that we're able to continue to work diligently and creatively to try to preserve our world cultural heritage.

0
💬 0

1462.353 - 1470.619 Mark Graham

We haven't built a full-text index on the entire holdings of the Wayback Machine, maybe someday, but for now we kind of do it on a case-by-case basis.

0
💬 0

1471.055 - 1477.32 Nilay Patel

So it seems like money is not the biggest challenge with Wayback Machine, and that's a good place to be. But then what are the challenges?

0
💬 0

1478 - 1497.855 Mark Graham

But there's other dimensions, though, of this evolving digital world that we live in that are representing new challenges and new opportunities. Issues like hyper-personalization. The web you experience is different than the web I experience. Even down to a given web page, what you see and what I see may be different because of

0
💬 0

1498.275 - 1523.692 Mark Graham

geography or browser type or what that website knows about us as individuals, our age or our preferences, et cetera. And I'm not just talking about the ads either. You know, this is elements of it. So hyper-personalization is one thing. The splinterization of the net often around geopolitical boundaries where large parts of the internet are just not accessible to other parts of the internet.

0
💬 0

1524.072 - 1545.597 Mark Graham

Certainly we all know about the great firewall of China where But there are many, many other examples of that. When Russia invaded Ukraine, many thousands of websites that had traditionally been available from Russia in the West are no longer available. And then there's the evolution of what we think of as the internet into the web.

0
💬 0

1545.677 - 1570.718 Mark Graham

And now it's this kind of like mobile first kind of environment with apps and apps to their own kind of special hell of walled garden content. For a variety of ways, it's very bound technically and often administratively with IDs and passwords and paywalls and all the rest of that. So getting material out of these containers that we think of as apps that live on our phones is challenging.

0
💬 0

1571.238 - 1588.55 Nilay Patel

We've been talking about it with the web, right? The Wayback Machine is centered on the web. There's reasons that websites have gone out of favor. MTV News is a great example. They just couldn't make money running MTV News on the web. It just wasn't happening for them. They shut it down. That's more or less the case for media on the web, probably.

0
💬 0

1588.59 - 1596.496 Nilay Patel

That's why so many news websites are going out of business. That's why local news on the web is going out of business. That's not the case for video platforms.

0
💬 0

1597.316 - 1617.717 Nilay Patel

right if you're an independent creator and you're on youtube maybe you're making a lot of money maybe you're a tick tocker making a lot of money you're inside of that ecosystem and that's where the money and that's where the advertising is going none of that has the same ideals or norms of the web right which is that it is available which is what so much of the internet has been built on is the norms and ideals of the web that availability is the key

0
💬 0

1618.317 - 1629.086 Nilay Patel

There's three, four million videos are uploaded to YouTube every day. I'm assuming TikTok and the others all have similar amounts. It's a massive amount of information. Are you collecting that as well?

0
💬 0

1629.447 - 1651.54 Mark Graham

No. We have some archives from some YouTube videos, but you just threw out a number like three or four million a day, like nothing near that. This goes to also like, why do we archive what we archive and how do we make choices? And the answer in short is there are more than 10,000 different reasons why a given URL may be archived by the Wayback Machine at any given day.

0
💬 0

1652.12 - 1674.507 Mark Graham

And they are in part selected by the more than 1,000 partners that we have that are primarily librarians that do curation of material that they think should be archived. So we have partnerships with them. We have partnerships with CloudFlare, an infrastructure provider, with WordPress, with Wikipedia. We also offer a service called Save Page Now.

0
💬 0

1674.967 - 1693.611 Nilay Patel

What I'm getting at is this is all pretty based in the web, right? If you capture a web page and it has a YouTube video on it, maybe you'll capture the YouTube video too. But there's a growing body of information that lives on more closed platforms, even if they are exposed to the web. Like Instagram is exposed to the web, but it's not the web.

0
💬 0

1693.951 - 1715.518 Mark Graham

I would say the Instagram is the web, but I would say it's not necessarily the public web because generally speaking, material from Instagram, Facebook, and threads, they're basically the meta properties are not very accessible unless you have an ID and a password on those services. Even the so-called public pages have limitations for how one can access them. So there are special cases.

0
💬 0

1715.798 - 1731.141 Mark Graham

We work hard to archive things that people think they want to preserve in some fashion. And so a lot of material on some of these social platforms are archived by patrons who enter URLs into the Wayback Machine.

0
💬 0

1731.581 - 1746.364 Nilay Patel

Does the shift to people doing more and more of their publishing on closed platforms threaten the nature of what you're doing? If all the information is going from the open web to Discord channels, I'm guessing you're not able to archive all that. And that seems like a big problem for the information landscape.

0
💬 0

1746.747 - 1769.487 Mark Graham

It absolutely is making some of the work that we do more challenging. But I actually think there's larger implications here. It's hurting our democracy. It's hurting our culture. It's hurting our ability to have shared conversations and shared understandings of the world. that we live in. But this isn't necessarily a technical thing too, because we can make choices.

0
💬 0

1769.687 - 1787.554 Mark Graham

I can watch one TV channel and you can watch another, and we get radically different worldviews. But in those cases, we have choices at least, and we can flip between one and the other. If the switching cost is higher, where it's a paywall, for example, and where the switching cost is an actual dollar sign cost, then

0
💬 0

1788.194 - 1810.947 Mark Graham

Can I afford to pay for the 30 or 40 different news sources that I would like to have access to as an informed citizen? Is a real cost associated with that or the cost of using this app or that app and not be able to bring this material together and aggregate it? So the issue that you're addressing, I think, is one of the critical issues of our times.

0
💬 0

1811.847 - 1828.933 Mark Graham

And yes, certainly it affects the work that we do here as archivists, but I think it has much broader and profound social implications. There's a lot of material that's publicly available. I keep using this phrase public web, and I'm making a distinction here. Things you can get to without an ID and a password.

0
💬 0

1831.174 - 1832.775 Nilay Patel

We have to take another quick break. We'll be right back.

0
💬 0

1841.886 - 1860.348 Domino Advertisement

Support for this show comes from the refinery at Domino. Look, location and atmosphere are key when deciding on a home for your business, and the refinery can be that home. If you're a business leader, specifically one in New York, the refinery at Domino is an opportunity to claim a defining part of the New York City skyline.

0
💬 0

1860.788 - 1880.93 Domino Advertisement

The refinery at Domino is located in Williamsburg, Brooklyn, and it offers all the perks and amenities of a brand new building while being a landmark address that dates back to the mid-19th century. Its 15 floors of Class A modern office environment house within the original urban artifact, making it a unique experience for inhabitants as well as the wider community.

0
💬 0

1881.531 - 1900.939 Domino Advertisement

The building is outfitted with immersive interior gardens, a glass-domed penthouse lounge, and a world-class event space. The building is also home to a state-of-the-art Equinox with a pool and spa, world-renowned restaurants, and exceptional retail. As New Yorkers return to the office, the refinery at Domino can be more than a place to work.

0
💬 0

1901.339 - 1908.562 Domino Advertisement

It can be a magnetic hub fit to inspire your team's best ideas. Visit therefinery.nyc for a tour.

0
💬 0

1911.044 - 1930.741 Shopify

This episode is brought to you by Shopify. Forget the frustration of picking commerce platforms when you switch your business to Shopify, the global commerce platform that supercharges your selling wherever you sell. With Shopify, you'll harness the same intuitive features, trusted apps, and powerful analytics used by the world's leading brands.

0
💬 0

1931.081 - 1939.308 Shopify

Sign up today for your $1 per month trial period at Shopify.com slash tech, all lowercase. That's Shopify.com slash tech.

0
💬 0

1943.013 - 1966.913 Jira Advertisement

So if you're a team of developers, Jira better connects you with teams like marketing and design so you have all the information you need in one place. Plus, their AI helps you knock out the small stuff so you can focus on delivering your best work. Get started on your next big idea today in Jira.

0
💬 0

1974.13 - 1990.134 Nilay Patel

Welcome back. I'm talking with Mark Graham, director of the Wayback Machine, about the challenges of preserving the internet when everything is not only ephemeral, but also more and more closed off. Mark just mentioned the concept of the public web, meaning anything you can get to without an ID and a password. And that brings us to a new challenge for preservation.

0
💬 0

1990.655 - 2007.861 Nilay Patel

Up until a year or so ago, maybe two, the idea that the Wayback Machine would just cycle through the internet to read and preserve websites was more or less seen as a universal good. But now there's a new crop of players scraping websites, and it's a lot more contentious. All the generative AI companies are scraping the entire web and using it to train their LLMs.

0
💬 0

2008.261 - 2017.806 Nilay Patel

And that has made a lot of people very upset and very litigious. We've had some of them on the show. The New York Times and a bunch of artists and organizations have filed plenty of lawsuits over this practice.

0
💬 0

2018.286 - 2038.418 Nilay Patel

That's made a lot of people suddenly aware of something called robots.txt, the file which dictates which web pages third-party crawlers and other automated tools are allowed to visit on a website. Lots of websites are now making changes to block these scrapers, and it's called into question one of the oldest and most widely used practices on the open web, one that's vital for preservation.

0
💬 0

2039.058 - 2040.539 Nilay Patel

Has that affected your work at the Internet Archive?

0
💬 0

2041.038 - 2064.419 Mark Graham

This is an evolving landscape that we live in of people's perceptions and realities. With the advent of the AI companies and large amounts of material that they've gathered from the public web and then used in new and different ways, There has been changes by some of the folks who are making that material available. Many organizations are kind of closing down the hashes.

0
💬 0

2065.04 - 2091.3 Mark Graham

So far, we've been doing okay. We've actually been working cooperatively with many different platforms for a long time. And we also take measures to respect the intellectual property and the rights of content creators. The material from the Wayback Machine is... generally only available as a playback of an individual URL. We don't support the bulk downloading of the material.

0
💬 0

2091.72 - 2109.21 Mark Graham

In general terms, there are exceptions to that. There's a project we do, for example, with the Library of Congress and the National Archives, where we archive material from U.S. government websites. Making the material available within a specific controlled environment, we've been able to have good relationships with most folks out there.

0
💬 0

2109.63 - 2125.18 Mark Graham

For example, Reddit recently put out an announcement where they said, you know, we're locking things down, but we have an agreement with the Internet Archive. Reddit considers the work that they do with us to be a legitimate and beneficial, a beneficial service to the patrons of Reddit.

0
💬 0

2125.919 - 2140.853 Nilay Patel

What's interesting about that is Reddit's kind of an old company. It's like an old web company, and there's a bunch of web people there who understand what the Internet Archive is and why it's valuable, and they probably use it. And then you've got a bunch of new companies who might have new leaders who don't understand the ideals of the web.

0
💬 0

2141.473 - 2160.308 Nilay Patel

And then you've got the AI companies who I think a lot of people woke up last year and said, there's something called robots.txt, and it It should maybe pay us money. And now everyone's confused, right? Is that meaningfully changed what you do, that the idea that this should be a set of business agreements or a set of legal agreements? But do you get to just run around saying you're a library?

0
💬 0

2161.206 - 2182.258 Mark Graham

Well, we are a library. I'm not a lawyer. I don't have those kind of conversations. I get up every day and I ask myself the question, how can we do a better job archiving more of the public web in a way that is respectful, in a way that is useful, in a way that is helping to preserve the cultural heritage of our times? I guess...

0
💬 0

2183.208 - 2196.958 Nilay Patel

Much more directly, my question is, a bunch of companies took advantage of the open web to build AI models, and now the rest of the open web might get ornerier or more closed down even. Is that making your job harder?

0
💬 0

2197.83 - 2204.334 Mark Graham

We're doing fine. You know, so there's challenges every day. But honestly, that's not one of the ones that's keeping me up at night. No.

0
💬 0

2204.855 - 2226.251 Nilay Patel

Let's talk about some solutions for all of these changes kind of broadly. I'm thinking about just the amount of culture that is uploaded to TikTok every day. It is where the culture is happening right now. That is the most ephemeral of all. It doesn't even feel searchable in a real way. It comes, it goes. Obviously, the algorithm creates an infinite array of filter bubbles for people.

0
💬 0

2227.012 - 2237.681 Nilay Patel

Is it even possible to capture all of that or organize it or make it understandable? Because I'm thinking about historians 20 years from now trying to understand a meme today, and I have no idea how they're going to do it.

0
💬 0

2238.241 - 2256.745 Mark Graham

Some of it is, yeah. Actually, TikTok is one of those platforms that we're doing a fair amount of archiving on. So I would say yes. And in some of these cases, like say TikTok or Telegram or Rumble or let's say Truth Social or some of these other social platforms, we're not trying to get everything. That's far too much, but we are trying to get a fair amount.

0
💬 0

2257.105 - 2277.511 Mark Graham

In some cases, we're working with domain experts, subject matter experts, et cetera. We're hoping to guide us and to get things that may be cultural or historically more significant or others. You mentioned memes, for example. And so if you take a meme as a meme and as a vector into, okay, let's try to collect material related to this meme.

0
💬 0

2277.651 - 2285.434 Mark Graham

So there are any one of a number of methods that we might incorporate to try to help prioritize material that we would get from some of these platforms.

0
💬 0

2286.021 - 2296.683 Nilay Patel

When you think about all of those opportunities, you're going to have to prioritize somehow, right? Six hard drives a day, or you can go to 12 hard drives a day. How do you make those kinds of prioritization decisions?

0
💬 0

2297.323 - 2314.966 Mark Graham

First of all, there's a lot of people that work here. More than 100 people work at the Internet Archive. We do a variety of different things from an engineering perspective or a program perspective. And yeah, there are choices that are made. But I would say we're mostly constrained by just our own creativity and our own imagination.

0
💬 0

2315.526 - 2336.656 Mark Graham

We have a fair amount of latitude as we work here to explore our interests as individuals and as an organization. But with a really strong focus on just trying to do a really good job of the things that we set out to do. And admittedly, the North Star, universal access to all knowledge, is a high bar. But we have the...

0
💬 0

2337.256 - 2351.113 Mark Graham

The luxury of being able to pursue that with a lot of resources is something that I have a great deal of gratitude for. And I know that the people that I work with do as well. And I think that millions of patrons that use our service every day also.

0
💬 0

2352.349 - 2361.471 Nilay Patel

Preservation is a high, noble goal. I work in the media. It's fine for you to preserve everything that we make. Some people want stuff deleted. How do you balance preservation and privacy?

0
💬 0

2361.491 - 2382.801 Mark Graham

And we are respectful of rights holders. And one of the ways that we are respectful is that we do respond to requests to have things excluded from the Wayback Machine. So rights holders that make legitimate requests. We actually have human beings that check these things out. We just don't say, oh, so-and-so said, just take that out. But we do consider the request.

0
💬 0

2382.861 - 2394.472 Mark Graham

Sometimes if the person is a public official, then we will have to weigh off their request with maybe a broader public right to know. But we work those things out on a case-by-case basis.

0
💬 0

2395.154 - 2404.399 Nilay Patel

You've made some content moderation decisions along the way as well. Two years ago, you removed Kiwi Farms for sort of a notorious forum for people who don't behave very well. How do you make that kind of decision?

0
💬 0

2405.399 - 2426.091 Mark Graham

We weigh off the evidence, the information that's available to us. That particular case, for example, would live in a category, I would say, where there are times when we learn about situations where there is what may be considered a high probability of real-world harm. and then have to make a decision.

0
💬 0

2426.491 - 2448.689 Mark Graham

The fact of the matter is there's material that is made available on the web that does cause real human suffering. And there are cases in which we have a duty to care. Another one maybe that is with child sexual exploitation material, for example. I don't think people are really questioning that too much, right? You say, oh, they took that down or something. Well, yeah, of course.

0
💬 0

2448.769 - 2470.196 Mark Graham

And first of all, that's the law. But there's other cases, doxing, for example, or harassment, or where people's personal safety or other risks have to be taken into consideration. So these are not decisions that are made lightly. We have policy that helps guide us, but very carefully and diligently. And we reconsider, too. That's another thing, too. It's not just like, oh, that was done.

0
💬 0

2470.956 - 2483.46 Mark Graham

And that's never to be looked at again. No. Over time, situations change, and the context of material in a new light of a new day may lead to different kinds of decisions.

0
💬 0

2484.46 - 2496.924 Nilay Patel

Obviously, there's a lot of systems at play here. You sometimes partner with organizations. Entire websites also come and go with the whim of corporations beyond most people's control. But there is a personal element. How should individuals think about all of this?

0
💬 0

2497.714 - 2517.603 Mark Graham

If you create something and you want it to stick around for a while, then take care. If you see something, save something. The Internet Archive is a free resource. It's available to anyone with a browser and a connection to the Internet. Just go to web.archive.org. If you'd like to preserve a URL, put it into the Save Page Now feature on the bottom right. Write to us.

0
💬 0

2517.663 - 2527.288 Mark Graham

Write to us at info.archive.org. If you've got a website that you think may be at risk, send us a note, and we'll make sure that we do a really good job of preserving it.

0
💬 0

2527.668 - 2529.414 Nilay Patel

Sounds good. Thank you so much, Mark. I really appreciate it.

0
💬 0

2529.996 - 2530.377 Mark Graham

You're welcome.

0
💬 0

2533.8 - 2548.303 Nilay Patel

Thanks again to Mark Graham for joining me on the show, and thanks again to the Internet Archive and the Wayback Machine. We depend on their work all the time here at The Verge. If you have thoughts about this episode or what you'd like to hear more of, you can email us at decoder at theverge.com. We really do read all the emails. Or you can hit me up directly on threads. I'm at reckless1280.

0
💬 0

2548.363 - 2563.731 Nilay Patel

We also have a TikTok, which you should check out. While there's a TikTok, it's at decoderpod. It's a lot of fun. If you like Decoder, please share it with your friends and subscribe or read your podcasts. If you really like the show, hit us with that five-star review. Decoder is a production of The Verge and part of the Vox Media Podcast Network. Our producers are Kate Cox and Nick Statt.

0
💬 0

2563.931 - 2569.618 Nilay Patel

Our editor is Callie Wright. Our supervising producer is Liam James. The Decoder music is by Breakmaster Cylinder. We'll see you next time.

0
💬 0
Comments

There are no comments yet.

Please log in to write the first comment.