Archiving the Internet

“A lot of people do believe that if it’s on the Web it will stay on the Web” is a highly accurate statement that does not match how ephemeral the existence of content on the internet really is.

One of my friends google-searched an obituary of a favorite writer of hers for a research paper, and when she clicked the link that was supposed to take her to the piece, she was redirected to a web page about a big win for a random baseball team. This is an example of the “content drift” that the New Yorker piece talks about, and the only way to find the original web page was to call up the journalist who’s name was next to the article’s link; he ended up emailing her a PDF of the article. This entire process felt extremely outdated and cumbersome compared to the expected speed at which my friend thought she would be able to access a simple journal article on the web.

The issue of not being able to access the obituary of a writer feels like a nonissue this it’s such a specific uses case, but what if all of of the files we uploaded to google drive or dropbox disappear?

The solution of archiving the internet seems like a no-brainer, but it also seems impossible to compile and index every single piece of content on the internet. In 2013, “Google performed 2 million searches each minute and 72 hours worth of video was uploaded to YouTube within the space of 60 seconds.” And, in 2012, over 140,000 websites were created each day, which has probably increased dramatically since then. There is a vast amount of existing content before even figuring out which content is valuable enough to archive. Then, there’s the question of indexing relevant content.