Originally published Monday, April 7, 2008 at 12:00 AM
Trying to preserve today's Web for future generations
The Internet is the world's living, ever-evolving database, certainly the largest depository of information ever assembled. But as it is...
Newhouse News Service
The Internet is the world's living, ever-evolving database, certainly the largest depository of information ever assembled.
But as it is assembled and reassembled, changing by the second, researchers are concerned that too often, backing up is hard to do.
To some, nothing is older than a minutes-old Web page (to update the old saw about a day-old newspaper), but for others there is value in the Web sites of yester-minute and archivists have been wrestling with means of preserving today's Internet for tomorrow's scholars and researchers.
Without a working archive, these experts fear, future-generation Web surfers might never know who Client 9 was or what topics were generating the most interest on Digg.com this week.
Impossible task
By virtue of the Internet's sheer enormousness and its warp-speed evolution, the task of archiving its content in its entirety is impossible, like trying to catalog every grain of sand on the world's beaches.
But as it is easy to take a photograph of a beach, it also is possible to grab snapshots of the Internet, or specific portions of it, to preserve for future generations.
And that's exactly what researchers at the Internet Archive, the Library of Congress, the National Archives and libraries worldwide are working on.
There have been some remarkable strides already, starting with the Mountain View, Calif.-based Internet Archive and its Wayback Machine, where its creator hopes to build a sort of second coming of the Library of Alexandria, the long-ago destroyed institution that housed much of the ancient world's recorded knowledge.
The project has archived some 85 billion Web pages on computers that measure data in large quantities called petabytes.
100 days
"The average life span of a Web site is about 100 days, so you have to be proactive about getting and saving them," said Brewster Kahle, who founded the nonprofit Internet Archive and began sculpting his vision of a working Internet library in 1996.
![]()
"We knew that this was coming. You could tell that there was going to be an online digital world, and we wanted to make sure there was a library built," said Kahle, who is as dedicated to his gargantuan task as he is unassuming in discussing it.
So, on a regular basis, the Internet Archive releases a robot program called the Heritrix, which conducts Web crawls, bounding about the Internet and collecting Web sites by the millions.
Each crawl collects about 4 billion sites, which are saved in the Wayback Machine. Anyone can access the collection at www.archive.org, type in a site name and view archived past versions of it.
Initially funded by Kahle, the project has since received money from dozens of individuals and institutions — including the Mellon Foundation — and works worldwide with government agencies and libraries.
The robot crawlers collect only open public sites; those who don't want their sites archived can add a bit of code to block the bot.
Kahle, a middle-aged Internet pioneer who sold startup companies to Amazon.com and AOL, said the Internet Archive has no endgame and its Web crawlers will continue indefinitely collecting petabytes of data as the Internet continues to dilate.
It can be difficult to wrap your brain around the enormity of a petabyte, which is about 1,000,000,000,000,000 bytes.
Science Grid This Week, a publication of the Fermilab, sums it up like this: If a byte is a single character on a keyboard and you typed one character per second, it would take more than 30 million years to create a petabyte-length document.
Another example: Say you had a fleet of personal computers and each one had a 50 gigabyte hard drive. You would need 20,000 of those PCs to hold a petabyte of data.
So when the Internet Archive says it has 2 petabytes worth of data stored, that's one supersized library.
Only a fraction
Still, it's just a fraction of the information stored on millions of Internet servers around the world. And what doesn't get archived can end up disappearing forever into the digital ether.
Gregory S. Hunter, a professor at the Palmer School of Library and Information Science at Long Island University and one of the nation's leading experts on electronic archiving, agreed that some Web sites are precious commodities that must be preserved.
Unlike Kahle's all-inclusive approach, most other archives are consumed with the often vexing task of determining what data is worth saving and what belongs in the digital scrap heap.
Do we really want to preserve every teenager's MySpace page? Well, Hunter says, we may want to save some of them so future researchers can understand the phenomenon of social networking.
"It's very important that we preserve some Web sites as evidence of what has been created, or what was happening at a given point in the past. Newspaper Web sites are a good example. It's a cultural question. We want to preserve things that reflect society in all its beauty and ugliness for future generations," Hunter said.
Federal project
Hunter is the principal archivist for a project to build the federal government's Electronic Records Archive (ERA), which would preserve or "appropriately dispose" of any government electronic record.
The ERA, a project of the National Archives, passed a milestone in December with the successful test of its software system developed by Lockheed Martin.
Now comes the hard part.
"As archivists, we think that by making appropriate judgments we can help sort out the wheat from the chaff," Hunter said.
"If we save every bit of information, what good would it do us? If we keep nothing, that would do us no good either. Archivists are trying to find that middle point."
Copyright © 2008 The Seattle Times Company
UPDATE - 09:46 AM
Exxon Mobil wins ruling in Alaska oil spill case
UPDATE - 09:32 AM
Bank stocks push indexes higher; oil prices dip
UPDATE - 08:04 AM
Ford CEO Mulally gets $56.5M in stock award
UPDATE - 07:54 AM
Underwater mortgages rise as home prices fall
NEW - 09:43 AM
Warner Bros. to offer movie rentals on Facebook

general classifieds
Garage & estate salesFurniture & home furnishings
Electronics
just listed
1971 Grand Banks
2000 Bayliner 3988 - 40, twin 330 Cummins, ...
2001 SeaRay 380DA
More listings
POST A FREE LISTING
- Seattle shootings: day of horror, grief in a shaken city
- 2 best friends, a Bellevue mom, 2 others die in Seattle shootings
- Gunman: a life full of rage, a shocking final act
- Help: We're losing it here | Danny Westneat
- Good Samaritan tells dying woman, 'You are not alone'
- Mariners' bats bust loose in amazing rout of Rangers, 21-8
- Local actress, urban planner among shooting victims
- Man who threw stools at shooter: I'm no hero
- End of Metro's free-ride zone to hurt the poor, panel told
- Woman dies in shooting outside Town Hall
- Suspect's family: 'We could see this coming'
955 - Boston court: US gay marriage law unconstitutional
381 - US economy added 69K jobs in May, fewest in a year
308 - Seattle shootings: Is it time for gun control?
218 - Wash. Supreme Court upholds liquor initiative
202 - Will Ichiro bat third for Mariners tomorrow in Chicago? No way
161 - At stroke of midnight, liquor sales turn into a private enterprise
154 - Jobless rate now a leading political indicator
135 - A lifetime of rage, a shocking final act
130 - Effort to ban abortions based on gender fails in House
100
- Seattle shootings: day of horror, grief in a shaken city
- Good Samaritan tells dying woman, 'You are not alone'
- Cycle Chelanatchee (or is it Wenatchelan?) for sun and scenery
- Help: We're losing it here | Danny Westneat
- 2 best friends, a Bellevue mom, 2 others die in Seattle shootings
- Electric car chargers turned on for I-5 and Highway 2
- Gunman: a life full of rage, a shocking final act
- Two beloved eateries expand in Belltown | All You Can Eat
- Touch-a-Truck will please 6-year-olds (and many others)
- Bias seen in Forest Service practice on Olympic Peninsula



