Originally published Monday, April 7, 2008 at 12:00 AM
Trying to preserve today's Web for future generations
The Internet is the world's living, ever-evolving database, certainly the largest depository of information ever assembled. But as it is...
Newhouse News Service
The Internet is the world's living, ever-evolving database, certainly the largest depository of information ever assembled.
But as it is assembled and reassembled, changing by the second, researchers are concerned that too often, backing up is hard to do.
To some, nothing is older than a minutes-old Web page (to update the old saw about a day-old newspaper), but for others there is value in the Web sites of yester-minute and archivists have been wrestling with means of preserving today's Internet for tomorrow's scholars and researchers.
Without a working archive, these experts fear, future-generation Web surfers might never know who Client 9 was or what topics were generating the most interest on Digg.com this week.
Impossible task
By virtue of the Internet's sheer enormousness and its warp-speed evolution, the task of archiving its content in its entirety is impossible, like trying to catalog every grain of sand on the world's beaches.
But as it is easy to take a photograph of a beach, it also is possible to grab snapshots of the Internet, or specific portions of it, to preserve for future generations.
And that's exactly what researchers at the Internet Archive, the Library of Congress, the National Archives and libraries worldwide are working on.
There have been some remarkable strides already, starting with the Mountain View, Calif.-based Internet Archive and its Wayback Machine, where its creator hopes to build a sort of second coming of the Library of Alexandria, the long-ago destroyed institution that housed much of the ancient world's recorded knowledge.
The project has archived some 85 billion Web pages on computers that measure data in large quantities called petabytes.
100 days
"The average life span of a Web site is about 100 days, so you have to be proactive about getting and saving them," said Brewster Kahle, who founded the nonprofit Internet Archive and began sculpting his vision of a working Internet library in 1996.
![]()
"We knew that this was coming. You could tell that there was going to be an online digital world, and we wanted to make sure there was a library built," said Kahle, who is as dedicated to his gargantuan task as he is unassuming in discussing it.
So, on a regular basis, the Internet Archive releases a robot program called the Heritrix, which conducts Web crawls, bounding about the Internet and collecting Web sites by the millions.
Each crawl collects about 4 billion sites, which are saved in the Wayback Machine. Anyone can access the collection at www.archive.org, type in a site name and view archived past versions of it.
Initially funded by Kahle, the project has since received money from dozens of individuals and institutions — including the Mellon Foundation — and works worldwide with government agencies and libraries.
The robot crawlers collect only open public sites; those who don't want their sites archived can add a bit of code to block the bot.
Kahle, a middle-aged Internet pioneer who sold startup companies to Amazon.com and AOL, said the Internet Archive has no endgame and its Web crawlers will continue indefinitely collecting petabytes of data as the Internet continues to dilate.
It can be difficult to wrap your brain around the enormity of a petabyte, which is about 1,000,000,000,000,000 bytes.
Science Grid This Week, a publication of the Fermilab, sums it up like this: If a byte is a single character on a keyboard and you typed one character per second, it would take more than 30 million years to create a petabyte-length document.
Another example: Say you had a fleet of personal computers and each one had a 50 gigabyte hard drive. You would need 20,000 of those PCs to hold a petabyte of data.
So when the Internet Archive says it has 2 petabytes worth of data stored, that's one supersized library.
Only a fraction
Still, it's just a fraction of the information stored on millions of Internet servers around the world. And what doesn't get archived can end up disappearing forever into the digital ether.
Gregory S. Hunter, a professor at the Palmer School of Library and Information Science at Long Island University and one of the nation's leading experts on electronic archiving, agreed that some Web sites are precious commodities that must be preserved.
Unlike Kahle's all-inclusive approach, most other archives are consumed with the often vexing task of determining what data is worth saving and what belongs in the digital scrap heap.
Do we really want to preserve every teenager's MySpace page? Well, Hunter says, we may want to save some of them so future researchers can understand the phenomenon of social networking.
"It's very important that we preserve some Web sites as evidence of what has been created, or what was happening at a given point in the past. Newspaper Web sites are a good example. It's a cultural question. We want to preserve things that reflect society in all its beauty and ugliness for future generations," Hunter said.
Federal project
Hunter is the principal archivist for a project to build the federal government's Electronic Records Archive (ERA), which would preserve or "appropriately dispose" of any government electronic record.
The ERA, a project of the National Archives, passed a milestone in December with the successful test of its software system developed by Lockheed Martin.
Now comes the hard part.
"As archivists, we think that by making appropriate judgments we can help sort out the wheat from the chaff," Hunter said.
"If we save every bit of information, what good would it do us? If we keep nothing, that would do us no good either. Archivists are trying to find that middle point."
Copyright © 2008 The Seattle Times Company
UPDATE - 08:03 AM
Service sector shrinks less than expected in June
Tech execs double as scourges and sages at Allen & Co.'s media summit
UPDATE - 08:45 AM
Stocks slide on conflicting signs about economy
UPDATE - 08:32 AM
Bankruptcy judge OKs GM sale plan, appeal looms

2009 fireworks time lapse
With strict parking rules enforced at this year's July 4th celebration on Wallingford Ave North, less cars and more spectators filled the streets.
Entertainment | Top Video | World | Offbeat Video | Sci-Tech
nwjobs

Post a comment

Michelle Goodman blogs about work/life balance.
Tax tips for new independent professionals
Post a comment
nwhomes

Find a new home or condo that fits your lifestyle.
Search New Developments
Builder Directory
- Landmark Smith Tower mostly vacant
- Property taxes: Appeals shoot up in King, Snohomish Counties
- Shooting unveils very different sides of McNair
- Palin links resignation to 'higher calling' and blasts media in Facebook posting
- Former NFL MVP McNair killed
- Hard times for tourist towns means good deals for travelers
- Tukwila residents rally against light-rail noise
- Seattle may allow homeowners to build backyard cottages
- Confessions of an Idol Addict | "American Idols" on tour: Live coverage from opening date
- Quincy Jones remembers "the biggest entertainer on the planet": Michael Jackson
- Seattle Mariners at Boston Red Sox: 07/05 game thread
247 - Palin links resignation to 'higher calling' and blasts media in Facebook posting
183 - Hatred for the NBA runs deep, but don't take it out on the players
138 - Tukwila residents rally against light-rail noise
130 - Former NFL MVP McNair killed
113 - Property taxes: Appeals shoot up is King, Snohomish Counties
109 - Tent City on campus: UW stalls decision
107 - Anti-tax rally in Olympia attracts about 1,500
69 - Mariners did their part, now they need help
47 - Megachurch pastor Rick Warren addresses US Muslims
36
- Property taxes: Appeals shoot up in King, Snohomish Counties
- Hard times for tourist towns means good deals for travelers
- Landmark Smith Tower mostly vacant
- Seattle may allow homeowners to build backyard cottages
- Plasma and LCD beware; OLED screens ready to go mainstream
- The People's Pharmacy | Estrogen mimicker found in sunscreen
- Researchers stunned by inmates' success raising endangered frogs
- Tent City on campus: UW stalls decision
- Toyota's Toyoda scolds execs for emulating U.S. car companies' mistakes
- Tukwila residents rally against light-rail noise









