Originally published Monday, April 7, 2008 at 12:00 AM
Trying to preserve today's Web for future generations
The Internet is the world's living, ever-evolving database, certainly the largest depository of information ever assembled. But as it is...
Newhouse News Service
The Internet is the world's living, ever-evolving database, certainly the largest depository of information ever assembled.
But as it is assembled and reassembled, changing by the second, researchers are concerned that too often, backing up is hard to do.
To some, nothing is older than a minutes-old Web page (to update the old saw about a day-old newspaper), but for others there is value in the Web sites of yester-minute and archivists have been wrestling with means of preserving today's Internet for tomorrow's scholars and researchers.
Without a working archive, these experts fear, future-generation Web surfers might never know who Client 9 was or what topics were generating the most interest on Digg.com this week.
Impossible task
By virtue of the Internet's sheer enormousness and its warp-speed evolution, the task of archiving its content in its entirety is impossible, like trying to catalog every grain of sand on the world's beaches.
But as it is easy to take a photograph of a beach, it also is possible to grab snapshots of the Internet, or specific portions of it, to preserve for future generations.
And that's exactly what researchers at the Internet Archive, the Library of Congress, the National Archives and libraries worldwide are working on.
There have been some remarkable strides already, starting with the Mountain View, Calif.-based Internet Archive and its Wayback Machine, where its creator hopes to build a sort of second coming of the Library of Alexandria, the long-ago destroyed institution that housed much of the ancient world's recorded knowledge.
The project has archived some 85 billion Web pages on computers that measure data in large quantities called petabytes.
100 days
"The average life span of a Web site is about 100 days, so you have to be proactive about getting and saving them," said Brewster Kahle, who founded the nonprofit Internet Archive and began sculpting his vision of a working Internet library in 1996.
![]()
"We knew that this was coming. You could tell that there was going to be an online digital world, and we wanted to make sure there was a library built," said Kahle, who is as dedicated to his gargantuan task as he is unassuming in discussing it.
So, on a regular basis, the Internet Archive releases a robot program called the Heritrix, which conducts Web crawls, bounding about the Internet and collecting Web sites by the millions.
Each crawl collects about 4 billion sites, which are saved in the Wayback Machine. Anyone can access the collection at www.archive.org, type in a site name and view archived past versions of it.
Initially funded by Kahle, the project has since received money from dozens of individuals and institutions — including the Mellon Foundation — and works worldwide with government agencies and libraries.
The robot crawlers collect only open public sites; those who don't want their sites archived can add a bit of code to block the bot.
Kahle, a middle-aged Internet pioneer who sold startup companies to Amazon.com and AOL, said the Internet Archive has no endgame and its Web crawlers will continue indefinitely collecting petabytes of data as the Internet continues to dilate.
It can be difficult to wrap your brain around the enormity of a petabyte, which is about 1,000,000,000,000,000 bytes.
Science Grid This Week, a publication of the Fermilab, sums it up like this: If a byte is a single character on a keyboard and you typed one character per second, it would take more than 30 million years to create a petabyte-length document.
Another example: Say you had a fleet of personal computers and each one had a 50 gigabyte hard drive. You would need 20,000 of those PCs to hold a petabyte of data.
So when the Internet Archive says it has 2 petabytes worth of data stored, that's one supersized library.
Only a fraction
Still, it's just a fraction of the information stored on millions of Internet servers around the world. And what doesn't get archived can end up disappearing forever into the digital ether.
Gregory S. Hunter, a professor at the Palmer School of Library and Information Science at Long Island University and one of the nation's leading experts on electronic archiving, agreed that some Web sites are precious commodities that must be preserved.
Unlike Kahle's all-inclusive approach, most other archives are consumed with the often vexing task of determining what data is worth saving and what belongs in the digital scrap heap.
Do we really want to preserve every teenager's MySpace page? Well, Hunter says, we may want to save some of them so future researchers can understand the phenomenon of social networking.
"It's very important that we preserve some Web sites as evidence of what has been created, or what was happening at a given point in the past. Newspaper Web sites are a good example. It's a cultural question. We want to preserve things that reflect society in all its beauty and ugliness for future generations," Hunter said.
Federal project
Hunter is the principal archivist for a project to build the federal government's Electronic Records Archive (ERA), which would preserve or "appropriately dispose" of any government electronic record.
The ERA, a project of the National Archives, passed a milestone in December with the successful test of its software system developed by Lockheed Martin.
Now comes the hard part.
"As archivists, we think that by making appropriate judgments we can help sort out the wheat from the chaff," Hunter said.
"If we save every bit of information, what good would it do us? If we keep nothing, that would do us no good either. Archivists are trying to find that middle point."
Copyright © 2008 The Seattle Times Company
UPDATE - 09:46 AM
Exxon Mobil wins ruling in Alaska oil spill case
UPDATE - 09:32 AM
Bank stocks push indexes higher; oil prices dip
UPDATE - 08:04 AM
Ford CEO Mulally gets $56.5M in stock award
UPDATE - 07:54 AM
Underwater mortgages rise as home prices fall
NEW - 09:43 AM
Warner Bros. to offer movie rentals on Facebook

nwautos
Turismo upgrade "Gran Turismo 5: XL Edition" for PlayStation 3 has features such as new car-tuning settings, new NASCAR vehicles, better replay video...
Post a comment
- Lakewood cop accused of embezzling $150K meant for slain officers' families
- 3 big health insurers stockpile $2.4 billion as rates keep rising
- Agency set to investigate handling of 911 call about Josh Powell
- Quick decisions: How Washington hired its new football staff
- Historic day for gay marriage as another fight looms
- Justin Wilcox's versatile defensive style is the right fit for Huskies | Jerry Brewer
- It's Terrence Time: Enigmatic Ross leads Huskies
- Social worker recounts minutes before Powell fire
- $25B settlement reached over foreclosure abuses
- Club promoter convicted in brutal 2010 murder of Des Moines prostitute
- Gay-marriage bill passes House, awaits Gregoire's signature
454 - Historic day for gay marriage as another fight looming
352 - 3 big health insurers stockpile $2.4 billion as rates keep rising
239 - Source: NY, California to sign mortgage settlement
228 - Wanted in Seattle classrooms: more teachers of color
215 - Oregon live game thread
155 - Pac-12 picks ... including the UW game
140 - Council members get briefing on arena proposal, minus details
95 - Worker: Josh Powell told son he had 'surprise'
90 - AP Source: Obama to change birth control rule
75
- State Medicaid program to stop paying for unneeded ER visits
- 3 big health insurers stockpile $2.4 billion as rates keep rising
- Wanted in Seattle classrooms: more teachers of color
- One man's audacious pursuit of sailing history
- Darren Berg gets 18-year sentence for Ponzi scheme
- $25B settlement reached over foreclosure abuses
- 'Gauguin and Polynesia': dazzling mix-and-match | Art review
- A wandering gene's destructive path | Book review
- Economy, blogs give survivalists new reason to look to Northwest
- Navy fliers' love-hate relationship with water-crash survival class










