Considerations of Web Archiving

This document was prepared as part of the decision-making process for a project with the ELCA. It describes some of the things to consider when planning a web archive. This background can help make the tradeoffs necessary clearer when planning a project.

Goals

There are a number of goals an organization may have for archiving old websites. So much of our organizational history is bound up in the web today that one goal is simply keeping a record of organizational activities and practices. What used to be in files and folders is today presented on the web. As we migrate from one site to another, our connection to past practices can be lost. This goal can be inward facing, a preservation of self-awareness, and it can be outward-facing, a way to hold ourselves accountable to the past by sharing it with future partners.

The web also serves as our presence to much of the world, they see us by what we share on the web. Another goal of web archiving may be to preserve that presence over time. When we migrate to new websites it is very difficult to take along every single piece of content from the old sites. It is even more difficult, and probably inadvisable, to maintain the URL structure of our old sites. Yet our partners have likely embedded links to our old site in all kinds of documents, from other web pages, to presentations, to tweets, to Facebook status updates. There is no way to eliminate all these references to our old site, and simply migrating to a new website will break all these links making our organization seem less present in the world. Web archiving can be used to preserve access to our old content.

Finally, an organization may require an archive of the actual internal structure that drove a particular website, the database and contributions to the site. This may be needed for legal reasons or because the site is so complex in its internal relationships that it cannot be adequately represented by a static HTML rendering.

To some extent, the first goal can be met by allowing our sites to be archived by the Internet Archive. Just explore their Wayback Machine to see how much of your organization’s web presence has been archived there. This may be enough for many organizations, just make sure you are being crawled by IA. However, this cannot address the preservation of direct access goal and it certainly does not leave your organization with its own copy of its history.

Technical Approach

Today’s websites are rarely made up of simple HTML documents in a filesystem. More likely they are generated by content management systems of various flavors that render HTML as a last step on the way to serving a reader’s request. The technical preservation of a website can take a number of forms: the actual preservation of the database and software that runs a site, the preservation of the exact HTML that was rendered to present a site, or the preservation of a modified version of that HTML designed to render a facsimile of the original site without the presence of the original content management system.

These three approaches are quite different in the technology they demand to create the archive and the technology later required to use that archive. For example, capturing the actual underlying software, database, and file structure of a website is often quite simple (this is commonly done as part of the “backup” procedures for most sites already). But rendering that raw data into a website one can browse requires the whole complex “stack” of technology the original site required (database servers, programming languages on the server side, web servers with special awareness of the content, and so on). A very simple archive leaves you with a very difficult retrieval task.

Alternatively consider the simple crawling approach. It is not very difficult to use off-the-shelf tools to “crawl” across a website and save everything found there. This does not save the raw source material sitting on the server, but rather saves the “rendered” HTML that each web browser presents to the reader. This is often called a “flattened” or “static” or “freeze-dried” website in that it capture the state of a site in time. However, to serve the site to a future reader this captured HTML must be modified or “rewritten” so that things like self-referencial internal links point back to the archive instead of to the original URLs. This capture and rewriting leaves you with a representation of the site that is quite different from the original.

The “backup” strategy lies beyond the scope of this project because, while simple to execute the capture of the site, the technology needed to present the site later is quite complex and unpredictable. Instead, this project will focus on the “crawling” approach, capturing a snapshot of the website in time and making it possible to serve that snapshot up to future readers with as little infrastructure as possible. This is a tradeoff, but by keeping the infrastructure required for serving up the archived site minimal, we increase our chances of being able to serve up the site over the long haul.

Discovery

Creating the archive is one challenge. Helping people find the archive is another. To be “present” on the web requires that this discovery task be as graceful as possible. The Wayback Machine may have a copy of most of your web content, but a reader won’t discover that without actually visiting the Wayback Machine and searching for your old URL there themselves. Your site won’t be very present to that user unless you make the discovery of the archive simpler.

Simplifying this process requires taking advantage of “redirection.” All web servers are capable of redirecting users from one page they were looking for to another page that actually exists. All server-side programming languages (like PHP) can do the same thing, redirect the user. Accomplishing this redirection means that your organization must maintain the old host names (the “.org” or “.com” names you registered for your site) and redirect requests for old pages to the archives.

If the host name you used for the old site is the same as the name you are using for the new site (almost always the case when launching a new website design, for example), then this can be accomplished with a “failover” of the new site to the archive. Typically if a user requests a page that does not exist on a site, the server will respond with a “404 not found” error of some sort. Instead, the failover requires that the server send all “not found” URLs to a script running on the server that redirects those URLs to the archive. If the archive finds an old page that matches that rewritten URL it then serves it up, if it does not find an old page, it serves up the “404 not found” message itself.

This approach is vital to making sure that URLs stored by users of your old site around the world will still work to pull up the archival resources.

The Consolidated Method

The intensely archival approach prioritizes maintaining the HTML and files served up by the old web server in as close to an original form as possible. This may seem odd, since we have already admitted that we are “flattening” the site in question so the rendered HTML is really not the actual data that lives behind the scenes, it is just a moment-in-time snapshot. Still, for some future researchers it may be valuable to know that the HTML in the archive is exactly the HTML the living server provided.

This approach was developed as I worked on the task of archiving the Internet2 collection at the University of Minnesota. We called this “Documenting Internet2” or DI2. Ironically, the technical discussion of this process is no longer available at the university, but luckily I kept a copy of these pages and they are available from my DI2 snapshot. It starts with a Heritrix crawl which produces a set of “.arc” files. These files are like “.zip” files, a consolidation of everything Hetitrix found as it was crawling. This material is stored bit-for-bit as it was found and the fact that these archive files are never altered provides the truly archival attribute of this method.

Of course, to make the site usable by readers, some kind of front-end must be developed to take incoming URLs, look up their content in the archival files, extract that content, modify it so that it works properly in the current environment, and present it to the user. The need for this “front-end” is one of the drawbacks of this method. It requires some care and feeding or nothing will be available to users.

The advantage is that whatever choices were made in developing the front-end can always be revisited. Since the archival files are as they were found, any compromises made when rendering them for users can be reconsidered as the organization’s resources change. Any changes to the front-end will automatically have an impact on the whole archive. And the archive itself, consisting of the large “.arc” repositories of content rather than the tens of thousands of individual files, is relatively easy to manage.

It may be instructive to note that I have only implemented this process once, and that the archive we produced is no longer in service or accessible. On the other hand, the Wayback Machine operates this way year after year and the Internet Archive makes the core of this functionality available as a hosted service via Archive-It.

Advantages of the Consolidated Method:

purity of captured files
consolidation of captured files into large “.arc” files
consistency of treatment of files is guaranteed
ease of future revisions in the way files are served to readers

Disadvantages of the Consolidated Method:

Heritrix has a significant learning curve and requires a Java-capable host
infrastructure to serve up results requires ongoing PHP and MySQL support
future maintainers can’t just see what they have in the filesystem

Note that this list of disadvantages can be traded for another disadvantage by subscribing to the Archive-It service. In this case the disadvantage would be that the archival files are no longer held on your organization’s servers. Oh, wait.. you may consider that an advantage!

The Discrete Method

Many organizations cannot commit to the long-term maintenance of the kind of front-end demanded by the archival approach, yet they still would like to preserve access to older resources while retiring older content management systems and servers. Simply capturing a “freeze-dried” static version of their older sites can be a viable alternative for these organizations. I have implemented such simple archives for the Digital Library Federation, the Coalition for Networked Information, and the Religious Education Association.

This method still requires a crawl of the site to be archived, but this crawl can be accomplished with a tool that is considerably less complex than Heritrix. I have used a simple application called SiteSucker to capture sites, for example, though many other web crawling tools could be equally suitable.

Of course, to serve up the results of such a crawl requires that the HTML pages, at least, be rewritten to point their self-referencial links back to the new archive location. This can be accomplished by creating scripts that rewrite all the HTML for the site. While we are at it, we also insert a bit of JavaScript to facilitate the “wrapping” of each page in some sort of frame that warns the reader they are seeing an archive. By using JavaScript for this wrapping, we give ourselves the flexibility to change the language of that wrapper whenever we want without going back and rewriting the individual pages again.

This rewriting process, even though it is undertaken carefully, can also damage some files. Even if it does not damage them, it clearly alters them, making this a less-pure archive than that captured by the Heritrix crawl. Still, the effect is quite similar, an archived site that more or less looks like a static version of the old site.

Note that the files captured by this method are also not consolidated. This means the result is often a massive set of directories with tens of thousands of discrete files that each live as individually on whatever system hosts the archive.

Advantages of the Discrete Method:

crawl is relatively easy to accomplish
archive site can be served up with only a web server, no server-side language or database required

Disadvantages of the Discrete Method:

crawl is often a bit dirtier (error prone) and needs a few repeated tries to get right
rewriting files actually modifies them, which can lead to sticky situations if mistakes are made
managing thousands of files from old sites can be more difficult if (when) server recovery is required

Expectations

Websites are ever more complex beasts. These methods are reasonable for capturing sites that render into relatively plain HTML. However, more and more sites today include interactive, JavaScript-driven and dependent effects that may well not survive the “freeze-drying” process intact. For example, if your site uses scripts to generate links to other portions of your site, these methods will likely not recognize or rewrite those scripts and those links will end up pointing to destinations other than the archive, stranding the reader. If your site gathers data from other sources, those sources may refuse to deliver that data to the archive or you may capture just the moment-in-time data that was present at the time of capture. If your site delivers user-entered data to a back end in order to render a new result, that won’t work in the archive (for example, many site-search systems break because the search engine is no longer actually behind the site).

These methods will do a good job of capturing pages, not whole dynamic systems. If your site consists mostly of pages that the reader navigates by clicking on links, then the archive will probably be pretty functional. The further from this simplistic model your old site strays, the less representative the archive will be of the old site.

In most cases, a crawl can produce a good landing page for those URLs readers are likely to have embedded in systems around the web or in documentation of their own. However, don’t forget that creating that smooth “failover” experience from your new site to the archival site is a separate task that depends heavily on the architecture of your new site.

Finally, capturing just portions of an existing site or, alternately, capturing many separate existing sites into a single archive are more complex tasks. These methods are best at capturing everything within a single “domain.” The nature of web crawlers makes it more complicated to set boundaries within a domain or to cross over a certain set of domains. Not impossible, just more complicated. If your needs require this sort of bounding, be prepared to spend extra time configuring whatever web crawler you use.

Choices

Your project will require tradeoffs. Consider carefully what your organization’s goals are for the archive you are planning. Consider your own organizational strengths and weaknesses. If you have a strong IT staff, for example, then the demands of the Consolidated Method may well justify the investment in infrastructure and be rewarded by the ease of management of archival files over time. If your technical shop is more constrained, the Discrete Method may actually provide more functionality and be simpler for less technical staff to understand and manage. Is it important to you to maintain old URLs so that users don’t hit 404 errors when following old links? Then make sure to create a failover mechanism (and maintain it).

I can help you work through these choices, feel free to contact me to talk them through.