ELCA Web Archiving Proposal

This DRAFT (120501) proposal responds to the following request from the ELCA.

Request

The pressing concern are sites related to previous ELCA churchwide assemblies, and two domains for programs that are no longer being supported by the ELCA. For historic reasons, I would like to see the sites preserved online, but they need to clearly be marked as archival, with links back to ELCA.org pages for the latest related information (e.g. a link to the page for the 2013 churchwide assembly). We use ISAPI Rewriter to manage URL shortcuts, so over time we could redirect visitors later without having to modify the frozen archive further.

Could you estimate the cost to provide archived versions of the following web properties, along with the specs and instructions for mounting them on a basic web server in the Lutheran Center?
1. Davey and Goliath, http://www.daveyandgoliath.org
  - 1,475 files, 634MB
2. Grace Matters, http://www.gracematters.org
  - 3,139 files, 1.6GB
  - lots of audio files
3. 2001 Assembly site, http://www2.elca.org/assembly/01/
  - 490 files, 18.2MB
4. 2003 Assembly site, http://www2.elca.org/assembly/03/
  - 631 files, 150MB
5. 2005 Assembly site, http://www2.elca.org/assembly/05/
  - 496 files, 573MB
6. 2007 Assembly site, http://www2.elca.org/assembly/07/
  - a little confusion here …
  - the bulk of files for the 2007 assembly seem to be in the /assembly folder …
  - it’s probably similar to 2005
7. 2009 Assembly site, http://www2.elca.org/assembly99/
  - 505 files, 9.6MB also a little suspicious
  - bulky files may be stored outside the directories
  - Will your process “gather” them all in?
Would you be willing to train ELCA IT staff to make regular archives of ELCA.org? As part of that, freezing and saving a copy of the entire site just before the transition to a redeveloped version, early next year, would be helpful for historic reasons, but also to help people transition to a new ELCA.org site, which will certainly be smaller than the current one.

Assumptions

In order to respond to this request I am going to make a number of assumptions. I will list these here so that they will be easy to correct should I be assuming something false.

Funds for this web archiving project are actually quite limited, this is not the top priority for ELCA.
The “basic web server” referred to may or may not be running a database, but in any case the web archiving project should not presume access to a database. It should stand alone.
The process should be simple enough for IT staff to engage in themselves, without further support from me.
The ELCA web servers are IIS and the ELCA is a Windows shop.
Perl or PHP is available for scripting on the server side.

Considerations

My response assumes you have reviewed the considerations for web archiving and have at least a basic understanding of the trade-offs inherent in different methods of web archiving. If you have not already done so, I highly recommend that you review those considerations before reading this proposal.

I believe the ELCA would be best served by pursuing the Discrete Method of web archiving. This method allows for more flexibility in crawler choice, leaves files in places that are more visible to staff who perhaps did not participate in the original crawl, and requires less infrastructure to run day-to-day. Once a few basic skills of rewriting HTML and using JavaScript are mastered, this method can be managed without consultant help.

I believe the Consolidated Method would represent too high an overhead for ELCA. It would require more infrastructure to run day-to-day and certainly more expertise to execute independently. If the Consolidated Method is attractive to the ELCA, I would suggest that I act as a consultant in connecting them with the Archive-It service and training them to use it. This options is not really discussed below, but should be kept in mind.

However, the ELCA sites described in the request do present some challenges for either method. The size of the sites is not a problem, though the ELCA will need to provide disk space to house them on whatever server it wants to be the host. The complexity of some of the sites may be more of an issue, though.

Many of the assembly sites are not really separate sites at all, but simply directories of the core ELCA site. It is rare for such sub sites to depend purely on resources within that subdirectory, so crawling often has to be allowed to extend into the parent site as well. However, if that is allowed, then it is likely that each “child site” archived will pull in duplicating “parent site” resources. This can quickly get unwieldy. Another option would be to strictly limit the crawls to the subdirectories in question, but this would potentially leave holes (whether missing pages, illustrations, or style sheets) in the site.

I am not quite certain whether the whole elca.org site is to be considered for archiving. If this will be needed in the near future, we might just want to start there. This would bring along the assembly sites “for free” and assure you that the archiving of the whole elca.org site could be accomplished later again. The Discrete Method of archiving sites does not really allow for the incremental addition of new material to the archive, so in the future the site would just be crawled again and the new archive would replace the old.

Some of the sites make reference to media files stored at media.elca.org, a server that does not seem to be responding when I gave these sites a cursory review. This means that these media objects may not be retrieved at all, or if they are retrieved they may end up causing duplication as noted for the assembly sites.

Since so many domains and subdomains are involved, serving these up efficiently from an archive will require careful analysis and planning. This planning will have to take place before URLs are rewritten, since the rewriting may need to be adjusted based on what we find. In fact, this looks to be complex enough that I’d want to try something altogether new, and manage the URL rewriting from JavaScript, making it more universal across the archives and flexible for future adjustments. This would make the archive much less useful to anyone without JavaScript running on their browser, but that is a smaller and smaller portion of the browsing public these days.

One question in the request is whether this process would “gather in” files from such disparate servers so that they would form a single archive. The answer is a bit complex, but essentially “yes.” The archive will be able to run from a single server, but that single server would then require all the space needed to house these currently dispersed files. Furthermore, each server contributing files to the archive will have to be crawled, either by being the target of a crawl separately, or by being included in a crawl that originates from another target. Only files that are included in a crawl one way or another will be “gathered in” by this process.

Web archiving necessarily means moving a lot of data over the internet. This can be slow, especially if that data is moving to or from my own (home) office. If the ELCA can provide an account on a local machine with sufficient storage space for the archives and remote access, that might speed things up. Also, given that I am a thoroughly Mac and Unix operative and ELCA appears to be a Windows shop, this might serve to help me create tools you can actually use locally.

I make the assumption that ELCA is a Windows shop based on the mention of ISAPI Rewrite in the request. This appears to be an IIS implementation of Apache’s mod_rewrite. I am totally unfamiliar with Windows and IIS and will require considerable input from your tech team to make sure that what I develop will be workable in your environment.

Proposal

I propose an iterative process that can uncover the complexities of the ELCA sites and offer an opportunity to “train-in” ELCA staff to produce future archives.

I would start with one site, preferably one of the assembly sites since these may present issues that are common in this set of sites, though the ELCA could choose one of the others if another is a higher priority. I would work through the whole archiving process with that site, developing the required Perl, PHP, and JavaScript along the way. I would work with ELCA staff to install this archived site on ELCA servers for testing.

Deliverable 1: crawl and analysis of first site and detailed proposal for archiving to share with ELCA

Cost: $1,500

Deliverable 2: development of server-side and client-side scripts to manage rewriting and wrapping

Cost: $1,200

Deliverable 3: installation and testing of archived site on an ELCA server

Cost: $800

Once we are confident it will serve ELCA purposes, I will document the process I used and work through a second site revising the documentation.

Deliverable 4: develop documentation of the archiving procedures

Cost: $500

Deliverable 5: crawl and analysis of second site and detailed proposal for archiving to share with ELCA

Cost: $800

Deliverable 6: installation and testing of archived site

Cost: $500

After two sites are complete, I will train ELCA staff to follow the documentation and archive a third site. This could be done on site or remotely. If remotely, we would likely do most of this by sharing a screen while talking over Skype or the phone.

Deliverable 7: revised documentation shared with ELCA staff

Cost: $1,000

Deliverable 8: close supervision of crawl and analysis of third site by ELCA staff

Cost: $1,000

Deliverable 9: close supervision of installation and testing of third archived site by ELCA staff

Cost: $500

The ELCA can then decide whether it wants to retain me in a support role while its staff continue archiving the sites in this request, or would rather just let its staff independently complete the task.

Deliverable 10: support during each additional crawl and archive installation

Cost: $800

To accomplish a smooth “failover” from a new website to an archive requires some attention to the mechanism for this handoff. This proposal does not include any development of such a failover because (A) it is not clear that it would be necessary before the transition to a new ELCA site and (B) such a mechanism is highly dependent on the tools with which the new site is to be built. I would be happy to consult on this as well, but left it out of the scope of this proposal.

Timeline

I would be available to complete the first site by the end of May 2012 and the second by the middle of June 2012. I will not be available from 20 June - 20 July, but can then help out with further sites later in July. Once ELCA decides to begin the project and identifies the first site to be archived, I estimate two weeks to completion of that archive. Each further archiving should not take more than one week, though specific times will depend on availability on those weeks.

Putting it all together

As a consultant, I do not charge by the hour. The costs above will be charged as flat rates. These costs total $7,800 plus $800 per site beyond the first three. I would submit invoices at the end of each month. I would be happy to travel to your offices to conduct training in person for up to two days, but would expect travel expenses plus a $40 per diem to be covered in that case.

I realize that this cost may be more than ELCA can afford. If that is the case, please let me know what your budget is for this task and I will consider whether it can be restructured in some way and/or whether I can afford to contribute the remainder of the consulting in-kind.

Eric Celeste

Eric brings over 20 years of library and 30 years of technology experience to his consulting. At MIT Eric shepherded the creation of DSpace, open source digital repository management software developed with HP and now deployed at hundreds of institutions worldwide. At the University of Minnesota Libraries he encouraged the development of the UThink blog service, a wiki-based staff intranet, LibData, and the University Digital Conservancy. As a consultant he has worked with the Minnesota Digital Library, the Minnesota Historical Society, the Digital Library Federation, the Coalition for Networked Information, OCLC, and many others. He works with non-profit institutions on appropriate uses of technology for informing, communicating, and collaborating with their constituencies.