OLIVER 2.0 Proposal for HMML

This proposal has not been accepted by HMML. Please contact Eric with any questions.

On 19 November 2014 I was contacted by William Straub, Systems Librarian and Web Developer for the Hill Museum & Manuscript Library (HMML) at Saint John’s University (SJU). He requested a price quote for participating in a grant-funded project to rebuild the HMML manuscript catalog database (called OLIVER) using new tools and in such a way that the data could be used as part of the emerging linked data ecosystem on the web.

This database, currently managed via Microsoft Access and delivered on the web via Microsoft SQL and a number of other tools, has become unwieldy to update and keep in sync. The dataset does not adhere to any widely held standard such as MARC, being a home-grown solution to the rather special circumstance of cataloging rare and unique manuscripts. The data currently consists of roughly 105,000 manuscripts with 106,000 parts representing over 275,000 works with about 47,000 individual images. This is a total of just over 500,000 records in a variety of MS Access tables.

Given the resources available to many sites where this metadata is gathered, and training already invested in a wide variety of staff on these sites, it would be best if data gathering could continue with as little disruption as possible. Currently HMML provides spreadsheet templates that are filled out on site and returned to HMML by any convenient means. Often the sites do not have consistent internet connectivity, making transit of the data by thumb drive or email necessary. I believe a renovation of the database at HMML could be accomplished without significant changes to this relatively low-tech workflow.

Overview

I propose that the data be moved from MS Access to JSON files stored in a plain computer filesystem, and that those files be indexed by ElasticSearch to facilitate both staff maintenance of the JSON data and public access to the resulting catalog.

JSON (json.org) emerged in 1999 as JavaScript Object Notation, a subset of the JavaScript language standard designed to simply serialize JavaScript data objects. JSON has since grown to become a major interchange format in its own right, now rivaling XML in its adoption across the web. Much like XML, JSON offers a well-understood way to encode data as plain text. However, JSON is much “lighter weight” than XML, using a simpler syntax with much less fuss about namespaces and other details that encumber XML. As a result, JSON is much more easily readable by humans, usually consumes less space than XML, and has been adopted in a wide range of tools well outside the realm of formal libraries, archives, and markup environments. The simplicity of JSON has attracted a wide community of users and encouraged a phenomenal range of uses.

I suggest JSON for the OLIVER 2.0 because of this simplicity. Representing current HMML cataloging practice in JSON will be quite natural and the resulting records will be easy for both catalogers and technology staff to understand. More importantly, changes in practice over time will be easier to express in JSON than in the more formal XML standard. By expressing the catalog records as JSON documents, HMML would also open those documents to manipulation by a wide variety of tools and programming languages that understand JSON either natively or via well developed and supported libraries.

ElasticSearch (elasticsearch.org) is one such tool, providing a powerful, flexible, and incredibly fast way to index, search, and retrieve JSON data. First released in 2010, ElasticSearch is based on the Lucene search engine also at the heart of Apache Solr, which in turn is used in a number of widely used library systems (from open source projects like Hydra to commercial products such as Primo from Ex Libris). In other words, ElasticSearch is built on a very strong pedigree. ElasticSearch can certainly handle the complexity and scale of the HMML manuscript data with grace, and is in fact designed to scale into the millions of records.

A system based on ElasticSearch over JSON data would also require a number of custom scripts to provide the management tools HMML needs. These would include scripts to ingest the spreadsheets being received from remote sites, to facilitate the creation of new records, to edit existing records, and to gather analytics about the collection. It should also be noted that scripts to drive a web discovery service built upon ElasticSearch could also be the basis of the Reading Room referred to in HMML plans.

JSON and ElasticSearch are open source software (though ElasticSearch is managed by a commercial entity), and I propose that any scripts developed for HMML would also be developed as open source software both to ensure HMML has full license to this work and to allow others in the archive community to leverage and contribute to this work.

Process

Becoming more familiar with HMML data ahead of the software development phase of the project would be very helpful. I would like to have access to a snapshot of the current HMML dataset in early 2015. This snapshot would be used to get to know the dataset and to experiment with various encoding strategies. This would include the creation of an initial JSON dataset and ElasticSearch index.

If this proof-of-concept demonstration performs well, we would then plan the data and workflow changes HMML staff would like to see in OLIVER 2.0. This would include defining which fields would need to be added to and normalized in the dataset, which elements should be available as linked data, and understanding changes in forms and interactions that HMML staff would like for themselves and partners doing data entry around the world.

This plan would be the basis for developing the editing and import scripts for OLIVER 2.0. The proof-of-concept would be reimplemented including elements to expose OLIVER records as linked data and an initial faceted catalog that could serve as the basis of the Reading Room. By the end of summer 2015 we would aim to provide working versions of both the back and front end systems for staff review and feedback.

I would plan to work with one other developer in-house (my son, an SJU graduate). We would manage the development work on GitHub so that anyone on SJU staff would be able to monitor progress, provide feedback, and participate in the development process. We would expect to have at least bi-weekly contact (a scheduled call, perhaps) with a project manager on the HMML staff, to ensure the work we were doing is meeting your needs.

While ElasticSearch is built on a Java framework, we do not intend to do our development in Java. We are comfortable with PHP as a development language and would also consider Ruby and Python if there are reasons to favor those in the SJU environment. Our goal would be to write our code in such a way that IT Services staff might feel comfortable maintaining it over time and other institutions might want to consider it as a framework for interacting with ElasticSearch.

I would expect to use Fall 2015 implementing revisions suggested by staff after a review of the summer’s work. During October and November 2015 we would focus on migrating the work to whatever server CSB/SJU Information Technology Services would like to use for hosting and maintaining the service.

Deliverables

Deliverable 1: Proof-of-concept.

Timeframe: February - March 2015.

Cost: $10,000

Acquire a copy of HMML’s OLIVER dataset and build a JSON / ElasticSearch version. Use only text editors to modify the data (no scripts yet). Demonstrate basic architecture is sound and up to the job.

Deliverable 2: Initial Data and Workflow Plan.

Timeframe: April - May 2015.

Cost: $10,000

Plan data revisions and workflow with staff of HMML. Agree on new fields, linked data elements, and required forms. Do this work “on paper” (not working code yet). Develop mapping of old data to new. Identify changes to existing off-line data gathering forms. These plans can continue to iterate over the next phase of project, but we need to document initial expectations in order to proceed with development.

Deliverable 3: Initial Implementation.

Timeframe: July - August 2015.

Cost: $30,000

Re-implement JSON / ElasticSearch with specifications and mapping developed earlier. Release “early and often” giving staff access to working system as quickly as possible, identifying bi-weekly goals, and iterating. Continue to work new ideas for fields, mapping, and forms into this iterative process.

Deliverable 4: End-of-summer status review.

Timeframe: September 2015.

Cost: $5,000

Facilitate a meeting with HMML and IT Services staff to gather feedback on the initial implementation. Get input on both functionality and maintenance concerns. Present findings in a brief summary report to HMML.

Deliverable 5: Final Revision & Documentation.

Timeframe: October - November 2015.

Cost: $10,000

Make necessary changes based on status review feedback. Migrate implementation to HMML and IT Services hardware. Prepare documentation for HMML and IT Services staff.

Deliverable 6: Handover to HMML and IT Services.

Timeframe: late November 2015.

Cost: $7,000

Turn over continuing operation of the new OLIVER to HMML and IT Services staff. Train staff to manage the system.

Deliverable 7: Maintenance.

Timeframe: ongoing.

Cost: aproximately $500 per month, to be negotiated separately

If desired, we could be available for some degree of ongoing support and troubleshooting of the new system. This is an optional service and the cost of ongoing maintenance help is not included in the cost estimate total below.

Requirements

While ElasticSearch is very capable, it also does best when installed on a fairly capable server. This would be a relatively modest ElasticSearch implementation, but would still benefit from 16 to 32GB of RAM and 0.5TB or more of preferably SSD storage.

An iMac meeting these specifications could be purchased for about $2,500. However, we could also work with IT Services staff to determine a suitable Linux alternative. Note that in addition to the main server, arrangements would have to be made for suitable backup of the data and configuration. The process and deliverables outlined above do not require this hardware be in place until Fall 2015, however if it is in place in Summer 2015 then the initial implmentation (Deliverable 3) could be undertaken on SJU hardware.

If complete isolation of the main dataset from the searchable public data is desired, then another location on a less premium server could be arranged for the JSON data store.

These hardware costs are not included below.

Ongoing maintenance would require some familiarity with JSON and the system’s basic architecture for troubleshooting purposes. Maintenance of the scripts developed for HMML would require familiarity with the programming language used. This ongoing maintenance would be available at an additional cost (Deliverable 7), or could be arranged with IT Services.

Scope

This project would include the development of a rudimentary search engine for the OLIVER data, but would not include development of the full Reading Room vision. Specifically, this proposal does not include making the search interface “pretty,” but only making it essentially functional. While I would be happy to be engaged for the effort of bringing the Reading Room vision to life, it would be as part of a separate project.

Cost

As a consultant, I do not charge by the hour. The services above will cost $72,000 which would be payable upon submission of each deliverable on the schedule outlined above. I understand that the grant HMML is seeking may only provide limited funding, so if this cost proves too high, I am willing to discuss modifying the expectations of the proposal or a discount for the service to accommodate your needs.

Eric Celeste

Eric brings over 20 years of library and 35 years of technology experience to his consulting. At MIT Eric shepherded the creation of DSpace, open source digital repository management software developed with HP and now deployed at hundreds of institutions worldwide. At the University of Minnesota Libraries he encouraged the development of the UThink blog service, a wiki-based staff intranet, LibData, and the University Digital Conservancy. Currently he is working with ARL as the Technical Director of the SHARE project to create technologies and workflows that will track and integrate the progress of research across its lifecycle. He works with non-profit institutions on appropriate uses of technology for informing, communicating, and collaborating with their constituencies.