MDL Wide Are Search Prototype

NOTE: WIP, this report is far from complete just yet.

The MDL wanted to explore “universal access” during the current grant period. In MDL-speak “universal access” means a way to search a wide range of digital collections including MDL and other historical society and campus digital object collections. I will call this endeavor “wide area search”. The MDL asked me to put together a prototype system that could search across MDL photo collections and a few collections from non-MDL institutions by the middle of June.

When I began this work in late April 2008 I decided to make a quick pilot go of it using the Google custom search engine (CSE) facility. That quickly proved unworkable. But it revealed enough of the Google way that we realized we might be able to create a fully functional prototype using their Google Mini search appliance. That worked, more or less, and the result is available to see at http://search.mndigital.org.

This document describes what was done and invites comments about the process, the results, or alternatives we should consider. In some classic sense the “right” way to approach this would have been using Heretrix, Lucene, and Solr to build a custom system meeting our particular specifications. But we didn’t have time to create specifications, much less learn how to code Lucene/Solr. I’d be interested in your reactions to the path we took.

Ground rules

We set a few basic ground rules when embarking on this effort. The most important was that the institutions we asked to participate should have to do as little as possible to work with us. Another was that I wanted to be sure that whatever work they did do would benefit the institution even if our wide area search fizzled.

It turns out that many historical societies and other cultural memory organizations don’t even have their collections exposed to internet search engines. These materials lie in the “dark web”. It seemed that the most beneficial lesson we could teach these organizations would be how to expose their collections to spiders. Indeed, this included some political work convincing them that they should open their collections to spiders. If we could make their collections indexable, then even if our search engine never got off the ground, these institutions would benefit from the added exposure.

We decided not to require anything beyond this basic exposure from potential partners. No structured metadata, no special code. That almost worked. For the most part we didn’t need anything special, but in some cases getting up to this very basic bar was difficult for institutions.

Given that ground rule, our solution space became quite constrained. We would have no structured metadata to speak of for this effort. Essentially we would be doing web crawls (or using someone else’s crawls) to grab data from pages that became the source of our index. As a result, no fancy search would be possible either.

We worked with our own MDL Reflections collection and the collections of two partners: the Pennington County Historical Society (PCHS) and the Minnesota Historical Society (MHS). Two other partners are working on making their collections accessible, but that work was not completed at the time this report was drafted.

The CSE Attempt

The Google custom search engine bills itself as a way to “harness the power of Google to create a search engine tailored to your needs” for searching “a website or collection of sites”. The deep flaw in the CSE is that it appears to severely restrict the quantity of results returned when more than one website is included in the scope of the CSE. That turned out to be a show-stopper with regard to the CSE when on April 24 I was notified by a Google enterprise search team member that even the fee-based enterprise version of the CSE would suffer the same limitation.

Still, the work on the CSE revealed a way to manipulate Google search results that may be useful to others and certainly pointed the way toward the later search appliance solutions, so I want to document them here.

One of the limitations of using Google’s CSE is that the CSE only returns textual results. We were indexing image collections and getting a text result is pretty unsatisfying when searching for images. To compound the problem, Google stresses the page title in the search result (making it bold and colorful), but the page title of many image management systems (CONTENTdm in particular, which is at the heart of MDL Reflections) provide only the most generic page titles. This had to be fixed.

My plan was to use JavaScript to rewrite the results so that the uninformative headlines were replaced by thumbnail images. The URL of the images could be derived from the URL of the indexed page (though that later proved an insufficient strategy for the prototype, it is as far as I went while investigating the CSE). The problem was that the CSE presented its results as an “iframe” from Google servers and Javascript refuses to touch the contents of a frame being supplied by a server in a domain outside the domain of the page supplying the JavaScript. In other words, I could not touch the Google-supplied results from my separately housed CSE.

The (very kludgy) solution was to write a PHP shim that grabbed results from Google and presented them to the CSE page. As long as that shim ran on a server in the same domain as the CSE page, then Javascript was happy to manipulate the results. The product of this effort is still available at http://mdlwas.clst.org/search.html. The Javascript can be seen in the source of that page, the PHP is available here.

This proved to us that rewriting the Google results was feasible, unfortunately the quality of the results supplied by the CSE was still unacceptable.

The Search Appliance Attempt

The Google Mini search appliance is a fairly low cost blue box that provides the tools to index a set of web pages and customize the results of that indexing. The Mini, like all Google search appliance (GSA) solutions, is aimed primarily at the business community for searching corporate web sites. However, it turns out that noting prevents a GSA from searching any set of web sites in the world, the only limit is the number of pages the GSA will index, and that is based on the license you purchase from Google.

The challenges in configuring a GSA for our wide area search task were threefold: (1) determining proper indexing instructions for the GSA, (2) modifying the XSLT stylesheet to accommodate a more image-friendly search result page, (3) developing JavaScript to post-process the results inserting pictures and the like.

Indexing with the GSA

The Google Mini provides only three basic controls for defining the scope of its crawl: where crawls start, what crawls follow, and what crawls avoid. In all cases, the GSA will obey any robots.txt or other robots directives that it finds present.

Crawls start from a list of URLs configured on the Mini. These URLs must contain links that, when followed, lead to every image we want to index. In the case of the MDL Reflections system, the page we used as a starting point was our “About” page, (http://reflections.mndigital.org/cdm4/about.php) since it contains links to the browse view of every collection in the system. In the case of Pennington County, a special page was developed that simply listed the item records of each photograph in the collection (http://pchs.org/photos). This was similar to the approach taken for the MDL Social Side commentary pages (http://views.mndigital.org/Main/Comments). For the Minnesota Historical Society we tried an XML sitemap and learned that though the GSA can create a sitemap, it does not know how to follow one. This surprised us, and we had to create a set of sitemaps in HTML format instead (http://collections.mnhs.org/visualresources/sitemap1.htm).

We also found that some of the collections were being protected by robot exclusions in both robots.txt files and in metadata headers. These had to be eliminated, or in the case of MHS, an exception had to be made for the MDL crawler. Opening collections in this way was a critical component of developing this search facility.

Of course, as a crawler steps though the files “hanging off” those starting points, it will encounter many pages that are not “item records” for photographs. These were screened out in two ways. The GSA allowed us to define a list of URL patters that specified the only URLs that should be crawled and indexed. For example, though the crawl begins at MHS with the “sitemap1.htm” page, we also want the crawl to include “details.cfm” pages. But this is not enough to limit the scope of the crawl. In many cases there are pages within these bounds that should not be indexed or followed. For example, in our MDL Social Side site there are many administrative pages that should not be indexed, so we included a pattern “contains:action=” to screen out any page with “action=” in the URL.

Scoping the crawl for CONTENTdm proved especially difficult. First, every image has countless variations in size and orientation referenced from the item page, so we had to screen out all pages with “&DM” in the URL. Then it turned out that the links that resulted in column resorting were being followed, so we eliminated URLs containing “CISOSORT”. But even with that screening, most item records were indexed multiple times by the GSA, since they were somehow found in different positions on various result set lists. In other words, if an image was the second on a results page when the GSA first encountered it, but it was fourth on another results page, the GSA would capture and index both versions of the item even though they were identical in content. Unfortunately CONTENTdm recorded that result set position in the URL of the item record and there was no way screen out the duplicates. As a result, we index more than twice the “pages” for our CONTENTdm system than we really should. This has a significant impact on the licensing of the Google Mini, since the cost of the license is determined on a pages-indexed basis.

Defining Page Structure with GSA’s XSLT

The Google Mini returns search results in an XML form that gets transformed to HTML via an XSLT stylesheet that is part of what Google terms a “front end” for the GSA. Deciphering this stylesheet takes considerable attention, since it is over 3,000 lines long. While about a fifth of this code “can be customized”, the vast majority of it lies behind Googles statement that “we do not recommend changes to the following code.” To accomplish the changes required to the look of the search results definitely necessitated changes to that code, including the creation of a CSS file to support it.

In addition to obvious layout changes (a two column layout with images next to the text about each hit), certain information had to be tagged with class or name attributes so that they could later be found by the JavaScript.

One other very important change was exposing the URL of the cached copy that the GSA maintains for every page indexed. More about that later.

[WIP, not finished!] [link to XML and XSLT]

Post-processing Results on the GSA with JavaScript

After massaging the results with XSLT the page was structurally in good shape, but some important details were still unsatisfactory. For one thing, there were still no images on the results page. For another, the lengthy URLs that Google displayed were not of much use to the typical patron, yet they contained hints that were very important.

To insert the image, the JavaScript looks for elements named “mdl-image” in the XSLT and uses the URLs in those elements to derive the URL for a thumbnail image and then inserts that image in the “innerHTML” of the original element. This effectively replaced the text header of each result with an image.

Similarly, to make the displayed URL more helpful, the JavaScript looks for the “mdl-instituion” tag surrounding this URL and then parses the URL itself to figure out which institution is responsible for the image. It then substitutes links to that institution for the URL.

[WIP, not finished!] [link to JavaScript]

Bringing it All Together on the GSA

A few other features of the GSA can come in handy for a statewide search of this sort. Google allows the creation of “keyphrases” that will be recognized in the search query. In response to these phrases, an “ad” of sorts can be displayed above the results. This is a nice way to call attention to certain partners. For example, any search that includes “streetcar” might pull up a link for the “Minnesota Streetcar Museum”. The GSA can also suggest alternative terms, for example a search for “tram” will suggest using the term “streetcar.”

The end result provides a very quick search of a wide variety of collections, presenting thumbnails of the images found and links to the host institutions.

[WIP, not finished!]

Experiences with Partners

In this brief building of a prototype we only aimed to include a few partners, two or three collections outside the two MDL systems (Reflections and the Social Side commentary). The two partners we succeeded at incorporating into the search taught us quite a bit about how much (or little) we could expect. Two other partners are actively interested and may come on board before the end of the summer.

Our aim was to keep things simple for partners. They needed to make their item records indexable by Google and provide a way to associate item records to thumbnail images. As we learned, both of these very basic requirements can be troublesome.

Pennington County Historical Society was building a new website, and the programmer developing that site was able to accommodate our crawler’s needs in a matter of days. Item records on the new system used a URL that could, with some parsing sugar, be transformed into the URL of the thumbnails. He was able to provide a single starting page that referenced every photo in the collection. All was good.

A few weeks later we realized that while the parsing usually was sufficient to transform the item record URL into the thumbnail URL, that was not always the case. There were some special cases (where one item was represented by multiple images, like an object photographed from various angles) where the derivative thumbnail URL didn’t reference an existing image.

Then we found that the item records provided for indexing by the Minnesota Historical Society were built by a weekly process that associated given item records with images that had unrelated filenames. In this case there was no way to derive the thumbnail URL from the indexed item URL.

Without some way to make this association we would not be able to present thumbnails in the result sets. We needed a way to reliably refer to the thumbnail given nothing more than the URL of the indexed item record, since that URL was the only systematic metadata the GSA could provide as the result set was being rendered.

The solution lay in the fact that the GSA also keeps a cached version of each file it indexes. While we did not want to make that cached version of the page available for users, I realized that we could parse that cached file to discover the URL of the associated thumbnail image from the item record page’s HTML itself. To do so would require a PHP script that took the cache page as an input and provided the actual image as an output. In other words, we could treat this PHP script as the URL of the thumbnail image itself. This method works so well that I rewrote both Pennington County and MHS glue to take advantage of the GSA cache.

Since the GSA provides no way to upload and run PHP scripts, we host this script on another MDL machine in the same data center as the Google Mini.

We also learned that not all systems in use by our potential partners even have the concept of “item records” implemented. One potential partner is still working to develop an “item record” that combines a thumbnail image with a summary of the metadata available for that image. Creating such a structure will be a great benefit to the accessibility of this site to all search engines, not just ours; so we hope the work involved in doing this feels justified to the institution.

Another potential partner was willing to take part in our search engine, but had robots.txt directives present that told all crawlers to stay away. Teaching this institution how to write a more discriminating robots.txt file will serve them, again, beyond the bounds of this particular project.

Our goal is to only make our partners do work that will serve them beyond the scope of our MDL search system.

[WIP, not finished!] [of caches and PHP] [links to PHP]

Next Steps

The Google Mini based system works far better than we expected. At this stage what we need to do most is throw more data at it, dig up some more institutions who would be willing to let us index their data into this statewide search engine. Finding these partners and helping them meet the basic requirements of the system is the most critical task.

Some further cleanup of the system itself is also called for. It would be wonderful to resolve the duplicate indexing of CONTENTdm material. Our design also presents only a single column of results when viewed with Internet Explorer instead of two columns we see with other browsers. It would be good to get IE on the same page as other web browsers.

Finally, we would be interested in any feedback from users or others in the library community. I am sure this approach is boneheaded in many ways, but your input could help me understand what’s most goofy and could use fixing.

[WIP, not finished!] [better CONTENTdm filter] [single column IE]