originally: http://www.oclc.org/research/projects/schematrans/seel/tutorial/section5.htm
We have now walked through the process of creating some real and fairly sophisticated Seel maps, showing how they can be assembled into a translation. In this section, we will step back from the details and sketch the big picture.
It looks something like this:
Seel is a component in a custom application that is dedicated to the problem of metadata translation. It is designed to facilitate the management of metadata by making the best use of rare subject-matter expertise and maximizing the potential for change management.
Ideally, our system would permit someone with an expert's knowledge in the metadata standards used in libraries, the publishing industry, education, or other contexts we have not yet studied to develop executable, production-quality translations by creating something very close to the familiar-looking crosswalk. Relatively little programming would be required because the process of converting the crosswalk to a Seel script would be trivial and perhaps easily automated. Behind the scenes, the application would fill in all of the messy details required to execute a literal translation between standards, which may have more than one structural realization or exacting requirements for accuracy and validity. To carry out this task, the application would consult a resource of previously constructed maps and translations that drastically reduces the duplication of effort.
Once the system is set up and seeded with Seel scripts for the important crosswalks (as well as the supporting Morfrom readers and writers), it should run, with only minimal human intervention. Only minor tweaks would be required to add new standards or accommodate changes in versions or translation needs.
So, how far have we come toward realizing this vision?
Since the crosswalk is the key component in our system, we can start by returning to a claim we made in Section 1: a Seel script is an executable version of a crosswalk. Consider Figure 5.1, which shows a portion of a crosswalk between ONIX and MARC. The first three columns list the ONIX source, the MARC target, and some special conditions, which map directly to the Seel <source>, <target>, and <context> elements; the fourth column cites the corresponding Seel map. With some trivial special processing (that would, for example, join two rows that mention the same MARC field but with different subfields), a set of Seel maps could be generated directly from the crosswalk.
ONIX source | MARC target | Special conditions | Seel map |
---|---|---|---|
B037 | 100 $a | 3.1 | |
B043 | 100 $c | 3.1 | |
Othertext/d104 | 520 $a | i2=1; Othertext/d102=32 | 3.4 |
Othertext/d107 | 520 $r | 3.4 | |
Othertext/d018 | 520 $t | 3.4 | |
A196 | 040 $a $c | i1=#; i2=# | 3.5 |
To be honest, we developed the ONIX-to-MARC Seel script the old-fashioned way: by consulting standards documents, reviewing previous versions of executable translations, and crafting Seel code by hand. Since a Seel script is built up one map at a time (just as a crosswalk usually grows by adding one row at a time), we ended up with a substantial translation that replicates most of the instructions in the indispensable Record Builder document and could easily be finished when a real-world need arises. Here is a small set of ONIX records and their MARC translations that were generated from this script.
But in an earlier study of the Dublin Core to MARC translation, we tried harder to automate the process of creating the Seel script. We extended and modified an existing, in-house crosswalk contained in an Excel spreadsheet to included some features that Seel allowed. The Seel translation was then generated from this spreadsheet using a Perl script that coded an understanding of a small number of conventions for expressing the special conditions. Note that this crosswalk represents Qualified Dublin Core (not Simple DC, which isn't detailed enough for many applications) -- a crosswalk for which no publicly accessible executable form exists. It was used in a production stream at OCLC to translate Dublin Core records like these to MARC records like these.
The flow from crosswalk to Seel script to a set of translated records is straightforward because a Seel map encodes an important formal property of crosswalks: it is modular and self-contained, like a row in a crosswalk.
But modularity isn't the only formal property that yields a computational payoff. Like a row in a crosswalk, a Seel map is also symmetric. In other words, if ONIX <b037> PersonNameInverted maps to MARC <100><a> Personal Name, then the converse is also true. Only one crosswalk entry, and only one Seel map, is required to express this relationship. So if the translation from a-to-b is coded in Seel, the translation from b-to-a is "free," unlike the equivalent formulation in most procedural programming languages. In the default mode, the Seel interpreter translates the <source> element to the <target> element. But when it is invoked with the "reverse" option, the sources and targets are effectively swapped, and the whole translation is reversed. Click here to see a demo of a translation and its reversal.
There is one more important respect in which a Seel translation resembles a crosswalk: it's abstract. Most of the structural details of the standards involved in the translation are hidden. To interpret a crosswalk, the reader has to fill in the missing information, usually by consulting documents that describe the specifications of the standards. The corresponding Seel translation operates in the context of a model that has a syntactic preprocessing layer, which hides the the messy details about how to take apart and reassemble the records in the syntactic layer.
Our model is illustrated in Figure 5.2. The native record is normalized using a Morfrom reader at Step 2. The Seel script is executed by the Seel interpreter, whose output is a Morfrom record in the target format at Step 3. A Morfrom writer reassembles the target into native syntax at Step 4.
One more point about Step 3. In this tutorial, we have described a process flow that uses a Seel script to translate a source to a target, a relatively simple process because the structure of the input and output is the same. But our tools are agnostic about what happens here. More complex processes could occur.
For example, Step 3 could be designed as a core or universal format into which standards are mapped before they are translated to the target. What would that format look like? Perhaps a lot like Dublin Core the Learning Object Metadata Standard (LOM), MARC XML, or MODS -- all of which have been proposed as universal standards to promote interoperability, at least among grossly similar records. If the Seel process model is used to convert to a common standard, two Seel scripts would be needed at Step 3 instead of one. The first would convert from the source to the core and the second would convert from the core to the target. The benefits of this slightly more complex design are described in one of our recent papers.
The design of the translation model maximizes the potential for reuse--reuse of code, reuse of analysis, and reuse of the expertise that is required to align two standards that participate in a translation.
For example, the Morfrom readers and writers encapsulate structural details that are unlikely to change frequently, such as hierarchical structure, element order, and stylistic conventions that Morfrom <value> elements must obey to be in compliance with a published standard. As a result, the readers and writers can be reused whenever a new translation is developed that involves a source or a target for which complete translations have already been written. For example, when we created the ONIX-to-MARC translation described in Sections 2 , 3, and 4 of this tutorial, we needed to write only an ONIX reader. The MARC writer that was developed for our earlier study of the Dublin Core-to-MARC translation was reused.
The Morfrom readers and writers can also be reused when multiple crosswalks are required for the same pair of standards. Suppose an application required two kinds of translations: one for real-time query processing and another for offline database preparation. In the first case, only a lightweight crosswalk is required, one that might provide rough translations of a few key elements as a query is standardized across databases containing different kinds of records. But the second translation probably requires a comprehensive crosswalk involving all mappable elements, plus callouts to processes that validate or enhance the record. In this scenario, depicted in Figure 5.3, more than one Seel script would be needed. But these scripts could operate on data processed with the same Morfrom readers and writers.
The converse relationship is also possible: the same Seel script can be reused with different Morfrom readers and writers. For example, in the MARC to Dublin Core translation, MARC 245 $a maps to Title and MARC 856 $u maps to Identifier regardless of whether the original records are encoded in MARC XML, MARC 2709, RDF, BER, ASCII, or some idiosyncratic local format. In this scenario, shown in Figure 5.4, a pair of standards has multiple syntactic realizations, each requiring a dedicated Morfrom reader or writer. But once the native formats undergo the syntactic normalizaton step and are encoded in Morfrom, they can be translated with the same Seel script.
The Seel and Morfrom components can be manipulated separately because they contain qualitatively different information: structure and meaning, syntax and semantics. Organized as a two-step model, the Seel and Morfrom modules can also form the skeleton of a system that could provide a comprehensive set of services for metadata management. Records that must be processed for persistent access usually require more than translation. Most of the time, they must also be cleaned up, normalized, validated, and enhanced. If we maintain the distinction between syntax and semantics as we think about these ancillary services, we can design them as modules that best apply at certain locations in our model. Some operations--such as character set fuzzy conversions, the normalization of spaces, and the elimination of empty tags--manipulate structure and don't require any human judgment, so they are probably best called from the Morfrom readers and writers. Other operations do require this kind of judgment, such as validation, enhancement, and vocabulary mapping, so they would apply at the Seel layer. This model for a collection of enhanced metadata management service is illustrated schematically in Figure 5.5.
Eventually, Seel maps could evolve into a persistent resource that would provide even more opportunities for reuse. Since a Seel map is abstract, modular, and coded in a descriptive language that explicitly labels the critical components of the translation, it could easily serve as an entry in a database of crosswalk records. Despite the fact that metadata management is currently labor-intensive and error-prone, there is lots of wasted effort because many mappings are actually pretty stable, especially those involving common fields such titles and creators, or easily coded fields such as URLs. What happens now is that, whenever a standard changes or a new translation is required, a program or script that executes a complete translation must be written or modified.
But with a database of Seel maps, the need for rewriting is minimized. A translation could be assembled dynamically through a series of searches for sources and targets in the appropriate formats. New maps would be added to the database only as the need arises. For example, a user might need a translation that is essentially standard except for a customized map of one pair of elements - say, a Dublin Core to MARC translation that mentions a local subject heading scheme instead of the Library of Congress Subject Headings. The database software would use the "id" attribute on the <map> element to identify the corresponding stored map so it could be overridden in this special case. The rest of the translation would be used as originally written. A database with an overrride facility could also accommodate new versions of standards. New elements might be added or redefined, but the elements that are unaffected by the revision could simply be retrieved in a search and reused.
And the user would have the flexibility to create different kinds of translations. In this tutorial, we have described translations between standards: from MARC to Dublin Core, from GEM or ONIX to MARC, and so on. But some translations involve mixing and matching elements from different standards - a so-called application profile, such as the Library Application Profile, which consists of elements from Qualified Dublin Core and MODS. Creating a translation from an application profile would just be another database search. As far as Seel is concerned, application profiles are logically no different from conventional standard-to-standard translations.
The creation of a database resource involving Seel elements requires a development effort with relatively few unknowns. We're not there yet, it is a natural outgrowth of our existing work and the next phase of our metadata translation project.