/research/projects/schematrans/seel/tutorial/section4.htm

originally: http://www.oclc.org/research/projects/schematrans/seel/tutorial/section4.htm

Section 4: Working with Morfrom Values

Sections 2 and 3 described the Seel code that processes the intermediate portions of a Morfrom record. Here we complete the technical walkthrough by describing the operations that can be performed on the lowest node, the <value> element, where the unique content resides. In the Morfrom notation, a MARC field such as <100><a>Franz Kafka would be rendered as <field name="100"><field name="a"><value>Franz Kafka</value></field></field>. Since this code is a little verbose, and precise descriptions of it are even more so, we will adopt the convention of referring to strings like "Franz Kafka" more loosely as a field's value--or, more simply, a value. Seel maps involve values in three ways: they can be queried as a condition for a translation, they can be modified, or they can be created.

We'll begin the discussion of values with Map 4.1, which shows a map in a translation we developed for a GEM to MARC crosswalk. The Gateway to Educational Materials is a premier collection of Web-accessible e-learning objects for primary and secondary school students. GEM records are coded in a standard that is specialized for the needs of the education community, which has several elements for describing the intended audience of the learning object. Two of these elements which are involved in Map 4.1: <audience><toolfor> and <audience><beneficiary>. The MARC element with the closest meaning is <521><a>, but it's not a perfect match because the GEM elements have a narrower meaning. So both GEM elements map to the same MARC destination, resulting in a translation with a potential loss of information. To preserve the distinction for applications that might need to reconstruct the original GEM record, Map 4.1 appends the name of the GEM source element to the MARC data, producing <field name="521"><field name="a"><value>Toolfor: …</value></field></field>.

The <micro> element

As Map 4.1 shows, Morfrom values are modified with the <micro> element, which performs "micro-"scopic string manipulations on values. In Map 4.1, fixed strings are appended to values using the <add> element, a child of <micro>. The <add> element has two attributes: "side," the start or end of the value; and "data," the string to be appended. The rest of the logic in Map 4.1 should look familiar from our previous discussion of contexts and paths in Sections 2 and 3. As in those earlier examples, the path to the element of interest is located one or more <step> elements away from the translation root node named in the <mainpath> element.

Map 4.1

<map id="map:05">
  <source>
    <mainpath>
      <step name="audience"/>
      <path pid="1"><step name="toolfor"/></path>
      <path pid="2"><step name="beneficiary"/></path>
    </mainpath>
  </source>
  <target>
    <mainpath>
      <step name="521"/>
      <path pid="1"><step name="a"/></path>
      <path pid="2"><step name="a"/></path>
    </mainpath>
    <micro pid="1">
      <add side="start" data="Tool for: "/>
    </micro>
    <micro pid="2">
      <add side="start" data="Beneficiary: "/>
    </micro>
  </target>
</map>

Given the following input:

<field name="audience">
  <field name="toolfor">
    <value>Teachers</value>
  </field>
  <field name="beneficiary">
    <value>Students</value>
  </field>
</field>

Map 4.1 produces this output:

<field name="521">
  <field name="a">
    <value>Beneficiary: Students</value>
  </field>
  <field name="a">
    <value>Tool for: Teachers</value>
  </field>
</field>

Data can be removed with the <remove> element, which is the converse of <add> and has the same attributes. The following example would remove a single semicolon from the end of a Morfrom <value> element.

<micro>
  <remove side="end" data=";"/>
</micro>

The <micro> element can have two other children, <trim> and <sub>. The <trim> element has no attributes and is used simply to trim leading and trailing whitespace on the string in the Morfrom <value> element.

The <sub> element is the most powerful of the <micro> elements because it uses regular expression matching to make substitutions in a Morfrom <value> string. It has two required attributes: "regexp" and "date". The "regexp" attribute contains a regular expression which "selects" substrings of the value to be replaced. The "data" attribute contains a string which replaces those value substrings.

Without further modification, the <sub> element performs a substitution on the first matching substring it encounters in the relevant context. But <sub> also has an optional "global" attribute, which, if set to "true," performs the same substitution on all subsequent portions of the same string.

In the example below, whitespace will be trimmed and every underscore will be replaced with a space when the data in the source record is transferred to the target record.

<micro>
  <trim/>
  <sub global="true" regexp="_" data=" "/>
</micro>

The <sub> element can also be used to remove data from the value. If the "data" attribute is left blank, then anything matching the regular expression will be removed. This example will simply remove every underscore in a given context:

<micro>
  <sub global="true" regexp="_" data=""/>
</micro>

Defining a context for the <micro> element

In the two previous examples, the contents of each <micro> element would be applied to all Morfrom <value> elements under the translation root node. But as we saw in Map 4.1, the <micro> scope can also be restricted to a single path using a "pid" attribute. In other words, the scope of the <micro> element is defined in exactly the same way as it is for the <context> element, as discussed in Sections 2 and 3.

To underscore this point, let's consider Map 4.2, which is exactly like Map 3.3, except that the target has a <micro> element. As written, the map will produce a translation that does a verbatim transfer of the data in the list of MARC elements to the corresponding LOM elements, except that the whitespace on the <856><u> element will be trimmed.

But if the "pid" is omitted from the <micro> statement, <trim> applies to the whole map. That's because the target path is no longer <general><identifier>, but just <general> -- the translation root node. Since Map 4.2 specifies that all MARC fields listed in the <source> end up as children of the <general> node, they would all get trimmed.

Map 4.2

<map id="map:08">
  <source>
    <mainpath>
      <path pid="1"><step name="856"/><step name="u"/></path>
      <path pid="2"><step name="245"/><step name="a"/></path>
      <path pid="3"><step name="245"/><step name="b"/></path>
      <path pid="4"><step name="246"/><step name="a"/></path>
      <path pid="5"><step name="520"/><step name="a"/></path>
      <path pid="6"><step name_re="6.."/><step name="a"/></path>
    </mainpath>
  </source>
  <target>
    <mainpath>
      <step name="general"/>
      <path pid="1"><step name="identifier"/><step name="entity"/></path>
      <path pid="2"><step name="title"/></path>
      <path pid="3"><step name="title"/></path>
      <path pid="4"><step name="title"/></path>
      <path pid="5"><step name="description"/></path>
      <path pid="6"><step name="keyword"/></path>
    </mainpath>
    <micro pid="1">
      <trim/>
    </micro>
  </target>
</map>

References to data in the source and target

So far, the discussion has shown how the <micro> elements can be used to perform operations on data in a record to be translated. But these elements introduce a theoretical problem because they seem to violate the symmetry of a Seel map. Since the goal of a data manipulation is to create a data field in the translated record with just the right characteristics, the <micro> elements appear only in the target. The source record is left alone.

But Seel's symmetrical design actually is preserved because <micro> elements can appear in the source, too. They're just not processed. Yet that doesn't mean they're undefined or superfluous. To explain why requires that we get a little ahead of ourselves, but a detour might be worth it in this one case.

In all of the maps we have discussed, a source is translated to a target. But Seel can also be invoked with the effect of reversing the translation. The target then becomes the source and vice-versa. In a reverse translation, the <micro> elements in the source are executed, while those in the target are ignored.

With this concept, we can modify Map 4.1 to create a genuine roundtrip translation. In the default case, a GEM record is translated to a MARC record with some additional text indicating where the GEM elements came from. If the <source> is modified as shown in Map 4.3, the superfluous GEM text is stripped out when the translation is reversed, reproducing the original GEM record. As a result, one fairly transparent Seel map performs a bidirectional translation between metadata formats that differ in granularity, an operation that would involve some messy special processing in other scripting or programming languages.

Map 4.3

<source>
  <mainpath>
    <step name="audience"/>
    <path pid="1"><step name="toolfor"/></path>
    <path pid="2"><step name="beneficiary"/></path>
  </mainpath>
  <micro pid="1">
    <remove side="start" data="Tool for: "/>
  </micro>
  <micro pid="2">
    <remove side="start" data="Beneficiary: "/>
  </micro>
</source>

Section 5 has more discussion of reverse translations, as well as a pointer to a simple demo.

Conditional matching

Just as Seel permits us to manipulate Morfrom values, it also permits the conceptually simpler operation of referring to them. We saw one example in Map 2.2, which used <context> elements to construct a field in the target record based on values found in the source record. Here we introduce the <vmatch>, or "value-match" element, which does the same thing in an idiom that is both more compact and more powerful. Since <vmatch> refers instead of constructs, this element appears most naturally in the source block of a translation map.

Map 4.4 shows a rewrite of Map 2.2, using the <vmatch> element to perform special processing on the ONIX path <contributor><b035>ContributorRole. Recall that if the data at that location is "A01", the result is a MARC record with "author" at <100><e> and if it's "A12", the result is "illustrator."

The relevant context in Map 4.4 is at pid=3. The translation root node is <contributor> and the reference field is <b035>. This is further modified by the <vmatch> element contained in the <value> element, which indicates to the Seel interpreter that a field's value will also be used to determine if the translation should be executed. The Seel <vmatch> element contains a "data" attribute, which specifies a regular expression to be applied to the string in a Morfrom value. In Map 4.4, the string in the value must be either "A01" or "A12". If a record contains the path <field name="contributor"><field name="b035"> with either <value>A01</value> or <value>A12</value>, the translation is executed as in Map 2.2. In the target, a second regular expression in the <sub> element ensures that the ONIX codes are translated to the correct MARC keywords.

Map 4.4

<map id="map:02">
  <source>
    <mainpath>
      <step name="contributor" position="1"/>
      <path pid="1"><step name="b037"/></path>
      <path pid="2"><step name="b043"/></path>
      <path pid="3">
        <step name="b035"><value><vmatch data="^(A01|A12)$"/></value></step>
      </path>
    </mainpath>
  </source>
  <target>
    <mainpath>
      <step name="100"/>
      <path pid="1"><step name="a"/></path>
      <path pid="2"><step name="c"/></path>
      <path pid="3"><step name="e"/></path>
    </mainpath>
    <micro pid="3">
      <sub regexp="A01" data="author"/>
      <sub regexp="A12" data="illustrator"/>
    </micro>
  </target>
</map>

In Maps 4.4 and 2.2, the expressiveness of <vmatch> and the corresponding <context> elements are equivalent. But <vmatch> is more powerful because it permits arbitrarily complex regular expressions. For example, the "data" attribute might contain this pattern:

<value>
  <vmatch data="^\d{4}-\d{2}-\d{2}$"/>
</value>

A translation would then be executed if the Morform value in a specified path consisted of four digits, a dash, two digits, a dash, and two digits, with nothing else on the flanks. Of course, regular expression matches represent a compromise between compactness and readability, so you are advised to find your own comfort level.

One final comment about <vmatch>. All of the examples we have discussed so far contain the "data" attribute. This is the default. But the DTDs for Morfrom and Seel show that the Seel <vmatch> element and the Morfrom <value> element both permit a small (and identical) set of other attributes, including "scheme" and "language," which permit special interpretations of the data.

When the <vmatch> attribute is "data," the Seel interpeter applies the regular expression pattern to any Morfrom value in the current path. If the attribute is something else, the application of the regular expression is restricted to a Morfrom value with the same attribute. For example, a data stream might consist of Morfrom records containing references to different kinds of URLs, URIs and URNs, coded with statements such as:

<field name="identifier"><value scheme="url">ID 1</value></field>
<field name="identifier"><value scheme="uri">ID 2</value></field>
<field name="identifier"><value scheme="urn">ID 3</value></field>

A Seel script might contain the following <vmatch> element:

<value><vmatch scheme="(?i)url"/></value>

This would have the effect of singling out records with references to URLs for special handling and would process only the first record shown above.

Summary

This section has shown a variety of methods for referring to and manipulating Morfrom <value> data in a Seel translation. The elements <add>, <sub>, <remove>, and <trim> can be simple and relatively fixed, but they are sufficient for handling most data-manipulation needs in the metadata translation tasks we have encountered. When more flexibility is needed, regular expressions can be embedded in the <vmatch> or <sub> statement. Together with the elements for specifying contexts and paths described in the previous sections, Seel can process records of arbitrary complexity, as long as they are expressed in the Morfrom structure.