originally: http://www.oclc.org/research/rac/minutes/2002-03.htm
Reported by Robert C. Bolander
Communications & Programs Manager
OCLC Office of Research
Presentation | Presenter |
---|---|
Introduction | |
Project Management | L. Dempsey |
Metadata Switch | L. Dempsey |
Terminology Resources/Knowledge Organization | D. Vizine-Goetz |
DCMI | S. Weibel |
ACE | T. Hickey |
Special Collections | L. Normore |
Electronic Theses & Dissertations and the Open Archives Initiative's Protocol for Metadata Harvesting | T. Hickey |
Z39.50 Interest Group, Search & Retrieve on the Web, and Search and Retrieve with URLs | R. LeVan |
Corporate Marketing Update | C. De Rosa |
Digital Preservation Infrastructure | B. Lavoie |
Public Libraries' Use of the Web | C. Prabha |
RDF Topicmaps | J. Godby |
Learning Systems & Interoperability | N. McLean |
FRBR Intro | E. O'Neill |
FRBR Overview | E. O'Neill |
INDECS | J. Godby |
The FRBRization of Humphry Clinker | E. O'Neill |
FRBRization Algorithms | T. Hickey |
Closing |
The OCLC Research Advisory Committee (RAC) convened its first meeting of the 2002 year on March 21-22, 2002 at the OCLC campus in Dublin, Ohio. Four RAC members were present:
Lorcan Dempsey, Vice President, OCLC Research, opened the two-day RAC meeting with a brief overview of project management activities of the Office of Research (OR).
Mr. Dempsey also presented a proposed Metadata Switch project, which would investigate the development of a set of services to add value to metadata by leveraging OCLC expertise and positioning within the community.
The Metadata Switch proposal recognizes that major digital initiatives are underway as part of learning, research, and cultural engagement. These represent the management and disclosure of institutional assets in a context in which unique, non-published materials are growing in importance.
The project will explore the infrastructure necessary for OCLC to leverage its expertise and position to add value to metadata created elsewhere by aggregating such metadata for reuse, leveraging knowledge structures, and developing services on enhanced metadata.
Components of the global system include:
Diane Vizine-Goetz, Consulting Research Scientist, reported on internal and external projects using the Dewey Decimal Classification (DDC). She also reported on a new project to provide terminology services in the form of Web services for a range of terminology resources.
OCLC researchers and Dewey editorial staff have recently collaborated to provide high-level mappings between the outline to the Library of Congress (LC) Classification scheme and the DDC. These mappings are intended for use in QuestionPoint, a collaborative reference service that is being developed cooperatively by OCLC and the Library of Congress. The mappings will be used by participating Dewey libraries to profile their subject strengths.
Dr. Vizine-Goetz also reported on how the DDC is being used in research projects conducted by external partners. In the Renardus project, a collaborative effort involving several European subject gateways, the DDC is being used to provide a common browsing structure and switching language for the different subject vocabularies used by the project partners. The DDC is also to be used in the e-Prints UK project. In this project, OCLC researchers will develop Web services for enhancing metadata with DDC categories. Project partner, UKOLN, will harvest metadata from e-print repositories at UK educational institutions. The metadata records will then be transferred to the DDC service hosted at OCLC for enhancement. OCLC researchers are also prototyping Web services for use in other projects. These services will use the DDC as well as other terminology resources available at OCLC. Dr. Vizine-Goetz and her team are researching new technologies and resources for extending the range of OCLC knowledge organization resources and services.
Stuart Weibel, Consulting Research Scientist and Executive Director of the Dublin Core Metadata Initiative (DCMI), briefed the RAC on DCMI activities over the past year. Among the highlights:
Dr. Weibel demonstrated RDF (Resource Description Framework) interoperability as part of his report on DCMI and the World-Wide Web Consortium's Semantic Web Activity, and described DCMI's Open Metadata Registry project.
Interested readers will find more information on the DCMI Web site and in the article, "Dublin Core Metadata Initiative Progress Report and Workplan for 2002," in D-Lib Magazine (Vol. 8, no. 2), by Makx Dekkers and Stuart L. Weibel.
Thom Hickey, Chief Scientist, reported on the Advanced Collections Environment (ACE) project, which investigates centralized solutions to personal collection management and uses the Application Service Provider (ASP) model for managing collections.
Dr. Hickey explained that a project starting with personal collections is simpler than working immediately with institutional collections. It allows more experimentation, but many findings should apply to the library setting.
ACE is intended to be a complete service for the serious collector. Its records are based on Dublin Core, and it emphasizes management rather than commerce.
ACE is a browser-based service, implemented with Zope, an open-source web-application server, and the Python programming language. It uses RDF descriptions of collections and records.
Specific challenges in the project include developing adequate searching facilities given that ACE is a collection of diverse collections, privacy, authority control, report generation, and user-interface issues. Possibilities for future exploration include connecting to a simplified Z39.50 server (SRW; see Ralph LeVan's report below) and investigating the use of ACE with special collections.
Lorraine Normore, Consulting Research Scientist, reported on her investigations of special collections in institutional settings. This work:
Dr. Normore uses a methodology known as Contextual Design, developed by Karen Holtzblatt and Hugh Beyer, which:
The fieldwork and initial organization of data has been completed in this project, but analysis is not yet finished. So far the following areas have been identified as distinguishing special collections from libraries:
The internal roles most common in the settings studied include:
Other internal roles also occur, as do external ones (e.g., donors, users, appraisers, funders).
Significant entities within these environments are:
It is clear at this stage that the twenty collections studied are different from libraries (even though some of them occur in libraries), and that they have special needs for metadata and outreach that may differ from those of libraries.
Thom Hickey also reported on the Electronic Theses & Dissertations (ETDs) project, focusing on thesis metadata via the Open Archives Initiative's Protocol for Metadata Harvesting (OAI-PMH). This is a light-weight protocol for moving or sharing metadata that allows synchronization of loosely coupled databases and mandates XML Dublin Core as the default metadata format.
The goals of Dr. Hickey's project are to:
So far the project team has written a harvester and server in Java, and brought up a database of 4.3 million thesis and dissertation records from WorldCat. Names in the database are linked to the Library of Congress Name Authority file, and searching is provided via OCLC's SiteSearch software.
Dr. Hickey reported that OAI-PMH is expected to be the major method of moving large amounts of metadata between systems. It can be useful for interoperability beyond simple metadata, and useful for repositories even without harvesting. Current plans for the project include making a searchable version public via SRW (see below), making sets harvestable via OAI-PMH, bringing up a 2.0 server, merging in other sets of theses, and working on harvesting other OAI servers.
Consulting Research Scientist Ralph LeVan discussed his work with the Z39.50 Implementors Group (ZIG) and two of its initiatives.
The Search and Retrieve on the Web (SRW) project investigates providing Z39.50 as a Web Service. Mr. LeVan explained that classic Z39.50 has not been popular with the Web community because it:
On the other hand, Z39.50 allows for result sets (statefulness) and abstraction (abstract access points/attribute sets, abstract record schemas).
SRW uses the Simple Object Access Protocol, known as SOAP, as the information-exchange mechanism and the Web Service Description Language, or WSDL, for record description. In contrast to the eighteen native and extended services supported under classic Z39.50, SRW supports only one (SearchAndRetrieve). It is semantically equivalent to classic Z39.50, which makes gateways trivial and preserves the experience of the Z39.50 community without the overhead of the standard.
Mr. LeVan outlined aspects of SRW requests and responses, its common query language, and the fact that SRW supports the explain service, which had never been practical in classic Z39.50.
The ZIG also is exploring SRU, or Search and Retrieve with URLs, which Mr. LeVan described as SRW without the SOAP wrapper. SRU adds a ResponseSchema parameter and is intended for thin clients, where the browser is the application. The market for SRU currently is underdeveloped, and it may be a while before the library community adopts it. It could show up in other communities first, and may be seen as a competitor with XML Query.
More information is available.
Cathy De Rosa, OCLC Vice President, Corporate Marketing, provided an update on corporate marketing plans and activities. She described the goal of the Corporate Marketing Division as delivering tools and methods that connect the value of OCLC with the people we serve, in a way that can be understood, absorbed, valued, and promoted.
Brian Lavoie, Research Scientist, reported on the OCLC/RLG collaboration on the development of digital preservation infrastructure. The plan is to facilitate consensus-building among stakeholders and to identify and support standards and best practices. Two working groups have been established (described below). Mr. Lavoie explained that preservation metadata generally is created, maintained, and utilized in an archival setting. It informs and documents actions taken to preserve a digital object's bit stream over the long-term, supports the rendering and understanding of a preserved digital object's content, and represents the information infrastructure necessary to support preservation of, and ongoing access to, digital objects.
The Preservation Metadata Working Group (PMWG), which is led by OCLC, brings together digital preservation experts from diverse institutional and geographic backgrounds within the library/cultural-institution community. Its objectives are to develop a comprehensive preservation metadata framework, applicable to a broad range of digital preservation activities, and to examine issues surrounding implementation and practical use of metadata in support of digital preservation. Milestones achieved to date include a white paper on the "state of the art" in preservation metadata and a two-part report on preservation metadata and the Open Archival Information System (OAIS) information model. Both reports are available on the PMWG Web site.
The goals of the RLG-led Working Group on Attributes of a Trusted Digital Repository are to specify characteristics of a sustainable digital archive for large-scale heterogeneous collections and to provide a foundation for the establishment of certification programs for digital archives (as recommended by the 1996 Task Force on Archiving of Digital Information report). Its work includes consideration of administrative responsibility, technological suitability, organizational viability, system security, economic sustainability, and procedural accountability. The group has published a report, Attributes of Trusted Digital Repository: Meeting the Needs of Research Resources, which currently is available for public comment on the Working Group Web site.
Mr. Lavoie reported widespread recognition that collaboration is the most likely vehicle for advancing the research agenda for digital preservation, given the shared challenges facing stakeholders, a desire to avoid duplicating effort, and the benefits of working out issues of standardization and interoperability. The scope of collaboration potentially includes working across units within OCLC as well as with external organizations within the library/cultural-institution community or beyond.
Chandra Prabha, Senior Research Scientist, presented an overview of her work assessing public libraries' use of the Web. The objectives of her study were to assess how visible public libraries are on the Web and what public libraries are doing to provide access to Web resources.
Dr. Prabha took a random sample of 200 public libraries and sought to identify whether each library in the sample maintained a presence on the World-Wide Web. She did this twice for each library, once in winter 2001, and again in winter 2002. Dr. Prabha also e-mailed a survey to the libraries in the sample, in order to determine who hosts PL Web sites, who designs them, who maintains/updates the sites, and how often. This phase of the project is still in progress.
Dr. Prabha concluded that both public libraries and the Web are reaching the general public. Both public libraries and the Web provide access to everyday information (e.g., ready reference & its variations), community information, and current information. Some public libraries use the Web effectively to meet the everyday information needs of the general public.
Dr. Prabha summarized her work by saying that most public libraries have a door on the Web and there has been a 15% increase in public libraries' presence on the web from 2001 to 2002. Nonetheless, librarians fear loss of support for public libraries and user surveys indicate a preference for the Web. She concluded that public libraries clearly are trying to weave themselves into the web, and asked how the OCLC Cooperative can more effectively weave the Web into public libraries?
Consulting Research Scientist Jean Godby demonstrated subject navigation of Web sites using RDF topic maps, which are created by:
Some topics generally are found to be unique to specific sites, while others are common to multiple sites. Dr. Godby gathered subject/topic metadata from Web sites by examining:
Some of the term relationships found on the sites were identified as:
The subject/topic extraction software is embedded in a library of Open Source code that:
Open issues include:
Dr. Godby concludes from her project results thus far that the enterprise succeeds or fails on the strength of the knowledge ontology. Sophisticated user interface design is required to exploit all of the encoded information. Demo and Open Source code are accessible at: http://topicmaps.oclc.org:5000
Neil McLean spoke on interoperability issues between Learning and Information Spaces.
Online learning environments:
Information spaces are dominated by:
Ideally, these domains would be readily accessible to the teacher or learner at any point in the teaching/learning cycle. Enterprise systems would provide an institutional management context.
Thus, key technical goals for the realization of a truly integrated learning information environment would be:
The technical issues that must be addressed to meet these goals include:
In particular, Mr. McLean envisions the strategic challenges for OCLC as:
With that in mind, he proposed a project to develop searchable open archives repositories for learning resources, arguing that such activity would:
Ed O'Neill, Thom Hickey, and Jean Godby made a coordinated series of four presentations on topics related to the International Federation of Library Associations and Institutions (IFLA) project known as Functional Requirements for Bibliographic Records, or FRBR:
Ed O'Neill reported that an IFLA study group issued the FRBR recommendations in 1998. If fully implemented, FRBR would produce the biggest change cataloging has seen in the last century. FRBR is an entity-relationship model of metadata for information objects, instead of the single flat record conceptualized by AACR and MARC.
The goals of the OCLC project are to:
FRBR conceptualizes three groups of entities:
The internal subdivision of Group One entities is important as well. FRBR specifies that intellectual or artistic products include the following types of entities:
Furthermore, FRBR specifies particular relationships between classes of Group One entities:
Dr. O'Neill explained that we currently describe a bibliographic unit out of context. With FRBR the items must be described in context in a manner sufficient to relate the item to the other items comprising the work. AACR2 is focused on the physical manifestation while FRBR uses the four-level bibliographic structure outlined above.
Jean Godby spoke to the committee and audience about FRBR and INDECS, both of which are entity-relationship models for information objects.
FRBR goals:
INDECS goal:
Although INDECS is conceptually similar to FRBR, it also is more generic. It proposes a data dictionary of metadata objects and encodes a detailed proposal for intellectual property rights.
The INDECS model includes three basic entities:
It also specifies four relations among them:
Furthermore, the model specifies three primary types of creation:
and four relations among them:
INDECS treats metadata as events, not resources, an approach that has some advantages:
INDECS events include:
INDECS transactions describe rights ownership and actions that can be taken with creations. Rights transactions depend on a chain of transfers of rights and permissions. The INDECS term "agreement" captures the notion of accord between two parties. INDECS recognizes the transaction relations of Work - Agreement - {Agent (consenters) | Output (permision, requirement, prohibition) | Context (time, place)}. It also recognizes the transfer of intellectual property rights through the term iprTransfer and the relations Work - iprTransfer - {Agent (granter, grantee) | Input (transferred right, controlled creation) | Context }.
Dr. Godby considered whether INDECS might be a proper standard for OCLC, considering its interest in intellectual property, interoperability, and preservation, but she also noted the existence of problematic issues related to FRBR and implementation of the model.
Ed O'Neill described the FRBRization of Humphry Clinker, a project undertaken because studying a single work in depth can provide a level of understanding not possible with a larger group. The eighteenth-century Clinker is an epistolary novel, presenting a series of letters from members of a particular family as they travel about Britain. It has been previously studied by the Office of Research, and is considered to be of mid-level complexity and not atypical of works in the WorldCat database. Furthermore, it is widely held, with 184 records in WorldCat and over 5,000 holdings. The sense was that if serious difficulties were encountered in the process of FRBRizing Clinker, then such difficulties would be likely for many other works as well.
The goals of studying this work were:
The objective of the Humphry Clinker analysis was to organize the bibliographic objects represented by bibliographic records, not simply to organize the records. To determine if two records were for the same expression, the question was whether the objects represented by the records had the identical content; not about the similarity of the records.
In order to collect the bibliographic records, WorldCat was searched for all possible Humphrey Clinker records. A total of 179 records were found. Thirty-eight actual books were examined, and 600 digital photographs were taken of key pages. Researchers identified a number of types of revisions to the original text, including the correction of errors, replacing of archaic character forms with modern letters, repositioning dates on the letters that convey the story, and adding chapter titles. Numerous other augmentations were identified as well.
A set of elements of the bibliographic record was identified as crucial for the determination of different expressions:
Another set was identified as key for the determination of different manifestations:
At this stage of the project the following FRBR components have been identified for Humphry Clinker:
Dr. O'Neill's conclusions:
Thom Hickey made the final presentation in the FRBR series, describing algorithms and tools used in the analysis of Humphry Clinker. Researchers made a copy of WorldCat that included holdings data and LC authorities, and created a personal author file. The structure allowed them to process WorldCat as a single file.
Records were structured in MARC Communications format (LC's MARCMaker format) and processed in Unicode. The first step was to divide the database into works and expressions and tabulate statistics (including holdings).
The basic work algorithm involved optionally processing Library of Congress records first. The most common works got priority. The first work with an 'exact' match on author and title was taken as the base record, otherwise the first partial match was used. If there was no partial match, the record was assumed to represent a new work. Strings were extensively normalized regarding punctuation, diacritics, and capitalization. Leading articles and common abbreviations were dropped from titles.
When it was determined that a new record represented a previously identified work, all the titles found in the new record were added to the list of acceptable titles for the work. Similarly, all authors were added to work's list of acceptable authors. Furthermore, all titles and authors in the record were added to the list of acceptable titles.
The basic expression algorithm stipulated no partial matching or order for expressions. Records with added entries were processed first. Records were put in the first expression that matched; otherwise a new expression was created within the work.
In order to match expressions, titles must match 'exactly,' i.e., all surnames found in the record must match those in the expression. If there were only a partial match on surnames, then it would be considered the same expression if the pagination (both decimal and Roman) matched.
When adding a record to an existing expression, names and title to match against were taken from the record that created the expression.
Dr. Hickey concluded his presentation by demonstrating a browser interface that had been developed to view records in the FRBR database.
Lorcan Dempsey invited participation from members in a strategic outlook discussion, in order to consider the current status of work in the OR, as well as what might be done in the future. The meeting ended following a subsequent executive session between RAC members and officers of OCLC. The next meeting of the Research Advisory Committee is scheduled for August 1-2.