Thursday, April 12, 2007

pf-dspace: repository research on an open-source repository platform

In this entry we describe on-going work within Hewlett-Packard Labs to extend the open source DSpace platform to address the problem of digital object repository federation in innovative ways. With the extensions introduced by pf-dspace, repository administrators will be able to manage and replicate information and media across many cooperating institutions in a peer-to-peer fashion using essentially “out-of-the-box” features of the DSpace platform. In this entry we describe our next steps with DSpace@HPLabs, including advancements in policy-based federation management and our plan to contribute artifacts this work to the institutional repository community through the active DSpace open source network.

1. Why Repository Federation is Interesting

A repository federation is the interconnection of a set of autonomously-managed digital object repositories into one or more larger-scale, distributed collections based on an expressed set of rules or policies that codify the federation's purpose, and the collection properties that its members agree to achieve. Given this definition it's clear that many federation architectures are possible, based on goals as varied as:

  • long-term persistence of information and preservation of digital assets
  • access to information over wide-ranging geographies and difficult conditions
  • construction and management of large-scale collections
  • managing collections with broad topical scope, from widely diverse communities
  • leveraging skills and technical capabilities contributed by diverse organizations
  • distributed, semi-autonomous collection management

Our interest at HPLabs in repository federation was sparked by the China Digital Museum Project (CDMP), an ongoing collaboration involving the Chinese Ministry of Education, HPLabs and several Chinese universities. CDMP provides a large-scale infrastructure based on the DSpace platform upon which a federation of university-based museums store, manage, preserve and disseminate the digitized versions of university museum artifacts. In the final phase of CDMP it is expected that this federation will interconnect more than 100 university museums, each with an estimated 2TB of digital artifacts stored in local DSpace installations.

CDMP has created a replicated collection architecture in which items of interest from the individual remote collections are harvested or “pulled” to complete the particular local collection, according to that collection's defining rules. A given replica might include only the item's metadata, or it might also include the item's composite files. In the case of CDMP, two modified DSpace instances (known as DM-DSpace) are designated as replicating repositories or data centers, and the remaining repositories hold individual collections of local interest.

2. Peer Federation and pf-dspace

Our pf-dspace code generalizes the approach to DSpace federation introduced with DM-DSpace, allowing the platform to implement a wider variety of federation topologies. pf-dspace also makes federation administration more accessible from the administrator's user interface, allowing both simple and complex topologies to be constructed without extra software or setup. pf-dspace has eliminated the need for a separate, centralized node registry, which is a common feature of many previous repository federation implementations. The key to achieving this decentralized node management was the adoption of a distributed “friends” list, in which each repository shares with other nodes basic information about known peers, know as its “friends,” using standard features of the OAI-PMH protocol. Such decentralized node management is just one of the features of peer federation made possible by pf-dspace. Our extensions build on the way DM-DSpace applies standard protocols and introduce some important new management capabilities:

  • pf-dspace uses OAI-SQ/OAI-SQ-F to provide selective (query-based) harvesting, in which metadata from other repositories is retrieved based on keywords and metadata fields. DM-DSpace limited its harvesting to “new” items.
  • pf-dspace introduces improvements in how interactions with nodes known to an individual repository are managed. In particular, node administrators now have the ability to control whether each of a node's “friends” are published to other peers (i.e. are made “public”) or are suppressed, as well as whether those nodes are harvested (i.e. are “active”). In addition, the pf-dspace code tracks the “live” state of "friends," ie: whether or not a network connection can be established.
  • pf-dspace provides the ability to do metadata-only harvesting, which is useful for constructing “virtual” (non-replicating) repositories. The DM-DSpace platform only supports full replication of items.

3. What's next for pf-dspace?

An important contribution of pf-dspace has been our practical implementation of the OAI-SQ/OAI-SQ-F extensions to the OAI-PMH protocol, giving repositories the ability to make more elective metadata queries against remote collections. This capability allows individual repositories to accumulate items based on their attributes, which is fundamental to federated collection anagement. At this writing pf-dspace successfully performs selective harvests and stores them in a physical directory, but it doesn't (yet) map these replicated items onto the appropriate logical collections in the repository. An important next step will be to integrate the AutoMapper plugin to map retrieved items to logical collections, which will involve being able to actually refine the selective queries and associate them with one or more mappings.

An exciting area of experimentation for pf-dspace will be harvesting objects from a variety of heterogeneous repository-like sources. We believe there is an important opportunity to demonstrate the utility and value of harvesting selected items from "ephemeral" sources and bringing them under the management of institutional repositories. Examples of these sorts of sources include: wikis and blogs; “social networking” sites; decentralized, departmental wikis; social tagging and bookmarking services; mailing list archives; and anything else that we can attach a harvesting interface onto.

4. What's next for DSpace@HPLabs?

HPLabs remains active in the DSpace community at both the advisory and development levels. We are finding that the DSpace platform is an ideal vehicle for certain kinds of repository research, and we look forward to releasing back to the open source community DSpace code patches that we've created as a result of our ongoing research that may be of benefit to the community. Ongoing DSpace-based research at HPLabs currently falls under two categories: the clustering of DSpace instances using open-source tools to achieve robust, large-scale digital repositories; and policy-based management and automation of repository federations. Specific topics that we are exploring include:

  • federating repositories to accomplish goals such as replication, subject-based collections, distributed format migration, etc.
  • automated, event-driven repository management, locally and across federations
  • active integrity assurance of managed items and metadata
  • information-based access control that remains valid over time, dispite item transformations
  • continual expansion of the facets of information that are extracted from managed items
  • providing access to new and different consumers of that information

We anticipate that much of our future DSpace federation work will be in policy-based federation management, building on the basic peer federation capability provided by the current pf-dspace extensions. Some next steps include using a distributed, event-condition-action rules approach to marshal sets of autonomously- managed peers into federations that are been defined by express sets of collection management policies (implemented as reactive rules) that participants in the federation agree to share. To this end, a promising rules language that we are now experimenting with is Xchange from Institut für Informatik der Ludwig-Maximilians-Universität München. (Univ. of Munich)

“Policy”-driven federations exist today, but almost always the “policies” have been hard-coded, are not flexible and are not themselves under some kind of lifecycle management. Still, existing platforms such as LOCKSS have been rigorously studied and teach us much; LOCKSS is of particular interest because it is a proven platform that has implemented examples of the kinds of "policies" that DSpace-based federations must also implement. The NARA-funded PLEDGE project (MIT & SDSC) is in another example; in that case researchers are examining how to implement preservation policies within institutional repositories, including how to cast expressed policies in machine-interpretable, actionable and verifiable ways. A more recent, related bit of work in this area is the PHAROAH project, a follow-on to LOCKSS that our HPLabs colleagues are involved with.

From a DSpace perspective, our continuing focus will be on providing and maintaining visibility for the repository administrator throughout the policy and object "lifecycles." This includes visibility of the policies, visibility of the assets, visibility over all actions performed on the assets, etc. Achieving this visibility has been at the heart of our approach to pf-dspace, especially as we put control of the federation directly into the hands of the repository administrator and deal with federation management in ways that are directly analogous to collection management itself.

Wednesday, April 11, 2007

About the Bloggers (11 April 2007)

John Erickson has spent many years studying the unique social, legal, and technical problems that arise when managing and disseminating information in the digital environment. At HP Labs John has focused on the policy-based management of distributed, heterogeneous digital object repositories and content processing architectures. He has been an active participant in a number of international metadata and rights management standards efforts and currently serves on the OAI Object Reuse and Exchange (OAI-ORE) advisory committee, the DSpace Architectural Review committee, the Handle System Technical Review committee and the Global Handle System Advisory Committee.

Jim Rutherford is the lead DSpace developer for HP Labs and HP's primary contributor to the DSpace open source community. Jim joined HP Labs in 2006 to work on digital repository research using DSpace, in particular working closely with the China Digital Museum Project (DM-DSpace) team and on the problem of repository federation more generally. His work on generalising and extending key elements of the DM-DSpace codebase has led to several recent presentations; these and other contributions to the DSpace community led recently to his elevation to Committer status within the DSpace open source project.

Welcome to the pf-dspace blog

Jim Rutherford and John Erickson of HPLabs (Bristol, UK and Norwich, VT USA) will use this blog to keep the DSpace and repository research communities up-to-date on progress using the DSpace open repository platform as a basis for repository and digital preservation research.

Their current work is focussed on a standards-based extension to the DSpace platform they call pf-dspace (peer-federation DSpace).