NHSE Repository Interoperation

The National HPCC Software Exchange provides a uniform interface to a distributed set of discipline-oriented HPCC repositories. As such, the NHSE is a virtual repository, in that it catalogs and points to software maintained elsewhere, except for archive and mirror copies stored on NHSE machines. A virtual repository is a type of interoperation that involves a hierarchical relationship. The NHSE virtual repository architecture is shown in Figure 1.

In many cases, a discipline-oriented repository will wish to provide its own specialized interface to its software collection. The repository may use classification schemes and search tools tuned to its particular discipline. For example, the Netlib and GAMS mathematical software repositories use the GAMS classification scheme and are developing expert search subsystems for specific GAMS classes. Discipline-oriented repositories will also be in the best position to review and evaluate software within their own domains. In addition to providing access to its own software, a repository may wish to import software descriptions from other repositories and make this software available from its own interface. For example, a computational chemistry repository may wish to provide access to mathematical software and to parallel processing tools in a manner tuned to the computational chemistry discipline. A repository interoperation architecture is shown in Figure 2.

Interoperability raises the following issues which are discussed further below:

Exchange and interpretation of catalog information

Different repositories cannot be expected to adopt and use internally a universal data model for catalog records, nor would such uniformity necessarily be desirable. The reasons uniformity cannot and should not be achieved are two-fold:
  1. Existing repositories have long-standing practices for cataloging their software which work well for their purposes. If interoperability requirements were to encroach on the autonomy of individual repositories by requiring them to adopt a universal cataloging format internally, few repositories would be willing to interoperate.
  2. A "one-size-fits-all" data model is undesirable because software from different disciplines will have specialized properties that will require special data elements to describe. Different disciplines will also develop their own classification schemes and controlled vocabularies.
However, although there are differences between the data models most appropriate for cataloging software in different disciplines, enough commonality exists to reach agreement on a core set of data elements. Such agreement has been reached in the form of the Reuse Library Interoperability Group's Basic Interoperability Data Model (BIDM), IEEE Std 1420.1 Repositories need not adopt the BIDM for internal use, although they may certainly do so if desired. Rather, they should be able to export their metadata to other libraries using the BIDM and import and interpret BIDM metadata from other libraries.

In addition to exporting common data elements, it is desirable for repositories to be able to export additional meta-information in a manner that may be interpreted by other repositories. The RIG Technical Committee on Model Extensions is working on a meta-model that will enable formal description and exchange of extensions to the BIDM.

The RIG is currently involved in a Web-based Interoperability Experiment that is also the basis for current efforts at NHSE repository interoperation. The experiment, which is being conducted by the Technical Committee on Web Bindings, consists of specifying, implementing, and testing a small number of bindings of the RIG BIDM. A binding of the RIG BIDM is a mapping from the abstract data model to a concrete syntax that can be used for interchange. The binding currently being used for NHSE interoperation is an HTML binding that maps BIDM data elements to META tags in the headers of HTML documents. The following repositories are currently participating in the NHSE interoperation effort, along with a few individual software providers:

In addition to the BIDM fields, the NHSE data model includes a few additional fields that are desirable for NHSE interoperation. The relevant data model for a field is currently specified by prefixing the field name with the data model name in the name attribute of the META tag. In the future, NHSE extensions to the BIDM will be described using the RIG meta-model which is currently under development.

Name assignment and resolution

Almost every software repository assigns some sort of unique identifier to each of its holdings. The format of these identifiers varies, however, and uniqueness is guaranteed only within a particular repository. With interoperation, the need arises for globally-unique, location-independent identifiers. A user in possession of such an identifier should be able to retrieve either associated metadata or the named resource itself, subject to access restrictions. The RIG has recognized the need for such identifiers by specifying a UniqueID field for asset metadata, but the mechanisms for assigning and resolving such UniqueIDs have yet to be determined.

As a virtual repository, the NHSE sees a need for a globally unique identifier that unambiguously identifiers a particular version of a software asset. Such unambiguous identification is necessary for a number of reasons, including the following:

  1. version tracking
  2. associating testing and review metadata with the exact version that was reviewed
  3. reporting and reproducing scientific results
However, the NHSE also sees a need for a stable name for a resource that does not change every time there is a minor bug fix or revision.

The NHSE is currently experimenting with using both URLs and URNs in the metadata that is exchanged using the HTML binding of the RIG BIDM discussed above. The NHSE data model includes an additional fingerprint field for identifying the exact version of a file. The fingerprint scheme currently used by the NHSE is MD5. The NHSE is considering adopting the RCDS system (see the following section for more information about RCDS), which would resolve a URN to an unambiguous identifier called a LIFN, and would resolve a LIFN to a set of URLs.

Performance and reliability of access to resources

Distributed maintenance of resources, although desirable for maintaining information close to its source and thus allowing local control and keeping it up-to-date, raises performance and reliability problems for access by remote users. Performance and reliability problems can be solved by replication and caching. However, replication and caching raise consistency and intellectual property rights issues. Intellectual property rights issues are discussed further below.

Some caching/replication schemes, such as the Domain Name System and the Harvest Cache, use Time-To-Live (TTL) based consistency. The Netlib mirroring scheme uses a master slave update protocol that runs nightly. The Andrew File System uses a hierarchical invalidation sheme. Research has shown, however, that the overhead of invalidation can outweigh the efficiency advantages of caching.

The Resource Cataloging and Distribution System (RCDS) under development at the University of Tennessee uses a consistency model based on Location Independent File Names (LIFNs). Once assigned, a LIFN is immutably bound to a particular sequence of bytes. After updating a file, a publisher assigns it a new LIFN, registers the new URN-to-LIFN binding with an RCDS catalog server, and notifies authorized file servers who can then acquire the new file and notify a location server of the new LIFN-to-URL binding. Thus, the RCDS scheme is a combination of TTL-based "pull" consistency, with file servers pulling updates at their convenience, and invalidation-based "push" updating by efficient propagation of meta-information updates among catalog servers.

The NHSE is planning to mirror authorized copies of software from the various HPCC repositories and individual software providers on NHSE file servers. The NHSE is also planning to run experimental RCDS catalog and location servers on the distributed set of NHSE servers. Experiments will be carried out to compare the performance and efficiency of the RCDS file replication approach with other proposed replication and caching schemes.

Authenticity and integrity

Verifying the authenticity of a file means verifying that its purported author is the true author. Verifying the integrity of a file means verifying that its contents have not been changed, with respect to the purported version and publication date, either maliciously or accidentally. Such guarantees are especially important for software due to the danger of Trojan horses and viruses. Verifiable digital signatures of catalog records that contain file fingerprints would achieve both the authenticity and the integrity objectives. The public key infrastructure that would enable widespread use of digital signatures is currently lacking, however.

Evaluation and review

Many software repositories have their own certification, evaluation, and/or review procedures. The RIG Technical Committee on Asset Evaluation and Certification (TC4) has determined that all certification procedures used by or known to RIG members use levels, but that the levels have different meanings and associated certification activities. TC4 has therefore developed the RIG Asset Certification Framework, which defines a consistent structure for describing a reuse library's asset certification policy, and which is currently in the IEEE balloting process. Using this framework, a reuse library can exchange a description of its certification policy along with the certification metadata itself.

The NHSE has designed a software review policy that will enable easy access by users to information about software quality, but which is flexible enough to be used across and specialized to different disciplines. The three review levels recognized by the NHSE are the following:

The Unreviewed designation means only that the software has been accepted into the owning repository and is thus within the scope of HPCC and of the discipline of that repository. The Partially reviewed designation means that the software has been checked by a librarian for properties that may be verified by inspection, including completeness, adequate documentation, and good software construction. The Reviewed designation means that the software has been reviewed in a review article in the electronic journal NHSE Review by an expert in the appropriate field. The NHSE also provides for soliciting and publishing author claims and user comments about software quality. All software exported to the NHSE by its owning repository or by an individual contributor is to be tagged with its current review level and with a pointer to a review abstract which describes the software's current review status and includes pointers to supporting material.

Protection of intellectual property rights

Software may be protected by copyright and/or patent law. Software is also subject to U.S. export law, which requires a license for export of any "technical information", which includes software, although in some cases the license is "automatic", meaning that no paperwork is required. It is desirable for users and repository maintainers to be able to quickly access and understand meta-information about legal restrictions on copying and using software so as to not unknowingly infringe upon them. The RIG Technical Committee on Intellectual Property Rights is working on standards that will enable interoperable exchange of such information.

Protection of intellectual property rights should not unduly impede or slow access to software. The NHSE is faced with the task of distributing and providing efficient access to HPCC software, some of which has security classifications and/or access restrictions. The NHSE is currently undertaking a study of how efficient access can be provided while meeting legal restrictions and security objectives, and without exposing third parties, such as NHSE online service providers, to legal liability for rights infringement or violation of U.S. export law.