NHSE Repository Interoperation
The National HPCC Software
Exchange provides a uniform interface to a distributed set of
discipline-oriented HPCC repositories. As such, the NHSE is a
virtual repository, in that it catalogs and points to
software maintained elsewhere, except for archive and mirror copies
stored on NHSE machines. A virtual repository is a type of
interoperation that involves a hierarchical relationship.
The NHSE virtual repository architecture is shown in
Figure 1.
In many cases, a discipline-oriented repository will wish to provide
its own specialized interface to its software collection.
The repository may use classification schemes and search tools
tuned to its particular discipline. For example, the
Netlib and
GAMS mathematical software repositories
use the GAMS classification scheme and are developing expert search
subsystems for specific GAMS classes.
Discipline-oriented repositories will also be in the best position
to review and evaluate software within their own domains.
In addition to providing access to its own software, a repository
may wish to import software descriptions from other repositories
and make this software available from its own interface.
For example, a computational chemistry repository may wish to provide
access to mathematical software and to parallel processing tools
in a manner tuned to the computational chemistry discipline.
A repository interoperation architecture is shown in
Figure 2.
Interoperability raises the following issues which are discussed
further below:
Different repositories cannot be expected to adopt and use internally
a universal data model for catalog records, nor would such uniformity
necessarily be desirable. The reasons uniformity cannot and should
not be achieved are two-fold:
- Existing repositories have long-standing practices for cataloging
their software which work well for their purposes.
If interoperability requirements were to encroach on the autonomy
of individual repositories by requiring them to adopt a universal
cataloging format internally, few repositories would be willing to
interoperate.
- A "one-size-fits-all" data model is undesirable because software
from different disciplines will have specialized properties that will
require special data elements to describe. Different disciplines
will also develop their own classification schemes and controlled
vocabularies.
However, although there are differences between the data models
most appropriate for cataloging software in different disciplines,
enough commonality exists to reach agreement on a core set of
data elements. Such agreement has been reached in the
form of the Reuse Library Interoperability
Group's Basic Interoperability Data Model (BIDM), IEEE Std 1420.1
Repositories need not adopt the BIDM for internal use,
although they may certainly do so if desired. Rather, they should be able to
export their metadata to other libraries using the BIDM and import and
interpret BIDM metadata from other libraries.
In addition to exporting common data elements, it is desirable for
repositories to be able to export additional meta-information in a manner
that may be interpreted by other repositories. The
RIG Technical Committee on Model
Extensions is working on a meta-model that will enable formal
description and exchange of extensions to the BIDM.
The RIG is currently involved in a
Web-based Interoperability
Experiment that is also the basis for current efforts at
NHSE repository interoperation. The experiment, which is being
conducted by the Technical
Committee on Web Bindings, consists of specifying, implementing,
and testing a small number of bindings of the RIG BIDM.
A binding of the RIG BIDM is a mapping from the abstract
data model to a concrete syntax that can be used for interchange.
The binding currently being used for NHSE interoperation is an
HTML binding
that maps BIDM data elements to META tags in the headers of HTML
documents. The following repositories are currently participating
in the NHSE interoperation effort, along with a few individual software
providers:
In addition to the BIDM fields, the NHSE data model includes a few
additional fields that are desirable for NHSE interoperation. The
relevant data model for a field is currently specified by prefixing
the field name with the data model name in the name attribute of the
META tag. In the future, NHSE extensions to the BIDM will be described
using the RIG meta-model which is currently under development.
Almost every software repository assigns some sort of unique identifier
to each of its holdings. The format of these identifiers varies, however,
and uniqueness is guaranteed only within a particular repository.
With interoperation, the need arises for globally-unique,
location-independent identifiers. A user in possession of such an
identifier should be able to retrieve either associated metadata or
the named resource itself, subject to access restrictions.
The RIG has recognized the need for such identifiers by specifying
a UniqueID field for asset metadata, but the mechanisms for assigning
and resolving such UniqueIDs have yet to be determined.
As a virtual repository, the NHSE sees a need for a globally unique
identifier that unambiguously identifiers a particular version of a
software asset. Such unambiguous identification is necessary for
a number of reasons, including the following:
- version tracking
- associating testing and review metadata with the exact version
that was reviewed
- reporting and reproducing scientific results
However, the NHSE also sees a need for a stable name for a resource
that does not change every time there is a minor bug fix or revision.
The NHSE is currently experimenting with using both URLs and URNs
in the metadata that is exchanged using the HTML binding of the RIG
BIDM discussed above. The NHSE data model includes an additional
fingerprint field for identifying the exact version of a file.
The fingerprint scheme currently used by the NHSE is MD5.
The NHSE is considering adopting the RCDS system
(see the following section for more information about RCDS), which would resolve
a URN to an unambiguous identifier called a LIFN, and would resolve
a LIFN to a set of URLs.
Distributed maintenance of resources, although desirable for maintaining
information close to its source and thus allowing local control and
keeping it up-to-date, raises performance and reliability problems for
access by remote users. Performance and reliability problems can be
solved by replication and caching. However, replication and caching
raise consistency and intellectual property rights issues.
Intellectual property rights issues are discussed further below.
Some caching/replication schemes, such as the Domain Name System and the
Harvest Cache,
use Time-To-Live (TTL) based consistency.
The Netlib mirroring scheme uses a master slave update protocol
that runs nightly. The Andrew File System uses a hierarchical
invalidation sheme. Research has shown, however, that the overhead
of invalidation can outweigh the efficiency advantages of caching.
The
Resource Cataloging and Distribution System (RCDS) under development
at the University of Tennessee uses a consistency model based on
Location Independent File Names (LIFNs).
Once assigned, a LIFN is immutably bound to a particular sequence
of bytes. After updating a file, a publisher assigns it a new LIFN,
registers the new URN-to-LIFN binding with an RCDS catalog server,
and notifies authorized file servers who can then acquire the new file
and notify a location server of the new LIFN-to-URL binding.
Thus, the RCDS scheme is a combination of TTL-based "pull" consistency,
with file servers pulling updates at their convenience, and
invalidation-based "push" updating by efficient propagation of
meta-information updates among catalog servers.
The NHSE is planning to mirror authorized copies of software from the various
HPCC repositories and individual software providers on NHSE file
servers. The NHSE is also planning to run experimental RCDS catalog
and location servers on the distributed set of NHSE servers.
Experiments will be carried out to compare the performance and efficiency
of the RCDS file replication approach with other proposed replication
and caching schemes.
Verifying the authenticity of a file means verifying that
its purported author is the true author. Verifying the integrity
of a file means verifying that its contents have not been changed,
with respect to the purported version and publication date, either
maliciously or accidentally. Such guarantees are especially important
for software due to the danger of Trojan horses and viruses.
Verifiable digital signatures of catalog records that contain file fingerprints
would achieve both the authenticity and the integrity objectives.
The public key infrastructure that would enable widespread use of digital
signatures is currently lacking, however.
Many software repositories have their own certification, evaluation,
and/or review procedures. The
RIG Technical Committee on Asset Evaluation and Certification (TC4)
has determined that all certification procedures used by or known to
RIG members use levels, but that the levels have different meanings
and associated certification activities. TC4 has therefore developed
the RIG Asset Certification Framework, which defines a consistent
structure for describing a reuse library's asset certification policy,
and which is currently in the IEEE balloting process. Using this
framework, a reuse library can exchange a description of its certification
policy along with the certification metadata itself.
The NHSE has designed a
software review policy that will enable easy access by users to
information about software quality, but which is flexible enough to be used
across and specialized to different disciplines.
The three review levels recognized by the NHSE are the following:
- Unreviewed
- Partially reviewed
- Reviewed
The Unreviewed designation means only that the software
has been accepted into the owning repository and is thus within the scope
of HPCC and of the discipline of that repository. The Partially
reviewed designation means that the software has been checked by a
librarian for properties that may be verified by inspection, including
completeness, adequate documentation, and good software construction.
The Reviewed designation means that the software has
been reviewed in a review article in the electronic journal
NHSE Review by an expert in the appropriate field.
The NHSE also provides for soliciting and publishing author claims and
user comments about software quality. All software exported to the NHSE
by its owning repository or by an individual contributor is to be tagged
with its current review level and with a pointer to a review abstract
which describes the software's current review status and includes
pointers to supporting material.
Software may be protected by copyright and/or patent law.
Software is also subject to U.S. export law, which requires a license
for export of any "technical information", which includes software,
although in some cases the license is "automatic", meaning that
no paperwork is required.
It is desirable
for users and repository maintainers to be able to quickly access and
understand meta-information about legal restrictions on copying
and using software so as to not unknowingly
infringe upon them. The
RIG Technical Committee on Intellectual Property Rights is working
on standards that will enable interoperable exchange of such information.
Protection of intellectual property rights should not unduly
impede or slow access to software. The NHSE is faced with the task of
distributing and providing efficient access to HPCC software, some
of which has security classifications and/or access restrictions.
The NHSE is currently undertaking a study of how efficient access
can be provided while meeting legal restrictions and security objectives,
and without exposing third parties, such as NHSE online service
providers, to legal liability for rights infringement or violation of
U.S. export law.
browne@cs.utk.edu