Evolving Software Repositories

http://www.netlib.org/utk/projects/esr/

Jack Dongarra
University of Tennessee

Eric Grosse
AT&T Bell Laboratories

Ron Boisvert
National Institute of Standards and Technology

Software repositories have traditionally provided access to software resources for particular communities of users within specific domains. For example, our Netlib and GAMS repositories provide access to collections of mathematical software, while our National HPCC Software Exchange (NHSE) provides access to high performance computing resources. The growth of the World Wide Web has created new opportunities for expanding the scope of discipline-oriented repositories, for reaching a wider community of users, and for expanding the types of services offered. With these opportunities have come challenges, however, such as the shift from centralized to decentralized management, interoperating between different repositories, and increased security risks. Reaching a wider community of users has created a need for increased automated assistance in locating appropriate resources and in understanding and making use of these resources. We are tackling these challenges with a number of efforts, ranging from system-level infrastructure for resource management to application-level and content-oriented tools.

Our research has the following five focus areas:

Resource Cataloging and Distribution System (RCDS)
Application-level and content-oriented tools
Safe execution environments for mobile code
Repository interoperability
Distributed, semantic-based searching

Resource Cataloging and Distribution System (RCDS)

In the area of resource management infrastructure, we are developing the Resource Cataloging and Distribution System (RCDS). RCDS has the goals of

facilitating the scalable distribution of resources,
achieving fault tolerance, high availability, and good response time,
responding quickly to changes in resources, and
assuring integrity, authenticity, and consistency of resources and metadata.

RCDS consists of the following components:

File servers, which provide access to the files themselves. These can be ordinary http, ftp, etc. servers.
Catalog info servers, which maintain authenticated information about the characteristics of network-accessible resources and accept queries about the characteristics of such resources from clients.
Location servers, which maintain information about the locations of network-accessible resources and accept queries for location data from clients.
Collection managers, which are responsible for acquiring and deleting files on a file server and for informing location servers about file availability.
Publication tools, which accept new files and descriptions from content providers and inject them into the system.

Another component needed by RCDS, but which is not part of the current RCDS design but which we are considering, is a public key infrastructure consisting of key servers for certifying and revoking public keys. Search servers are also not part of RCDS -- rather than attempting to design a resource discovery system that would work well for all existing subject areas, we have chosen to design a cataloging and distribution system that will form a common substrate for present and future resource discovery tools. RCDS does not explicitly support protection of intellectual property rights. However, it is possible to include pricing information and usage restrictions in the description of a resource.

Software repositories will find the RCDS infrastructure useful for supporting decentralized management of resources and for providing users with reliable, efficient access to those resources. Through the use of digital signatures and cryptographically signed certificates, RCDS will also provide integrity and authentication guarantees that, in addition to protecting against malicious modification or accidental corruption of source code, will enable safe use of agent and applet technologies for adding interactive content to repositories. A repository may participate in RCDS as a resource contributor, by providing descriptions of resources that it holds, or as a third party that adds value to resources contributed by others by cataloging and classifying them and/or providing a search service.

Application-level and content-oriented tools

In the area of application-level and content-oriented tools, we are developing applet and agent programs for assisting users in finding and using software resources.

The Numerical Navigator, for which we have developed prototype versions in Java and Tcl/Tk, allows the user to visualize the contents of a software collection on a single screen. The user manipulates the display using buttons, sliders, and pull-down menu. Pointing and clicking in the display area reveals more detailed information, including links for immediate downloading of selected software.
ApproxWizard is an applet, being developed in both Java and Limbo versions, that helps users select an approximation code. The applet interacts with the user by doing calculations, either on the client or remotedly on servers, on sample user data sets that reside on the client disk.
We have developed domain-specific expert extensions to the GAMS problem classification scheme. An advisory system for a given problem class helps the user discriminate between problem-solving software modules for that class. The existing prototype user interface was programmed as an X-windows client, but Java versions are planned.
A project in the planning stage is a Program Builder that fetches the appropriate versions of source code, subroutines, and libraries for a user's platform from different repositories and compiles and links them.

Safe Execution Environments for Mobile Code

We are working with other researchers in the repository and agent technology communities to define requirements for safe execution environments for agent and applet programs. An execution environment provides program interpretation and run-time support as well as relocation and communication services. However, the execution environment must also be secure to ensure that code from untrusted sources does not harm the host system, gain unauthorized access to files, or usurp resources. After determining the requirements for such an environment, we plan to implement a facility for remote execution of user code in the Netsolve system.

Repository Interoperability

Although there a number of software repositories in existence or under development, these repositories generally have their own interfaces and require the user to connect to each one separately to search or browse for software or other resources. We are involved in efforts to promote sharing of asset metadata and, where possible, of assets themselves between software repositories. We are working with other members of the Reuse Library Interoperability Group to add structure to WWW-based interoperation and to define labeling standards for asset certification and intellectual property rights.

To facilitate maximum interoperability, we are developing a toolkit called Repository in a Box for use by repository managers. This toolkit will include a publishing tool for creating and maintaining software catalog records and for exporting these records to other repositories and search services.

Distributed, Semantic-based Searching

We are combining our experience using the Harvest System with our research on Latent Semantic Indexing (LSI) to produce a semantic-based distributed search system. LSI uses the singular value decomposition of the term-document matrix to produce a low-rank approximation to this matrix that can be used for semantic retrieval based on statistical word co-occurrence. The resulting concept space provides better retrieval performance than lexical keyword matching and allows for easy relevance feedback. Our plans are to interface LSI to the Gatherer and Broker components of Harvest. The interface to the Gatherer will consist of an interactive tool for use by an expert in guiding the Gatherer to collect relevant informtion. The Broker interface will allow for searching and doing relevance feedback across multiple distributed LSI indexes.

dongarra@cs.utk.edu