Indexing and Searching

Next: Collection Management Up: Areas Needing Development Previous: Universal Naming Scheme

Indexing and Searching

Currently searching on the World-Wide Web is very ad hoc, and searches typically have poor precision and recall. Furthermore, the user has no way to evaluate the quality of his/her search. For example, if no items are found, it is not possible to tell whether no information exists or the search needs to be modified or expanded, and there are no reliable methods for restricting or expanding a search. There are also a multitude of different search services, with a great deal of overlap and using different protocols, and this situation makes searching redundant and expensive.

The NSE needs a structure and guidelines for describing and classifying objects, especially software. Descriptions should be in the form of attribute/value templates, with standard definitions for the data elements. The IAFA software and document templates are a start, but rigorous data definitions for these templates are lacking. The Netlib index file format is another example of a software template, but rigorous definitions of data elements are again lacking. A candidate for a standard format which is both rigorously defined and extensible is the RIG Proposed Standard RPS-0002 (1994), A Uniform Data Model for Reuse Libraries (UDM), available from the Reuse Library Interoperability Group via AdaNET at 800-444-1458. The IETF URI Working Group has discussed the Uniform Resource Citation (URC) as a way of encapsulating meta-information about objects, but so far there has been no agreement on the format of URCs or on how they will be deployed and used.

Research in information science has shown that the quality of searching is vastly improved by the use of comprehensive, well-defined classification schemes. The Guide to Available Math Software (GAMS) classification scheme has demonstrated the effectiveness of classification for retrieving mathematical software [3]. An extension to GAMS, or the development of an additional classification scheme, is needed, however, for non-mathematical software that does not fit into any of the current GAMS classes. It would also be useful to have mappings between GAMS and existing document classification schemes such as the American Mathematical Society and Computing Reviews categories.

Because it is probably not feasible or even desirable to have a centralized authority for describing and classifying software and documents contributed to the NSE, guidelines need to be established to assist publishers in accurately describing and classifying contributed materials. This classification activity could be facilitated by a hypertext help system, with pointers to data definition schema and classification examples. A similar hypertext system could assist users in formulating good queries and in using classification schemes.

Current indexers typically build an inverted index that includes every occurrence of every word (possibly excluding some common words) in the indexed material. The main drawback of inverted indices is their space requirements, which typically range from 50 to 300 percent of the size of the original text. Inverted indices also require exact spelling, and a misspelled word can cause a piece of information to be lost. At the cost of somewhat slower searching performance, the Glimpse indexing tool builds a much smaller index (typically 2 to 4 percent of the original text) and supports arbitrary approximate matching [9]. The material indexed by an indexing tool might be the title or filename only (e.g., Archie), the title plus keywords plus abstract (e.g., ALIWEB), or the full text of documents (e.g., WAIS). Indexers and search engines are needed that support attribute-specific indexing and searching. The NSE prototype uses the freeWAIS code from CNIDR as the basis of most of its searchable indices. The current version of freeWAIS supports only free-format indexing, but an alternative version that supports attributes (freeWAIS-sf) by building a separate inverted index for each attribute is under development. A version of Glimpse that supports attributes is also under development. There also a number of commercial products available that support text searching or combined relational database and text searching.

Z39.50 (Version 2, 1992) is a US ANSI standard protocol for information retrieval [1]. Z39.50 was designed with the intent of supporting bibliographic database applications. The format of the data retrieved is not constrained by the protocol but is agreed to by the client (origin in Z39.50 terminology) and server (target in Z39.50 terminology). The MARC format and a search attribute set suitable for bibliographic data are registered within the current version of the standard. It is expected that as the protocol begins to be used by other communities and for other types of data, other attribute sets and record syntaxes will be developed. There is a mechanim that allows new record syntaxes to be registered and then referred to by well known identifiers. A number of library automation vendors are developing Z39.50 support for their products. A number of sites on the Internet have put up Z39.50 servers (a list of these sites is at http://ds.internic.net/z3950/dblist.txt). Thus, support for Z39.50 by the NSE information infrastructure will allow interoperability with the growing number of organizations using this standard. The NSE may want to take the lead in defining and registering an attibute set and record syntax especially suited for searching software repositories.

A central searchable database for the NSE will not scale to many thousands of users, nor will replication of a single monolithic database. The alternative is to partition search records into subject area databases. The search problem then becomes one of locating the appropriate search server. The concept of centroids has been proposed and discussed by the IETF URI Working Group [11]. The centroid idea involves a hierarchical organization of search servers, and a server propagates upwards the set of keywords which it has indexed. Without some analysis, it is by no means certain that the centroid idea will scale, however.

The Harvest system provides a set of customizable tools for gathering information from diverse repositories, building topic-specific content indices, and widely replicating the indices [5]. A Gatherer collects indexing information from a service provider, while a Broker provides an indexed query interface to the gathered information. Harvest reduces server load and network traffic by using provider-site-resident indexing, content summaries instead of full text, and sharing of information between interconnected Gatherers and Brokers. It is proposed that a number of topic-specific Brokers be constructed. A distinguished Broker, called the Harvest Server Registry (HSR), which may be replicated, maintains information about all Harvest Gatherers and Brokers. It is suggested that the HSR be consulted when looking for an appropriate Broker to search. The Harvest system appears to offer a good framework for developing a search interface for the NSE. The Netlib Development Team is currently experimenting with Harvest, and a Harvest Broker that indexes a large number of NSE pages is accessible from the NSE home page.

Next: Collection Management Up: Areas Needing Development Previous: Universal Naming Scheme

Jack Dongarra
Sun Dec 18 14:22:28 EST 1994