NHSE Progress Report, 3/13/95 - 5/9/95 1. Distributed search based on Harvest The newest version of Harvest is 1.2. The default for 1.2 to to use the sgmls SGML parser to parse and index HTML documents according to the HTML 2.0 DTD (Document Type Definition). Harvest 1.2 has been installed but is not yet in operational use at UT and Argonne. Harvest 1.1 has been installed at Syracuse. To use 1.2 with the default parser, HTML files must be valid HTML. Otherwise, you will need to substitute a more forgiving parser -- e.g., HTML.sum from Harvest 1.1. Ideally all the major NHSE sites should install and run Harvest 1.2 gatherers, both CRPC sites plus others that contribute large amounts of material. For now, we could run one main NHSE broker that maintains a central index and provides a single search interface, replicating it if necessary. Eventually, we may want to have sub-domains of the NHSE run their own brokers which would be linked together. An alternative to using Harvest would be to start the Argonne Web Robot from the NHSE home page and use it to provide the search interface. We could do both, and compare the two. We should compare the human effort (approximate man hours) to construct the indexes, bandwidth requirements for constructing, and search effectiveness. 2. Software submission The software submission forms and scripts are ready to go at http://www.netlib.org/nse/software_submit/software_submit.html. Please send any final comments to browne@cs.utk.edu. We plan to start a trial run, by contacting authors of selected software from the current NHSE software catalog and asking them to formally submit their software, by the end of this week. 3. Semantic indexing We are experimenting with using LSI to provide an enhanced search interface to NHSE material. LSI uses the SVD decomposition of the term-document matrix to construct a lower-rank approximation, or "concept space", in which the terms and documents are vectors. Queries are processed by finding nearby terms or documents. As a first step, we have constructed an LSI index to the HTML version of Parallel Computing Works. If you want to try it out, open http://rasp.cs.utk.edu and login as Username: browne Password: shirley (I'll probably be hung by my toenails for giving you that login). There is still some work to be done -- e.g., fix bug that causes query containing parentheses to return nothing, filter out HTML tags and other junk so they don't appear as terms. Eventually we plan to integrate the LSI indexes for different sets of NHSE material with the manually constructed roadmap. Mike Berry and Todd Letsche at UT are helping with this. 4. Publicity and interaction with the digital library and software reuse communities. We have had the following papers related to the NHSE accepted: "Location Independent Naming for Virtual Distributed Repositories", by Browne, Dongarra, Green, Moore, Pepin, Rowan, Wade, and Grosse, ACM-SIGSOFT Symposium on Software Reusability, Seattle, Apr 28-30, 1995. http://www.netlib.org/srwn/srwn07.ps. "Digital Software and Data Repositories for Support of Scientific Computing", by Boisvert, Browne, Dongarra, and Grosse. Digital Libraries Forum, McLean, Va, May 16-17, 1995. http://www.netlib.org/srwn/srwn09.ps. "Management of the National HPCC Software Exchange -- A Virtual Distributed Digital Library", by Browne, Dongarra, Kennedy, and Rowan. Digital Libraries 95, Austin, Texas, June 11-13, 1995. http://www.netlib.org/srwn/srwn11.ps (currently being revised, new version will be available May 11). The following papers have been submitted: "Distributed Information Management in the National HPCC Software Exchange", by Browne, Dongarra, Fox, Hawick, Kennedy, Stevens, Olson, and Rowan, Supercomputing 95. http://www.netlib.org/srwn/srwn10.html. "Software Reuse in High Performance Computing", by Browne, Dongarra, Fox, Hawick, and Rowan. 7th Workshop on Insitutionalizing Software Reuse (WISR 7), St. Charles, Illinois, Aug 28-30, 1995. Netlib officially became a member of the Reuse Library Interoperability Group (RIG) in March, and is representing the NHSE's interests as well. The RIG has been chartered by IEEE to develop standards for reuse library interoperability. The RIG meets every other month. Shirley Browne represented Netlib at the January and March meetings. The May RIG meeting is being held May 11-12 in Knoxville and is being hosted by Netlib. Shirley Browne attended the ACM-SIGSOFT Symposium on Software Reusability in Seattle, April 28-30, 1995. She presented the paper listed above and also represented the RIG on a panel on software component description languages. The paper presentation was attended by about 60-80 people (it was in one of two parallel sessions that split total SSR attendance of 160). Much of the audience was from industry. The NHSE was presented as the context and testbed for the location independent naming system being developed at UT. There was much interest in the NHSE, with several people asking more about it afterwards and at lunch. Shirley met informally with Martin Griss and Mark Simos. Simos asked about the possibility of using the NHSE as a testbed for his ODA (Open Domain Analysis) domain analysis system. Griss pointed out that Fox et.al.'s problem classification is a start at domain analysis for high performance computing. CNRI is leading an ARPA-funded project to develop the network infrastructure for a distributed digital library system (see http://www.cnri.reston.va.us). Barry Leiner of ARPA has expressed an interest in seeing software repositories merged with the digital libraries initiative. Robert Kahn and Bill Arms of CNRI are interested in seeing the naming system being developed at UT merged with CNRI's handle management system. 5. Equipment purchase From Reed Wade, May 5: 5 of below-- already ordered: sparc 20, 176Meg mem, 6 Gigs of disk, floppy, CD drive approximately 20 more gigs of disk and a 10 tape stacker to come soon 6. Proposed NHSE framework We expect to have two kinds of sites participating in the NHSE: 1) contributor sites that make resources available on file servers, 2) index sites that manage, catalog, and provide a search interface to the distributed collection. Initially the CRPC sites will probably be the index sites. Only index sites would run the software submission server programs. Contributor sites would interact as clients via a WWW browser forms interface. Both contributor and index sites will need to have PGP installed to use for digital signatures and authentication. We are undecided at this point as to the best way to organize HPCC tech reports. It might be best to have each sub-community within the NHSE set up its own tech report service, and to have these services interoperate. For this we could use Dienst, WATERS, UCSTRI, or Harvest software, but we should standardize on something. CSTR/Dienst http://cs-tr.cs.cornell.edu/ WATERS http://www.cs.odu.edu/WATERS/WATERS-GS.html Harvest http://harvest.cs.colorado.edu/brokers/cstech/query.html UCSTRI http://www.cs.indiana.edu/cstr/search The Dienst and WATERS approaches would require each contributing site to install and run the server software. The Harvest and UCSTRI approaches would only require contributing sites to register URLs with index sites. Our plan right now for indexing informational HTML pages is to have each contributing site run a Harvest gatherer that collects and summarizes its own pages on a regular basis. A Harvest broker running at an indexing site (e.g., CalTech for computational chemistry) would then collect information from the individual gatherers and provide a search interface. The individual brokers would also be linked together to provide an overall search interface for the NHSE. Because we want to collect usage statistics for the NHSE, each contributor site would be responsible for running a logging program and for reporting statistics to an index site that would do gathering, summarizing, and display of usage statistics. So, to summarize, the following components would make up "repository-in-a-box" for the two kinds of sites: contributor site ---------------- http or ftp server Harvest gatherer log/statistics reporter tech report server? PGP index site ---------- Harvest broker with search interface indexing engine (harvest works with glimpse, freeWAIS, and commercial WAIS) software submission programs, software catalog usage statistics gathering/summarizing/display tech report server? PGP [Note: where the tech report server goes depends on what software we choose to use, as described above] In the future the following may be added to the index sites: index site ---------- URC server (returns URC for URNs and LIFNs) location server (returns URLs for URNs and LIFNs) SGML parser/validator natural language or semantic indexer tools for classification and thesaurus development and use 7. NHSE reorganization An alternative organization of the NHSE home page is being constructed that reorganizes the information (e.g., places all individual software packages under the NHSE Software Catalog) and tries to give the user more cues as to what we're about and where to find things. The new format will also have a What's New page to keep users coming back. It's currently under construction, but should be ready in another week or so, at which time we'll ask for comments.