NHSE Progress Report, 3/13/95 - 5/9/95

1.  Distributed search based on Harvest

    The newest version of Harvest is 1.2.  The default for 1.2 to to use
    the sgmls SGML parser to parse and index HTML documents according
    to the HTML 2.0 DTD (Document Type Definition).
    Harvest 1.2 has been installed but is not yet in operational
    use at UT and Argonne.  Harvest 1.1 has been installed at Syracuse.
    To use 1.2 with the default parser, HTML files must be valid HTML.
    Otherwise, you will need to substitute a more forgiving parser --
    e.g., HTML.sum from Harvest 1.1.
    
    Ideally all the major NHSE sites should install and run Harvest 1.2
    gatherers, both CRPC sites plus others that contribute large
    amounts of material.  For now, we could run one main NHSE
    broker that maintains a central index and provides a single
    search interface, replicating it if necessary.  Eventually,
    we may want to have sub-domains of the NHSE run their own
    brokers which would be linked together.

    An alternative to using Harvest would be to start the Argonne
    Web Robot from the NHSE home page and use it to provide the
    search interface.  We could do both, and compare the two.
    We should compare the human effort (approximate man hours) to
    construct the indexes,
    bandwidth requirements for constructing, and search effectiveness.


2.  Software submission

    The software submission forms and scripts are ready to go
    at http://www.netlib.org/nse/software_submit/software_submit.html.
    Please send any final comments to browne@cs.utk.edu.
    We plan to start a trial run, by contacting authors of selected
    software from the current NHSE software catalog and asking them
    to formally submit their software, by the end of this week.


3.  Semantic indexing

    We are experimenting with using LSI to provide an enhanced
    search interface to NHSE material.  LSI uses the SVD decomposition
    of the term-document matrix to construct a lower-rank approximation,
    or "concept space",
    in which the terms and documents are vectors.
    Queries are processed by finding nearby terms or documents.
    As a first step, we have constructed an LSI index to the
    HTML version of Parallel Computing Works.
    If you want to try it out, open http://rasp.cs.utk.edu
    and login as 
    
      Username: browne
      Password: shirley

    (I'll probably be hung by my toenails for giving you that login).

    There is still some work to be done -- e.g., fix bug that
    causes query containing parentheses to return nothing,
    filter out HTML tags and other junk so they don't appear as terms.
    Eventually we plan to integrate the LSI indexes for different
    sets of NHSE material with the manually constructed roadmap.
    Mike Berry and Todd Letsche at UT are helping with this.
  
 
4.  Publicity and interaction with the digital library and 
    software reuse communities.

    We have had the following papers related to the NHSE accepted:

    "Location Independent Naming for Virtual Distributed Repositories",
    by Browne, Dongarra, Green, Moore, Pepin, Rowan, Wade, and Grosse,
    ACM-SIGSOFT Symposium on Software Reusability, Seattle, Apr 28-30,
    1995.  http://www.netlib.org/srwn/srwn07.ps.

    "Digital Software and Data Repositories for Support of Scientific
    Computing", by Boisvert, Browne, Dongarra, and Grosse.
    Digital Libraries Forum, McLean, Va, May 16-17, 1995.
    http://www.netlib.org/srwn/srwn09.ps.

    "Management of the National HPCC Software Exchange -- A Virtual
    Distributed Digital Library", by Browne, Dongarra, Kennedy, and Rowan.
    Digital Libraries 95, Austin, Texas, June 11-13, 1995.
    http://www.netlib.org/srwn/srwn11.ps (currently being revised,
    new version will be available May 11).

    
    The following papers have been submitted:

    "Distributed Information Management in the National HPCC Software
    Exchange", by Browne, Dongarra, Fox, Hawick, Kennedy, Stevens, Olson,
    and Rowan, Supercomputing 95.  http://www.netlib.org/srwn/srwn10.html.

    "Software Reuse in High Performance Computing", by Browne, Dongarra,
    Fox, Hawick, and Rowan.  7th Workshop on Insitutionalizing Software
    Reuse (WISR 7), St. Charles, Illinois, Aug 28-30, 1995.

    
    Netlib officially became a member of the Reuse Library Interoperability
    Group (RIG) in March, and is representing the NHSE's interests as well.
    The RIG has been chartered by IEEE to develop standards for reuse
    library interoperability.  The RIG meets every other month.
    Shirley Browne represented Netlib at the January and March
    meetings.  The May RIG meeting is being held May 11-12 in Knoxville
    and is being hosted by Netlib.

    Shirley Browne attended the ACM-SIGSOFT Symposium on Software
    Reusability in Seattle, April 28-30, 1995.  She presented the
    paper listed above and also represented the RIG on a panel on
    software component description languages.
    The paper presentation was attended by about 60-80 people
    (it was in one of two parallel sessions that split total SSR
    attendance of 160).  Much of the audience was from industry.
    The NHSE was presented as the context and testbed for the location
    independent naming system being developed at UT.
    There was much interest in the NHSE, with several people asking
    more about it afterwards and at lunch.
    Shirley met informally with Martin Griss and Mark Simos.  Simos
    asked about the possibility of using the NHSE as a testbed
    for his ODA (Open Domain Analysis) domain analysis system.
    Griss pointed out that Fox et.al.'s problem classification
    is a start at domain analysis for high performance computing.

    CNRI is leading an ARPA-funded project to develop the network
    infrastructure for a distributed digital library system
    (see http://www.cnri.reston.va.us).
    Barry Leiner of ARPA has expressed an interest in seeing software
    repositories merged with the digital libraries initiative.
    Robert Kahn and Bill Arms of CNRI are interested in seeing
    the naming system being developed at UT merged with CNRI's
    handle management system.


5.  Equipment purchase

    From Reed Wade, May 5:

    5 of below--

    already ordered: sparc 20, 176Meg mem, 6 Gigs of disk, floppy, CD drive

    approximately 20 more gigs of disk and a 10 tape stacker
    to come soon


6.  Proposed NHSE framework

    We expect to have two kinds of sites participating in the NHSE:
    1) contributor sites that make resources available on file servers,
    2) index sites that manage, catalog, and provide a search interface
    to the distributed collection.  Initially the CRPC sites will
    probably be the index sites.

    Only index sites would run the software submission server programs.
    Contributor sites would interact as clients via a WWW browser
    forms interface.
    Both contributor and index sites will need to have PGP installed
    to use for digital signatures and authentication.

    We are undecided at this point as to
    the best way to organize HPCC tech reports.
    It might be best to have each
    sub-community within the NHSE set up its own tech report service, and
    to have these services interoperate.
    For this we could use Dienst, WATERS, UCSTRI, or Harvest software, 
    but we should standardize on something.

    CSTR/Dienst  http://cs-tr.cs.cornell.edu/
    WATERS       http://www.cs.odu.edu/WATERS/WATERS-GS.html
    Harvest      http://harvest.cs.colorado.edu/brokers/cstech/query.html
    UCSTRI       http://www.cs.indiana.edu/cstr/search

    The Dienst and WATERS approaches would require each contributing
    site to install and run the server software.  The Harvest and UCSTRI
    approaches would only require contributing sites to register
    URLs with index sites.

    Our plan right now for indexing informational HTML pages is to
    have each contributing site run a Harvest gatherer that collects
    and summarizes its own pages on a regular basis.  A Harvest
    broker running at an indexing site (e.g., CalTech for computational
    chemistry) would then collect information from the individual
    gatherers and provide a search interface.  The individual
    brokers would also be linked together to provide an overall
    search interface for the NHSE.

    Because we want to collect usage statistics for the NHSE, each
    contributor site would be responsible for running a logging program
    and for reporting statistics to an index site that would do
    gathering, summarizing, and display of usage statistics.

    So, to summarize, the following components would make up
    "repository-in-a-box" for the two kinds of sites:

    contributor site
    ----------------
    http or ftp server
    Harvest gatherer
    log/statistics reporter
    tech report server?
    PGP

    index site
    ----------
    Harvest broker with search interface
    indexing engine (harvest works with glimpse, freeWAIS, and commercial WAIS)
    software submission programs, software catalog
    usage statistics gathering/summarizing/display
    tech report server?
    PGP

    [Note: where the tech report server goes depends on what software we
    choose to use, as described above]

    In the future the following may be added to the index sites:

    index site
    ----------
    URC server (returns URC for URNs and LIFNs)
    location server (returns URLs for URNs and LIFNs)
    SGML parser/validator
    natural language or semantic indexer
    tools for classification and thesaurus development and use


7.  NHSE reorganization

    An alternative organization of the NHSE home page is being
    constructed that reorganizes the information (e.g., places
    all individual software packages under the NHSE Software
    Catalog) and tries to give
    the user more cues as to what we're about and where to find things.
    The new format will also have a What's New page to keep
    users coming back.
    It's currently under construction, but should be ready
    in another week or so, at which time we'll ask for comments.