Network-enabled Solvers and the NetSolve Project
by H. Casanova and J. J. Dongarra
Scientific computing has been a major part of research centers and industry for many years. The demands of complex scientific problem-solving has led to the development of numerous, yet diverse software tools.
Some numerical tools, such as MATLAB or Mathematica, have enjoyed great success. They generally provide an interactive interface as well as the possibility of writing scripts to perform computation.
Another class of tools falls under the category of numerical libraries. Numerical libraries are less convenient than interactive tools because the user is generally required to write a C or Fortran program. Nevertheless, they offer several advantages; a large number of such libraries exists, and they cover diverse fields of computational science. Moreover, unlike interactive tools, numerical libraries are often freely available and can be downloaded directly from the World Wide Web.
A third class of tools comprises runtime packages whose goal is to help the user perform some specific type of built-in scientific computation. Like numerical libraries, these packages are usually freely available. One example is the NEOS project, which is focused on linear programming and optimization. However, such tools are not yet well established, and the user often needs numerical facilities that are outside the scope of the tools.
Users wanting to solve a numerical problem are thus confronted with a dilemma. They can purchase an expensive commercial product and take the risk that it might not be suitable for future use on different kinds of problems, or they can try to locate and download free libraries and write programs in terms of specific functions or subroutines. In this article, we discuss the latter situation and describe how the NetSolve system helps researchers use such free libraries easily and effectively.
What is NetSolve?
NetSolve is a client-server, agent-based application designed to solve computational science problems over a network. A number of different interfaces have been incorporated within the NetSolve software so that users of C, Fortran, MATLAB, or the Web can easily use the NetSolve system. The underlying computational software can be any scientific package, thereby ensuring good performance results and great flexibility and extensibility. Moreover, NetSolve uses a load-balancing strategy to improve the use of the computational resources available.
How does NetSolve help in using free libraries?
The user who decides to use free libraries must first look for the appropriate library or set of libraries needed for the specific computational problem. Usually, such libraries can be found in software repositories. A well-known repository, for example, is Netlib, which is maintained through the collaborative effort of several institutions and universities. Software repositories present some intrinsic difficulties for the unexperienced user: they are generally very large, and they contain very different types of libraries.
Once the appropriate library has been located, it must be downloaded and installed. Depending on the nature of the software, this step might be nontrivial, especially for a user not used to this kind of task.
The biggest steps still remain--learning how to use the library itself and learning how to write a program in terms of its component. These tasks can formidable and time-consuming (without even mentioning the debugging phase).
Such considerations motivated the establishment of the NetSolve project.
Figure 1: NetSolve's organization
Is NetSolve complicated to use?
NetSolve is a client-server network-based system. We can distinguish three main paradigms for such systems: proxy computing, code shipping, and remote computing. These paradigms differ in the way they handle the user's data and the program that operates on this data. In proxy romputing, the data and the program reside on the user's machine and are both sent to a server that runs the code on the data and returns the result. In code shipping, the program resides on the server and is downloaded to the user's machine, where it operates on the data and generates the result on that machine. This is the paradigm used by Java applets within Web browsers, for example. In the third paradigm, remote computing, the program resides on the server. The user's data is sent to the server, where the programs or numerical libraries operate on it; the result then is sent back to the user's machine. NetSolve uses the third paradigm.
NetSolve provides the user with pools of computational resources. These resources are, in fact, computational servers that provide run-time access to arbitrary numerical libraries. The NetSolve computational servers have the following abilities:
To make the implementation of such a computational server model possible, we have designed a machine-independent, general way of describing a numerical computation, as well as a set of tools to generate new computational modules as easily as possible. The main component of this framework is a descriptive language that is used to describe each separate numerical functionality of a computational server. Files written in this language can be compiled by NetSolve into actual computational modules executable on any UNIX platform. NetSolve also includes a Java applet to easily generate description files. The Java applet can be used by anyone on the Internet to create new computational resources. This framework also allows increased collaboration between research teams. Indeed, description files need to be generated only once and can be reused in a machine-independent manner to set up new computational resources anywhere on the Net.
The user can use one of the different NetSolve client interfaces to send requests to the NetSolve computational servers. The user requests are not sent directly to the computational resources, however; instead, they are processed by another component of the system, a NetSolve agent. The agent decides which computational server will be assigned the user requests. Thus, the agent is really the mastermind behind the whole NetSolve strategy, and the efficiency of the system depends entirely on its decisions. Figure 1 shows this organization.
One of the roles of the agent in the NetSolve system is to perform load balancing among the different computational resources. NetSolve is inherently a multirequest system. Several users can compete for the resources by contacting the same agent or different agents managing the same pool of resources. Alternatively, a single user can send multiple asynchronous requests at once (as we will see in the description of the user interfaces). For each incoming request, the NetSolve agent chooses a computational server where the numerical computation will be performed. For each server, the agent can use information contained in the user request (e.g., type of computation, size of the problem), static information about the server (e.g., speed of the host, numerical server available), predictions about the workload of the server's host, and the distance to the server's host over the network. These different pieces of information are then combined to obtain an estimate of the time required to process the user request on each computational server (including network time and CPU time). For each request, the NetSolve agent sorts the appropriate computational servers according to these estimated times and processes the request accordingly.
Where can NetSolve be used?
The different hosts that participate in the NetSolve protocol can be anywhere on the Internet. In fact, they can be administrated by different institutions. NetSolve does not assume any centralized control over the different hosts in the system. On the contrary, each process (computational server or agent) is an independent entity: it can be stopped and restarted safely at any time, without jeopardizing the integrity of the system. The flexibility of this approach does, howver, require that NetSolve implement some kind of fault tolerance mechanisms. Indeed, any resource can become unreachable at any moment, perhaps because of a network failure, a host failure, or simply a system administrator rebooting a host.
NetSolve also can be used on an intranet, inside a research department or a university, without participating in any Internet computation. Even though such a setting is more stable than an Internet-based NetSolve configuration, fault tolerance is still required,
Currently, NetSolve uses the following strategy for fault tolerance. The NetSolve system ensures that a user request will be completed unless every single resource has failed. When a client sends a request to a NetSolve agent, it receives a sorted list of computational servers to try. When one of these servers has been successfully contacted, the numerical computation starts. If the contacted server fails during the computation, another server is contacted, and the computation is restarted. This whole process is transparent to the user. If all the servers fail, the user is notified that the computation cannot be performed at this time. This simple fault-tolerant approach will be improved in a future version of the NetSolve software.
What interfaces does NetSolve provide?
A major concern in designing NetSolve was to provide several interfaces in order to target a wide range of users. Currently, NetSolve provides Application Program Interfaces (APIs) as well as higher-level interfaces: C, Fortran, and Java APIs are already available, as well as a MATLAB interface and a graphical Java interface. Another concern was keeping the interfaces as simple as possible. For example, the MATLAB interface contains only two functions that allow users to submit problems to the NetSolve system. Every interface provides asynchronous calls to NetSolve in addition to traditional synchronous calls. When several asynchronous requests are sent to a NetSolve agent, they are dispatched among the available computational resources according to the load-balancing schemes implemented by the agent. Hence, the user--with virtually no effort--can achieve coarse-grained parallelism from either a program or from interaction with a high-level interface. The interfaces are described in detail in the ``NetSolve's Client User's Guide,'' and a brief example is given in Table 1.
How can users obtain NetSolve?
Figure 2: NetSolve-enabling global collaboration
To allow users to try out NetSolve as soon as they download the client distribution, we maintain a pool of computational servers at the University of Tennessee (as well as in some other places). These servers can solve various numerical problems in several fields, including linear algebra, fast Fourier transform, optimization, and curve fitting. The numerical functionalities are constantly increasing, and new servers are being started as the demand increases.
Currently, we have released version 1.0 of NetSolve (both clients and servers). The NetSolve homepage, located at http://www.cs.utk.edu/netsolve, contains detailed information and source code.
NetSolve is a continuing project, and several research issues are under investigation for the next release. One of the main improvements that we would like to make to the paradigm is to allow dynamic software-hardware computational resource binding. In the present version of the software, computational servers are given access to computational software and started on a host. New numerical functionalities can be added to the server, but this decision has to be made by the NetSolve administrator. We envision a system where a server (or agent), upon receiving a request for an unknown numerical computation, could contact a well-established software repository and download the appropriate code to perform the computation. Software resources, hardware resources, and data resources could be dynamically bound yet transparent to the user. The Netlib repository seems to be the natural choice for this revolutionary system. This new paradigm will require addressing several issues, such as security, software caching mechanisms, and software authentication. Success in this endeavor would, however, represent a breakthrough in global metacomputing and collaboration.
Table 1: Solving a Linear System Ax=b with NetSolve