Fault tolerance is an important issue in any loosely connected distributed system like NetSolve. The failure of one or more components of the system should not cause any catastrophic failure. Moreover, the number of side effects generated by such a failure should be as low as possible, and the system should minimize the drop in performance. We tried to make NetSolve as fault tolerant as possible.
A first aspect of fault-tolerance in NetSolve takes place at the server level. It is possible to stop a NetSolve server (resource or instance of the agent) at any time, and restart it safely at any time. In fact, every NetSolve server is an independent entity. This insures that the NetSolve system will remain coherent after any kind of network/machine problem. In the installation of NetSolve at the University of Tennessee, the whole system is managed by a 'cron' job, and servers are restarted automatically after machines go down for back-ups for instance.
Another aspect of fault tolerance is that it should minimize the side effects of failures. To this end, we designed the client-server protocol as follows. When the NetSolve agent receives a request for a problem to be solved, it sends back a list of computational servers sorted from the most to the least suitable one. The client tries all the servers in sequence until one accepts the problem. This strategy allows the client to avoid sending multiple requests to the agent for the same problem if some of the computational servers are stopped. If at the end of the list no server has been able to answer, the client asks another list from the agent. Since it has reported all the encountered failures, it will receive a different list.
Once the connection has been established with a computational server, there still is no guarantee that the problem will be solved. The computational process on the remote host may die for some reason. In that case, the failure is detected by the client, and the problem is sent to another available computational server. This process is transparent to the user but, of course, lengthens the execution time. The problem is migrated between the possible computational servers until it is solved or no server remains.