next up previous
Next: Taking Failures into Up: Fault Tolerance Previous: Failure Detection

Failure Robustness

Another aspect of fault tolerance is that it should minimize the side effects of failures. To this end, we designed the client-server protocol as following. When the NetSolve agent receives a request for a problem to be solved, it sends back a list of computational servers sorted from the most to the least suitable one. The client tries all the servers in sequence until one accepts the problem. This strategy allows the client to avoid sending multiple requests to the agent for the same problem if some of the computational servers are stopped. If at the end of the list no server has been able to answer, the client asks another list from the agent. Since it has reported all these failures, it will receive a different list.

Once the connection has been established with a computational server, there still is no guarantee that the problem will be solved. The computational process on the remote host can die for some reason. In that case, the failure is detected by the client, and the problem is sent to another available computational server. This process is transparent to the user but, of course, lengthens the execution time. The problem is migrated between the possible computational servers until it is solved or no server remains.



Joint Institute for Computational Science
Mon Apr 29 13:00:40 EDT 1996