next up previous
Next: Performance Up: Fault Tolerance Previous: Failure Robustness

Taking Failures into Account

When a failure occurs, the instances of the agent update their view of the NetSolve system. They keep track of the status of the remote hosts: reachable or unreachable. They also keep track of the status of the NetSolve servers on these hosts: running, stopped, or failed. When a host is unreachable or a NetSolve server is stopped for more than 24 hours, the agent erases the corresponding entry in their view of the NetSolve system.

The agent also keeps track of the number of failures encountered when using a computational server. Once this number reaches a limit value, the corresponding entry is removed. Therefore, if a computational server is poorly implemented, for instance because it calls a library incorrectly, it will eventually disappear from the system.



Joint Institute for Computational Science
Mon Apr 29 13:00:40 EDT 1996