References

Next: About this document Up: Fault Tolerant Matrix Operations Previous: Fault Tolerant Matrix Operations

References

1: J. N. C. Arabe, A. Beguelin, B. Lowekamp, E. Seligman, M. Starkey, and P. Stephan. DOME: Parallel programming in a distributed computing environment. April 1996.
2: D. E. Bakken and R. D. Schilchting. Supporting fault-tolerant parallel programming in Linda. ACM Transactions on Computer Systems, 7(1):1-24, Feb 1989.
3: M. Blaum, J. Brady, J. Bruck, and J. Menon. EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures. pages 245--254, April 1994.
4: A. Borg, W. Blau, W. Graetsch, F. Herrman, and W. Oberle. Fault tolerance under UNIX. ACM Transactions on Computer Systems, 7(1):1-24, Feb 1989.
5: J. Casas, D. Clark, R. Konuru, S. Otto, R. Prouty, and J. Walpole. MPVM: A migration transparent version of PVM. Computing Systems, 8(2):171-216, Spring 1995.
6: J. Choi, J. J. Dongarra, S. Ostrouchov, A. P. Petitet, D. W. Walker, and R. C. Whaley. The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming, Vol. 5, pages 173-184, 1996.
7: F. Cristian and F. Jahanain. A timestamp-based checkpointing protocol for long-lived distributed computations. In 10th Symposium on Reliable Distributed Systems, pages 12-20, October 1991.
8: D. Cummings and L. Alkalaj. Checkpoint/rollback in a distributed system using coarse-grained dataflow. In 24th International Symposium on Fault-Tolerant Computing, pages 424-433, June 1994.
9: E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, pages 39-47, October 1992.
10: A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, 1994.
11: D. Gelernter and D. Kaminsky. Supercomputing out of recycled garbage. pages 417-427, June 1992.
12: G. A. Gibson. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. PhD thesis, University of California, Berkeley, CA, December 1990.
13: G. A. Gibson, L. Hellerstein, R. M. Karp, and D. A. Patterson. Failure correction techniques for large disk arrays. pages 123-132, April 1989.
14: K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33(6):518-528, June 1984.
15: D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and checkpointing. Journal of Algorithms, 11(3):462-491, September 1990.
16: Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. PhD thesis, The University of Tennessee, Knoxville TN, August 1996.
17: Y. Kim, J. S. Plank, and J. J. Dongarra. Fault tolerant matrix operations using checksum and reverse computation. In The 6th Symposium of The Frontiers of Massively Parallel Computation, pages 70-77, Annapolis MD, October 1996.
18: T. H. Lai and T. H. Yang. On distributed snapshots. Information Processing Letters, 25:153-158, May 1987.
19: K. Li, J. F. Naughton, and J. S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, 5(8):874-879, August 1994.
20: W. Peterson and E. J. Weldon. Error-Correcting Codes. MIT Press, Cambridge MA, second edition edition, 1972.
21: J. S. Plank, J. Friedman, and K. Li. A failure correction technique for parallel storage devices with minimal device overhead. Technical Report CS-94-243, University of Tennessee, August 1994.
22: J. S. Plank, Y. Kim, and J. J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In The 25th International Symposium on Fault-Tolerant Computing, pages 351-360, Pasadena, CA, June 1995.
23: J. S. Plank, Y. Kim, and J. J. Dongarra. Fault tolerant matrix operations for networks of workstations using diskless checkpointing. Journal of Parallel and Distributed Computing (To appear), June 1997.
24: J. S. Plank and K. Li. Ickp -- a consistent checkpointer for multicomputers. IEEE Parallel & Distributed Technology, 2(2):62-67, Summer 1994.
25: J. Pruyne and M. Livny. Parallel processing on dynamic resources with CARMI. April 1995.
26: S. Roman. Coding and Information Theory. Springer-Verlag, 1992.
27: A. Roy-Chowdhury and P. Banerjee. Algorithm-based fault location and recovery for matrix computations. In 24th International Symposium on Fault-Tolerant Computing, pages 38-47, Austin, TX, June 1994.
28: L. M. Silva, J. G. Silva, S. Chapple, and L. Clarke. Portable Checkpointing and Recovery. 1995.
29: M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. J. Dongarra. MPI: The Complete Reference. MIT Press, Boston, MA, 1996.
30: G. Stellber. CoCheck: Checkpointing and process migration for MPI. April 1996.
31: R. E. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204-226, August 1985.

Jack Dongarra
Thu Feb 20 21:38:16 EST 1997