next up previous
Next: About this document Up: Fault Tolerant Matrix Operations Previous: Fault Tolerant Matrix Operations

References



1
J. N. C. Arabe, A. Beguelin, B. Lowekamp, E. Seligman, M. Starkey, and P. Stephan. DOME: Parallel programming in a distributed computing environment. April 1996.

2
D. E. Bakken and R. D. Schilchting. Supporting fault-tolerant parallel programming in Linda. ACM Transactions on Computer Systems, 7(1):1-24, Feb 1989.

3
M. Blaum, J. Brady, J. Bruck, and J. Menon. EVENODD: An optimal scheme for tolerating double disk failures in RAID architectures. pages 245--254, April 1994.

4
A. Borg, W. Blau, W. Graetsch, F. Herrman, and W. Oberle. Fault tolerance under UNIX. ACM Transactions on Computer Systems, 7(1):1-24, Feb 1989.

5
J. Casas, D. Clark, R. Konuru, S. Otto, R. Prouty, and J. Walpole. MPVM: A migration transparent version of PVM. Computing Systems, 8(2):171-216, Spring 1995.

6
J. Choi, J. J. Dongarra, S. Ostrouchov, A. P. Petitet, D. W. Walker, and R. C. Whaley. The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming, Vol. 5, pages 173-184, 1996.

7
F. Cristian and F. Jahanain. A timestamp-based checkpointing protocol for long-lived distributed computations. In 10th Symposium on Reliable Distributed Systems, pages 12-20, October 1991.

8
D. Cummings and L. Alkalaj. Checkpoint/rollback in a distributed system using coarse-grained dataflow. In 24th International Symposium on Fault-Tolerant Computing, pages 424-433, June 1994.

9
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In 11th Symposium on Reliable Distributed Systems, pages 39-47, October 1992.

10
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, MA, 1994.

11
D. Gelernter and D. Kaminsky. Supercomputing out of recycled garbage. pages 417-427, June 1992.

12
G. A. Gibson. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. PhD thesis, University of California, Berkeley, CA, December 1990.

13
G. A. Gibson, L. Hellerstein, R. M. Karp, and D. A. Patterson. Failure correction techniques for large disk arrays. pages 123-132, April 1989.

14
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33(6):518-528, June 1984.

15
D. B. Johnson and W. Zwaenepoel. Recovery in distributed systems using optimistic message logging and checkpointing. Journal of Algorithms, 11(3):462-491, September 1990.

16
Y. Kim. Fault Tolerant Matrix Operations for Parallel and Distributed Systems. PhD thesis, The University of Tennessee, Knoxville TN, August 1996.

17
Y. Kim, J. S. Plank, and J. J. Dongarra. Fault tolerant matrix operations using checksum and reverse computation. In The 6th Symposium of The Frontiers of Massively Parallel Computation, pages 70-77, Annapolis MD, October 1996.

18
T. H. Lai and T. H. Yang. On distributed snapshots. Information Processing Letters, 25:153-158, May 1987.

19
K. Li, J. F. Naughton, and J. S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, 5(8):874-879, August 1994.

20
W. Peterson and E. J. Weldon. Error-Correcting Codes. MIT Press, Cambridge MA, second edition edition, 1972.

21
J. S. Plank, J. Friedman, and K. Li. A failure correction technique for parallel storage devices with minimal device overhead. Technical Report CS-94-243, University of Tennessee, August 1994.

22
J. S. Plank, Y. Kim, and J. J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In The 25th International Symposium on Fault-Tolerant Computing, pages 351-360, Pasadena, CA, June 1995.

23
J. S. Plank, Y. Kim, and J. J. Dongarra. Fault tolerant matrix operations for networks of workstations using diskless checkpointing. Journal of Parallel and Distributed Computing (To appear), June 1997.

24
J. S. Plank and K. Li. Ickp -- a consistent checkpointer for multicomputers. IEEE Parallel & Distributed Technology, 2(2):62-67, Summer 1994.

25
J. Pruyne and M. Livny. Parallel processing on dynamic resources with CARMI. April 1995.

26
S. Roman. Coding and Information Theory. Springer-Verlag, 1992.

27
A. Roy-Chowdhury and P. Banerjee. Algorithm-based fault location and recovery for matrix computations. In 24th International Symposium on Fault-Tolerant Computing, pages 38-47, Austin, TX, June 1994.

28
L. M. Silva, J. G. Silva, S. Chapple, and L. Clarke. Portable Checkpointing and Recovery. 1995.

29
M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. J. Dongarra. MPI: The Complete Reference. MIT Press, Boston, MA, 1996.

30
G. Stellber. CoCheck: Checkpointing and process migration for MPI. April 1996.

31
R. E. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204-226, August 1985.


Jack Dongarra
Thu Feb 20 21:38:16 EST 1997