Next: About this document
Up: Fault Tolerant Matrix Operations
Previous: Fault Tolerant Matrix Operations
References
- 1
-
J. N. C. Arabe, A. Beguelin, B. Lowekamp, E. Seligman, M. Starkey, and
P. Stephan.
DOME: Parallel programming in a distributed computing environment.
April 1996.
- 2
-
D. E. Bakken and R. D. Schilchting.
Supporting fault-tolerant parallel programming in Linda.
ACM Transactions on Computer Systems, 7(1):1-24, Feb 1989.
- 3
-
M. Blaum, J. Brady, J. Bruck, and J. Menon.
EVENODD: An optimal scheme for tolerating double disk failures in
RAID architectures.
pages 245--254, April 1994.
- 4
-
A. Borg, W. Blau, W. Graetsch, F. Herrman, and W. Oberle.
Fault tolerance under UNIX.
ACM Transactions on Computer Systems, 7(1):1-24, Feb 1989.
- 5
-
J. Casas, D. Clark, R. Konuru, S. Otto, R. Prouty, and J. Walpole.
MPVM: A migration transparent version of PVM.
Computing Systems, 8(2):171-216, Spring 1995.
- 6
-
J. Choi, J. J. Dongarra, S. Ostrouchov, A. P. Petitet, D. W. Walker, and R. C.
Whaley.
The design and implementation of the ScaLAPACK LU, QR, and
Cholesky factorization routines.
Scientific Programming, Vol. 5, pages 173-184, 1996.
- 7
-
F. Cristian and F. Jahanain.
A timestamp-based checkpointing protocol for long-lived distributed
computations.
In 10th Symposium on Reliable Distributed Systems, pages
12-20, October 1991.
- 8
-
D. Cummings and L. Alkalaj.
Checkpoint/rollback in a distributed system using coarse-grained
dataflow.
In 24th International Symposium on Fault-Tolerant Computing,
pages 424-433, June 1994.
- 9
-
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel.
The performance of consistent checkpointing.
In 11th Symposium on Reliable Distributed Systems, pages
39-47, October 1992.
- 10
-
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam.
PVM: Parallel Virtual Machine - A User's Guide and Tutorial for
Networked Parallel Computing.
MIT Press, Cambridge, MA, 1994.
- 11
-
D. Gelernter and D. Kaminsky.
Supercomputing out of recycled garbage.
pages 417-427, June 1992.
- 12
-
G. A. Gibson.
Redundant Disk Arrays: Reliable, Parallel Secondary Storage.
PhD thesis, University of California, Berkeley, CA,
December 1990.
- 13
-
G. A. Gibson, L. Hellerstein, R. M. Karp, and D. A. Patterson.
Failure correction techniques for large disk arrays.
pages 123-132, April 1989.
- 14
-
K.-H. Huang and J. A. Abraham.
Algorithm-based fault tolerance for matrix operations.
IEEE Transactions on Computers, C-33(6):518-528, June 1984.
- 15
-
D. B. Johnson and W. Zwaenepoel.
Recovery in distributed systems using optimistic message logging and
checkpointing.
Journal of Algorithms, 11(3):462-491, September 1990.
- 16
-
Y. Kim.
Fault Tolerant Matrix Operations for Parallel and Distributed
Systems.
PhD thesis, The University of Tennessee, Knoxville TN,
August 1996.
- 17
-
Y. Kim, J. S. Plank, and J. J. Dongarra.
Fault tolerant matrix operations using checksum and reverse
computation.
In The 6th Symposium of The Frontiers of Massively Parallel
Computation, pages 70-77, Annapolis MD, October 1996.
- 18
-
T. H. Lai and T. H. Yang.
On distributed snapshots.
Information Processing Letters, 25:153-158, May 1987.
- 19
-
K. Li, J. F. Naughton, and J. S. Plank.
Low-latency, concurrent checkpointing for parallel programs.
IEEE Transactions on Parallel and Distributed Systems,
5(8):874-879, August 1994.
- 20
-
W. Peterson and E. J. Weldon.
Error-Correcting Codes.
MIT Press, Cambridge MA, second edition edition, 1972.
- 21
-
J. S. Plank, J. Friedman, and K. Li.
A failure correction technique for parallel storage devices with
minimal device overhead.
Technical Report CS-94-243, University of Tennessee, August 1994.
- 22
-
J. S. Plank, Y. Kim, and J. J. Dongarra.
Algorithm-based diskless checkpointing for fault tolerant matrix
operations.
In The 25th International Symposium on Fault-Tolerant
Computing, pages 351-360, Pasadena, CA, June 1995.
- 23
-
J. S. Plank, Y. Kim, and J. J. Dongarra.
Fault tolerant matrix operations for networks of workstations using
diskless checkpointing.
Journal of Parallel and Distributed Computing (To appear), June 1997.
- 24
-
J. S. Plank and K. Li.
Ickp -- a consistent checkpointer for multicomputers.
IEEE Parallel & Distributed Technology, 2(2):62-67, Summer
1994.
- 25
-
J. Pruyne and M. Livny.
Parallel processing on dynamic resources with CARMI.
April 1995.
- 26
-
S. Roman.
Coding and Information Theory.
Springer-Verlag, 1992.
- 27
-
A. Roy-Chowdhury and P. Banerjee.
Algorithm-based fault location and recovery for matrix computations.
In 24th International Symposium on Fault-Tolerant Computing,
pages 38-47, Austin, TX, June 1994.
- 28
-
L. M. Silva, J. G. Silva, S. Chapple, and L. Clarke.
Portable Checkpointing and Recovery.
1995.
- 29
-
M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. J. Dongarra.
MPI: The Complete Reference.
MIT Press, Boston, MA, 1996.
- 30
-
G. Stellber.
CoCheck: Checkpointing and process migration for MPI.
April 1996.
- 31
-
R. E. Strom and S. Yemini.
Optimistic recovery in distributed systems.
ACM Transactions on Computer Systems, 3(3):204-226, August
1985.
Jack Dongarra
Thu Feb 20 21:38:16 EST 1997