Our discussion of parallel I/O for dense matrices assumes that in-core matrices are distributed over processes using a block-cyclic data distribution as in ScaLAPACK [4, 2]. Processes are viewed as being laid out with a two-dimensional logical topology, forming a process mesh. Our approach to parallel I/O for dense matrices hinges on the number of file pointers, and on which processes have access to the file pointers. We divide parallel I/O modes into two broad classes
Modes 1(a) and 1(b) correspond to the case in which there is no parallel I/O system, and all I/O is bound to be sequential. Modes 1(c), 2(a) and 2(b) corresponds to different ways of doing parallel I/O. The shared file mode is the most general since it means a file can be written using one particular process grid and block size and read later using a different process grid and block size. A distributed file can only be read using the same process grid and block size that it was written with. However, a major drawback of a shared file is that, in general, each process can only read and write contiguous elements at a time. This results in very poor performance unless block sizes are very large or unless the process grid is chosen to be (for Fortran codes) so that each column of the matrix lies in one process. The potential for poor performance arises because most I/O systems work best when reading large blocks. Furthermore, if only a small amount of data is written at a time systems such as the Intel Paragon will not stripe the data across disks so I/O is essentially sequentialized.