The BLAPIOS outlined above have been implemented on the Intel Paragon using Intel's Parallel File System (PFS). In these PFS-BLAPIOS a distributed file is implemented by having each process access its own distinct file, though it could also have been implemented by partitioning a single file into contiguous chunks and assigning each process one chunk. For both shared and distributed modes the M_ASYNC I/O mode of PFS is used. Although one might expect the best performance on a particular platform to come from implementing the BLAPIOS directly on top of the native parallel I/O system, there are also distinct advantages to being able to implement them on top of a portable parallel I/O system. Parallel I/O is an area of much active research (see, for example, [12] and the parallel I/O archive at
http://www.cs.dartmouth.edu/pario.htmlfor more information.) Although there is currently no generally accepted parallel I/O standard, MPI-IO, the proposed extensions to MPI [14] for performing parallel I/O, is a strong contender [5]. We shall, therefore, briefly consider how the BLAPIOS might be implemented on top of MPI-IO.
MPI-IO contains routines for collective and independent I/O operations. All the I/O operations in the BLAPIOS are independent. MPI-IO partitions a file using filetypes, which are an extension of MPI datatypes. Each process in a given group (specified by an MPI communicator) creates a filetype that picks out just the data assigned to it. A routine for creating a filetype for block-cyclicly distributed matrices is provided by MPI-IO. This filetype, together with MPI-IO's absolute offset mode, can be used to create and access the equivalent of a BLAPIOS shared file. A BLAPIOS distributed file can be handled by creating a datatype that divides the file into contiguous segments with one segment being assigned to each process. In this case MPI-IO's relative offset mode would be used to access data.
In MPI-IO the filetype and communicator are specified as input arguments when a file is opened. This is somewhat more restrictive than access to a shared file using the BLAPIOS in which the partitioning is determined dynamically by the distribution of the in-core matrix being read from or written to. The usefullness of dynamic partitioning (or alignment) is apparent when performing the LU factorization of A, an matrix with N>M. In this case there are two phases to the computation: first the LU factorization of the first M columns is found (call this matrix B), and then the transformations are applied to the remaining N-M columns (call this matrix C). It is natural, and convenient, in performing the second phase of the algorithm to treat matrices B and C as unrelated matrices with independent partitionings. However, complications can arise if the number of columns spanning the process grid, , does not exactly divide M, so that C begins in the middle of a block. If we are dealing with a shared file the BLAPIOS routine P_ASEEK can be used to dynamically partition C so it starts at the beginning of a block. For a distributed file, which has a fixed partitioning, we have to offset the in-core matrix involved in I/O operations so that it is aligned with the partitioning. To make the BLAPIOS compatible with MPI-IO we need to either permit multiple alignments for a file in MPI-IO, or else permit only fixed alignments for shared files in the BLAPIOS.