The world of modern computing potentially offers many helpful methods and tools to scientists and engineers, but the fast pace of change in computer hardware, software, and algorithms often makes practical use of the newest computing technology difficult. The Scientific and Engineering Computation series focuses on rapid advances in computing technologies and attempts to facilitate transferring these technologies to applications in science and engineering. It will include books on theories, methods, and original applications in such areas as parallelism, large-scale simulations, time-critical computing, computer-aided design and engineering, use of computers in manufacturing, visualization of scientific data, and human-machine interface technology.
The series will help scientists and engineers to understand the current world of advanced computation and to anticipate future developments that will impact their computing environments and open up new capabilities and modes of computation.
This book is about the Message Passing Interface (MPI), an important and increasingly popular standarized and portable message passing system that brings us closer to the potential development of practical and cost-effective large-scale parallel applications. It gives a complete specification of the MPI standard and provides illustrative programming examples. This advanced level book supplements the companion, introductory volume in the Series by William Gropp, Ewing Lusk and Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface.
Janusz S. Kowalik
Preface
MPI, the Message Passing Interface, is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable message-passing programs in Fortran 77 or C. Several well-tested and efficient implementations of MPI already exist, including some that are free and in the public domain. These are beginning to foster the development of a parallel software industry, and there is excitement among computing researchers and vendors that the development of portable and scalable, large-scale parallel applications is now feasible.
The MPI standardization effort involved over 80 people from 40 organizations, mainly from the United States and Europe. Most of the major vendors of concurrent computers at the time were involved in MPI, along with researchers from universities, government laboratories, and industry. The standardization process began with the Workshop on Standards for Message Passing in a Distributed Memory Environment, sponsored by the Center for Research on Parallel Computing, held April 29-30, 1992, in Williamsburg, Virginia [29]. A preliminary draft proposal, known as MPI1, was put forward by Dongarra, Hempel, Hey, and Walker in November 1992, and a revised version was completed in February 1993 [11].
In November 1992, a meeting of the MPI working group was held in Minneapolis, at which it was decided to place the standardization process on a more formal footing. The MPI working group met every 6 weeks throughout the first 9 months of 1993. The draft MPI standard was presented at the Supercomputing '93 conference in November 1993. After a period of public comments, which resulted in some changes in MPI, version 1.0 of MPI was released in June 1994.
These meetings and the email discussion together constituted the MPI Forum, membership of which has been open to all members of the high performance computing community.
This book serves as an annotated reference manual for MPI, and a complete specification of the standard is presented. We repeat the material already published in the MPI specification document [15], though an attempt to clarify has been made. The annotations mainly take the form of explaining why certain design choices were made, how users are meant to use the interface, and how MPI implementors should construct a version of MPI. Many detailed, illustrative programming examples are also given, with an eye toward illuminating the more advanced or subtle features of MPI.
The complete interface is presented in this book, and we are not hesitant to explain even the most esoteric features or consequences of the standard. As such, this volume does not work as a gentle introduction to MPI, nor as a tutorial. For such purposes, we recommend the companion volume in this series by William Gropp, Ewing Lusk, and Anthony Skjellum, Using MPI: Portable Parallel Programming with the Message-Passing Interface. The parallel application developer will want to have copies of both books handy.
For a first reading, and as a good introduction to MPI, the reader should first read: Chapter 1, through Section ; the material on point to point communications covered in Sections through and Section ; the simpler forms of collective communications explained in Sections through ; and the basic introduction to communicators given in Sections through . This will give a fair understanding of MPI, and will allow the construction of parallel applications of moderate complexity.
This book is based on the hard work of many people in the MPI Forum. The authors gratefully recognize the members of the forum, especially the contributions made by members who served in positions of responsibility: Lyndon Clarke, James Cownie, Al Geist, William Gropp, Rolf Hempel, Robert Knighten, Richard Littlefield, Ewing Lusk, Paul Pierce, and Anthony Skjellum. Other contributors were: Ed Anderson, Robert Babb, Joe Baron, Eric Barszcz, Scott Berryman, Rob Bjornson, Nathan Doss, Anne Elster, Jim Feeney, Vince Fernando, Sam Fineberg, Jon Flower, Daniel Frye, Ian Glendinning, Adam Greenberg, Robert Harrison, Leslie Hart, Tom Haupt, Don Heller, Tom Henderson, Anthony Hey, Alex Ho, C.T. Howard Ho, Gary Howell, John Kapenga, James Kohl, Susan Krauss, Bob Leary, Arthur Maccabe, Peter Madams, Alan Mainwaring, Oliver McBryan, Phil McKinley, Charles Mosher, Dan Nessett, Peter Pacheco, Howard Palmer, Sanjay Ranka, Peter Rigsbee, Arch Robison, Erich Schikuta, Mark Sears, Ambuj Singh, Alan Sussman, Robert Tomlinson, Robert G. Voigt, Dennis Weeks, Stephen Wheat, and Steven Zenith. We especially thank William Gropp and Ewing Lusk for help in formatting this volume.
Support for MPI meetings came in part from ARPA and NSF under grant ASC-9310330, NSF Science and Technology Center Cooperative agreement No. CCR-8809615, and the Commission of the European Community through Esprit Project P6643. The University of Tennessee also made financial contributions to the MPI Forum.
MPI_Gatherv(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int *recvcounts, int *displs, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_GATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, ROOT, COMM, IERROR
MPI_GATHERV extends the functionality of MPI_GATHER by allowing a varying count of data from each process, since recvcounts is now an array. It also allows more flexibility as to where the data is placed on the root, by providing the new argument, displs.
The outcome is as if each process, including the root process, sends a message to the root, MPI_Send(sendbuf, sendcount, sendtype, root, ...) and the root executes n receives, MPI_Recv(recvbuf+displs[i] extent(recvtype), recvcounts[i], recvtype, i, ...).
The data sent from process j is placed in the jth portion of the receive buffer recvbuf on process root. The jth portion of recvbuf begins at offset displs[j] elements (in terms of recvtype) into recvbuf.
The receive buffer is ignored for all non-root processes.
The type signature implied by sendcount and sendtype on process i must be equal to the type signature implied by recvcounts[i] and recvtype at the root. This implies that the amount of data sent must be equal to the amount of data received, pairwise between each process and the root. Distinct type maps between sender and receiver are still allowed, as illustrated in Example .
All arguments to the function are significant on process root, while on other processes, only arguments sendbuf, sendcount, sendtype, root, and comm are significant. The argument root must have identical values on all processes, and comm must represent the same intragroup communication domain.
The specification of counts, types, and displacements should not cause any location on the root to be written more than once. Such a call is erroneous. On the other hand, the successive displacements in the array displs need not be a monotonic sequence.
Figure: The root process gathers 100 ints from each process
in the group, each set is placed stride ints apart.
Figure: The root process gathers column 0 of a 100
150
C array, and each set is placed stride ints apart.
Figure: The root process gathers 100-i ints from
column i of a 100
150
C array, and each set is placed stride ints apart.
Figure: The root process gathers 100-i ints from
column i of a 100
150
C array, and each set is placed stride[i] ints apart (a varying
stride).
MPI_Scatter(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR
MPI_SCATTER is the inverse operation to MPI_GATHER.
The outcome is as if the root executed n send operations, MPI_Send(sendbuf+i sendcount extent(sendtype), sendcount, sendtype, i,...), i = 0 to n - 1. and each process executed a receive, MPI_Recv(recvbuf, recvcount, recvtype, root,...).
An alternative description is that the root sends a message with MPI_Send(sendbuf, sendcount n, sendtype, ...). This message is split into n equal segments, the th segment is sent to the th process in the group, and each process receives this message as above.
The type signature associated with sendcount and sendtype at the root must be equal to the type signature associated with recvcount and recvtype at all processes. This implies that the amount of data sent must be equal to the amount of data received, pairwise between each process and the root. Distinct type maps between sender and receiver are still allowed.
All arguments to the function are significant on process root, while on other processes, only arguments recvbuf, recvcount, recvtype, root, comm are significant. The argument root must have identical values on all processes and comm must represent the same intragroup communication domain. The send buffer is ignored for all non-root processes.
The specification of counts and types should not cause any location on the root to be read more than once.
Figure: The root process scatters sets of 100 ints to each process
in the group.
MPI_Scatterv(void* sendbuf, int *sendcounts, int *displs, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_SCATTERV(SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNTS(*), DISPLS(*), SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR
MPI_SCATTERV is the inverse operation to MPI_GATHERV.
MPI_SCATTERV extends the functionality of MPI_SCATTER by allowing a varying count of data to be sent to each process, since sendcounts is now an array. It also allows more flexibility as to where the data is taken from on the root, by providing the new argument, displs.
The outcome is as if the root executed n send operations, MPI_Send(sendbuf+displs [i] extent(sendtype), sendcounts[i], sendtype, i,...), i = 0 to n - 1, and each process executed a receive, MPI_Recv(recvbuf, recvcount, recvtype, root,...).
The type signature implied by sendcount[i] and sendtype at the root must be equal to the type signature implied by recvcount and recvtype at process i. This implies that the amount of data sent must be equal to the amount of data received, pairwise between each process and the root. Distinct type maps between sender and receiver are still allowed.
All arguments to the function are significant on process root, while on other processes, only arguments recvbuf, recvcount, recvtype, root, comm are significant. The arguments root must have identical values on all processes, and comm must represent the same intragroup communication domain. The send buffer is ignored for all non-root processes.
The specification of counts, types, and displacements should not cause any location on the root to be read more than once.
Figure: The root process scatters sets of 100 ints, moving by
stride ints from send to send in the scatter.
Figure: The root scatters blocks of 100-i ints into
column i of a 100
150
C array. At the sending side, the blocks are stride[i] ints apart.
MPI_Allgather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)
MPI_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, IERROR
MPI_ALLGATHER can be thought of as MPI_GATHER, except all processes receive the result, instead of just the root. The jth block of data sent from each process is received by every process and placed in the jth block of the buffer recvbuf.
The type signature associated with sendcount and sendtype at a process must be equal to the type signature associated with recvcount and recvtype at any other process.
The outcome of a call to MPI_ALLGATHER(...) is as if all processes executed n calls to MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm), for root = 0 , ..., n-1. The rules for correct usage of MPI_ALLGATHER are easily found from the corresponding rules for MPI_GATHER.
MPI_Allgatherv(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int *recvcounts, int *displs, MPI_Datatype recvtype, MPI_Comm comm)
MPI_ALLGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, RECVTYPE, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, COMM, IERROR
MPI_ALLGATHERV can be thought of as MPI_GATHERV, except all processes receive the result, instead of just the root. The jth block of data sent from each process is received by every process and placed in the jth block of the buffer recvbuf. These blocks need not all be the same size.
The type signature associated with sendcount and sendtype at process j must be equal to the type signature associated with recvcounts[j] and recvtype at any other process.
The outcome is as if all processes executed calls to MPI_GATHERV( sendbuf, sendcount, sendtype,recvbuf,recvcounts,displs, recvtype,root,comm), for root = 0 , ..., n-1. The rules for correct usage of MPI_ALLGATHERV are easily found from the corresponding rules for MPI_GATHERV.
all to all scatter and gathergather and scatter
MPI_Alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)
MPI_ALLTOALL(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, IERROR
MPI_ALLTOALL is an extension of MPI_ALLGATHER to the case where each process sends distinct data to each of the receivers. The jth block sent from process i is received by process j and is placed in the ith block of recvbuf.
The type signature associated with sendcount and sendtype at a process must be equal to the type signature associated with recvcount and recvtype at any other process. This implies that the amount of data sent must be equal to the amount of data received, pairwise between every pair of processes. As usual, however, the type maps may be different.
The outcome is as if each process executed a send to each process (itself included) with a call to, MPI_Send(sendbuf+i sendcount extent(sendtype), sendcount, sendtype, i, ...), and a receive from every other process with a call to, MPI_Recv(recvbuf+i recvcount extent(recvtype), recvcount, i,...), where i = 0, , n - 1.
All arguments on all processes are significant. The argument comm must represent the same intragroup communication domain on all processes.
MPI procedures are specified using a language independent notation. The arguments of procedure calls are marked as , or . The meanings of these are:
IN OUT INOUT procedure specification arguments
There is one special case - if an argument is a handle to an opaque object (defined in Section ), and the object is updated by the procedure call, then the argument is marked . It is marked this way even though the handle itself is not modified - we use the attribute to denote that what the handle references is updated.
The definition of MPI tries to avoid, to the largest possible extent, the use of arguments, because such use is error-prone, especially for scalar arguments.
A common occurrence for MPI functions is an argument that is used as by some processes and by other processes. Such an argument is, syntactically, an argument and is marked as such, although, semantically, it is not used in one call both for input and for output.
Another frequent situation arises when an argument value is needed only by a subset of the processes. When an argument is not significant at a process then an arbitrary value can be passed as the argument.
Unless specified otherwise, an argument of type or type cannot be aliased with any other argument passed to an MPI procedure. An example of argument aliasing in C appears below. If we define a C procedure like this,
void copyIntBuffer( int *pin, int *pout, int len ) { int i; for (i=0; i<len; ++i) *pout++ = *pin++; }then a call to it in the following code fragment has aliased arguments. aliased arguments
int a[10]; copyIntBuffer( a, a+3, 7);Although the C language allows this, such usage of MPI procedures is forbidden unless otherwise specified. Note that Fortran prohibits aliasing of arguments.
All MPI functions are first specified in the language-independent notation. Immediately below this, the ANSI C version of the function is shown, and below this, a version of the same function in Fortran 77.
MPI_Alltoallv(void* sendbuf, int *sendcounts, int *sdispls, MPI_Datatype sendtype, void* recvbuf, int *recvcounts, int *rdispls, MPI_Datatype recvtype, MPI_Comm comm)
MPI_ALLTOALLV(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPE, RECVBUF, RECVCOUNTS, RDISPLS, RECVTYPE, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNTS(*), SDISPLS(*), SENDTYPE, RECVCOUNTS(*), RDISPLS(*), RECVTYPE, COMM, IERROR
MPI_ALLTOALLV adds flexibility to MPI_ALLTOALL in that the location of data for the send is specified by sdispls and the location of the placement of the data on the receive side is specified by rdispls.
The jth block sent from process i is received by process j and is placed in the ith block of recvbuf. These blocks need not all have the same size.
The type signature associated with sendcount[j] and sendtype at process i must be equal to the type signature associated with recvcount[i] and recvtype at process j. This implies that the amount of data sent must be equal to the amount of data received, pairwise between every pair of processes. Distinct type maps between sender and receiver are still allowed.
The outcome is as if each process sent a message to process i with MPI_Send( sendbuf + displs[i] extent(sendtype), sendcounts[i], sendtype, i, ...), and received a message from process i with a call to MPI_Recv( recvbuf + displs[i] extent(recvtype), recvcounts[i], recvtype, i, ...), where i = 0 n - 1.
All arguments on all processes are significant. The argument comm must specify the same intragroup communication domain on all processes.
The functions in this section perform a global reduce operation (such as sum, max, logical AND, etc.) across all the members of a group. The reduction operation can be either one of a predefined list of operations, or a user-defined operation. The global reduction functions come in several flavors: a reduce that returns the result of the reduction at one node, an all-reduce that returns this result at all nodes, and a scan (parallel prefix) operation. In addition, a reduce-scatter operation combines the functionality of a reduce and of a scatter operation. In order to improve performance, the functions can be passed an array of values; one call will perform a sequence of element-wise reductions on the arrays of values. Figure gives a pictorial representation of these operations.
Figure: Reduce functions illustrated for a group of three
processes. In each case, each row of boxes represents data items in
one process. Thus, in the reduce, initially each process has three
items; after the reduce the root process has three sums.
MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER COUNT, DATATYPE, OP, ROOT, COMM, IERROR
MPI_REDUCE combines the elements provided in the input buffer of each process in the group, using the operation op, and returns the combined value in the output buffer of the process with rank root. The input buffer is defined by the arguments sendbuf, count and datatype; the output buffer is defined by the arguments recvbuf, count and datatype; both have the same number of elements, with the same type. The arguments count, op and root must have identical values at all processes, the datatype arguments should match, and comm should represent the same intragroup communication domain. Thus, all processes provide input buffers and output buffers of the same length, with elements of the same type. Each process can provide one element, or a sequence of elements, in which case the combine operation is executed element-wise on each entry of the sequence. For example, if the operation is MPI_MAX and the send buffer MPI_MAX contains two elements that are floating point numbers (count = 2 and datatype = MPI_FLOAT), then and .
Section lists the set of predefined operations provided by MPI. That section also enumerates the allowed datatypes for each operation. In addition, users may define their own operations that can be overloaded to operate on several datatypes, either basic or derived. This is further explained in Section .
The operation op is always assumed to be associative. All predefined operations are also commutative. Users may define operations that are assumed to be associative, but not commutative. The ``canonical'' evaluation order of a reduction is determined by the ranks of the processes in the group. However, the implementation can take advantage of associativity, or associativity and commutativity in order to change the order of evaluation. This may change the result of the reduction for operations that are not strictly associative and commutative, such as floating point addition. reduction and associativityreduction and commutativity associativity and reductioncommutativity and reduction
The datatype argument of MPI_REDUCE must be compatible with op. Predefined operators work only with the MPI types listed in Section and Section . User-defined operators may operate on general, derived datatypes. In this case, each argument that the reduce operation is applied to is one element described by such a datatype, which may contain several basic values. This is further explained in Section .
The following predefined operations are supplied for MPI_REDUCE and related functions MPI_ALLREDUCE, MPI_REDUCE_SCATTER, and MPI_SCAN. These operations are invoked by placing the following in op.
reduce, list of operations MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_BAND MPI_LOR MPI_BOR MPI_BXOR MPI_MAXLOC MPI_MINLOC MPI_LXOR
The two operations MPI_MINLOC and MPI_MAXLOC are discussed separately in Section . For the other predefined operations, we enumerate below the allowed combinations of op and datatype arguments. First, define groups of MPI basic datatypes in the following way.
Now, the valid datatypes for each option is specified below.
Figure: vector-matrix product. Vector a and matrix b are
distributed in one dimension. The distribution is illustrated for
four processes. The slices need not be all of the same
size: each process may have a different value for m.
minimum and location maximum and location
The operator MPI_MINLOC is used to compute MPI_MINLOC MPI_MAXLOC a global minimum and also an index attached to the minimum value. MPI_MAXLOC similarly computes a global maximum and index. One application of these is to compute a global minimum (maximum) and the rank of the process containing this value.
The operation that defines MPI_MAXLOC is: MPI_MAXLOC
where and
MPI_MINLOC is defined similarly: MPI_MINLOC
where and
Both operations are associative and commutative. Note that if MPI_MAXLOC MPI_MAXLOC is applied to reduce a sequence of pairs , then the value returned is , where and is the index of the first global maximum in the sequence. Thus, if each process supplies a value and its rank within the group, then a reduce operation with op = MPI_MAXLOC will return the maximum value and the rank of the first process with that value. Similarly, MPI_MINLOC can be used to return a minimum and its index. More generally, MPI_MINLOC computes a lexicographic MPI_MINLOC minimum, where elements are ordered according to the first component of each pair, and ties are resolved according to the second component.
The reduce operation is defined to operate on arguments that consist of a pair: value and index. In order to use MPI_MINLOC and MPI_MAXLOC in a MPI_MINLOC MPI_MAXLOC reduce operation, one must provide a datatype argument that represents a pair (value and index). MPI provides nine such predefined datatypes. In C, the index is an int and the value can be a short or long int, a float, or a double. The potentially mixed-type nature of such arguments is a problem in Fortran. The problem is circumvented, for Fortran, by having the MPI-provided type consist of a pair of the same type as value, and coercing the index to this type also.
The operations MPI_MAXLOC and MPI_MINLOC can be used with each of the following datatypes.
MPI_2REAL MPI_2DOUBLE_PRECISION MPI_2INTEGER MPI_FLOAT_INT MPI_DOUBLE_INT MPI_LONG_INT MPI_2INT MPI_SHORT_INT MPI_LONG_DOUBLE_INT
The datatype MPI_2REAL is as if defined by the following MPI_2REAL (see Section ).
MPI_TYPE_CONTIGUOUS(2, MPI_REAL, MPI_2REAL)
Similar statements apply for MPI_2INTEGER, MPI_2DOUBLE_PRECISION, and MPI_2INT.
The datatype MPI_FLOAT_INT is as if defined by the MPI_FLOAT_INT following sequence of instructions.
type[0] = MPI_FLOAT type[1] = MPI_INT disp[0] = 0 disp[1] = sizeof(float) block[0] = 1 block[1] = 1 MPI_TYPE_STRUCT(2, block, disp, type, MPI_FLOAT_INT)Similar statements apply for the other mixed types in C.
MPI includes variants of each of the reduce operations where the result is returned to all processes in the group. MPI requires that all processes participating in these operations receive identical results.
MPI_Allreduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
MPI_ALLREDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER COUNT, DATATYPE, OP, COMM, IERROR
Same as MPI_REDUCE except that the result appears in the receive buffer of all the group members.
MPI includes variants of each of the reduce operations where the result is scattered to all processes in the group on return.
MPI_Reduce_scatter(void* sendbuf, void* recvbuf, int *recvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
MPI_REDUCE_SCATTER(SENDBUF, RECVBUF, RECVCOUNTS, DATATYPE, OP, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER RECVCOUNTS(*), DATATYPE, OP, COMM, IERROR
MPI_REDUCE_SCATTER acts as if it first does an element-wise reduction on vector of elements in the send buffer defined by sendbuf, count and datatype. Next, the resulting vector of results is split into n disjoint segments, where n is the number of processes in the group of comm. Segment i contains recvcounts[i] elements. The ith segment is sent to process i and stored in the receive buffer defined by recvbuf, recvcounts[i] and datatype.
Figure: vector-matrix product. All vectors and matrices are
distributed. The distribution is illustrated for four processes.
Each process may have a different value for m and k.
MPI_Scan(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )
MPI_SCAN(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER COUNT, DATATYPE, OP, COMM, IERROR
MPI_SCAN is used to perform a prefix reduction on data distributed across the group. The operation returns, in the receive buffer of the process with rank i, the reduction of the values in the send buffers of processes with ranks 0,...,i (inclusive). The type of operations supported, their semantics, and the constraints on send and receive buffers are as for MPI_REDUCE.
user-defined operations reduce, user-defined scan, user-defined
MPI_Op_create(MPI_User_function *function, int commute, MPI_Op *op)
MPI_OP_CREATE( FUNCTION, COMMUTE, OP, IERROR) EXTERNAL FUNCTION
LOGICAL COMMUTE
INTEGER OP, IERROR
MPI_OP_CREATE binds a user-defined global operation to an op handle that can subsequently be used in MPI_REDUCE, MPI_ALLREDUCE, MPI_REDUCE_SCATTER, and MPI_SCAN. The user-defined operation is assumed to be associative. If commute = true, then the operation should be both commutative and associative. If commute = false, then the order of operations is fixed and is defined to be in ascending, process rank order, beginning with process zero. The order of evaluation can be changed, taking advantage of the associativity of the operation. If commute = true then the order of evaluation can be changed, taking advantage of commutativity and associativity. associativity, and user-defined operation commutativity, and user-defined operation
function is the user-defined function, which must have the following four arguments: invec, inoutvec, len and datatype.
The ANSI-C prototype for the function is the following.
typedef void MPI_User_function( void *invec, void *inoutvec, int *len, MPI_Datatype *datatype);
The Fortran declaration of the user-defined function appears below.
FUNCTION USER_FUNCTION( INVEC(*), INOUTVEC(*), LEN, TYPE) <type> INVEC(LEN), INOUTVEC(LEN) INTEGER LEN, TYPE
The datatype argument is a handle to the data type that was passed into the call to MPI_REDUCE. The user reduce function should be written such that the following holds: Let u[0], ... , u[len-1] be the len elements in the communication buffer described by the arguments invec, len and datatype when the function is invoked; let v[0], ... , v[len-1] be len elements in the communication buffer described by the arguments inoutvec, len and datatype when the function is invoked; let w[0], ... , w[len-1] be len elements in the communication buffer described by the arguments inoutvec, len and datatype when the function returns; then w[i] = u[i] v[i], for i=0 , ... , len-1, where is the reduce operation that the function computes.
Informally, we can think of invec and inoutvec as arrays of len elements that function is combining. The result of the reduction over-writes values in inoutvec, hence the name. Each invocation of the function results in the pointwise evaluation of the reduce operator on len elements: i.e, the function returns in inoutvec[i] the value , for , where is the combining operation computed by the function.
General datatypes may be passed to the user function. However, use of datatypes that are not contiguous is likely to lead to inefficiencies.
No MPI communication function may be called inside the user function. MPI_ABORT may be called inside the function in case of an error.
MPI_op_free( MPI_Op *op)
MPI_OP_FREE( OP, IERROR) INTEGER OP, IERROR
Marks a user-defined reduction operation for deallocation and sets op to MPI_OP_NULL. MPI_OP_NULL
The following two examples illustrate usage of user-defined reduction. user-defined reduction
semantics of collective collective, semantics of portabilitysafety collective, and portability collective, and deadlock
A correct, portable program must invoke collective communications so that deadlock will not occur, whether collective communications are synchronizing or not. The following examples illustrate dangerous use of collective routines.
Figure: A race condition causes non-deterministic matching of sends
and receives. One cannot rely on synchronization from a broadcast
to make the program deterministic.
Finally, in multithreaded implementations, one can have more than one, concurrently executing, collective communication calls at a process. In these situations, it is the user's responsibility to ensure that the same communicator is not used concurrently by two different collective communication calls at the same process. threads and collective collective, and threads
This section describes semantic terms used in this book.
It was the intent of the creators of the MPI standard to address several issues that augment the power and usefulness of point-to-point and collective communications. These issues are mainly concerned with the the creation of portable, efficient and safe libraries and codes with MPI, and will be discussed in this chapter. This effort was driven by the need to overcome several limitations in many message passing systems. The next few sections describe these limitations.
process groupgroup In some applications it is desirable to divide up the processes to allow different groups of processes to perform independent work. For example, we might want an application to utilize of its processes to predict the weather based on data already processed, while the other of the processes initially process new data. This would allow the application to regularly complete a weather forecast. However, if no new data is available for processing we might want the same application to use all of its processes to make a weather forecast.
Being able to do this efficiently and easily requires the application to be able to logically divide the processes into independent subsets. It is important that these subsets are logically the same as the initial set of processes. For example, the module to predict the weather might use process 0 as the master process to dole out work. If subsets of processes are not numbered in a consistent manner with the initial set of processes, then there may be no process 0 in one of the two subsets. This would cause the weather prediction model to fail.
Applications also need to have collective operations work on a subset of processes. If collective operations only work on the initial set of processes then it is impossible to create independent subsets that perform collective operations. Even if the application does not need independent subsets, having collective operations work on subsets is desirable. Since the time to complete most collective operations increases with the number of processes, limiting a collective operation to only the processes that need to be involved yields much better scaling behavior. For example, if a matrix computation needs to broadcast information along the diagonal of a matrix, only the processes containing diagonal elements should be involved.
libraries, safety modularity Library routines have historically had difficulty in isolating their own message passing calls from those in other libraries or in the user's code. For example, suppose the user's code posts a non-blocking receive with both tag and source wildcarded before it enters a library routine. The first send in the library may be received by the user's posted receive instead of the one posted by the library. This will undoubtedly cause the library to fail.
The solution to this difficulty is to allow a module to isolate its message passing calls from the other modules. Some applications may only determine at run time which modules will run so it can be impossible to statically isolate all modules in advance. This necessitates a run time callable system to perform this function.
layering Writers of libraries often want to expand the functionality of the message passing system. For example, the library may want to create its own special and unique collective operation. Such a collective operation may be called many times if the library is called repetitively or if multiple libraries use the same collective routine. To perform the collective operation efficiently may require a moderately expensive calculation up front such as determining the best communication pattern. It is most efficient to reuse the up front calculations if the same the set of processes are involved. This is most easily done by attaching the results of the up front calculation to the set of processes involved. These types of optimization are routinely done internally in message passing systems. The desire is to allow others to perform similar optimizations in the same way.
libraries, safetysafety There are two philosophies used to provide mechanisms for creating subgroups, isolating messages, etc. One point of view is to allow the user total control over the process. This allows maximum flexibility to the user and can, in some cases, lead to fast implementations. The other point of view is to have the message passing system control these functions. This adds a degree of safety while limiting the mechanisms to those provided by the system. MPI chose to use the latter approach. The added safety was deemed to be very important for writing portable message passing codes. Since the MPI system controls these functions, modules that are written independently can safely perform these operations without worrying about conflicts. As in other areas, MPI also decided to provide a rich set of functions so that users would have the functionality they are likely to need.
The above features and several more are provided in MPI through communicators. The concepts behind communicators encompass several central and fundamental ideas in MPI. The importance of communicators can be seen by the fact that they are present in most calls in MPI. There are several reasons that these features are encapsulated into a single MPI object. One reason is that it simplifies calls to MPI functions. Grouping logically related items into communicators substantially reduces the number of calling arguments. A second reason is it allows for easier extensibility. Both the MPI system and the user can add information onto communicators that will be passed in calls without changing the calling arguments. This is consistent with the use of opaque objects throughout MPI.
process groupgroup A group is an ordered set of process identifiers (henceforth processes); processes are implementation-dependent objects. Each process in a group is associated with an integer rank. Ranks are contiguous and start from zero. rankprocess rank Groups are represented by opaque group objects, and hence cannot be directly transferred from one process to another.
There is a special pre-defined group: MPI_GROUP_EMPTY, which is MPI_GROUP_EMPTY a group with no members. The predefined constant MPI_GROUP_NULL is the value used for invalid group handles. MPI_GROUP_NULL MPI_GROUP_EMPTY, which is a valid handle to an empty group, MPI_GROUP_EMPTY should not be confused with MPI_GROUP_NULL, which is MPI_GROUP_NULL an invalid handle. The former may be used as an argument to group operations; the latter, which is returned when a group is freed, in not a valid argument.
Group operations are discussed in Section .
communicator A communicator is an opaque object with a number of attributes, together with simple rules that govern its creation, use and destruction. The communicator specifies a communication domain which can be used for point-to-point communications. communication domain An intracommunicator is used for communicating within a single group of processes; we call such communication intra-group communication. An intracommunicator has two fixed attributes. intracommunicator intra-group communication domain These are the process group and the topology describing the logical layout of the processes in the group. Process topologies are the subject of chapter . Intracommunicators are also used for collective operations within a group of processes.
An intercommunicator is used for point-to-point intercommunicator communication between two disjoint groups of processes. We call such communication inter-group communication. inter-group communication domain The fixed attributes of an intercommunicator are the two groups. No topology is associated with an intercommunicator. In addition to fixed attributes a communicator may also have user-defined attributes which are associated with the communicator using MPI's caching mechanism, as described in Section . The table below summarizes the differences cachingcommunicator, caching between intracommunicators and intercommunicators. communicator, intra vs inter
Intracommunicator operations are described in Section , and intercommunicator operations are discussed in Section .
communication domain
Any point-to-point or collective communication occurs in MPI within a communication domain. Such a communication domain is represented by a set of communicators with consistent values, one at each of the participating processes; each communicator is the local representation of the global communication domain. If this domain is for intra-group communication then all the communicators are intracommunicators, and all have the same group attribute. Each communicator identifies all the other corresponding communicators.
One can think of a communicator as an array of links to other communicators. An intra-group communication domain is specified by a set of communicators such that communication domain, intra
Figure: Distributed data structure for intra-communication domain.
We discuss inter-group communication domains in Section .
In point-to-point communication, matching send and receive calls should have communicator arguments that represent the same communication domains. The rank of the processes is interpreted relative to the group, or groups, associated with the communicator. Thus, in an intra-group communication domain, process ranks are relative to the group associated with the communicator.
Similarly, a collective communication call involves all processes in the group of an intra-group communication domain, and all processes should use a communicator argument that represents this domain. Intercommunicators may not be used in collective communication operations.
We shall sometimes say, for simplicity, that two communicators are the same, if they represent the same communication domain. One should not be misled by this abuse of language: Each communicator is really a distinct object, local to a process. Furthermore, communicators that represent the same communication domain may have different attribute values attached to them at different processes.
MPI is designed to ensure that communicator constructors always generate consistent communicators that are a valid representation of the newly created communication domain. This is done by requiring that a new intracommunicator be constructed out of an existing parent communicator, and that this be a collective operation over all processes in the group associated with the parent communicator. The group associated with a new intracommunicator must be a subgroup of that associated with the parent intracommunicator. Thus, all the intracommunicator constructor routines described in Section have an existing communicator as an input argument, and the newly created intracommunicator as an output argument. This leads to a chicken-and-egg situation since we must have an existing communicator to create a new communicator. This problem is solved by the provision of a predefined intracommunicator, MPI_COMM_WORLD, which is available for use once MPI_COMM_WORLD the routine MPI_INIT has been called. MPI_COMM_WORLD, which has as its group attribute all processes with which the local process can communicate, can be used as the parent communicator in constructing new communicators. A second pre-defined communicator, MPI_COMM_SELF, is also MPI_COMM_SELF available for use after calling MPI_INIT and has as its associated group just the process itself. MPI_COMM_SELF is provided as a convenience since it could easily be created out of MPI_COMM_WORLD.
An MPI program consists of autonomous processes, executing their own (C or Fortran) code, in an MIMD style. The codes executed by each process need not be identical. processes The processes communicate via calls to MPI communication primitives. Typically, each process executes in its own address space, although shared-memory implementations of MPI are possible. This document specifies the behavior of a parallel program assuming that only MPI calls are used for communication. The interaction of an MPI program with other possible means of communication (e.g., shared memory) is not specified.
MPI does not specify the execution model for each process. threads A process can be sequential, or can be multi-threaded, with threads possibly executing concurrently. Care has been taken to make MPI ``thread-safe,'' by avoiding the use of implicit state. The desired interaction of MPI with threads is that concurrent threads be all allowed to execute MPI calls, and calls be reentrant; a blocking MPI call blocks only the invoking thread, allowing the scheduling of another thread.
MPI does not provide mechanisms to specify the initial allocation of processes to an MPI computation and their binding to physical processors. process allocation It is expected that vendors will provide mechanisms to do so either at load time or at run time. Such mechanisms will allow the specification of the initial number of required processes, the code to be executed by each initial process, and the allocation of processes to processors. Also, the current standard does not provide for dynamic creation or deletion of processes during program execution (the total number of processes is fixed); however, MPI design is consistent with such extensions, which are now under consideration (see Section ). Finally, MPI always identifies processes according to their relative rank in a group, that is, consecutive integers in the range 0..groupsize-1.
The current practice in many codes is that there is a unique, predefined communication universe that includes all processes available when the parallel program is initiated; the processes are assigned consecutive ranks. Participants in a point-to-point communication are identified by their rank; a collective communication (such as broadcast) always involves all processes. As such, most current message passing libraries have no equivalent argument to the communicator. It is implicitly all the processes as ranked by the system.
This practice can be followed in MPI by using the predefined communicator MPI_COMM_WORLD wherever a communicator argument is required. Thus, using current practice in MPI is very easy. Users that are content with it can ignore most of the information in this chapter. However, everyone should seriously consider understanding the potential risks in using MPI_COMM_WORLD to avoid unexpected behavior of their programs.
This section describes the manipulation of process groups in MPI. These operations are local and their execution do not require interprocess communication. MPI allows manipulation of groups outside of communicators but groups can only be used for message passing inside of a communicator.
MPI_Group_size(MPI_Group group, int *size)
MPI_GROUP_SIZE(GROUP, SIZE, IERROR)INTEGER GROUP, SIZE, IERROR
MPI_GROUP_SIZE returns the number of processes in the group. Thus, if group = MPI_GROUP_EMPTY then the call will return size = 0. (On the other hand, a call with group = MPI_GROUP_NULL is erroneous.)
MPI_Group_rank(MPI_Group group, int *rank)
MPI_GROUP_RANK(GROUP, RANK, IERROR)INTEGER GROUP, RANK, IERROR
MPI_GROUP_RANK returns the rank of the calling process in group. If the process is not a member of group then MPI_UNDEFINED is returned.
MPI_Group_translate_ranks (MPI_Group group1, int n, int *ranks1, MPI_Group group2, int *ranks2)
MPI_GROUP_TRANSLATE_RANKS(GROUP1, N, RANKS1, GROUP2, RANKS2, IERROR)INTEGER GROUP1, N, RANKS1(*), GROUP2, RANKS2(*), IERROR
MPI_GROUP_TRANSLATE_RANKS maps the ranks of a set of processes in group1 to their ranks in group2. Upon return, the array ranks2 contains the ranks in group2 for the processes in group1 with ranks listed in ranks1. If a process in group1 found in ranks1 does not belong to group2 then MPI_UNDEFINED is returned in ranks2.
This function is important for determining the relative numbering of the same processes in two different groups. For instance, if one knows the ranks of certain processes in the group of MPI_COMM_WORLD, one might want to know their ranks in a subset of that group.
MPI_Group_compare(MPI_Group group1,MPI_Group group2, int *result)
MPI_GROUP_COMPARE(GROUP1, GROUP2, RESULT, IERROR)INTEGER GROUP1, GROUP2, RESULT, IERROR
MPI_GROUP_COMPARE returns the relationship between two groups. MPI_IDENT results if the group members and group order is exactly the MPI_IDENT same in both groups. This happens, for instance, if group1 and group2 are handles to the same object. MPI_SIMILAR results if the group members are the same but the order is MPI_SIMILAR different. MPI_UNEQUAL results otherwise. MPI_UNEQUAL
Group constructors are used to construct new groups from existing groups, using various set operations. These are local operations, and distinct groups may be defined on different processes; a process may also define a group that does not include itself. Consistent definitions are required when groups are used as arguments in communicator-building functions. MPI does not provide a mechanism to build a group from scratch, but only from other, previously defined groups. The base group, upon which all other groups are defined, is the group associated with the initial communicator MPI_COMM_WORLD (accessible through the function MPI_COMM_GROUP).
Local group creation functions are useful since some applications have the needed information distributed on all nodes. Thus, new groups can be created locally without communication. This can significantly reduce the necessary communication in creating a new communicator to use this group.
In Section , communicator creation functions are described which also create new groups. These are more general group creation functions where the information does not have to be local to each node. They are part of communicator creation since they will normally require communication for group creation. Since communicator creation may also require communication, it is logical to group these two functions together for this case.
MPI_Comm_group(MPI_Comm comm, MPI_Group *group)
MPI_COMM_GROUP(COMM, GROUP, IERROR)INTEGER COMM, GROUP, IERROR
MPI_COMM_GROUP returns in group a handle to the group of comm.
The following three functions do standard set type operations. The only difference is that ordering is important so that ranks are consistently defined.
MPI_Group_union(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup)
MPI_GROUP_UNION(GROUP1, GROUP2, NEWGROUP, IERROR)INTEGER GROUP1, GROUP2, NEWGROUP, IERROR
MPI_Group_intersection(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup)
MPI_GROUP_INTERSECTION(GROUP1, GROUP2, NEWGROUP, IERROR)INTEGER GROUP1, GROUP2, NEWGROUP, IERROR
MPI_Group_difference(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup)
MPI_GROUP_DIFFERENCE(GROUP1, GROUP2, NEWGROUP, IERROR)INTEGER GROUP1, GROUP2, NEWGROUP, IERROR
The operations are defined as follows:
The new group can be empty, that is, equal to MPI_GROUP_EMPTY. MPI_GROUP_EMPTY
MPI_Group_incl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup)
MPI_GROUP_INCL(GROUP, N, RANKS, NEWGROUP, IERROR)INTEGER GROUP, N, RANKS(*), NEWGROUP, IERROR
The function MPI_GROUP_INCL creates a group newgroup that consists of the n processes in group with ranks rank[0],..., rank[n-1]; the process with rank i in newgroup is the process with rank ranks[i] in group. Each of the n elements of ranks must be a valid rank in group and all elements must be distinct, or else the call is erroneous. If n = 0, then newgroup is MPI_GROUP_EMPTY. MPI_GROUP_EMPTY This function can, for instance, be used to reorder the elements of a group.
Assume that newgroup was created by a call to MPI_GROUP_INCL(group, n, ranks, newgroup). Then, a subsequent call to MPI_GROUP_TRANSLATE_RANKS(group, n, ranks, newgroup, newranks) will return (in C) or (in Fortran).
MPI_Group_excl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup)
MPI_GROUP_EXCL(GROUP, N, RANKS, NEWGROUP, IERROR)INTEGER GROUP, N, RANKS(*), NEWGROUP, IERROR
The function MPI_GROUP_EXCL creates a group of processes newgroup that is obtained by deleting from group those processes with ranks ranks[0],..., ranks[n-1] in C or ranks[1],..., ranks[n] in Fortran. The ordering of processes in newgroup is identical to the ordering in group. Each of the n elements of ranks must be a valid rank in group and all elements must be distinct; otherwise, the call is erroneous. If n = 0, then newgroup is identical to group.
Suppose one calls MPI_GROUP_INCL(group, n, ranks, newgroupi) and MPI_GROUP_EXCL(group, n, ranks, newgroupe). The call MPI_GROUP_UNION(newgroupi, newgroupe, newgroup) would return in newgroup a group with the same members as group but possibly in a different order. The call MPI_GROUP_INTERSECTION(groupi, groupe, newgroup) would return MPI_GROUP_EMPTY. MPI_GROUP_EMPTY
MPI_Group_range_incl(MPI_Group group, int n, int ranges[][3], MPI_Group *newgroup)
MPI_GROUP_RANGE_INCL(GROUP, N, RANGES, NEWGROUP, IERROR)INTEGER GROUP, N, RANGES(3,*), NEWGROUP, IERROR
Each triplet in ranges specifies a sequence of ranks for processes to be included in the newly created group. The newly created group contains the processes specified by the first triplet, followed by the processes specified by the second triplet, etc.
Generally, if ranges consist of the triplets
then newgroup consists of the sequence of processes in group with ranks
Each computed rank must be a valid rank in group and all computed ranks must be distinct, or else the call is erroneous. Note that a call may have , and may be negative, but cannot be zero.
The functionality of this routine is specified to be equivalent to expanding the array of ranges to an array of the included ranks and passing the resulting array of ranks and other arguments to MPI_GROUP_INCL. A call to MPI_GROUP_INCL is equivalent to a call to MPI_GROUP_RANGE_INCL with each rank i in ranks replaced by the triplet (i,i,1) in the argument ranges.
MPI_Group_range_excl(MPI_Group group, int n, int ranges[][3], MPI_Group *newgroup)
MPI_GROUP_RANGE_EXCL(GROUP, N, RANGES, NEWGROUP, IERROR)INTEGER GROUP, N, RANGES(3,*), NEWGROUP, IERROR
Each triplet in ranges specifies a sequence of ranks for processes to be excluded from the newly created group. The newly created group contains the remaining processes, ordered as in group.
Each computed rank must be a valid rank in group and all computed ranks must be distinct, or else the call is erroneous.
The functionality of this routine is specified to be equivalent to expanding the array of ranges to an array of the excluded ranks and passing the resulting array of ranks and other arguments to MPI_GROUP_EXCL. A call to MPI_GROUP_EXCL is equivalent to a call to MPI_GROUP_RANGE_EXCL with each rank i in ranks replaced by the triplet (i,i,1) in the argument ranges.
MPI_Group_free(MPI_Group *group)
MPI_GROUP_FREE(GROUP, IERROR)INTEGER GROUP, IERROR
This operation marks a group object for deallocation. The handle group is set to MPI_GROUP_NULL by the call. MPI_GROUP_NULL Any ongoing operation using this group will complete normally.
This section describes the manipulation of communicators in MPI. Operations that access communicators are local and their execution does not require interprocess communication. Operations that create communicators are collective and may require interprocess communication. We describe the behavior of these functions, assuming that their comm argument is an intracommunicator; we describe later in Section their semantics for intercommunicator arguments.
The following are all local operations.
MPI_Comm_size(MPI_Comm comm, int *size)
MPI_COMM_SIZE(COMM, SIZE, IERROR)INTEGER COMM, SIZE, IERROR
MPI_COMM_SIZE returns the size of the group associated with comm.
This function indicates the number of processes involved in an intracommunicator. For MPI_COMM_WORLD, it indicates the total number of processes MPI_COMM_WORLD available at initialization time. (For this version of MPI, this is also the total number of processes available throughout the computation).
MPI_Comm_rank(MPI_Comm comm, int *rank)
MPI_COMM_RANK(COMM, RANK, IERROR)INTEGER COMM, RANK, IERROR
MPI_COMM_RANK indicates the rank of the process that calls it, in the range from size , where size is the return value of MPI_COMM_SIZE. This rank is relative to the group associated with the intracommunicator comm. Thus, MPI_COMM_RANK(MPI_COMM_WORLD, rank) returns in rank the ``absolute'' rank of the calling process in the global communication group of MPI_COMM_WORLD; MPI_COMM_RANK( MPI_COMM_SELF, rank) returns rank = 0.
MPI_Comm_compare(MPI_Comm comm1,MPI_Comm comm2, int *result)
MPI_COMM_COMPARE(COMM1, COMM2, RESULT, IERROR)INTEGER COMM1, COMM2, RESULT, IERROR
MPI_COMM_COMPARE is used to find the relationship between two intra-communicators. MPI_IDENT results if and only if MPI_IDENT comm1 and comm2 are handles for the same object (representing the same communication domain). MPI_CONGRUENT results if the underlying groups are identical MPI_CONGRUENT in constituents and rank order (the communicators represent two distinct communication domains with the same group attribute). MPI_SIMILAR results if the group members of both MPI_SIMILAR communicators are the same but the rank order differs. MPI_UNEQUAL results otherwise. The groups MPI_UNEQUAL associated with two different communicators could be gotten via MPI_COMM_GROUP and then used in a call to MPI_GROUP_COMPARE. If MPI_COMM_COMPARE gives MPI_CONGRUENT then MPI_GROUP_COMPARE will give MPI_IDENT. If MPI_COMM_COMPARE gives MPI_SIMILAR then MPI_GROUP_COMPARE will give MPI_SIMILAR.
communicator, constructors The following are collective functions that are invoked by all processes in the group associated with comm.
MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm)
MPI_COMM_DUP(COMM, NEWCOMM, IERROR)INTEGER COMM, NEWCOMM, IERROR
MPI_COMM_DUP creates a new intracommunicator, newcomm, with the same fixed attributes (group, or groups, and topology) as the input intracommunicator, comm. The newly created communicators at the processes in the group of comm define a new, distinct communication domain, with the same group as the old communicators. The function can also be used to replicate intercommunicators.
The association of user-defined (or cached) attributes with newcomm is controlled by the copy callback function specified when the attribute was attached to comm. callback function, copy For each key value, the respective copy callback function determines the attribute value associated with this key in the new communicator. User-defined attributes are discussed in Section .
MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm)
MPI_COMM_CREATE(COMM, GROUP, NEWCOMM, IERROR)INTEGER COMM, GROUP, NEWCOMM, IERROR
This function creates a new intracommunicator newcomm with communication group defined by group. No attributes propagates from comm to newcomm. The function returns MPI_COMM_NULL to processes that are not in group. MPI_COMM_NULL The communicators returned at the processes in group define a new intra-group communication domain.
The call is erroneous if not all group arguments have the same value on different processes, or if group is not a subset of the group associated with comm (but it does not have to be a proper subset). Note that the call is to be executed by all processes in comm, even if they do not belong to the new group.
MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
MPI_COMM_SPLIT(COMM, COLOR, KEY, NEWCOMM, IERROR)INTEGER COMM, COLOR, KEY, NEWCOMM, IERROR
This function partitions the group associated with comm into disjoint subgroups, one for each value of color. Each subgroup contains all processes of the same color. Within each subgroup, the processes are ranked in the order defined by the value of the argument key, with ties broken according to their rank in the old group. A new communication domain is created for each subgroup and a handle to the representative communicator is returned in newcomm. A process may supply the color value MPI_UNDEFINED to not be a member of any new group, in which case newcomm returns MPI_COMM_NULL. This is a MPI_COMM_NULL collective call, but each process is permitted to provide different values for color and key. The value of color must be nonnegative.
A call to MPI_COMM_CREATE(comm, group, newcomm) is equivalent to a call to MPI_COMM_SPLIT(comm, color, key, newcomm), where all members of group provide color and key rank in group, and all processes that are not members of group provide color MPI_UNDEFINED. The function MPI_COMM_SPLIT allows more general partitioning of a group into one or more subgroups with optional reordering.
MPI_Comm_free(MPI_Comm *comm)
MPI_COMM_FREE(COMM, IERROR)INTEGER COMM, IERROR
This collective operation marks the communication object for deallocation. The handle is set to MPI_COMM_NULL. MPI_COMM_NULL Any pending operations that use this communicator will complete normally; the object is actually deallocated only if there are no other active references to it. This call applies to intra- and intercommunicators. The delete callback functions for all cached attributes (see Section ) are called in arbitrary order. callback function, delete It is erroneous to attempt to free MPI_COMM_NULL.
This section illustrates the design of parallel libraries, and the use of communicators to ensure the safety of internal library communications.
Assume that a new parallel library function is needed that is similar to the MPI broadcast function, except that it is not required that all processes provide the rank of the root process. Instead of the root argument of MPI_BCAST, the function takes a Boolean flag input that is true if the calling process is the root, false, otherwise. To simplify the example we make another assumption: namely that the datatype of the send buffer is identical to the datatype of the receive buffer, so that only one datatype argument is needed. A possible code for such a modified broadcast function is shown below.
Consider a collective invocation to the broadcast function just defined, in the context of the program segment shown in the example below, for a group of three processes.
A (correct) execution of this code is illustrated in Figure , with arrows used to indicate communications.
Figure: Correct invocation of mcast
Since the invocation of mcast at the three processes is not simultaneous, it may actually happen that mcast is invoked at process 0 before process 1 executed the receive in the caller code. This receive, rather than being matched by the caller code send at process 2, might be matched by the first send of process 0 within mcast. The erroneous execution illustrated in Figure results.
Figure: Erroneous invocation of mcast
How can such erroneous execution be prevented? One option is to enforce synchronization at the entry to mcast, and, for symmetric reasons, at the exit from mcast. E.g., the first and last executable statements within the code of mcast would be a call to MPI_Barrier(comm). This, however, introduces two superfluous synchronizations that will slow down execution. Furthermore, this synchronization works only if the caller code obeys the convention that messages sent before a collective invocation should also be received at their destination before the matching invocation. Consider an invocation to mcast() in a context that does not obey this restriction, as shown in the example below.
The desired execution of the code in this example is illustrated in Figure .
Figure: Correct invocation of mcast
However, a more likely matching of sends with receives will lead to the erroneous execution is illustrated in Figure .
Figure: Erroneous invocation of mcast
Erroneous results may also occur if a process that is not in the group of comm and does not participate in the collective invocation of mcast sends a message to processes one or two in the group of comm.
A more robust solution to this problem is to use a distinct communication domain for communication within the library, which is not used by the caller code. This will ensure that messages sent by the library are not received outside the library, and vice-versa. The modified code of the function mcast is shown below.
This code suffers the penalty of one communicator allocation and deallocation at each invocation. We show in the next section, in Example , how to avoid this overhead, by using a preallocated communicator.
When discussing MPI procedures the following terms are used.
As the previous examples showed, a communicator provides a ``scope'' for collective invocations. The communicator, which is passed as parameter to the call, specifies the group of processes that participate in the call and provide a private communication domain for communications within the callee body. In addition, it may carry information about the logical topology of the executing processes. It is often useful to attach additional persistent values to this scope; e.g., initialization parameters for a library, or additional communicators to provide a separate, private communication domain.
MPI provides a caching facility that allows an application to attach arbitrary pieces of information, called attributes, to attribute both intra- and intercommunicators. More precisely, the caching facility allows a portable library to do the following:
Each attribute is associated with a key. keyattribute, key To provide safety, MPI internally generates key values. MPI functions are provided which allow the user to allocate and deallocate key values (MPI_KEYVAL_CREATE and MPI_KEYVAL_FREE). Once a key is allocated by a process, it can be used to attach one attribute to any communicator defined at that process. Thus, the allocation of a key can be thought of as creating an empty box at each current or future communicator object at that process; this box has a lock that matches the allocated key. (The box is ``virtual'': one need not allocate any actual space before an attempt is made to store something in the box.)
Once the key is allocated, the user can set or access attributes associated with this key. The MPI call MPI_ATTR_PUT can be used to set an attribute. This call stores an attribute, or replaces an attribute in one box: the box attached with the specified communicator with a lock that matches the specified key.
The call MPI_ATTR_GET can be used to access the attribute value associated with a given key and communicator. I.e., it allows one to access the content of the box attached with the specified communicator, that has a lock that matches the specified key. This call is valid even if the box is empty, e.g., if the attribute was never set. In such case, a special ``empty'' value is returned.
Finally, the call MPI_ATTR_DELETE allows one to delete an attribute. I.e., it allows one to empty the box attached with the specified communicator with a lock that matches the specified key.
To be general, the attribute mechanism must be able to store arbitrary user information. On the other hand, attributes must be of a fixed, predefined type, both in Fortran and C - the type specified by the MPI functions that access or update attributes. Attributes are defined in C to be of type void *. Generally, such an attribute will be a pointer to a user-defined data structure or a handle to an MPI opaque object. In Fortran, attributes are of type INTEGER. These can be handles to opaque MPI objects or indices to user-defined tables.
An attribute, from the MPI viewpoint, is a pointer or an integer. An attribute, from the application viewpoint, may contain arbitrary information that is attached to the ``MPI attribute''. attribute User-defined attributes are ``copied'' when a new communicator is created by a call to MPI_COMM_DUP; they are ``deleted'' when a communicator is deallocated by a call to MPI_COMM_FREE. Because of the arbitrary nature of the information that is copied or deleted, the user has to specify the semantics of attribute copying or deletion. The user does so by providing copy and delete callback functions when the attribute key is allocated (by a call to MPI_KEYVAL_CREATE). Predefined, default copy and delete callback functions are available. callback function
All attribute manipulation functions are local and require no communication. Two communicator objects at two different processes that represent the same communication domain may have a different set of attribute keys and different attribute values associated with them.
MPI reserves a set of predefined key values in order to associate with MPI_COMM_WORLD information about the execution environment, at MPI initialization time. These attribute keys are discussed in Chapter . These keys cannot be deallocated and the associated attributes cannot be updated by the user. Otherwise, they behave like user-defined attributes.
MPI provides the following services related to caching. They are all process local.
MPI_Keyval_create(MPI_Copy_function *copy_fn, MPI_Delete_function *delete_fn, int *keyval, void* extra_state)
MPI_KEYVAL_CREATE(COPY_FN, DELETE_FN, KEYVAL, EXTRA_STATE, IERROR)EXTERNAL COPY_FN, DELETE_FN
INTEGER KEYVAL, EXTRA_STATE, IERROR
MPI_KEYVAL_CREATE allocates a new attribute key value. Key values are unique in a process. keyattribute, key Once allocated, the key value can be used to associate attributes and access them on any locally defined communicator. The special key value MPI_KEYVAL_INVALID is MPI_KEYVAL_INVALID never returned by MPI_KEYVAL_CREATE. Therefore, it can be used for static initialization of key variables, to indicate an ``unallocated'' key.
The copy_fn function is invoked when a communicator is duplicated by MPI_COMM_DUP. copy_fn should be callback function, copy of type MPI_Copy_function, which is defined as follows:
typedef int MPI_Copy_function(MPI_Comm oldcomm, int keyval, void *extra_state, void *attribute_val_in, void *attribute_val_out, int *flag)
A Fortran declaration for such a function is as follows:
SUBROUTINE COPY_FUNCTION(OLDCOMM, KEYVAL, EXTRA_STATE, ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, FLAG, IERR)INTEGER OLDCOMM, KEYVAL, EXTRA_STATE, ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, IERR
LOGICAL FLAG
Whenever a communicator is replicated using the function MPI_COMM_DUP, all callback copy functions for attributes that are currently set are invoked (in arbitrary order). Each call to the copy callback is passed as input parameters the old communicator oldcomm, the key value keyval, the additional state extra_state that was provided to MPI_KEYVAL_CREATE when the key value was created, and the current attribute value attribute_val_in. If it returns flag = false, then the attribute is deleted in the duplicated communicator. Otherwise, when flag = true, the new attribute value is set to the value returned in attribute_val_out. The function returns MPI_SUCCESS on MPI_SUCCESS success and an error code on failure (in which case MPI_COMM_DUP will fail).
copy_fn may be specified as MPI_NULL_COPY_FN or MPI_DUP_FN from either C or FORTRAN. MPI_NULL_COPY_FN is a function that does nothing other than returning flag = 0 and MPI_SUCCESS; I.e., the attribute is not copied. MPI_DUP_FN sets flag = 1, returns the value of attribute_val_in in attribute_val_out and returns MPI_SUCCESS. I.e., the attribute value is copied, with no side-effects.
Analogous to copy_fn is a callback deletion function, defined as follows. The delete_fn function is invoked when a communicator is callback function, delete deleted by MPI_COMM_FREE or by a call to MPI_ATTR_DELETE or MPI_ATTR_PUT. delete_fn should be of type MPI_Delete_function, which is defined as follows:
typedef int MPI_Delete_function(MPI_Comm comm, int keyval, void *attribute_val, void *extra_state);
A Fortran declaration for such a function is as follows:
SUBROUTINE DELETE_FUNCTION(COMM, KEYVAL, ATTRIBUTE_VAL, EXTRA_STATE, IERR)INTEGER COMM, KEYVAL, ATTRIBUTE_VAL, EXTRA_STATE, IERR
Whenever a communicator is deleted using the function MPI_COMM_FREE, all callback delete functions for attributes that are currently set are invoked (in arbitrary order). In addition the callback delete function for the deleted attribute is invoked by MPI_ATTR_DELETE and MPI_ATTR_PUT. The function is passed as input parameters the communicator comm, the key value keyval, the current attribute value attribute_val, and the additional state extra_state that was passed to MPI_KEYVAL_CREATE when the key value was allocated. The function returns MPI_SUCCESS on success and an error code on failure (in which case MPI_COMM_FREE will fail).
delete_fn may be specified as MPI_NULL_DELETE_FN from either C or FORTRAN; MPI_NULL_DELETE_FN is a function that does nothing, other than returning MPI_SUCCESS.
MPI_Keyval_free(int *keyval)
MPI_KEYVAL_FREE(KEYVAL, IERROR)INTEGER KEYVAL, IERROR
MPI_KEYVAL_FREE deallocates an attribute key value. This function sets the value of keyval to MPI_KEYVAL_INVALID. Note MPI_KEYVAL_INVALID keyattribute, key that it is not erroneous to free an attribute key that is in use (i.e., has attached values for some communicators); the key value is not actually deallocated until after no attribute values are locally attached to this key. All such attribute values need to be explicitly deallocated by the program, either via calls to MPI_ATTR_DELETE that free one attribute instance, or by calls to MPI_COMM_FREE that free all attribute instances associated with the freed communicator.
MPI_Attr_put(MPI_Comm comm, int keyval, void* attribute_val)
MPI_ATTR_PUT(COMM, KEYVAL, ATTRIBUTE_VAL, IERROR)INTEGER COMM, KEYVAL, ATTRIBUTE_VAL, IERROR
MPI_ATTR_PUT associates the value attribute_val with the key keyval on communicator comm. If a value is already associated with this key on the communicator, then the outcome is as if MPI_ATTR_DELETE was first called to delete the previous value (and the callback function delete_fn was executed), and a new value was next stored. The call is erroneous if there is no key with value keyval; in particular MPI_KEYVAL_INVALID is an erroneous value for keyval. MPI_KEYVAL_INVALID
MPI_Attr_get(MPI_Comm comm, int keyval, void *attribute_val, int *flag)
MPI_ATTR_GET(COMM, KEYVAL, ATTRIBUTE_VAL, FLAG, IERROR)INTEGER COMM, KEYVAL, ATTRIBUTE_VAL, IERROR
LOGICAL FLAG
MPI_ATTR_GET retrieves an attribute value by key. The call is erroneous if there is no key with value keyval. In particular MPI_KEYVAL_INVALID is an erroneous value for keyval. On the other hand, the call is correct if the key value exists, but no attribute is attached on comm for that key; in such a case, the call returns flag = false. If an attribute is attached on comm to keyval, then the call returns flag = true, and returns the attribute value in attribute_val.
MPI_Attr_delete(MPI_Comm comm, int keyval)
MPI_ATTR_DELETE(COMM, KEYVAL, IERROR)INTEGER COMM, KEYVAL, IERROR
MPI_ATTR_DELETE deletes the attribute attached to key keyval on comm. This function invokes the attribute delete function delete_fn specified when the keyval was created. The call will fail if there is no key with value keyval or if the delete_fn function returns an error code other than MPI_SUCCESS. On the other hand, the call is correct even if no attribute is currently attached to keyval on comm.
The code above dedicates a statically allocated private communicator for the use of mcast. This segregates communication within the library from communication outside the library. However, this approach does not provide separation of communications belonging to distinct invocations of the same library function, since they all use the same communication domain. Consider two successive collective invocations of mcast by four processes, where process 0 is the broadcast root in the first one, and process 3 is the root in the second one. The intended execution and communication flow for these two invocations is illustrated in Figure .
Figure: Correct execution of two successive invocations of mcast
However, there is a race between messages sent by the first invocation of mcast, from process 0 to process 1, and messages sent by the second invocation of mcast, from process 3 to process 1. The erroneous execution illustrated in Figure may occur, where messages sent by second invocation overtake messages from the first invocation. This phenomenon is known as backmasking. backmasking
Figure: Erroneous execution of two successive invocations of mcast
How can we avoid backmasking? One option is to revert to the approach in Example , where a separate communication domain is generated for each invocation. Another option is to add a barrier synchronization, either at the entry or at the exit from the library call. Yet another option is to rewrite the library code, so as to prevent the nondeterministic race. The race occurs because receives with dontcare's are used. It is often possible to avoid the use of such constructs. Unfortunately, avoiding dontcares leads to a less efficient implementation of mcast. A possible alternative is to use increasing tag numbers to disambiguate successive invocations of mcast. An ``invocation count'' can be cached with each communicator, as an additional library attribute. The resulting code is shown below.
This section introduces the concept of inter-communication and describes the portions of MPI that support it.
All point-to-point communication described thus far has involved communication between processes that are members of the same group. In modular and multi-disciplinary applications, different process groups execute distinct modules and processes within different modules communicate with one another in a pipeline or a more general module graph. In these applications, the most natural way for a process to specify a peer process is by the rank of the peer process within the peer group. In applications that contain internal user-level servers, each server may be a process group that provides services to one or more clients, and each client may be a process group that uses the services of one or more servers. It is again most natural to specify the peer process by rank within the peer group in these applications.
An inter-group communication domain is specified by a set of intercommunicators with the pair of disjoint groups (A,B) as their attribute, such that communication domain, inter
This distributed data structure is illustrated in Figure , for the case of a pair of groups (A,B), with two (upper box) and three (lower box) processes, respectively.
Figure: Distributed data structure for inter-communication domain.
The communicator structure distinguishes between a local group, namely the group containing the process where the structure reside, and a remote group, namely the other group. process group, local and remotegroup, local and remote The structure is symmetric: for processes in group A, then A is the local group and B is the remote group, whereas for processes in group B, then B is the local group and A is the remote group.
An inter-group communication will involve a process in one group executing a send call and another process, in the other group, executing a matching receive call. As in intra-group communication, the matching process (destination of send or source of receive) is specified using a (communicator, rank) pair. Unlike intra-group communication, the rank is relative to the second, remote group. Thus, in the communication domain illustrated in Figure , process 1 in group A sends a message to process 2 in group B with a call MPI_SEND(..., 2, tag, comm); process 2 in group B receives this message with a call MPI_RECV(..., 1, tag, comm). Conversely, process 2 in group B sends a message to process 1 in group A with a call to MPI_SEND(..., 1, tag, comm), and the message is received by a call to MPI_RECV(..., 2, tag, comm); a remote process is identified in the same way for the purposes of sending or receiving. All point-to-point communication functions can be used with intercommunicators for inter-group communication.
Here is a summary of the properties of inter-group communication and intercommunicators: intercommunication, summary
The routine MPI_COMM_TEST_INTER may be used to determine if a communicator is an inter- or intracommunicator. Intercommunicators can be used as arguments to some of the other communicator access routines. Intercommunicators cannot be used as input to some of the constructor routines for intracommunicators (for instance, MPI_COMM_CREATE).
It is often convenient to generate an inter-group communication domain by joining together two intra-group communication domains, i.e., building the pair of communicating groups from the individual groups. This requires that there exists one process in each group that can communicate with each other through a communication domain that serves as a bridge between the two groups. For example, suppose that comm1 has 3 processes and comm2 has 4 processes (see Figure ). In terms of the MPI_COMM_WORLD, the processes in comm1 are 0, 1 and 2 and in comm2 are 3, 4, 5 and 6. Let local process 0 in each intracommunicator form the bridge. They can communicate via MPI_COMM_WORLD where process 0 in comm1 has rank 0 and process 0 in comm2 has rank 3. Once the intercommunicator is formed, the original group for each intracommunicator is the local group in the intercommunicator and the group from the other intracommunicator becomes the remote group. For communication with this intercommunicator, the rank in the remote group is used. For example, if a process in comm1 wants to send to process 2 of comm2 (MPI_COMM_WORLD rank 5) then it uses 2 as the rank in the send.
Figure: Example of two intracommunicators merging to become one
intercommunicator.
Intercommunicators are created in this fashion by the call MPI_INTERCOMM_CREATE. The two joined groups are required to be disjoint. The converse function of building an intracommunicator from an intercommunicator is provided by the call MPI_INTERCOMM_MERGE. This call generates a communication domain with a group which is the union of the two groups of the inter-group communication domain. Both calls are blocking. Both will generally require collective communication within each of the involved groups, as well as communication across the groups.
MPI_Comm_test_inter(MPI_Comm comm, int *flag)
MPI_COMM_TEST_INTER(COMM, FLAG, IERROR)INTEGER COMM, IERROR
LOGICAL FLAG
MPI_COMM_TEST_INTER is a local routine that allows the calling process to determine if a communicator is an intercommunicator or an intracommunicator. It returns true if it is an intercommunicator, otherwise false.
When an intercommunicator is used as an input argument to the communicator accessors described in Section , the following table describes the behavior.
Furthermore, the operation MPI_COMM_COMPARE is valid for intercommunicators. Both communicators must be either intra- or intercommunicators, or else MPI_UNEQUAL results. Both corresponding MPI_UNEQUAL local and remote groups must compare correctly to get the results MPI_CONGRUENT and MPI_SIMILAR. In particular, it is MPI_CONGRUENT MPI_SIMILAR possible for MPI_SIMILAR to result because either the local or remote groups were similar but not identical.
The following accessors provide consistent access to the remote group of an intercommunicator; they are all local operations.
MPI_Comm_remote_size(MPI_Comm comm, int *size)
MPI_COMM_REMOTE_SIZE(COMM, SIZE, IERROR)INTEGER COMM, SIZE, IERROR
MPI_COMM_REMOTE_SIZE returns the size of the remote group in the intercommunicator. Note that the size of the local group is given by MPI_COMM_SIZE.
MPI_Comm_remote_group(MPI_Comm comm, MPI_Group *group)
MPI_COMM_REMOTE_GROUP(COMM, GROUP, IERROR)INTEGER COMM, GROUP, IERROR
MPI_COMM_REMOTE_GROUP returns the remote group in the intercommunicator. Note that the local group is give by MPI_COMM_GROUP.
intercommunicator, constructors
An intercommunicator can be created by a call to MPI_COMM_DUP, see Section . As for intracommunicators, this call generates a new inter-group communication domain with the same groups as the old one, and also replicates user-defined attributes. An intercommunicator is deallocated by a call to MPI_COMM_FREE. The other intracommunicator constructor functions of Section do not apply to intercommunicators. Two new functions are specific to intercommunicators.
MPI_Intercomm_create(MPI_Comm local_comm, int local_leader, MPI_Comm bridge_comm, int remote_leader, int tag, MPI_Comm *newintercomm)
MPI_INTERCOMM_CREATE(LOCAL_COMM, LOCAL_LEADER, PEER_COMM, REMOTE_LEADER, TAG, NEWINTERCOMM, IERROR)INTEGER LOCAL_COMM, LOCAL_LEADER, PEER_COMM, REMOTE_LEADER, TAG, NEWINTERCOMM, IERROR
MPI_INTERCOMM_CREATE creates an intercommunicator. The call is collective over the union of the two groups. Processes should provide matching local_comm and identical local_leader arguments within each of the two groups. The two leaders specify matching bridge_comm arguments, and each provide in remote_leader the rank of the other leader within the domain of bridge_comm. Both provide identical tag values.
Wildcards are not permitted for remote_leader, local_leader, nor tag.
This call uses point-to-point communication with communicator bridge_comm, and with tag tag between the leaders. Thus, care must be taken that there be no pending communication on bridge_comm that could interfere with this communication.
MPI_Intercomm_merge(MPI_Comm intercomm, int high, MPI_Comm *newintracomm)
MPI_INTERCOMM_MERGE(INTERCOMM, HIGH, NEWINTRACOMM, IERROR)INTEGER INTERCOMM, NEWINTRACOMM, IERROR
LOGICAL HIGH
MPI_INTERCOMM_MERGE creates an intracommunicator from the union of the two groups that are associated with intercomm. All processes should provide the same high value within each of the two groups. If processes in one group provided the value high = false and processes in the other group provided the value high = true then the union orders the ``low'' group before the ``high'' group. If all processes provided the same high argument then the order of the union is arbitrary. This call is blocking and collective within the union of the two groups.
Figure: Three-group pipeline. The figure shows the local rank and
(within brackets) the global rank of each process.
This chapter discusses the MPI topology mechanism. A topology is an extra, topologyattribute, topology topology and intercommunicator optional attribute that one can give to an intra-communicator; topologies cannot be added to inter-communicators. A topology can provide a convenient naming mechanism for the processes of a group (within a communicator), and additionally, may assist the runtime system in mapping the processes onto hardware.
As stated in Chapter , a process group in MPI is a collection of n processes. Each process in groupprocess group the group is assigned a rank between 0 and n-1. In many parallel applications a linear ranking of processes does not adequately reflect the logical communication pattern of the processes (which is usually determined by the underlying problem geometry and the numerical algorithm used). Often the rankprocess rank processes are arranged in topological patterns such as two- or three-dimensional grids. More generally, the logical process arrangement is described by a graph. In this chapter we will refer to this logical process arrangement as the ``virtual topology.''
A clear distinction must be made between the virtual process topology and the topology of the underlying, physical hardware. The virtual topology can be exploited by the system in the assignment of processes to physical processors, if this helps to improve the communication performance on a given machine. How this mapping is done, however, is outside the scope of MPI. The description of the virtual topology, on the other hand, depends only on the application, and is machine-independent. The functions in this chapter deal only with machine-independent mapping. topology, virtual vs physical
The communication pattern of a set of processes can be represented by a graph. The nodes stand for the processes, and the edges connect processes that communicate with each other. Since communication is most often symmetric, communication graphs are assumed to be symmetric: if an edge connects node to node , then an edge connects node to node .
MPI provides message-passing between any pair of processes in a group. There is no requirement for opening a channel explicitly. Therefore, a ``missing link'' in the user-defined process graph does not prevent the corresponding processes from exchanging messages. It means, rather, that this connection is neglected in the virtual topology. This strategy implies that the topology gives no convenient way of naming this pathway of communication. Another possible consequence is that an automatic mapping tool (if one exists for the runtime environment) will not take account of this edge when mapping, and communication on the ``missing'' link will be less efficient.
Specifying the virtual topology in terms of a graph is sufficient for all applications. However, in many applications the graph structure is regular, and the detailed set-up of the graph would be inconvenient for the user and might be less efficient at run time. A large fraction of all parallel applications use process topologies like rings, two- or higher-dimensional grids, or tori. These structures are completely defined by the number of dimensions and the numbers of processes in each coordinate direction. Also, the mapping of grids and tori is generally an easier problem than general graphs. Thus, it is desirable to address these cases explicitly.
Process coordinates in a Cartesian structure begin their numbering at . Row-major numbering is always used for the processes in a Cartesian structure. This means that, for example, the relation between group rank and coordinates for twelve processes in a grid is as shown in Figure .
Figure: Relationship between ranks and Cartesian coordinates for a
3x4 2D topology. The upper number in each box is the rank of the process
and the lower value is the (row, column) coordinates.
MPI manages system memory that is used for buffering messages and for storing internal representations of various MPI objects such as groups, communicators, datatypes, etc. This memory is not directly accessible to the user, and objects stored there are opaque: their size and shape is not visible to the user. Opaque objects are accessed via handles, which exist in opaque objects handles user space. MPI procedures that operate on opaque objects are passed handle arguments to access these objects. In addition to their use by MPI calls for object access, handles can participate in assignments and comparisons.
In Fortran, all handles have type INTEGER. In C, a different handle type is defined for each category of objects. Implementations should use types that support assignment and equality operators.
In Fortran, the handle can be an index in a table of opaque objects, while in C it can be such an index or a pointer to the object. More bizarre possibilities exist.
Opaque objects are allocated and deallocated by calls that are specific to each object type. These are listed in the sections where the objects are described. The calls accept a handle argument of matching type. In an allocate call this is an argument that returns a valid reference to the object. In a call to deallocate this is an argument which returns with a ``null handle'' value. MPI provides a ``null handle'' constant for each object type. Comparisons to this constant are used to test for validity of the handle. handle, null MPI calls do not change the value of handles, with the exception of calls that allocate and deallocate objects, and of the call MPI_TYPE_COMMIT, defined in Section .
A null handle argument is an erroneous argument in MPI calls, unless an exception is explicitly stated in the text that defines the function. Such exceptions are allowed for handles to request objects in Wait and Test calls (Section ). Otherwise, a null handle can only be passed to a function that allocates a new object and returns a reference to it in the handle.
A call to deallocate invalidates the handle and marks the object for deallocation. The object is not accessible to the user after the call. However, MPI need not deallocate the object immediately. Any operation pending (at the time of the deallocate) that involves this object will complete normally; the object will be deallocated afterwards.
An opaque object and its handle are significant only at the process where the object was created, and cannot be transferred to another process.
MPI provides certain predefined opaque objects and predefined, static handles to these objects. Such objects may not be destroyed.
topology, overlapping
In some applications, it is desirable to use different Cartesian topologies at different stages in the computation. For example, in a QR factorization, the transformation is determined by the data below the diagonal in the column of the matrix. It is often easiest to think of the upper right hand corner of the 2D topology as starting on the process with the diagonal element of the matrix for the stage of the computation. Since the original matrix was laid out in the original 2D topology, it is necessary to maintain a relationship between it and the shifted 2D topology in the stage. For example, the processes forming a row or column in the original 2D topology must also form a row or column in the shifted 2D topology in the stage. As stated in Section and shown in Figure , there is a clear correspondence between the rank of a process and its coordinates in a Cartesian topology. This relationship can be used to create multiple Cartesian topologies with the desired relationship. Figure shows the relationship of two 2D Cartesian topologies where the second one is shifted by two rows and two columns.
Figure: The relationship between two overlaid topologies on a
torus. The upper values in each process is the
rank / (row,col) in the original 2D topology and the lower values are
the same for the shifted 2D topology. Note that rows and columns of
processes remain intact.
topologyattribute, topology
The support for virtual topologies as defined in this chapter is consistent with other parts of MPI, and, whenever possible, makes use of functions that are defined elsewhere. Topology information is associated with communicators. It can be implemented using the caching mechanism described in Chapter .
This section describes the MPI functions for creating Cartesian topologies.
MPI_CART_CREATE can be used to describe Cartesian structures of arbitrary dimension. For each coordinate direction one specifies whether the process structure is periodic or not. For a 1D topology, it is linear if it is not periodic and a ring if it is periodic. For a 2D topology, it is a rectangle, cylinder, or torus as it goes from non-periodic to periodic in one dimension to fully periodic. Note that an -dimensional hypercube is an -dimensional torus with 2 processes per coordinate direction. Thus, special support for hypercube structures is not necessary.
MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart)
MPI_CART_CREATE(COMM_OLD, NDIMS, DIMS, PERIODS, REORDER, COMM_CART, IERROR)INTEGER COMM_OLD, NDIMS, DIMS(*), COMM_CART, IERROR
LOGICAL PERIODS(*), REORDER
MPI_CART_CREATE returns a handle to a new communicator to which the Cartesian topology information is attached. In analogy to the function MPI_COMM_CREATE, no cached information propagates to the new communicator. Also, this function is collective. As with other collective calls, the program must be written to work correctly, whether the call synchronizes or not.
If reorder = false then the rank of each process in the new group is identical to its rank in the old group. Otherwise, the function may reorder the processes (possibly so as to choose a good embedding of the virtual topology onto the physical machine). If the total size of the Cartesian grid is smaller than the size of the group of comm_old, then some processes are returned MPI_COMM_NULL, in analogy to MPI_COMM_SPLIT. MPI_COMM_NULL The call is erroneous if it specifies a grid that is larger than the group size.
For Cartesian topologies, the function MPI_DIMS_CREATE helps the user select a balanced distribution of processes per coordinate direction, depending on the number of processes in the group to be balanced and optional constraints that can be specified by the user. One possible use of this function is to partition all the processes (the size of MPI_COMM_WORLD's group) into an -dimensional topology.
MPI_Dims_create(int nnodes, int ndims, int *dims)
MPI_DIMS_CREATE(NNODES, NDIMS, DIMS, IERROR)INTEGER NNODES, NDIMS, DIMS(*), IERROR
The entries in the array dims are set to describe a Cartesian grid with ndims dimensions and a total of nnodes nodes. The dimensions are set to be as close to each other as possible, using an appropriate divisibility algorithm. The caller may further constrain the operation of this routine by specifying elements of array dims. If dims[i] is set to a positive number, the routine will not modify the number of nodes in dimension i; only those entries where dims[i] = 0 are modified by the call.
Negative input values of dims[i] are erroneous. An error will occur if nnodes is not a multiple of .
For dims[i] set by the call, dims[i] will be ordered in monotonically decreasing order. Array dims is suitable for use as input to routine MPI_CART_CREATE. MPI_DIMS_CREATE is local. Several sample calls are shown in Example .
Once a Cartesian topology is set up, it may be necessary to inquire about the topology. These functions are given below and are all local calls.
MPI_Cartdim_get(MPI_Comm comm, int *ndims)
MPI_CARTDIM_GET(COMM, NDIMS, IERROR)INTEGER COMM, NDIMS, IERROR
MPI_CARTDIM_GET returns the number of dimensions of the Cartesian structure associated with comm. This can be used to provide the other Cartesian inquiry functions with the correct size of arrays. The communicator with the topology in Figure would return .
MPI_Cart_get(MPI_Comm comm, int maxdims, int *dims, int *periods, int *coords)
MPI_CART_GET(COMM, MAXDIMS, DIMS, PERIODS, COORDS, IERROR)INTEGER COMM, MAXDIMS, DIMS(*), COORDS(*), IERROR
LOGICAL PERIODS(*)
MPI_CART_GET returns information on the Cartesian topology associated with comm. maxdims must be at least ndims as returned by MPI_CARTDIM_GET. For the example in Figure , . The coords are as given for the rank of the calling process as shown, e.g., process 6 returns .
The functions in this section translate to/from the rank and the Cartesian topology coordinates. These calls are local.
MPI_Cart_rank(MPI_Comm comm, int *coords, int *rank)
MPI_CART_RANK(COMM, COORDS, RANK, IERROR)INTEGER COMM, COORDS(*), RANK, IERROR
For a process group with Cartesian structure, the function MPI_CART_RANK translates the logical process coordinates to process ranks as they are used by the point-to-point routines. coords is an array of size ndims as returned by MPI_CARTDIM_GET. For the example in Figure , would return .
For dimension i with periods(i) = true, if the coordinate, coords(i), is out of range, that is, coords(i) < 0 or coords(i) dims(i), it is shifted back to the interval 0 coords(i) < dims(i) automatically. If the topology in Figure is periodic in both dimensions (torus), then would also return . Out-of-range coordinates are erroneous for non-periodic dimensions.
MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int *coords)
MPI_CART_COORDS(COMM, RANK, MAXDIMS, COORDS, IERROR)INTEGER COMM, RANK, MAXDIMS, COORDS(*), IERROR
MPI_CART_COORDS is the rank-to-coordinates translator. It is the inverse mapping of MPI_CART_RANK. maxdims is at least as big as ndims as returned by MPI_CARTDIM_GET. For the example in Figure , would return .
If the process topology is a Cartesian structure, a MPI_SENDRECV operation is likely to be used along a coordinate direction to perform a shift of data. As input, MPI_SENDRECV takes the rank of a source process for the receive, and the rank of a destination process for the send. A Cartesian shift operation is specified by the coordinate of the shift and by the size of the shift step (positive or negative). The function MPI_CART_SHIFT inputs such specification and returns the information needed to call MPI_SENDRECV. The function MPI_CART_SHIFT is local.
MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest)
MPI_CART_SHIFT(COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR)INTEGER COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR
The direction argument indicates the dimension of the shift, i.e., the coordinate whose value is modified by the shift. The coordinates are numbered from 0 to ndims-1, where ndims is the number of dimensions.
Depending on the periodicity of the Cartesian group in the specified coordinate direction, MPI_CART_SHIFT provides the identifiers for a circular or an end-off shift. In the case of an end-off shift, the value MPI_PROC_NULL may be returned in MPI_PROC_NULL rank_source and/or rank_dest, indicating that the source and/or the destination for the shift is out of range. This is a valid input to the sendrecv functions.
Neither MPI_CART_SHIFT, nor MPI_SENDRECV are collective functions. It is not required that all processes in the grid call MPI_CART_SHIFT with the same direction and disp arguments, but only that sends match receives in the subsequent calls to MPI_SENDRECV. Example shows such use of MPI_CART_SHIFT, where each column of a 2D grid is shifted by a different amount. Figures and show the result on 12 processors.
Figure: Outcome of Example
when the 2D topology is
periodic (a torus) on 12 processes. In the boxes on the left, the
upper number in each box represents the process rank, the middle
values are the (row, column) coordinate, and the lower values are the
source/dest for the sendrecv operation. The value in the boxes on the
right are the results in b after the sendrecv has completed.
Figure: Similar to Figure
except the 2D
Cartesian topology is not periodic (a rectangle). This results when
the values of periods(1) and periods(2)
are made .FALSE. A ``-'' in a
source or dest value indicates MPI_CART_SHIFT returns
MPI_PROC_NULL.
MPI_Cart_sub(MPI_Comm comm, int *remain_dims, MPI_Comm *newcomm)
MPI_CART_SUB(COMM, REMAIN_DIMS, NEWCOMM, IERROR)INTEGER COMM, NEWCOMM, IERROR
LOGICAL REMAIN_DIMS(*)
If a Cartesian topology has been created with MPI_CART_CREATE, the function MPI_CART_SUB can be used to partition the communicator group into subgroups that form lower-dimensional Cartesian subgrids, and to build for each subgroup a communicator with the associated subgrid Cartesian topology. This call is collective.
Typically, the functions already presented are used to create and use Cartesian topologies. However, some applications may want more control over the process. MPI_CART_MAP returns the Cartesian map recommended by the MPI system, in order to map well the virtual communication graph of the application on the physical machine topology. This call is collective.
MPI_Cart_map(MPI_Comm comm, int ndims, int *dims, int *periods, int *newrank)
MPI_CART_MAP(COMM, NDIMS, DIMS, PERIODS, NEWRANK, IERROR)INTEGER COMM, NDIMS, DIMS(*), NEWRANK, IERROR
LOGICAL PERIODS(*)
MPI_CART_MAP computes an ``optimal'' placement for the calling process on the physical machine.
MPI procedures sometimes assign a special meaning to a special value of an argument. For example, tag is an integer-valued argument of point-to-point communication operations, that can take a special wild-card value, MPI_ANY_TAG. MPI_ANY_TAG Such arguments will have a range of regular values, which is a proper subrange of the range of values of the corresponding type of the variable. Special values (such as MPI_ANY_TAG) will be outside the regular range. The range of regular values can be queried using environmental inquiry functions (Chapter ).
MPI also provides predefined named constant handles, such as MPI_COMM_WORLD, which is a handle to an object that represents all MPI_COMM_WORLD processes available at start-up time and allowed to communicate with any of them.
All named constants, with the exception of MPI_BOTTOM in MPI_BOTTOM Fortran, can be used in initialization expressions or assignments. These constants do not change values during execution. Opaque objects accessed by constant handles are defined and do not change value between MPI initialization (MPI_INIT() call) and MPI completion (MPI_FINALIZE() call).
topology, general graph
This section describes the MPI functions for creating graph topologies.
MPI_Graph_create(MPI_Comm comm_old, int nnodes, int *index, int *edges, int reorder, MPI_Comm *comm_graph)
MPI_GRAPH_CREATE(COMM_OLD, NNODES, INDEX, EDGES, REORDER, COMM_GRAPH, IERROR)INTEGER COMM_OLD, NNODES, INDEX(*), EDGES(*), COMM_GRAPH, IERROR
LOGICAL REORDER
MPI_GRAPH_CREATE returns a new communicator to which the graph topology information is attached. If reorder = false then the rank of each process in the new group is identical to its rank in the old group. Otherwise, the function may reorder the processes. If the size, nnodes, of the graph is smaller than the size of the group of comm_old, then some processes are returned MPI_COMM_NULL, in MPI_COMM_NULL analogy to MPI_COMM_SPLIT. The call is erroneous if it specifies a graph that is larger than the group size of the input communicator. In analogy to the function MPI_COMM_CREATE, no cached information propagates to the new communicator. Also, this function is collective. As with other collective calls, the program must be written to work correctly, whether the call synchronizes or not.
The three parameters nnodes, index and edges define the graph structure. nnodes is the number of nodes of the graph. The nodes are numbered from 0 to nnodes-1. The ith entry of array index stores the total number of neighbors of the first i graph nodes. The lists of neighbors of nodes 0, 1, ..., nnodes-1 are stored in consecutive locations in array edges. The array edges is a flattened representation of the edge lists. The total number of entries in index is nnodes and the total number of entries in edges is equal to the number of graph edges.
The definitions of the arguments nnodes, index, and edges are illustrated in Example .
Once a graph topology is set up, it may be necessary to inquire about the topology. These functions are given below and are all local calls.
MPI_Graphdims_get(MPI_Comm comm, int *nnodes, int *nedges)
MPI_GRAPHDIMS_GET(COMM, NNODES, NEDGES, IERROR)INTEGER COMM, NNODES, NEDGES, IERROR
MPI_GRAPHDIMS_GET returns the number of nodes and the number of edges in the graph. The number of nodes is identical to the size of the group associated with comm. nnodes and nedges can be used to supply arrays of correct size for index and edges, respectively, in MPI_GRAPH_GET. MPI_GRAPHDIMS_GET would return and for Example .
MPI_Graph_get(MPI_Comm comm, int maxindex, int maxedges, int *index, int *edges)
MPI_GRAPH_GET(COMM, MAXINDEX, MAXEDGES, INDEX, EDGES, IERROR)INTEGER COMM, MAXINDEX, MAXEDGES, INDEX(*), EDGES(*), IERROR
MPI_GRAPH_GET returns index and edges as was supplied to MPI_GRAPH_CREATE. maxindex and maxedges are at least as big as nnodes and nedges, respectively, as returned by MPI_GRAPHDIMS_GET above. Using the comm created in Example would return the index and edges given in the example.
The functions in this section provide information about the structure of the graph topology. All calls are local.
MPI_Graph_neighbors_count(MPI_Comm comm, int rank, int *nneighbors)
MPI_GRAPH_NEIGHBORS_COUNT(COMM, RANK, NNEIGHBORS, IERROR)INTEGER COMM, RANK, NNEIGHBORS, IERROR
MPI_GRAPH_NEIGHBORS_COUNT returns the number of neighbors for the process signified by rank. It can be used by MPI_GRAPH_NEIGHBORS to give an array of correct size for neighbors. Using Example with would give .
MPI_Graph_neighbors(MPI_Comm comm, int rank, int maxneighbors, int *neighbors)
MPI_GRAPH_NEIGHBORS(COMM, RANK, MAXNEIGHBORS, NEIGHBORS, IERROR)INTEGER COMM, RANK, MAXNEIGHBORS, NEIGHBORS(*), IERROR
MPI_GRAPH_NEIGHBORS returns the part of the edges array associated with process rank. Using Example , would return . Another use is given in Example .
The low-level function for general graph topologies as in the Cartesian topologies given in Section is as follows. This call is collective.
MPI_UNDEFINED
MPI_Graph_map(MPI_Comm comm, int nnodes, int *index, int *edges, int *newrank)
MPI_GRAPH_MAP(COMM, NNODES, INDEX, EDGES, NEWRANK, IERROR)INTEGER COMM, NNODES, INDEX(*), EDGES(*), NEWRANK, IERROR
A routine may receive a communicator for which it is unknown what type of topology is associated with it. MPI_TOPO_TEST allows one to answer this question. This is a local call.
MPI_Topo_test(MPI_Comm comm, int *status)
MPI_TOPO_TEST(COMM, STATUS, IERROR)INTEGER COMM, STATUS, IERROR
The function MPI_TOPO_TEST returns the type of topology that is assigned to a communicator.
The output value status is one of the following:
MPI_GRAPH MPI_CART MPI_UNDEFINED
Figure: Data partition in 2D parallel matrix product algorithm.
Figure: Phases in 2D parallel matrix product algorithm.
Figure: Data partition in 3D parallel matrix product algorithm.
Figure: Phases in 3D parallel matrix product algorithm.
This chapter discusses routines for getting and, where appropriate, setting various parameters that relate to the MPI implementation and the execution environment. It discusses error handling in MPI and the procedures available for controlling MPI error handling. The procedures for entering and leaving the MPI execution environment are also described here. Finally, the chapter discusses the interaction between MPI and the general execution environment. environmental parameters error handling interaction, MPI with execution environment initializationexit
A set of attributes that describe the execution environment are attached to the communicator MPI_COMM_WORLD when MPI is initialized. MPI_COMM_WORLD The value of these attributes can be inquired by using the function MPI_ATTR_GET described in Chapter . It is erroneous to delete these attributes, free their keys, or change their values.
The list of predefined attribute keys include predefined attributesattribute, predefined
Vendors may add implementation specific parameters (such as node number, real memory size, virtual memory size, etc.)
These predefined attributes do not change value between MPI initialization (MPI_INIT) and MPI completion (MPI_FINALIZE).
The required parameter values are discussed in more detail below:
MPI functions sometimes use arguments with a choice (or union) data type. Distinct calls to the same routine may pass by reference actual arguments of different types. The mechanism for providing such arguments will differ from language to language. For Fortran, we use <type> to represent a choice variable, for C, we use (void *). choice
The Fortran 77 standard specifies that the type of actual arguments need to agree with the type of dummy arguments; no construct equivalent to C void pointers is available. Thus, it would seem that there is no standard conforming mechanism to support choice arguments. However, most Fortran compilers either don't check type consistency of calls to external routines, or support a special mechanism to link foreign (e.g., C) routines. We accept this non-conformity with the Fortran 77 standard. I.e., we accept that the same routine may be passed an actual argument of a different type at distinct calls.
Generic routines can be used in Fortran 90 to provide a standard conforming solution. This solution will be consistent with our nonstandard conforming Fortran 77 solution.
tag, upper bound Tag values range from 0 to the value returned for MPI_TAG_UB, MPI_TAG_UB inclusive. These values are guaranteed to be unchanging during the execution of an MPI program. In addition, the tag upper bound value must be at least 32767. An MPI implementation is free to make the value of MPI_TAG_UB larger than this; for example, the value is also a legal value for MPI_TAG_UB (on a system where this value is a legal int or INTEGER value).
The attribute MPI_TAG_UB has the same value on all MPI_TAG_UB processes in the group of MPI_COMM_WORLD.
host process The value returned for MPI_HOST gets the rank of the MPI_HOST HOST process in the group associated with communicator MPI_COMM_WORLD, if there is such. MPI_PROC_NULL is returned if there is no host. This attribute can be used on systems that have a distinguished host processor, in order to identify the process running on this processor. However, MPI does not specify what it means for a process to be a HOST, nor does it requires that a HOST exists.
The attribute MPI_HOST has the same value on all MPI_HOST processes in the group of MPI_COMM_WORLD.
I/O inquiry The value returned for MPI_IO is the rank of a processor that can MPI_IO provide language-standard I/O facilities. For Fortran, this means that all of the Fortran I/O operations are supported (e.g., OPEN, REWIND, WRITE). For C, this means that all of the ANSI-C I/O operations are supported (e.g., fopen, fprintf, lseek).
If every process can provide language-standard I/O, then the value MPI_ANY_SOURCE will be returned. Otherwise, if the calling MPI_ANY_SOURCE process can provide language-standard I/O, then its rank will be returned. Otherwise, if some process can provide language-standard I/O then the rank of one such process will be returned. The same value need not be returned by all processes. If no process can provide language-standard I/O, then the value MPI_PROC_NULL will be MPI_PROC_NULL returned.
clock synchronization The value returned for MPI_WTIME_IS_GLOBAL is 1 if clocks MPI_WTIME_IS_GLOBAL at all processes in MPI_COMM_WORLD are synchronized, 0 otherwise. A collection of clocks is considered synchronized if explicit effort has been taken to synchronize them. The expectation is that the variation in time, as measured by calls to MPI_WTIME, will be less then one half the round-trip time for an MPI message of length zero. If time is measured at a process just before a send and at another process just after a matching receive, the second time should be always higher than the first one.
The attribute MPI_WTIME_IS_GLOBAL need not be present when MPI_WTIME_IS_GLOBAL the clocks are not synchronized (however, the attribute key MPI_WTIME_IS_GLOBAL is always valid). This attribute may be associated with communicators other then MPI_COMM_WORLD.
The attribute MPI_WTIME_IS_GLOBAL has the same value on all processes in the group of MPI_COMM_WORLD.
MPI_Get_processor_name(char *name, int *resultlen)
MPI_GET_PROCESSOR_NAME( NAME, RESULTLEN, IERROR)CHARACTER*(*) NAME
INTEGER RESULTLEN,IERROR
This routine returns the name of the processor on which it was called at the moment of the call. The name is a character string for maximum flexibility. From this value it must be possible to identify a specific piece of hardware; possible values include ``processor 9 in rack 4 of mpp.cs.org'' and ``231'' (where 231 is the actual processor number in the running homogeneous system). The argument name must represent storage that is at least MPI_MAX_PROCESSOR_NAME characters long. MPI_MAX_PROCESSOR_NAME MPI_GET_PROCESSOR_NAME may write up to this many characters into name.
The number of characters actually written is returned in the output argument, resultlen.
The constant MPI_BSEND_OVERHEAD provides an upper bound on MPI_BSEND_OVERHEAD the fixed overhead per message buffered by a call to MPI_BSEND.
clocktime function
MPI defines a timer. A timer is specified even though it is not ``message-passing,'' because timing parallel programs is important in ``performance debugging'' and because existing timers (both in POSIX 1003.1-1988 and 1003.4D 14.1 and in Fortran 90) are either inconvenient or do not provide adequate access to high-resolution timers.
double MPI_Wtime(void) DOUBLE PRECISION MPI_WTIME() MPI_WTIME returns a floating-point number of seconds, representing elapsed wall-clock time since some time in the past.
The ``time in the past'' is guaranteed not to change during the life of the process. The user is responsible for converting large numbers of seconds to other units if they are preferred.
This function is portable (it returns seconds, not ``ticks''), it allows high-resolution, and carries no unnecessary baggage. One would use it like this:
{ double starttime, endtime; starttime = MPI_Wtime(); .... stuff to be timed ... endtime = MPI_Wtime(); printf("That took %f seconds\n",endtime-starttime); }
The times returned are local to the node that called them. There is no requirement that different nodes return ``the same time.'' (But see also the discussion of MPI_WTIME_IS_GLOBAL in MPI_WTIME_IS_GLOBAL Section ).
double MPI_Wtick(void) DOUBLE PRECISION MPI_WTICK() MPI_WTICK returns the resolution of MPI_WTIME in seconds. That is, it returns, as a double precision value, the number of seconds between successive clock ticks. For example, if the clock is implemented by the hardware as a counter that is incremented every millisecond, the value returned by MPI_WTICK should be .
initializationexit One goal of MPI is to achieve source code portability. By this we mean that a program written using MPI and complying with the relevant language standards is portable as written, and must not require any source code changes when moved from one system to another. This explicitly does not say anything about how an MPI program is started or launched from the command line, nor what the user must do to set up the environment in which an MPI program will run. However, an implementation may require some setup to be performed before other MPI routines may be called. To provide for this, MPI includes an initialization routine MPI_INIT.
MPI_Init(int *argc, char ***argv)
MPI_INIT(IERROR)INTEGER IERROR
This routine must be called before any other MPI routine. It must be called at most once; subsequent calls are erroneous (see MPI_INITIALIZED).
All MPI programs must contain a call to MPI_INIT; this routine must be called before any other MPI routine (apart from MPI_INITIALIZED) is called. The version for ANSI C accepts the argc and argv that are provided by the arguments to main:
int main(argc, argv) int argc; char **argv; { MPI_Init(&argc, &argv); /* parse arguments */ /* main program */ MPI_Finalize(); /* see below */ }
The Fortran version takes only IERROR.
An MPI implementation is free to require that the arguments in the C binding must be the arguments to main.
MPI_Finalize(void)
MPI_FINALIZE(IERROR)INTEGER IERROR
This routines cleans up all MPI state. Once this routine is called, no MPI routine (even MPI_INIT) may be called. The user must ensure that all pending communications involving a process complete before the process calls MPI_FINALIZE.
MPI_Initialized(int *flag)
MPI_INITIALIZED(FLAG, IERROR)LOGICAL FLAG
INTEGER IERROR
This routine may be used to determine whether MPI_INIT has been called. It is the only routine that may be called before MPI_INIT is called.
MPI_Abort(MPI_Comm comm, int errorcode)
MPI_ABORT(COMM, ERRORCODE, IERROR)INTEGER COMM, ERRORCODE, IERROR
This routine makes a ``best attempt'' to abort all tasks in the group of comm. This function does not require that the invoking environment take any action with the error code. However, a Unix or POSIX environment should handle this as a return errorcode from the main program or an abort(errorcode).
MPI implementations are required to define the behavior of MPI_ABORT at least for a comm of MPI_COMM_WORLD. MPI implementations may MPI_COMM_WORLD ignore the comm argument and act as if the comm was MPI_COMM_WORLD.
MPI provides the user with reliable message transmission. A message sent is always received correctly, and the user does not need to check for transmission errors, time-outs, or other error conditions. In other words, MPI does not provide mechanisms for dealing with failures in the communication system. If the MPI implementation is built on an unreliable underlying mechanism, then it is the job of the implementor of the MPI subsystem to insulate the user from this unreliability, or to reflect unrecoverable errors as exceptions.
Of course, errors can occur during MPI calls for a variety of reasons. A program error can error, program occur when an MPI routine is called with an incorrect argument (non-existing destination in a send operation, buffer too small in a receive operation, etc.) This type of error would occur in any implementation. In addition, a resource error may occur when a program error, resource exceeds the amount of available system resources (number of pending messages, system buffers, etc.). The occurrence of this type of error depends on the amount of available resources in the system and the resource allocation mechanism used; this may differ from system to system. A high-quality implementation will provide generous limits on the important resources so as to alleviate the portability problem this represents.
An MPI implementation cannot or may choose not to handle some errors that occur during MPI calls. These can include errors that generate exceptions or traps, such as floating point errors or access violations; errors that are too expensive to detect in normal execution mode; or ``catastrophic'' errors which may prevent MPI from returning control to the caller in a consistent state.
Another subtle issue arises because of the nature of asynchronous communications. MPI can only handle errors that can be attached to a specific MPI call. MPI calls (both blocking and nonblocking) may initiate operations that continue asynchronously after the call returned. Thus, the call may complete successfully, yet the operation may later cause an error. If there is a subsequent call that relates to the same operation (e.g., a wait or test call that completes a nonblocking call, or a receive that completes a communication initiated by a blocking send) then the error can be associated with this call. In some cases, the error may occur after all calls that relate to the operation have completed. (Consider the case of a blocking ready mode send operation, where the outgoing message is buffered, and it is subsequently found that no matching receive is posted.) Such errors will not be handled by MPI.
The set of errors in MPI calls that are handled by MPI is implementation-dependent. Each such error generates an MPI exception. exceptionMPI exception A good quality implementation will attempt to handle as many errors as possible as MPI exceptions. Errors that are not handled by MPI will be handled by the error handling mechanisms of the language run-time or the operating system. Typically, errors that are not handled by MPI will cause the parallel program to abort.
The occurrence of an MPI exception has two effects:
Some MPI calls may cause more than one MPI exception (see Section ). In such a case, the MPI error handler will be invoked once for each exception, and multiple error codes will be returned.
After an error is detected, the state of MPI is undefined. That is, the state of the computation after the error-handler executed does not necessarily allow the user to continue to use MPI. The purpose of these error handlers is to allow a user to issue user-defined error messages and to take actions unrelated to MPI (such as flushing I/O buffers) before a program exits. An MPI implementation is free to allow MPI to continue after an error but is not required to do so.
error handler
A user can associate an error handler with a communicator. The specified error handling routine will be used for any MPI exception that occurs during a call to MPI for a communication with this communicator. MPI calls that are not related to any communicator are considered to be attached to the communicator MPI_COMM_WORLD. MPI_COMM_WORLD The attachment of error handlers to communicators is purely local: different processes may attach different error handlers to communicators for the same communication domain.
A newly created communicator inherits the error handler that is associated with the ``parent'' communicator. In particular, the user can specify a ``global'' error handler for all communicators by associating this handler with the communicator MPI_COMM_WORLD MPI_COMM_WORLD immediately after initialization.
Several predefined error handlers are available in MPI: error handler, predefined
Implementations may provide additional predefined error handlers and programmers can code their own error handlers.
The error handler MPI_ERRORS_ARE_FATAL is associated by default MPI_ERRORS_ARE_FATAL with MPI_COMM_WORLD after initialization. Thus, if the user chooses not to control error handling, every error that MPI handles is treated as fatal. Since (almost) all MPI calls return an error code, a user may choose to handle errors in his or her main code, by testing the return code of MPI calls and executing a suitable recovery code when the call was not successful. In this case, the error handler MPI_ERRORS_RETURN will be used. Usually it is more MPI_ERRORS_RETURN convenient and more efficient not to test for errors after each MPI call, and have such an error handled by a non-trivial MPI error handler.
An MPI error handler is an opaque object, which is accessed by a handle. MPI calls are provided to create new error handlers, to associate error handlers with communicators, and to test which error handler is associated with a communicator.
MPI_Errhandler_create(MPI_Handler_function *function, MPI_Errhandler *errhandler)
MPI_ERRHANDLER_CREATE(FUNCTION, HANDLER, IERROR)EXTERNAL FUNCTION
INTEGER ERRHANDLER, IERROR
Register the user routine function for use as an MPI
exception handler. Returns in errhandler a handle to the registered
exception handler.
In the C language, the user routine should be a C function of type MPI_Handler_function, which is defined as:
typedef void (MPI_Handler_function)(MPI_Comm *, int *, ...);The first argument is the communicator in use. The second is the error code to be returned by the MPI routine that raised the error. If the routine would have returned multiple error codes (see Section ), it is the error code returned in the status for the request that caused the error handler to be invoked. The remaining arguments are ``stdargs'' arguments whose number and meaning is implementation-dependent. An implementation should clearly document these arguments. Addresses are used so that the handler may be written in Fortran.
MPI_Errhandler_set(MPI_Comm comm, MPI_Errhandler errhandler)
MPI_ERRHANDLER_SET(COMM, ERRHANDLER, IERROR)INTEGER COMM, ERRHANDLER, IERROR
Associates the new error handler errorhandler with communicator comm at the calling process. Note that an error handler is always associated with the communicator.
MPI_Errhandler_get(MPI_Comm comm, MPI_Errhandler *errhandler)
MPI_ERRHANDLER_GET(COMM, ERRHANDLER, IERROR)INTEGER COMM, ERRHANDLER, IERROR
Returns in errhandler (a handle to) the error handler that is currently associated with communicator comm.
Example: A library function may register at its entry point the current error handler for a communicator, set its own private error handler for this communicator, and restore before exiting the previous error handler.
MPI_Errhandler_free(MPI_Errhandler *errhandler)
MPI_ERRHANDLER_FREE(ERRHANDLER, IERROR)INTEGER ERRHANDLER, IERROR Marks the error handler associated with errhandler for deallocation and sets errhandler to MPI_ERRHANDLER_NULL. MPI_ERRHANDLER_NULL The error handler will be deallocated after all communicators associated with it have been deallocated.
error codes
Most MPI functions return an error code indicating successful execution (MPI_SUCCESS), or providing information on the type MPI_SUCCESS of MPI exception that occurred. In certain circumstances, when the MPI function may complete several distinct operations, and therefore may generate several independent errors, the MPI function may return multiple error codes. This may occur with some of the calls described in Section that complete multiple nonblocking communications. As described in that section, the call may return the code MPI_ERR_IN_STATUS, in which case a detailed error code is returned MPI_ERR_IN_STATUS with the status of each communication.
The error codes returned by MPI are left entirely to the implementation (with the exception of MPI_SUCCESS, MPI_ERR_IN_STATUS and MPI_ERR_PENDING). MPI_SUCCESS MPI_ERR_IN_STATUS MPI_ERR_PENDING This is done to allow an implementation to provide as much information as possible in the error code. Error codes can be translated into meaningful messages using the function below.
MPI_Error_string(int errorcode, char *string, int *resultlen)
MPI_ERROR_STRING(ERRORCODE, STRING, RESULTLEN, IERROR)INTEGER ERRORCODE, RESULTLEN, IERROR
CHARACTER*(*) STRING
Returns the error string associated with an error code or class. The argument string must represent storage that is at least MPI_MAX_ERROR_STRING characters long. MPI_MAX_ERROR_STRING
The number of characters actually written is returned in the output argument, resultlen.
The use of implementation-dependent error codes allows implementers to provide more information, but prevents one from writing portable error-handling code. To solve this problem, MPI provides a standard set of specified error values, called error classes, and a function that maps each error code into a suitable error class. error classes
Valid error classes are
MPI_SUCCESS MPI_ERR_BUFFER MPI_ERR_COUNT MPI_ERR_TYPE MPI_ERR_TAG MPI_ERR_COMM MPI_ERR_RANK MPI_ERR_REQUEST MPI_ERR_ROOT MPI_ERR_GROUP MPI_ERR_OP MPI_ERR_TOPOLOGY MPI_ERR_DIMS MPI_ERR_ARG MPI_ERR_UNKNOWN MPI_ERR_TRUNCATE MPI_ERR_OTHER MPI_ERR_INTERN MPI_ERR_IN_STATUS MPI_ERR_PENDING MPI_ERR_LASTCODE
Most of these classes are self explanatory. The use of MPI_ERR_IN_STATUS and MPI_ERR_PENDING is explained in Section . The list of standard classes may be extended in the future.
The function MPI_ERROR_STRING can be used to compute the error string associated with an error class.
The error codes satisfy,
MPI_Error_class(int errorcode, int *errorclass)
MPI_ERROR_CLASS(ERRORCODE, ERRORCLASS, IERROR)INTEGER ERRORCODE, ERRORCLASS, IERROR
The function MPI_ERROR_CLASS maps each error code into a standard error code (error class). It maps each standard error code onto itself.
interaction, MPI with execution environment
There are a number of areas where an MPI implementation may interact with the operating environment and system. While MPI does not mandate that any services (such as I/O or signal handling) be provided, it does strongly suggest the behavior to be provided if those services are available. This is an important point in achieving portability across platforms that provide the same set of services.
This section defines the rules for MPI language binding in Fortran 77 and ANSI C. Defined here are various object representations, as well as the naming conventions used for expressing this standard.
It is expected that any Fortran 90 and C++ implementations use the Fortran 77 and ANSI C bindings, respectively. Although we consider it premature to define other bindings to Fortran 90 and C++, the current bindings are designed to encourage, rather than discourage, experimentation with better bindings that might be adopted later.
Since the word PARAMETER is a keyword in the Fortran language, we use the word ``argument'' to denote the arguments to a subroutine. These are normally referred to as parameters in C, however, we expect that C programmers will understand the word ``argument'' (which has no specific meaning in C), thus allowing us to avoid unnecessary confusion for Fortran programmers.
There are several important language binding issues not addressed by this standard. This standard does not discuss the interoperability of message passing between languages. It is fully expected that good quality implementations will provide such interoperability. interoperability, language
MPI programs require that library routines that are part of the basic language environment (such as date and write in Fortran and printf and malloc in ANSI C) and are executed after MPI_INIT and before MPI_FINALIZE operate independently and that their completion is independent of the action of other processes in an MPI program.
Note that this in no way prevents the creation of library routines that provide parallel services whose operation is collective. However, the following program is expected to complete in an ANSI C environment regardless of the size of MPI_COMM_WORLD (assuming that I/O is available at the executing nodes).
../codes/terms-1.c
The corresponding Fortran 77 program is also expected to complete.
An example of what is not required is any particular ordering of the action of these routines when called by several tasks. For example, MPI makes neither requirements nor recommendations for the output from the following program (again assuming that I/O is available at the executing nodes).
MPI_Comm_rank( MPI_COMM_WORLD, &rank ); printf( "Output from task rank %d\n", rank );
In addition, calls that fail because of resource exhaustion or other error are not considered a violation of the requirements here (however, they are required to complete, just not to complete successfully).
MPI does not specify either the interaction of processes with signals, in a UNIX environment, or with other events that do not relate to MPI communication. That is, signals are not significant from the view point of MPI, and implementors should attempt to implement MPI so that signals are transparent: an MPI call suspended by a signal should resume and complete after the signal is handled. Generally, the state of a computation that is visible or significant from the view-point of MPI should only be affected by MPI calls.
The intent of MPI to be thread and signal safe has a number of thread safetysignal safety subtle effects. For example, on Unix systems, a catchable signal such as SIGALRM (an alarm signal) must not cause an MPI routine to behave differently than it would have in the absence of the signal. Of course, if the signal handler issues MPI calls or changes the environment in which the MPI routine is operating (for example, consuming all available memory space), the MPI routine should behave as appropriate for that situation (in particular, in this case, the behavior should be the same as for a multithreaded MPI implementation).
A second effect is that a signal handler that performs MPI calls must not interfere with the operation of MPI. For example, an MPI receive of any type that occurs within a signal handler must not cause erroneous behavior by the MPI implementation. Note that an implementation is permitted to prohibit the use of MPI calls from within a signal handler, and is not required to detect such use.
It is highly desirable that MPI not use SIGALRM, SIGFPE, or SIGIO. An implementation is required to clearly document all of the signals that the MPI implementation uses; a good place for this information is a Unix `man' page on MPI.
profile interface To satisfy the requirements of the MPI profiling interface, an implementation of the MPI functions must
The objective of the MPI profiling interface is to ensure that it is relatively easy for authors of profiling (and other similar) tools to interface their codes to MPI implementations on different machines. layeringlibraries
Since MPI is a machine independent standard with many different implementations, it is unreasonable to expect that the authors of profiling tools for MPI will have access to the source code which implements MPI on any particular machine. It is therefore necessary to provide a mechanism by which the implementors of such tools can collect whatever performance information they wish without access to the underlying implementation.
The MPI Forum believed that having such an interface is important if MPI is to be attractive to end users, since the availability of many different tools will be a significant factor in attracting users to the MPI standard.
The profiling interface is just that, an interface. It says nothing about the way in which it is used. Therefore, there is no attempt to lay down what information is collected through the interface, or how the collected information is saved, filtered, or displayed.
While the initial impetus for the development of this interface arose from the desire to permit the implementation of profiling tools, it is clear that an interface like that specified may also prove useful for other purposes, such as ``internetworking'' multiple MPI implementations. Since all that is defined is an interface, there is no impediment to it being used wherever it is useful.
As the issues being addressed here are intimately tied up with the way in which executable images are built, which may differ greatly on different machines, the examples given below should be treated solely as one way of implementing the MPI profiling interface. The actual requirements made of an implementation are those detailed in Section , the whole of the rest of this chapter is only present as justification and discussion of the logic for those requirements.
The examples below show one way in which an implementation could be constructed to meet the requirements on a Unix system (there are doubtless others which would be equally valid).
Provided that an MPI implementation meets the requirements listed in Section , it is possible for the implementor of the profiling system to intercept all of the MPI calls which are made by the user program. Whatever information is required can then be collected before calling the underlying MPI implementation (through its name shifted entry points) to achieve the desired effects.
There is a clear requirement for the user code to be able to control the profiler dynamically at run time. This is normally used for (at least) the purposes of
These requirements are met by use of the MPI_PCONTROL.
MPI_Pcontrol(const int level, ...) MPI_PCONTROL(level)INTEGER LEVEL
MPI libraries themselves make no use of this routine, and simply return immediately to the user code. However the presence of calls to this routine allows a profiling package to be explicitly called by the user.
Since MPI has no control of the implementation of the profiling code, The MPI Forum was unable to specify precisely the semantics which will be provided by calls to MPI_PCONTROL. This vagueness extends to the number of arguments to the function, and their datatypes.
However to provide some level of portability of user codes to different profiling libraries, the MPI Forum requested the following meanings for certain values of level.
The MPI Forum also requested that the default state after MPI_INIT has been called is for profiling to be enabled at the normal default level. (i.e. as if MPI_PCONTROL had just been called with the argument 1). This allows users to link with a profiling library and obtain profile output without having to modify their source code at all.
The provision of MPI_PCONTROL as a no-op in the standard MPI library allows users to modify their source code to obtain more detailed profiling information, but still be able to link exactly the same code against the standard MPI library.
Suppose that the profiler wishes to accumulate the total amount of data sent by the MPI_Send() function, along with the total elapsed time spent in the function. This could trivially be achieved thus
../codes/prof-1.c
On a Unix system, in which the MPI library is implemented in C, then there are various possible options, of which two of the most obvious are presented here. Which is better depends on whether the linker and compiler support weak symbols.
All MPI names have an MPI_ prefix, and all characters are upper case. Programs should not declare variables or functions with names with the prefix, MPI_ or PMPI_, to avoid possible name collisions.
All MPI Fortran subroutines have a return code in the last argument. A few MPI operations are functions, which do not have the return code argument. The return code value for successful completion is MPI_SUCCESS. MPI_SUCCESS Other error codes are implementation dependent; see Chapter . return codes
Handles are represented in Fortran as INTEGERs. Binary-valued variables are of type LOGICAL.
Array arguments are indexed from one.
Unless explicitly stated, the MPI F77 binding is consistent with ANSI standard Fortran 77. There are several points where the MPI standard diverges from the ANSI Fortran 77 standard. These exceptions are consistent with common practice in the Fortran community. In particular:
Figure: An example of calling a routine with mismatched formal
and actual arguments.
All MPI named constants can be used wherever an entity declared with the PARAMETER attribute can be used in Fortran. There is one exception to this rule: the MPI constant MPI_BOTTOM (section ) can only be MPI_BOTTOM used as a buffer argument.
If the compiler and linker support weak external symbols (e.g. Solaris 2.x, other system V.4 machines), then only a single library is required through the use of #pragma weak thus
#pragma weak MPI_Send = PMPI_Send int PMPI_Send(/* appropriate args */) { /* Useful content */ }
The effect of this #pragma is to define the external symbol MPI_Send as a weak definition. This means that the linker will not complain if there is another definition of the symbol (for instance in the profiling library), however if no other definition exists, then the linker will use the weak definition. This type of situation is illustrated in Fig. , in which a profiling library has been written that profiles calls to MPI_Send() but not calls to MPI_Bcast(). On systems with weak links the link step for an application would be something like
% cc ... -lprof -lmpi
References to MPI_Send() are resolved in the profiling library, where the routine then calls PMPI_Send() which is resolved in the MPI library. In this case the weak link to PMPI_Send() is ignored. However, since MPI_Bcast() is not included in the profiling library, references to it are resolved via a weak link to PMPI_Bcast() in the MPI library.
Figure: Resolution of MPI calls on systems with weak links.
In the absence of weak symbols then one possible solution would be to use the C macro pre-processor thus
#ifdef PROFILELIB # ifdef __STDC__ # define FUNCTION(name) P##name # else # define FUNCTION(name) P/**/name # endif #else # define FUNCTION(name) name #endif
Each of the user visible functions in the library would then be declared thus
int FUNCTION(MPI_Send)(/* appropriate args */) { /* Useful content */ }
The same source file can then be compiled to produce the MPI and the PMPI versions of the library, depending on the state of the PROFILELIB macro symbol.
It is required that the standard MPI library be built in such a way that the inclusion of MPI functions can be achieved one at a time. This is a somewhat unpleasant requirement, since it may mean that each external function has to be compiled from a separate file. However this is necessary so that the author of the profiling library need only define those MPI functions that are to be intercepted, references to any others being fulfilled by the normal MPI library. Therefore the link step can look something like this
% cc ... -lprof -lpmpi -lmpi
Here libprof.a contains the profiler functions which intercept some of the MPI functions. libpmpi.a contains the ``name shifted'' MPI functions, and libmpi.a contains the normal definitions of the MPI functions. Thus, on systems without weak links the example shown in Fig. would be resolved as shown in Fig.
Figure: Resolution of MPI calls on systems without weak links.
Since parts of the MPI library may themselves be implemented using more basic MPI functions (e.g. a portable implementation of the collective operations implemented using point to point communications), there is potential for profiling functions to be called from within an MPI function which was called from a profiling function. This could lead to ``double counting'' of the time spent in the inner routine. Since this effect could actually be useful under some circumstances (e.g. it might allow one to answer the question ``How much time is spent in the point to point routines when they're called from collective functions ?''), the MPI Forum decided not to enforce any restrictions on the author of the MPI library which would overcome this. Therefore, the author of the profiling library should be aware of this problem, and guard against it. In a single threaded world this is easily achieved through use of a static variable in the profiling code which remembers if you are already inside a profiling routine. It becomes more complex in a multi-threaded environment (as does the meaning of the times recorded!)
The Unix linker traditionally operates in one pass. The effect of this is that functions from libraries are only included in the image if they are needed at the time the library is scanned. When combined with weak symbols, or multiple definitions of the same function, this can cause odd (and unexpected) effects.
Consider, for instance, an implementation of MPI in which the Fortran binding is achieved by using wrapper functions on top of the C implementation. The author of the profile library then assumes that it is reasonable to provide profile functions only for the C binding, since Fortran will eventually call these, and the cost of the wrappers is assumed to be small. However, if the wrapper functions are not in the profiling library, then none of the profiled entry points will be undefined when the profiling library is called. Therefore none of the profiling code will be included in the image. When the standard MPI library is scanned, the Fortran wrappers will be resolved, and will also pull in the base versions of the MPI functions. The overall effect is that the code will link successfully, but will not be profiled.
To overcome this we must ensure that the Fortran wrapper functions are included in the profiling version of the library. We ensure that this is possible by requiring that these be separable from the rest of the base MPI library. This allows them to be extracted out of the base library and placed into the profiling library using the Unix ar command.
The scheme given here does not directly support the nesting of profiling functions, since it provides only a single alternative name for each MPI function. The MPI Forum gave consideration to an implementation which would allow multiple levels of call interception; however, it was unable to construct an implementation of this which did not have the following disadvantages
Since one of the objectives of MPI is to permit efficient, low latency implementations, and it is not the business of a standard to require a particular implementation language, the MPI Forum decided to accept the scheme outlined above.
Note, however, that it is possible to use the scheme above to implement a multi-level system, since the function called by the user may call many different profiling functions before calling the underlying MPI function.
Unfortunately such an implementation may require more cooperation between the different profiling libraries than is required for the single level implementation detailed above.
This book has attempted to give a complete description of the MPI specification, and includes code examples to illustrate aspects of the use of MPI. After reading the preceding chapters programmers should feel comfortable using MPI to develop message-passing applications. This final chapter addresses some important topics that either do not easily fit into the other chapters, or which are best dealt with after a good overall understanding of MPI has been gained. These topics are concerned more with the interpretation of the MPI specification, and the rationale behind some aspects of its design, rather than with semantics and syntax. Future extensions to MPI and the current status of MPI implementations will also be discussed.
One aspect of concern, particularly to novices, is the large number of routines comprising the MPI specification. In all there are 128 MPI routines, and further extensions (see Section ) will probably increase their number. There are two fundamental reasons for the size of MPI. The first reason is that MPI was designed to be rich in functionality. This is reflected in MPI's support for derived datatypes, modular communication via the communicator abstraction, caching, application topologies, and the fully-featured set of collective communication routines. The second reason for the size of MPI reflects the diversity and complexity of today's high performance computers. This is particularly true with respect to the point-to-point communication routines where the different communication modes (see Sections and ) arise mainly as a means of providing a set of the most widely-used communication protocols. For example, the synchronous communication mode corresponds closely to a protocol that minimizes the copying and buffering of data through a rendezvous mechanism. A protocol that attempts to initiate delivery of messages as soon as possible would provide buffering for messages, and this corresponds closely to the buffered communication mode (or the standard mode if this is implemented with sufficient buffering). One could decrease the number of functions by increasing the number of parameters in each call. But such approach would increase the call overhead and would make the use of the most prevalent calls more complex. The availability of a large number of calls to deal with more esoteric features of MPI allows one to provide a simpler interface to the more frequently used functions.
There are two potential reasons why we might be concerned about the size of MPI. The first is that potential users might equate size with complexity and decide that MPI is too complicated to bother learning. The second is that vendors might decide that MPI is too difficult to implement. The design of MPI addresses the first of these concerns by adopting a layered approach. For example, novices can avoid having to worry about groups and communicators by performing all communication in the pre-defined communicator MPI_COMM_WORLD. In fact, most existing message-passing applications can be ported to MPI simply by converting the communication routines on a one-for-one basis (although the resulting MPI application may not be optimally efficient). To allay the concerns of potential implementors the MPI Forum at one stage considered defining a core subset of MPI known as the MPI subset that would be substantially smaller than MPI and include just the point-to-point communication routines and a few of the more commonly-used collective communication routines. However, early work by Lusk, Gropp, Skjellum, Doss, Franke and others on early implementations of MPI showed that it could be fully implemented without a prohibitively large effort [16] [12]. Thus, the rationale for the MPI subset was lost, and this idea was dropped.
Message passing is a programming paradigm used widely on parallel computers, especially Scalable Parallel Computers (SPCs) with distributed memory, and on Networks of Workstations (NOWs). Although there are many variations, the basic concept of processes communicating through messages is well understood. Over the last ten years, substantial progress has been made in casting significant applications into this paradigm. Each vendor has implemented its own variant. More recently, several public-domain systems have demonstrated that a message-passing system can be efficiently and portably implemented. It is thus an appropriate time to define both the syntax and semantics of a standard core of library routines that will be useful to a wide range of users and efficiently implementable on a wide range of computers. This effort has been undertaken over the last three years by the Message Passing Interface (MPI) Forum, a group of more than 80 people from 40 organizations, representing vendors of parallel systems, industrial users, industrial and national research laboratories, and universities. MPI Forum
The designers of MPI sought to make use of the most attractive features of a number of existing message-passing systems, rather than selecting one of them and adopting it as the standard. Thus, MPI has been strongly influenced by work at the IBM T. J. Watson Research Center [2] [1], Intel's NX/2 [24], Express [23], nCUBE's Vertex [22], p4 [5] [6], and PARMACS [7] [3]. Other important contributions have come from Zipcode [26] [25], Chimp [14] [13], PVM [27] [17], Chameleon [19], and PICL [18]. The MPI Forum identified some critical shortcomings of existing message-passing systems, in areas such as complex data layouts or support for modularity and safe communication. This led to the introduction of new features in MPI.
The MPI standard defines the user interface and functionality for a wide range of message-passing capabilities. Since its completion in June of 1994, MPI has become widely accepted and used. Implementations are available on a range of machines from SPCs to NOWs. A growing number of SPCs have an MPI supplied and supported by the vendor. Because of this, MPI has achieved one of its goals - adding credibility to parallel computing. Third party vendors, researchers, and others now have a reliable and portable way to express message-passing, parallel programs.
The major goal of MPI, as with most standards, is a degree of portability across different machines. The expectation is for a degree of portability comparable to that given by programming languages such as Fortran. This means that the same message-passing source code can be executed on a variety of machines as long as the MPI library is available, while some tuning might be needed to take best advantage of the features of each system. portability Though message passing is often thought of in the context of distributed-memory parallel computers, the same code can run well on a shared-memory parallel computer. It can run on a network of workstations, or, indeed, as a set of processes running on a single workstation. Knowing that efficient MPI implementations exist across a wide variety of computers gives a high degree of flexibility in code development, debugging, and in choosing a platform for production runs.
Another type of compatibility offered by MPI is the ability to run transparently on heterogeneous systems, that is, collections of processors with distinct architectures. It is possible for an MPI implementation to span such a heterogeneous collection, yet provide a virtual computing model that hides many architectural differences. The user need not worry whether the code is sending messages between processors of like or unlike architecture. The MPI implementation will automatically do any necessary data conversion and utilize the correct communications protocol. However, MPI does not prohibit implementations that are targeted to a single, homogeneous system, and does not mandate that distinct implementations be interoperable. Users that wish to run on an heterogeneous system must use an MPI implementation designed to support heterogeneity. heterogeneous interoperability
Portability is central but the standard will not gain wide usage if this was achieved at the expense of performance. For example, Fortran is commonly used over assembly languages because compilers are almost always available that yield acceptable performance compared to the non-portable alternative of assembly languages. A crucial point is that MPI was carefully designed so as to allow efficient implementations. The design choices seem to have been made correctly, since MPI implementations over a wide range of platforms are achieving high performance, comparable to that of less portable, vendor-specific systems.
An important design goal of MPI was to allow efficient implementations across machines of differing characteristics. efficiency For example, MPI carefully avoids specifying how operations will take place. It only specifies what an operation does logically. As a result, MPI can be easily implemented on systems that buffer messages at the sender, receiver, or do no buffering at all. Implementations can take advantage of specific features of the communication subsystem of various machines. On machines with intelligent communication coprocessors, much of the message passing protocol can be offloaded to this coprocessor. On other systems, most of the communication code is executed by the main processor. Another example is the use of opaque objects in MPI. By hiding the details of how MPI-specific objects are represented, each implementation is free to do whatever is best under the circumstances.
Another design choice leading to efficiency is the avoidance of unnecessary work. MPI was carefully designed so as to avoid a requirement for large amounts of extra information with each message, or the need for complex encoding or decoding of message headers. MPI also avoids extra computation or tests in critical routines since this can degrade performance. Another way of minimizing work is to encourage the reuse of previous computations. MPI provides this capability through constructs such as persistent communication requests and caching of attributes on communicators. The design of MPI avoids the need for extra copying and buffering of data: in many cases, data can be moved from the user memory directly to the wire, and be received directly from the wire to the receiver memory.
MPI was designed to encourage overlap of communication and computation, so as to take advantage of intelligent communication agents, and to hide communication latencies. This is achieved by the use of nonblocking communication calls, which separate the initiation of a communication from its completion.
Scalability is an important goal of parallel processing. MPI allows or supports scalability through several of its design features. For example, an application can create subgroups of processes that, in turn, allows collective communication operations to limit their scope to the processes involved. Another technique used is to provide functionality without a computation that scales as the number of processes. For example, a two-dimensional Cartesian topology can be subdivided into its one-dimensional rows or columns without explicitly enumerating the processes. scalability
Finally, MPI, as all good standards, is valuable in that it defines a known, minimum behavior of message-passing implementations. This relieves the programmer from having to worry about certain problems that can arise. One example is that MPI guarantees that the underlying transmission of messages is reliable. The user need not check if a message is received correctly.
We use the ANSI C declaration format. All MPI names have an MPI_ prefix, defined constants are in all capital letters, and defined types and functions have one capital letter after the prefix. Programs must not declare variables or functions with names beginning with the prefix MPI_ or PMPI_. This is mandated to avoid possible name collisions.
The definition of named constants, function prototypes, and type definitions must be supplied in an include file mpi.h. include filempif.h
Almost all C functions return an error code. The successful return code will be MPI_SUCCESS, but failure return codes are implementation dependent. A few C functions do not return error codes, so that they can be implemented as macros. return codes
Type declarations are provided for handles to each category of opaque objects. Either a pointer or an integer type is used.
Array arguments are indexed from zero.
Logical flags are integers with value 0 meaning ``false'' and a non-zero value meaning ``true.''
Choice arguments are pointers of type void*.
Address arguments are of MPI defined type MPI_Aint. This is defined to be an int of the size needed to hold any valid address on the target architecture.
All named MPI constants can be used in initialization expressions or assignments like C constants.
buffering MPI does not guarantee to buffer arbitrary messages because memory is a finite resource on all computers. Thus, all computers will fail under sufficiently adverse communication loads. Different computers at different times are capable of providing differing amounts of buffering, so if a program relies on buffering it may fail under certain conditions, but work correctly under other conditions. This is clearly undesirable.
Given that no message passing system can guarantee that messages will be buffered as required under all circumstances, it might be asked why MPI does not guarantee a minimum amount of memory available for buffering. One major problem is that it is not obvious how to specify the amount of buffer space that is available, nor is it easy to estimate how much buffer space is consumed by a particular program.
Different buffering policies make sense in different environments. Messages can be buffered at the sending node or at the receiving node, or both. In the former case,
The choice of the right policy is strongly dependent on the hardware and software environment. For instance, in a dedicated environment, a processor with a process blocked on a send is idle and so computing resources are not wasted if this processor copies the outgoing message to a buffer. In a time shared environment, the computing resources may be used by another process. In a system where buffer space can be in paged memory, such space can be allocated from heap. If the buffer space cannot be paged, or has to be in kernel space, then a separate buffer is needed. Flow control may require that some amount of buffer space be dedicated to each pair of communicating processes.
The optimal strategy strongly depends on various performance parameters of the system: the bandwidth, the communication start-up time, scheduling and context switching overheads, the amount of potential overlap between communication and computation, etc. The choice of a buffering and scheduling policy may not be entirely under the control of the MPI implementor, as it is partially determined by the properties of the underlying communication layer. Also, experience in this arena is quite limited, and underlying technology can be expected to change rapidly: fast, user-space interprocessor communication mechanisms are an active research area [20] [28].
Attempts by the MPI Forum to design mechanisms for querying or setting the amount of buffer space available to standard communication led to the conclusion that such mechanisms will either restrict allowed implementations unacceptably, or provide bounds that will be extremely pessimistic on most implementations in most cases. Another problem is that parameters such as buffer sizes work against portability. Rather then restricting the implementation strategies for standard communication, the choice was taken to provide additional communication modes for those users that do not want to trust the implementation to make the right choice for them.
portable programming The MPI specification was designed to make it possible to write portable message passing programs while avoiding unacceptable performance degradation. Within the context of MPI, ``portable'' is synonymous with ``safe.'' Unsafe programs may exhibit a different behavior on different systems because they are non-deterministic: Several outcomes are consistent with the MPI specification, and the actual outcome to occur depends on the precise timing of events. Unsafe programs may require resources that are not always guaranteed by MPI, in order to complete successfully. On systems where such resources are unavailable, the program will encounter a resource error. Such an error will manifest itself as an actual program error, or will result in deadlock.
There are three main issues relating to the portability of MPI programs (and, indeed, message passing programs in general).
If proper attention is not paid to these factors a message passing code may fail intermittently on a given computer, or may work correctly on one machine but not on another. Clearly such a program is not portable. We shall now consider each of the above factors in more detail.
buffering A message passing program is dependent on the buffering of messages if its communication graph has a cycle. The communication graph is a directed graph in which the nodes represent MPI communication calls and the edges represent dependencies between these calls: a directed edge uv indicates that operation v might not be able to complete before operation u is started. Calls may be dependent because they have to be executed in succession by the same process, or because they are matching send and receive calls.
The execution of the code results in the dependency graph illustrated in Figure , for the case of a three process group.
Figure: Cycle in communication graph for cyclic shift.
The arrow from each send to the following receive executed by the same process reflects the program dependency within each process: the receive call cannot be executed until the previous send call has completed. The double arrow between each send and the matching receive reflects their mutual dependency: Obviously, the receive cannot complete unless the matching send was invoked. Conversely, since a standard mode send is used, it may be the case that the send blocks until a matching receive occurs.
The dependency graph has a cycle. This code will only work if the system provides sufficient buffering, in which case the send operation will complete locally, the call to MPI_Send() will return, and the matching call to MPI_Recv() will be performed. In the absence of sufficient buffering MPI does not specify an outcome, but for most implementations deadlock will occur, i.e., the call to MPI_Send() will never return: each process will wait for the next process on the ring to execute a matching receive. Thus, the behavior of this code will differ from system to system, or on the same system, when message size (count) is changed.
There are a number of ways in which a shift operation can be performed portably using MPI. These are
If at least one process in a shift operation calls the receive routine before the send routine, and at least one process calls the send routine before the receive routine, then at least one communication can proceed, and, eventually, the shift will complete successfully. One of the most efficient ways of doing this is to alternate the send and receive calls so that all processes with even rank send first and then receive, and all processes with odd rank receive first and then send. Thus, the following code is portable provided there is more than one process, i.e., clock and anticlock are different:
if (rank%2) { MPI_Recv (buf2, count, MPI_INT, anticlock, tag, comm, &status); MPI_Send (buf1, count, MPI_INT, clock, tag, comm); } else { MPI_Send (buf1, count, MPI_INT, clock, tag, comm); MPI_Recv (buf2, count, MPI_INT, anticlock, tag, comm, &status); }The resulting communication graph is illustrated in Figure . This graph is acyclic.
Figure: Cycle in communication graph is broken by reordering
send and receive.
If there is only one process then clearly blocking send and receive routines cannot be used since the send must be called before the receive, and so cannot complete in the absence of buffering.
We now consider methods for performing shift operations that work even if there is only one process involved. A blocking send in buffered mode can be used to perform a shift operation. In this case the application program passes a buffer to the MPI communication system, and MPI can use this to buffer messages. If the buffer provided is large enough, then the shift will complete successfully. The following code shows how to use buffered mode to create a portable shift operation.
... MPI_Pack_size (count, MPI_INT, comm, &buffsize) buffsize += MPI_BSEND_OVERHEAD userbuf = malloc (buffsize) MPI_Buffer_attach (userbuf, buffsize); MPI_Bsend (buf1, count, MPI_INT, clock, tag, comm); MPI_Recv (buf2, count, MPI_INT, anticlock, tag, comm, &status);
MPI guarantees that the buffer supplied by a call to MPI_Buffer_attach() will be used if it is needed to buffer the message. (In an implementation of MPI that provides sufficient buffering, the user-supplied buffer may be ignored.) Each buffered send operations can complete locally, so that a deadlock will not occur. The acyclic communication graph for this modified code is shown in Figure . Each receive depends on the matching send, but the send does not depend anymore on the matching receive.
Figure: Cycle in communication graph is broken by using
buffered sends.
Another approach is to use nonblocking communication. One can either use a nonblocking send, a nonblocking receive, or both. If a nonblocking send is used, the call to MPI_Isend() initiates the send operation and then returns. The call to MPI_Recv() can then be made, and the communication completes successfully. After the call to MPI_Isend(), the data in buf1 must not be changed until one is certain that the data have been sent or copied by the system. MPI provides the routines MPI_Wait() and MPI_Test() to check on this. Thus, the following code is portable,
... MPI_Isend (buf1, count, MPI_INT, clock, tag, comm, &request); MPI_Recv (buf2, count, MPI_INT, anticlock, tag, comm, &status); MPI_Wait (&request, &status);
The corresponding acyclic communication graph is shown in Figure .
Figure: Cycle in communication graph is broken by using
nonblocking sends.
Each receive operation depends on the matching send, and each wait depends on the matching communication; the send does not depend on the matching receive, as a nonblocking send call will return even if no matching receive is posted.
(Posted nonblocking communications do consume resources: MPI has to keep track of such posted communications. But the amount of resources consumed is proportional to the number of posted communications, not to the total size of the pending messages. Good MPI implementations will support a large number of pending nonblocking communications, so that this will not cause portability problems.)
An alternative approach is to perform a nonblocking receive first to initiate (or ``post'') the receive, and then to perform a blocking send in standard mode.
... MPI_Irecv (buf2, count, MPI_INT, anticlock, tag, comm, &request); MPI_Send (buf1, count, MPI_INT, clock, tag, comm): MPI_Wait (&request, &status);
The call to MPI_Irecv() indicates to MPI that incoming data should be stored in buf2; thus, no buffering is required. The call to MPI_Wait() is needed to block until the data has actually been received into buf2. This alternative code will often result in improved performance, since sends complete faster in many implementations when the matching receive is already posted.
Finally, a portable shift operation can be implemented using the routine MPI_Sendrecv(), which was explicitly designed to send to one process while receiving from another in a safe and portable way. In this case only a single call is required;
... MPI_Sendrecv (buf1, count, MPI_INT, clock, tag, buf2, count, MPI_INT, anticlock, tag, comm, &status);
collective, semantics of The MPI specification purposefully does not mandate whether or not collective communication operations have the side effect of synchronizing the processes over which they operate. Thus, in one valid implementation collective communication operations may synchronize processes, while in another equally valid implementation they do not. Portable MPI programs, therefore, must not rely on whether or not collective communication operations synchronize processes. Thus, the following assumptions must be avoided.
MPI_Irecv (buf2, count, MPI_INT, anticlock, tag, comm, &status); MPI_Bcast (buf3, 1, MPI_CHAR, 0, comm); MPI_Rsend (buf1, count, MPI_INT, clock, tag, comm);
Here if we want to perform the send in ready mode we must be certain that the receive has already been initiated at the destination. The above code is nonportable because if the broadcast does not act as a barrier synchronization we cannot be sure this is the case.
ambiguity of communications modularitylibraries, safety MPI employs the communicator abstraction to promote software modularity by allowing the construction of independent communication streams between processes, thereby ensuring that messages sent in one phase of an application are not incorrectly intercepted by another phase. Communicators are particularly important in allowing libraries that make message passing calls to be used safely within an application. The point here is that the application developer has no way of knowing if the tag, group, and rank completely disambiguate the message traffic of different libraries and the rest of the application. Communicators, in effect, provide an additional criterion for message selection, and hence permits the construction of independent tag spaces.
We discussed in Section possible hazards when a library uses the same communicator as the calling code. The incorrect matching of sends executed by the caller code with receives executed by the library occurred because the library code used wildcarded receives. Conversely, incorrect matches may occur when the caller code uses wildcarded receives, even if the library code by itself is deterministic.
Consider the example in Figure . If the program behaves correctly processes 0 and 1 each receive a message from process 2, using a wildcarded selection criterion to indicate that they are prepared to receive a message from any process. The three processes then pass data around in a ring within the library routine. If separate communicators are not used for the communication inside and outside of the library routine this program may intermittently fail. Suppose we delay the sending of the second message sent by process 2, for example, by inserting some computation, as shown in Figure . In this case the wildcarded receive in process 0 is satisfied by a message sent from process 1, rather than from process 2, and deadlock results.
Figure: Use of communicators. Numbers
in parentheses indicate the process to which data are being sent or received.
The gray shaded area represents the library routine call. In this case
the program behaves as intended. Note that the second message sent by process
2 is received by process 0, and that the message sent by process 0 is
received by process 2.
Figure: Unintended behavior of program. In this case the message from process 2
to process 0 is never received, and deadlock results.
Even if neither caller nor callee use wildcarded receives, incorrect matches may still occur if a send initiated before the collective library invocation is to be matched by a receive posted after the invocation (Ex. , page ). By using a different communicator in the library routine we can ensure that the program is executed correctly, regardless of when the processes enter the library routine.
heterogeneous Heterogeneous computing uses different computers connected by a network to solve a problem in parallel. With heterogeneous computing a number of issues arise that are not applicable when using a homogeneous parallel computer. For example, the computers may be of differing computational power, so care must be taken to distribute the work between them to avoid load imbalance. Other problems may arise because of the different behavior of floating point arithmetic on different machines. However, the two most fundamental issues that must be faced in heterogeneous computing are,
Incompatible data representations arise when computers use different binary representations for the same number. In MPI all communication routines have a datatype argument so implementations can use this information to perform the appropriate representation conversion when communicating data between computers.
Interoperability refers to the ability of different implementations of a given piece of software to work together as if they were a single homogeneous implementation. A interoperability prerequisite of interoperability for MPI would be the standardization of the MPI's internal data structures, of the communication protocols, of the initialization, termination and error handling procedures, of the implementation of collective operations, and so on. Since this has not been done, there is no support for interoperability in MPI. In general, hardware-specific implementations of MPI will not be interoperable. However it is still possible for different architectures to work together if they both use the same portable MPI implementation.
MPI implementationsimplementations At the time of writing several portable implementations of MPI exist,
In addition, hardware-specific MPI implementations exist for the Cray T3D, the IBM SP-2, The NEC Cinju, and the Fujitsu AP1000.
Information on MPI implementations and other useful information on MPI can be found on the MPI web pages at Argonne National Laboratory (http://www.mcs.anl.gov/mpi), and at Mississippi State Univ (http://www.erc.msstate.edu/mpi). Additional information can be found on the MPI newsgroup comp.parallel.mpi and on netlib.
When the MPI Forum reconvened in March 1995, the main reason was to produce a new version of MPI that would have significant new features. The original MPI is being referred to as MPI-1 and the new effort is being called MPI-2. The need and desire to extend MPI-1 arose from several factors. One consideration was that the MPI-1 effort had a constrained scope. This was done to avoid introducing a standard that was seen as too large and burdensome for implementors. It was also done to complete MPI-1 in the Forum-imposed deadline of one year. A second consideration for limiting MPI-1 was the feeling by many Forum members that some proposed areas were still under investigation. As a result, the MPI Forum decided not to propose a standard in these areas for fear of discouraging useful investigations into improved methods.
The MPI Forum is now actively meeting and discussing extensions to MPI-1 that will become MPI-2. The areas that are currently under discussion are: 0.1truein
External Interfaces: This will define interfaces to allow easy extension of MPI with libraries, and facilitate the implementation of packages such as debuggers and profilers. Among the issues considered are mechanisms for defining new nonblocking operations and mechanisms for accessing internal MPI information.
One-Sided Communications: This will extend MPI to allow communication that does not require execution of matching calls at both communicating processes. Examples of such operations are put/get operations that allow a process to access data in another process' memory, messages with interrupts (e.g., active messages), and Read-Modify-Write operations (e.g., fetch and add).
Dynamic Resource Management: This will extend MPI to allow the acquisition of computational resources and the spawning and destruction of processes after MPI_INIT.
Extended Collective: This will extend the collective calls to be non-blocking and apply to inter-communicators.
Bindings: This will produce bindings for Fortran 90 and C++.
Real Time: This will provide some support for real time processing. 0.1truein Since the MPI-2 effort is ongoing, the topics and areas covered are still subject to change.
The MPI Forum set a timetable at its first meeting in March 1995. The goal is release of a preliminary version of certain parts of MPI-2 in December 1995 at Supercomputing '95. This is to include dynamic processes. The goal of this early release is to allow testing of the ideas and to receive extended public comments. The complete version of MPI-2 will be released at Supercomputing '96 for final public comment. The final version of MPI-2 is scheduled for release in the spring of 1997.
MPI: The Complete Reference
This document was generated using the LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html book.tex.
The translation was initiated by Jack Dongarra on Fri Sep 1 06:16:55 EDT 1995
The basic communication mechanism of MPI is the transmittal of data between a pair of processes, one side sending, the other, receiving. We call this ``point to point communication.'' Almost all the constructs of MPI are built around the point to point operations and so this chapter is fundamental. It is also quite a long chapter since: there are many variants to the point to point operations; there is much to say in terms of the semantics of the operations; and related topics, such as probing for messages, are explained here because they are used in conjunction with the point to point operations.
MPI provides a set of send and receive functions that allow the communication of typedtyped data data with an associated tag.tagmessage tag Typing of the message contents is necessary for heterogeneous support - the type information is needed so that correct data representation conversions can be performed as data is sent from one architecture to another. The tag allows selectivity of messages at the receiving end: one can receive on a particular tag, or one can wild-card this quantity, allowing reception of messages with any tag. Message selectivity on the source process of the message is also provided.
A fragment of C code
appears in Example
for the example of
process 0 sending a message to process 1.
The code executes on both
process 0 and process 1.
Process 0 sends a character string using MPI_Send(). The first three
parameters of the send call
specify the data to be sent: the outgoing data is to
be taken from msg; it consists of strlen(msg)+1 entries,
each of
type MPI_CHAR (The string "Hello there" contains
strlen(msg)=11
significant characters.
In addition, we are also sending the tex2html_html_special_mark_quot
''"
string terminator character).
The fourth parameter specifies the message
destination, which is process 1.
The fifth parameter specifies the message tag.
Finally, the last parameter is a
communicatorcommunicator
that specifies a
communication domaincommunication domain
for this communication. Among other
things, a communicator serves to define
a set of processes that can be
contacted.
Each such process is labeled by
a process rank.rank
Process ranks are integers
and are discovered by inquiry to a communicator (see the
call to MPI_Comm_rank()).
MPI_COMM_WORLDMPI_COMM_WORLD
is a default communicator provided
upon start-up that defines an initial
communication domain for all the processes
that participate in the computation.
Much more will be said about
communicators in Chapter
.
The receiving process specified that the incoming data was to be placed in msg and that it had a maximum size of 20 entries, of type MPI_CHAR. The variable status, set by MPI_Recv(), gives information on the source and tag of the message and how many elements were actually received. For example, the receiver can examine this variable to find out the actual length of the character string received. Datatype matchingdatatype matchingtype matching (between sender and receiver) and data conversion data conversionrepresentation conversion on heterogeneous systems are discussed in more detail in Section .
The Fortran version of this code is shown in Example . In order to make our Fortran examples more readable, we use Fortran 90 syntax, here and in many other places in this book. The examples can be easily rewritten in standard Fortran 77. The Fortran code is essentially identical to the C code. All MPI calls are procedures, and an additional parameter is used to return the value returned by the corresponding C function. Note that Fortran strings have fixed size and are not null-terminated. The receive operation stores "Hello there" in the first 11 positions of msg.
These examples employed blocking blocking send and receive functions. The send call blocks until the send buffer can be reclaimed (i.e., after the send, process 0 can safely over-write the contents of msg). Similarly, the receive function blocks until the receive buffer actually contains the contents of the message. MPI also provides nonblockingnonblocking send and receive functions that allow the possible overlap of message transmittal with computation, or the overlap of multiple message transmittals with one-another. Non-blocking functions always come in two parts: the posting functions, posting which begin the requested operation; and the test-for-completion functions,test-for-completion which allow the application program to discover whether the requested operation has completed. Our chapter begins by explaining blocking functions in detail, in Section - , while nonblocking functions are covered later, in Sections - .
We have already said rather a lot about a simple transmittal of data from one process to another, but there is even more. To understand why, we examine two aspects of the communication: the semantics semantics of the communication primitives, and the underlying protocols that protocols implement them. Consider the previous example, on process 0, after the blocking send has completed. The question arises: if the send has completed, does this tell us anything about the receiving process? Can we know that the receive has finished, or even, that it has begun?
Such questions of semantics are related to the nature of the underlying protocol implementing the operations. If one wishes to implement a protocol minimizing the copying and buffering of data, the most natural semantics might be the ``rendezvous''rendezvous version, where completion of the send implies the receive has been initiated (at least). On the other hand, a protocol that attempts to block processes for the minimal amount of time will necessarily end up doing more buffering and copying of data and will have ``buffering'' semantics.buffering
The trouble is, one choice of semantics is not best for all applications, nor is it best for all architectures. Because the primary goal of MPI is to standardize the operations, yet not sacrifice performance, the decision was made to include all the major choices for point to point semantics in the standard.
The above complexities are manifested in MPI by the existence of modesmodes for point to point communication. Both blocking and nonblocking communications have modes. The mode allows one to choose the semantics of the send operation and, in effect, to influence the underlying protocol of the transfer of data.
In standard modestandard mode the completion of the send does not necessarily mean that the matching receive has started, and no assumption should be made in the application program about whether the out-going data is buffered by MPI. In buffered mode buffered mode the user can guarantee that a certain amount of buffering space is available. The catch is that the space must be explicitly provided by the application program. In synchronous mode synchronous mode a rendezvous semantics between sender and receiver is used. Finally, there is ready mode. ready mode This allows the user to exploit extra knowledge to simplify the protocol and potentially achieve higher performance. In a ready-mode send, the user asserts that the matching receive already has been posted. Modes are covered in Section .
This section describes standard-mode, blocking sends and receives.
MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
MPI_SEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR
MPI_SEND performs a standard-mode, blocking send. The semantics of this function are described in Section . The arguments to MPI_SEND are described in the following subsections.
The send buffersend bufferbuffer, send specified by MPI_SEND consists of count successive entries of the type indicated by datatype,datatype starting with the entry at address buf. Note that we specify the message length in terms of number of entries, not number of bytes. The former is machine independent and facilitates portable programming. The count may be zero, in which case the data part of the message is empty. The basic datatypes correspond to the basic datatypes of the host language. Possible values of this argument for Fortran and the corresponding Fortran types are listed below.
MPI_INTEGER MPI_REAL MPI_DOUBLE_PRECISION MPI_COMPLEX MPI_LOGICAL MPI_CHARACTER MPI_BYTE MPI_PACKED Possible values for this argument for C and the corresponding C types are listed below.
MPI_CHAR MPI_SHORT MPI_INT MPI_LONG MPI_UNSIGNED_CHAR MPI_UNSIGNED_SHORT MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_FLOAT MPI_DOUBLE MPI_LONG_DOUBLE MPI_BYTE MPI_PACKED
The datatypes MPI_BYTE and MPI_PACKED do not correspond to a Fortran or C datatype. A value of type MPI_BYTE consists of a byte (8 binary digits). A byte is uninterpreted and is different from a character. Different machines may have different representations for characters, or may use more than one byte to represent characters. On the other hand, a byte has the same binary value on all machines. The use of MPI_PACKED is explained in MPI_PACKED Section .
MPI requires support of the datatypes listed above, which match the basic datatypes of Fortran 77 and ANSI C. Additional MPI datatypes should be provided if the host language has additional data types. Some examples are: MPI_LONG_LONG, for C integers declared to be of type MPI_LONG_LONG longlong; MPI_DOUBLE_COMPLEX for double precision complex in MPI_DOUBLE_COMPLEX Fortran declared to be of type DOUBLE COMPLEX; MPI_REAL2, MPI_REAL2 MPI_REAL4 and MPI_REAL8 for Fortran reals, declared to be of MPI_REAL4 MPI_REAL8 type REAL*2, REAL*4 and REAL*8, respectively; MPI_INTEGER1 MPI_INTEGER2 and MPI_INTEGER4 for Fortran integers, declared to be of type MPI_INTEGER1 MPI_INTEGER2 MPI_INTEGER4 INTEGER*1, INTEGER*2 and INTEGER*4, respectively. In addition, MPI provides a mechanism for users to define new, derived, datatypes. This is explained in Chapter .
In addition to data, messages carry information that is used to distinguish and selectively receive them. This information consists of a fixed number of fields, which we collectively call the message envelope. These fields are message envelope source, destination, tag, and communicator.
sourcedestinationtagcommunicator The message source is implicitly determined by the identity of the message source message sender. The other fields are specified by arguments in the send operation.
The comm argument specifies the communicator used for communicator the send operation. The communicator is a local object that represents a communication domain. A communication domain is a communication domain global, distributed structure that allows processes in a group groupprocess group to communicate with each other, or to communicate with processes in another group. A communication domain of the first type (communication within a group) is represented by an intracommunicator, whereas a communication domain of the second type intracommunicator (communication between groups) is represented by an intercommunicator. intercommunicator Processes in a group are ordered, and are identified by their integer rank. rankprocess rank Processes may participate in several communication domains; distinct communication domains may have partially or even completely overlapping groups of processes. Each communication domain supports a disjoint stream of communications. Thus, a process may be able to communicate with another process via two distinct communication domains, using two distinct communicators. The same process may be identified by a different rank in the two domains; and communications in the two domains do not interfere. MPI applications begin with a default communication domain that includes all processes (of this parallel job); the default communicator MPI_COMM_WORLD represents this communication domain. MPI_COMM_WORLD Communicators are explained further in Chapter .
The message destination is specified by the dest destinationmessage destination argument. The range of valid values for dest is 0,...,n-1, where n is the number of processes in the group. This range includes the rank of the sender: if comm is an intracommunicator, then a process may send a message to itself. If the communicator is an intercommunicator, then destinations are identified by their rank in the remote group.
The integer-valued message tag is specified by the tag argument.
tagmessage tag
This integer can be used by the application to distinguish messages.
The range of valid tag values is 0,...,UB, where the value of UB is
implementation dependent. It is found by querying the
value of the attribute MPI_TAG_UB, as
MPI_TAG_UB
described in Chapter
.
MPI requires that UB be no less than 32767.
MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
MPI_RECV performs a standard-mode, blocking receive. The semantics of this function are described in Section . The arguments to MPI_RECV are described in the following subsections.
The receive buffer consists of receive bufferbuffer, receive storage sufficient to contain count consecutive entries of the type specified by datatype, starting at address buf. The length of the received message must be less than or equal to the length of the receive buffer. An overflow error occurs if all incoming data does overflow not fit, without truncation, into the receive buffer. We explain in Chapter how to check for errors. If a message that is shorter than the receive buffer arrives, then the incoming message is stored in the initial locations of the receive buffer, and the remaining locations are not modified.
The goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs. As such the interface should establish a practical, portable, efficient, and flexible standard for message passing.
A list of the goals of MPI appears below.
The selection of a message by a receive operation is governed by selectionmessage selection the value of its message envelope. A message can be received if its envelope matches the source, tag and comm values specified by the receive operation. The receiver may specify a wildcard wildcard value for source (MPI_ANY_SOURCE), MPI_ANY_SOURCE and/or a wildcard value for tag (MPI_ANY_TAG), MPI_ANY_TAG indicating that any source and/or tag are acceptable. One cannot specify a wildcard value for comm.
The argument source, if different from MPI_ANY_SOURCE, sourcemessage source is specified as a rank within the process group associated with the communicator (remote process group, for intercommunicators). The range of valid values for the source argument is {0,...,n-1} {MPI_ANY_SOURCE}, where n is the number of processes in this group. This range includes the receiver's rank: if comm is an intracommunicator, then a process may receive a message from itself. The range of valid values for the tag argument is {0,...,UB} {MPI_ANY_TAG}. tagmessage tag
The receive call does not specify the size of an incoming message, but only an upper bound. The source or tag of a received message may not be known if wildcard values were used in a receive operation. Also, if multiple requests are completed by a single MPI function (see Section ), a distinct error code may be error code returned for each request. (Usually, the error code is returned as the value of the function in C, and as the value of the IERROR argument in Fortran.)
This information is returned by the status argument of MPI_RECV. The type of status is defined by MPI. Status variables need to be explicitly allocated by the user, that is, they are not system objects.
In C, status is a structure of type MPI_Status that contains three fields named MPI_SOURCE, MPI_TAG, and MPI_ERROR; the structure may contain additional fields. Thus, status.MPI_SOURCE, status.MPI_TAG and status.MPI_ERROR contain the source, tag and error code, respectively, of the received message.
In Fortran, status is an array of INTEGERs of length MPI_STATUS_SIZE. The three constants MPI_SOURCE, MPI_STATUS_SIZE MPI_TAG and MPI_ERROR MPI_SOURCE MPI_TAG MPI_ERROR are the indices of the entries that store the source, tag and error fields. Thus status(MPI_SOURCE), status(MPI_TAG) and status(MPI_ERROR) contain, respectively, the source, the tag and the error code of the received message.
The status argument also returns information on the length of the message received. However, this information is not directly available as a field of the status variable and a call to MPI_GET_COUNT is required to ``decode'' this information.
MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)
MPI_GET_COUNT(STATUS, DATATYPE, COUNT, IERROR)INTEGER STATUS(MPI_STATUS_SIZE), DATATYPE, COUNT, IERROR
MPI_GET_COUNT takes as input the status set by MPI_RECV and computes the number of entries received. The number of entries is returned in count. The datatype argument should match the argument provided to the receive call that set status. (Section explains that MPI_GET_COUNT may return, in certain situations, the value MPI_UNDEFINED.) MPI_UNDEFINED
Note the asymmetry between send and receive operations. A receive asymmetry operation may accept messages from an arbitrary sender, but a send operation must specify a unique receiver. This matches a ``push'' communication mechanism, where data transfer is effected by the sender, rather than a ``pull'' mechanism, where data transfer is effected by the receiver.
Source equal to destination is allowed, that is, a process can send a message to itself. However, for such a communication to succeed, it is required that the message be buffered by the system between the completion of the send call and the start of the receive call. The amount of buffer space available and the buffer allocation policy are implementation dependent. Therefore, it is unsafe and non-portable to send self-messages with the standard-mode, blocking send and receive self messagemessage, self operations described so far, since this may lead to deadlock. More discussions of this appear in Section .
type matchingmatching, type
conversion
One can think of message transfer as consisting of the following three phases.
Type matching must be observed at each of these phases. The type of each variable in the sender buffer must match the type specified for that entry by the send operation. The type specified by the send operation must match the type specified by the receive operation. Finally, the type of each variable in the receive buffer must match the type specified for that entry by the receive operation. A program that fails to observe these rules is erroneous.
To define type matching precisely, we need to deal with two issues: matching of types of variables of the host language with types specified in communication operations, and matching of types between sender and receiver.
The types between a send and receive match if both operations specify identical type names. That is, MPI_INTEGER matches MPI_INTEGER, MPI_REAL matches MPI_REAL, and so on. The one exception to this rule is that the type MPI_PACKED can match any other type (Section ).
The type of a variable matches the type specified in the communication operation if the datatype name used by that operation corresponds to the basic type of the host program variable. For example, an entry with type name MPI_INTEGER matches a Fortran variable of type INTEGER. Tables showing this correspondence for Fortran and C appear in Section . There are two exceptions to this rule: an entry with type name MPI_BYTE or MPI_PACKED can be used to match any byte of storage (on a byte-addressable machine), MPI_PACKEDMPI_BYTE irrespective of the datatype of the variable that contains this byte. The type MPI_BYTE allows one to transfer the binary value of a byte in memory unchanged. The type MPI_PACKED is used to send data that has been explicitly packed with calls to MPI_PACK, or receive data that will be explicitly unpacked with calls to MPI_UNPACK (Section ).
The following examples illustrate type matching.
MPI_CHARACTER
The type MPI_CHARACTER matches one character of a Fortran variable of MPI_CHARACTER type CHARACTER, rather then the entire character string stored in the variable. Fortran variables of type CHARACTER or substrings are transferred as if they were arrays of characters. This is illustrated in the example below.
One of the goals of MPI is to support parallel computations across heterogeneous environments. Communication in a heterogeneous heterogeneous environment may require data conversions. We use the following terminology.
The type matching rules imply that MPI communications never do type conversion. On the other hand, MPI requires that a representation conversion be performed when a typed value is transferred across environments that use different representations for such a value. MPI does not specify the detailed rules for representation conversion. Such a conversion is expected to preserve integer, logical or character values, and to convert a floating point value to the nearest value that can be represented on the target system.
Overflow and underflow exceptions may occur during floating point conversions. overflowunderflow Conversion of integers or characters may also lead to exceptions when a value that can be represented in one system cannot be represented in the other system. An exception occurring during representation conversion results in a failure of the communication. An error occurs either in the send operation, or the receive operation, or both.
If a value sent in a message is untyped (i.e., of type MPI_BYTE), MPI_BYTE MPI_BYTE then the binary representation of the byte stored at the receiver is identical to the binary representation of the byte loaded at the sender. This holds true, whether sender and receiver run in the same or in distinct environments. No representation conversion is done. Note that representation conversion may occur when values of type MPI_CHARACTER or MPI_CHAR are transferred, for example, from an EBCDIC encoding to an ASCII encoding. MPI_CHARACTER MPI_CHAR
No representation conversion need occur when an MPI program executes in a homogeneous system, where all processes run in the same environment.
Consider the three examples, - . The first program is correct, assuming that a and b are REAL arrays of size . If the sender and receiver execute in different environments, then the ten real values that are fetched from the send buffer will be converted to the representation for reals on the receiver site before they are stored in the receive buffer. While the number of real elements fetched from the send buffer equal the number of real elements stored in the receive buffer, the number of bytes stored need not equal the number of bytes loaded. For example, the sender may use a four byte representation and the receiver an eight byte representation for reals.
The second program is erroneous, and its behavior is undefined.
The third program is correct. The exact same sequence of forty bytes that were loaded from the send buffer will be stored in the receive buffer, even if sender and receiver run in a different environment. The message sent has exactly the same length (in bytes) and the same binary representation as the message received. If a and b are of different types, or if they are of the same type but different data representations are used, then the bits stored in the receive buffer may encode values that are different from the values they encoded in the send buffer.
Representation conversion also applies to the envelope of a message. The source, destination and tag are all integers that may need to be converted.
MPI does not require support for inter-language communication. The behavior of a program is undefined if messages are sent by a C process and received by a Fortran process, or vice-versa. inter-language communication
semantics This section describes the main properties of the send and receive calls introduced in Section . Interested readers can find a more formal treatment of the issues in this section in [10].
The receive described in Section can be started whether or not a matching send has been posted. That version of receive is blocking. blocking It returns only after the receive buffer contains the newly received message. A receive could complete before the matching send has completed (of course, it can complete only after the matching send has started).
The send operation described in Section can be started whether or not a matching receive has been posted. That version of send is blocking. It does not return until the message data and envelope have been safely stored away so that the sender is free to access and overwrite the send buffer. The send call is also potentially non-local. non-local The message might be copied directly into the matching receive buffer, or it might be copied into a temporary system buffer. In the first case, the send call will not complete until a matching receive call occurs, and so, if the sending process is single-threaded, then it will be blocked until this time. In the second case, the send call may return ahead of the matching receive call, allowing a single-threaded process to continue with its computation. The MPI implementation may make either of these choices. It might block the sender or it might buffer the data.
Message buffering decouples the send and receive operations. A blocking send might complete as soon as the message was buffered, even if no matching receive has been executed by the receiver. On the other hand, message buffering can be expensive, as it entails additional memory-to-memory copying, and it requires the allocation of memory for buffering. The choice of the right amount of buffer space to allocate for communication and of the buffering policy to use is application and implementation dependent. Therefore, MPI offers the choice of several communication modes that allow one to control the choice of the communication protocol. Modes are described in communication modesmodescommunication protocol protocol, communication Section . The choice of a buffering policy for the standard mode send described in standard modemode, standard Section is left to the implementation. In any case, lack of buffer space will not cause a standard send call to fail, but will merely cause it to block. In well-constructed programs, this results in a useful throttle effect. throttle effect Consider a situation where a producer repeatedly produces new values and sends them to a consumer. Assume that the producer produces new values faster than the consumer can consume them. If standard sends are used, then the producer will be automatically throttled, as its send operations will block when buffer space is unavailable.
In ill-constructed programs, blocking may lead to a deadlock situation, where all processes are deadlock blocked, and no progress occurs. Such programs may complete when sufficient buffer space is available, but will fail on systems that do less buffering, or when data sets (and message sizes) are increased. Since any system will run out of buffer resources as message sizes are increased, and some implementations may want to provide little buffering, MPI takes the position that safe programs safe program do not rely on system buffering, and will complete correctly irrespective of the buffer allocation policy used by MPI. Buffering may change the buffering performance of a safe program, but it doesn't affect the result of the program.
MPI does not enforce a safe programming style. Users are free to take advantage of knowledge of the buffering policy of an implementation in order to relax the safety requirements, though doing so will lessen the portability of the program.
The following examples illustrate safe programming issues.
The MPI standard is intended for use by all those who want to write portable message-passing programs in Fortran 77 and C. This includes individual application programmers, developers of software designed to run on parallel machines, and creators of environments and tools. In order to be attractive to this wide audience, the standard must provide a simple, easy-to-use interface for the basic user while not semantically precluding the high-performance message-passing operations available on advanced machines.
threads
MPI does not specify the interaction of blocking communication calls with the thread scheduler in a multi-threaded implementation of MPI. The desired behavior is that a blocking communication call blocks only the issuing thread, allowing another thread to be scheduled. The blocked thread will be rescheduled when the blocked call is satisfied. That is, when data has been copied out of the send buffer, for a send operation, or copied into the receive buffer, for a receive operation. When a thread executes concurrently with a blocked communication operation, it is the user's responsibility not to access or modify a communication buffer until the communication completes. Otherwise, the outcome of the computation is undefined.
Messages are non-overtaking. Conceptually, one may think of successive messages sent by a process to another process as ordered in a sequence. Receive operations posted by a process are also ordered in a sequence. Each incoming message matches the first matching receive in the sequence. This is illustrated in Figure . Process zero sends two messages to process one and process two sends three messages to process one. Process one posts five receives. All communications occur in the same communication domain. The first message sent by process zero and the first message sent by process two can be received in either order, since the first two posted receives match either. The second message of process two will be received before the third message, even though the third and fourth receives match either.
Figure: Messages are matched in order.
Thus, if a sender sends two messages in succession to the same destination, and both match the same receive, then the receive cannot get the second message if the first message is still pending. If a receiver posts two receives in succession, and both match the same message, then the second receive operation cannot be satisfied by this message, if the first receive is still pending.
These requirements further define message matching. matchingmessage matching They guarantee that message-passing code is deterministic, if processes are single-threaded and the wildcard MPI_ANY_SOURCE is MPI_ANY_SOURCE not used in receives. Some other MPI functions, such as MPI_CANCEL or MPI_WAITANY, are additional sources of nondeterminism. deterministic programs
In a single-threaded process all communication operations are ordered according to program execution order. The situation is different when processes are multi-threaded. order, with threads The semantics of thread execution may not define a relative order between two communication operations executed by two distinct threads. The operations are logically concurrent, even if one physically precedes the other. In this case, no order constraints apply. Two messages sent by concurrent threads can be received in any order. Similarly, if two receive operations that are logically concurrent receive two successively sent messages, then the two messages can match the receives in either order.
It is important to understand what is guaranteed by the ordering property and what is not. Between any pair of communicating processes, messages flow in order. This does not imply a consistent, total order on communication events in the system. Consider the following example.
Figure: Order preserving is not transitive.
If a pair of matching send and receives have been initiated on two processes, then at least one of these two operations will complete, independently of other actions in the system. The send operation will complete, unless the receive is satisfied by another message. The receive operation will complete, unless the message sent is consumed by another matching receive posted at the same destination process.
MPI makes no guarantee of fairness in the handling of communication. Suppose that a send is posted. Then it is possible that the destination process repeatedly posts a receive that matches this send, yet the message is never received, because it is repeatedly overtaken by other messages, sent from other sources. The scenario requires that the receive used the wildcard MPI_ANY_SOURCE as its source argument. MPI_ANY_SOURCE
Similarly, suppose that a receive is posted by a multi-threaded process. Then it is possible that messages that match this receive are repeatedly consumed, yet the receive is never satisfied, because it is overtaken by other receives posted at this node by other threads. It is the programmer's responsibility to prevent starvation in such situations.
We shall use the following example to illustrate the material introduced so far, and to motivate new functions.
Since this code has a simple structure, a data-parallel approach can be used to derive an equivalent parallel code. The array is distributed across processes, and each process is assigned the task of updating the entries on the part of the array it owns.
A parallel algorithm is derived from a choice of data distribution. The data distribution distribution should be balanced, allocating (roughly) the same number of entries to each processor; and it should minimize communication. Figure illustrates two possible distributions: a 1D (block) distribution, where the matrix is partitioned in one dimension, and a 2D (block,block) distribution, where the matrix is partitioned in two dimensions.
Figure: Block partitioning of a matrix.
Since the communication occurs at block boundaries, communication volume is minimized by the 2D partition which has a better area to perimeter ratio. However, in this partition, each processor communicates with four neighbors, rather than two neighbors in the 1D partition. When the ratio of n/P (P number of processors) is small, communication time will be dominated by the fixed overhead per message, and the first partition will lead to better performance. When the ratio is large, the second partition will result in better performance. In order to keep the example simple, we shall use the first partition; a realistic code would use a ``polyalgorithm'' that selects one of the two partitions, according to problem size, number of processors, and communication performance parameters.
The value of each point in the array B is computed from the value of the four neighbors in array A. Communications are needed at block boundaries in order to receive values of neighbor points which are owned by another processor. Communications are simplified if an overlap area is allocated at each processor for storing the values to be received from the neighbor processor. Essentially, storage is allocated for each entry both at the producer and at the consumer of that entry. If an entry is produced by one processor and consumed by another, then storage is allocated for this entry at both processors. With such scheme there is no need for dynamic allocation of communication buffers, and the location of each variable is fixed. Such scheme works whenever the data dependencies in the computation are fixed and simple. In our case, they are described by a four point stencil. Therefore, a one-column overlap is needed, for a 1D partition.
We shall partition array A with one column overlap. No such overlap is required for array B. Figure shows the extra columns in A and how data is transfered for each iteration.
We shall use an algorithm where all values needed from a neighbor are brought in one message. Coalescing of communications in this manner reduces the number of messages and generally improves performance.
Figure: 1D block partitioning with overlap and communication
pattern for jacobi iteration.
The resulting parallel algorithm is shown below.
One way to get a safe version of this code is to alternate the order of sends and receives: odd rank processes will first send, next receive, and even rank processes will first receive, next send. Thus, one achieves the communication pattern of Example .
The modified main loop is shown below. We shall later see simpler ways of dealing with this problem. Jacobi, safe version
The exchange communication pattern exhibited by the last example is exchange communication sufficiently frequent to justify special support. The send-receive operation combines, in one call, the sending of one message to a destination and the receiving of another message from a source. The source and destination are possibly the same. Send-receive is useful for communications patterns where each node both sends and receives messages. One example is an exchange of data between two processes. Another example is a shift operation across a chain of processes. A safe program that implements such shift will need to use an odd/even ordering of communications, similar to the one used in Example . When send-receive is used, data flows simultaneously in both directions (logically, at least) and cycles in the communication pattern do not lead to deadlock. deadlockcycles
Send-receive can be used in conjunction with the functions described in Chapter to perform shifts on logical topologies. Also, send-receive can be used for implementing remote procedure calls: remote procedure call one blocking send-receive call can be used for sending the input parameters to the callee and receiving back the output parameters.
There is compatibility between send-receive and normal sends and receives. A message sent by a send-receive can be received by a regular receive or probed by a regular probe, and a send-receive can receive a message sent by a regular send.
MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, MPI_Datatype recvtag, MPI_Comm comm, MPI_Status *status)
MPI_SENDRECV(SENDBUF, SENDCOUNT, SENDTYPE, DEST, SENDTAG, RECVBUF, RECVCOUNT, RECVTYPE, SOURCE, RECVTAG, COMM, STATUS, IERROR)<type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNT, SENDTYPE, DEST, SENDTAG, RECVCOUNT, RECVTYPE, SOURCE, RECV
TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
MPI_SENDRECV executes a blocking send and receive operation. Both the send and receive use the same communicator, but have distinct tag arguments. The send buffer and receive buffers must be disjoint, and may have different lengths and datatypes. The next function handles the case where the buffers are not disjoint.
The semantics of a send-receive operation is what would be obtained if the caller forked two concurrent threads, one to execute the send, and one to execute the receive, followed by a join of these two threads.
MPI_Sendrecv_replace(void* buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status)
MPI_SENDRECV_REPLACE(BUF, COUNT, DATATYPE, DEST, SENDTAG, SOURCE, RECVTAG, COMM, STATUS, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, SENDTAG, SOURCE, RECVTAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
MPI_SENDRECV_REPLACE executes a blocking send and receive. The same buffer is used both for the send and for the receive, so that the message sent is replaced by the message received.
The example below shows the main loop of the parallel Jacobi code, reimplemented using send-receive.
null process In many instances, it is convenient to specify a ``dummy'' source or destination for communication.
In the Jacobi example, this will avoid special handling of boundary processes. This also simplifies handling of boundaries in the case of a non-circular shift, when used in conjunction with the functions described in Chapter .
The special value MPI_PROC_NULL can be used MPI_PROC_NULL instead of a rank wherever a source or a destination argument is required in a communication function. A communication with process MPI_PROC_NULL has no effect. A send to MPI_PROC_NULL succeeds and returns as soon as possible. A receive from MPI_PROC_NULL succeeds and returns as soon as possible with no modifications to the receive buffer. When a receive with source = MPI_PROC_NULL is executed then the status object returns source = MPI_PROC_NULL, tag = MPI_ANY_TAG and count = 0.
We take advantage of null processes to further simplify the parallel Jacobi code. Jacobi, with null processes
One can improve performance on many systems by overlapping communication and computation. This is especially true on systems where communication can be executed autonomously by an intelligent communication controller. Multi-threading is one mechanism for threads achieving such overlap. While one thread is blocked, waiting for a communication to complete, another thread may execute on the same processor. This mechanism is efficient if the system supports light-weight threads that are integrated with the communication subsystem. An alternative mechanism that often gives better performance is to use nonblocking communication. A nonblocking communicationcommunication, nonblocking nonblocking post-send initiates a send operation, but does not post-send complete it. The post-send will return before the message is copied out of the send buffer. A separate complete-send complete-send call is needed to complete the communication, that is, to verify that the data has been copied out of the send buffer. With suitable hardware, the transfer of data out of the sender memory may proceed concurrently with computations done at the sender after the send was initiated and before it completed. Similarly, a nonblocking post-receive initiates a receive post-receive operation, but does not complete it. The call will return before a message is stored into the receive buffer. A separate complete-receive complete-receive is needed to complete the receive operation and verify that the data has been received into the receive buffer.
A nonblocking send can be posted whether a matching receive has been posted or not. The post-send call has local completion semantics: it returns immediately, irrespective of the status of other processes. If the call causes some system resource to be exhausted, then it will fail and return an error code. Quality implementations of MPI should ensure that this happens only in ``pathological'' cases. That is, an MPI implementation should be able to support a large number of pending nonblocking operations.
The complete-send returns when data has been copied out of the send buffer. The complete-send has non-local completion semantics. The call may return before a matching receive is posted, if the message is buffered. On the other hand, the complete-send may not return until a matching receive is posted.
There is compatibility between blocking and nonblocking communication functions. Nonblocking sends can be matched with blocking receives, and vice-versa.
Nonblocking communications use request objects to request object identify communication operations and link the posting operation with the completion operation. Request objects are allocated by MPI and reside in MPI ``system'' memory. The request object is opaque in the sense that the type and structure of the object is not visible to users. The application program can only manipulate handles to request objects, not the objects themselves. The system may use the request object to identify various properties of a communication operation, such as the communication buffer that is associated with it, or to store information about the status of the pending communication operation. The user may access request objects through various MPI calls to inquire about the status of pending communication operations.
The special value MPI_REQUEST_NULL is used to indicate an invalid request handle. Operations that deallocate request objects set the request handle to this value.
Calls that post send or receive operations have the same names as the corresponding blocking calls, except that an additional prefix of I (for immediate) indicates that the call is nonblocking.
MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
MPI_ISEND posts a standard-mode, nonblocking send.
MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR
MPI_IRECV posts a nonblocking receive.
These calls allocate a request object and return a handle to it in request object, allocation of request. The request is used to query the status of the communication or wait for its completion.
A nonblocking post-send call indicates that the system may start copying data out of the send buffer. The sender must not access any part of the send buffer (neither for loads nor for stores) after a nonblocking send operation is posted, until the complete-send returns.
A nonblocking post-receive indicates that the system may start writing data into the receive buffer. The receiver must not access any part of the receive buffer after a nonblocking receive operation is posted, until the complete-receive returns.
The attractiveness of the message-passing paradigm at least partially stems from its wide portability. Programs expressed this way may run on distributed-memory multicomputers, shared-memory multiprocessors, networks of workstations, and combinations of all of these. The paradigm will not be made obsolete by architectures combining the shared- and distributed-memory views, or by increases in network speeds. Thus, it should be both possible and useful to implement this standard on a great variety of machines, including those ``machines'' consisting of collections of other machines, parallel or not, connected by a communication network.
The interface is suitable for use by fully general Multiple Instruction, Multiple Data (MIMD) programs, or Multiple Program, Multiple Data (MPMD) programs, where each process follows a distinct execution path through the same code, or even executes a different code. It is also suitable for those written in the more restricted style of Single Program, Multiple Data (SPMD), where all processes follow the same execution path through the same program. Although no explicit support for threads is provided, the interface has been designed so as not to prejudice their use. With this version of MPI no support is provided for dynamic spawning of tasks; such support is expected in future versions of MPI; see Section .
MPI provides many features intended to improve performance on scalable parallel computers with specialized interprocessor communication hardware. Thus, we expect that native, high-performance implementations of MPI will be provided on such machines. At the same time, implementations of MPI on top of standard Unix interprocessor communication protocols will provide portability to workstation clusters and heterogeneous networks of workstations. Several proprietary, native implementations of MPI, and public domain, portable implementation of MPI are now available. See Section for more information about MPI implementations.
The functions MPI_WAIT and MPI_TEST are used to complete nonblocking sends and receives. The completion of a send indicates that the sender is now free to access the send buffer. The completion of a receive indicates that the receive buffer contains the message, the receiver is free to access it, and that the status object is set.
MPI_Wait(MPI_Request *request, MPI_Status *status)
MPI_WAIT(REQUEST, STATUS, IERROR)INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR
A call to MPI_WAIT returns when the operation identified by request is complete. If the system object pointed to by request was originally created by a nonblocking send or receive, then the object is deallocated by MPI_WAIT and request is set to MPI_REQUEST_NULL. MPI_REQUEST_NULL The status object is set to contain information on the completed operation. MPI_WAIT has non-local completion semantics.
MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
MPI_TEST(REQUEST, FLAG, STATUS, IERROR)LOGICAL FLAG
INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR
A call to MPI_TEST returns flag = true if the operation identified by request is complete. In this case, the status object is set to contain information on the completed operation. If the system object pointed to by request was originally created by a nonblocking send or receive, then the object is deallocated by MPI_TEST and request is set to MPI_REQUEST_NULL. MPI_REQUEST_NULL The call returns flag = false, otherwise. In this case, the value of the status object is undefined. MPI_TEST has local completion semantics.
For both MPI_WAIT and MPI_TEST, information on the completed operation is returned in status. The content of the status object for a receive operation is accessed as described in Section . The contents of a status object for a send operation is undefined, except that the query function MPI_TEST_CANCELLED (Section ) can be applied to it.
We illustrate the use of nonblocking communication for the same Jacobi computation used in previous examples (Example - ). To achieve maximum overlap between computation and communication, communications should be started as soon as overlap possible and completed as late as possible. That is, sends should be posted as soon as the data to be sent is available; receives should be posted as soon as the receive buffer can be reused; sends should be completed just before the send buffer is to be reused; and receives should be completed just before the data in the receive buffer is to be used. Sometimes, the overlap can be increased by reordering computations. Jacobi, using nonblocking
The next example shows a multiple-producer, single-consumer code. The last process in the group consumes messages sent by the other processes. producer-consumer
The example imposes a strict round-robin discipline, since round-robin the consumer receives one message from each producer, in turn. In some cases it is preferable to use a ``first-come-first-served'' discipline. This is achieved by using MPI_TEST, rather than MPI_WAIT, as shown below. Note that MPI can only offer an first-come-first-served approximation to first-come-first-served, since messages do not necessarily arrive in the order they were sent.
A request object is deallocated automatically by a successful call to MPI_WAIT or MPI_TEST. In addition, a request object can be explicitly deallocated by using the following operation. request object, deallocation of
MPI_Request_free(MPI_Request *request)
MPI_REQUEST_FREE(REQUEST, IERROR)INTEGER REQUEST, IERROR
MPI_REQUEST_FREE marks the request object for deallocation and sets request to MPI_REQUEST_NULL. MPI_REQUEST_NULL An ongoing communication associated with the request will be allowed to complete. The request becomes unavailable after it is deallocated, as the handle is reset to MPI_REQUEST_NULL. However, the request object itself need not be deallocated immediately. If the communication associated with this object is still ongoing, and the object is required for its correct completion, then MPI will not deallocate the object until after its completion.
MPI_REQUEST_FREE cannot be used for cancelling an ongoing communication. For that purpose, one should use MPI_CANCEL, described in Section . One should use MPI_REQUEST_FREE when the logic of the program is such that a nonblocking communication is known to have terminated and, therefore, a call to MPI_WAIT or MPI_TEST is superfluous. For example, the program could be such that a send command generates a reply from the receiver. If the reply has been successfully received, then the send is known to be complete.
The semantics of nonblocking communication is defined by suitably extending the definitions in Section .
order, nonblockingnonblocking, order Nonblocking communication operations are ordered according to the execution order of the posting calls. The non-overtaking requirement of Section is extended to nonblocking communication.
The order requirement specifies how post-send calls are matched to post-receive calls. There are no restrictions on the order in which operations complete. Consider the code in Example .
Since the completion of a receive can take an arbitrary amount of time, there is no way to infer that the receive operation completed, short of executing a complete-receive call. On the other hand, the completion of a send operation can be inferred indirectly from the completion of a matching receive.
progress, nonblockingnonblocking, progress A communication is enabled once a send and a matching receive have been enabled communication posted by two processes. The progress rule requires that once a communication is enabled, then either the send or the receive will proceed to completion (they might not both complete as the send might be matched by another receive or the receive might be matched by another send). Thus, a call to MPI_WAIT that completes a receive will eventually return if a matching send has been started, unless the send is satisfied by another receive. In particular, if the matching send is nonblocking, then the receive completes even if no complete-send call is made on the sender side.
Similarly, a call to MPI_WAIT that completes a send eventually returns if a matching receive has been started, unless the receive is satisfied by another send, and even if no complete-receive call is made on the receiving side.
If a call to MPI_TEST that completes a receive is repeatedly made with the same arguments, and a matching send has been started, then the call will eventually return flag = true, unless the send is satisfied by another receive. If a call to MPI_TEST that completes a send is repeatedly made with the same arguments, and a matching receive has been started, then the call will eventually return flag = true, unless the receive is satisfied by another send.
fairness, nonblockingnonblocking, fairness The statement made in Section concerning fairness applies to nonblocking communications. Namely, MPI does not guarantee fairness.
buffering, nonblockingnonblocking, buffering resource limitations nonblocking, safety The use of nonblocking communication alleviates the need for buffering, since a sending process may progress after it has posted a send. Therefore, the constraints of safe programming can be relaxed. However, some amount of storage is consumed by a pending communication. At a minimum, the communication subsystem needs to copy the parameters of a posted send or receive before the call returns. If this storage is exhausted, then a call that posts a new communication will fail, since post-send or post-receive calls are not allowed to block. A high quality implementation will consume only a fixed amount of storage per posted, nonblocking communication, thus supporting a large number of pending communications. The failure of a parallel program that exceeds the bounds on the number of pending nonblocking communications, like the failure of a sequential program that exceeds the bound on stack size, should be seen as a pathological case, due either to a pathological program or a pathological MPI implementation.
The approach illustrated in the last two examples can be used, in general, to transform unsafe programs into safe ones. Assume that the program consists of safety successive communication phases, where processes exchange data, followed by computation phases. The communication phase should be rewritten as two sub-phases, the first where each process posts all its communication, and the second where the process waits for the completion of all its communications. The order in which the communications are posted is not important, as long as the total number of messages sent or received at any node is moderate. This is further discussed in Section .
completion, multiplemultiple completion
It is convenient and efficient to complete in one call a list of multiple pending communication operations, rather than completing only one. MPI_WAITANY or MPI_TESTANY are used to complete one out of several operations. MPI_WAITALL or MPI_TESTALL are used to complete all operations in a list. MPI_WAITSOME or MPI_TESTSOME are used to complete all enabled operations in a list. The behavior of these functions is described in this section and in Section .
MPI_Waitany(int count, MPI_Request *array_of_requests, int *index, MPI_Status *status)
MPI_WAITANY(COUNT, ARRAY_OF_REQUESTS, INDEX, STATUS, IERROR)INTEGER COUNT, ARRAY_OF_REQUESTS(*), INDEX, STATUS(MPI_STATUS_SIZE), IERROR
MPI_WAITANY blocks until one of the communication operations associated with requests in the array has completed. If more then one operation can be completed, MPI_WAITANY arbitrarily picks one and completes it. MPI_WAITANY returns in index the array location of the completed request and returns in status the status of the completed communication. The request object is deallocated and the request handle is set to MPI_REQUEST_NULL. MPI_REQUEST_NULL MPI_WAITANY has non-local completion semantics.
MPI_Testany(int count, MPI_Request *array_of_requests, int *index, int *flag, MPI_Status *status)
MPI_TESTANY(COUNT, ARRAY_OF_REQUESTS, INDEX, FLAG, STATUS, IERROR)LOGICAL FLAG
INTEGER COUNT, ARRAY_OF_REQUESTS(*), INDEX, STATUS(MPI_STATUS_SIZE), IERROR
MPI_TESTANY tests for completion of the communication operations associated with requests in the array. MPI_TESTANY has local completion semantics.
If an operation has completed, it returns flag = true, returns in index the array location of the completed request, and returns in status the status of the completed communication. The request is deallocated and the handle is set to MPI_REQUEST_NULL. MPI_REQUEST_NULL
If no operation has completed, it returns flag = false, returns MPI_UNDEFINED in index and status is MPI_UNDEFINED undefined.
The execution of MPI_Testany(count, array_of_requests, &, &, &) has the same effect as the execution of MPI_Test( &_of_requests[i], &, &), for i=0, 1 ,..., count-1, in some arbitrary order, until one call returns flag = true, or all fail. In the former case, index is set to the last value of i, and in the latter case, it is set to MPI_UNDEFINED. MPI_UNDEFINED
MPI_Waitall(int count, MPI_Request *array_of_requests, MPI_Status *array_of_statuses)
MPI_WAITALL(COUNT, ARRAY_OF_REQUESTS, ARRAY_OF_STATUSES, IERROR)INTEGER COUNT, ARRAY_OF_REQUESTS(*)
INTEGER ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR
MPI_WAITALL blocks until all communications, associated with requests in the array, complete. The i-th entry in array_of_statuses is set to the return status of the i-th operation. All request objects are deallocated and the corresponding handles in the array are set to MPI_REQUEST_NULL. MPI_REQUEST_NULL MPI_WAITALL has non-local completion semantics.
The execution of MPI_Waitall(count, array_of_requests, array_of_statuses) has the same effect as the execution of MPI_Wait(&_of_requests[i],&_of_statuses[i]), for i=0 ,..., count-1, in some arbitrary order.
When one or more of the communications completed by a call to MPI_WAITALL fail, MPI_WAITALL will return the error code MPI_ERR_IN_STATUS and will set the MPI_ERR_IN_STATUS error field of each status to a specific error code. This code will be MPI_SUCCESS, if the specific communication completed; it will MPI_SUCCESS be another specific error code, if it failed; or it will be MPI_PENDING if it has not failed nor completed. MPI_PENDING The function MPI_WAITALL will return MPI_SUCCESS if it MPI_SUCCESS completed successfully, or will return another error code if it failed for other reasons (such as invalid arguments). MPI_WAITALL updates the error fields of the status objects only when it returns MPI_ERR_IN_STATUS.
MPI_Testall(int count, MPI_Request *array_of_requests, int *flag, MPI_Status *array_of_statuses)
MPI_TESTALL(COUNT, ARRAY_OF_REQUESTS, FLAG, ARRAY_OF_STATUSES, IERROR)LOGICAL FLAG
INTEGER COUNT, ARRAY_OF_REQUESTS(*), ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR
MPI_TESTALL tests for completion of all communications associated with requests in the array. MPI_TESTALL has local completion semantics.
If all operations have completed, it returns flag = true, sets the corresponding entries in status, deallocates all requests and sets all request handles to MPI_REQUEST_NULL. MPI_REQUEST_NULL
If all operations have not completed, flag = false is returned, no request is modified and the values of the status entries are undefined.
Errors that occurred during the execution of MPI_TEST_ALL are handled in the same way as errors in MPI_WAIT_ALL.
MPI_Waitsome(int incount, MPI_Request *array_of_requests, int *outcount, int *array_of_indices, MPI_Status *array_of_statuses)
MPI_WAITSOME(INCOUNT, ARRAY_OF_REQUESTS, OUTCOUNT, ARRAY_OF_INDICES, ARRAY_OF_STATUSES, IERROR)INTEGER INCOUNT, ARRAY_OF_REQUESTS(*), OUTCOUNT, ARRAY_OF_INDICES(*), ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR
MPI_WAITSOME waits until at least one of the communications, associated with requests in the array, completes. MPI_WAITSOME returns in outcount the number of completed requests. The first outcount locations of the array array_of_indices are set to the indices of these operations. The first outcount locations of the array array_of_statuses are set to the status for these completed operations. Each request that completed is deallocated, and the associated handle is set to MPI_REQUEST_NULL. MPI_REQUEST_NULL MPI_WAITSOME has non-local completion semantics.
If one or more of the communications completed by MPI_WAITSOME fail then the arguments outcount, array_of_indices and array_of_statuses will be adjusted to indicate completion of all communications that have succeeded or failed. The call will return the error code MPI_ERR_IN_STATUS and the error field of each status MPI_ERR_IN_STATUS returned will be set to indicate success or to indicate the specific error that occurred. The call will return MPI_SUCCESS if it MPI_SUCCESS succeeded, and will return another error code if it failed for for other reasons (such as invalid arguments). MPI_WAITSOME updates the status fields of the request objects only when it returns MPI_ERR_IN_STATUS.
MPI_Testsome(int incount, MPI_Request *array_of_requests, int *outcount, int *array_of_indices, MPI_Status *array_of_statuses)
MPI_TESTSOME(INCOUNT, ARRAY_OF_REQUESTS, OUTCOUNT, ARRAY_OF_INDICES, ARRAY_OF_STATUSES, IERROR)INTEGER INCOUNT, ARRAY_OF_REQUESTS(*), OUTCOUNT, ARRAY_OF_INDICES(*), ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR
MPI_TESTSOME behaves like MPI_WAITSOME, except that it returns immediately. If no operation has completed it returns outcount = 0. MPI_TESTSOME has local completion semantics.
Errors that occur during the execution of MPI_TESTSOME are handled as for MPI_WAIT_SOME.
Both MPI_WAITSOME and MPI_TESTSOME fulfill a fairness requirement: if a request for a receive repeatedly appears in a list of requests passed to MPI_WAITSOME or MPI_TESTSOME, and a matching send has been posted, then the receive will eventually complete, unless the send is satisfied by another receive. A similar fairness requirement holds for send requests.
The standard includes:
MPI_PROBE and MPI_IPROBE allow polling of incoming messages without actually receiving them. The application can then decide how to receive them, based on the information returned by the probe (in a status variable). For example, the application might allocate memory for the receive buffer according to the length of the probed message.
MPI_CANCEL allows pending communications to be canceled. This is required for cleanup in some situations. Suppose an application has posted nonblocking sends or receives and then determines that these operations will not complete. Posting a send or a receive ties up application resources (send or receive buffers), and a cancel allows these resources to be freed.
MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, MPI_Status *status)
MPI_IPROBE(SOURCE, TAG, COMM, FLAG, STATUS, IERROR)LOGICAL FLAG
INTEGER SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
MPI_IPROBE is a nonblocking operation that returns flag = true if there is a message that can be received and that matches the message envelope specified by source, tag, and comm. The call matches the same message that would have been received by a call to MPI_RECV (with these arguments) executed at the same point in the program, and returns in status the same value. Otherwise, the call returns flag = false, and leaves status undefined. MPI_IPROBE has local completion semantics.
If MPI_IPROBE(source, tag, comm, flag, status) returns flag = true, then the first, subsequent receive executed with the communicator comm, and with the source and tag returned in status, will receive the message that was matched by the probe.
The argument source can be MPI_ANY_SOURCE, and tag can be MPI_ANY_SOURCE MPI_ANY_TAG, so that one can probe for messages from an arbitrary MPI_ANY_TAG source and/or with an arbitrary tag. However, a specific communicator must be provided in comm.
It is not necessary to receive a message immediately after it has been probed for, and the same message may be probed for several times before it is received.
MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status *status)
MPI_PROBE(SOURCE, TAG, COMM, STATUS, IERROR)INTEGER SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
MPI_PROBE behaves like MPI_IPROBE except that it blocks and returns only after a matching message has been found. MPI_PROBE has non-local completion semantics.
The semantics of MPI_PROBE and MPI_IPROBE guarantee progress, in the same way as a corresponding receive executed at the same point in the program. progress, for probe If a call to MPI_PROBE has been issued by a process, and a send that matches the probe has been initiated by some process, then the call to MPI_PROBE will return, unless the message is received by another, concurrent receive operation, irrespective of other activities in the system. Similarly, if a process busy waits with MPI_IPROBE and a matching message has been issued, then the call to MPI_IPROBE will eventually return flag = true unless the message is received by another concurrent receive operation, irrespective of other activities in the system.
MPI_Cancel(MPI_Request *request)
MPI_CANCEL(REQUEST, IERROR)INTEGER REQUEST, IERROR
MPI_CANCEL marks for cancelation a pending, cancelation nonblocking communication operation (send or receive). MPI_CANCEL has local completion semantics. It returns immediately, possibly before the communication is actually canceled. After this, it is still necessary to complete a communication that has been marked for cancelation, using a call to MPI_REQUEST_FREE, MPI_WAIT, MPI_TEST or one of the functions in Section . If the communication was not cancelled (that is, if the communication happened to start before the cancelation could take effect), then the completion call will complete the communication, as usual. If the communication was successfully cancelled, then the completion call will deallocate the request object and will return in status the information that the communication was canceled. The application should then call MPI_TEST_CANCELLED, using status as input, to test whether the communication was actually canceled.
Either the cancelation succeeds, and no communication occurs, or the communication completes, and the cancelation fails. If a send is marked for cancelation, then it must be the case that either the send completes normally, and the message sent is received at the destination process, or that the send is successfully canceled, and no part of the message is received at the destination. If a receive is marked for cancelation, then it must be the case that either the receive completes normally, or that the receive is successfully canceled, and no part of the receive buffer is altered.
If a communication is marked for cancelation, then a completion call for that communication is guaranteed to return, irrespective of the activities of other processes. In this case, MPI_WAIT behaves as a local function. Similarly, if MPI_TEST is repeatedly called in a busy wait loop for a canceled communication, then MPI_TEST will eventually succeed.
MPI_Test_cancelled(MPI_Status *status, int *flag)
MPI_TEST_CANCELLED(STATUS, FLAG, IERROR)LOGICAL FLAG
INTEGER STATUS(MPI_STATUS_SIZE), IERROR
MPI_TEST_CANCELLED is used to test whether the communication operation was actually canceled by MPI_CANCEL. It returns flag = true if the communication associated with the status object was canceled successfully. In this case, all other fields of status are undefined. It returns flag = false, otherwise.
persistent requestrequest, persistent porthalf-channel
Often a communication with the same argument list is repeatedly executed within the inner loop of a parallel computation. In such a situation, it may be possible to optimize the communication by binding the list of communication arguments to a persistent communication request once and then, repeatedly, using the request to initiate and complete messages. A persistent request can be thought of as a communication port or a ``half-channel.'' It does not provide the full functionality of a conventional channel, since there is no binding of the send port to the receive port. This construct allows reduction of the overhead for communication between the process and communication controller, but not of the overhead for communication between one communication controller and another.
It is not necessary that messages sent with a persistent request be received by a receive operation using a persistent request, or vice-versa. Persistent communication requests are associated with nonblocking send and receive operations.
A persistent communication request is created using the following functions. They involve no communication and thus have local completion semantics.
MPI_Send_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_SEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER REQUEST, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
MPI_SEND_INIT creates a persistent communication request for a standard-mode, nonblocking send operation, and binds to it all the arguments of a send operation.
MPI_Recv_init(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
MPI_RECV_INIT(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR
MPI_RECV_INIT creates a persistent communication request for a nonblocking receive operation. The argument buf is marked as OUT because the application gives permission to write on the receive buffer.
Persistent communication requests are created by the preceding functions, but they are, so far, inactive. They are activated, and the associated communication operations started, by MPI_START or MPI_STARTALL.
MPI_Start(MPI_Request *request)
MPI_START(REQUEST, IERROR)INTEGER REQUEST, IERROR
MPI_START activates request and initiates the associated communication. Since all persistent requests are associated with nonblocking communications, MPI_START has local completion semantics. The semantics of communications done with persistent requests are identical to the corresponding operations without persistent requests. That is, a call to MPI_START with a request created by MPI_SEND_INIT starts a communication in the same manner as a call to MPI_ISEND; a call to MPI_START with a request created by MPI_RECV_INIT starts a communication in the same manner as a call to MPI_IRECV.
A send operation initiated with MPI_START can be matched with any receive operation (including MPI_PROBE) and a receive operation initiated with MPI_START can receive messages generated by any send operation.
MPI_Startall(int count, MPI_Request *array_of_requests)
MPI_STARTALL(COUNT, ARRAY_OF_REQUESTS, IERROR)INTEGER COUNT, ARRAY_OF_REQUESTS(*), IERROR
MPI_STARTALL starts all communications associated with persistent requests in array_of_requests. A call to MPI_STARTALL(count, array_of_requests) has the same effect as calls to MPI_START(array_of_requests[i]), executed for i=0 ,..., count-1, in some arbitrary order.
A communication started with a call to MPI_START or MPI_STARTALL is completed by a call to MPI_WAIT, MPI_TEST, or one of the other completion functions described in Section . The persistent request becomes inactive after the completion of such a call, but it is not deallocated and it can be re-activated by another MPI_START or MPI_STARTALL.
Persistent requests are explicitly deallocated by a call to MPI_REQUEST_FREE (Section ). The call to MPI_REQUEST_FREE can occur at any point in the program after the persistent request was created. However, the request will be deallocated only after it becomes inactive. Active receive requests should not be freed. Otherwise, it will not be possible to check that the receive has completed. It is preferable to free requests when they are inactive. If this rule is followed, then the functions described in this section will be invoked in a sequence of the form,
where indicates zero or more repetitions. If the same communication request is used in several concurrent threads, it is the user's responsibility to coordinate calls so that the correct sequence is obeyed.
MPI_CANCEL can be used to cancel a communication that uses a persistent request, in the same way it is used for nonpersistent requests. A successful cancelation cancels the active communication, but does not deallocate the request. After the call to MPI_CANCEL and the subsequent call to MPI_WAIT or MPI_TEST (or other completion function), the request becomes inactive and can be activated for a new communication.
Normally, an invalid handle to an MPI object is not a valid argument for a call that expects an object. There is one exception to this rule: communication-complete calls can be passed request handles with value MPI_REQUEST_NULL. MPI_REQUEST_NULL A communication complete call with such an argument is a ``no-op'': the null handles are ignored. The same rule applies to persistent handles that are not associated with an active communication operation. request, null handlenull request handle request, inactive vs active active request handle inactive request handle status, empty
We shall use the following terminology. A null request handle is a handle with value MPI_REQUEST_NULL. A handle to a MPI_REQUEST_NULL persistent request is inactive if the request is not currently associated with an ongoing communication. A handle is active, if it is neither null nor inactive. An empty status is a status that is set to tag = MPI_ANY_TAG, source = MPI_ANY_SOURCE, and is also internally configured so that calls to MPI_ANY_TAG MPI_ANY_SOURCE MPI_GET_COUNT and MPI_GET_ELEMENT return count = 0. We set a status variable to empty in cases when the value returned is not significant. Status is set this way to prevent errors due to access of stale information.
A call to MPI_WAIT with a null or inactive request argument returns immediately with an empty status.
A call to MPI_TEST with a null or inactive request argument returns immediately with flag = true and an empty status.
The list of requests passed to MPI_WAITANY may contain null or inactive requests. If some of the requests are active, then the call returns when an active request has completed. If all the requests in the list are null or inactive then the call returns immediately, with index = MPI_UNDEFINED and an empty status.
The list of requests passed to MPI_TESTANY may contain null or inactive requests. The call returns flag = false if there are active requests in the list, and none have completed. It returns flag = true if an active request has completed, or if all the requests in the list are null or inactive. In the later case, it returns index = MPI_UNDEFINED and an empty status.
The list of requests passed to MPI_WAITALL may contain null or inactive requests. The call returns as soon as all active requests have completed. The call sets to empty each status associated with a null or inactive request.
The list of requests passed to MPI_TESTALL may contain null or inactive requests. The call returns flag = true if all active requests have completed. In this case, the call sets to empty each status associated with a null or inactive request. Otherwise, the call returns flag = false.
The list of requests passed to MPI_WAITSOME may contain null or inactive requests. If the list contains active requests, then the call returns when some of the active requests have completed. If all requests were null or inactive, then the call returns immediately, with outcount = MPI_UNDEFINED. MPI_UNDEFINED
The list of requests passed to MPI_TESTSOME may contain null or inactive requests. If the list contains active requests and some have completed, then the call returns in outcount the number of completed request. If it contains active requests, and none have completed, then it returns outcount = 0. If the list contains no active requests, then it returns outcount = MPI_UNDEFINED.
In all these cases, null or inactive request handles are not modified by the call.
The send call described in Section used the standard communication mode. In this mode, it is up to MPI to decide whether outgoing messages will be buffered. MPI may buffer outgoing messages. In such a case, the send call may complete before a matching receive is invoked. On the other hand, buffer space may be unavailable, or MPI may choose not to buffer outgoing messages, for performance reasons. In this case, the send call will not complete until a matching receive has been posted, and the data has been moved to the receiver. (A blocking send completes when the call returns; a nonblocking send completes when the matching Wait or Test call returns successfully.)
Thus, a send in standard mode can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. The standard-mode send has non-local completion semantics, since successful completion of the send operation may depend on the occurrence of a matching receive.
modecommunication modes mode, standard mode, synchronoussynchronous mode mode, bufferedbuffered mode mode, readyready mode standard mode rendezvous
A buffered-mode send operation can be started whether or not a matching receive has been posted. It may complete before a matching receive is posted. Buffered-mode send has local completion semantics: its completion does not depend on the occurrence of a matching receive. In order to complete the operation, it may be necessary to buffer the outgoing message locally. For that purpose, buffer space is provided by the application (Section ). An error will occur if a buffered-mode send is called and there is insufficient buffer space. The buffer space occupied by the message is freed when the message is transferred to its destination or when the buffered send is cancelled.
A synchronous-mode send can be started whether or not a matching receive was posted. However, the send will complete successfully only if a matching receive is posted, and the receive operation has started to receive the message sent by the synchronous send. Thus, the completion of a synchronous send not only indicates that the send buffer can be reused, but also indicates that the receiver has reached a certain point in its execution, namely that it has started executing the matching receive. Synchronous mode provides synchronous communication semantics: a communication does not complete at either end before both processes rendezvous at the communication. A synchronous-mode send has non-local completion semantics.
A ready-mode send may be started only if the matching receive has already been posted. Otherwise, the operation is erroneous and its outcome is undefined. On some systems, this allows the removal of a hand-shake operation and results in improved performance. A ready-mode send has the same semantics as a standard-mode send. In a correct program, therefore, a ready-mode send could be replaced by a standard-mode send with no effect on the behavior of the program other than performance.
Three additional send functions are provided for the three additional communication modes. The communication mode is indicated by a one letter prefix: B for buffered, S for synchronous, and R for ready. There is only one receive mode and it matches any of the send modes.
All send and receive operations use the buf, count, datatype, source, dest, tag, comm, status and request arguments in the same way as the standard-mode send and receive operations.
MPI_Bsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
MPI_BSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR
MPI_BSEND performs a buffered-mode, blocking send.
MPI_Ssend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR
MPI_SSEND performs a synchronous-mode, blocking send.
MPI_Rsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
MPI_RSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR
MPI_RSEND performs a ready-mode, blocking send.
We use the same naming conventions as for blocking communication: a prefix of B, S, or R is used for buffered, synchronous or ready mode. In addition, a prefix of I (for immediate) indicates that the call is nonblocking. There is only one nonblocking receive call, MPI_IRECV. Nonblocking send operations are completed with the same Wait and Test calls as for standard-mode send.
MPI_Ibsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_IBSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
MPI_IBSEND posts a buffered-mode, nonblocking send.
MPI_Issend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_ISSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
MPI_ISSEND posts a synchronous-mode, nonblocking send.
MPI_Irsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_IRSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
MPI_IRSEND posts a ready-mode, nonblocking send.
MPI_Bsend_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_BSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER REQUEST, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
MPI_BSEND_INIT creates a persistent communication request for a buffered-mode, nonblocking send, and binds to it all the arguments of a send operation.
MPI_Ssend_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_SSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
MPI_SSEND_INIT creates a persistent communication object for a synchronous-mode, nonblocking send, and binds to it all the arguments of a send operation.
MPI_Rsend_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_RSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
MPI_RSEND_INIT creates a persistent communication object for a ready-mode, nonblocking send, and binds to it all the arguments of a send operation.
buffered modemode, bufferedbuffer attach
An application must specify a buffer to be used for buffering messages sent in buffered mode. Buffering is done by the sender.
MPI_Buffer_attach( void* buffer, int size)
MPI_BUFFER_ATTACH( BUFFER, SIZE, IERROR)<type> BUFFER(*)
INTEGER SIZE, IERROR
MPI_BUFFER_ATTACH provides to MPI a buffer in the application's memory to be used for buffering outgoing messages. The buffer is used only by messages sent in buffered mode. Only one buffer can be attached at a time (per process).
MPI_Buffer_detach( void* buffer, int* size)
MPI_BUFFER_DETACH( BUFFER, SIZE, IERROR)<type> BUFFER(*)
INTEGER SIZE, IERROR
MPI_BUFFER_DETACH detaches the buffer currently associated with MPI. The call returns the address and the size of the detached buffer. This operation will block until all messages currently in the buffer have been transmitted. Upon return of this function, the user may reuse or deallocate the space taken by the buffer.
Now the question arises: how is the attached buffer to be used? The answer is that MPI must behave as if buffer policy outgoing message data were buffered by the sending process, in the specified buffer space, using a circular, contiguous-space allocation policy. We outline below a model implementation that defines this policy. MPI may provide more buffering, and may use a better buffer allocation algorithm than described below. On the other hand, MPI may signal an error whenever the simple buffering allocator described below would run out of space.
The model implementation uses the packing and unpacking functions described in Section and the nonblocking communication functions described in Section .
We assume that a circular queue of pending message entries (PME) is maintained. Each entry contains a communication request that identifies a pending nonblocking send, a pointer to the next entry and the packed message data. The entries are stored in successive locations in the buffer. Free space is available between the queue tail and the queue head.
A buffered send call results in the execution of the following algorithm.
mode, comments communication modes, comments
MPI does not specify:
There are many features that were considered and not included in MPI. This happened for a number of reasons: the time constraint that was self-imposed by the MPI Forum in finishing the standard; the feeling that not enough experience was available on some of these topics; and the concern that additional features would delay the appearance of implementations.
Features that are not included can always be offered as extensions by specific implementations. Future versions of MPI will address some of these issues (see Section ).
The MPI communication mechanisms introduced in the previous chapter allows one to send or receive a sequence of identical elements that are contiguous in memory. It is often desirable to send data that is not homogeneous, such as a structure, or that is not contiguous in memory, such as an array section. This allows one to amortize the fixed overhead of sending and receiving a message over the transmittal of many elements, even in these more general circumstances. MPI provides two mechanisms to achieve this.
The construction and use of derived datatypes is described in Section - . The use of Pack and Unpack functions is described in Section . It is often possible to achieve the same data transfer using either mechanisms. We discuss the pros and cons of each approach at the end of this chapter.
All MPI communication functions take a datatype argument. In the simplest case this will be a primitive type, such as an integer or floating-point number. An important and powerful generalization results by allowing user-defined (or ``derived'') types wherever the primitive types can occur. These are not ``types'' as far as the programming language is concerned. They are only ``types'' in that MPI is made aware of them through the use of type-constructor functions, and they describe the layout, in memory, of sets of primitive types. Through user-defined types, MPI supports the communication of complex data structures such as array sections and structures containing combinations of primitive datatypes. Example shows how a user-defined datatype is used to send the upper-triangular part of a matrix, and Figure diagrams the memory layout represented by the user-defined datatype. derived datatypetype constructor derived datatype, constructor
Figure: A diagram of the memory cells represented by the user-defined
datatype upper. The shaded cells are the locations of the array
that will be sent.
Derived datatypes are constructed from basic datatypes using the constructors described in Section . The constructors can be applied recursively.
A derived datatype is an opaque object that specifies two things:
derived datatype
The displacements are not required to be positive, distinct, or in increasing order. Therefore, the order of items need not coincide with their order in memory, and an item may appear more than once. We call such a pair of sequences (or sequence of pairs) a type map. The sequence of primitive datatypes (displacements ignored) is the type signature of the datatype. type maptype signature derived datatype, mapderived datatype, signature
Let
be such a type map, where are primitive types, and are displacements. Let
be the associated type signature. This type map, together with a base address buf, specifies a communication buffer: the communication buffer that consists of entries, where the -th entry is at address and has type . A message assembled from a single type of this sort will consist of values, of the types defined by .
A handle to a derived datatype can appear as an argument in a send or receive operation, instead of a primitive datatype argument. The operation MPI_SEND(buf, 1, datatype,...) will use the send buffer defined by the base address buf and the derived datatype associated with datatype. It will generate a message with the type signature determined by the datatype argument. MPI_RECV(buf, 1, datatype,...) will use the receive buffer defined by the base address buf and the derived datatype associated with datatype.
Derived datatypes can be used in all send and receive operations including collective. We discuss, in Section , the case where the second argument count has value .
The primitive datatypes presented in Section are special cases of a derived datatype, and are predefined. Thus, MPI_INT is a predefined handle to a datatype with type MPI_INT map , with one entry of type int and displacement zero. The other primitive datatypes are similar.
The extent of a datatype is defined to be the span from the first byte to the last byte occupied by entries in this datatype, rounded up to satisfy alignment requirements. extentderived datatype, extent That is, if
then
where . is the lower bound and is the upper bound of the datatype. lower boundderived datatype, lower bound upper boundderived datatype, upper bound If requires alignment to a byte address that is a multiple of , then is the least nonnegative increment needed to round to the next multiple of . (The definition of extent is expanded in Section .)
The following functions return information on datatypes.
MPI_Type_extent(MPI_Datatype datatype, MPI_Aint *extent)
MPI_TYPE_EXTENT(DATATYPE, EXTENT, IERROR)INTEGER DATATYPE, EXTENT, IERROR
MPI_TYPE_EXTENT returns the extent of a datatype. In addition to its use with derived datatypes, it can be used to inquire about the extent of primitive datatypes. For example, MPI_TYPE_EXTENT(MPI_INT, extent) will return in extent the size, in bytes, of an int - the same value that would be returned by the C call sizeof(int).
extentderived datatype, extent
MPI_Type_size(MPI_Datatype datatype, int *size)
MPI_TYPE_SIZE(DATATYPE, SIZE, IERROR)INTEGER DATATYPE, SIZE, IERROR
MPI_TYPE_SIZE returns the total size, in bytes, of the entries in the type signature associated with datatype; that is, the total size of the data in a message that would be created with this datatype. Entries that occur multiple times in the datatype are counted with their multiplicity. For primitive datatypes, this function returns the same information as MPI_TYPE_EXTENT.
This section presents the MPI functions for constructing derived datatypes. The functions are presented in an order from simplest to most complex. type constructor derived datatype, constructor
MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_CONTIGUOUS is the simplest datatype constructor. It constructs a typemap consisting of the replication of a datatype into contiguous locations. The argument newtype is the datatype obtained by concatenating count copies of oldtype. Concatenation is defined using extent(oldtype) as the size of the concatenated copies. The action of the Contiguous constructor is represented schematically in Figure .
Figure: Effect of datatype constructor MPI_TYPE_CONTIGUOUS.
In general, assume that the type map of oldtype is
with extent . Then newtype has a type map with entries defined by:
MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_VECTOR(COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_VECTOR is a constructor that allows replication of a datatype into locations that consist of equally spaced blocks. Each block is obtained by concatenating the same number of copies of the old datatype. The spacing between blocks is a multiple of the extent of the old datatype. The action of the Vector constructor is represented schematically in Figure .
Figure: Datatype constructor MPI_TYPE_VECTOR.
In general, assume that oldtype has type map
with extent . Let bl be the blocklength. The new datatype has a type map with entries:
A call to MPI_TYPE_CONTIGUOUS(count, oldtype, newtype) is equivalent to a call to MPI_TYPE_VECTOR(count, 1, 1, oldtype, newtype), or to a call to MPI_TYPE_VECTOR(1, count, num, oldtype, newtype), with num arbitrary.
The Vector type constructor assumes that the stride between successive blocks is a multiple of the oldtype extent. This avoids, most of the time, the need for computing stride in bytes. Sometimes it is useful to relax this assumption and allow a stride which consists of an arbitrary number of bytes. The Hvector type constructor below achieves this purpose. The usage of both Vector and Hvector is illustrated in Examples - .
MPI_Type_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_HVECTOR(COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_HVECTOR is identical to MPI_TYPE_VECTOR, except that stride is given in bytes, rather than in elements. (H stands for ``heterogeneous''). The action of the Hvector constructor is represented schematically in Figure .
Figure: Datatype constructor MPI_TYPE_HVECTOR.
In general, assume that oldtype has type map
with extent . Let bl be the blocklength. The new datatype has a type map with entries:
Figure: Memory layout of 2D array section for
Example
. The shaded blocks are sent.
The Indexed constructor allows one to specify a noncontiguous data layout where displacements between successive blocks need not be equal. This allows one to gather arbitrary entries from an array and send them in one message, or receive one message and scatter the received entries into arbitrary locations in an array.
MPI_Type_indexed(int count, int *array_of_blocklengths, int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_INDEXED(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_DISPLACEMENTS(*), OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_INDEXED allows replication of an old datatype into a sequence of blocks (each block is a concatenation of the old datatype), where each block can contain a different number of copies of oldtype and have a different displacement. All block displacements are measured in units of the oldtype extent. The action of the Indexed constructor is represented schematically in Figure .
Figure: Datatype constructor MPI_TYPE_INDEXED.
In general, assume that oldtype has type map
with extent ex. Let B be the array_of_blocklengths argument and D be the array_of_displacements argument. The new datatype has a type map with entries:
A call to MPI_TYPE_VECTOR(count, blocklength, stride, oldtype, newtype) is equivalent to a call to MPI_TYPE_INDEXED(count, B, D, oldtype, newtype) where
and
The use of the MPI_TYPE_INDEXED function was illustrated in Example , on page ; the function was used to transfer the upper triangular part of a square matrix.
As with the Vector and Hvector constructors, it is usually convenient to measure displacements in multiples of the extent of the oldtype, but sometimes necessary to allow for arbitrary displacements. The Hindexed constructor satisfies the later need.
MPI_Type_hindexed(int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_HINDEXED(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_DISPLACEMENTS(*), OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_HINDEXED is identical to MPI_TYPE_INDEXED, except that block displacements in array_of_displacements are specified in bytes, rather than in multiples of the oldtype extent. The action of the Hindexed constructor is represented schematically in Figure .
Figure: Datatype constructor MPI_TYPE_HINDEXED.
In general, assume that oldtype has type map
with extent . Let B be the array_of_blocklength argument and D be the array_of_displacements argument. The new datatype has a type map with entries:
MPI_Type_struct(int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype)
MPI_TYPE_STRUCT(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, ARRAY_OF_TYPES, NEWTYPE, IERROR)INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_DISPLACEMENTS(*), ARRAY_OF_TYPES(*), NEWTYPE, IERROR
MPI_TYPE_STRUCT is the most general type constructor. It further generalizes MPI_TYPE_HINDEXED in that it allows each block to consist of replications of different datatypes. The intent is to allow descriptions of arrays of structures, as a single datatype. The action of the Struct constructor is represented schematically in Figure .
Figure: Datatype constructor MPI_TYPE_STRUCT.
In general, let T be the array_of_types argument, where T[i] is a handle to,
with extent . Let B be the array_of_blocklength argument and D be the array_of_displacements argument. Let c be the count argument. Then the new datatype has a type map with entries:
A call to MPI_TYPE_HINDEXED(count, B, D, oldtype, newtype) is equivalent to a call to MPI_TYPE_STRUCT(count, B, D, T, newtype), where each entry of T is equal to oldtype.
The original MPI standard was created by the Message Passing Interface Forum (MPIF). The public release of version 1.0 of MPI was made in June 1994. The MPIF began meeting again in March 1995. One of the first tasks undertaken was to make clarifications and corrections to the MPI standard. The changes from version 1.0 to version 1.1 of the MPI standard were limited to ``corrections'' that were deemed urgent and necessary. This work was completed in June 1995 and version 1.1 of the standard was released. This book reflects the updated version 1.1 of the MPI standard.
A derived datatype must be committed before it can be used in a communication. A committed datatype can continue to be used as an input argument in datatype constructors (so that other datatypes can be derived from the committed datatype). There is no need to commit primitive datatypes. derived datatype, commitcommit
MPI_Type_commit(MPI_Datatype *datatype)
MPI_TYPE_COMMIT(DATATYPE, IERROR)INTEGER DATATYPE, IERROR
MPI_TYPE_COMMIT commits the datatype. Commit should be thought of as a possible ``flattening'' or ``compilation'' of the formal description of a type map into an efficient representation. Commit does not imply that the datatype is bound to the current content of a communication buffer. After a datatype has been committed, it can be repeatedly reused to communicate different data.
A datatype object is deallocated by a call to MPI_TYPE_FREE.
MPI_Type_free(MPI_Datatype *datatype)
MPI_TYPE_FREE(DATATYPE, IERROR)INTEGER DATATYPE, IERROR
MPI_TYPE_FREE marks the datatype object associated with datatype for deallocation and sets datatype to MPI_DATATYPE_NULL. MPI_DATATYPE_NULL Any communication that is currently using this datatype will complete normally. Derived datatypes that were defined from the freed datatype are not affected. derived datatype, destructor
A call of the form MPI_SEND(buf, count, datatype , ...), where , is interpreted as if the call was passed a new datatype which is the concatenation of count copies of datatype. Thus, MPI_SEND(buf, count, datatype, dest, tag, comm) is equivalent to,
MPI_TYPE_CONTIGUOUS(count, datatype, newtype) MPI_TYPE_COMMIT(newtype) MPI_SEND(buf, 1, newtype, dest, tag, comm).
Similar statements apply to all other communication functions that have a count and datatype argument.
Suppose that a send operation MPI_SEND(buf, count, datatype, dest, tag, comm) is executed, where datatype has type map
and extent . type matchingderived datatype, matching The send operation sends entries, where entry is at location and has type , for and . The variable stored at address in the calling program should be of a type that matches , where type matching is defined as in Section .
Similarly, suppose that a receive operation MPI_RECV(buf, count, datatype, source, tag, comm, status) is executed. The receive operation receives up to entries, where entry is at location and has type . Type matching is defined according to the type signature of the corresponding datatypes, that is, the sequence of primitive type components. Type matching does not depend on other aspects of the datatype definition, such as the displacements (layout in memory) or the intermediate types used to define the datatypes.
For sends, a datatype may specify overlapping entries. This is not true for receives. If the datatype used in a receive operation specifies overlapping entries then the call is erroneous. derived datatype, overlapping entries
If a message was received using a user-defined datatype, then a subsequent call to MPI_GET_COUNT(status, datatype, count) (Section ) will return the number of ``copies'' of datatype received (count). That is, if the receive operation was MPI_RECV(buff, count,datatype,...) then MPI_GET_COUNT may return any integer value , where . If MPI_GET_COUNT returns , then the number of primitive elements received is , where is the number of primitive elements in the type map of datatype. The received message need not fill an integral number of ``copies'' of datatype. If the number of primitive elements received is not a multiple of , that is, if the receive operation has not received an integral number of datatype ``copies,'' then MPI_GET_COUNT returns the value MPI_UNDEFINED. MPI_UNDEFINED
The function MPI_GET_ELEMENTS below can be used to determine the number of primitive elements received.
MPI_Get_elements(MPI_Status *status, MPI_Datatype datatype, int *count)
MPI_GET_ELEMENTS(STATUS, DATATYPE, COUNT, IERROR)INTEGER STATUS(MPI_STATUS_SIZE), DATATYPE, COUNT, IERROR
The function MPI_GET_ELEMENTS can also be used after a probe to find the number of primitive datatype elements in the probed message. Note that the two functions MPI_GET_COUNT and MPI_GET_ELEMENTS return the same values when they are used with primitive datatypes.
alignment
As shown in Example , page , one sometimes needs to be able to find the displacement, in bytes, of a structure component relative to the structure start. In C, one can use the sizeof operator to find the size of C objects; and one will be tempted to use the & operator to compute addresses and then displacements. However, the C standard does not require that (int)& be the byte address of variable v: the mapping of pointers to integers is implementation dependent. Some systems may have ``word'' pointers and ``byte'' pointers; other systems may have a segmented, noncontiguous address space. Therefore, a portable mechanism has to be provided by MPI to compute the ``address'' of a variable. Such a mechanism is certainly needed in Fortran, which has no dereferencing operator. addressderived datatype, address
MPI_Address(void* location, MPI_Aint *address)
MPI_ADDRESS(LOCATION, ADDRESS, IERROR)<type> LOCATION(*)
INTEGER ADDRESS, IERROR
MPI_ADDRESS is used to find the address of a location in memory. It returns the byte address of location.
Sometimes it is necessary to override the definition of extent given in Section . markersderived datatype, markers Consider, for example, the code in Example in the previous section. Assume that a double occupies 8 bytes and must be double-word aligned. There will be 7 bytes of padding after the first field and one byte of padding after the last field of the structure Partstruct, and the structure will occupy 64 bytes. If, on the other hand, a double can be word aligned only, then there will be only 3 bytes of padding after the first field, and Partstruct will occupy 60 bytes. The MPI library will follow the alignment rules used on the target systems so that the extent of datatype Particletype equals the amount of storage occupied by Partstruct. The catch is that different alignment rules may be specified, on the same system, using different compiler options. An even more difficult problem is that some compilers allow the use of pragmas in order to specify different alignment rules for different structures within the same program. (Many architectures can correctly handle misaligned values, but with lower performance; different alignment rules trade speed of access for storage density.) The MPI library will assume the default alignment rules. However, the user should be able to overrule this assumption if structures are packed otherwise. alignment
To allow this capability, MPI has two additional ``pseudo-datatypes,'' MPI_LB and MPI_UB, MPI_LB MPI_UB that can be used, respectively, to mark the lower bound or the upper bound of a datatype. These pseudo-datatypes occupy no space ( ). They do not affect the size or count of a datatype, and do not affect the the content of a message created with this datatype. However, they do change the extent of a datatype and, therefore, affect the outcome of a replication of this datatype by a datatype constructor.
In general, if
then the lower bound of is defined to be
Similarly, the upper bound of is defined to be
And
If requires alignment to a byte address that is a multiple of , then is the least nonnegative increment needed to round to the next multiple of . The formal definitions given for the various datatype constructors continue to apply, with the amended definition of extent. Also, MPI_TYPE_EXTENT returns the above as its value for extent. extentderived datatype, extent lower boundderived datatype, lower bound upper boundderived datatype, upper bound
The two functions below can be used for finding the lower bound and the upper bound of a datatype.
MPI_Type_lb(MPI_Datatype datatype, MPI_Aint* displacement)
MPI_TYPE_LB(DATATYPE, DISPLACEMENT, IERROR)INTEGER DATATYPE, DISPLACEMENT, IERROR
MPI_TYPE_LB returns the lower bound of a datatype, in bytes, relative to the datatype origin.
MPI_Type_ub(MPI_Datatype datatype, MPI_Aint* displacement)
MPI_TYPE_UB(DATATYPE, DISPLACEMENT, IERROR)INTEGER DATATYPE, DISPLACEMENT, IERROR
MPI_TYPE_UB returns the upper bound of a datatype, in bytes, relative to the datatype origin.
Consider Example on page . One computes the ``absolute address'' of the structure components, using calls to MPI_ADDRESS, then subtracts the starting address of the array to compute relative displacements. When the send operation is executed, the starting address of the array is added back, in order to compute the send buffer location. These superfluous arithmetics could be avoided if ``absolute'' addresses were used in the derived datatype, and ``address zero'' was passed as the buffer argument in the send call. addressderived datatype, address
MPI supports the use of such ``absolute'' addresses in derived datatypes. The displacement arguments used in datatype constructors can be ``absolute addresses'', i.e., addresses returned by calls to MPI_ADDRESS. Address zero is indicated to communication functions by passing the constant MPI_BOTTOM as the buffer argument. Unlike derived datatypes MPI_BOTTOM with relative displacements, the use of ``absolute'' addresses restricts the use to the specific structure for which it was created.
The use of addresses and displacements in MPI is best understood in the context of a flat address space. Then, the ``address'' of a location, as computed by calls to MPI_ADDRESS can be the regular address of that location (or a shift of it), and integer arithmetic on MPI ``addresses'' yields the expected result. However, the use of a flat address space is not mandated by C or Fortran. Another potential source of problems is that Fortran INTEGER's may be too short to store full addresses.
Variables belong to the same sequential storage if they belong to the same array, to the same COMMON block in Fortran, or to the same structure in C. addressderived datatype, address sequential storage Implementations may restrict the use of addresses so that arithmetic on addresses is confined within sequential storage. Namely, in a communication call, either
Some existing communication libraries, such as PVM and Parmacs, provide pack and unpack functions for sending noncontiguous data. In packunpack these, the application explicitly packs data into a contiguous buffer before sending it, and unpacks it from a contiguous buffer after receiving it. Derived datatypes, described in the previous sections of this chapter, allow one, in most cases, to avoid explicit packing and unpacking. The application specifies the layout of the data to be sent or received, and MPI directly accesses a noncontiguous buffer when derived datatypes are used. The pack/unpack routines are provided for compatibility with previous libraries. Also, they provide some functionality that is not otherwise available in MPI. For instance, a message can be received in several parts, where the receive operation done on a later part may depend on the content of a former part. Another use is that the availability of pack and unpack operations facilitates the development of additional communication libraries layered on top of MPI. layering
MPI_Pack(void* inbuf, int incount, MPI_Datatype datatype, void *outbuf, int outsize, int *position, MPI_Comm comm)
MPI_PACK(INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, COMM, IERROR)<type> INBUF(*), OUTBUF(*)
INTEGER INCOUNT, DATATYPE, OUTSIZE, POSITION, COMM, IERROR
MPI_PACK packs a message specified by inbuf, incount, datatype, comm into the buffer space specified by outbuf and outsize. The input buffer can be any communication buffer allowed in MPI_SEND. The output buffer is a contiguous storage area containing outsize bytes, starting at the address outbuf.
The input value of position is the first position in the output buffer to be used for packing. The argument position is incremented by the size of the packed message so that it can be used as input to a subsequent call to MPI_PACK. The comm argument is the communicator that will be subsequently used for sending the packed message.
MPI_Unpack(void* inbuf, int insize, int *position, void *outbuf, int outcount, MPI_Datatype datatype, MPI_Comm comm)
MPI_UNPACK(INBUF, INSIZE, POSITION, OUTBUF, OUTCOUNT, DATATYPE, COMM, IERROR)<type> INBUF(*), OUTBUF(*)
INTEGER INSIZE, POSITION, OUTCOUNT, DATATYPE, COMM, IERROR
MPI_UNPACK unpacks a message into the receive buffer specified by outbuf, outcount, datatype from the buffer space specified by inbuf and insize. The output buffer can be any communication buffer allowed in MPI_RECV. The input buffer is a contiguous storage area containing insize bytes, starting at address inbuf. The input value of position is the position in the input buffer where one wishes the unpacking to begin. The output value of position is incremented by the size of the packed message, so that it can be used as input to a subsequent call to MPI_UNPACK. The argument comm was the communicator used to receive the packed message.
The MPI_PACK/MPI_UNPACK calls relate to message passing as the sprintf/sscanf calls in C relate to file I/O, or internal Fortran files relate to external units. Basically, the MPI_PACK function allows one to ``send'' a message into a memory buffer; the MPI_UNPACK function allows one to ``receive'' a message from a memory buffer.
Several communication buffers can be successively packed into one packing unit. This packing unit is effected by several, successive related calls to MPI_PACK, where the first call provides position = 0, and each successive call inputs the value of position that was output by the previous call, and the same values for outbuf, outcount and comm. This packing unit now contains the equivalent information that would have been stored in a message by one send call with a send buffer that is the ``concatenation'' of the individual send buffers.
A packing unit must be sent using type MPI_PACKED. Any point-to-point MPI_PACKED or collective communication function can be used. The message sent is identical to the message that would be sent by a send operation with a datatype argument describing the concatenation of the send buffer(s) used in the Pack calls. The message can be received with any datatype that matches this send datatype.
Any message can be received in a point-to-point or collective communication using the type MPI_PACKED. Such a message can then be MPI_PACKED unpacked by calls to MPI_UNPACK. The message can be unpacked by several, successive calls to MPI_UNPACK, where the first call provides position = 0, and each successive call inputs the value of position that was output by the previous call, and the same values for inbuf, insize and comm.
MPI_Pack_size(int incount, MPI_Datatype datatype, MPI_Comm comm, int *size)
MPI_PACK_SIZE(INCOUNT, DATATYPE, COMM, SIZE, IERROR)INTEGER INCOUNT, DATATYPE, COMM, SIZE, IERROR
MPI_PACK_SIZE allows the application to find out how much space is needed to pack a message and, thus, manage space allocation for buffers. The function returns, in size, an upper bound on the increment in position that would occur in a call to MPI_PACK with the same values for incount, datatype, and comm.
This section explains notational terms and conventions used throughout this book.
A comparison between Example on page and Example in the previous section is instructive.
First, programming convenience. It is somewhat less tedious to pack the class zero particles in the loop that locates them, rather then defining in this loop the datatype that will later collect them. On the other hand, it would be very tedious (and inefficient) to pack separately the components of each structure entry in the array. Defining a datatype is more convenient when this definition depends only on declarations; packing may be more convenient when the communication buffer layout is data dependent.
Second, storage use. The packing code uses at least 56,000 bytes for the pack buffer, e.g., up to 1000 copies of the structure (1 char, 6 doubles, and 7 char is bytes). The derived datatype code uses 12,000 bytes for the three, 1,000 long, integer arrays used to define the derived datatype. It also probably uses a similar amount of storage for the internal datatype representation. The difference is likely to be larger in realistic codes. The use of packing requires additional storage for a copy of the data, whereas the use of derived datatypes requires additional storage for a description of the data layout.
Finally, compute time. The packing code executes a function call for each packed item whereas the derived datatype code executes only a fixed number of function calls. The packing code is likely to require one additional memory to memory copy of the data, as compared to the derived-datatype code. One may expect, on most implementations, to achieve better performance with the derived datatype code.
Both codes send the same size message, so that there is no difference in communication time. However, if the buffer described by the derived datatype is not contiguous in memory, it may take longer to access.
Example above illustrates another advantage of pack/unpack; namely the receiving process may use information in part of an incoming message in order to decide how to handle subsequent data in the message. In order to achieve the same outcome without pack/unpack, one would have to send two messages: the first with the list of indices, to be used to construct a derived datatype that is then used to receive the particle entries sent in a second message.
The use of derived datatypes will often lead to improved performance: data copying can be avoided, and information on data layout can be reused, when the same communication buffer is reused. On the other hand, the definition of derived datatypes for complex layouts can be more tedious than explicit packing. Derived datatypes should be used whenever data layout is defined by program declarations (e.g., structures), or is regular (e.g., array sections). Packing might be considered for complex, dynamic, data-dependent layouts. Packing may result in more efficient code in situations where the sender has to communicate to the receiver information that affects the layout of the receive buffer.
Collective communications transmit data among all processes in a group specified by an intracommunicator object. One function, the barrier, serves to synchronize processes without passing data. MPI provides the following collective communication functions. collective communicationsynchronization
Figure: Collective move functions illustrated
for a group of six processes. In each case, each row of boxes
represents data locations in one process. Thus, in the broadcast,
initially just the first process contains the item
, but after the
broadcast all processes contain it.
Figure gives a pictorial representation of the global communication functions. All these functions (broadcast excepted) come in two variants: the simple variant, where all communicated items are messages of the same size, and the ``vector'' variant, where each item can be of a different size. In addition, in the simple variant, multiple items originating from the same process or received at the same process, are contiguous in memory; the vector variant allows to pick the distinct items from non-contiguous locations. collective, vector variants
Some of these functions, such as broadcast or gather, have a single origin or a single receiving process. Such a process is called the root. root Global communication functions basically comes in three patterns:
The syntax and semantics of the MPI collective functions was designed to be consistent with point-to-point communications. collective, compatibility with point-to-point However, to keep the number of functions and their argument lists to a reasonable level of complexity, the MPI committee made collective functions more restrictive than the point-to-point functions, in several ways. One collective, restrictions restriction is that, in contrast to point-to-point communication, the amount of data sent must exactly match the amount of data specified by the receiver.
A major simplification is that collective functions come in blocking versions only. Though a standing joke at committee meetings concerned the ``non-blocking barrier,'' such functions can be quite useful and may be included in a future version of MPI. collective, and blocking semanticsblocking
Collective functions do not use a tag argument. Thus, within each intragroup communication domain, collective calls are matched strictly according to the order of execution. tagcollective, and message tag
A final simplification of collective functions concerns modes. Collective collective, and modesmodes functions come in only one mode, and this mode may be regarded as analogous to the standard mode of point-to-point. Specifically, the semantics are as follows. A collective function (on a given process) can return as soon as its participation in the overall communication is complete. As usual, the completion indicates that the caller is now free to access and modify locations in the communication buffer(s). It does not indicate that other processes have completed, or even started, the operation. Thus, a collective communication may, or may not, have the effect of synchronizing all calling processes. The barrier, of course, is the exception to this statement.
This choice of semantics was made so as to allow a variety of implementations.
The user of MPI must keep these issues in mind. For example, even though a particular implementation of MPI may provide a broadcast with the side-effect of synchronization (the standard allows this), the standard does not require this, and hence, any program that relies on the synchronization will be non-portable. On the other hand, a correct and portable program must allow a collective function to be synchronizing. Though one should not rely on synchronization side-effects, one must program so as to allow for it. portabilitycorrectness collective, and portabilitycollective, and correctness
Though these issues and statements may seem unusually obscure, they are merely a consequence of the desire of MPI to:
A collective operation is executed by having all processes in the group call the communication routine, with matching arguments. The syntax and semantics of the collective operations are defined to be consistent with the syntax and semantics of the point-to-point operations. Thus, user-defined datatypes are allowed and must match between sending and receiving processes as specified in Chapter . One of the key arguments is an intracommunicator that defines the group of participating processes and provides a communication domain for the operation. In calls where a root process is defined, some arguments are specified as ``significant only at root,'' and are ignored for all participants except the root. The reader is referred to Chapter for information concerning communication buffers and type matching rules, to Chapter for user-defined datatypes, and to Chapter for information on how to define groups and create communicators.
The type-matching conditions for the collective operations are more strict than the corresponding conditions between sender and receiver in point-to-point. Namely, for collective operations, the amount of data sent must exactly match the amount of data specified by the receiver. Distinct type maps (the layout in memory, see Section ) between sender and receiver are still allowed. type matchingcollective, and type matching
Collective communication calls may use the same communicators as point-to-point communication; MPI guarantees that messages generated on behalf of collective communication calls will not be confused with messages generated by point-to-point communication. A more detailed discussion of correct use of collective routines is found in Section .
The key concept of the collective functions is to have a ``group'' communicator, and collectivegroup, for collective collective, and communicatorcollective, process group of participating processes. The routines do not have a group identifier as an explicit argument. Instead, there is a communicator argument. For the purposes of this chapter, a communicator can be thought of as a group identifier linked with a communication domain. An intercommunicator, that is, a communicator that spans two groups, is not allowed as an argument to a collective function. intercommunicator, and collective collective, and intercommunicator
MPI_Barrier(MPI_Comm comm)
MPI_BARRIER(COMM, IERROR) INTEGER COMM, IERROR
MPI_BARRIER blocks the caller until all group members have called it. The call returns at any process only after all group members have entered the call.
MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR) <type> BUFFER(*)
INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR
MPI_BCAST broadcasts a message from the process with rank root to all processes of the group. The argument root must have identical values on all processes, and comm must represent the same intragroup communication domain. On return, the contents of root's communication buffer has been copied to all processes.
General, derived datatypes are allowed for datatype. The type signature of count and datatype on any process must be equal to the type signature of count and datatype at the root. This implies that the amount of data sent must be equal to the amount received, pairwise between each process and the root. MPI_BCAST and all other data-movement collective routines make this restriction. Distinct type maps between sender and receiver are still allowed.
MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR
Each process (root process included) sends the contents of its send buffer to the root process. The root process receives the messages and stores them in rank order. The outcome is as if each of the n processes in the group (including the root process) had executed a call to MPI_Send(sendbuf, sendcount, sendtype, root, ...), and the root had executed n calls to MPI_Recv(recvbuf+i recvcount extent(recvtype), recvcount, recvtype, i ,...), where extent(recvtype) is the type extent obtained from a call to MPI_Type_extent().
An alternative description is that the n messages sent by the processes in the group are concatenated in rank order, and the resulting message is received by the root as if by a call to MPI_RECV(recvbuf, recvcount n, recvtype, ...).
The receive buffer is ignored for all non-root processes.
General, derived datatypes are allowed for both sendtype and recvtype. The type signature of sendcount and sendtype on process i must be equal to the type signature of recvcount and recvtype at the root. This implies that the amount of data sent must be equal to the amount of data received, pairwise between each process and the root. Distinct type maps between sender and receiver are still allowed.
All arguments to the function are significant on process root, while on other processes, only arguments sendbuf, sendcount, sendtype, root, and comm are significant. The argument root must have identical values on all processes and comm must represent the same intragroup communication domain.
The specification of counts and types should not cause any location on the root to be written more than once. Such a call is erroneous.
Note that the recvcount argument at the root indicates the number of items it receives from each process, not the total number of items it receives.
Figure: The root process gathers 100 ints from each process
in the group.
7
Scientific and Engineering ComputationJanusz Kowalik, Editor Data-Parallel Programming on MIMD Computersby Philip J. Hatcher and Michael J. Quinn, 1991
Unstructured Scientific Computation on Scalable Multiprocessorsedited by Piyush Mehrotra, Joel Saltz, and Robert Voigt, 1991
Parallel Computational Fluid Dynamics: Implementations and Resultsedited by Horst D. Simon, 1992
Enterprise Integration Modeling: Proceedings of the First International Conferenceedited by Charles J. Petrie, Jr., 1992
The High Performance Fortran Handbookby Charles H. Koelbel, David B. Loveman, Robert S. Schreiber, Guy L. Steele Jr. and Mary E. Zosel, 1993
Using MPI: Portable Parallel Programming with the Message-Passing Interfaceby William Gropp, Ewing Lusk, and Anthony Skjellum, 1994
PVM: Parallel Virtual Machine-A User's Guide and Tutorial for Network Parallel Computingby Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek, and Vaidy Sunderam, 1994
Enabling Technologies for Petaflops Computingby Thomas Sterling, Paul Messina, and Paul H. Smith
An Introduction to High-Performance Scientific Computingby Lloyd D. Fosdick, Elizabeth R. Jessup, Carolyn J.C. Schauble, and Gitta Domik
Practical Parallel Programmingby Gregory V. Wilson
MPI: The Complete Referenceby Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra
1996 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
Parts of this book came from, ``MPI: A Message-Passing Interface Standard'' by the Message Passing Interface Forum. That document is copyrighted by the University of Tennessee. These sections were copied by permission of the University of Tennessee.
This book was set in LaTeX by the authors and was printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
Scientific and Engineering ComputationJanusz Kowalik, Editor Data-Parallel Programming on MIMD Computersby Philip J. Hatcher and Michael J. Quinn, 1991
Unstructured Scientific Computation on Scalable Multiprocessorsedited by Piyush Mehrotra, Joel Saltz, and Robert Voigt, 1991
Parallel Computational Fluid Dynamics: Implementations and Resultsedited by Horst D. Simon, 1992
Enterprise Integration Modeling: Proceedings of the First International Conferenceedited by Charles J. Petrie, Jr., 1992
The High Performance Fortran Handbookby Charles H. Koelbel, David B. Loveman, Robert S. Schreiber, Guy L. Steele Jr. and Mary E. Zosel, 1993
Using MPI: Portable Parallel Programming with the Message-Passing Interfaceby William Gropp, Ewing Lusk, and Anthony Skjellum, 1994
PVM: Parallel Virtual Machine-A User's Guide and Tutorial for Network Parallel Computingby Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek, and Vaidy Sunderam, 1994
Enabling Technologies for Petaflops Computingby Thomas Sterling, Paul Messina, and Paul H. Smith
An Introduction to High-Performance Scientific Computingby Lloyd D. Fosdick, Elizabeth R. Jessup, Carolyn J.C. Schauble, and Gitta Domik
Practical Parallel Programmingby Gregory V. Wilson
MPI: The Complete Referenceby Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra
1996 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.
Parts of this book came from, ``MPI: A Message-Passing Interface Standard'' by the Message Passing Interface Forum. That document is copyrighted by the University of Tennessee. These sections were copied by permission of the University of Tennessee.
This book was set in LaTeX by the authors and was printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
This book is also available in postscript and html forms over the Internet.
To retrieve the postscript file you can use one of the following methods:
netlib@netlib.org
and in the message type: send mpi-book.ps from utk/papers/mpi-book
To view the html file use the URL:
To order from the publisher, send email to mitpress-orders@mit.edu, or telephone 800-356-0343 or 617-625-8569. Send snail mail orders to