30 September 1994
This work is dedicated to Jim Wilkinson whose ideas and spirit have given us inspiration and influenced the project at every turn.
The printed version of LAPACK Users' Guide, Second Edition will be available from SIAM in February 1995. The list price is $28.50 and the SIAM Member Price is $22.80.Contact SIAM for additional information.
The royalties from the sales of this book are being placed in a fund
to help students attend SIAM meetings and other SIAM related activities.
This fund is administered by SIAM and qualified individuals are encouraged to
write directly to SIAM for guidelines.
LAPACK has been designed to supersede LINPACK [26] and EISPACK [44] [70] , principally by restructuring the software to achieve much greater efficiency, where possible, on modern high-performance computers; also by adding extra functionality, by using some new or improved algorithms, and by integrating the two sets of algorithms into a unified package.
Appendix D lists the LAPACK counterparts of LINPACK and EISPACK routines. Not all the facilities of LINPACK and EISPACK are covered by Release 2.0 of LAPACK.
The argument lists of all LAPACK routines conform to a single set of conventions for their design and documentation.
Specifications of all LAPACK driver and computational routines are given in Part 2. These are derived from the specifications given in the leading comments in the code, but in Part 2 the specifications for real and complex versions of each routine have been merged, in order to save space.
The documentation of each LAPACK routine includes:
Arguments of an LAPACK routine appear in the following order:
The style of the argument descriptions is illustrated by the following example:
The description of each argument gives:
Arguments specifying options are usually of type CHARACTER*1. The meaning of each valid value is given , as in this example:
The corresponding lower-case characters may be supplied (with the same meaning), but any other value is illegal (see subsection 5.1.8).
A longer character string can be passed as the actual argument, making the calling program more readable, but only the first character is significant; this is a standard feature of Fortran 77. For example:
CALL SPOTRS('upper', . . . )
It is permissible for the problem dimensions to be passed as zero, in which case the computation (or part of it) is skipped. Negative dimensions are regarded as erroneous.
Each two-dimensional array argument is immediately followed in the argument list by its leading dimension , whose name has the form LD<array-name>. For example:
It should be assumed, unless stated otherwise, that vectors and
matrices are stored in one- and two-dimensional arrays in the
conventional manner. That is, if an array X of dimension (N) holds
a vector
, then X(i) holds
for
i = 1,..., n.
If a two-dimensional array A of dimension (LDA,N) holds an m-by-n
matrix A,
then A(i,j) holds
for i = 1,..., m and
j = 1,..., n (LDA must be at least m).
See Section
5.3 for more about
storage of matrices.
Note that array arguments are usually declared in the software as assumed-size arrays (last dimension *), for example:
REAL A( LDA, * )although the documentation gives the dimensions as (LDA,N). The latter form is more informative since it specifies the required minimum value of the last dimension. However an assumed-size array declaration has been used in the software, in order to overcome some limitations in the Fortran 77 standard. In particular it allows the routine to be called when the relevant dimension (N, in this case) is zero. However actual array dimensions in the calling program must be at least 1 (LDA in this example).
Many LAPACK routines require one or more work arrays to be passed as arguments. The name of a work array is usually WORK - sometimes IWORK, RWORK or BWORK to distinguish work arrays of integer, real or logical (Boolean) type.
Occasionally the first element of a work array is used to return some useful information: in such cases, the argument is described as (workspace/output) instead of simply (workspace).
A number of routines implementing block algorithms require workspace sufficient to hold one block of rows or columns of the matrix, for example, workspace of size n-by-nb, where nb is the block size. In such cases, the actual declared length of the work array must be passed as a separate argument LWORK , which immediately follows WORK in the argument-list.
See Section 5.2 for further explanation.
All documented routines have a diagnostic argument INFO that indicates the success or failure of the computation, as follows:
All driver and auxiliary routines check that input arguments such as N or LDA or option arguments of type character have permitted values. If an illegal value of the i-th argument is detected, the routine sets INFO = -i, and then calls an error-handling routine XERBLA.
The standard version of XERBLA issues an error message and halts execution, so that no LAPACK routine would ever return to the calling program with INFO < 0. However, this might occur if a non-standard version of XERBLA is used.
LAPACK routines that implement block algorithms need to determine what block size to use. The intention behind the design of LAPACK is that the choice of block size should be hidden from users as much as possible, but at the same time easily accessible to installers of the package when tuning LAPACK for a particular machine.
LAPACK routines call an auxiliary enquiry function ILAENV , which returns the optimal block size to be used, as well as other parameters. The version of ILAENV supplied with the package contains default values that led to good behavior over a reasonable number of our test machines, but to achieve optimal performance, it may be beneficial to tune ILAENV for your particular machine environment. Ideally a distinct implementation of ILAENV is needed for each machine environment (see also Chapter 6). The optimal block size may also depend on the routine, the combination of option arguments (if any), and the problem dimensions.
If ILAENV returns a block size of 1, then the routine performs the unblocked algorithm, calling Level 2 BLAS, and makes no calls to Level 3 BLAS.
Some LAPACK routines require a work array whose size is proportional to the block size (see subsection 5.1.7). The actual length of the work array is supplied as an argument LWORK. The description of the arguments WORK and LWORK typically goes as follows:
The routine determines the block size to be used by the following steps:
The minimum value of LWORK that would be needed to use the optimal block size, is returned in WORK(1).
Thus, the routine uses the largest block size allowed by the amount of workspace supplied, as long as this is likely to give better performance than the unblocked algorithm. WORK(1) is not always a simple formula in terms of N and NB.
The specification of LWORK gives the minimum value for the routine to return correct results. If the supplied value is less than the minimum - indicating that there is insufficient workspace to perform the unblocked algorithm - the value of LWORK is regarded as an illegal value, and is treated like any other illegal argument value (see subsection 5.1.8).
If in doubt about how much workspace to supply, users should supply a generous amount (assume a block size of 64, say), and then examine the value of WORK(1) on exit.
LAPACK routines are written so that as much as possible of the computation is performed by calls to the Basic Linear Algebra Subprograms (BLAS) [28] [30] [58] . Highly efficient machine-specific implementations of the BLAS are available for many modern high-performance computers. The BLAS enable LAPACK routines to achieve high performance with portable code. The methodology for constructing LAPACK routines in terms of calls to the BLAS is described in Chapter 3.
The BLAS are not strictly speaking part of LAPACK, but Fortran 77 code for the BLAS is distributed with LAPACK, or can be obtained separately from netlib (see below). This code constitutes the ``model implementation'' [27] [29].
The model implementation is not expected to perform as well as a specially tuned implementation on most high-performance computers - on some machines it may give much worse performance - but it allows users to run LAPACK codes on machines that do not offer any other implementation of the BLAS.
LAPACK allows the following different storage schemes for matrices:
These storage schemes are compatible with those used in LINPACK and the BLAS, but EISPACK uses incompatible schemes for band and tridiagonal matrices.
In the examples below, indicates an array element that need not be set and is not referenced by LAPACK routines. Elements that ``need not be set'' are never read, written to, or otherwise accessed by the LAPACK routines. The examples illustrate only the relevant part of the arrays; array arguments may of course have additional rows or columns, according to the usual rules for passing array arguments in Fortran 77.
The default scheme for storing matrices is the obvious one described in subsection 5.1.6: a matrix A is stored in a two-dimensional array A, with matrix element stored in array element A(i,j).
If a matrix is triangular (upper or lower, as specified by the argument UPLO), only the elements of the relevant triangle are accessed. The remaining elements of the array need not be set. Such elements are indicated by * in the examples below. For example, when n = 4:
Similarly, if the matrix is upper Hessenberg, elements below the first subdiagonal need not be set.
Routines that handle symmetric or Hermitian matrices allow for either the upper or lower triangle of the matrix (as specified by UPLO) to be stored in the corresponding elements of the array; the remaining elements of the array need not be set. For example, when n = 4:
Symmetric, Hermitian or triangular matrices may be stored more compactly , if the relevant triangle (again as specified by UPLO) is packed by columns in a one-dimensional array. In LAPACK, arrays that hold matrices in packed storage, have names ending in `P'. So:
For example:
Note that for real or complex symmetric matrices, packing the upper triangle by columns is equivalent to packing the lower triangle by rows; packing the lower triangle by columns is equivalent to packing the upper triangle by rows. For complex Hermitian matrices, packing the upper triangle by columns is equivalent to packing the conjugate of the lower triangle by rows; packing the lower triangle by columns is equivalent to packing the conjugate of the upper triangle by rows.
An m-by-n band matrix with kl subdiagonals and ku superdiagonals may be stored compactly in a two-dimensional array with kl + ku + 1 rows and n columns. Columns of the matrix are stored in corresponding columns of the array, and diagonals of the matrix are stored in rows of the array. This storage scheme should be used in practice only if kl , ku << min(m , n), although LAPACK routines work correctly for all values of kl and ku. In LAPACK, arrays that hold matrices in band storage have names ending in `B'.
To be precise, is stored in AB(ku + 1 + i - j , j) for max(1 , j - ku) < = i < = min(m , j + kl). For example, when m = n = 5, kl = 2 and ku = 1:
The elements marked * in the upper left and lower right corners of the array AB need not be set, and are not referenced by LAPACK routines.
Note: when a band matrix is supplied for LU factorization, space must be allowed to store an additional kl superdiagonals, generated by fill-in as a result of row interchanges. This means that the matrix is stored according to the above scheme, but with kl + ku superdiagonals.
Triangular band matrices are stored in the same format, with either kl = 0 if upper triangular, or ku = 0 if lower triangular.
For symmetric or Hermitian band matrices with kd subdiagonals or superdiagonals, only the upper or lower triangle (as specified by UPLO) need be stored:
For example, when n = 5 and kd = 2:
EISPACK routines use a different storage scheme for band matrices, in which rows of the matrix are stored in corresponding rows of the array, and diagonals of the matrix are stored in columns of the array (see Appendix D).
An unsymmetric tridiagonal matrix of order n is stored in three one-dimensional arrays, one of length n containing the diagonal elements, and two of length n - 1 containing the subdiagonal and superdiagonal elements in elements 1 : n - 1.
A symmetric tridiagonal or bidiagonal matrix is stored in two one-dimensional arrays, one of length n containing the diagonal elements, and one of length n containing the off-diagonal elements. (EISPACK routines store the off-diagonal elements in elements 2 : n of a vector of length n.)
Some LAPACK routines have an option to handle unit triangular matrices (that is, triangular matrices with diagonal elements = 1). This option is specified by an argument DIAG . If DIAG = 'U' (Unit triangular), the diagonal elements of the matrix need not be stored, and the corresponding array elements are not referenced by the LAPACK routines. The storage scheme for the rest of the matrix (whether conventional, packed or band) remains unchanged, as described in subsections 5.3.1, 5.3.2 and 5.3.3.
Complex Hermitian matrices have diagonal matrices that are by definition purely real. In addition, some complex triangular matrices computed by LAPACK routines are defined by the algorithm to have real diagonal elements - in Cholesky or QR factorization, for example.
If such matrices are supplied as input to LAPACK routines, the imaginary parts of the diagonal elements are not referenced, but are assumed to be zero. If such matrices are returned as output by LAPACK routines, the computed imaginary parts are explicitly set to zero.
A real orthogonal or complex unitary matrix (usually denoted Q) is often represented in LAPACK as a product of elementary reflectors - also referred to as elementary Householder matrices (usually denoted ). For example,
Most users need not be aware of the details, because LAPACK routines are provided to work with this representation:
The following further details may occasionally be useful.
An elementary reflector (or elementary Householder matrix) H of order n is a unitary matrix of the form
where is a scalar, and v is an n-vector, with ); v is often referred to as the Householder vector . Often v has several leading or trailing zero elements, but for the purpose of this discussion assume that H has no such special structure.
There is some redundancy in the representation ( 5.1), which can be removed in various ways. The representation used in LAPACK (which differs from those used in LINPACK or EISPACK) sets ; hence need not be stored. In real arithmetic, , except that implies H = I.
In complex arithmetic , may be complex, and satisfies and . Thus a complex H is not Hermitian (as it is in other representations), but it is unitary, which is the important property. The advantage of allowing to be complex is that, given an arbitrary complex vector x, H can be computed so that
with real . This is useful, for example, when reducing a complex Hermitian matrix to real symmetric tridiagonal form , or a complex rectangular matrix to real bidiagonal form .
For further details, see Lehoucq [59].
For anyone who obtains the complete LAPACK package from netlib or NAG (see Chapter 1), a comprehensive installation guide is provided. We recommend installation of the complete package as the most convenient and reliable way to make LAPACK available.
People who obtain copies of a few LAPACK routines from netlib need to be aware of the following points:
Some compilers provide DOUBLE COMPLEX as an alternative to COMPLEX*16, and an intrinsic function DREAL instead of DBLE to return the real part of a COMPLEX*16 argument. If the compiler does not accept the constructs used in LAPACK, the installer will have to modify the code: for example, globally change COMPLEX*16 to DOUBLE COMPLEX, or selectively change DBLE to DREAL.
This Users' Guide gives an informal introduction to the design of the package, and a detailed description of its contents. Chapter 5 explains the conventions used in the software and documentation. Part 2 contains complete specifications of all the driver routines and computational routines. These specifications have been derived from the leading comments in the source text.
On-line manpages (troff files) for LAPACK routines, as well as for most of the BLAS routines, are available on netlib. These files are automatically generated at the time of each release. For more information, see the manpages.tar.z entry on the lapack index on netlib.
Machine-dependent parameters such as the block size are set by calls to an inquiry function which may be set with different values on each machine. The declaration of the environment inquiry function is
INTEGER FUNCTION ILAENV( ISPEC, NAME, OPTS, N1, N2, N3, N4 )where ISPEC, N1, N2, N3, and N4 are integer variables and NAME and OPTS are CHARACTER*(*). NAME specifies the subroutine name: OPTS is a character string of options to the subroutine; and N1-N4 are the problem dimensions. ISPEC specifies the parameter to be returned; the following values are currently used in LAPACK:
ISPEC = 1: NB, optimal block size = 2: NBMIN, minimum block size for the block routine to be used = 3: NX, crossover point (in a block routine, for N < NX, an un blocked routine should be used) = 4: NS, number of shifts = 6: NXSVD is the threshold point for which the QR factorization is performed prior to reduction to bidiagonal form. If M > NXSVD * N, then a QR factorization is performed. = 8: MAXB, crossover point for block multishift QR
The three block size parameters, NB, NBMIN, and NX, are used in many different subroutines (see Table 6.1). NS and MAXB are used in the block multishift QR algorithm, xHSEQR. NXSVD is used in the driver routines xGELSS and xGESVD.
Table 6.1: Use of the block parameters NB, NBMIN, and NX in LAPACK
The LAPACK testing and timing programs use a special version of ILAENV where the parameters are set via a COMMON block interface. This is convenient for experimenting with different values of, say, the block size in order to exercise different parts of the code and to compare the relative performance of different parameter values.
The LAPACK timing programs were designed to collect data for all the routines in Table 6.1. The range of problem sizes needed to determine the optimal block size or crossover point is machine-dependent, but the input files provided with the LAPACK test and timing package can be used as a starting point. For subroutines that require a crossover point, it is best to start by finding the best block size with the crossover point set to 0, and then to locate the point at which the performance of the unblocked algorithm is beaten by the block algorithm. The best crossover point will be somewhat smaller than the point where the curves for the unblocked and blocked methods cross.
For example, for SGEQRF on a single processor of a CRAY-2, NB = 32 was observed to be a good block size , and the performance of the block algorithm with this block size surpasses the unblocked algorithm for square matrices between N = 176 and N = 192. Experiments with crossover points from 64 to 192 found that NX = 128 was a good choice, although the results for NX from 3*NB to 5*NB are broadly similar. This means that matrices with N < = 128 should use the unblocked algorithm, and for N > 128 block updates should be used until the remaining submatrix has order less than 128. The performance of the unblocked (NB = 1) and blocked (NB = 32) algorithms for SGEQRF and for the blocked algorithm with a crossover point of 128 are compared in Figure 6.1.
Figure 6.1: QR factorization on CRAY-2 (1 processor)
By experimenting with small values of the block size, it should be straightforward to choose NBMIN, the smallest block size that gives a performance improvement over the unblocked algorithm. Note that on some machines, the optimal block size may be 1 (the unblocked algorithm gives the best performance); in this case, the choice of NBMIN is arbitrary. The prototype version of ILAENV sets NBMIN to 2, so that blocking is always done, even though this could lead to poor performance from a block routine if insufficient workspace is supplied (see chapter 7).
Complicating the determination of optimal parameters is the fact that
the orthogonal factorization routines and SGEBRD
accept non-square
matrices as input.
The LAPACK timing program allows M and N to be varied independently.
We have found the optimal block size to be
generally insensitive to the shape of the matrix,
but the crossover point is more dependent on the matrix shape.
For example, if
M >> N in the QR factorization, block updates
may always be faster than unblocked updates on the remaining submatrix,
so one might set NX = NB if M > = 2N.
Parameter values for the number of shifts, etc. used to tune the block multishift QR algorithm can be varied from the input files to the eigenvalue timing program. In particular, the performance of xHSEQR is particularly sensitive to the correct choice of block parameters. Setting NS = 2 will give essentially the same performance as EISPACK . Interested users should consult [3] for a description of the timing program input files.
For the benefit of less experienced programmers, we give here a list of common programming errors in calling an LAPACK routine. These errors may cause the LAPACK routine to report a failure, as described in Section 7.2 ; they may cause an error to be reported by the system; or they may lead to wrong results - see also Section 7.3.
Some modern compilation systems, as well as software tools such as the portability checker in Toolpack [66], can check that arguments agree in number and type; and many compilation systems offer run-time detection of errors such as an array element out-of-bounds or use of an unassigned variable.
There are two ways in which an LAPACK routine may report a failure to complete a computation successfully.
If an illegal value is supplied for one of the input arguments to an LAPACK routine, it will call the error handler XERBLA to write a message to the standard output unit of the form:
** On entry to SGESV parameter number 4 had an illegal valueThis particular message would be caused by passing to SGESV a value of LDA which was less than the value of the argument N. The documentation for SGESV in Part 2 states the set of acceptable input values: ``LDA > = max(1,N).'' This is required in order that the array A with leading dimension LDA can store an n-by-n matrix. The arguments are checked in order, beginning with the first. In the above example, it may - from the user's point of view - be the value of N which is in fact wrong. Invalid arguments are often caused by the kind of error listed in Section 7.1.
In the model implementation of XERBLA which is supplied with LAPACK, execution stops after the message; but the call to XERBLA is followed by a RETURN statement in the LAPACK routine, so that if the installer removes the STOP statement in XERBLA, the result will be an immediate exit from the LAPACK routine with a negative value of INFO. It is good practice always to check for a non-zero value of INFO on return from an LAPACK routine. (We recommend however that XERBLA should not be modified to return control to the calling routine, unless absolutely necessary, since this would remove one of the built-in safety-features of LAPACK.)
A positive value of INFO on return from an LAPACK routine indicates a failure in the course of the algorithm. Common causes are:
When a failure with INFO > 0 occurs, control is always returned to the calling program; XERBLA is not called, and no error message is written. It is worth repeating that it is good practice always to check for a non-zero value of INFO on return from an LAPACK routine.
A failure with INFO > 0 may indicate any of the following:
Wrong results from LAPACK routines are most often caused by incorrect usage.
It is also possible that wrong results are caused by a bug outside of LAPACK, in the compiler or in one of the library routines, such as the BLAS, that are linked with LAPACK. Test procedures are available for both LAPACK and the BLAS, and the LAPACK installation guide [3] should be consulted for descriptions of the tests and for advice on resolving problems.
A list of known problems, compiler errors, and bugs in LAPACK routines is maintained on netlib; see Chapter 1.
Users who suspect they have found a new bug in an LAPACK routine are encouraged to report it promptly to the developers as directed in Chapter 1. The bug report should include a test case, a description of the problem and expected results, and the actions, if any, that the user has already taken to fix the bug.
We have tried to make the performance of LAPACK ``transportable'' by performing most of the computation within the Level 1, 2, and 3 BLAS, and by isolating all of the machine-dependent tuning parameters in a single integer function ILAENV . To avoid poor performance from LAPACK routines, note the following recommendations :
XXXXXX = SLAMCH( 'P' )or in double precision:
XXXXXX = DLAMCH( 'P' )A cleaner but less portable solution is for the installer to save the values computed by xLAMCH for a specific machine and create a new version of xLAMCH with these constants set in DATA statements, taking care that no accuracy is lost in the translation.
The complete LAPACK package or individual routines from LAPACK are most easily obtained through netlib [32] . At the time of this writing, the e-mail addresses for netlib are
netlib@ornl.govBoth repositories provide electronic mail and anonymous ftp service (the netlib@ornl.gov cite is available via anonymous ftp to netlib2.cs.utk.edu), and the netlib@ornl.gov cite additionally provides xnetlib . Xnetlib uses an X Windows graphical user interface and a socket-based connection between the user's machine and the xnetlib server machine to process software requests. For more information on xnetlib, echo ``send index from xnetlib'' | mail netlib@ornl.gov.
netlib@research.att.com
General information about LAPACK can be obtained by sending mail to one of the above addresses with the message
send index from lapack
The package is also available on the World Wide Web. It can be accessed through the URL address:
http://www.netlib.org/lapack/index.html
The complete package, including test code and timing programs in four different Fortran data types, constitutes some 735,000 lines of Fortran source and comments.
Alternatively, if a user does not have internet access, the complete package can be obtained on magnetic media from NAG for a cost-covering handling charge.
For further details contact NAG at one of the following addresses:
NAG Inc. NAG Ltd. 1400 Opus Place, Suite 200 Wilkinson House Downers Grove, IL 60515-5702 Jordan Hill Road USA Oxford OX2 8DR Tel: +1 708 971 2337 England Fax: +1 708 971 2706 Tel: +44 865 511245 Fax: +44 865 310139 NAG GmbH Schleissheimerstrasse 5 W-8046 Garching bei Munchen Germany Tel: +49 89 3207395 Fax: +49 89 3207396
Level 1 BLAS
dim scalar vector vector scalars 5-element prefixes array SUBROUTINE _ROTG ( A, B, C, S ) S, D SUBROUTINE _ROTMG( D1, D2, A, B, PARAM ) S, D SUBROUTINE _ROT ( N, X, INCX, Y, INCY, C, S ) S, D SUBROUTINE _ROTM ( N, X, INCX, Y, INCY, PARAM ) S, D SUBROUTINE _SWAP ( N, X, INCX, Y, INCY ) S, D, C, Z SUBROUTINE _SCAL ( N, ALPHA, X, INCX ) S, D, C, Z, CS, ZD SUBROUTINE _COPY ( N, X, INCX, Y, INCY ) S, D, C, Z SUBROUTINE _AXPY ( N, ALPHA, X, INCX, Y, INCY ) S, D, C, Z FUNCTION _DOT ( N, X, INCX, Y, INCY ) S, D, DS FUNCTION _DOTU ( N, X, INCX, Y, INCY ) C, Z FUNCTION _DOTC ( N, X, INCX, Y, INCY ) C, Z FUNCTION __DOT ( N, ALPHA, X, INCX, Y, INCY ) SDS FUNCTION _NRM2 ( N, X, INCX ) S, D, SC, DZ FUNCTION _ASUM ( N, X, INCX ) S, D, SC, DZ FUNCTION I_AMAX( N, X, INCX ) S, D, C, Z
Level 2 BLAS
options dim b-width scalar matrix vector scalar vector prefixes _GEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) S, D, C, Z _GBMV ( TRANS, M, N, KL, KU, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) S, D, C, Z _HEMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) C, Z _HBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) C, Z _HPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) C, Z _SYMV ( UPLO, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) S, D _SBMV ( UPLO, N, K, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) S, D _SPMV ( UPLO, N, ALPHA, AP, X, INCX, BETA, Y, INCY ) S, D _TRMV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) S, D, C, Z _TBMV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) S, D, C, Z _TPMV ( UPLO, TRANS, DIAG, N, AP, X, INCX ) S, D, C, Z _TRSV ( UPLO, TRANS, DIAG, N, A, LDA, X, INCX ) S, D, C, Z _TBSV ( UPLO, TRANS, DIAG, N, K, A, LDA, X, INCX ) S, D, C, Z _TPSV ( UPLO, TRANS, DIAG, N, AP, X, INCX ) S, D, C, Z options dim scalar vector vector matrix prefixes _GER ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) S, D _GERU ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) C, Z _GERC ( M, N, ALPHA, X, INCX, Y, INCY, A, LDA ) C, Z _HER ( UPLO, N, ALPHA, X, INCX, A, LDA ) C, Z _HPR ( UPLO, N, ALPHA, X, INCX, AP ) C, Z _HER2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) C, Z _HPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) C, Z _SYR ( UPLO, N, ALPHA, X, INCX, A, LDA ) S, D _SPR ( UPLO, N, ALPHA, X, INCX, AP ) S, D _SYR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, A, LDA ) S, D _SPR2 ( UPLO, N, ALPHA, X, INCX, Y, INCY, AP ) S, D
Level 3 BLAS
options dim scalar matrix matrix scalar matrix prefixes _GEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) S, D, C, Z _SYMM ( SIDE, UPLO, M, N, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) S, D, C, Z _HEMM ( SIDE, UPLO, M, N, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) C, Z _SYRK ( UPLO, TRANS, N, K, ALPHA, A, LDA, BETA, C, LDC ) S, D, C, Z _HERK ( UPLO, TRANS, N, K, ALPHA, A, LDA, BETA, C, LDC ) C, Z _SYR2K( UPLO, TRANS, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) S, D, C, Z _HER2K( UPLO, TRANS, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC ) C, Z _TRMM ( SIDE, UPLO, TRANSA, DIAG, M, N, ALPHA, A, LDA, B, LDB ) S, D, C, Z _TRSM ( SIDE, UPLO, TRANSA, DIAG, M, N, ALPHA, A, LDA, B, LDB ) S, D, C, Z
Notes
Meaning of prefixes
S - REAL C - COMPLEX D - DOUBLE PRECISION Z - COMPLEX*16 (this may not be supported by all machines)
For the Level 2 BLAS a set of extended-precision routines with the prefixes ES, ED, EC, EZ may also be available.
Level 1 BLAS
In addition to the listed routines there are two further extended-precision dot product routines DQDOTI and DQDOTA.
Level 2 and Level 3 BLAS
Matrix types
GE - GEneral GB - General Band SY - SYmmetric SB - Symmetric Band SP - Symmetric Packed HE - HErmitian HB - Hermitian Band HP - Hermitian Packed TR - TRiangular TB - Triangular Band TP - Triangular Packed
Options
Arguments describing options are declared as CHARACTER*1 and may be passed as character strings.
TRANS = 'No transpose', 'Transpose', 'Conjugate transpose' (X, X^T, X^C) UPLO = 'Upper triangular', 'Lower triangular' DIAG = 'Non-unit triangular', 'Unit triangular' SIDE = 'Left', 'Right' (A or op(A) on the left, or A or op(A) on the right)
For real matrices, TRANS = `T' and TRANS = `C' have the same meaning.
For Hermitian matrices, TRANS = `T' is not allowed.
For complex symmetric matrices, TRANS = `H' is not allowed.
This appendix is designed to assist people to convert programs that currently call LINPACK or EISPACK routines, to call LAPACK routines instead.
LAPACK equivalents of LINPACK routines for real matrices ---------------------------------------------------------------- LINPACK LAPACK Function of LINPACK routine ---------------------------------------------------------------- SCHDC Cholesky factorization with diagonal pivoting option ---------------------------------------------------------------- SCHDD rank-1 downdate of a Cholesky factorization or the triangular factor of a QR factorization ---------------------------------------------------------------- SCHEX rank-1 update of a Cholesky factorization or the triangular factor of a QR factorization ---------------------------------------------------------------- SCHUD modifies a Cholesky factorization under permutations of the original matrix ---------------------------------------------------------------- SGBCO SLANGB LU factorization and condition estimation SGBTRF of a general band matrix SGBCON ---------------------------------------------------------------- SGBDI determinant of a general band matrix, after factorization by SGBCO or SGBFA ---------------------------------------------------------------- SGBFA SGBTRF LU factorization of a general band matrix ---------------------------------------------------------------- SGBSL SGBTRS solves a general band system of linear equations, after factorization by SGBCO or SGBFA ---------------------------------------------------------------- SGECO SLANGE LU factorization and condition SGETRF estimation of a general matrix SGECON ---------------------------------------------------------------- SGEDI SGETRI determinant and inverse of a general matrix, after factorization by SGECO or SGEFA ---------------------------------------------------------------- SGEFA SGETRF LU factorization of a general matrix ---------------------------------------------------------------- SGESL SGETRS solves a general system of linear equations, after factorization by SGECO or SGEFA ---------------------------------------------------------------- SGTSL SGTSV solves a general tridiagonal system of linear equations ---------------------------------------------------------------- SPBCO SLANSB Cholesky factorization and condition SPBTRF estimation of a symmetric positive definite SPBCON band matrix ---------------------------------------------------------------- SPBDI determinant of a symmetric positive definite band matrix, after factorization by SPBCO or SPBFA ---------------------------------------------------------------- SPBFA SPBTRF Cholesky factorization of a symmetric positive definite band matrix ---------------------------------------------------------------- SPBSL SPBTRS solves a symmetric positive definite band system of linear equations, after factorization by SPBCO or SPBFA ---------------------------------------------------------------- SPOCO SLANSY Cholesky factorization and condition SPOTRF estimation of a symmetric positive definite SPOCON matrix ---------------------------------------------------------------- SPODI SPOTRI determinant and inverse of a symmetric positive definite matrix, after factorization by SPOCO or SPOFA ---------------------------------------------------------------- SPOFA SPOTRF Cholesky factorization of a symmetric positive definite matrix ---------------------------------------------------------------- SPOSL SPOTRS solves a symmetric positive definite system of linear equations, after factorization by SPOCO or SPOFA ---------------------------------------------------------------- SPPCO SLANSY Cholesky factorization and condition SPPTRF estimation of a symmetric positive definite SPPCON matrix (packed storage) ----------------------------------------------------------------
LAPACK equivalents of LINPACK routines for real matrices(continued) ---------------------------------------------------------------- LINPACK LAPACK Function of LINPACK routine}\\ ---------------------------------------------------------------- SPPDI SPPTRI determinant and inverse of a symmetric positive definite matrix, after factorization by SPPCO or SPPFA (packed storage) ---------------------------------------------------------------- SPPFA SPPTRF Cholesky factorization of a symmetric positive definite matrix (packed storage) ---------------------------------------------------------------- SPPSL SPPTRS solves a symmetric positive definite system of linear equations, after factorization by SPPCO or SPPFA (packed storage) ---------------------------------------------------------------- SPTSL SPTSV solves a symmetric positive definite tridiagonal system of linear equations ---------------------------------------------------------------- SQRDC SGEQPF QR factorization with optional column or pivoting SGEQRF ---------------------------------------------------------------- SQRSL SORMQR solves linear least squares problems after STRSV factorization by SQRDC ---------------------------------------------------------------- SSICO SLANSY symmetric indefinite factorization and SSYTRF condition estimation of a symmetric SSYCON indefinite matrix ---------------------------------------------------------------- SSIDI SSYTRI determinant, inertia and inverse of a symmetric indefinite matrix, after factorization by SSICO or SSIFA ---------------------------------------------------------------- SSIFA SSYTRF symmetric indefinite factorization of a symmetric indefinite matrix ---------------------------------------------------------------- SSISL SSYTRS solves a symmetric indefinite system of linear equations, after factorization by SSICO or SSIFA ---------------------------------------------------------------- SSPCO SLANSP symmetric indefinite factorization and SSPTRF condition estimation of a symmetric SSPCON indefinite matrix (packed storage) ---------------------------------------------------------------- SSPDI SSPTRI determinant, inertia and inverse of a symmetric indefinite matrix, after factorization by SSPCO or SSPFA (packed storage) ---------------------------------------------------------------- SSPFA SSPTRF symmetric indefinite factorization of a symmetric indefinite matrix (packed storage) ---------------------------------------------------------------- SSPSL SSPTRS solves a symmetric indefinite system of linear equations, after factorization by SSPCO or SSPFA (packed storage) ---------------------------------------------------------------- SSVDC SGESVD all or part of the singular value decomposition of a general matrix ---------------------------------------------------------------- STRCO STRCON condition estimation of a triangular matrix ---------------------------------------------------------------- STRDI STRTRI determinant and inverse of a triangular matrix ---------------------------------------------------------------- STRSL STRTRS solves a triangular system of linear equations ----------------------------------------------------------------
Most of these working notes are available from netlib, where they can only be obtained in postscript form. To receive a list of available postscript reports, send email to netlib@ornl.gov of the form: send index from lapack/lawns
=0.15in =-.4in
A Quick Installation Guide (LAPACK Working Note 81) [35] is distributed with the complete package. This Quick Installation Guide provides installation instructions for Unix Systems. A comprehensive Installation Guide [3] (LAPACK Working Note 41), which contains descriptions of the testing and timings programs, as well as detailed non-Unix installation instructions, is also available. See also Chapter 6.
LAPACK Users' Guide
Release 2.0
This document was generated using the LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html lapack_lug.tex.
The translation was initiated by on Tue Nov 29 14:03:33 EST 1994
LAPACK has been thoroughly tested before release, on many different types of computers. The LAPACK project supports the package in the sense that reports of errors or poor performance will gain immediate attention from the developers . Such reports - and also descriptions of interesting applications and other comments - should be sent to:
LAPACK Project
c/o J. J. Dongarra
Computer Science Department
University of Tennessee
Knoxville, TN 37996-1301
USA
Email: lapack@cs.utk.edu
A list of known problems, bugs, and compiler errors for LAPACK, as well as an errata list for this guide, is maintained on netlib. For a copy of this report, send email to netlib of the form:
send release_notes from lapack
As previously mentioned in the Preface, many LAPACK-related software projects are currently available on netlib. In the context of this users' guide, several of these projects require further discussion - LAPACK++, CLAPACK, ScaLAPACK, and LAPACK routines exploiting IEEE arithmetic.
LAPACK++ is an object-oriented C++ extension to the LAPACK library. Traditionally, linear algebra libraries have been available only in Fortran. However, with an increasing number of programmers using C and C++ for scientific software development, there is a need to have high-quality numerical libraries to support these platforms as well. LAPACK++ provides the speed and efficiency competitive with native Fortran codes, while allowing programmers to capitalize on the software engineering benefits of object-oriented programming.
LAPACK++ supports various matrix classes for vectors, non-symmetric matrices, symmetric positive definite matrices, symmetric matrices, banded, triangular, and tridiagonal matrices; however, the current version does not include all of the capabilities of original Fortran 77 LAPACK. Emphasis is given to routines for solving linear systems consisting of nonsymmetric matrices, symmetric positive definite systems, and solving linear least-square systems. Future versions of LAPACK++ will support eigenvalue problems and singular value decompositions as well as distributed matrix classes for parallel computer architectures. For a more detailed description of the design of LAPACK++, please see [36]. This paper, as well as an Installation manual and Users' Guide are available on netlib. To obtain this software or documentation send a message to netlib@ornl.gov of the form:
send index from c++/lapack++Questions and comments about LAPACK++ can be directed to lapackpp@cs.utk.edu.
The CLAPACK library was built using a Fortran to C conversion utility called f2c [40]. The entire Fortran 77 LAPACK library is run through f2c to obtain C code, and then modified to improve readability. CLAPACK's goal is to provide LAPACK for someone who does not have access to a Fortran compiler.
However, f2c is designed to create C code that is still callable from Fortran, so all arguments must be passed using Fortran calling conventions and data structures. This requirement has several repercussions. The first is that since many compilers require distinct Fortran and C routine namespaces, an underscore (_) is appended to C routine names which will be called from Fortran. Therefore, f2c has added this underscore to all the names in CLAPACK. So, a call that in Fortran would look like:
call dgetrf(...)becomes in C:
dgetrf_(...);Second, the user must pass ALL arguments by reference, i.e. as pointers, since this is how Fortran works. This includes all scalar arguments like M and N. This restriction means that you cannot make a call with numbers directly in the parameter sequence. For example, consider the LU factorization of a 5-by-5 matrix. If the matrix to be factored is called A, the Fortran call
call dgetrf(5, 5, A, 5, ipiv, info)becomes in C:
M = N = LDA = 5; dgetrf_(&M, &N, A, &LDA, ipiv, &info);
Some LAPACK routines take character string arguments. In all but the testing and timing code, only the first character of the string is signficant. Therefore, the CLAPACK driver, computational, and auxiliary routines only expect single character arguments. For example, the Fortran call
call dpotrf( 'Upper', n, a, lda, info )becomes in C:
char s = 'U'; dpotrf_(&s, &n, a, &lda, &info);
In a future release we hope to provide ``wrapper'' routines that will remove the need for these unnecessary pointers, and automatically allocate (``malloc'') any workspace that is required.
As a final point, we must stress that there is a difference in the definition of a two-dimensional array in Fortran and C. A two-dimensional Fortran array declared as
DOUBLE PRECISION A(LDA, N)is a contiguous piece of LDA N double-words of memory, stored in column-major order: elements in a column are contiguous, and elements within a row are separated by a stride of LDA double-words.
In C, however, a two-dimensional array is in row-major order. Further, the rows of a two-dimensional C array need not be contiguous. The array
double A[LDA][N];actually has LDA pointers to rows of length N. These pointers can in principle be anywhere in memory. Passing such a two-dimensional C array to a CLAPACK routine will almost surely give erroneous results.
Instead, you must use a one-dimensional C array of size LDA N double-words (or else malloc the same amount of space). We recommend using the following code to get the array CLAPACK will be expecting:
double *A; A = malloc( LDA*N*sizeof(double) );Note that for best memory utilization, you would set LDA=M, the actual number of rows of A. If you now wish to operate on the matrix A, remember that A is in column-major order. As an example of accessing Fortran-style arrays in C, the following code fragments show how to initialize the array A declared above so that all of column has the value :
double *ptr; ptr = A; for(j=0; j < N; j++) { for (i=0; i < M; i++) *ptr++ = j; ptr += (LDA - M); }or, you can use:
for(j=0; j < N; j++) { for (i=0; i < M; i++) A[j*LDA+i] = j; }Note that the loop over the row index i is the inner loop, since column entries are contiguous.
The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory parallel computers. It is currently written in a Single-Program-Multiple-Data style using explicit message passing for interprocessor communication. It assumes matrices are laid out in a two-dimensional block cyclic decomposition. The goal is to have ScaLAPACK routines resemble their LAPACK equivalents as much as possible. Just as LAPACK is built on top of the BLAS, ScaLAPACK relies on the PBLAS (Parallel Basic Linear Algebra Subprograms) and the BLACS (Basic Linear Algebra Communication Subprograms). The PBLAS perform computations analogous to the BLAS but on matrices distributed across multiple processors. The PBLAS rely on the communication protocols of the BLACS. The BLACS are designed for linear algebra applications and provide portable communication across a wide variety of distributed-memory architectures. At the present time, they are available for the Intel Gamma, Delta, and Paragon, Thinking Machines CM-5, IBM SPs, and PVM. They will soon be available for the CRAY T3D. For more information:
echo ''send index from scalapack'' | mail netlib@ornl.govAll questions/comments can be directed to scalapack@cs.utk.edu.
We have also explored the advantages of IEEE arithmetic in implementing linear algebra routines. For example, the accurate rounding properties of IEEE arithmetic permit high precision arithmetic to be simulated economically in short stretches of code, thereby replacing possibly much more complicated low precision algorithms. Second, the ``friendly'' exception handling capabilities of IEEE arithmetic, such as being able to continue computing past an overflow and to ask later whether an overflow occurred, permit us to use simple, fast algorithms which work almost all the time, and revert to slower, safer algorithms only if the fast algorithm fails. See [23] for more details.
However, the continuing importance of machines implementing Cray arithmetic, the existence of some machines that only implement full IEEE exception handling by slowing down all floating point operations significantly, and the lack of portable ways to refer to exceptions in Fortran or C, has led us not to include these improved algorithms in this release of LAPACK. Since Cray has announced plans to convert to IEEE arithmetic, and some progress is being made on standardizing exception handling [65] we do expect to make these routines available in a future release.
The subroutines in LAPACK are classified as follows:
Both driver routines and computational routines are fully described in this Users' Guide, but not the auxiliary routines. A list of the auxiliary routines, with brief descriptions of their functions, is given in Appendix B.
LAPACK provides the same range of functionality for real and complex data.
For most computations there are matching routines, one for real and one for complex data, but there are a few exceptions. For example, corresponding to the routines for real symmetric indefinite systems of linear equations, there are routines for complex Hermitian and complex symmetric systems, because both types of complex systems occur in practical applications. However, there is no complex analogue of the routine for finding selected eigenvalues of a real symmetric tridiagonal matrix, because a complex Hermitian matrix can always be reduced to a real symmetric tridiagonal matrix.
Matching routines for real and complex data have been coded to maintain a close correspondence between the two, wherever possible. However, in some areas (especially the nonsymmetric eigenproblem) the correspondence is necessarily weaker.
All routines in LAPACK are provided in both single and double precision versions. The double precision versions have been generated automatically, using Toolpack/1 [66].
Double precision routines for complex matrices require the non-standard Fortran data type COMPLEX*16, which is available on most machines where double precision computation is usual.
The name of each LAPACK routine is a coded specification of its function (within the very tight limits of standard Fortran 77 6-character names).
All driver and computational routines have names of the form XYYZZZ, where for some driver routines the 6th character is blank.
The first letter, X, indicates the data type as follows:
S REAL D DOUBLE PRECISION C COMPLEX Z COMPLEX*16 or DOUBLE COMPLEX
When we wish to refer to an LAPACK routine generically, regardless of data type, we replace the first letter by ``x''. Thus xGESV refers to any or all of the routines SGESV, CGESV, DGESV and ZGESV.
The next two letters, YY, indicate the type of matrix (or of the most significant matrix). Most of these two-letter codes apply to both real and complex matrices; a few apply specifically to one or the other, as indicated in Table 2.1.
BD bidiagonal GB general band GE general (i.e., unsymmetric, in some cases rectangular) GG general matrices, generalized problem (i.e., a pair of general matrices) (not used in Release 1.0) GT general tridiagonal HB (complex) Hermitian band HE (complex) Hermitian HG upper Hessenberg matrix, generalized problem (i.e a Hessenberg and a triangular matrix) (not used in Release 1.0) HP (complex) Hermitian, packed storage HS upper Hessenberg OP (real) orthogonal, packed storage OR (real) orthogonal PB symmetric or Hermitian positive definite band PO symmetric or Hermitian positive definite PP symmetric or Hermitian positive definite, packed storage PT symmetric or Hermitian positive definite tridiagonal SB (real) symmetric band SP symmetric, packed storage ST (real) symmetric tridiagonal SY symmetric TB triangular band TG triangular matrices, generalized problem (i.e., a pair of triangular matrices) (not used in Release 1.0) TP triangular, packed storage TR triangular (or in some cases quasi-triangular) TZ trapezoidal UN (complex) unitary UP (complex) unitary, packed storage
When we wish to refer to a class of routines that performs the same function on different types of matrices, we replace the first three letters by ``xyy''. Thus xyySVX refers to all the expert driver routines for systems of linear equations that are listed in Table 2.2.
The last three letters ZZZ indicate the computation performed. Their meanings will be explained in Section 2.3. For example, SGEBRD is a single precision routine that performs a bidiagonal reduction (BRD) of a real general matrix.
The names of auxiliary routines follow a similar scheme except that the 2nd and 3rd characters YY are usually LA (for example, SLASCL or CLARFG). There are two kinds of exception. Auxiliary routines that implement an unblocked version of a block algorithm have similar names to the routines that perform the block algorithm, with the sixth character being ``2'' (for example, SGETF2 is the unblocked version of SGETRF). A few routines that may be regarded as extensions to the BLAS are named according to the BLAS naming schemes (for example, CROT, CSYR).
This section describes the driver routines in LAPACK. Further details on the terminology and the numerical operations they perform are given in Section 2.3, which describes the computational routines.
Two types of driver routines are provided for solving systems of linear equations :
The expert driver requires roughly twice as much storage as the simple driver in order to perform these extra functions.
Both types of driver routines can handle multiple right hand sides (the columns of B).
Different driver routines are provided to take advantage of special properties or storage schemes of the matrix A, as shown in Table 2.2.
These driver routines cover all the functionality of the computational routines for linear systems , except matrix inversion . It is seldom necessary to compute the inverse of a matrix explicitly, and it is certainly not recommended as a means of solving linear systems.
-------------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex -------------------------------------------------------------------------- general simple driver SGESV CGESV DGESV ZGESV expert driver SGESVX CGESVX DGESVX ZGESVX -------------------------------------------------------------------------- general band simple driver SGBSV CGBSV DGBSV ZGBSV expert driver SGBSVX CGBSVX DGBSVX ZGBSVX -------------------------------------------------------------------------- general tridiagonal simple driver SGTSV CGTSV DGTSV ZGTSV expert driver SGTSVX CGTSVX DGTSVX ZGTSVX -------------------------------------------------------------------------- symmetric/Hermitian simple driver SPOSV CPOSV DPOSV ZPOSV positive definite expert driver SPOSVX CPOSVX DPOSVX ZPOSVX -------------------------------------------------------------------------- symmetric/Hermitian simple driver SPPSV CPPSV DPPSV ZPPSV positive definite expert driver SPPSVX CPPSVX DPPSVX ZPPSVX (packed storage) -------------------------------------------------------------------------- symmetric/Hermitian simple driver SPBSV CPBSV DPBSV ZPBSV positive definite expert driver SPBSVX CPBSVX DPBSVX ZPBSVX band -------------------------------------------------------------------------- symmetric/Hermitian simple driver SPTSV CPTSV DPTSV ZPTSV positive definite expert driver SPTSVX CPTSVX DPTSVX ZPTSVX tridiagonal -------------------------------------------------------------------------- symmetric/Hermitian simple driver SSYSV CHESV DSYSV ZHESV indefinite expert driver SSYSVX CHESVX DSYSVX ZHESVX -------------------------------------------------------------------------- complex symmetric simple driver CSYSV ZSYSV expert driver CSYSVX ZSYSVX -------------------------------------------------------------------------- symmetric/Hermitian simple driver SSPSV CHPSV DSPSV ZHPSV indefinite (packed expert driver SSPSVX CHPSVX DSPSVX ZHPSVX storage) -------------------------------------------------------------------------- complex symmetric simple driver CSPSV ZSPSV (packed storage) expert driver CSPSVX ZSPSVX --------------------------------------------------------------------------
The linear least squares problem is:
where A is an m-by-n matrix, b is a given m element vector and x is the n element solution vector.
In the most usual case m > = n and rank(A) = n, and in this case the solution to problem ( 2.1) is unique, and the problem is also referred to as finding a least squares solution to an overdetermined system of linear equations.
When m < n and rank(A) = m, there are an infinite number of solutions x which exactly satisfy b - Ax = 0. In this case it is often useful to find the unique solution x which minimizes , and the problem is referred to as finding a minimum norm solution to an underdetermined system of linear equations.
The driver routine xGELS solves problem ( 2.1) on the assumption that rank(A) = min(m , n) -- in other words, A has full rank - finding a least squares solution of an overdetermined system when m > n, and a minimum norm solution of an underdetermined system when m < n. xGELS uses a QR or LQ factorization of A, and also allows A to be replaced by in the statement of the problem (or by if A is complex).
In the general case when we may have rank(A) < min(m , n) -- in other words, A may be rank-deficient - we seek the minimum norm least squares solution x which minimizes both and .
The driver routines xGELSX and xGELSS solve this general formulation of problem 2.1, allowing for the possibility that A is rank-deficient; xGELSX uses a complete orthogonal factorization of A, while xGELSS uses the singular value decomposition of A.
The LLS driver routines are listed in Table 2.3.
All three routines allow several right hand side vectors b and corresponding solutions x to be handled in a single call, storing these vectors as columns of matrices B and X, respectively. Note however that problem 2.1 is solved for each right hand side vector independently; this is not the same as finding a matrix X which minimizes .
------------------------------------------------------------------- Single precision Double precision Operation real complex real complex ------------------------------------------------------------------- solve LLS using QR or SGELS CGELS DGELS ZGELS LQ factorization solve LLS using complete SGELSX CGELSX DGELSX ZGELSX orthogonal factorization solve LLS using SVD SGELSS CGELSS DGELSS ZGELSS -------------------------------------------------------------------Table 2.3: Driver routines for linear least squares problems
Since its initial public release in February 1992, LAPACK has expanded in both depth and breadth. LAPACK is now available in both Fortran and C. The publication of this second edition of the Users' Guide coincides with the release of version 2.0 of the LAPACK software.
This release of LAPACK introduces new routines and extends the functionality of existing routines. Prominent among the new routines are driver and computational routines for the generalized nonsymmetric eigenproblem, generalized linear least squares problems, the generalized singular value decomposition, a generalized banded symmetric-definite eigenproblem, and divide-and-conquer methods for symmetric eigenproblems. Additional computational routines include the generalized QR and RQ factorizations and reduction of a band matrix to bidiagonal form.
Added functionality has been incorporated into the expert driver routines that involve equilibration (xGESVX, xGBSVX, xPOSVX, xPPSVX, and xPBSVX). The option FACT = 'F' now permits the user to input a prefactored, pre-equilibrated matrix. The expert drivers xGESVX and xGBSVX now return the reciprocal of the pivot growth from Gaussian elimination. xBDSQR has been modified to compute singular values of bidiagonal matrices much more quickly than before, provided singular vectors are not also wanted. The least squares driver routines xGELS, xGELSS, and xGELSX now make available the residual root-sum-squares for each right hand side.
All LAPACK routines reflect the current version number with the date on the routine indicating when it was last modified. For more information on revisions to the LAPACK software or this Users' Guide please refer to the LAPACK release_notes file on netlib. Instructions for obtaining this file can be found in Chapter 1.
On-line manpages (troff files) for LAPACK routines, as well as for most of the BLAS routines, are available on netlib. Refer to Section 1.6 for further details.
We hope that future releases of LAPACK will include routines for reordering eigenvalues in the generalized Schur factorization; solving the generalized Sylvester equation; computing condition numbers for the generalized eigenproblem (for eigenvalues, eigenvectors, clusters of eigenvalues, and deflating subspaces); fast algorithms for the singular value decomposition based on divide and conquer; high accuracy methods for symmetric eigenproblems and the SVD based on Jacobi's algorithm; updating and/or downdating for linear least squares problems; computing singular values by bidiagonal bisection; and computing singular vectors by bidiagonal inverse iteration.
The following additions/modifications have been made to this second edition of the Users' Guide:
Chapter 1 (Essentials) now includes information on accessing LAPACK via the World Wide Web.
Chapter 2 (Contents of LAPACK) has been expanded to discuss new routines.
Chapter 3 (Performance of LAPACK) has been updated with performance results from version 2.0 of LAPACK. In addition, a new section entitled ``LAPACK Benchmark'' has been introduced to present timings for several driver routines.
Chapter 4 (Accuracy and Stability) has been simplified and rewritten. Much of the theory and other details have been separated into ``Further Details'' sections. Example Fortran code segments are included to demonstrate the calculation of error bounds using LAPACK.
Appendices A, B, and D have been expanded to cover the new routines.
Appendix E (LAPACK Working Notes) lists a number of new Working Notes, written during the LAPACK 2 and ScaLAPACK projects (see below) and published by the University of Tennessee. The Bibliography has been updated to give the most recent published references.
The Specifications of Routines have been extended and updated to cover the new routines and revisions to existing routines.
The Bibliography and Index have been moved to the end of the book. The Index has been expanded into two indexes: Index by Keyword and Index by Routine Name. Occurrences of LAPACK, LINPACK, and EISPACK routine names have been cited in the latter index.
The original LAPACK project was funded by the NSF. Since its completion, two follow-up projects, LAPACK 2 and ScaLAPACK, have been funded in the U.S. by the NSF and ARPA in 1990-1994 and 1991-1995, respectively. In addition to making possible the additions and extensions in this release, these grants have supported the following closely related activities.
A major effort is underway to implement LAPACK-type algorithms for distributed memory machines. As a result of these efforts, several new software items are now available on netlib. The new items that have been introduced are distributed memory versions of the core routines from LAPACK; a fully parallel package to solve a symmetric positive definite sparse linear system on a message passing multiprocessor using Cholesky factorization ; a package based on Arnoldi's method for solving large-scale nonsymmetric, symmetric, and generalized algebraic eigenvalue problems ; and templates for sparse iterative methods for solving Ax = b. For more information on the availability of each of these packages, consult the scalapack and linalg indexes on netlib via netlib@ornl.gov.
We have also explored the advantages of IEEE floating point arithmetic [4] in implementing linear algebra routines. The accurate rounding properties and ``friendly'' exception handling capabilities of IEEE arithmetic permit us to write faster, more robust versions of several algorithms in LAPACK. Since all machines do not yet implement IEEE arithmetic, these algorithms are not currently part of the library [23], although we expect them to be in the future. For more information, please refer to Section 1.11.
LAPACK has been translated from Fortran into C and, in addition, a subset of the LAPACK routines has been implemented in C++ . For more information on obtaining the C or C++ versions of LAPACK, consult Section 1.11 or the clapack or c++ indexes on netlib via netlib@ornl.gov.
We deeply appreciate the careful scrutiny of those individuals who reported mistakes, typographical errors, or shortcomings in the first edition.
We acknowledge with gratitude the support which we have received from the following organizations and the help of individual members of their staff: Cray Research Inc.; NAG Ltd.
We would additionally like to thank the following people, who were not acknowledged in the first edition, for their contributions:
Françoise Chatelin, Inderjit Dhillon, Stan Eisenstat, Vince Fernando, Ming Gu, Rencang Li, Xiaoye Li, George Ostrouchov, Antoine Petitet, Chris Puscasiu, Huan Ren, Jeff Rutter, Ken Stanley, Steve Timson, and Clint Whaley.
* The royalties from the sales of this book are being placed in a fund to help students attend SIAM meetings and other SIAM related activities. This fund is administered by SIAM and qualified individuals are encouraged to write directly to SIAM for guidelines.
Driver routines are provided for two types of generalized linear least squares problems.
The first is
where A is an m-by-m matrix and B is a p-by-n matrix, c is a given m-vector, and d is a given p-vector, with p < = n < = m + p. This is called a linear equality-constrained least squares problem (LSE). The routine xGGLSE solves this problem using the generalized RQ (GRQ) factorization, on the assumptions that B has full row rank p and the matrix has full column rank n. Under these assumptions, the problem LSE has a unique solution.
The second generalized linear least squares problem is
where A is an n-by-m matrix, B is an n-by-p matrix, and d is a given n-vector, with m < = n < = m + p. This is sometimes called a general (Gauss-Markov) linear model problem (GLM). When B = I, the problem reduces to an ordinary linear least squares problem. When B is square and nonsingular, the GLM problem is equivalent to the weighted linear least squares problem:
The routine xGGGLM solves this problem using the generalized QR (GQR) factorization, on the assumptions that A has full column rank m, and the matrix (A , B) has full row rank n. Under these assumptions, the problem is always consistent, and there are unique solutions x and y. The driver routines for generalized linear least squares problems are listed in Table 2.4.
------------------------------------------------------------------ Single precision Double precision Operation real complex real complex ------------------------------------------------------------------ solve LSE problem using GQR SGGLSE CGGLSE DGGLSE ZGGLSE solve GLM problem using GQR SGGGLM CGGGLM DGGGLM ZGGGLM ------------------------------------------------------------------Table 2.4: Driver routines for generalized linear least squares problems
The symmetric eigenvalue problem is to find the eigenvalues , , and corresponding eigenvectors , , such that
For the Hermitian eigenvalue problem we have
For both problems the eigenvalues are real.
When all eigenvalues and eigenvectors have been computed, we write:
where is a diagonal matrix whose diagonal elements are the eigenvalues , and Z is an orthogonal (or unitary) matrix whose columns are the eigenvectors. This is the classical spectral factorization of A.
Three types of driver routines are provided for symmetric or Hermitian eigenproblems:
Different driver routines are provided to take advantage of special structure or storage of the matrix A, as shown in Table 2.5.
In the future LAPACK will include routines based on the Jacobi algorithm [76] [69] [24], which are slower than the above routines but can be significantly more accurate.
The nonsymmetric eigenvalue problem is to find the eigenvalues , , and corresponding eigenvectors , , such that
A real matrix A may have complex eigenvalues, occurring as complex conjugate pairs. More precisely, the vector v is called a right eigenvector of A, and a vector satisfying
is called a left eigenvector of A.
This problem can be solved via the Schur factorization of A, defined in the real case as
where Z is an orthogonal matrix and T is an upper quasi-triangular matrix with 1-by-1 and 2-by-2 diagonal blocks, the 2-by-2 blocks corresponding to complex conjugate pairs of eigenvalues of A. In the complex case the Schur factorization is
where Z is unitary and T is a complex upper triangular matrix.
The columns of Z are called the Schur vectors . For each k (1 < = k < = n) , the first k columns of Z form an orthonormal basis for the invariant subspace corresponding to the first k eigenvalues on the diagonal of T. Because this basis is orthonormal, it is preferable in many applications to compute Schur vectors rather than eigenvectors. It is possible to order the Schur factorization so that any desired set of k eigenvalues occupy the k leading positions on the diagonal of T.
Two pairs of drivers are provided, one pair focusing on the Schur factorization, and the other pair on the eigenvalues and eigenvectors as shown in Table 2.5:
The singular value decomposition of an m-by-n matrix A is given by
where U and V are orthogonal (unitary) and is an m-by-n diagonal matrix with real diagonal elements, , such that
The are the singular values of A and the first min(m , n) columns of U and V are the left and right singular vectors of A.
The singular values and singular vectors satisfy:
where and are the i-th columns of U and V respectively.
A single driver routine xGESVD computes all or part of the singular value decomposition of a general nonsymmetric matrix (see Table 2.5). A future version of LAPACK will include a driver based on divide and conquer, as in section 2.2.4.1.
-------------------------------------------------------------------------- Type of Single precision Double precision problem Function and storage scheme real complex real complex -------------------------------------------------------------------------- SEP simple driver SSYEV CHEEV DSYEV ZHEEV expert driver SSYEVX CHEEVX DSYEVX ZHEEVX -------------------------------------------------------------------------- simple driver (packed storage) SSPEV CHPEV DSPEV ZHPEV expert driver (packed storage) SSPEVX CHPEVX DSPEVX ZHPEVX -------------------------------------------------------------------------- simple driver (band matrix) SSBEV CHBEV DSBEV ZHBEV expert driver (band matrix) SSBEVX CHBEVX DSBEVX ZHBEVX -------------------------------------------------------------------------- simple driver (tridiagonal SSTEV DSTEV matrix) expert driver (tridiagonal SSTEVX DSTEVX matrix) -------------------------------------------------------------------------- NEP simple driver for SGEES CGEES DGEES ZGEES Schur factorization expert driver for SGEESX CGEESX DGEESX ZGEESX Schur factorization simple driver for SGEEV CGEEV DGEEV ZGEEV eigenvalues/vectors expert driver for SGEEVX CGEEVX DGEEVX ZGEEVX eigenvalues/vectors -------------------------------------------------------------------------- SVD singular values/vectors SGESVD CGESVD DGESVD ZGESVD --------------------------------------------------------------------------
Simple drivers are provided to compute all the eigenvalues and (optionally) the eigenvectors of the following types of problems:
where A and B are symmetric or Hermitian and B is positive definite. For all these problems the eigenvalues are real. The matrices Z of computed eigenvectors satisfy (problem types 1 and 3) or (problem type 2), where is a diagonal matrix with the eigenvalues on the diagonal. Z also satisfies (problem types 1 and 2) or (problem type 3).
The routines are listed in Table 2.6.
Given two square matrices A and B, the generalized nonsymmetric eigenvalue problem is to find the eigenvalues and corresponding eigenvectors such that
or find the eigenvalues and corresponding eigenvectors such that
Note that these problems are equivalent with and if neither nor is zero. In order to deal with the case that or is zero, or nearly so, the LAPACK routines return two values, and , for each eigenvalue, such that and .
More precisely, and are called right eigenvectors. Vectors or satisfying
If the determinant of is zero for all values of , the eigenvalue problem is called singular, and is signaled by some (in the presence of roundoff, and may be very small). In this case the eigenvalue problem is very ill-conditioned, and in fact some of the other nonzero values of and may be indeterminate [21] [80] [71].
The generalized nonsymmetric eigenvalue problem can be solved via the generalized Schur factorization of the pair A,B, defined in the real case as
where Q and Z are orthogonal matrices, P is upper triangular, and S is an upper quasi-triangular matrix with 1-by-1 and 2-by-2 diagonal blocks, the 2-by-2 blocks corresponding to complex conjugate pairs of eigenvalues of A,B. In the complex case the Schur factorization is
where Q and Z are unitary and S and P are both upper triangular.
The columns of Q and Z are called generalized Schur vectors and span pairs of deflating subspaces of A and B [72]. Deflating subspaces are a generalization of invariant subspaces: For each k (1 < = k < = n), the first k columns of Z span a right deflating subspace mapped by both A and B into a left deflating subspace spanned by the first k columns of Q.
Two simple drivers are provided for the nonsymmetric problem :
The generalized (or quotient) singular value decomposition of an m-by-n matrix A and a p-by-n matrix B is given by the pair of factorizations
The matrices in these factorizations have the following properties:
and have the following detailed structures, depending on whether m - r > = 0 or m - r < 0. In the first case, m - r > = 0, then
Here l is the rank of B, m = r - 1, C and S are diagonal matrices satisfying , and S is nonsingular. We may also identify , for , , and for . Thus, the first k generalized singular values are infinite, and the remaining l generalized singular values are finite.
In the second case, when m - r < 0,
and
Again, l is the rank of B, k = r - 1, C and S are diagonal matrices satisfying , S is nonsingular, and we may identify , for , , , for , and . Thus, the first generalized singular values are infinite, and the remaining generalized singular values are finite.
Here are some important special cases of the generalized singular value decomposition. First, if B is square and nonsingular, then r = n and the generalized singular value decomposition of A and B is equivalent to the singular value decomposition of , where the singular values of are equal to the generalized singular values of the pair A,B:
Second, if the columns of are orthonormal, then r = n, R = I and the generalized singular value decomposition of A and B is equivalent to the CS (Cosine-Sine) decomposition of [45]:
Third, the generalized eigenvalues and eigenvectors of can be expressed in terms of the generalized singular value decomposition: Let
Then
Therefore, the columns of X are the eigenvectors of , and the ``nontrivial'' eigenvalues are the squares of the generalized singular values (see also section 2.2.5.1). ``Trivial'' eigenvalues are those corresponding to the leading n - r columns of X, which span the common null space of and . The ``trivial eigenvalues'' are not well defined .
A single driver routine xGGSVD computes the generalized singular value decomposition of A and B (see Table 2.6). The method is based on the method described in [12] [10] [62].
---------------------------------------------------------------- Type of Function and Single precision Double precision problem storage scheme real complex real complex ---------------------------------------------------------------- GSEP simple driver SSYGV CHEGV DSYGV ZHEGV simple driver SSPGV CHPGV DSPGV ZHPGV (packed storage) simple driver SSBGV CHBGV DSBGV ZHBGV (band matrices) ---------------------------------------------------------------- GNEP simple driver for SGEGS CGEGS DGEGS ZGEGS Schur factorization simple driver for SGEGV CGEGV DGEGV ZGEGV eigenvalues/vectors ---------------------------------------------------------------- GSVD singular values/ SGGSVD CGGSVD DGGSVD ZGGSVD vectors -----------------------------------------------------------------Table 2.6: Driver routines for generalized eigenvalue and singular value problems
The development of LAPACK was a natural step after specifications of the Level 2 and 3 BLAS were drawn up in 1984-86 and 1987-88. Research on block algorithms had been ongoing for several years, but agreement on the BLAS made it possible to construct a new software package to take the place of LINPACK and EISPACK, which would achieve much greater efficiency on modern high-performance computers. This also seemed to be a good time to implement a number of algorithmic advances that had been made since LINPACK and EISPACK were written in the 1970's. The proposal for LAPACK was submitted while the Level 3 BLAS were still being developed and funding was obtained from the National Science Foundation (NSF) beginning in 1987.
LAPACK is more than just a more efficient update of its popular predecessors. It extends the functionality of LINPACK and EISPACK by including: driver routines for linear systems; equilibration, iterative refinement and error bounds for linear systems; routines for computing and re-ordering the Schur factorization; and condition estimation routines for eigenvalue problems. LAPACK improves on the accuracy of the standard algorithms in EISPACK by including high accuracy algorithms for finding singular values and eigenvalues of bidiagonal and tridiagonal matrices, respectively, that arise in SVD and symmetric eigenvalue problems.
We have tried to be consistent with our documentation and coding style throughout LAPACK in the hope that LAPACK will serve as a model for other software development efforts. In particular, we hope that LAPACK and this guide will be of value in the classroom. But above all, LAPACK has been designed to be used for serious computation, especially as a source of building blocks for larger applications.
The LAPACK project has been a research project on achieving good performance in a portable way over a large class of modern computers. This goal has been achieved, subject to the following qualifications. For optimal performance, it is necessary, first, that the BLAS are implemented efficiently on the target machine, and second, that a small number of tuning parameters (such as the block size) have been set to suitable values (reasonable default values are provided). Most of the LAPACK code is written in standard Fortran 77, but the double precision complex data type is not part of the standard, so we have had to make some assumptions about the names of intrinsic functions that do not hold on all machines (see section 6.1). Finally, our rigorous testing suite included test problems scaled at the extremes of the arithmetic range, which can vary greatly from machine to machine. On some machines, we have had to restrict the range more than on others.
Since most of the performance improvements in LAPACK come from restructuring the algorithms to use the Level 2 and 3 BLAS, we benefited greatly by having access from the early stages of the project to a complete set of BLAS developed for the CRAY machines by Cray Research. Later, the BLAS library developed by IBM for the IBM RISC/6000 was very helpful in proving the worth of block algorithms and LAPACK on ``super-scalar'' workstations. Many of our test sites, both computer vendors and research institutions, also worked on optimizing the BLAS and thus helped to get good performance from LAPACK. We are very pleased at the extent to which the user community has embraced the BLAS, not only for performance reasons, but also because we feel developing software around a core set of common routines like the BLAS is good software engineering practice.
A number of technical reports were written during the development of LAPACK and published as LAPACK Working Notes, initially by Argonne National Laboratory and later by the University of Tennessee. Many of these reports later appeared as journal articles. Appendix E lists the LAPACK Working Notes, and the Bibliography gives the most recent published reference.
A follow-on project, LAPACK 2, has been funded in the U.S. by the NSF and DARPA. One of its aims will be to add a modest amount of additional functionality to the current LAPACK package - for example, routines for the generalized SVD and additional routines for generalized eigenproblems. These routines will be included in a future release of LAPACK when they are available. LAPACK 2 will also produce routines which implement LAPACK-type algorithms for distributed memory machines, routines which take special advantage of IEEE arithmetic, and versions of parts of LAPACK in C and Fortran 90. The precise form of these other software packages which will result from LAPACK 2 has not yet been decided.
As the successor to LINPACK and EISPACK, LAPACK has drawn heavily on both the software and documentation from those collections. The test and timing software for the Level 2 and 3 BLAS was used as a model for the LAPACK test and timing software, and in fact the LAPACK timing software includes the BLAS timing software as a subset. Formatting of the software and conversion from single to double precision was done using Toolpack/1 [66], which was indispensable to the project. We owe a great debt to our colleagues who have helped create the infrastructure of scientific computing on which LAPACK has been built.
The development of LAPACK was primarily supported by NSF grant ASC-8715728. Zhaojun Bai had partial support from DARPA grant F49620-87-C0065; Christian Bischof was supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under contract W-31-109-Eng-38; James Demmel had partial support from NSF grant DCR-8552474; and Jack Dongarra had partial support from the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy, under Contract DE-AC05-84OR21400.
The cover was designed by Alan Edelman at UC Berkeley who discovered the matrix by performing Gaussian elimination on a certain 20-by-20 Hadamard matrix.
We acknowledge with gratitude the support which we have received from the following organizations, and the help of individual members of their staff: Cornell Theory Center; Cray Research Inc.; IBM ECSEC Rome; IBM Scientific Center, Bergen; NAG Ltd.
We also thank many, many people who have contributed code, criticism, ideas and encouragement. We wish especially to acknowledge the contributions of: Mario Arioli, Mir Assadullah, Jesse Barlow, Mel Ciment, Percy Deift, Augustin Dubrulle, Iain Duff, Alan Edelman, Victor Eijkhout, Sam Figueroa, Pat Gaffney, Nick Higham, Liz Jessup, Bo Kågström, Velvel Kahan, Linda Kaufman, L.-C. Li, Bob Manchek, Peter Mayes, Cleve Moler, Beresford Parlett, Mick Pont, Giuseppe Radicati, Tom Rowan, Pete Stewart, Peter Tang, Carlos Tomei, Charlie Van Loan, Kresimir Veselic, Phuong Vu, and Reed Wade.
Finally we thank all the test sites who received three preliminary distributions of LAPACK software and who ran an extensive series of test programs and timing programs for us; their efforts have influenced the final version of the package in numerous ways.
* The royalties from the sales of this book are being placed in a fund to help students attend SIAM meetings and other SIAM related activities. This fund is administered by SIAM and qualified individuals are encouraged to write directly to SIAM for guidelines.
We use the standard notation for a system of simultaneous linear equations :
where A is the coefficient matrix, b is the right hand side, and x is the solution. In ( 2.4) A is assumed to be a square matrix of order n, but some of the individual routines allow A to be rectangular. If there are several right hand sides, we write
where the columns of B are the individual right hand sides, and the columns of X are the corresponding solutions. The basic task is to compute X, given A and B.
If A is upper or lower triangular, ( 2.4) can be solved by a straightforward process of backward or forward substitution. Otherwise, the solution is obtained after first factorizing A as a product of triangular matrices (and possibly also a diagonal matrix or permutation matrix).
The form of the factorization depends on the properties of the matrix A. LAPACK provides routines for the following types of matrices, based on the stated factorizations:
A = PLU
where P is a permutation matrix, L is lower triangular with unit diagonal elements (lower trapezoidal if m > n), and U is upper triangular (upper trapezoidal if m < n).
A = LU
where L is a product of permutation and unit lower triangular matrices with kl subdiagonals, and U is upper triangular with kl + ku superdiagonals.
where U is an upper triangular matrix and L is lower triangular.
where U is a unit upper bidiagonal matrix, L is unit lower bidiagonal, and D is diagonal.
where U (or L) is a product of permutation and unit upper (lower) triangular matrices, and D is symmetric and block diagonal with diagonal blocks of order 1 or 2.
The factorization for a general tridiagonal matrix is like that for a general band matrix with kl = 1 and ku = 1. The factorization for a symmetric positive definite band matrix with k superdiagonals (or subdiagonals) has the same form as for a symmetric positive definite matrix, but the factor U (or L) is a band matrix with k superdiagonals (subdiagonals). Band matrices use a compact band storage scheme described in section 5.3.3. LAPACK routines are also provided for symmetric matrices (whether positive definite or indefinite) using packed storage, as described in section 5.3.2.
While the primary use of a matrix factorization is to solve a system of equations, other related tasks are provided as well. Wherever possible, LAPACK provides routines to perform each of these tasks for each type of matrix and storage scheme (see Tables 2.7 and 2.8). The following list relates the tasks to the last 3 characters of the name of the corresponding computational routine:
Note that some of the above routines depend on the output of others:
The RFS (``refine solution'') routines perform iterative refinement and compute backward and forward error bounds for the solution. Iterative refinement is done in the same precision as the input data. In particular, the residual is not computed with extra precision, as has been traditionally done. The benefit of this procedure is discussed in Section 4.4.
-------------------------------------------------------------------------------- Type of matrix Operation Single precision Double precision and storage scheme real complex real complex -------------------------------------------------------------------------------- general factorize SGETRF CGETRF DGETRF ZGETRF solve using factorization SGETRS CGETRS DGETRS ZGETRS estimate condition number SGECON CGECON DGECON ZGECON error bounds for solution SGERFS CGERFS DGERFS ZGERFS invert using factorization SGETRI CGETRI DGETRI ZGETRI equilibrate SGEEQU CGEEQU DGEEQU ZGEEQU -------------------------------------------------------------------------------- general factorize SGBTRF CGBTRF DGBTRF ZGBTRF band solve using factorization SGBTRS CGBTRS DGBTRS ZGBTRS estimate condition number SGBCON CGBCON DGBCON ZGBCON error bounds for solution SGBRFS CGBRFS DGBRFS ZGBRFS equilibrate SGBEQU CGBEQU DGBEQU ZGBEQU -------------------------------------------------------------------------------- general factorize SGTTRF CGTTRF DGTTRF ZGTTRF tridiagonal solve using factorization SGTTRS CGTTRS DGTTRS ZGTTRS estimate condition number SGTCON CGTCON DGTCON ZGTCON error bounds for solution SGTRFS CGTRFS DGTRFS ZGTRFS -------------------------------------------------------------------------------- symmetric/ factorize SPOTRF CPOTRF DPOTRF ZPOTRF Hermitian solve using factorization SPOTRS CPOTRS DPOTRS ZPOTRS positive definite estimate condition number SPOCON CPOCON DPOCON ZPOCON error bounds for solution SPORFS CPORFS DPORFS ZPORFS invert using factorization SPOTRI CPOTRI DPOTRI ZPOTRI equilibrate SPOEQU CPOEQU DPOEQU ZPOEQU -------------------------------------------------------------------------------- symmetric/ factorize SPPTRF CPPTRF DPPTRF ZPPTRF Hermitian solve using factorization SPPTRS CPPTRS DPPTRS ZPPTRS positive definite estimate condition number SPPCON CPPCON DPPCON ZPPCON (packed storage) error bounds for solution SPPRFS CPPRFS DPPRFS ZPPRFS invert using factorization SPPTRI CPPTRI DPPTRI ZPPTRI equilibrate SPPEQU CPPEQU DPPEQU ZPPEQU -------------------------------------------------------------------------------- symmetric/ factorize SPBTRF CPBTRF DPBTRF ZPBTRF Hermitian solve using factorization SPBTRS CPBTRS DPBTRS ZPBTRS positive definite estimate condition number SPBCON CPBCON DPBCON ZPBCON band error bounds for solution SPBRFS CPBRFS DPBRFS ZPBRFS equilibrate SPBEQU CPBEQU DPBEQU ZPBEQU -------------------------------------------------------------------------------- symmetric/ factorize SPTTRF CPTTRF DPTTRF ZPTTRF Hermitian solve using factorization SPTTRS CPTTRS DPTTRS ZPTTRS positive definite estimate condition number SPTCON CPTCON DPTCON ZPTCON tridiagonal error bounds for solution SPTRFS CPTRFS DPTRFS ZPTRFS --------------------------------------------------------------------------------Table 2.7: Computational routines for linear equations
Table 2.8: Computational routines for linear equations (continued)
LAPACK provides a number of routines for factorizing a general rectangular m-by-n matrix A, as the product of an orthogonal matrix (unitary if complex) and a triangular (or possibly trapezoidal) matrix.
A real matrix Q is orthogonal if ; a complex matrix Q is unitary if . Orthogonal or unitary matrices have the important property that they leave the two-norm of a vector invariant:
As a result, they help to maintain numerical stability because they do not amplify rounding errors.
Orthogonal factorizations are used in the solution of linear least squares problems . They may also be used to perform preliminary steps in the solution of eigenvalue or singular value problems.
The most common, and best known, of the factorizations is the QR factorization given by
where R is an n-by-n upper triangular matrix and Q is an m-by-m orthogonal (or unitary) matrix. If A is of full rank n, then R is non-singular. It is sometimes convenient to write the factorization as
which reduces to
where consists of the first n columns of Q, and the remaining m - n columns.
If m < n, R is trapezoidal, and the factorization can be written
where is upper triangular and is rectangular.
The routine xGEQRF computes the QR factorization. The matrix Q is not formed explicitly, but is represented as a product of elementary reflectors, as described in section 5.4. Users need not be aware of the details of this representation, because associated routines are provided to work with Q: xORGQR (or xUNGQR in the complex case) can generate all or part of R, while xORMQR (or xUNMQR ) can pre- or post-multiply a given matrix by Q or ( if complex).
The QR factorization can be used to solve the linear least squares problem ( 2.1) when m > = n and A is of full rank, since
c can be computed by xORMQR (or xUNMQR ), and consists of its first n elements. Then x is the solution of the upper triangular system
which can be computed by xTRTRS . The residual vector r is given by
and may be computed using xORMQR (or xUNMQR ). The residual sum of squares may be computed without forming r explicitly, since
The LQ factorization is given by
where L is m-by-m lower triangular, Q is n-by-n orthogonal (or unitary), consists of the first m rows of Q, and the remaining n - m rows.
This factorization is computed by the routine xGELQF, and again Q is represented as a product of elementary reflectors; xORGLQ (or xUNGLQ in the complex case) can generate all or part of Q, and xORMLQ (or xUNMLQ ) can pre- or post-multiply a given matrix by Q or ( if Q is complex).
The LQ factorization of A is essentially the same as the QR factorization of ( if A is complex), since
The LQ factorization may be used to find a minimum norm solution of an underdetermined system of linear equations Ax = b where A is m-by-n with m < n and has rank m. The solution is given by
and may be computed by calls to xTRTRS and xORMLQ.
To solve a linear least squares problem ( 2.1) when A is not of full rank, or the rank of A is in doubt, we can perform either a QR factorization with column pivoting or a singular value decomposition (see subsection 2.3.6).
The QR factorization with column pivoting is given by
where Q and R are as before and P is a permutation matrix, chosen (in general) so that
and moreover, for each k,
In exact arithmetic, if rank(A) = k, then the whole of the submatrix in rows and columns k + 1 to n would be zero. In numerical computation, the aim must be to determine an index k, such that the leading submatrix in the first k rows and columns is well-conditioned, and is negligible:
Then k is the effective rank of A. See Golub and Van Loan [45] for a further discussion of numerical rank determination.
The so-called basic solution to the linear least squares problem ( 2.1) can be obtained from this factorization as
where consists of just the first k elements of .
The routine xGEQPF computes the QR factorization with column pivoting, but does not attempt to determine the rank of A. The matrix Q is represented in exactly the same way as after a call of xGEQRF , and so the routines xORGQR and xORMQR can be used to work with Q (xUNGQR and xUNMQR if Q is complex).
The QR factorization with column pivoting does not enable us to compute a minimum norm solution to a rank-deficient linear least squares problem, unless . However, by applying further orthogonal (or unitary) transformations from the right to the upper trapezoidal matrix , using the routine xTZRQF, can be eliminated:
This gives the complete orthogonal factorization
from which the minimum norm solution can be obtained as
The QL and RQ factorizations are given by
and
These factorizations are computed by xGEQLF and xGERQF, respectively; they are less commonly used than either the QR or LQ factorizations described above, but have applications in, for example, the computation of generalized QR factorizations [2].
All the factorization routines discussed here (except xTZRQF) allow arbitrary m and n, so that in some cases the matrices R or L are trapezoidal rather than triangular. A routine that performs pivoting is provided only for the QR factorization.
--------------------------------------------------------------------------- Type of factorization Single precision Double precision and matrix Operation real complex real complex --------------------------------------------------------------------------- QR, general factorize with pivoting SGEQPF CGEQPF DGEQPF ZGEQPF factorize, no pivoting SGEQRF CGEQRF DGEQRF ZGEQRF generate Q SORGQR CUNGQR DORGQR ZUNGQR multiply matrix by Q SORMQR CUNMQR DORMQR ZUNMQR --------------------------------------------------------------------------- LQ, general factorize, no pivoting SGELQF CGELQF DGELQF ZGELQF generate Q SORGLQ CUNGLQ DORGLQ ZUNGLQ multiply matrix by Q SORMLQ CUNMLQ DORMLQ ZUNMLQ --------------------------------------------------------------------------- QL, general factorize, no pivoting SGEQLF CGEQLF DGEQLF ZGEQLF generate Q SORGQL CUNGQL DORGQL ZUNGQL multiply matrix by Q SORMQL CUNMQL DORMQL ZUNMQL --------------------------------------------------------------------------- RQ, general factorize, no pivoting SGERQF CGERQF DGERQF ZGERQF generate Q SORGRQ CUNGRQ DORGRQ ZUNGRQ multiply matrix by Q SORMRQ CUNMRQ DORMRQ ZUNMRQ --------------------------------------------------------------------------- RQ, trapezoidal factorize, no pivoting STZRQF CTZRQF DTZRQF ZTZRQF ---------------------------------------------------------------------------Table 2.9: Computational routines for orthogonal factorizations
The generalized QR (GQR) factorization of an n-by-m matrix A and an n-by-p matrix B is given by the pair of factorizations
A = QR and B = QTZ
where Q and Z are respectively n-by-n and p-by-p orthogonal matrices (or unitary matrices if A and B are complex). R has the form:
or
where is upper triangular. T has the form
or
where or is upper triangular.
Note that if B is square and nonsingular, the GQR factorization of A and B implicitly gives the QR factorization of the matrix :
without explicitly computing the matrix inverse or the product .
The routine xGGQRF computes the GQR factorization by first computing the QR factorization of A and then the RQ factorization of . The orthogonal (or unitary) matrices Q and Z can either be formed explicitly or just used to multiply another given matrix in the same way as the orthogonal (or unitary) matrix in the QR factorization (see section 2.3.2).
The GQR factorization was introduced in [63] [49]. The implementation of the GQR factorization here follows [2]. Further generalizations of the GQR factorization can be found in [25].
The GQR factorization can be used to solve the general (Gauss-Markov) linear model problem (GLM) (see ( 2.3) and [60][page 252]GVL2). Using the GQR factorization of A and B, we rewrite the equation d = Ax + By from ( 2.3) as
We partition this as
where
can be computed by xORMQR (or xUNMQR).
The GLM problem is solved by setting
from which we obtain the desired solutions
which can be computed by xTRSV, xGEMV and xORMRQ (or xUNMRQ).
The generalized RQ (GRQ) factorization of an m-by-n matrix A and a p-by-n matrix B is given by the pair of factorizations
A = RQ and B = ZTQ
where Q and Z are respectively n-by-n and p-by-p orthogonal matrices (or unitary matrices if A and B are complex). R has the form
or
where or is upper triangular. T has the form
or
where is upper triangular.
Note that if B is square and nonsingular, the GRQ factorization of A and B implicitly gives the RQ factorization of the matrix :
without explicitly computing the matrix inverse or the product .
The routine xGGRQF computes the GRQ factorization by first computing the RQ factorization of A and then the QR factorization of . The orthogonal (or unitary) matrices Q and Z can either be formed explicitly or just used to multiply another given matrix in the same way as the orthogonal (or unitary) matrix in the RQ factorization (see section 2.3.2).
The GRQ factorization can be used to solve the linear equality-constrained least squares problem (LSE) (see ( 2.2) and [page 567]GVL2). We use the GRQ factorization of B and A (note that B and A have swapped roles), written as
B = TQ and A = ZRQ
We write the linear equality constraints Bx = d as:
TQx = d
which we partition as:
Therefore is the solution of the upper triangular system
Furthermore,
We partition this expression as:
where , which can be computed by xORMQR (or xUNMQR).
To solve the LSE problem, we set
which gives as the solution of the upper triangular system
Finally, the desired solution is given by
which can be computed by xORMRQ (or xUNMRQ).
Let A be a real symmetric or complex Hermitian n-by-n matrix. A scalar is called an eigenvalue and a nonzero column vector z the corresponding eigenvector if . is always real when A is real symmetric or complex Hermitian.
The basic task of the symmetric eigenproblem routines is to compute values of and, optionally, corresponding vectors z for a given matrix A.
This computation proceeds in the following stages:
In the real case, the decomposition is computed by one of the routines xSYTRD , xSPTRD, or xSBTRD, depending on how the matrix is stored (see Table 2.10). The complex analogues of these routines are called xHETRD, xHPTRD, and xHBTRD. The routine xSYTRD (or xHETRD) represents the matrix Q as a product of elementary reflectors, as described in section 5.4. The routine xORGTR (or in the complex case xUNMTR) is provided to form Q explicitly; this is needed in particular before calling xSTEQR to compute all the eigenvectors of A by the QR algorithm. The routine xORMTR (or in the complex case xUNMTR) is provided to multiply another matrix by Q without forming Q explicitly; this can be used to transform eigenvectors of T computed by xSTEIN, back to eigenvectors of A.
When packed storage is used, the corresponding routines for forming Q or multiplying another matrix by Q are xOPGTR and xOPMTR (in the complex case, xUPGTR and xUPMTR).
When A is banded and xSBTRD (or xHBTRD) is used to reduce it to tridiagonal form , Q is determined as a product of Givens rotations , not as a product of elementary reflectors; if Q is required, it must be formed explicitly by the reduction routine. xSBTRD is based on the vectorizable algorithm due to Kaufman [57].
There are several routines for computing eigenvalues and eigenvectors of T, to cover the cases of computing some or all of the eigenvalues, and some or all of the eigenvectors. In addition, some routines run faster in some computing environments or for some matrices than for others. Also, some routines are more accurate than other routines.
See Table 2.10.
------------------------------------------------------------------------------ Type of matrix Single precision Double precision and storage scheme Operation real complex real complex ------------------------------------------------------------------------------ dense symmetric tridiagonal reduction SSYTRD CHETRD DSYTRD ZHETRD (or Hermitian) ------------------------------------------------------------------------------ packed symmetric tridiagonal reduction SSPTRD CHPTRD DSPTRD ZHPTRD (or Hermitian) ------------------------------------------------------------------------------ band symmetric tridiagonal reduction SSBTRD CHBTRD DSBTRD ZHBTRD (or Hermitian) orthogonal/unitary generate matrix after SORGTR CUNGTR DORGTR ZUNGTR reduction by xSYTRD multiply matrix after SORMTR CUNMTR DORMTR ZUNMTR reduction by xSYTRD ------------------------------------------------------------------------------ orthogonal/unitary generate matrix after SOPGTR CUPGTR DOPGTR ZUPGTR (packed storage) reduction by xSPTRD multiply matrix after SOPMTR CUPMTR DOPMTR ZUPMTR reduction by xSPTRD ------------------------------------------------------------------------------ symmetric eigenvalues/ SSTEQR CSTEQR DSTEQR ZSTEQR tridiagonal eigenvectors via QR eigenvalues only SSTERF DSTERF via root-free QR eigenvalues only SSTEBZ DSTEBZ via bisection eigenvectors by SSTEIN CSTEIN DSTEIN ZSTEIN inverse iteration ------------------------------------------------------------------------------ symmetric eigenvalues/ SPTEQR CPTEQR DPTEQR ZPTEQR tridiagonal eigenvectors positive definite ------------------------------------------------------------------------------Table 2.10: Computational routines for the symmetric eigenproblem
Let A be a square n-by-n matrix. A scalar is called an eigenvalue and a non-zero column vector v the corresponding right eigenvector if . A nonzero column vector u satisfying is called the left eigenvector . The first basic task of the routines described in this section is to compute, for a given matrix A, all n values of and, if desired, their associated right eigenvectors v and/or left eigenvectors u.
A second basic task is to compute the Schur factorization of a matrix A. If A is complex, then its Schur factorization is , where Z is unitary and T is upper triangular. If A is real, its Schur factorization is , where Z is orthogonal. and T is upper quasi-triangular (1-by-1 and 2-by-2 blocks on its diagonal). The columns of Z are called the Schur vectors of A. The eigenvalues of A appear on the diagonal of T; complex conjugate eigenvalues of a real A correspond to 2-by-2 blocks on the diagonal of T.
These two basic tasks can be performed in the following stages:
Other subsidiary tasks may be performed before or after those just described.
The routine xGEBAL may be used to balance the matrix A prior to reduction to Hessenberg form . Balancing involves two steps, either of which is optional:
where P is a permutation matrix and and are upper triangular. Thus the matrix is already in Schur form outside the central diagonal block in rows and columns ILO to IHI. Subsequent operations by xGEBAL, xGEHRD or xHSEQR need only be applied to these rows and columns; therefore ILO and IHI are passed as arguments to xGEHRD and xHSEQR . This can save a significant amount of work if ILO > 1 or IHI < n. If no suitable permutation can be found (as is very often the case), xGEBAL sets ILO = 1 and IHI = n, and is the whole of A.
This can improve the accuracy of later processing in some cases; see subsection 4.8.1.2.
If A was balanced by xGEBAL, then eigenvectors computed by subsequent operations are eigenvectors of the balanced matrix ; xGEBAK must then be called to transform them back to eigenvectors of the original matrix A.
The Schur form
depends on the order of the eigenvalues on the diagonal
of T and this may optionally be chosen by the user. Suppose the user chooses
that
,
1 < = j < = n, appear in the upper left
corner of T. Then the first j columns of Z span the right invariant
subspace of A corresponding to
.
The following routines perform this re-ordering and also compute condition numbers for eigenvalues, eigenvectors, and invariant subspaces:
See Table 2.11 for a complete list of the routines.
----------------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex ----------------------------------------------------------------------------- general Hessenberg reduction SGEHRD CGEHRD DGEHRD ZGEHRD balancing SGEBAL CGEBAL DGEBAL ZGEBAL backtransforming SGEBAK CGEBAK DGEBAK ZGEBAK ----------------------------------------------------------------------------- orthogonal/unitary generate matrix after SORGHR CUNGHR DORGHR ZUNGHR Hessenberg reduction multiply matrix after SORMHR CUNMHR DORMHR ZUNMHR Hessenberg reduction ----------------------------------------------------------------------------- Hessenberg Schur factorization SHSEQR CHSEQR DHSEQR ZHSEQR eigenvectors by SHSEIN CHSEIN DHSEIN ZHSEIN inverse iteration ----------------------------------------------------------------------------- (quasi)triangular eigenvectors STREVC CTREVC DTREVC ZTREVC reordering Schur STREXC CTREXC DTREXC ZTREXC factorization Sylvester equation STRSYL CTRSYL DTRSYL ZTRSYL condition numbers of STRSNA CTRSNA DTRSNA ZTRSNA eigenvalues/vectors condition numbers of STRSEN CTRSEN DTRSEN ZTRSEN eigenvalue cluster/ invariant subspace -----------------------------------------------------------------------------Table 2.11: Computational routines for the nonsymmetric eigenproblem
Let A be a general real m-by-n matrix. The singular value decomposition (SVD) of A is the factorization , where U and V are orthogonal, and , r = min(m , n), with . If A is complex, then its SVD is where U and V are unitary, and is as before with real diagonal elements. The are called the singular values , the first r columns of V the right singular vectors and the first r columns of U the left singular vectors .
The routines described in this section, and listed in Table 2.12, are used to compute this decomposition. The computation proceeds in the following stages:
The reduction to bidiagonal form is performed by the subroutine xGEBRD, or by xGBBRD for a band matrix.
The routine xGEBRD represents and in factored form as products of elementary reflectors, as described in section 5.4. If A is real, the matrices and may be computed explicitly using routine xORGBR, or multiplied by other matrices without forming and using routine xORMBR . If A is complex, one instead uses xUNGBR and xUNMBR , respectively.
If A is banded and xGBBRD is used to reduce it to bidiagonal form, and are determined as products of Givens rotations , rather than as products of elementary reflectors. If or is required, it must be formed explicitly by xGBBRD. xGBBRD uses a vectorizable algorithm, similar to that used by xSBTRD (see Kaufman [57]). xGBBRD may be much faster than xGEBRD when the bandwidth is narrow.
The SVD of the bidiagonal matrix is computed by the subroutine xBDSQR. xBDSQR is more accurate than its counterparts in LINPACK and EISPACK: barring underflow and overflow, it computes all the singular values of A to nearly full relative precision, independent of their magnitudes. It also computes the singular vectors much more accurately. See section 4.9 and [41] [16] [22] for details.
If m >> n, it may be more efficient to first perform a QR factorization of A, using the routine xGEQRF , and then to compute the SVD of the n-by-n matrix R, since if A = QR and , then the SVD of A is given by . Similarly, if m << n, it may be more efficient to first perform an LQ factorization of A, using xGELQF. These preliminary QR and LQ factorizations are performed by the driver xGESVD.
The SVD may be used to find a minimum norm solution to a (possibly) rank-deficient linear least squares problem ( 2.1). The effective rank, k, of A can be determined as the number of singular values which exceed a suitable threshold. Let be the leading k-by-k submatrix of , and be the matrix consisting of the first k columns of V. Then the solution is given by:
where consists of the first k elements of . can be computed using xORMBR, and xBDSQR has an option to multiply a vector by .
----------------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex ----------------------------------------------------------------------------- general bidiagonal reduction SGEBRD CGEBRD DGEBRD ZGEBRD ----------------------------------------------------------------------------- general band bidiagonal reduction SGBBRD CGBBRD DGBBRD ZGBBRD ----------------------------------------------------------------------------- orthogonal/unitary generate matrix after SORGBR CUNGBR DORGBR ZUNGBR bidiagonal reduction multiply matrix after SORMBR CUNMBR DORMBR ZUNMBR bidiagonal reduction ----------------------------------------------------------------------------- bidiagonal singular values/ SBDSQR CBDSQR DBDSQR ZBDSQR singular vectors -----------------------------------------------------------------------------Table 2.12: Computational routines for the singular value decomposition
This section is concerned with the solution of the generalized eigenvalue problems , , and , where A and B are real symmetric or complex Hermitian and B is positive definite. Each of these problems can be reduced to a standard symmetric eigenvalue problem, using a Cholesky factorization of B as either or ( or in the Hermitian case). In the case , if A and B are banded then this may also be exploited to get a faster algorithm.
With , we have
Hence the eigenvalues of are those of , where C is the symmetric matrix and . In the complex case C is Hermitian with and .
Table 2.13 summarizes how each of the three types of problem may be reduced to standard form , and how the eigenvectors z of the original problem may be recovered from the eigenvectors y of the reduced problem. The table applies to real problems; for complex problems, transposed matrices must be replaced by conjugate-transposes.
Table 2.13: Reduction of generalized symmetric definite eigenproblems to standard
problems
Given A and a Cholesky factorization of B, the routines xyyGST overwrite A with the matrix C of the corresponding standard problem (see Table 2.14). This may then be solved using the routines described in subsection 2.3.4. No special routines are needed to recover the eigenvectors z of the generalized problem from the eigenvectors y of the standard problem, because these computations are simple applications of Level 2 or Level 3 BLAS.
If the problem is and the matrices A and B are banded, the matrix C as defined above is, in general, full. We can reduce the problem to a banded standard problem by modifying the definition of C thus:
where Q is an orthogonal matrix chosen to ensure that C has bandwidth no greater than that of A. Q is determined as a product of Givens rotations. This is known as Crawford's algorithm (see Crawford [14]). If X is required, it must be formed explicitly by the reduction routine.
A further refinement is possible when A and B are banded, which halves the amount of work required to form C (see Wilkinson [79]). Instead of the standard Cholesky factorization of B as or , we use a ``split Cholesky'' factorization ( if B is complex), where:
with upper triangular and lower triangular of order approximately n / 2; S has the same bandwidth as B. After B has been factorized in this way by the routine xPBSTF , the reduction of the banded generalized problem to a banded standard problem is performed by the routine xSBGST (or xHBGST for complex matrices). This routine implements a vectorizable form of the algorithm, suggested by Kaufman [57].
-------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex -------------------------------------------------------------------- symmetric/Hermitian reduction SSYGST CHEGST DSYGST ZHEGST -------------------------------------------------------------------- symmetric/Hermitian reduction SSPGST CHPGST DSPGST ZHPGST (packed storage) -------------------------------------------------------------------- symmetric/Hermitian split SPBSTF CPBSTF DPBSTF ZPBSTF banded Cholesky factorization -------------------------------------------------------------------- reduction SSBGST DSBGST CHBGST ZHBGST --------------------------------------------------------------------
Let A and B be n-by-n matrices. A scalar is called a generalized eigenvalue and a non-zero column vector x the corresponding right generalized eigenvector if . A non-zero column vector y satisfying (where the superscript H denotes conjugate-transpose) is called the left generalized eigenvector corresponding to . (For simplicity, we will usually omit the word ``generalized'' when no confusion is likely to arise.) If B is singular, we can have the infinite eigenvalue , by which we mean Bx = 0. Note that if A is non-singular, then the equivalent problem is perfectly well-behaved, and the infinite eigenvalue corresponds to . To deal with infinite eigenvalues, the LAPACK routines return two values, and , for each eigenvalue . The first basic task of these routines is to compute the all n pairs and x and/or y for a given pair of matrices A,B.
If the determinant of is zero for all values of , the eigenvalue problem is called singular, and is signaled by some (in the presence of roundoff, and may be very small). In this case the eigenvalue problem is very ill-conditioned, and in fact some of the other nonzero values of and may be indeterminate [43] [21] [80] [71].
The other basic task is to compute the generalized Schur decomposition of the pair A,B. If A and B are complex, then the pair's generalized Schur decomposition is , where Q and Z are unitary and S and P are upper triangular. The LAPACK routines normalize P to have non-negative diagonal entries. Note that in this form, the eigenvalues can be easily computed from the diagonals: , and so the LAPACK routines return and . The generalized Schur form depends on the order on which the eigenvalues appear on the diagonal. In a future version of LAPACK, we will supply routines to allow the user to choose this order.
If A and B are real, then the pair's generalized Schur decomposition is , , where Q and Z are orthogonal, P is upper triangular, and S is quasi-upper triangular with 1-by-1 and 2-by-2 blocks on the diagonal. The 1-by-1 blocks correspond to real generalized eigenvalues, while the 2-by-2 blocks correspond to complex conjugate pairs of generalized eigenvalues. In this case, P is normalized so that diagonal entries of P corresponding to 1-by-1 blocks of S are non-negative, while the (upper triangular) diagonal blocks of P corresponding to 2-by-2 blocks of S are made diagonal. Note that for real eigenvalues, as for all eigenvalues in the complex case, the and values corresponding to real eigenvalues may be easily computed from the diagonal of S and P. The and values corresponding to complex eigenvalues are computed by computing , then computing the values that would result if the 2-by-2 diagonal block of S,P were upper triangularized using unitary transformations , and finally multiplying to get and .
The columns of Q and Z are called generalized Schur vectors and span pairs of deflating subspaces of A and B [72]. Deflating subspaces are a generalization of invariant subspaces: The first k columns of Z span a right deflating subspace mapped by both A and B into a left deflating subspace spanned by the first k columns of Q. This pair of deflating subspaces corresponds to the first k eigenvalues appearing at the top of S and p.
The computations proceed in the following stages:
In addition, the routines xGGBAL and xGGBAK may be used to balance the pair A,B prior to reduction to generalized Hessenberg form. Balancing involves premultiplying A and B by one permutation and postmultiplying them by another, to try to make A,B as nearly triangular as possible, and then ``scaling'' the matrices by premultiplying A and B by one diagonal matrix and postmultiplying by another in order to make the rows and columns of A and B as close in norm to 1 as possible. These transformations can improve speed and accuracy of later processing in some cases; however, the scaling step can sometimes make things worse. Moreover, the scaling step will significantly change the generalized Schur form that results. xGGBAL performs the balancing, and xGGBAK back transforms the eigenvectors of the balanced matrix pair.
-------------------------------------------------------------------------- Type of matrix Single precision Double precision and storage scheme Operation real complex real complex -------------------------------------------------------------------------- general Hessenberg reduction SGGHRD CGGHRD DGGHRD ZGGHRD balancing SGGBAL CGGBAL DGGBAL ZGGBAL back transforming SGGBAK CGGBAK DGGBAK ZGGBAK -------------------------------------------------------------------------- Hessenberg Schur factorization SHGEQZ CHGEQZ DHGEQZ ZHGEQZ -------------------------------------------------------------------------- (quasi)triangular eigenvectors STGEVC CTGEVC DTGEVC ZTGEVC --------------------------------------------------------------------------Table 2.15: Computational routines for the generalized nonsymmetric eigenproblem
A future release of LAPACK will include the routines xTGEXC, xTGSYL, xTGSNA and xTGSEN, which are analogous to the routines xTREXC, xTRSYL, xTRSNA and xTRSEN. They will reorder eigenvalues in generalized Schur form, solve the generalized Sylvester equation, compute condition numbers of generalized eigenvalues and eigenvectors, and compute condition numbers of average eigenvalues and deflating subspaces.
The generalized (or quotient) singular value decomposition of an m-by-n matrix A and a p-by-n matrix B is described in section 2.2.5. The routines described in this section, are used to compute the decomposition. The computation proceeds in the following two stages:
where and are nonsingular upper triangular, and is upper triangular. If m - k - 1 < 0, the bottom zero block of does not appear, and is upper trapezoidal. , and are orthogonal matrices (or unitary matrices if A and B are complex). l is the rank of B, and k + l is the rank of .
Here , and are orthogonal (or unitary) matrices, C and S are both real nonnegative diagonal matrices satisfying , S is nonsingular, and R is upper triangular and nonsingular.
-------------------------------------------------------- Single precision Double precision Operation real complex real complex -------------------------------------------------------- triangular reduction SGGSVP CGGSVP DGGSVP ZGGSVP of A and B -------------------------------------------------------- GSVD of a pair of STGSJA CTGSJA DTGSJA ZTGSJA triangular matrices --------------------------------------------------------Table 2.16: Computational routines for the generalized singular value decomposition
The reduction to triangular form, performed by xGGSVP, uses QR decomposition with column pivoting for numerical rank determination. See [12] for details.
The generalized singular value decomposition of two triangular matrices, performed by xTGSJA, is done using a Jacobi-like method as described in [10] [62].
Note: this chapter presents some performance figures for LAPACK routines. The figures are provided for illustration only, and should not be regarded as a definitive up-to-date statement of performance. They have been selected from performance figures obtained in 1994 during the development of version 2.0 of LAPACK. All reported timings were obtained using the optimized version of the BLAS available on each machine. For the IBM computers, the ESSL BLAS were used. Performance is affected by many factors that may change from time to time, such as details of hardware (cycle time, cache size), compiler, and BLAS. To obtain up-to-date performance figures, use the timing programs provided with LAPACK.
Can we provide portable software for computations in dense linear algebra that is efficient on a wide range of modern high-performance computers? If so, how? Answering these questions - and providing the desired software - has been the goal of the LAPACK project.
LINPACK [26] and EISPACK [44] [70] have for many years provided high-quality portable software for linear algebra; but on modern high-performance computers they often achieve only a small fraction of the peak performance of the machines. Therefore, LAPACK has been designed to supersede LINPACK and EISPACK, principally by achieving much greater efficiency - but at the same time also adding extra functionality, using some new or improved algorithms, and integrating the two sets of algorithms into a single package.
LAPACK was originally targeted to achieve good performance on single-processor vector machines and on shared memory multiprocessor machines with a modest number of powerful processors. Since the start of the project, another class of machines has emerged for which LAPACK software is equally well-suited-the high-performance ``super-scalar'' workstations . (LAPACK is intended to be used across the whole spectrum of modern computers, but when considering performance, the emphasis is on machines at the more powerful end of the spectrum.)
Here we discuss the main factors that affect the performance of linear algebra software on these classes of machines.
Designing vectorizable algorithms in linear algebra is usually straightforward. Indeed, for many computations there are several variants, all vectorizable, but with different characteristics in performance (see, for example, [33]). Linear algebra algorithms can come close to the peak performance of many machines - principally because peak performance depends on some form of chaining of vector addition and multiplication operations, and this is just what the algorithms require.
However, when the algorithms are realized in straightforward Fortran 77 code, the performance may fall well short of the expected level, usually because vectorizing Fortran compilers fail to minimize the number of memory references - that is, the number of vector load and store operations. This brings us to the next factor.
What often limits the actual performance of a vector-or scalar- floating-point unit is the rate of transfer of data between different levels of memory in the machine. Examples include: the transfer of vector operands in and out of vector registers , the transfer of scalar operands in and out of a high-speed scalar processor, the movement of data between main memory and a high-speed cache or local memory , and paging between actual memory and disk storage in a virtual memory system.
It is desirable to maximize the ratio of floating-point operations to memory references, and to re-use data as much as possible while it is stored in the higher levels of the memory hierarchy (for example, vector registers or high-speed cache).
A Fortran programmer has no explicit control over these types of data movement, although one can often influence them by imposing a suitable structure on an algorithm.
The nested loop structure of most linear algebra algorithms offers considerable scope for loop-based parallelism on shared memory machines. This is the principal type of parallelism that LAPACK at present aims to exploit. It can sometimes be generated automatically by a compiler, but often requires the insertion of compiler directives .
How then can we hope to be able to achieve sufficient control over vectorization, data movement, and parallelism in portable Fortran code, to obtain the levels of performance that machines can offer?
The LAPACK strategy for combining efficiency with portability is to construct the software as much as possible out of calls to the BLAS (Basic Linear Algebra Subprograms); the BLAS are used as building blocks.
The efficiency of LAPACK software depends on efficient implementations of the BLAS being provided by computer vendors (or others) for their machines. Thus the BLAS form a low-level interface between LAPACK software and different machine architectures. Above this level, almost all of the LAPACK software is truly portable.
There are now three levels of BLAS:
Here, A, B and C are matrices, x and y are vectors, and and are scalars.
The Level 1 BLAS are used in LAPACK, but for convenience rather than for performance: they perform an insignificant fraction of the computation, and they cannot achieve high efficiency on most modern supercomputers.
The Level 2 BLAS can achieve near-peak performance on many vector processors, such as a single processor of a CRAY Y-MP, CRAY C90, or CONVEX C4 machine. However on other vector processors, such as a CRAY 2, or a RISC workstation, their performance is limited by the rate of data movement between different levels of memory.
This limitation is overcome by the Level 3 BLAS , which perform floating-point operations on data, whereas the Level 2 BLAS perform only operations on data.
The BLAS also allow us to exploit parallelism in a way that is transparent to the software that calls them. Even the Level 2 BLAS offer some scope for exploiting parallelism, but greater scope is provided by the Level 3 BLAS, as Table 3.1 illustrates.
Table 3.1: Speed in megaflops of Level 2 and Level 3 BLAS operations on a
CRAY C90
It is comparatively straightforward to recode many of the algorithms in LINPACK and EISPACK so that they call Level 2 BLAS . Indeed, in the simplest cases the same floating-point operations are performed, possibly even in the same order: it is just a matter of reorganizing the software. To illustrate this point we derive the Cholesky factorization algorithm that is used in the LINPACK routine SPOFA , which factorizes a symmetric positive definite matrix as . Writing these equations as:
and equating coefficients of the j-th column, we obtain:
Hence, if
has already been computed, we can compute
and
from the equations:
Here is the body of the code of the LINPACK routine SPOFA , which implements the above method:
DO 30 J = 1, N INFO = J S = 0.0E0 JM1 = J - 1 IF (JM1 .LT. 1) GO TO 20 DO 10 K = 1, JM1 T = A(K,J) - SDOT(K-1,A(1,K),1,A(1,J),1) T = T/A(K,K) A(K,J) = T S = S + T*T 10 CONTINUE 20 CONTINUE S = A(J,J) - S C ......EXIT IF (S .LE. 0.0E0) GO TO 40 A(J,J) = SQRT(S) 30 CONTINUE
And here is the same computation recoded in ``LAPACK-style'' to use the Level 2 BLAS routine STRSV (which solves a triangular system of equations). The call to STRSV has replaced the loop over K which made several calls to the Level 1 BLAS routine SDOT. (For reasons given below, this is not the actual code used in LAPACK - hence the term ``LAPACK-style''.)
DO 10 J = 1, N CALL STRSV( 'Upper', 'Transpose', 'Non-unit', J-1, A, LDA, $ A(1,J), 1 ) S = A(J,J) - SDOT( J-1, A(1,J), 1, A(1,J), 1 ) IF( S.LE.ZERO ) GO TO 20 A(J,J) = SQRT( S ) 10 CONTINUE
This change by itself is sufficient to make big gains in performance on a number of machines.
For example, on an IBM RISC Sys/6000-550 (using double precision) there is virtually no difference in performance between the LINPACK-style and the LAPACK-style code. Both styles run at a megaflop rate far below its peak performance for matrix-matrix multiplication. To exploit the faster speed of Level 3 BLAS , the algorithms must undergo a deeper level of restructuring, and be re-cast as a block algorithm - that is, an algorithm that operates on blocks or submatrices of the original matrix.
To derive a block form of Cholesky factorization , we write the defining equation in partitioned form thus:
Equating submatrices in the second block of columns, we obtain:
Hence, if
has already been computed, we can compute
as the solution to the equation
by a call to the Level 3 BLAS routine STRSM; and then we can compute
from
This involves first updating the symmetric submatrix
by a call to the Level 3 BLAS routine SSYRK, and then computing its Cholesky factorization. Since Fortran does not allow recursion, a separate routine must be called (using Level 2 BLAS rather than Level 3), named SPOTF2 in the code below. In this way successive blocks of columns of U are computed. Here is LAPACK-style code for the block algorithm. In this code-fragment NB denotes the width of the blocks.
DO 10 J = 1, N, NB JB = MIN( NB, N-J+1 ) CALL STRSM( 'Left', 'Upper', 'Transpose', 'Non-unit', J-1, JB, $ ONE, A, LDA, A( 1, J ), LDA ) CALL SSYRK( 'Upper', 'Transpose', JB, J-1, -ONE, A( 1, J ), LDA, $ ONE, A( J, J ), LDA ) CALL SPOTF2( 'Upper', JB, A( J, J ), LDA, INFO ) IF( INFO.NE.0 ) GO TO 20 10 CONTINUE
But that is not the end of the story, and the code given above is not the code that is actually used in the LAPACK routine SPOTRF . We mentioned in subsection 3.1.1 that for many linear algebra computations there are several vectorizable variants, often referred to as i-, j- and k-variants, according to a convention introduced in [33] and used in [45]. The same is true of the corresponding block algorithms.
It turns out that the j-variant that was chosen for LINPACK, and used in the above examples, is not the fastest on many machines, because it is based on solving triangular systems of equations, which can be significantly slower than matrix-matrix multiplication. The variant actually used in LAPACK is the i-variant, which does rely on matrix-matrix multiplication.
Having discussed in detail the derivation of one particular block algorithm, we now describe examples of the performance that has been achieved with a variety of block algorithms. The clock speeds for the computers involved in the timings are listed in Table 3.2.
------------------------------------------- Clock Speed ------------------------------------------- CONVEX C-4640 135 MHz 7.41 ns CRAY C90 240 MHz 4.167 ns DEC 3000-500X Alpha 200 MHz 5.0 ns IBM POWER2 model 590 66 MHz 15.15 ns IBM RISC Sys/6000-550 42 MHz 23.81 ns SGI POWER CHALLENGE 75 MHz 13.33 ns -------------------------------------------
See Gallivan et al. [42] and Dongarra et al. [31] for an alternative survey of algorithms for dense linear algebra on high-performance computers.
The well-known LU and Cholesky factorizations are the simplest block algorithms to derive. No extra floating-point operations nor extra working storage are required.
Table 3.3 illustrates the speed of the LAPACK routine for LU factorization of a real matrix , SGETRF in single precision on CRAY machines, and DGETRF in double precision on all other machines. This corresponds to 64-bit floating-point arithmetic on all machines tested. A block size of 1 means that the unblocked algorithm is used, since it is faster than - or at least as fast as - a blocked algorithm.
--------------------------------------------------- No. of Block Values of n processors size 100 1000 --------------------------------------------------- CONVEX C-4640 1 64 274 711 CONVEX C-4640 4 64 379 2588 CRAY C90 1 128 375 863 CRAY C90 16 128 386 7412 DEC 3000-500X Alpha 1 32 53 91 IBM POWER2 model 590 1 32 110 168 IBM RISC Sys/6000-550 1 32 33 56 SGI POWER CHALLENGE 1 64 81 201 SGI POWER CHALLENGE 4 64 79 353 ---------------------------------------------------Table 3.3: Speed in megaflops of SGETRF/DGETRF for square matrices of order n
Table 3.4 gives similar results for Cholesky factorization .
--------------------------------------------------- No. of Block Values of n processors size 100 1000 --------------------------------------------------- CONVEX C-4640 1 64 120 546 CONVEX C-4640 4 64 150 1521 CRAY C90 1 128 324 859 CRAY C90 16 128 453 9902 DEC 3000-500X Alpha 1 32 37 83 IBM POWER2 model 590 1 32 102 247 IBM RISC Sys/6000-550 1 32 40 72 SGI POWER CHALLENGE 1 64 74 199 SGI POWER CHALLENGE 4 64 69 424 ---------------------------------------------------Table 3.4: Speed in megaflops of SPOTRF/DPOTRF for matrices of order n with UPLO = `U'
LAPACK, like LINPACK, provides a factorization for symmetric indefinite matrices, so that A is factorized as , where P is a permutation matrix, and D is block diagonal with blocks of order 1 or 2. A block form of this algorithm has been derived, and is implemented in the LAPACK routine SSYTRF /DSYTRF . It has to duplicate a little of the computation in order to ``look ahead'' to determine the necessary row and column interchanges, but the extra work can be more than compensated for by the greater speed of updating the matrix by blocks as is illustrated in Table 3.5 .
------------------- Block Values of n size 100 1000 ------------------- 1 62 86 64 68 165 -------------------Table 3.5: Speed in megaflops of DSYTRF for matrices of order n with UPLO = `U' on an IBM POWER2 model 590
LAPACK, like LINPACK, provides LU and Cholesky factorizations of band matrices. The LINPACK algorithms can easily be restructured to use Level 2 BLAS, though that has little effect on performance for matrices of very narrow bandwidth. It is also possible to use Level 3 BLAS, at the price of doing some extra work with zero elements outside the band [39]. This becomes worthwhile for matrices of large order and semi-bandwidth greater than 100 or so.
The traditional algorithm for QR factorization is based on the use of elementary Householder matrices of the general form
where v is a column vector and
is a scalar. This leads to an algorithm with very good vector performance, especially if coded to use Level 2 BLAS.
The key to developing a block form of this algorithm is to represent a product of b elementary Householder matrices of order n as a block form of a Householder matrix . This can be done in various ways. LAPACK uses the following form [68]:
where V is an n-by-n matrix whose columns are the individual vectors
associated with the Householder matrices
, and T is an upper triangular matrix of order b. Extra work is required to compute the elements of T, but once again this is compensated for by the greater speed of applying the block form. Table 3.6 summarizes results obtained with the LAPACK routine SGEQRF /DGEQRF .
------------------------------------------------- No. of Block Values of n processors size 100 1000 ------------------------------------------------- CONVEX C-4640 1 64 81 521 CONVEX C-4640 4 64 94 1204 CRAY C90 1 128 384 859 CRAY C90 16 128 390 7641 DEC 3000-500X Alpha 1 32 50 86 IBM POWER2 model 590 1 32 108 208 IBM RISC Sys/6000-550 1 32 30 61 SGI POWER CHALLENGE 1 64 61 190 SGI POWER CHALLENGE 4 64 39 342 -------------------------------------------------
Eigenvalue problems have until recently provided a less fertile ground for the development of block algorithms than the factorizations so far described. Version 2.0 of LAPACK includes new block algorithms for the symmetric eigenvalue problem, and future releases will include analogous algorithms for the singular value decomposition.
The first step in solving many types of eigenvalue problems is to reduce the original matrix to a ``condensed form'' by orthogonal transformations .
In the reduction to condensed forms, the unblocked algorithms all use elementary Householder matrices and have good vector performance. Block forms of these algorithms have been developed [34], but all require additional operations, and a significant proportion of the work must still be performed by Level 2 BLAS, so there is less possibility of compensating for the extra operations.
The algorithms concerned are:
using the Level 3 BLAS routine SSYR2K; Level 3 BLAS account for at most half the work.
using two calls to the Level 3 BLAS routine SGEMM; Level 3 BLAS account for at most half the work.
Level 3 BLAS account for at most three-quarters of the work.
Note that only in the reduction to Hessenberg form is it possible to use the block Householder representation described in subsection 3.4.2. Extra work must be performed to compute the n-by-b matrices X and Y that are required for the block updates (b is the block size) - and extra workspace is needed to store them.
Nevertheless, the performance gains can be worthwhile on some machines, for example, on an IBM POWER2 model 590, as shown in Table 3.7.
(all matrices are square of order n) ---------------------------- Block Values of n size 100 1000 ---------------------------- DSYTRD 1 137 159 16 82 169 ---------------------------- DGEBRD 1 90 110 16 90 136 ---------------------------- DGEHRD 1 111 113 16 125 187 ----------------------------Table 3.7: Speed in megaflops of reductions to condensed forms on an IBM POWER2 model 590
Following the reduction of a dense (or band) symmetric matrix to tridiagonal form T, we must compute the eigenvalues and (optionally) eigenvectors of T. Computing the eigenvalues of T alone (using LAPACK routine SSTERF ) requires flops, whereas the reduction routine SSYTRD does flops. So eventually the cost of finding eigenvalues alone becomes small compared to the cost of reduction. However, SSTERF does only scalar floating point operations, without scope for the BLAS, so n may have to be large before SSYTRD is slower than SSTERF.
Version 2.0 of LAPACK includes a new algorithm, SSTEDC , for finding all eigenvalues and eigenvectors of n. The new algorithm can exploit Level 2 and 3 BLAS, whereas the previous algorithm, SSTEQR , could not. Furthermore, SSTEDC usually does many fewer flops than SSTEQR, so the speedup is compounded. Briefly, SSTEDC works as follows (for details, see [67] [47]). The tridiagonal matrix T is written as
where and are tridiagonal, and H is a very simple rank-one matrix. Then the eigenvalues and eigenvectors of and are found by applying the algorithm recursively; this yields and , where is a diagonal matrix of eigenvalues, and the columns of are orthonormal eigenvectors. Thus
where is again a simple rank-one matrix. The eigenvalues and eigenvectors of may be found using scalar operations, yielding Substituting this into the last displayed expression yields
where the diagonals of are the desired eigenvalues of T, and the columns of are the eigenvectors. Almost all the work is done in the two matrix multiplies of and times , which is done using the Level 3 BLAS.
The same recursive algorithm can be developed for the singular value decomposition of the bidiagonal matrix resulting from reducing a dense matrix with SGEBRD. This software will be completed for a future release of LAPACK. The current LAPACK algorithm for the bidiagonal singular values decomposition, SBDSQR , does not use the Level 2 or Level 3 BLAS.
For computing the eigenvalues and eigenvectors of a Hessenberg matrix-or rather for computing its Schur factorization- yet another flavour of block algorithm has been developed: a multishift QR iteration [8]. Whereas the traditional EISPACK routine HQR uses a double shift (and the corresponding complex routine COMQR uses a single shift), the multishift algorithm uses block shifts of higher order. It has been found that often the total number of operations decreases as the order of shift is increased until a minimum is reached typically between 4 and 8; for higher orders the number of operations increases quite rapidly. On many machines the speed of applying the shift increases steadily with the order, and the optimum order of shift is typically in the range 8-16. Note however that the performance can be very sensitive to the choice of the order of shift; it also depends on the numerical properties of the matrix. Dubrulle [37] has studied the practical performance of the algorithm, while Watkins and Elsner [77] discuss its theoretical asymptotic convergence rate.
Finally, we note that research into block algorithms for symmetric and nonsymmetric eigenproblems continues [55] [9], and future versions of LAPACK will be updated to contain the best algorithms available.
LAPACK is a library of Fortran 77 subroutines for solving the most commonly occurring problems in numerical linear algebra. It has been designed to be efficient on a wide range of modern high-performance computers. The name LAPACK is an acronym for Linear Algebra PACKage.
This section contains performance numbers for selected LAPACK driver routines. These routines provide complete solutions for the most common problems of numerical linear algebra, and are the routines users are most likely to call:
Data is provided for a variety of vector computers, shared memory parallel computers, and high performance workstations. All timings were obtained by using the machine-specific optimized BLAS available on each machine. For the IBM RISC Sys/6000-550 and IBM POWER2 model 590, the ESSL BLAS were used. In all cases the data consisted of 64-bit floating point numbers (single precision on the CRAY C90 and double precision on the other machines). For each machine and each driver, a small problem (N = 100 with LDA = 101) and a large problem (N = 1000 with LDA = 1001) were run. Block sizes NB = 1, 16, 32 and 64 were tried, with data only for the fastest run reported in the tables below. Similarly, UPLO = 'L' and UPLO = 'U' were timed for SSYEVD/DSYEVD, but only times for UPLO = 'U' were reported. For SGEEV/DGEEV, ILO = 1 and IHI = N. The test matrices were generated with randomly distributed entries. All run times are reported in seconds, and block size is denoted by nb. The value of nb was chosen to make N = 1000 optimal. It is not necessarily the best choice for N = 100. See Section 6.2 for details.
The performance data is reported using three or four statistics. First, the run-time in seconds is given. The second statistic measures how well our performance compares to the speed of the BLAS, specifically SGEMM/DGEMM. This ``equivalent matrix multiplies'' statistic is calculated as
and labeled as
in the tables.
The performance information for the BLAS routines
SGEMV/DGEMV (TRANS='N')
and SGEMM/DGEMM (TRANSA='N', TRANSB='N') is provided in Table
3.8,
along with the clock speed for each machine in Table
3.2.
The third statistic is the true megaflop rating. For
the eigenvalue and singular value drivers, a fourth ``synthetic megaflop''
statistic is also presented. We provide this statistic because the number
of floating point operations needed to find eigenvalues and singular values
depends on the input data, unlike linear equation solving or linear least
squares solving with SGELS/DGELS. The synthetic megaflop rating is defined
to be the ``standard'' number of flops required to solve the problem, divided
by the run-time in microseconds. This ``standard'' number of flops is taken
to be the average for a standard algorithm over a variety of problems, as
given in Table
3.9
(we ignore terms of order
)
[45].
Table 3.8: Execution time and Megaflop rates for SGEMV/DGEMV and
SGEMM/DGEMM
Note that the synthetic megaflop rating is much higher than the true megaflop
rating for
SSYEVD/DSYEVD in Table
3.15; this is because SSYEVD/DSYEVD
performs many fewer floating point operations than the standard algorithm, SSYEV/DSYEV.
Table 3.9: ``Standard'' floating point operation counts for LAPACK drivers
for n-by-n matrices
Table 3.10: Performance of SGESV/DGESV for n-by-n matrices
Table 3.11: Performance of SGELS/DGELS for n-by-n matrices
Table 3.12: Performance of SGEEV/DGEEV, eigenvalues only
Table 3.13: Performance of SGEEV/DGEEV, eigenvalues and right eigenvectors
Table 3.14: Performance of SSYEVD/DSYEVD, eigenvalues only, UPLO='U'
Table 3.15: Performance of SSYEVD/DSYEVD, eigenvalues and eigenvectors, UPLO='U'
Table 3.16: Performance of SGESVD/DGESVD, singular values only
Table 3.17: Performance of SGESVD/DGESVD, singular values and left and right singular vectors
In addition to providing faster routines than previously available, LAPACK provides more comprehensive and better error bounds . Our ultimate goal is to provide error bounds for all quantities computed by LAPACK.
In this chapter we explain our overall approach to obtaining error bounds, and provide enough information to use the software. The comments at the beginning of the individual routines should be consulted for more details. It is beyond the scope of this chapter to justify all the bounds we present. Instead, we give references to the literature. For example, standard material on error analysis can be found in [45].
In order to make this chapter easy to read, we have labeled sections not essential for a first reading as Further Details. The sections not labeled as Further Details should provide all the information needed to understand and use the main error bounds computed by LAPACK. The Further Details sections provide mathematical background, references, and tighter but more expensive error bounds, and may be read later.
In section 4.1 we discuss the sources of numerical error, in particular roundoff error. Section 4.2 discusses how to measure errors, as well as some standard notation. Section 4.3 discusses further details of how error bounds are derived. Sections 4.4 through 4.12 present error bounds for linear equations, linear least squares problems, generalized linear least squares problems, the symmetric eigenproblem, the nonsymmetric eigenproblem, the singular value decomposition, the generalized symmetric definite eigenproblem, the generalized nonsymmetric eigenproblem and the generalized (or quotient) singular value decomposition respectively. Section 4.13 discusses the impact of fast Level 3 BLAS on the accuracy of LAPACK routines.
The sections on generalized linear least squares problems and the generalized nonsymmetric eigenproblem are ``placeholders'' to be completed in the next versions of the library and manual. The next versions will also include error bounds for new high accuracy routines for the symmetric eigenvalue problem and singular value decomposition.
There are two sources of error whose effects can be measured by the bounds in this chapter: roundoff error and input error. Roundoff error arises from rounding results of floating-point operations during the algorithm. Input error is error in the input to the algorithm from prior calculations or measurements. We describe roundoff error first, and then input error.
Almost all the error bounds LAPACK provides are multiples of machine epsilon, which we abbreviate by . Machine epsilon bounds the roundoff in individual floating-point operations. It may be loosely defined as the largest relative error in any floating-point operation that neither overflows nor underflows. (Overflow means the result is too large to represent accurately, and underflow means the result is too small to represent accurately.) Machine epsilon is available either by the function call SLAMCH('Epsilon') (or simply SLAMCH('E')) in single precision, or by the function call DLAMCH('Epsilon') (or DLAMCH('E')) in double precision. See section 4.1.1 and Table 4.1 for a discussion of common values of machine epsilon.
Since overflow generally causes an error message, and underflow is almost always less significant than roundoff, we will not consider overflow and underflow further (see section 4.1.1).
Bounds on input errors, or errors in the input parameters inherited from prior computations or measurements, may be easily incorporated into most LAPACK error bounds. Suppose the input data is accurate to, say, 5 decimal digits (we discuss exactly what this means in section 4.2). Then one simply replaces by in the error bounds.
Roundoff error is bounded in terms of the machine precision , which is the smallest value satisfying
where and are floating-point numbers , is any one of the four operations +, , and , and is the floating-point result of . Machine epsilon, , is the smallest value for which this inequality is true for all , and for all and such that is neither too large (magnitude exceeds the overflow threshold) nor too small (is nonzero with magnitude less than the underflow threshold) to be represented accurately in the machine. We also assume bounds the relative error in unary operations like square root:
A precise characterization of depends on the details of the machine arithmetic and sometimes even of the compiler. For example, if addition and subtraction are implemented without a guard digit we must redefine to be the smallest number such that
In order to assure portability , machine parameters such as machine epsilon, the overflow threshold and underflow threshold are computed at runtime by the auxiliary routine xLAMCH . The alternative, keeping a fixed table of machine parameter values, would degrade portability because the table would have to be changed when moving from one machine, or even one compiler, to another.
Actually, most machines, but not yet all, do have the same machine parameters because they implement IEEE Standard Floating Point Arithmetic [5] [4], which exactly specifies floating-point number representations and operations. For these machines, including all modern workstations and PCs , the values of these parameters are given in Table 4.1.
Table 4.1: Values of Machine Parameters in IEEE Floating Point Arithmetic
As stated above, we will ignore overflow and underflow in discussing error bounds. Reference [18] discusses extending error bounds to include underflow, and shows that for many common computations, when underflow occurs it is less significant than roundoff. Overflow generally causes an error message and stops execution, so the error bounds do not apply .
Therefore, most of our error bounds will simply be proportional to machine epsilon. This means, for example, that if the same problem in solved in double precision and single precision, the error bound in double precision will be smaller than the error bound in single precision by a factor of . In IEEE arithmetic, this ratio is , meaning that one expects the double precision answer to have approximately nine more decimal digits correct than the single precision answer.
LAPACK routines are generally insensitive to the details of rounding, like their counterparts in LINPACK and EISPACK. One newer algorithm (xLASV2) can return significantly more accurate results if addition and subtraction have a guard digit (see the end of section 4.9).
LAPACK routines return four types of floating-point output arguments:
First consider scalars. Let the scalar be an approximation of the true answer . We can measure the difference between and either by the absolute error , or, if is nonzero, by the relative error . Alternatively, it is sometimes more convenient to use instead of the standard expression for relative error (see section 4.2.1). If the relative error of is, say , then we say that is accurate to 5 decimal digits.
In order to measure the error in vectors, we need to measure the size or norm of a vector x . A popular norm is the magnitude of the largest component, , which we denote . This is read the infinity norm of x. See Table 4.2 for a summary of norms.
Table 4.2: Vector and matrix norms
If is an approximation to the exact vector x, we will refer to as the absolute error in (where p is one of the values in Table 4.2), and refer to as the relative error in (assuming ). As with scalars, we will sometimes use for the relative error. As above, if the relative error of is, say , then we say that is accurate to 5 decimal digits. The following example illustrates these ideas:
Thus, we would say that approximates x to 2 decimal digits.
Errors in matrices may also be measured with norms . The most obvious generalization of to matrices would appear to be , but this does not have certain important mathematical properties that make deriving error bounds convenient (see section 4.2.1). Instead, we will use , where A is an m-by-n matrix, or ; see Table 4.2 for other matrix norms. As before is the absolute error in , is the relative error in , and a relative error in of means is accurate to 5 decimal digits. The following example illustrates these ideas:
so is accurate to 1 decimal digit.
Here is some related notation we will use in our error bounds. The condition number of a matrix A is defined as , where A is square and invertible, and p is or one of the other possibilities in Table 4.2. The condition number measures how sensitive is to changes in A; the larger the condition number, the more sensitive is . For example, for the same A as in the last example,
LAPACK error estimation routines typically compute a variable called RCOND , which is the reciprocal of the condition number (or an approximation of the reciprocal). The reciprocal of the condition number is used instead of the condition number itself in order to avoid the possibility of overflow when the condition number is very large. Also, some of our error bounds will use the vector of absolute values of x, ( ), or similarly ( ).
Now we consider errors in subspaces. Subspaces are the
outputs of routines that compute eigenvectors and invariant
subspaces of matrices. We need a careful definition
of error in these cases for the following reason. The nonzero vector x is called a
(right) eigenvector of the matrix A with eigenvalue
if
. From this definition, we see that
-x, 2x, or any other nonzero multiple
of x is also an
eigenvector. In other words, eigenvectors are not unique. This
means we cannot measure the difference between two supposed eigenvectors
and x by computing
, because this may
be large while
is small or even zero for
some
. This is true
even if we normalize n so that
, since both
x and -x can be normalized simultaneously. So in order to define
error in a useful way, we need to instead consider the set S of
all scalar multiples
of
x. The set S is
called the subspace spanned by x, and is uniquely determined
by any nonzero member of S. We will measure the difference
between two such sets by the acute angle between them.
Suppose
is spanned by
and
S is spanned by {x}. Then the acute angle between
and S is defined as
One can show that
does not change when either
or x is multiplied by any nonzero scalar. For example, if
as above, then
for any
nonzero scalars
and
.
Here is another way to interpret the angle
between
and
S.
Suppose
is a unit vector (
).
Then there is a scalar
such that
The approximation
holds when
is much less than 1
(less than .1 will do nicely). If
is an approximate
eigenvector with error bound
,
where x is a true eigenvector, there is another true eigenvector
satisfying
.
For example, if
then
for
.
Some LAPACK routines also return subspaces spanned by more than one
vector, such as the invariant subspaces of matrices returned by xGEESX.
The notion of angle between subspaces also applies here;
see section
4.2.1 for details.
Finally, many of our error bounds will contain a factor p(n) (or p(m , n)),
which grows as a function of matrix dimension n (or dimensions m and n).
It represents a potentially different function for each problem.
In practice, the true errors usually grow just linearly; using
p(n) = 10n in the error bound formulas will often give a reasonable bound.
Therefore, we will refer to p(n) as a ``modestly growing'' function of n.
However it can occasionally be much larger, see
section
4.2.1.
For simplicity, the error bounds computed by the code fragments
in the following sections will use p(n) = 1.
This means these computed error bounds may occasionally
slightly underestimate the true error. For this reason we refer
to these computed error bounds as ``approximate error bounds''.
The relative error
in the approximation
of the true solution
has a drawback: it often cannot
be computed directly, because it depends on the unknown quantity
. However, we can often instead estimate
, since
is
known (it is the output of our algorithm). Fortunately, these two
quantities are necessarily close together, provided either one is small,
which is the only time they provide a useful bound anyway. For example,
implies
so they can be used interchangeably.
Table
4.2 contains a variety of norms we will use to
measure errors.
These norms have the properties that
, and
, where p is one of
1, 2,
, and F. These properties are useful for deriving
error bounds.
An error bound that uses a given norm may be changed into an error bound
that uses another norm. This is accomplished by multiplying the first
error bound by an appropriate function of the problem dimension.
Table
4.3 gives the
factors
such that
, where
n is the dimension of x.
Table
4.4 gives the
factors
such that
, where
A is m-by-n.
The two-norm of A,
, is also called the spectral
norm of A, and is equal to the largest singular value
of A.
We shall also need to refer to the smallest singular value
of A; its value can be defined in a similar way to
the definition of the two-norm in Table
4.2, namely as
when A
has at least as many rows as columns, and defined as
when A has more
columns than rows. The two-norm,
Frobenius norm
,
and singular values of a matrix do not change
if the matrix is multiplied by a real orthogonal (or complex unitary) matrix.
Now we define subspaces spanned by more than one vector,
and angles between subspaces.
Given a set of k
n-dimensional vectors
, they determine
a subspace S consisting of all their possible linear combinations
,
scalars
. We also
say that
spans S.
The difficulty in measuring the difference between subspaces is that
the sets of vectors spanning them are not unique.
For example, {x}, {-x} and {2x} all determine the
same subspace.
This means we cannot simply compare the subspaces spanned by
and
by
comparing each
to
. Instead, we will measure the angle
between the subspaces, which is independent of the spanning set
of vectors. Suppose subspace
is spanned by
and that subspace S
is spanned by
. If k = 1, we instead write more
simply
and {x}.
When k = 1, we defined
the angle
between
and S as the acute angle
between
and
.
When k > 1, we define the acute angle between
and
S as the largest acute angle between any vector
in
, and the closest vector x in S to
:
LAPACK routines which compute subspaces return
vectors
spanning a subspace
which are orthonormal. This means the
n-by-k matrix
satisfies
. Suppose also that
the vectors
spanning S
are orthonormal, so
also
satisfies
.
Then there is a simple expression for the angle between
and S:
For example, if
then
.
As stated above, all our bounds will contain a factor
p(n) (or p(m,n)), which measure how roundoff errors can grow
as a function of matrix dimension n (or m and n).
In practice, the true error usually grows just linearly with n,
but we can generally only prove much weaker bounds of the form
.
This is because we can not rule out the extremely unlikely possibility of rounding
errors all adding together instead of canceling on average. Using
would give very pessimistic and unrealistic bounds, especially
for large n, so we content ourselves with describing p(n) as a
``modestly growing'' polynomial function of n. Using p(n) = 10n in
the error bound formulas will often give a reasonable bound.
For detailed derivations of various
p(n), see
[78]
[45].
There is also one situation where p(n) can grow as large as
:
Gaussian elimination. This typically occurs only on specially constructed
matrices presented in numerical analysis courses [p. 212]wilkinson1.
However, the expert drivers for solving linear systems, xGESVX and xGBSVX,
provide error bounds incorporating p(n), and so this rare possibility
can be detected.
We illustrate standard error analysis with the simple example of
evaluating the scalar function y = f(z). Let the output of the
subroutine which implements f(z) be denoted alg(z); this includes
the effects of roundoff. If
where
is small,
then we say alg is a backward stable
algorithm for f,
or that the backward error
is small.
In other words, alg(z) is the
exact value of f at a slightly perturbed input
.
Suppose now that f is a smooth function, so that
we may approximate it near z by a straight line:
.
Then we have the simple error estimate
Thus, if
is small, and the derivative
is
moderate, the error alg(z) - f(z) will be small
.
This is often written
in the similar form
This approximately bounds the relative error
by the product of
the condition number of
f at z,
, and the
relative backward error
.
Thus we get an error bound by multiplying a
condition
number and
a backward error (or bounds for these quantities). We call a problem
ill-conditioned
if its condition number is large,
and ill-posed
if its condition number is infinite (or does not exist)
.
If f and z are vector quantities, then
is a matrix
(the Jacobian). So instead of using absolute values as before,
we now measure
by a vector norm
and
by a matrix norm
. The conventional (and coarsest) error analysis
uses a norm such as the infinity norm. We therefore call
this normwise backward stability.
For example, a normwise stable
method for solving a system of linear equations Ax = b will
produce a solution
satisfying
where
and
are both small (close to machine epsilon).
In this case the condition number is
(see section
4.4 below).
Almost all of the algorithms in LAPACK (as well as LINPACK and EISPACK)
are stable in the sense just described
:
when applied to a matrix A
they produce the exact result for a slightly different matrix A + E,
where
is of order
.
Condition numbers may be expensive to compute
exactly.
For example, it costs about
operations to solve Ax = b
for a general matrix A, and computing
exactly costs
an additional
operations, or twice as much.
But
can be estimated in only
operations beyond those
necessary for solution,
a tiny extra cost. Therefore, most of LAPACK's condition numbers
and error bounds are based on estimated condition
numbers
, using the method
of
[52]
[51]
[48].
The price one pays for using an estimated rather than an
exact condition number is occasional
(but very rare) underestimates of the true error; years of experience
attest to the reliability of our estimators, although examples
where they badly underestimate the error can be constructed
[53].
Note that once a condition estimate is large enough,
(usually
), it confirms that the computed
answer may be completely inaccurate, and so the exact magnitude
of the condition estimate conveys little information.
The standard error analysis just outlined has a drawback: by using the
infinity norm
to measure the backward error,
entries of equal magnitude in
contribute equally to the final
error bound
. This means that
if z is sparse or has some very tiny entries, a normwise backward
stable algorithm may make very large changes in these entries
compared to their original values. If these tiny values are known accurately
by the user, these errors may be unacceptable,
or the error bounds may be unacceptably large.
For example, consider solving a diagonal system of linear equations Ax = b.
Each component of the solution is computed accurately by
Gaussian elimination:
.
The usual error bound is approximately
,
which can arbitrarily overestimate the true error,
, if at least one
is tiny and another one is large.
LAPACK addresses this inadequacy by providing some algorithms
whose backward error
is a tiny relative change in
each component of z:
.
This backward error retains both the sparsity structure of z as
well as the information in tiny entries. These algorithms are therefore
called componentwise relatively backward stable.
Furthermore, computed error bounds reflect this stronger form of backward
error
.
If the input data has independent uncertainty in each component,
each component must have at least a small relative uncertainty,
since each is a floating-point number.
In this case, the extra uncertainty contributed by the algorithm is not much
worse than the uncertainty in the input data, so
one could say the answer provided by a componentwise
relatively backward stable algorithm is as accurate as the data
warrants
[1].
When solving Ax = b using expert driver xyySVX or computational routine xyyRFS,
for example, we almost always
compute
satisfying
, where
is a small relative change in
and
is a small relative change in
. In particular, if A is diagonal,
the corresponding error bound is always tiny, as one would
expect (see the next section).
LAPACK can achieve this accuracy
for linear equation solving,
the bidiagonal singular value decomposition, and
the symmetric tridiagonal eigenproblem,
and provides facilities for achieving this accuracy for least squares problems.
Future versions of LAPACK will also achieve this
accuracy for other linear algebra problems, as discussed below.
Let Ax = b be the system to be solved, and
the computed
solution. Let n be the dimension of A.
An approximate error bound
for
may be obtained in one of the following two ways,
depending on whether the solution is computed by a simple driver or
an expert driver:
can be computed by the following code fragment.
,
Then (to 4 decimal places)
,
,
the true reciprocal condition number
,
, and the true error
.
For example, the following code fragment solves
Ax = b and computes an approximate error bound FERR:
For the same A and b as above,
,
,
and the actual error is
.
This example illustrates
that the expert driver provides an error bound with less programming
effort than the simple driver, and also that it may produce a significantly
more accurate answer.
Similar code fragments, with obvious adaptations,
may be used with all the driver routines for linear
equations listed in Table
2.2.
For example, if a symmetric system is solved using the simple driver xSYSV,
then xLANSY must be used to compute ANORM, and xSYCON must be used
to compute RCOND.
LAPACK can solve systems of linear equations, linear least squares
problems, eigenvalue problems and singular value problems.
LAPACK can also handle many
associated computations such as matrix factorizations
or estimating condition numbers.
LAPACK contains driver routines for solving standard types of
problems,
computational routines to perform a distinct
computational task, and auxiliary routines to perform a certain
subtask or common low-level computation. Each driver routine
typically calls a sequence of
computational routines. Taken as a whole, the computational routines
can perform a wider range of tasks than are covered by the driver
routines.
Many of the auxiliary routines may be of use to numerical analysts
or software developers, so we have documented the Fortran source for
these routines with the same level of detail used for the LAPACK
routines and driver routines.
Dense and band matrices are provided for,
but not general sparse matrices. In all areas, similar functionality
is provided for real and complex matrices.
See Chapter
2 for a complete summary of the
contents.
The conventional error analysis of linear
equation
solving goes as follows.
Let Ax = b be the system to be solved. Let
be the solution
computed by LAPACK (or LINPACK) using any of their linear equation solvers.
Let r be
the residual
. In the absence of rounding error r
would be zero and
would equal x; with rounding error one can
only say the following:
subject to the constraint
.
The minimal value of
is given by
One can show that the computed solution
satisfies
,
where p(n) is a modestly growing function of n.
The corresponding condition number is
.
The error
is bounded by
In the first code fragment in the last section,
,
which is
in the numerical example,
is approximated by
.
Approximations
of
- or, strictly speaking, its reciprocal RCOND -
are returned by computational routines
xyyCON (subsection
2.3.1) or driver routines
xyySVX (subsection
2.2.1). The code fragment
makes sure RCOND is at least
EPSMCH to
avoid overflow in computing
ERRBD.
This limits
ERRBD to a maximum of 1, which is no loss of generality since
a relative error of 1 or more indicates the same thing:
a complete loss of accuracy.
Note that the
value of RCOND returned by xyySVX may apply to a linear
system obtained from Ax = b by equilibration, i.e.
scaling the rows and columns of A in order to make the
condition number smaller. This is the case in the second
code fragment in the last section, where the program
chose to scale the rows by the factors returned in
and scale the columns by the factors returned in
,
resulting in
.
As stated in section
4.3.2,
this approach does not respect the presence
of zero or tiny entries in A. In contrast,
the LAPACK computational routines
xyyRFS (subsection
2.3.1) or driver routines xyySVX
(subsection
2.2.1) will (except in rare cases)
compute a solution
with the following properties:
(where we interpret 0 / 0 as 0)
subject to the constraint
.
The minimal value of
is given by
One can show that for most problems the
computed by xyySVX
satisfies
,
where p(n) is a modestly growing function of n.
In other words,
is the exact solution of the
perturbed problem
where E and f are small relative perturbations in each entry of A and
b, respectively.
The corresponding condition number is
.
The error
is bounded by
The routines xyyRFS and xyySVX return
, which is called BERR
(for Backward ERRor),
and a bound on the the actual error
, called FERR
(for Forward ERRor), as
in the second code fragment in the last section.
FERR is actually calculated by the following formula, which can
be smaller than the bound
given above:
Here,
is the computed value of the residual
, and
the norm in the numerator is estimated using the same estimation
subroutine used for RCOND.
The value of
BERR for the example in the last section is
.
Even in the rare cases where xyyRFS fails to make
BERR close to its minimum
, the error bound FERR
may remain small. See
[6]
for details.
The linear least squares problem is to find x that minimizes
.
We discuss error bounds for the most common case where A is m-by-n
with m > n, and A has full rank
;
this is called an overdetermined least squares problem
(the following code fragments deal with m = n as well).
Let
be the solution computed by one of the driver routines
xGELS, xGELSX or xGELSS (see section
2.2.2).
An approximate error
bound
may be computed in one of the following ways, depending on which type
of driver routine is used:
For example,
if
,
then, to 4 decimal places,
,
,
,
, and the true error
is
.
and the call to STRCON must be replaced by:
Applied to the same A and b as above, the computed
is
nearly the same,
,
, and the true error is
.
The conventional error analysis of linear least squares problems goes
as follows
.
As above, let
be the solution to minimizing
computed by
LAPACK using one of the least squares drivers xGELS, xGELSS or xGELSX
(see subsection
2.2.2).
We discuss the most common case, where A is
overdetermined
(i.e., has more rows than columns) and has full rank
[45]:
and p(n) is a modestly growing function of n. We take p(n) = 1 in
the code fragments above.
Let
(approximated by
1/RCOND in the above code fragments),
(= RNORM above), and
(SINT = RNORM / BNORM above). Here,
is the acute angle between
the vectors
and
.
Then when
is small, the error
is bounded by
where
= COST and
= TANT in the code fragments
above.
We avoid overflow by making sure RCOND and COST are both at least
EPSMCH, and by handling the case of a zero B matrix
separately (BNORM = 0).
may be computed directly
from the singular values of A returned by xGELSS (as in the code fragment) or
by xGESVD. It may also be approximated by using xTRCON following calls to
xGELS or xGELSX. xTRCON estimates
or
instead
of
, but these can differ from
by at most a factor of n.
If A is rank-deficient, xGELSS and xGELSX can be used to regularize the
problem
by treating all singular values
less than a user-specified threshold
(
) as
exactly zero. The number of singular values treated as nonzero is returned
in RANK. See
[45] for error bounds in this case, as well as
[45]
[19] for the
underdetermined
case.
The solution of the overdetermined,
full-rank problem may also be
characterized as the solution of the linear system of equations
By solving this linear system using xyyRFS or xyySVX (see section
4.4) componentwise error bounds can also be obtained
[7].
There are two kinds of generalized least squares problems that are discussed in
section
2.2.3: the linear equality-constrained least squares
problem, and the general linear model problem. Error bounds for
these problems will be included in a future version of this manual.
The eigendecomposition
of
an n-by-n real symmetric matrix is the
factorization
(
in the complex Hermitian
case), where Z is orthogonal (unitary) and
is real and diagonal,
with
.
The
are the eigenvalues
of Aand the columns
of
Z are the eigenvectors
. This is also often written
. The eigendecomposition of a symmetric matrix is
computed
by the driver routines xSYEV, xSYEVX, xSYEVD, xSBEV, xSBEVX, xSBEVD,
xSPEV, xSPEVX, xSPEVD, xSTEV, xSTEVX and xSTEVD.
The complex counterparts of these routines, which compute the
eigendecomposition of complex Hermitian matrices, are
the driver routines xHEEV, xHEEVX, xHEEVD, xHBEV, xHBEVX, xHBEVD,
xHPEV, xHPEVX, and xHPEVD (see subsection
2.2.4).
The approximate error
bounds
for the computed eigenvalues
are
The approximate error bounds for the computed eigenvectors
, which bound the acute angles
between the computed eigenvectors and true
eigenvectors
, are:
These bounds can be computed by the following code fragment:
For example,
if
and
then the eigenvalues, approximate error bounds, and true errors are
The usual error analysis of the
symmetric
eigenproblem (using any LAPACK
routine in subsection
2.2.4
or any EISPACK routine) is as follows
[64]:
Thus large eigenvalues (those near
)
are computed to high relative accuracy
and small ones may not be.
The angular difference between the computed unit eigenvector
and a true unit eigenvector
satisfies the approximate bound
if
is small enough.
Here
is the
absolute gap
between
and the nearest other eigenvalue. Thus, if
is close to other eigenvalues, its corresponding eigenvector
may be inaccurate.
The gaps may be easily computed from the array of computed eigenvalues
using subroutine SDISNA
.
The gaps computed by SDISNA are ensured not to be so small as
to cause overflow when used as divisors.
Let
be the invariant subspace spanned by a collection of eigenvectors
, where
is a subset of the
integers from 1 to n. Let S be the corresponding true subspace. Then
is the absolute gap between the eigenvalues in
and the nearest
other eigenvalue. Thus, a cluster
of
close eigenvalues which is
far away from any other eigenvalue may have a well determined
invariant subspace
even if its individual eigenvectors are
ill-conditioned
.
In the special case of a real symmetric tridiagonal matrix T, the eigenvalues
and eigenvectors can be computed much more accurately. xSYEV (and the other
symmetric eigenproblem drivers) computes the eigenvalues and eigenvectors of
a dense symmetric matrix by first reducing it to tridiagonal form
T, and then
finding the eigenvalues and eigenvectors of T.
Reduction of a dense matrix to tridiagonal form
T can introduce
additional errors, so the following bounds for the tridiagonal case do not
apply to the dense case.
where p(n) is a modestly growing function of n.
Thus if
is moderate, each eigenvalue will be computed
to high relative accuracy,
no matter how tiny it is.
The eigenvectors
computed by xPTEQR
can differ from true eigenvectors
by
at most about
if
is small enough, where
is the relative gap between
and the nearest other eigenvalue.
Since the relative gap may be much larger than the absolute gap, this
error bound may be much smaller than the previous one.
could be computed by applying
xPTCON (subsection
2.3.1) to H.
The relative gaps are easily computed from the
array of computed eigenvalues.
Jacobi's method
[69]
[76]
[24] is another
algorithm for finding eigenvalues and eigenvectors of symmetric matrices. It is
slower than the algorithms based on first tridiagonalizing the matrix, but is
capable of computing more accurate answers in several important cases. Routines
implementing Jacobi's method and corresponding error bounds will be available
in a future LAPACK release.
The nonsymmetric eigenvalue
problem
is more
complicated than the
symmetric eigenvalue problem. In this subsection,
we state the simplest bounds and leave the more complicated ones to
subsequent subsections.
Let A be an n-by-n nonsymmetric matrix, with eigenvalues
. Let
be a right eigenvector
corresponding to
:
.
Let
and
be the corresponding
computed eigenvalues and eigenvectors, computed by expert driver routine
xGEEVX (see subsection
2.2.4).
The approximate error bounds
for the computed eigenvalues are
The approximate error
bounds
for the computed eigenvectors
,
which bound the acute angles between the computed eigenvectors and true
eigenvectors
, are
These bounds can be computed by the following code fragment:
For example, if
and
then true eigenvalues, approximate eigenvalues, approximate error bounds,
and true errors are
In this subsection, we will summarize all the available error bounds.
Later subsections will provide further details. The reader may also
refer to
[11].
Bounds for individual eigenvalues and eigenvectors are provided by
driver xGEEVX (subsection
2.2.4) or computational
routine xTRSNA (subsection
2.3.5).
Bounds for
clusters
of eigenvalues
and their associated invariant subspace are
provided by driver xGEESX (subsection
2.2.4) or
computational routine xTRSEN (subsection
2.3.5).
We let
be the i-th computed eigenvalue and
an i-th true eigenvalue.
Let
be the
corresponding computed right eigenvector, and
a true right
eigenvector (so
).
If
is a subset of the
integers from 1 to n, we let
denote the average of
the selected eigenvalues:
,
and similarly for
. We also let
denote the subspace spanned by
; it is
called a right invariant subspace because if v is any vector in
then
Av is also in
.
is the corresponding computed subspace.
The algorithms for the nonsymmetric eigenproblem are normwise backward stable:
they compute the exact eigenvalues, eigenvectors and invariant subspaces
of slightly perturbed matrices A + E, where
.
Some of the bounds are stated in terms of
and others in
terms of
; one may use
to approximate
either quantity.
The code fragment in the previous subsection approximates
by
, where
is returned by xGEEVX.
xGEEVX (or xTRSNA) returns two quantities for each
,
pair:
and
.
xGEESX (or xTRSEN) returns two quantities for a selected subset
of eigenvalues:
and
.
(or
) is a reciprocal condition number for the
computed eigenvalue
(or
),
and is referred to as RCONDE by xGEEVX (or xGEESX).
(or
) is a reciprocal condition number for
the right eigenvector
(or
), and
is referred to as RCONDV by xGEEVX (or xGEESX).
The approximate error bounds for eigenvalues, averages of eigenvalues,
eigenvectors, and invariant subspaces
provided in Table
4.5 are
true for sufficiently small ||E||, which is why they are called asymptotic.
If the problem is ill-conditioned, the asymptotic bounds may only hold
for extremely small ||E||. Therefore, in Table
4.6
we also provide global bounds
which are guaranteed to hold for all
.
We also have the following bound, which is true for all E:
all the
lie in the union of n disks,
where the i-th disk is centered at
and has
radius
. If k of these disks overlap,
so that any two points inside the k disks can be connected
by a continuous curve lying entirely inside the k disks,
and if no larger set of k + 1 disks has this property,
then exactly k of the
lie inside the
union of these k disks. Figure
4.1 illustrates
this for a 10-by-10 matrix, with 4 such overlapping unions
of disks, two containing 1 eigenvalue each, one containing 2
eigenvalues, and one containing 6 eigenvalues.
Finally, the quantities s and sep tell use how we can best
(block) diagonalize a matrix A by a similarity,
, where each diagonal block
has a selected subset of the eigenvalues of A. Such a decomposition
may be useful in computing functions of matrices, for example.
The goal is to choose a V with a nearly minimum condition number
which performs this decomposition, since this generally minimizes the error
in the decomposition.
This may be done as follows. Let
be
-by-
. Then columns
through
of V span the invariant
subspace
of A corresponding
to the eigenvalues of
; these columns should be chosen to be any
orthonormal basis of this space (as computed by xGEESX, for example).
Let
be the value corresponding to the
cluster of
eigenvalues of
, as computed by xGEESX or xTRSEN. Then
, and no other choice of V can make
its condition number smaller than
[17].
Thus choosing orthonormal
subblocks of V gets
to within a factor b of its minimum
value.
In the case of a real symmetric (or complex Hermitian) matrix,
s = 1 and sep is the absolute gap, as defined in subsection
4.7.
The bounds in Table
4.5 then reduce to the
bounds in subsection
4.7.
There are two preprocessing
steps
one may perform
on a matrix A in order
to make its eigenproblem easier. The first is permutation, or
reordering the rows and columns to make A more nearly upper triangular
(closer to Schur form):
, where P is a permutation matrix.
If
is permutable to upper triangular form (or close to it), then
no floating-point operations (or very few) are needed to reduce it to
Schur form.
The second is scaling
by a diagonal matrix D to make the rows and
columns of
more nearly equal in norm:
. Scaling
can make the matrix norm smaller with respect to the eigenvalues, and so
possibly reduce the inaccuracy contributed by roundoff
[][Chap. II/11]wilkinson3. We refer to these two operations as
.
Balancing is performed by driver xGEEVX, which calls
computational routine xGEBAL. The user may tell xGEEVX to optionally
permute, scale, do both, or do neither; this is specified by input
parameter BALANC. Permuting has no effect on
the condition numbers
or their interpretation as described in previous
subsections. Scaling, however, does change their interpretation,
as we now describe.
The output parameters of xGEEVX - SCALE (real array of length N),
ILO (integer), IHI (integer) and ABNRM (real) - describe
the result of
balancing a matrix A into
, where N is the dimension of A.
The matrix
is block upper triangular, with at most three blocks:
from 1 to ILO - 1, from ILO to IHI, and from IHI + 1 to N.
The first and last blocks are upper triangular, and so already in Schur
form. These are not scaled; only the block from ILO to IHI is scaled.
Details of the scaling and permutation are described in SCALE (see the
specification of xGEEVX or xGEBAL for details)
. The one-norm of
is returned in ABNRM.
The condition numbers
described in earlier subsections are computed for
the balanced matrix
, and so some interpretation is needed to
apply them to the eigenvalues and eigenvectors of the original matrix A.
To use the bounds for eigenvalues in Tables
4.5 and
4.6,
we must replace
and
by
. To use the
bounds for eigenvectors, we also need to take into account that bounds
on rotations of eigenvectors are for the eigenvectors
of
, which are related to the eigenvectors x of A by
, or
. One coarse but simple way to do this is
as follows: let
be the bound on rotations of
from
Table
4.5 or Table
4.6
and let
be the desired bound on rotation of x. Let
be the condition number of D.
Then
The numerical example in subsection
4.8 does no scaling,
just permutation.
LAPACK is designed to give high efficiency
on vector processors,
high-performance ``super-scalar'' workstations, and
shared memory multiprocessors.
LAPACK in its present form
is less likely to give good performance on other types of
parallel architectures (for example,
massively parallel SIMD machines, or distributed memory machines),
but work has begun to try to adapt LAPACK to these new
architectures.
LAPACK can also be used satisfactorily on all types of scalar machines
(PC's, workstations, mainframes).
See Chapter
3 for some examples of the
performance achieved by LAPACK routines.
To explain s and sep
, we need to
introduce
the spectral projector P
[56]
[72], and the
separation of two matrices
A and B, sep(A , B)
[75]
[72].
We may assume the matrix A is in Schur form, because reducing it
to this form does not change the values of s and sep.
Consider a cluster of m > = 1 eigenvalues, counting multiplicities.
Further assume the n-by-n matrix A is
where the eigenvalues of the m-by-n matrix
are exactly those in which we are
interested. In practice, if the eigenvalues on the diagonal of A
are in the wrong order, routine xTREXC
can be used to put the desired ones in the upper left corner
as shown.
We define the spectral projector, or simply projector P belonging
to the eigenvalues of
as
where R satisfies the system of linear equations
Equation (
4.3) is called a Sylvester equation
.
Given the Schur form (
4.1), we solve equation
(
4.3) for R using the subroutine xTRSYL.
We can now define s for the eigenvalues of
:
In practice we do not use this expression since
is hard to
compute. Instead we use the more easily computed underestimate
which can underestimate the true value of s by no more than a factor
.
This underestimation makes our error bounds more conservative.
This approximation of s is called RCONDE in xGEEVX and xGEESX.
The separation
of the matrices
and
is defined as the smallest singular value of the linear
map in (
4.3) which takes X to
, i.e.,
This formulation lets us estimate
using the condition estimator
xLACON
[52]
[51]
[48], which estimates the norm of
a linear operator
given the ability to compute T and
quickly for arbitrary x.
In our case, multiplying an
arbitrary vector by T
means solving the Sylvester equation (
4.3)
with an arbitrary right hand side using xTRSYL, and multiplying by
means solving the same equation with
replaced by
and
replaced by
. Solving either equation
costs at most
operations, or as few as
if m << n.
Since the true value of sep is
but we use
,
our estimate of sep may differ from the true value by as much as
. This approximation to sep is called
RCONDV by xGEEVX and xGEESX.
Another formulation which in principle permits an exact evaluation of
is
where
is the Kronecker product of X and Y.
This method is
generally impractical, however, because the matrix whose smallest singular
value we need is m(n - m) dimensional, which can be as large as
. Thus we would require as much as
extra workspace and
operations, much more than the estimation method of the last
paragraph.
The expression
measures the ``separation'' of
the spectra
of
and
in the following sense. It is zero if and only if
and
have a common eigenvalue, and small if there is a small
perturbation of either one that makes them have a common eigenvalue. If
and
are both Hermitian matrices, then
is just the gap, or minimum distance between an eigenvalue of
and an
eigenvalue of
. On the other hand, if
and
are
non-Hermitian,
may be much smaller than
this gap.
The singular
value decomposition (SVD) of a
real m-by-n matrix A is defined as follows. Let r = min(m , n).
The the SVD of A is
(
in the complex case),
where
U and V are orthogonal (unitary) matrices and
is diagonal,
with
.
The
are the singular values of A and the leading
r columns
of U and
of V the
left and right singular vectors, respectively.
The SVD of a general matrix is computed by xGESVD
(see subsection
2.2.4).
The approximate error
bounds
for the computed singular values
are
The approximate error bounds for the computed singular vectors
and
,
which bound the acute angles between the computed singular vectors and true
singular vectors
and
, are
These bounds can be computing by the following code fragment.
For example,
if
and
then the singular values, approximate error bounds, and true errors are given below.
The usual error analysis of the SVD algorithm
xGESVD in LAPACK (see
subsection
2.2.4) or the
routines in LINPACK and EISPACK is as follows
[45]:
(we take p(m , n) = 1 in the code fragment).
Thus large singular values (those near
) are computed to
high relative accuracy
and small ones may not be.
The angular difference between the computed left singular vector
and a true
satisfies the approximate bound
where
is the
absolute gap
between
and the nearest other singular value.
We take p(m , n) = 1 in the code fragment.
Thus, if
is close to other singular values, its corresponding singular vector
may be inaccurate. When n > m, then
must be redefined
as
.
The gaps may be easily computed from the array of computed singular values
using function
SDISNA.
The gaps computed by SDISNA are ensured not to be so small as
to cause overflow when used as divisors.
The same bound applies to the computed right singular
vector
and a true vector
.
Let
be the space spanned by a collection of computed left singular
vectors
, where
is a subset
of the integers from 1 to n. Let S be the corresponding true space.
Then
where
is the absolute gap between the singular values in
and the nearest
other singular value. Thus, a cluster
of close singular values which is
far away from any other singular value may have a well determined
space
even if its individual singular vectors are ill-conditioned.
The same bound applies to a set of right singular vectors
.
In the special case of bidiagonal matrices, the singular values and
singular vectors may be computed much more accurately. A bidiagonal
matrix B has nonzero entries only on the main diagonal and the diagonal
immediately
above it (or immediately below it). xGESVD computes the SVD of a general
matrix by first reducing it to bidiagonal form B, and then calling xBDSQR
(subsection
2.3.6)
to compute the SVD of B.
Reduction of a dense matrix to bidiagonal form B can introduce
additional errors, so the following bounds for the bidiagonal case
do not apply to the dense case.
The computed left singular vector
has an angular error
at most about
where
is the relative gap between
and the nearest other singular
value. The same bound applies to the right singular vector
and
.
Since the relative gap
may be much larger than
the absolute gap
,
this error bound may be much smaller than the previous one. The relative gaps
may be easily computed from the array of computed singular values.
In the very special case of 2-by-2 bidiagonal matrices, xBDSQR
calls auxiliary routine xLASV2 to compute the SVD. xLASV2 will
actually compute nearly correctly rounded singular vectors independent of
the relative gap, but this requires accurate computer arithmetic:
if leading digits cancel during floating-point subtraction, the resulting
difference must be exact.
On machines without guard digits one has the slightly weaker result that the
algorithm is componentwise relatively backward stable, and therefore the
accuracy
of the singular vectors depends on the relative gap as described
above.
Jacobi's method
[69]
[76]
[24] is another
algorithm for finding singular values and singular vectors of matrices.
It is slower than the algorithms based on first tridiagonalizing the matrix,
but is capable of computing more accurate answers in several important cases.
Routines implementing Jacobi's method and corresponding error bounds will be
available in a future LAPACK release.
There are three types of problems to consider.
In all cases A and B
are real symmetric (or complex Hermitian) and B is positive definite.
These decompositions are computed for real symmetric matrices
by the driver routines
xSYGV, xSPGV and (for type 1 only) xSBGV (see subsection
2.2.5.1).
These decompositions are computed for complex Hermitian matrices
by the driver routines
xHEGV, xHPGV and (for type 1 only) xHBGV (see subsection
2.2.5.1).
In each of the following three decompositions,
is real and diagonal with diagonal entries
, and
the columns
of Z are linearly independent vectors.
The
are called
eigenvalues and the
are
eigenvectors.
The approximate error bounds
for the computed eigenvalues
are
The approximate error
bounds
for the computed eigenvectors
,
which bound the acute angles between the computed eigenvectors and true
eigenvectors
, are
These bounds are computed differently, depending on which of the above three
problems are to be solved. The following code fragments show how.
For example, if
,
then ANORM = 120231,
BNORM = 120, and
RCOND = .8326, and
the approximate eigenvalues, approximate error bounds,
and true errors are
This code fragment cannot be adapted to use xSBGV (or xHBGV),
because xSBGV does not return a conventional Cholesky factor in B,
but rather a ``split'' Choleksy factorization (performed by xPBSTF).
A future LAPACK release will include error bounds for xSBGV.
For the same A and B as above, the approximate eigenvalues,
approximate error bounds, and true errors are
The error analysis of the driver routine xSYGV, or xHEGV in the complex case
(see subsection
2.2.5.1),
goes as follows.
In all cases
is
the absolute gap
between
and the nearest other eigenvalue.
The angular difference between the computed eigenvector
and a true eigenvector
is
The angular difference between the computed eigenvector
and a true eigenvector
is
The code fragments above replace p(n) by 1, and makes sure
neither RCONDB nor RCONDZ is so small as to cause
overflow when used as divisors in the expressions for error bounds.
These error bounds are large when B is ill-conditioned with respect to
inversion (
is large). It is often the case that the eigenvalues
and eigenvectors are much better conditioned than indicated here.
We mention three ways to get tighter bounds.
The first way is effective when the diagonal entries of B differ
widely in magnitude
:
The second way to get tighter bounds does not actually supply guaranteed
bounds, but its estimates are often better in practice.
It is not guaranteed because it assumes the algorithm is backward stable,
which is not necessarily true when B is ill-conditioned.
It estimates the chordal distance between a
true eigenvalue
and a computed eigenvalue
:
To interpret this measure we write
and
. Then
.
In other words, if
represents the one-dimensional subspace
consisting of the line through the origin with slope
,
and
represents the analogous subspace S, then
is the sine of the acute angle
between these
subspaces.
Thus X is bounded by one, and is small when both arguments are
large
.
It applies only to the first problem,
:
The third way applies only to the first problem
, and only
when A is positive definite. We use a different algorithm:
Other yet more refined algorithms and error bounds are discussed in
[78]
[73]
[13], and will be available in
future releases.
The generalized nonsymmetric eigenproblem is discussed in
section
2.2.5.2,
and has error bounds which are analogous to those
for the nonsymmetric eigenvalue problem presented in section
4.8. These
bounds will be computed by future LAPACK routines xGGEVX and xGGESX, and discussed in
a future version of this manual.
The generalized (or quotient) singular value decomposition
of an m-by-n matrix
A and a p-by-n matrix B is the pair of factorizations
where V, V, Q, R,
and
are defined
as follows.
The generalized singular value decomposition is
computed by driver routine xGGSVD (see section
2.2.5.3).
We will give error bounds for the generalized
singular values in the
common case where
has full
rank r = n.
Let
and
be the values of
and
, respectively,
computed by xGGSVD.
The approximate error
bound
for these values is
Note that if
is close to zero, then a true
generalized singular value
can differ greatly in magnitude from
the computed generalized singular value
, even if SERRBD is
close to its minimum
.
Here is another way to interpret SERRBD:
if we think of
and
as representing the subspace S
consisting of the straight line through the origin with slope
, and similarly
and
representing the subspace
,
then SERRBD bounds the acute angle between
S and
.
Note that any two
lines through the origin with nearly vertical slopes
(very large
) are close together in angle.
(This is related to the chordal distance in
section
4.10.1.)
SERRBD can be computed by the following code fragment,
which for simplicity assumes m > = n.
(The assumption r = n implies only that p + m > = n.
Error bounds can also be computed when p + m > = n > m,
with slightly more complicated code.)
For example, if
,
then, to 4 decimal places,
, and the true errors
are
,
and
.
The GSVD algorithm used in LAPACK (
[12]
[10]
[62])
is backward stable:
there exist small
,
, and
such that
,
, and
are exactly orthogonal (or unitary):
and
is the exact GSVD of A + E and B + F. Here p(n) is a modestly growing function of n, and
we take p(n) = 1 in the above code fragment.
Let
and
be the square roots of the diagonal entries of the exact
and
,
and let
and
the square roots of the diagonal entries
of the computed
and
.
Let
Then provided G and
have full rank n, one can show
[61]
[74] that
In the code fragment we approximate the numerator of the last expression by
and approximate the denominator by
in order to compute SERRBD;
STRCON returns an approximation RCOND to
.
We assume that the rank r of G equals n, because otherwise the
s and
s are not well determined. For example, if
then A and B have
and
, whereas
and
have
and
, which
are completely different, even though
and
. In this case,
,
so G is nearly rank-deficient.
The reason the code fragment assumes m > = n is that in this case
is
stored overwritten on A, and can be passed to STRCON in order to compute
RCOND. If m < = n, then the
first m rows of
are
stored in A, and the last m - n rows of
are stored in B. This
complicates the computation of RCOND: either
must be copied to
a single array before calling STRCON, or else the lower level subroutine SLACON
must be used with code capable of solving linear equations with
and
as coefficient matrices.
The Level 3 BLAS specifications
[28] specify the input, output
and calling sequence for each routine, but allow freedom of
implementation, subject to the requirement that the routines be
numerically stable
.
Level 3 BLAS implementations can therefore be
built using matrix multiplication algorithms that achieve a more
favorable operation count (for suitable dimensions) than the standard
multiplication technique, provided that these ``fast'' algorithms are
numerically stable. The simplest fast matrix multiplication
technique is Strassen's
method
, which can
multiply two n-by-n
matrices in fewer than
operations, where
.
The effect on the results in this chapter of using a fast Level 3 BLAS
implementation can be explained as follows. In general, reasonably
implemented fast Level 3 BLAS preserve all the bounds presented here
(except those at the end of subsection
4.10), but the constant
p(n) may increase somewhat. Also, the iterative refinement
routine
xyyRFS may take more steps to converge.
This is what we mean by reasonably implemented fast Level 3 BLAS.
Here,
denotes a constant depending on the specified matrix dimensions.
(1) If A is m-by-n, B is n-by-p and
is the computed
approximation to C = AB, then
(2)
The computed solution
to the triangular systems TX = B,
where T is m-by-m and B is m-by-p, satisfies
For conventional Level 3 BLAS implementations these conditions
hold with
and
.
Strassen's method
satisfies these
bounds for slightly larger
and
.
For further details, and references to fast multiplication techniques,
see
[20].
30 September 1994
This work is dedicated to Jim Wilkinson whose ideas and
spirit have given us inspiration and influenced the project
at every turn.
An
-by-
band matrix
with
subdiagonals
and
superdiagonals may be
stored compactly in a two-dimensional array with
rows and
columns.
Columns of the matrix are stored in corresponding columns of the
array, and diagonals of the matrix are stored in rows of the array.
This storage scheme should be used in practice only if
,
although LAPACK routines work correctly for all values of
and
.
In LAPACK, arrays that hold matrices in band storage have names
ending in `B'.
To be precise,
is stored in AB(
) for
.
For example, when
,
and
:
The elements marked
in the upper left and lower right
corners of the array AB need not be set, and are not referenced by
LAPACK routines.
Note: when a band matrix is supplied for
factorization,
space
must be allowed to store an
additional
superdiagonals,
generated by fill-in as a result of row interchanges.
This means that the matrix is stored according to the above scheme,
but with
superdiagonals.
Triangular band matrices are stored in the same format, with either
if upper triangular, or
if
lower triangular.
For symmetric or Hermitian band matrices with
subdiagonals or
superdiagonals, only the upper or lower triangle (as specified by
UPLO) need be stored:
For example, when
and
:
EISPACK
routines use a different storage scheme for band matrices,
in which rows of the matrix are stored in corresponding rows of the
array, and diagonals of the matrix are stored in columns of the array
(see Appendix
D).
An unsymmetric
tridiagonal matrix of order
is stored in
three one-dimensional arrays, one of length
containing the
diagonal elements, and two of length
containing the
subdiagonal and superdiagonal elements in elements
.
A symmetric
tridiagonal or
bidiagonal
matrix is stored in
two one-dimensional arrays, one of length
containing the
diagonal elements, and one of length
containing the
off-diagonal elements. (EISPACK routines store the off-diagonal
elements in elements
of a vector of length
.)
The generalized
(GRQ) factorization of an
-by-
matrix
and
a
-by-
matrix
is given by the pair of factorizations
where
and
are respectively
-by-
and
-by-
orthogonal
matrices (or unitary matrices if
and
are complex).
has the form
or
where
or
is upper triangular.
has the form
or
where
is upper triangular.
Note that if
is square and nonsingular, the GRQ factorization of
and
implicitly gives the
factorization of the matrix
:
without explicitly computing the matrix inverse
or the product
.
The routine xGGRQF computes the GRQ factorization
by first computing the
factorization of
and then
the
factorization of
.
The orthogonal (or unitary) matrices
and
can either be formed explicitly or
just used to multiply another given matrix in the same way as the
orthogonal (or unitary) matrix
in the
factorization
(see section
2.3.2).
The GRQ factorization can be used to solve the linear
equality-constrained least squares problem (LSE) (see (
2.2) and
[page 567]GVL2).
We use the GRQ factorization of
and
(note that
and
have
swapped roles), written as
We write the linear equality constraints
as:
which we partition as:
Therefore
is the solution of the upper triangular system
Furthermore,
We partition this expression as:
where
, which
can be computed by xORMQR (or xUNMQR).
To solve the LSE problem, we set
which gives
as the solution of the upper triangular system
Finally, the desired solution is given by
which can be computed
by xORMRQ (or xUNMRQ).
The world of modern computing potentially offers many helpful methods
and tools to scientists and engineers, but the fast pace of change in
computer hardware, software, and algorithms often makes practical use of
the newest computing technology difficult. The Scientific and
Engineering Computation series focuses on rapid advances in computing
technologies and attempts to facilitate transferring these technologies
to applications in science and engineering. It will include books on
theories, methods, and original applications in such areas as
parallelism, large-scale simulations, time-critical computing,
computer-aided design and engineering, use of computers in
manufacturing, visualization of scientific data, and human-machine
interface technology.
The series will help scientists and engineers to understand the current
world of advanced computation and to anticipate future developments that
will impact their computing environments and open up new capabilities
and modes of computation.
This book is about the Message Passing Interface (MPI),
an important and increasingly popular standarized and portable
message passing system that brings us closer to the potential
development of practical and cost-effective large-scale parallel applications.
It gives a complete specification of the MPI standard and
provides illustrative programming examples.
This advanced level book supplements the companion, introductory volume
in the Series by William Gropp, Ewing Lusk and Anthony Skjellum,
Using MPI: Portable Parallel Programming with the Message-Passing
Interface.
Janusz S. Kowalik
Preface
MPI, the Message Passing Interface, is a standardized and portable
message-passing system designed by a group of researchers from
academia and industry to function on a wide variety of parallel
computers.
The standard defines the syntax and semantics of a core of
library routines useful to a wide range of users writing
portable message-passing programs in Fortran 77 or C.
Several well-tested and efficient implementations
of MPI already exist, including some that are free and in
the public domain.
These are beginning to foster the development of a parallel
software industry, and there is excitement among computing
researchers and vendors that the development of portable and
scalable, large-scale parallel applications is now feasible.
The MPI standardization effort involved over 80 people
from 40 organizations,
mainly from the United States and Europe. Most of the major vendors of
concurrent computers at the time were involved in MPI,
along with researchers from
universities, government laboratories, and industry.
The standardization
process began with the Workshop on
Standards for Message Passing in a Distributed Memory Environment,
sponsored by the Center for Research on Parallel Computing,
held April 29-30,
1992, in Williamsburg, Virginia
[29].
A preliminary draft proposal, known as MPI1,
was put forward by Dongarra,
Hempel, Hey, and Walker in November 1992,
and a revised version was completed
in February 1993
[11].
In November 1992, a
meeting of the MPI working group was held in Minneapolis,
at which it was
decided to place the standardization process on a more formal
footing.
The MPI working group met every
6 weeks throughout the first 9 months of 1993. The
draft MPI standard was presented at the
Supercomputing '93 conference in November 1993.
After a period of public comments, which resulted in some
changes in MPI, version 1.0 of MPI was released in June
1994.
These meetings and the email
discussion together constituted the MPI Forum, membership
of which has been open to all members of the
high performance computing community.
This book serves as an annotated reference manual for MPI, and
a complete specification of the standard is presented. We repeat
the material already published in the MPI specification
document
[15], though an attempt to clarify
has been made. The annotations mainly take the form of explaining
why certain design choices were made, how users are meant to use the
interface, and how MPI implementors should construct a version
of MPI. Many detailed, illustrative programming examples are
also given, with an eye toward illuminating the more advanced or
subtle features of MPI.
The complete interface is presented in this book, and we are not
hesitant to explain even the most esoteric features or consequences
of the standard. As such, this volume does not work as a gentle
introduction to MPI, nor as a tutorial. For such purposes, we recommend the
companion volume in this series by William Gropp, Ewing Lusk, and Anthony
Skjellum, Using MPI: Portable Parallel Programming with the
Message-Passing Interface. The parallel application developer will
want to have copies of both books handy.
For a first reading, and as a good introduction to MPI, the reader
should first read: Chapter 1, through
Section
; the material on
point to point communications covered in
Sections
through
and Section
;
the simpler forms of collective communications explained in Sections
through
; and the basic
introduction to communicators given in
Sections
through
.
This will give a fair understanding
of MPI, and will allow the construction of parallel applications
of moderate complexity.
This book is based on the
hard work of many people in the MPI Forum. The
authors gratefully recognize the members of the forum,
especially the contributions made by members who served
in positions of responsibility: Lyndon Clarke, James Cownie,
Al Geist, William Gropp, Rolf Hempel, Robert Knighten, Richard Littlefield,
Ewing Lusk, Paul Pierce, and Anthony Skjellum. Other contributors were:
Ed Anderson, Robert Babb, Joe Baron, Eric Barszcz, Scott Berryman,
Rob Bjornson, Nathan Doss, Anne Elster, Jim Feeney, Vince Fernando,
Sam Fineberg, Jon Flower, Daniel Frye, Ian Glendinning, Adam Greenberg,
Robert Harrison, Leslie Hart, Tom Haupt, Don Heller, Tom Henderson, Anthony Hey,
Alex Ho, C.T. Howard Ho, Gary Howell, John Kapenga, James Kohl, Susan Krauss,
Bob Leary, Arthur Maccabe, Peter Madams, Alan Mainwaring, Oliver McBryan,
Phil McKinley, Charles Mosher, Dan Nessett, Peter Pacheco, Howard Palmer,
Sanjay Ranka, Peter Rigsbee, Arch Robison, Erich Schikuta, Mark Sears,
Ambuj Singh, Alan Sussman, Robert Tomlinson, Robert G. Voigt, Dennis Weeks,
Stephen Wheat, and Steven Zenith.
We especially thank William Gropp and Ewing Lusk for help in formatting
this volume.
Support for MPI meetings came in part from ARPA and NSF
under grant ASC-9310330, NSF Science and Technology Center
Cooperative agreement No. CCR-8809615, and the Commission of the
European Community through Esprit Project P6643. The University of
Tennessee also made financial contributions to the MPI Forum.
MPI_Gatherv(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int *recvcounts, int *displs, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_GATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_GATHERV extends the functionality of MPI_GATHER
by allowing a varying count of data from each process, since
recvcounts
is now an array. It also allows more flexibility as to where the data
is placed on the root, by providing the new argument, displs.
The outcome is as if each process, including the root process,
sends a message to the root,
MPI_Send(sendbuf, sendcount, sendtype, root, ...)
and the root executes n receives,
MPI_Recv(recvbuf+displs[i]
extent(recvtype), recvcounts[i],
recvtype, i, ...).
The data sent from process j is
placed in the jth portion of the receive buffer recvbuf
on process root. The jth portion of recvbuf
begins at offset displs[j] elements (in terms of
recvtype) into recvbuf.
The receive buffer is ignored for all non-root processes.
The type signature implied by sendcount and sendtype on process i
must be equal to the type signature implied by recvcounts[i]
and recvtype
at the root.
This implies that the amount of data sent must be equal to the
amount of data received, pairwise between each process and the root.
Distinct type maps between sender and receiver are still allowed,
as illustrated in Example
.
All arguments to the function are significant on process root,
while on other processes, only arguments sendbuf, sendcount,
sendtype, root, and comm are significant.
The argument root
must have identical values on all processes,
and comm must represent the same intragroup communication
domain.
The specification of counts, types, and displacements
should not cause any location on the root to be written more than
once. Such a call is erroneous. On the other hand, the successive
displacements in the array displs need not be a monotonic sequence.
MPI_Scatter(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_SCATTER is the inverse operation to MPI_GATHER.
The outcome is as if the root executed n send operations,
MPI_Send(sendbuf+i
sendcount
extent(sendtype), sendcount,
sendtype, i,...), i = 0 to n - 1.
and each process executed a receive,
MPI_Recv(recvbuf, recvcount, recvtype, root,...).
An alternative description is that the root sends a message with
MPI_Send(sendbuf, sendcount
n, sendtype, ...). This
message is split into n equal segments, the
th segment is
sent to the
th process in the group, and each process receives
this message as above.
The type signature associated with sendcount and sendtype at the root
must be equal to the type signature associated with
recvcount and recvtype at all
processes.
This implies that the amount of data sent must be equal to the
amount of data received, pairwise between each process and the root.
Distinct type maps between sender and receiver are still allowed.
All arguments to the function are significant on process
root, while on other processes, only arguments
recvbuf, recvcount, recvtype, root, comm are significant.
The argument root
must have identical
values on all processes and comm must represent the same
intragroup communication domain.
The send buffer is ignored for all non-root
processes.
The specification of counts and types
should not cause any location on the root to be read more than
once.
MPI_Scatterv(void* sendbuf, int *sendcounts, int *displs, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_SCATTERV(SENDBUF, SENDCOUNTS, DISPLS, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_SCATTERV is the inverse operation to MPI_GATHERV.
MPI_SCATTERV extends the functionality of MPI_SCATTER
by allowing a varying count of data to be sent to each process,
since sendcounts is now an array.
It also allows more flexibility as to where the data
is taken from on the root, by providing the new argument, displs.
The outcome is as if the root executed n send operations,
MPI_Send(sendbuf+displs [i]
extent(sendtype), sendcounts[i],
sendtype, i,...), i = 0 to n - 1,
and each process executed a receive,
MPI_Recv(recvbuf, recvcount, recvtype, root,...).
The type signature implied by sendcount[i] and sendtype at the root
must be equal to the type signature implied by
recvcount and recvtype at process
i.
This implies that the amount of data sent must be equal to the
amount of data received, pairwise between each process and the root.
Distinct type maps between sender and receiver are still allowed.
All arguments to the function are significant on process root,
while on other processes, only arguments recvbuf, recvcount,
recvtype, root, comm are significant.
The arguments root
must have identical values on all processes, and comm must
represent the same intragroup communication domain.
The send buffer is ignored for all non-root processes.
The specification of counts, types, and displacements
should not cause any location on the root to be read more than
once.
MPI_Allgather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)
MPI_ALLGATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_ALLGATHER can be thought of as MPI_GATHER, except
all processes receive the result, instead of just the root.
The jth block of data sent from each process is received
by every process and placed in the jth block of the
buffer recvbuf.
The type signature associated with sendcount and sendtype
at a process must be equal to the type signature associated with
recvcount and recvtype at any other process.
The outcome of a call to MPI_ALLGATHER(...) is as if
all processes executed n calls to
MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount,
recvtype, root, comm),
for root = 0 , ..., n-1. The rules for correct usage of
MPI_ALLGATHER are easily found from the corresponding rules
for MPI_GATHER.
MPI_Allgatherv(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int *recvcounts, int *displs, MPI_Datatype recvtype, MPI_Comm comm)
MPI_ALLGATHERV(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNTS, DISPLS, RECVTYPE, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_ALLGATHERV can be thought of as MPI_GATHERV, except
all processes receive the result, instead of just the root.
The jth block of data sent from each process is received
by every process and placed in the jth block of the
buffer recvbuf. These blocks need not all be the same size.
The type signature associated with sendcount and sendtype
at process j must be equal to the type signature associated with
recvcounts[j] and recvtype at any other process.
The outcome is as if all processes executed calls to
MPI_GATHERV( sendbuf, sendcount, sendtype,recvbuf,recvcounts,displs,
recvtype,root,comm),
for root = 0 , ..., n-1. The rules for correct usage of
MPI_ALLGATHERV are easily found from the corresponding rules
for MPI_GATHERV.
all to all
scatter and gathergather and scatter
MPI_Alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)
MPI_ALLTOALL(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_ALLTOALL is an extension of MPI_ALLGATHER to the case
where each process sends distinct data to each of the receivers.
The jth block sent from process i is received by process j
and is placed in the ith block of recvbuf.
The type signature associated with sendcount and sendtype
at a process must be equal to the type signature associated with
recvcount and recvtype at any other process.
This implies that the amount of data sent must be equal to the
amount of data received, pairwise between every pair of processes.
As usual, however, the type maps may be different.
The outcome is as if each process executed a send to each
process (itself included)
with a call to,
MPI_Send(sendbuf+i
sendcount
extent(sendtype), sendcount,
sendtype, i, ...),
and a receive from every other process
with a call to,
MPI_Recv(recvbuf+i
recvcount
extent(recvtype), recvcount, i,...),
where i = 0,
, n - 1.
All arguments
on all processes are significant. The argument comm
must represent the same intragroup communication domain
on all processes.
MPI procedures are specified using a language
independent notation.
The arguments of procedure calls are marked as , or
. The meanings of these are:
IN
OUT
INOUT
procedure specification
arguments
There is one special case - if an argument is a handle to
an opaque object (defined in Section
), and the
object is updated by the procedure call, then the argument is marked
. It is marked this way even though the handle itself is not
modified - we use the attribute to denote that what the
handle references is updated.
The definition of MPI tries to avoid, to the largest possible extent,
the use of arguments, because such use is error-prone,
especially for scalar arguments.
A common occurrence for MPI functions is
an argument that is used as by some
processes and by other processes. Such an argument is,
syntactically, an argument and is marked as such,
although, semantically, it is
not used in one call both for input and for output.
Another frequent situation arises when an argument value is needed only by
a subset of the processes. When an argument is not significant at a
process then an arbitrary value can be passed as the argument.
Unless
specified otherwise, an argument of type or type
cannot be aliased with any other argument passed to an MPI procedure.
An example of argument aliasing in C appears below. If we define a
C procedure like this,
All MPI functions are first specified in the language-independent notation.
Immediately below this, the ANSI C version of the function is shown, and
below this, a version of the same function in Fortran 77.
MPI_Alltoallv(void* sendbuf, int *sendcounts, int *sdispls, MPI_Datatype sendtype, void* recvbuf, int *recvcounts, int *rdispls, MPI_Datatype recvtype, MPI_Comm comm)
MPI_ALLTOALLV(SENDBUF, SENDCOUNTS, SDISPLS, SENDTYPE, RECVBUF, RECVCOUNTS, RDISPLS, RECVTYPE, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_ALLTOALLV adds flexibility to MPI_ALLTOALL in that
the location of data for the send is specified by sdispls
and the location of the placement of the data on the receive side
is specified by rdispls.
The jth block sent from process i is received by process j
and is placed in the ith block of recvbuf. These blocks
need not all have the same size.
The type signature associated with
sendcount[j] and sendtype at process i must be equal
to the type signature
associated with recvcount[i] and recvtype at process j.
This implies that the amount of data sent must be equal to the
amount of data received, pairwise between every pair of processes.
Distinct type maps between sender and receiver are still allowed.
The outcome is as if each process sent a message to process i
with
MPI_Send( sendbuf + displs[i]
extent(sendtype),
sendcounts[i], sendtype, i, ...),
and received a message from process i with
a call to
MPI_Recv( recvbuf +
displs[i]
extent(recvtype), recvcounts[i], recvtype, i, ...), where
i = 0
n - 1.
All arguments
on all processes are significant. The argument comm
must specify the same intragroup communication domain
on all processes.
The functions in this section perform a global reduce operation (such
as sum, max, logical AND, etc.) across all the members of a group.
The reduction operation can be either one of a predefined list of
operations, or a user-defined operation.
The global reduction functions come in several flavors: a reduce that
returns the result of the reduction at one node, an all-reduce that
returns this result at all nodes, and a scan (parallel prefix)
operation. In addition, a reduce-scatter operation combines the
functionality of a reduce and of a scatter operation. In order to
improve performance, the functions can be passed an array of
values; one call will perform a sequence of element-wise
reductions on the arrays of values.
Figure
gives a pictorial representation of these
operations.
MPI_Reduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_REDUCE combines the elements provided
in the input buffer of each process in the
group, using the operation op, and returns the combined value in
the output buffer of the process with rank root.
The input buffer is defined by the arguments sendbuf,
count and datatype; the output buffer is defined by
the arguments recvbuf, count and datatype;
both have the same number of elements, with the same type. The arguments
count, op and root must have identical
values at all processes, the datatype arguments should match,
and comm should represent the same
intragroup communication domain.
Thus, all processes provide input buffers and output buffers of the
same length, with elements of the same type.
Each process can provide one element, or a sequence of elements,
in which case the
combine operation is executed element-wise on each entry of the sequence.
For example, if the operation is MPI_MAX and the send buffer
MPI_MAX
contains two elements that are floating point numbers (count = 2 and
datatype = MPI_FLOAT), then
and
.
Section
lists the set of predefined operations
provided by MPI. That section also enumerates
the allowed datatypes for each operation.
In addition, users may define their own operations that can be
overloaded to operate on several datatypes, either basic or derived.
This is further explained in Section
.
The operation op is always assumed to be
associative. All predefined operations are also
commutative. Users may define operations that are assumed to be
associative, but not commutative. The ``canonical'' evaluation order
of a reduction is determined by the ranks of the processes in the
group. However, the implementation can take
advantage of associativity, or associativity and commutativity
in order to change the order of evaluation.
This may change the result of the reduction for operations that are not
strictly associative and commutative, such as floating point addition.
reduction and associativityreduction and commutativity
associativity and reductioncommutativity and reduction
The datatype argument of MPI_REDUCE must be
compatible with op. Predefined operators work only with
the MPI types listed in Section
and
Section
. User-defined operators may operate
on general, derived datatypes. In this case, each argument that
the reduce operation is applied to is one element described by such a datatype,
which may contain several basic values.
This is further explained in Section
.
The following predefined operations are supplied for MPI_REDUCE
and related functions MPI_ALLREDUCE, MPI_REDUCE_SCATTER,
and MPI_SCAN.
These operations are invoked by placing the following in op.
reduce, list of operations
MPI_MAX
MPI_MIN
MPI_SUM
MPI_PROD
MPI_LAND
MPI_BAND
MPI_LOR
MPI_BOR
MPI_BXOR
MPI_MAXLOC
MPI_MINLOC
MPI_LXOR
The two operations MPI_MINLOC and MPI_MAXLOC are
discussed separately in Section
.
For the other predefined operations,
we enumerate below the allowed combinations of op and
datatype arguments.
First, define groups of MPI basic datatypes
in the following way.
Now, the valid datatypes for each option is specified below.
minimum and location
maximum and location
The operator MPI_MINLOC is used to compute
MPI_MINLOC
MPI_MAXLOC
a global minimum and also
an index attached to the minimum value.
MPI_MAXLOC similarly computes a global maximum and index.
One application of these is to compute a global minimum (maximum) and the
rank of the process containing this value.
The operation that defines MPI_MAXLOC is:
MPI_MAXLOC
where
and
MPI_MINLOC is defined similarly:
MPI_MINLOC
where
and
Both operations are associative and commutative.
Note that if MPI_MAXLOC
MPI_MAXLOC
is applied to reduce a sequence of pairs
, then the value
returned is
, where
and
is the index of
the first global maximum in the sequence. Thus, if each process
supplies a value and its rank within the group, then a reduce
operation with op = MPI_MAXLOC will return the
maximum value and the rank of the first process with that value.
Similarly, MPI_MINLOC can be used to return a minimum and its
index.
More generally, MPI_MINLOC computes a lexicographic
MPI_MINLOC
minimum, where elements are ordered according to the first component
of each pair, and ties are resolved according to the second component.
The reduce operation is defined to operate on arguments that
consist of a pair: value and index.
In order to use MPI_MINLOC and MPI_MAXLOC in a
MPI_MINLOC
MPI_MAXLOC
reduce operation, one must provide a datatype argument
that represents a pair (value and index). MPI provides nine such
predefined datatypes.
In C,
the index is an int and the value can be a short or long
int, a float, or a double.
The potentially mixed-type nature of such arguments
is a problem in Fortran. The problem is circumvented, for Fortran, by
having the MPI-provided type consist of a pair of the same type as
value, and coercing the index to this type also.
The operations MPI_MAXLOC and
MPI_MINLOC can be used with each of the following datatypes.
MPI_2REAL
MPI_2DOUBLE_PRECISION
MPI_2INTEGER
MPI_FLOAT_INT
MPI_DOUBLE_INT
MPI_LONG_INT
MPI_2INT
MPI_SHORT_INT
MPI_LONG_DOUBLE_INT
The datatype MPI_2REAL is as if defined by the following
MPI_2REAL
(see Section
).
Similar statements apply for MPI_2INTEGER,
MPI_2DOUBLE_PRECISION, and MPI_2INT.
The datatype MPI_FLOAT_INT is as if defined by the
MPI_FLOAT_INT
following sequence of instructions.
MPI includes variants of each of the reduce operations
where the result is returned to all processes in the group.
MPI requires that all processes participating in these
operations receive identical results.
MPI_Allreduce(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
MPI_ALLREDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
Same as MPI_REDUCE except that the result
appears in the receive buffer of all the group members.
MPI includes variants of each of the reduce operations
where the result is scattered to all processes in the group on return.
MPI_Reduce_scatter(void* sendbuf, void* recvbuf, int *recvcounts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)
MPI_REDUCE_SCATTER(SENDBUF, RECVBUF, RECVCOUNTS, DATATYPE, OP, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_REDUCE_SCATTER acts as if it first does
an element-wise reduction on
vector of
elements
in the send buffer defined by sendbuf, count and
datatype.
Next, the resulting vector of results is split into n disjoint
segments, where n is the number of processes in the group of
comm.
Segment i contains recvcounts[i] elements.
The ith segment is sent to process i and stored in the
receive buffer defined by recvbuf, recvcounts[i] and
datatype.
MPI_Scan(void* sendbuf, void* recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )
MPI_SCAN(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
MPI_SCAN is used to perform a prefix reduction
on data distributed across the group.
The operation returns, in the receive buffer of the process with rank
i, the
reduction of the values in the send buffers of processes with ranks 0,...,i (inclusive). The type of operations supported,
their semantics, and the
constraints on send and receive buffers are as for MPI_REDUCE.
user-defined operations
reduce, user-defined
scan, user-defined
MPI_Op_create(MPI_User_function *function, int commute, MPI_Op *op)
MPI_OP_CREATE( FUNCTION, COMMUTE, OP, IERROR) EXTERNAL FUNCTION
MPI_OP_CREATE binds a user-defined global operation
to an op handle that can subsequently be used in
MPI_REDUCE, MPI_ALLREDUCE,
MPI_REDUCE_SCATTER, and MPI_SCAN.
The user-defined operation is assumed to be associative.
If commute = true, then the operation should be both
commutative and associative. If commute = false,
then the order of operations is fixed and is defined to be
in ascending, process rank order, beginning with process zero. The
order of evaluation can be changed, taking advantage of the
associativity of the operation. If commute = true
then the order of evaluation can be changed, taking advantage of
commutativity and associativity.
associativity, and user-defined operation
commutativity, and user-defined operation
function is the user-defined function, which must have the
following four arguments: invec, inoutvec, len and datatype.
The ANSI-C prototype for the function is the following.
The Fortran declaration of the user-defined function appears below.
The datatype argument
is a handle to the data type that was passed into the call
to MPI_REDUCE.
The user reduce function should be written such that the following
holds:
Let u[0], ... , u[len-1] be the len elements in the
communication buffer described by the arguments invec, len
and datatype when the function is invoked;
let v[0], ... , v[len-1] be len elements in the
communication buffer described by the arguments inoutvec, len
and datatype when the function is invoked;
let w[0], ... , w[len-1] be len elements in the
communication buffer described by the arguments inoutvec, len
and datatype when the function returns;
then w[i] = u[i]
v[i], for i=0 , ... , len-1,
where
is the reduce operation that the function computes.
Informally, we can think of
invec and inoutvec as arrays of len elements that
function
is combining. The result of the reduction over-writes values in
inoutvec, hence the name. Each invocation of the function results in
the pointwise evaluation of the reduce operator on len
elements:
i.e, the function returns in inoutvec[i] the value
, for
,
where
is the combining operation computed by the function.
General datatypes may be passed to the user function.
However, use of datatypes that are not contiguous is likely to lead to
inefficiencies.
No MPI communication function may be called inside the user function.
MPI_ABORT may be called inside the
function in case of an error.
MPI_op_free( MPI_Op *op)
MPI_OP_FREE( OP, IERROR) INTEGER OP, IERROR
Marks a user-defined reduction operation for deallocation and sets
op to MPI_OP_NULL.
MPI_OP_NULL
The following two examples illustrate usage of user-defined
reduction.
user-defined reduction
semantics of collective
collective, semantics of
portabilitysafety
collective, and portability
collective, and deadlock
A correct, portable program must invoke collective communications so
that deadlock will
not occur, whether collective communications are synchronizing or not.
The following examples illustrate dangerous use of collective routines.
Finally, in multithreaded implementations, one can have more than one,
concurrently executing, collective communication calls at a process. In these
situations, it is the user's responsibility to ensure that
the same communicator is not used concurrently by two different
collective communication calls at the same process.
threads and collective
collective, and threads
This section describes semantic terms used in this book.
It was the intent of the creators of the MPI standard to address
several issues that augment the power and usefulness of
point-to-point and collective
communications. These issues are mainly concerned with the
the creation of portable, efficient and safe libraries and codes with
MPI, and will be discussed in this chapter.
This effort was driven by the need to overcome several
limitations in many message passing systems. The
next few sections describe these limitations.
process groupgroup
In some applications it is desirable to divide up the processes
to allow different groups of processes to perform independent
work. For example, we might want an application to utilize
of its processes to predict the weather based on data
already processed, while the other
of the processes
initially process new data. This would allow the application to
regularly complete a weather forecast. However, if no new data is
available for processing we might
want the same application to use all of its processes to make a
weather forecast.
Being able to do this efficiently and easily requires the application
to be able to logically divide the processes into independent subsets.
It is important that these subsets are logically the same as the
initial set of processes. For example, the module to predict the
weather might use process 0 as the master process to dole out work.
If subsets of processes are not numbered in a consistent manner with
the initial set of processes, then there may be no process 0 in one
of the two subsets. This would cause the weather prediction model to
fail.
Applications also need to have collective operations work on a
subset of processes. If collective operations only work on the
initial set of processes then it is impossible to create independent
subsets that perform collective operations. Even if the application
does not need independent subsets, having collective operations work
on subsets is desirable. Since the time to complete most collective
operations increases with the number of processes, limiting a collective
operation to only the processes that need to be
involved yields much better scaling
behavior. For example, if a matrix computation needs to broadcast
information along the diagonal of a matrix, only the processes
containing diagonal elements should be involved.
libraries, safety
modularity
Library routines have historically had difficulty in isolating their own
message passing calls from those in other libraries or in the user's
code. For example, suppose the user's code posts a non-blocking
receive with both tag and source wildcarded
before it enters a library routine.
The first send in the library may be received by the user's posted
receive instead of the one posted by the library. This will
undoubtedly cause the library to fail.
The solution to this difficulty is to allow a module to
isolate its message passing calls from the other modules. Some
applications may only determine at run time which modules will run so it
can be impossible to statically isolate all modules in advance. This
necessitates a run time callable system to perform this function.
layering
Writers of libraries often want to expand the functionality of the
message passing system. For example, the library may want to create
its own special and unique collective operation. Such a collective
operation may be called many times if the library is called
repetitively or if multiple libraries use the same collective routine.
To perform the collective operation efficiently may require a
moderately expensive calculation up front such as determining the best
communication pattern. It is most efficient to reuse the up front
calculations if the same the set of processes are involved. This is most
easily done by attaching the results of the up front calculation to
the set of processes involved. These types of optimization are routinely
done internally in message passing systems. The desire is to allow
others to perform similar optimizations in the same way.
libraries, safetysafety
There are two philosophies used to provide mechanisms for creating
subgroups, isolating messages, etc. One point of view is to allow the
user total control over the process. This allows maximum flexibility
to the user and can, in some cases, lead to fast implementations. The
other point of view is to have the message passing system control
these functions. This adds a degree of safety while limiting the
mechanisms to those provided by the system. MPI chose to use the
latter approach. The added safety was deemed to be very important for
writing portable message passing codes. Since the MPI system controls
these functions, modules that are written independently can safely
perform these operations without worrying about conflicts. As in
other areas, MPI also decided to provide a rich set of functions so
that users would have the functionality they are likely to need.
The above features and several more are provided in MPI through
communicators. The concepts behind communicators encompass several
central and fundamental ideas in MPI. The importance of
communicators can be seen by the fact that they are present in most
calls in MPI. There are several reasons that these features are
encapsulated into a single MPI object. One reason is that it
simplifies calls to MPI functions. Grouping logically related items
into communicators substantially reduces the number of calling
arguments. A second reason is it allows for easier extensibility.
Both the MPI system and the user can add information onto
communicators that will be passed in calls without changing the
calling arguments. This is consistent with the use of opaque objects
throughout MPI.
process groupgroup
A group is an ordered set of process identifiers (henceforth
processes); processes are implementation-dependent objects. Each
process in a group is associated with an integer rank. Ranks are
contiguous and start from zero.
rankprocess rank
Groups are represented by opaque group objects, and hence cannot
be directly transferred from one process to another.
There is a special pre-defined group: MPI_GROUP_EMPTY, which is
MPI_GROUP_EMPTY
a group with no members.
The predefined constant
MPI_GROUP_NULL is the value used for invalid group handles.
MPI_GROUP_NULL
MPI_GROUP_EMPTY, which is a valid handle to an empty group,
MPI_GROUP_EMPTY
should not be confused with MPI_GROUP_NULL, which is
MPI_GROUP_NULL
an invalid handle. The former may be used as an argument to group
operations; the latter, which is returned when a group is freed, in not
a valid argument.
Group operations are discussed in Section
.
communicator
A communicator is an opaque object with a number of attributes,
together with simple rules that govern its creation, use and
destruction. The communicator specifies a communication domain
which can be used for point-to-point communications.
communication domain
An intracommunicator is used for communicating
within a single group of processes; we call such communication intra-group communication. An intracommunicator has two fixed
attributes.
intracommunicator
intra-group communication domain
These are the process group and the topology describing the logical layout
of the processes in the group. Process topologies are the
subject of chapter
. Intracommunicators are also
used for collective operations within a group of processes.
An intercommunicator is used for point-to-point
intercommunicator
communication between two
disjoint groups of processes. We call such communication inter-group
communication.
inter-group communication domain
The fixed attributes of an
intercommunicator are the two groups. No topology is associated
with an intercommunicator.
In addition to fixed attributes a
communicator may also have user-defined attributes which are associated
with the communicator using MPI's caching mechanism, as described in
Section
. The table below summarizes the differences
cachingcommunicator, caching
between intracommunicators and intercommunicators.
communicator, intra vs inter
Intracommunicator operations are described in
Section
, and intercommunicator operations are
discussed in Section
.
communication domain
Any point-to-point or collective communication occurs in MPI within
a communication domain.
Such a communication domain is represented by a set of
communicators with consistent values, one at each of the participating
processes;
each communicator is the local representation of the global communication
domain.
If this domain is for intra-group communication then all the
communicators are intracommunicators, and all have the same group attribute.
Each communicator identifies all the other corresponding
communicators.
One can think of a communicator as an array of links to other
communicators. An intra-group communication domain is specified
by a set of communicators such that
communication domain, intra
We discuss inter-group communication domains in
Section
.
In point-to-point communication, matching send and receive calls should
have communicator arguments that represent the same communication domains.
The rank of the processes is interpreted relative to the group, or
groups, associated with the communicator. Thus, in an intra-group
communication domain, process ranks are relative to the
group associated with the communicator.
Similarly, a collective communication call involves all processes in the group
of an intra-group communication domain, and all processes should use
a communicator argument that represents this domain.
Intercommunicators may
not be used in collective communication operations.
We shall sometimes say, for simplicity, that two communicators are the
same, if they represent the same communication domain.
One should not be misled by this abuse of
language:
Each communicator is really a distinct object, local to a process.
Furthermore, communicators that represent the same communication domain may
have different attribute values attached to them at different processes.
MPI is designed to ensure that communicator constructors always
generate consistent communicators that are a valid representation of
the newly created communication domain.
This is done by requiring that a new
intracommunicator be
constructed out of an existing parent communicator, and that
this be a collective operation over all processes in the group associated with
the parent communicator. The group associated with a new intracommunicator
must be a subgroup of that associated with the parent
intracommunicator.
Thus, all
the intracommunicator constructor routines described
in Section
have an existing
communicator as an input argument, and the newly created intracommunicator
as an output argument. This leads to a chicken-and-egg situation since
we must have an existing communicator to create a new communicator.
This problem is solved by the provision of a predefined intracommunicator,
MPI_COMM_WORLD, which is available for use once
MPI_COMM_WORLD
the routine MPI_INIT has been called.
MPI_COMM_WORLD, which has
as its group attribute all processes with which the local process can
communicate, can be used as the parent communicator in
constructing new communicators.
A second pre-defined communicator, MPI_COMM_SELF, is also
MPI_COMM_SELF
available for use after calling MPI_INIT and
has as its associated group just the process itself.
MPI_COMM_SELF is provided as a convenience since it could
easily be created out of MPI_COMM_WORLD.
An MPI program consists of autonomous processes, executing
their own (C or Fortran) code, in an MIMD style.
The codes executed by each process need not be
identical.
processes
The processes communicate via calls to MPI communication primitives.
Typically, each process executes in its own address space, although
shared-memory implementations of MPI are possible.
This document specifies the behavior
of a parallel program assuming that only
MPI calls are used for communication.
The interaction of an MPI program with
other possible means of communication
(e.g., shared memory) is not specified.
MPI does not specify the execution model for each process.
threads
A process can be
sequential, or can be multi-threaded, with threads possibly executing
concurrently. Care has been taken to make MPI ``thread-safe,'' by
avoiding
the use of implicit state. The desired interaction of MPI with threads
is that concurrent threads be all allowed to execute
MPI calls, and calls be reentrant;
a blocking MPI call blocks only the invoking thread,
allowing the scheduling of another thread.
MPI does not provide mechanisms to specify the
initial allocation of processes to an MPI computation and their
binding to physical processors.
process allocation
It is expected that vendors will provide mechanisms to do so
either at load time or at run time. Such mechanisms will allow the
specification of the initial number of required processes, the
code to be executed by each initial process,
and the allocation of processes to
processors.
Also, the current standard does not provide for dynamic creation or
deletion of processes during program execution
(the total number of processes
is fixed); however, MPI design is consistent with such
extensions, which are now under consideration
(see Section
).
Finally, MPI always identifies processes
according to their relative rank in a group, that is,
consecutive integers in the range 0..groupsize-1.
The current practice in many codes is that there is a unique,
predefined communication universe that includes all processes
available when the parallel program is initiated; the processes are
assigned consecutive ranks. Participants in a point-to-point
communication are identified by their rank; a collective communication
(such as broadcast) always involves all processes. As such, most
current message passing libraries have no equivalent argument to the
communicator. It is implicitly all the processes as ranked by the
system.
This practice can be followed in MPI by using the predefined
communicator
MPI_COMM_WORLD
wherever a communicator argument
is required. Thus, using current practice in MPI is very easy. Users that
are content with it can ignore most of the
information in this chapter. However, everyone should seriously
consider understanding the potential risks in using
MPI_COMM_WORLD to avoid unexpected behavior of their
programs.
This section describes the manipulation of process groups in MPI. These
operations are local and their execution do not require interprocess
communication. MPI allows manipulation of groups outside of
communicators but groups can only be used for message passing inside
of a communicator.
MPI_Group_size(MPI_Group group, int *size)
MPI_GROUP_SIZE(GROUP, SIZE, IERROR)INTEGER GROUP, SIZE, IERROR
MPI_GROUP_SIZE returns the number of processes in the group.
Thus, if group = MPI_GROUP_EMPTY then the call will return
size = 0. (On the other hand, a call with group =
MPI_GROUP_NULL is erroneous.)
MPI_Group_rank(MPI_Group group, int *rank)
MPI_GROUP_RANK(GROUP, RANK, IERROR)INTEGER GROUP, RANK, IERROR
MPI_GROUP_RANK returns the rank of the calling process in
group. If the process is not a member of group then
MPI_UNDEFINED is returned.
MPI_Group_translate_ranks (MPI_Group group1, int n, int *ranks1, MPI_Group group2, int *ranks2)
MPI_GROUP_TRANSLATE_RANKS(GROUP1, N, RANKS1, GROUP2, RANKS2, IERROR)INTEGER GROUP1, N, RANKS1(*), GROUP2, RANKS2(*), IERROR
MPI_GROUP_TRANSLATE_RANKS maps the ranks of a set of
processes in group1 to their ranks in group2.
Upon return, the array
ranks2 contains the ranks in group2 for
the processes in group1 with ranks
listed in ranks1.
If a process in group1 found in ranks1
does not belong to group2 then
MPI_UNDEFINED is returned in ranks2.
This function is important for determining the relative numbering of
the same processes in two different groups. For instance, if one
knows the ranks of certain processes in the group of
MPI_COMM_WORLD, one might want to know their ranks in a
subset of that group.
MPI_Group_compare(MPI_Group group1,MPI_Group group2, int *result)
MPI_GROUP_COMPARE(GROUP1, GROUP2, RESULT, IERROR)INTEGER GROUP1, GROUP2, RESULT, IERROR
MPI_GROUP_COMPARE returns the relationship between two groups.
MPI_IDENT results if the group members and group order is exactly the
MPI_IDENT
same in both groups. This happens, for instance, if
group1 and group2 are handles to the same object.
MPI_SIMILAR results if the group members are the same but the order is
MPI_SIMILAR
different. MPI_UNEQUAL results otherwise.
MPI_UNEQUAL
Group constructors are used to construct new groups from existing
groups, using various set operations.
These are local operations, and distinct groups may be defined on
different processes; a process may also define a group that does not
include itself. Consistent definitions are required when groups are
used as arguments in communicator-building functions. MPI does not
provide a mechanism to build a group from scratch, but only from
other, previously defined groups. The base group, upon which all
other groups are defined, is the group associated with the initial
communicator MPI_COMM_WORLD (accessible through
the function MPI_COMM_GROUP).
Local group creation functions are useful since some applications have
the needed information distributed on all nodes. Thus, new groups can
be created locally without communication. This can significantly
reduce the necessary communication in creating a new communicator to
use this group.
In Section
, communicator creation
functions are described which also create new groups. These are more
general group creation functions where the information does not have
to be local to each node. They are part of communicator creation
since they will normally require communication for group creation.
Since communicator creation may also require communication, it is
logical to group these two functions together for this case.
MPI_Comm_group(MPI_Comm comm, MPI_Group *group)
MPI_COMM_GROUP(COMM, GROUP, IERROR)INTEGER COMM, GROUP, IERROR
MPI_COMM_GROUP returns in group a handle to the
group of comm.
The following three functions do standard set type operations. The
only difference is that ordering is important so that ranks are
consistently defined.
MPI_Group_union(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup)
MPI_GROUP_UNION(GROUP1, GROUP2, NEWGROUP, IERROR)INTEGER GROUP1, GROUP2, NEWGROUP, IERROR
MPI_Group_intersection(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup)
MPI_GROUP_INTERSECTION(GROUP1, GROUP2, NEWGROUP, IERROR)INTEGER GROUP1, GROUP2, NEWGROUP, IERROR
MPI_Group_difference(MPI_Group group1, MPI_Group group2, MPI_Group *newgroup)
MPI_GROUP_DIFFERENCE(GROUP1, GROUP2, NEWGROUP, IERROR)INTEGER GROUP1, GROUP2, NEWGROUP, IERROR
The operations are defined as follows:
The new group can be empty, that is, equal to
MPI_GROUP_EMPTY.
MPI_GROUP_EMPTY
MPI_Group_incl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup)
MPI_GROUP_INCL(GROUP, N, RANKS, NEWGROUP, IERROR)INTEGER GROUP, N, RANKS(*), NEWGROUP, IERROR
The function MPI_GROUP_INCL creates a group
newgroup that consists of the n processes in
group with ranks rank[0],..., rank[n-1];
the process with rank i in
newgroup is the process with rank ranks[i] in
group. Each of the n elements of ranks must be a
valid rank in group and all elements must be distinct, or else the
call is erroneous. If n = 0, then newgroup is
MPI_GROUP_EMPTY.
MPI_GROUP_EMPTY
This function can, for instance, be used to reorder the
elements of a group.
Assume that newgroup was created by a call to
MPI_GROUP_INCL(group, n, ranks, newgroup). Then, a subsequent call
to MPI_GROUP_TRANSLATE_RANKS(group, n, ranks, newgroup,
newranks) will return
(in C)
or
(in Fortran).
MPI_Group_excl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup)
MPI_GROUP_EXCL(GROUP, N, RANKS, NEWGROUP, IERROR)INTEGER GROUP, N, RANKS(*), NEWGROUP, IERROR
The function MPI_GROUP_EXCL creates a group of processes
newgroup that is obtained by deleting from group
those processes with ranks
ranks[0],..., ranks[n-1] in C or
ranks[1],..., ranks[n] in Fortran.
The ordering of processes in newgroup is identical to the ordering
in group.
Each of the n elements of ranks must be a valid
rank in group and all elements must be distinct; otherwise, the
call is erroneous.
If n = 0, then newgroup is identical to
group.
Suppose one calls MPI_GROUP_INCL(group, n, ranks, newgroupi)
and MPI_GROUP_EXCL(group, n, ranks, newgroupe). The call
MPI_GROUP_UNION(newgroupi, newgroupe, newgroup) would return
in newgroup a group
with the same members as group but possibly in a different
order. The call
MPI_GROUP_INTERSECTION(groupi, groupe, newgroup) would return
MPI_GROUP_EMPTY.
MPI_GROUP_EMPTY
MPI_Group_range_incl(MPI_Group group, int n, int ranges[][3], MPI_Group *newgroup)
MPI_GROUP_RANGE_INCL(GROUP, N, RANGES, NEWGROUP, IERROR)INTEGER GROUP, N, RANGES(3,*), NEWGROUP, IERROR
Each triplet in ranges specifies a sequence of ranks for
processes to be included in the newly created group. The newly
created group contains the processes specified by the first triplet,
followed by the processes specified by the second triplet, etc.
Generally, if ranges consist of the triplets
then newgroup consists of the sequence of
processes in group with ranks
Each computed rank must be a valid rank in group and all
computed ranks must be distinct, or else the call is erroneous.
Note that a call may have
, and
may be
negative, but cannot be zero.
The functionality of this routine is specified to be equivalent to
expanding the array of ranges to an array of the included ranks and
passing the resulting array of ranks and other arguments to
MPI_GROUP_INCL. A call to MPI_GROUP_INCL is
equivalent to a call to
MPI_GROUP_RANGE_INCL with each rank i
in ranks replaced by the triplet (i,i,1) in the argument ranges.
MPI_Group_range_excl(MPI_Group group, int n, int ranges[][3], MPI_Group *newgroup)
MPI_GROUP_RANGE_EXCL(GROUP, N, RANGES, NEWGROUP, IERROR)INTEGER GROUP, N, RANGES(3,*), NEWGROUP, IERROR
Each triplet in ranges specifies a sequence of ranks for
processes to be excluded from the newly created group. The newly
created group contains the remaining processes, ordered as in
group.
Each computed rank must be a valid
rank in group and all computed ranks must be distinct, or else the
call is erroneous.
The functionality of this routine is specified to be equivalent to
expanding the array of ranges to an array of the excluded ranks and
passing the resulting array of ranks and other arguments to
MPI_GROUP_EXCL. A call to MPI_GROUP_EXCL is
equivalent to a call to MPI_GROUP_RANGE_EXCL with each rank
i in ranks replaced by the triplet (i,i,1)
in the argument ranges.
MPI_Group_free(MPI_Group *group)
MPI_GROUP_FREE(GROUP, IERROR)INTEGER GROUP, IERROR
This operation marks a group object for deallocation. The handle
group is set to MPI_GROUP_NULL by the call.
MPI_GROUP_NULL
Any ongoing operation using this group will complete normally.
This section describes the manipulation of communicators in MPI.
Operations that access communicators are local and their execution
does not require interprocess communication. Operations that create
communicators are collective and may require interprocess
communication. We describe the behavior of these functions, assuming
that their comm argument is an intracommunicator; we
describe later in Section
their semantics for
intercommunicator arguments.
The following are all local operations.
MPI_Comm_size(MPI_Comm comm, int *size)
MPI_COMM_SIZE(COMM, SIZE, IERROR)INTEGER COMM, SIZE, IERROR
MPI_COMM_SIZE returns the size of the group associated with
comm.
This function indicates the number of processes involved in an
intracommunicator.
For MPI_COMM_WORLD, it indicates the total number of processes
MPI_COMM_WORLD
available at initialization time. (For this version of MPI,
this is also the total number of processes available throughout the
computation).
MPI_Comm_rank(MPI_Comm comm, int *rank)
MPI_COMM_RANK(COMM, RANK, IERROR)INTEGER COMM, RANK, IERROR
MPI_COMM_RANK indicates the rank of the process
that calls it, in the range from
size
, where
size is the return value of MPI_COMM_SIZE.
This rank is relative to the group associated with the intracommunicator
comm. Thus,
MPI_COMM_RANK(MPI_COMM_WORLD,
rank) returns in rank the ``absolute'' rank of the calling
process in the global communication group of
MPI_COMM_WORLD;
MPI_COMM_RANK( MPI_COMM_SELF,
rank) returns rank = 0.
MPI_Comm_compare(MPI_Comm comm1,MPI_Comm comm2, int *result)
MPI_COMM_COMPARE(COMM1, COMM2, RESULT, IERROR)INTEGER COMM1, COMM2, RESULT, IERROR
MPI_COMM_COMPARE is used to find the relationship between
two intra-communicators. MPI_IDENT results if and only if
MPI_IDENT
comm1 and comm2 are handles for the same object
(representing the same communication domain).
MPI_CONGRUENT results if the underlying groups are identical
MPI_CONGRUENT
in constituents and rank order (the communicators
represent two distinct communication domains with the same group attribute).
MPI_SIMILAR results if the group members of both
MPI_SIMILAR
communicators are the same but the rank order
differs. MPI_UNEQUAL results otherwise. The groups
MPI_UNEQUAL
associated with two different communicators could be gotten via
MPI_COMM_GROUP and then used in a call to
MPI_GROUP_COMPARE. If MPI_COMM_COMPARE gives
MPI_CONGRUENT then MPI_GROUP_COMPARE will give
MPI_IDENT. If MPI_COMM_COMPARE gives
MPI_SIMILAR then MPI_GROUP_COMPARE will give
MPI_SIMILAR.
communicator, constructors
The following are collective functions that are invoked by all processes in the
group associated with comm.
MPI_Comm_dup(MPI_Comm comm, MPI_Comm *newcomm)
MPI_COMM_DUP(COMM, NEWCOMM, IERROR)INTEGER COMM, NEWCOMM, IERROR
MPI_COMM_DUP creates a new intracommunicator, newcomm,
with the same fixed attributes (group, or groups, and topology) as the input
intracommunicator, comm.
The newly created communicators at the processes in the group of
comm define a new, distinct communication domain, with the
same group as the old communicators. The function can also be used to
replicate intercommunicators.
The association of user-defined (or cached) attributes with
newcomm is controlled by the copy callback function
specified when the attribute was attached
to comm.
callback function, copy
For each key value, the respective
copy callback function determines the attribute value associated with
this key in the new communicator. User-defined attributes are
discussed in Section
.
MPI_Comm_create(MPI_Comm comm, MPI_Group group, MPI_Comm *newcomm)
MPI_COMM_CREATE(COMM, GROUP, NEWCOMM, IERROR)INTEGER COMM, GROUP, NEWCOMM, IERROR
This function creates a new
intracommunicator newcomm with
communication group defined by group. No attributes
propagates from comm to newcomm. The function
returns MPI_COMM_NULL to processes that are not in group.
MPI_COMM_NULL
The communicators returned at the processes in group define a
new intra-group communication domain.
The call is erroneous if not all group arguments have the
same value on different processes,
or if group is not a subset of the group associated with
comm (but it does not have to be a proper subset). Note that the call is to be executed by all processes in
comm, even if they do not belong to the new group.
MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
MPI_COMM_SPLIT(COMM, COLOR, KEY, NEWCOMM, IERROR)INTEGER COMM, COLOR, KEY, NEWCOMM, IERROR
This function partitions the group associated with comm
into disjoint subgroups, one for each value of color. Each
subgroup contains all processes of the same color. Within each
subgroup, the processes are ranked in the order defined by the value
of the argument key, with ties broken according to their rank
in the old group. A new communication domain is created for each subgroup and
a handle to the representative communicator is
returned in newcomm. A process may supply the color value
MPI_UNDEFINED to not be a member of any new group, in which case
newcomm returns MPI_COMM_NULL. This is a
MPI_COMM_NULL
collective call, but each process is permitted to provide different
values for color and key. The value of color must
be nonnegative.
A call to MPI_COMM_CREATE(comm, group, newcomm) is equivalent to
a call to
MPI_COMM_SPLIT(comm, color, key, newcomm), where all
members of group provide color
and
key
rank in
group, and all processes that are not members of
group provide
color
MPI_UNDEFINED.
The function MPI_COMM_SPLIT allows
more general partitioning of a group
into one or more subgroups with optional reordering.
MPI_Comm_free(MPI_Comm *comm)
MPI_COMM_FREE(COMM, IERROR)INTEGER COMM, IERROR
This collective operation marks the communication object for
deallocation. The handle is set to MPI_COMM_NULL.
MPI_COMM_NULL
Any pending operations that use this communicator will complete normally;
the object is actually
deallocated only if there are no other active references to it.
This call applies to intra- and intercommunicators. The delete callback
functions for all cached attributes (see Section
) are
called in arbitrary order.
callback function, delete
It is erroneous to attempt to free MPI_COMM_NULL.
This section illustrates the design of parallel libraries, and the use of
communicators to ensure the safety of internal library communications.
Assume that a new parallel library function is needed that is similar to
the MPI broadcast function, except that
it is not required that all processes
provide the rank of the root process. Instead of the root argument of
MPI_BCAST, the function takes a Boolean flag input that is
true if the calling process is the root, false, otherwise.
To simplify the example we make another assumption: namely that the datatype of
the send buffer is identical to the datatype of the receive buffer,
so that only one datatype argument is needed.
A possible code for such a modified broadcast function is shown below.
Consider a collective invocation to the broadcast function just defined, in the
context of the program segment shown in the example below, for a group of three
processes.
A (correct) execution of this code is illustrated in
Figure
, with arrows used to indicate communications.
Since the invocation of mcast at the three processes is not
simultaneous, it may actually happen that mcast is invoked at process
0 before process 1 executed the receive in the caller code.
This receive, rather than being matched by the caller code send at
process 2, might be matched by
the first send of process 0 within mcast. The erroneous execution
illustrated in Figure
results.
How can such erroneous execution be prevented? One option is to enforce
synchronization at the entry to mcast, and, for symmetric reasons, at
the exit from mcast. E.g., the
first and last executable statements within the code of mcast
would be
a call to MPI_Barrier(comm). This, however, introduces two
superfluous synchronizations that will slow down execution. Furthermore, this
synchronization works only if the caller code obeys the convention that messages
sent before a collective invocation should also be received at their
destination before the matching invocation. Consider an invocation to
mcast() in a context that does not obey this restriction, as shown in
the example below.
The desired execution of the code in this example is illustrated
in Figure
.
However, a more likely matching of sends with receives will lead to the
erroneous execution is illustrated in
Figure
.
Erroneous results may also occur if a process that is not in the group
of comm and does not participate in the collective invocation
of mcast sends a message to processes one or two in the group of
comm.
A more robust solution to this problem is to use a distinct communication
domain for
communication within the library, which is not used by the caller code. This
will ensure that messages sent by the library are not received outside the
library, and vice-versa. The modified code of the function mcast is
shown below.
This code suffers the penalty of one communicator allocation and deallocation at
each invocation. We show in the next section, in
Example
,
how to avoid this overhead, by using a preallocated communicator.
When discussing MPI procedures the following terms are used.
As the previous examples showed, a communicator provides a ``scope'' for
collective invocations. The communicator, which is passed as parameter
to the call, specifies the group of processes that participate in the call and
provide a private communication domain for communications within the callee
body. In addition, it may carry information about the logical topology of the
executing processes. It is often useful to attach additional persistent values
to this scope; e.g., initialization parameters for a library, or additional
communicators to provide a separate, private communication domain.
MPI provides a caching facility that allows an application to
attach arbitrary pieces of information, called attributes, to
attribute
both intra- and intercommunicators. More precisely, the caching
facility allows a portable library to do the following:
Each attribute is associated with a key.
keyattribute, key
To provide safety, MPI internally generates key values.
MPI functions are provided which allow the user to allocate and
deallocate
key values (MPI_KEYVAL_CREATE and MPI_KEYVAL_FREE).
Once a key is allocated by a
process, it can be used to attach one attribute to any communicator
defined
at that process. Thus, the allocation of a key can be thought of as creating an
empty box at each current or future communicator object at that process; this
box has a lock that matches the allocated key. (The box is ``virtual'': one
need not allocate any actual space before an attempt is made to store something
in the box.)
Once the key is allocated, the user can set or access attributes
associated with this key.
The MPI call MPI_ATTR_PUT can be used to set an
attribute. This call
stores an attribute, or replaces an attribute in one box: the box attached
with the specified communicator with a lock that matches the specified key.
The
call MPI_ATTR_GET can be used to access the attribute value
associated with a given key and communicator. I.e., it allows one to access the
content of the box attached with the specified communicator, that has a lock
that matches the specified key. This call is valid even if the box is
empty, e.g., if the attribute was never set. In such case, a special
``empty'' value is returned.
Finally, the call MPI_ATTR_DELETE allows one to delete an
attribute. I.e., it allows one to empty the box attached with the
specified communicator with a lock that matches the specified key.
To be general, the
attribute mechanism must be able to store arbitrary user information.
On the other hand, attributes must be of a fixed, predefined type, both in
Fortran and C - the type specified by the MPI functions that access or
update attributes. Attributes are defined in C to be of type void *.
Generally, such an attribute will be a pointer to a user-defined data structure or
a handle to an MPI opaque object. In Fortran, attributes are of type
INTEGER. These can be handles to opaque MPI objects or indices to
user-defined tables.
An attribute, from the MPI viewpoint, is a pointer or an integer. An attribute,
from the application viewpoint, may contain arbitrary information that
is attached to
the ``MPI attribute''.
attribute
User-defined attributes are ``copied'' when a new communicator is created by
a call to MPI_COMM_DUP; they are ``deleted'' when a communicator
is deallocated by a call to MPI_COMM_FREE.
Because of the arbitrary nature of the information that is copied or
deleted, the user has to specify the semantics of
attribute copying or deletion.
The user does so
by providing copy
and delete callback functions when the attribute key is allocated (by a call to
MPI_KEYVAL_CREATE). Predefined, default copy and delete callback
functions are available.
callback function
All attribute manipulation functions are local and require no
communication. Two communicator objects at two different processes that
represent the same communication domain may have a different set of attribute
keys and different attribute values associated with them.
MPI reserves a set of predefined key values in order to associate
with MPI_COMM_WORLD information about the execution environment, at
MPI initialization time. These attribute keys are discussed in
Chapter
. These keys cannot be deallocated and the
associated attributes cannot be updated by the user. Otherwise, they behave
like user-defined attributes.
MPI provides the following services related to caching.
They are all process local.
MPI_Keyval_create(MPI_Copy_function *copy_fn, MPI_Delete_function *delete_fn, int *keyval, void* extra_state)
MPI_KEYVAL_CREATE(COPY_FN, DELETE_FN, KEYVAL, EXTRA_STATE, IERROR)EXTERNAL COPY_FN, DELETE_FN
MPI_KEYVAL_CREATE allocates a new attribute key value. Key values are
unique in a process.
keyattribute, key
Once allocated, the key value can be used to
associate attributes and access them on any locally defined
communicator. The special key value MPI_KEYVAL_INVALID is
MPI_KEYVAL_INVALID
never returned by MPI_KEYVAL_CREATE. Therefore, it can be
used for static initialization of key variables, to indicate an
``unallocated'' key.
The copy_fn function is invoked when a communicator is
duplicated by MPI_COMM_DUP. copy_fn should be
callback function, copy
of type MPI_Copy_function, which is defined as follows:
A Fortran declaration for such a function is as follows:
SUBROUTINE COPY_FUNCTION(OLDCOMM, KEYVAL, EXTRA_STATE, ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, FLAG, IERR)INTEGER OLDCOMM, KEYVAL, EXTRA_STATE, ATTRIBUTE_VAL_IN, ATTRIBUTE_VAL_OUT, IERR
Whenever a communicator is replicated using the function
MPI_COMM_DUP, all callback copy functions for attributes
that are currently set are invoked (in arbitrary order). Each call to
the copy callback is passed as input parameters the old communicator
oldcomm, the key value keyval, the additional state
extra_state that was provided to MPI_KEYVAL_CREATE
when the key value was created, and the current attribute value
attribute_val_in.
If it returns flag = false, then the attribute is
deleted in the duplicated communicator. Otherwise, when flag = true,
the new attribute value is set to the value returned in
attribute_val_out. The function returns MPI_SUCCESS on
MPI_SUCCESS
success and an error code on failure (in which case
MPI_COMM_DUP will fail).
copy_fn may be specified as
MPI_NULL_COPY_FN or MPI_DUP_FN
from either C or FORTRAN. MPI_NULL_COPY_FN
is a function that does nothing other than returning flag = 0
and MPI_SUCCESS; I.e., the attribute is not copied.
MPI_DUP_FN sets flag = 1,
returns the value of
attribute_val_in in attribute_val_out and
returns MPI_SUCCESS. I.e., the attribute value is copied, with no
side-effects.
Analogous to copy_fn is a callback deletion function, defined
as follows. The delete_fn function is invoked when a communicator is
callback function, delete
deleted by MPI_COMM_FREE or by a call
to MPI_ATTR_DELETE or MPI_ATTR_PUT. delete_fn should be
of type MPI_Delete_function, which is defined as follows:
A Fortran declaration for such a function is as follows:
SUBROUTINE DELETE_FUNCTION(COMM, KEYVAL, ATTRIBUTE_VAL, EXTRA_STATE, IERR)INTEGER COMM, KEYVAL, ATTRIBUTE_VAL, EXTRA_STATE, IERR
Whenever a communicator is deleted using the function
MPI_COMM_FREE, all callback delete functions for attributes
that are currently set are invoked (in arbitrary order).
In addition the callback delete function for the
deleted attribute is invoked by MPI_ATTR_DELETE
and MPI_ATTR_PUT. The function is passed as input parameters the
communicator comm, the key value keyval, the current attribute
value attribute_val, and the additional state
extra_state that was passed to MPI_KEYVAL_CREATE when the
key value was allocated.
The function returns
MPI_SUCCESS on success and an error code on failure (in which case
MPI_COMM_FREE will fail).
delete_fn may be specified as
MPI_NULL_DELETE_FN from either C or FORTRAN;
MPI_NULL_DELETE_FN is a function that does nothing, other
than returning MPI_SUCCESS.
MPI_Keyval_free(int *keyval)
MPI_KEYVAL_FREE(KEYVAL, IERROR)INTEGER KEYVAL, IERROR
MPI_KEYVAL_FREE deallocates an attribute key value. This function sets
the value of keyval to MPI_KEYVAL_INVALID. Note
MPI_KEYVAL_INVALID
keyattribute, key
that it is not erroneous to free an attribute key that is in use (i.e., has
attached values for some communicators); the key value is not actually
deallocated until after no attribute values are locally attached to this key.
All such attribute values need to be explicitly deallocated by the
program, either
via calls to MPI_ATTR_DELETE that free one attribute instance,
or by calls to MPI_COMM_FREE that free all attribute
instances associated with the freed communicator.
MPI_Attr_put(MPI_Comm comm, int keyval, void* attribute_val)
MPI_ATTR_PUT(COMM, KEYVAL, ATTRIBUTE_VAL, IERROR)INTEGER COMM, KEYVAL, ATTRIBUTE_VAL, IERROR
MPI_ATTR_PUT associates the value
attribute_val with the key keyval on communicator
comm.
If a value is already associated with this key on the communicator,
then the outcome is as if MPI_ATTR_DELETE was first called to
delete the previous value (and the callback function
delete_fn was executed), and a new value was next stored.
The call is erroneous if there is no key with value
keyval; in particular
MPI_KEYVAL_INVALID is an erroneous value for keyval.
MPI_KEYVAL_INVALID
MPI_Attr_get(MPI_Comm comm, int keyval, void *attribute_val, int *flag)
MPI_ATTR_GET(COMM, KEYVAL, ATTRIBUTE_VAL, FLAG, IERROR)INTEGER COMM, KEYVAL, ATTRIBUTE_VAL, IERROR
MPI_ATTR_GET retrieves an attribute value by key. The call is
erroneous if there is no key with value
keyval. In particular
MPI_KEYVAL_INVALID is an erroneous value for keyval.
On the other hand, the call is correct if the key value
exists, but no attribute is attached on comm for that key; in
such a case,
the call returns flag = false. If an attribute is attached on
comm to keyval, then the call returns
flag = true, and returns the attribute value in attribute_val.
MPI_Attr_delete(MPI_Comm comm, int keyval)
MPI_ATTR_DELETE(COMM, KEYVAL, IERROR)INTEGER COMM, KEYVAL, IERROR
MPI_ATTR_DELETE deletes the attribute attached to key keyval on
comm. This function invokes the attribute delete function
delete_fn specified when the keyval was created.
The call will fail if there is no key with value keyval or if the
delete_fn
function returns an error code other than MPI_SUCCESS.
On the other hand, the call is correct even if no attribute is currently
attached to keyval on comm.
The code above dedicates a statically allocated private communicator for the use
of mcast. This segregates communication within the library from
communication outside the library. However, this approach does not provide
separation of communications belonging to distinct invocations of the same
library function, since they all use the same communication domain. Consider
two successive collective invocations of mcast by four
processes, where process 0 is the
broadcast root in the first one, and process 3 is the root in the second one.
The intended execution and communication flow for these two invocations is
illustrated in Figure
.
However, there is a race
between messages sent by
the first invocation of mcast, from process 0 to process 1,
and messages sent by the second invocation of mcast,
from process 3 to process 1. The erroneous execution illustrated in
Figure
may occur, where messages sent by second
invocation overtake messages from the first invocation.
This phenomenon is known as backmasking.
backmasking
How can we avoid backmasking? One option is to revert to the approach in
Example
, where a separate communication domain is generated
for each invocation. Another option is to add a barrier
synchronization, either
at the entry or at the exit from the library call. Yet another option is to
rewrite the library code, so as to prevent the nondeterministic race. The
race occurs because
receives with dontcare's are used. It is often possible to
avoid the use of such constructs. Unfortunately, avoiding dontcares leads to a
less efficient implementation
of mcast. A possible alternative is to use increasing tag numbers to
disambiguate successive invocations of mcast.
An ``invocation count''
can be cached with each communicator, as an additional library attribute.
The resulting code is shown below.
This section introduces the concept of inter-communication and
describes the portions of MPI that support it.
All point-to-point communication described thus far has involved
communication between processes that are members of the same group.
In modular and multi-disciplinary applications, different process groups
execute distinct modules and processes within different modules communicate
with one another in a pipeline or a more general module graph. In these
applications, the most natural way for a process to specify a peer process
is by the rank of the peer process within the peer group. In applications
that contain internal user-level servers, each server may be a process group
that provides services to one or more clients, and each client may be a
process group that uses the services of one or more servers. It is again most
natural to specify the peer process by rank within the peer group in these
applications.
An inter-group communication domain is specified by a set of
intercommunicators with the pair of disjoint
groups (A,B) as their attribute, such
that
communication domain, inter
This distributed data structure is
illustrated
in Figure
, for the case of a pair of groups
(A,B), with two (upper box) and three (lower box)
processes, respectively.
The communicator structure distinguishes between a local group, namely the
group containing the process where the structure reside, and a remote
group, namely the other group.
process group, local and remotegroup, local and remote
The structure is symmetric:
for processes in group A, then A is the local group and B is
the remote group, whereas for processes in group B,
then B is the local
group and A is the remote group.
An inter-group communication will involve a process in one group
executing a send call and another process, in the other group, executing a
matching receive call.
As in
intra-group communication, the matching process (destination of send
or source of receive) is specified using
a (communicator, rank) pair. Unlike intra-group communication,
the rank is relative to the second, remote group.
Thus, in the communication domain illustrated in
Figure
, process 1 in group A sends a message to
process 2 in group B with a call MPI_SEND(..., 2, tag, comm);
process 2 in group B receives this message with a call
MPI_RECV(..., 1, tag, comm).
Conversely, process 2 in group B sends a message to process 1 in group
A with a call to MPI_SEND(..., 1, tag, comm), and the message
is received by a call to MPI_RECV(..., 2, tag, comm); a
remote process is identified in the same way for the purposes of
sending or
receiving.
All point-to-point communication functions can be used with
intercommunicators for inter-group communication.
Here is a summary of the properties of inter-group communication and
intercommunicators:
intercommunication, summary
The routine MPI_COMM_TEST_INTER may be used to determine if
a communicator is an inter- or intracommunicator. Intercommunicators can be
used as arguments to some of the other communicator access routines.
Intercommunicators cannot be used as input to some of the constructor routines
for intracommunicators (for instance, MPI_COMM_CREATE).
It is often convenient to generate an inter-group communication domain by
joining together two intra-group communication domains, i.e., building the pair
of communicating groups from the individual groups.
This requires that there exists
one process in each group that can communicate with each other
through a communication domain that serves as a bridge between the two groups.
For example, suppose that comm1 has 3 processes and
comm2 has 4 processes (see Figure
).
In terms of the
MPI_COMM_WORLD, the processes in comm1 are 0, 1 and 2 and
in comm2
are 3, 4, 5 and 6. Let local process 0 in each intracommunicator form
the bridge. They can communicate via MPI_COMM_WORLD where process 0
in comm1 has rank 0 and process 0 in comm2
has rank 3. Once the
intercommunicator is formed, the original group for each
intracommunicator is the local group in the intercommunicator and the
group from the other intracommunicator becomes the remote group. For
communication with this intercommunicator, the rank in the remote group is
used. For example, if a process in comm1 wants
to send to process 2 of
comm2 (MPI_COMM_WORLD rank 5) then it uses 2 as the rank in the
send.
Intercommunicators are created in this fashion by the call
MPI_INTERCOMM_CREATE.
The two joined groups are required to be disjoint.
The converse function of building an intracommunicator from an
intercommunicator is provided by the call MPI_INTERCOMM_MERGE.
This call generates a communication domain with a group which is the
union of the two groups of the inter-group communication domain.
Both calls are blocking. Both will generally require
collective communication
within each of the involved groups, as well as communication across the
groups.
MPI_Comm_test_inter(MPI_Comm comm, int *flag)
MPI_COMM_TEST_INTER(COMM, FLAG, IERROR)INTEGER COMM, IERROR
MPI_COMM_TEST_INTER is a local routine that allows the calling process
to determine if a communicator is an intercommunicator or an
intracommunicator. It returns
true if it is an intercommunicator, otherwise false.
When an intercommunicator is used as an input argument to the
communicator accessors described in
Section
,
the following table describes the behavior.
Furthermore, the operation MPI_COMM_COMPARE is valid
for intercommunicators. Both communicators must be either intra- or
intercommunicators, or else MPI_UNEQUAL results. Both corresponding
MPI_UNEQUAL
local and remote groups must compare correctly to get the results
MPI_CONGRUENT and MPI_SIMILAR. In particular, it is
MPI_CONGRUENT
MPI_SIMILAR
possible for MPI_SIMILAR to result because either the local or remote
groups were similar but not identical.
The following accessors provide consistent access to the remote group of
an intercommunicator; they are all local operations.
MPI_Comm_remote_size(MPI_Comm comm, int *size)
MPI_COMM_REMOTE_SIZE(COMM, SIZE, IERROR)INTEGER
COMM, SIZE, IERROR
MPI_COMM_REMOTE_SIZE returns the size of the remote group in the
intercommunicator. Note that the size of the local group is given by
MPI_COMM_SIZE.
MPI_Comm_remote_group(MPI_Comm comm, MPI_Group *group)
MPI_COMM_REMOTE_GROUP(COMM, GROUP, IERROR)INTEGER
COMM, GROUP, IERROR
MPI_COMM_REMOTE_GROUP returns the remote group in the
intercommunicator. Note that the local group is give by
MPI_COMM_GROUP.
intercommunicator, constructors
An intercommunicator can be created by a call to MPI_COMM_DUP,
see Section
. As for intracommunicators, this
call generates a new inter-group communication domain with the same groups as
the old one, and also replicates user-defined attributes. An intercommunicator
is deallocated by a call to MPI_COMM_FREE. The other
intracommunicator constructor functions
of Section
do not apply to intercommunicators.
Two new functions are specific to intercommunicators.
MPI_Intercomm_create(MPI_Comm local_comm, int local_leader, MPI_Comm bridge_comm, int remote_leader, int tag, MPI_Comm *newintercomm)
MPI_INTERCOMM_CREATE(LOCAL_COMM, LOCAL_LEADER, PEER_COMM, REMOTE_LEADER, TAG, NEWINTERCOMM, IERROR)INTEGER LOCAL_COMM, LOCAL_LEADER, PEER_COMM, REMOTE_LEADER, TAG, NEWINTERCOMM, IERROR
MPI_INTERCOMM_CREATE creates an intercommunicator. The call is collective
over the union
of the two groups. Processes should provide matching local_comm and
identical local_leader arguments within each of the two groups.
The two leaders specify matching bridge_comm arguments, and each
provide in remote_leader the rank of the other leader within
the domain of bridge_comm. Both provide identical tag values.
Wildcards are not permitted for
remote_leader, local_leader, nor tag.
This call uses point-to-point communication with communicator
bridge_comm, and with tag tag between the leaders.
Thus, care must be taken that there be no pending communication on
bridge_comm that could interfere with this communication.
MPI_Intercomm_merge(MPI_Comm intercomm, int high, MPI_Comm *newintracomm)
MPI_INTERCOMM_MERGE(INTERCOMM, HIGH, NEWINTRACOMM, IERROR)INTEGER INTERCOMM, NEWINTRACOMM, IERROR
MPI_INTERCOMM_MERGE creates an intracommunicator from the union of
the two groups that are associated with intercomm. All
processes should provide the same
high value within each of the two groups. If processes in one group
provided the value high = false and processes in the other group
provided the value high = true then the union orders the ``low'' group
before the ``high'' group. If all processes provided the same high
argument then the order of the union is arbitrary.
This call is blocking and collective within the union of
the two groups.
This chapter discusses the MPI topology mechanism. A topology is an extra,
topologyattribute, topology
topology and intercommunicator
optional attribute that one can give to an intra-communicator; topologies
cannot be added to inter-communicators. A topology can provide a convenient
naming mechanism for the processes of a group (within a communicator), and
additionally, may assist the runtime system in mapping the processes onto
hardware.
As stated in Chapter
,
a process group in MPI is a collection of n processes. Each process in
groupprocess group
the group is assigned a rank between 0 and n-1. In many parallel
applications a linear ranking of processes does not adequately reflect the logical
communication pattern of the processes (which is usually determined by the
underlying problem geometry and the numerical algorithm used). Often the
rankprocess rank
processes are arranged in topological patterns such as two- or
three-dimensional grids. More generally, the logical process arrangement is
described by a graph. In this chapter we will refer to this logical process
arrangement as the ``virtual topology.''
A clear distinction must be made between the virtual process topology
and the topology of the underlying, physical hardware. The virtual
topology can be exploited by the system in the assignment of processes
to physical processors, if this helps to improve the communication
performance on a given machine. How this mapping is done, however, is
outside the scope of MPI. The description of the virtual topology,
on the other hand, depends only on the application, and is
machine-independent. The functions in this chapter deal only with
machine-independent mapping.
topology, virtual vs physical
The communication pattern of a set of processes can be represented by a
graph. The nodes stand for the processes, and the edges connect processes that
communicate with each other. Since communication is most often
symmetric, communication graphs are assumed to be symmetric: if an
edge
connects node
to node
, then an edge
connects
node
to node
.
MPI provides message-passing between any pair
of processes in a group. There is no requirement for opening a channel
explicitly. Therefore, a ``missing link'' in the user-defined process graph
does not prevent the corresponding processes from exchanging messages. It
means, rather, that this connection is neglected in the virtual topology. This
strategy implies
that the topology gives no convenient way of naming this pathway of
communication. Another possible consequence is that an automatic mapping tool
(if one exists for the runtime environment) will not take account of this edge
when mapping, and communication on the ``missing'' link will be
less efficient.
Specifying the virtual
topology in terms of a graph is sufficient for all applications. However, in
many applications the graph structure is regular, and the detailed set-up
of the graph would be inconvenient for the user and might be less
efficient at
run time. A large fraction of all parallel applications use process topologies
like rings, two- or higher-dimensional grids, or tori. These structures are
completely defined by the number of dimensions and the numbers of processes in
each coordinate direction. Also, the mapping of grids and tori is generally
an easier problem than general graphs. Thus, it is desirable to
address these cases explicitly.
Process coordinates in a Cartesian structure begin their numbering at
.
Row-major numbering is always used for the processes in a
Cartesian structure. This means that, for example, the relation
between group rank and coordinates for twelve processes in
a
grid is as shown in Figure
.
MPI manages system memory that is used for buffering
messages and for storing internal representations of various MPI objects
such as groups, communicators, datatypes, etc.
This memory is not directly accessible to the user, and objects stored
there are opaque: their size and shape is not visible to the
user. Opaque objects are accessed via handles, which exist in
opaque objects
handles
user space. MPI procedures that operate on opaque objects are
passed handle arguments to access these objects.
In addition to their use by MPI calls for object access, handles can
participate in assignments and comparisons.
In Fortran, all handles have type INTEGER.
In C, a different handle type is defined for each category of objects.
Implementations
should use types that support assignment and equality operators.
In Fortran, the handle can be an index in a table of opaque objects,
while in C it can be such an index or a pointer to the object.
More bizarre possibilities exist.
Opaque objects are allocated and deallocated
by calls that are specific to each object type.
These are listed in the sections where the objects are described.
The calls accept a handle argument of matching type.
In an allocate call this is an argument that
returns a valid reference to the object.
In a call to deallocate this is an argument which returns
with a ``null handle'' value.
MPI provides a ``null handle'' constant
for each object type. Comparisons to this constant are used to test for
validity of the handle.
handle, null
MPI calls do not change the value of handles, with the exception of
calls that allocate and deallocate objects, and of the call
MPI_TYPE_COMMIT, defined in Section
.
A null handle argument is an erroneous argument in
MPI calls, unless an exception is explicitly stated in the text that
defines the function. Such exceptions are allowed for handles to
request objects in Wait and Test calls
(Section
).
Otherwise, a null handle can only be passed to a function that
allocates a new object and returns a reference to it in the handle.
A call to deallocate invalidates the handle and marks the object for
deallocation. The object is not accessible to the user after the
call.
However, MPI need not deallocate the object immediately. Any
operation pending
(at the time of the deallocate)
that involves this object will complete normally; the object will be
deallocated afterwards.
An opaque object and its
handle are significant only at the process where the object
was created, and cannot be transferred to another process.
MPI provides certain predefined opaque objects and
predefined, static handles to
these objects. Such objects may not be destroyed.
topology, overlapping
In some applications, it is desirable to use different Cartesian topologies at
different stages in the computation. For example, in a QR
factorization, the
transformation is determined by the
data below the diagonal in the
column of the matrix. It
is often easiest to think of the upper right hand corner of the 2D
topology as starting on the process with the
diagonal
element of the matrix for the
stage of the computation.
Since the original matrix was laid out in the original 2D topology, it
is necessary to maintain a relationship between it and the shifted 2D
topology in the
stage. For example, the processes
forming a row or column in the original 2D topology must also form a
row or column in the shifted 2D topology in the
stage.
As stated in Section
and shown in
Figure
, there is a clear correspondence between
the rank of a process and its coordinates in a Cartesian topology.
This relationship can be used to create multiple Cartesian topologies
with the desired relationship. Figure
shows
the relationship of two 2D Cartesian topologies where the second
one is shifted by two rows and two columns.
topologyattribute, topology
The support for virtual topologies as defined in this chapter is
consistent with other parts of MPI, and, whenever possible,
makes use of functions that are defined elsewhere.
Topology information is associated with communicators. It can be implemented
using the caching mechanism described in
Chapter
.
This section describes the MPI functions for creating Cartesian topologies.
MPI_CART_CREATE can be used to describe Cartesian structures of
arbitrary dimension. For each coordinate direction one specifies
whether the process structure is periodic or not. For a 1D topology,
it is linear if it is not periodic and a ring if it is periodic. For
a 2D topology, it is a rectangle, cylinder, or torus as it goes from
non-periodic to periodic in one dimension to fully periodic. Note that
an
-dimensional hypercube is an
-dimensional torus with 2
processes per coordinate direction. Thus, special support for
hypercube structures is not necessary.
MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart)
MPI_CART_CREATE(COMM_OLD, NDIMS, DIMS, PERIODS, REORDER, COMM_CART, IERROR)INTEGER COMM_OLD, NDIMS, DIMS(*), COMM_CART, IERROR
MPI_CART_CREATE returns a handle to a new communicator to which the
Cartesian topology information is attached. In analogy to the function
MPI_COMM_CREATE, no cached information propagates
to the new communicator. Also, this function is collective. As with
other collective calls, the program must be written to work correctly,
whether the call synchronizes or not.
If reorder = false then the rank of each process in the new
group is identical to its rank in the old group. Otherwise, the
function may reorder the processes (possibly so as to choose a good
embedding of the virtual topology onto the physical machine). If the
total size of the Cartesian grid is smaller than the size of the group
of comm_old, then some processes are returned
MPI_COMM_NULL, in analogy to MPI_COMM_SPLIT.
MPI_COMM_NULL
The call is erroneous if it specifies a grid that is larger than the
group size.
For Cartesian topologies, the function MPI_DIMS_CREATE helps
the user select a balanced distribution of processes per coordinate
direction, depending on the number of processes in the group to be
balanced and optional constraints that can be specified by the user.
One possible use of this function is to partition all the processes
(the size of
MPI_COMM_WORLD's group) into an
-dimensional topology.
MPI_Dims_create(int nnodes, int ndims, int *dims)
MPI_DIMS_CREATE(NNODES, NDIMS, DIMS, IERROR)INTEGER NNODES, NDIMS, DIMS(*), IERROR
The entries in the array dims are set to describe a Cartesian
grid with ndims dimensions and a total of nnodes
nodes. The dimensions are set to be as close to each other as
possible, using an appropriate divisibility algorithm. The caller may
further constrain the operation of this routine by specifying elements
of array dims. If dims[i] is set to a positive number, the
routine will not modify the number of nodes in dimension i; only
those entries where dims[i] = 0 are modified by the call.
Negative input values of dims[i] are erroneous.
An error will occur if nnodes is not a multiple of
.
For dims[i] set by the call, dims[i] will be ordered in
monotonically decreasing order. Array dims is suitable for use
as input to routine MPI_CART_CREATE.
MPI_DIMS_CREATE is local. Several sample calls are shown
in Example
.
Once a Cartesian topology is set up, it may be necessary to inquire
about the topology. These functions are given below and are all local
calls.
MPI_Cartdim_get(MPI_Comm comm, int *ndims)
MPI_CARTDIM_GET(COMM, NDIMS, IERROR)INTEGER COMM, NDIMS, IERROR
MPI_CARTDIM_GET returns the number of dimensions of the
Cartesian structure associated with comm. This can be used to provide
the other Cartesian inquiry functions with the correct size of arrays.
The communicator with the topology in Figure
would return
.
MPI_Cart_get(MPI_Comm comm, int maxdims, int *dims, int *periods, int *coords)
MPI_CART_GET(COMM, MAXDIMS, DIMS, PERIODS, COORDS, IERROR)INTEGER COMM, MAXDIMS, DIMS(*), COORDS(*), IERROR
MPI_CART_GET returns information on the Cartesian topology
associated with comm. maxdims must be at least
ndims as returned by
MPI_CARTDIM_GET. For the example in
Figure
,
. The
coords are as given for the rank of the calling process as
shown, e.g., process 6 returns
.
The functions in this section translate to/from the rank and the
Cartesian topology coordinates. These calls are local.
MPI_Cart_rank(MPI_Comm comm, int *coords, int *rank)
MPI_CART_RANK(COMM, COORDS, RANK, IERROR)INTEGER COMM, COORDS(*), RANK, IERROR
For a process group with Cartesian structure, the function
MPI_CART_RANK translates the logical process coordinates to
process ranks as they are used by the point-to-point routines.
coords is an array of size ndims as returned by
MPI_CARTDIM_GET.
For the example in Figure
,
would return
.
For dimension i with periods(i) = true, if the coordinate,
coords(i), is out of range, that is, coords(i) < 0 or
coords(i)
dims(i), it is shifted back to the interval
0
coords(i) < dims(i) automatically.
If the topology in Figure
is periodic in both
dimensions (torus), then
would also return
. Out-of-range
coordinates are erroneous for non-periodic dimensions.
MPI_Cart_coords(MPI_Comm comm, int rank, int maxdims, int *coords)
MPI_CART_COORDS(COMM, RANK, MAXDIMS, COORDS, IERROR)INTEGER COMM, RANK, MAXDIMS, COORDS(*), IERROR
MPI_CART_COORDS is the rank-to-coordinates translator. It
is the inverse mapping of MPI_CART_RANK. maxdims
is at least as big as ndims as returned by
MPI_CARTDIM_GET. For the example in
Figure
,
would return
.
If the process topology is a Cartesian structure, a
MPI_SENDRECV operation is likely to be used along a coordinate
direction to perform a shift of data. As input, MPI_SENDRECV
takes the rank of a source process for the receive, and the rank of a
destination process for the send.
A Cartesian shift operation is specified by the coordinate of the
shift and by the size of the shift step (positive or negative). The
function MPI_CART_SHIFT inputs such specification and
returns the information needed to call MPI_SENDRECV.
The function MPI_CART_SHIFT is local.
MPI_Cart_shift(MPI_Comm comm, int direction, int disp, int *rank_source, int *rank_dest)
MPI_CART_SHIFT(COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR)INTEGER COMM, DIRECTION, DISP, RANK_SOURCE, RANK_DEST, IERROR
The direction argument indicates the dimension of the shift,
i.e., the coordinate whose value is modified by the shift. The
coordinates are numbered from 0 to ndims-1, where
ndims is the number of dimensions.
Depending on the periodicity of the Cartesian group in the specified
coordinate direction, MPI_CART_SHIFT provides the identifiers
for a circular or an end-off shift. In the case of an end-off shift,
the value MPI_PROC_NULL may be returned in
MPI_PROC_NULL
rank_source and/or
rank_dest, indicating that the source and/or the destination
for the shift is out of range. This is a valid input to the sendrecv
functions.
Neither MPI_CART_SHIFT, nor MPI_SENDRECV are
collective functions. It is not required that all processes in the
grid call MPI_CART_SHIFT with the same direction
and disp arguments, but only that sends match receives in the
subsequent calls to MPI_SENDRECV.
Example
shows such use of MPI_CART_SHIFT,
where each column of a 2D grid is shifted by a different amount.
Figures
and
show the
result on 12 processors.
MPI_Cart_sub(MPI_Comm comm, int *remain_dims, MPI_Comm *newcomm)
MPI_CART_SUB(COMM, REMAIN_DIMS, NEWCOMM, IERROR)INTEGER COMM, NEWCOMM, IERROR
If a Cartesian topology has been created with MPI_CART_CREATE, the
function MPI_CART_SUB can be used to partition the
communicator group into subgroups that form lower-dimensional Cartesian
subgrids, and to build for each subgroup a communicator with the associated
subgrid Cartesian topology. This call is collective.
Typically, the functions already presented are used to create and use
Cartesian topologies. However, some applications may want more
control over the process. MPI_CART_MAP returns the
Cartesian map recommended by the MPI system, in order to map well
the virtual communication graph of the application on the physical
machine topology.
This call is collective.
MPI_Cart_map(MPI_Comm comm, int ndims, int *dims, int *periods, int *newrank)
MPI_CART_MAP(COMM, NDIMS, DIMS, PERIODS, NEWRANK, IERROR)INTEGER COMM, NDIMS, DIMS(*), NEWRANK, IERROR
MPI_CART_MAP
computes an ``optimal'' placement for the calling process on the
physical machine.
MPI procedures sometimes assign a special meaning to a special value
of an
argument. For example, tag is an integer-valued argument of
point-to-point communication operations, that can take a special wild-card
value, MPI_ANY_TAG.
MPI_ANY_TAG
Such arguments will have a range of regular values, which is a proper
subrange
of the range of values of the corresponding type of the variable.
Special values (such as MPI_ANY_TAG)
will be outside the regular range. The range of regular values can
be queried using environmental inquiry
functions (Chapter
).
MPI also provides predefined named constant handles, such as
MPI_COMM_WORLD, which is a handle to an object that represents all
MPI_COMM_WORLD
processes available at start-up time and allowed to communicate with
any of them.
All named constants, with the exception of MPI_BOTTOM in
MPI_BOTTOM
Fortran, can be used in initialization expressions or assignments.
These constants do not change values during execution. Opaque objects
accessed by constant handles are defined and do not change value
between MPI initialization (MPI_INIT() call) and MPI
completion (MPI_FINALIZE() call).
topology, general graph
This section describes the MPI functions for creating graph topologies.
MPI_Graph_create(MPI_Comm comm_old, int nnodes, int *index, int *edges, int reorder, MPI_Comm *comm_graph)
MPI_GRAPH_CREATE(COMM_OLD, NNODES, INDEX, EDGES, REORDER, COMM_GRAPH, IERROR)INTEGER COMM_OLD, NNODES, INDEX(*), EDGES(*), COMM_GRAPH, IERROR
MPI_GRAPH_CREATE returns a new communicator to which the
graph topology information is attached. If reorder = false
then the rank of each process in the new group is identical to its
rank in the old group. Otherwise, the function may reorder the
processes. If the size,
nnodes, of the graph is smaller than the size of the group of
comm_old,
then some processes are returned MPI_COMM_NULL, in
MPI_COMM_NULL
analogy to MPI_COMM_SPLIT. The call is erroneous if it
specifies a graph that is larger than the group size of the input
communicator. In analogy to the function
MPI_COMM_CREATE, no cached information propagates
to the new communicator. Also, this function is collective. As with
other collective calls, the program must be written to work correctly,
whether the call synchronizes or not.
The three parameters nnodes, index and edges define the graph
structure.
nnodes is the number of nodes of the graph. The nodes are numbered
from 0 to nnodes-1.
The ith entry of array index stores the total number of
neighbors of the first i graph nodes. The lists of neighbors of
nodes 0, 1, ..., nnodes-1 are stored in consecutive locations in array
edges. The array edges is a flattened representation
of the edge lists.
The total number of entries in index is nnodes and
the total number of entries in edges is equal to the number of
graph edges.
The definitions of the arguments nnodes, index, and
edges are illustrated in Example
.
Once a graph topology is set up, it may be necessary to inquire
about the topology. These functions are given below and are all local
calls.
MPI_Graphdims_get(MPI_Comm comm, int *nnodes, int *nedges)
MPI_GRAPHDIMS_GET(COMM, NNODES, NEDGES, IERROR)INTEGER COMM, NNODES, NEDGES, IERROR
MPI_GRAPHDIMS_GET returns the number of nodes
and the number of edges in the graph. The number of nodes is
identical to the size of the group associated with comm.
nnodes and nedges can be used to supply arrays of
correct size for index and edges, respectively, in
MPI_GRAPH_GET. MPI_GRAPHDIMS_GET would return
and
for
Example
.
MPI_Graph_get(MPI_Comm comm, int maxindex, int maxedges, int *index, int *edges)
MPI_GRAPH_GET(COMM, MAXINDEX, MAXEDGES, INDEX, EDGES,
IERROR)INTEGER COMM, MAXINDEX, MAXEDGES, INDEX(*), EDGES(*),
IERROR
MPI_GRAPH_GET returns index and edges as
was supplied to MPI_GRAPH_CREATE. maxindex and
maxedges are at least as big as nnodes and
nedges, respectively, as returned by
MPI_GRAPHDIMS_GET above. Using the comm created
in Example
would return the index and
edges given in the example.
The functions in this section provide information about the structure
of the graph topology. All calls are local.
MPI_Graph_neighbors_count(MPI_Comm comm, int rank, int *nneighbors)
MPI_GRAPH_NEIGHBORS_COUNT(COMM, RANK, NNEIGHBORS, IERROR)INTEGER COMM, RANK, NNEIGHBORS, IERROR
MPI_GRAPH_NEIGHBORS_COUNT returns the number of neighbors
for the process signified by rank. It can be used by
MPI_GRAPH_NEIGHBORS to give an array of correct size for
neighbors. Using Example
with
would give
.
MPI_Graph_neighbors(MPI_Comm comm, int rank, int maxneighbors, int *neighbors)
MPI_GRAPH_NEIGHBORS(COMM, RANK, MAXNEIGHBORS, NEIGHBORS, IERROR)INTEGER COMM, RANK, MAXNEIGHBORS, NEIGHBORS(*), IERROR
MPI_GRAPH_NEIGHBORS returns the part of the edges
array associated with process rank. Using
Example
,
would return
. Another use is given in
Example
.
The low-level function for general graph topologies as in
the Cartesian topologies given in Section
is as follows. This call is collective.
MPI_UNDEFINED
MPI_Graph_map(MPI_Comm comm, int nnodes, int *index, int *edges, int *newrank)
MPI_GRAPH_MAP(COMM, NNODES, INDEX, EDGES, NEWRANK, IERROR)INTEGER COMM, NNODES, INDEX(*), EDGES(*), NEWRANK, IERROR
A routine may receive a communicator for which it is unknown what type of
topology is associated with it. MPI_TOPO_TEST allows one
to answer this question. This is a local call.
MPI_Topo_test(MPI_Comm comm, int *status)
MPI_TOPO_TEST(COMM, STATUS, IERROR)INTEGER COMM, STATUS, IERROR
The function MPI_TOPO_TEST returns the type of topology that
is assigned to a communicator.
The output value status is one of the following:
MPI_GRAPH
MPI_CART
MPI_UNDEFINED
This chapter discusses routines for getting and, where appropriate, setting
various parameters that relate to the MPI
implementation and the execution environment.
It discusses error handling in MPI and the procedures available for
controlling MPI error handling.
The procedures for entering and leaving the
MPI execution environment are also
described here.
Finally, the chapter discusses the interaction between MPI and the
general execution environment.
environmental parameters
error handling
interaction, MPI with execution environment
initializationexit
A set of attributes that describe the execution environment are attached to
the communicator MPI_COMM_WORLD when MPI is initialized.
MPI_COMM_WORLD
The value of these attributes can be inquired by using the function
MPI_ATTR_GET described in Chapter
.
It is erroneous to delete these attributes, free their keys, or
change their values.
The list of predefined attribute keys include
predefined attributesattribute, predefined
Vendors may add implementation specific parameters (such as node number,
real memory size, virtual memory size, etc.)
These predefined attributes do not change value between MPI
initialization (MPI_INIT) and MPI completion
(MPI_FINALIZE).
The required parameter values are discussed in more detail below:
MPI functions sometimes use arguments with a choice (or union) data
type. Distinct calls to the same routine may pass by reference actual
arguments of different types. The mechanism for providing such
arguments will differ from language to language.
For Fortran, we use <type> to represent a choice
variable, for C, we use (void *).
choice
The Fortran 77 standard specifies that the type of actual arguments need to
agree with the type of dummy arguments; no construct equivalent to C
void pointers is
available. Thus, it would seem that there is no standard conforming mechanism
to support choice arguments.
However, most Fortran compilers either don't check type
consistency of calls to external routines, or support a special mechanism to
link foreign (e.g., C) routines. We accept this non-conformity
with the Fortran 77 standard.
I.e., we accept that the same routine may be passed
an actual argument of a different type at distinct calls.
Generic routines can be used in Fortran 90 to provide a standard
conforming solution. This solution will be consistent with our nonstandard
conforming Fortran 77 solution.
tag, upper bound
Tag values range from 0 to the value returned for MPI_TAG_UB,
MPI_TAG_UB
inclusive.
These values are guaranteed to be unchanging during the execution of an MPI
program.
In addition, the tag upper bound value must be at least 32767.
An MPI implementation is free to make the value of MPI_TAG_UB larger
than this;
for example, the value
is also a legal value for
MPI_TAG_UB (on a system where this value is a legal int or
INTEGER value).
The attribute MPI_TAG_UB has the same value on all
MPI_TAG_UB
processes in the group of MPI_COMM_WORLD.
host process
The value returned for MPI_HOST gets the rank of the
MPI_HOST
HOST process in the group associated
with communicator MPI_COMM_WORLD, if there is such.
MPI_PROC_NULL is returned if there is no host.
This attribute can be used on systems that have a distinguished host processor, in order to identify the process running on this
processor. However,
MPI does not specify what it
means for a process
to be a HOST, nor does it requires that a HOST
exists.
The attribute MPI_HOST has the same value on all
MPI_HOST
processes in the group of MPI_COMM_WORLD.
I/O inquiry
The value returned for MPI_IO is the rank of a processor that can
MPI_IO
provide language-standard I/O facilities. For Fortran, this means that all of
the Fortran I/O operations are supported (e.g., OPEN, REWIND, WRITE). For C, this means that all of the ANSI-C I/O operations are
supported (e.g., fopen, fprintf, lseek).
If every process can provide language-standard I/O, then the value
MPI_ANY_SOURCE will be returned. Otherwise, if the calling
MPI_ANY_SOURCE
process can provide language-standard I/O, then its rank will be
returned. Otherwise, if some process can provide language-standard
I/O then the rank of one such process will be returned. The same value
need not be returned by all processes.
If no process can provide
language-standard I/O, then the value
MPI_PROC_NULL will be
MPI_PROC_NULL
returned.
clock synchronization
The value returned for MPI_WTIME_IS_GLOBAL is 1 if clocks
MPI_WTIME_IS_GLOBAL
at all processes in MPI_COMM_WORLD are synchronized, 0
otherwise. A collection of clocks is considered synchronized if
explicit effort has been taken to synchronize them. The
expectation is that the variation in time, as measured by calls
to MPI_WTIME, will be less then one half the round-trip
time for an MPI message of length zero. If time is measured at a
process just before a send and at another process just after a matching
receive, the second time should be always higher than the first one.
The attribute MPI_WTIME_IS_GLOBAL need not be present when
MPI_WTIME_IS_GLOBAL
the clocks are not synchronized (however, the attribute key
MPI_WTIME_IS_GLOBAL is always valid).
This attribute
may be associated with communicators other then MPI_COMM_WORLD.
The attribute MPI_WTIME_IS_GLOBAL has the same value on all
processes in the group of MPI_COMM_WORLD.
MPI_Get_processor_name(char *name, int *resultlen)
MPI_GET_PROCESSOR_NAME( NAME, RESULTLEN, IERROR)CHARACTER*(*) NAME
This routine returns the name of the processor on which it was called at the
moment of the call.
The name is a character string for maximum flexibility. From
this value it must be possible to identify a specific piece of hardware;
possible values include ``processor 9 in rack 4 of mpp.cs.org'' and ``231''
(where 231 is the actual processor number in the running homogeneous system).
The argument name must represent storage that is at least
MPI_MAX_PROCESSOR_NAME characters long.
MPI_MAX_PROCESSOR_NAME
MPI_GET_PROCESSOR_NAME may write up to this many characters into
name.
The number of characters actually written
is returned in the output argument, resultlen.
The constant MPI_BSEND_OVERHEAD provides an upper bound on
MPI_BSEND_OVERHEAD
the fixed overhead per message buffered by a call to
MPI_BSEND.
clocktime function
MPI defines a timer. A timer is specified even though it is not
``message-passing,'' because timing parallel programs is important in
``performance debugging'' and because existing timers (both in POSIX
1003.1-1988 and 1003.4D 14.1 and in Fortran 90) are either inconvenient or do
not provide adequate access to high-resolution timers.
double MPI_Wtime(void)
DOUBLE PRECISION MPI_WTIME()
MPI_WTIME returns a floating-point number of seconds,
representing elapsed wall-clock time since some time in
the past.
The ``time in the past'' is guaranteed not to change during the life of the
process. The user is responsible for converting large numbers of seconds to
other units if they are preferred.
This function is portable (it returns seconds, not ``ticks''), it allows
high-resolution, and carries no unnecessary baggage. One would use it like
this:
The times returned are local to the node that called them.
There is no requirement that different nodes return ``the same time.''
(But see also the discussion of MPI_WTIME_IS_GLOBAL in
MPI_WTIME_IS_GLOBAL
Section
).
double MPI_Wtick(void)
DOUBLE PRECISION MPI_WTICK()
MPI_WTICK returns the resolution of
MPI_WTIME in seconds. That is, it
returns, as a double precision value, the number of seconds between successive
clock ticks.
For example, if the clock is implemented by the hardware as a counter that is
incremented every millisecond, the value returned by MPI_WTICK
should be
.
initializationexit
One goal of MPI is to achieve source code portability. By this we mean
that a program written using MPI and complying with the relevant language
standards is portable as written, and must not require any source code changes
when moved from one system to another. This explicitly does not say
anything about how an MPI program is started or launched from the command
line, nor what the user must do to set up the environment in which an MPI
program will run. However, an implementation may require some setup to be
performed before other MPI routines may be called. To provide for this, MPI
includes an initialization routine MPI_INIT.
MPI_Init(int *argc, char ***argv)
MPI_INIT(IERROR)INTEGER IERROR
This routine must be called before any other MPI routine. It must be called
at most once; subsequent calls are erroneous (see MPI_INITIALIZED).
All MPI programs must contain a call to MPI_INIT; this routine must be
called before any other MPI routine (apart from
MPI_INITIALIZED) is called.
The version for ANSI C accepts the argc and argv that are
provided by the arguments to main:
The Fortran version takes only IERROR.
An MPI implementation is free to require that the arguments in the C binding
must be the arguments to main.
MPI_Finalize(void)
MPI_FINALIZE(IERROR)INTEGER IERROR
This routines cleans up all MPI state. Once this routine is called, no MPI
routine (even MPI_INIT) may be called.
The user must ensure that all pending communications
involving a process complete before the process calls MPI_FINALIZE.
MPI_Initialized(int *flag)
MPI_INITIALIZED(FLAG, IERROR)LOGICAL FLAG
This routine may be used to determine whether MPI_INIT has been
called. It is the only routine that may be called before
MPI_INIT is called.
MPI_Abort(MPI_Comm comm, int errorcode)
MPI_ABORT(COMM, ERRORCODE, IERROR)INTEGER COMM, ERRORCODE, IERROR
This routine makes a ``best attempt'' to abort
all tasks in the group of comm.
This function does not require that the invoking environment take any action
with the error code. However, a Unix or POSIX environment should handle this
as a return errorcode from the main program or an
abort(errorcode).
MPI implementations are required to define the behavior of
MPI_ABORT at least for
a comm of MPI_COMM_WORLD. MPI implementations may
MPI_COMM_WORLD
ignore the comm argument and act as if the comm was
MPI_COMM_WORLD.
MPI provides the user with reliable message transmission.
A message sent is always received
correctly, and the user does not need to check for transmission errors,
time-outs, or other error conditions. In
other words, MPI does not provide mechanisms for
dealing with failures in the communication system.
If the MPI implementation is built on an unreliable underlying
mechanism, then it is the job of the implementor of the MPI subsystem
to insulate the user from this unreliability, or to reflect unrecoverable
errors as exceptions.
Of course, errors can occur during MPI calls for a variety of reasons.
A program error can
error, program
occur when an MPI routine is called
with an incorrect argument (non-existing
destination in a send operation,
buffer too small in a receive operation, etc.)
This type of error would occur in any implementation.
In addition, a resource error may occur when a program
error, resource
exceeds the amount
of available system resources (number of pending messages, system buffers,
etc.). The occurrence of this type of error depends on the amount of
available resources in the system and the
resource allocation mechanism used;
this may differ from system to system. A high-quality
implementation will provide generous limits on the important
resources so as to alleviate the portability problem this
represents.
An MPI implementation cannot or may choose not to handle some errors
that occur during MPI calls.
These can include errors that generate
exceptions or traps, such as floating point errors or access
violations; errors that are too
expensive to detect in normal execution mode; or
``catastrophic'' errors which may prevent MPI from returning
control to the caller in a consistent state.
Another subtle issue arises because of the nature of asynchronous
communications. MPI can only handle errors that can be attached to a
specific MPI call.
MPI calls (both blocking and nonblocking) may initiate operations
that continue asynchronously
after the call returned. Thus, the call may complete
successfully, yet the operation may later cause an error.
If there is a subsequent call that relates to the same
operation (e.g., a wait or test call that completes a nonblocking call,
or a receive that completes a communication initiated by a blocking send)
then the error can be
associated with this call.
In some cases, the error may occur after all calls that
relate to the operation have completed.
(Consider the case of a blocking ready mode send operation,
where the outgoing message is
buffered, and it is subsequently found that no matching receive is
posted.) Such errors will not be handled by MPI.
The set of errors in MPI calls that are handled by MPI is
implementation-dependent.
Each such error generates an MPI exception.
exceptionMPI exception
A good quality implementation will attempt to handle as many errors as possible
as MPI exceptions.
Errors that are not handled by MPI will be handled by the error
handling mechanisms of the language run-time or the operating system.
Typically, errors that are not handled by MPI will cause the parallel
program to abort.
The occurrence of an MPI exception has two effects:
Some MPI calls may cause more than one MPI exception
(see Section
). In such a case, the
MPI error handler will be invoked once for each exception,
and multiple error codes will be returned.
After an error is detected, the state of MPI is undefined. That is, the
state of the computation after the error-handler executed
does not
necessarily
allow the user to continue to use MPI. The purpose
of these error handlers is to allow a user
to issue user-defined error messages
and to take actions unrelated to MPI
(such as flushing I/O buffers) before a
program exits.
An MPI implementation is free to allow MPI to continue after
an error but is not required to do so.
error handler
A user can associate an error handler with a communicator. The
specified error handling routine will be used for any MPI exception
that occurs during a call to MPI for a communication with this communicator.
MPI calls that are not related to any communicator are considered to
be attached to the communicator MPI_COMM_WORLD.
MPI_COMM_WORLD
The attachment of error handlers to communicators is purely local:
different processes may attach different error handlers
to communicators for the same communication domain.
A newly created communicator inherits the error
handler that is associated with the ``parent'' communicator.
In particular, the user can specify a ``global'' error handler for
all communicators by
associating this handler with the communicator MPI_COMM_WORLD
MPI_COMM_WORLD
immediately after initialization.
Several predefined error handlers are available in MPI:
error handler, predefined
Implementations may provide additional predefined error handlers and
programmers can code their own error handlers.
The error handler
MPI_ERRORS_ARE_FATAL is associated by default
MPI_ERRORS_ARE_FATAL
with MPI_COMM_WORLD
after initialization. Thus, if the user chooses not to control error handling,
every error that MPI handles is treated as fatal.
Since (almost) all MPI calls return an error code, a user may choose to handle
errors in his or her main code, by testing the return code of MPI calls and
executing a
suitable recovery code when the call was not successful. In this case, the
error handler MPI_ERRORS_RETURN will be used. Usually it is more
MPI_ERRORS_RETURN
convenient and more efficient not to test for errors after each MPI call, and
have such an error handled by a non-trivial MPI error handler.
An MPI error handler is an opaque object, which is accessed by a handle.
MPI calls are provided to create new error handlers, to associate error
handlers with communicators, and to test which error handler is associated with
a communicator.
MPI_Errhandler_create(MPI_Handler_function *function, MPI_Errhandler *errhandler)
MPI_ERRHANDLER_CREATE(FUNCTION, HANDLER, IERROR)EXTERNAL FUNCTION
In the C language,
the user routine should be a C function of type MPI_Handler_function,
which is defined as:
MPI_Errhandler_set(MPI_Comm comm, MPI_Errhandler errhandler)
MPI_ERRHANDLER_SET(COMM, ERRHANDLER, IERROR)INTEGER COMM, ERRHANDLER, IERROR
Associates the new error handler errorhandler
with communicator comm at the calling process. Note that an
error handler is always associated with the communicator.
MPI_Errhandler_get(MPI_Comm comm, MPI_Errhandler *errhandler)
MPI_ERRHANDLER_GET(COMM, ERRHANDLER, IERROR)INTEGER COMM, ERRHANDLER, IERROR
Returns in errhandler (a handle to) the error handler that is
currently
associated with communicator comm.
Example:
A library function may register at its entry point the current error
handler for a
communicator, set its own private error handler for this communicator, and
restore before exiting the previous error handler.
MPI_Errhandler_free(MPI_Errhandler *errhandler)
MPI_ERRHANDLER_FREE(ERRHANDLER, IERROR)INTEGER ERRHANDLER, IERROR
Marks the error handler associated with errhandler for
deallocation and sets errhandler to
MPI_ERRHANDLER_NULL.
MPI_ERRHANDLER_NULL
The error handler will be deallocated after
all communicators associated with it have been deallocated.
error codes
Most MPI functions return an error code indicating successful
execution (MPI_SUCCESS), or providing information on the type
MPI_SUCCESS
of MPI exception that occurred.
In certain circumstances, when the MPI
function may complete several distinct
operations, and therefore may generate
several independent errors, the MPI
function may return multiple error codes. This may occur with some of
the calls described in Section
that complete
multiple nonblocking communications. As described in that section,
the call may return the
code MPI_ERR_IN_STATUS, in which case
a detailed error code is returned
MPI_ERR_IN_STATUS
with the status of each communication.
The error codes returned by MPI are left entirely to the implementation (with the
exception of MPI_SUCCESS, MPI_ERR_IN_STATUS and
MPI_ERR_PENDING).
MPI_SUCCESS
MPI_ERR_IN_STATUS
MPI_ERR_PENDING
This is done to allow an implementation to
provide as much information as possible in the error code.
Error codes can be translated into meaningful messages using the function
below.
MPI_Error_string(int errorcode, char *string, int *resultlen)
MPI_ERROR_STRING(ERRORCODE, STRING, RESULTLEN, IERROR)INTEGER ERRORCODE, RESULTLEN, IERROR
Returns the error string associated with an error code or class.
The argument string must represent storage that is at least
MPI_MAX_ERROR_STRING characters long.
MPI_MAX_ERROR_STRING
The number of characters actually written
is returned in the output argument, resultlen.
The use of implementation-dependent error codes allows implementers to
provide more information, but prevents one from writing portable
error-handling code. To solve this problem, MPI provides a standard
set of specified error values, called error classes, and a function that
maps each error code into a suitable error class.
error classes
Valid error classes are
MPI_SUCCESS
MPI_ERR_BUFFER
MPI_ERR_COUNT
MPI_ERR_TYPE
MPI_ERR_TAG
MPI_ERR_COMM
MPI_ERR_RANK
MPI_ERR_REQUEST
MPI_ERR_ROOT
MPI_ERR_GROUP
MPI_ERR_OP
MPI_ERR_TOPOLOGY
MPI_ERR_DIMS
MPI_ERR_ARG
MPI_ERR_UNKNOWN
MPI_ERR_TRUNCATE
MPI_ERR_OTHER
MPI_ERR_INTERN
MPI_ERR_IN_STATUS
MPI_ERR_PENDING
MPI_ERR_LASTCODE
Most of these classes are self explanatory. The use of
MPI_ERR_IN_STATUS and MPI_ERR_PENDING is explained
in Section
. The list of standard classes may
be extended in the future.
The function
MPI_ERROR_STRING can be used to compute the error string
associated with an error class.
The error codes satisfy,
MPI_Error_class(int errorcode, int *errorclass)
MPI_ERROR_CLASS(ERRORCODE, ERRORCLASS, IERROR)INTEGER ERRORCODE, ERRORCLASS, IERROR
The function MPI_ERROR_CLASS maps each
error code into a standard error code (error
class). It maps each standard error code onto itself.
interaction, MPI with execution environment
There are a number of areas where an MPI implementation may interact
with the operating environment and system. While MPI does not mandate
that any services (such as I/O or signal handling) be provided, it does
strongly suggest the behavior to be provided if those services are
available. This is an important point in achieving portability across
platforms that provide the same set of services.
This section defines the rules for MPI language binding in
Fortran 77 and ANSI C.
Defined here are various object representations,
as well as the naming conventions used for expressing this
standard.
It is expected that any Fortran 90 and C++ implementations
use the Fortran 77 and ANSI C bindings, respectively.
Although we consider it premature to define other bindings to
Fortran 90 and C++, the current bindings are designed to encourage,
rather than discourage, experimentation with better
bindings that might be adopted later.
Since the word PARAMETER is a keyword in the Fortran language,
we use the word ``argument'' to denote the arguments to a
subroutine. These are normally referred to
as parameters in C, however, we expect that C programmers will
understand the word
``argument'' (which has no specific meaning in C), thus allowing us to avoid
unnecessary confusion for Fortran programmers.
There are several important language binding
issues not addressed by this standard.
This standard does not discuss the interoperability
of message passing between languages.
It is fully expected that good quality implementations will provide such
interoperability.
interoperability, language
MPI programs require that library routines that are part of the
basic language environment (such as date
and write in Fortran and printf and malloc in ANSI
C) and are executed after MPI_INIT and before MPI_FINALIZE
operate independently and that their completion is
independent of the action of other processes in an MPI program.
Note that this in no way prevents the creation of library routines that
provide parallel services whose operation is collective. However, the
following program is expected to complete in an ANSI C environment
regardless of the size of MPI_COMM_WORLD (assuming that
I/O is available at the executing nodes).
../codes/terms-1.c
The corresponding Fortran 77 program is also expected to complete.
An example of what is not required is any particular ordering
of the action of these routines when called by several tasks. For
example, MPI makes neither requirements nor recommendations for the
output from the following program (again assuming that
I/O is available at the executing nodes).
In addition, calls that fail because of resource exhaustion or other
error are not considered a violation of the requirements here (however,
they are required to complete, just not to complete successfully).
MPI does not specify either the interaction of processes with
signals, in a UNIX
environment, or with other events that do not relate to MPI communication.
That is, signals are not significant from the view point of MPI, and
implementors should attempt to implement MPI so that signals are
transparent: an
MPI call suspended by a signal should resume and complete after the signal is
handled. Generally, the state of a computation that is visible or
significant from the view-point of MPI should only be affected by
MPI calls.
The intent of MPI to be thread and signal safe has a number of
thread safetysignal safety
subtle effects. For example, on Unix systems, a catchable signal such
as SIGALRM (an alarm signal) must not cause an MPI routine to behave
differently than it would have in the absence of the signal. Of course,
if the signal handler issues MPI calls or changes the environment in
which the MPI routine is operating (for example, consuming all available
memory space), the MPI routine should behave as appropriate for that
situation (in particular, in this case, the behavior should be the same
as for a multithreaded MPI implementation).
A second effect is that a signal handler that performs MPI calls must
not interfere with the operation of MPI. For example, an MPI receive of
any type that occurs within a signal handler must not cause erroneous
behavior by the MPI implementation. Note that an implementation is
permitted to prohibit the use of MPI calls from within a signal handler, and
is not required to detect such use.
It is highly desirable that MPI not use SIGALRM, SIGFPE,
or SIGIO. An implementation is required to
clearly document all of the signals that the MPI implementation uses;
a good place for this information is a Unix `man' page on MPI.
profile interface
To satisfy the requirements of
the MPI profiling interface, an implementation of the MPI
functions must
The objective of the MPI profiling interface is to ensure that it is
relatively easy for authors of profiling (and other similar) tools to
interface their codes to MPI implementations on different machines.
layeringlibraries
Since MPI is a machine independent standard with many different
implementations, it is unreasonable to expect that the authors of
profiling tools for MPI will have access to the source code which
implements MPI on any particular machine. It is therefore necessary to
provide a mechanism by which the implementors of such tools can
collect whatever performance information they wish without
access to the underlying implementation.
The MPI Forum believed that having such an interface is important if
MPI is to be attractive to end users, since the availability of many
different tools will be a significant factor in attracting users to
the MPI standard.
The profiling interface is just that, an interface. It says nothing about the way in which it is used. Therefore, there is no
attempt to lay down what information is collected through the
interface, or how the collected information is saved, filtered, or
displayed.
While the initial impetus for the development of this interface arose
from the desire to permit the implementation of profiling tools, it is
clear that an interface like that specified may also prove useful for
other purposes, such as ``internetworking'' multiple MPI
implementations. Since all that is defined is an interface, there is
no impediment to it being used wherever it is useful.
As the issues being addressed here are intimately tied up with the way
in which executable images are built, which may differ greatly on
different machines, the examples given below should be treated solely
as one way of implementing the MPI profiling
interface. The actual requirements made of an implementation are those
detailed in Section
, the whole of the rest of
this chapter is only present as justification and discussion of the
logic for those requirements.
The examples below show one way in which an implementation could be
constructed to meet the requirements on a Unix system (there are
doubtless others which would be equally valid).
Provided that an MPI implementation meets the requirements listed
in Section
, it
is possible for the implementor of the profiling system to intercept
all of the MPI calls which are made by the user program. Whatever
information is required can then be collected before calling the
underlying MPI implementation (through its name shifted entry points)
to achieve the desired effects.
There is a clear requirement for the user code to be able to control
the profiler dynamically at run time. This is normally used for (at
least) the purposes of
These requirements are met by use of the MPI_PCONTROL.
MPI_Pcontrol(const int level, ...)
MPI_PCONTROL(level)INTEGER LEVEL
MPI libraries themselves make no use of this routine, and simply
return immediately to the user code. However the presence of calls to
this routine allows a profiling package to be explicitly called by the
user.
Since MPI has no control of the implementation of the profiling code,
The MPI Forum was unable to specify precisely the semantics which will be
provided by calls to MPI_PCONTROL. This vagueness extends to the
number of arguments to the function, and their datatypes.
However to provide some level of portability of user codes to different
profiling libraries, the MPI Forum requested the following meanings for certain
values of level.
The MPI Forum also requested that the default state after MPI_INIT has been
called is for profiling to be enabled at the normal default level.
(i.e. as if MPI_PCONTROL had just been called with the
argument 1). This allows users to link with a profiling library and
obtain profile output without having to modify their source code at
all.
The provision of MPI_PCONTROL as a no-op in the standard MPI
library allows users to modify their source code to obtain more
detailed profiling information, but still be able to link exactly the
same code against the standard MPI library.
Suppose that the profiler wishes to accumulate the total amount of
data sent by the MPI_Send() function, along with the total elapsed time
spent in the function. This could trivially be achieved thus
../codes/prof-1.c
On a Unix system, in which the MPI library is implemented in C, then
there are various possible options, of which two of the most obvious
are presented here. Which is better depends on whether the linker and
compiler support weak symbols.
All MPI names have an MPI_ prefix, and all characters are upper case.
Programs should not declare variables or functions with names with
the prefix, MPI_ or PMPI_,
to avoid possible name collisions.
All MPI Fortran subroutines have a
return code in the last argument. A few
MPI operations are functions, which do not have the return code argument.
The return code value for successful completion is MPI_SUCCESS.
MPI_SUCCESS
Other error
codes are implementation dependent; see
Chapter
.
return codes
Handles are represented in Fortran as INTEGERs. Binary-valued
variables are of type LOGICAL.
Array arguments are indexed from one.
Unless explicitly stated, the MPI F77
binding is consistent with ANSI standard Fortran 77.
There are several points where the MPI standard diverges from the
ANSI Fortran 77 standard.
These exceptions are consistent with common practice in the Fortran
community. In particular:
All MPI named constants can be used
wherever an entity declared with the PARAMETER attribute can be
used in Fortran.
There is one exception to this rule: the MPI constant
MPI_BOTTOM (section
) can only be
MPI_BOTTOM
used as a buffer argument.
If the compiler and linker support weak external symbols (e.g. Solaris
2.x, other system V.4 machines), then only a single library is
required through the use of #pragma weak thus
The effect of this #pragma is to define the external symbol MPI_Send as a weak definition. This means that the linker will
not complain if there is another definition of the symbol (for
instance in the profiling library), however if no other definition
exists, then the linker will use the weak definition.
This type of situation is illustrated in Fig.
, in
which a profiling library has been written that profiles calls to
MPI_Send() but not calls to MPI_Bcast(). On systems with
weak links the link step for an application would be something like
References to MPI_Send() are resolved in the profiling
library, where the routine then calls PMPI_Send() which is
resolved in the MPI library. In this case the weak link to
PMPI_Send() is ignored. However, since MPI_Bcast() is not
included in the profiling library, references to it are
resolved via a weak link to PMPI_Bcast() in the MPI library.
In the absence of weak symbols then one possible solution would be to
use the C macro pre-processor thus
Each of the user visible functions in the library would then be
declared thus
The same source file can then be compiled to produce the MPI and the
PMPI versions of
the library, depending on the state of the PROFILELIB macro
symbol.
It is required that the standard MPI library be built in such a way
that the inclusion of MPI functions can be achieved one at a time.
This is a somewhat unpleasant requirement, since it may mean that
each external function has to be compiled from a separate file.
However this is necessary so that the author of the profiling library
need only define those MPI functions that are to be intercepted,
references to any others being fulfilled by the normal MPI library.
Therefore the link step can look something like this
Here libprof.a contains the profiler functions which intercept
some of the MPI functions. libpmpi.a contains the ``name
shifted'' MPI functions, and libmpi.a contains the normal
definitions of the MPI functions.
Thus, on systems without weak links the example shown in
Fig.
would be resolved as shown in
Fig.
Since parts of the MPI library may themselves be implemented using
more basic MPI functions (e.g. a portable implementation of the
collective operations implemented using point to point communications),
there is potential for profiling functions to be called from within an
MPI function which was called from a profiling function. This could
lead to ``double counting'' of the time spent in the inner routine.
Since this effect could actually be useful under some circumstances
(e.g. it might allow one to answer the question ``How much time is
spent in the point to point routines when they're called from
collective functions ?''), the MPI Forum decided not to enforce any
restrictions on the author of the MPI library which would overcome
this. Therefore, the author of the profiling library should be aware of
this problem, and guard against it. In a single threaded
world this is easily achieved through use of a static variable in the
profiling code which remembers if you are already inside a profiling
routine. It becomes more complex in a multi-threaded environment (as
does the meaning of the times recorded!)
The Unix linker traditionally operates in one pass. The effect of this
is that functions from libraries are only included in the image if
they are needed at the time the library is scanned. When combined with
weak symbols, or multiple definitions of the same function, this can
cause odd (and unexpected) effects.
Consider, for instance, an implementation of MPI in which the Fortran
binding is achieved by using wrapper functions on top of the C
implementation. The author of the profile library then assumes that it
is reasonable to provide profile functions only for the C binding,
since Fortran will eventually call these, and the cost of the wrappers
is assumed to be small. However, if the wrapper functions are not in
the profiling library, then none of the profiled entry points will be
undefined when the profiling library is called. Therefore none of the
profiling code will be included in the image. When the standard MPI
library is scanned, the Fortran wrappers will be resolved, and will
also pull in the base versions of the MPI functions. The overall
effect is that the code will link successfully, but will not be
profiled.
To overcome this we must ensure that the Fortran wrapper functions are
included in the profiling version of the library. We ensure that this
is possible by requiring that these be separable from the rest of the
base MPI library. This allows them to be extracted out of the base
library and placed into the profiling library using the Unix ar
command.
The scheme given here does not directly support the nesting of
profiling functions, since it provides only a single alternative name
for each MPI function. The MPI Forum gave consideration to an implementation
which would allow multiple levels of call interception; however, it was
unable to construct an implementation of this which did not
have the following disadvantages
Since one of the objectives of MPI is to permit efficient, low latency
implementations, and it is not the business of a standard to require a
particular implementation language, the MPI Forum decided to accept the scheme
outlined above.
Note, however, that it is possible to use the scheme above to
implement a multi-level system, since the function called by the user
may call many different profiling functions before calling the
underlying MPI function.
Unfortunately such an implementation may require more cooperation
between the different profiling libraries than is required for the
single level implementation detailed above.
This book has attempted to give a complete description of the MPI
specification, and includes code examples to illustrate aspects
of the use of MPI. After reading the preceding chapters programmers should
feel comfortable using MPI to develop message-passing applications.
This final chapter addresses some important topics
that either do not easily fit into the other chapters, or which are best
dealt with after a good overall understanding of MPI has been gained.
These topics are concerned more with the interpretation of the MPI
specification, and the rationale behind some aspects of its design, rather
than with semantics and syntax. Future extensions to MPI and the current
status of MPI implementations will also be discussed.
One aspect of concern,
particularly to novices, is the large number of routines
comprising the MPI
specification. In all there are 128 MPI routines,
and further extensions
(see Section
) will probably
increase their number.
There are two fundamental reasons for the size of MPI.
The first reason is that MPI was designed to be rich in
functionality.
This is reflected in MPI's support for derived datatypes,
modular communication via the communicator abstraction, caching,
application topologies,
and the fully-featured set of collective communication routines.
The second reason for the size of MPI
reflects the diversity and complexity of today's high
performance computers. This is particularly true with respect to the
point-to-point communication
routines where the different communication modes
(see Sections
and
) arise mainly
as a means of providing a set of the most
widely-used communication protocols. For example, the synchronous
communication mode
corresponds closely to a protocol that minimizes the
copying and buffering of
data through a rendezvous mechanism. A protocol
that attempts to initiate delivery of messages as soon as possible
would provide buffering for messages, and this corresponds closely
to the buffered communication mode (or the standard mode if this is
implemented with sufficient buffering).
One could decrease the number of functions by increasing the
number of parameters in each call. But such approach would
increase the call overhead and would make the use of the most
prevalent calls more complex. The availability of a large
number of calls to deal with more esoteric features of MPI
allows one to provide a simpler interface to the more frequently
used functions.
There are two potential reasons
why we might be concerned about the size of MPI.
The first is that potential
users might equate size with complexity and decide that MPI is too
complicated to bother learning.
The second is that vendors might decide that
MPI is too difficult to implement.
The design of MPI addresses the first of
these concerns by adopting a layered approach.
For example, novices can
avoid having to worry about groups and
communicators by performing all
communication in the pre-defined communicator MPI_COMM_WORLD.
In fact, most
existing message-passing applications
can be ported to MPI simply by
converting the communication routines
on a one-for-one basis (although the
resulting
MPI application may not be optimally
efficient). To allay the concerns
of potential implementors the MPI Forum at one stage considered
defining a core subset of MPI known as the MPI subset that would
be substantially smaller than MPI and include just the
point-to-point communication routines and a few of the more
commonly-used collective communication routines. However, early work
by Lusk, Gropp, Skjellum, Doss, Franke and others on early
implementations of MPI showed that it
could be fully implemented without a prohibitively
large effort
[16]
[12]. Thus,
the rationale for the MPI subset was lost, and this
idea was dropped.
Message passing is a programming paradigm used
widely on parallel computers, especially Scalable Parallel
Computers (SPCs) with
distributed memory, and on Networks of Workstations (NOWs).
Although there are many variations, the basic
concept of processes communicating through messages is well understood.
Over the last ten years, substantial progress has been made in casting
significant applications into this paradigm. Each vendor
has implemented its own variant.
More recently, several public-domain systems have demonstrated
that a message-passing system can be efficiently and
portably implemented.
It is thus an appropriate time to define both the syntax
and semantics of a standard
core of library routines that will be useful to a wide
range of users and efficiently implementable on a wide range of computers.
This effort has been undertaken over the last three years by the
Message Passing Interface (MPI) Forum, a group of more than 80
people from 40 organizations, representing vendors of parallel
systems, industrial users, industrial and national research laboratories,
and universities.
MPI Forum
The designers of MPI sought to make use of the most attractive features
of a number of existing message-passing systems, rather than selecting one of
them and adopting it as the standard.
Thus, MPI has been strongly influenced
by work at the IBM T. J. Watson Research Center
[2]
[1], Intel's NX/2
[24], Express
[23], nCUBE's Vertex
[22], p4
[5]
[6], and
PARMACS
[7]
[3]. Other important contributions have come
from Zipcode
[26]
[25], Chimp
[14]
[13], PVM
[27]
[17], Chameleon
[19],
and PICL
[18]. The MPI Forum
identified some critical shortcomings of existing message-passing
systems, in areas such as complex data layouts or
support for modularity and safe communication.
This led to the introduction of new features in MPI.
The MPI standard defines the user interface and
functionality for a wide range of message-passing capabilities. Since
its completion in June of 1994, MPI has become widely
accepted and used. Implementations are available on a range of
machines from SPCs to NOWs. A growing number of SPCs have an MPI
supplied and supported by the vendor. Because of this, MPI
has achieved one of its goals - adding credibility to parallel
computing. Third party vendors, researchers, and others now have a
reliable and portable way to express message-passing, parallel programs.
The major goal of MPI, as with most standards, is a degree of
portability across different machines.
The expectation is for a degree of portability comparable to that given
by programming languages such as Fortran.
This means that the same message-passing source code
can be executed on a variety of machines as long as the MPI library is
available, while some tuning might be needed to take best advantage of
the features of each system.
portability
Though
message passing is often thought of in the context of
distributed-memory parallel computers, the same code can run well
on a shared-memory parallel computer. It can run on a network of
workstations, or, indeed, as a set of
processes running on a single workstation.
Knowing that efficient MPI
implementations exist across a wide variety of computers gives a
high degree of flexibility in code development, debugging, and
in choosing a platform for production runs.
Another type of compatibility
offered by MPI is the ability to run transparently on heterogeneous
systems, that is, collections of processors with distinct architectures.
It is possible for an MPI implementation to span such
a heterogeneous collection, yet provide a virtual computing model
that hides many architectural differences.
The user need not worry whether the code is sending
messages between processors of like or unlike architecture.
The MPI
implementation will automatically do any necessary data conversion and
utilize the correct communications protocol. However, MPI does
not prohibit implementations that are targeted to a single,
homogeneous system, and does not mandate that distinct
implementations be interoperable.
Users that wish to run on an heterogeneous
system must use an MPI implementation designed to
support heterogeneity.
heterogeneous
interoperability
Portability is central but the standard
will not gain wide usage if this was achieved at the expense of
performance. For example, Fortran is commonly used over assembly
languages because compilers are almost always available that yield
acceptable performance compared to the non-portable alternative of assembly
languages. A crucial point is that MPI was carefully designed
so as to allow efficient implementations. The design
choices seem to have been made correctly,
since MPI implementations over a wide range of
platforms are achieving high performance, comparable to that
of less portable, vendor-specific systems.
An important design goal of MPI was to allow efficient
implementations across machines of differing characteristics.
efficiency
For example, MPI carefully avoids specifying how operations will
take place. It only specifies what an operation does logically. As a
result, MPI can be easily implemented on systems that buffer messages at
the sender, receiver, or do no buffering at all.
Implementations can take advantage of specific features of the
communication subsystem of various machines. On machines with
intelligent communication coprocessors, much of the message passing
protocol can be offloaded to this coprocessor. On other systems,
most of the communication code is executed by the main processor.
Another example is
the use of opaque objects in MPI. By hiding the details of how
MPI-specific objects are represented,
each implementation is free to do whatever is
best under the circumstances.
Another design choice leading to efficiency is the avoidance of
unnecessary work.
MPI was carefully designed so as to avoid a requirement for
large amounts of extra information with each message, or the
need for complex encoding or decoding of message headers.
MPI also avoids extra computation or tests in critical
routines since this can degrade performance. Another way of
minimizing work is to encourage the reuse of previous computations.
MPI provides this capability through constructs such as persistent
communication requests and caching of attributes on communicators.
The design of MPI avoids the need for extra copying and
buffering of data: in many cases, data can be moved from the user
memory directly to the wire, and be received directly from the wire
to the receiver memory.
MPI was designed to encourage overlap of communication and
computation, so as to take advantage of intelligent
communication agents, and to hide communication latencies.
This is achieved by the use of nonblocking
communication calls, which separate the initiation of a
communication from its completion.
Scalability is an important goal of parallel processing.
MPI allows or supports scalability through several of its
design features. For example, an
application can create subgroups of processes that, in turn, allows
collective communication operations to limit their scope to the
processes involved. Another technique used is to
provide functionality without a computation that
scales as the number of processes. For example, a two-dimensional
Cartesian topology can be subdivided into its one-dimensional rows or columns
without explicitly enumerating the processes.
scalability
Finally, MPI, as all good standards, is valuable in that it defines
a known, minimum behavior of message-passing implementations. This
relieves the programmer from having to worry about certain problems
that can arise. One example is
that MPI guarantees that the underlying transmission of messages is
reliable. The user need not check if a message is received
correctly.
We use the ANSI C declaration format. All MPI names have an MPI_
prefix, defined constants are in all capital letters, and defined types and
functions have one capital letter after the prefix.
Programs must not declare variables or functions with names beginning with
the prefix MPI_ or PMPI_.
This is mandated to avoid possible name collisions.
The definition of named constants, function prototypes, and type
definitions must be supplied in an include file mpi.h.
include filempif.h
Almost all C functions return an error code.
The successful return code will
be MPI_SUCCESS, but
failure return codes are implementation dependent.
A few C functions do not return error codes,
so that they can be implemented as
macros.
return codes
Type declarations are provided for handles to each category of opaque
objects. Either a pointer or an integer type is used.
Array arguments are indexed from zero.
Logical flags are integers with value 0 meaning ``false'' and a non-zero
value meaning ``true.''
Choice arguments are pointers of type void*.
Address arguments are of MPI defined type
MPI_Aint. This is defined to be an int of the size needed to
hold any valid address on the target architecture.
All named MPI constants can be used in initialization expressions or
assignments like C constants.
buffering
MPI does not guarantee to buffer arbitrary messages because memory is
a finite resource on all computers. Thus, all computers will fail under
sufficiently adverse communication loads. Different computers at different
times are capable of providing differing amounts of buffering, so if
a program relies on buffering it may fail under certain conditions, but
work correctly under other conditions. This is clearly undesirable.
Given that no message passing system can guarantee that messages will
be buffered as required under all circumstances, it might be asked why
MPI does not guarantee a minimum amount of memory available for
buffering. One major problem is that it is not obvious how to
specify the amount of buffer space that is available, nor is it easy
to estimate how much buffer space is consumed by a particular
program.
Different buffering policies make sense in
different environments. Messages can be buffered at the sending
node or at the receiving node, or both. In the former case,
The choice of the right policy is strongly dependent on the hardware and
software environment.
For instance, in a dedicated environment, a processor
with a process blocked on a send is idle and so
computing resources are not wasted if this
processor copies the outgoing message to a buffer. In a time shared
environment, the computing resources may be used by another process.
In a system where buffer space can be in paged memory, such space can
be allocated from heap. If the buffer space cannot be paged,
or has to be in kernel space, then a separate buffer
is needed. Flow control may require that
some amount of buffer space be dedicated to each pair of communicating
processes.
The optimal strategy strongly depends on various
performance parameters of the system: the bandwidth, the communication
start-up time, scheduling and context switching overheads, the amount
of potential overlap between communication and computation, etc.
The choice of a buffering and scheduling policy
may not be entirely under the control of the MPI
implementor, as it is partially determined by the properties of the
underlying communication layer.
Also,
experience in this arena is quite limited, and underlying technology
can be expected to change rapidly: fast, user-space
interprocessor communication mechanisms are an active research area
[20]
[28].
Attempts by the MPI Forum to design mechanisms for querying or
setting the amount of buffer space available to standard communication
led to the conclusion that such mechanisms will either restrict
allowed implementations unacceptably, or provide
bounds that will be extremely pessimistic on most implementations in
most cases. Another problem is that parameters such as buffer
sizes work against portability.
Rather then restricting the
implementation strategies for standard communication, the choice was
taken to provide additional communication modes for those users that
do not want to trust the implementation to make the right choice for
them.
portable programming
The MPI specification
was designed to make it possible to write portable
message passing
programs while avoiding unacceptable performance degradation.
Within the context of
MPI, ``portable'' is synonymous with ``safe.''
Unsafe programs may exhibit a different behavior on different
systems because they are non-deterministic: Several outcomes
are consistent with the MPI specification, and the actual
outcome to occur depends on the precise timing of events.
Unsafe programs may require resources that are not always
guaranteed by MPI,
in order to complete successfully. On systems where such
resources are unavailable, the program will encounter a
resource error. Such an error will manifest itself as an actual
program error, or will result in deadlock.
There are
three main issues relating to the portability of MPI programs (and, indeed,
message passing programs in general).
If proper attention is not paid to
these factors a message passing code
may fail intermittently on a given computer,
or may work correctly on one
machine but not on another. Clearly such a program is not portable.
We shall now consider each of the above factors in more detail.
buffering
A message passing program is
dependent on the buffering of messages if
its communication graph has a cycle.
The communication graph is a directed graph
in which the nodes represent MPI communication calls
and the edges represent
dependencies between these calls: a directed edge uv indicates
that operation v might not be able to complete before operation u is started.
Calls may be dependent because they have to be executed in
succession by the same process, or because they are matching
send and receive calls.
The execution of the code results in the
dependency graph illustrated in Figure
,
for the case of a three
process group.
The arrow from each send to the following receive
executed by the same process reflects the program dependency
within each process: the receive call cannot be executed until
the previous send call has completed. The double arrow between
each send and the matching receive reflects their mutual
dependency: Obviously, the receive cannot
complete unless the matching send was invoked. Conversely,
since a standard mode send is used, it may be the case that
the send blocks until a matching receive occurs.
The dependency graph has a cycle.
This code will only work if the system provides sufficient
buffering, in
which case the send operation will complete locally, the call to
MPI_Send() will return,
and the matching call to MPI_Recv()
will be
performed. In the absence of sufficient
buffering MPI does not specify an
outcome, but for most implementations deadlock will occur, i.e., the
call to MPI_Send() will never return: each process will
wait for the next process on the ring to execute a matching
receive.
Thus, the behavior of this code will differ from system to
system, or on the same system, when message size
(count) is changed.
There are a number of ways in which a shift operation can be performed
portably using MPI. These are
If at least one process in a shift
operation calls the receive routine before
the send routine, and at least one process calls the send routine
before the receive routine, then at least one communication can proceed,
and, eventually,
the shift will complete successfully.
One of the most efficient ways of doing this is to alternate the send and
receive calls so that all processes
with even rank send first and then
receive, and all processes with odd rank receive first and then send.
Thus, the following code is portable provided there is more than one
process, i.e., clock and anticlock are different:
If there is only one process then
clearly blocking send and receive routines
cannot be used since the send must be
called before the receive, and so
cannot complete in the absence of buffering.
We now consider methods for
performing shift operations that work even if
there is only one process involved.
A blocking send in buffered mode
can be used to perform a shift operation.
In this case the application program passes a buffer to the MPI
communication system, and
MPI can use this to buffer messages. If the buffer
provided is large enough, then the shift will complete successfully.
The following
code shows how to use
buffered mode to create a portable shift operation.
MPI guarantees that the buffer supplied by a call to
MPI_Buffer_attach()
will be used if it is needed to buffer the message.
(In an implementation
of MPI that provides sufficient buffering,
the user-supplied buffer may be
ignored.)
Each buffered send operations can complete locally, so
that a deadlock will not occur. The acyclic communication graph for
this modified code is shown in Figure
.
Each receive depends on the matching send, but the
send does not depend anymore on the matching receive.
Another approach is to use nonblocking communication.
One can either use a
nonblocking send, a nonblocking receive, or both.
If a nonblocking send is
used, the call to MPI_Isend() initiates the send
operation and then returns.
The call to MPI_Recv() can then be made,
and the communication completes
successfully.
After the call to MPI_Isend(), the data in buf1 must
not be changed until one is
certain that the data have been sent or copied
by the system. MPI provides the routines MPI_Wait() and
MPI_Test()
to check on this. Thus, the following code is portable,
The corresponding acyclic communication graph is shown in
Figure
.
Each receive operation depends on the matching send,
and each wait depends on the matching communication; the send
does not depend on the matching receive, as a nonblocking send
call will return even if no matching receive is posted.
(Posted nonblocking communications do consume resources: MPI
has to keep track of such posted communications. But the
amount of resources consumed is proportional to the number of
posted communications, not to the total size of the pending messages. Good MPI implementations will support a large number of
pending nonblocking communications, so that this will not cause
portability problems.)
An alternative approach is to perform a nonblocking receive first to
initiate (or ``post'') the receive,
and then to perform a blocking send in
standard mode.
The call to MPI_Irecv()
indicates to MPI that incoming data should be
stored in buf2; thus, no buffering is required. The call to
MPI_Wait()
is needed to block until the data has actually been received
into buf2. This alternative code will often result in
improved performance,
since sends complete faster in many implementations
when the matching receive is already posted.
Finally, a portable shift
operation can be implemented using the routine
MPI_Sendrecv(),
which was explicitly designed to send to one process
while receiving from another in a safe and portable way. In this
case only a single call is required;
collective, semantics of
The MPI specification purposefully does not mandate whether or not
collective communication operations have the side effect of synchronizing
the processes over which they operate. Thus, in one valid implementation
collective communication operations may synchronize processes, while in another
equally valid implementation they do not. Portable MPI programs, therefore,
must not rely on whether or not collective communication operations
synchronize processes. Thus, the following assumptions must be avoided.
Here if we want to perform the send in ready mode we must be certain
that the receive has already been initiated at the destination. The above
code is nonportable because if the broadcast does not act as a barrier
synchronization we cannot be sure this is the case.
ambiguity of communications
modularitylibraries, safety
MPI employs the communicator abstraction
to promote software modularity by allowing the
construction of independent communication streams between processes,
thereby ensuring that messages sent in one phase of an application
are not incorrectly intercepted by another phase. Communicators
are particularly important in allowing libraries that make message passing
calls to be used safely within an application.
The point here is that
the application developer has no way of knowing if the tag, group, and rank
completely disambiguate the message traffic of different libraries and
the rest of the application. Communicators, in effect, provide an
additional criterion for message selection, and hence permits the construction
of independent tag spaces.
We discussed in
Section
possible hazards when a library uses
the same communicator as the calling code. The incorrect matching of
sends executed by the caller code with receives executed by the
library occurred because the library code used wildcarded receives.
Conversely, incorrect matches may occur when the caller code uses
wildcarded receives, even if the library code by itself is deterministic.
Consider the
example in Figure
.
If the program behaves correctly
processes 0 and 1 each
receive a message from process 2,
using a wildcarded selection criterion to
indicate that they are prepared to receive a message from any process. The
three processes then pass data around in a ring within the library routine.
If separate communicators are not used for the communication inside and
outside of the library routine
this program may intermittently fail. Suppose we delay the sending
of the second message sent by process 2, for example, by inserting some
computation, as shown in Figure
.
In this case the wildcarded
receive in process 0 is satisfied by a message sent from process 1, rather
than from process 2, and deadlock results.
Even if neither caller nor callee use wildcarded receives, incorrect
matches may still occur if a send initiated before the collective
library invocation is to be matched by a receive posted after the
invocation (Ex.
,
page
).
By using a different
communicator in the library routine we can ensure that the program
is executed correctly, regardless of when the processes enter the library
routine.
heterogeneous
Heterogeneous computing uses different computers connected by a
network to solve a problem in parallel. With heterogeneous computing
a number of issues arise that are not applicable when using
a homogeneous parallel computer. For example, the computers may
be of differing computational power, so care must be taken to
distribute the work between them to avoid load imbalance. Other
problems may arise because of the different behavior of floating
point arithmetic on different machines. However,
the two most fundamental issues that must be faced in
heterogeneous computing are,
Incompatible data representations arise when computers use different
binary representations for the same number. In MPI all communication
routines have a datatype argument so implementations can use this
information to perform the appropriate representation conversion
when communicating data between computers.
Interoperability refers to the ability of different implementations
of a given piece of software to work together as if they were a single
homogeneous implementation. A
interoperability
prerequisite of interoperability for MPI would be the
standardization of the MPI's internal data structures,
of the communication protocols, of the initialization,
termination and error handling procedures, of the implementation of
collective operations, and so on.
Since this has not been done, there is no support for interoperability
in MPI. In general, hardware-specific implementations of MPI will
not be interoperable. However it is still possible for different
architectures to work together if they both use the same portable MPI
implementation.
MPI implementationsimplementations
At the time of writing several portable implementations
of MPI exist,
In addition, hardware-specific MPI implementations exist for the
Cray T3D, the IBM SP-2, The NEC Cinju, and the Fujitsu AP1000.
Information on MPI implementations and other useful
information on MPI can be found on the MPI web pages
at Argonne National Laboratory (http://www.mcs.anl.gov/mpi),
and at Mississippi State Univ (http://www.erc.msstate.edu/mpi). Additional information can be found on the
MPI
newsgroup comp.parallel.mpi and on netlib.
When the MPI Forum reconvened in March 1995, the main reason was to
produce a new version of MPI that would have significant new
features. The original MPI is being referred to as MPI-1 and the
new effort is being called MPI-2. The need and desire to extend
MPI-1 arose from several factors. One consideration was that the
MPI-1 effort had a constrained scope. This was done to avoid
introducing a standard that was seen as too large and burdensome for
implementors. It was also done to complete
MPI-1 in the Forum-imposed deadline of one year. A second
consideration for limiting MPI-1 was the feeling by many Forum
members that some proposed areas were still under investigation. As a
result, the MPI Forum decided not to propose a standard in these areas for
fear of discouraging useful investigations into improved methods.
The MPI Forum is now actively meeting and discussing extensions to MPI-1
that will become MPI-2. The areas that are currently under
discussion are:
0.1truein
External Interfaces: This will define interfaces to
allow easy extension of MPI with libraries, and facilitate
the implementation of
packages such as debuggers and profilers.
Among the issues considered are mechanisms for defining new
nonblocking operations and mechanisms for accessing internal
MPI information.
One-Sided Communications: This will extend MPI to allow
communication that does not require execution of matching calls
at both communicating processes. Examples of such operations are
put/get operations that allow a process to access data in
another process' memory, messages with interrupts (e.g., active
messages), and Read-Modify-Write operations (e.g., fetch and add).
Dynamic Resource Management:
This will extend MPI to allow the acquisition of computational
resources and the
spawning and destruction of processes after MPI_INIT.
Extended Collective: This will extend the collective calls to be
non-blocking and apply to inter-communicators.
Bindings: This will produce bindings for Fortran 90 and C++.
Real Time: This will provide some support for real time processing.
0.1truein
Since the MPI-2 effort is ongoing, the topics and areas covered are
still subject to change.
The MPI Forum set a timetable at its first meeting in March 1995. The goal
is release of a preliminary version of certain parts of MPI-2 in
December 1995 at Supercomputing '95. This is to include dynamic
processes. The goal of this early
release is to allow testing of the ideas and to receive extended
public comments. The complete version of MPI-2 will be released at
Supercomputing '96 for final public comment. The final version of
MPI-2 is scheduled for release in the spring of 1997.
MPI: The Complete Reference
This document was generated using the
LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994,
Nikos Drakos, Computer Based Learning Unit, University of Leeds. The command line arguments were: The translation was initiated by Jack Dongarra on Fri Sep 1 06:16:55 EDT 1995
The basic communication mechanism of MPI is the transmittal of data
between a pair of processes, one side sending, the other, receiving.
We call this ``point to point communication.'' Almost all the
constructs of MPI are built around the point to point operations
and so this chapter is fundamental. It is also quite a long chapter since:
there are many variants to the point to point operations; there
is much to say in terms of the semantics of the operations; and related
topics, such as probing for messages, are explained here because
they are used in conjunction with the point to point operations.
MPI provides a set of send and receive functions that allow
the communication of typedtyped data
data with an associated tag.tagmessage tag
Typing of the message contents is necessary for heterogeneous support -
the type information is needed so that correct data representation
conversions can be performed as data is sent from one architecture to
another. The tag allows selectivity of messages at the receiving end:
one can receive on a particular tag, or one can wild-card this
quantity, allowing reception of messages with any tag. Message
selectivity on the
source process of the message is also provided.
A fragment of C code
appears in Example
for the example of
process 0 sending a message to process 1.
The code executes on both
process 0 and process 1.
Process 0 sends a character string using MPI_Send(). The first three
parameters of the send call
specify the data to be sent: the outgoing data is to
be taken from msg; it consists of strlen(msg)+1 entries,
each of
type MPI_CHAR (The string "Hello there" contains
strlen(msg)=11
significant characters.
In addition, we are also sending the
The receiving process specified
that the incoming data was to be placed in msg and that it had a maximum
size of 20 entries, of type MPI_CHAR.
The variable status, set by MPI_Recv(), gives information
on the source and tag of the message and how many elements were
actually received.
For example, the receiver can examine this variable to find out the
actual length of the character string received.
Datatype matchingdatatype matchingtype matching
(between sender and receiver) and data
conversion data conversionrepresentation conversion
on heterogeneous systems
are discussed in more detail in Section
.
The Fortran version of this code is shown in
Example
. In order to make our
Fortran examples more readable, we use Fortran 90
syntax, here and in
many other places in this book. The examples can be
easily rewritten in standard Fortran 77. The Fortran code is
essentially identical to the C code. All MPI calls are
procedures, and an additional parameter is used to return
the value returned by the corresponding C
function. Note that Fortran
strings have fixed size and are not
null-terminated.
The receive operation stores "Hello there" in the first
11 positions of msg.
These examples employed blocking blocking
send and receive functions. The send call
blocks until the send buffer can be reclaimed (i.e., after the send, process 0
can safely over-write the contents of msg). Similarly, the
receive function blocks until the receive buffer actually contains the contents
of the message. MPI also provides nonblockingnonblocking send and receive
functions that allow the possible overlap of message transmittal with
computation, or the overlap of multiple message transmittals with one-another.
Non-blocking functions always come in two parts: the posting functions,
posting
which begin the requested operation; and the test-for-completion
functions,test-for-completion which allow the application program to discover whether
the requested operation has completed.
Our chapter begins by explaining blocking functions in detail, in
Section
-
,
while nonblocking functions are covered
later, in Sections
-
.
We have already said rather a lot about a simple transmittal of data
from one process to another, but there is even more. To
understand why, we examine two aspects of the communication: the semantics
semantics
of the communication primitives, and the underlying protocols that
protocols
implement them. Consider the previous example, on process 0, after
the blocking send has completed. The question arises: if the send has
completed, does this tell us anything about the receiving process? Can we
know that the receive has finished, or even, that it has begun?
Such questions of semantics are related to the nature of the underlying
protocol implementing the operations. If one wishes to implement
a protocol minimizing the copying and buffering of data, the most natural
semantics might be the ``rendezvous''rendezvous
version, where completion of the send
implies the receive has been initiated (at least). On the other
hand, a protocol that
attempts to block processes for the minimal amount of time will
necessarily end up doing more buffering and copying of data and will
have ``buffering'' semantics.buffering
The trouble is, one choice of
semantics is not best for all applications, nor is it best for all
architectures. Because the primary goal of MPI is to standardize the
operations, yet not sacrifice performance, the decision was made to
include all the major choices for point to point semantics in the
standard.
The above complexities are manifested in MPI by the existence of
modesmodes for point to point communication. Both blocking and
nonblocking communications have modes. The mode allows one to choose the
semantics of the send operation and, in effect, to influence the
underlying protocol of the transfer of data.
In standard modestandard mode the completion of the send does not
necessarily mean that the matching receive has started, and no
assumption should be made in the application program about whether the
out-going data is buffered by MPI. In buffered mode
buffered mode the user can
guarantee that a certain amount of buffering space is available. The
catch is that the space must be explicitly provided by the application
program. In synchronous mode synchronous mode
a rendezvous semantics between
sender and receiver is used. Finally, there is ready mode.
ready mode This
allows the user to exploit extra knowledge to simplify the protocol and
potentially achieve higher performance. In a ready-mode send, the user
asserts that the matching receive already has been posted.
Modes are covered in Section
.
This section describes standard-mode, blocking sends and receives.
MPI_Send(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
MPI_SEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)<type> BUF(*)
MPI_SEND performs a standard-mode, blocking send.
The semantics of this function are described in
Section
.
The arguments to MPI_SEND are described in
the following subsections.
The send buffersend bufferbuffer, send
specified by MPI_SEND consists of
count successive entries of the type indicated by
datatype,datatype starting
with the entry at address buf. Note that we specify the
message length
in terms of number of entries, not number of bytes. The former is
machine independent and facilitates portable programming.
The count may be zero,
in which case the data part of the message is empty.
The basic datatypes correspond to
the basic datatypes of the host language.
Possible values of this argument for Fortran and the
corresponding Fortran types are listed below.
MPI_INTEGER
MPI_REAL
MPI_DOUBLE_PRECISION
MPI_COMPLEX
MPI_LOGICAL
MPI_CHARACTER
MPI_BYTE
MPI_PACKED
Possible values for this argument for C and the corresponding C
types are listed below.
MPI_CHAR
MPI_SHORT
MPI_INT
MPI_LONG
MPI_UNSIGNED_CHAR
MPI_UNSIGNED_SHORT
MPI_UNSIGNED
MPI_UNSIGNED_LONG
MPI_FLOAT
MPI_DOUBLE
MPI_LONG_DOUBLE
MPI_BYTE
MPI_PACKED
The datatypes MPI_BYTE and MPI_PACKED do not correspond to a
Fortran or C datatype. A value of type MPI_BYTE consists of a byte
(8 binary digits). A
byte is uninterpreted and is different from a character.
Different machines may have
different representations for characters, or may use more than one
byte to represent characters. On the other hand, a byte has the same
binary value on all machines.
The use of MPI_PACKED is explained in
MPI_PACKED
Section
.
MPI requires support of the datatypes listed above, which match the basic
datatypes of Fortran 77 and ANSI C.
Additional MPI datatypes should be provided if the host language has
additional data types. Some examples are:
MPI_LONG_LONG, for C integers declared to be of type
MPI_LONG_LONG
longlong;
MPI_DOUBLE_COMPLEX for double precision complex in
MPI_DOUBLE_COMPLEX
Fortran declared to be of type DOUBLE COMPLEX;
MPI_REAL2,
MPI_REAL2
MPI_REAL4 and MPI_REAL8 for Fortran reals, declared to be of
MPI_REAL4
MPI_REAL8
type REAL*2, REAL*4 and REAL*8, respectively;
MPI_INTEGER1 MPI_INTEGER2 and
MPI_INTEGER4 for Fortran integers, declared to be of type
MPI_INTEGER1
MPI_INTEGER2
MPI_INTEGER4
INTEGER*1, INTEGER*2 and INTEGER*4, respectively.
In addition, MPI provides a mechanism for users to define new, derived,
datatypes. This is explained in Chapter
.
In addition to data, messages carry information that is used to
distinguish and selectively receive them. This information consists
of a fixed number of fields, which we collectively call
the message envelope. These fields are
message envelope
source, destination, tag, and communicator.
sourcedestinationtagcommunicator
The message source is implicitly determined by the identity of the
message source
message sender. The other fields are specified by arguments in the send
operation.
The comm argument specifies the communicator used for
communicator
the send operation. The communicator is a local object that
represents a communication domain. A communication domain is a
communication domain
global, distributed structure that allows processes in a group
groupprocess group
to communicate with each other, or to communicate with processes in another
group. A communication domain of the first type (communication within a
group) is represented by
an intracommunicator, whereas a communication domain of the second type
intracommunicator
(communication between groups) is represented by an intercommunicator.
intercommunicator
Processes
in a group are ordered, and are identified by their integer rank.
rankprocess rank
Processes may participate in several communication domains; distinct
communication
domains may have partially or even completely overlapping groups of processes.
Each communication domain supports a disjoint stream of communications.
Thus, a process may be able to communicate with another process via two distinct
communication domains, using two distinct communicators. The same
process may be identified by a different rank in the two domains;
and communications in the two domains do not interfere.
MPI applications begin with a default communication domain that includes
all processes (of this parallel job); the default communicator
MPI_COMM_WORLD represents this communication domain.
MPI_COMM_WORLD
Communicators are explained further in Chapter
.
The message destination is specified by the dest
destinationmessage destination
argument.
The range of valid values for dest is
0,...,n-1, where n is the number of
processes in the group.
This range includes the rank of the
sender: if comm is an intracommunicator, then a process may send a
message to itself. If the communicator is an intercommunicator,
then destinations are identified by their rank in the remote group.
The integer-valued message tag is specified by the tag argument.
tagmessage tag
This integer can be used by the application to distinguish messages.
The range of valid tag values is 0,...,UB, where the value of UB is
implementation dependent. It is found by querying the
value of the attribute MPI_TAG_UB, as
MPI_TAG_UB
described in Chapter
.
MPI requires that UB be no less than 32767.
MPI_Recv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
MPI_RECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS, IERROR)<type> BUF(*)
MPI_RECV performs a standard-mode, blocking receive.
The semantics of this function are described in
Section
.
The arguments to MPI_RECV are described in
the following subsections.
The receive buffer consists of
receive bufferbuffer, receive
storage sufficient to contain count consecutive entries of the type
specified by datatype, starting at address buf.
The length of the received message must be less than or equal to the length of
the receive buffer. An overflow error occurs if all incoming data does
overflow
not fit, without truncation, into the receive buffer.
We explain in Chapter
how to check for errors.
If a
message that is shorter than the receive buffer arrives, then the incoming
message is stored in the initial locations of the receive buffer,
and the remaining locations are not modified.
The goal of the Message Passing Interface, simply stated, is to
develop a widely used
standard for writing message-passing programs.
As such the interface should
establish a practical, portable, efficient, and flexible standard for message
passing.
A list of the goals of MPI appears below.
The selection of a message by a receive operation is governed by
selectionmessage selection
the value of its message envelope.
A message can be received
if its envelope matches the source, tag and
comm values specified by the
receive operation. The receiver may specify a wildcard
wildcard
value for source (MPI_ANY_SOURCE),
MPI_ANY_SOURCE
and/or
a wildcard value for tag (MPI_ANY_TAG),
MPI_ANY_TAG
indicating that any source and/or tag are acceptable. One cannot specify a
wildcard value for comm.
The argument source, if different from MPI_ANY_SOURCE,
sourcemessage source
is specified as a rank within the process group associated
with the communicator (remote process group, for intercommunicators).
The range of valid values for the
source argument is
{0,...,n-1}
{MPI_ANY_SOURCE}, where
n is the number of processes in this group.
This range includes the receiver's rank: if comm is an
intracommunicator, then a process may receive a
message from itself. The range of valid values for the tag argument is
{0,...,UB}
{MPI_ANY_TAG}.
tagmessage tag
The receive call does not specify the size of an
incoming message, but only an upper bound. The
source or tag of a received message may not be known if
wildcard values were used in a receive operation.
Also, if multiple requests
are completed by a single MPI function (see
Section
), a distinct error code may be
error code
returned for each request. (Usually, the error code is returned as
the value of the function in C, and as the value of the
IERROR argument in Fortran.)
This information is returned by the status argument of
MPI_RECV.
The type of status is defined by MPI.
Status variables need to be explicitly
allocated by the user, that is, they are not system objects.
In C, status is a structure of type MPI_Status
that contains three fields named
MPI_SOURCE, MPI_TAG, and MPI_ERROR;
the structure may contain
additional fields. Thus, status.MPI_SOURCE,
status.MPI_TAG and status.MPI_ERROR
contain the source, tag and error code, respectively, of
the received message.
In Fortran, status is an array of INTEGERs of length
MPI_STATUS_SIZE. The three constants MPI_SOURCE,
MPI_STATUS_SIZE
MPI_TAG and MPI_ERROR
MPI_SOURCE
MPI_TAG
MPI_ERROR
are the indices of the entries that store the
source, tag and error fields. Thus status(MPI_SOURCE),
status(MPI_TAG) and status(MPI_ERROR)
contain,
respectively, the source, the
tag and the error code of the received message.
The status argument also returns information on the length of the message
received. However, this information is not directly available as a field
of the status variable and a call to MPI_GET_COUNT is required
to ``decode'' this information.
MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)
MPI_GET_COUNT(STATUS, DATATYPE, COUNT, IERROR)INTEGER STATUS(MPI_STATUS_SIZE), DATATYPE, COUNT, IERROR
MPI_GET_COUNT takes as input the status set
by MPI_RECV and computes
the number of entries received. The number of entries
is returned in count.
The datatype argument should match the argument
provided to the receive call that set status.
(Section
explains that
MPI_GET_COUNT may return, in certain situations, the value
MPI_UNDEFINED.)
MPI_UNDEFINED
Note the asymmetry between send and receive operations. A receive
asymmetry
operation may
accept messages from an arbitrary sender, but
a send operation
must specify a unique receiver. This matches a ``push'' communication
mechanism, where data transfer is effected by the sender, rather than a
``pull'' mechanism, where data transfer is effected by the receiver.
Source equal to destination is allowed, that is, a process can send a
message to itself.
However, for such a communication to succeed, it is required that the message be
buffered by the system between the completion of the send call and the
start of the receive call. The amount of buffer space available and the buffer
allocation policy are implementation dependent.
Therefore, it is unsafe
and non-portable to send
self-messages with the standard-mode, blocking send and receive
self messagemessage, self
operations described so far, since this may lead to deadlock.
More discussions of this appear in Section
.
type matchingmatching, type
conversion
One can think of message transfer as consisting of the following three phases.
Type matching must be observed at each of these phases. The type
of each variable in the sender buffer must match
the type specified for that entry by the send operation.
The type specified by the send operation must match the type specified by
the receive operation. Finally,
the type of each
variable in the receive buffer must match the type specified for
that entry by the receive operation.
A program that fails to observe these
rules is erroneous.
To define type matching precisely, we need to deal with two issues:
matching of types of variables of the host language with types specified in
communication operations,
and matching of types between sender and receiver.
The types between a send and receive match if
both operations specify identical type names. That is, MPI_INTEGER
matches MPI_INTEGER, MPI_REAL matches MPI_REAL,
and so on. The one exception to this rule is that
the type MPI_PACKED can match
any other type (Section
).
The type of a variable matches the type specified in the
communication operation
if the datatype name used by that operation corresponds
to the basic type of the host program variable. For example, an entry with type
name MPI_INTEGER matches a Fortran variable of type INTEGER.
Tables showing this correspondence for Fortran and C appear in
Section
.
There are two exceptions to this rule: an entry with type name
MPI_BYTE or MPI_PACKED can be used to match
any byte of storage (on a byte-addressable machine),
MPI_PACKEDMPI_BYTE
irrespective of the datatype of the variable that contains this byte.
The type MPI_BYTE allows one to transfer the binary value of a byte in
memory unchanged.
The type MPI_PACKED is used to send data that has been
explicitly packed with calls to MPI_PACK, or receive data that will
be explicitly unpacked with calls to
MPI_UNPACK (Section
).
The following examples illustrate type matching.
MPI_CHARACTER
The
type MPI_CHARACTER matches one character of a Fortran variable of
MPI_CHARACTER
type CHARACTER, rather then the entire character string stored in the
variable. Fortran variables of type CHARACTER
or substrings are transferred as if they were arrays of characters.
This is illustrated in the example below.
One of the goals of MPI is to support parallel computations across
heterogeneous environments. Communication in a heterogeneous
heterogeneous
environment may require data conversions.
We use the following terminology.
The type matching rules imply that MPI communications never do type
conversion. On the other hand, MPI requires that a representation
conversion be
performed when a typed value is transferred across environments that use
different representations for such a value.
MPI does not specify the detailed rules for representation
conversion. Such a conversion is
expected to preserve integer, logical or character values, and to
convert a floating point value to the nearest value that can be
represented on the target system.
Overflow and underflow exceptions may occur during floating point conversions.
overflowunderflow
Conversion of integers or characters may also lead to exceptions when a value
that can be represented in one system cannot be represented in the other
system. An exception
occurring during representation conversion results in a failure of the
communication. An error occurs either in the send operation, or the receive
operation, or both.
If a value sent in a message is untyped (i.e., of type MPI_BYTE),
MPI_BYTE
MPI_BYTE
then the binary representation of the byte stored at the receiver is
identical to the binary representation of the byte loaded at the sender. This
holds true, whether sender and receiver run in the same or in distinct
environments. No representation conversion is done.
Note that representation conversion may
occur when values of type MPI_CHARACTER or
MPI_CHAR are transferred, for example, from an
EBCDIC encoding to an ASCII encoding.
MPI_CHARACTER
MPI_CHAR
No representation conversion need occur when an MPI program executes in
a homogeneous system, where all processes run in the same environment.
Consider the three examples,
-
.
The first program is correct, assuming that a and b are
REAL arrays of size
.
If the sender and receiver execute in different environments,
then the ten real values that are fetched from the send buffer will
be converted to the representation for reals on the receiver site
before they are stored in the receive buffer. While the
number of real elements fetched from the send buffer equal the
number of real elements stored in the receive buffer, the number of
bytes stored need not equal the number of bytes loaded. For example, the
sender may use a four byte representation and the receiver an eight
byte representation for reals.
The second program is erroneous, and its behavior is undefined.
The third program is correct. The exact same
sequence of forty bytes that were loaded from the send buffer will be
stored in the receive buffer, even
if sender and receiver run in a different environment. The message
sent has exactly the same length (in bytes) and the same binary
representation as the message received. If a and b are
of different types,
or if they are of the same type but different data representations are used,
then the bits stored in the receive buffer may encode values that are
different from the values they encoded in the send buffer.
Representation conversion also applies to the envelope of a message.
The source, destination and tag are all integers that may need to be converted.
MPI does not require support for inter-language
communication. The behavior of a program is undefined if messages are sent
by a C process and received by a Fortran process, or vice-versa.
inter-language communication
semantics
This section describes the main properties of the send and receive calls
introduced in Section
.
Interested readers can find a more formal treatment of the issues
in this section in
[10].
The receive described in Section
can be started
whether or not a matching send has been posted.
That version of receive is blocking.
blocking
It returns only after the receive buffer contains the newly received
message. A receive could complete before the matching send
has completed (of course, it can complete only after the matching send
has started).
The send operation described in Section
can be started whether or not a
matching receive has been posted. That version of send
is blocking.
It does not return until the message data
and envelope have been safely stored away so that the sender is
free to access and overwrite the send buffer.
The send call is also potentially non-local.
non-local
The message might be copied directly into the matching receive buffer,
or it might be copied into a temporary system buffer. In the first case, the
send call will not complete until a matching receive call occurs, and so,
if the
sending process is single-threaded, then it will be blocked until this time.
In the
second case, the send call may return ahead of the matching receive call,
allowing a single-threaded process to continue with its computation.
The MPI implementation may make either of these
choices. It might
block the sender or it might buffer the data.
Message buffering
decouples the send and receive operations. A blocking send might
complete as soon
as the message was buffered, even if no matching receive has been executed
by the receiver. On the other hand, message buffering can be expensive,
as it entails additional memory-to-memory copying, and it requires the
allocation of memory for buffering. The choice of the right amount of buffer
space to allocate for communication and of the buffering policy to use is
application and implementation dependent. Therefore, MPI
offers the choice of
several communication modes that allow one to control the choice of the
communication protocol. Modes are described in
communication modesmodescommunication protocol
protocol, communication
Section
. The choice of a buffering policy for the
standard mode send described in
standard modemode, standard
Section
is left to the implementation.
In any case, lack of buffer space will not cause a standard send
call to fail, but will merely cause it to block. In well-constructed
programs, this results in a useful throttle effect.
throttle effect
Consider a situation where a producer repeatedly produces
new values and sends them to a consumer. Assume that the producer produces
new values faster than the consumer can consume them.
If standard sends are used, then the producer will be automatically throttled,
as its send operations will block when buffer space is unavailable.
In ill-constructed programs, blocking may
lead to a deadlock situation, where all processes are
deadlock
blocked, and no progress occurs. Such programs may complete
when sufficient buffer space is available, but will fail on systems
that do less buffering, or when data sets (and message sizes) are increased.
Since any
system will run out of buffer resources as message sizes are increased,
and some implementations may want to provide little buffering, MPI
takes the position that safe programs
safe program
do not rely on system buffering, and will complete correctly irrespective of
the buffer allocation policy used by MPI. Buffering may change the
buffering
performance of a safe program, but it doesn't affect the
result of the program.
MPI does not enforce a safe programming style.
Users are free to take advantage of knowledge of the buffering policy of an
implementation in order to relax the safety requirements, though
doing so will lessen the portability of the program.
The following examples illustrate safe programming issues.
The MPI standard is intended for use by all those who want to write portable
message-passing programs in Fortran 77 and C.
This includes individual application programmers,
developers of software designed to run on
parallel machines, and creators of
environments and tools. In order to be
attractive to this wide audience, the standard must provide a simple, easy-to-use
interface for the basic user while not semantically precluding the
high-performance message-passing operations available on advanced machines.
threads
MPI does not specify the interaction of blocking communication calls with the
thread scheduler in a multi-threaded implementation of MPI. The desired
behavior is that a blocking communication call blocks only the issuing thread,
allowing another thread to be scheduled. The blocked thread will be
rescheduled
when the blocked call is satisfied. That is,
when data has been copied out of the
send buffer, for a send operation, or copied into the receive buffer, for a
receive operation. When a thread executes concurrently with a blocked
communication operation,
it is the user's responsibility not to access or modify a
communication buffer until the communication completes.
Otherwise, the outcome of the computation is undefined.
Messages are non-overtaking.
Conceptually, one may think of successive
messages sent by a process to another process as ordered in a sequence.
Receive operations posted by a process are also ordered in a sequence. Each
incoming message matches the first matching receive in the sequence. This is
illustrated in Figure
.
Process zero sends two messages to process one and process two sends three
messages to process one.
Process one posts five receives. All communications occur in
the same communication domain. The first message sent by process zero
and the first message sent by process two can be received in either
order, since the first two posted receives match either. The second
message of process two will be received before the third message, even
though the third and fourth receives match either.
Thus, if a sender sends two messages in succession to the same destination,
and both match the same receive, then the receive cannot get the
second message if the first message is still pending.
If a receiver posts two receives in succession, and both match the same
message,
then the second receive operation cannot be satisfied by this message, if the
first receive is still pending.
These requirements further define message matching.
matchingmessage matching
They guarantee that message-passing code is deterministic, if processes are
single-threaded and the wildcard MPI_ANY_SOURCE is
MPI_ANY_SOURCE
not used in receives.
Some other MPI functions, such as MPI_CANCEL or
MPI_WAITANY, are additional sources of nondeterminism.
deterministic programs
In a single-threaded process all communication operations are ordered
according to program execution order.
The situation is different when processes are multi-threaded.
order, with threads
The semantics of thread execution may not define
a relative order between two communication operations executed by two
distinct threads. The operations are logically concurrent, even if one
physically precedes the other. In this case, no order constraints
apply. Two messages sent by concurrent threads can be
received in any order. Similarly, if two receive operations that are logically
concurrent receive two successively sent messages, then the two messages can
match the receives in either order.
It is important to understand what is guaranteed by the ordering
property and what is not. Between any pair of communicating
processes, messages flow in order. This does not imply a consistent, total
order on communication events in the system. Consider the following example.
If a pair of
matching send and receives have been initiated on two processes, then at
least one of these two operations will complete, independently of
other actions in the system. The send operation will
complete, unless the receive is satisfied by another message.
The receive operation will complete, unless the message sent
is consumed by another
matching receive posted at the same destination process.
MPI makes no guarantee of fairness in the handling of
communication. Suppose that a send is posted. Then it is possible
that the destination process repeatedly posts a receive that matches this
send, yet the message is never received, because it is repeatedly overtaken by
other messages, sent from other sources. The scenario requires that the
receive used the wildcard MPI_ANY_SOURCE as its source argument.
MPI_ANY_SOURCE
Similarly, suppose that a
receive is posted by a multi-threaded process. Then it is possible that
messages that
match this receive are repeatedly consumed, yet the receive is never satisfied,
because it is overtaken by other receives posted at this node by
other threads. It is the programmer's responsibility to prevent
starvation in such situations.
We shall use the following example to illustrate the material
introduced so far, and to motivate new functions.
Since this code has a simple structure, a data-parallel approach can be
used to derive an equivalent parallel code. The array is distributed
across processes, and each process is assigned the task of updating
the entries on the part of the array it owns.
A parallel
algorithm is derived from a choice of data distribution. The
data distribution
distribution should be balanced, allocating (roughly) the same number
of entries to each processor; and it should minimize communication.
Figure
illustrates two possible distributions: a
1D (block) distribution, where the matrix is partitioned in one dimension,
and a 2D (block,block) distribution, where the matrix is partitioned
in two dimensions.
Since the communication occurs at block
boundaries, communication volume is minimized by the 2D partition which
has a better area to perimeter ratio. However, in this partition, each
processor communicates with four neighbors, rather than two neighbors
in the 1D partition. When the ratio of n/P (P number of
processors) is small, communication time will be dominated by the
fixed overhead per message, and the first partition will lead to
better performance. When the ratio is large, the second partition
will result in better performance. In order to keep the example
simple, we shall use the first partition; a realistic code would use a
``polyalgorithm'' that selects one of the two partitions, according to
problem size, number of processors, and communication performance
parameters.
The value of each point in the array B is computed from the
value of the four neighbors in array A. Communications are
needed at block boundaries in order to receive values of neighbor
points which are owned by another processor.
Communications are simplified if
an overlap area is allocated at each processor
for storing the values to be received
from the neighbor processor. Essentially, storage is allocated for
each entry
both at the producer and at the consumer of that entry. If an entry
is produced by one processor and consumed by another, then storage
is allocated for this entry at both processors.
With such scheme there is no need for dynamic allocation of communication
buffers, and the location of each variable is fixed.
Such scheme works whenever the data dependencies in the
computation are fixed and simple. In our case, they are described
by a four point stencil. Therefore, a one-column overlap is
needed, for a 1D partition.
We shall partition array A with one column
overlap. No such overlap is required for array B.
Figure
shows the extra columns in A and
how data is transfered for each iteration.
We shall use an algorithm where all values needed
from a neighbor are brought in one message.
Coalescing of communications in this manner
reduces the number of messages and generally improves performance.
The resulting parallel algorithm is shown below.
One way to get a safe version of this code is to
alternate
the order of sends and receives: odd rank processes will first send, next
receive, and even rank processes will first receive, next send. Thus,
one achieves the communication pattern of Example
.
The
modified main loop is shown below. We shall later see
simpler ways of dealing with this problem.
Jacobi, safe version
The exchange communication pattern exhibited by the last example is
exchange communication
sufficiently frequent to justify special support.
The send-receive operation combines, in one call, the sending of one
message to a destination and the receiving of another message from
a source. The source and destination are possibly the same.
Send-receive is
useful for communications patterns where each node both sends and
receives messages. One example is an exchange of data between two processes.
Another example is a shift operation across a chain of
processes. A safe program that implements such shift will need to use an
odd/even ordering of communications, similar to the one used in
Example
.
When send-receive is used, data flows simultaneously in both directions
(logically, at least) and cycles in the communication pattern do not lead
to deadlock.
deadlockcycles
Send-receive can be
used in conjunction with the functions described in Chapter
to
perform shifts on logical topologies.
Also, send-receive can be used for implementing
remote procedure calls:
remote procedure call
one blocking send-receive call can be used for
sending the input parameters to the callee and receiving back the
output parameters.
There is compatibility between send-receive and normal sends and receives.
A message sent by a
send-receive can be received by a regular receive
or probed by a regular probe, and a send-receive can
receive a message sent by a regular send.
MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, MPI_Datatype recvtag, MPI_Comm comm, MPI_Status *status)
MPI_SENDRECV(SENDBUF, SENDCOUNT, SENDTYPE, DEST, SENDTAG, RECVBUF, RECVCOUNT, RECVTYPE, SOURCE, RECVTAG, COMM, STATUS, IERROR)<type> SENDBUF(*), RECVBUF(*)
MPI_SENDRECV
executes a blocking send and receive operation. Both the send and receive
use the same communicator, but have
distinct tag arguments. The send buffer and receive buffers must be
disjoint, and may have different lengths and datatypes.
The next function handles the case where the buffers are not disjoint.
The semantics of a send-receive operation is what would be obtained
if the caller forked two concurrent threads, one to execute the send,
and one to execute the receive, followed by a join of these two
threads.
MPI_Sendrecv_replace(void* buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status)
MPI_SENDRECV_REPLACE(BUF, COUNT, DATATYPE, DEST, SENDTAG, SOURCE, RECVTAG, COMM, STATUS, IERROR)<type> BUF(*)
MPI_SENDRECV_REPLACE
executes a blocking send and receive. The same buffer is used both for
the send and for the receive, so that the message sent is replaced by
the message received.
The example below shows the main loop of the
parallel Jacobi code, reimplemented using send-receive.
null process
In many instances, it is convenient to specify a ``dummy'' source or
destination for communication.
In the Jacobi example, this will avoid special handling of boundary processes.
This also simplifies handling of
boundaries in the case of a non-circular shift, when
used in conjunction with the functions described in
Chapter
.
The special value MPI_PROC_NULL can be used
MPI_PROC_NULL
instead of a rank wherever a
source or a destination argument is required in a communication function.
A communication
with process MPI_PROC_NULL has no effect.
A send to MPI_PROC_NULL
succeeds and returns as soon as possible.
A receive from MPI_PROC_NULL succeeds and
returns as soon as possible with no modifications to the receive buffer.
When a receive with source = MPI_PROC_NULL is executed then
the status object returns source =
MPI_PROC_NULL, tag = MPI_ANY_TAG and
count = 0.
We take advantage of null processes to further simplify the parallel Jacobi
code.
Jacobi, with null processes
One can improve performance on many systems by overlapping
communication and computation. This is especially true on systems
where communication can be executed autonomously by an intelligent
communication controller. Multi-threading is one mechanism for
threads
achieving such overlap. While one thread is blocked, waiting for a communication
to complete, another thread may execute on the same processor. This mechanism
is efficient if the system supports light-weight threads that are integrated
with the communication subsystem. An alternative mechanism that
often gives better performance is to use nonblocking communication. A
nonblocking communicationcommunication, nonblocking
nonblocking post-send initiates a send operation, but does not
post-send
complete it. The post-send will return
before the message is copied out of the send buffer.
A separate complete-send
complete-send
call is needed to complete the communication, that is, to verify that the
data has been copied out of the send buffer. With
suitable hardware, the transfer of data out of the sender memory
may proceed concurrently with computations done at the sender after
the send was initiated and before it completed.
Similarly, a nonblocking post-receive initiates a receive
post-receive
operation, but does not complete it. The call will return before
a message is stored into the receive buffer. A separate complete-receive
complete-receive
is needed to complete the receive operation and verify that the data has
been received into the receive buffer.
A nonblocking send can be posted whether a matching
receive has been posted or not.
The post-send call
has local completion semantics: it returns immediately, irrespective of the
status of other processes.
If the call causes some system resource to be exhausted, then it will
fail and return an error code. Quality
implementations of MPI should ensure that this happens only
in ``pathological'' cases. That is, an MPI implementation
should be able to support a
large number of pending nonblocking operations.
The complete-send returns when data has been copied out of the
send buffer.
The complete-send has non-local completion semantics.
The call may return before a
matching receive is posted, if the message is buffered. On
the other hand, the
complete-send may not return until a matching
receive is posted.
There is compatibility between blocking and nonblocking
communication functions.
Nonblocking sends can be matched with blocking receives, and
vice-versa.
Nonblocking communications use request objects to
request object
identify communication operations and link the posting
operation with the completion operation.
Request objects are allocated by MPI and reside in MPI ``system'' memory.
The request object is opaque in the sense that the type and structure
of the object is not visible to users.
The application program can only manipulate handles to request objects, not the
objects themselves.
The system may use the request object to identify various properties of a
communication operation, such as the communication
buffer that is associated with it, or to store
information about the status of the pending communication operation.
The user may access request objects through various MPI calls to
inquire about the status of pending communication operations.
The special value MPI_REQUEST_NULL is used to indicate an invalid
request handle. Operations that deallocate request objects set the request
handle to this value.
Calls that post send or receive operations have the same names as the
corresponding blocking calls, except that an additional
prefix of I (for immediate) indicates that the call is nonblocking.
MPI_Isend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_ISEND posts a standard-mode, nonblocking send.
MPI_Irecv(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_IRECV posts a nonblocking receive.
These calls allocate a
request object and return a handle to it in
request object, allocation of
request.
The request is used to
query the status of the communication or wait for its completion.
A nonblocking post-send call indicates that the
system may start copying data out of the send buffer.
The sender must not access any part of the send buffer
(neither for loads nor for
stores) after a nonblocking send operation is posted, until the
complete-send returns.
A nonblocking post-receive indicates that the system may start
writing data into the receive buffer. The receiver must not access
any part of the
receive buffer after a nonblocking receive operation is posted,
until the complete-receive returns.
The attractiveness of the message-passing paradigm at least partially
stems from its wide portability. Programs expressed this way may run
on distributed-memory multicomputers, shared-memory
multiprocessors, networks of workstations,
and combinations of all of these.
The paradigm will not be made obsolete by
architectures combining the shared-
and distributed-memory views, or by increases in network speeds. Thus, it
should be both possible and useful to implement this standard on a great
variety of machines,
including those ``machines'' consisting of collections of
other machines, parallel or not, connected by a communication network.
The interface is suitable for use by fully general Multiple
Instruction, Multiple Data
(MIMD) programs, or Multiple Program, Multiple Data (MPMD)
programs, where each process follows a distinct execution path through
the same code, or even executes a different code.
It is also suitable for
those written in the more restricted style of Single Program,
Multiple Data (SPMD), where all processes follow the same execution
path through the same program.
Although no explicit
support for threads is provided,
the interface has been designed so as not to
prejudice their use.
With this version of MPI no support is provided for dynamic spawning
of tasks; such support is expected in future versions of MPI; see
Section
.
MPI provides many features intended to improve performance on
scalable parallel computers with
specialized interprocessor communication
hardware. Thus, we expect that native, high-performance
implementations of MPI will be provided on such machines. At the
same time, implementations of MPI on top of standard Unix
interprocessor communication protocols will provide portability to
workstation clusters and heterogeneous networks of workstations.
Several proprietary, native implementations of MPI, and public
domain, portable implementation of MPI are now available. See
Section
for more information
about MPI implementations.
The functions MPI_WAIT and MPI_TEST are used to complete
nonblocking sends and receives. The completion of a send
indicates that the sender is now free to access the send buffer.
The completion of a receive indicates that the receive buffer
contains the message, the receiver is free to access it,
and that the status object is set.
MPI_Wait(MPI_Request *request, MPI_Status *status)
MPI_WAIT(REQUEST, STATUS, IERROR)INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR
A call to MPI_WAIT returns when the operation
identified by request is complete. If the system object
pointed to by request
was originally created by a nonblocking send or
receive, then the object is deallocated by MPI_WAIT
and request is set to MPI_REQUEST_NULL.
MPI_REQUEST_NULL
The status object is set to contain information on the completed operation.
MPI_WAIT has non-local completion semantics.
MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
MPI_TEST(REQUEST, FLAG, STATUS, IERROR)LOGICAL FLAG
A call to MPI_TEST returns flag = true if the
operation identified by request is complete. In this case, the
status object is set to contain information on the completed
operation. If the system object pointed to by request
was originally created by a nonblocking send or
receive, then the object is deallocated by MPI_TEST
and request is set to MPI_REQUEST_NULL.
MPI_REQUEST_NULL
The call returns flag = false, otherwise. In this case, the value
of the status object is undefined.
MPI_TEST has local completion semantics.
For both MPI_WAIT and MPI_TEST,
information on the completed operation is returned in status.
The content of the status object for a receive
operation is accessed as
described in Section
.
The contents of a status object for a send operation is undefined,
except that
the query function MPI_TEST_CANCELLED
(Section
) can be applied to it.
We illustrate the use of nonblocking communication for the same
Jacobi computation used in previous examples
(Example
-
).
To achieve maximum overlap between
computation and communication, communications should be started as soon as
overlap
possible and completed as late as possible. That is, sends should be posted as
soon as the data to be sent is available; receives should be posted as soon as
the receive buffer can be reused; sends should be completed
just before the send
buffer is to be reused; and receives should be completed just before
the data in
the receive buffer is to be used. Sometimes, the overlap can be increased by
reordering computations.
Jacobi, using nonblocking
The next example shows a multiple-producer, single-consumer code. The
last process in the group consumes messages sent by the other
processes.
producer-consumer
The example imposes a strict round-robin discipline, since
round-robin
the consumer receives one message from each producer, in turn.
In some cases it is preferable to use a
``first-come-first-served'' discipline.
This is
achieved by using MPI_TEST, rather than MPI_WAIT,
as shown below.
Note that MPI can only offer an
first-come-first-served
approximation to first-come-first-served, since messages do not necessarily
arrive in the order they were sent.
A request object is deallocated automatically by a successful call
to MPI_WAIT or
MPI_TEST. In addition, a request object can be explicitly
deallocated by using the following operation.
request object, deallocation of
MPI_Request_free(MPI_Request *request)
MPI_REQUEST_FREE(REQUEST, IERROR)INTEGER REQUEST, IERROR
MPI_REQUEST_FREE marks the request object for deallocation
and sets request to MPI_REQUEST_NULL.
MPI_REQUEST_NULL
An ongoing communication associated with the request will be allowed
to complete. The request becomes unavailable after it is deallocated, as
the handle is reset to MPI_REQUEST_NULL. However, the request
object itself need not be deallocated immediately. If the communication
associated with this object is still ongoing, and the object is required
for its correct completion, then MPI will not deallocate the object
until after its completion.
MPI_REQUEST_FREE cannot be used for cancelling an ongoing
communication. For that purpose,
one should use MPI_CANCEL,
described in Section
.
One should use MPI_REQUEST_FREE
when the logic of the program is such that a nonblocking communication is
known to have terminated and, therefore, a call to MPI_WAIT or
MPI_TEST is superfluous.
For example, the program could
be such that a send command generates a reply from the receiver. If the reply
has been successfully received, then the send is known to be complete.
The semantics of nonblocking communication is defined by suitably extending the
definitions in Section
.
order, nonblockingnonblocking, order
Nonblocking communication operations are ordered according to the execution
order of the posting calls. The non-overtaking
requirement of Section
is extended to
nonblocking communication.
The order requirement specifies how post-send calls are matched to
post-receive calls.
There are no restrictions
on the order in which operations complete. Consider the code in Example
.
Since the completion of a receive can take an arbitrary amount of time,
there is no way to infer that the receive operation completed, short of
executing a complete-receive call. On the other hand, the completion of a
send operation can be inferred indirectly from the completion of a matching
receive.
progress, nonblockingnonblocking, progress
A communication is enabled once a send and a matching receive have been
enabled communication
posted by two processes. The progress rule requires that once a communication
is enabled, then either the send or the receive will proceed to completion (they
might not both complete as the send might be matched by another receive or the
receive might be matched by another send).
Thus, a call to MPI_WAIT that completes a receive
will eventually return if a matching send has been started, unless
the send is satisfied by another receive. In particular, if the matching send
is nonblocking,
then the receive completes even if no complete-send call is made on the
sender side.
Similarly, a call to MPI_WAIT that
completes a send eventually
returns if a matching receive has been started, unless
the receive is satisfied by another send, and even if no complete-receive
call is made on the receiving side.
If a call to MPI_TEST that completes a receive is repeatedly made
with the same arguments, and
a matching send has been started, then the call will eventually
return flag = true, unless the send is satisfied by another
receive.
If a call to MPI_TEST that completes
a send is repeatedly made with the same arguments, and
a matching receive has been started, then the call will eventually
return flag = true, unless the receive is satisfied by another
send.
fairness, nonblockingnonblocking, fairness
The statement made in Section
concerning fairness
applies to nonblocking communications. Namely, MPI does not guarantee
fairness.
buffering, nonblockingnonblocking, buffering
resource limitations
nonblocking, safety
The use of nonblocking communication alleviates the need for buffering,
since a sending process may progress after it has posted a send. Therefore,
the constraints of safe programming can be relaxed. However,
some amount of storage is consumed by a pending communication.
At a minimum, the
communication subsystem needs to copy the parameters of a posted send or
receive before the call returns. If this storage
is exhausted, then a call that posts a new communication will fail, since
post-send or post-receive calls are not allowed to block.
A high quality implementation will consume
only a fixed amount of storage per posted, nonblocking communication, thus
supporting a large number of pending communications. The failure of a parallel
program that exceeds the bounds on the number of pending nonblocking
communications, like the failure of a sequential
program that exceeds the bound on stack
size, should be seen as a pathological case, due either to a pathological
program or a pathological MPI implementation.
The approach illustrated in the last two examples can be used, in general, to
transform unsafe programs into safe ones. Assume that the program consists of
safety
successive communication phases, where processes exchange data, followed by
computation phases. The communication phase should be rewritten as two
sub-phases, the first where each process posts all its communication, and the
second where the process waits for the completion of all its communications.
The order in which the communications are posted is not important, as long as
the total number of messages sent or received at any node is moderate.
This is further discussed in Section
.
completion, multiplemultiple completion
It is convenient and efficient to complete in one call a list of multiple
pending communication operations, rather than
completing only one.
MPI_WAITANY or MPI_TESTANY are used to
complete one out of several operations.
MPI_WAITALL or MPI_TESTALL are
used to complete all operations in a list.
MPI_WAITSOME or MPI_TESTSOME are used to complete all
enabled operations in a list. The behavior of these functions is
described in this section and in Section
.
MPI_Waitany(int count, MPI_Request *array_of_requests, int *index, MPI_Status *status)
MPI_WAITANY(COUNT, ARRAY_OF_REQUESTS, INDEX, STATUS, IERROR)INTEGER COUNT, ARRAY_OF_REQUESTS(*), INDEX, STATUS(MPI_STATUS_SIZE), IERROR
MPI_WAITANY blocks until one of the communication operations associated
with requests in the array has completed.
If more then one operation can be completed, MPI_WAITANY
arbitrarily picks one and completes it.
MPI_WAITANY returns in index the array location
of the completed request and returns in status the status of the
completed communication.
The request object
is deallocated and the request handle is set to MPI_REQUEST_NULL.
MPI_REQUEST_NULL
MPI_WAITANY has non-local completion semantics.
MPI_Testany(int count, MPI_Request *array_of_requests, int *index, int *flag, MPI_Status *status)
MPI_TESTANY(COUNT, ARRAY_OF_REQUESTS, INDEX, FLAG, STATUS, IERROR)LOGICAL FLAG
MPI_TESTANY tests for completion of
the communication operations associated with requests in the array.
MPI_TESTANY has local completion semantics.
If an operation has completed, it returns flag = true,
returns in index the array location of the completed request,
and returns in status the status of the completed communication.
The request
is deallocated and the handle is set to MPI_REQUEST_NULL.
MPI_REQUEST_NULL
If no operation has completed, it returns flag = false, returns
MPI_UNDEFINED in index and status is
MPI_UNDEFINED
undefined.
The execution of MPI_Testany(count, array_of_requests,
&, &, &) has the same effect as the execution of
MPI_Test( &_of_requests[i], &, &),
for i=0, 1 ,..., count-1,
in some arbitrary order, until one call returns flag = true, or
all fail. In the former case, index is set to the last value of i,
and in the latter case, it is set to MPI_UNDEFINED.
MPI_UNDEFINED
MPI_Waitall(int count, MPI_Request *array_of_requests, MPI_Status *array_of_statuses)
MPI_WAITALL(COUNT, ARRAY_OF_REQUESTS, ARRAY_OF_STATUSES, IERROR)INTEGER COUNT, ARRAY_OF_REQUESTS(*)
MPI_WAITALL blocks until all communications, associated with
requests in the array, complete.
The i-th entry in
array_of_statuses is set to the return status of the
i-th operation.
All request objects are
deallocated and the corresponding handles in the array are set to
MPI_REQUEST_NULL.
MPI_REQUEST_NULL
MPI_WAITALL has non-local completion semantics.
The execution of MPI_Waitall(count, array_of_requests,
array_of_statuses) has the same effect as the execution of
MPI_Wait(&_of_requests[i],&_of_statuses[i]),
for i=0 ,..., count-1, in some arbitrary order.
When one or more of the communications completed by a
call to MPI_WAITALL fail,
MPI_WAITALL will return
the error code MPI_ERR_IN_STATUS and will set the
MPI_ERR_IN_STATUS
error field of each status to a specific error code. This code will be
MPI_SUCCESS, if the specific communication completed; it will
MPI_SUCCESS
be another specific error code, if it failed;
or it will be MPI_PENDING if it has not failed nor completed.
MPI_PENDING
The function MPI_WAITALL will return MPI_SUCCESS if it
MPI_SUCCESS
completed successfully, or will return another error code if it failed
for other reasons (such as invalid arguments).
MPI_WAITALL updates the error fields of the status objects
only when it returns MPI_ERR_IN_STATUS.
MPI_Testall(int count, MPI_Request *array_of_requests, int *flag, MPI_Status *array_of_statuses)
MPI_TESTALL(COUNT, ARRAY_OF_REQUESTS, FLAG, ARRAY_OF_STATUSES, IERROR)LOGICAL FLAG
MPI_TESTALL tests for completion of all
communications associated with requests in the array.
MPI_TESTALL has local completion semantics.
If all operations have completed, it returns flag = true,
sets the corresponding entries in status, deallocates
all requests and sets all request handles to MPI_REQUEST_NULL.
MPI_REQUEST_NULL
If all operations have not completed,
flag = false is returned, no request is modified
and the values of the status entries are undefined.
Errors that occurred during the execution of MPI_TEST_ALL
are handled in the same way as errors in MPI_WAIT_ALL.
MPI_Waitsome(int incount, MPI_Request *array_of_requests, int *outcount, int *array_of_indices, MPI_Status *array_of_statuses)
MPI_WAITSOME(INCOUNT, ARRAY_OF_REQUESTS, OUTCOUNT, ARRAY_OF_INDICES, ARRAY_OF_STATUSES, IERROR)INTEGER INCOUNT, ARRAY_OF_REQUESTS(*), OUTCOUNT, ARRAY_OF_INDICES(*), ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR
MPI_WAITSOME waits until at least one of the
communications, associated with
requests in the array, completes.
MPI_WAITSOME returns in outcount the
number of completed requests. The first
outcount locations of the array array_of_indices
are set to the indices of these operations.
The first outcount
locations of the array array_of_statuses
are set to the status for these completed operations. Each request
that completed is deallocated, and the
associated handle is set to MPI_REQUEST_NULL.
MPI_REQUEST_NULL
MPI_WAITSOME has non-local completion semantics.
If one or more of the communications completed by
MPI_WAITSOME fail then the arguments outcount,
array_of_indices and array_of_statuses will be
adjusted to indicate completion of all communications that have
succeeded or failed. The call will return the error code
MPI_ERR_IN_STATUS and the error field of each status
MPI_ERR_IN_STATUS
returned will be set to indicate success or to indicate the specific error
that occurred. The call will return MPI_SUCCESS if it
MPI_SUCCESS
succeeded, and will return another error code if it failed for
for other reasons (such as invalid arguments).
MPI_WAITSOME updates the status fields of the request
objects only when it returns MPI_ERR_IN_STATUS.
MPI_Testsome(int incount, MPI_Request *array_of_requests, int *outcount, int *array_of_indices, MPI_Status *array_of_statuses)
MPI_TESTSOME(INCOUNT, ARRAY_OF_REQUESTS, OUTCOUNT, ARRAY_OF_INDICES, ARRAY_OF_STATUSES, IERROR)INTEGER INCOUNT, ARRAY_OF_REQUESTS(*), OUTCOUNT, ARRAY_OF_INDICES(*), ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR
MPI_TESTSOME
behaves like MPI_WAITSOME, except that it returns
immediately. If no operation has completed it
returns outcount = 0.
MPI_TESTSOME has local completion semantics.
Errors that occur during the execution of MPI_TESTSOME are
handled as for MPI_WAIT_SOME.
Both MPI_WAITSOME and MPI_TESTSOME fulfill a
fairness requirement: if a request for a receive repeatedly
appears in a list of requests passed to MPI_WAITSOME or
MPI_TESTSOME, and a matching send has been posted, then the receive
will eventually complete, unless the send is satisfied by another receive.
A similar fairness requirement holds for send requests.
The standard includes:
MPI_PROBE and MPI_IPROBE
allow polling of incoming messages
without
actually receiving them. The application
can then decide how to receive them, based
on the information returned by the probe (in a
status variable). For example, the application might allocate memory for
the receive buffer according to the length of the probed message.
MPI_CANCEL allows pending communications to be canceled.
This is required for cleanup in some situations.
Suppose an application has posted nonblocking sends or receives and
then determines that these operations will not complete.
Posting a send or a receive ties up application resources (send or receive
buffers),
and a cancel allows these resources to be freed.
MPI_Iprobe(int source, int tag, MPI_Comm comm, int *flag, MPI_Status *status)
MPI_IPROBE(SOURCE, TAG, COMM, FLAG, STATUS, IERROR)LOGICAL FLAG
MPI_IPROBE is a nonblocking operation that
returns flag = true
if there is a message that can be received
and that matches the message envelope specified by
source, tag, and comm.
The call matches the same message
that would have been received by a call to MPI_RECV
(with these arguments)
executed at the same point in the program, and returns in
status the same
value.
Otherwise, the call returns flag = false, and leaves status
undefined.
MPI_IPROBE has local completion semantics.
If MPI_IPROBE(source, tag, comm, flag, status) returns flag = true,
then the first, subsequent receive executed with the
communicator comm, and with the source and tag returned in
status,
will receive the message
that was matched by the probe.
The argument source can be
MPI_ANY_SOURCE, and tag can be
MPI_ANY_SOURCE
MPI_ANY_TAG, so that one can probe for messages from an arbitrary
MPI_ANY_TAG
source and/or with
an arbitrary tag. However, a specific communicator
must be provided in comm.
It is not necessary to receive a message immediately after it has been
probed for, and the
same message may be probed for several times before it is received.
MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status *status)
MPI_PROBE(SOURCE, TAG, COMM, STATUS, IERROR)INTEGER SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
MPI_PROBE behaves like MPI_IPROBE except that it blocks
and returns only after a matching message has been found.
MPI_PROBE has non-local completion semantics.
The semantics of MPI_PROBE and MPI_IPROBE
guarantee progress, in the same way as a corresponding receive executed at the
same point in the program.
progress, for probe
If a call to MPI_PROBE has been issued by a process, and a send that
matches the probe has been initiated by some process, then the call to
MPI_PROBE will return, unless the message is received by another,
concurrent receive operation, irrespective of other activities in the system.
Similarly, if a process busy waits with
MPI_IPROBE and a matching message has been issued,
then the call to
MPI_IPROBE will eventually return flag = true
unless the message is received by another concurrent receive
operation, irrespective of other activities in the system.
MPI_Cancel(MPI_Request *request)
MPI_CANCEL(REQUEST, IERROR)INTEGER REQUEST, IERROR
MPI_CANCEL marks for cancelation a pending,
cancelation
nonblocking communication operation (send or receive). MPI_CANCEL
has local completion semantics.
It returns immediately, possibly before the
communication is actually canceled.
After this, it is still necessary to complete a
communication that has been marked
for cancelation, using a call to MPI_REQUEST_FREE,
MPI_WAIT, MPI_TEST or one of the functions in
Section
. If the communication was not cancelled
(that is, if the communication happened to start before the
cancelation could take effect), then
the completion call will complete the communication, as
usual. If the communication was
successfully cancelled,
then the completion call will deallocate the request object
and will return in status the information that the
communication was canceled.
The application should then call MPI_TEST_CANCELLED,
using status as input, to test
whether the communication was actually canceled.
Either the cancelation succeeds, and no communication occurs, or the
communication completes, and the cancelation fails.
If a send is marked for cancelation, then it must be the case that
either the send completes normally, and the
message sent is received at the destination process, or that the send is
successfully
canceled, and no part of the message is received at the
destination.
If a receive is marked for cancelation, then it must be the case that
either the receive completes normally, or that the receive is
successfully canceled, and no part of the receive buffer
is altered.
If a communication is marked for cancelation, then a completion
call for that communication is guaranteed to return, irrespective of
the activities of other processes. In this case, MPI_WAIT behaves as a
local function. Similarly, if MPI_TEST is
repeatedly called in a busy wait loop for a canceled communication,
then MPI_TEST will eventually succeed.
MPI_Test_cancelled(MPI_Status *status, int *flag)
MPI_TEST_CANCELLED(STATUS, FLAG, IERROR)LOGICAL FLAG
MPI_TEST_CANCELLED is used to test whether the communication
operation was actually canceled by MPI_CANCEL.
It returns flag = true if the communication associated with the
status object was canceled successfully. In this case, all
other fields of status are
undefined. It returns flag = false, otherwise.
persistent requestrequest, persistent
porthalf-channel
Often a communication with the same argument list is repeatedly
executed within the inner loop of a parallel computation. In such a
situation, it may be possible to optimize the communication by
binding the list of communication arguments to a persistent communication
request once and then, repeatedly, using
the request to initiate and complete messages. A
persistent request can be thought of as a
communication port or a ``half-channel.''
It does not provide the full functionality of a conventional channel,
since there is no binding of the send port to the receive port. This
construct allows reduction of the overhead for communication
between the process and communication controller, but not of the overhead for
communication between one communication controller and another.
It is not necessary that messages sent with a persistent request be received
by a receive operation using a persistent request, or vice-versa.
Persistent communication requests are associated with nonblocking
send and receive operations.
A persistent communication request is created using the following
functions. They involve no communication and thus have local
completion semantics.
MPI_Send_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_SEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_SEND_INIT
creates a persistent communication request
for a standard-mode, nonblocking send operation, and binds to it all the
arguments of a send operation.
MPI_Recv_init(void* buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
MPI_RECV_INIT(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_RECV_INIT
creates a persistent communication request
for a nonblocking receive operation. The
argument buf is marked as OUT
because the application gives permission to write on the receive buffer.
Persistent communication requests are created by the preceding functions,
but they are, so far, inactive. They are activated, and the associated
communication operations started, by MPI_START
or MPI_STARTALL.
MPI_Start(MPI_Request *request)
MPI_START(REQUEST, IERROR)INTEGER REQUEST, IERROR
MPI_START activates request and initiates the
associated communication.
Since all persistent requests are associated with nonblocking
communications, MPI_START has local completion semantics.
The semantics of communications done
with persistent requests are identical to the corresponding
operations without persistent requests.
That is,
a call to MPI_START with a
request created by MPI_SEND_INIT
starts a
communication in the same manner as a call to MPI_ISEND;
a call to MPI_START with a request created by
MPI_RECV_INIT starts a communication in the same manner as
a call to MPI_IRECV.
A send operation initiated with MPI_START can be matched with
any receive operation (including MPI_PROBE)
and a receive operation initiated
with MPI_START can receive messages generated by any send
operation.
MPI_Startall(int count, MPI_Request *array_of_requests)
MPI_STARTALL(COUNT, ARRAY_OF_REQUESTS, IERROR)INTEGER COUNT, ARRAY_OF_REQUESTS(*), IERROR
MPI_STARTALL
starts all communications associated with persistent requests in
array_of_requests. A call to
MPI_STARTALL(count, array_of_requests) has the
same effect as calls to
MPI_START(array_of_requests[i]),
executed for i=0 ,..., count-1, in some arbitrary order.
A communication started with a call to MPI_START or
MPI_STARTALL is
completed by a call to MPI_WAIT, MPI_TEST, or
one of the other completion functions described in
Section
. The persistent request becomes inactive after
the completion of such a call, but it is not deallocated
and it can be re-activated by another MPI_START or
MPI_STARTALL.
Persistent requests are explicitly deallocated by a call to
MPI_REQUEST_FREE (Section
).
The call to MPI_REQUEST_FREE can occur at any point in the program
after the persistent request was created. However, the request will be
deallocated only after it becomes inactive.
Active receive requests should not be freed. Otherwise, it will not be
possible to check that the receive has completed.
It is preferable to free requests when they are inactive. If this
rule is followed, then the functions
described in this section will be invoked
in a sequence of the form,
where
indicates zero or more repetitions.
If the same communication request is used in several concurrent
threads, it is the user's responsibility to coordinate calls so that the
correct sequence is obeyed.
MPI_CANCEL can be used to cancel a communication that uses
a persistent request, in
the same way it is used for nonpersistent requests.
A successful cancelation cancels
the active communication, but does not deallocate the request. After the
call to MPI_CANCEL and the subsequent call to MPI_WAIT or
MPI_TEST (or other completion function), the
request becomes inactive and
can be activated for a new communication.
Normally, an invalid handle to an MPI object is not a valid argument for
a call that expects an object. There is one exception to this rule:
communication-complete calls can be passed request handles with value
MPI_REQUEST_NULL.
MPI_REQUEST_NULL
A communication complete call with such an argument is a
``no-op'': the null handles are ignored. The same rule applies to
persistent handles that are not associated with an active
communication operation.
request, null handlenull request handle
request, inactive vs active
active request handle inactive request handle
status, empty
We shall use the following terminology. A null request handle is
a handle with value MPI_REQUEST_NULL. A handle to a
MPI_REQUEST_NULL
persistent request is inactive if the request is not currently
associated with an ongoing communication. A handle is active,
if it is neither null nor inactive.
An empty status is a status
that is set to tag = MPI_ANY_TAG, source =
MPI_ANY_SOURCE, and is also internally configured so that calls to
MPI_ANY_TAG
MPI_ANY_SOURCE
MPI_GET_COUNT and MPI_GET_ELEMENT return
count = 0. We set a status variable to empty in cases
when the value returned is not significant. Status is set this way to
prevent errors due to access of stale information.
A call to MPI_WAIT with a null or inactive request
argument returns immediately with an empty status.
A call to MPI_TEST with a null or inactive request
argument returns immediately with flag = true and an empty status.
The list of requests passed to MPI_WAITANY may contain null or
inactive requests. If some of the requests are active, then the call returns
when an active request has completed. If all the requests in the list
are null or inactive then the call returns immediately, with index =
MPI_UNDEFINED and an empty status.
The list of requests passed to MPI_TESTANY may contain null or
inactive requests. The call returns flag = false if there are
active requests in the list, and none have completed. It returns
flag = true if an active request has completed, or if all the requests
in the list are null or inactive. In the later case, it returns
index = MPI_UNDEFINED and an empty status.
The list of requests passed to MPI_WAITALL may contain null or
inactive requests. The call returns as soon as all active requests
have completed. The call sets to empty each status associated with a
null or inactive request.
The list of requests passed to MPI_TESTALL may contain null or
inactive requests. The call returns flag = true if
all active requests have completed. In this case, the call sets to
empty each status associated with a null or inactive request.
Otherwise, the call returns flag = false.
The list of requests passed to MPI_WAITSOME may contain null or
inactive requests. If the list contains active requests, then the call
returns when some of the active requests have completed. If all requests were
null or inactive,
then the call returns immediately, with outcount = MPI_UNDEFINED.
MPI_UNDEFINED
The list of requests passed to MPI_TESTSOME may contain null or
inactive requests. If the list contains active requests and some have
completed, then the call returns in outcount the number of completed
request. If it contains active requests, and none have completed, then it
returns outcount = 0. If the list contains no active requests, then it
returns outcount = MPI_UNDEFINED.
In all these cases, null or inactive request handles are not modified
by the call.
The send call described in Section
used the standard communication mode. In this mode,
it is up to MPI to decide whether outgoing
messages will be buffered. MPI may
buffer outgoing messages. In such a case, the send call may complete
before a matching receive is invoked. On the other hand, buffer space may be
unavailable, or MPI may choose not to buffer
outgoing messages, for performance reasons. In this case,
the send call will not complete until a matching receive has been posted, and
the data has been moved to the receiver. (A blocking send completes when the
call returns; a nonblocking send completes when the matching Wait or Test call
returns successfully.)
Thus, a send in standard mode can be started whether or not a
matching receive has been posted. It may complete before a matching receive
is posted. The
standard-mode send has non-local completion semantics, since successful
completion of the send
operation may depend on the occurrence of a matching receive.
modecommunication modes
mode, standard
mode, synchronoussynchronous mode
mode, bufferedbuffered mode
mode, readyready mode
standard mode
rendezvous
A buffered-mode send operation can be started whether or not a
matching receive has been posted.
It may complete before a matching receive is posted.
Buffered-mode send has local completion semantics: its
completion does not depend on the occurrence of a matching receive.
In order to complete the operation, it may be necessary to buffer the
outgoing message locally. For that purpose, buffer
space is provided by the application (Section
).
An error will occur if a buffered-mode send is called and
there is insufficient buffer space.
The buffer space occupied by the message is freed when the message is
transferred to its destination or when the buffered send is cancelled.
A synchronous-mode send can be started whether or
not a matching receive was posted. However, the send will complete
successfully only if a matching receive is posted, and the
receive operation has started to receive the message sent by the
synchronous send.
Thus, the completion of a synchronous send not only indicates that the
send buffer can be reused, but also indicates that the receiver has
reached a certain point in its execution, namely that it has started
executing the matching receive. Synchronous mode provides
synchronous communication semantics: a communication does not complete
at either end before both processes rendezvous at the
communication. A synchronous-mode send has non-local completion semantics.
A ready-mode send
may be started only if the matching receive has already been posted.
Otherwise, the operation is erroneous and its outcome is undefined.
On some systems, this allows the removal of a hand-shake
operation and results in improved
performance.
A ready-mode send has the same semantics as a standard-mode send.
In a correct program, therefore, a
ready-mode send could be replaced by a standard-mode send with no effect on the
behavior of the program other than performance.
Three additional send functions are provided for the three additional
communication
modes. The communication mode is indicated by a one letter prefix:
B for buffered,
S for synchronous, and
R for ready.
There is only one receive mode and it matches any of the send modes.
All send and receive operations use the buf, count,
datatype, source, dest, tag,
comm, status and request arguments
in the same way as
the standard-mode send and receive operations.
MPI_Bsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
MPI_BSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)<type> BUF(*)
MPI_BSEND performs a buffered-mode, blocking send.
MPI_Ssend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)<type> BUF(*)
MPI_SSEND performs a synchronous-mode, blocking send.
MPI_Rsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
MPI_RSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)<type> BUF(*)
MPI_RSEND performs a ready-mode, blocking send.
We use the same naming conventions as for blocking communication: a
prefix of B, S, or R is used for buffered,
synchronous or ready mode. In addition, a prefix of I (for
immediate) indicates that the call is nonblocking.
There is only one nonblocking receive call, MPI_IRECV.
Nonblocking send operations are completed with the same Wait and Test
calls as for standard-mode send.
MPI_Ibsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_IBSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_IBSEND posts a buffered-mode, nonblocking send.
MPI_Issend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_ISSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_ISSEND posts a synchronous-mode, nonblocking send.
MPI_Irsend(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_IRSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_IRSEND posts a ready-mode, nonblocking send.
MPI_Bsend_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_BSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_BSEND_INIT
creates a persistent communication request for a buffered-mode,
nonblocking send,
and binds to it all the
arguments of a send operation.
MPI_Ssend_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_SSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_SSEND_INIT
creates a persistent communication object
for a synchronous-mode, nonblocking send,
and binds to it all the
arguments of a send operation.
MPI_Rsend_init(void* buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request)
MPI_RSEND_INIT(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)<type> BUF(*)
MPI_RSEND_INIT
creates a persistent communication object
for a ready-mode, nonblocking send,
and binds to it all the
arguments of a send operation.
buffered modemode, bufferedbuffer attach
An application must specify a buffer to be used for
buffering messages sent in buffered
mode. Buffering is done by the sender.
MPI_Buffer_attach( void* buffer, int size)
MPI_BUFFER_ATTACH( BUFFER, SIZE, IERROR)<type> BUFFER(*)
MPI_BUFFER_ATTACH
provides to MPI a buffer in the application's
memory to be used for buffering outgoing
messages. The buffer is used only by messages sent in buffered mode.
Only one buffer can be attached at a time (per process).
MPI_Buffer_detach( void* buffer, int* size)
MPI_BUFFER_DETACH( BUFFER, SIZE, IERROR)<type> BUFFER(*)
MPI_BUFFER_DETACH detaches the buffer currently associated
with MPI. The call returns the
address and the size of the detached buffer. This operation
will block until all messages currently in the buffer have been transmitted.
Upon return of
this function, the user may reuse or deallocate the space taken by the buffer.
Now the question arises: how is the attached buffer to be
used? The answer is that MPI must behave
as if
buffer policy
outgoing message
data were buffered by the sending process, in the specified buffer space,
using a circular, contiguous-space allocation policy.
We outline below a model implementation that defines this policy.
MPI may provide more buffering, and may use a better buffer allocation
algorithm
than described below. On the other hand, MPI may signal an error
whenever the
simple buffering allocator described below would run out of space.
The model implementation uses the packing and unpacking functions described in
Section
and the nonblocking communication functions
described in Section
.
We assume that a circular queue
of pending message entries (PME) is maintained. Each
entry contains a communication request
that identifies a pending nonblocking
send, a pointer to the next entry and the packed message data. The
entries are stored in successive locations in the buffer. Free space is
available between the queue tail and the queue head.
A buffered send call results in the execution of the following algorithm.
mode, comments
communication modes, comments
MPI does not specify:
There are many features that were considered and
not included in MPI. This happened for
a number of reasons: the time constraint
that was self-imposed by the MPI Forum in finishing the standard;
the feeling that not enough experience was available on some of these
topics; and the concern that additional features would delay the appearance of
implementations.
Features that are not included can always be offered as extensions
by specific
implementations.
Future versions of MPI will address some of these issues (see
Section
).
The MPI communication mechanisms introduced in the previous chapter
allows one to send or receive a sequence of identical elements that
are contiguous in memory. It is often desirable to send
data that is not homogeneous, such as a structure, or that is
not contiguous in memory, such as an array section. This allows one
to amortize the fixed overhead of sending and receiving a
message over the transmittal of many elements, even in these
more general circumstances. MPI provides two mechanisms to
achieve this.
The construction and use of derived datatypes is described in
Section
-
.
The use of Pack and Unpack functions is described in
Section
. It is often possible to achieve the same
data transfer using either mechanisms. We discuss the pros and cons
of each approach at the end of this chapter.
All MPI communication functions take a
datatype argument. In the
simplest case this will be a primitive type, such as an integer or
floating-point number. An important and powerful generalization results
by allowing user-defined (or ``derived'') types wherever the primitive
types can occur. These are not ``types'' as far as the programming
language is concerned. They are only ``types'' in that MPI is made
aware of them through the use of type-constructor functions, and they
describe the layout, in memory, of sets of primitive types. Through
user-defined types, MPI supports the communication of
complex data structures such as array sections and structures containing
combinations of primitive datatypes. Example
shows how
a user-defined datatype is used to send the upper-triangular part
of a matrix, and Figure
diagrams the memory layout
represented by the user-defined datatype.
derived datatypetype constructor
derived datatype, constructor
Derived datatypes
are constructed from basic datatypes using the constructors described
in Section
. The constructors can
be applied recursively.
A derived datatype is an opaque object that specifies two
things:
derived datatype
The displacements are not required to be positive, distinct, or
in increasing order. Therefore, the order of items need not
coincide with their order in memory, and an item may appear more than
once.
We call such a pair of sequences (or sequence of pairs) a type map.
The sequence of primitive datatypes (displacements ignored) is the type signature of the datatype.
type maptype signature
derived datatype, mapderived datatype, signature
Let
be such a type map, where
are primitive types, and
are displacements.
Let
be the associated type signature.
This type map, together with a base address buf,
specifies a communication buffer: the communication buffer that consists of
entries, where the
-th entry is at address
and has type
.
A message assembled from a single type of this sort
will consist of
values, of the types defined
by
.
A handle to a derived datatype can appear as an argument in a send or
receive operation, instead of a primitive datatype argument. The
operation
MPI_SEND(buf, 1, datatype,...) will use the send buffer
defined by the base address buf and the derived datatype
associated with datatype. It will generate a message with the type
signature determined by the datatype argument.
MPI_RECV(buf, 1, datatype,...) will use the receive buffer
defined by the base address buf and the derived datatype
associated with datatype.
Derived datatypes can be used in all send and receive operations
including collective. We discuss, in
Section
, the case where the second argument
count has value
.
The primitive datatypes presented in
Section
are special cases of a derived datatype, and are predefined.
Thus, MPI_INT is a predefined handle to a datatype with type
MPI_INT
map
, with one entry of type int and
displacement zero. The other primitive datatypes are similar.
The extent of a datatype is defined to
be the span from the first byte to the last byte occupied by entries in this
datatype, rounded up to satisfy alignment requirements.
extentderived datatype, extent
That is, if
then
where
.
is the lower bound and
is
the upper bound of the datatype.
lower boundderived datatype, lower bound
upper boundderived datatype, upper bound
If
requires alignment to a byte address that is a multiple
of
,
then
is the least nonnegative increment needed to round
to the next multiple of
.
(The definition of extent is expanded in Section
.)
The following functions return information on datatypes.
MPI_Type_extent(MPI_Datatype datatype, MPI_Aint *extent)
MPI_TYPE_EXTENT(DATATYPE, EXTENT, IERROR)INTEGER DATATYPE, EXTENT, IERROR
MPI_TYPE_EXTENT
returns the extent of a datatype. In addition to its use with
derived datatypes, it can be used to inquire about the extent
of primitive datatypes. For example, MPI_TYPE_EXTENT(MPI_INT, extent)
will return in extent the size, in bytes, of an int - the same
value that would be returned by the C call sizeof(int).
extentderived datatype, extent
MPI_Type_size(MPI_Datatype datatype, int *size)
MPI_TYPE_SIZE(DATATYPE, SIZE, IERROR)INTEGER DATATYPE, SIZE, IERROR
MPI_TYPE_SIZE returns the total size, in bytes, of the entries in
the type signature
associated with datatype; that is, the total size of
the data in a message that would be created with this datatype. Entries that
occur multiple times in the datatype are counted with their multiplicity.
For primitive datatypes, this function returns the same information as
MPI_TYPE_EXTENT.
This section presents the MPI functions for constructing derived
datatypes. The functions are presented in an order from
simplest to most complex.
type constructor
derived datatype, constructor
MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_CONTIGUOUS(COUNT, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_CONTIGUOUS is the simplest datatype constructor.
It constructs a typemap consisting of the replication of a
datatype into contiguous locations.
The argument newtype is the datatype obtained by concatenating
count copies of
oldtype. Concatenation is defined using extent(oldtype)
as the size of the concatenated copies.
The action of the Contiguous
constructor is represented schematically in Figure
.
In general,
assume that the type map of oldtype is
with extent
.
Then newtype has a type map with
entries
defined by:
MPI_Type_vector(int count, int blocklength, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_VECTOR(COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_VECTOR is a constructor that
allows replication of a datatype
into locations that consist of equally spaced blocks. Each block
is obtained by concatenating the same number of copies of the old datatype.
The spacing between blocks is a multiple of the extent of the old datatype.
The action of the Vector
constructor is represented schematically in
Figure
.
In general, assume that oldtype has type map
with extent
. Let bl be the blocklength.
The new datatype has a type map with
entries:
A call to MPI_TYPE_CONTIGUOUS(count, oldtype, newtype) is
equivalent to a call to
MPI_TYPE_VECTOR(count, 1, 1, oldtype, newtype), or to a call to
MPI_TYPE_VECTOR(1, count, num, oldtype, newtype),
with num arbitrary.
The Vector type constructor assumes that the stride
between successive blocks is a multiple of the oldtype extent. This
avoids, most of the time, the need for computing stride in bytes.
Sometimes it is useful to relax this assumption and allow a stride which
consists of an arbitrary number of bytes.
The Hvector type constructor below achieves this purpose. The usage of both
Vector and Hvector is illustrated in
Examples
-
.
MPI_Type_hvector(int count, int blocklength, MPI_Aint stride, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_HVECTOR(COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, BLOCKLENGTH, STRIDE, OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_HVECTOR is identical to
MPI_TYPE_VECTOR, except that stride is given in bytes,
rather than in elements.
(H stands for ``heterogeneous'').
The action of the Hvector
constructor is represented schematically in
Figure
.
In general, assume that oldtype has type map
with extent
. Let bl be the blocklength.
The new datatype has a type map with
entries:
The Indexed constructor allows one to specify a noncontiguous data layout where
displacements between successive blocks need not be equal. This
allows one to
gather arbitrary entries from an array and send them in one message, or
receive one message and scatter the received entries into arbitrary
locations in an array.
MPI_Type_indexed(int count, int *array_of_blocklengths, int *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_INDEXED(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_DISPLACEMENTS(*), OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_INDEXED allows
replication of an old datatype into a sequence of blocks (each block is
a concatenation of the old datatype), where
each block can contain a different number of copies of oldtype
and have a different
displacement. All block displacements are measured in units of the
oldtype extent.
The action of the Indexed
constructor is represented schematically in
Figure
.
In general,
assume that oldtype has type map
with extent ex.
Let B be the array_of_blocklengths argument and
D be the
array_of_displacements argument. The new datatype
has a type map with
entries:
A call to MPI_TYPE_VECTOR(count, blocklength, stride, oldtype,
newtype) is equivalent to a call to
MPI_TYPE_INDEXED(count, B, D, oldtype, newtype) where
and
The use of the MPI_TYPE_INDEXED function was illustrated in
Example
, on page
; the function was used
to transfer the upper triangular part of a square matrix.
As with the Vector and Hvector constructors, it is usually convenient to measure
displacements in multiples of the extent of the oldtype, but sometimes necessary
to allow for arbitrary displacements. The Hindexed constructor satisfies the
later need.
MPI_Type_hindexed(int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype oldtype, MPI_Datatype *newtype)
MPI_TYPE_HINDEXED(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, OLDTYPE, NEWTYPE, IERROR)INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_DISPLACEMENTS(*), OLDTYPE, NEWTYPE, IERROR
MPI_TYPE_HINDEXED is identical to
MPI_TYPE_INDEXED, except that block displacements in
array_of_displacements are specified in
bytes, rather than in multiples of the oldtype extent.
The action of the Hindexed
constructor is represented schematically in
Figure
.
In general, assume that oldtype has type map
with extent
.
Let B be the array_of_blocklength argument and
D be the
array_of_displacements argument. The new datatype
has a type map with
entries:
MPI_Type_struct(int count, int *array_of_blocklengths, MPI_Aint *array_of_displacements, MPI_Datatype *array_of_types, MPI_Datatype *newtype)
MPI_TYPE_STRUCT(COUNT, ARRAY_OF_BLOCKLENGTHS, ARRAY_OF_DISPLACEMENTS, ARRAY_OF_TYPES, NEWTYPE, IERROR)INTEGER COUNT, ARRAY_OF_BLOCKLENGTHS(*), ARRAY_OF_DISPLACEMENTS(*), ARRAY_OF_TYPES(*), NEWTYPE, IERROR
MPI_TYPE_STRUCT is the most general type constructor.
It further generalizes MPI_TYPE_HINDEXED
in that it allows each block to consist of replications of
different datatypes.
The intent is to allow descriptions of arrays of structures, as a single
datatype.
The action of the Struct
constructor is represented schematically in
Figure
.
In general,
let T be the array_of_types argument, where T[i]
is a handle to,
with extent
.
Let
B be the array_of_blocklength argument and D be
the array_of_displacements argument.
Let c be the
count argument.
Then the new datatype has a type map with
entries:
A call to MPI_TYPE_HINDEXED(count, B, D, oldtype, newtype) is
equivalent to a call to
MPI_TYPE_STRUCT(count, B, D, T, newtype), where each entry
of T is equal to oldtype.
The original MPI standard was created by the Message Passing
Interface Forum (MPIF). The public release of version 1.0 of
MPI was made in June 1994. The MPIF began meeting again in March
1995. One of the first tasks undertaken was to make clarifications
and corrections to the MPI standard. The changes from version 1.0
to version 1.1 of the MPI standard were limited to ``corrections''
that were deemed urgent and necessary. This work was completed in
June 1995 and version 1.1 of the standard was released. This book
reflects the updated version 1.1 of the MPI standard.
A derived datatype must be committed before it can be used in a
communication. A committed datatype can continue to be used as an input argument in
datatype constructors (so that other datatypes can be derived from the
committed datatype).
There is no need to commit primitive datatypes.
derived datatype, commitcommit
MPI_Type_commit(MPI_Datatype *datatype)
MPI_TYPE_COMMIT(DATATYPE, IERROR)INTEGER DATATYPE, IERROR
MPI_TYPE_COMMIT
commits the datatype.
Commit should be thought of as a possible ``flattening'' or ``compilation''
of the formal description of a type map into an efficient representation.
Commit does not imply that the datatype is bound to the
current content of a communication buffer.
After a datatype
has been committed, it can be repeatedly reused to communicate
different data.
A datatype object is deallocated by a call to MPI_TYPE_FREE.
MPI_Type_free(MPI_Datatype *datatype)
MPI_TYPE_FREE(DATATYPE, IERROR)INTEGER DATATYPE, IERROR
MPI_TYPE_FREE
marks the datatype object associated with datatype for
deallocation and sets datatype to MPI_DATATYPE_NULL.
MPI_DATATYPE_NULL
Any communication that is currently using this datatype will complete normally.
Derived datatypes that were defined from the freed datatype are not
affected.
derived datatype, destructor
A call of the form MPI_SEND(buf, count, datatype , ...), where
, is interpreted as if the call was passed a new datatype
which is the
concatenation of count copies of datatype.
Thus,
MPI_SEND(buf, count, datatype, dest, tag, comm) is equivalent to,
Similar statements apply to all other communication functions that have a
count and datatype argument.
Suppose that a send operation MPI_SEND(buf, count,
datatype, dest, tag, comm) is executed, where
datatype has type map
and extent
.
type matchingderived datatype, matching
The send
operation sends
entries, where entry
is at location
and has type
,
for
and
.
The variable stored at address
in the calling program
should be of a type that matches
, where
type matching is defined as in Section
.
Similarly, suppose that a receive operation
MPI_RECV(buf, count, datatype, source, tag, comm, status) is
executed.
The receive operation receives up to
entries, where entry
is at location
and has type
.
Type matching is defined according to the type signature of
the corresponding datatypes, that is, the sequence of primitive type
components. Type matching does not depend on other aspects of the
datatype definition, such as the displacements (layout in memory) or the
intermediate types used to define the datatypes.
For sends,
a datatype may specify overlapping entries.
This is not true for receives. If the datatype used in a receive
operation specifies overlapping entries then
the call is erroneous.
derived datatype, overlapping entries
If a message was received using a user-defined datatype, then a
subsequent call
to MPI_GET_COUNT(status, datatype, count)
(Section
) will
return the number of ``copies'' of
datatype received (count).
That is, if the receive operation was MPI_RECV(buff,
count,datatype,...) then MPI_GET_COUNT
may return any integer value
, where
.
If MPI_GET_COUNT returns
, then the number of primitive
elements received
is
, where
is the number of primitive elements in the type
map of datatype.
The received message need not fill an integral number of ``copies'' of
datatype.
If the number of primitive elements received is not a
multiple of
, that is, if the receive operation has not received an
integral number of datatype ``copies,'' then
MPI_GET_COUNT returns the value MPI_UNDEFINED.
MPI_UNDEFINED
The function MPI_GET_ELEMENTS below can be used to determine
the number of primitive elements received.
MPI_Get_elements(MPI_Status *status, MPI_Datatype datatype, int *count)
MPI_GET_ELEMENTS(STATUS, DATATYPE, COUNT, IERROR)INTEGER STATUS(MPI_STATUS_SIZE), DATATYPE, COUNT, IERROR
The function MPI_GET_ELEMENTS can also be used after a probe
to find the number of primitive datatype elements in the probed message.
Note that the two functions MPI_GET_COUNT and
MPI_GET_ELEMENTS return the same values when they are used
with primitive datatypes.
alignment
As shown in Example
,
page
, one sometimes needs to be able to
find the displacement, in bytes, of a structure component relative to the
structure start. In C, one can use the sizeof operator to
find the size of C objects; and one will be tempted to use the
& operator to compute addresses and then displacements.
However, the C standard does not require that (int)& be the
byte address of variable v: the mapping of pointers to
integers is implementation dependent. Some systems may have ``word''
pointers and ``byte'' pointers; other systems may have a segmented,
noncontiguous address space. Therefore, a portable mechanism has to be
provided by MPI to compute the ``address'' of a variable. Such a
mechanism is certainly needed in Fortran, which has no dereferencing
operator.
addressderived datatype, address
MPI_Address(void* location, MPI_Aint *address)
MPI_ADDRESS(LOCATION, ADDRESS, IERROR)<type> LOCATION(*)
MPI_ADDRESS is used to find
the address of a location in memory.
It returns the byte address of location.
Sometimes it is necessary to override the definition of
extent given in Section
.
markersderived datatype, markers
Consider, for example, the code in Example
in
the previous section.
Assume that
a double occupies 8 bytes
and must be double-word aligned.
There will be 7 bytes of padding after the first field and one byte of
padding after the last field of the structure Partstruct, and
the structure will occupy 64 bytes.
If, on the other hand, a double can be word aligned
only, then there will be only 3 bytes of padding after the first
field, and Partstruct will occupy 60 bytes.
The MPI library will follow
the alignment rules used on the target systems so that the extent of datatype
Particletype equals the amount of storage occupied by
Partstruct. The catch is that different
alignment rules may be specified, on the same system,
using different compiler options. An even more difficult problem is that
some compilers allow the use of pragmas in order to specify different alignment
rules for different structures within the same program. (Many
architectures can correctly handle misaligned values, but with lower
performance; different alignment rules trade speed of access for storage
density.) The MPI library will assume the default alignment rules. However,
the user should be able to overrule this assumption if structures are packed
otherwise.
alignment
To allow this capability, MPI has
two additional ``pseudo-datatypes,'' MPI_LB and MPI_UB,
MPI_LB
MPI_UB
that can be used, respectively, to mark the lower bound or the upper
bound of a datatype. These pseudo-datatypes occupy no space
(
). They do not
affect the size or count of a datatype, and do not affect the
the content of a message created with this datatype. However, they do
change the extent of a datatype and, therefore, affect the outcome
of a replication of this datatype by a datatype constructor.
In general, if
then the lower bound of
is defined to be
Similarly,
the upper bound of
is defined to be
And
If
requires alignment to a byte address that is a multiple of
,
then
is the least nonnegative increment needed to round
to the next multiple of
.
The formal definitions given for the various datatype constructors
continue to apply, with the amended definition of extent.
Also, MPI_TYPE_EXTENT returns the above as its value for extent.
extentderived datatype, extent
lower boundderived datatype, lower bound
upper boundderived datatype, upper bound
The two functions below can be used for finding the lower bound and
the upper bound of a datatype.
MPI_Type_lb(MPI_Datatype datatype, MPI_Aint* displacement)
MPI_TYPE_LB(DATATYPE, DISPLACEMENT, IERROR)INTEGER DATATYPE, DISPLACEMENT, IERROR
MPI_TYPE_LB returns the lower bound of a datatype, in bytes,
relative to the datatype origin.
MPI_Type_ub(MPI_Datatype datatype, MPI_Aint* displacement)
MPI_TYPE_UB(DATATYPE, DISPLACEMENT, IERROR)INTEGER DATATYPE, DISPLACEMENT, IERROR
MPI_TYPE_UB returns the upper bound of a datatype, in bytes,
relative to the datatype origin.
Consider Example
on page
.
One computes the ``absolute address'' of the structure components, using calls
to MPI_ADDRESS, then subtracts the starting address of the array to
compute relative displacements. When the send operation is executed, the
starting address of the array is added back, in order to compute the send buffer
location. These superfluous arithmetics could be avoided if ``absolute''
addresses were used in the derived datatype, and ``address zero'' was
passed as the buffer argument in the send call.
addressderived datatype, address
MPI supports the use of such ``absolute'' addresses in derived datatypes.
The displacement arguments used in datatype constructors can be ``absolute
addresses'', i.e., addresses returned by calls to MPI_ADDRESS.
Address zero is indicated to communication functions by passing the constant
MPI_BOTTOM as the buffer argument. Unlike derived datatypes
MPI_BOTTOM
with relative displacements, the use of ``absolute'' addresses
restricts the use to the specific structure for which it was created.
The use of addresses and displacements in MPI is best understood in
the context of a flat address space. Then, the ``address'' of a location, as
computed by calls to MPI_ADDRESS can be
the regular address of that location (or a shift of it), and integer
arithmetic
on MPI ``addresses'' yields the expected result. However, the use of a
flat address space is not mandated by C or Fortran. Another potential
source of problems is that Fortran INTEGER's may be too short
to store full addresses.
Variables belong
to the same sequential storage if they belong to the same
array,
to the same COMMON block in Fortran, or to the same structure in C.
addressderived datatype, address
sequential storage
Implementations may restrict the
use of addresses so that arithmetic on addresses is confined within
sequential storage.
Namely, in a communication call, either
Some existing communication libraries, such as PVM and Parmacs,
provide pack and unpack functions for sending noncontiguous data. In
packunpack
these, the application explicitly packs data into a contiguous buffer
before sending it, and unpacks it from a contiguous buffer after
receiving it. Derived datatypes, described in the previous sections
of this chapter, allow one, in most cases, to avoid explicit packing
and unpacking. The application specifies the layout of the data to be
sent or received, and MPI directly accesses a noncontiguous buffer
when derived datatypes are used. The pack/unpack routines are
provided for compatibility with previous libraries. Also, they
provide some functionality that is not otherwise available in MPI.
For instance, a message can be received in several parts, where the
receive operation done on a later part may depend on the content of a
former part. Another use is that the availability of pack and unpack
operations facilitates the development of additional communication
libraries layered on top of MPI.
layering
MPI_Pack(void* inbuf, int incount, MPI_Datatype datatype, void *outbuf, int outsize, int *position, MPI_Comm comm)
MPI_PACK(INBUF, INCOUNT, DATATYPE, OUTBUF, OUTSIZE, POSITION, COMM, IERROR)<type> INBUF(*), OUTBUF(*)
MPI_PACK
packs a message specified by inbuf, incount, datatype, comm
into the buffer
space specified by outbuf and outsize. The
input buffer can
be any communication buffer allowed in MPI_SEND. The output buffer
is a contiguous storage area containing outsize bytes, starting at
the address outbuf.
The input value of position is the first
position in the output buffer to be used for packing. The argument
position is
incremented by the size of the packed message so that it can be used
as input to a subsequent call to MPI_PACK.
The comm argument
is the communicator that will be subsequently used for sending the packed
message.
MPI_Unpack(void* inbuf, int insize, int *position, void *outbuf, int outcount, MPI_Datatype datatype, MPI_Comm comm)
MPI_UNPACK(INBUF, INSIZE, POSITION, OUTBUF, OUTCOUNT, DATATYPE, COMM, IERROR)<type> INBUF(*), OUTBUF(*)
MPI_UNPACK
unpacks a message into the receive buffer specified by
outbuf, outcount, datatype from the buffer
space specified by inbuf and insize. The output buffer can
be any communication buffer allowed in MPI_RECV. The input
buffer is a contiguous storage area containing insize bytes,
starting at address inbuf.
The input value of position is the position in
the input buffer where one wishes the unpacking to begin.
The output value of position is incremented
by the size of the packed message, so that
it can be used as input to a subsequent call
to MPI_UNPACK.
The argument
comm was the communicator used to receive the packed message.
The MPI_PACK/MPI_UNPACK calls relate to message passing as
the sprintf/sscanf calls in C relate to file I/O, or internal
Fortran files relate to external units. Basically, the
MPI_PACK function allows one to ``send'' a message into a
memory buffer; the MPI_UNPACK function allows one to
``receive'' a message from a memory buffer.
Several communication buffers can be successively packed into one packing unit. This
packing unit
is effected by several, successive related calls to MPI_PACK,
where the first
call provides position = 0, and each successive call inputs the value
of position that was output by the previous call, and the same values
for outbuf, outcount and comm. This packing unit
now contains
the equivalent information that would have been stored in a message by one send
call with a send buffer that is the ``concatenation'' of the individual send
buffers.
A packing unit must
be sent using type MPI_PACKED. Any point-to-point
MPI_PACKED
or collective communication function can be used.
The message sent is identical to the message
that would be sent by a send operation with a
datatype argument describing the concatenation of the send
buffer(s) used in the Pack calls. The message
can be received with any datatype that matches this send datatype.
Any message
can be received in a point-to-point or collective communication
using the type MPI_PACKED. Such a message can then be
MPI_PACKED
unpacked by calls to MPI_UNPACK.
The message
can be unpacked by several, successive calls to
MPI_UNPACK, where the first
call provides position = 0, and each successive call inputs the value
of position that was output by the previous call, and the same values
for inbuf, insize and comm.
MPI_Pack_size(int incount, MPI_Datatype datatype, MPI_Comm comm, int *size)
MPI_PACK_SIZE(INCOUNT, DATATYPE, COMM, SIZE, IERROR)INTEGER INCOUNT, DATATYPE, COMM, SIZE, IERROR
MPI_PACK_SIZE
allows the application to find out how much space is
needed to pack a message and, thus, manage space allocation for
buffers. The function
returns, in size, an upper bound on the increment in position
that would occur in a call to MPI_PACK with the same values
for incount, datatype, and comm.
This section explains notational terms and conventions used
throughout this book.
A comparison between Example
on page
and
Example
in the previous section is instructive.
First, programming
convenience. It is somewhat less tedious to pack the class zero particles
in the loop that locates them, rather then defining in this loop the
datatype that will later collect them. On the other hand, it would be
very tedious (and inefficient) to pack separately the components of
each structure entry in the array. Defining a datatype is more convenient
when this definition depends only on declarations; packing may be more
convenient when the communication buffer layout is data dependent.
Second, storage use. The packing code uses at least 56,000 bytes for the
pack buffer, e.g., up to 1000 copies of the structure (1 char, 6
doubles, and 7 char is
bytes).
The derived datatype code uses 12,000
bytes for the three, 1,000 long, integer
arrays used to define the derived datatype.
It also probably uses a similar amount of
storage for the internal datatype representation. The difference is
likely to be larger in realistic codes. The use of packing
requires additional storage for a copy
of the data, whereas the use of derived
datatypes requires additional storage for a description of the
data layout.
Finally, compute time. The packing code executes a function call for
each packed item whereas the derived datatype code executes only a fixed
number of function calls. The packing code is likely to require one
additional memory to memory copy of the data, as compared to the
derived-datatype code. One may expect, on most implementations, to
achieve better performance with the derived datatype code.
Both codes send the same size message, so that there is no
difference in
communication time.
However, if the buffer
described by the derived datatype is
not contiguous
in memory, it may take longer to access.
Example
above illustrates another advantage of
pack/unpack; namely the receiving process may use information in part
of an incoming message in order to decide how to handle subsequent
data in the message. In order to achieve the same outcome without
pack/unpack, one would have to send two messages: the first with the
list of indices, to be used to construct a derived datatype that is
then used to receive the particle entries sent in a second message.
The use of derived datatypes will often lead to improved performance:
data copying can be avoided, and information on data layout can be
reused, when the same communication buffer is reused. On the other
hand, the definition of derived datatypes for complex layouts can be
more tedious than explicit packing.
Derived datatypes should be used whenever data layout is defined by program
declarations (e.g., structures), or is regular (e.g., array sections).
Packing might be considered for complex, dynamic, data-dependent
layouts. Packing may result in more efficient code in situations
where the sender has to communicate to the receiver
information that affects the layout of the receive buffer.
Collective communications transmit data among all processes
in a group specified by an intracommunicator object.
One function, the barrier,
serves to synchronize processes without passing data.
MPI provides the following collective communication functions.
collective communicationsynchronization
Figure
gives a pictorial representation of the
global communication functions. All these functions (broadcast
excepted) come in two variants: the simple variant, where all
communicated items are messages of the same size, and the ``vector''
variant, where each item can be of a different size. In addition, in
the simple variant, multiple items originating from the same process
or received at the same process, are contiguous in memory; the vector
variant allows to pick the distinct items from non-contiguous
locations.
collective, vector variants
Some of these functions, such as broadcast or gather, have a single
origin or a single receiving process. Such a process is called the
root.
root
Global communication functions basically comes in
three patterns:
The syntax and semantics of the MPI collective functions was
designed to be consistent with point-to-point communications.
collective, compatibility with point-to-point
However, to keep the number of functions and their argument lists
to a reasonable level of complexity, the MPI committee made
collective functions more
restrictive than the point-to-point functions, in several ways. One
collective, restrictions
restriction is that, in contrast to point-to-point communication,
the amount of data
sent must exactly match the amount of data specified by the receiver.
A major simplification is that collective functions come in blocking
versions only. Though a standing joke at committee meetings concerned
the ``non-blocking barrier,'' such functions can be quite useful
and may be included in a future version of MPI.
collective, and blocking semanticsblocking
Collective functions do not use a tag argument. Thus, within
each intragroup communication domain, collective calls are matched
strictly according to the order of execution.
tagcollective, and message tag
A final simplification of collective functions concerns modes. Collective
collective, and modesmodes
functions come in only one mode, and this mode may be regarded as
analogous to the standard mode of point-to-point. Specifically, the
semantics are as follows. A collective function (on a given process)
can return as soon as its participation in the overall communication
is complete. As usual, the completion indicates that the caller is now
free to access and modify locations in the communication buffer(s).
It does not indicate that other processes have completed, or even started,
the operation. Thus, a collective communication may, or may not, have
the effect of synchronizing all calling processes. The barrier, of course,
is the exception to this statement.
This choice of semantics was made so as to allow a variety of implementations.
The user of MPI must keep these issues in mind. For example, even
though a particular implementation of MPI may provide a broadcast with the
side-effect of synchronization (the standard allows this), the standard
does not require this, and hence, any program that relies on the
synchronization will be non-portable. On the other hand, a correct and
portable program must allow a collective function to be synchronizing.
Though one should not rely on synchronization side-effects, one must
program so as to allow for it.
portabilitycorrectness
collective, and portabilitycollective, and correctness
Though these issues and statements may seem unusually obscure, they are
merely a consequence of the desire of MPI to:
A collective operation is executed by having all processes in the group
call the communication routine, with matching arguments.
The syntax and semantics of the collective operations are
defined to be consistent with the syntax and semantics of the
point-to-point operations. Thus, user-defined datatypes are allowed
and must match between sending and receiving processes as specified
in Chapter
.
One of the key arguments is an intracommunicator that defines the group
of participating processes and provides
a communication domain for the operation.
In calls where a root process is defined,
some arguments are specified as
``significant only at root,'' and are ignored for all
participants except the root.
The reader is referred to Chapter
for information concerning communication buffers and type matching rules,
to Chapter
for
user-defined datatypes, and to
Chapter
for information on how to define groups and
create communicators.
The type-matching conditions for the collective operations are more
strict than the corresponding conditions between sender and receiver
in point-to-point. Namely, for collective operations,
the amount of data sent must exactly
match the amount of data specified by the receiver.
Distinct type maps (the layout in memory, see Section
)
between sender and receiver are still allowed.
type matchingcollective, and type matching
Collective communication calls may use the same
communicators as point-to-point communication; MPI guarantees that
messages generated on behalf of collective communication calls will not
be confused with messages generated by point-to-point communication.
A more detailed discussion of correct use of collective
routines is found in Section
.
The key concept of the collective functions is to have a ``group''
communicator, and collectivegroup, for collective
collective, and communicatorcollective, process group
of participating processes. The routines do not have a group
identifier as an explicit argument. Instead, there is a communicator
argument. For the purposes of this chapter, a communicator can be
thought of as a group identifier linked with a communication
domain. An intercommunicator, that is, a communicator that spans two
groups, is not allowed as an argument to a collective function.
intercommunicator, and collective
collective, and intercommunicator
MPI_Barrier(MPI_Comm comm)
MPI_BARRIER(COMM, IERROR) INTEGER COMM, IERROR
MPI_BARRIER blocks the caller until all group members have called
it. The call returns at any process only after all group members have
entered the call.
MPI_Bcast(void* buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR) <type> BUFFER(*)
MPI_BCAST broadcasts a message from
the process with rank root to all processes
of the group.
The argument root must have identical values on all
processes, and comm must represent the same intragroup
communication domain.
On return, the contents of root's communication buffer
has been copied to all processes.
General, derived datatypes are allowed for datatype.
The type signature of count and datatype on any process must
be equal to the type signature of count and datatype at the root.
This implies that the amount of data sent must be equal to the amount received,
pairwise between each process and the root.
MPI_BCAST and all other data-movement collective routines
make this restriction.
Distinct type maps between sender and receiver are still allowed.
MPI_Gather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR) <type> SENDBUF(*), RECVBUF(*)
Each process (root process included) sends the contents of its send
buffer to the root process. The root process receives the messages and
stores them in rank order.
The outcome is as if each of the n processes in the group
(including the root process) had executed a call to
MPI_Send(sendbuf, sendcount, sendtype, root, ...),
and the
root had executed n calls to
MPI_Recv(recvbuf+i
recvcount
extent(recvtype), recvcount, recvtype, i ,...),
where extent(recvtype) is the type extent obtained from a call to
MPI_Type_extent().
An alternative description is that the n messages sent by the
processes in the group are concatenated in rank order, and the
resulting message is received by the root as if by a call to
MPI_RECV(recvbuf, recvcount
n, recvtype, ...).
The receive buffer is ignored for all non-root processes.
General, derived datatypes are allowed for both sendtype
and recvtype.
The type signature of sendcount and sendtype on process i
must be equal to the type signature of
recvcount and recvtype at the root.
This implies that the amount of data sent must be equal to the
amount of data received, pairwise between each process and the root.
Distinct type maps between sender and receiver are still allowed.
All arguments to the function are significant on process root,
while on other processes, only arguments sendbuf, sendcount,
sendtype, root, and comm are significant.
The argument root
must have identical values on all processes and comm must
represent the same intragroup communication domain.
The specification of counts and types
should not cause any location on the root to be written more than
once. Such a call is erroneous.
Note that the recvcount argument at the root indicates
the number of items it receives from each process, not the total number
of items it receives.
7
Scientific and Engineering ComputationJanusz Kowalik, Editor
Data-Parallel Programming on MIMD Computersby Philip J. Hatcher and Michael J. Quinn, 1991
Unstructured Scientific Computation on Scalable Multiprocessorsedited by Piyush Mehrotra, Joel Saltz, and Robert Voigt, 1991
Parallel Computational Fluid Dynamics: Implementations and Resultsedited by Horst D. Simon, 1992
Enterprise Integration Modeling: Proceedings of the First International
Conferenceedited by Charles J. Petrie, Jr., 1992
The High Performance Fortran Handbookby Charles H. Koelbel,
David B. Loveman,
Robert S. Schreiber,
Guy L. Steele Jr. and
Mary E. Zosel, 1993
Using MPI: Portable Parallel Programming with the Message-Passing
Interfaceby William Gropp, Ewing Lusk, and Anthony Skjellum, 1994
PVM: Parallel Virtual Machine-A User's Guide and Tutorial for
Network Parallel Computingby Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek,
and Vaidy Sunderam, 1994
Enabling Technologies for Petaflops Computingby Thomas Sterling, Paul Messina, and Paul H. Smith
An Introduction to High-Performance Scientific Computingby Lloyd D. Fosdick, Elizabeth R. Jessup, Carolyn J.C. Schauble,
and Gitta Domik
Practical Parallel Programmingby Gregory V. Wilson
MPI: The Complete Referenceby Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra
1996 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced
in any form by any electronic or mechanical means (including
photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.
Parts of this book came from, ``MPI: A Message-Passing Interface
Standard'' by the Message Passing Interface Forum. That document is
copyrighted by the University of Tennessee. These sections were
copied by permission of the University of Tennessee.
This book was set in LaTeX by the authors and was
printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
Scientific and Engineering ComputationJanusz Kowalik, Editor
Data-Parallel Programming on MIMD Computersby Philip J. Hatcher and Michael J. Quinn, 1991
Unstructured Scientific Computation on Scalable Multiprocessorsedited by Piyush Mehrotra, Joel Saltz, and Robert Voigt, 1991
Parallel Computational Fluid Dynamics: Implementations and Resultsedited by Horst D. Simon, 1992
Enterprise Integration Modeling: Proceedings of the First International
Conferenceedited by Charles J. Petrie, Jr., 1992
The High Performance Fortran Handbookby Charles H. Koelbel,
David B. Loveman,
Robert S. Schreiber,
Guy L. Steele Jr. and
Mary E. Zosel, 1993
Using MPI: Portable Parallel Programming with the Message-Passing
Interfaceby William Gropp, Ewing Lusk, and Anthony Skjellum, 1994
PVM: Parallel Virtual Machine-A User's Guide and Tutorial for
Network Parallel Computingby Al Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Bob Manchek,
and Vaidy Sunderam, 1994
Enabling Technologies for Petaflops Computingby Thomas Sterling, Paul Messina, and Paul H. Smith
An Introduction to High-Performance Scientific Computingby Lloyd D. Fosdick, Elizabeth R. Jessup, Carolyn J.C. Schauble,
and Gitta Domik
Practical Parallel Programmingby Gregory V. Wilson
MPI: The Complete Referenceby Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and
Jack Dongarra
1996 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced
in any form by any electronic or mechanical means (including
photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.
Parts of this book came from, ``MPI: A Message-Passing Interface
Standard'' by the Message Passing Interface Forum. That document is
copyrighted by the University of Tennessee. These sections were
copied by permission of the University of Tennessee.
This book was set in LaTeX by the authors and was
printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
This book is also available in postscript and html forms over the Internet.
To retrieve the postscript file you can use one of the following methods:
To view the html file use the URL:
The world of modern computing potentially offers many helpful methods
and tools to scientists and engineers, but the fast pace of change in
computer hardware, software, and algorithms often makes practical use of
the newest computing technology difficult. The Scientific and
Engineering Computation series focuses on rapid advances in computing
technologies and attempts to facilitate transferring these technologies
to applications in science and engineering. It will include books on
theories, methods, and original applications in such areas as
parallelism, large-scale simulations, time-critical computing,
computer-aided design and engineering, use of computers in
manufacturing, visualization of scientific data, and human-machine
interface technology.
The series will help scientists and engineers to understand the current
world of advanced computation and to anticipate future developments that
will impact their computing environments and open up new capabilities
and modes of computation.
This volume presents a software package for developing
parallel programs executable on networked Unix computers.
The tool called Parallel Virtual Machine (PVM) allows a
heterogeneous collection of workstations and
supercomputers to function as a single high-performance
parallel machine. PVM is portable and runs on a wide
variety of modern platforms.
It has been well accepted by the global computing
community and used successfully for solving
large-scale problems in science, industry, and business.
Janusz S. Kowalik
Preface
In this book we describe the
Parallel Virtual Machine
(PVM) system and how to develop programs
using PVM.
PVM is a software system that permits a heterogeneous collection
of Unix computers networked together to be viewed
by a user's program as a single
parallel computer.
PVM is the mainstay of the Heterogeneous Network Computing
research project, a collaborative venture between
Oak Ridge National Laboratory,
the University of Tennessee,
Emory University,
and
Carnegie Mellon University.
The PVM system has evolved in the past several
years into a viable technology for distributed and
parallel processing in a variety of disciplines. PVM supports a
straightforward but functionally complete message-passing model.
PVM is designed to link computing resources and provide users
with a parallel platform for running their computer applications,
irrespective of the number of different computers
they use and where the computers are located.
When
PVM is correctly installed, it is capable of harnessing
the combined resources of typically
heterogeneous networked computing platforms to deliver high levels
of performance and functionality.
In this book, we describe the
architecture of the PVM system and discuss its computing model;
the programming interface it supports;
auxiliary facilities for process groups;
the use of PVM on highly parallel systems
such as the Intel Paragon, Cray T3D, and Thinking Machines CM-5;
and some of the internal
implementation techniques employed. Performance
issues, dealing primarily with communication overheads, are
analyzed, and recent findings as well as enhancements
are presented.
To demonstrate the viability of PVM for large-scale scientific
supercomputing, we also provide some example programs.
This book is not a textbook; rather, it is meant to provide a fast entrance
to the world of heterogeneous network computing.
We intend this book to be used by two groups of readers:
students and researchers working with networks of computers.
As such, we hope this book can serve both as a reference
and as a supplement to a teaching text on aspects of network computing.
This guide will familiarize readers with the basics of PVM and
the concepts used in programming on a network.
The information provided here will help with the following PVM tasks:
Stand-alone workstations delivering several tens of millions of operations
per second are commonplace, and continuing increases in power
are predicted.
When these computer systems are interconnected by an appropriate
high-speed network, their combined computational power
can be applied to solve a variety of
computationally intensive applications.
Indeed,
network computing may even provide supercomputer-level computational power.
Further,
under the right circumstances, the network-based approach can be effective
in coupling several similar multiprocessors, resulting in a configuration
that might be economically and technically difficult to achieve with
supercomputer hardware.
To be effective, distributed computing requires high communication speeds.
In the past fifteen years or so, network speeds have increased by several orders
of magnitude (see Figure
).
Among the most notable
advances in computer networking technology
are the following:
ATM
- Asynchronous Transfer Mode. ATM is the technique for transport,
multiplexing, and switching that provides a high
degree of flexibility required by B-ISDN.
ATM is a connection-oriented protocol
employing fixed-size packets
with a 5-byte header and 48 bytes of information.
These advances in high-speed networking
promise high throughput with low latency and make it possible
to utilize distributed computing for years to come.
Consequently,
increasing numbers of universities, government and industrial
laboratories, and financial firms are turning to
distributed computing to solve their computational problems.
The objective of PVM is to enable these institutions
to use distributed computing efficiently.
Four functions
handle all packet traffic into and out of libpvm.
mroute()
is called by higher-level functions
such as pvm_send() and pvm_recv()
to copy messages into and out of the task.
It establishes any necessary routes before calling mxfer().
mxfer()
polls for messages,
optionally blocking until one is received
or until a specified timeout.
It calls mxinput() to copy
fragments into the task and reassemble messages.
In the generic version of PVM,
mxfer()
uses select() to poll all routes (sockets) in order to find
those ready for input or output.
pvmmctl()
is called by mxinput()
when a control message (Section
)
is received.
Direct routing allows one task to send messages to another
through a TCP link,
avoiding the overhead of forwarding through the pvmds.
It is implemented entirely in libpvm,
using the notify and control message facilities.
By default,
a task routes messages to its pvmd,
which forwards them on.
If direct routing is enabled
(PvmRouteDirect)
when a message (addressed to a task)
is passed to mroute(),
it attempts to create a direct route if one
doesn't already exist.
The route may be granted or refused by the destination task,
or fail (if the task doesn't exist).
The message is then passed to mxfer().
Libpvm maintains a protocol control block (struct ttpcb)
for each active or denied connection,
in list ttlist.
The state diagram for a ttpcb is shown in
Figure
.
To request a connection,
mroute()
makes a ttpcb and socket,
then
sends a
TC_CONREQ
control message to the destination via the default route.
At the same time,
it sends a TM_NOTIFY message to the pvmd,
to be notified if the destination task exits,
with closure (message tag)
TC_TASKEXIT.
Then it
puts the ttpcb in
state TTCONWAIT,
and calls
mxfer() in blocking mode repeatedly
until the state changes.
When the
destination task
enters
mxfer()
(for example, to receive a message),
it receives the TC_CONREQ message.
The request is granted
if its routing policy (pvmrouteopt != PvmDontRoute)
and implementation
allow a direct connection,
it has resources available,
and the protocol version (TDPROTOCOL) in the request
matches its own.
It makes a ttpcb with state TTGRNWAIT,
creates and listens on a socket,
and
then replies with a TC_CONACK message.
If the destination denies the connection,
it nacks, also with a TC_CONACK message.
The originator receives the TC_CONACK
message,
and either opens the connection (state = TTOPEN) or marks the route denied
(state = TTDENY).
Then, mroute() passes the original message to mxfer(),
which sends it.
Denied connections are cached in order to prevent repeated
negotiation.
If the destination doesn't exist,
the TC_CONACK message never arrives because the TC_CONREQ message is
silently dropped.
However,
the TC_TASKEXIT message generated by the notify system
arrives in its place,
and the ttpcb state is set to TTDENY.
This connect scheme also works if both ends try to
establish a connection at the same time.
They both enter
TTCONWAIT, and
when they receive each other's TC_CONREQ messages,
they go directly to the TTOPEN state.
The libpvm function
pvm_mcast()
sends a message to multiple destinations simultaneously.
The current implementation
only routes multicast messages through the pvmds.
It
uses a 1:N fanout
to ensure that failure of a host doesn't
cause the loss of any messages (other than ones to
that host).
The packet routing layer of the pvmd cooperates with the
libpvm to multicast a message.
To form a multicast address TID (GID)
,
the G bit is set
(refer to Figure
).
The L field is assigned by a counter that is incremented for
each multicast,
so
a new multicast address is used for each message,
then recycled.
To initiate a multicast,
the task sends a TM_MCA message to its pvmd,
containing a list of recipient TIDs.
The pvmd creates a multicast descriptor (struct mca) and GID.
It
sorts the addresses,
removes bogus ones, and duplicates and
caches them in the mca.
To
each destination pvmd
(ones with destination tasks),
it sends a
DM_MCA message with the GID and
destinations on that host.
The GID is sent back to the task in the TM_MCA reply message.
The task sends the multicast message
to the pvmd,
addressed to the GID.
As each packet arrives,
the routing layer
copies it
to each local task
and foreign pvmd.
When a multicast packet arrives at a destination pvmd,
it is copied to each destination task.
Packet order
is preserved,
so
the multicast address and data packets
arrive in order at each destination.
As it forwards multicast
packets,
each pvmd eavesdrops on the header flags.
When it sees a packet with EOM flag set,
it
flushes the mca.
Experience seems to indicate
that inherited
environment (Unix environ)
is useful to an application.
For example,
environment variables can be used to
distinguish a group of related tasks
or to set debugging variables.
PVM makes increasing use of environment,
and may eventually support it
even on machines where the concept
is not native.
For now,
it allows a task to export any part of environ
to tasks spawned by it.
Setting variable PVM_EXPORT to the names of other variables
causes them to be exported through spawn.
For example, setting
The following environment variables are used by PVM.
The user may set these:
The following variables are set by PVM and should not be modified:
Each task spawned through PVM
has /dev/null opened for stdin.
From its parent,
it inherits a stdout sink,
which is a (TID, code) pair.
Output on stdout or stderr is
read by the pvmd through a pipe,
packed into PVM messages and
sent to the TID,
with message tag equal to the code.
If the output TID is set to zero
(the default for a task with no parent),
the messages go to the master pvmd,
where they are written on its error log.
Children spawned by a task inherit its stdout
sink.
Before the spawn,
the parent can use pvm_setopt() to
alter the output TID or code.
This doesn't
affect where the output of the parent task itself goes.
A task may set output TID to one of three settings:
the value inherited from its parent,
its own TID,
or zero.
It can set output code only if output TID is set to its own TID.
This means that output can't be assigned to an arbitrary task.
Four types of messages are sent to an stdout sink.
The message body formats for each type are as follows:
The first two items in the message body
are always the task id and output count,
which
allow the receiver to
distinguish between different tasks and the four message types.
For each task,
one message each
of types Spawn, Begin, and End is sent,
along with zero or more messages of class Output, (count > 0).
Classes Begin, Output and End will be received
in order,
as they originate from the same source (the pvmd of the
target task).
Class Spawn originates at the (possibly different) pvmd
of the parent task,
so it can be received in any order relative to
the others.
The output sink
is expected to understand the different types of messages
and use them to know when to stop
listening for output from a task (EOF) or group of tasks (global EOF).
The messages are designed so as to prevent race conditions
when a task spawns another task,
then immediately exits.
The
output sink might
get the End
message from the parent task
and decide the group is finished,
only to receive more output later from the child task.
According to these rules, the Spawn
message for the second task
must
arrive before
the End message from the first task.
The Begin message itself is necessary because the Spawn
message for a task may arrive after the End message
for the same task.
The state transitions of a task as observed by the receiver of
the output messages
are shown in
Figure
.
The libpvm function pvm_catchout() uses this output collection
feature to put the output from children of a task into a file
(for example, its own stdout).
It sets output TID to its own task id,
and the output code to control message TC_OUTPUT.
Output from children and grandchildren tasks is collected by the
pvmds and sent to the task,
where it is received by pvmmctl() and printed by pvmclaimo().
The libpvm library
has a tracing system
that can record the parameters and results of all calls to interface
functions.
Trace data is sent as messages to a trace sink task
just as output is sent to an stdout sink (Section
).
If the trace output TID is set to zero (the default),
tracing is disabled.
Besides the trace sink,
tasks also inherit a trace mask,
used to enable tracing function-by-function.
The mask is passed as a (printable) string in environment variable
PVMTMASK.
A task can manipulate its own trace mask or the one to be inherited
from it.
A task's trace mask can also be set asynchronously with a TC_SETTMASK
control message.
Constants related to trace messages are defined in
public header file pvmtev.h.
Trace data from a task is collected in a manner similar to
the output redirection discussed above.
Like the type Spawn, Begin, and End messages which bracket
output from a task,
TEV_SPNTASK,
TEV_NEWTASK and
TEV_ENDTASK
trace messages are generated by the pvmds to bracket trace messages.
The tracing system was introduced in version 3.3
and is still expected to change somewhat.
PVM provides a simple but extensible debugging facility.
Tasks started by hand could just as easily be run under a debugger,
but this procedure is cumbersome for those spawned by an application,
since it requires the user to comment out the calls to
pvm_spawn() and start tasks manually.
If PvmTaskDebug is added to the flags passed to
pvm_spawn(),
the task is started through a debugger script (a normal shell script),
$PVM_ROOT/lib/debugger.
The pvmd passes the name and parameters of the task to the debugger
script,
which is free to start any sort of debugger.
The script provided is very simple.
In an xterm window,
it runs the correct debugger according to the architecture type
of the host.
The script can be customized or replaced by the user.
The pvmd can be made to execute a different debugger via the bx=
host file option
or the PVM_DEBUGGER environment variable.
The PVM console is used to manage the virtual machine-to reconfigure it or start and stop processes.
In addition,
it's an example program that makes use of most of the libpvm functions.
pvm_getfds() and select() are used to check for
input from the keyboard and messages from the pvmd simultaneously.
Keyboard input is passed to the command interpreter,
while messages
contain notification (for example, HostAdd) or output from a task.
The console can collect output or trace messages from spawned tasks,
using the redirection mechanisms described
in Section
and Section
,
and
write them to the screen or a file.
It uses the begin and end messages
from child tasks to maintain groups of tasks (or jobs),
related by common ancestors.
Using the PvmHostAdd notify event,
it informs the user when the virtual machine is reconfigured.
Resource limits imposed by the operating system and available
hardware are in turn passed to PVM applications.
Whenever possible,
PVM avoids setting explicit limits;
instead, it returns an error when resources are exhausted.
Competition between users on the same host or network
affects some limits dynamically.
The PVM software provides a unified framework within which
parallel programs can be developed in an
efficient and straightforward manner using existing hardware.
PVM enables a collection of heterogeneous computer
systems to be viewed as a single parallel virtual machine.
PVM transparently handles all message routing, data conversion,
and task scheduling across a network of incompatible computer
architectures.
The PVM computing model is simple yet very general, and accommodates
a wide variety of application program structures. The programming interface
is deliberately straightforward, thus permitting simple
program structures to be implemented in an intuitive manner.
The user writes his application as a collection of cooperating tasks.
Tasks access PVM resources through a library of standard interface
routines. These routines allow the initiation and termination of tasks
across the network as well as communication and synchronization between tasks.
The PVM message-passing
primitives are oriented towards heterogeneous operation, involving
strongly typed constructs for buffering and transmission.
Communication constructs include those for sending and receiving
data structures as well as high-level primitives such as broadcast,
barrier synchronization, and global sum.
PVM tasks may possess arbitrary control and dependency
structures. In other words, at any point in the execution of a
concurrent application, any task in existence may
start or stop other tasks or add or delete computers from the virtual machine.
Any process may communicate and/or synchronize with any other.
Any specific control and dependency
structure may be implemented under the PVM system by appropriate use of
PVM constructs and host language control-flow statements.
Owing to its ubiquitous nature (specifically, the virtual machine concept)
and also because of its simple but complete
programming interface, the PVM system has gained widespread acceptance
in the high-performance scientific computing community.
How many tasks each pvmd can manage is limited by two factors:
the number of
processes allowed a user by the operating system,
and the number of file descriptors available to the pvmd.
The limit on processes
is generally not an issue,
since it doesn't make sense to have a huge number of tasks running
on a uniprocessor machine.
Each task
consumes one file descriptor
in the pvmd,
for the pvmd-task
TCP stream.
Each spawned task (not ones connected anonymously)
consumes an extra descriptor,
since its output is read through a pipe by the pvmd
(closing stdout and stderr in the task would reclaim this slot).
A few more file descriptors are always in use by the pvmd
for the local and network sockets
and error log file.
For example, with a limit of 64 open files,
a user should be able to have up to 30 tasks running per host.
The pvmd may become a bottleneck
if all these tasks try to talk
to one another through it.
The pvmd
uses dynamically allocated memory
to store message packets en route
between tasks.
Until the receiving task accepts the packets,
they accumulate in the pvmd in an FIFO procedure.
No flow control is imposed by the pvmd: it will happily store all the packets given to it, until
it can't get any more memory.
If an application is designed so that tasks can keep sending
even when the receiving end is off doing something else
and not receiving,
the system will eventually run out of memory
.
As with the pvmd,
a task may have a limit on the number of others it can connect
to directly.
Each direct route to a task
has a separate TCP connection (which is bidirectional),
and so consumes a file descriptor.
Thus, with a limit of 64 open files,
a task can establish direct routes to about 60
other tasks.
Note that this limit is in effect only when using
task-task
direct routing.
Messages routed via the pvmds use only the default pvmd-task
connection.
The maximum size of a PVM message
is
limited by the amount of memory available to the task.
Because
messages are generally packed using data existing elsewhere in
memory,
and they must be reside in memory between being packed and
sent,
the largest possible message a task can send should be somewhat
less than half the available memory.
Note that as a message is sent,
memory for packet buffers is allocated by the pvmd,
aggravating the situation.
In-place message
encoding alleviates this problem somewhat,
because the data is not copied into message buffers in the
sender.
However,
on the receiving end,
the entire message
is downloaded into the task before the receive call accepts it,
possibly leaving no room to unpack it.
In a similar vein,
if many tasks send to a single destination all at once,
the destination task or pvmd may be overloaded as it tries
to store the messages.
Keeping messages from being freed when new ones are received
by using pvm_setrbuf() also uses up memory.
These problems can sometimes be avoided by
rearranging the application code, for example, to
use
smaller messages,
eliminate bottlenecks,
and process messages in the order in which they are generated.
Developed initially as a parallel programming environment
for Unix workstations, PVM has gained
wide acceptance and become a de facto standard for message-passing
programming.
Users want the same programming environment on multiprocessor computers
so they can move
their applications onto these systems.
A common interface would also allow users to write
vendor-independent programs for parallel computers
and to do part or most of the development work on workstations,
freeing up the multiprocessor supercomputers for production runs.
With PVM, multiprocessor systems can be included in the same configuration
with workstations. For example, a PVM task running
on a graphics workstation can display the results of computations
carried out on a massively parallel processing supercomputer.
Shared-memory computers with a small number of processors can be
linked to deliver supercomputer performance.
The virtual machine hides the configuration details from the
programmer.
The physical processors can be a network of
workstations, or they can be the nodes of a multicomputer.
The programmer doesn't have to know how the tasks are created or
where they are running;
it is the responsibility of PVM to schedule user's tasks
onto individual processors.
The user can, however, tune the
program for a specific configuration to achieve maximum performance,
at the expense of its portability.
Multiprocessor systems can be divided into two main categories:
message passing and shared memory. In the first category, PVM
is now supported on Intel's iPSC/860
and Paragon
, as well as
Thinking Machine's CM-5
.
Porting PVM to these platforms is
straightforward, because the message-passing functions in PVM
map quite naturally onto the native system calls. The difficult
part is the loading and management of tasks. In the second
category, message passing can be done by placing the message buffers
in shared memory. Access to these buffers must be synchronized
with mutual exclusion locks.
PVM 3.3 shared memory ports include
SGI multiprocessor machines running IRIX 5.x and
Sun Microsystems, Inc.,
multiprocessor
machines running Solaris 2.3
(This port also runs
on the Cray Research, Inc., CS6400
).
In addition, CRAY and DEC have created PVM ports
for their T3D and DEC 2100 shared memory multiprocessors, respectively.
A typical MPP system has one or more service nodes for user logins
and a large number of compute nodes for number crunching.
The PVM daemon
runs on one of the service nodes
and serves as the gateway to the outside world.
A task can be started on any one of the service nodes as a Unix
process and enrolls in PVM by establishing a TCP socket connection
to the daemon. The only way to start PVM tasks on the compute nodes
is via pvm_spawn(). When the daemon receives a request to spawn new
tasks, it will allocate a set of nodes if necessary, and load the
executable onto the specified number of nodes.
The way PVM allocates nodes
is system dependent. On the CM-5, the entire partition is allocated
to the user. On the iPSC/860, PVM will get a subcube
big enough to accommodate all the tasks to be spawned. Tasks created
with two separate calls to pvm_spawn() will
reside in different subcubes, although they can exchange messages
directly by using the physical node address. The NX operating system
limits the number of active subcubes system-wide to 10. Pvm_spawn
will fail when this limit is reached or when there are not enough nodes
available.
In the case of the
Paragon,
PVM uses the default partition unless a different one is
specified when pvmd is invoked. Pvmd and the spawned tasks form one
giant parallel application. The user can set the appropriate NX
environment variables such as NX_DFLT_SIZE before starting PVM, or
he can specify the equivalent command-line
arguments to pvmd (i.e., pvmd -sz 32).
PVM message-passing functions are implemented in terms of
the native send and receive system calls.
The ``address" of a task is encoded in the task id, as illustrated
in Figure
.
This enables the messages to be sent directly to the target task,
without any help from the daemon. The node number is normally the
logical node number, but the physical address
is used on the iPSC/860
to allow for direct intercube communication.
The instance number is used to distinguish tasks running on the
same node.
PVM normally uses asynchronous send primitives to send
messages.
The operating system can run out of
message handles very quickly if a lot of small messages or several
large messages are sent at once.
PVM will be forced to switch to synchronous send when there are no more
message handles left or when the system buffer gets filled up.
To improve performance, a task
should call pvm_send() as soon as the data becomes available,
so (one hopes) when the other task calls pvm_recv(), the message will
already be in its buffer. PVM buffers one incoming packet between
calls to pvm_send()/pvm_recv(). A large message,
however, is broken up into
many fixed-size fragments during packing, and each piece is sent
separately.
Buffering one of these fragments
is not sufficient unless pvm_send() and pvm_recv() are synchronized.
Figures
and
illustrate this process.
The front end of an MPP
system is treated as a regular workstation.
Programs to be run there should be linked with the regular PVM library,
which relies on Unix sockets to transmit messages. Normally one should
avoid running processes on the front end, because communication between
those processes and the node processes must go through the PVM daemon
and a TCP socket link. Most of the computation and communication should
take place on the compute nodes in order to take advantage of the processing
power of these nodes and the fast interconnects between them.
Since the PVM library for the front end is different from the one for
the nodes, the executable for the front end must be different from
the one compiled for the nodes. An SPMD program, for example, has only
one source file, but the object code must be linked with the front end
and node PVM libraries separately to produce two executables if it is
to be started from the front end. An alternative would be a ``hostless"
SPMD program
, which could be spawned from the PVM console.
Table
shows the native system calls used by the corresponding
PVM functions on various platforms.
The CM-5 is somewhat different from the Intel systems because it
requires a special host process for each group of tasks spawned.
This process enrolls in PVM and relays messages between pvmd
and the node programs. This, needless to say, adds even more overhead
to daemon-task communications.
Another restrictive feature of the CM-5 is that all nodes in the same
partition are scheduled as a single unit. The partitions are
normally configured by the
system manager and each partition must contain at least 16 processors.
User programs are run on the entire partition by default. Although it is
possible to idle some of the processors in a partition, as PVM does
when fewer nodes are called for, there is no easy way to harness the
power of the idle processors. Thus, if PVM spawns two groups of tasks,
they will time-share the partition, and any intergroup traffic must
go through pvmd.
Additionally, CMMD has no support for multicasting. Thus, pvm_mcast() is implemented
with a loop of CMMD_async_send().
The shared-memory architecture provides a very efficient medium for
processes to exchange data.
In our implementation, each task owns a shared buffer created
with the shmget() system call. The task id is used as the ``key" to
the shared segment. If the key is being used by another user, PVM will
assign a different id to the task.
A task communicates with other tasks
by mapping their message buffers into its own memory space.
To enroll in PVM, the task first writes its Unix process id into
pvmd's incoming box. It then looks for the assigned task id in
pvmd's pid
TID table.
The message buffer is divided into pages, each of which holds one fragment
(Figure
).
PVM's page size can be a multiple of the system page size.
Each page has a header, which contains the lock and
the reference count.
The first few pages are used as the incoming box, while the rest of the pages
hold outgoing fragments (Figure
). To send a message,
the task first packs the
message body into its buffer, then delivers the message header (which
contains the sender's TID and the location of the data) to the incoming
box of the intended recipient. When pvm_recv() is called, PVM checks
the incoming box, locates and unpacks the messages (if any), and
decreases the reference count so the space can be reused. If a task
is not able to deliver the header directly because the receiving box
is full, it will block until the other task is ready.
Inevitably some overhead will be incurred when a message is packed
into and unpacked from the buffer, as is the case with all other PVM
implementations. If the buffer is full, then the data must first be
copied into a temporary buffer in the process's private space and
later transferred to the shared buffer.
Memory contention is usually not a problem. Each process has its own
buffer, and each page of the buffer has its own lock. Only the page
being written to is locked, and no process should be trying to read
from this page because the header has not been sent out. Different
processes can read from the same page without interfering with each
other, so multicasting will be efficient (they do have to decrease
the counter afterwards, resulting in some contention). The only time
contention
occurs is when two or more processes trying to deliver the message
header to the same process at the same time. But since the header
is very short (16 bytes), such contention should not cause any
significant delay.
To minimize the possibility of page faults, PVM attempts to use
only a small number of pages in the message buffer and recycle
them as soon as they have been read by all intended recipients.
Once a task's buffer has been mapped, it will not be unmapped
unless the system limits the number of mapped segments.
This strategy saves time for any subsequent message exchanges with the
same process.
In the original implementation, all user messages are buffered
by PVM. The user must pack the data into a PVM buffer before sending
it, and unpack the data after it has been received into an internal
buffer. This approach works well on systems with relatively high
communication latency, such as the Ethernet. On MPP systems the
packing and unpacking introduce substantial overhead. To solve this
problem we added two new PVM functions, namely pvm_psend() and
pvm_precv(). These functions combine packing/unpacking and
sending/receiving into one single step. They could
be mapped directly into the native message passing primitives
available on the system, doing
away with internal buffers altogether. On the Paragon these new
functions give almost the same performance as the native ones.
Although the user can use both pvm_psend() and pvm_send()
in the same program,
on MPP the pvm_psend() must be matched with pvm_precv(),
and pvm_send() with pvm_recv().
Several research groups have developed software packages
that like PVM assist programmers in using distributed computing.
Among the most well known efforts are
P4
[1],
Express
[],
MPI
[],
and Linda
[].
Various other systems
with similar capabilities are also in existence; a reasonably
comprehensive listing may be found in
[13].
It is often useful and always reassuring to be able to see
the present configuration of the virtual machine and the
status of the hosts. It would be even more useful if
the user could also see what his program is doing-what tasks are running, where messages are being sent, etc.
The PVM GUI
called XPVM was developed to display this
information, and more.
XPVM combines the capabilities of the PVM console, a performance monitor,
and a call-level debugger into a single, easy-to-use X-Windows interface.
XPVM is available from netlib
in the directory pvm3/xpvm.
It is distributed as precompiled, ready-to-run executables for
SUN4, RS6K, ALPHA, SUN4SOL2, HPPA, and SGI5.
The XPVM source is also available for compiling on other machines.
XPVM is written entirely in C using the TCL/TK
[8] toolkit
and runs just like another PVM task.
If a user wishes to build XPVM from the source, he must first
obtain and install the TCL/TK software on his system.
TCL and TK were developed by John Ousterhout
at Berkeley
and can be obtained by anonymous ftp to sprite.berkeley.edu
The TCL and XPVM source distributions each contain a README file
that describes the most up-to-date installation procedure for
each package respectively.
Figure
shows a snapshot of XPVM in use.
- figure not available -
Like the PVM console,
XPVM will start PVM if PVM is not already running,
or will attach to the local pvmd if it is.
The console can take an optional hostfile argument
whereas XPVM always reads $HOME/.xpvm_hosts
as its hostfile. If this file does not exist, then
XPVM just starts PVM on the local host (or attaches to the existing PVM).
In typical use, the hostfile .xpvm_hosts
contains a list of hosts prepended with an &.
These hostnames then get added to the Hosts menu
for addition and deletion from the virtual machine by clicking on them.
The top row of buttons perform console-like functions.
The Hosts button displays a menu of hosts. Clicking on a host
toggles whether it is added or deleted from the virtual machine.
At the bottom of the menu is an option for adding a host not listed.
The Tasks button brings up a menu whose most-used selection is
spawn. Selecting spawn brings up a window where one can set the executable name,
spawn flags, start position, number of copies to start, etc.
By default, XPVM turns on tracing in all tasks (and their children)
started inside XPVM. Clicking on Start in the spawn window
starts the task, which will then appear in the space-time view.
The Reset button has a menu for resetting PVM (i.e., kill all PVM tasks)
or resetting different parts of XPVM.
The Quit button exits XPVM while leaving PVM running.
If XPVM is being used to collect trace information, the information
will not be collected if XPVM is stopped.
The Halt button is used when one is through with PVM.
Clicking on this button kills all running PVM tasks, shuts down PVM cleanly,
and exits the XPVM interface.
The Help button brings up a menu of topics the user can get help about.
During startup, XPVM joins a group called xpvm.
The intention is that tasks started outside the XPVM interface
can get the TID of XPVM by doing tid = pvm_gettid( xpvm, 0 ).
This TID would be needed if the user wanted to manually turn on
tracing inside such a task and pass the events back to XPVM for display.
The expected TraceCode for these events is 666.
While an application is running, XPVM
collects and displays the information in real time.
Although XPVM updates the views as fast as it can,
there are cases when XPVM cannot keep up with the events
and it falls behind the actual run time.
In the middle of the XPVM interface are tracefile controls.
It is here that the user can specify a tracefile-a
default tracefile in /tmp is initially displayed.
There are buttons to specify whether the specified tracefile
is to be played back or overwritten by a new run.
XPVM saves trace events in a file using the ``self defining data format''
(SDDF)
described in Dan Reed's
Pablo
[11]
trace playing package.
The analysis of PVM traces can be carried out on any of a number
of systems such as Pablo.
XPVM can play back its own SDDF files. The tape-player-like buttons
allow the user to rewind the tracefile, stop the display at any point,
and step through the execution.
A time display specifies the number of seconds
from when the trace display began.
The Views button allows the user to open or close any of several
views presently supplied with XPVM. These views are described below.
The Network view displays the present virtual machine configuration
and the activity of the hosts. Each host is represented by an icon
that includes the PVM_ARCH and host name inside the icon.
In the initial release of XPVM, the icons are arranged arbitrarily
on both sides of a bus network. In future releases the view will
be extended to visualize network activity as well. At that time
the user will be able to specify the network topology to display.
These icons are illuminated in different colors to indicate their status
in executing PVM tasks. Green implies that at least one task on
that host is busy executing useful work. Yellow indicates that no tasks
are executing user computation, but at least one task is busy
executing PVM system routines. When there are no tasks on a given host,
its icon is left uncolored or white. The specific colors used in each
case are user customizable.
The user can tell at a glance how well the virtual machine is
being utilized by his PVM application. If all the hosts are green
most of the time, then machine utilization is good.
The Network view does not display activity from other users' PVM jobs
or other processes that may be running on the hosts.
In future releases the view will allow the user to click on
a multiprocessor icon and get information about the number of
processors, number of PVM tasks, etc., that are running on the host.
The Space-Time view displays the activities of individual PVM tasks
that are running on the virtual machine.
Listed on the left-hand side of the view are the executable names of
the tasks, preceded by the host they are running on.
The task list is sorted by host so that it is easy to see whether
tasks are being clumped on one host.
This list also shows the task-to-host mappings (which are not available
in the Network view).
The Space-Time view combines three different displays.
The first is like a Gantt chart
.
Beside each listed task is a
horizontal bar stretching out in the ``time'' direction.
The color of this bar at any time indicates the state of the task.
Green indicates that user computations are being executed.
Yellow marks the times when the task is executing PVM routines.
White indicates when a task is waiting for messages.
The bar begins at the time when the task starts executing and ends
when the task exits normally.
The specific colors used in each case are user customizable.
The second display overlays the first display with
the communication activity among tasks.
When a message is sent between two tasks, a red line is drawn
starting at the sending task's bar at the time the message is sent
and ending at the receiving task's bar when the message is received.
Note that this is not necessarily the time the message arrived, but rather the
time the task returns from pvm_recv().
Visually, the patterns and slopes of the red lines combined with
white ``waiting'' regions reveal a lot about the communication
efficiency of an application.
The third display appears only when a user clicks on interesting features
of the Space-Time view with the left mouse button.
A small ``pop-up'' window appears giving detailed
information regarding specific task states or messages.
If a task bar is clicked on, the state begin and end times are displayed,
along with the last PVM system call information.
If a message line is clicked on, the window displays the send and receive
time as well as the number of bytes in the message and the message tag.
When the mouse is moved inside the Space-Time view, a blue vertical line
tracks the cursor and the time corresponding to this vertical line
is displayed as Query time at the bottom of the display.
This vertical line also appears in the other ``something vs. time'' views
so the user can correlate a feature in one view with information
given in another view.
The user can zoom into any area of the Space-Time view by dragging
the vertical line with the middle mouse button. The view will
unzoom back one level when the right mouse button is clicked.
It is often the case that very fine communication or waiting states
are only visible when the view is magnified with the zoom feature.
As with the Query time, the other views also zoom along with the
Space-Time view.
XPVM is designed to be extensible. New views can be created and added
to the Views menu.
At present, there are three other views: task utilization vs. time view,
call trace view, and task output view.
Unlike the Network and Space-Time views, these views are closed
by default. XPVM attempts to draw the views in real time;
hence, the fewer open views, the faster XPVM can draw.
The Utilization view shows the number of tasks computing, in overhead,
or waiting for each instant. It is a summary of the Space-Time view
for each instant. Since the number of tasks in a PVM application
can change dynamically, the scale on the Utilization view will
change dynamically when tasks are added, but not when they exit.
When the number of tasks changes, the displayed portion of the Utilization view
is completely redrawn to the new scale.
The Call Trace view provides a textual record of the last PVM call made
in each task. The list of tasks is the same as in the Space-Time view.
As an application runs, the text changes to reflect the most recent
activity in each task. This view is useful as a call level debugger
to identify where a PVM program's execution hangs.
Unlike the PVM console, XPVM has no natural place for task output to be
printed. Nor is there a flag in XPVM to tell tasks to redirect their
standard output back to XPVM. This flag is turned on automatically
in all tasks spawned by XPVM after the Task Output view is opened.
This view gives the user the option to also redirect the output into a file.
If the user types a file name in the ``Task Output'' box,
then the output is printed in the window and into the file.
As with the trace events, a task started outside XPVM can be
programmed to send standard output to XPVM for display by
using the options in pvm_setopt().
XPVM expects the OutputCode to be set to 667.
PVM has been ported to three distinct classes of architecture:
Porting PVM to non-Unix
operating systems can be very difficult.
Nonetheless, groups outside the PVM team have developed PVM ports for
DEC's VMS
and IBM's OS/2
operating systems.
Such ports can require extensive rewriting of the source and are not
described here.
PVM is supported on most Unix platforms. If an architecture
is not listed in the file $PVM_ROOT/docs/arches, the following
description should help you to create a new PVM port.
Anything from a small amount of tweaking to major surgery may
be required,
depending on how accomodating your version of Unix is.
The PVM source directories are organized in the following manner:
Files in src form the core for PVM (pvmd and libpvm);
files in console are for
the PVM console, which is just a special task; source for
the FORTRAN interface and group functions are in the libfpvm
and pvmgs directories, respectively.
In each of the source directories,
the file Makefile.aimk is the generic makefile for all
uniprocessor platforms.
System-specific definitions are kept in the conf directory under
$(PVM_ARCH).def.
The script lib/aimk, invoked by the top-level makefile,
determines the value of PVM_ARCH,
then chooses the appropriate makefile for a particular architecture.
It first looks in the PVM_ARCH
subdirectory for a makefile; if none is found, the generic one is used.
The custom information stored in the conf directory is
prepended to the head of the chosen makefile, and the build begins.
The generic makefiles for MPP and shared-memory systems are
Makefile.mimd and Makefile.shmem, respectively. System-specific rules
are kept in the makefile under the PVM_ARCH subdirectory.
The steps to create a new architecture (for example ARCH) are:
Compiler macros imported from conf/ARCH.def
are listed at the top of the file named src/Makefile.aimk.
They enable options that are common to several machines
and so generally useful.
New ones are added occasionally.
The macro IMA_ARCH can be used to enable code that only
applies to a single architecture.
The ones most commonly used are:
ARCH.m4 is a file of commands for the m4 macro
processor,
that
edits the libfpvm C source code to conform to FORTRAN
calling conventions,
which vary from machine to machine.
The two main things you must determine about your FORTRAN are:
1.
How FORTRAN subroutine names are converted to linker symbols.
Some systems append an underscore to the name;
others convert to all capital letters.
2.
How strings are passed in FORTRAN -
One common method is to pass the address in a char*,
and pass corresponding lengths after all remaining parameters.
The easiest way to discover the correct choices
may be to try every common case (approximately three) for each.
First,
get the function names right,
then make sure you can pass string data to FORTRAN tasks.
Porting to MPP systems is more difficult because most of them
do not offer a standard Unix environment on the nodes.
We discuss some of these limitations below.
Processes running on the nodes of an Intel iPSC/860
have no Unix
process id's and they cannot receive Unix signals. There is a similar
problem for the Thinking Machine's CM-5
.
If a node process forks, the behavior of the new process is
machine dependent. In any event it would not be allowed to become
a new PVM task. In general, processes on the nodes are not
allowed to enroll unless they were spawned by PVM.
By default, pvm_spawn() starts tasks on the (compute) nodes.
To spawn multiple copies of the same executable, the programmer
should call pvm_spawn() once and specify the number of copies.
On some machines (e.g., iPSC/860), only one process is allowed on
each node, so the total number of PVM tasks on these machines
cannot exceed the number of nodes available.
Several functions serve as the multiprocessor ``interface" for PVM.
They are called by pvmd to spawn new tasks and to communicate with
them. The implementation
of these functions is system dependent; the source code is
kept in the src/PVM_ARCH/pvmdmimd.c (message passing)
or src/PVM_ARCH/pvmdshmem.c (shared memory).
We give a brief description of each of these functions below.
Note that pvmdmimd.c can be found in the subdirectory PVM_ARCH because
MPP platforms are very different from one another, even those from
the same vendor.
In addition to these functions, the message exchange routine in libpvm,
mroute(), must also be implemented in the most efficient native
message-passing primitives. The following macros are defined in
src/pvmmimd.h:
These functions are used by mroute() on MPP systems.
The source code for mroute for multiprocessors
is in src/lpvmmimd.c or src/lpvmshmem.c depending on the class.
For shared-memory implementations, the following macros are defined in
the file
This chapter attempts to answer some of the most common questions
encountered by users when installing PVM and running PVM programs.
It also covers debugging the system itself,
which is sometimes necessary when doing new ports or trying to determine
whether an application or PVM is at fault.
The material here is mainly taken from other sections of the book,
and rearranged to make answers easier to find.
As always, RTFM pages first.
Printed material always lags behind reality,
while the online documentation is kept up-to-date with each release.
The newsgroup comp.parallel.pvm is available to post questions and
discussions.
If you find a problem with PVM,
please tell us about it.
A bug report form is included with the distribution
in $PVM_ROOT/doc/bugreport.
Please use this form or include equivalent information.
Some of the information in this chapter applies only to the generic
Unix implementation of PVM,
or describes features more volatile than the standard documented ones.
It is presented here to aid with debugging,
and tagged with a
to warn you of its nature.
Examples of shell scripts are for either C-shell (csh, tcsh) or Bourne
shell (sh, ksh).
If you use some other shell,
you may need to modify them somewhat, or use csh while troubleshooting.
You can get a copy of PVM for your own use or share an already-installed
copy with other users.
The installation process for either case more or less the same.
Make certain you have environment variable PVM_ROOT set
(and exported, if applicable) to directory where PVM is installed
before you do anything else.
This directory is where the system executables and libraries reside.
Your application executables go in a private directory,
by default $HOME/pvm3/bin/$PVM_ARCH.
If PVM is already installed at your site you can share it by
setting PVM_ROOT to that path,
for example /usr/local/pvm3.
If you have your own copy,
you could install it in $HOME/pvm3.
If you normally
use csh,
add a line like this to your .cshrc file:
setenv PVM_ROOT $HOME/pvm3
If you normally use sh,
add these lines to your .profile:
PVM_ROOT=$HOME/pvm3
PVM_DPATH=$HOME/pvm3/lib/pvmd
export PVM_ROOT PVM_DPATH
Make sure these are set in your current session too.
Older versions of PVM assumed an installation path of $HOME/pvm3.
Versions 3.3 and later require that
the PVM_ROOT variable always be set.
Note:
For compatibility with older versions of PVM and some command
shells that don't execute a startup file,
newer versions guess $HOME/pvm3 if it's not set,
but you shouldn't depend on that.
On-line manual pages compatible with most Unix machines are shipped
with the source distribution.
These reside in $PVM_ROOT/man and can be copied to some other
place (for example /usr/local/man or used in-place.
If the man program on your machine uses the MANPATH
environment variable,
try adding something like the following near the end of your .cshrc
or .login file:
Then you should be able to read both normal system man pages and PVM
man pages by simply typing man subject.
The following commands download, unpack,
build and install a release:
The compiler may print a few warning messages; we suggest you ignore
these unless the build doesn't complete or until you have some other
reason to think there is a problem.
If you can't build the unmodified distribution ``out of the box''
on a supported architecture, let us know.
P4
[1]
is a library of macros and subroutines developed
at Argonne National
Laboratory for programming a variety of parallel machines.
The p4 system supports both the shared-memory model (based
on monitors) and the distributed-memory model (using message-passing).
For the
shared-memory model of parallel computation, p4 provides a set
of useful monitors as well as a
set of primitives
from which monitors can be constructed.
For the distributed-memory model, p4 provides typed send and receive
operations and creation of processes according to a text file describing
group and process structure.
Process management in the p4 system is based on a configuration
file that specifies the host pool, the object file to be executed on
each machine, the number of processes to be started on each host
(intended primarily for multiprocessor systems), and other auxiliary
information. An example of a configuration file is
Two issues are noteworthy in regard to the process management mechanism
in p4. First, there is the notion a ``master'' process and ``slave''
processes, and multilevel hierarchies may be formed to implement
what is termed a cluster model of computation. Second, the
primary mode of process creation is static, via the configuration
file; dynamic process creation is possible only by a statically
created process that must invoke a special o4 function that spawns
a new process on the local machine. Despite these restrictions,
a variety of application paradigms may be implemented in the p4 system
in a fairly straightforward manner.
Message passing in the p4 system is achieved through the
use of traditional send and recv primitives, parameterized
almost exactly as other message-passing systems. Several variants
are provided for semantics, such as heterogeneous exchange and
blocking or nonblocking transfer. A significant proportion of the
burden of buffer allocation and management, however, is left to the user.
Apart from basic message passing, p4 also offers a variety of global
operations, including broadcast, global maxima and minima, and
barrier synchronization.
The protocols used in building PVM are evolving,
with the result that newer releases are not compatible with older
ones.
Compatibility is determined by the pvmd-task and task-task protocol
revision numbers.
These are compared when two PVM entities connect;
they will refuse to interoperate if the numbers don't match.
The protocol numbers are defined in src/ddpro.h and src/tdpro.h
(DDPROTOCOL, TDPROTOCOL).
As a general rule,
PVM releases with the same second digit in their version numbers
(for example 3.2.0 and 3.2.6) will interoperate.
Changes that result in incompatibility are held until a major version
change (for example, from 3.2 to 3.3).
To get PVM running,
you must start either a pvmd or the PVM console by hand.
The executables are named pvmd3 and pvm,
respectively,
and reside in directory $PVM_ROOT/lib/ $PVM_ARCH.
We suggest using the pvmd or pvm script
in $PVM_ROOT/lib instead,
as this simplifies setting your shell path.
These scripts determine the host architecture and
run the correct executable, passing on their command line arguments.
Problems when starting PVM can be caused by system or network trouble,
running out of resources (such as disk space),
incorrect installation or a bug in the PVM code.
The pvmd writes errors on both its standard error stream
(only until it is fully started)
and a log file,
named
/tmp/pvml.uid.
uid is your numeric user id
(generally the number in the third colon-separated field
of your passwd entry).
If PVM was built with the SHAREDTMP option
(used when a cluster of machines shares a /tmp directory),
the log file will instead be named /tmp/pvml.uid.hostname.
If you have trouble getting PVM started,
always check the log file for hints about what went wrong.
If more than one host is involved,
check the log file on each host.
For example, when adding a new host to a virtual machine,
check the log files on the master host and the new host.
Try the following command to get your uid:
The pvmd publishes the address of the socket to which local tasks connect
in a file named
/tmp/pvmd.uid.
uid is your numeric user id
(generally in the third field of your passwd entry).
If PVM was built with the SHAREDTMP option
(used when a cluster of machines shares a /tmp directory),
the file will be named /tmp/pvmd.uid.hostname.
See §
for more information on how
this file is used.
The pvmd creates the socket address file while starting up,
and removes it while shutting down.
If while starting up, it finds the file already exists,
it prints an error message and exits.
If the pvmd can't create the file because the permissions of /tmp
are set incorrectly or the filesystem is full,
it won't be able to start up.
If the pvmd is killed with un uncatchable signal
or other catastrophic event such as a (Unix) machine crash,
you must remove the socket address file
before another pvmd will start on that host.
Note that if the pvmd is compiled with option OVERLOADHOST,
it will start up even if the address
file already exists (creating it if it doesn't).
It doesn't consider the existence of the address file an error.
This allows disjoint virtual machines owned by the
same user
to use overlapping sets of hosts.
Tasks not spawned by PVM can only connect to the first pvmd running on
an overloaded host, however,
unless they can somehow guess the correct socket address
of one of the other pvmds.
PVM is normally started by invoking the console program,
which starts a pvmd if one is not already running and connects to it.
The syntax for starting a PVM console is:
If the console can't start the pvmd for some reason,
you may see one of the following error messages.
Check the pvmd log file for error messages.
The most common ones are described below.
Can't start pvmd -
This message means that the console
either can't find the pvmd executable or the pvmd is having
trouble starting up.
If the pvmd complains that it can't bind a socket,
perhaps the host name set for the machine does not resolve to
an IP address of one of its interfaces,
or that interface is down.
The console/pvmd option -nname can be used to change
the default.
Can't contact local daemon -
If a previously running pvmd crashed, leaving behind
its socket address file,
the console may print
this message.
The pvmd will log error message pvmd already running?.
Find and delete the address file.
Version mismatch -
The console (libpvm)
and pvmd protocol revision numbers don't match.
The protocol has a revision number so that incompatible versions
won't attempt to interoperate.
Note that having different protocol revisions doesn't necessarily
cause this message to be printed;
instead
the connecting side may simply hang.
It is necessary to start the master pvmd by hand if you will use
the so=pw or so=ms options in the host file or
when adding hosts.
These options require direct interaction with the pvmd when
adding a host.
If the pvmd is started by the console, or otherwise backgrounded,
it will not be able to read passwords from a TTY.
The syntax to start the master pvmd by hand is:
If you start a PVM console or application,
use another window.
When the pvmd finishes starting up,
it prints out a line like either:
80a95ee4:0a9a
or
/tmp/aaa026175.
If it can't start up, you may not see an error message,
depending on whether the problem occurs before or after the pvmd
stops logging to its standard error output.
Check the pvmd log file for a complete record.
This section also applies to hosts started via a host file,
because the same mechanism is used in both cases.
The master pvmd starts up,
reads the host file,
then sends itself a request to add more hosts.
The PVM console (or an application) can return an error
when adding hosts to the virtual machine.
Check the pvmd log file on the master host and the failing
host for additional clues to what went wrong.
No such host -
The master pvmd couldn't resolve the the host name (or name given
in ip= option) to an IP address.
Make sure you have the correct host name.
Can't start pvmd -
This message means that the master pvmd
failed to start the slave pvmd process.
This can be caused by incorrect installation, network or permission problems.
The master pvmd must be able to resolve the host name (get its IP address)
and route packets to it.
The pvmd executable and shell script to start it must be installed in the
correct location.
You must avoid printing anything in your .cshrc (or equivalent)
script,
because it will confuse the pvmd communication.
If you must print something,
either move it to your .login file or enclose it in a conditional:
To test all the above, try running the following command by hand on
the master host:
rsh host $PVM_ROOT/lib/pvmd -s
where host is the name of the slave host you want to test.
You should see a message similar to the following from
the slave pvmd and nothing else:
Version mismatch -
This message indicates that the protocol revisions
of the master and slave pvmd are incompatible.
You must install the same (or compatible) versions everywhere.
Duplicate host -
This message means that PVM thinks there is another pvmd (owned by
the same user)
already running on the host.
If you're not already using the host in the
current virtual machine or a different one,
the socket address file (§
) must be
left over from a previous run.
Find and delete it.
A host file may be supplied to the pvmd (or console, which
passes it to the pvmd) as a command-line parameter.
Each line of the file contains a host name followed by option parameters.
Hosts not preceded by '&' are started automatically as soon as
the master pvmd is ready.
The syntax:
The preferred way to shut down a virtual machine is to type halt
at the PVM console,
or to call libpvm function pvm_halt().
When shutting PVM down from the console,
you may see an error message such as EOF on pvmd sock.
This is normal and can be ignored.
You can instead kill the pvmd process;
it will shut down, killing any local tasks with SIGTERM.
If you kill a slave pvmd,
it will be deleted from the virtual machine.
If you kill the master pvmd,
the slaves will all exit too.
Always kill the pvmd with a catchable signal,
for example SIGTERM.
If you kill it with SIGKILL,
it won't be able to clean up after itself,
and you'll have to do that by hand.
In contrast to the other parallel processing systems described
in this section, Express
[]
toolkit is a collection of tools that
individually address various aspects of concurrent computation.
The toolkit is developed and marketed commercially by ParaSoft
Corporation, a company that was started by some members of the
Caltech concurrent computation project.
The philosophy behind computing with Express is based on beginning
with a sequential version of an application and following a
recommended development life cycle culminating in a parallel
version that is tuned for optimality. Typical development cycles begin
with the use of VTOOL, a graphical program that
allows the progress of sequential algorithms to be
displayed in a dynamic manner. Updates and references to
individual data structures can be displayed to
explicitly demonstrate algorithm structure and provide the
detailed knowledge necessary for parallelization.
Related to this program is FTOOL, which provides in-depth analysis
of a program including variable use analysis, flow structure,
and feedback regarding potential parallelization.
FTOOL operates on both sequential and parallel versions of an application.
A third tool called ASPAR is then used; this is
an automated parallelizer that converts sequential C and Fortran
programs for parallel or distributed execution using the Express
programming models.
The core of the Express system is a set of libraries for
communication, I/O, and parallel graphics. The communication
primitives are akin to those found in other message-passing systems
and include
a variety of global operations and data distribution primitives.
Extended I/O routines enable parallel input and output, and a similar
set of routines are provided for graphical displays from multiple
concurrent processes.
Express also contains the NDB tool, a parallel debugger
that uses commands based on the popular ``dbx'' interface.
PVM applications written in C should include header file pvm3.h,
as follows:
#include <pvm3.h>
Programs using the trace functions should additionally include pvmtev.h,
and resource manager programs should include pvmsdpro.h.
You may need to specify the PVM
include directory in the compiler flags as follows:
cc ... -I$PVM_ROOT/include ...
A header file for Fortran (fpvm3.h) is also supplied.
Syntax for including files in Fortran is variable;
the header file may need to be pasted into your source.
A statement commonly used is:
INCLUDE '/usr/local/pvm/include/fpvm3.h'
PVM applications written in C
must be linked with at least the base PVM library, libpvm3.
Fortran applications must be linked with both libfpvm3 and libpvm3.
Programs that use group functions
must also be linked with libgpvm3.
On some operating systems,
PVM programs must be linked with
still other libraries (for the socket or XDR functions).
Note that the order of libraries in the link command is important;
Unix machines generally process the list from left to right,
searching each library once.
You may also need to specify the PVM library directory
in the link command.
A correct order is shown below
(your compiler may be called something other than cc or f77).
The aimk program supplied with PVM automatically sets
environment variable PVM_ARCH to the PVM architecture name and
ARCHLIB to the necessary system libraries.
Before running aimk,
you must have PVM_ROOT set to the path where PVM is installed.
You can use these variables to write a portable,
shared makefile (Makefile.aimk).
No such file -
This error code is returned instead of a task id
when the pvmd fails to find a program executable during spawn.
Remember that task placement decisions are made before checking the existence
of executables.
If an executable is not installed on the selected host,
PVM returns an error instead of trying another one.
For example, if you have installed myprog on 4 hosts of
a 7 host virtual machine,
and spawn 7 instances of myprog with default placement,
only 4 will succeed.
Make sure executables are built for each architecture you're
using,
and installed in the correct directory.
By default, PVM searches first in
pvm3/bin/$PVM_ARCH
(the pvmd default working directory is $HOME)
and then in
$PVM_ROOT/bin/$PVM_ARCH.
This path list can
be changed with host file option ep=.
If your programs aren't on a filesystem shared between the hosts,
you must copy them to each host manually.
failed to start group server -
This message means that a function in the group library (libgpvm3.a)
could not spawn a group server task
to manage group membership lists.
Tasks using group library functions must be able to communicate with this
server.
It is started automatically if one is not already running.
The group server executable (pvmgs)
normally resides in
$PVM_ROOT/bin/$PVM_ARCH,
which
must be in the pvmd search path.
If you change the path using the host file ep= option,
make sure this directory is still included.
The group server may be spawned on any host,
so be sure one is installed and your path is set correctly everywhere.
Tasks and pvmds allocate some memory (using malloc()) as they run.
Malloc never gives memory back to the system,
so the data size of each process only increases
over its lifetime.
Message and packet buffers (the main users of dynamic memory in PVM)
are recycled, however.
The things that most commonly cause PVM to use a large amount of memory
are passing huge messages,
certain communication patterns and memory leaks.
A task sending a PVM message doesn't necessarily block until the
corresponding receive is executed.
Messages are stored at the destination until claimed,
allowing some leeway when programming in PVM.
The programmer should be careful to
limit the number of outstanding messages.
Having too many causes the receiving task (and its pvmd
if the task is busy) to accumulate a lot of dynamic memory
to hold all the messages.
There is nothing to stop a task from sending a message
which is never claimed (because receive is never called with
a wildcard pattern).
This message will be held in memory until the task exits.
Make sure you're not accumulating old message buffers by moving them aside.
The pvm_initsend() and receive
functions automatically free the
current buffer,
but
if you use the pvm_set[sr]buf() routines, then the associated buffers
may not be freed.
For example,
the following code fragment allocates message buffers
until the system runs out of memory:
As a quick check,
look at the message handles returned by initsend or receive functions.
Message ids are taken from a pool,
which is extended as the number of message buffers in use increases.
If there is a buffer leak,
message ids will start out small and increase steadily.
Two undocumented functions in libpvm dump information about message buffers:
umbuf_dump(int mid, int level),
Function umbuf_dump()
dumps a message buffer by id (mid).
Parameter
level is one of:
Each task spawned through PVM has its stdout and stderr
files connected to a pipe that is read by the pvmd managing the task.
Anything printed by the task is packed into a PVM message by the
pvmd and sent to the task's stdout sink.
The implementation of
this mechanism is described in §
.
Each spawned task has /dev/null opened as stdin.
Output from a task running on any host in a virtual machine
(unless redirected by the console, or a parent task)
is written in the log file of the master pvmd by default.
You can use the console spawn command with flag -> to collect output
from an application (the spawned tasks and any others they in turn
spawn).
Use function pvm_catchout() to collect output within an application.
The C stdio library (fgets(), printf(), etc.)
buffers input and output
whenever possible, to reduce the
frequency of actual read() or write() system calls. It decides
whether to buffer by looking at the underlying file descriptor of a
stream. If the file is a tty, it buffers only a line at
a time, that is, the buffer is flushed whenever the newline character
is encountered. If the descriptor is a file, pipe, or socket,
however, stdio buffers up much more, typically 4k bytes.
A task spawned by PVM writes output through a pipe back to its pvmd,
so the stdout buffer isn't flushed after every line (stderr probably is).
The pvm_exit() function closes the stdio streams,
causing them to be flushed so you should eventually see all your
output.
You can flush stdout by calling fflush(stdout) anywhere in
your program.
You can change the buffering mode of stdout to line-oriented
for the entire program by calling
setlinebuf(stdout)
near the top of the program.
Fortran systems handle output buffering in many different ways.
Sometimes there is a FLUSH subroutine,
sometimes not.
In a PVM task,
you can open a file to read or write,
but remember that spawned components inherit the working directory
(by default $HOME) from the pvmd
so the file path you open must
be relative to your home directory (or an absolute path).
You can change the pvmd (and therefore task)
working directory (per-host)
by using the host file option wd=.
PVM doesn't have a built-in facility for running programs at different
priorities (as with nice),
but you can do it yourself.
You can call setpriority() (or perhaps nice()) in your code or
replace your program with a shell script wrapper as follows:
When prog is spawned, the shell script execs prog-
at a new priority level.
You could be even more creative and pass an environment variable through
PVM to the shell script,
to allow varying the priority without editing the script.
If you want to have real fun,
hack the tasker example to do the work,
then you won't have to replace all the programs with wrappers.
One reason for changing the
scheduling priority of a task is to allow it to run on a workstation
without impacting the performance of the machine
for someone sitting at the console.
Longer
response time seems to feel worse than lower throughput.
Response time is affected most by tasks that
use a lot of memory, stealing all the physical pages from other
programs.
When interactive input arrives,
it takes the system time to reclaim all the pages.
Decreasing the priority of such a task may not help much,
because if it's
allowed to run for a few seconds,
it accumulates pages again.
In contrast,
cpu bound jobs with small working set sizes
may hardly affect the response time at all,
unless you have many of them running.
Available memory limits the maximum size and number of outstanding
messages the system can handle.
The number of file descriptors (I/O channels) available to a process
limits the number of direct route connections a task can establish
to other tasks,
and the number of tasks a single pvmd can manage.
The number of processes allowed to a user limits the number of
tasks that can run on a single host,
and so on.
An important thing to know is that you may not see a message when you
reach a resource limit.
PVM tries to return an error code to the offending task
and continue operation,
but can't recover from certain events (running out of memory
is the worst).
See §
for more information on how
resource limits affect PVM.
First,
the bad news.
Adding printf() calls to your code is still a state-of-the-art
methodology.
PVM tasks can be started in a debugger on systems that support X-Windows.
If PvmTaskDebug is specified
in pvm_spawn(), PVM
runs $PVM_ROOT/lib/debugger,
which opens an xterm in which it
runs the task in a debugger defined in pvm3/lib/debugger2.
The PvmTaskDebug flag is not inherited,
so you must modify each call to spawn.
The DISPLAY environment variable can be exported to a remote host so
the xterm will always be displayed on the local screen.
Use the following command before running the application:
Make sure DISPLAY is set to the name of your host
(not unix:0) and the host name is fully qualified
if your virtual machine includes hosts at more than one administrative site.
To spawn a task in a debugger from the console, use the command:
You may be able to use the libpvm trace facility to isolate problems,
such as hung processes.
A task has a trace mask,
which allows each function in libpvm to be selectively traced,
and a trace sink,
which is another task to which trace data is sent (as messages).
A task's trace mask and sink are inherited by any tasks spawned
by it.
The console can spawn a task with tracing enabled
(using the spawn -@),
collect the trace data and print it out.
In this way,
a whole job (group of tasks related by parentage) can be traced.
The console has a trace command to edit the mask passed
to tasks it spawns.
Or, XPVM can be used to collect and display trace data graphically.
It is difficult to start an application by hand and trace it,
though.
Tasks with no parent (anonymous tasks)
have a default trace mask and sink of NULL.
Not only must the
first task call
pvm_setopt() and pvm_settmask() to initialize
the tracing parameters,
but it must collect and interpret the trace data.
If you must start a traced application from a TTY,
we suggest spawning an xterm from the console:
The task context held open by the xterm has tracing enabled.
If you now run a PVM program in the xterm,
it will reconnect to the task context
and trace data will be sent back to the PVM console.
Once the PVM program exits,
you must spawn a new xterm to run again,
since the task context will be closed.
Because the libpvm library is linked with your program,
it can't be trusted when debugging.
If you overwrite part of its memory
(for example by overstepping the bounds of an array)
it may start to behave erratically,
making the fault hard to isolate.
The pvmds are somewhat more robust and attempt to sanity-check messages
from tasks,
but can still be killed by errant programs.
The pvm_setopt() function can be used to set the debug mask for PVM
message-passing functions, as described in §
.
Setting this mask to 3, for example, will
force PVM to log for every message sent or received
by that task,
information such as the source, destination, and length of the message.
You can use this information to trace lost or stray messages.
The Message Passing Interface (MPI)
[]
standard, whose specification was completed in April 1994,
is the outcome of a community effort to try to define both the syntax
and semantics of a core of message-passing
library routines that would be useful to a wide
range of users and efficiently implementable on a wide range of MPPs.
The main advantage of establishing a message-passing standard is portability.
One of the goals of developing MPI is to provide
MPP vendors with a clearly defined base set of
routines that they can implement efficiently or, in some cases, provide
hardware support for, thereby enhancing scalability.
MPI is not intended to be
a complete and self-contained software infrastructure
that can be used for distributed computing.
MPI does not include necessities such as process management
(the ability to start tasks), (virtual) machine configuration,
and support for input and output. As a result, it is anticipated that
MPI will be realized as a communications interface layer that will be built
upon native facilities of the underlying hardware platform, with the exception
of certain data transfer operations that might be implemented at a level
close to hardware. This scenario permits the provision of
PVM's being ported to MPI to exploit any communication performance
a vendor supplies.
You may need to debug the PVM system when
porting it to a new architecture,
or because an application is not running correctly.
If you've thoroughly checked your application and can't find a problem,
then it may lie in the system itself.
This section describes a few tricks and undocumented features
of PVM to help you find out what's going on.
The pvmd and libpvm
each have a debugging mask
that can be set to enable logging of various
information.
Logging information is divided into classes,
each enabled separately by a bit in the debug mask.
The pvmd and console have a command line option
(-d) to set the debug mask of the pvmd to the
(hexadecimal) value specified;
the default is zero.
Slave pvmds inherit the debug mask of the master as
they are started.
The debug mask of a pvmd can be set at any time using the
console tickle
command on that host.
The debug mask in libpvm can be set in the
task with pvm_setopt()
.
The pvmd debug mask bits are defined in ddpro.h,
and the libpvm bits in lpvm.c.
The meanings of the bits are not well defined,
since they're only intended to be used when fixing
or modifying the pvmd or libpvm.
At present, the bits in the debug mask are as follows:
The tickle function is a simple, extensible
interface that allows a task to poke at its local pvmd as it runs.
It is not formally specified,
but has proven to be very useful in debugging the system.
Tickle is accessible from the console (tickle command)
or libpvm.
Function pvm_tickle() sends a TM_TICKLE message to
the pvmd containing a short (maximum of ten) array of
integers and receives an array in reply.
The first element of the array is a subcommand,
and the remaining elements are parameters.
The commands currently defined are:
New tickle commands are generally added to the end of the list.
If the pvmd breaks,
you may need to start it under a debugger.
The master pvmd can be started by hand under a debugger,
and the PVM console started on another terminal.
To start a slave pvmd
under a debugger,
use the manual startup (so=ms) host file option
so the master pvmd will allow you to start the slave by hand.
Or,
use the dx= host file option to execute a script similar
to lib/debugger,
and run the pvmd in a debugger in an xterm window.
To help catch memory allocation errors in the system code,
the pvmd and libpvm use a
sanity-checking library called imalloc
.
Imalloc functions are wrappers for the regular
libc functions
malloc(), realloc(), and free().
Upon detecting an error,
the imalloc functions abort the program
so the fault can be traced.
The following checks and functions are performed by imalloc:
Since the overhead of this checking is quite severe,
it is disabled at compile time by default.
Defining USE_PVM_ALLOC in the source makefile(s) switches it on.
The pvmd includes several registers and counters to sample certain
events,
such as the number of calls made to select() or
the number of packets refragmented
by the network code.
These values can be computed from a debug log
,
but the counters have less adverse impact on
the performance of the pvmd than would generating a huge log file.
The counters can be dumped or reset using the pvm_tickle()
function or the console tickle command.
The code to gather statistics
is normally switched out at compile time.
To enable it,
one
edits the makefile and adds -DSTATISTICS to the compile options.
Glossary
This appendix contains a list of all the versions of PVM that
have been released from the first one in February 1991 through August 1994.
Along with each version we include a brief synopsis of the improvements
made in this version. Although not listed here, new ports were being
added to PVM with each release. PVM continues to evolve driven by
new technology and user feedback. Newer versions of PVM beyond those
listed here may exist at the time of reading. The latest version
can always be found on netlib.
PVM: Parallel Virtual Machine
This document was generated using the
LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994,
Nikos Drakos, Computer Based Learning Unit, University of Leeds. The command line arguments were: The translation was initiated by Jack Dongarra on Thu Sept 15 21:00:17 EDT 1994
Linda
[] is a concurrent programming model that has evolved
from a Yale University research project. The primary concept in
Linda is that of a ``tuple-space''
, an abstraction via which
cooperating processes communicate. This central theme of Linda
has been proposed as an alternative paradigm to the two traditional
methods of parallel processing: that based on shared memory,
and that based on message passing. The tuple-space concept is essentially
an abstraction of distributed shared memory, with one important
difference (tuple-spaces are associative), and several minor
distinctions (destructive and nondestructive reads and
different coherency semantics are possible). Applications
use the Linda model by embedding explicitly, within cooperating sequential
programs, constructs that manipulate (insert/retrieve tuples)
the tuple space.
From the application point of view
Linda is a set of programming
language extensions for facilitating parallel programming.
It provides a shared-memory abstraction for process communication
without requiring the underlying hardware to
physically share memory.
The Linda system usually refers to a specific
implementation of software that supports the Linda programming
model. System software is provided that establishes and maintains
tuple spaces and is used in conjunction with libraries that
appropriately interpret and execute Linda primitives. Depending on the
environment (shared-memory multiprocessors, message-passing
parallel computers, networks of workstations, etc.), the tuple space
mechanism is implemented using different techniques and with
varying degrees of efficiency. Recently, a new system scheme
has been proposed, at least nominally related to the Linda
project. This scheme, termed ``Pirhana''
[],
proposes a proactive approach
to concurrent computing: computational
resources (viewed as active agents) seize computational tasks
from a well-known location based on availability
and suitability. This scheme may be implemented on
multiple platforms and manifested as a ``Pirhana system'' or
``Linda-Pirhana system.''
PVM (Parallel Virtual Machine) is a byproduct of an ongoing
heterogeneous network computing research project
involving the authors and their institutions. The general goals
of this project are to investigate issues in, and develop solutions
for, heterogeneous concurrent computing.
PVM is an integrated set of software tools and libraries that emulates
a general-purpose, flexible, heterogeneous concurrent computing
framework on interconnected computers of varied architecture.
The overall objective of the PVM system is to to enable such a collection
of computers to be used cooperatively for concurrent or parallel
computation. Detailed descriptions and discussions of the concepts,
logistics, and methodologies involved in this network-based computing
process are contained in the remainder of the book. Briefly, the
principles upon which PVM is based include the following:
The PVM system is composed of two parts.
The first part is a daemon
,
called pvmd3
and sometimes abbreviated pvmd
,
that resides on all the computers making up the virtual machine.
(An example of a daemon program is the mail program that runs in the
background and handles all the incoming and outgoing electronic mail
on a computer.)
Pvmd3 is designed so any user with a valid login can install this
daemon on a machine. When a user wishes to run a PVM application,
he first creates a virtual machine by starting up PVM.
(Chapter 3 details how this is done.)
The PVM application can then be started from a Unix
prompt on any of the hosts.
Multiple users can configure overlapping virtual machines,
and each user can execute several PVM applications simultaneously.
The second part of the system is a library of PVM interface routines.
It contains a functionally complete repertoire of primitives that are
needed for cooperation between tasks of an application.
This library contains user-callable routines for message passing,
spawning processes, coordinating tasks, and modifying the virtual machine.
The PVM computing model
is based on the notion that an application
consists of several tasks.
Each task is responsible for a part of the application's computational workload.
Sometimes an application is parallelized along its functions;
that is, each task performs a different function, for example,
input, problem setup, solution, output, and display.
This process is often called functional parallelism
.
A more common method of parallelizing an application is called
data parallelism
.
In this method all the tasks are the same,
but each one only knows and solves a small part of the data.
This is also referred to as the SPMD
(single-program multiple-data)
model of computing. PVM supports either or a mixture of these methods.
Depending on their functions, tasks may execute in parallel and may
need to synchronize or exchange data, although this is not always the case.
An exemplary diagram of the PVM computing model is shown in
Figure
.
and an architectural view of the PVM system, highlighting the
heterogeneity of the computing platforms supported by PVM, is
shown in Figure
.
The PVM system currently supports C, C++, and Fortran languages.
This set of language interfaces have been included based on the
observation that the predominant majority of target applications
are written in C and Fortran, with an emerging trend in experimenting
with object-based languages and methodologies.
The C and C++ language bindings
for the PVM user interface library
are implemented as functions, following the general conventions
used by most C systems, including Unix-like operating systems. To elaborate,
function arguments are a combination of value parameters and pointers
as appropriate, and function result values indicate the outcome of
the call. In addition, macro definitions are used for system constants,
and global variables such as errno and pvm_errno
are the mechanism for
discriminating between multiple possible outcomes. Application programs
written in C and C++ access PVM library functions by linking
against an archival library (libpvm3.a) that is
part of the standard distribution.
Fortran language bindings
are implemented
as subroutines rather than as functions.
This approach was taken because some compilers on the supported architectures
would not reliably interface Fortran functions with C functions.
One immediate implication of this
is that an additional argument is introduced into each PVM library
call for status results to be returned to the invoking program.
Also, library routines for the placement and
retrieval of typed data in message buffers are unified, with
an additional parameter indicating the datatype. Apart from these
differences (and the standard naming prefixes - pvm_ for C,
and pvmf for Fortran), a one-to-one correspondence
exists between the two language bindings.
Fortran interfaces to PVM are implemented as library stubs that
in turn invoke the corresponding C routines, after casting
and/or dereferencing arguments as appropriate. Thus, Fortran applications
are required to link against the stubs library (libfpvm3.a) as
well as the C library.
All PVM tasks are identified by an integer task identifier (TID)
.
Messages are sent to and received from tids.
Since tids must be unique across the entire virtual machine,
they are supplied by the local pvmd and are not user chosen.
Although PVM encodes information into each TID (see Chapter 7 for details)
the user is expected to treat the tids as opaque integer identifiers.
PVM contains several routines that return TID values
so that the user application can identify other tasks in the system.
There are applications where it is natural to think of a group of tasks
.
And there are cases where a user would like to identify his tasks
by the numbers 0 - (p - 1), where p is the number of tasks.
PVM includes the concept of user named groups.
When a task joins a group, it is assigned a unique ``instance'' number
in that group. Instance numbers start at 0 and count up.
In keeping with the PVM philosophy, the group functions are designed
to be very general and transparent to the user. For example,
any PVM task can join or leave any group at any time without having
to inform any other task in the affected groups.
Also, groups can overlap,
and tasks can broadcast messages to groups of which they are not a member.
Details of the available group functions are given in Chapter 5.
To use any of the group functions, a program must be linked with
libgpvm3.a
.
The general paradigm for application programming with PVM is as follows.
A user writes one or more sequential programs in C, C++, or Fortran 77
that contain embedded calls to the PVM library.
Each program corresponds to a task making up the application.
These programs are compiled for each architecture
in the host pool, and the resulting object files are placed at a location
accessible from machines in the host pool. To execute an application, a
user typically starts one copy of one task
(usually the ``master'' or ``initiating'' task) by hand
from a machine within the host pool.
This process subsequently starts other PVM tasks,
eventually resulting in a collection of active tasks
that then compute locally and exchange messages with each other
to solve the problem.
Note that while the above is a typical scenario, as many tasks as
appropriate may be started manually. As mentioned earlier, tasks
interact through explicit message passing, identifying each other
with a system-assigned, opaque TID.
Shown in Figure
is the body of the PVM program hello,
a simple example that illustrates the basic concepts of PVM programming.
This program is intended to be invoked manually; after printing its
task id (obtained with pvm_mytid()), it initiates a copy of
another program called hello_other using the pvm_spawn()
function. A successful spawn causes the program to execute
a blocking receive using pvm_recv.
After receiving the message, the program prints the message sent by
its counterpart, as well its task id; the buffer is extracted
from the message using pvm_upkstr.
The final pvm_exit call dissociates the program from the PVM system.
Figure
is a listing of the ``slave'' or spawned program; its
first PVM action is to obtain the task id of the ``master'' using
the pvm_parent call. This program then obtains its hostname
and transmits it to the master using the three-call sequence -
pvm_initsend to initialize the send buffer;
pvm_pkstr to place a string, in a strongly typed and
architecture-independent manner, into the send buffer; and pvm_send
to transmit it to the destination process specified by ptid,
``tagging'' the message with the number 1.
This chapter describes how to set up the PVM software package,
how to configure a simple virtual machine,
and how to compile and run the example programs supplied with PVM.
The chapter is written as a tutorial, so the reader can follow along with
the book beside the terminal.
The first part of the chapter describes the straightforward use of PVM
and the most common errors and problems in set up and running.
The latter part of the chapter describes some of the more advanced options
available to customize the reader's PVM environment.
The latest version of the PVM source code and documentation
is always available through netlib.
Netlib is a software distribution service set up on the Internet
that contains a wide range of computer software.
Software can be retrieved from netlib by ftp, WWW, xnetlib, or email.
PVM files can be obtained by anonymous ftp to ftp.netlib.org.
Look in directory pvm3. The file index describes the files in
this directory and its subdirectories.
Using a world wide web tool like Xmosaic the PVM files are
accessed by using the address
http://www.netlib.org/pvm3/index.html.
Xnetlib is a X-Window interface that allows a user to browse
or query netlib for available software and to automatically
transfer the selected software to the user's computer.
To get xnetlib send email to netlib@netlib.org with the message
send xnetlib.shar from xnetlib
or anonymous ftp from ftp.netlib.org xnetlib/xnetlib.shar.
The PVM software can be requested by email.
To receive this software send email to netlib@netlib.org with the
message: send index from pvm3. An automatic mail handler will
return a list of available files and further instructions by email.
The advantage of this method is that anyone with email access
to Internet can obtain the software.
The PVM software is distributed as a uuencoded, compressed, tar file.
To unpack the distribution the file must be uudecoded, uncompressed,
and tar xvf filename. This will create a directory called pvm3
wherever it is untarred. The PVM documentation is distributed as
postscript files and includes a User's Guide, reference manual,
and quick reference card.
The PVM project began in the summer of 1989 at
Oak Ridge National Laboratory.
The prototype system, PVM 1.0, was constructed by Vaidy Sunderam
and Al Geist; this version of the system was used internally at
the Lab and was not released to the outside.
Version 2 of PVM was written
at the University of Tennessee and released in March 1991.
During the following year, PVM began to be used in
many scientific applications.
After user feedback and a number of changes (PVM 2.1 - 2.4),
a complete rewrite was undertaken, and version 3 was completed in
February 1993.
It is PVM version 3.3 that we describe in this book (and
refer to simply as PVM).
The PVM software has
been distributed freely
and is being used in computational applications around the world.
One
of the reasons for PVM's popularity is that it is simple to set up and use.
PVM does not require special privileges to be installed.
Anyone with a valid login on the hosts can do so.
In addition, only one person at an organization needs to get and install PVM
for everyone at that organization to use it.
PVM uses two environment variables when starting and running.
Each PVM user needs to set these two variables to use PVM.
The first variable is PVM_ROOT
,
which is set to the location of the
installed pvm3 directory.
The second variable is PVM_ARCH
,
which tells PVM the architecture of this host and thus what executables
to pick from the PVM_ROOT directory.
The easiest method is to set these two variables in your .cshrc file.
We assume you are using csh as you follow along this tutorial.
Here is an example for setting PVM_ROOT:
Table 1 lists the PVM_ARCH names and their corresponding
architecture types that are supported in PVM 3.3.
The PVM source comes with directories and makefiles for most architectures
you are likely to have.
Chapter 8 describes how to port the PVM source to an unsupported architecture.
Building for each architecture type is done automatically
by logging on to a host, going into the PVM_ROOT directory,
and typing make.
The makefile will automatically determine which architecture
it is being executed on, create appropriate subdirectories,
and build pvm, pvmd3, libpvm3.a, and libfpvm3.a,
pvmgs, and libgpvm3.a.
It places all these files in $PVM_ROOT/lib/PVM_ARCH, with the exception
of pvmgs which is placed in $PVM_ROOT/bin/PVM_ARCH.
Before
we go over the steps to compile and run parallel PVM programs,
you should be sure you can start up PVM and configure a virtual machine.
On any host on which PVM has been installed you can type
To see what the present virtual machine looks like, you can type
You should practice starting and stopping and adding hosts to PVM
until you are comfortable with the PVM console.
A full description of the PVM console and its many command options
is given at the end of this chapter.
If you don't want to
type in a bunch of host names each time,
there is a hostfile option. You can list the hostnames in a file
one per line and then type
There are other ways to start up PVM.
The functions of the console and a performance monitor
have been combined in a graphical user interface called XPVM
,
which is available precompiled on netlib
(see Chapter 8 for XPVM details).
If XPVM has been installed at your site, then it can be used to start PVM.
To start PVM with this X window interface, type
The quit and halt buttons work just like the PVM console.
If you quit XPVM and then restart it, XPVM will automatically display
what the running virtual machine looks like.
Practice starting and stopping and adding hosts with XPVM.
If there are errors, they should appear in the window where you started XPVM.
If
PVM has a problem starting up, it will print an error message
either to the screen or in the log file /tmp/pvml.<uid>.
This section describes the most common startup problems
and how to solve them.
Chapter 9 contains a more complete troubleshooting guide.
If the message says
Other reasons to get this message include not having PVM installed
on a host or not having PVM_ROOT set correctly on some host.
You can check these by typing
If PVM is manually killed, or stopped abnormally (e.g., by a system crash),
then check for the existence of the file /tmp/pvmd.<uid>.
This file is used for authentication and should exist only while PVM is running.
If this file is left behind, it prevents PVM from starting.
Simply delete this file.
If the message says
If you get any other strange messages, then check your .cshrc file.
It is important that you not have any I/O in the .cshrc file
because this will interfere with the startup of PVM.
If you wish to print out information
(such as who or uptime)
when you log in,
you should do it in your .login script,
not when you're running a csh command script.
In this section you'll learn how to compile and run PVM programs.
Later chapters of this book describe how to write parallel PVM programs.
In this section we will work with the example programs supplied with
the PVM software. These example programs make useful templates on which
to base your own PVM programs.
The first step is to copy the example programs into your own area:
The master/slave programming model is the most popular model used in
distributed computing. (In the general parallel programming arena,
the SPMD model is more popular.)
To compile the master/slave C example, type
The makefile moves the executables to $HOME/pvm3/bin/PVM_ARCH,
which is the default location PVM will look for them on all hosts.
If your file system is not common across all your PVM hosts,
then you will have to build or copy (depending on the architectures)
these executables on all your PVM hosts.
Now, from one window, start PVM and configure some hosts.
These examples are designed to run on any number of hosts, including one.
In another window cd to $HOME/pvm3/bin/PVM_ARCH and type
The first example illustrates the ability to run a PVM program from
a Unix prompt on any host in the virtual machine. This is just like
the way you would run a serial a.out program on a workstation.
In the next example, which is also a master/slave model called hitc,
you will see how to spawn PVM jobs from the PVM console and also from XPVM.
hitc illustrates dynamic load balancing using the pool-of-tasks
paradigm. In the pool of tasks paradigm, the master program manages
a large queue of tasks, always sending idle slave programs more work
to do until the queue is empty. This paradigm is effective in
situations where the hosts have very different computational powers,
because the least loaded or more powerful hosts do more of the work
and all the hosts stay busy until the end of the problem.
To compile hitc, type
Since hitc does not require any user input, it can be spawned directly
from the PVM console. Start up the PVM console and add a few hosts.
At the PVM console prompt type
hitc can be used to illustrate XPVM's real-time animation capabilities.
Start up XPVM and build a virtual machine with four hosts.
Click on the ``tasks" button and select ``spawn" from the menu.
Type ``hitc" where XPVM asks for the command, and click on ``start".
You will see the host icons light up as the machines become busy.
You will see the hitc_slave tasks get spawned and see all the messages
that travel between the tasks in the Space Time display.
Several other views are selectable from the XPVM ``views" menu.
The ``task output" view is equivalent to the ``->" option in the PVM console.
It causes the standard output from all tasks to appear in
the window that pops up.
There is one restriction on programs that are spawned from XPVM
(and the PVM console).
The programs must not contain any interactive input, such as asking
for how many slaves to start up or how big a problem to solve.
This type of information can be read from a file or put on the command line
as arguments, but there is nothing in place to get user input
from the keyboard to a potentially remote task.
The PVM console, called pvm,
is a stand-alone PVM task that allows the user to interactively
start, query, and modify the virtual machine.
The console may be started and stopped multiple times on any of the
hosts in the virtual machine without affecting PVM or any
applications that may be running.
When started, pvm determines whether PVM is already running;
if it is not, pvm automatically executes pvmd on this host,
passing pvmd the command line options and hostfile.
Thus PVM need not be running to start the console.
The -n option is useful for specifying an alternative name for the
master pvmd (in case hostname doesn't match the IP address you want).
Once PVM is started, the console prints the prompt
The console reads $HOME/.pvmrc before reading commands from the tty, so
you can do things like
As we stated earlier,
only one person at a site needs to install PVM,
but each PVM user can have his own hostfile,
which describes his own personal virtual machine.
The hostfile
defines the initial configuration of hosts
that PVM combines into a virtual machine.
It also contains information about hosts that you
may wish to add to the configuration later.
The hostfile in its simplest form is just a list of hostnames one to a line.
Blank lines are ignored, and lines that begin with a # are comment lines.
This allows you to document the hostfile
and also provides a handy way to modify the initial configuration
by commenting out various hostnames (see Figure
).
Several options can be specified on each line after the hostname.
The options are separated by white space.
If you want to set any of the above options as defaults
for a series of hosts, you can place these options
on a single line with a * for the hostname field.
The defaults will be in effect for all the following hosts
until they are overridden by another set-defaults line.
Hosts that you don't want in the initial configuration
but may add later can be specified in the hostfile by beginning
those lines with an &.
An example hostfile displaying most of these options is shown in
Figure
.
Developing applications for the PVM system-in a general sense, at least-follows
the traditional paradigm for programming distributed-memory
multiprocessors such as the nCUBE or the Intel family of multiprocessors.
The basic techniques are similar both for
the logistical aspects of programming
and for algorithm development. Significant
differences exist, however, in terms of (a) task management, especially issues
concerning dynamic process creation, naming, and addressing; (b) initialization
phases prior to actual computation; (c) granularity choices; and
(d) heterogeneity. In this chapter, we discuss the
programming process for PVM and identify factors that may impact
functionality and performance.
Parallel computing using a system such as PVM may be approached
from three fundamental viewpoints, based on the organization of
the computing tasks. Within each, different
workload allocation strategies are possible and will be discussed
later in this chapter. The first and most common model for PVM
applications can be termed ``crowd'' computing
:
a collection
of closely related processes, typically executing the same code,
perform computations on different portions of the workload,
usually involving the periodic exchange of intermediate results.
This paradigm can be further subdivided into two categories:
The second model supported by PVM is termed a ``tree''
computation
.
In this scenario, processes are spawned (usually dynamically
as the computation progresses) in a tree-like manner, thereby
establishing a tree-like, parent-child relationship (as opposed
to crowd computations where a star-like relationship exists). This
paradigm, although less commonly used, is an extremely natural
fit to applications where the total workload is not known
a priori, for example, in branch-and-bound algorithms, alpha-beta
search, and recursive ``divide-and-conquer'' algorithms.
The third model, which we term ``hybrid,'' can be thought of as
a combination of the tree model and crowd model. Essentially,
this paradigm possesses an arbitrary spawning structure: that is,
at any point during application execution, the process
relationship structure may resemble an arbitrary and changing graph.
We note that these three classifications are
made on the basis of process relationships, though they frequently
also correspond to communication
topologies.
Nevertheless, in all three, it is possible for any process to
interact and synchronize with any other. Further, as may be expected,
the choice of model is application dependent and should be selected
to best match the natural structure of the parallelized program.
Crowd computations typically involve three phases. The first is
the initialization of the process group; in the case of node-only
computations, dissemination of group information and
problem parameters, as well as workload allocation, is typically done within
this phase. The second phase is computation. The third phase is collection
results and display of output;
during this phase, the process group is
disbanded or terminated.
The master-slave model is illustrated below, using the well-known
Mandelbrot
set computation which is representative of the class of
problems termed ``embarrassingly''
parallel
. The computation
itself involves applying a recursive function to a collection of
points in the complex plane until the function values either
reach a specific value or begin to diverge. Depending upon
this condition, a graphical representation of each point in the plane
is constructed. Essentially, since the function outcome depends
only on the starting value of the point (and is independent of
other points), the problem
can be partitioned into
completely independent portions, the algorithm applied to each, and
partial results combined using simple combination schemes. However,
this model permits dynamic load balancing,
thereby allowing the processing elements to
share the workload unevenly. In this and subsequent examples within
this chapter, we only show a skeletal form of the algorithms, and
also take syntactic liberties with the PVM routines in the interest
of clarity. The control structure of the master-slave class of
applications is shown in Figure
.
The master-slave example described above involves no communication
among the slaves. Most crowd computations of any complexity do need
to communicate among the computational processes; we illustrate the
structure of such applications using a node-only example for
matrix multiply using Cannon's algorithm
[2]
(programming details
for a similar algorithm are given in another chapter).
The matrix-multiply example, shown
pictorially in Figure
multiplies matrix subblocks locally, and
uses row-wise multicast of matrix A subblocks in conjunction
with column-wise shifts of matrix B subblocks.
To successfully use this book, one should be experienced with common
programming techniques and understand some basic parallel processing
concepts.
In particular,
this guide assumes that the user knows how to write, execute,
and debug Fortran or C programs and is familiar
with Unix.
As mentioned earlier, tree computations
typically exhibit a tree-like
process control structure which also conforms to the communication pattern
in many instances. To illustrate this model, we consider a parallel sorting
algorithm that works as follows. One process (the manually started
process in PVM) possesses (inputs or generates) the list to be sorted.
It then spawns a second process and sends it half the list. At this
point, there are two processes each of which spawns a process and sends
them one-half of their already halved lists. This continues until
a tree of appropriate depth is constructed. Each process then independently
sorts its portion of the list, and a merge phase follows where sorted
sublists are transmitted upwards along the tree edges, with intermediate
merges being done at each node. This algorithm is illustrative of
a tree computation in which the workload is known in advance; a diagram
depicting the process is given in Figure
;
an algorithmic outline is given below.
In the preceding section, we discussed the common parallel programming paradigms
with respect to process structure, and we outlined representative examples
in the context of the PVM system. In this section we address the issue
of workload allocation, subsequent to establishing process structure,
and describe some common paradigms that are used in distributed-memory
parallel computing. Two general methodologies are commonly used. The first,
termed data decomposition or partitioning, assumes that the overall
problem involves applying computational operations or transformations on
one or more data structures and, further, that these data structures
may be divided and operated upon. The second, called function decomposition,
divides the work based on different operations or functions. In a sense,
the PVM computing model supports both function decomposition
(fundamentally different tasks perform different operations)
and data decomposition
(identical tasks operate on different portions of the data).
As a simple example of data decomposition, consider the addition of
two vectors, A[1..N] and B[1..N], to produce the result vector,
C[1..N]. If we assume that P processes are working on this problem, data partitioning
involves the allocation of N/P elements of each vector to each process,
which computes the corresponding N/P elements of the resulting vector.
This data partitioning may be done either ``statically,''
where each process knows a priori (at least in terms of
the variables N and P) its share of the workload,
or ``dynamically,'' where a control process (e.g., the master process)
allocates subunits of the workload to processes as and when they
become free. The principal difference between these two approaches
is ``scheduling.''
With static scheduling, individual process workloads are fixed;
with dynamic scheduling, they vary as the computation progresses. In
most multiprocessor environments, static scheduling is effective for
problems such as the vector addition example; however, in the
general PVM environment, static scheduling is not necessarily beneficial.
The reason is
that PVM environments based on networked clusters are susceptible to
external influences; therefore, a statically scheduled, data-partitioned
problem might encounter one or more processes that complete their portion
of the workload much faster or much slower than the others. This
situation could also arise when the machines in a PVM system are
heterogeneous, possessing varying CPU speeds and different memory
and other system attributes.
In a real execution of even this trivial vector addition problem,
an issue that cannot be ignored is input and output. In other words,
how do the processes described above receive their workloads, and
what do they do with the result vectors? The answer to these questions
depends on the application and the circumstances of a particular
run, but in general:
The third method of allocating individual workloads is also consistent
with dynamic scheduling in applications where interprocess interactions
during computations are rare or nonexistent. However, nontrivial
algorithms generally require intermediate exchanges of data values,
and therefore only the initial assignment of data partitions
can be accomplished by these schemes. For example, consider the
data partitioning method depicted in Figure 4.2. In order to multiply
two matrices A and B, a group of processes is first spawned, using
the master-slave or node-only paradigm. This set of processes is
considered to form a mesh; the matrices to be multiplied are
divided into subblocks, also forming a mesh. Each subblock of the
A and B matrices is placed on the corresponding process, by utilizing
one of the data decomposition and workload allocation strategies listed
above. During computation, subblocks need to be forwarded or
exchanged between processes, thereby transforming the original
allocation map, as shown in the figure. At the end of the computation,
however, result matrix subblocks are situated on the individual
processes, in conformance with their respective positions on the
process grid, and consistent with a data partitioned map of the
resulting matrix C.
The foregoing discussion illustrates the basics
of data decomposition. In a later chapter, example programs highlighting
details of this approach will be presented
.
Parallelism in distributed-memory environments such as PVM may also be
achieved by partitioning the overall workload in terms of different
operations. The most obvious example of this form of decomposition
is with respect to the three stages of typical program execution,
namely, input, processing, and result output. In function decomposition,
such an application may consist of three separate and distinct
programs, each one dedicated to one of the three phases.
Parallelism is obtained by concurrently executing the three programs
and by establishing a "pipeline" (continuous or quantized) between
them. Note, however, that in such a scenario, data parallelism may
also exist within each phase. An example is shown in Figure
,
where distinct functions are realized as PVM components, with multiple
instances within each component implementing portions of different
data partitioned algorithms.
Although the concept of function decomposition is illustrated by
the trivial example above, the term is generally used to signify
partitioning and workload allocation by function within
the computational phase. Typically, application computations
contain several different subalgorithms-sometimes on the
same data (the MPSD
or multiple-program single-data scenario),
sometimes in a pipelined sequence of transformations, and sometimes
exhibiting an unstructured pattern of exchanges. We illustrate
the general functional decomposition paradigm by considering the
hypothetical simulation of an aircraft consisting of multiple
interrelated and interacting, functionally decomposed subalgorithms.
A diagram providing an overview of this example is shown in
Figure
(and will also be used in a later chapter dealing
with graphical PVM programming).
In the figure, each node or circle in the "graph" represents a
functionally decomposed piece of the application. The input function
distributes the particular problem parameters to the different
functions 2 through 6, after spawning processes corresponding
to distinct programs implementing each of the application subalgorithms.
The same data may be sent to multiple functions (e.g., as in the
case of the two wing functions), or data appropriate for the
given function alone may be delivered. After performing some
amount of computations these functions deliver intermediate or
final results to functions 7, 8, and 9 that may have been spawned
at the beginning of the computation or as results become available.
The diagram indicates the primary concept of decomposing applications
by function, as well as control and data dependency relationships.
Parallelism is achieved in two respects-by the concurrent
and independent execution of modules as in functions 2 through 6,
and by the simultaneous, pipelined, execution of modules
in a dependency chain, as, for example,
in functions 1, 6, 8, and 9
.
In order to utilize the PVM system, applications must evolve through
two stages. The first concerns development of the distributed-memory
parallel version of the application algorithm(s); this phase is
common to the PVM system as well as to other distributed-memory
multiprocessors. The actual parallelization decisions fall into two
major categories - those related to structure, and those related
to efficiency. For structural decisions in parallelizing applications,
the major decisions to be made include the choice of model to be used
(i.e., crowd computation vs. tree computation and data decomposition vs.
function decomposition). Decisions with respect to efficiency when
parallelizing for distributed-memory environments are generally oriented
toward minimizing the frequency and volume of communications. It is
typically in this latter respect that the parallelization process
differs for PVM and hardware multiprocessors; for PVM environments
based on networks, large granularity generally leads to better
performance. With this qualification, the parallelization process is
very similar for PVM and for other distributed-memory environments,
including hardware multiprocessors.
The parallelization of applications may be done ab initio,
from existing sequential versions, or from existing parallel
versions. In the first two cases, the stages involved are to select
an appropriate algorithm for each of the subtasks in the
application, usually from published descriptions or by
inventing a parallel algorithm,
and to then code
these algorithms
in the language of choice (C, C++, or Fortran 77 for PVM) and
interface them with each other as well as with process management
and other constructs. Parallelization from existing sequential
programs also follows certain general guidelines, primary among
which are to decompose loops, beginning with outermost loops and
working inward. In this process, the main concern is to detect
dependencies and to partition loops such that the dependencies are preserved
while allowing for concurrency. This parallelization process is
described in numerous textbooks and papers on parallel computing,
although few textbooks discuss the practical and specific aspects
of transforming a sequential program to a parallel one.
Existing parallel programs may be based upon either the shared-memory
or distributed-memory paradigms. Converting existing shared-memory
programs to PVM is similar to converting from sequential code,
when the shared-memory versions are based upon vector or loop-level
parallelism. In the case of explicit shared memory programs, the
primary task is to locate synchronization points and replace these
with message passing.
In order to convert existing distributed-memory parallel code to PVM, the main
task is to convert from one set of concurrency constructs to another.
Typically, existing distributed memory parallel programs are written
either for hardware multiprocessors or other networked environments
such as p4 or Express. In both cases, the major changes required
are with regard to process management. For example, in the Intel
family of DMMPs, it is common for processes to be started from
an interactive shell command line. Such a paradigm should be replaced
for PVM by either a master program or a node program that takes
responsibility for process spawning. With regard to interaction,
there is, fortunately, a great deal of commonality between the
message-passing calls in various programming environments. The major
differences between PVM and other systems in this context
are with regard to (a) process management and process addressing schemes;
(b) virtual machine configuration/reconfiguration and its impact
on executing applications; (c) heterogeneity in messages as well
as the aspect of heterogeneity that deals with different architectures
and data representations; and (d) certain unique and specialized
features such as signaling, and task scheduling methods.
In this chapter we give a brief description of
the routines in the PVM 3 user library.
This chapter is organized by the functions of the routines.
For example, in the section on Message Passing
is a discussion of all the routines for sending and receiving data
from one PVM task to another and a description of PVM's
message passing options.
The calling syntax of the C and Fortran PVM routines
are highlighted by boxes in each section.
An alphabetical listing of all the routines is given in Appendix B.
Appendix B contains a detailed description of each routine,
including a description of each argument in each routine, the possible
error codes a routine may return, and the possible reasons for the error.
Each listing also includes examples of both C and Fortran use.
In PVM 3 all PVM tasks are identified by an integer supplied by the
local pvmd. In the following descriptions this task identifier is
called TID. It is similar to the process ID (PID)
used in the Unix system and is assumed to be opaque to the user,
in that the value of the TID has no special significance to him.
In fact, PVM encodes information into the TID for its own internal use.
Details of this encoding can be found in Chapter 7.
All the PVM routines are written in C.
C++ applications can link to the PVM library.
Fortran applications can call these routines through a Fortran 77 interface
supplied with the PVM 3 source.
This interface translates arguments, which are passed by reference
in Fortran, to their values if needed by the underlying C routines.
The interface also takes into account Fortran character string
representations and the various naming
conventions that different Fortran compilers use to call C functions.
The PVM communication model
assumes that any task can send a message to any other
PVM task and that there is no limit to the size or number of
such messages. While all hosts have physical memory limitations
that limits potential buffer space, the communication model
does not restrict itself to a particular machine's limitations
and assumes sufficient memory is available.
The PVM communication model provides asynchronous blocking send,
asynchronous blocking receive, and nonblocking receive functions.
In our terminology, a blocking send returns as soon as the
send buffer is free for reuse, and an asynchronous send does not
depend on the receiver calling a matching receive before the send
can return. There are options in PVM 3 that request that data
be transferred directly from task to task. In this case, if the message
is large, the sender may block until the receiver has called a matching receive.
A nonblocking receive immediately returns with either the data or
a flag that the data has not arrived, while
a blocking receive returns only when the data is in the receive buffer.
In addition to these point-to-point communication functions, the model supports
multicast to a set of tasks and broadcast to a user-defined group of tasks.
There are also functions to perform global max, global sum, etc.,
across a user-defined group of tasks.
Wildcards can be specified in the receive for the source and label,
allowing either or both of these contexts to be ignored.
A routine can be called to return information about received messages.
The PVM model guarantees that message order is preserved.
If task 1 sends message A to task 2, then task 1 sends message B to task 2,
message A will arrive at task 2 before message B.
Moreover, if both messages arrive before task 2 does a receive,
then a wildcard receive will always return message A.
Message buffers are allocated dynamically. Therefore, the maximum message size
that can be sent or received is limited only by the amount
of available memory on a given host.
There is only limited flow control built into PVM 3.3.
PVM may give the user a can't get memory error
when the sum of incoming messages exceeds the available memory,
but PVM does not tell other tasks to stop sending to this host.
The routine pvm_mytid()
returns the TID of this process and can be called multiple times.
It enrolls this process into PVM if this is the first PVM call.
Any PVM system call (not just pvm_mytid) will enroll a task in PVM
if the task is not enrolled before the call,
but it is common practice to call pvm_mytid first to perform the enrolling.
The routine pvm_exit() tells the local pvmd that this process is leaving PVM.
This routine does not kill the process, which can continue to
perform tasks just like any other UNIX process.
Users typically call pvm_exit right before exiting their C programs
and right before STOP in their Fortran programs.
The routine pvm_spawn()
starts up ntask copies of an executable file task
on the virtual machine.
argv is a pointer to an array of arguments to task
with the end of the array specified by NULL.
If task takes no arguments, then argv is NULL.
The flag argument is used to specify options, and is a sum of:
These names are predefined in pvm3/include/pvm3.h.
In Fortran all the names are predefined in
parameter statements which can be found in the include file
pvm3/include/fpvm3.h.
PvmTaskTrace is a new feature in PVM 3.3.
It causes spawned tasks to generate trace events
.
PvmTasktrace is used by XPVM (see Chapter 8). Otherwise, the user must
specify where the trace events are sent in pvm_setopt().
On return, numt is set to the number of tasks successfully spawned
or an error code if no tasks could be started.
If tasks were started,
then pvm_spawn() returns a vector of the spawned tasks' tids;
and if some tasks could not be started, the corresponding error codes
are placed in the last ntask - numt positions of the vector.
The pvm_spawn() call can also start tasks on multiprocessors.
In the case of the Intel iPSC/860 the following restrictions apply.
Each spawn call gets a subcube of size ntask and loads
the program task on all of these nodes.
The iPSC/860 OS has an allocation limit of 10 subcubes across all users,
so it is better to start a block of tasks on an iPSC/860
with a single pvm_spawn() call rather than several calls.
Two different blocks of tasks spawned separately on the iPSC/860 can
still communicate with each other as well as any other PVM tasks
even though they are in separate subcubes.
The iPSC/860 OS has a restriction that messages going from the nodes
to the outside world be less than 256 Kbytes.
The routine pvm_kill() kills some other PVM task identified by TID.
This routine is not designed to kill the calling task, which should be
accomplished by calling pvm_exit() followed by exit().
The default is to have PVM write the stderr and stdout of spawned tasks to
the log file /tmp/pvml.<uid>.
The routine pvm_catchout
causes the calling task to catch output
from tasks subsequently spawned.
Characters printed on stdout
or stderr
in children tasks
are collected by the pvmds
and sent in control messages to the parent task,
which tags each line and appends it to the specified file (in C)
or standard output (in Fortran).
Each of the prints is prepended with information
about which task generated the print, and the end of the print is marked
to help separate outputs coming from several tasks at once.
If pvm_exit is called by the parent while output collection is in effect,
it will block until all tasks sending it output have exited,
in order to print all their output.
To avoid this,
one can turn off
the output collection by calling pvm_catchout(0)
before calling pvm_exit.
New capabilities in PVM 3.3 include the ability to register special PVM tasks
to handle the jobs of adding new hosts, mapping tasks to hosts, and
starting new tasks. This creates an interface for advanced batch schedulers
(examples include Condor
[7], DQS
[6], and LSF
[4])
to plug into PVM and run PVM jobs in batch mode.
These register routines also create an interface for debugger writers to
develop sophisticated debuggers for PVM.
The routine names are pvm_reg_rm(), pvm_reg_hoster(), and
pvm_reg_tasker(). These are advanced functions not meant for the
average PVM user and thus are not presented in detail here.
Specifics can be found in Appendix B.
The routine pvm_parent()
returns the TID of the process that spawned this task
or the value of PvmNoParent if not created by pvm_spawn().
The routine pvm_tidtohost()
returns the TID dtid of the daemon running on the same host as TID.
This routine is useful for determining on which host a given task is running.
More general information about the entire virtual machine, including
the textual name of the configured hosts, can be obtained by using the
following functions:
The routine pvm_config()
returns information about the virtual machine including
the number of hosts, nhost,
and the number of different data formats, narch.
hostp is a pointer to a user declaried array
of pvmhostinfo structures.
The array should be of size at least nhost.
On return, each pvmhostinfo structure contains the pvmd TID,
host name, name of the architecture,
and relative CPU speed
for that host in the configuration.
The Fortran function returns information about one host
per call and cycles through all the hosts. Thus, if pvmfconfig
is called nhost times, the entire virtual machine will be represented.
The Fortran interface works by saving a copy of the hostp array
and returning one entry per call.
All the hosts must be cycled through before a new hostp array is obtained.
Thus, if the virtual machine is changing during these calls,
then the change will appear in the nhost and narch
parameters, but not in the host information.
Presently, there is no way to reset pvmfconfig() and force it
to restart the cycle when it is in the middle.
The routine pvm_tasks()
returns information about the PVM tasks running on the virtual machine.
The integer which specifies which tasks to return information about.
The present options are (0), which means all tasks,
a pvmd TID (dtid), which means tasks running on that host,
or a TID, which means just the given task.
The number of tasks is returned in ntask.
taskp is a pointer to an array of pvmtaskinfo structures.
The array is of size ntask.
Each pvmtaskinfo structure contains the TID, pvmd TID,
parent TID, a status flag, and the spawned file name.
(PVM doesn't know the file name of manually started tasks
and so leaves these blank.)
The Fortran function returns information about one task
per call and cycles through all the tasks. Thus, if where = 0, and
pvmftasks is called ntask times, all tasks will be represented.
The Fortran implementation assumes that the task pool is not changing
while it cycles through the tasks. If the pool changes, these
changes will not appear until the next cycle of ntask calls begins.
Examples of the use of pvm_config and pvm_tasks can be found in the
source to the PVM console, which is just a PVM task itself.
Examples of the use of the Fortran versions of these routines can be
found in the source pvm3/examples/testall.f.
The C routines add or delete a set of hosts in the virtual machine.
The Fortran routines add or delete a single host in the virtual machine.
In the Fortran routine info is returned as 1 or a status code.
In the C version info is returned as the number of
hosts successfully added.
The argument infos is an array of length nhost that
contains the status code for each individual host being added or deleted.
This allows the user to check whether only one of a set of hosts
caused a problem rather than trying to add or delete the entire
set of hosts again.
These routines are sometimes used to set up a virtual machine, but more
often they are used to increase the flexibility and fault tolerance
of a large application. These routines allow an application to increase
the available computing power (adding hosts) if it determines the
problem is getting harder to solve. One example of this would be
a CAD/CAM program where, during the computation, the finite-element grid
is refined, dramatically increasing the size of the problem.
Another use would be to increase the fault tolerance of an application
by having it detect the failure of a host and adding in a
replacement
.
The routine pvm_sendsig() sends a signal signum to another PVM task
identified by TID.
The routine pvm_notify requests PVM to notify the caller on detecting
certain events.
The present options are as follows:
In response to a notify request, some number of messages (see Appendix B)
are sent by PVM back to the calling task.
The messages are tagged with the user supplied msgtag.
The tids array specifies who to monitor when using TaskExit or HostDelete.
The array contains nothing when using HostAdd.
If required, the routines
pvm_config and pvm_tasks can be used to obtain task and pvmd tids.
If the host on which task A is running fails, and task B
has asked to be notified if task A exits,
then task B will be notified even though the exit was caused indirectly
by the host failure
.
We use
the following conventions in this book:
The routine pvm_setopt
is a general-purpose function that allows the user to set or get options in the PVM
system. In PVM 3, pvm_setopt can be used to set several options, including
automatic error message printing, debugging level, and
communication routing method for all subsequent PVM calls.
pvm_setopt returns the previous value of set in oldval.
The PVM 3.3 what can have the following values:
The most popular use of pvm_setopt is to enable direct route
communication between PVM tasks. As a general rule of thumb,
PVM communication bandwidth over a network doubles by calling
Sending a message comprises three steps in PVM.
First, a send buffer must be initialized by a call to pvm_initsend()
or pvm_mkbuf().
Second, the message must be ``packed'' into this buffer using
any number and combination of pvm_pk*() routines.
(In Fortran all message packing is done with the pvmfpack() subroutine.)
Third, the completed message is sent to another process by
calling the pvm_send() routine or multicast with the pvm_mcast() routine.
A message is received by calling either a blocking or nonblocking
receive routine and then ``unpacking'' each of the packed items from
the receive buffer. The receive routines can be set to
accept any message, or any message from a specified source, or
any message with a specified message tag,
or only messages with a given message tag from a given source.
There is also a probe function that returns whether a message has
arrived, but does not actually receive it.
If required, other receive contexts can be handled by PVM 3.
The routine pvm_recvf() allows users to define their own
receive contexts that will be used by the subsequent PVM receive routines.
If the user is using only a single
send buffer (and this is the typical case)
then pvm_initsend() is the only required buffer routine.
It is called before
packing a new message into the buffer.
The routine pvm_initsend clears the send buffer
and creates a new one for packing a new message. The encoding scheme
used for this packing is set by encoding.
The new buffer identifier is returned in bufid.
The encoding options are as follows:
The following message buffer routines are required only if the
user wishes to manage multiple message buffers inside an application.
Multiple message buffers are not required for most message passing
between processes.
In PVM 3 there is one active send buffer and one active
receive buffer per process at any given moment. The developer may
create any number of message buffers and switch between them
for the packing and sending of data. The packing, sending, receiving,
and unpacking routines affect only the active buffers.
The routine pvm_mkbuf creates a new empty send buffer
and specifies the encoding method used for packing messages.
It returns a buffer identifier bufid.
The routine pvm_freebuf() disposes of the buffer with identifier bufid.
This should be done after a message has been sent and is no longer needed.
Call pvm_mkbuf() to create a buffer for a new message if required.
Neither of these calls is required when using pvm_initsend(),
which performs these functions for the user.
pvm_getsbuf() returns the active send buffer identifier.
pvm_getrbuf() returns the active receive buffer identifier.
These routines set the active send (or receive) buffer to bufid,
save the state of the previous buffer,
and return the previous active buffer identifier oldbuf.
If bufid is set to 0 in pvm_setsbuf() or pvm_setrbuf(),
then the present buffer is saved and there is no active buffer.
This feature can be used to save the present state of an application's
messages so that a math library or graphical interface which also
uses PVM messages will not interfere with the state of the application's
buffers. After they complete, the application's buffers can be reset
to active.
It is possible to forward messages without repacking them by using
the message buffer routines. This is illustrated by the following fragment.
Each of the following C routines packs an array of the given data type
into the active send buffer.
They can be called multiple times to pack data into a single message.
Thus, a message can contain several arrays each with a different data type.
C structures must be passed by packing their individual elements.
There is no limit to the complexity of the packed messages, but
an application should unpack the messages exactly as they were packed.
Although this is not strictly required, it is a safe programming practice.
The arguments for each of the routines are a pointer to the first item to
be packed, nitem which is the total number of items to pack from
this array, and stride which is the stride to use when packing.
A stride of 1 means a contiguous vector is packed,
a stride of 2 means every other item is packed, and so on.
An exception is pvm_pkstr() which by definition packs a NULL terminated
character string and thus does not need nitem or stride arguments.
PVM also supplies a packing routine that uses a printf-like format expression
to specify what data to pack and how to pack it into the send buffer.
All variables are passed as addresses if count and stride are specified;
otherwise, variables are assumed to be values.
A description of the format syntax is given in Appendix B.
A single Fortran subroutine handles all the packing functions
of the above C routines.
The argument xp is the first item of the array to be packed.
Note that in Fortran the number of characters in a string
to be packed must be specified in nitem.
The integer what specifies the type of data to be packed.
The supported options are as follows:
These names have been predefined in parameter statements in
the include file
pvm3/include/fpvm3.h.
Some vendors may extend this list to include 64-bit architectures
in their PVM implementations. We will be adding INTEGER8, REAL16, etc.,
as soon as XDR
support for these data types is available.
The routine pvm_send() labels the message
with an integer identifier msgtag
and sends it immediately to the process TID.
The routine pvm_mcast() labels the message
with an integer identifier msgtag
and broadcasts the message to all tasks specified in the
integer array tids (except itself).
The tids array is of length ntask.
The routine pvm_psend() packs and sends an array of the specified datatype
to the task identified by TID.
The defined datatypes for Fortran are the same as for pvmfpack().
In C the type argument can be any of the following:
PVM contains several methods of receiving messages at a task.
There is no function matching in PVM, for example, that a pvm_psend
must be matched with a pvm_precv. Any of the following routines
can be called for any incoming message no matter how it was sent
(or multicast).
This blocking receive routine will wait
until a message with label msgtag has arrived from TID.
A value of -1 in msgtag or TID matches anything (wildcard).
It then places the message in a new active receive buffer that is created.
The previous active receive buffer
is cleared unless it has been saved with a pvm_setrbuf() call.
If the requested message has not arrived,
then the nonblocking receive pvm_nrecv() returns bufid = 0.
This routine can be called multiple times for the same message
to check whether it has arrived, while performing useful work between calls.
When no more useful work can be performed, the blocking receive pvm_recv()
can be called for the same message.
If a message with label msgtag has arrived from TID,
pvm_nrecv() places this message in a new active receive buffer
(which it creates) and returns the ID of this buffer.
The previous active receive buffer
is cleared unless it has been saved with a pvm_setrbuf() call.
A value of -1 in msgtag or TID matches anything (wildcard).
If the requested message has not arrived,
then pvm_probe() returns bufid = 0.
Otherwise, it returns a bufid for the message, but does not ``receive'' it.
This routine can be called multiple times for the same message
to check whether it has arrived, while performing useful work between calls.
In addition, pvm_bufinfo() can be called with the returned bufid
to determine information about the message before receiving it.
PVM also supplies a timeout version of receive. Consider the case
where a message is never going to arrive (because of error or failure);
the routine pvm_recv would block forever.
To avoid such situations,
the user may wish to give up after waiting for a
fixed amount of time. The routine pvm_trecv() allows the user
to specify a timeout period. If the timeout period is set very large,
then pvm_trecv acts like pvm_recv. If the timeout period is set to zero,
then pvm_trecv acts like pvm_nrecv. Thus, pvm_trecv fills the gap
between the blocking and nonblocking receive functions.
The routine pvm_bufinfo() returns
msgtag, source TID, and length in bytes of the message
identified by bufid.
It can be used to determine the label and source of
a message that was received with wildcards specified.
The routine pvm_precv() combines the functions of a blocking receive and
unpacking the received buffer. It does not return a bufid.
Instead, it returns the actual values of TID, msgtag, and cnt.
The routine pvm_recvf() modifies the receive context used by the
receive functions and can be used to extend PVM.
The default receive context is to match on source and message tag.
This can be modified to any user-defined comparison function.
(See Appendix B for an example of creating a probe function
with pvm_recf().)
There is no Fortran interface routine for pvm_recvf().
The following C routines unpack (multiple) data types from
the active receive buffer.
In an application they should match their corresponding pack routines
in type, number of items, and stride.
nitem is the number of items of the given type to unpack, and
stride is the stride.
The routine pvm_unpackf() uses a printf-like format expression
to specify what data to unpack and how to unpack it from the receive buffer.
A single Fortran subroutine handles all the unpacking functions
of the above C routines.
The argument xp is the array to be unpacked into.
The integer argument what specifies the type of data to be unpacked.
(Same what options as for pvmfpack()).
The dynamic process group functions are built on top of the core PVM routines.
A separate library libgpvm3.a must be linked
with user programs that make use of any of the group functions.
The pvmd does not perform the group functions.
This task is handled by a group server that is automatically started
when the first group function is invoked.
There is some debate about how groups should be handled in a
message-passing interface. The issues include efficiency and reliability,
and there are tradeoffs between static versus dynamic groups.
Some people argue that only tasks in a group can call group functions.
In keeping with the PVM philosophy, the group functions are designed
to be very general and transparent to the user, at some cost in efficiency.
Any PVM task can join or leave any group at any time without having
to inform any other task in the affected groups. Tasks can broadcast
messages to groups of which they are not a member.
In general, any PVM task may call any of the following group functions
at any time.
The exceptions are pvm_lvgroup(), pvm_barrier(), and pvm_reduce(),
which by their nature require the calling task to be a member
of the specified group.
These routines allow a task to join or leave a user named group.
The first call to pvm_joingroup() creates a group with name group
and puts the calling task in this group.
pvm_joingroup()
returns the instance number (inum) of the process in this group.
Instance numbers run from 0 to the number of group members minus 1.
In PVM 3, a task can join multiple groups.
If a process leaves a group and then rejoins it, that process may receive
a different instance number.
Instance numbers are recycled so a task joining a group will get
the lowest available instance number. But if multiple tasks are
joining a group, there is no guarantee that a task will be assigned
its previous instance number.
To assist the user in maintaining a continuous set of
instance numbers despite joining and leaving, the pvm_lvgroup()
function does not return until the task is confirmed to have left.
A pvm_joingroup() called after this return will assign the vacant
instance number to the new task.
It is the user's responsibility to maintain a contiguous set of
instance numbers if the algorithm requires it. If several tasks
leave a group and no tasks join, then there will be gaps in the
instance numbers.
The routine pvm_gettid() returns the TID of the process with a
given group name and instance number.
pvm_gettid() allows two tasks with no knowledge of each other
to get each other's TID simply by joining a common group.
The routine pvm_getinst()
returns the instance number of TID in the specified group.
The routine pvm_gsize()
returns the number of members in the specified group.
On calling pvm_barrier() the
process blocks until count members of a group have called pvm_barrier.
In general count should be the total number of members of the group.
A count is required because with dynamic process groups
PVM cannot know how many members are in a group at a given instant.
It is an error for processes to call pvm_barrier with a group it is
not a member of. It is also an error if the count arguments across a given
barrier call do not match.
For example it is an error if one member of a group calls pvm_barrier()
with a count of 4, and another member calls pvm_barrier() with a count
of 5.
pvm_bcast() labels the message with an integer identifier msgtag
and broadcasts the message to all tasks in the specified group
except itself (if it is a member of the group).
For pvm_bcast() ``all tasks'' is defined to be those tasks
the group server thinks are in the group when the routine is called.
If tasks join the group during a broadcast, they may not receive
the message. If tasks leave the group during a broadcast, a copy of the
message will still be sent to them.
pvm_reduce() performs a global arithmetic operation across the group,
for example, global sum or global max
.
The result of the reduction operation
appears on root.
PVM supplies four predefined functions that the user can place in func.
These are
In addition users can define their own global operation function to place in
func. See Appendix B for details. An example is given in the source
code for PVM.
For more information see PVM_ROOT/examples/gexamples.
Note: pvm_reduce() does not block. If a task calls pvm_reduce and then
leaves the group before the root has called pvm_reduce, an error may occur.
In this chapter we discuss several complete PVM programs in
detail. The first example, forkjoin.c, shows how to to spawn off
processes and synchronize with them. The second example
discusses a Fortran dot
product program, PSDOT.F. The third example, failure.c, demonstrates
how the user can
use the
Our first example demonstrates how to spawn off PVM tasks and synchronize
with them. The program spawns several tasks, three by default. The
children then synchronize by sending a message to their parent task.
The parent receives a message from each of the spawned tasks and prints
out information about the message from the child tasks.
The fork-join program contains the code for both the parent and the child
tasks. Let's examine it in more detail. The very first thing the
program does is call
Assuming we obtained a valid result for mytid, we now call
Let's examine the code executed by the parent task. The number of
tasks is taken from the command line as argv[1]. If the number of
tasks is not legal, then we exit the program, calling
The
For each child task, the parent receives a message and prints out
information about that message. The
The last segment of code in forkjoin will be executed by the child
tasks. Before placing data in a message buffer, the buffer must be
initialized by calling
Figure
shows the output of running forkjoin. Notice that
the order the messages were received is nondeterministic. Since the
main loop of the parent processes messages on a first-come first-serve
basis, the order of the prints are simply determined by time it takes
messages to travel from the child tasks
to the parent
.
This guide is divided into three major parts; it includes
nine chapters, a glossary, two appendixes and a bibliography.
Here we show a simple Fortran program, PSDOT, for computing a dot
product. The program computes the dot product of arrays, X and Y.
First PSDOT calls PVMFMYTID() and PVMFPARENT(). The PVMFPARENT call
will return PVMNOPARENT if the task wasn't spawned by another PVM
task. If this is the case, then PSDOT is the master and must spawn
the other worker copies of PSDOT. PSDOT then asks the user for the
number of processes to use and the length of vectors to compute. Each
spawned process will receive n/nproc elements of X and Y, where n is
the length of the vectors and nproc is the number of processes being
used in the computation. If nproc does not divide n evenly, then
the master will compute the dot product on extra the elements. The
subroutine SGENMAT randomly generates values for X and Y. PSDOT then
spawns nproc - 1 copies of itself and sends each new task a part of
the X and Y arrays. The message contains the length of the subarrays
in the message and the subarrays themselves. After the master spawns
the worker processes and sends out the subvectors, the master then
computes the dot product on its portion of X and Y. The master process
then receives the other local dot products from the worker processes.
Notice that the PVMFRECV call uses a wildcard (-1) for the task id
parameter. This indicates that a message from any task will
satisfy the receive. Using the wildcard in this manner results in a
race condition. In this case the race condition does not cause a
problem since addition is commutative. In other words, it doesn't
matter in which order we add the partial sums from the workers.
Unless one is certain that the race will not have an adverse effect on
the program, race conditions should be avoided.
Once the master receives all the local dot products and sums them into
a global dot product, it then calculates the entire dot product locally.
These two results are then subtracted, and the difference between
the two values is printed.
A small difference can be expected because of the variation
in floating-point roundoff errors.
If the PSDOT program is a worker then it receives a message from
the master process containing subarrays of X and Y. It calculates
the dot product of these subarrays and sends the result back to the
master process. In the interests of brevity we do not include the
SGENMAT and SDOT subroutines.
The failure example demonstrates how one can
kill tasks and how one can find out when tasks exit or fail. For this
example we spawn several tasks, just as we did in the previous
examples. One of these unlucky tasks gets killed by the parent. Since
we are interested in finding out when a task fails, we call
After requesting notification, the parent task then kills one of the
children; in this case, one of the middle children is killed. The call
to
In our next example we program a matrix-multiply algorithm described by Fox
et al. in
[5]. The mmult program can be found at the end of this
section.
The mmult program will calculate C = AB, where
C, A, and B are all square matrices. For simplicity we
assume that m x m tasks will be used to calculate the solution.
Each task will calculate a subblock of the resulting matrix C. The
block size and the value of m is given as a command line argument to
the program. The matrices A and B are also stored as blocks distributed
over the
tasks. Before delving into the details of the program,
let us first describe the algorithm at a high level.
Assume we have a grid of m x m tasks. Each task (
where
0 < = i,j < m) initially contains blocks
,
, and
. In the first step of the algorithm the tasks on the diagonal
(
where i = j) send their block
to all the other tasks
in row i. After the transmission of
, all tasks calculate
and add the result into
. In the next
step, the column blocks of B are rotated. That is,
sends
its block of B to
. (Task
sends its B block
to
.) The tasks now return to the first step;
is multicast to all other tasks in row i, and the
algorithm continues. After m iterations the C matrix contains A x B, and the B matrix has been rotated back into place.
Let's now go over the matrix multiply as it is programmed in PVM. In
PVM there is no restriction on which tasks may communicate with which
other tasks. However, for this program we would like to think of the
tasks as a two-dimensional conceptual torus. In order to enumerate the
tasks, each task joins the group mmult. Group ids are used to
map tasks to our torus. The first task to join a group is given the
group id of zero. In the mmult program, the task with group id zero
spawns the other tasks and sends the parameters for the matrix multiply
to those tasks. The parameters are m and bklsize: the square root of
the number of blocks and the size of a block, respectively. After all the
tasks have been spawned and the parameters transmitted,
After the barrier, we store the task ids for the other tasks in our
``row'' in the array myrow. This is done by calculating the
group ids for all the tasks in the row and asking PVM for the task
id for the corresponding group id. Next we allocate the blocks for the
matrices using malloc(). In an actual application program we would
expect that the matrices would already be allocated. Next the program
calculates the row and column of the block of C it will be computing.
This is based on the value of the group id. The group ids range from
0 to m - 1 inclusive. Thus the integer division of (mygid/m) will
give the task's row and (mygid mod m) will give the column, if we assume
a row major mapping of group ids to tasks. Using a similar mapping, we
calculate the group id of the task directly above and below
in the torus and store their task ids in up and down,
respectively.
Next the blocks are initialized by calling InitBlock(). This function
simply initializes A to random values, B to the identity matrix, and C
to zeros. This will allow us to verify the computation at the end of the
program by checking that A = C.
Finally we enter the main loop to calculate the matrix multiply. First
the tasks on the diagonal multicast their block of A to the other tasks
in their row. Note that the array myrow actually contains the
task id of the task doing the multicast. Recall that
After the subblocks have been multiplied and added into the C block, we
now shift the B blocks vertically. Specifically, we pack the block
of B into a message, send it to the up task id, and then
receive a new B block from the down task id.
Note that we use different message tags for sending the A blocks and the
B blocks as well as for different iterations of the loop. We also fully
specify the task ids when doing a
Once the computation is complete, we check to see that A = C, just to verify
that the matrix multiply correctly calculated the values of C. This check would
not be done in a matrix multiply library routine, for example.
It is not necessary to call
and a discretization of the form
giving the explicit formula
initial and boundary conditions:
The pseudo code for this computation is as follows:
For this example we will use a master-slave programming model. The master,
heat.c, spawns five copies of the program heatslv.
The slaves compute the heat diffusion for subsections of the wire in parallel.
At each time step the slaves exchange boundary information, in this case
the temperature of the wire at the boundaries between processors.
Let's take a closer look at the code. In heat.c the
array solution will hold
the solution for the heat diffusion equation at each time step. This
array will be output at the end of the program in xgraph format. (xgraph
is a program for plotting data.)
First the heatslv tasks are spawned.
Next, the initial data set is computed. Notice that the ends of the wires are
given initial temperature values of zero.
The main part of the program is then executed four times, each with
a different value for
. A timer is used to compute the
elapsed time of each compute phase. The initial data sets are sent to the
heatslv tasks. The left and right neighbor task ids are sent along with
the initial data set. The heatslv tasks use these to communicate boundary
information. (Alternatively, we could have used the PVM group calls to
map tasks to segments of the wire. By using the group calls we would
have avoided explicitly communicating the task ids to the slave processes.)
After sending the initial data, the master process simply waits for results.
When the results arrive, they are integrated into the solution matrix,
the elapsed time is calculated, and the solution is written out to the
xgraph file.
Once the data for all four phases has been computed and stored, the
master program prints out the elapsed times and kills the slave processes.
The heatslv programs do the actual computation of the heat diffusion through
the wire. The slave program consists of an infinite loop that receives
an initial data set, iteratively computes a solution based on this data set
(exchanging boundary information with neighbors on each iteration),
and sends the resulting partial solution back to the master process.
Rather than using an infinite loop in the slave tasks, we could send
a special message to the slave ordering it to exit. To avoid complicating
the message passing, however, we simply use the infinite loop in the slave tasks
and kill them off from the master program. A third option would be
to have the slaves execute only once, exiting after processing a single
data set from the master. This would require placing the master's spawn
call inside the main for loop of heat.c. While this option would work,
it would needlessly add overhead to the overall computation.
For each time step and before each compute phase, the boundary values
of the temperature matrix are exchanged. The left-hand boundary elements
are first sent to the left neighbor task and received from the right
neighbor task. Symmetrically, the right-hand boundary elements
are sent to the right neighbor and then received from the left neighbor.
The task ids for the neighbors are checked to make sure no attempt is
made to send or receive messages to nonexistent tasks.
In this chapter we have given a variety of example programs written in
Fortran and C. These examples demonstrate various ways of writing
PVM programs. Some break the code into two separate programs, while
others use a single program with conditionals to handle spawning and
computing phases. These examples show different styles of
communication, both among worker tasks and between worker and master
tasks. In some cases messages are used for synchronization; in
others the master processes simply kill of the workers when they are no
longer needed. We hope that these examples can be used as a basis for
better understanding how to write PVM programs and
for appreciating
the design tradeoffs
involved.
PVM is an ongoing research project. As such,
we provide limited
support.
We welcome
feedback on this book and other aspects of the system to help in enhancing PVM.
Please send comments and questions
to
In this chapter we describe the implementation of the PVM software
and the reasons behind the basic design decisions.
The most important goals for PVM 3
are fault tolerance,
scalability,
heterogeneity,
and portability.
PVM is able to withstand host and network failures.
It doesn't automatically recover an application after a crash,
but it does provide polling and notification primitives
to allow fault-tolerant applications to be built.
The virtual machine is dynamically reconfigurable.
This property goes hand in hand with fault tolerance: an application may need
to acquire more resources in order to continue running once a host
has failed.
Management is as decentralized and localized as possible,
so
virtual machines should be able to scale to hundreds of
hosts and run thousands of tasks.
PVM can connect computers of different types
in a single session.
It runs with minimal modification on any
flavor of Unix
or an operating system with comparable facilities
(multitasking, networkable).
The programming interface is simple but complete,
and
any user can install the package without special
privileges.
To allow PVM to be
highly portable,
we
avoid the use of
operating system and language features that would be
be hard to retrofit if unavailable,
such as multithreaded processes and
asynchronous I/O.
These exist in many
versions of Unix,
but they vary enough from product to product
that different versions of PVM might need to be maintained.
The generic port is kept as simple as possible,
though
PVM can always be
optimized for any particular machine.
We
assume that sockets are used for interprocess communication
and that each host in a virtual machine group can connect directly
to every other host via
TCP
[9] and UDP
[10]
protocols.
The requirement of full IP connectivity could be removed
by specifying message routes and using the pvmds to forward
messages.
Some multiprocessor machines
don't make sockets available on the
processing nodes,
but do have them on the front-end (where the pvmd runs).
PVM uses a task identifier (TID) to address pvmds,
tasks, and groups of tasks within a virtual machine.
The TID contains four fields, as shown in Figure
.
Since the TID is used so heavily,
it is made to fit into the largest integer data type (32 bits) available
on a wide range of machines.
The fields S, G, and H have global meaning: each
pvmd of a virtual machine interprets them in the same way.
The H field contains a host number relative to the virtual machine.
As it starts up,
each pvmd is configured with a unique host number
and therefore ``owns''
part of the TID address space.
The maximum number of hosts in a virtual machine is limited to
(4095).
The mapping between host numbers and hosts is known to each pvmd,
synchronized by a global host table.
Host number zero is used,
depending on context,
to refer to the local pvmd
or a shadow pvmd, called pvmd' (Section
).
The S bit is used to address pvmds,
with the H field set to the host number
and the L field cleared.
This bit is a historical leftover
and causes slightly schizoid naming; sometimes
pvmds are addressed with the S bit cleared.
It should someday be reclaimed to make the H or L space
larger.
Each pvmd is allowed to assign private meaning to the L field
(with the H field set to its own host number),
except that ``all bits cleared''
is reserved to mean the pvmd itself.
The L field is 18 bits wide,
so up to
tasks can exist concurrently on each host.
In the generic Unix port,
L values are assigned by a counter,
and
the pvmd maintains a map between L values and Unix process id's.
Use of the L field in multiprocessor ports is described in Section
.
The G bit is set to form multicast addresses (GIDs),
which refer to groups of tasks.
Multicasting is described in Section
.
The design of the TID enables the implementation to meet the design
goals.
Tasks can be assigned TIDs by their local pvmds
without off-host communication.
Messages can be routed from anywhere in a virtual machine to anywhere else,
by hierarchical naming.
Portability is enhanced because the L field can be redefined.
Finally, space is reserved for error codes.
When a function can return a vector of TIDs mixed with error codes, it
is useful if the error codes don't correspond to legal TIDs.
The TID space is divided up as follows:
Naturally, TIDs are
intended to be opaque to the application,
and the programmer should not attempt to predict their values or modify
them without using functions supplied in the programming library.
More symbolic naming
can be obtained by using a name server library layered on top
of the raw PVM calls,
if the convenience is deemed worth the
cost of name lookup.
PVM assigns an architecture name to each kind of machine
on which it runs,
to
distinguish between machines that
run different executables,
because of hardware
or operating system
differences.
Many standard names are defined,
and others can be added.
Sometimes machines with incompatible executables use the same binary
data representation.
PVM takes advantage of this to avoid data conversion.
Architecture names are mapped to data encoding numbers,
and the
encoding numbers are used to determine when it is necessary to convert.
PVM daemons and tasks can compose and send
messages of arbitrary lengths
containing typed data.
The data can be converted
using
XDR
[12]
when passing between hosts
with incompatible data formats.
Messages are tagged at send time with a user-defined integer code
and can be selected for receipt by source address or tag.
The sender of a message does not wait for an acknowledgment from the receiver,
but continues as soon as the message has been handed to the network
and
the message buffer can be
safely deleted or reused.
Messages are buffered at the receiving end until received.
PVM reliably delivers messages,
provided the destination exists.
Message order from each sender to each receiver
in the system is preserved;
if one entity sends several messages to another,
they will be received in the same order.
Both blocking and nonblocking receive primitives are provided,
so a task can wait for a message without (necessarily)
consuming processor time by polling for it.
Or,
it can poll for a message without hanging.
A receive with timeout is also provided,
which returns after a specified time if no message has arrived.
No acknowledgments are used between sender and receiver.
Messages are reliably delivered and buffered by the system.
If we ignore fault recovery,
then either an application will run to completion or,
if some component goes down, it won't.
In order to provide fault recovery,
a task (
) must be prepared for another task
(
,
from which it wants a message) to crash,
and must be able to take corrective action.
For example,
it might reschedule its request to a different server,
or even start a new server.
From the viewpoint of
,
it doesn't matter specifically when
crashes
relative to messages sent from
.
While waiting for
,
will receive either a message from
or notification
that
has crashed.
For the purposes of flow control,
a fully blocking send can easily be built using the semi-synchronous
send primitive.
PVM provides notification messages as a means to implement
fault recovery in an application.
A task can request that the system send a message on
one of the following three events:
Notify requests are stored in the pvmds,
attached to objects they monitor.
Requests for remote events (occurring on a different host than
the requester)
are kept on both hosts.
The remote pvmd sends the message if the event occurs,
while the local one
sends the message if the remote host goes down.
The assumption is that a local pvmd can be trusted;
if it goes down,
tasks running under it won't be able to do anything,
so they don't need to be notified.
One pvmd runs on each host of a virtual machine.
Pvmds owned by (running as) one user do not interact with those
owned by others,
in order to reduce security risk,
and minimize the impact of one PVM user on another.
The pvmd
serves as a message
router and controller.
It provides
a point of contact,
authentication,
process control, and
fault detection.
An idle pvmd occasionally checks that its peers are still running.
Even if application programs
crash,
pvmds continue to run,
to aid in debugging.
The first pvmd (started by hand) is designated
the master,
while the others (started by the master) are called slaves.
During normal operation,
all are considered equal.
But only the master can start new slaves
and add them to the configuration.
Reconfiguration requests originating on a slave host
are forwarded to the master.
Likewise, only the master can forcibly delete hosts from the machine.
The libpvm library
allows a task to interface with the pvmd and other tasks.
It contains functions for
packing (composing) and unpacking messages,
and functions to
perform PVM syscalls by
using the message functions to send
service requests to the pvmd.
It is made as small and simple as possible.
Since it shares an address space with unknown, possibly buggy,
code, it can be broken or subverted.
Minimal sanity-checking of parameters is performed,
leaving further authentication to the pvmd.
The top level of the libpvm library,
including most of the programming interface functions,
is written in a machine-independent style.
The bottom level is kept separate and can be modified or
replaced with a new machine-specific file when porting PVM to a
new environment.
We gratefully acknowledge the valuable assistance of many
people who have contributed to the PVM project.
In particular, we thank
Peter Rigsbee and
Neil Lincoln
for their help and insightful comments.
We thank the PVM group at the
University of Tennessee and Oak Ridge National Laboratory-Carolyn
Aebischer,
Martin Do,
June Donato,
Jim Kohl,
Keith Moore,
Phil Papadopoulos,
and
Honbo Zhou-for
their assistance with the development of various pieces and components of PVM.
In addition we express appreciation to all those who
helped in the preparation of this work, in particular to
Clint Whaley and Robert Seccomb for help on the examples,
Ken Hawick for contributions to the glossary, and Gail Pieper for
helping with the task of editing the manuscript.
A number of computer vendors have encouraged and provided valuable
suggestions during the development of PVM.
We thank
Cray Research Inc.,
IBM,
Convex Computer,
Silicon Graphics,
Sequent Computer,
and Sun Microsystems
for their assistance in porting the software to their platforms.
This work would not have been possible without the support of
the Office of Scientific Computing,
U.S. Department of Energy, under Contract
DE-AC05-84OR21400;
the National Science
Foundation Science and Technology Center Cooperative
Agreement No. CCR-8809615;
and
the Science Alliance, a state-supported program
at the University of Tennessee.
The pvmd and libpvm manage message buffers,
which potentially hold large amounts of dynamic data.
Buffers need to be shared efficiently,
for example, to attach a multicast message to several send queues
(see Section
).
To avoid copying,
all pointers are to a single instance of the data (a databuf),
which is
refcounted
by
allocating a few extra bytes for
an integer at the head of the data.
A pointer to the data itself is passed around,
and routines subtract
from it to access the refcount or free the block.
When the refcount of a databuf decrements to zero,
it is freed.
PVM messages are composed
without declaring a maximum length ahead of time.
The pack functions allocate memory in steps,
using databufs to store the data, and frag descriptors to
chain the databufs together.
A frag descriptor struct frag
holds a pointer (fr_dat) to a block of data
and its length (fr_len).
It also keeps a pointer (fr_buf) to the databuf
and its total length (fr_max);
these reserve space to prepend or append data.
Frags can also reference static (non-databuf) data.
A frag has link pointers so it
can be chained into a list.
Each frag keeps a count of references to it;
when the refcount decrements to zero,
the frag is freed
and the underlying databuf refcount is decremented.
In the case where a frag descriptor is the head of a list,
its refcount applies to the entire list.
When it reaches zero,
every frag in the list is freed.
Figure
shows a list of fragments storing a message.
Libpvm provides functions
to pack all of the
primitive data types into a message,
in one of several encoding formats.
There are five sets of encoders and decoders.
Each message buffer has a set associated with it.
When creating a new message,
the encoder set is determined by the format parameter
to pvm_mkbuf().
When receiving a message,
the decoders are determined by the encoding field of the message header.
The two most commonly used
ones pack data in raw (host native)
and default (XDR) formats.
Inplace encoders pack descriptors of the data
(the frags point to static data),
so the message is sent without copying the data to a buffer.
There are no inplace decoders.
Foo encoders
use a machine-independent format that is
simpler than XDR;
these encoders are used when communicating with the pvmd.
Alien decoders
are installed when a received message can't be
unpacked
because its encoding doesn't match the data format of the host.
A message in an alien data format can
be held or forwarded,
but any attempt to read data from it results in an error.
Figure
shows libpvm message management.
To allow the PVM programmer to handle
message buffers,
they are labeled with integer message id's (MIDs)
,
which are simply indices into the message heap.
When a message buffer is freed,
its MID is recycled.
The heap
starts out small and
is extended if it becomes full.
Generally,
only a few messages exist at any time,
unless an application explicitly stores them.
A vector of functions for encoding/decoding
primitive types (struct encvec) is
initialized when a message buffer is created.
To pack
a long integer,
the generic pack function pvm_pklong() calls
(message_heap[mid].ub_codef->enc_long)() of the buffer.
Encoder vectors
were used for speed (as opposed to having a case switch in
each pack function).
One drawback is that
every encoder for every format is touched (by naming it in the code),
so
the linker must include all the functions
in every executable,
even when they're not used.
By comparison with libpvm,
message packing in the pvmd
is very simple.
Messages are handled using struct mesg
(shown in Figure
).
There are
encoders for signed and unsigned integers and strings,
which use in the libpvm foo format.
Integers occupy four bytes each, with
bytes in network order (bits 31..24 followed by bits 23..16, ...).
Byte strings are packed as an integer length
(including the terminating null for ASCII strings),
followed by the data
and zero to three null bytes to round the total length to a multiple of four.
Messages for the pvmd
are reassembled from packets in loclinpkt()
if from a local task,
or in netinpkt() if from another pvmd or foreign task.
Reassembled messages are passed to one of three
entry points:
If the message tag and contents are valid,
a new thread of action is started to handle the request.
Invalid messages are discarded.
Control messages are sent to a task like regular messages,
but have tags in a reserved space
(between TC_FIRST and TC_LAST).
Normally,
when a task downloads a message,
it queues it for receipt by the program.
Control messages are instead passed
to
pvmmctl()
and then discarded.
Like the entry points in the pvmd,
pvmmctl() is an entry point in the task,
causing it to take some asynchronous action.
The main difference is that
control messages
can't be used to get the task's attention,
since it must be in mxfer(),
sending or receiving,
in order
to get them.
The following control message tags are defined.
The first three are used by the direct routing mechanism
(discussed in Section
).
TC_OUTPUT is used to implement
pvm_catchout() (Section
).
User-definable control messages may
be added in the future as a way of implementing
PVM signal handlers
.
At startup,
a pvmd
configures itself as a master or slave,
depending on its command line arguments.
It creates and binds sockets to talk to
tasks and other pvmds, and it
opens an error log file /tmp/pvml.uid.
A master pvmd
reads the host file
if supplied;
otherwise it uses default parameters.
A slave pvmd gets its parameters from the master pvmd
via the command line and configuration messages.
After configuration,
the pvmd enters a loop in function work().
At the core of the work loop is a call to select()
that probes all sources of input for the pvmd (local
tasks and the network).
Packets are
received and routed to send queues.
Messages to the pvmd are reassembled and passed to
the entry points.
A pvmd shuts down when it is deleted from the virtual machine,
killed (signaled),
loses contact with the master pvmd,
or breaks (e.g., with a bus error).
When a pvmd shuts down,
it takes two final actions.
First, it kills any tasks running under it,
with signal SIGTERM.
Second,
it sends a final shutdown message (Section
)
to every other pvmd in its
host table.
The other pvmds would eventually discover the missing one
by timing out trying to communicate with it,
but the shutdown message speeds the process.
A host table describes the configuration of a virtual machine.
It lists
the name,
address
and
communication state
for each host.
Figure
shows how a host table is built from
struct htab and struct hostd structures.
Host tables are issued by the master pvmd
and kept synchronized across the virtual machine.
The delete operation is simple:
On receiving a
DM_HTDEL message from the master,
a pvmd calls
hostfailentry()
for each host listed in the message,
as though the deleted pvmds crashed.
Each pvmd can autonomously delete hosts
from its own table on finding them
unreachable (by timing out during communication).
The add operation is done
with a three-phase commit,
in order to guarantee global availability of new hosts
synchronously with completion of the add-host request.
This is described in
Section
.
Each host descriptor has a refcount
so it can be
shared by multiple host tables.
As the configuration of the machine changes,
the host descriptors (except those added and deleted, of course)
propagate from one host table to the next.
This propagation is necessary because they hold various state information.
Host tables also serve other uses:
They allow the pvmd to manipulate host sets,
for example, when picking candidate hosts on which to spawn a task.
Also,
the advisory host file supplied to the master pvmd is
parsed and stored in a host table.
If the master pvmd is started with a host file,
it parses the file into a host table, filehosts.
If some hosts in the file are to be started automatically,
the master
sends a
DM_ADD message to itself.
The slave hosts are started
just as though they had been added dynamically
(Section
).
Parallel processing,
the method of having many small tasks solve one large problem,
has emerged as a key enabling technology in modern computing.
The past several years have witnessed an ever-increasing acceptance
and adoption of parallel processing, both for high-performance
scientific computing and for more ``general-purpose'' applications,
was a result of the demand for
higher performance, lower cost, and sustained productivity.
The acceptance has been facilitated by two
major developments: massively parallel processors
(MPPs)
and the widespread use of distributed computing.
MPPs are now
the most powerful computers in the world.
These machines combine a few hundred to a few thousand CPUs
in a single large cabinet connected to hundreds of gigabytes of memory.
MPPs offer enormous computational power and are used to
solve computational Grand Challenge problems
such as global climate modeling and drug design.
As simulations become more realistic, the computational power
required to produce them grows rapidly.
Thus, researchers on the cutting edge
turn to MPPs and parallel processing
in order to get the most computational power possible.
The second major development affecting scientific problem solving is
distributed computing
.
Distributed computing is a process whereby a set of
computers connected by a network are used collectively
to solve a single large problem.
As more and more organizations have high-speed local area networks
interconnecting many general-purpose workstations,
the combined computational resources
may exceed the power of a single high-performance computer.
In some cases, several MPPs have been combined
using distributed computing to produce unequaled computational power.
The most important factor in distributed computing is cost.
Large MPPs typically cost more than $10 million.
In contrast,
users see very little cost in running their problems
on a local set of existing computers.
It is uncommon for distributed-computing users to realize
the raw computational power of a large MPP, but they are
able to solve problems several times larger than they could
using one of their local computers.
Common between distributed computing and MPP is the notion
of message passing
. In all parallel processing, data must
be exchanged between cooperating tasks.
Several paradigms have been tried including
shared memory, parallelizing compilers, and message passing.
The message-passing model has become
the paradigm of choice, from the
perspective of the number and variety of multiprocessors that support it,
as well as in terms of applications,
languages, and software systems that use it.
The Parallel Virtual Machine (PVM) system described in this book uses the message-passing
model to allow programmers to exploit distributed computing
across a wide variety of computer types, including MPPs.
A key concept in PVM is that it makes a collection of computers
appear as one large virtual machine
,
hence its name.
Each pvmd
maintains a list of all tasks under its management
(Figure
).
Every task, regardless of state, is a member of a
threaded list,
sorted by task id.
Most tasks are also in a second list,
sorted by process id.
The head of both lists is
locltasks.
PVM provides a simple debugging system
described in Section
.
More complex debuggers can be built by using
a special type of task called a tasker, introduced in version 3.3.
A tasker starts (execs, and is the parent of)
other tasks.
In general,
a debugger is a process that
controls the execution of other processes - can read and write
their memories and start and stop instruction counters.
On many species of Unix,
a debugger must be the direct parent of any processes it controls.
This is becoming less common with growing availability of the
attachable ptrace interface.
The function of the tasker interface overlaps with the simple
debugger starter,
but is fundamentally different for two reasons:
First,
all tasks running under a pvmd (during the life of the tasker)
may be children of a single tasker process.
With PvmTaskDebug,
a new debugger is necessarily started for each task.
Second,
the tasker cannot be enabled or disabled by spawn flags,
so it is always in control,
though
this is not an important difference.
If a tasker is registered (using pvm_reg_tasker())
with a pvmd when a DM_EXEC
message is received to start new tasks,
the pvmd sends a SM_STTASK message to the tasker instead
of calling execv().
No SM_STTASKACK message is required;
closure comes from the task reconnecting to the pvmd as usual.
The pvmd doesn't get SIGCHLD signals when a tasker
is in use,
because it's not the parent process of tasks,
so the tasker must send notification of exited tasks
to the pvmd in a SM_TASKX message.
The pvmd uses a wait context (waitc) to hold state when
a thread of operation must be interrupted.
The pvmd is not truly multithreaded
but performs operations concurrently.
For example,
when a pvmd gets a syscall from a task
and
must interact with another pvmd,
it doesn't block while waiting for the other pvmd
to respond.
It
saves state in a
waitc
and returns immediately to the work() loop.
When the reply arrives,
the pvmd uses the information stashed
in the waitc
to complete the syscall and reply to the task.
Waitcs are serial numbered, and the number is sent in the
message header along with the request and
returned with the reply.
For many operations,
the TIDs and kind of wait are the
only information saved.
The struct waitc includes a few extra fields
to handle most of the remaining cases,
and a pointer,
wa_spec,
to a block of extra data for special cases-the
spawn and host startup operations,
which need to save
struct waitc_spawn and struct waitc_add.
Sometimes
more than one phase of waiting is necessary-in
series, parallel,
or nested.
In the parallel case, a separate waitc is created for each foreign host.
The waitcs are
peered (linked in a list)
together to indicate they pertain to the same operation.
If a waitc has no peers,
its peer links point to itself.
Usually,
peered waitcs share data,
for example, wa_spec.
All existing parallel operations
are conjunctions;
a peer group is finished when every waitc
in the group is finished.
As replies arrive,
finished waitcs are collapsed out of the list and deleted.
When the
finished waitc is the only one left,
the operation is complete.
Figure
shows single and peered waitcs
stored in
waitlist (the list
of all active waitcs).
When a host fails or a task exits,
the pvmd
searches waitlist
for any blocked on this TID
and terminates those operations.
Waitcs from the dead host or task blocked on something else
are not deleted;
instead, their wa_tid fields are zeroed.
This approach prevents the wait id's from being recycled while
replies are still pending.
Once the defunct waitcs are satisfied,
they are silently discarded.
Fault detection
originates in the pvmd-pvmd
protocol
(Section
).
When the pvmd times out while communicating with another,
it calls hostfailentry(),
which scans waitlist and terminates any operations waiting
on the down host.
A pvmd can recover from the loss of any foreign pvmd
except the master.
If a slave loses the master,
the slave shuts itself down.
This algorithm ensures that the virtual machine doesn't become partitioned
and run as two partial machines.
It does, however, decrease fault tolerance
of the virtual machine
because the master must never crash.
There is currently no way for the master to hand off its status
to another pvmd,
so it always remains part of the configuration.
(This is an improvement over
PVM 2,
in which the failure of any pvmd would shut down the entire system.)
The shadow pvmd
(pvmd') runs on the master host
and is used by the master to start new slave pvmds.
Any of several steps in the startup process (for example,
starting a shell on the remote machine)
can block for seconds or minutes (or hang),
and the master pvmd must be able to
respond to other messages during this time.
It's messy to save all the state involved,
so a completely separate process is used.
The pvmd' has host number 0
and communicates with the master through the
normal pvmd-pvmd interface,
though
it never talks to tasks or other pvmds.
The normal host failure detection mechanism is used to
recover
in the event the pvmd' fails.
The startup operation has a wait context
in the master pvmd.
If the pvmd' breaks,
the master catches a SIGCHLD from it and
calls hostfailentry(),
which cleans up.
Getting a slave pvmd
started is a messy task
with no good solution.
The goal is to
get a process running on the new host,
with enough
identity
to let it be fully configured and added
as a peer.
Ideally,
the mechanism used
should be
widely available, secure, and fast,
while
leaving the system easy to install.
We'd like to avoid having to type passwords all the time,
but don't want to put them in a file from where they can be stolen.
No one system meets all of these criteria.
Using
inetd
or
connecting to an already-running pvmd or pvmd server at a
reserved port
would allow fast,
reliable startup,
but would require that a system
administrator install PVM on each host.
Starting the pvmd
via
rlogin
or telnet
with a chat
script
would allow access even to
IP-connected hosts behind firewall machines
and
would require no special privilege
to install;
the main drawbacks are speed
and the
effort needed to get the chat program working
reliably.
Two widely available systems are
rsh
and rexec()
;
we
use both to cover the cases where a password
does and does not
need to be typed.
A manual startup
option allows the user to take the place of a
chat program,
starting the pvmd by hand and typing in the configuration.
rsh is a privileged program that can be
used to run commands on another host without a password,
provided the destination host
can be made to trust the source host.
This can be done either
by making it equivalent (requires a system administrator)
or by creating a .rhosts file on the destination host
(this isn't a great idea).
The alternative,
rexec(), is a function compiled into the pvmd.
Unlike rsh,
which doesn't take a password,
rexec() requires the user to supply one at run time,
either by typing it in
or by placing it in a .netrc file (this is a really bad idea).
Figure
shows
a host being added to the machine.
A task calls pvm_addhosts()
to
send a request to its pvmd,
which in turn sends a DM_ADD message to the master
(possibly itself).
The master pvmd
creates a new host table entry for each
host
requested,
looks up the IP addresses,
and sets the options
from host file entries
or defaults.
The host descriptors are kept in a waitc_add structure
(attached to a wait context)
and not yet added to the host table.
The master
forks the pvmd'
to do the dirty work,
passing it a list of hosts and commands to execute
(an SM_STHOST message).
The pvmd' uses rsh, rexec() or manual startup
to start each pvmd,
pass it parameters,
and
get a line of configuration data back.
The configuration dialog between pvmd'
and a new slave is as follows:
The
addresses of the master and slave pvmds
are passed on the command line.
The slave writes its configuration on standard output,
then waits for
an EOF from the pvmd'
and disconnects.
It runs in
probationary status
(runstate = PVMDSTARTUP)
until it
receives the rest of its configuration
from the master pvmd.
If it isn't configured within five minutes
(parameter DDBAILTIME),
it assumes there is some problem with the master
and quits.
The protocol revision
(DDPROTOCOL)
of the slave pvmd must match that of the master.
This number is incremented whenever a change in the protocol
makes it incompatible with the previous
version.
When several hosts are added at once,
startup is done in parallel.
The pvmd' sends the data (or errors)
in a DM_STARTACK message
to the
master pvmd,
which completes the host descriptors
held in the wait context.
If a special task
called a hoster
is registered with the master pvmd when it receives
the DM_ADD message,
the pvmd' is not used.
Instead,
the SM_STHOST message
is sent to the hoster,
which
starts the remote processes as described above
using any mechanism it wants,
then
sends a SM_STHOSTACK message (same format as DM_STARTACK)
back to the master pvmd.
Thus,
the method of starting slave pvmds is dynamically replaceable,
but the hoster does not have to understand the configuration
protocol.
If the hoster task fails during an add operation,
the pvmd uses the wait context to recover.
It assumes none of the slaves were started
and sends a DM_ADDACK message indicating a system error.
After the slaves are started,
the master
sends each a DM_SLCONF message
to set parameters not included in the startup protocol.
It then
broadcasts a DM_HTUPD message
to all new and existing slaves.
Upon receiving this message,
each slave knows the configuration of
the new virtual machine.
The master waits for an acknowledging DM_HTUPDACK message
from every slave,
then broadcasts
an HT_COMMIT message,
shifting all to the new host table.
Two phases are needed so that new hosts are not advertised
(e.g., by pvm_config()) until all pvmds know the new
configuration.
Finally,
the master
sends a DM_ADDACK reply to the original request,
giving the new host id's.
Note:
Recent experience suggests it would be cleaner to
manage the pvmd' through
the task interface
instead of the host interface.
This approach would allow
multiple starters to run at once
(parallel startup is implemented explicitly
in a single pvmd' process).
A resource manager (RM) is a PVM task
responsible for making task and host
scheduling (placement) decisions.
The resource manager interface was introduced in version 3.3.
The simple schedulers embedded in the pvmd
handle many common conditions,
but require the user to explicitly place program components in
order to get the maximum efficiency.
Using knowledge not available to the pvmds, such as host load averages,
a RM can make more informed decisions automatically.
For example, when spawning a task,
it could pick the host in order to balance the computing load.
Or, when reconfiguring the virtual machine,
the RM could interact with an external queuing
system to allocate a new host.
The number of RMs registered can vary
from one for an entire virtual machine to one per pvmd.
The RM running on the master host (where the master pvmd runs)
manages any slave pvmds that don't have their own RMs.
A task connecting anonymously to a virtual machine is assigned
the default RM of the pvmd to which it connects.
A task spawned from within the system inherits the RM
of its parent task.
If a task has a RM assigned to it,
service requests from the task to its pvmd are routed to the
RM instead.
Messages from the following libpvm functions are intercepted:
Queries also go to the RM,
since it presumably knows more about the state of the virtual machine:
The call to register a task as a RM (pvm_reg_rm())
is also redirected if RM is already running.
In this way the existing RM learns of the new one,
and can grant or refuse the request to register.
Using messages SM_EXEC and SM_ADD,
the RM can directly command the pvmds to start tasks or reconfigure
the virtual machine.
On receiving acknowledgement for the commands,
it replies to the client task.
The RM is free to interpret service request parameters
in any way it wishes.
For example,
the architecture class given to pvm_spawn()
could be used to distinguish hosts by memory size or CPU speed.
Libpvm is
written in C and
directly supports C and C++ applications.
The Fortran
library,
libfpvm3.a
(also written in C),
is a set of wrapper functions
that conform to the Fortran calling conventions.
The Fortran/C linking requirements are portably met by preprocessing the
C source code for the
Fortran library with m4
before compilation.
On the first call to a libpvm function,
pvm_beatask() is called to
initialize the library state and
connect the task to its pvmd.
Connecting (for anonymous tasks)
is slightly different
from
reconnecting (for spawned tasks).
The pvmd publishes the address of the socket on which it listens
in /tmp/pvmd.uid,
where uid is the numeric user id under which the pvmd runs.
This file contains a line of the form
7f000001:06f7 or /tmp/aaa014138
This is the IP address and port number (in hexadecimal) of the socket,
or the path if a Unix-domain socket.
To avoid the need to read the address file,
the same information is passed to spawned tasks in
environment variable PVMSOCK.
To reconnect,
a spawned task also needs
its expected process id.
When a task is spawned by the pvmd,
a task descriptor
is created for it during the exec phase.
The descriptor must
exist
so it can stash any messages that arrive for the task
before it reconnects and can receive them.
During reconnection,
the task identifies itself to the pvmd by its PID.
If the task is always the child of the pvmd
(i.e., the exact process exec'd by it),
then it could use the value returned by getpid().
To allow for intervening processes,
such as debuggers,
the pvmd passes the expected PID in environment variable
PVMEPID,
and the task
uses that value in preference to its real PID.
The task also passes its real PID so it can be controlled normally
by the pvmd.
pvm_beatask() creates a TCP socket and does a proper connection
dance with the pvmd.
Each must prove its identity to the other,
to prevent a different user from spoofing the system.
It does this by creating a file in /tmp writable only by the owner,
and challenging the other to write in the file.
If successful,
the
identity of the other is proven.
Note that this authentication is only as strong as the filesystem
and the authority of root on each machine.
A protocol serial number
(TDPROTOCOL)
is compared whenever a task connects to a pvmd or another task.
This number is incremented whenever a change in the protocol
makes it incompatible with the previous
version.
Disconnecting is much simpler.
It can be done forcibly by a close from either end,
for example, by exiting the task process.
The function pvm_exit() performs a clean shutdown,
such that the process can be connected again later
(it would get a different TID).
PVM communication is based on TCP
,
UDP
,
and Unix-domain sockets.
While more appropriate
protocols exist,
they aren't as generally available.
VMTP
[3]
is one example of a protocol built for this purpose.
Although intended for RPC-style interaction
(request-response),
it could support PVM messages.
It is packet oriented
and efficiently sends short blocks
of data (such as most pvmd-pvmd management messages)
but also handles streaming (necessary for task-task communication).
It supports multicasting
and priority data (something PVM doesn't need yet).
Connections don't need to be established before use;
the first communication initializes the protocol drivers
at each end.
VMTP was rejected, however.
because
it is not widely available
(using it requires modifying the kernel).
This section explains the PVM protocols.
There are three connections to consider:
Between pvmds,
between pvmd and task,
and between tasks.
In an MPP, every processor is exactly like every other
in capability, resources, software, and communication speed.
Not so on a network.
The computers available on a network
may be
made by different vendors or have different compilers.
Indeed, when a programmer wishes to exploit a
collection of networked computers, he may have to contend
with several different types of heterogeneity
:
The set of computers available can include a wide range of architecture types
such as 386/486 PC class machines, high-performance workstations,
shared-memory multiprocessors, vector supercomputers, and even
large MPPs. Each architecture type has its own optimal programming method.
In addition, a user can be faced with a hierarchy of programming
decisions. The parallel virtual machine may itself be composed
of parallel computers.
Even when the architectures are only serial workstations,
there is still the problem of incompatible binary formats
and the need to compile a parallel task on each different
machine.
Data formats on different computers are often incompatible.
This incompatibility is an important point in distributed computing because
data sent from one computer may be unreadable on the receiving computer.
Message-passing packages developed for heterogeneous environments
must make sure all the computers understand the exchanged data.
Unfortunately,
the early message-passing systems developed for specific MPPs
are not amenable to distributed computing because they
do not include enough information in the message to
encode or decode it for any other computer.
Even if the set of computers are all workstations with the same data format,
there is still heterogeneity due to different computational speeds.
As an simple example, consider the problem of running
parallel tasks on a virtual machine that is composed of
one supercomputer and one workstation. The programmer must be careful
that the supercomputer doesn't sit idle waiting for the next data
from the workstation before continuing.
The problem of computational speeds can be very subtle.
The virtual machine can be composed of a set of identical workstations.
But since networked computers can have several other users on them
running a variety of jobs, the machine load can vary dramatically.
The result is that the effective computational power across
identical workstations can vary by an order of magnitude.
Like machine load, the time it takes to send a message over the network
can vary depending on the network load imposed by all the other network
users, who may not even be using any of the computers in the virtual
machine. This sending time becomes important when a task is sitting
idle waiting for a message, and it is even more important when
the parallel algorithm is sensitive to message arrival time.
Thus, in distributed computing, heterogeneity can appear dynamically
in even simple setups.
Despite these numerous difficulties caused by heterogeneity,
distributed computing offers
many advantages:
The pvmd and libpvm use the same message header,
shown in Figure
.
Code
contains an integer tag (message type).
Libpvm uses
Encoding
to pass the encoding
style of the message, as it can pack in different
formats.
The pvmd always sets Encoding (and requires that it be set)
to 1 (foo),
Pvmds use
the
Wait Context
field
to pass the wait id's (if any, zero if none)
of the waitc
associated with the message.
Certain tasks (resource manager, tasker, hoster)
also use wait id's.
The
Checksum
field is reserved for future use.
Messages are sent in one or more fragments,
each with its own fragment header (described below).
The message header is at the beginning of the first fragment.
PVM daemons communicate with one another through UDP sockets.
UDP is an unreliable
delivery service which can lose,
duplicate or reorder packets,
so
an acknowledgment and retry mechanism is used.
UDP also limits packet length,
so
PVM fragments long messages.
We
considered TCP,
but three factors make it inappropriate.
First is scalability
.
In a virtual machine of N hosts,
each pvmd must have connections to the other N - 1.
Each open TCP connection consumes
a file descriptor in the pvmd,
and some operating systems limit the number of open files to as few as 32,
whereas
a single UDP socket can communicate with any number of remote
UDP sockets.
Second is overhead
.
N pvmds
need
N(N - 1)/2
TCP connections,
which would be expensive to set up.
The PVM/UDP protocol is initialized with no communication.
Third is fault tolerance
.
The communication system
detects when foreign pvmds
have crashed
or the network has gone down,
so
we need to set timeouts in the protocol layer.
The TCP keepalive option might work,
but
it's not always possible to get
adequate control over the
parameters.
The packet header
is shown in Figure
.
Multibyte values are sent in (Internet) network byte order
(most significant byte first).
The source and destination fields hold
the TIDs of the true source and final destination of the packet,
regardless of the route it takes.
Sequence and acknowledgment numbers start at 1 and increment to 65535,
then wrap to zero.
SOM (EOM) - Set for the first (last) fragment of a message.
Intervening fragments have both bits cleared.
They are used by tasks and pvmds to delimit message boundaries.
DAT - If set, data is contained in the packet, and the sequence
number is valid.
The packet, even if zero length, must be delivered.
ACK - If set, the acknowledgment number field is valid.
This bit may be combined with the DAT bit to piggyback an acknowledgment
on a data packet.
FIN - The pvmd is closing down the connection.
A packet with FIN bit set (and DAT cleared)
begins an orderly shutdown.
When an acknowledgement arrives (ACK bit set and ack number
matching the sequence number from the FIN packet),
a final packet is sent with both FIN and ACK set.
If the pvmd panics,
(for example on a trapped segment violation)
it tries to send a packet with FIN and ACK set to
every peer before it exits.
The state of a connection to another pvmd is kept in its host table
entry.
The protocol driver uses the following fields of
struct hostd:
Figure
shows
the host send and outstanding-packet queues.
Packets waiting to be sent to a host are queued in FIFO
hd_txq.
Packets
are appended to this queue by the routing code,
described in
Section
.
No receive queues are used;
incoming packets are passed immediately through
to other send queues or reassembled into messages (or discarded).
Incoming messages are delivered to a pvmd
entry point as described in
Section
.
The protocol allows multiple outstanding packets
to improve performance over high-latency networks,
so two more queues are required.
hd_opq
holds a per-host list of unacknowledged packets,
and global opq
lists all unacknowledged packets,
ordered by time to retransmit.
hd_rxq holds packets received out of sequence until they
can be accepted.
The difference in time between sending a packet
and getting the acknowledgement is
used to estimate the round-trip time to the foreign host.
Each update is filtered into the estimate according to the formula
.
When the acknowledgment for a packet arrives,
the packet
is removed from hd_opq and opq and discarded.
Each packet has a retry timer and count,
and each is resent until acknowledged by the foreign pvmd.
The timer starts at
3 * hd_rtt,
and doubles for each retry up to 18 seconds.
hd_rtt
is limited to nine seconds, and backoff is
bounded in order to allow at least 10 packets to be sent to a host
before giving up.
After three minutes of resending with no acknowledgment,
a packet expires.
If a packet expires as a result of timeout,
the foreign pvmd is assumed to be down or unreachable,
and the local pvmd gives up on it,
calling hostfailentry()
A task talks to its pvmd and other tasks through TCP sockets.
TCP is used
because it delivers data reliably.
UDP can lose packets
even within a host.
Unreliable delivery requires retry
(with timers)
at both ends:
since tasks can't be
interrupted while computing to perform I/O,
we can't use UDP.
Implementing a
packet service over TCP is
simple
because of its reliable delivery.
The packet header is shown in
Figure
.
No sequence numbers are needed,
and only flags SOM and EOM
(these have the same meaning as in Section
).
Since TCP
provides no record marks to
distinguish back-to-back packets from one another,
the length is sent in the header.
Each side maintains a
FIFO of packets to send,
and switches between reading
the socket when data is available
and writing when there is space.
The main drawback to TCP (as opposed to UDP)
is that more system
calls are needed to transfer each packet.
With UDP,
a single
sendto()
and
single
recvfrom()
are required.
With TCP,
a packet can be sent by a single
write() call,
but
must be received by two
read() calls,
the first to get the header and the second to get the data.
When traffic on the connection is heavy,
a simple optimization reduces the average number of reads back
to about one per packet.
If,
when reading the packet body,
the requested length is increased by the size of a
packet header,
the read may succeed in getting both the packet body
and header of the next packet at once.
We have the header for the next packet for free
and can repeat this process.
Version 3.3 introduced the use of Unix-domain stream sockets as an
alternative to TCP for local communication,
to improve latency and transfer rate
(typically by a factor of two).
If enabled (the system is built without the NOUNIXDOM
option),
stream sockets are used between the pvmd and tasks
as well as between tasks on the same host.
Packet descriptors (struct pkt)
track message fragments
through
the pvmd.
Fields
pk_buf, pk_max, pk_dat and pk_len
are used in the same ways as
similarly named fields of a frag,
described in
Section
.
Besides data,
pkts contain state to operate
the pvmd-pvmd protocol.
Messages are sent by calling sendmessage(),
which routes by destination address.
Messages for other pvmds or tasks
are linked to packet descriptors and attached
to a send queue.
If the pvmd addresses a message to itself,
sendmessage()
passes the whole message descriptor
to netentry(),
avoiding the packet layer entirely.
This loopback interface is used often by the pvmd.
During a complex operation,
netentry() may be reentered several times
as the pvmd sends itself messages.
Messages to the pvmd are reassembled from packets
in message reassembly buffers,
one for each local task and remote pvmd.
Completed messages are passed to entry points (Section
).
A graph of packet and message routing inside the pvmd is shown in
Figure
.
Packets are received from the network
by
netinput()
directly into buffers
long enough to hold the largest packet
the pvmd will receive (its MTU in the host table).
Packets from local tasks
are read by loclinput(),
which creates a buffer large enough for each packet
after it reads the header.
To route a packet,
the pvmd chains it onto
the queue for its destination.
If a packet is multicast
(see Section
),
the descriptor is replicated,
counting extra
references on the underlying databuf.
One copy is placed in each send queue.
After the last copy of the packet is sent,
the databuf is freed.
Messages are generally built
with fragment length
equal to the MTU of the host's pvmd,
allowing them to be forwarded without refragmentation.
In some cases,
the pvmd can receive a packet (from a task) too long
to be sent to another pvmd.
The pvmd refragments
the packet by replicating its descriptor
as many times as necessary.
A single databuf is shared between the descriptors.
The pk_dat and pk_len fields of the
descriptors
cover successive chunks of the original
packet,
each chunk small enough to send.
The SOM and EOM flags are adjusted
(if the original packet is the start or end of a message).
At send time, netoutput()
saves the data under where it
writes the packet header,
sends the packet,
and
then restores the data.
MIT Press
1994 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced
in any form by any electronic or mechanical means (including
photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.
This book was set in
by the authors and was
printed and bound in the United States of America.
Library of Congress Cataloging-in-Publication Data
This book is also available in postscript and html forms over the Internet.
To retrieve the postscript file you can use one of the following methods:
To view the html file use the URL:
In our next example we program a matrix-multiply algorithm described by Fox
et al. in
[5]. The mmult program can be found at the end of this
section.
The mmult program will calculate
, where
,
, and
are all square matrices. For simplicity we
assume that
tasks will be used to calculate the solution.
Each task will calculate a subblock of the resulting matrix
. The
block size and the value of
is given as a command line argument to
the program. The matrices
and
are also stored as blocks distributed
over the
tasks. Before delving into the details of the program,
let us first describe the algorithm at a high level.
Assume we have a grid of
tasks. Each task (
where
) initially contains blocks
,
, and
. In the first step of the algorithm the tasks on the diagonal
(
where
) send their block
to all the other tasks
in row
. After the transmission of
, all tasks calculate
and add the result into
. In the next
step, the column blocks of
are rotated. That is,
sends
its block of
to
. (Task
sends its
block
to
.) The tasks now return to the first step;
is multicast to all other tasks in row
, and the
algorithm continues. After
iterations the
matrix contains
, and the
matrix has been rotated back into place.
Let's now go over the matrix multiply as it is programmed in PVM. In
PVM there is no restriction on which tasks may communicate with which
other tasks. However, for this program we would like to think of the
tasks as a two-dimensional conceptual torus. In order to enumerate the
tasks, each task joins the group mmult. Group ids are used to
map tasks to our torus. The first task to join a group is given the
group id of zero. In the mmult program, the task with group id zero
spawns the other tasks and sends the parameters for the matrix multiply
to those tasks. The parameters are
and
: the square root of
the number of blocks and the size of a block, respectively. After all the
tasks have been spawned and the parameters transmitted,
After the barrier, we store the task ids for the other tasks in our
``row'' in the array myrow. This is done by calculating the
group ids for all the tasks in the row and asking PVM for the task
id for the corresponding group id. Next we allocate the blocks for the
matrices using malloc(). In an actual application program we would
expect that the matrices would already be allocated. Next the program
calculates the row and column of the block of
it will be computing.
This is based on the value of the group id. The group ids range from
to
inclusive. Thus the integer division of
will
give the task's row and
will give the column, if we assume
a row major mapping of group ids to tasks. Using a similar mapping, we
calculate the group id of the task directly above and below
in the torus and store their task ids in up and down,
respectively.
Next the blocks are initialized by calling InitBlock(). This function
simply initializes
to random values,
to the identity matrix, and
to zeros. This will allow us to verify the computation at the end of the
program by checking that
.
Finally we enter the main loop to calculate the matrix multiply. First
the tasks on the diagonal multicast their block of A to the other tasks
in their row. Note that the array myrow actually contains the
task id of the task doing the multicast. Recall that
After the subblocks have been multiplied and added into the
block, we
now shift the
blocks vertically. Specifically, we pack the block
of
into a message, send it to the up task id, and then
receive a new
block from the down task id.
Note that we use different message tags for sending the
blocks and the
blocks as well as for different iterations of the loop. We also fully
specify the task ids when doing a
Once the computation is complete, we check to see that
, just to verify
that the matrix multiply correctly calculated the values of
. This check would
not be done in a matrix multiply library routine, for example.
It is not necessary to call
Pvmds usually don't communicate with foreign tasks
(those on other hosts).
The pvmd has message reassembly buffers for each foreign pvmd
and each task it manages.
What it doesn't want is to have reassembly buffers
for foreign tasks.
To free up the reassembly buffer for a foreign task
(if the task dies),
the pvmd
would have to request notification from the task's pvmd,
causing extra communication.
For the sake of simplicity
the pvmd local to the sending task serves as a message
repeater.
The message is reassembled
by the task's local pvmd as if it were the receiver,
then forwarded all at once to the
destination pvmd,
which reassembles the message again.
The source address is preserved,
so the sender can be identified.
Libpvm maintains dynamic reassembly buffers,
so
messages from pvmd to task do not cause a problem.
Control messages are sent to a task like regular messages,
but have tags in a reserved space
(between TC_FIRST and TC_LAST).
Normally,
when a task downloads a message,
it queues it for receipt by the program.
Control messages are instead passed
to
pvmmctl()
and then discarded.
Like the entry points in the pvmd,
pvmmctl() is an entry point in the task,
causing it to take some asynchronous action.
The main difference is that
control messages
can't be used to get the task's attention,
since it must be in mxfer(),
sending or receiving,
in order
to get them.
The following control message tags are defined.
The first three are used by the direct routing mechanism
(discussed in Section
).
TC_OUTPUT is used to implement
pvm_catchout() (Section
).
User-definable control messages may
be added in the future as a way of implementing
PVM signal handlers
.
Messages for the pvmd
are reassembled from packets in loclinpkt()
if from a local task,
or in netinpkt() if from another pvmd or foreign task.
Reassembled messages are passed to one of three
entry points:
If the message tag and contents are valid,
a new thread of action is started to handle the request.
Invalid messages are discarded.
A resource manager (RM) is a PVM task
responsible for making task and host
scheduling (placement) decisions.
The resource manager interface was introduced in version 3.3.
The simple schedulers embedded in the pvmd
handle many common conditions,
but require the user to explicitly place program components in
order to get the maximum efficiency.
Using knowledge not available to the pvmds, such as host load averages,
a RM can make more informed decisions automatically.
For example, when spawning a task,
it could pick the host in order to balance the computing load.
Or, when reconfiguring the virtual machine,
the RM could interact with an external queuing
system to allocate a new host.
The number of RMs registered can vary
from one for an entire virtual machine to one per pvmd.
The RM running on the master host (where the master pvmd runs)
manages any slave pvmds that don't have their own RMs.
A task connecting anonymously to a virtual machine is assigned
the default RM of the pvmd to which it connects.
A task spawned from within the system inherits the RM
of its parent task.
If a task has a RM assigned to it,
service requests from the task to its pvmd are routed to the
RM instead.
Messages from the following libpvm functions are intercepted:
Queries also go to the RM,
since it presumably knows more about the state of the virtual machine:
The call to register a task as a RM (pvm_reg_rm())
is also redirected if RM is already running.
In this way the existing RM learns of the new one,
and can grant or refuse the request to register.
Using messages SM_EXEC and SM_ADD,
the RM can directly command the pvmds to start tasks or reconfigure
the virtual machine.
On receiving acknowledgement for the commands,
it replies to the client task.
The RM is free to interpret service request parameters
in any way it wishes.
For example,
the architecture class given to pvm_spawn()
could be used to distinguish hosts by memory size or CPU speed.
Experience seems to indicate
that inherited
environment (Unix environ)
is useful to an application.
For example,
environment variables can be used to
distinguish a group of related tasks
or to set debugging variables.
PVM makes increasing use of environment,
and may eventually support it
even on machines where the concept
is not native.
For now,
it allows a task to export any part of environ
to tasks spawned by it.
Setting variable PVM_EXPORT to the names of other variables
causes them to be exported through spawn.
For example, setting
The following environment variables are used by PVM.
The user may set these:
The following variables are set by PVM and should not be modified:
Experience seems to indicate
that inherited
environment (Unix environ)
is useful to an application.
For example,
environment variables can be used to
distinguish a group of related tasks
or to set debugging variables.
PVM makes increasing use of environment,
and may eventually support it
even on machines where the concept
is not native.
For now,
it allows a task to export any part of environ
to tasks spawned by it.
Setting variable PVM_EXPORT to the names of other variables
causes them to be exported through spawn.
For example, setting
The following environment variables are used by PVM.
The user may set these:
The following variables are set by PVM and should not be modified:
Each task spawned through PVM
has /dev/null opened for stdin.
From its parent,
it inherits a stdout sink,
which is a (TID, code) pair.
Output on stdout or stderr is
read by the pvmd through a pipe,
packed into PVM messages and
sent to the TID,
with message tag equal to the code.
If the output TID is set to zero
(the default for a task with no parent),
the messages go to the master pvmd,
where they are written on its error log.
Children spawned by a task inherit its stdout
sink.
Before the spawn,
the parent can use pvm_setopt() to
alter the output TID or code.
This doesn't
affect where the output of the parent task itself goes.
A task may set output TID to one of three settings:
the value inherited from its parent,
its own TID,
or zero.
It can set output code only if output TID is set to its own TID.
This means that output can't be assigned to an arbitrary task.
Four types of messages are sent to an stdout sink.
The message body formats for each type are as follows:
The first two items in the message body
are always the task id and output count,
which
allow the receiver to
distinguish between different tasks and the four message types.
For each task,
one message each
of types Spawn, Begin, and End is sent,
along with zero or more messages of class Output, (count > 0).
Classes Begin, Output and End will be received
in order,
as they originate from the same source (the pvmd of the
target task).
Class Spawn originates at the (possibly different) pvmd
of the parent task,
so it can be received in any order relative to
the others.
The output sink
is expected to understand the different types of messages
and use them to know when to stop
listening for output from a task (EOF) or group of tasks (global EOF).
The messages are designed so as to prevent race conditions
when a task spawns another task,
then immediately exits.
The
output sink might
get the End
message from the parent task
and decide the group is finished,
only to receive more output later from the child task.
According to these rules, the Spawn
message for the second task
must
arrive before
the End message from the first task.
The Begin message itself is necessary because the Spawn
message for a task may arrive after the End message
for the same task.
The state transitions of a task as observed by the receiver of
the output messages
are shown in
Figure
.
The libpvm function pvm_catchout() uses this output collection
feature to put the output from children of a task into a file
(for example, its own stdout).
It sets output TID to its own task id,
and the output code to control message TC_OUTPUT.
Output from children and grandchildren tasks is collected by the
pvmds and sent to the task,
where it is received by pvmmctl() and printed by pvmclaimo().
We have divided this book into five main chapters.
Chapter
gives the motivation for this book
and the use of templates.
Chapter
describes
stationary and nonstationary iterative methods. In
this chapter we present both historical development and
state-of-the-art methods for solving some of the most challenging
computational problems facing researchers.
Chapter
focuses on preconditioners. Many
iterative methods depend in part on preconditioners to improve
performance and ensure fast convergence.
Chapter
provides a glimpse of issues related
to the use of iterative methods.
This chapter, like the preceding, is
especially recommended for the experienced user who wishes to have
further guidelines for tailoring a specific code to a particular
machine.
It includes information on complex systems, stopping criteria,
data storage formats, and parallelism.
Chapter
includes overviews of related
topics such as
the close connection between the Lanczos algorithm and the Conjugate
Gradient algorithm, block iterative methods,
red/black orderings,
domain decomposition methods,
multigrid-like methods, and
row-projection schemes.
The Appendices contain information on
how the templates and BLAS software can be
obtained.
A glossary of important terms used in the book is also provided.
The field of iterative methods for solving systems of linear equations
is in constant flux, with new methods and approaches continually being
created, modified, tuned, and some eventually discarded. We expect
the material in this book to undergo changes from time to time as some
of these new approaches mature and become the state-of-the-art.
Therefore, we plan to update the material included in this book
periodically for future editions. We welcome your comments and
criticisms of this work to help us in that updating process. Please
send your comments and questions by email to templates@cs.utk.edu.
Below are short descriptions of each of the methods to be discussed,
along with brief notes on the classification of the methods in terms
of the class of matrices for which they are most appropriate. In
later sections of this chapter more detailed descriptions of these
methods are given.
The Jacobi method is based on solving for every variable locally with
respect to the other variables; one iteration of the method
corresponds to solving for every variable once. The resulting method
is easy to understand and implement, but convergence is slow.
The Gauss-Seidel method is like the Jacobi method, except that it uses
updated values as soon as they are available.
In general, if the Jacobi method converges, the Gauss-Seidel method
will converge faster than the Jacobi method, though still relatively
slowly.
Successive Overrelaxation (SOR) can be derived from the Gauss-Seidel
method by introducing an extrapolation parameter
. For the
optimal choice of
, SOR may converge faster than Gauss-Seidel by
an order of magnitude.
Symmetric Successive Overrelaxation (SSOR) has no advantage over SOR
as a stand-alone iterative method; however, it is useful as a
preconditioner for nonstationary methods.
The conjugate gradient method derives its name from the fact that it
generates a sequence of conjugate (or orthogonal) vectors. These
vectors are the residuals of the iterates. They are also the
gradients of a quadratic functional, the minimization of which is
equivalent to solving the linear system. CG is an extremely effective
method when the coefficient matrix is symmetric positive definite,
since storage for only a limited number of vectors is required.
These methods are computational alternatives for CG for coefficient
matrices that are symmetric but possibly indefinite. SYMMLQ will
generate the same solution iterates as CG if the coefficient matrix is
symmetric positive definite.
These methods are based on the application of the CG method to one of
two forms of the normal equations
for
. CGNE solves the system
for
and then
computes the solution
. CGNR solves
for the solution vector
where
. When the
coefficient matrix
is nonsymmetric and nonsingular, the normal
equations matrices
and
will be symmetric and positive
definite, and hence CG can be applied. The convergence may be slow,
since the spectrum of the normal equations matrices will be less
favorable than the spectrum of
.
The Generalized Minimal Residual method computes a sequence of
orthogonal vectors (like MINRES), and combines these through a
least-squares solve and update. However, unlike MINRES (and CG) it
requires storing the whole sequence, so that a large amount of storage
is needed. For this reason, restarted versions of this method are
used. In restarted versions, computation and storage costs are
limited by specifying a fixed number of vectors to be generated. This
method is useful for general nonsymmetric matrices.
The Biconjugate Gradient method generates two CG-like sequences of
vectors, one based on a system with the original coefficient
matrix
, and one on
. Instead of orthogonalizing each
sequence, they are made mutually orthogonal, or
``bi-orthogonal''. This method, like CG, uses
limited storage. It is useful when the matrix is nonsymmetric and
nonsingular; however, convergence may be irregular, and there is a
possibility that the method will break down. BiCG requires a
multiplication with the coefficient matrix and with its transpose at
each iteration.
The Quasi-Minimal Residual method applies a least-squares solve and
update to the BiCG residuals, thereby smoothing out the irregular
convergence behavior of BiCG,
which may lead to more reliable approximations.
In full glory, it has a look ahead strategy built in that
avoids the BiCG breakdown.
Even without look ahead,
QMR largely avoids the breakdown that can occur in BiCG.
On the other hand, it does not effect a true minimization of either
the error or the residual, and while it converges smoothly, it often does
not improve on the BiCG in terms of the number of iteration
steps.
The Conjugate Gradient Squared method is a variant of BiCG that
applies the updating operations for the
-sequence and the
-sequences both to the same vectors. Ideally, this would double
the convergence rate, but in practice convergence may be much more
irregular than for BiCG,
which may sometimes lead to unreliable results. A practical
advantage is that the method does not need the multiplications with
the transpose of the coefficient matrix.
The Biconjugate Gradient Stabilized method is a variant of BiCG, like
CGS, but using different updates for the
-sequence in order to
obtain smoother convergence than CGS.
The Chebyshev Iteration recursively determines polynomials with
coefficients chosen to minimize the norm of the residual in a min-max
sense. The coefficient matrix must be positive definite and knowledge
of the extremal eigenvalues is required. This method has the
advantage of requiring no inner products.
Efficient preconditioners for iterative methods can be found by
performing an incomplete factorization of the coefficient matrix. In
this section, we discuss the incomplete factorization of an
matrix
stored in the CRS format,
and routines to solve a system with such a factorization. At first we
only consider a factorization of the
-
type, that is,
the simplest type of factorization in which no ``fill'' is
allowed, even if the matrix has a nonzero in the fill position (see
section
). Later we will consider factorizations that
allow higher levels of fill. Such factorizations considerably more
complicated to code, but they are essential for complicated
differential equations. The solution routines are applicable in
both cases.
For iterative methods, such as
, that involve a transpose matrix
vector product we need to consider solving a system with the transpose
of as factorization as well.
In this subsection we will consider
a matrix split as
in diagonal, lower and upper
triangular part, and an incomplete factorization preconditioner of the form
. In this way, we only need to store a
diagonal matrix
containing the pivots of the factorization.
Hence,it suffices to allocate for the preconditioner only
a pivot array of length
(pivots(1:n)).
In fact, we will store the inverses of the pivots
rather than the pivots themselves. This implies that during
the system solution no divisions have to be performed.
Additionally, we assume that an extra integer array
diag_ptr(1:n)
has been allocated that contains the column (or row) indices of the
diagonal elements in each row, that is,
.
The factorization begins by copying the matrix diagonal
The system
can be solved in the usual manner by introducing a
temporary vector
:
We have a choice between several equivalent ways of solving the
system:
The first and fourth formulae are not suitable since they require
both multiplication and division with
; the difference between the
second and third is only one of ease of coding. In this section we use
the third formula; in the next section we will use the
second for the transpose system solution.
Both halves of the solution have largely the same structure as the
matrix vector multiplication.
Solving the transpose system
is slightly more involved. In the usual
formulation we traverse rows when solving a factored system, but here
we can only access columns of the matrices
and
(at less
than prohibitive cost). The key idea is to distribute
each newly computed component of a triangular solve immediately over
the remaining right-hand-side.
For instance, if we write a lower triangular matrix as
, then the system
can be written as
. Hence, after computing
we modify
, and so on. Upper triangular systems are
treated in a similar manner.
With this algorithm we only access columns of the triangular systems.
Solving a transpose system with a matrix stored in CRS format
essentially means that we access rows of
and
.
The algorithm now becomes
Incomplete factorizations with several levels of fill allowed are more
accurate than the
-
factorization described above. On the
other hand, they require more storage, and are considerably harder to
implement (much of this section is based on algorithms for a full
factorization of a sparse matrix as found in Duff, Erisman and
Reid
[80]).
As a preliminary, we need an algorithm for adding two vectors
and
, both stored in sparse storage. Let lx be the number
of nonzero components in
, let
be stored in x, and let
xind be an integer array such that
Similarly,
is stored as ly, y, yind.
We now add
by first copying y into
a full vector w then adding w to x. The total number
of operations will be
:
For a slight refinement of the above algorithm,
we will add levels to the nonzero components:
we assume integer vectors xlev and ylev of length lx
and ly respectively, and a full length level vector wlev
corresponding to w. The addition algorithm then becomes:
We can now describe the
factorization. The algorithm starts
out with the matrix A, and gradually builds up
a factorization M of the form
, where
,
, and
are stored in the lower triangle, diagonal and
upper triangle of the array M respectively. The particular form
of the factorization is chosen to minimize the number of times that
the full vector w is copied back to sparse form.
Specifically, we use a sparse form of the following factorization
scheme:
We will describe an incomplete factorization that controls fill-in
through levels (see equation (
)). Alternatively we
could use a drop tolerance (section
), but this is less
attractive from a point of implementation. With fill levels we can
perform the factorization symbolically at first, determining storage
demands and reusing this information through a number of linear
systems of the same sparsity structure. Such preprocessing and reuse
of information is not possible with fill controlled by a drop
tolerance criterion.
The matrix
arrays A and M are assumed to be in compressed row
storage, with no particular ordering of the elements inside each row,
but arrays adiag and mdiag point to the locations of the
diagonal elements.
The structure of a particular sparse matrix is likely to apply to a
sequence of problems, for instance on different time-steps, or during
a Newton iteration. Thus it may pay off to perform the above
incomplete factorization first symbolically to determine the amount
and location of fill-in and use this structure for the numerically
different but structurally identical matrices. In this case, the
array for the numerical values can be used to store the levels during
the symbolic factorization phase.
Pipelining: See: Vector computer.
Vector computer: Computer that is able to process
consecutive identical operations (typically additions or multiplications)
several times faster than intermixed operations of different types.
Processing identical operations this way is called `pipelining'
the operations.
Shared memory: See: Parallel computer.
Distributed memory: See: Parallel computer.
Message passing: See: Parallel computer.
Parallel computer: Computer with multiple independent
processing units. If the processors have immediate access to the
same memory, the memory is said to be shared; if processors have
private memory that is not immediately visible to other processors,
the memory is said to be distributed. In that case, processors
communicate by message passing.
In this section we discuss aspects of parallelism in the
iterative methods discussed in this book.
Since the iterative methods share most of their computational kernels
we will discuss these independent of the method.
The basic time-consuming kernels of iterative schemes are:
We will examine each of these in turn. We will conclude this section
by discussing two particular issues, namely computational wavefronts
in the SOR method,
and block operations in the GMRES method.
The computation of an inner product of two vectors
can be easily parallelized; each processor computes the
inner product of corresponding segments of each vector
(local inner products or LIPs).
On distributed-memory machines the LIPs then
have to be sent to other processors
to be combined for the global inner product. This can be done either
with an all-to-all send where every processor performs the summation
of the LIPs, or by a global accumulation in one processor, followed by
a broadcast of the final result.
Clearly, this step requires communication.
For shared-memory machines, the accumulation of LIPs can be
implemented as a critical section where all processors add their local
result in turn to the global result, or as a piece of serial
code, where one processor performs the summations.
Clearly, in the usual formulation of conjugate gradient-type methods
the inner products induce a synchronization of the
processors, since they cannot progress until the final result has been
computed:
updating
and
can only begin after
completing the inner product for
. Since on a
distributed-memory machine communication is needed for the inner product, we
cannot overlap this communication with useful computation.
The same observation applies to updating
, which can only begin
after completing the inner product for
.
Figure
shows a variant of CG, in which all
communication time may be overlapped with useful computations. This
is just a reorganized version of the original CG scheme, and is
therefore precisely as stable. Another advantage over other
approaches (see below) is that no additional operations are required.
This rearrangement is based on two tricks. The first is that updating
the iterate is delayed to mask the communication stage of the
inner product. The second trick relies on splitting the
(symmetric) preconditioner as
, so one first computes
, after which the inner product
can be computed as
where
. The
computation of
will then mask the communication stage of the
inner product.
Under the assumptions that we have made, CG can be efficiently parallelized
as follows:
Several authors have found ways to eliminate some of the
synchronization points induced by the
inner products in methods such as CG. One strategy has been to
replace one of the two inner products typically present in conjugate
gradient-like methods by one or two others in such a way that all
inner products can be performed simultaneously. The global
communication can then be packaged. A first such method was proposed
by Saad
[182] with a modification to improve its
stability suggested by Meurant
[156]. Recently, related
methods have been
proposed by Chronopoulos and Gear
[55], D'Azevedo and
Romine
[62], and Eijkhout
[88].
These schemes can also be applied to
nonsymmetric methods such as BiCG. The stability of such methods is
discussed by D'Azevedo, Eijkhout and Romine
[61].
Another approach is to generate a number of
successive Krylov vectors (see §
) and
orthogonalize these as a block (see
Van Rosendale
[210], and Chronopoulos and
Gear
[55]).
Vector updates are trivially parallelizable: each processor updates its
own segment.
Iterative methods that can be expressed in the simple form
(where neither
nor
depend upon the iteration count
) are
called stationary iterative methods.
In this section, we present the four main stationary iterative
methods: the Jacobi
method, the Gauss-Seidel
method, the Successive
Overrelaxation
(SOR) method and
the Symmetric Successive Overrelaxation
(SSOR) method.
In each case,
we summarize their convergence behavior and their effectiveness, and
discuss how and when they should be used. Finally,
in §
, we give some historical background and
further notes and references.
The matrix-vector products are often easily parallelized on shared-memory
machines by splitting the matrix in strips corresponding to the vector
segments. Each processor then computes the matrix-vector product of one
strip.
For distributed-memory machines, there may be a problem if each processor
has only a segment of the vector in its memory. Depending on the bandwidth
of the matrix, we may need communication for other elements of the vector,
which may lead to communication bottlenecks. However, many sparse
matrix problems arise from a network in which only nearby nodes are
connected. For example, matrices stemming
from finite difference or finite element problems typically involve
only local connections: matrix element
is nonzero
only if variables
and
are physically close.
In such a case, it seems natural to subdivide the network, or
grid, into suitable blocks and to distribute them over the processors.
When computing
, each processor requires the values of
at
some nodes in neighboring blocks. If the number of connections to these
neighboring blocks is small compared to the number of internal nodes,
then the communication time can be overlapped with computational work.
For more detailed discussions on implementation aspects for distributed
memory systems, see De Sturler
[63] and
Pommerell
[175].
Preconditioning is often the most problematic part of parallelizing
an iterative method.
We will mention a number of approaches to obtaining parallelism in
preconditioning.
Certain preconditioners were not developed with parallelism in mind,
but they can be executed in parallel. Some examples are domain
decomposition methods (see §
), which
provide a high degree of coarse grained parallelism,
and polynomial preconditioners
(see §
), which have the same parallelism as
the matrix-vector product.
Incomplete factorization preconditioners are usually much harder to
parallelize: using wavefronts of independent computations (see for
instance Paolini and Radicati di Brozolo
[170]) a
modest amount of parallelism
can be attained, but the implementation is complicated. For instance,
a central difference discretization on regular grids gives wavefronts
that are hyperplanes
(see Van der Vorst
[205]
[203]).
Variants of existing sequential incomplete factorization
preconditioners with a higher degree of parallelism have been devised,
though they are perhaps less efficient in purely scalar terms than
their ancestors. Some examples are: reorderings of the
variables (see Duff and Meurant
[79] and
Eijkhout
[85]), expansion of the
factors in a truncated Neumann series (see
Van der Vorst
[201]),
various block factorization methods (see
Axelsson and Eijkhout
[15]
and
Axelsson and Polman
[21]),
and multicolor preconditioners.
Multicolor preconditioners have optimal parallelism among incomplete
factorization methods, since the minimal number of sequential steps
equals the color number of the matrix graphs. For theory and
applications to parallelism
see Jones and Plassman
[128]
[127].
If all processors execute their part of the preconditioner solve
without further communication, the overall method is technically a
block Jacobi preconditioner (see §
).
While their parallel execution is very efficient, they
may not be as effective as more complicated, less parallel
preconditioners, since improvement in the number of iterations
may be only modest.
To get a bigger improvement while retaining the efficient parallel
execution,
Radicati di Brozolo and Robert
[178] suggest that one construct
incomplete decompositions on slightly overlapping domains. This requires
communication similar to that for matrix-vector products.
At first sight, the Gauss-Seidel method (and the SOR method which has
the same basic structure) seems to be a fully sequential method.
A more careful analysis, however, reveals a high degree of parallelism
if the method is applied to sparse matrices such as those arising from
discretized partial differential equations.
We start by partitioning the unknowns in wavefronts. The first
wavefront contains those unknowns that (in the directed graph of
) have no predecessor; subsequent wavefronts are then sets (this
definition is not necessarily unique) of successors of elements of the
previous wavefront(s), such that no successor/predecessor relations hold
among the elements of this set. It is clear that all elements of a
wavefront can be processed simultaneously, so the sequential time of
solving a system with
can be reduced to the number of
wavefronts.
Next, we observe that the unknowns in a wavefront can be computed as
soon as all wavefronts containing its predecessors have been computed.
Thus we can, in the absence of tests for convergence, have components
from several iterations being computed simultaneously.
Adams and Jordan
[2] observe that in this way
the natural ordering of unknowns gives an iterative method that is
mathematically equivalent to a multi-color ordering.
In the multi-color ordering, all wavefronts of the same color are
processed simultaneously. This reduces the number of sequential steps
for solving the Gauss-Seidel matrix to the number of colors, which is
the smallest number
such that wavefront
contains no
elements that are a predecessor of an element in wavefront
.
As demonstrated by O'Leary
[164], SOR theory still holds
in an approximate sense for multi-colored matrices. The above
observation that the Gauss-Seidel method with the natural ordering is
equivalent to a multicoloring cannot be extended to the SSOR method or
wavefront-based incomplete factorization preconditioners for the
Conjugate Gradient method. In fact, tests by Duff and
Meurant
[79] and
an analysis by Eijkhout
[85] show that multicolor incomplete
factorization preconditioners in general may take a considerably
larger number of iterations to converge than preconditioners based on
the natural ordering. Whether this is offset by the increased
parallelism depends on the application and the computer architecture.
In addition to the usual matrix-vector product, inner products and
vector updates, the preconditioned GMRES method
(see §
) has a kernel where one new vector,
, is orthogonalized against the previously built
orthogonal set {
,
,...,
}.
In our version, this is
done using Level 1 BLAS, which may be quite inefficient. To
incorporate Level 2 BLAS we can apply either Householder
orthogonalization or classical Gram-Schmidt twice (which mitigates
classical Gram-Schmidt's potential instability; see
Saad
[185]). Both
approaches significantly increase the computational work, but using
classical Gram-Schmidt has the advantage that all inner products can
be performed simultaneously; that is, their communication can be
packaged. This may increase the efficiency of the computation
significantly.
Another way to obtain more parallelism and
data locality is to generate a basis
{
,
, ...,
} for the Krylov subspace first,
and to orthogonalize this set afterwards; this is called
-step GMRES(
) (see Kim and Chronopoulos
[139]).
(Compare this to the GMRES method in §
, where each
new vector is immediately orthogonalized to all previous vectors.)
This approach does not
increase the computational work and, in contrast to CG, the numerical
instability due to generating a possibly near-dependent set is not
necessarily a drawback.
As discussed by Paige and Saunders in
[168] and by
Golub and Van Loan in
[109], it is straightforward to
derive the conjugate gradient method for solving symmetric positive
definite linear systems from the Lanczos algorithm for solving
symmetric eigensystems and vice versa. As an example, let us consider
how one can derive the Lanczos process for symmetric eigensystems from
the (unpreconditioned) conjugate gradient method.
Suppose we define the
matrix
by
and the
upper bidiagonal matrix
by
where the sequences
and
are defined by the standard
conjugate gradient algorithm discussed in §
.
From the equations
and
, we have
, where
Assuming the elements of the sequence
are
-conjugate,
it follows that
is a tridiagonal matrix since
Since span{
} =
span{
} and since the elements of
are mutually orthogonal, it can be shown that the columns of
matrix
form an orthonormal basis
for the subspace
, where
is a diagonal matrix whose
th diagonal element is
. The columns of the matrix
are the Lanczos vectors (see
Parlett
[171]) whose associated projection of
is
the tridiagonal matrix
The extremal eigenvalues of
approximate those of the
matrix
. Hence,
the diagonal and subdiagonal elements of
in
(
), which are readily available during iterations of the
conjugate gradient algorithm (§
),
can be used to construct
after
CG iterations. This
allows us to obtain good approximations to the extremal eigenvalues
(and hence the condition number) of the matrix
while we are generating
approximations,
, to the solution of the linear system
.
For a nonsymmetric matrix
, an equivalent nonsymmetric Lanczos
algorithm (see Lanczos
[142]) would produce a
nonsymmetric matrix
in (
) whose extremal eigenvalues
(which may include complex-conjugate pairs) approximate those of
.
The nonsymmetric Lanczos method is equivalent to the BiCG method
discussed in §
.
The methods discussed so far are all subspace methods, that is, in
every iteration they extend the dimension of the subspace generated. In
fact, they generate an orthogonal basis for this subspace, by
orthogonalizing the newly generated vector with respect to the
previous basis vectors.
However, in the case of nonsymmetric coefficient matrices the newly
generated vector may be almost linearly dependent on the existing
basis. To prevent break-down or severe numerical error in such
instances, methods have been proposed that perform a look-ahead step
(see Freund, Gutknecht and
Nachtigal
[101], Parlett, Taylor and
Liu
[172], and Freund and
Nachtigal
[102]).
Several new, unorthogonalized, basis
vectors are generated and are then orthogonalized with
respect to the subspace already generated. Instead of generating a
basis, such a method generates a series of low-dimensional orthogonal
subspaces.
The
-step iterative methods of Chronopoulos and
Gear
[55] use this strategy of generating
unorthogonalized vectors and processing them as a block to reduce
computational overhead and improve processor cache behavior.
If conjugate gradient methods are considered to generate a
factorization of a tridiagonal reduction of the original matrix, then
look-ahead methods generate a block factorization of a block
tridiagonal reduction of the matrix.
A block tridiagonal reduction is also effected by the
Block Lanczos algorithm and the Block Conjugate Gradient
method
(see O'Leary
[163]).
Such methods operate on multiple linear systems with the same
coefficient matrix simultaneously, for instance with multiple right hand
sides, or the same right hand side but with different initial guesses.
Since these block methods use multiple search directions in each step,
their convergence behavior is better than for ordinary methods. In fact,
one can show that the spectrum of the matrix is effectively
reduced by the
smallest eigenvalues, where
is the block
size.
The Jacobi method is easily derived by examining each of
the
equations in the linear system
in isolation. If in
the
th equation
we solve for the value of
while assuming the other entries
of
remain fixed, we obtain
This suggests an iterative method defined by
which is the Jacobi method. Note that the order in which the
equations are examined is irrelevant, since the Jacobi method treats
them independently. For this reason, the Jacobi method is also known
as the method of simultaneous displacements, since the updates
could in principle be done simultaneously.
Simultaneous displacements, method of: Jacobi method.
In matrix terms, the definition of the Jacobi method
in (
) can be expressed as
where the matrices
,
and
represent the diagonal, the
strictly lower-triangular, and the strictly upper-triangular parts of
,
respectively.
The pseudocode for the Jacobi method is given in Figure
.
Note that an auxiliary storage vector,
is used in the
algorithm. It is not possible to update the vector
in place,
since values from
are needed throughout the
computation of
.
Reduced system: Linear system obtained by eliminating
certain variables from another linear system.
Although the number of variables is smaller
than for the original system, the matrix of a reduced system generally
has more nonzero entries. If the original matrix was symmetric and positive
definite, then the reduced system has a smaller condition number.
As we have seen earlier, a suitable preconditioner for CG is a
matrix
such that the system
requires fewer iterations to solve than
does, and for which
systems
can be solved efficiently. The first property is
independent of the machine used, while the second is highly machine
dependent. Choosing the best preconditioner involves balancing those
two criteria in a way that minimizes the overall computation time.
One balancing approach used for matrices
arising from
-point
finite difference discretization of second order elliptic partial
differential equations (PDEs) with Dirichlet boundary conditions
involves solving a reduced system. Specifically, for an
grid, we can use a point red-black ordering of the nodes to
get
where
and
are diagonal, and
is a well-structured
sparse matrix with
nonzero diagonals if
is even and
nonzero diagonals if
is odd. Applying one step
of block Gaussian elimination (or computing the
Schur complement; see Golub and Van Loan
[109]) we have
which reduces to
With proper scaling (left and right multiplication by
),
we obtain from the second block equation the reduced system
where
,
, and
. The linear system (
) is of
order
for even
and of order
for odd
. Once
is determined, the solution
is easily retrieved from
. The
values on the black points are those that would be obtained from a
red/black ordered SSOR preconditioner on the full system, so we expect
faster convergence.
The number of nonzero coefficients is reduced, although the
coefficient matrix in (
) has nine nonzero diagonals.
Therefore it has higher density and offers more data locality.
Meier and Sameh
[150] demonstrate that the reduced system
approach on hierarchical memory
machines such as the Alliant FX/8 is over
times faster than unpreconditioned CG for Poisson's equation on
grids with
.
For
-dimensional elliptic PDEs, the reduced system approach yields
a block tridiagonal matrix
in (
) having diagonal
blocks of the structure of
from the
-dimensional case and
off-diagonal blocks that are diagonal matrices. Computing the reduced
system explicitly leads to an unreasonable increase in the
computational complexity of solving
. The matrix products
required to solve (
) would therefore be performed implicitly
which could significantly decrease performance. However, as Meier and
Sameh show
[150], the reduced system approach can still be about
-
times as fast as the conjugate gradient method with Jacobi
preconditioning for
-dimensional problems.
Domain decomposition method: Solution method for
linear systems based on a partitioning of the physical domain
of the differential equation. Domain decomposition methods typically
involve (repeated) independent system solution on the subdomains,
and some way of combining data from the subdomains on the separator
part of the domain.
In recent years, much attention has been given to domain decomposition
methods for linear elliptic problems that are based on a partitioning
of the domain of the physical problem. Since the subdomains can be
handled independently, such methods are very attractive for
coarse-grain parallel computers. On the other hand, it should be
stressed that they can be very effective even on sequential computers.
In this brief survey, we shall restrict ourselves to the standard
second order self-adjoint scalar elliptic problems in two dimensions
of the form:
where
is a positive function on the domain
, on whose
boundary the value of
is prescribed (the Dirichlet problem). For
more general problems, and a good set of references, the reader is
referred to the series of
proceedings
[177]
[135]
[107]
[49]
[48]
[47]
and the surveys
[196]
[51].
For the discretization of (
), we shall assume for
simplicity that
is triangulated by a set
of nonoverlapping
coarse triangles (subdomains)
with
internal
vertices. The
's are in turn
further refined into a set of smaller triangles
with
internal vertices in total.
Here
denote the coarse and fine mesh size respectively.
By a standard Ritz-Galerkin method using piecewise linear triangular
basis elements on (
), we obtain an
symmetric positive definite linear system
.
Generally, there are two kinds of approaches depending on whether
the subdomains overlap with one another
(Schwarz methods
) or are separated from
one another by interfaces (Schur Complement methods
,
iterative substructuring).
We shall present domain decomposition methods as preconditioners
for the linear system
to
a reduced (Schur Complement) system
defined on the interfaces in the non-overlapping formulation.
When used with the standard Krylov subspace methods discussed
elsewhere in this book, the user has to supply a procedure
for computing
or
given
or
and the algorithms to be described
herein computes
.
The computation of
is a simple sparse matrix-vector
multiply, but
may require subdomain solves, as will be described later.
In this approach, each substructure
is extended to a
larger substructure
containing
internal vertices and all the
triangles
, within a distance
from
, where
refers to the amount of overlap.
Let
denote the the discretizations
of (
) on the subdomain
triangulation
and the coarse triangulation
respectively.
Let
denote the extension operator which extends (by zero) a
function on
to
and
the corresponding pointwise restriction operator.
Similarly, let
denote the interpolation operator
which maps a function on the coarse grid
onto the fine
grid
by piecewise linear interpolation
and
the corresponding weighted restriction operator.
With these notations, the Additive Schwarz Preconditioner
for
the system
can be compactly described as:
Note that the right hand side can be computed using
subdomain
solves using the
's, plus a coarse grid solve using
,
all of which can be computed in parallel.
Each term
should be evaluated in three steps:
(1) Restriction:
,
(2) Subdomain solves for
:
,
(3) Interpolation:
.
The coarse grid solve is handled in the same manner.
The theory of Dryja and Widlund
[76] shows that
the condition number of
is bounded independently
of both the coarse grid size
and the fine grid size
,
provided there is ``sufficient'' overlap between
and
(essentially it means that the ratio
of
the distance
of the boundary
to
should be uniformly bounded from below as
.)
If the coarse grid solve term is left out, then the
condition number grows as
, reflecting the lack
of global coupling provided by the coarse grid.
For the purpose of implementations, it is often useful to interpret
the definition of
in matrix notation.
Thus the decomposition of
into
's corresponds to partitioning
of the components of the vector
into
overlapping groups of
index sets
's, each with
components.
The
matrix
is simply a principal submatrix of
corresponding to the index set
.
is a
matrix
defined by its action on
a vector
defined on
as:
if
but is zero otherwise.
Similarly, the action of its transpose
forms an
-vector by picking
off the components of
corresponding to
.
Analogously,
is an
matrix
with entries corresponding to piecewise linear interpolation
and its transpose can be interpreted as a weighted restriction matrix.
If
is obtained from
by nested refinement,
the action of
can be efficiently computed
as in a standard multigrid algorithm.
Note that the matrices
are defined
by their actions and need not be stored explicitly.
We also note that in this algebraic formulation, the preconditioner
can be extended to any matrix
,
not necessarily one arising from a discretization of an elliptic problem.
Once we have the partitioning
index sets
's, the matrices
are defined.
Furthermore, if
is positive definite, then
is guaranteed
to be nonsingular. The difficulty is in defining the ``coarse grid''
matrices
, which inherently depends on knowledge of the
grid structure.
This part of the preconditioner can be left out, at the expense
of a deteriorating convergence rate as
increases.
Radicati and Robert
[178]
have experimented with such an algebraic overlapping block Jacobi
preconditioner.
The easiest way to describe this approach is through
matrix notation.
The set of vertices of
can be divided into two groups.
The set of interior vertices
of
all
and the set of vertices
which lies on the boundaries
of the coarse triangles
in
.
We shall re-order
and
as
and
corresponding to this partition.
In this ordering, equation (
) can be written
as follows:
We note that since the subdomains are uncoupled by the boundary vertices,
is block-diagonal
with each block
being the stiffness matrix corresponding
to the unknowns belonging to the interior
vertices of subdomain
.
By a block LU-factorization of
, the system in (
)
can be written as:
where
is the Schur complement of
in
.
By eliminating
in (
), we arrive at
the following equation for
:
We note the following properties of this Schur Complement system:
inherits the symmetric positive definiteness of
.
is dense in general and computing it explicitly
requires as many solves on each subdomain as there are
points on each of its edges.
is
, an improvement
over the
growth for
.
defined on the boundary vertices
of
,
the matrix-vector product
can be computed according to
where
involves
independent subdomain solves using
.
can also be computed using
independent subdomain solves.
We shall first describe
the Bramble-Pasciak-Schatz preconditioner
[36].
For this, we need to further decompose
into two non-overlapping
index sets:
where
denote the set of
nodes
corresponding to the vertices
's of
, and
denote the set of
nodes
on the edges
's
of the coarse triangles in
, excluding
the vertices
belonging to
.
In addition to the coarse grid interpolation and restriction
operators
defined before,
we shall also need a new set of interpolation and restriction
operators for the edges
's.
Let
be the pointwise restriction operator
(an
matrix, where
is the number
of vertices on the edge
)
onto the edge
defined by its action
if
but is zero otherwise.
The action of its transpose extends by zero a function
defined on
to one defined on
.
Corresponding to this partition of
,
can be written in
the block form:
The block
can again be block partitioned, with most of
the subblocks being zero. The diagonal blocks
of
are the
principal submatrices of
corresponding to
.
Each
represents the coupling of nodes on interface
separating two neighboring subdomains.
In defining the preconditioner, the action of
is
needed. However, as noted before, in general
is a dense
matrix which is also expensive to compute, and even if we had it, it
would be expensive to compute its action (we would need to compute its
inverse or a Cholesky factorization). Fortunately, many efficiently
invertible approximations to
have been proposed in the
literature (see Keyes and Gropp
[136]) and we shall use these
so-called interface preconditioners for
instead.
We mention one specific preconditioner:
where
is an
one dimensional
Laplacian matrix, namely a tridiagonal matrix with
's down
the main diagonal and
's down the two off-diagonals,
and
is taken to be some average of the coefficient
.
We note that since the eigen-decomposition of
is known
and computable by the Fast Sine Transform, the action of
can be efficiently computed.
With these notations, the Bramble-Pasciak-Schatz preconditioner
is defined by its action on a vector
defined on
as follows:
Analogous to the additive Schwarz preconditioner,
the computation of each term consists of the three steps
of restriction-inversion-interpolation and
is independent of the others and thus can be carried out in parallel.
Bramble, Pasciak and Schatz
[36] prove that the condition
number of
is bounded by
. In
practice, there is a slight growth in the number of iterations as
becomes small (i.e., as the fine grid is refined) or as
becomes large (i.e., as the coarse grid becomes coarser).
The
growth is due to the coupling of the unknowns on the
edges incident on a common vertex
, which is not accounted for in
. Smith
[191] proposed a vertex space
modification to
which explicitly accounts for this coupling
and therefore eliminates the dependence on
and
. The idea is
to introduce further subsets of
called vertex spaces
with
consisting of a small set of vertices on
the edges incident on the vertex
and adjacent to it. Note that
overlaps with
and
. Let
be the principal
submatrix of
corresponding to
, and
be
the corresponding restriction (pointwise) and extension (by zero)
matrices. As before,
is dense and expensive to compute and
factor/solve but efficiently invertible approximations (some using
variants of the
operator defined before) have been developed
in the literature (see Chan, Mathew and
Shao
[52]). We shall let
be such a
preconditioner for
. Then Smith's Vertex Space
preconditioner is defined by:
Smith
[191] proved that the condition number
of
is bounded independent of
and
, provided
there is sufficient overlap of
with
As mentioned before,
the Additive Schwarz preconditioner can be
viewed as an overlapping block Jacobi preconditioner.
Analogously, one can define a multiplicative Schwarz
preconditioner which corresponds to a symmetric block Gauss-Seidel
version. That is, the solves on each subdomain are performed
sequentially, using the most current iterates as boundary conditions
from neighboring subdomains. Even without conjugate gradient
acceleration, the multiplicative
method can take many fewer iterations than the additive version.
However, the multiplicative version is not as parallelizable,
although the degree of parallelism can be increased
by trading off the convergence rate through
multi-coloring the subdomains.
The theory can be found in Bramble, et al.
[37].
The exact solves involving
and
in
can be replaced by inexact solves
and
,
which can be standard elliptic preconditioners themselves
(e.g. multigrid, ILU, SSOR, etc.).
For the Schwarz methods, the modification is straightforward
and the Inexact Solve Additive Schwarz Preconditioner
is simply:
The Schur Complement methods require more changes to accommodate
inexact solves.
By replacing
by
in
the definitions of
and
, we can easily obtain
inexact preconditioners
and
for
.
The main difficulty is, however, that the evaluation of the product
requires exact subdomain solves in
.
One way to get around this
is to use an inner iteration using
as a preconditioner for
in order to compute the action
of
.
An alternative is to perform the iteration on the larger system
(
) and construct a preconditioner from the
factorization in (
) by replacing the terms
by
respectively,
where
can be either
or
.
Care must be taken to scale
and
so that they are as close to
and
as possible respectively -
it is not sufficient that the condition number of
and
be close to unity, because
the scaling of the coupling matrix
may be wrong.
The preconditioners given above extend naturally to nonsymmetric
's (e.g., convection-diffusion problems), at least when the
nonsymmetric part is not too large. The nice theoretical convergence
rates can be retained provided that the coarse grid size
is chosen
small enough (depending on the size of the nonsymmetric part of
)
(see Cai and Widlund
[43]).
Practical implementations (especially for parallelism) of nonsymmetric
domain decomposition methods are discussed
in
[138]
[137].
Given
, it has been observed empirically (see Gropp and
Keyes
[111]) that there often exists an optimal value of
which minimizes the total computational time for solving the problem.
A small
provides a better, but more expensive, coarse grid
approximation, and requires solving more, but smaller, subdomain
solves. A large
has the opposite effect. For model problems, the
optimal
can be determined for both sequential and parallel
implementations (see Chan and Shao
[53]). In
practice, it may pay to determine a near optimal value of
empirically if the preconditioner is to be re-used many times.
However, there
may also be geometric constraints on the range of values that
can
take.
Multigrid method: Solution method for linear systems
based on restricting and extrapolating solutions between a series
of nested grids.
Simple iterative methods (such as the Jacobi
method) tend to damp out high frequency components of the error
fastest (see §
). This has led people to
develop methods based on the following heuristic:
The method outlined above is said to be a ``V-cycle'' method, since it
descends through a sequence of subsequently coarser grids, and then
ascends this sequence in reverse order. A ``W-cycle'' method results
from visiting the coarse grid twice, with possibly some
smoothing steps in between.
An analysis of multigrid methods is relatively straightforward in the
case of simple differential operators such as the Poisson operator on
tensor product grids. In that case, each next coarse grid is taken to
have the double grid spacing of the previous grid. In two dimensions,
a coarse grid will have one quarter of the number of points of the
corresponding fine grid. Since the coarse grid is again a tensor
product grid, a Fourier analysis (see for instance
Briggs
[42]) can be used. For the more general case
of self-adjoint elliptic operators on arbitrary domains a more
sophisticated analysis is
needed (see Hackbusch
[117],
McCormick
[148]). Many
multigrid methods can be shown to have an (almost) optimal number of
operations, that is, the work involved is proportional to the number
of variables.
From the above description it is clear that iterative methods play a
role in multigrid theory as smoothers (see
Kettler
[133]). Conversely, multigrid-like
methods can be used as preconditioners in iterative methods. The basic
idea here is to partition the matrix on a given grid to a
structure
with the variables in the second block row corresponding to the coarse
grid nodes. The matrix on the next grid is then an incomplete
version of the Schur complement
The coarse grid is typically formed based on a red-black or
cyclic reduction ordering; see for
instance Rodrigue and Wolitzer
[180], and
Elman
[93].
Some multigrid preconditioners try to obtain optimality results
similar to those for the full multigrid method. Here we will merely
supply some pointers to the literature:
Axelsson and Eijkhout
[16], Axelsson and
Vassilevski
[22]
[23],
Braess
[35], Maitre and Musy
[145],
McCormick and Thomas
[149], Yserentant
[218]
and Wesseling
[215].
Iterative methods are often used for solving discretized partial
differential equations. In that context a rigorous analysis of the
convergence of simple methods such as the Jacobi method can be given.
As an example, consider the boundary value problem
discretized by
The eigenfunctions of the
and
operator are the same:
for
the function
is an
eigenfunction corresponding to
. The
eigenvalues of the Jacobi iteration matrix
are then
.
From this it is easy to see that the high frequency modes (i.e.,
eigenfunction
with
large) are damped quickly, whereas the
damping factor for modes with
small is close to
. The spectral
radius of the Jacobi iteration matrix is
, and it is attained for the eigenfunction
.
Spectral radius: The spectral radius of a matrix
is
.
Spectrum: The set of all eigenvalues of a matrix.
The type of analysis applied to this example can be generalized to
higher dimensions and other stationary iterative methods. For both the
Jacobi and Gauss-Seidel method
(below) the spectral radius is found to be
where
is the
discretization mesh width, i.e.,
where
is the
number of variables and
is the number of space dimensions.
Most iterative methods depend on spectral properties of the
coefficient matrix, for instance some require the eigenvalues to be in
the right half plane. A class of methods without this limitation is
that of row projection methods (see Björck and
Elfving
[34], and Bramley and Sameh
[38]).
They are based on a block row partitioning of the coefficient matrix
and iterative application of orthogonal projectors
These methods have good parallel properties and seem to be robust in
handling nonsymmetric and indefinite problems.
Row projection methods can be used as
preconditioners in the conjugate gradient method. In that case, there
is a theoretical connection with the conjugate gradient method on the
normal equations (see §
).
A large body of numerical software is freely available 24 hours a day
via an electronic service called Netlib. In addition to the
template material, there are dozens of other libraries, technical
reports on various parallel computers and software, test data,
facilities to automatically translate FORTRAN programs to C, bibliographies, names and addresses of scientists and
mathematicians, and so on. One can communicate with Netlib in one of
a number of ways: by email, through anonymous ftp (netlib2.cs.utk.edu)
or
(much more easily) via the World Wide Web
through some web browser like Netscape or Mosaic.
The url for the Templates work is:
http://www.netlib.org/templates/ .
The html version of this book can be found in:
http://www.netlib.org/templates/Templates.html .
To get started using netlib, one sends a message of the form send
index to netlib@ornl.gov. A description of the entire
library should be sent to you within minutes (providing all the
intervening networks as well as the netlib server are up).
FORTRAN and C versions of the templates for each method
described in this book are available from Netlib. A user sends a
request by electronic mail as follows:
Save the mail message to a file called templates.shar. Edit the
mail header from this file and delete the lines down to and including
<< Cut Here >>. In the directory containing the shar file, type
Note that the matrix-vector operations are accomplished using the BLAS
[144] (many manufacturers have assembly coded these
kernels for maximum performance), although a mask file is provided to
link to user defined routines.
The README file gives more details, along with instructions for
a test routine.
The BLAS give us a standardized
set of basic codes for performing operations on vectors and matrices.
BLAS take advantage of the Fortran storage structure and the structure
of the mathematical system wherever possible. Additionally, many
computers have the BLAS library optimized to their system. Here we use
five routines:
The prefix ``S'' denotes single precision. This prefix may be
changed to ``D'', ``C'', or ``Z'', giving
the routine double, complex,
or double complex precision. (Of course, the declarations would also
have to be changed.) It is important to note that putting double precision
into single variables works, but single into double will cause errors.
If we define
a(i,j) and
= x(i), we can see what the
code is doing:
The corresponding Fortran segment is
The corresponding Fortran segment is
The corresponding Fortran segment:
This illustrates a feature of the BLAS that often requires close
attention. For example, we will use this routine to compute the residual
vector
, where
is our current approximation to the
solution
(merely change the fourth argument to -1.0E0). Vector
will be overwritten with the residual vector; thus, if we need it later, we
will first copy it to temporary storage.
The corresponding Fortran segment is
Note that the parameters in single quotes are for descriptions
such as
The variable
where
and
denote the largest and
smallest eigenvalues, respectively. For linear systems derived from
partial differential equations in 2D, the condition number is
proportional to the number of unknowns.
In this section, we present some of the notation we use
throughout the book.
We have tried to use standard notation that would be found
in any current publication on the subjects covered.
Throughout, we follow several conventions:
We define matrix
of dimension
and block dimension
as follows:
We define vector
of dimension
as follows:
Other notation is as follows:
=0pt plus 40pt
Templates for the Solution of Linear Systems:
This document was generated using the
LaTeX2HTML translator Version 0.6.4 (Tues Aug 30 1994) Copyright © 1993, 1994,
Nikos Drakos, Computer Based Learning Unit, University of Leeds. The command line arguments were: The translation was initiated by Jack Dongarra on Mon Nov 20 08:52:54 EST 1995
Consider again the linear equations in (
). If
we proceed as with the Jacobi method, but now assume
that the equations are examined one at a time in sequence, and that
previously computed results are used as soon as they are available, we
obtain the Gauss-Seidel method:
Two important facts about the Gauss-Seidel method should be noted.
First, the computations in (
) appear
to be serial. Since each component of the new iterate depends upon
all previously computed components, the updates cannot be done
simultaneously as in the Jacobi method. Second, the new iterate
depends upon the order in which the equations are examined.
The Gauss-Seidel method is sometimes called the method of
successive displacements to indicate the dependence of the
iterates on the ordering. If this ordering is changed, the components of the new iterate (and not just their order) will
also change.
Successive displacements, method of: Gauss-Seidel method.
These two points are important because if
is sparse, the
dependency of each component of the new iterate on previous components
is not absolute. The presence of zeros in the matrix may remove the
influence of some of the previous components. Using a judicious
ordering of the equations, it may be possible to reduce such
dependence, thus restoring the ability to make updates to groups of
components in parallel. However, reordering the equations can affect
the rate at which the Gauss-Seidel
method
converges. A poor choice of ordering can degrade the rate of
convergence; a good choice can enhance the rate of convergence. For a
practical discussion of this tradeoff (parallelism versus convergence
rate) and some standard reorderings, the reader is referred to
Chapter
and §
.
In matrix terms, the definition of the
Gauss-Seidel method
in (
) can be expressed as
As before,
,
and
represent the diagonal, lower-triangular,
and upper-triangular parts of
, respectively.
The pseudocode for the Gauss-Seidel algorithm is given in Figure
.
The Successive Overrelaxation Method, or SOR, is devised
by applying extrapolation to the Gauss-Seidel method. This
extrapolation takes the form of a weighted average between the
previous iterate and the computed Gauss-Seidel iterate successively
for each component:
(where
denotes a Gauss-Seidel iterate, and
is the
extrapolation factor). The
idea is to choose a value for
that will accelerate the rate
of convergence of the iterates to the solution.
In matrix terms, the SOR algorithm can be written as
follows:
The pseudocode for the SOR algorithm is given in Figure
.
If
, the SOR method simplifies
to the Gauss-Seidel method. A theorem due to
Kahan
[130] shows that SOR fails to
converge if
is outside the interval
. Though
technically the term underrelaxation
should be used when
, for convenience the term
overrelaxation
is now used for any value
of
.
In general, it is not possible to compute in advance the value
of
that is optimal with respect to the rate of convergence of
SOR. Even when it is possible to compute the optimal value
for
, the expense of such computation is usually prohibitive.
Frequently, some heuristic estimate is used, such as
where
is the mesh spacing of the discretization of the
underlying physical domain.
If the coefficient matrix
is symmetric and positive definite, the
SOR iteration is guaranteed to converge for any
value of
between 0 and 2, though the choice of
can
significantly affect the rate at which the SOR iteration
converges. Sophisticated implementations of the SOR
algorithm (such as that found in ITPACK
[140]) employ adaptive
parameter estimation schemes to try to home in on the appropriate
value of
by estimating the rate at which the iteration is
converging.
Adaptive methods: Iterative methods that collect information
about the coefficient matrix during the iteration process, and use
this to speed up convergence.
Symmetric matrix: See: Matrix properties.
Diagonally dominant matrix: See: Matrix properties
-Matrix: See: Matrix properties.
Positive definite matrix: See: Matrix properties.
Matrix properties: We call a square matrix
For coefficient matrices of a special class called consistently
ordered with property A (see
Young
[217]), which includes certain orderings of matrices
arising from the discretization of elliptic PDEs, there is a direct
relationship between the spectra of the Jacobi
and SOR iteration matrices. In principle, given the
spectral radius
of the
Jacobi iteration matrix, one can determine a priori the theoretically optimal value of
for SOR:
This is seldom done, since calculating the spectral radius of the
Jacobi matrix requires an impractical amount of computation. However,
relatively inexpensive rough estimates of
(for example, from
the power method, see Golub and Van Loan [p. 351]GoVL:matcomp)
can yield reasonable estimates for the optimal value of
.
If we assume that the coefficient matrix
is symmetric, then the
Symmetric Successive Overrelaxation method, or SSOR, combines two SOR
sweeps together in such a way that the resulting iteration matrix is
similar to a symmetric matrix. Specifically, the
first SOR sweep is carried out as
in (
), but in the second sweep the unknowns are
updated in the reverse order. That is, SSOR is a forward
SOR sweep followed by a
backward SOR sweep. The
similarity of the SSOR iteration matrix to a symmetric
matrix permits the application of SSOR as a preconditioner
for other iterative schemes for symmetric matrices. Indeed, this is
the primary motivation for SSOR since its convergence
rate
, with an optimal value of
, is
usually slower than the convergence rate of SOR with
optimal
(see Young [page 462]Yo:book). For details on
using SSOR as a preconditioner, see
Chapter
.
In matrix terms, the SSOR iteration can be expressed as
follows:
where
and
Note that
is simply the iteration matrix for SOR
from (
), and that
is the same, but with the
roles of
and
reversed.
The pseudocode for the SSOR algorithm is given in
Figure
.
The modern treatment of iterative methods dates back to the
relaxation
method of Southwell
[193].
This was the precursor to the SOR method, though the order in which
approximations to the unknowns were relaxed varied during the
computation. Specifically, the next unknown was chosen based upon
estimates of the location of the largest error in the current
approximation. Because of this, Southwell's relaxation method was
considered impractical for automated computing. It is interesting to
note that the introduction of multiple-instruction, multiple
data-stream (MIMD) parallel computers has rekindled an interest in
so-called asynchronous
, or chaotic
iterative methods (see Chazan and
Miranker
[54], Baudet
[30], and
Elkin
[92]), which are closely related to Southwell's
original relaxation method. In chaotic methods, the order of
relaxation is unconstrained, thereby eliminating costly
synchronization of the processors, though the effect on convergence is
difficult to predict.
The notion of accelerating the convergence of an iterative method by
extrapolation predates the development of SOR. Indeed, Southwell used
overrelaxation to accelerate the convergence of
his original relaxation method
. More
recently, the ad hoc SOR
method, in
which a different relaxation factor
is used for updating each
variable, has given impressive results for some classes of
problems (see Ehrlich
[83]).
The three main references for the theory of stationary iterative
methods are Varga
[211], Young
[217] and Hageman and
Young
[120]. The reader is referred to these books (and
the references therein) for further details concerning the methods
described in this section.
Nonstationary methods differ from stationary methods in that the
computations involve information that changes at each iteration.
Typically, constants are computed by taking inner products of
residuals or other vectors arising from the iterative method.
The Conjugate Gradient method is
an effective method for symmetric positive definite systems. It is
the oldest and best known of the nonstationary methods discussed
here. The method proceeds by generating vector sequences of iterates
(i.e., successive approximations to the solution), residuals
corresponding to the iterates, and search directions
used in updating
the iterates and residuals. Although the length of these sequences can
become large, only a small number of vectors needs to be kept in
memory. In every iteration of the method, two inner products are
performed in order to compute update scalars that are defined to
make the sequences satisfy certain orthogonality conditions. On a
symmetric positive definite linear system these conditions imply that
the distance to the true solution is minimized in some norm.
The iterates
are updated in each iteration by a multiple
of the
search direction vector
:
Correspondingly the residuals
are updated as
The choice
minimizes
over all possible choices for
in
equation (
).
The search directions are updated using the residuals
where the choice
ensures
that
and
- or equivalently,
and
- are orthogonal
. In fact, one can
show that this choice of
makes
and
orthogonal to
all previous
and
respectively.
The pseudocode for the Preconditioned Conjugate Gradient Method is
given in Figure
. It uses a preconditioner
;
for
one obtains the unpreconditioned version of the Conjugate Gradient
Algorithm. In that case the algorithm may be further simplified by skipping
the ``solve'' line, and replacing
by
(and
by
).
The unpreconditioned conjugate gradient method constructs the
th
iterate
as an element of
so that
is minimized
, where
is the exact solution of
. This minimum is guaranteed
to exist in general only if
is symmetric positive definite. The
preconditioned version of the method uses a different subspace for
constructing the iterates, but it satisfies the same minimization
property, although over this different subspace. It requires in
addition that the preconditioner
is symmetric and positive
definite.
The above minimization of the error is equivalent to the residuals
being
orthogonal (that is,
if
). Since for symmetric
an orthogonal basis for the Krylov subspace
can be
constructed with only three-term recurrences
, such a recurrence also
suffices for generating the residuals. In the Conjugate
Gradient method two coupled two-term recurrences are used; one that
updates residuals using a search direction vector, and one updating
the search direction
with a newly computed residual.
This makes the
Conjugate Gradient Method quite attractive computationally.
Krylov sequence: For a given matrix
and vector
, the
sequence of vectors
, or a finite initial part
of this sequence.
Krylov subspace: The subspace spanned by a Krylov sequence.
There is a close relationship between the Conjugate Gradient method
and the Lanczos method
for determining eigensystems, since both are
based on the construction of an orthogonal basis for the Krylov
subspace, and a similarity transformation of the coefficient matrix to
tridiagonal form. The coefficients computed during
the CG iteration then arise from the
factorization of this
tridiagonal matrix.
From the CG iteration one can reconstruct the Lanczos process, and vice versa;
see Paige and Saunders
[168]
and Golub and Van Loan [.2.6]GoVL:matcomp.
This relationship can be exploited to obtain relevant information
about the eigensystem of the (preconditioned) matrix
;
see §
.
Accurate predictions of the convergence of iterative methods are
difficult to make, but useful bounds can often be obtained. For the
Conjugate Gradient method, the error can be
bounded in terms of the spectral condition number
of the
matrix
. (Recall that if
and
are the largest and smallest eigenvalues of a
symmetric positive definite matrix
, then the spectral condition
number of
is
. If
is the exact solution of the linear system
,
with symmetric positive definite matrix
, then for CG
with
symmetric positive definite preconditioner
, it can be shown that
where
(see Golub and Van Loan
[][.2.8]GoVL:matcomp, and
Kaniel
[131]), and
. From this
relation we see that the number of iterations to reach a relative
reduction of
in the error is proportional
to
.
In some cases, practical application of the above error bound is
straightforward. For example, elliptic second order partial
differential equations typically give rise to coefficient matrices
with
(where
is the discretization mesh
width), independent of the order of the finite elements or differences
used, and of the number of space dimensions of the problem (see for
instance Axelsson and Barker [.5]AxBa:febook). Thus,
without preconditioning, we expect a number of iterations proportional
to
for the Conjugate Gradient method.
Other results concerning the behavior of the Conjugate Gradient
algorithm have been obtained. If the extremal eigenvalues of the
matrix
are well separated, then one often observes so-called
(see Concus, Golub and
O'Leary
[58]); that is, convergence at a
rate that increases per iteration.
This phenomenon is explained by
the fact that CG tends to eliminate components of the error in the
direction of eigenvectors associated with extremal eigenvalues first.
After these have been eliminated, the method proceeds as if these
eigenvalues did not exist in the given system, i.e., the
convergence rate depends on a reduced system with a (much) smaller
condition number (for an analysis of this, see Van der Sluis and
Van der Vorst
[199]). The effectiveness of
the preconditioner in reducing the condition number and in separating
extremal eigenvalues can be deduced by studying the approximated
eigenvalues of the related Lanczos process.
The Conjugate Gradient method involves one matrix-vector product, three
vector updates, and two inner products per iteration. Some slight
computational variants exist that have the same
structure (see Reid
[179]). Variants that cluster the inner products
, a favorable property on
parallel machines, are discussed in §
.
For a discussion of the Conjugate Gradient method
on vector and shared
memory computers, see Dongarra, et
al.
[166]
[71]. For discussions
of the method for more general parallel architectures
see Demmel, Heath and Van der Vorst
[67] and
Ortega
[166], and the references therein.
A more formal presentation of CG, as well as many theoretical
properties, can be found in the textbook by Hackbusch
[118].
Shorter presentations are given in Axelsson and Barker
[14]
and Golub and Van Loan
[109]. An overview of papers
published in the first 25 years of existence of the method is given
in Golub and O'Leary
[108].
The Conjugate Gradient method can be viewed as a special variant of
the Lanczos method
(see §
) for positive definite
symmetric systems. The MINRES
and SYMMLQ
methods are variants that can
be applied to symmetric indefinite systems.
The vector sequences in the Conjugate Gradient method
correspond to a
factorization of a tridiagonal matrix similar to the coefficient
matrix. Therefore, a breakdown
of the algorithm can occur
corresponding to a zero pivot if the matrix is indefinite.
Furthermore, for indefinite matrices the minimization property of the
Conjugate Gradient method
is no longer well-defined. The
MINRES
and SYMMLQ
methods are variants of the CG method that avoid the
-factorization and do not suffer from breakdown. MINRES
minimizes
the residual in the
-norm
. SYMMLQ solves the projected
system, but
does not minimize anything (it keeps the residual orthogonal to all
previous ones
). The convergence
behavior of Conjugate Gradients and MINRES
for indefinite systems was
analyzed by Paige, Parlett, and Van der Vorst
[167].
Breakdown: The occurrence of a zero coefficient that is
to be used as a divisor in an iterative method.
When
is not positive definite, but symmetric, we can still
construct an orthogonal basis for the Krylov subspace
by three term recurrence relations.
Eliminating the
search directions
in equations (
)
and (
) gives a recurrence
which can be written in matrix form as
where
is an
tridiagonal matrix
In this case we have the problem that
no
longer defines an inner product. However we can still try to minimize
the residual in the
-norm by obtaining
that minimizes
Now we exploit the fact that if
, then
is an orthonormal transformation with respect to
the current Krylov subspace:
and this final expression can simply be seen as a minimum norm least
squares problem.
The element in the
position of
can be
annihilated by a simple Givens rotation and the resulting upper
bidiagonal system (the other subdiagonal elements having been removed
in previous iteration steps) can simply be solved, which leads to the
MINRES method (see Paige and Saunders
[168]).
Another possibility is to solve the system
, as in the CG method (
is the upper
part of
). Other than in CG we cannot rely on
the existence of a Cholesky decomposition
(since
is not positive
definite). An alternative is then to decompose
by an
-decomposition. This again leads to simple recurrences and the
resulting method is known as SYMMLQ (see Paige and Saunders
[168]).
The CGNE
and CGNR methods
are the simplest methods for nonsymmetric or
indefinite systems. Since other methods for such systems are in
general rather more complicated than the Conjugate Gradient method,
transforming the system to a symmetric definite one and then applying
the Conjugate Gradient method is attractive for its coding simplicity.
If a system of linear equations
has a nonsymmetric, possibly
indefinite (but nonsingular), coefficient matrix, one obvious attempt
at a solution is to apply Conjugate Gradient to a related symmetric positive definite system,
. While
this approach is easy to understand and code, the convergence speed of
the Conjugate Gradient method now depends on the square of the
condition number of the original coefficient matrix. Thus the
rate of convergence of the CG procedure on the normal equations
may be slow.
Several proposals have been made to improve the numerical stability of
this method. The best known is by Paige and Saunders
[169]
and is based upon applying the Lanczos method
to the auxiliary system
A clever execution of this scheme delivers the factors
and
of the
-decomposition of the tridiagonal matrix that would
have been computed by carrying out the Lanczos procedure with
.
Another means for improving the numerical stability of this normal
equations approach is suggested by Björck and Elfving in
[34].
The observation that the matrix
is used in the construction of
the iteration coefficients through an inner product like
leads to the suggestion that such an inner product
be replaced by
.
The Generalized Minimal Residual method
[189]
is an extension of MINRES
(which is only applicable to symmetric systems) to unsymmetric
systems. Like MINRES, it generates a sequence of orthogonal vectors,
but in the absence of symmetry this can no longer be done with short
recurrences; instead, all previously computed vectors in the
orthogonal sequence have to be retained. For this reason,
``restarted
'' versions of the method are used.
In the Conjugate Gradient method, the residuals form an orthogonal
basis
for the space
. In GMRES, this
basis is formed explicitly:
The reader may recognize this as a modified Gram-Schmidt
orthogonalization.
Applied to the Krylov
sequence
this orthogonalization
is called the ``Arnoldi method''
[6].
The inner product coefficients
and
are stored in an upper Hessenberg matrix.
The GMRES iterates are constructed as
where the coefficients
have been chosen to minimize the residual
norm
. The GMRES algorithm has the property that this
residual norm can be computed without the iterate having been formed.
Thus, the expensive action of forming the iterate can be postponed
until the residual norm is deemed small enough.
The pseudocode for the restarted
GMRES(
) algorithm with preconditioner
is given in
Figure
.
The authors gratefully acknowledge the valuable assistance of many
people who commented on preliminary drafts of this book. In
particular, we thank Loyce Adams, Bill Coughran,
Matthew Fishler, Peter Forsyth,
Roland Freund, Gene
Golub, Eric Grosse, Mark Jones, David Kincaid, Steve Lee, Tarek
Mathew, Noël Nachtigal, Jim Ortega, and David Young for their
insightful comments. We also thank Geoffrey Fox for initial
discussions on the concept of templates, and Karin Remington for
designing the front cover.
This work was supported in part by DARPA and ARO under contract number
DAAL03-91-C-0047, the National Science Foundation Science and
Technology Center Cooperative Agreement No. CCR-8809615, the Applied
Mathematical Sciences subprogram of the Office of Energy Research,
U.S. Department of Energy, under Contract DE-AC05-84OR21400, and the
Stichting Nationale Computer Faciliteit (NCF) by Grant CRG 92.03.
The Generalized Minimum
Residual (GMRES) method
is designed to solve nonsymmetric linear
systems (see Saad and Schultz
[189]). The most popular
form of GMRES is based on the modified Gram-Schmidt procedure, and
uses restarts
to control storage requirements.
If no restarts are used, GMRES
(like any orthogonalizing Krylov-subspace method) will
converge in no more than
steps (assuming exact arithmetic). Of
course this is of no practical value when
is large; moreover, the
storage and computational requirements in the absence of restarts are
prohibitive. Indeed, the crucial element for successful application
of GMRES(
) revolves around the decision of when to restart; that
is, the choice of
. Unfortunately, there exist examples for which
the method stagnates
and convergence takes place only
at the
th step. For such systems, any choice of
less than
fails to converge.
Saad and Schultz
[189] have proven several useful results.
In particular, they show that if the coefficient matrix
is real
and nearly positive definite, then a ``reasonable'' value for
may be selected. Implications of the choice of
are discussed
below.
A common implementation of GMRES
is suggested by Saad and Schultz in
[189] and
relies on using modified Gram-Schmidt orthogonalization. Householder
transformations, which are relatively costly but stable, have also
been proposed. The Householder approach results in a three-fold
increase in work associated with inner products and
vector updates (not with matrix vector products);
however, convergence may be better, especially for
ill-conditioned systems
(see Walker
[214]). From
the point of view of parallelism
, Gram-Schmidt orthogonalization may be
preferred, giving up some stability for better parallelization
properties (see Demmel, Heath and Van der Vorst
[67]).
Here we adopt the Modified Gram-Schmidt approach.
The major drawback to GMRES is that the amount of work and storage
required per iteration rises linearly with the iteration count.
Unless one is fortunate enough to obtain extremely fast convergence,
the cost will rapidly become prohibitive. The usual way to overcome
this limitation is by restarting
the iteration. After a chosen number
(
) of iterations, the accumulated data are cleared and the
intermediate results are used as the initial data for the next
iterations. This procedure is repeated until convergence is
achieved. The difficulty is in choosing an appropriate value for
.
If
is ``too small'', GMRES(
) may be slow to converge, or fail
to converge entirely. A value of
that is larger than necessary
involves excessive work (and uses more storage). Unfortunately, there
are no definite rules governing the choice of
-choosing when to
restart is a matter of experience.
For a discussion of GMRES for vector and shared memory computers see
Dongarra et al.
[71]; for more general
architectures,
see Demmel, Heath and Van der Vorst
[67]
.
The Conjugate Gradient method is not suitable for nonsymmetric systems
because the residual vectors cannot be made orthogonal with short
recurrences (for proof of this see Voevodin
[213] or Faber and
Manteuffel
[96]). The GMRES method retains
orthogonality of the residuals by using long recurrences, at the cost
of a larger storage demand. The BiConjugate Gradient method takes
another approach, replacing the orthogonal sequence of residuals by
two mutually orthogonal sequences, at the price of no longer providing
a minimization.
The update relations for residuals in the Conjugate Gradient method
are augmented in the BiConjugate Gradient method by relations
that are similar
but based on
instead of
. Thus we update two sequences of
residuals
and two sequences of search directions
The choices
ensure the bi-orthogonality
relations
The pseudocode for the Preconditioned BiConjugate Gradient
Method with preconditioner
is given in Figure
.
Few theoretical results are known about the convergence of BiCG. For
symmetric positive definite systems the method delivers the same
results as CG, but at twice the cost per iteration. For nonsymmetric
matrices it has been shown that in phases of the process where there
is significant reduction of the norm of the residual, the method is
more or less comparable to full GMRES
(in terms of numbers of
iterations) (see Freund and Nachtigal
[102]). In practice
this is often confirmed, but
it is also observed that the convergence behavior may be quite
irregular
, and the method may even break
down
. The breakdown
situation due to the possible event that
can be circumvented by so-called
look-ahead strategies
(see Parlett, Taylor and Liu
[172]). This
leads to
complicated codes and is beyond the scope of this book. The other
breakdown
situation,
,
occurs when the
-decomposition fails (see the theory subsection
of §
), and can be repaired by
using another decomposition. This is done in the version of QMR
that we adopted
(see §
).
Sometimes, breakdown
or near-breakdown situations can be
satisfactorily avoided by a restart
at the iteration step immediately
before the (near-) breakdown step. Another possibility is to switch to
a more robust (but possibly more expensive) method, like GMRES.
BiCG requires computing a matrix-vector product
and a
transpose product
. In some applications the
latter product may be impossible to perform, for instance if the
matrix is not formed explicitly
and the regular product is only given in operation form, for instance
as a function call evaluation.
In a parallel environment
, the two matrix-vector products can
theoretically be performed simultaneously; however,
in a distributed-memory environment, there will be extra communication
costs associated with one of the two matrix-vector products, depending
upon the storage scheme for
.
A duplicate copy of the
matrix will alleviate this problem, at the cost of doubling the
storage requirements for the matrix.
Care must also be exercised in choosing the preconditioner, since
similar problems arise during the two solves involving the
preconditioning matrix.
It is difficult to make a fair comparison between GMRES
and BiCG.
GMRES really minimizes a residual, but at the cost of increasing work
for keeping all residuals orthogonal and increasing demands for memory
space. BiCG does not minimize a residual, but often its
accuracy is comparable to GMRES, at the cost of twice the amount of matrix
vector products per iteration step. However, the generation of the
basis vectors is relatively cheap and the memory requirements are
modest. Several variants of BiCG have been proposed that increase
the effectiveness of this class of methods in certain circumstances. These
variants (CGS and Bi-CGSTAB) will be discussed in coming subsections.
The BiConjugate Gradient method often displays rather irregular
convergence
behavior. Moreover, the implicit
decomposition of the
reduced tridiagonal system may not exist, resulting in breakdown
of the algorithm. A related algorithm, the
Quasi-Minimal Residual method of Freund and
Nachtigal
[102],
[103]
attempts to overcome
these problems. The main idea behind this algorithm is to solve the
reduced tridiagonal system in a least squares sense, similar to the
approach followed in GMRES.
Since the constructed basis for the
Krylov subspace
is bi-orthogonal
, rather than orthogonal as in GMRES,
the obtained solution is viewed as a quasi-minimal residual solution,
which explains the name. Additionally, QMR uses look-ahead techniques
to avoid breakdowns in the underlying Lanczos process, which makes it
more robust than BiCG.
The convergence behavior of QMR
is typically much smoother than for BiCG.
Freund and Nachtigal
[102] present quite general error
bounds which show that
QMR may be expected to converge about as fast as GMRES. From a relation
between the residuals in BiCG and QMR
(Freund and Nachtigal [relation (5.10)]FrNa:qmr) one may deduce
that at phases in the
iteration process where BiCG makes significant progress, QMR has arrived
at about the same approximation for
.
On the other hand, when BiCG makes
no progress at all
, QMR may still show slow convergence.
The look-ahead steps in the version of the QMR method
discussed in
[102] prevents breakdown in all cases
but the so-called ``incurable breakdown'', where no practical number of
look-ahead steps would yield a next iterate.
The pseudocode for the Preconditioned Quasi Minimal Residual
Method, with preconditioner
,
is given in Figure
.
This algorithm follows the two term recurrence
version
without look-ahead, presented by Freund and
Nachtigal
[103] as Algorithm 7.1.
This version of QMR is simpler to implement than the full QMR method
with look-ahead, but it is susceptible to breakdown of the underlying
Lanczos process. (Other implementational variations are whether to
scale Lanczos vectors or not, or to use three-term recurrences instead
of coupled two-term recurrences. Such decisions usually have implications
for the stability and the efficiency of the algorithm.)
A professional implementation of QMR
with look-ahead is given in Freund and Nachtigal's QMRPACK, which
is available through netlib; see Appendix
.
We have modified Algorithm 7.1 in
[103] to include a
relatively inexpensive recurrence relation for the computation
of the residual vector. This requires a few extra vectors of storage
and vector update operations per iteration, but it avoids expending
a matrix-vector product on the residual calculation.
Also, the algorithm has been modified so that only two
full preconditioning steps are required instead of three.
Computation of the residual is done for the convergence test.
If one uses right (or post) preconditioning, that is
, then a
cheap upper bound for
can be computed in each iteration,
avoiding the recursions for
. For details,
see Freund and Nachtigal [proposition 4.1]FrNa:qmr. This upper
bound may be
pessimistic by a factor of at most
.
QMR has roughly the same problems with respect to vector and parallel
implementation as BiCG. The scalar overhead per iteration is slightly
more than for BiCG. In all cases where the slightly cheaper BiCG method
converges irregularly
(but fast enough),
QMR may be preferred for stability reasons.
In BiCG, the residual vector
can be regarded as
the product
of
and an
th degree polynomial in
, that is
This same polynomial satisfies
so that
This suggests that if
reduces
to a smaller
vector
, then it might be advantageous to apply this
``contraction'' operator twice, and compute
.
Equation (
) shows that the iteration coefficients can
still be recovered from these vectors, and it turns out to be easy to
find the corresponding approximations for
. This
approach leads to the Conjugate Gradient Squared method (see
Sonneveld
[192]).
Often one observes a speed of convergence for CGS that is about twice
as fast as for BiCG, which is in agreement with the observation that
the same ``contraction'' operator is applied twice. However, there is
no reason that the ``contraction'' operator, even if it really reduces
the initial residual
, should also reduce the once reduced
vector
. This is evidenced by the often
highly irregular convergence behavior of CGS
. One should be aware of
the fact that local corrections to the current solution may be so
large that cancelation effects occur. This may lead to a less
accurate solution than suggested by the updated residual
(see Van der Vorst
[207]).
The method tends to diverge if the starting guess is close to the solution.
CGS requires about the same number of
operations per iteration as
BiCG, but does not involve computations with
. Hence, in
circumstances where computation with
is impractical, CGS may be
attractive.
The pseudocode for the Preconditioned Conjugate Gradient Squared
Method with preconditioner
is given in Figure
.
The BiConjugate Gradient Stabilized method (Bi-CGSTAB) was developed to
solve nonsymmetric linear systems while avoiding the often irregular
convergence
patterns of the Conjugate
Gradient Squared
method (see Van der Vorst
[207]). Instead of computing
the CGS sequence
, Bi-CGSTAB computes
where
is an
th degree polynomial describing a steepest
descent update.
Bi-CGSTAB often converges about as fast as CGS,
sometimes faster and
sometimes not. CGS can be viewed as a method in which the BiCG
``contraction'' operator is applied twice. Bi-CGSTAB can be
interpreted as the product of BiCG and repeatedly applied GMRES(1). At
least locally, a residual vector is minimized
, which leads to a
considerably smoother
convergence behavior. On the
other hand, if the
local GMRES(1) step stagnates, then the Krylov subspace
is not
expanded, and Bi-CGSTAB will break down
. This is a breakdown situation
that can occur in addition to the other breakdown possibilities in the
underlying BiCG algorithm. This type of breakdown may be avoided by
combining BiCG with other methods, i.e., by selecting other
values for
(see the algorithm). One such alternative is
Bi-CGSTAB2
(see Gutknecht
[115]); more general
approaches are suggested by Sleijpen and Fokkema in
[190].
Note that Bi-CGSTAB has two stopping tests: if the method has already
converged at the first test on the norm of
, the subsequent update
would be numerically questionable. Additionally, stopping on the first
test saves a few unnecessary operations, but this is of minor importance.
Bi-CGSTAB requires two matrix-vector products
and four inner products, i.e., two inner products more than BiCG
and CGS.
The pseudocode for the Preconditioned BiConjugate Gradient Stabilized
Method with preconditioner
is given in Figure
.
Chebyshev Iteration is another method for solving
nonsymmetric
problems (see Golub and
Van Loan [.1.5]GoVL:matcomp and
Varga [Chapter 5]Va:book).
Chebyshev Iteration avoids the computation of inner products
as is
necessary for the other nonstationary methods.
For some distributed memory architectures
these inner products are a bottleneck
with respect to efficiency. The
price one pays for avoiding inner products is that the method requires
enough knowledge about the spectrum of the coefficient matrix
that
an ellipse enveloping the spectrum can be identified
;
however this
difficulty can be overcome via an adaptive construction
developed by
Manteuffel
[146], and implemented by Ashby
[7].
Chebyshev iteration is suitable for any nonsymmetric linear system for
which the enveloping ellipse does not include the origin.
Comparing the pseudocode for Chebyshev Iteration with the pseudocode
for the Conjugate Gradient method shows a high degree of similarity,
except that no inner products are computed in Chebyshev Iteration
.
Scalars
and
must be selected so that they define a family of
ellipses
with common center
and foci
and
which contain the
ellipse that encloses the spectrum (or more general, field of values)
of
and for which the rate
of
convergence is minimal:
where
is the length of the
-axis of the ellipse.
We provide code in which it is assumed that
the ellipse degenerate to the interval
,
that is all eigenvalues are real.
For code including the adaptive determination of the iteration
parameters
and
the reader is referred
to Ashby
[7].
The Chebyshev
method has the advantage over GMRES that only short recurrences are
used. On the other hand, GMRES is guaranteed to generate the smallest
residual over the current search space. The BiCG methods, which also
use short recurrences, do not minimize the residual in a suitable
norm; however, unlike Chebyshev iteration, they do not require
estimation of parameters (the spectrum of
). Finally, GMRES and
BiCG may be more effective in practice, because of superlinear
convergence behavior
, which cannot be expected for
Chebyshev.
For symmetric positive definite systems the ``ellipse'' enveloping the
spectrum degenerates to the interval
on the positive
-axis, where
and
are the smallest and largest eigenvalues of
. In circumstances where the computation of inner products
is a bottleneck
, it may be advantageous to start
with CG, compute
estimates of the extremal eigenvalues from the CG coefficients, and
then after sufficient convergence of these approximations switch to
Chebyshev Iteration
. A similar strategy
may be adopted for a switch from GMRES, or BiCG-type methods, to
Chebyshev Iteration.
In the symmetric case (where
and the preconditioner
are both
symmetric) for the Chebyshev Iteration we have the same upper
bound as for the Conjugate Gradient
method, provided
and
are computed from
and
(the extremal eigenvalues of the preconditioned
matrix
).
There is a severe penalty for overestimating
or underestimating the field of values. For example, if in the
symmetric case
is underestimated, then the method may
diverge; if it is overestimated then the result may be very slow
convergence. Similar statements can be made for the nonsymmetric case.
This implies that one needs fairly accurate bounds on the
spectrum of
for the method to be effective (in comparison
with CG or GMRES).
In Chebyshev Iteration the iteration parameters
are known as soon
as one knows the ellipse containing the eigenvalues (or rather, the
field of values) of the operator. Therefore the computation of
inner products, as is necessary in methods like GMRES or CG,
is avoided
.
This avoids the synchronization points required of CG-type methods, so
machines with hierarchical or distributed memory may achieve higher
performance (it also suggests strong parallelization properties
; for a
discussion of this see Saad
[185], and Dongarra, et
al.
[71]).
Specifically, as soon as some segment of
is computed, we may begin
computing, in sequence, corresponding segments of
,
, and
.
The pseudocode for the Preconditioned Chebyshev
Method with preconditioner
is given in Figure
.
It handles the case of a symmetric
positive definite coefficient matrix
. The
eigenvalues of
are assumed to be all real and in the
interval
, which does not include
zero.
Efficient solution of a linear system is largely a function of the
proper choice of iterative method. However,
to obtain good performance, consideration must also be given to the
computational kernels of the method and how efficiently they can be
executed on the target architecture. This point is of particular
importance on parallel architectures; see §
.
Iterative methods are very different from direct methods in this
respect. The performance of direct methods, both for dense and sparse
systems, is largely that of the factorization of the matrix. This
operation is absent in iterative methods (although preconditioners may
require a setup phase), and with it, iterative methods lack dense
matrix suboperations. Since such operations can be executed at very
high efficiency on most current computer architectures, we expect a
lower flop rate for iterative than for direct methods.
(Dongarra and Van der Vorst
[74] give some
experimental results about this, and provide a benchmark code
for iterative solvers.)
Furthermore,
the basic operations in iterative methods often use indirect
addressing, depending on the data structure. Such operations also have
a relatively low efficiency of execution.
However, this lower efficiency of execution does not imply anything
about the total solution time for a given system. Furthermore,
iterative methods are usually simpler to implement than direct
methods, and since no full factorization has to be stored, they can
handle much larger systems than direct methods.
In this section we summarize for each method
Table
lists the storage required for each method
(without preconditioning). Note that we are not including the storage
for the original
system
and we ignore scalar storage.
Selecting the ``best'' method for a given class of problems is largely
a matter of trial and error. It also depends on how much storage one
has available (GMRES), on the availability of
(BiCG and QMR),
and on how expensive the matrix vector products (and Solve steps with
) are in comparison to SAXPYs and inner products. If these matrix
vector products are relatively expensive, and if sufficient storage is
available then it may be attractive to use GMRES and delay restarting
as much as possible.
Table
shows the type of operations performed per
iteration. Based on the particular problem or data structure, the
user may observe that a particular operation could be performed more
efficiently.
Methods based on orthogonalization were developed by a number of
authors in the early '50s. Lanczos'
method
[142] was based on two mutually orthogonal
vector sequences, and his motivation came from eigenvalue problems. In
that context, the most prominent feature of the method is that it
reduces the original matrix to tridiagonal form. Lanczos
later applied his method to solving linear systems, in particular
symmetric ones
[143]. An important
property for proving convergence of the method when solving linear
systems is that the iterates are related to the initial residual by
multiplication with a polynomial in the coefficient matrix.
The joint paper by Hestenes
and Stiefel
[122], after their independent discovery
of the same method, is the classical description of the conjugate
gradient method for solving linear systems. Although error-reduction
properties are proved, and experiments showing premature convergence
are reported, the conjugate gradient method is presented here as a
direct method, rather than an iterative method.
This Hestenes/Stiefel method is closely related to a reduction of the
Lanczos method to symmetric matrices, reducing the two mutually
orthogonal sequences to one orthogonal sequence, but there is an
important algorithmic difference. Whereas Lanczos used three-term
recurrences, the method by Hestenes and Stiefel uses coupled two-term
recurrences. By combining the two two-term recurrences (eliminating
the ``search directions'') the Lanczos method is obtained.
A paper by
Arnoldi
[6] further discusses the Lanczos
biorthogonalization method, but it also presents a new method,
combining features of the Lanczos and Hestenes/Stiefel methods.
Like the Lanczos method it is applied to nonsymmetric systems, and
it does not use search directions. Like the Hestenes/Stiefel method,
it generates only one, self-orthogonal sequence. This last fact,
combined with the asymmetry of the coefficient matrix means that the
method no longer effects a reduction to tridiagonal form, but instead
one to upper Hessenberg form.
Presented as ``minimized iterations in the Galerkin method'' this
algorithm has become known as the Arnoldi algorithm.
The conjugate gradient method received little attention as a practical
method for some time, partly because of a misperceived importance of
the finite termination property. Reid
[179] pointed out that
the most important application area lay in sparse definite systems,
and this renewed the interest in the method.
Several methods have been developed in later years that employ, most
often implicitly, the upper Hessenberg matrix of the Arnoldi method.
For an overview and characterization of these orthogonal projection
methods for nonsymmetric systems
see Ashby, Manteuffel and Saylor
[10],
Saad and Schultz
[188], and Jea and
Young
[125].
Fletcher
[98] proposed an implementation of the
Lanczos method, similar to the Conjugate Gradient method, with two
coupled two-term recurrences, which he named the bi-conjugate
gradient method (BiCG).
Research into the design of Krylov subspace methods for solving
nonsymmetric linear systems is an active field of research and new
methods are still emerging. In this book, we have included only the
best known and most popular methods, and in particular those for which
extensive computational experience has been gathered. In this
section, we shall briefly highlight some of the recent developments
and other methods not treated here. A survey of methods up to about
1991 can be found in Freund, Golub and
Nachtigal
[106]. Two more recent reports by
Meier-Yang
[151] and Tong
[197] have extensive
numerical comparisons among various methods, including several more
recent ones that have not been discussed in detail in this book.
Several suggestions have been made to reduce the increase in memory
and computational costs in GMRES. An obvious one is to restart (this one is included
in §
): GMRES(
). Another approach is to restrict
the GMRES search to a suitable subspace of some higher-dimensional
Krylov subspace. Methods based on this idea can be viewed as
preconditioned GMRES methods. The simplest ones exploit a fixed
polynomial preconditioner (see Johnson, Micchelli and
Paul
[126],
Saad
[183], and Nachtigal,
Reichel and Trefethen
[159]).
In more sophisticated approaches, the
polynomial preconditioner is adapted to the iterations
(Saad
[187]), or the preconditioner may even be some other
(iterative) method of choice (Van der Vorst and
Vuik
[209], Axelsson and
Vassilevski
[24]). Stagnation is prevented in the GMRESR
method (Van der Vorst and Vuik
[209]) by including
LSQR steps in some phases of the process. In De Sturler and
Fokkema
[64], part of the optimality of GMRES is
maintained in the hybrid method GCRO, in which the iterations of the
preconditioning method are kept orthogonal to the iterations of the
underlying GCR method. All these approaches have advantages for some
problems, but it is far from clear a priori which strategy is
preferable in any given case.
Recent work has focused on endowing
the BiCG method with several desirable properties:
(1) avoiding breakdown;
(2) avoiding use of the transpose;
(3) efficient use of matrix-vector products;
(4) smooth convergence; and
(5) exploiting the work expended in forming the Krylov space with
for further reduction of the residual.
As discussed before, the BiCG method can have two kinds
of breakdown: Lanczos breakdown (the underlying Lanczos
process breaks down), and pivot breakdown (the
tridiagonal matrix
implicitly generated in the
underlying Lanczos process encounters a zero pivot when
Gaussian elimination without pivoting is used to factor it).
Although such exact breakdowns are very rare in practice, near
breakdowns can cause severe numerical stability problems.
The pivot breakdown is the easier one to overcome and there have been
several approaches proposed in the literature. It should be noted
that for symmetric matrices, Lanczos breakdown cannot occur and the
only possible breakdown is pivot breakdown. The SYMMLQ and QMR
methods discussed in this book circumvent pivot breakdown by solving
least squares systems. Other methods tackling this problem can be
found
in Fletcher
[98], Saad
[181],
Gutknecht
[113], and Bank and
Chan
[29]
[28].
Lanczos breakdown is much more difficult to eliminate. Recently,
considerable attention has been given to analyzing the nature of the
Lanczos breakdown (see Parlett
[172], and
Gutknecht
[114]
[116]), as well as various look-ahead
techniques for remedying it (see Brezinski and
Sadok
[39], Brezinski, Zaglia and
Sadok
[41]
[40],
Freund and Nachtigal
[102], Parlett
[172],
Nachtigal
[160], Freund, Gutknecht and
Nachtigal
[101],
Joubert
[129], Freund, Golub
and Nachtigal
[106], and
Gutknecht
[114]
[116]). However, the resulting
algorithms are usually too complicated to give in template form (some
codes of Freund and Nachtigal are available on netlib.)
Moreover, it is still not possible to eliminate breakdowns that
require look-ahead steps of arbitrary size (incurable breakdowns). So
far, these methods have not yet received much practical use but some
form of look-ahead may prove to be a crucial component in future
methods.
In the BiCG method, the need for matrix-vector multiplies with
can be inconvenient as well as doubling the number of matrix-vector
multiplies compared with CG for each increase in the degree of the
underlying Krylov subspace.
Several recent methods have been proposed to overcome this drawback.
The most notable of these is the ingenious CGS method
by Sonneveld
[192]
discussed earlier, which computes the square of the
BiCG polynomial without requiring
- thus obviating the need for
.
When BiCG converges, CGS is often an attractive,
faster converging alternative.
However, CGS also inherits (and often magnifies) the breakdown
conditions and the irregular convergence of
BiCG (see Van der Vorst
[207]).
CGS also generated interest in the possibility of product
methods, which generate iterates corresponding to a product of the
BiCG polynomial with another polynomial of the same degree,
chosen to have certain desirable properties but
computable without recourse to
. The
Bi-CGSTAB method of Van der Vorst
[207] is such an
example, in which the auxiliary polynomial is defined by a local
minimization chosen to smooth the convergence behavior.
Gutknecht
[115] noted that Bi-CGSTAB could be viewed as a
product of BiCG and GMRES(1), and he suggested combining BiCG with
GMRES(2) for the even numbered iteration steps. This was anticipated
to lead to better convergence for the case where the eigenvalues of
are complex. A more efficient and more robust variant of this
approach has been suggested by Sleijpen and Fokkema
in
[190], where they describe how to easily combine
BiCG with any GMRES(
), for modest
.
=-1
Many other basic methods can also be squared. For example, by
squaring the Lanczos procedure, Chan, de Pillis and
Van der Vorst
[45] obtained
transpose-free implementations of BiCG and QMR. By squaring the QMR
method, Freund and Szeto
[104] derived a transpose-free
QMR squared method which is quite competitive with CGS but with much
smoother convergence. Unfortunately, these methods
require an
extra matrix-vector product per step (three instead of two) which
makes them less efficient.
In addition to Bi-CGSTAB, several recent product methods have been
designed to smooth the convergence of CGS. One idea is to use the
quasi-minimal residual (QMR) principle to obtain smoothed iterates
from the Krylov subspace generated by other product methods.
Freund
[105] proposed such a QMR version of CGS, which
he called TFQMR. Numerical experiments show that TFQMR in most
cases retains the desirable convergence features of CGS while
correcting its erratic behavior. The transpose free nature of TFQMR,
its low computational cost and its smooth convergence behavior make it
an attractive alternative to CGS. On the other hand, since the BiCG
polynomial is still used, TFQMR breaks down whenever CGS does. One
possible remedy would be to combine TFQMR with a look-ahead Lanczos
technique but this appears to be quite complicated and no methods of
this kind have yet appeared in the literature. Recently, Chan et. al.
[46] derived a similar QMR version of
Van der Vorst's Bi-CGSTAB method, which is called QMRCGSTAB. These
methods offer smoother convergence over CGS and Bi-CGSTAB with little
additional cost.
There is no clear best Krylov subspace method at this time, and there
will never be a best overall Krylov subspace method. Each of
the methods is a winner in a specific problem class, and the main
problem is to identify these classes and to construct new methods for
uncovered classes. The paper by Nachtigal, Reddy and
Trefethen
[158] shows that for any of a group of
methods (CG, BiCG, GMRES, CGNE, and CGS), there is a class of problems
for which a given method is the winner and another one is the loser.
This shows clearly that there will be no ultimate method. The best we
can hope for is some expert system that guides the user in his/her
choice. Hence, iterative methods will never reach the robustness of
direct methods, nor will they beat direct methods for all problems.
For some problems, iterative schemes will be most attractive,
and for others, direct methods (or
multigrid). We hope to find suitable methods
(and preconditioners) for classes of very large problems that we are
yet unable to solve by any known method, because of CPU-restrictions,
memory, convergence problems, ill-conditioning, et cetera.
The convergence rate of iterative methods
depends on spectral properties of the coefficient matrix. Hence one
may attempt to transform the linear system into one that is equivalent
in the sense that it has the same solution, but that has more
favorable spectral properties. A preconditioner is a matrix
that effects such a transformation.
For instance, if a matrix
approximates the coefficient matrix
in some way, the transformed system
has the same solution as the original system
, but the spectral
properties of its coefficient matrix
may be more favorable.
In devising a preconditioner, we are faced with a choice between finding
a matrix
that approximates
, and for which solving a system is
easier than solving one with
, or finding a matrix
that
approximates
, so that only multiplication by
is needed.
The majority of preconditioners falls in the first category; a notable
example of the second category will be discussed in §
.
Since using a preconditioner in an iterative method
incurs some extra cost, both initially
for the setup, and per iteration for applying it,
there is a trade-off between the cost of
constructing and applying the preconditioner, and the gain in
convergence speed.
Certain preconditioners need little or no construction phase
at all (for instance the SSOR preconditioner), but for others, such as
incomplete factorizations, there can be substantial work involved.
Although the work in scalar terms may be comparable to a single
iteration, the construction of the preconditioner may not be
vectorizable/parallelizable even if application of the preconditioner
is. In that case, the initial cost has to be amortized over the
iterations, or over repeated use of the same preconditioner in
multiple linear systems.
Most preconditioners take in their application an amount of work
proportional to the number of variables. This implies that they
multiply the work per iteration by a constant factor. On the other
hand, the number of iterations as a function of the matrix size
is usually only improved by a constant. Certain preconditioners are
able to improve on this situation, most notably the modified
incomplete factorizations and preconditioners based on multigrid
techniques.
On parallel machines there is a further trade-off between the efficacy
of a preconditioner in the classical sense, and its parallel
efficiency. Many of the traditional preconditioners have a large
sequential component.
The above transformation of the linear system
is often not what is used in practice. For instance, the matrix
is
not symmetric, so, even if
and
are, the conjugate gradients
method is not immediately applicable to this system. The method as
described in figure
remedies this by employing the
-inner product for orthogonalization of the residuals. The
theory of the cg method is then applicable again.
All cg-type methods in this book, with the exception of QMR, have been
derived with such a combination of preconditioned iteration matrix
and correspondingly changed inner product.
Another way of deriving the preconditioned conjugate gradients method
would be to split the preconditioner as
and
to transform the system as
If
is symmetric and
, it is obvious that we now have a
method with a symmetric iteration matrix, hence the conjugate
gradients method can be applied.
Remarkably, the splitting of
is in practice not needed.
By rewriting the steps of the method (see for
instance Axelsson and
Barker [pgs. 16,29]AxBa:febook or Golub and
Van Loan [.3]GoVL:matcomp) it
is usually possible to reintroduce a computational step
that is, a step that applies the preconditioner in its entirety.
There is a different approach to preconditioning, which is much easier
to derive. Consider again the system.
The matrices
and
are called the left-
and right preconditioners
, respectively, and we can simply apply an
unpreconditioned iterative method to this system. Only two additional
actions
before the iterative process and
after are necessary.
Thus we arrive at the following schematic for deriving a left/right
preconditioned iterative method from any of the symmetrically
preconditioned methods in this book.
where
is the final calculated solution.
The simplest preconditioner consists of just the diagonal of the
matrix:
This is known as the (point) Jacobi preconditioner.
It is possible to use this preconditioner without using any extra
storage beyond that of the matrix itself. However, division operations
are usually quite costly, so in practice storage is allocated
for the reciprocals of the matrix diagonal. This strategy applies to
many preconditioners below.
Block versions of the Jacobi preconditioner can be derived by a
partitioning of the variables. If the index set
is partitioned as
with the sets
mutually
disjoint, then
The preconditioner is now a block-diagonal matrix.
Often, natural choices for the partitioning suggest themselves:
Jacobi preconditioners need very little storage, even in the block
case, and they are easy to implement. Additionally, on parallel
computers they don't present any particular problems.
On the other hand, more sophisticated preconditioners usually yield
a larger improvement.
The SSOR preconditioner
like the Jacobi preconditioner, can be derived from the coefficient
matrix without any work.
If the original, symmetric, matrix is decomposed as
in its diagonal, lower, and upper triangular part, the SSOR matrix is
defined as
or, parameterized by
The optimal value of the
parameter, like the
parameter in the SOR method, will reduce the number
of iterations to a lower order.
Specifically, for second order elliptic problems a spectral
condition number
is attainable
(see Axelsson and Barker [.4]AxBa:febook). In practice,
however, the spectral
information needed to calculate the optimal
is prohibitively
expensive to compute.
The SSOR matrix is given in factored form, so this preconditioner
shares many properties of other factorization-based methods (see
below). For instance, its suitability for vector processors or
parallel architectures
depends strongly on the
ordering of the variables. On the other hand, since this factorization
is given a priori, there is no possibility of breakdown as in
the construction phase of incomplete factorization methods.
A broad class of preconditioners is based on incomplete factorizations
of the coefficient matrix. We call a factorization incomplete if
during the factorization process certain fill elements,
nonzero elements in the factorization in positions where the original
matrix had a zero, have been
ignored. Such a preconditioner is then given in factored form
with
lower and
upper triangular. The efficacy of the
preconditioner depends on how well
approximates
.
Which of the following statements is true?
Traditionally, users have asked for and been provided with black box
software in the form of mathematical libraries such as LAPACK
,
LINPACK
, NAG
, and
IMSL
. More recently, the
high-performance community has discovered that they must write custom
software for their problem. Their reasons include inadequate
functionality of existing software libraries, data structures that are
not natural or convenient for a particular problem, and overly general
software that sacrifices too much performance when applied to a
special case of interest.
Can we meet the needs of both groups of users? We believe we can.
Accordingly, in this book, we introduce the use of templates
Template: Description of an
algorithm, abstracting away from implementational details.
A template is a description of a general algorithm rather than the
executable object code or the source code more commonly found in a
conventional software library. Nevertheless, although templates are
general descriptions of key algorithms, they offer whatever degree of
customization the user may desire. For example, they can be
configured for the specific data structure of a problem or for the
specific computing system on which the problem is to run.
We focus on the use of iterative methods for
solving large sparse systems of linear equations.Iterative
method: An algorithm that produces a sequence of approximations to
the solution of a linear system of equations; the length of the
sequence is not given a priori by the size of the system. Usually,
the longer one iterates,
the closer one is able to get to the true solution. See: Direct method.Direct method:
An algorithm that produces the solution to a system of linear equations
in a number of operations that is determined a priori by the size
of the system. In exact arithmetic, a direct method yields the true
solution to the system. See: Iterative method.
Many methods exist for solving such
problems. The trick is to find the most effective method for the
problem at hand. Unfortunately, a method that works well for one
problem type may not work as well for another. Indeed, it may not work
at all.
Thus, besides providing templates, we suggest how to choose and
implement an effective method, and how to specialize a method to
specific matrix types. We restrict ourselves to iterative
methods, which work by repeatedly improving an approximate solution
until it is accurate enough. These methods access the coefficient
matrix
of the linear system only via the
matrix-vector product
(and
perhaps
). Thus the user need only supply a subroutine
for computing
(and perhaps
) given
, which permits full
exploitation of the sparsity or other special structure of
.
We believe that after reading this book, applications developers will
be able to use templates to get their program running on a parallel
machine quickly. Nonspecialists will know how to choose and implement
an approach to solve a particular problem. Specialists will be able
to assemble and modify their codes-without having to make the huge
investment that has, up to now, been required to tune large-scale
applications for each particular machine. Finally, we hope that all
users will gain a better understanding of the algorithms employed.
While education has not been one of the traditional goals of
mathematical software, we believe that our approach will go a long way
in providing such a valuable service.
Incomplete factorizations are the first preconditioners we have
encountered so far for which there is a non-trivial creation stage.
Incomplete factorizations may break down (attempted
division by zero pivot) or result in indefinite matrices (negative
pivots) even if the full factorization of the same matrix is
guaranteed to exist and yield a positive definite matrix.
An incomplete factorization is guaranteed to exist for many
factorization strategies if the original matrix is an
-matrix.
This was originally proved by Meijerink and
Van der Vorst
[152]; see
further Beauwens and Quenon
[33],
Manteuffel
[147], and
Van der Vorst
[200].
In cases where pivots are zero or negative, strategies have been
proposed such as substituting an arbitrary positive
number (see Kershaw
[132]), or restarting the factorization
on
for some positive value of
(see
Manteuffel
[147]).
An important consideration for incomplete
factorization preconditioners is the cost of the factorization process.
Even if the incomplete factorization exists,
the number of operations involved in creating it
is at least as much as for
solving a system with such a coefficient matrix, so the cost may equal
that of one
or more iterations of the iterative method. On parallel computers this
problem is aggravated by the generally poor parallel efficiency of the
factorization.
Such factorization costs can be amortized if the iterative method
takes many iterations, or if the same preconditioner will be used
for several linear systems, for instance in successive time steps or
Newton iterations.
Incomplete factorizations can be given in various forms. If
(with
and
nonsingular triangular matrices),
solving a system proceeds in the usual way
(figure
),
but often incomplete factorizations are given as
(with
diagonal, and
and
now strictly triangular matrices, determined through
the factorization process).
In that case, one could use either of the following equivalent
formulations for
:
or
In either case, the diagonal elements are used twice (not three times
as the formula for
would lead one to expect), and since only
divisions with
are performed, storing
explicitly is the
practical thing to do.
At the cost of some extra storage, one could store
or
, thereby saving some computation.
Solving a system using the first formulation is
outlined in figure
. The second formulation is
slightly harder to implement.
The most common type of incomplete factorization is based on taking a
set
of matrix positions, and keeping all positions outside this
set equal to zero during the factorization. The resulting
factorization is incomplete in the sense that fill is suppressed.
The set
is usually chosen to encompass all positions
for
which
. A position that is zero in
but not so in an
exact factorization
is called a fill position, and if it is
outside
, the fill there is said to be ``discarded''.
Often,
is chosen to coincide with the set of nonzero positions
in
, discarding all fill. This factorization type is called
the
factorization: the Incomplete
factorization of
level zero
.
We can describe an incomplete factorization formally as
Meijerink and Van der Vorst
[152] proved that, if
is
an
-matrix, such a factorization exists for any choice of
, and
gives a symmetric positive definite matrix if
is symmetric
positive definite. Guidelines for allowing levels of fill were given
by Meijerink and Van der Vorst in
[153].
There are two major strategies for accepting or discarding fill-in,
one structural, and one numerical. The structural strategy is that of
accepting fill-in only to a certain level. As was already pointed out above,
any zero location
in
filling in (say in step
)
is assigned a fill level value
If
was already nonzero, the level value is not changed.
The numerical fill strategy is that of `drop tolerances':
fill is ignored if it is too small, for a suitable definition of
`small'. Although this definition makes more sense mathematically, it
is harder to implement in practice, since the amount of storage needed
for the factorization is not easy to predict.
See
[157]
[20] for discussions of
preconditioners using drop tolerances.
For the
method, the incomplete factorization produces no
nonzero elements beyond the sparsity structure
of the original matrix, so that the preconditioner at worst
takes exactly as much space to store as the original matrix. In a
simplified version of
, called
-
(Pommerell
[174]), even less is needed. If not only we
prohibit fill-in elements, but we also alter only the diagonal
elements (that is, any alterations of off-diagonal elements are
ignored
),
we have the following situation.
Splitting the coefficient matrix into its diagonal, lower
triangular, and upper triangular parts as
, the
preconditioner can be written as
where
is the
diagonal matrix containing the pivots generated. Generating this
preconditioner is described in figure
.
Since we use the
upper and lower triangle of the matrix unchanged, only storage space
for
is needed. In fact, in order to avoid division operations
during the preconditioner solve stage we store
rather than
.
Remark: the resulting lower and upper factors of the preconditioner have
only nonzero elements in the set
, but this fact is in general not
true for the preconditioner
itself.
The fact that the
-
preconditioner contains the off-diagonal
parts of the original matrix was used by
Eisenstat
[91] to derive at a more
efficient implementation of preconditioned CG. This new
implementation merges the application of the tridiagonal factors of
the matrix and the preconditioner, thereby saving a substantial number
of operations per iteration.
We will now consider the special case of
a matrix derived from central differences on a Cartesian product grid.
In this case the
and
-
factorizations coincide, and,
as remarked above, we only have to calculate the pivots of the
factorization; other elements in the triangular factors are equal to
off-diagonal elements of
.
In the following we will assume a natural, line-by-line, ordering of
the grid points.
Letting
,
be coordinates in a regular 2D grid, it is easy to see
that the pivot on grid point
is only determined by
pivots on points
and
. If there are
points on
each of
grid lines,
we get the following generating relations for the pivots:
Conversely, we can describe the factorization algorithmically as
In the above we have assumed that the variables in the problem
are ordered according to the so-called ``natural ordering'': a
sequential numbering of the grid lines and the points within each grid
line. Below we will encounter different orderings of the variables.
One modification to the basic idea of incomplete factorizations is as
follows:
If the product
is nonzero, and fill is not allowed in position
,
instead of simply discarding this fill quantity subtract it from
the diagonal element
.
Such a factorization scheme is usually called a ``modified incomplete
factorization''.
Mathematically this corresponds to forcing the preconditioner to have
the same rowsums as the original matrix.
One reason for considering modified incomplete factorizations is the
behavior of the spectral condition number of the preconditioned
system. It was mentioned above that for second order elliptic
equations the condition number of the
coefficient matrix is
as a function of the discretization
mesh width. This order of magnitude is preserved by simple incomplete
factorizations, although usually a reduction by a large constant
factor is obtained.
Modified factorizations are of interest because, in
combination with small perturbations, the spectral condition number of
the preconditioned system can be of a lower order.
It was first proved by Dupont, Kendall and
Rachford
[81] that
a modified incomplete factorization of
gives
for the central difference case. More
general proofs are given by Gustafsson
[112],
Axelsson and Barker [.2]AxBa:febook, and
Beauwens
[32]
[31].
Instead of keeping row sums constant, one can also keep column
sums constant.
In computational fluid mechanics this idea is justified with the argument
that the material balance stays constant over all iterates.
(Equivalently, one wishes to avoid `artificial
diffusion'.)
Appleyard and Cheshire
[4] observed that if
and
have the same column sums, the choice
guarantees
that the sum of the elements in
(the material balance error) is
zero, and that all further
have elements summing to zero.
Modified incomplete factorizations can break down,
especially when the variables are numbered other than in the natural
row-by-row ordering. This was noted by Chan and
Kuo
[50], and a full analysis was given by
Eijkhout
[86] and Notay
[161].
A slight variant of modified incomplete factorizations consists of the
class of ``relaxed incomplete factorizations''. Here the fill is
multiplied by a parameter
before it is subtracted from
the diagonal; see
Ashcraft and Grimes
[11],
Axelsson and Lindskog
[19]
[18],
Chan
[44],
Eijkhout
[86],
Notay
[162],
Stone
[194], and
Van der Vorst
[204].
For the dangers of MILU in the presence of rounding error, see
Van der Vorst
[206].
At first it may appear that the sequential time of solving a
factorization is of the order of the number of variables, but things
are not quite that bad. Consider the special case of central
differences on a regular domain of
points. The variables
on any diagonal in the domain, that is,
in locations
with
, depend
only on those on the previous diagonal, that is, with
.
Therefore it is possible to process the operations on such a diagonal,
or `wavefront'
,
in parallel (see figure
),
or have a vector computer pipeline them;
see Van der Vorst
[205]
[203].
Another way of vectorizing the solution of the triangular factors is
to use some form of expansion of the inverses of the factors.
Consider for a moment a lower triangular matrix, normalized to
the form
where
is strictly lower triangular). Its
inverse can be given as either of the following two series:
(The first series is called a ``Neumann expansion'', the second an
``Euler expansion''. Both series are finite, but their length
prohibits practical use of this fact.)
Parallel or vectorizable preconditioners can be derived from an
incomplete factorization by taking a small number of terms in either
series.
Experiments indicate that a small number of terms, while
giving high execution rates, yields almost the full precision of the
more recursive triangular solution (see
Axelsson and Eijkhout
[15] and
Van der Vorst
[201]).
There are some practical considerations in implementing these
expansion algorithms. For instance, because of the normalization the
in equation (
) is not
. Rather, if we
have a preconditioner (as described in section
)
described by
then we write
Now we can choose whether or not to store the product
.
Doing so doubles the storage requirements for the matrix, not doing so
means that separate multiplications by
and
have to be
performed in the expansion.
Suppose then that the products
and
have been
stored. We then define
by
and replace solving a system
for
by computing
.
This algorithm is given in figure
.
The algorithms for vectorization outlined above can be used on
parallel computers. For instance, variables on a wavefront
can be
processed in parallel, by dividing the wavefront over processors.
More radical approaches for increasing the parallelism in incomplete
factorizations are based on a renumbering of the problem variables.
For instance, on rectangular domains one could start numbering the
variables from all four corners simultaneously, thereby creating
four simultaneous wavefronts, and therefore four-fold parallelism
(see Dongarra, et al.
[71],
Van der Vorst
[204]
[202])
.
The most extreme case is the red/black ordering
(or for more general matrices the multi-color ordering) which gives
the absolute minimum number of sequential steps.
Multi-coloring is also an attractive method for vector computers.
Since points of one color are uncoupled, they can be processed as one
vector; see Doi
[68],
Melhem
[154], and Poole and
Ortega
[176].
However, for such ordering strategies there is usually a trade-off
between the degree of parallelism and the resulting number of
iterations. The reason for this is that a different ordering may give
rise to a different error matrix, in particular the norm of the error
matrix may vary considerably between orderings.
See experimental results by Duff and Meurant
[79] and a
partial explanation of them by Eijkhout
[85].
We can also consider block variants of preconditioners for accelerated
methods. Block methods are normally feasible if the problem domain is
a Cartesian product grid; in that case a natural division in lines (or
planes in the 3-dimensional case), can be used for blocking, though
incomplete factorizations are not as effective in the 3-dimensional
case; see for instance Kettler
[134].
In such a blocking scheme for Cartesian product grids,
both the size and number of the
blocks increases with the overall problem size.
Templates offer three significant advantages. First, templates are
general and reusable. Thus, they can simplify ports to diverse
machines. This feature is important given the diversity
of parallel architectures.
Second, templates exploit the expertise of two distinct groups. The
expert numerical analyst creates a template reflecting in-depth
knowledge of a specific numerical technique. The computational scientist
then provides ``value-added'' capability to the general template
description, customizing it for specific contexts or applications
needs.
And third, templates are not language specific. Rather, they are
displayed in an Algol-like structure, which is readily translatable
into the target language such as FORTRAN (with the use of the
Basic Linear Algebra Subprograms, or BLAS
,
whenever possible) and C. By using these familiar styles, we
believe that the users will have trust in the algorithms. We also
hope that users will gain a better understanding of numerical
techniques and parallel programming.
For each template, we provide some or all of the following:
For each of the templates, the following can be obtained via electronic mail.
The starting point for an incomplete block factorization is a
partitioning of the matrix, as mentioned in §
.
Then an incomplete factorization is performed using the matrix blocks
as basic entities (see Axelsson
[12] and Concus, Golub
and Meurant
[57] as
basic references).
The most important difference with point methods arises in the
inversion of the pivot blocks. Whereas inverting a scalar is easily
done, in the block case two problems arise. First, inverting
the pivot block is likely to be a costly operation. Second, initially the
diagonal blocks of the matrix are likely to be be sparse
and we would like to maintain this type of
structure throughout the factorization.
Hence the need for approximations of inverses arises.
In addition to this, often fill-in in off-diagonal blocks is discarded
altogether. Figure
describes an incomplete block
factorization that is analogous to the
-
factorization
(section
) in that it only updates the diagonal blocks.
As in the case of incomplete point factorizations, the existence of
incomplete block methods is guaranteed if the coefficient
matrix is an
-matrix. For a general proof, see
Axelsson
[13].
In block factorizations a pivot block is generally forced to be
sparse, typically of banded form, and that we need an approximation to
its inverse that has a similar structure. Furthermore, this
approximation should be easily computable, so we rule out the option
of calculating the full inverse and taking a banded part of it.
The simplest approximation to
is the diagonal matrix
of
the reciprocals of the diagonal of
:
.
Other possibilities were considered by
Axelsson and Eijkhout
[15],
Axelsson and Polman
[21],
Concus, Golub and Meurant
[57],
Eijkhout and Vassilevski
[90],
Kolotilina and Yeremin
[141],
and Meurant
[155]. One particular
example is given in figure
. It has the attractive
theoretical property that, if the original matrix is symmetric
positive definite and a factorization with positive diagonal
can
be made, the approximation to the inverse is again symmetric positive
definite.
Banded approximations to the inverse of banded matrices have a
theoretical justification. In the context of partial differential
equations the diagonal blocks of the coefficient matrix are usually
strongly diagonally dominant. For such matrices, the elements of the
inverse have a size that is exponentially decreasing in their distance
from the main diagonal. See Demko, Moss and
Smith
[65] for a general
proof, and Eijkhout and Polman
[89] for a more detailed
analysis in the
-matrix case.
In many applications, a block tridiagonal structure can be found in
the coefficient matrix. Examples are problems on a 2D regular grid if
the blocks correspond to lines of grid points, and problems on a
regular 3D grid, if the blocks correspond to planes of grid points.
Even if such a block tridiagonal structure does not arise naturally,
it can be imposed by renumbering the variables in a Cuthill-McKee
ordering
[60].
Such a matrix has incomplete block factorizations of a particularly
simple nature: since no fill can occur outside the diagonal
blocks
,
all properties follow from our treatment of the pivot blocks. The
generating recurrence for the pivot blocks also takes a simple form,
see figure
. After the factorization we are left
with sequences of
block forming the pivots, and of
blocks
approximating their inverses.
One reason that block methods are of interest is that they are
potentially more suitable for vector computers and parallel
architectures. Consider the block factorization
where
is the block diagonal matrix of pivot blocks,
and
,
are the block lower and upper triangle of the factorization;
they coincide with
,
in the case of a block tridiagonal
matrix.
We can turn this into an incomplete factorization by replacing the
block diagonal matrix of pivots
by the block diagonal matrix of
incomplete factorization pivots
, giving
For factorizations of this type (which covers all
methods in Concus, Golub and Meurant
[57] and
Kolotilina and Yeremin
[141]) solving
a linear system means solving
smaller systems with the
matrices.
Alternatively, we can replace
by a
the inverse of the block diagonal matrix of the
approximations to the inverses of the pivots,
, giving
For this second type (which
was discussed by Meurant
[155], Axelsson and
Polman
[21] and Axelsson and
Eijkhout
[15]) solving
a system with
entails multiplying by the
blocks.
Therefore, the second type has a much higher potential for
vectorizability. Unfortunately, such a factorization is theoretically
more troublesome; see the above references or Eijkhout and
Vassilevski
[90].
If the physical problem has several variables per grid point, that is,
if there are several coupled partial differential equations, it is
possible to introduce blocking in a natural way.
Blocking of the equations (which gives a small number of very large
blocks) was used by Axelsson and Gustafsson
[17] for
the equations of linear
elasticity, and blocking of the
variables per node (which gives many very small blocks) was used
by Aarden and Karlsson
[1] for the
semiconductor
equations. A systematic comparison of the two approaches was made
by Bank, et al.
[26].
Saad
[184] proposes to construct an incomplete LQ factorization
of a general sparse matrix. The idea is to orthogonalize the rows
of the matrix by a Gram-Schmidt process (note that in sparse matrices,
most rows are typically orthogonal already, so that standard Gram-Schmidt
may be not so bad as in general). Saad suggest dropping strategies for
the fill-in produced in the orthogonalization process. It turns out that
the resulting incomplete L factor can be viewed as the incomplete Cholesky
factor of the matrix
. Experiments show that using
in a CG
process for the normal equations:
is effective for
some relevant problems.
So far, we have described preconditioners in only one of two classes:
those that approximate the coefficient
matrix, and where linear systems with the preconditioner as
coefficient matrix are easier to solve than the original system.
Polynomial preconditioners can be considered as members of the
second class of preconditioners:
direct approximations of the inverse of the coefficient matrix.
Suppose that the coefficient matrix
of the linear system is
normalized to
the form
, and that the spectral radius of
is less than
one. Using the Neumann series, we can write the inverse of
as
,
so an approximation may be derived by truncating this infinite series.
Since the iterative methods we are considering are already based on
the idea of applying polynomials in the coefficient matrix to the
initial residual, there are analytic connections between the basic
method and polynomially accelerated one.
Dubois, Greenbaum and Rodrigue
[77]
investigated the relationship between a basic method
using a splitting
, and a polynomially preconditioned method
with
The basic result is that for classical methods,
steps of the polynomially preconditioned method are
exactly equivalent to
steps of the original method; for accelerated
methods, specifically the Chebyshev method, the preconditioned
iteration can improve the number of iterations by at most a factor
of
.
Although there is no gain in the number of times the coefficient
matrix is applied, polynomial preconditioning does eliminate a large
fraction of the inner
products and update operations, so there may be an overall increase in
efficiency.
Let us define a polynomial preconditioner more abstractly as any
polynomial
normalized to
. Now the choice of the
best polynomial preconditioner becomes that of choosing the best
polynomial that minimizes
. For the choice of the
infinity norm we thus obtain Chebyshev polynomials, and they require
estimates of both a lower and upper bound on the spectrum of
.
These estimates may be derived from the conjugate gradient iteration
itself; see §
.
Since an accurate lower bound on the spectrum of
may be hard to
obtain, Johnson, Micchelli and Paul
[126]
and Saad
[183] propose least
squares polynomials based on several weight functions.
These functions only require an upper bound and this is easily
computed, using for instance the ``Gerschgorin bound''
; see [.4]Va:book.
Experiments comparing Chebyshev and least squares polynomials can be
found in Ashby, Manteuffel and Otto
[8].
Application of polynomial preconditioning to symmetric indefinite
problems is described by Ashby, Manteuffel and
Saylor
[9]. There the
polynomial is chosen so that it transforms the system into a
definite one.
A number of preconditioners exist that derive their justification from
properties of the underlying partial differential equation. We will
cover some of them here (see also §
and §
). These preconditioners usually involve more
work than the types discussed above, however, they allow for
specialized faster solution methods.
In §
we pointed out that conjugate
gradient methods for non-selfadjoint systems require the storage of
previously calculated vectors. Therefore it is somewhat remarkable
that preconditioning by the symmetric part
of the coefficient matrix
leads to a method that does not need this extended storage.
Such a method was proposed by Concus and Golub
[56]
and Widlund
[216].
However, solving a system with the symmetric part of a matrix may be
no easier than solving a system with the full matrix. This problem may
be tackled by imposing a nested iterative method, where a
preconditioner based on the symmetric part is used.
Vassilevski
[212] proved that the efficiency of this
preconditioner for the symmetric part carries over to the outer
method.
In many applications, the coefficient matrix is symmetric and positive
definite. The reason for this is usually that the partial differential
operator from which it is derived is self-adjoint, coercive, and bounded
(see Axelsson and Barker [.2]AxBa:febook).
It follows that for the coefficient
matrix
the following relation holds for any matrix
from a
similar differential equation:
where
,
do not depend on the matrix size. The importance of
this is that the use of
as a preconditioner gives an iterative
method with a number of iterations that does not depend on the matrix size.
Thus we can precondition our original matrix by one derived from a
different PDE, if one can be found that has attractive properties as
preconditioner.
The most common choice is to take a matrix from a separable PDE. A system involving such a matrix can be solved with
various so-called ``fast solvers'', such as FFT methods, cyclic
reduction, or the generalized marching algorithm
(see Dorr
[75],
Swarztrauber
[195], Bank
[25] and
Bank and Rose
[27]).
As a simplest example, any elliptic operator can be preconditioned
with the Poisson operator, giving the iterative method
In Concus and Golub
[59] a transformation of this
method is considered to speed up the convergence.
As another example, if the original matrix arises from
the preconditioner can be formed from
An extension to the non-self adjoint case is considered
by Elman and Schultz
[94].
Fast solvers are attractive in that the number of operations they
require is (slightly higher than) of the order of the number of
variables. Coupled with the fact that the number of iterations in the
resulting preconditioned iterative methods is independent of the
matrix size, such methods are close to optimal. However, fast solvers
are usually only applicable if the physical domain is a rectangle or
other Cartesian product structure. (For a domain consisting of a
number of such pieces, domain decomposition methods can be used;
see §
).
Many iterative methods have been developed and it is impossible to
cover them all. We chose the methods below either because they
illustrate the historical development of iterative methods, or because
they represent the current state-of-the-art for solving large sparse
linear systems. The methods we discuss are:
We do not intend to write a ``cookbook'', and have deliberately
avoided the words ``numerical recipes'', because these phrases imply
that our algorithms can be used blindly without knowledge of the
system of equations. The state of the art in iterative methods does
not permit this: some knowledge about the linear system is needed to
guarantee convergence of these algorithms, and generally the more that
is known the more the algorithm can be tuned. Thus, we have chosen to
present an algorithmic outline, with guidelines for choosing a method
and implementing it on particular kinds of high-performance machines.
We also discuss the use of preconditioners and relevant data storage
issues.
The Poisson differential operator can be split in a natural way
as the sum of two operators:
Now let
,
be discretized representations of
,
. Based on the observation that
, iterative schemes
such as
with suitable choices of
and
have been proposed.
This alternating direction implicit, or ADI, method was first
proposed as a solution method for parabolic equations. The
are then approximations on subsequent time steps. However, it can also
be used for the steady state, that is, for solving elliptic equations.
In that case, the
become subsequent iterates;
see D'Yakonov
[82],
Fairweather, Gourlay and Mitchell
[97],
Hadjidimos
[119], and
Peaceman and Rachford
[173].
Generalization
of this scheme to variable coefficients or fourth order elliptic
problems is relatively straightforward.
The above method is implicit since it requires systems solutions, and it
alternates the
and
(and if necessary
) directions. It is
attractive from a practical point of view (although mostly on tensor
product grids), since solving a system with, for instance,
a matrix
entails only a number of uncoupled tridiagonal
solutions. These need very little storage over that needed for the
matrix, and they can be executed in parallel
, or one can vectorize
over them.
A theoretical reason that ADI preconditioners are of interest is that
they can be shown to be
spectrally equivalent to the original coefficient matrix. Hence the
number of iterations is bounded independent of the condition number.
However, there is a problem of data distribution. For vector
computers, either the system solution with
or with
will
involve very large strides: if columns of variables in the grid are
stored contiguously, only the solution with
will involve
contiguous data. For the
the stride equals the number of
variables in a column.
On parallel machines an efficient solution is possible if the
processors are arranged in a
grid. During, e.g., the
solve, every processor row then works independently of other
rows. Inside each row, the processors can work together, for instance
using a Schur complement method. With sufficient network bandwidth
this will essentially reduce the time
to that for solving any of the subdomain
systems plus the time for the interface system. Thus, this method will
be close to optimal.
Conjugate gradient methods for real symmetric systems can be applied
to complex Hermitian systems in a straightforward manner. For
non-Hermitian complex systems we distinguish two cases. In general,
for any coefficient matrix a CGNE method is possible,
that is, a conjugate gradients method on the normal equations
,
or one can split the system into real and complex parts and use
a method such as GMRES on the resulting real nonsymmetric system.
However, in certain practical situations the complex system is
non-Hermitian but symmetric.
Complex symmetric systems can be solved by a classical conjugate
gradient or Lanczos method, that is, with short recurrences, if the
complex inner product
is replaced by
.
Like the BiConjugate Gradient method, this method is susceptible to
breakdown, that is, it can happen that
for
.
A look-ahead strategy can remedy this in most
cases (see Freund
[100]
and Van der Vorst and Melissen
[208]).
Stopping criterion: Since an iterative method computes successive
approximations to the solution of a linear system, a practical test is needed
to determine when to stop the iteration. Ideally this test would measure
the distance of the last iterate to the true solution, but this is not
possible. Instead, various other metrics are used, typically involving
the residual.
Forward error: The difference between a computed iterate and
the true solution of a linear system, measured in some vector norm.
Backward error: The size of perturbations
of
the coefficient matrix and
of the right hand side
of a linear system, such that the computed iterate
is the
solution of
.
An iterative method produces a sequence
of vectors
converging to the vector
satisfying the
system
.
To be effective, a method must decide when to stop. A good stopping
criterion should
For the user wishing to read as little as possible,
the following simple stopping criterion will likely be adequate.
The user must supply the quantities
,
, stop_tol, and preferably also
:
Here is the algorithm:
Note that if
does not change much from step to step, which occurs
near convergence, then
need not be recomputed.
If
is not available, the stopping criterion may be replaced with
the generally stricter criterion
In either case, the final error bound is
.
If an estimate of
is available, one may also use the stopping
criterion
which guarantees that the relative error
in the computed solution is bounded by stop_tol.
Ideally we would like to stop when the magnitudes of entries of the error
fall below a user-supplied threshold.
But
is hard to estimate directly, so
we use the residual
instead, which is
more readily computed. The rest of this section describes how to measure
the sizes of vectors
and
, and how to bound
in terms of
.
We will measure errors using vector and matrix norms.
The most common vector norms are:
For some algorithms we may also use the norm
, where
is a fixed nonsingular
matrix and
is one of
,
, or
.
Corresponding to these vector norms are three matrix norms:
as well as
.
We may also use the matrix norm
, where
denotes the largest eigenvalue.
Henceforth
and
will refer to any mutually
consistent pair of the above.
(
and
, as well as
and
, both form
mutually consistent pairs.)
All these norms satisfy the triangle inequality
and
, as well as
for mutually consistent pairs.
(For more details on the properties of norms, see Golub and
Van Loan
[109].)
One difference between these norms is their dependence on dimension.
A vector
of length
with entries uniformly distributed between
0 and 1 will satisfy
, but
will grow
like
and
will grow like
. Therefore a stopping
criterion based on
(or
) may have to be permitted to
grow proportional to
(or
) in order that it does not become
much harder to satisfy for large
.
There are two approaches to bounding the
inaccuracy of the computed solution to
.
Since
,
which we will call the forward error,
is hard to estimate directly,
we introduce
the backward error, which allows us to bound the forward error.
The normwise backward error is defined as
the smallest possible value of
where
is the exact solution of
(here
denotes a general matrix, not
times
; the
same goes for
).
The backward error may be easily computed from the residual
; we show how below.
Provided one has
some bound on the inverse of
,
one can bound the forward error in terms of the backward error via
the simple equality
which implies
.
Therefore, a stopping criterion of the form ``stop when
'' also yields an upper bound on the forward error
. (Sometimes we may prefer to
use the stricter but harder to estimate bound
; see §
. Here
is the matrix or vector of absolute values of components of
.)
The backward error also has a direct interpretation as a stopping
criterion, in addition to supplying a bound on the forward error.
Recall that the backward error is the smallest change
to the problem
that makes
an exact solution of
. If the original data
and
have
errors from previous computations or measurements,
then it is usually not worth iterating until
and
are even
smaller than these errors.
For example, if the machine precision is
, it is not
worth making
and
, because just rounding the entries
of
and
to fit in the machine creates errors this large.
Based on this discussion, we will now consider
some stopping criteria and their
properties. Above we already mentioned
The second stopping criterion we discussed, which does not require
,
may be much more stringent than Criterion 1:
This criterion yields the forward error bound
If an estimate of
is available, one can also just stop
when the upper bound on the error
falls below
a threshold. This yields the third stopping criterion:
permitting the user to specify the desired relative accuracy stop_tol
in the computed solution
.
One drawback to Criteria 1 and 2 is that they usually treat backward errors in
each component of
and
equally, since most
norms
and
measure each entry of
and
equally.
For example, if
is sparse and
is dense, this loss of
possibly important structure will not be reflected in
.
In contrast, the following stopping criterion gives one the option of scaling
each component
and
differently,
including the possibility of insisting that some entries be zero.
The cost is an extra matrix-vector multiply:
where
is the matrix of absolute values of entries of
.
Finally, we mention one more criterion, not because we recommend it,
but because it is widely used. We mention it in order to explain its
potential drawbacks:
It is possible to design an iterative algorithm for which
or
is not directly available,
although this is not the case for any algorithms in this book.
For completeness, however, we discuss stopping criteria in this case.
For example, if ones ``splits''
to get the iterative
method
, then the
natural residual to compute is
.
In other words, the residual
is the same as the residual of the
preconditioned system
. In this case, it is
hard to interpret
as a backward error for the original system
, so we may instead derive a forward error bound
.
Using this as a stopping criterion requires an estimate of
.
In the case of methods based on
splitting
, we have
,
and
.
Another example is an implementation of the preconditioned conjugate
gradient algorithm which computes
instead of
(the
implementation in this book computes the latter). Such an
implementation could use the stopping criterion
as in Criterion 5. We
may also use it to get the forward error bound
, which could also
be used in a stopping criterion.
Bounds on the error
inevitably rely on bounds for
,
since
. There is a large number of problem dependent
ways to estimate
; we mention a few here.
When a splitting
is used to get an iteration
then the matrix whose
inverse norm we need is
. Often, we know how to estimate
if the splitting is a standard one such as Jacobi or SOR,
and the matrix
has special characteristics such as Property A.
Then we may estimate
.
When
is symmetric positive definite, and Chebyshev acceleration with
adaptation of parameters is being used, then at each step the algorithm
estimates the largest and smallest eigenvalues
and
of
anyway.
Since
is symmetric positive definite,
.
This adaptive estimation is often done using the Lanczos algorithm
(see section
),
which can usually provide good estimates of the
largest (rightmost) and smallest (leftmost) eigenvalues of a symmetric matrix
at the cost of a few matrix-vector multiplies.
For general nonsymmetric
, we may apply
the Lanczos method to
or
,
and use the fact that
.
It is also possible to estimate
provided one is willing
to solve a few systems of linear equations with
and
as coefficient
matrices. This is often done with dense linear system solvers, because the
extra cost of these systems is
, which is small compared to the cost
of the LU decomposition (see Hager
[121],
Higham
[124] and Anderson, et al.
[3]).
This is not the case for iterative solvers, where the cost of these
solves may well be several times as much as the original linear system.
Still, if many linear systems with the same coefficient matrix and
differing right-hand-sides are to be solved, it is a viable method.
The approach in the last paragraph also lets us estimate the alternate
error bound
.
This may be much smaller than the simpler
in the
case where the rows of
are badly scaled; consider the case of a
diagonal matrix
with widely varying diagonal entries. To
compute
, let
denote the diagonal
matrix with diagonal entries equal to the entries of
; then
(see Arioli, Demmel and Duff
[5]).
can be estimated using the
technique in the last paragraph since multiplying by
or
is no harder than multiplying
by
and
and also by
, a diagonal matrix.
In addition to limiting the total amount of work by limiting the
maximum number of iterations one is willing to do, it is also natural
to consider stopping when no apparent progress is being made.
Some methods, such as Jacobi and SOR, often exhibit nearly monotone
linear convergence, at least after some initial transients, so it
is easy to recognize when convergence degrades. Other methods, like
the conjugate gradient method, exhibit ``plateaus'' in their convergence,
with the residual norm stagnating at a constant value for many iterations
before decreasing again; in principle there can be many such plateaus
(see Greenbaum and Strakos
[110]) depending on the problem.
Still other methods, such as CGS, can appear
wildly nonconvergent for a large number of steps before the residual begins
to decrease; convergence may continue to be erratic from step to step.
In other words, while it is a good idea to have a criterion that stops
when progress towards a solution is no longer being made, the
form of such a criterion is both method and problem dependent.
The error bounds discussed in this section are subject to floating
point errors, most of which are innocuous, but which deserve some discussion.
The infinity norm
requires the fewest
floating point operations to compute, and cannot overflow or cause other
exceptions if the
are themselves finite
. On the other hand, computing
in the most straightforward manner
can easily overflow or lose accuracy to underflow even when the true result
is far from either the overflow or underflow thresholds. For this reason,
a careful implementation for computing
without this danger
is available (subroutine snrm2 in the BLAS
[72]
[144]),
but it is more expensive than computing
.
Now consider computing the residual
by forming the
matrix-vector product
and then subtracting
, all in floating
point arithmetic with relative precision
. A standard error
analysis shows that the error
in the computed
is bounded by
, where
is typically bounded by
, and
usually closer to
. This is why one should not choose
in Criterion 1, and why Criterion 2 may not
be satisfied by any method.
This uncertainty in the value of
induces an uncertainty in the error
of
at most
.
A more refined bound is that the error
in the
th component of
is bounded by
times the
th component of
, or more tersely
.
This means the uncertainty in
is really bounded by
.
This last quantity can be estimated inexpensively provided solving systems
with
and
as coefficient matrices is inexpensive (see the last
paragraph of §
).
Both these bounds can be severe overestimates of the uncertainty in
,
but examples exist where they are attainable.
The efficiency of any of the iterative methods considered in previous
sections is determined primarily by the performance of the
matrix-vector product and the preconditioner solve, and therefore on
the storage scheme used for the matrix and the preconditioner. Since
iterative methods are typically used on sparse matrices, we will
review here a number of sparse storage formats.
Often, the storage scheme used
arises naturally from the specific application problem.
Storage scheme: The way elements of a matrix are stored
in the memory of a computer. For dense matrices, this can be
the decision to store rows or columns consecutively.
For sparse matrices, common storage schemes avoid storing zero elements;
as a result they involve integer data describing where the stored elements
fit into the global matrix.
In this section we will review some of the more popular sparse matrix
formats that are used in numerical software packages such as ITPACK
[140] and NSPCG
[165].
After surveying the various formats, we demonstrate how the
matrix-vector product and an incomplete factorization solve are
formulated using two of the sparse matrix formats.
The term ``iterative method'' refers to a wide range of techniques
that use successive approximations to obtain more accurate solutions
to a linear system at each step. In this book we will cover two types
of iterative methods.
Stationary methods are older, simpler to understand
and implement, but usually not as effective.
Nonstationary methods are a relatively recent
development; their analysis is usually harder to understand, but they
can be highly effective. The nonstationary methods we present are
based on the idea of sequences of orthogonal vectors. (An exception
is the Chebyshev
iteration method, which is based on
orthogonal polynomials.)
Stationary iterative method: Iterative method that performs
in each iteration the same operations on the current iteration vectors.
Nonstationary iterative method: Iterative method that
has iteration-dependent coefficients.
Dense matrix: Matrix for which the number of zero elements
is too small to warrant specialized algorithms.
Sparse matrix: Matrix for which the number of zero elements
is large enough that algorithms avoiding operations on zero elements
pay off. Matrices derived from partial differential equations typically
have a number of nonzero elements that is proportional to the matrix size,
while the total number of matrix elements is the square of the matrix size.
The rate at which an iterative method converges depends greatly on the
spectrum of the coefficient matrix. Hence, iterative methods usually
involve a second matrix that transforms the coefficient matrix into
one with a more favorable spectrum. The transformation matrix is
called a preconditioner.
A good preconditioner improves the convergence of the iterative method,
sufficiently to overcome the extra cost of constructing and applying
the preconditioner. Indeed, without a preconditioner the iterative
method may even fail to converge.
If the coefficient matrix
is sparse, large-scale linear systems
of the form
can be most
efficiently solved if the
zero elements of
are not stored. Sparse storage schemes allocate
contiguous storage in memory for
the nonzero elements of the matrix, and perhaps a limited number of
zeros. This, of course, requires a scheme for knowing
where the elements fit into the full matrix.
There are many methods for
storing the data (see for instance Saad
[186]
and Eijkhout
[87]).
Here we will discuss Compressed Row and Column Storage, Block Compressed
Row Storage, Diagonal Storage, Jagged Diagonal Storage, and Skyline
Storage.
The Compressed Row and Column (in the next section) Storage formats
are the most general: they make absolutely no assumptions about the
sparsity structure of the matrix, and they don't store any unnecessary
elements. On the other hand, they are not very efficient, needing an
indirect addressing step for every single scalar operation in a matrix-vector
product or preconditioner solve.
The Compressed Row Storage (CRS) format puts the subsequent nonzeros of the
matrix rows in contiguous memory locations.
Assuming we have a nonsymmetric sparse matrix
, we create
vectors:
one for floating-point numbers (val), and the other two for
integers (col_ind, row_ptr). The val vector
stores the values of the nonzero elements of the
matrix
, as they are traversed in a row-wise fashion.
The col_ind vector stores
the column indexes of the elements in the val vector.
That is, if
then
.
The row_ptr vector stores
the locations in the val vector that start a row, that is,
if
then
.
By convention, we define
, where
is
the number of nonzeros in the matrix
. The storage savings for this
approach is significant. Instead of storing
elements,
we need only
storage locations.
As an example, consider the nonsymmetric matrix
defined by
The CRS format for this matrix is then specified by the arrays
{val, col_ind, row_ptr} given below
.
If the matrix
is symmetric, we need only store the upper (or
lower) triangular portion of the matrix. The trade-off is
a more complicated algorithm with a somewhat different pattern of data access.
Analogous to Compressed Row Storage there is Compressed Column
Storage (CCS), which is also called the Harwell-Boeing
sparse matrix format
[78]. The CCS format is identical to the
CRS format except that the columns of
are stored (traversed) instead
of the rows. In other words, the CCS format is the CRS format
for
.
The CCS format is specified by the
arrays
{val, row_ind, col_ptr}, where
row_ind stores the row indices of each nonzero, and col_ptr
stores the index of the elements in val which start a column of
.
The CCS format for the matrix
in (
) is
given by
.
If the sparse matrix
is comprised of square
dense blocks of nonzeros in some regular pattern, we can modify
the CRS (or CCS) format to exploit such block patterns. Block
matrices typically arise from the discretization of partial differential
equations in which there are several degrees of freedom associated
with a grid point. We then partition the matrix in small blocks with
a size equal to the number of degrees of freedom, and treat each
block as a dense matrix, even though it may have some zeros.
If
is the dimension of each block and
is the number of nonzero
blocks in the
matrix
, then the total storage
needed is
.
The block
dimension
of
is then defined by
.
Similar to the CRS format, we require
arrays for the BCRS format:
a rectangular array for floating-point numbers (
val(
,
,
)) which stores the nonzero blocks in
(block) row-wise fashion, an integer array (col_ind(
))
which stores the actual column indices in the original matrix
of
the (
) elements of the nonzero blocks, and a pointer
array (row_blk(
)) whose entries point to the beginning of
each block row in val(:,:,:) and col_ind(:). The
savings in storage locations and reduction in indirect addressing for
BCRS over CRS can be significant for matrices with a large
.
If the matrix
is banded with bandwidth that is fairly constant
from row to row,
then it is worthwhile to take advantage of this
structure in the storage scheme by storing subdiagonals of the matrix
in consecutive locations. Not only can we eliminate the vector
identifying the column and row, we can pack the nonzero elements in such a
way as to make the matrix-vector product more efficient.
This storage scheme is particularly useful if the matrix arises from
a finite element or finite difference discretization on a tensor
product grid.
We say that the matrix
is
banded if there are nonnegative constants
,
, called the left and
right halfbandwidth, such
that
only if
. In this case, we can
allocate for the matrix
an array val(1:n,-p:q).
The declaration with reversed dimensions (-p:q,n) corresponds to the
LINPACK band format
[73], which unlike CDS,
does not allow for an
efficiently vectorizable matrix-vector multiplication if
is small.
Usually, band formats involve storing some zeros. The CDS
format may even contain some array elements that do not
correspond to matrix elements at all.
Consider the nonsymmetric matrix
defined by
Using the CDS format, we
store this matrix
in an array of dimension (6,-1:1) using
the mapping
Hence, the rows of the val(:,:) array are
.
Notice the two zeros corresponding to non-existing matrix elements.
A generalization of the CDS format more suitable for manipulating
general sparse matrices on vector supercomputers is discussed by
Melhem in
[154]. This variant of CDS uses a stripe data structure to store the matrix
. This structure is
more efficient in storage in the case of varying bandwidth, but it
makes the matrix-vector product slightly more expensive, as it
involves a gather operation.
As defined in
[154],
a stripe in the
matrix
is a set of positions
, where
and
is a strictly increasing function.
Specifically, if
and
are in
,
then
When computing the
matrix-vector product
using stripes, each
element of
in stripe
is multiplied
with both
and
and these products are
accumulated in
and
, respectively. For
the nonsymmetric matrix
defined by
the
stripes of the matrix
stored
in the rows of the val(:,:) array would be
.
The Jagged Diagonal Storage format can be useful for the
implementation of iterative methods on parallel and vector processors
(see Saad
[185]). Like the Compressed Diagonal
format, it gives a vector length essentially of the size of the
matrix. It is more space-efficient than CDS at the cost of a
gather/scatter operation.
A simplified form of JDS, called ITPACK storage or Purdue
storage, can be described as follows. In the matrix
from (
) all elements are shifted left:
after which the columns are stored consecutively. All rows are padded
with zeros on the right to give them equal length. Corresponding to
the array of matrix elements val(:,:),
an array of column indices, col_ind(:,:) is also stored:
It is clear that the padding zeros in this structure may be a
disadvantage, especially if the bandwidth of the matrix varies strongly.
Therefore, in the CRS format, we
reorder the rows of the matrix decreasingly according to
the number of nonzeros per row. The compressed and permuted diagonals
are then stored in a linear array. The new data structure is called
jagged diagonals.
The number of jagged diagonals is equal to
the number of nonzeros in the first row, i.e., the largest
number of nonzeros in any row of
. The data structure to
represent the
matrix
therefore consists of a permutation
array (perm(1:n)) which reorders the rows, a floating-point array
(jdiag(:)) containing the jagged diagonals in succession,
an integer array (col_ind(:)) containing the corresponding column
indices, and finally a pointer array (jd_ptr(:)) whose elements
point to the beginning of each jagged diagonal. The advantages
of JDS for matrix multiplications are discussed by Saad in
[185].
The JDS format for the above matrix
in using the linear arrays
{perm, jdiag, col_ind, jd_ptr}
is given below (jagged diagonals are separated by semicolons)
.
The final storage scheme we consider is for skyline matrices, which
are also called variable band or profile matrices (see Duff, Erisman
and Reid
[80]). It is mostly of importance in direct solution
methods, but it can be used for handling the diagonal blocks in block
matrix factorization methods. A major advantage of solving linear
systems having skyline coefficient matrices is that when pivoting is
not necessary, the skyline structure is preserved during Gaussian
elimination. If the matrix is symmetric, we only store its lower
triangular part. A straightforward approach in storing the elements
of a skyline matrix is to place all the rows (in order) into a
floating-point array (val(:)), and then keep an integer array
(row_ptr(:)) whose elements point to the beginning of each row.
The column indices of the nonzeros stored in val(:) are easily
derived and are not stored.
For a nonsymmetric skyline matrix such as the one illustrated in
Figure
, we store the lower
triangular elements in SKS format, and store the upper triangular
elements in a column-oriented SKS format (transpose stored in row-wise
SKS format). These two separated substructures can be linked in
a variety of ways. One approach, discussed by Saad
in
[186], is to store each row of the lower triangular
part and each column of the upper triangular part contiguously into
the floating-point array (val(:)). An additional pointer is
then needed to determine where the diagonal elements, which separate
the lower triangular elements from the upper triangular elements, are
located.
In many of the iterative methods discussed earlier, both the product
of a matrix and that of its transpose times a vector are needed, that
is, given an input vector
we want to compute products
We will present
these algorithms for two of the storage formats
from §
: CRS and CDS.
The matrix vector product
using CRS format can be
expressed in the usual way:
since this traverses the rows of the matrix
.
For an
matrix A, the
matrix-vector multiplication is given by
Since this method only multiplies nonzero matrix entries, the
operation count is
times the number of nonzero elements in
,
which is a significant savings over the dense operation requirement
of
.
For the transpose product
we cannot use the equation
since this implies
traversing columns of the matrix, an extremely inefficient operation
for matrices stored in CRS format. Hence, we switch indices to
The matrix-vector multiplication involving
is then given by
Both matrix-vector products above have largely the same structure, and
both use indirect addressing. Hence, their vectorizability properties
are the same on any given computer. However, the first product
(
) has a more favorable memory access pattern in that (per
iteration of the outer loop) it reads two vectors of data (a row of
matrix
and the input vector
) and writes one scalar. The
transpose product (
) on the other hand reads one element of
the input vector, one row of matrix
, and both reads and writes the
result vector
. Unless the machine on which these methods are
implemented has three separate memory paths (e.g., Cray Y-MP), the
memory traffic will then limit the performance. This is an
important consideration for RISC-based architectures.
If the
matrix
is
stored in CDS format, it is still possible to
perform a matrix-vector product
by either rows or columns, but this
does not take advantage of the CDS format. The idea
is to make a change in coordinates in the doubly-nested loop. Replacing
we get
With the index
in the inner loop we see that the
expression
accesses the
th diagonal of the matrix
(where the main diagonal has number 0).
The algorithm will now have a doubly-nested loop with the outer loop
enumerating the diagonals diag=-p,q with
and
the
(nonnegative) numbers of diagonals to the left and right of the main
diagonal. The bounds for the inner loop follow from the requirement
that
The algorithm becomes
The transpose matrix-vector product
is a minor variation of the
algorithm above. Using the update formula
we obtain
=25pt
-8pt
Eijk-hout
Richard Barrett
,Michael Berry
,
Tony F. Chan
,
This book is also available in Postscript from over the Internet.
The url for this book is
http://www.netlib.org/templates/Templates.html .
A bibtex reference for this book follows:
@BOOK{templates,
AUTHOR = {R. Barrett and M. Berry and T. F. Chan and J. Demmel and
J. Donato and J. Dongarra and V. Eijkhout and R. Pozo and
C. Romine, and H. Van der Vorst },
TITLE = {Templates for the Solution of Linear Systems:
Building Blocks for Iterative Methods},
PUBLISHER = {SIAM},
YEAR = {1994},
ADDRESS = {Philadelphia, PA} }
This book describes a set of application and systems software
research projects undertaken by the Caltech Concurrent
Computation Program (C
P) from 1983-1990. This parallel
computing activity is organized so that applications with similar
algorithmic and software challenges are grouped together. Thus,
one can not only learn that parallel computing is effective on a
broad range of problems but also why it works, what algorithms
are needed, and what features the software should support. The
description of the software has been updated through 1993 to
reflect the current interests of Geoffrey Fox, now at Syracuse
University but still working with many C
P collaborators
through the auspices of the NSF Center for Research in Parallel
Computation (CRPC).
Many C
P members wrote sections of this book. John
Apostolakis wrote Section 7.4; Clive Baillie, Sections 4.3, 4.4, 7.2
and 12.6; Vas Bala, Section 13.2; Ted Barnes, Section 7.3; Roberto
Battitti, Sections 6.5, 6.7, 6.8 and 9.9; Rob Clayton, Section 18.2;
Dave Curkendall, Section 18.3; Hong Ding, Sections 6.3 and 6.4;
David Edelsohn, Section 12.8; Jon Flower, Sections 5.2, 5.3, 5.4
and 13.5; Tom Gottschalk, Sections 9.8 and 18.4; Gary Gutt,
Section 4.5; Wojtek Furmanski, Chapter 17; Mark Johnson,
Section 14.2; Jeff Koller, Sections 13.4 and 15.2; Aron
Kuppermann, Section 8.2; Paulette Liewer, Section 9.3; Vince
McKoy, Section 8.3; Paul Messina, Chapter 2; Steve Otto, Sections
6.6, 11.4, 12.7, 13.6 and 14.3; Jean Patterson, Section 9.4; Francois
Pepin, Section 12.5; Peter Reiher, Section 15.3; John Salmon,
Section 12.4; Tony Skjellum, Sections 9.5, 9.6 and Chapter 16;
Michael Speight, Section 7.6; Eric Van de Velde, Section 9.7;
David Walker, Sections 6.2 and 8.1; Brad Werner, Section 9.2;
Roy Williams, Sections 11.1, 12.2, 12.3 and Chapter 10. Geoffrey
Fox wrote the remaining text. Appendix B describes many of the key
C
P contributors, with brief biographies.
C
P's research depended on the support of many sponsors;
central support for major projects was given by the Department
of Energy and the Electronic Systems Division of the USAF.
Other federal sponsors were the Joint Tactical Fusion office,
NASA, NSF and the National Security Agency. C
P's start up
was only possible due to two private donations from the Parsons
and System Development Foundations. Generous corporate
support came from ALCOA, Digital Equipment, General
Dynamics, General Motors, Hitachi, Hughes, IBM, INTEL,
Lockheed, McDonnell Douglas, MOTOROLA, Nippon Steel,
nCUBE, Sandia National Laboratories, and Shell.
Production of this book would have been impossible without
the dedicated help of Richard Alonso, Lisa Deyo, Keri Arnold, Blaise
Canzian and especially Terri Canzian.
This book describes the activities of the Caltech Concurrent Computation
Program (C
P). This was a seven-year project (1983-1990), focussed
on the question, ``Can parallel computers be used effectively for large
scale scientific computation?'' The title of the book, ``Parallel
Computing Works,'' reveals our belief that we answered the question in
the affirmative, by implementing numerous scientific applications on
real parallel computers and doing computations that produced new
scientific results. In the process of doing so, C
P helped design
and build several new computers, designed and implemented basic system
software, developed algorithms for frequently used mathematical
computations on massively parallel machines, devised performance models
and measured the performance of many computers, and created a
high-performance computing facility based exclusively on parallel computers.
While the initial focus of C
P was the hypercube architecture
developed by C. Seitz at Caltech, many of the methods developed and
lessons learned have been applied successfully on other massively
parallel architectures.
Of course, C
P was only one of many projects contributing to this
field and so the contents of this book are only representative of the
important activities in parallel computing during the last ten years.
However, we believe that the project did address a wide range of issues
and applications areas. Thus, a book focussed on C
P has some general
interest. We do, of course, cite other activities but surely not
completely. Other general references which the reader will find
valuable are [
Almasi:89a
], [
Andrews:91a
], [
Arbib:90a
],
[
Blelloch:90a
], [
Brawer:89a
], [
Doyle:91a
], [
Duncan:90a
],
[
Fox:88a
], [
Golub:89a
], [
Hayes:89a
], [
Hennessy:91a
],
[
Hillis:85a
], [
Hockney:81a
], [
Hord:90a
], [
Hwang:89a
],
[
IEEE:91a
], [
Laksh:90a
], [
Lazou:87a
], [Messina:87a;91d],
[
Schneck:87a
], [
Skerrett:92a
], [
Stone:91a
], [
Trew:91a
],
[
Zima:91a
].
C
P was both a technical and social experiment. It involved a wide
range of disciplines working together to understand the hardware,
software, and algorithmic (applications) issues in parallel computing.
Such multidisciplinary
activities are
generally considered of growing relevance to many new academic and
research activities-including the federal high-performance computing
and communication initiative. Many of the participants of C
P are
no longer at Caltech, and this has positive and negative messages.
C
P was not set up in a traditional academic fashion since its core
interdisciplinary field, computational science, is not well understood
or implemented either nationally or in specific universities. This is
explored further in Chapter
20
. C
P has led to flourishing
follow-on projects at Caltech, Syracuse University, and elsewhere.
These differ from C
P just as parallel computing has changed from an
exploratory field to one that is in a transitional stage into
development, production, and exploitation.
The technological driving force behind parallel computing is
VLSI,
or very large scale integration-the same technology
that created the personal computer and workstation market over the last
decade. In 1980, the Intel 8086 used 50,000 transistors; in 1992,
the latest Digital alpha RISC chip contains 1,680,000 transistors-a
factor of 30 increase. The dramatic improvement in chip density comes
together with an increase in clock speed and improved design so that
the alpha performs better by a factor of over one thousand on
scientific problems than the 8086-8087 chip pair of the early 1980s.
The increasing density of transistors on a chip follows directly from a
decreasing feature size which is now
for the alpha. Feature
size will continue to decrease and by the year 2000, chips with
50 million transistors are expected to be available. What can we do
with all these transistors?
With around a million transistors on a chip, designers were able to move
full mainframe functionality to about
of a chip. This
enabled the personal computing and workstation revolutions. The next
factors of ten increase in transistor density must go into some form of
parallelism by replicating several CPUs on a single chip.
By the year 2000, parallelism is thus inevitable to all computers, from
your children's video game to personal computers, workstations, and
supercomputers. Today we see it in the larger machines as we replicate many
chips and printed circuit boards to build systems as arrays of
nodes, each unit of which is some variant of the microprocessor. This
is illustrated in Figure 1.1 (Color Plate), which shows an
nCUBE
parallel supercomputer with 64 identical nodes on
each board-each node is a single-chip CPU with additional memory
chips. To be useful, these nodes must be linked in some way and this
is still a matter of much research and experimentation. Further, we
can argue as to the most appropriate node to replicate; is it a
``small'' node as in the nCUBE of Figure 1.1 (Color Plate), or more
powerful ``fat'' nodes such as those offered in CM-5 and Intel
Touchstone shown in Figures 1.2 and 1.3 (Color Plates) where each node
is a sophisticated multichip printed circuit board? However, these
details should not obscure the basic point: Parallelism allows one to
build the world's fastest and most cost-effective supercomputers.
Parallelism may only be critical today for supercomputer vendors and
users. By the year 2000, all computers will have to address the
hardware, algorithmic, and software issues implied by parallelism. The
reward will be amazing performance and the opening up of new fields;
the price will be a major rethinking and reimplementation of software,
algorithms, and applications.
This vision and its consequent issues are now well understood and
generally agreed. They provided the motivation in 1981 when
C
P's first roots were formed. In those days, the vision was blurred
and controversial. Many believed that parallel computing would not
work.
President Bush instituted, in 1992, the five-year federal High
Performance Computing and Communications (HPCC) Program. This will spur the
development of the technology described above and is focussed on the
solution of grand challenges shown in Figure 1.4 (Color Plate). These
are fundamental problems in science and engineering, with broad economic
and scientific impact, whose solution could be advanced by applying
high-performance computing techniques and resources.
The activities of several federal agencies have been coordinated in
this program. The Advanced Research Projects Agency (ARPA) is developing
the basic technologies which is applied to the grand challenges by
the Department of Energy (DOE), the National Aeronautics and Space Agency
(NASA), the National Science Foundation (NSF), the National Institute of Health
(NIH), the Environmental Protection Agency (EPA), and the National
Oceanographic and Atmospheric Agency (NOAA). Selected activities
include the mapping of the human genome in DOE, climate modelling in
DOE and NOAA, coupled structural and airflow simulations of advanced
powered lift and a high-speed civil transport by NASA.
More generally, it is clear that parallel computing can only realize its
full potential and be commercially successful if it is accepted in the
real world of industry and government applications. The clear U.S.
leadership over Europe and Japan in high-performance computing offers
the rest of the U.S. industry the opportunity of gaining global
competitive advantage.
Some of these industrial opportunities are discussed in
Chapter
19
. Here we note some interesting possibilities which
include
C
P did not address such large-scale problems. Rather, we
concentrated on major academic applications. This fit the
experience of the Caltech faculty who led most of the C
P teams, and
further academic applications are smaller and cleaner than large-scale
industrial problems. One important large-scale C
P application was
a military simulation described in Chapter
18
and produced
by Caltech's Jet Propulsion Laboratory. C
P chose the correct and
only computations on which to cut its parallel computing teeth. In
spite of the focus on different applications, there are many
similarities between the vision and structure of C
P and today's
national effort. It may even be that today's grand challenge teams can
learn from C
P's experience.
C
P's origins dated to an early collaboration between the physics and
computer science departments at Caltech in bringing up UNIX on the physics
department's VAX 11/780. As an aside, we note this was motivated by the
development of the Symbolic Manipulation Program (SMP) by Wolfram and
collaborators; this project has now grown into the well-known system
Mathematica. Carver Mead from computer science urged physics to get
back to them if we had insoluble large scale computational needs. This
comment was reinforced in May, 1981 when Mead gave a physics
colloquium on VLSI, Very Large Scale Integration, and the opportunities
it opened up. Fox, in the audience, realized that quantum
chromodynamics (QCD, Section
4.3
), now using up all free cycles on
the VAX 11/780, was naturally parallelizable and could take advantage of
the parallel machines promised by VLSI. Thus, a seemingly modest
interdisciplinary interaction-a computer scientist lecturing to
physicists-gave birth to a large interdisciplinary project, C
P.
Further, our interest in QCD stemmed from the installation of the
VAX 11/780 to replace our previous batch computing using a remote
CDC7600
. This more attractive computing environment, UNIX
on a VAX 11/780, encouraged theoretical physics graduate students to
explore computation.
During the summer of 1981, Fox's research group, especially
Eugene Brooks and Steve Otto, showed that effective concurrent
algorithms could be developed, and we presented our conclusion to the
Caltech computer scientists. This presentation led to the plans,
described in a national context in Chapter
2
, to produce the first
hypercube, with Chuck Seitz and his student Erik DeBenedictis
developing the hardware and Fox's group the QCD applications and
systems software. The physics group did not understand what a
hypercube was at that stage, but agreed with the computer scientists
because the planned six-dimensional hypercube was isomorphic to a
three-dimensional mesh,
a topology
whose relevance a physicist could appreciate. With the generous help
of the computer scientists, we gradually came to understand the
hypercube topology with its general advantage (maximum distance between
nodes is
) and its specific feature of including a rich
variety of mesh topologies. Here
N
is the total number of nodes in
the concurrent computer. We should emphasize that this understanding
of the relevance of concurrency to QCD was not particularly novel; it
followed from ideas already known from earlier concurrent machines such
as the Illiac IV. We were, however, fortunate to investigate the
issues at a time when microprocessor technology (in particular the
Intel 8086/8087) allowed one to build large (in terms of number of
nodes) cost-effective concurrent computers with interesting performance
levels. The QCD problem was also important in helping ensure that the
initial Cosmic Cube
was built with sensible design
choices; we were fortunate that in choosing parameters, such as
memory size, appropriate for QCD, we also realized a machine of general
capability.
While the 64-node Cosmic Cube was under construction, Fox wandered
around Caltech and the associated Jet Propulsion Laboratory explaining
parallel computing and, in particular, the Cosmic Cube to scientists in
various disciplines who were using ``large'' (by the standards of 1981)
scale computers. To his surprise, all the problems being tackled on
conventional machines by these scientists seemed to be implementable on
the Cosmic Cube. This was the origin of C
P, which identified the
Caltech-JPL applications team, whose initial participants are noted in
Table
4.2
. Fox, Seitz, and these scientists prepared the
initial proposals which established and funded C
P in the summer of
1983. Major support was obtained from the Department of Energy and the
Parsons and System Development Foundation. Intel made key initial
contributions of chips to the Cosmic Cube and follow-on machines. The
Department of Energy remained the central funding support for C
P
throughout its seven years, 1983 to 1990.
The initial C
P proposals were focussed on the question:
Our approach was simple: Build or acquire interesting
hardware and provide the intellectual and physical infrastructure to
allow leading application scientists and engineers to both develop
parallel algorithms and codes, and use them to address important
problems. Often we liked to say that C
P
Our project showed initial success, with the approximately ten
applications of Table
4.2
developed in the first year. We both
showed good performance on the hypercube and developed a performance
model which is elaborated in Chapter
3
. A major
activity at this time was the design and development of the necessary
support software, termed CrOS and later developed into the
commercial software Express described in Chapter
5
.
Not only was the initial hardware applicable to a wide range of
problems, but our software model proved surprisingly useful. CrOS was
originally designed by Brooks as the ``mailbox communication system''
and we initially targeted the regular synchronous problems typified in
Chapter
4
. Only later did we realize that it supported quite
efficiently the irregular and non-local communication needs of more
general problems. This generalization is represented as an evolutionary
track of Express in Chapter
5
and for a new
communication system Zipcode in Section
16.1
developed from
scratch for general asynchronous irregular problems.
Although successful, we found many challenges and intriguing questions
opened up by C
P's initial investigation into parallel computing.
Further, around 1985, the DOE and later the NSF made substantial
conventional supercomputer
(Cray, Cyber, ETA) time available to
applications scientists. Our Cosmic Cube
and the
follow-on Mark II machines were very competitive with the VAX 11/780,
but not with the performance offered by the CRAY X-MP. Thus, internal
curiosity and external pressures moved C
P in the direction of
computer science: still developing real software but addressing new
parallel algorithms and load-balancing techniques rather than a
production application. This phase of C
P is summarized in
[
Angus:90a
], [Fox:88a;88b].
Around 1988, we returned to our original goal with a focus on parallel
supercomputers. We no longer concentrated on the hypercube, but rather
asked such questions as [
Fox:88v
],
and as a crucial (and harder) question:
We evolved from the initial 8086/8087, 80286/80287 machines to the
internal JPL Mark IIIfp and commercial nCUBE-1
and CM-2
described in Chapter
2
. These were still ``rough, difficult
to use machines'' but had performance capabilities competitive with
conventional supercomputers.
This book largely describes work in the last three years of C
P when
we developed a series of large scale applications on these parallel
supercomputers. Further, as described in Chapters
15
,
16
, and
17
, we developed prototypes and ideas for
higher level software environments which could accelerate and ease the
production of parallel software. This period also saw an explosion of
interest in the use of parallel computing outside Caltech. Much of
this research used commercial hypercubes which partly motivated our
initial discoveries and successes on the in-house machines at Caltech.
This rapid technology transfer was in one sense gratifying, but it also
put pressure on C
P which was better placed to blaze new trails than
to perform the more programmatic research which was now appropriate.
An important and unexpected discovery in C
P was in the education and
the academic support for interdisciplinary research. Many of the
researchers, especially graduate students in C
P, evolved to be
``computational scientists.'' Not traditional physicists, chemists, or
computer scientists but rather something in between. We believe that
this interdisciplinary education and expertise was critical for C
P's
success and, as discussed in Chapter
20
, it should be encouraged in
more universities [Fox:91f;92d].
Further information about C
P can be found in our annual reports and two
reviews.
[Fox:84j;84k;85c;85e;85i;86f;87c;87d;88v;88oo;89i;89n;89cc;89dd;90o]
C
P's research showed that
In Chapter
2
, we provide the national overview of parallel
computing activities during the last decade. Chapter
3
is somewhat speculative as it attempts to provide a framework to quantify
the previous
PCW
statement.
We will show that, more precisely, parallel computing only works in a
``scaling'' fashion in a special class of problems which we call
synchronous and loosely synchronous.
By scaling, we mean that the parallel implementation will efficiently
extend to systems with large numbers of nodes without levelling off of
the speedup obtained. These concepts are quantified in
Chapter
3
with a simple performance model described in
detail in [
Fox:88a
].
The book is organized with applications and software issues growing in
complexity in later chapters. Chapter
4
describes the cleanest
regular synchronous applications which included many of our initial
successes. However, we already see the essential points:
Chapters
6
through
9
confirm these lessons
with an extension to more irregular problems. Loosely synchronous
problem classes are harder to parallelize, but still use the basic
principles
DD
and
MP
. Chapter
7
describes a special
class,
embarrassingly parallel
, of applications where scaling
parallelism is guaranteed by the independence of separate components of
the problem.
Chapters
10
and
11
describe parallel computing tools
developed within C
P. DIME supports parallel mesh generation and
adaptation, and its use in general finite element codes. Initially, we
thought load balancing would be a major stumbling block for parallel computing
because formally it is an NP-complete
(intractable)
optimization
problem. However, effective
heuristic
methods were developed which avoid the
exponential time complexity of NP-complete
problems
by searching for good but not exact minima.
In Chapter
12
, we describe the most complex irregular loosely
synchronous problems which include some of the hardest problems tackled
in C
P.
As described earlier, we implemented essentially all the applications
described in the book using explicit user-generated message passing.
In Chapter
13
, we describe our initial efforts to produce
a higher level data-parallel Fortran environment, which should be able to
provide a more attractive software environment for the user. High
Performance Fortran has been adopted as an informal industry standard
for this language.
In Chapter
14
, we describe the very difficult asynchronous
problem class for which scaling parallel algorithms and the correct
software model are less clear. Chapters
15
,
16
, and
17
describe four software models, Zipcode, MOOSE, Time Warp, and
MOVIE which tackle asynchronous and the mixture of asynchronous and
loosely synchronous problems one finds in the complex system
simulations and analysis typical of many real-world
problems. Applications of this class are described in
Chapter
18
, with the application of
Section
18.3
being an event-driven simulation-an
important class of difficult-to-parallelize asynchronous applications.
In Chapter
19
we look to the future and describe some
possibilities for the use of parallel computers in industry. Here we
note that C
P, and much of the national enterprise, concentrated on
scientific and engineering computations. The examples and ``proof''
that parallel computing works are focussed in this book on such
problems. However, this will
not
be the dominant industrial
use of parallel computers where information processing is most
important. This will be used for decision support in the military and
large corporations, and to supply video, information and simulation
``on demand'' for homes, schools, and other institutions. Such
applications have recently been termed
national challenges
to
distinguish them from the large scale
grand challenges
, which
underpinned the initial HPCC initiative [
FCCSET:94a
]. The lessons
C
P and others have learnt from scientific computations will have
general applicability across the wide range of industrial problems.
Chapter
20
includes a discussion of education in computational
science-an unexpected byproduct of C
P-and other retrospective
remarks. The appendices list the C
P reports including those not
cited directly in this book. Some information is available
electronically by mailing citlib@caltech.edu.
Next:
Further Details: How
Up:
Accuracy and Stability
Previous:
Further Details: Floating
Tue Nov 29 14:03:33 EST 1994
Next:
Further Details: How
Up:
How to Measure
Previous:
How to Measure
Further Details: How to Measure Errors
Table 4.3: Bounding One Vector Norm in Terms of Another
Table 4.4: Bounding One Matrix Norm in Terms of Another
Next:
Further Details: How
Up:
How to Measure
Previous:
How to Measure
Tue Nov 29 14:03:33 EST 1994
Next:
Standard Error Analysis
Up:
Accuracy and Stability
Previous:
Further Details: How
Further Details: How Error Bounds Are Derived
Tue Nov 29 14:03:33 EST 1994
Next:
Improved Error Bounds
Up:
Further Details: How
Previous:
Further Details: How
Standard Error Analysis
Next:
Improved Error Bounds
Up:
Further Details: How
Previous:
Further Details: How
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Further Details: How
Previous:
Standard Error Analysis
Improved Error Bounds
Next:
Error Bounds for
Up:
Further Details: How
Previous:
Standard Error Analysis
Tue Nov 29 14:03:33 EST 1994
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Improved Error Bounds
Error Bounds for Linear Equation Solving
EPSMCH = SLAMCH( 'E' )
* Get infinity-norm of A
ANORM = SLANGE( 'I', N, N, A, LDA, WORK )
* Solve system; The solution X overwrites B
CALL SGESV( N, 1, A, LDA, IPIV, B, LDB, INFO )
IF( INFO.GT.0 ) THEN
PRINT *,'Singular Matrix'
ELSE IF (N .GT. 0) THEN
* Get reciprocal condition number RCOND of A
CALL SGECON( 'I', N, A, LDA, ANORM, RCOND,
$ WORK, IWORK, INFO )
RCOND = MAX( RCOND, EPSMCH )
ERRBD = EPSMCH / RCOND
END IF
CALL SGESVX( 'E', 'N', N, 1, A, LDA, AF, LDAF, IPIV,
$ EQUED, R, C, B, LDB, X, LDX, RCOND, FERR, BERR,
$ WORK, IWORK, INFO )
IF( INFO.GT.0 ) PRINT *,'(Nearly) Singular Matrix'
Tue Nov 29 14:03:33 EST 1994
Next:
Computers for which
Up:
Essentials
Previous:
LAPACK
Problems that LAPACK can Solve
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Further Details: Error Bounds for Linear Equation Solving
The normwise backward error of the computed solution
,
with respect to the infinity norm,
is the pair E,f which minimizes
The componentwise backward error
of the computed solution
is the pair E,f which minimizes
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Tue Nov 29 14:03:33 EST 1994
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Further Details: Error
Error Bounds for Linear Least Squares Problems
EPSMCH = SLAMCH( 'E' )
* Get the 2-norm of the right hand side B
BNORM = SNRM2( M, B, 1 )
* Solve the least squares problem; the solution X
* overwrites B
CALL SGELS( 'N', M, N, 1, A, LDA, B, LDB, WORK,
$ LWORK, INFO )
IF ( MIN(M,N) .GT. 0 ) THEN
* Get the 2-norm of the residual A*X-B
RNORM = SNRM2( M-N, B( N+1 ), 1 )
* Get the reciprocal condition number RCOND of A
CALL STRCON('I', 'U', 'N', N, A, LDA, RCOND,
$ WORK, IWORK, INFO)
RCOND = MAX( RCOND, EPSMCH )
IF ( BNORM .GT. 0.0 ) THEN
SINT = RNORM / BNORM
ELSE
SINT = 0.0
ENDIF
COST = MAX( SQRT( (1.0E0 - SINT)*(1.0E0 + SINT) ),
$ EPSMCH )
TANT = SINT / COST
ERRBD = EPSMCH*( 2.0E0/(RCOND*COST) +
$ TANT / RCOND**2 )
ENDIF
EPSMCH = SLAMCH( 'E' )
* Get the 2-norm of the right hand side B
BNORM = SNRM2( M, B, 1 )
* Solve the least squares problem; the solution X
* overwrites B
RCND = 0
CALL SGELSX( M, N, 1, A, LDA, B, LDB, JPVT, RCND,
$ RANK, WORK, INFO )
IF ( RANK.LT.N ) THEN
PRINT *,'Matrix less than full rank'
ELSE IF ( MIN( M,N ) .GT. 0 ) THEN
* Get the 2-norm of the residual A*X-B
RNORM = SNRM2( M-N, B( N+1 ), 1 )
* Get the reciprocal condition number RCOND of A
CALL STRCON('I', 'U', 'N', N, A, LDA, RCOND,
$ WORK, IWORK, INFO)
RCOND = MAX( RCOND, EPSMCH )
IF ( BNORM .GT. 0.0 ) THEN
SINT = RNORM / BNORM
ELSE
SINT = 0.0
ENDIF
COST = MAX( SQRT( (1.0E0 - SINT)*(1.0E0 + SINT) ),
$ EPSMCH )
TANT = SINT / COST
ERRBD = EPSMCH*( 2.0E0/(RCOND*COST) +
$ TANT / RCOND**2 )
END IF
The numerical results of this code fragment on the above A and b are
the same as for the first code fragment.
CALL SGELSS( M, N, 1, A, LDA, B, LDB, S, RCND, RANK,
$ WORK, LWORK, INFO )
RCOND = S( N ) / S( 1 )
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Further Details: Error
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Further Details: Error Bounds for Linear Least Squares
Problems
The computed solution
has a small normwise backward error.
In other words
minimizes
, where
E and f satisfy
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Accuracy and Stability
Previous:
Further Details: Error
Error Bounds for Generalized Least Squares Problems
Tue Nov 29 14:03:33 EST 1994
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Error Bounds for
Error Bounds for the Symmetric Eigenproblem
EPSMCH = SLAMCH( 'E' )
* Compute eigenvalues and eigenvectors of A
* The eigenvalues are returned in W
* The eigenvector matrix Z overwrites A
CALL SSYEV( 'V', UPLO, N, A, LDA, W, WORK, LWORK, INFO )
IF( INFO.GT.0 ) THEN
PRINT *,'SSYEV did not converge'
ELSE IF ( N.GT.0 ) THEN
* Compute the norm of A
ANORM = MAX( ABS( W(1) ), ABS( W(N) ) )
EERRBD = EPSMCH * ANORM
* Compute reciprocal condition numbers for eigenvectors
CALL SDISNA( 'Eigenvectors', N, N, W, RCONDZ, INFO )
DO 10 I = 1, N
ZERRBD( I ) = EPSMCH * ( ANORM / RCONDZ( I ) )
10 CONTINUE
ENDIF
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Further Details: Error Bounds for the Symmetric Eigenproblem
The computed eigendecomposition
is nearly
the exact
eigendecomposition of A + E, i.e.,
is a true eigendecomposition so that
is
orthogonal,
where
and
. Here p(n) is a modestly growing
function of n. We take p(n) = 1 in the above code fragment.
Each computed eigenvalue
differs from a
true
by at most
The eigenvalues of T may be computed with small componentwise relative
backward error
(
) by using subroutine xSTEBZ (subsection
2.3.4)
or driver xSTEVX (subsection
2.2.4). If T is also positive definite,
they may also be computed at least as accurately by xPTEQR
(subsection
2.3.4).
To compute error bounds
for the computed eigenvalues
we must make some assumptions
about T. The bounds discussed here are from
[13].
Suppose T is positive definite, and
write T = DHD where
and
.
Then the computed eigenvalues
can differ from true eigenvalues
by
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Tue Nov 29 14:03:33 EST 1994
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Further Details: Error
Error Bounds for the Nonsymmetric Eigenproblem
EPSMCH = SLAMCH( 'E' )
* Compute the eigenvalues and eigenvectors of A
* WR contains the real parts of the eigenvalues
* WI contains the real parts of the eigenvalues
* VL contains the left eigenvectors
* VR contains the right eigenvectors
CALL SGEEVX( 'P', 'V', 'V', 'B', N, A, LDA, WR, WI,
$ VL, LDVL, VR, LDVR, ILO, IHI, SCALE, ABNRM,
$ RCONDE, RCONDV, WORK, LWORK, IWORK, INFO )
IF( INFO.GT.0 ) THEN
PRINT *,'SGEEVX did not converge'
ELSE IF ( N.GT.0 ) THEN
DO 10 I = 1, N
EERRBD(I) = EPSMCH*ABNRM/RCONDE(I)
VERRBD(I) = EPSMCH*ABNRM/RCONDV(I)
10 CONTINUE
ENDIF
Tue Nov 29 14:03:33 EST 1994
Next:
Overview
Up:
Error Bounds for
Previous:
Error Bounds for
Further Details: Error Bounds for the Nonsymmetric Eigenproblem
Tue Nov 29 14:03:33 EST 1994
Next:
Balancing and Conditioning
Up:
Further Details: Error
Previous:
Further Details: Error
Overview
Table 4.5: Asymptotic error bounds for the nonsymmetric eigenproblem
Table 4.6: Global error bounds for the nonsymmetric eigenproblem
assuming
Figure 4.1: Bounding eigenvalues inside overlapping disks
Next:
Balancing and Conditioning
Up:
Further Details: Error
Previous:
Further Details: Error
Tue Nov 29 14:03:33 EST 1994
Next:
Computing and
Up:
Further Details: Error
Previous:
Overview
Balancing and Conditioning
Next:
Computing and
Up:
Further Details: Error
Previous:
Overview
Tue Nov 29 14:03:33 EST 1994
Next:
LAPACK Compared with
Up:
Essentials
Previous:
Problems that LAPACK
Computers for which LAPACK is Suitable
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Further Details: Error
Previous:
Balancing and Conditioning
Computing s and sep
Next:
Error Bounds for
Up:
Further Details: Error
Previous:
Balancing and Conditioning
Tue Nov 29 14:03:33 EST 1994
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Computing and
Error Bounds for the Singular Value Decomposition
EPSMCH = SLAMCH( 'E' )
* Compute singular value decomposition of A
* The singular values are returned in S
* The left singular vectors are returned in U
* The transposed right singular vectors are returned in VT
CALL SGESVD( 'S', 'S', M, N, A, LDA, S, U, LDU, VT, LDVT,
$ WORK, LWORK, INFO )
IF( INFO.GT.0 ) THEN
PRINT *,'SGESVD did not converge'
ELSE IF ( MIN(M,N) .GT. 0 ) THEN
SERRBD = EPSMCH * S(1)
* Compute reciprocal condition numbers for singular
* vectors
CALL SDISNA( 'Left', M, N, S, RCONDU, INFO )
CALL SDISNA( 'Right', M, N, S, RCONDV, INFO )
DO 10 I = 1, MIN(M,N)
VERRBD( I ) = EPSMCH*( S(1)/RCONDV( I ) )
UERRBD( I ) = EPSMCH*( S(1)/RCONDU( I ) )
10 CONTINUE
END IF
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Further Details: Error Bounds for the Singular Value Decomposition
The SVD algorithm is backward stable.
This means that the computed SVD,
,
is nearly the exact SVD of A + E where
,
and p(m , n) is a modestly growing function of m and n. This means
is the true SVD, so that
and
are both orthogonal, where
, and
.
Each computed singular value
differs from true
by at most
Each computed singular value of a bidiagonal matrix
is accurate to nearly full relative accuracy
,
no matter how tiny it is:
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Tue Nov 29 14:03:33 EST 1994
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Further Details: Error
Error Bounds for the Generalized Symmetric Definite
Eigenproblem
EPSMCH = SLAMCH( 'E' )
* Solve the eigenproblem A - lambda B (ITYPE = 1)
ITYPE = 1
* Compute the norms of A and B
ANORM = SLANSY( '1', UPLO, N, A, LDA, WORK )
BNORM = SLANSY( '1', UPLO, N, B, LDB, WORK )
* The eigenvalues are returned in W
* The eigenvectors are returned in A
CALL SSYGV( ITYPE, 'V', UPLO, N, A, LDA, B, LDB, W,
$ WORK, LWORK, INFO )
IF( INFO.GT.0 .AND. INFO.LE.N ) THEN
PRINT *,'SSYGV did not converge'
ELSE IF( INFO.GT.N ) THEN
PRINT *,'B not positive definite'
ELSE IF ( N.GT.0 ) THEN
* Get reciprocal condition number RCONDB of Cholesky
* factor of B
CALL STRCON( '1', UPLO, 'N', N, B, LDB, RCONDB,
$ WORK, IWORK, INFO )
RCONDB = MAX( RCONDB, EPSMCH )
CALL SDISNA( 'Eigenvectors', N, N, W, RCONDZ, INFO )
DO 10 I = 1, N
EERRBD( I ) = ( EPSMCH / RCONDB**2 ) *
$ ( ANORM / BNORM + ABS( W(I) ) )
ZERRBD( I ) = ( EPSMCH / RCONDB**3 ) *
$ ( ( ANORM / BNORM ) / RCONDZ(I) +
$ ( ABS( W(I) ) / RCONDZ(I) ) * RCONDB )
10 CONTINUE
END IF
EPSMCH = SLAMCH( 'E' )
* Solve the eigenproblem A*B - lambda I (ITYPE = 2)
ITYPE = 2
* Compute the norms of A and B
ANORM = SLANSY( '1', UPLO, N, A, LDA, WORK )
BNORM = SLANSY( '1', UPLO, N, B, LDB, WORK )
* The eigenvalues are returned in W
* The eigenvectors are returned in A
CALL SSYGV( ITYPE, 'V', UPLO, N, A, LDA, B, LDB, W,
$ WORK, LWORK, INFO )
IF( INFO.GT.0 .AND. INFO.LE.N ) THEN
PRINT *,'SSYGV did not converge'
ELSE IF( INFO.GT.N ) THEN
PRINT *,'B not positive definite'
ELSE IF ( N.GT.0 ) THEN
* Get reciprocal condition number RCONDB of Cholesky
* factor of B
CALL STRCON( '1', UPLO, 'N', N, B, LDB, RCONDB,
$ WORK, IWORK, INFO )
RCONDB = MAX( RCONDB, EPSMCH )
CALL SDISNA( 'Eigenvectors', N, N, W, RCONDZ, INFO )
DO 10 I = 1, N
EERRBD(I) = ( ANORM * BNORM ) * EPSMCH +
$ ( EPSMCH / RCONDB**2 ) * ABS( W(I) )
ZERRBD(I) = ( EPSMCH / RCONDB ) * ( ( ANORM *
$ BNORM ) / RCONDZ(I) + 1.0 / RCONDB )
10 CONTINUE
END IF
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Further Details: Error
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Further Details: Error Bounds for the Generalized Symmetric Definite
Eigenproblem
Suppose a computed eigenvalue
of
is
the exact eigenvalue of a perturbed problem
.
Let
be the unit eigenvector (
) for the exact
eigenvalue
.
Then if ||E|| is small compared to
|A|, and if ||F|| is small compared to ||B||, we have
See sections
2.2.5.3 and
2.3.9
for a discussion of the generalized singular
value decomposition, and section
4.12 for a discussion of
the relevant error bound. This approach can give a tighter error bound
than the above bounds when B is ill conditioned but A + B is
well-conditioned.
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Accuracy and Stability
Previous:
Further Details: Error
Error Bounds for the Generalized Nonsymmetric Eigenproblem
Tue Nov 29 14:03:33 EST 1994
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Error Bounds for
Error Bounds for the Generalized Singular Value Decomposition
EPSMCH = SLAMCH( 'E' )
* Compute generalized singular values of A and B
CALL SGGSVD( 'N', 'N', 'N', M, N, P, K, L, A, LDA, B,
$ LDB, ALPHA, BETA, U, LDU, V, LDV, Q, LDQ,
$ WORK, IWORK, INFO )
* Compute rank of [A',B']'
RANK = K+L
IF( INFO.GT.0 ) THEN
PRINT *,'SGGSVD did not converge'
ELSE IF( RANK.LT.N ) THEN
PRINT *,'[A**T,B**T]**T not full rank'
ELSE IF ( M .GE. N .AND. N .GT. 0 ) THEN
* Compute reciprocal condition number RCOND of R
CALL STRCON( 'I', 'U', 'N', N, A, LDA, RCOND, WORK,
$ IWORK, INFO )
RCOND = MAX( RCOND, EPSMCH )
SERRBD = EPSMCH / RCOND
END IF
Next:
Further Details: Error
Up:
Accuracy and Stability
Previous:
Error Bounds for
Tue Nov 29 14:03:33 EST 1994
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Further Details: Error Bounds for the Generalized Singular Value Decomposition
Let the computed GSVD of A and B be
and
.
This is nearly the exact GSVD of
A + E and B + F in the following sense. E and F are small:
Next:
Error Bounds for
Up:
Error Bounds for
Previous:
Error Bounds for
Tue Nov 29 14:03:33 EST 1994
Next:
Documentation and Software
Up:
Accuracy and Stability
Previous:
Further Details: Error
Error Bounds for Fast Level 3 BLAS
Tue Nov 29 14:03:33 EST 1994
Next:
Design and Documentation
Up:
Guide
Previous:
Error Bounds for
Documentation and Software Conventions
Tue Nov 29 14:03:33 EST 1994
Next:
List of
Up:
LAPACK Users' Guide
Release 2.0
Previous:
LAPACK Users' Guide
Release 2.0
Tue Nov 29 14:03:33 EST 1994
Next:
List of
Up:
LAPACK Users' Guide
Release 2.0
Previous:
LAPACK Users' Guide
Release 2.0
Tue Nov 29 14:03:33 EST 1994
Release 2.0
Next:
Contents
LAPACK Users' Guide
Release 2.0
Tue Nov 29 14:03:33 EST 1994
Next:
Tridiagonal and Bidiagonal
Up:
Matrix Storage Schemes
Previous:
Packed Storage
Band Storage
Tue Nov 29 14:03:33 EST 1994
Next:
Unit Triangular Matrices
Up:
Matrix Storage Schemes
Previous:
Band Storage
Tridiagonal and Bidiagonal Matrices
Tue Nov 29 14:03:33 EST 1994
Next:
Symmetric Eigenproblems
Up:
Generalized Orthogonal Factorizations
Previous:
Generalized Factorization
Generalized
factorization
Tue Nov 29 14:03:33 EST 1994
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction
Up:
MPI: The Complete Reference
Previous:
MPI: The Complete Reference
Contents
Series Foreword
Next:
Introduction
Up:
MPI: The Complete Reference
Previous:
MPI: The Complete Reference
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Procedure Specification
Up:
MPI Conventions and
Previous:
MPI Conventions and
Document Notation
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Examples Using MPI_GATHERV
Up:
Gather
Previous:
Examples Using MPI_GATHER
Gather, Vector Variant
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, ROOT, COMM, IERROR
Next:
Examples Using MPI_GATHERV
Up:
Gather
Previous:
Examples Using MPI_GATHER
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Scatter
Up:
Gather
Previous:
GatherVector Variant
Examples Using MPI_GATHERV
Figure: The root process gathers 100 ints from each process
in the group, each set is placed stride ints apart.
Figure: The root process gathers column 0 of a 100
150
C array, and each set is placed stride ints apart.
Figure: The root process gathers 100-i ints from
column i of a 100
150
C array, and each set is placed stride ints apart.
Figure: The root process gathers 100-i ints from
column i of a 100
150
C array, and each set is placed stride[i] ints apart (a varying
stride).
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
An Example Using
Up:
Collective Communications
Previous:
Examples Using MPI_GATHERV
Scatter
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Scatter: Vector Variant
Up:
Scatter
Previous:
Scatter
An Example Using MPI_SCATTER
Figure: The root process scatters sets of 100 ints to each process
in the group.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Examples Using MPI_SCATTERV
Up:
Scatter
Previous:
An Example Using
Scatter: Vector Variant
INTEGER SENDCOUNTS(*), DISPLS(*), SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Gather to All
Up:
Scatter
Previous:
Scatter: Vector Variant
Examples Using MPI_SCATTERV
Figure: The root process scatters sets of 100 ints, moving by
stride ints from send to send in the scatter.
Figure: The root scatters blocks of 100-i ints into
column i of a 100
150
C array. At the sending side, the blocks are stride[i] ints apart.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
An Example Using
Up:
Collective Communications
Previous:
Examples Using MPI_SCATTERV
Gather to All
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Gather to All:
Up:
Gather to All
Previous:
Gather to All
An Example Using MPI_ALLGATHER
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
All to All
Up:
Gather to All
Previous:
An Example Using
Gather to All: Vector Variant
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNTS(*), DISPLS(*), RECVTYPE, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
All to All:
Up:
Collective Communications
Previous:
Gather to All:
All to All Scatter/Gather
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Semantic Terms
Up:
MPI Conventions and
Previous:
Document Notation
Procedure Specification
void copyIntBuffer( int *pin, int *pout, int len )
{ int i;
for (i=0; i<len; ++i) *pout++ = *pin++;
}
then a call to it in the following code fragment has aliased arguments.
aliased arguments
int a[10];
copyIntBuffer( a, a+3, 7);
Although the C language allows this, such usage of MPI procedures is forbidden
unless otherwise specified. Note that Fortran prohibits aliasing of arguments.
Next:
Semantic Terms
Up:
MPI Conventions and
Previous:
Document Notation
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Global Reduction Operations
Up:
All to All
Previous:
All to All
All to All: Vector Variant
INTEGER SENDCOUNTS(*), SDISPLS(*), SENDTYPE, RECVCOUNTS(*), RDISPLS(*), RECVTYPE, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Reduce
Up:
Collective Communications
Previous:
All to All:
Global Reduction Operations
Figure: Reduce functions illustrated for a group of three
processes. In each case, each row of boxes represents data items in
one process. Thus, in the reduce, initially each process has three
items; after the reduce the root process has three sums.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Predefined Reduce Operations
Up:
Global Reduction Operations
Previous:
Global Reduction Operations
Reduce
INTEGER COUNT, DATATYPE, OP, ROOT, COMM, IERROR
Next:
Predefined Reduce Operations
Up:
Global Reduction Operations
Previous:
Global Reduction Operations
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
MINLOC and MAXLOC
Up:
Global Reduction Operations
Previous:
Reduce
Predefined Reduce Operations
Figure: vector-matrix product. Vector a and matrix b are
distributed in one dimension. The distribution is illustrated for
four processes. The slices need not be all of the same
size: each process may have a different value for m.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
All Reduce
Up:
Global Reduction Operations
Previous:
Predefined Reduce Operations
MINLOC and MAXLOC
MPI_TYPE_CONTIGUOUS(2, MPI_REAL, MPI_2REAL)
type[0] = MPI_FLOAT
type[1] = MPI_INT
disp[0] = 0
disp[1] = sizeof(float)
block[0] = 1
block[1] = 1
MPI_TYPE_STRUCT(2, block, disp, type, MPI_FLOAT_INT)
Similar statements apply for the other mixed types in C.
Next:
All Reduce
Up:
Global Reduction Operations
Previous:
Predefined Reduce Operations
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Reduce-Scatter
Up:
Global Reduction Operations
Previous:
MINLOC and MAXLOC
All Reduce
INTEGER COUNT, DATATYPE, OP, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Scan
Up:
Global Reduction Operations
Previous:
All Reduce
Reduce-Scatter
INTEGER RECVCOUNTS(*), DATATYPE, OP, COMM, IERROR
Figure: vector-matrix product. All vectors and matrices are
distributed. The distribution is illustrated for four processes.
Each process may have a different value for m and k.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
User-Defined Operations for
Up:
Collective Communications
Previous:
Reduce-Scatter
Scan
INTEGER COUNT, DATATYPE, OP, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
The Semantics of
Up:
Collective Communications
Previous:
Scan
User-Defined Operations for Reduce and Scan
LOGICAL COMMUTE
INTEGER OP, IERROR
typedef void MPI_User_function( void *invec, void *inoutvec, int *len,
MPI_Datatype *datatype);
FUNCTION USER_FUNCTION( INVEC(*), INOUTVEC(*), LEN, TYPE)
<type> INVEC(LEN), INOUTVEC(LEN)
INTEGER LEN, TYPE
Next:
The Semantics of
Up:
Collective Communications
Previous:
Scan
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communicators
Up:
Collective Communications
Previous:
User-Defined Operations for
The Semantics of Collective Communications
Figure: A race condition causes non-deterministic matching of sends
and receives. One cannot rely on synchronization from a broadcast
to make the program deterministic.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Processes
Up:
Introduction
Previous:
Procedure Specification
Semantic Terms
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction
Up:
MPI: The Complete Reference
Previous:
The Semantics of
Communicators
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Division of Processes
Up:
Communicators
Previous:
Communicators
Introduction
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Avoiding Message Conflicts
Up:
Introduction
Previous:
Introduction
Division of Processes
Next:
Avoiding Message Conflicts
Up:
Introduction
Previous:
Introduction
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Extensibility by Users
Up:
Introduction
Previous:
Division of Processes
Avoiding Message Conflicts Between Modules
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Safety
Up:
Introduction
Previous:
Avoiding Message Conflicts
Extensibility by Users
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Overview
Up:
Introduction
Previous:
Extensibility by Users
Safety
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Groups
Up:
Communicators
Previous:
Safety
Overview
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communicator
Up:
Overview
Previous:
Overview
Groups
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communication Domains
Up:
Overview
Previous:
Groups
Communicator
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Compatibility with Current
Up:
Overview
Previous:
Communicator
Communication Domains
This distributed data structure is
illustrated
in Figure
, for the case of a three member group.
Figure: Distributed data structure for intra-communication domain.
Next:
Compatibility with Current
Up:
Overview
Previous:
Communicator
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Types of MPI
Up:
Semantic Terms
Previous:
Semantic Terms
Processes
Next:
Types of MPI
Up:
Semantic Terms
Previous:
Semantic Terms
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Group Management
Up:
Overview
Previous:
Communication Domains
Compatibility with Current Practice
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Group Accessors
Up:
Communicators
Previous:
Compatibility with Current
Group Management
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Group Constructors
Up:
Group Management
Previous:
Group Management
Group Accessors
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Group Destructors
Up:
Group Management
Previous:
Group Accessors
Group Constructors
Note that for these operations the order of processes in the output
group is determined primarily by order in the first group (if possible)
and then, if necessary, by order in the second group. Neither union nor
intersection are commutative, but both are associative.
Next:
Group Destructors
Up:
Group Management
Previous:
Group Accessors
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communicator Management
Up:
Group Management
Previous:
Group Constructors
Group Destructors
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communicator Accessors
Up:
Communicators
Previous:
Group Destructors
Communicator Management
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communicator Constructors
Up:
Communicator Management
Previous:
Communicator Management
Communicator Accessors
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communicator Destructor
Up:
Communicator Management
Previous:
Communicator Accessors
Communicator Constructors
Next:
Communicator Destructor
Up:
Communicator Management
Previous:
Communicator Accessors
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Safe Parallel Libraries
Up:
Communicator Management
Previous:
Communicator Constructors
Communicator Destructor
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Caching
Up:
Communicators
Previous:
Communicator Destructor
Safe Parallel Libraries
Figure: Correct invocation of mcast
Figure: Erroneous invocation of mcast
Figure: Correct invocation of mcast
Figure: Erroneous invocation of mcast
Next:
Caching
Up:
Communicators
Previous:
Communicator Destructor
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Opaque Objects
Up:
Semantic Terms
Previous:
Processes
Types of MPI Calls
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction
Up:
Communicators
Previous:
Safe Parallel Libraries
Caching
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Caching Functions
Up:
Caching
Previous:
Caching
Introduction
Next:
Caching Functions
Up:
Caching
Previous:
Caching
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Intercommunication
Up:
Caching
Previous:
Introduction
Caching Functions
INTEGER KEYVAL, EXTRA_STATE, IERROR
typedef int MPI_Copy_function(MPI_Comm oldcomm, int keyval,
void *extra_state, void *attribute_val_in,
void *attribute_val_out, int *flag)
LOGICAL FLAG
typedef int MPI_Delete_function(MPI_Comm comm, int keyval,
void *attribute_val, void *extra_state);
LOGICAL FLAG
Figure: Correct execution of two successive invocations of mcast
Figure: Erroneous execution of two successive invocations of mcast
Next:
Intercommunication
Up:
Caching
Previous:
Introduction
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction
Up:
Communicators
Previous:
Caching Functions
Intercommunication
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Intercommunicator Accessors
Up:
Intercommunication
Previous:
Intercommunication
Introduction
Figure: Distributed data structure for inter-communication domain.
Figure: Example of two intracommunicators merging to become one
intercommunicator.
Next:
Intercommunicator Accessors
Up:
Intercommunication
Previous:
Intercommunication
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Intercommunicator Constructors
Up:
Intercommunication
Previous:
Introduction
Intercommunicator Accessors
LOGICAL FLAG
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Process Topologies
Up:
Intercommunication
Previous:
Intercommunicator Accessors
Intercommunicator Constructors
LOGICAL HIGH
Figure: Three-group pipeline. The figure shows the local rank and
(within brackets) the global rank of each process.
Next:
Process Topologies
Up:
Intercommunication
Previous:
Intercommunicator Accessors
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction
Up:
MPI: The Complete Reference
Previous:
Intercommunicator Constructors
Process Topologies
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Virtual Topologies
Up:
Process Topologies
Previous:
Process Topologies
Introduction
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Overlapping Topologies
Up:
Process Topologies
Previous:
Introduction
Virtual Topologies
Figure: Relationship between ranks and Cartesian coordinates for a
3x4 2D topology. The upper number in each box is the rank of the process
and the lower value is the (row, column) coordinates.
Next:
Overlapping Topologies
Up:
Process Topologies
Previous:
Introduction
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Named Constants
Up:
Semantic Terms
Previous:
Types of MPI
Opaque Objects
Next:
Named Constants
Up:
Semantic Terms
Previous:
Types of MPI
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Embedding in MPI
Up:
Process Topologies
Previous:
Virtual Topologies
Overlapping Topologies
Figure: The relationship between two overlaid topologies on a
torus. The upper values in each process is the
rank / (row,col) in the original 2D topology and the lower values are
the same for the shifted 2D topology. Note that rows and columns of
processes remain intact.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Cartesian Topology Functions
Up:
Process Topologies
Previous:
Overlapping Topologies
Embedding in MPI
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Cartesian Constructor Function
Up:
Process Topologies
Previous:
Embedding in MPI
Cartesian Topology Functions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Cartesian Convenience Function:
Up:
Cartesian Topology Functions
Previous:
Cartesian Topology Functions
Cartesian Constructor Function
LOGICAL PERIODS(*), REORDER
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Cartesian Inquiry Functions
Up:
Cartesian Topology Functions
Previous:
Cartesian Constructor Function
Cartesian Convenience Function: MPI_DIMS_CREATE
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Cartesian Translator Functions
Up:
Cartesian Topology Functions
Previous:
Cartesian Convenience Function:
Cartesian Inquiry Functions
LOGICAL PERIODS(*)
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Cartesian Shift Function
Up:
Cartesian Topology Functions
Previous:
Cartesian Inquiry Functions
Cartesian Translator Functions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Cartesian Partition Function
Up:
Cartesian Topology Functions
Previous:
Cartesian Translator Functions
Cartesian Shift Function
Figure: Outcome of Example
when the 2D topology is
periodic (a torus) on 12 processes. In the boxes on the left, the
upper number in each box represents the process rank, the middle
values are the (row, column) coordinate, and the lower values are the
source/dest for the sendrecv operation. The value in the boxes on the
right are the results in b after the sendrecv has completed.
Figure: Similar to Figure
except the 2D
Cartesian topology is not periodic (a rectangle). This results when
the values of periods(1) and periods(2)
are made .FALSE. A ``-'' in a
source or dest value indicates MPI_CART_SHIFT returns
MPI_PROC_NULL.
Next:
Cartesian Partition Function
Up:
Cartesian Topology Functions
Previous:
Cartesian Translator Functions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Cartesian Low-level Functions
Up:
Cartesian Topology Functions
Previous:
Cartesian Shift Function
Cartesian Partition Function
LOGICAL REMAIN_DIMS(*)
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Graph Topology Functions
Up:
Cartesian Topology Functions
Previous:
Cartesian Partition Function
Cartesian Low-level Functions
LOGICAL PERIODS(*)
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Choice Arguments
Up:
Semantic Terms
Previous:
Opaque Objects
Named Constants
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Graph Constructor Function
Up:
Process Topologies
Previous:
Cartesian Low-level Functions
Graph Topology Functions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Graph Inquiry Functions
Up:
Graph Topology Functions
Previous:
Graph Topology Functions
Graph Constructor Function
LOGICAL REORDER
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Graph Information Functions
Up:
Graph Topology Functions
Previous:
Graph Constructor Function
Graph Inquiry Functions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Low-level Graph Functions
Up:
Graph Topology Functions
Previous:
Graph Inquiry Functions
Graph Information Functions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Topology Inquiry Functions
Up:
Graph Topology Functions
Previous:
Graph Information Functions
Low-level Graph Functions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
An Application Example
Up:
Process Topologies
Previous:
Low-level Graph Functions
Topology Inquiry Functions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Environmental Management
Up:
Process Topologies
Previous:
Topology Inquiry Functions
An Application Example
Figure: Data partition in 2D parallel matrix product algorithm.
Figure: Phases in 2D parallel matrix product algorithm.
Figure: Data partition in 3D parallel matrix product algorithm.
Figure: Phases in 3D parallel matrix product algorithm.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Implementation Information
Up:
MPI: The Complete Reference
Previous:
An Application Example
Environmental Management
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Environmental Inquiries
Up:
Environmental Management
Previous:
Environmental Management
Implementation Information
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Tag Values
Up:
Implementation Information
Previous:
Implementation Information
Environmental Inquiries
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Language Binding
Up:
Semantic Terms
Previous:
Named Constants
Choice Arguments
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Host Rank
Up:
Environmental Inquiries
Previous:
Environmental Inquiries
Tag Values
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
I/O Rank
Up:
Environmental Inquiries
Previous:
Tag Values
Host Rank
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Clock Synchronization
Up:
Environmental Inquiries
Previous:
Host Rank
I/O Rank
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Timers and Synchronization
Up:
Environmental Inquiries
Previous:
I/O Rank
Clock Synchronization
INTEGER RESULTLEN,IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Initialization and Exit
Up:
Environmental Management
Previous:
Clock Synchronization
Timers and Synchronization
{
double starttime, endtime;
starttime = MPI_Wtime();
.... stuff to be timed ...
endtime = MPI_Wtime();
printf("That took %f seconds\n",endtime-starttime);
}
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Error Handling
Up:
Environmental Management
Previous:
Timers and Synchronization
Initialization and Exit
int main(argc, argv)
int argc;
char **argv;
{
MPI_Init(&argc, &argv);
/* parse arguments */
/* main program */
MPI_Finalize(); /* see below */
}
INTEGER IERROR
Next:
Error Handling
Up:
Environmental Management
Previous:
Timers and Synchronization
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Error Handlers
Up:
Environmental Management
Previous:
Initialization and Exit
Error Handling
Next:
Error Handlers
Up:
Environmental Management
Previous:
Initialization and Exit
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Error Codes
Up:
Error Handling
Previous:
Error Handling
Error Handlers
INTEGER ERRHANDLER, IERROR
Register the user routine function for use as an MPI
exception handler. Returns in errhandler a handle to the registered
exception handler.
typedef void (MPI_Handler_function)(MPI_Comm *, int *, ...);
The first argument is the communicator in use.
The second is
the error code to be returned by the MPI routine that raised the error.
If the routine would have returned multiple error codes
(see Section
), it is
the error code returned in the status for the request that caused the
error handler to be invoked.
The remaining arguments are ``stdargs'' arguments
whose number and meaning is implementation-dependent. An implementation
should clearly document these arguments.
Addresses are used so that the handler may be written in Fortran.
Next:
Error Codes
Up:
Error Handling
Previous:
Error Handling
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Interaction with Executing
Up:
Error Handling
Previous:
Error Handlers
Error Codes
CHARACTER*(*) STRING
Next:
Interaction with Executing
Up:
Error Handling
Previous:
Error Handlers
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Independence of Basic
Up:
Environmental Management
Previous:
Error Codes
Interaction with Executing Environment
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Fortran 77 Binding
Up:
Introduction
Previous:
Choice Arguments
Language Binding
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Interaction with Signals
Up:
Interaction with Executing
Previous:
Interaction with Executing
Independence of Basic Runtime Routines
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
printf( "Output from task rank %d\n", rank );
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
The MPI Profiling
Up:
Interaction with Executing
Previous:
Independence of Basic
Interaction with Signals in POSIX
Next:
The MPI Profiling
Up:
Interaction with Executing
Previous:
Independence of Basic
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Requirements
Up:
MPI: The Complete Reference
Previous:
Interaction with Signals
The MPI Profiling Interface
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Discussion
Up:
The MPI Profiling
Previous:
The MPI Profiling
Requirements
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Logic of the
Up:
The MPI Profiling
Previous:
Requirements
Discussion
Next:
Logic of the
Up:
The MPI Profiling
Previous:
Requirements
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Miscellaneous Control of
Up:
The MPI Profiling
Previous:
Discussion
Logic of the Design
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Examples
Up:
Logic of the
Previous:
Logic of the
Miscellaneous Control of Profiling
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Profiler Implementation
Up:
The MPI Profiling
Previous:
Miscellaneous Control of
Examples
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
MPI Library Implementation
Up:
Examples
Previous:
Examples
Profiler Implementation
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Systems With Weak
Up:
Examples
Previous:
Profiler Implementation
MPI Library Implementation
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
C Binding Issues
Up:
Language Binding
Previous:
Language Binding
Fortran 77 Binding Issues
Figure: An example of calling a routine with mismatched formal
and actual arguments.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Systems without Weak
Up:
MPI Library Implementation
Previous:
MPI Library Implementation
Systems With Weak symbols
#pragma weak MPI_Send = PMPI_Send
int PMPI_Send(/* appropriate args */)
{
/* Useful content */
}
% cc ... -lprof -lmpi
Figure: Resolution of MPI calls on systems with weak links.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Complications
Up:
MPI Library Implementation
Previous:
Systems With Weak
Systems without Weak Symbols
#ifdef PROFILELIB
# ifdef __STDC__
# define FUNCTION(name) P##name
# else
# define FUNCTION(name) P/**/name
# endif
#else
# define FUNCTION(name) name
#endif
int FUNCTION(MPI_Send)(/* appropriate args */)
{
/* Useful content */
}
% cc ... -lprof -lpmpi -lmpi
Figure: Resolution of MPI calls on systems without weak links.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Multiple Counting
Up:
Examples
Previous:
Systems without Weak
Complications
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Linker Oddities
Up:
Complications
Previous:
Complications
Multiple Counting
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Multiple Levels of
Up:
Complications
Previous:
Multiple Counting
Linker Oddities
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Conclusions
Up:
The MPI Profiling
Previous:
Linker Oddities
Multiple Levels of Interception
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Design Issues
Up:
MPI: The Complete Reference
Previous:
Multiple Levels of
Conclusions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Why is MPI
Up:
Conclusions
Previous:
Conclusions
Design Issues
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Should we be
Up:
Design Issues
Previous:
Design Issues
Why is MPI so big?
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Why does MPI
Up:
Design Issues
Previous:
Why is MPI
Should we be concerned about the size of MPI?
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
The Goals of
Up:
MPI: The Complete Reference
Previous:
Contents
Introduction
Next:
The Goals of
Up:
MPI: The Complete Reference
Previous:
Contents
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Point-to-Point Communication
Up:
Language Binding
Previous:
Fortran 77 Binding
C Binding Issues
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Portable Programming with
Up:
Design Issues
Previous:
Should we be
Why does MPI not guarantee buffering?
Similar choices occur if messages are buffered at
the destination. Communication buffers may be fixed in size, or they
may be allocated dynamically out of the heap, in competition with the
application. The buffer allocation policy may depend on the size
of the messages (preferably buffering short messages), and may depend
on communication history (preferably buffering on busy channels).
Next:
Portable Programming with
Up:
Design Issues
Previous:
Should we be
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Dependency on Buffering
Up:
Conclusions
Previous:
Why does MPI
Portable Programming with MPI
Next:
Dependency on Buffering
Up:
Conclusions
Previous:
Why does MPI
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Collective Communication and
Up:
Portable Programming with
Previous:
Portable Programming with
Dependency on Buffering
Figure: Cycle in communication graph for cyclic shift.
if (rank%2) {
MPI_Recv (buf2, count, MPI_INT, anticlock, tag, comm, &status);
MPI_Send (buf1, count, MPI_INT, clock, tag, comm);
}
else {
MPI_Send (buf1, count, MPI_INT, clock, tag, comm);
MPI_Recv (buf2, count, MPI_INT, anticlock, tag, comm, &status);
}
The resulting communication graph is illustrated in
Figure
. This graph is acyclic.
Figure: Cycle in communication graph is broken by reordering
send and receive.
...
MPI_Pack_size (count, MPI_INT, comm, &buffsize)
buffsize += MPI_BSEND_OVERHEAD
userbuf = malloc (buffsize)
MPI_Buffer_attach (userbuf, buffsize);
MPI_Bsend (buf1, count, MPI_INT, clock, tag, comm);
MPI_Recv (buf2, count, MPI_INT, anticlock, tag, comm, &status);
Figure: Cycle in communication graph is broken by using
buffered sends.
...
MPI_Isend (buf1, count, MPI_INT, clock, tag, comm, &request);
MPI_Recv (buf2, count, MPI_INT, anticlock, tag, comm, &status);
MPI_Wait (&request, &status);
Figure: Cycle in communication graph is broken by using
nonblocking sends.
...
MPI_Irecv (buf2, count, MPI_INT, anticlock, tag, comm, &request);
MPI_Send (buf1, count, MPI_INT, clock, tag, comm):
MPI_Wait (&request, &status);
...
MPI_Sendrecv (buf1, count, MPI_INT, clock, tag,
buf2, count, MPI_INT, anticlock, tag, comm, &status);
Next:
Collective Communication and
Up:
Portable Programming with
Previous:
Portable Programming with
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Ambiguous Communications and
Up:
Portable Programming with
Previous:
Dependency on Buffering
Collective Communication and Synchronization
MPI_Irecv (buf2, count, MPI_INT, anticlock, tag, comm, &status);
MPI_Bcast (buf3, 1, MPI_CHAR, 0, comm);
MPI_Rsend (buf1, count, MPI_INT, clock, tag, comm);
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Heterogeneous Computing with
Up:
Portable Programming with
Previous:
Collective Communication and
Ambiguous Communications and Portability
Figure: Use of communicators. Numbers
in parentheses indicate the process to which data are being sent or received.
The gray shaded area represents the library routine call. In this case
the program behaves as intended. Note that the second message sent by process
2 is received by process 0, and that the message sent by process 0 is
received by process 2.
Figure: Unintended behavior of program. In this case the message from process 2
to process 0 is never received, and deadlock results.
Next:
Heterogeneous Computing with
Up:
Portable Programming with
Previous:
Collective Communication and
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
MPI Implementations
Up:
Conclusions
Previous:
Ambiguous Communications and
Heterogeneous Computing with MPI
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Extensions to MPI
Up:
Conclusions
Previous:
Heterogeneous Computing with
MPI Implementations
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
References
Up:
Conclusions
Previous:
MPI Implementations
Extensions to MPI
Next:
References
Up:
Conclusions
Previous:
MPI Implementations
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
About this document
Up:
MPI: The Complete Reference
Previous:
Extensions to MPI
References
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Up:
MPI: The Complete Reference
Previous:
References
About this document ...
latex2html book.tex.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction and Overview
Up:
MPI: The Complete Reference
Previous:
C Binding Issues
Point-to-Point Communication
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Blocking Send and
Up:
Point-to-Point Communication
Previous:
Point-to-Point Communication
Introduction and Overview
tex2html_html_special_mark_quot
''"
string terminator character).
The fourth parameter specifies the message
destination, which is process 1.
The fifth parameter specifies the message tag.
Finally, the last parameter is a
communicatorcommunicator
that specifies a
communication domaincommunication domain
for this communication. Among other
things, a communicator serves to define
a set of processes that can be
contacted.
Each such process is labeled by
a process rank.rank
Process ranks are integers
and are discovered by inquiry to a communicator (see the
call to MPI_Comm_rank()).
MPI_COMM_WORLDMPI_COMM_WORLD
is a default communicator provided
upon start-up that defines an initial
communication domain for all the processes
that participate in the computation.
Much more will be said about
communicators in Chapter
.
Next:
Blocking Send and
Up:
Point-to-Point Communication
Previous:
Point-to-Point Communication
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Blocking Send
Up:
Point-to-Point Communication
Previous:
Introduction and Overview
Blocking Send and Receive Operations
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Send Buffer and
Up:
Blocking Send and
Previous:
Blocking Send and
Blocking Send
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Message Envelope
Up:
Blocking Send and
Previous:
Blocking Send
Send Buffer and Message Data
Next:
Message Envelope
Up:
Blocking Send and
Previous:
Blocking Send
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Comments on Send
Up:
Blocking Send and
Previous:
Send Buffer and
Message Envelope
Next:
Comments on Send
Up:
Blocking Send and
Previous:
Send Buffer and
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Blocking Receive
Up:
Blocking Send and
Previous:
Message Envelope
Comments on Send
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Receive Buffer
Up:
Blocking Send and
Previous:
Comments on Send
Blocking Receive
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Message Selection
Up:
Blocking Send and
Previous:
Blocking Receive
Receive Buffer
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Who Should Use
Up:
Introduction
Previous:
Introduction
The Goals of MPI
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Return Status
Up:
Blocking Send and
Previous:
Receive Buffer
Message Selection
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Comments on Receive
Up:
Blocking Send and
Previous:
Message Selection
Return Status
Next:
Comments on Receive
Up:
Blocking Send and
Previous:
Message Selection
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Datatype Matching and
Up:
Blocking Send and
Previous:
Return Status
Comments on Receive
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Type Matching Rules
Up:
Point-to-Point Communication
Previous:
Comments on Receive
Datatype Matching and Data Conversion
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Type MPI_CHARACTER
Up:
Datatype Matching and
Previous:
Datatype Matching and
Type Matching Rules
Next:
Type MPI_CHARACTER
Up:
Datatype Matching and
Previous:
Datatype Matching and
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Data Conversion
Up:
Type Matching Rules
Previous:
Type Matching Rules
Type MPI_CHARACTER
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Comments on Data
Up:
Datatype Matching and
Previous:
Type MPI_CHARACTER
Data Conversion
Next:
Comments on Data
Up:
Datatype Matching and
Previous:
Type MPI_CHARACTER
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Semantics of Blocking
Up:
Datatype Matching and
Previous:
Data Conversion
Comments on Data Conversion
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Buffering and Safety
Up:
Point-to-Point Communication
Previous:
Comments on Data
Semantics of Blocking Point-to-point
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Multithreading
Up:
Semantics of Blocking
Previous:
Semantics of Blocking
Buffering and Safety
Next:
Multithreading
Up:
Semantics of Blocking
Previous:
Semantics of Blocking
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
What Platforms are
Up:
Introduction
Previous:
The Goals of
Who Should Use This Standard?
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Order
Up:
Semantics of Blocking
Previous:
Buffering and Safety
Multithreading
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Progress
Up:
Semantics of Blocking
Previous:
Multithreading
Order
Figure: Messages are matched in order.
Figure: Order preserving is not transitive.
Next:
Progress
Up:
Semantics of Blocking
Previous:
Multithreading
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Fairness
Up:
Semantics of Blocking
Previous:
Order
Progress
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Example - Jacobi
Up:
Semantics of Blocking
Previous:
Progress
Fairness
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Send-Receive
Up:
Point-to-Point Communication
Previous:
Fairness
Example - Jacobi iteration
Figure: Block partitioning of a matrix.
Figure: 1D block partitioning with overlap and communication
pattern for jacobi iteration.
Next:
Send-Receive
Up:
Point-to-Point Communication
Previous:
Fairness
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Null Processes
Up:
Point-to-Point Communication
Previous:
Example - Jacobi
Send-Receive
INTEGER SENDCOUNT, SENDTYPE, DEST, SENDTAG, RECVCOUNT, RECVTYPE, SOURCE, RECV
TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
INTEGER COUNT, DATATYPE, DEST, SENDTAG, SOURCE, RECVTAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
Next:
Null Processes
Up:
Point-to-Point Communication
Previous:
Example - Jacobi
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Nonblocking Communication
Up:
Point-to-Point Communication
Previous:
Send-Receive
Null Processes
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Request Objects
Up:
Point-to-Point Communication
Previous:
Null Processes
Nonblocking Communication
Next:
Request Objects
Up:
Point-to-Point Communication
Previous:
Null Processes
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Posting Operations
Up:
Nonblocking Communication
Previous:
Nonblocking Communication
Request Objects
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Completion Operations
Up:
Nonblocking Communication
Previous:
Request Objects
Posting Operations
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
What is Included
Up:
Introduction
Previous:
Who Should Use
What Platforms are Targets for Implementation?
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Examples
Up:
Nonblocking Communication
Previous:
Posting Operations
Completion Operations
INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Freeing Requests
Up:
Nonblocking Communication
Previous:
Completion Operations
Examples
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Semantics of Nonblocking
Up:
Nonblocking Communication
Previous:
Examples
Freeing Requests
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Order
Up:
Nonblocking Communication
Previous:
Freeing Requests
Semantics of Nonblocking Communications
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Progress
Up:
Semantics of Nonblocking
Previous:
Semantics of Nonblocking
Order
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Fairness
Up:
Semantics of Nonblocking
Previous:
Order
Progress
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Buffering and resource
Up:
Semantics of Nonblocking
Previous:
Progress
Fairness
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Comments on Semantics
Up:
Semantics of Nonblocking
Previous:
Fairness
Buffering and resource limitations
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Multiple Completions
Up:
Nonblocking Communication
Previous:
Buffering and resource
Comments on Semantics of Nonblocking Communications
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Probe and Cancel
Up:
Point-to-Point Communication
Previous:
Comments on Semantics
Multiple Completions
INTEGER COUNT, ARRAY_OF_REQUESTS(*), INDEX, STATUS(MPI_STATUS_SIZE), IERROR
INTEGER ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR
INTEGER COUNT, ARRAY_OF_REQUESTS(*), ARRAY_OF_STATUSES(MPI_STATUS_SIZE,*), IERROR
Next:
Probe and Cancel
Up:
Point-to-Point Communication
Previous:
Comments on Semantics
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
What is Not
Up:
Introduction
Previous:
What Platforms are
What is Included in MPI?
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Persistent Communication Requests
Up:
Point-to-Point Communication
Previous:
Multiple Completions
Probe and Cancel
INTEGER SOURCE, TAG, COMM, STATUS(MPI_STATUS_SIZE), IERROR
INTEGER STATUS(MPI_STATUS_SIZE), IERROR
Next:
Persistent Communication Requests
Up:
Point-to-Point Communication
Previous:
Multiple Completions
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communication-Complete Calls with
Up:
Point-to-Point Communication
Previous:
Probe and Cancel
Persistent Communication Requests
INTEGER REQUEST, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR
Next:
Communication-Complete Calls with
Up:
Point-to-Point Communication
Previous:
Probe and Cancel
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communication Modes
Up:
Point-to-Point Communication
Previous:
Persistent Communication Requests
Communication-Complete Calls with Null Request Handles
Next:
Communication Modes
Up:
Point-to-Point Communication
Previous:
Persistent Communication Requests
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Blocking Calls
Up:
Point-to-Point Communication
Previous:
Communication-Complete Calls with
Communication Modes
Next:
Blocking Calls
Up:
Point-to-Point Communication
Previous:
Communication-Complete Calls with
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Nonblocking Calls
Up:
Communication Modes
Previous:
Communication Modes
Blocking Calls
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Persistent Requests
Up:
Communication Modes
Previous:
Blocking Calls
Nonblocking Calls
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Buffer Allocation and
Up:
Communication Modes
Previous:
Nonblocking Calls
Persistent Requests
INTEGER REQUEST, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Model Implementation of
Up:
Communication Modes
Previous:
Persistent Requests
Buffer Allocation and Usage
INTEGER SIZE, IERROR
INTEGER SIZE, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Comments on Communication
Up:
Communication Modes
Previous:
Buffer Allocation and
Model Implementation of Buffered Mode
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
User-Defined Datatypes and
Up:
Communication Modes
Previous:
Model Implementation of
Comments on Communication Modes
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Version of MPI
Up:
Introduction
Previous:
What is Included
What is Not Included in MPI?
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction
Up:
MPI: The Complete Reference
Previous:
Comments on Communication
User-Defined Datatypes and Packing
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction to User-Defined
Up:
User-Defined Datatypes and
Previous:
User-Defined Datatypes and
Introduction
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Datatype Constructors
Up:
User-Defined Datatypes and
Previous:
Introduction
Introduction to User-Defined Datatypes
Figure: A diagram of the memory cells represented by the user-defined
datatype upper. The shaded cells are the locations of the array
that will be sent.
Next:
Datatype Constructors
Up:
User-Defined Datatypes and
Previous:
Introduction
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Contiguous
Up:
User-Defined Datatypes and
Previous:
Introduction to User-Defined
Datatype Constructors
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Vector
Up:
Datatype Constructors
Previous:
Datatype Constructors
Contiguous
Figure: Effect of datatype constructor MPI_TYPE_CONTIGUOUS.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Hvector
Up:
Datatype Constructors
Previous:
Contiguous
Vector
Figure: Datatype constructor MPI_TYPE_VECTOR.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Indexed
Up:
Datatype Constructors
Previous:
Vector
Hvector
Figure: Datatype constructor MPI_TYPE_HVECTOR.
Figure: Memory layout of 2D array section for
Example
. The shaded blocks are sent.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Hindexed
Up:
Datatype Constructors
Previous:
Hvector
Indexed
Figure: Datatype constructor MPI_TYPE_INDEXED.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Struct
Up:
Datatype Constructors
Previous:
Indexed
Hindexed
Figure: Datatype constructor MPI_TYPE_HINDEXED.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Use of Derived
Up:
Datatype Constructors
Previous:
Hindexed
Struct
Figure: Datatype constructor MPI_TYPE_STRUCT.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
MPI Conventions and
Up:
Introduction
Previous:
What is Not
Version of MPI
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Commit
Up:
User-Defined Datatypes and
Previous:
Struct
Use of Derived Datatypes
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Deallocation
Up:
Use of Derived
Previous:
Use of Derived
Commit
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Relation to count
Up:
Use of Derived
Previous:
Commit
Deallocation
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Type Matching
Up:
Use of Derived
Previous:
Deallocation
Relation to count
MPI_TYPE_CONTIGUOUS(count, datatype, newtype)
MPI_TYPE_COMMIT(newtype)
MPI_SEND(buf, 1, newtype, dest, tag, comm).
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Message Length
Up:
Use of Derived
Previous:
Relation to count
Type Matching
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Address Function
Up:
Use of Derived
Previous:
Type Matching
Message Length
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Lower-bound and Upper-bound
Up:
User-Defined Datatypes and
Previous:
Message Length
Address Function
INTEGER ADDRESS, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Absolute Addresses
Up:
User-Defined Datatypes and
Previous:
Address Function
Lower-bound and Upper-bound Markers
Next:
Absolute Addresses
Up:
User-Defined Datatypes and
Previous:
Address Function
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Pack and Unpack
Up:
User-Defined Datatypes and
Previous:
Lower-bound and Upper-bound
Absolute Addresses
Next:
Pack and Unpack
Up:
User-Defined Datatypes and
Previous:
Lower-bound and Upper-bound
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Derived Datatypes vs
Up:
User-Defined Datatypes and
Previous:
Absolute Addresses
Pack and Unpack
INTEGER INCOUNT, DATATYPE, OUTSIZE, POSITION, COMM, IERROR
INTEGER INSIZE, POSITION, OUTCOUNT, DATATYPE, COMM, IERROR
Next:
Derived Datatypes vs
Up:
User-Defined Datatypes and
Previous:
Absolute Addresses
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Document Notation
Up:
Introduction
Previous:
Version of MPI
MPI Conventions and Design Choices
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Collective Communications
Up:
Pack and Unpack
Previous:
Pack and Unpack
Derived Datatypes vs Pack/Unpack
Next:
Collective Communications
Up:
Pack and Unpack
Previous:
Pack and Unpack
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Introduction and Overview
Up:
MPI: The Complete Reference
Previous:
Derived Datatypes vs
Collective Communications
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Operational Details
Up:
Collective Communications
Previous:
Collective Communications
Introduction and Overview
Figure: Collective move functions illustrated
for a group of six processes. In each case, each row of boxes
represents data locations in one process. Thus, in the broadcast,
initially just the first process contains the item
, but after the
broadcast all processes contain it.
Next:
Operational Details
Up:
Collective Communications
Previous:
Collective Communications
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Communicator Argument
Up:
Collective Communications
Previous:
Introduction and Overview
Operational Details
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Barrier Synchronization
Up:
Collective Communications
Previous:
Operational Details
Communicator Argument
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Broadcast
Up:
Collective Communications
Previous:
Communicator Argument
Barrier Synchronization
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Example Using MPI_BCAST
Up:
Collective Communications
Previous:
Barrier Synchronization
Broadcast
INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Gather
Up:
Broadcast
Previous:
Broadcast
Example Using MPI_BCAST
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Examples Using MPI_GATHER
Up:
Collective Communications
Previous:
Example Using MPI_BCAST
Gather
INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR
Next:
Examples Using MPI_GATHER
Up:
Collective Communications
Previous:
Example Using MPI_BCAST
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
GatherVector Variant
Up:
Gather
Previous:
Gather
Examples Using MPI_GATHER
Figure: The root process gathers 100 ints from each process
in the group.
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
Contents
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
MPI: The Complete Reference
Next:
Contents
netlib@netlib.org
and in the message type:
send mpi-book.ps from utk/papers/mpi-book
8 x 9 * ??? pages * $??.?? * Original in Paperback ISBN 95-80471
For more information, contact Gita Manaktala, manak@mit.edu.
or
Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995
Next:
A Bit of
Up:
PVM: Parallel Virtual Machine
Previous:
PVM: Parallel Virtual Machine
Contents
Series Foreword
Next:
A Bit of
Up:
PVM: Parallel Virtual Machine
Previous:
PVM: Parallel Virtual Machine
Jack Dongarra
Thu Sep 15 21:00:17 EDT 1994
Next:
PVM Overview
Up:
Introduction
Previous:
Heterogeneous Network Computing
Trends in Distributed Computing
Next:
PVM Overview
Up:
Introduction
Previous:
Heterogeneous Network Computing
Next:
Direct Message Routing
Up:
Message Routing
Previous:
Pvmd and Foreign
Libpvm
Next:
Multicasting
Up:
Libpvm
Previous:
Libpvm
Direct Message Routing
Figure: Task-task connection state diagram
Next:
Multicasting
Up:
Libpvm
Previous:
Libpvm
Next:
Task Environment
Up:
Message Routing
Previous:
Direct Message Routing
Multicasting
Next:
Environment Variables
Up:
How PVM Works
Previous:
Multicasting
Task Environment
Next:
Standard Input and
Up:
Task Environment
Previous:
Task Environment
Environment Variables
PVM_EXPORT=DISPLAY:SHELL
exports the variables DISPLAY and SHELL to
children tasks (and PVM_EXPORT too).
-----------------------------------------------------------------------
PVM_ROOT Root installation directory
PVM_EXPORT Names of environment variables to inherit through spawn
PVM_DPATH Default slave pvmd install path
PVM_DEBUGGER Path of debugger script used by spawn
-----------------------------------------------------------------------
-------------------------------------------------------------------
PVM_ARCH PVM architecture name
PVMSOCK Address of the pvmd local socket; see Section 7.4.2
PVMEPID Expected PID of a spawned task
PVMTMASK Libpvm Trace mask
-------------------------------------------------------------------
Next:
Tracing
Up:
Task Environment
Previous:
Environment Variables
Standard Input and Output
------------------------------------------------------------
Spawn: (code) { Task has been spawned
int tid, Task id
int -1, Signals spawn
int ptid TID of parent
}
Begin: (code) { First output from task
int tid, Task id
int -2, Signals task creation
int ptid TID of parent
}
Output: (code) { Output from a task
int tid, Task id
int count, Length of output fragment
data[count] Output fragment
}
End: (code) { Last output from a task
int tid, Task id
int 0 Signals EOF
}
------------------------------------------------------------
Figure: Output states of a task
Next:
Tracing
Up:
Task Environment
Previous:
Environment Variables
Next:
Debugging
Up:
Task Environment
Previous:
Standard Input and
Tracing
Next:
Console Program
Up:
Task Environment
Previous:
Tracing
Debugging
Next:
Resource Limitations
Up:
How PVM Works
Previous:
Debugging
Console Program
Next:
In the PVM
Up:
How PVM Works
Previous:
Console Program
Resource Limitations
Next:
Other Packages
Up:
Introduction
Previous:
Trends in Distributed
PVM Overview
Next:
In the Task
Up:
Resource Limitations
Previous:
Resource Limitations
In the PVM Daemon
Next:
Multiprocessor Systems
Up:
Resource Limitations
Previous:
In the PVM
In the Task
Next:
Message-Passing Architectures
Up:
How PVM Works
Previous:
In the Task
Multiprocessor Systems
Next:
Message-Passing Architectures
Up:
How PVM Works
Previous:
In the Task
Next:
Shared-Memory Architectures
Up:
Multiprocessor Systems
Previous:
Multiprocessor Systems
Message-Passing Architectures
Figure:
PVM daemon and tasks on MPP host
Figure:
Packing: breaking data into fixed-size fragments
Figure: How TID is used to distinguish tasks on MPP
Figure:
Buffering: buffering one fragment by receiving
task until pvm_recv() is called
Table: Implementation of PVM system calls
Next:
Shared-Memory Architectures
Up:
Multiprocessor Systems
Previous:
Multiprocessor Systems
Next:
Optimized Send and
Up:
Multiprocessor Systems
Previous:
Message-Passing Architectures
Shared-Memory Architectures
Figure:
Structure of a PVM page
Figure:
Structures of shared message buffers
Next:
Optimized Send and
Up:
Multiprocessor Systems
Previous:
Message-Passing Architectures
Next:
Advanced Topics
Up:
Multiprocessor Systems
Previous:
Shared-Memory Architectures
Optimized Send and Receive on MPP
Next:
The p4 System
Up:
Introduction
Previous:
PVM Overview
Other Packages
Next:
XPVM
Up:
PVM: Parallel Virtual Machine
Previous:
Optimized Send and
Advanced Topics
Next:
Network View
Up:
Advanced Topics
Previous:
Advanced Topics
XPVM
Figure: XPVM interface - snapshot during use
Next:
Network View
Up:
Advanced Topics
Previous:
Advanced Topics
Next:
Space-Time View
Up:
XPVM
Previous:
XPVM
Network View
Next:
Other Views
Up:
XPVM
Previous:
Network View
Space-Time View
Next:
Other Views
Up:
XPVM
Previous:
Network View
Next:
Porting PVM to
Up:
XPVM
Previous:
Space-Time View
Other Views
Next:
Porting PVM to
Up:
XPVM
Previous:
Space-Time View
Next:
Unix Workstations
Up:
Advanced Topics
Previous:
Other Views
Porting PVM to New Architectures
Each of these classes requires a different approach to make PVM
exploit the capabilities of the respective architecture.
The workstations use TCP/IP to move data between hosts,
the distributed-memory multiprocessors use the native
message-passing routines to move data between nodes,
and the shared-memory multiprocessors use shared memory to
move data between the processors.
The following sections describe the steps for porting the
PVM source to each of these classes.
Next:
Multiprocessors
Up:
Porting PVM to
Previous:
Porting PVM to
Unix Workstations
Next:
Multiprocessors
Up:
Porting PVM to
Previous:
Porting PVM to
Next:
Troubleshooting
Up:
Porting PVM to
Previous:
Unix Workstations
Multiprocessors
void mpp_init(int argc, char **argv);
Initialization. Called once when PVM is started. Arguments argc and argv
are passed from pvmd main().
int mpp_load(int flags, char *name, char *argv, int count, int *tids, int ptid);
Create partition if necessary. Load executable onto nodes; create new
entries in task table, encode node number and process type into task IDs.
flags: exec options;
name: executable to be loaded;
argv: command line argument for executable;
count: number of tasks to be created;
tids: array to store new task IDs;
ptid: parent task ID.
void mpp_output(struct task *tp, struct pkt *pp);
Send all pending packets to nodes via native send. Node number and process
type are extracted from task ID.
tp: destination task;
pp: packet.
int mpp_mcast(struct pkt pp, int *tids, int ntask);
Global send.
pp: packet;
tids: list of destination task IDs;
ntask: how many.
int mpp_probe();
Probe for pending packets from nodes (non-blocking). Returns 1 if packets
are found, otherwise 0.
void mpp_input();
Receive pending packets (from nodes) via native receive.
void mpp_free(int tid)
Remove node/process-type from active list.
tid: task ID.
ASYNCRECV(buf,len)
Non-blocking receive. Returns immediately with a message handle.
buf: (char *), buffer to place the data;
len: (int), size of buffer in bytes.
ASYNCSEND(tag,buf,len,dest,ptype)
Non-blocking send. Returns immediately with a message handle.
tag: (int), message tag;
buf: (char *), location of data;
len: (int), size of data in bytes;
dest: (long), address of destination node;
ptype: instance number of destination task.
ASYNCWAIT(mid)
Blocks until operation associated with mid has completed.
mid: message handle (its type is system-dependent).
ASYNCDONE(mid)
Returns 1 if operation associated with mid has completed, and 0 otherwise.
mid: message handle (its type is system-dependent).
MSGSIZE(mid)
Returns size of message most recently arrived.
mid: message handle (its type is system-dependent).
MSGSENDER(mid)
Returns node number of the sender of most recently received message.
mid: message handle (its type is system-dependent).
PVMCRECV(tag,buf,len)
Blocks until message has been received into buffer.
tag: (int), expected message tag;
buf: (char *), buffer to place the data;
len: (int), size of buffer in bytes;
PVMCSEND(tag,buf,len,dest,ptype)
Blocks until send operation is complete and buffer can be reused.
Non-blocking send. Returns immediately with a message handle.
tag: (int), message tag;
buf: (char *), location of data;
len: (int), size of data in bytes;
dest: (long), address of destination node;
ptype: instance number of destination task.
src/pvmshmem.h:
PAGEINITLOCK(lp)
Initialize the lock pointed to by lp.
PAGELOCK(lp)
Locks the lock pointed to by lp.
PAGEUNLOCK(lp)
Unlocks the lock pointed to by lp.
In addition, the file pvmshmem.c contains routines used by both pvmd
and libpvm.
Next:
Troubleshooting
Up:
Porting PVM to
Previous:
Unix Workstations
Next:
Getting PVM Installed
Up:
PVM: Parallel Virtual Machine
Previous:
Multiprocessors
Troubleshooting
Next:
Set PVM_ROOT
Up:
Troubleshooting
Previous:
Troubleshooting
Getting PVM Installed
Next:
On-Line Manual Pages
Up:
Getting PVM Installed
Previous:
Getting PVM Installed
Set PVM_ROOT
Next:
Building the Release
Up:
Getting PVM Installed
Previous:
Set PVM_ROOT
On-Line Manual Pages
if (! $?MANPATH) setenv MANPATH /usr/man:/usr/local/man
setenv MANPATH ${MANPATH}:$PVM_ROOT/man
Next:
Errors During Build
Up:
Getting PVM Installed
Previous:
On-Line Manual Pages
Building the Release
Next:
Compatible Versions
Up:
Getting PVM Installed
Previous:
Building the Release
Errors During Build
Next:
Express
Up:
Other Packages
Previous:
Other Packages
The p4 System
# start one slave on each of sun2 and sun3
local 0
sun2 1 /home/mylogin/p4pgms/sr_test
sun3 1 /home/mylogin/p4pgms/sr_test
Next:
Express
Up:
Other Packages
Previous:
Other Packages
Next:
Getting PVM Running
Up:
Getting PVM Installed
Previous:
Errors During Build
Compatible Versions
Next:
Pvmd Log File
Up:
Troubleshooting
Previous:
Compatible Versions
Getting PVM Running
Next:
Pvmd Socket Address
Up:
Getting PVM Running
Previous:
Getting PVM Running
Pvmd Log File
(grep `whoami` /etc/passwd || ypmatch `whoami` passwd) \
| awk -F: '{print $3;exit}'
Next:
Starting PVM from
Up:
Getting PVM Running
Previous:
Pvmd Log File
Pvmd Socket Address File
Next:
Starting the Pvmd
Up:
Getting PVM Running
Previous:
Pvmd Socket Address
Starting PVM from the Console
pvm [-ddebugmask] [-nhostname] [hostfile]
Next:
Adding Hosts to
Up:
Getting PVM Running
Previous:
Starting PVM from
Starting the Pvmd by Hand
$PVM_ROOT/lib/pvmd [-ddebugmask] [-nhostname] [hostfile]
Next:
PVM Host File
Up:
Getting PVM Running
Previous:
Starting the Pvmd
Adding Hosts to the Virtual Machine
if ( { tty -s } && $?prompt ) then
echo terminal type is $TERM
stty erase '^?' kill '^u' intr '^c' echo
endif
[pvmd pid12360] slave_config: bad args
[pvmd pid12360] pvmbailout(0)
Next:
PVM Host File
Up:
Getting PVM Running
Previous:
Starting the Pvmd
Next:
Shutting Down
Up:
Getting PVM Running
Previous:
Adding Hosts to
PVM Host File
* option option ...
changes the default parameters for subsequent hosts (both those in the
host file and those added later).
Default statements are not cumulative;
each applies to the system defaults.
For example, after the following two host file entries:
* dx=pvm3/lib/pvmd
* ep=/bin:/usr/bin:pvm3/bin/$PVM_ARCH
only ep is changed from its system default
(dx is reset by the second line).
To set multiple defaults,
combine them into a single line.
Next:
Compiling Applications
Up:
Getting PVM Running
Previous:
PVM Host File
Shutting Down
Next:
Header Files
Up:
Troubleshooting
Previous:
Shutting Down
Compiling Applications
Next:
MPI
Up:
Other Packages
Previous:
The p4 System
Express
Next:
Linking
Up:
Compiling Applications
Previous:
Compiling Applications
Header Files
Next:
Running Applications
Up:
Compiling Applications
Previous:
Header Files
Linking
cc/f77 [ compiler flags ] [ source files ] [ loader flags ]
-L$PVM_ROOT/lib/$PVM_ARCH -lfpvm3 -lgpvm3 -lpvm3
[ libraries needed by PVM ] [ other libraries ]
Next:
Spawn Can't Find
Up:
Troubleshooting
Previous:
Linking
Running Applications
Next:
Group Functions
Up:
Running Applications
Previous:
Running Applications
Spawn Can't Find Executables
Next:
Memory Use
Up:
Running Applications
Previous:
Spawn Can't Find
Group Functions
Next:
Input and Output
Up:
Running Applications
Previous:
Group Functions
Memory Use
while (1) {
pvm_initsend(PvmDataDefault); /* make new buffer */
pvm_setsbuf(0);
/* now buffer won't be freed by next initsend */
}
umbuf_list(int level).
Function umbuf_list()
calls umbuf_dump() for each message in the message heap.
Next:
Input and Output
Up:
Running Applications
Previous:
Group Functions
Next:
Scheduling Priority
Up:
Running Applications
Previous:
Memory Use
Input and Output
Next:
Scheduling Priority
Up:
Running Applications
Previous:
Memory Use
Next:
Resource Limitations
Up:
Running Applications
Previous:
Input and Output
Scheduling Priority
cd ~/pvm3/bin/SUN4
mv prog prog-
echo 'P=$0"-"; shift; exec nice -10 $P $@' > prog
chmod 755 prog
Next:
Debugging and Tracing
Up:
Running Applications
Previous:
Scheduling Priority
Resource Limitations
Next:
Debugging the System
Up:
Troubleshooting
Previous:
Resource Limitations
Debugging
and Tracing
setenv PVM_EXPORT DISPLAY
spawn -? [ rest of spawn command ]
spawn -@ /usr/local/X11R5/bin/xterm -n PVMTASK
Next:
Debugging the System
Up:
Troubleshooting
Previous:
Resource Limitations
Next:
The Linda System
Up:
Other Packages
Previous:
Express
MPI
Next:
Runtime Debug Masks
Up:
Troubleshooting
Previous:
Debugging and Tracing
Debugging the System
Next:
Tickle the Pvmd
Up:
Debugging the System
Previous:
Debugging the System
Runtime Debug Masks
Next:
Starting Pvmd under
Up:
Debugging the System
Previous:
Runtime Debug Masks
Tickle the Pvmd
Next:
Sane Heap
Up:
Debugging the System
Previous:
Tickle the Pvmd
Starting Pvmd under a Debugger
Next:
Statistics
Up:
Debugging the System
Previous:
Starting Pvmd under
Sane Heap
Next:
History of PVM
Up:
Debugging the System
Previous:
Sane Heap
Statistics
Next:
History of PVM
Up:
Debugging the System
Previous:
Sane Heap
Next:
References
Up:
PVM: Parallel Virtual Machine
Previous:
Statistics
History of PVM Versions
PVM 1.0 (never released)
any of the several initial experimental PVM versions
used to study heterogeneous distributed computing issues.
PVM 2.0 (Feb. 1991)
+ Complete rewrite of in-house experimental PVM software (v1.0),
+ cleaned up the specification and implementation
to improve robustness and portablility.
PVM 2.1 (Mar. 1991)
+ process-process messages switched to XDR
to improve protability of source in heterogeneous environments.
+ Simple console interpreter added to master pvmd.
PVM 2.2 (April 1991)
+ pvmd-pvmd message format switched to XDR.
+ Get and put functions vectorized to improve performance.
+ broadcast function --> deprecated
PVM 2.3.2 (June 1991)
+ improved password-less startup via rsh/rcmd
+ added per-host options to hostfile format:
ask for password
specify alternate loginname
specify alternate pvmd executable location
+ pvmd-pvmd protocol version checked to prevent mixed versions
interoperating.
+ added support for short and long integers in messages.
+ added 'reset' pvmd command to reset the vm.
+ can specify "." as host to initiateM() to create on localhost
PVM 2.3.3 (July 1991)
+ added 'barr' command to check barrier/ready status
+ pstatus() libpvm call added to return size of virtual machine
PVM 2.3.4 (Oct. 1991)
+ pvmds negotiate maximum UDP message length at startup.
+ removed static limitation on number of hosts (used to be 40).
PVM 2.4.0 (Feb. 1992)
+ added direct-connect TCP message transfer available through
vsnd() and vrcv() to improve communication performance.
+ added option to specify user executable path on each host.
+ version check added between pvmd and libpvm to prevent running
incompatible versions.
+ libpvm automatically prints error messages.
+ libpvm error codes standardized and exported in "pvmuser.h".
+ includes instrumented heap to aid system debugging.
+ host file default parameters can be set with '*'.
+ libpvm returns error code instead of exiting in case
of fatal error.
PVM 2.4.1 (June 1992)
+ added new ports and bug fixes
PVM 2.4.2 (Dec. 1992)
+ pvmuser.h made compatible with C++.
+ can force messages to be packed in raw data format to avoid XDR.
+ rcv() will return BadMsg if message can't be decoded.
PVM 3.0 (Feb. 1993)
Complete redesign of PVM software both the user interface and
the implementation in order to:
+ allow scalability to hundreds of hosts.
+ allow portability to multiprocessors / operating systems
other than Unix.
+ allows dynamic reconfiguration of the virtual machine,
+ allows fault tolerance
+ allows asynchronous task notification - task exit,
machine reconfiguration.
+ includes dynamic process groups,
+ separate PVM console task.
PVM 3.1 (April 1993)
+ added task-task direct routing via TCP
using normal send and receive calls.
PVM 3.1.1 (May 1993) Five bug fix patches released for PVM 3.1
PVM 3.1.2 (May 1993)
PVM 3.1.3 (June 1993)
PVM 3.1.4 (July 1993)
PVM 3.1.5 (Aug. 1993)
PVM 3.2 (Aug. 1993)
+ distributed memory ports merged with Unix port source.
Ports include I860, PGON, CM5.
+ conf/ARCH.def files created for per-machine configuration
to improve source portability and package size.
+ pvmd adds new slave hosts in parallel to improve performance.
+ stdout and stderr from tasks can be redirected to a task/console.
+ option OVERLOADHOST allows virtual machines running under the
same login to overlap i.e. user can have multiple overlapping vm.
+ new printf-like pack and unpack routines pvm_packf() and
pvm_unpackf() available to C and C++ programmers.
+ added pack, unpack routines for unsigned integers.
+ environment passed through spawn(), controlled by
variable PVM_EXPORT.
+ many enhancements and features added to PVM console program.
+ pvmd and libpvm use PVM_ROOT and PVM_ARCH environment
variables if set.
PVM 3.2.1 (Sept. 1993) Six bug fix patches released for PVM 3.2
PVM 3.2.2 (Sept. 1993)
PVM 3.2.3 (Oct. 1993)
PVM 3.2.4 (Nov. 1993)
PVM 3.2.5 (Dec. 1993)
PVM 3.2.6 (Jan. 1994)
PVM 3.3.0 (June 1994)
+ PVM_ROOT environment variable now must be set.
$HOME/pvm3 is no longer assumed.
+ shared-memory ports merged with Unix and distributed memory ports.
Ports include SUNMP and SGIMP.
+ New functions pvm_psend() and pvm_precv() send and receive raw
data buffers, enabling more efficient implementation on machines
such as multiprocessors.
+ new function pvm_trecv() blocks until a message is received or a
specified timeout (in seconds and usec) improves fault tolerance.
+ Inplace packing implemented for dense data reducing packing costs.
+ Resource Manager, Hoster and Tasker interfaces defined
to allow third party debuggers and resource managers to use PVM.
+ libpvm parameter/result tracing implemented to drive XPVM tool.
tasks inherit trace destination and per-call event mask.
+ XPVM, a graphical user interface for PVM, is released.
+ added collective communication routines to group library.
global reduce and scatter/gather
+ libpvm function pvm_catchout() collects output of children tasks.
output can be appended to any FILE* (e.g. stdout).
+ new hostfile option "wd=" sets the working directory of the pvmd.
+ environment variables expanded when setting ep= or
bp= in the hostfile.
PVM 3.3.1 (June 1994) bug fix patches for PVM 3.3
PVM 3.3.2 (July 1994)
PVM 3.3.3 (August 1994)
Next:
Index
Up:
PVM: Parallel Virtual Machine
Previous:
History of PVM
References
Next:
About this document
Up:
PVM: Parallel Virtual Machine
Previous:
References
Index
Up:
PVM: Parallel Virtual Machine
Previous:
Index
About this document ...
A Users' Guide and Tutorial for Networked Parallel Computing
latex2html book.tex.
Next:
The PVM System
Up:
Other Packages
Previous:
MPI
The Linda System
Next:
Using PVM
Up:
PVM: Parallel Virtual Machine
Previous:
The Linda System
The PVM System
main()
{
int cc, tid, msgtag;
char buf[100];
printf("i'm t%x\n", pvm_mytid());
cc = pvm_spawn("hello_other", (char**)0, 0, "", 1, &tid);
if (cc == 1) {
msgtag = 1;
pvm_recv(tid, msgtag);
pvm_upkstr(buf);
printf("from t%x: %s\n", tid, buf);
} else
printf("can't start hello_other\n");
pvm_exit();
}
Figure: PVM program hello.c
Figure: PVM program hello_other.c
#include "pvm3.h"
main()
{
int ptid, msgtag;
char buf[100];
ptid = pvm_parent();
strcpy(buf, "hello, world from ");
gethostname(buf + strlen(buf), 64);
msgtag = 1;
pvm_initsend(PvmDataDefault);
pvm_pkstr(buf);
pvm_send(ptid, msgtag);
pvm_exit();
}
Next:
Using PVM
Up:
PVM: Parallel Virtual Machine
Previous:
The Linda System
Next:
How to Obtain
Up:
PVM: Parallel Virtual Machine
Previous:
The PVM System
Using PVM
Next:
Setup to Use
Up:
Using PVM
Previous:
Using PVM
How to Obtain the PVM Software
Next:
Who Should Read
Up:
Contents
Previous:
Contents
A Bit of History
Next:
Setup Summary
Up:
Using PVM
Previous:
How to Obtain
Setup to Use PVM
setenv PVM_ROOT $HOME/pvm3
It is recommended that the user set PVM_ARCH by concatenating to the file
.cshrc, the content of file
$PVM_ROOT/lib/cshrc.stub.
The stub should be placed after PATH and PVM_ROOT are defined.
This stub automatically determines the PVM_ARCH for this host
and is particularly useful when the user shares a common file system
(such as NFS) across several different architectures.
------------------------------------------------------------------------
PVM_ARCH Machine Notes
------------------------------------------------------------------------
AFX8 Alliant FX/8
ALPHA DEC Alpha DEC OSF-1
BAL Sequent Balance DYNIX
BFLY BBN Butterfly TC2000
BSD386 80386/486 PC runnning Unix BSDI, 386BSD, NetBSD
CM2 Thinking Machines CM2 Sun front-end
CM5 Thinking Machines CM5 Uses native messages
CNVX Convex C-series IEEE f.p.
CNVXN Convex C-series native f.p.
CRAY C-90, YMP, T3D port available UNICOS
CRAY2 Cray-2
CRAYSMP Cray S-MP
DGAV Data General Aviion
E88K Encore 88000
HP300 HP-9000 model 300 HPUX
HPPA HP-9000 PA-RISC
I860 Intel iPSC/860 Uses native messages
IPSC2 Intel iPSC/2 386 host SysV, Uses native messages
KSR1 Kendall Square KSR-1 OSF-1, uses shared memory
LINUX 80386/486 PC running Unix LINUX
MASPAR Maspar
MIPS MIPS 4680
NEXT NeXT
PGON Intel Paragon Uses native messages
PMAX DECstation 3100, 5100 Ultrix
RS6K IBM/RS6000 AIX 3.2
RT IBM RT
SGI Silicon Graphics IRIS IRIX 4.x
SGI5 Silicon Graphics IRIS IRIX 5.x
SGIMP SGI multiprocessor Uses shared memory
SUN3 Sun 3 SunOS 4.2
SUN4 Sun 4, SPARCstation SunOS 4.2
SUN4SOL2 Sun 4, SPARCstation Solaris 2.x
SUNMP SPARC multiprocessor Solaris 2.x, uses shared memory
SYMM Sequent Symmetry
TITN Stardent Titan
U370 IBM 370 AIX
UVAX DEC MicroVAX
------------------------------------------------------------------------
Table: PVM_ARCH names used in PVM 3
Next:
Setup Summary
Up:
Using PVM
Previous:
How to Obtain
Next:
Starting PVM
Up:
Using PVM
Previous:
Setup to Use
Setup Summary
Next:
Common Startup Problems
Up:
Using PVM
Previous:
Setup Summary
Starting PVM
% pvm
and you should get back a PVM console prompt signifying that PVM
is now running on this host.
You can add hosts to your virtual machine by typing at the console prompt
pvm> add hostname
And you can delete hosts (except the one you are on)
from your virtual machine by typing
pvm> delete hostname
If you get the message ``Can't Start pvmd,''
then check the common startup problems section and try again.
pvm> conf
To see what PVM tasks are running on the virtual machine, you type
pvm> ps -a
Of course you don't have any tasks running yet; that's in the next section.
If you type ``quit" at the console prompt, the console will quit but
your virtual machine and tasks will continue to run.
At any Unix prompt on any host in the virtual machine, you can type
% pvm
and you will get the message ``pvm already running" and the console prompt.
When you are finished with the virtual machine, you should type
pvm> halt
This command kills any PVM tasks, shuts down the virtual machine,
and exits the console. This is the recommended method to stop PVM
because it makes sure that the virtual machine shuts down cleanly.
% pvm hostfile
PVM will then add all the listed hosts simultaneously before
the console prompt appears. Several options can be
specified on a per-host basis in the hostfile
.
These are described
at the end of this chapter for the user who wishes to customize
his virtual machine for a particular application or environment.
% xpvm
The menu button labled ``hosts" will pull down a list of hosts you can add.
If you click on a hostname, it is added and an icon of the machine appears in
an animation of the virtual machine. A host is deleted if you click
on a hostname that is already in the virtual machine (see
Figure 3.1).
On startup XPVM reads the file $HOME/.xpvm_hosts, which is a list
of hosts to display in this menu. Hosts without leading ``\&" are
added all at once at startup.
Figure: XPVM system adding hosts
Next:
Common Startup Problems
Up:
Using PVM
Previous:
Setup Summary
Next:
Running PVM Programs
Up:
Using PVM
Previous:
Starting PVM
Common Startup Problems
[t80040000] Can't start pvmd
first check that your .rhosts file on the remote host
contains the name of the host from which you are starting PVM.
An external check that your .rhosts file is set correctly
is to type
% rsh remote_host ls
If your .rhosts is set up correctly, then you will see
a listing of your files on the remote host.
% rsh remote_host $PVM_ROOT/lib/pvmd
Some Unix shells, for example ksh, do not set environment variables
on remote hosts when using rsh. In PVM 3.3 there are two work arounds for
such shells. First, if you set the environment variable, PVM_DPATH, on
the master host to pvm3/lib/pvmd, then this will override
the default dx path.
The second method is to tell PVM explicitly were to find the remote
pvmd executable by using the dx= option in the hostfile.
[t80040000] Login incorrect
it probably means that no account is on the remote
machine with your login name. If your login name is different on
the remote machine, then you must use the lo= option in the hostfile
(see Section 3.7).
Next:
Running PVM Programs
Up:
Using PVM
Previous:
Starting PVM
Next:
PVM Console Details
Up:
Using PVM
Previous:
Common Startup Problems
Running PVM Programs
% cp -r $PVM_ROOT/examples $HOME/pvm3/examples
% cd $HOME/pvm3/examples
The examples directory contains a Makefile.aimk and Readme file
that describe how to build the examples.
PVM supplies an architecture-independent make, aimk,
that automatically determines PVM_ARCH and links any operating system
specific libraries to your application.
aimk was automatically added to your $PATH when you
placed the cshrc.stub in your .cshrc file.
Using aimk allows you to leave the source code and
makefile unchanged as you compile across different architectures.
% aimk master slave
If you prefer to work with Fortran, compile the Fortran version with
% aimk fmaster fslave
Depending on the location of PVM_ROOT, the INCLUDE statement
at the top of the Fortran examples may need to be changed.
If PVM_ROOT is not HOME/pvm3, then change the include to point
to $PVM_ROOT/include/fpvm3.h. Note that PVM_ROOT is not
expanded inside the Fortran, so you must insert the actual path.
% master
The program will ask how many tasks. The number of tasks does not have
to match the number of hosts in these examples. Try several combinations.
% aimk hitc hitc_slave
pvm> spawn -> hitc
The ``->" spawn option causes all the print statements in
hitc and in the
slaves to appear in the console window. This feature can be useful
when debugging your first few PVM programs.
You may wish to experiment with this option by placing print statements
in hitc.f and hitc_slave.f and recompiling.
Next:
PVM Console Details
Up:
Using PVM
Previous:
Common Startup Problems
Next:
Host File Options
Up:
Using PVM
Previous:
Running PVM Programs
PVM Console Details
pvm [-n<hostname>] [hostfile]
pvm>
and accepts commands from standard input. The available commands are
alias ? help
alias h help
alias j jobs
setenv PVM_EXPORT DISPLAY
# print my id
echo new pvm shell
id
PVM supports the use of multiple consoles
.
It is possible to run a
console on any host in an existing virtual machine and even
multiple consoles on the same machine. It is also possible to start
up a console in the middle of a PVM application and check on its
progress.
Next:
Host File Options
Up:
Using PVM
Previous:
Running PVM Programs
Next:
Basic Programming Techniques
Up:
Using PVM
Previous:
PVM Console Details
Host File Options
# configuration used for my run
sparky
azure.epm.ornl.gov
thud.cs.utk.edu
sun4
Figure: Simple hostfile listing virtual machine configuration
Note: The environment variable PVM_DEBUGGER can also be set.
The default debugger is pvm3/lib/debugger.
[t80040000] ready Fri Aug 27 18:47:47 1993
*** Manual startup ***
Login to "honk" and type:
pvm3/lib/pvmd -S -d0 -nhonk 1 80a9ca95:0cb6 4096 2 80a95c43:0000
Type response:
On honk, after typing the given line, you should see
ddpro<2312> arch<ALPHA> ip<80a95c43:0a8e> mtu<4096>
which you should relay back to the master pvmd. At that point,
you will see
Thanks
and the two pvmds should be able to communicate.
# Comment lines start with a # (blank lines ignored)
gstws
ipsc dx=/usr/geist/pvm3/lib/I860/pvmd3
ibm1.scri.fsu.edu lo=gst so=pw
# set default options for following hosts with *
* ep=$sun/problem1:~/nla/mathlib
sparky
#azure.epm.ornl.gov
midnight.epm.ornl.gov
# replace default options with new values
* lo=gageist so=pw ep=problem1
thud.cs.utk.edu
speedy.cs.utk.edu
# machines for adding later are specified with &
# these only need listing if options are required
&sun4 ep=problem1
&castor dx=/usr/local/bin/pvmd3
&dasher.cs.utk.edu lo=gageist
&elvis dx=~/pvm3/lib/SUN4/pvmd3
Figure: PVM hostfile illustrating customizing options
Next:
Basic Programming Techniques
Up:
Using PVM
Previous:
PVM Console Details
Next:
Common Parallel Programming
Up:
PVM: Parallel Virtual Machine
Previous:
Host File Options
Basic Programming Techniques
Next:
Crowd Computations
Up:
Basic Programming Techniques
Previous:
Basic Programming Techniques
Common Parallel Programming Paradigms
Next:
Crowd Computations
Up:
Basic Programming Techniques
Previous:
Basic Programming Techniques
Next:
Tree Computations
Up:
Common Parallel Programming
Previous:
Common Parallel Programming
Crowd Computations
{Master Mandelbrot algorithm.}
{Initial placement}
for i := 0 to NumWorkers - 1
pvm_spawn(<worker name>) {Start up worker i}
pvm_send(<worker tid>,999) {Send task to worker i}
endfor
{Receive-send}
while (WorkToDo)
pvm_recv(888) {Receive result}
pvm_send(<available worker tid>,999)
{Send next task to available worker}
display result
endwhile
{Gather remaining results.}
for i := 0 to NumWorkers - 1
pvm_recv(888) {Receive result}
pvm_kill(<worker tid i>) {Terminate worker i}
display result
endfor
{Worker Mandelbrot algorithm.}
while (true)
pvm_recv(999) {Receive task}
result := MandelbrotCalculations(task) {Compute result}
pvm_send(<master tid>,888) {Send result to master}
endwhile
Figure: General crowd computation
{Matrix Multiplication Using Pipe-Multiply-Roll Algorithm}
{Processor 0 starts up other processes}
if (<my processor number> = 0) then
for i := 1 to MeshDimension*MeshDimension
pvm_spawn(<component name>, . .)
endfor
endif
forall processors Pij, 0 <= i,j < MeshDimension
for k := 0 to MeshDimension-1
{Pipe.}
if myrow = (mycolumn+k) mod MeshDimension
{Send A to all Pxy, x = myrow, y <> mycolumn}
pvm_mcast((Pxy, x = myrow, y <> mycolumn),999)
else
pvm_recv(999) {Receive A}
endif
{Multiply. Running totals maintained in C.}
Multiply(A,B,C)
{Roll.}
{Send B to Pxy, x = myrow-1, y = mycolumn}
pvm_send((Pxy, x = myrow-1, y = mycolumn),888)
pvm_recv(888) {Receive B}
endfor
endfor
Next:
Tree Computations
Up:
Common Parallel Programming
Previous:
Common Parallel Programming
Next:
Typographical Conventions
Up:
Contents
Previous:
A Bit of
Who Should Read This Book?
Next:
Workload Allocation
Up:
Common Parallel Programming
Previous:
Crowd Computations
Tree Computations
Figure: Tree-computation example
{ Spawn and partition list based on a broadcast tree pattern. }
for i := 1 to N, such that 2^N = NumProcs
forall processors P such that P < 2^i
pvm_spawn(...) {process id P XOR 2^i}
if P < 2^(i-1) then
midpt: = PartitionList(list);
{Send list[0..midpt] to P XOR 2^i}
pvm_send((P XOR 2^i),999)
list := list[midpt+1..MAXSIZE]
else
pvm_recv(999) {receive the list}
endif
endfor
endfor
{ Sort remaining list. }
Quicksort(list[midpt+1..MAXSIZE])
{ Gather/merge sorted sub-lists. }
for i := N downto 1, such that 2^N = NumProcs
forall processors P such that P < 2^i
if P > 2^(i-1) then
pvm_send((P XOR 2^i),888)
{Send list to P XOR 2^i}
else
pvm_recv(888) {receive temp list}
merge templist into list
endif
endfor
endfor
Next:
Data Decomposition
Up:
Basic Programming Techniques
Previous:
Tree Computations
Workload Allocation
Next:
Function Decomposition
Up:
Workload Allocation
Previous:
Workload Allocation
Data Decomposition
Next:
Function Decomposition
Up:
Workload Allocation
Previous:
Workload Allocation
Next:
Porting Existing Applications
Up:
Workload Allocation
Previous:
Data Decomposition
Function Decomposition
Figure: Function decomposition example
Next:
Porting Existing Applications
Up:
Workload Allocation
Previous:
Data Decomposition
Next:
PVM User Interface
Up:
Basic Programming Techniques
Previous:
Function Decomposition
Porting Existing Applications to PVM
Next:
PVM User Interface
Up:
Basic Programming Techniques
Previous:
Function Decomposition
Next:
Process Control
Up:
PVM: Parallel Virtual Machine
Previous:
Porting Existing Applications
PVM User Interface
Next:
Process Control
Up:
PVM: Parallel Virtual Machine
Previous:
Porting Existing Applications
Next:
Information
Up:
PVM User Interface
Previous:
PVM User Interface
Process Control
int tid = pvm_mytid( void )
call pvmfmytid( tid )
int info = pvm_exit( void )
call pvmfexit( info )
int numt = pvm_spawn(char *task, char **argv, int flag,
char *where, int ntask, int *tids )
call pvmfspawn( task, flag, where, ntask, tids, numt )
Value Option Meaning
--------------------------------------------------------------------------
0 PvmTaskDefault PVM chooses where to spawn processes.
1 PvmTaskHost where argument is a particular host to spawn on.
2 PvmTaskArch where argument is a PVM_ARCH to spawn on.
4 PvmTaskDebug starts tasks under a debugger.
8 PvmTaskTrace trace data is generated.
16 PvmMppFront starts tasks on MPP front-end.
32 PvmHostCompl complements host set in where.
--------------------------------------------------------------------------
int info = pvm_kill( int tid )
call pvmfkill( tid, info )
int info = pvm_catchout( FILE *ff )
call pvmfcatchout( onoff )
Next:
Information
Up:
PVM User Interface
Previous:
PVM User Interface
Next:
Dynamic Configuration
Up:
PVM User Interface
Previous:
Process Control
Information
int tid = pvm_parent( void )
call pvmfparent( tid )
int dtid = pvm_tidtohost( int tid )
call pvmftidtohost( tid, dtid )
int info = pvm_config( int *nhost, int *narch,
struct pvmhostinfo **hostp )
call pvmfconfig( nhost, narch, dtid, name, arch, speed, info)
int info = pvm_tasks( int which, int *ntask,
struct pvmtaskinfo **taskp )
call pvmftasks( which, ntask, tid, ptid, dtid,
flag, aout, info )
Next:
Dynamic Configuration
Up:
PVM User Interface
Previous:
Process Control
Next:
Signaling
Up:
PVM User Interface
Previous:
Information
Dynamic Configuration
int info = pvm_addhosts( char **hosts, int nhost, int *infos)
int info = pvm_delhosts( char **hosts, int nhost, int *infos)
call pvmfaddhost( host, info )
call pvmfdelhost( host, info )
Next:
Setting and Getting
Up:
PVM User Interface
Previous:
Dynamic Configuration
Signaling
int info = pvm_sendsig( int tid, int signum )
call pvmfsendsig( tid, signum, info )
int info = pvm_notify( int what, int msgtag, int cnt, int tids )
call pvmfnotify( what, msgtag, cnt, tids, info )
Next:
The Map
Up:
Contents
Previous:
Who Should Read
Typographical Conventions
Next:
Message Passing
Up:
PVM User Interface
Previous:
Signaling
Setting and Getting Options
int oldval = pvm_setopt( int what, int val )
int val = pvm_getopt( int what )
call pvmfsetopt( what, val, oldval )
call pvmfgetopt( what, val )
Option Value Meaning
------------------------------------------------------------------
PvmRoute 1 routing policy
PvmDebugMask 2 debugmask
PvmAutoErr 3 auto error reporting
PvmOutputTid 4 stdout destination for children
PvmOutputCode 5 output msgtag
PvmTraceTid 6 trace destination for children
PvmTraceCode 7 trace msgtag
PvmFragSize 8 message fragment size
PvmResvTids 9 allow messages to reserved tags and tids
PvmSelfOutputTid 10 stdout destination for self
PvmSelfOutputCode 11 output msgtag
PvmSelfTraceTid 12 trace destination for self
PvmSelfTraceCode 13 trace msgtag
------------------------------------------------------------------
See Appendix B for allowable values for these options.
Future expansions to this list are planned.
pvm_setopt( PvmRoute, PvmRouteDirect );
The drawback is that this faster communication method is not scalable under Unix;
hence, it may not work if the application involves over 60 tasks
that communicate randomly with each other. If it doesn't work, PVM
automatically switches back to the default communication method.
It can be called multiple times during an application
to selectively set up direct task-to-task communication links,
but typical use is to call it once after the initial call to pvm_mytid().
Next:
Message Buffers
Up:
PVM User Interface
Previous:
Setting and Getting
Message Passing
Next:
Packing Data
Up:
Message Passing
Previous:
Message Passing
Message Buffers
int bufid = pvm_initsend( int encoding )
call pvmfinitsend( encoding, bufid )
int bufid = pvm_mkbuf( int encoding )
call pvmfmkbuf( encoding, bufid )
int info = pvm_freebuf( int bufid )
call pvmffreebuf( bufid, info )
int bufid = pvm_getsbuf( void )
call pvmfgetsbuf( bufid )
int bufid = pvm_getrbuf( void )
call pvmfgetrbuf( bufid )
int oldbuf = pvm_setsbuf( int bufid )
call pvmfsetrbuf( bufid, oldbuf )
int oldbuf = pvm_setrbuf( int bufid )
call pvmfsetrbuf( bufid, oldbuf )
bufid = pvm_recv( src, tag );
oldid = pvm_setsbuf( bufid );
info = pvm_send( dst, tag );
info = pvm_freebuf( oldid );
Next:
Packing Data
Up:
Message Passing
Previous:
Message Passing
Next:
Sending and Receiving
Up:
Message Passing
Previous:
Message Buffers
Packing Data
int info = pvm_pkbyte( char *cp, int nitem, int stride )
int info = pvm_pkcplx( float *xp, int nitem, int stride )
int info = pvm_pkdcplx( double *zp, int nitem, int stride )
int info = pvm_pkdouble( double *dp, int nitem, int stride )
int info = pvm_pkfloat( float *fp, int nitem, int stride )
int info = pvm_pkint( int *np, int nitem, int stride )
int info = pvm_pklong( long *np, int nitem, int stride )
int info = pvm_pkshort( short *np, int nitem, int stride )
int info = pvm_pkstr( char *cp )
int info = pvm_packf( const char *fmt, ... )
call pvmfpack( what, xp, nitem, stride, info )
STRING 0 REAL4 4
BYTE1 1 COMPLEX8 5
INTEGER2 2 REAL8 6
INTEGER4 3 COMPLEX16 7
Next:
Sending and Receiving
Up:
Message Passing
Previous:
Message Buffers
Next:
Unpacking Data
Up:
Message Passing
Previous:
Packing Data
Sending and Receiving Data
int info = pvm_send( int tid, int msgtag )
call pvmfsend( tid, msgtag, info )
int info = pvm_mcast( int *tids, int ntask, int msgtag )
call pvmfmcast( ntask, tids, msgtag, info )
int info = pvm_psend( int tid, int msgtag,
void *vp, int cnt, int type )
call pvmfpsend( tid, msgtag, xp, cnt, type, info )
PVM_STR PVM_FLOAT
PVM_BYTE PVM_CPLX
PVM_SHORT PVM_DOUBLE
PVM_INT PVM_DCPLX
PVM_LONG PVM_UINT
PVM_USHORT PVM_ULONG
int bufid = pvm_recv( int tid, int msgtag )
call pvmfrecv( tid, msgtag, bufid )
int bufid = pvm_nrecv( int tid, int msgtag )
call pvmfnrecv( tid, msgtag, bufid )
int bufid = pvm_probe( int tid, int msgtag )
call pvmfprobe( tid, msgtag, bufid )
int bufid = pvm_trecv( int tid, int msgtag, struct timeval *tmout )
call pvmftrecv( tid, msgtag, sec, usec, bufid )
int info = pvm_bufinfo( int bufid, int *bytes, int *msgtag, int *tid )
call pvmfbufinfo( bufid, bytes, msgtag, tid, info )
int info = pvm_precv( int tid, int msgtag, void *vp, int cnt,
int type, int *rtid, int *rtag, int *rcnt )
call pvmfprecv( tid, msgtag, xp, cnt, type, rtid, rtag, rcnt, info )
int (*old)() = pvm_recvf(int (*new)(int buf, int tid, int tag))
Next:
Unpacking Data
Up:
Message Passing
Previous:
Packing Data
Next:
Dynamic Process Groups
Up:
Message Passing
Previous:
Sending and Receiving
Unpacking Data
int info = pvm_upkbyte( char *cp, int nitem, int stride )
int info = pvm_upkcplx( float *xp, int nitem, int stride )
int info = pvm_upkdcplx( double *zp, int nitem, int stride )
int info = pvm_upkdouble( double *dp, int nitem, int stride )
int info = pvm_upkfloat( float *fp, int nitem, int stride )
int info = pvm_upkint( int *np, int nitem, int stride )
int info = pvm_upklong( long *np, int nitem, int stride )
int info = pvm_upkshort( short *np, int nitem, int stride )
int info = pvm_upkstr( char *cp )
int info = pvm_unpackf( const char *fmt, ... )
call pvmfunpack( what, xp, nitem, stride, info )
Next:
Program Examples
Up:
PVM User Interface
Previous:
Unpacking Data
Dynamic Process Groups
int inum = pvm_joingroup( char *group )
int info = pvm_lvgroup( char *group )
call pvmfjoingroup( group, inum )
call pvmflvgroup( group, info )
int tid = pvm_gettid( char *group, int inum )
int inum = pvm_getinst( char *group, int tid )
int size = pvm_gsize( char *group )
call pvmfgettid( group, inum, tid )
call pvmfgetinst( group, tid, inum )
call pvmfgsize( group, size )
int info = pvm_barrier( char *group, int count )
call pvmfbarrier( group, count, info )
int info = pvm_bcast( char *group, int msgtag )
call pvmfbcast( group, msgtag, info )
int info = pvm_reduce( void (*func)(), void *data,
int nitem, int datatype,
int msgtag, char *group, int root )
call pvmfreduce( func, data, count, datatype,
msgtag, group, root, info )
PvmMax
PvmMin
PvmSum
PvmProduct
The reduction operation is performed element-wise on the input data.
For example, if the data array contains two floating-point numbers
and func is PvmMax,
then the result contains two numbers-the global maximum of each group members first number
and the global maximum of each member's second number.
Next:
Program Examples
Up:
PVM User Interface
Previous:
Unpacking Data
Next:
Fork-Join
Up:
PVM: Parallel Virtual Machine
Previous:
Dynamic Process Groups
Program Examples
pvm_notify()
call
to create fault tolerant appliations.
We present an example that performs a matrix multiply.
Lastly, we show how PVM can be used to compute heat diffusion through a
wire.
Next:
Fork Join Example
Up:
Program Examples
Previous:
Program Examples
Fork-Join
pvm_mytid()
. This function must be called
before any other PVM call can be made. The result of the
pvm_mytid()
call should always be a positive integer. If it is
not, then something is seriously wrong. In fork-join we check the value
of mytid; if it indicates an error, we call pvm_perror()
and
exit the program. The pvm_perror()
call will print a message
indicating what went wrong with the last PVM call. In our example the
last call was pvm_mytid()
, so pvm_perror()
might print a
message indicating that PVM hasn't been started on this machine. The
argument to pvm_perror()
is a string that will be prepended to
any error message printed by pvm_perror()
. In this case we pass
argv[0], which is the name of the program as it was typed on the
command line. The pvm_perror()
function is modeled after the
Unix perror()
function.
pvm_parent()
. The pvm_parent()
function will return the
TID of the task that spawned the calling task. Since we run the
initial fork-join program from the Unix shell, this initial task will
not have a parent; it will not have been spawned by some other PVM task
but will have been started manually by the user. For the initial
forkjoin task the result of pvm_parent()
will not be any
particular task id but an error code, PvmNoParent. Thus we can
distinguish the parent forkjoin task from the children by checking whether
the result of the pvm_parent()
call is equal to PvmNoParent. If
this task is the parent, then it must spawn the children. If it is not
the parent, then it must send a message to the parent.
pvm_exit()
and then returning. The call to pvm_exit()
is important because
it tells PVM this program will no longer be using any of the PVM
facilities. (In this case the task exits and PVM will deduce that the
dead task no longer needs its services. Regardless, it is good style
to exit cleanly.) Assuming the number of tasks is valid, forkjoin will
then attempt to spawn the children.
pvm_spawn()
call tells PVM to start ntask tasks named
argv[0]. The second parameter is the argument list given to the
spawned tasks. In this case we don't care to give the children any
particular command line arguments, so this value is null. The third
parameter to spawn, PvmTaskDefault, is a flag telling PVM to spawn the
tasks in the default location. Had we been interested in placing the
children on a specific machine or a machine of a particular
architecture, then we would have used PvmTaskHost or PvmTaskArch for
this flag and specified the host or architecture as the fourth
parameter. Since we don't care where the tasks execute, we use
PvmTaskDefault for the flag and null for the fourth parameter.
Finally, ntask tells spawn how many tasks to start; the integer
array child will hold the task ids of the newly spawned children. The
return value of pvm_spawn()
indicates how many tasks were
successfully spawned. If info is not equal to ntask, then some error
occurred during the spawn. In case of an error, the error code is
placed in the task id array, child, instead of the actual task id.
The fork-join program loops over this array and prints the task ids or any error
codes. If no tasks were successfully spawned, then the program exits.
pvm_recv()
call receives a
message (with that JOINTAG) from any task.
The return value of pvm_recv()
is an integer indicating a
message buffer. This integer can be used to find out information about
message buffers. The subsequent call to pvm_bufinfo()
does just
this; it gets the length, tag, and task id of the sending process for
the message indicated by buf. In fork-join the messages sent by the
children contain a single integer value, the task id of the child
task. The pvm_upkint()
call unpacks the integer from the
message into the mydata variable. As a sanity check, forkjoin tests
the value of mydata and the task id returned by pvm_bufinfo()
.
If the values differ, the program has a bug, and an error message
is printed. Finally, the information about the message is printed, and
the parent program exits.
pvm_initsend()
. The parameter
PvmDataDefault indicates that PVM should do whatever data conversion is
needed to ensure that the data arrives in the correct format on the
destination processor. In some cases this may result in unnecessary
data conversions. If the user is sure no data conversion will be needed
since the destination machine uses the same data format, then he can
use PvmDataRaw as a parameter to pvm_initsend()
. The
pvm_pkint()
call places a single integer, mytid, into the
message buffer. It is important to make sure the corresponding unpack
call exactly matches the pack call. Packing an integer and unpacking
it as a float will not work correctly. Similarly, if the user packs two
integers with a single call, he cannot unpack those integers by
calling pvm_upkint()
twice, once for each integer. There must
be a one to one correspondence between pack and unpack calls. Finally,
the message is sent to the parent task using a message tag of JOINTAG.
Next:
Fork Join Example
Up:
Program Examples
Previous:
Program Examples
Next:
Dot Product
Up:
Program Examples
Previous:
Fork-Join
Fork Join Example
/*
Fork Join Example
Demonstrates how to spawn processes and exchange messages
*/
/* defines and prototypes for the PVM library */
#include <pvm3.h>
/* Maximum number of children this program will spawn */
#define MAXNCHILD 20
/* Tag to use for the joing message */
#define JOINTAG 11
int
main(int argc, char* argv[])
{
/* number of tasks to spawn, use 3 as the default */
int ntask = 3;
/* return code from pvm calls */
int info;
/* my task id */
int mytid;
/* my parents task id */
int myparent;
/* children task id array */
int child[MAXNCHILD];
int i, mydata, buf, len, tag, tid;
/* find out my task id number */
mytid = pvm_mytid();
/* check for error */
if (mytid < 0) {
/* print out the error */
pvm_perror(argv[0]);
/* exit the program */
return -1;
}
/* find my parent's task id number */
myparent = pvm_parent();
/* exit if there is some error other than PvmNoParent */
if ((myparent < 0) && (myparent != PvmNoParent)) {
pvm_perror(argv[0]);
pvm_exit();
return -1;
}
/* if i don't have a parent then i am the parent */
if (myparent == PvmNoParent) {
/* find out how many tasks to spawn */
if (argc == 2) ntask = atoi(argv[1]);
/* make sure ntask is legal */
if ((ntask < 1) || (ntask > MAXNCHILD)) { pvm_exit(); return 0; }
/* spawn the child tasks */
info = pvm_spawn(argv[0], (char**)0, PvmTaskDefault, (char*)0,
ntask, child);
/* print out the task ids */
for (i = 0; i < ntask; i++)
if (child[i] < 0) /* print the error code in decimal*/
printf(" %d", child[i]);
else /* print the task id in hex */
printf("t%x\t", child[i]);
putchar('\n');
/* make sure spawn succeeded */
if (info == 0) { pvm_exit(); return -1; }
/* only expect responses from those spawned correctly */
ntask = info;
for (i = 0; i < ntask; i++) {
/* recv a message from any child process */
buf = pvm_recv(-1, JOINTAG);
if (buf < 0) pvm_perror("calling recv");
info = pvm_bufinfo(buf, &len, &tag, &tid);
if (info < 0) pvm_perror("calling pvm_bufinfo");
info = pvm_upkint(&mydata, 1, 1);
if (info < 0) pvm_perror("calling pvm_upkint");
if (mydata != tid) printf("This should not happen!\n");
printf("Length %d, Tag %d, Tid t%x\n", len, tag, tid);
}
pvm_exit();
return 0;
}
/* i'm a child */
info = pvm_initsend(PvmDataDefault);
if (info < 0) {
pvm_perror("calling pvm_initsend"); pvm_exit(); return -1;
}
info = pvm_pkint(&mytid, 1, 1);
if (info < 0) {
pvm_perror("calling pvm_pkint"); pvm_exit(); return -1;
}
info = pvm_send(myparent, JOINTAG);
if (info < 0) {
pvm_perror("calling pvm_send"); pvm_exit(); return -1;
}
pvm_exit();
return 0;
}
% forkjoin
t10001c t40149 tc0037
Length 4, Tag 11, Tid t40149
Length 4, Tag 11, Tid tc0037
Length 4, Tag 11, Tid t10001c
% forkjoin 4
t10001e t10001d t4014b tc0038
Length 4, Tag 11, Tid t4014b
Length 4, Tag 11, Tid tc0038
Length 4, Tag 11, Tid t10001d
Length 4, Tag 11, Tid t10001e
Figure: Output of fork-join program
Next:
Comments and Questions
Up:
Contents
Previous:
Typographical Conventions
The Map
netlib2.cs.utk.edu
;
cd pvm3/book
;
get refcard.ps
.)
Next:
Comments and Questions
Up:
Contents
Previous:
Typographical Conventions
Next:
Example program: PSDOT.F
Up:
Program Examples
Previous:
Fork Join Example
Dot Product
Next:
Example program: PSDOT.F
Up:
Program Examples
Previous:
Fork Join Example
Next:
Failure
Up:
Program Examples
Previous:
Dot Product
Example program: PSDOT.F
PROGRAM PSDOT
*
* PSDOT performs a parallel inner (or dot) product, where the vectors
* X and Y start out on a master node, which then sets up the virtual
* machine, farms out the data and work, and sums up the local pieces
* to get a global inner product.
*
* .. External Subroutines ..
EXTERNAL PVMFMYTID, PVMFPARENT, PVMFSPAWN, PVMFEXIT, PVMFINITSEND
EXTERNAL PVMFPACK, PVMFSEND, PVMFRECV, PVMFUNPACK, SGENMAT
*
* .. External Functions ..
INTEGER ISAMAX
REAL SDOT
EXTERNAL ISAMAX, SDOT
*
* .. Intrinsic Functions ..
INTRINSIC MOD
*
* .. Parameters ..
INTEGER MAXN
PARAMETER ( MAXN = 8000 )
INCLUDE 'fpvm3.h'
*
* .. Scalars ..
INTEGER N, LN, MYTID, NPROCS, IBUF, IERR
INTEGER I, J, K
REAL LDOT, GDOT
*
* .. Arrays ..
INTEGER TIDS(0:63)
REAL X(MAXN), Y(MAXN)
*
* Enroll in PVM and get my and the master process' task ID number
*
CALL PVMFMYTID( MYTID )
CALL PVMFPARENT( TIDS(0) )
*
* If I need to spawn other processes (I am master process)
*
IF ( TIDS(0) .EQ. PVMNOPARENT ) THEN
*
* Get starting information
*
WRITE(*,*) 'How many processes should participate (1-64)?'
READ(*,*) NPROCS
WRITE(*,2000) MAXN
READ(*,*) N
TIDS(0) = MYTID
IF ( N .GT. MAXN ) THEN
WRITE(*,*) 'N too large. Increase parameter MAXN to run'//
$ 'this case.'
STOP
END IF
*
* LN is the number of elements of the dot product to do
* locally. Everyone has the same number, with the master
* getting any left over elements. J stores the number of
* elements rest of procs do.
*
J = N / NPROCS
LN = J + MOD(N, NPROCS)
I = LN + 1
*
* Randomly generate X and Y
*
CALL SGENMAT( N, 1, X, N, MYTID, NPROCS, MAXN, J )
CALL SGENMAT( N, 1, Y, N, I, N, LN, NPROCS )
*
* Loop over all worker processes
*
DO 10 K = 1, NPROCS-1
*
* Spawn process and check for error
*
CALL PVMFSPAWN( 'psdot', 0, 'anywhere', 1, TIDS(K), IERR )
IF (IERR .NE. 1) THEN
WRITE(*,*) 'ERROR, could not spawn process #',K,
$ '. Dying . . .'
CALL PVMFEXIT( IERR )
STOP
END IF
*
* Send out startup info
*
CALL PVMFINITSEND( PVMDEFAULT, IBUF )
CALL PVMFPACK( INTEGER4, J, 1, 1, IERR )
CALL PVMFPACK( REAL4, X(I), J, 1, IERR )
CALL PVMFPACK( REAL4, Y(I), J, 1, IERR )
CALL PVMFSEND( TIDS(K), 0, IERR )
I = I + J
10 CONTINUE
*
* Figure master's part of dot product
*
GDOT = SDOT( LN, X, 1, Y, 1 )
*
* Receive the local dot products, and
* add to get the global dot product
*
DO 20 K = 1, NPROCS-1
CALL PVMFRECV( -1, 1, IBUF )
CALL PVMFUNPACK( REAL4, LDOT, 1, 1, IERR )
GDOT = GDOT + LDOT
20 CONTINUE
*
* Print out result
*
WRITE(*,*) ' '
WRITE(*,*) '<x,y> = ',GDOT
*
* Do sequential dot product and subtract from
* distributed dot product to get desired error estimate
*
LDOT = SDOT( N, X, 1, Y, 1 )
WRITE(*,*) '<x,y> : sequential dot product. <x,y>^ : '//
$ 'distributed dot product.'
WRITE(*,*) '| <x,y> - <x,y>^ | = ',ABS(GDOT - LDOT)
WRITE(*,*) 'Run completed.'
*
* If I am a worker process (i.e. spawned by master process)
*
ELSE
*
* Receive startup info
*
CALL PVMFRECV( TIDS(0), 0, IBUF )
CALL PVMFUNPACK( INTEGER4, LN, 1, 1, IERR )
CALL PVMFUNPACK( REAL4, X, LN, 1, IERR )
CALL PVMFUNPACK( REAL4, Y, LN, 1, IERR )
*
* Figure local dot product and send it in to master
*
LDOT = SDOT( LN, X, 1, Y, 1 )
CALL PVMFINITSEND( PVMDEFAULT, IBUF )
CALL PVMFPACK( REAL4, LDOT, 1, 1, IERR )
CALL PVMFSEND( TIDS(0), 1, IERR )
END IF
*
CALL PVMFEXIT( 0 )
*
1000 FORMAT(I10,' Successfully spawned process #',I2,', TID =',I10)
2000 FORMAT('Enter the length of vectors to multiply (1 -',I7,'):')
STOP
*
* End program PSDOT
*
END
Next:
Example program: failure.c
Up:
Program Examples
Previous:
Example program: PSDOT.F
Failure
pvm_notify()
after spawning the tasks. The pvm_notify()
call tells PVM to send the calling task a message when certain tasks
exit. Here we are interested in all the children. Note that
the task calling pvm_notify()
will receive the notification,
not the tasks given in the task id array. It wouldn't make much
sense to send a notification message to a task that has exited. The
notify call can also be used to notify a task when a new host has been
added or deleted from the virtual machine. This might be useful if a
program wants to dynamically adapt to the currently available
machines.
pvm_kill()
simply kills the task indicated by the task id parameter.
After killing one of the spawned tasks, the parent waits on a
pvm_recv(-1, TASKDIED)
for the message notifying it the task has
died. The task id of the task that has exited is stored as a single integer
in the notify message.
The process unpacks the dead task's id and prints it out.
For good measure it also prints out the task id of the task it killed. These
ids should be the same.
The child tasks simply wait for about a minute and then quietly exit.
Next:
Matrix Multiply
Up:
Program Examples
Previous:
Failure
Example program: failure.c
/*
Failure notification example
Demonstrates how to tell when a task exits
*/
/* defines and prototypes for the PVM library */
#include <pvm3.h>
/* Maximum number of children this program will spawn */
#define MAXNCHILD 20
/* Tag to use for the task done message */
#define TASKDIED 11
int
main(int argc, char* argv[])
{
/* number of tasks to spawn, use 3 as the default */
int ntask = 3;
/* return code from pvm calls */
int info;
/* my task id */
int mytid;
/* my parents task id */
int myparent;
/* children task id array */
int child[MAXNCHILD];
int i, deadtid;
int tid;
char *argv[5];
/* find out my task id number */
mytid = pvm_mytid();
/* check for error */
if (mytid < 0) {
/* print out the error */
pvm_perror(argv[0]);
/* exit the program */
return -1;
}
/* find my parent's task id number */
myparent = pvm_parent();
/* exit if there is some error other than PvmNoParent */
if ((myparent < 0) && (myparent != PvmNoParent)) {
pvm_perror(argv[0]);
pvm_exit();
return -1;
}
/* if i don't have a parent then i am the parent */
if (myparent == PvmNoParent) {
/* find out how many tasks to spawn */
if (argc == 2) ntask = atoi(argv[1]);
/* make sure ntask is legal */
if ((ntask < 1) || (ntask > MAXNCHILD)) { pvm_exit(); return 0; }
/* spawn the child tasks */
info = pvm_spawn(argv[0], (char**)0, PvmTaskDebug, (char*)0,
ntask, child);
/* make sure spawn succeeded */
if (info != ntask) { pvm_exit(); return -1; }
/* print the tids */
for (i = 0; i < ntask; i++) printf("t%x\t",child[i]); putchar('\n');
/* ask for notification when child exits */
info = pvm_notify(PvmTaskExit, TASKDIED, ntask, child);
if (info < 0) { pvm_perror("notify"); pvm_exit(); return -1; }
/* reap the middle child */
info = pvm_kill(child[ntask/2]);
if (info < 0) { pvm_perror("kill"); pvm_exit(); return -1; }
/* wait for the notification */
info = pvm_recv(-1, TASKDIED);
if (info < 0) { pvm_perror("recv"); pvm_exit(); return -1; }
info = pvm_upkint(&deadtid, 1, 1);
if (info < 0) pvm_perror("calling pvm_upkint");
/* should be the middle child */
printf("Task t%x has exited.\n", deadtid);
printf("Task t%x is middle child.\n", child[ntask/2]);
pvm_exit();
return 0;
}
/* i'm a child */
sleep(63);
pvm_exit();
return 0;
}
file
pvm3/book/pvm-book.ps
for The postscript file for the book ``PVM A Users' Guide
, and Tutorial for Networked Parallel Computing''.
Next:
Example program: mmult.c
Up:
Program Examples
Previous:
Example program: failure.c
Matrix Multiply
pvm_barrier()
is called to make sure all the tasks have joined
the group. If the barrier is not performed, later
calls to pvm_gettid()
might fail since a task may not have yet
joined the group.
pvm_mcast()
will
send to all the tasks in the tasks array except the calling task. This
procedure works well in the case of mmult since we don't want to have to needlessly
handle the extra message coming into the multicasting task with an extra
pvm_recv()
. Both the multicasting task and the tasks receiving the block
calculate the AB for the diagonal block and the block of B residing in
the task.
pvm_recv()
. It's tempting to use
wildcards for the fields of pvm_recv()
; however, such a practice can be dangerous. For instance,
had we incorrectly calculated the value for up and used a wildcard
for the pvm_recv()
instead of down,
we might have sent
messages to the wrong tasks without knowing it. In this example we fully
specify messages, thereby reducing the possibility of mistakes by receiving
a message from the wrong task or the wrong phase of the algorithm.
pvm_lvgroup()
, since PVM will
realize the task has exited and will remove it from the group. It is good
form, however, to leave the group before calling pvm_exit()
. The
reset command from the PVM console will reset all the PVM groups. The
pvm_gstat
command will print the status of any groups that
currently exist.
Next:
Example program: mmult.c
Up:
Program Examples
Previous:
Example program: failure.c
Next:
One-Dimensional Heat Equation
Up:
Program Examples
Previous:
Matrix Multiply
Example program: mmult.c
/*
Matrix Multiply
*/
/* defines and prototypes for the PVM library */
#include <pvm3.h>
#include <stdio.h>
/* Maximum number of children this program will spawn */
#define MAXNTIDS 100
#define MAXROW 10
/* Message tags */
#define ATAG 2
#define BTAG 3
#define DIMTAG 5
void
InitBlock(float *a, float *b, float *c, int blk, int row, int col)
{
int len, ind;
int i,j;
srand(pvm_mytid());
len = blk*blk;
for (ind = 0; ind < len; ind++)
{ a[ind] = (float)(rand()%1000)/100.0; c[ind] = 0.0; }
for (i = 0; i < blk; i++) {
for (j = 0; j < blk; j++) {
if (row == col)
b[j*blk+i] = (i==j)? 1.0 : 0.0;
else
b[j*blk+i] = 0.0;
}
}
}
void
BlockMult(float* c, float* a, float* b, int blk)
{
int i,j,k;
for (i = 0; i < blk; i++)
for (j = 0; j < blk; j ++)
for (k = 0; k < blk; k++)
c[i*blk+j] += (a[i*blk+k] * b[k*blk+j]);
}
int
main(int argc, char* argv[])
{
/* number of tasks to spawn, use 3 as the default */
int ntask = 2;
/* return code from pvm calls */
int info;
/* my task and group id */
int mytid, mygid;
/* children task id array */
int child[MAXNTIDS-1];
int i, m, blksize;
/* array of the tids in my row */
int myrow[MAXROW];
float *a, *b, *c, *atmp;
int row, col, up, down;
/* find out my task id number */
mytid = pvm_mytid();
pvm_advise(PvmRouteDirect);
/* check for error */
if (mytid < 0) {
/* print out the error */
pvm_perror(argv[0]);
/* exit the program */
return -1;
}
/* join the mmult group */
mygid = pvm_joingroup("mmult");
if (mygid < 0) {
pvm_perror(argv[0]); pvm_exit(); return -1;
}
/* if my group id is 0 then I must spawn the other tasks */
if (mygid == 0) {
/* find out how many tasks to spawn */
if (argc == 3) {
m = atoi(argv[1]);
blksize = atoi(argv[2]);
}
if (argc < 3) {
fprintf(stderr, "usage: mmult m blk\n");
pvm_lvgroup("mmult"); pvm_exit(); return -1;
}
/* make sure ntask is legal */
ntask = m*m;
if ((ntask < 1) || (ntask >= MAXNTIDS)) {
fprintf(stderr, "ntask = %d not valid.\n", ntask);
pvm_lvgroup("mmult"); pvm_exit(); return -1;
}
/* no need to spawn if there is only one task */
if (ntask == 1) goto barrier;
/* spawn the child tasks */
info = pvm_spawn("mmult", (char**)0, PvmTaskDefault, (char*)0,
ntask-1, child);
/* make sure spawn succeeded */
if (info != ntask-1) {
pvm_lvgroup("mmult"); pvm_exit(); return -1;
}
/* send the matrix dimension */
pvm_initsend(PvmDataDefault);
pvm_pkint(&m, 1, 1);
pvm_pkint(&blksize, 1, 1);
pvm_mcast(child, ntask-1, DIMTAG);
}
else {
/* recv the matrix dimension */
pvm_recv(pvm_gettid("mmult", 0), DIMTAG);
pvm_upkint(&m, 1, 1);
pvm_upkint(&blksize, 1, 1);
ntask = m*m;
}
/* make sure all tasks have joined the group */
barrier:
info = pvm_barrier("mmult",ntask);
if (info < 0) pvm_perror(argv[0]);
/* find the tids in my row */
for (i = 0; i < m; i++)
myrow[i] = pvm_gettid("mmult", (mygid/m)*m + i);
/* allocate the memory for the local blocks */
a = (float*)malloc(sizeof(float)*blksize*blksize);
b = (float*)malloc(sizeof(float)*blksize*blksize);
c = (float*)malloc(sizeof(float)*blksize*blksize);
atmp = (float*)malloc(sizeof(float)*blksize*blksize);
/* check for valid pointers */
if (!(a && b && c && atmp)) {
fprintf(stderr, "%s: out of memory!\n", argv[0]);
free(a); free(b); free(c); free(atmp);
pvm_lvgroup("mmult"); pvm_exit(); return -1;
}
/* find my block's row and column */
row = mygid/m; col = mygid % m;
/* calculate the neighbor's above and below */
up = pvm_gettid("mmult", ((row)?(row-1):(m-1))*m+col);
down = pvm_gettid("mmult", ((row == (m-1))?col:(row+1)*m+col));
/* initialize the blocks */
InitBlock(a, b, c, blksize, row, col);
/* do the matrix multiply */
for (i = 0; i < m; i++) {
/* mcast the block of matrix A */
if (col == (row + i)%m) {
pvm_initsend(PvmDataDefault);
pvm_pkfloat(a, blksize*blksize, 1);
pvm_mcast(myrow, m, (i+1)*ATAG);
BlockMult(c,a,b,blksize);
}
else {
pvm_recv(pvm_gettid("mmult", row*m + (row +i)%m), (i+1)*ATAG);
pvm_upkfloat(atmp, blksize*blksize, 1);
BlockMult(c,atmp,b,blksize);
}
/* rotate the columns of B */
pvm_initsend(PvmDataDefault);
pvm_pkfloat(b, blksize*blksize, 1);
pvm_send(up, (i+1)*BTAG);
pvm_recv(down, (i+1)*BTAG);
pvm_upkfloat(b, blksize*blksize, 1);
}
/* check it */
for (i = 0 ; i < blksize*blksize; i++)
if (a[i] != c[i])
printf("Error a[%d] (%g) != c[%d] (%g) \n", i, a[i],i,c[i]);
printf("Done.\n");
free(a); free(b); free(c); free(atmp);
pvm_lvgroup("mmult");
pvm_exit();
return 0;
}
Next:
Example program: heat.c
Up:
Program Examples
Previous:
Example program: mmult.c
One-Dimensional Heat Equation
Here we present a PVM program that calculates heat diffusion through
a substrate, in this case a wire. Consider the one-dimensional heat
equation on a thin wire:
for i = 1:tsteps-1;
t = t+dt;
a(i+1,1)=0;
a(i+1,n+2)=0;
for j = 2:n+1;
a(i+1,j)=a(i,j) + mu*(a(i,j+1)-2*a(i,j)+a(i,j-1));
end;
t;
a(i+1,1:n+2);
plot(a(i,:))
end
Next:
Example program: heat.c
Up:
Program Examples
Previous:
Example program: mmult.c
Next:
Example program: heatslv.c
Up:
Program Examples
Previous:
One-Dimensional Heat Equation
Example program: heat.c
/*
heat.c
Use PVM to solve a simple heat diffusion differential equation,
using 1 master program and 5 slaves.
The master program sets up the data, communicates it to the slaves
and waits for the results to be sent from the slaves.
Produces xgraph ready files of the results.
*/
#include "pvm3.h"
#include <stdio.h>
#include <math.h>
#include <time.h>
#define SLAVENAME "heatslv"
#define NPROC 5
#define TIMESTEP 100
#define PLOTINC 10
#define SIZE 1000
int num_data = SIZE/NPROC;
main()
{ int mytid, task_ids[NPROC], i, j;
int left, right, k, l;
int step = TIMESTEP;
int info;
double init[SIZE], solution[TIMESTEP][SIZE];
double result[TIMESTEP*SIZE/NPROC], deltax2;
FILE *filenum;
char *filename[4][7];
double deltat[4];
time_t t0;
int etime[4];
filename[0][0] = "graph1";
filename[1][0] = "graph2";
filename[2][0] = "graph3";
filename[3][0] = "graph4";
deltat[0] = 5.0e-1;
deltat[1] = 5.0e-3;
deltat[2] = 5.0e-6;
deltat[3] = 5.0e-9;
/* enroll in pvm */
mytid = pvm_mytid();
/* spawn the slave tasks */
info = pvm_spawn(SLAVENAME,(char **)0,PvmTaskDefault,"",
NPROC,task_ids);
/* create the initial data set */
for (i = 0; i < SIZE; i++)
init[i] = sin(M_PI * ( (double)i / (double)(SIZE-1) ));
init[0] = 0.0;
init[SIZE-1] = 0.0;
/* run the problem 4 times for different values of delta t */
for (l = 0; l < 4; l++) {
deltax2 = (deltat[l]/pow(1.0/(double)SIZE,2.0));
/* start timing for this run */
time(&t0);
etime[l] = t0;
/* send the initial data to the slaves. */
/* include neighbor info for exchanging boundary data */
for (i = 0; i < NPROC; i++) {
pvm_initsend(PvmDataDefault);
left = (i == 0) ? 0 : task_ids[i-1];
pvm_pkint(&left, 1, 1);
right = (i == (NPROC-1)) ? 0 : task_ids[i+1];
pvm_pkint(&right, 1, 1);
pvm_pkint(&step, 1, 1);
pvm_pkdouble(&deltax2, 1, 1);
pvm_pkint(&num_data, 1, 1);
pvm_pkdouble(&init[num_data*i], num_data, 1);
pvm_send(task_ids[i], 4);
}
/* wait for the results */
for (i = 0; i < NPROC; i++) {
pvm_recv(task_ids[i], 7);
pvm_upkdouble(&result[0], num_data*TIMESTEP, 1);
/* update the solution */
for (j = 0; j < TIMESTEP; j++)
for (k = 0; k < num_data; k++)
solution[j][num_data*i+k] = result[wh(j,k)];
}
/* stop timing */
time(&t0);
etime[l] = t0 - etime[l];
/* produce the output */
filenum = fopen(filename[l][0], "w");
fprintf(filenum,"TitleText: Wire Heat over Delta Time: %e\n",
deltat[l]);
fprintf(filenum,"XUnitText: Distance\nYUnitText: Heat\n");
for (i = 0; i < TIMESTEP; i = i + PLOTINC) {
fprintf(filenum,"\"Time index: %d\n",i);
for (j = 0; j < SIZE; j++)
fprintf(filenum,"%d %e\n",j, solution[i][j]);
fprintf(filenum,"\n");
}
fclose (filenum);
}
/* print the timing information */
printf("Problem size: %d\n",SIZE);
for (i = 0; i < 4; i++)
printf("Time for run %d: %d sec\n",i,etime[i]);
/* kill the slave processes */
for (i = 0; i < NPROC; i++) pvm_kill(task_ids[i]);
pvm_exit();
}
int wh(x, y)
int x, y;
{
return(x*num_data+y);
}
Next:
Different Styles of
Up:
Program Examples
Previous:
Example program: heat.c
Example program: heatslv.c
/*
heatslv.c
The slaves receive the initial data from the host,
exchange boundary information with neighbors,
and calculate the heat change in the wire.
This is done for a number of iterations, sent by the master.
*/
#include "pvm3.h"
#include <stdio.h>
int num_data;
main()
{
int mytid, left, right, i, j, master;
int timestep;
double *init, *A;
double leftdata, rightdata, delta, leftside, rightside;
/* enroll in pvm */
mytid = pvm_mytid();
master = pvm_parent();
/* receive my data from the master program */
while(1) {
pvm_recv(master, 4);
pvm_upkint(&left, 1, 1);
pvm_upkint(&right, 1, 1);
pvm_upkint(×tep, 1, 1);
pvm_upkdouble(&delta, 1, 1);
pvm_upkint(&num_data, 1, 1);
init = (double *) malloc(num_data*sizeof(double));
pvm_upkdouble(init, num_data, 1);
/* copy the initial data into my working array */
A = (double *) malloc(num_data * timestep * sizeof(double));
for (i = 0; i < num_data; i++) A[i] = init[i];
/* perform the calculation */
for (i = 0; i < timestep-1; i++) {
/* trade boundary info with my neighbors */
/* send left, receive right */
if (left != 0) {
pvm_initsend(PvmDataDefault);
pvm_pkdouble(&A[wh(i,0)],1,1);
pvm_send(left, 5);
}
if (right != 0) {
pvm_recv(right, 5);
pvm_upkdouble(&rightdata, 1, 1);
/* send right, receive left */
pvm_initsend(PvmDataDefault);
pvm_pkdouble(&A[wh(i,num_data-1)],1,1);
pvm_send(right, 6);
}
if (left != 0) {
pvm_recv(left, 6);
pvm_upkdouble(&leftdata,1,1);
}
/* do the calculations for this iteration */
for (j = 0; j < num_data; j++) {
leftside = (j == 0) ? leftdata : A[wh(i,j-1)];
rightside = (j == (num_data-1)) ? rightdata : A[wh(i,j+1)];
if ((j==0)&&(left==0))
A[wh(i+1,j)] = 0.0;
else if ((j==(num_data-1))&&(right==0))
A[wh(i+1,j)] = 0.0;
else
A[wh(i+1,j)]=
A[wh(i,j)]+delta*(rightside-2*A[wh(i,j)]+leftside);
}
}
/* send the results back to the master program */
pvm_initsend(PvmDataDefault);
pvm_pkdouble(&A[0],num_data*timestep,1);
pvm_send(master,7);
}
/* just for good measure */
pvm_exit();
}
int wh(x, y)
int x, y;
{
return(x*num_data+y);
}
Next:
How PVM Works
Up:
Example program: heatslv.c
Previous:
Example program: heatslv.c
Different Styles of Communication
Next:
Acknowledgments
Up:
Contents
Previous:
The Map
Comments and Questions
pvm@msr.epm.ornl.gov
.
by e-mail.
While we would like to respond to all the electronic mail received,
this may not be always possible.
We
therefore recommend
also posting messages to the newsgroup
comp.parallel.pvmThis unmoderated newsgroup was established on the Internet in
May 1993
to provide a forum for discussing issues
related to the use of PVM.
Questions
(from beginner to the very experienced),
advice, exchange of public-domain
extensions to PVM, and bug reports can be posted to the newsgroup.
Next:
Components
Up:
PVM: Parallel Virtual Machine
Previous:
Different Styles of
How PVM Works
Next:
Components
Up:
PVM: Parallel Virtual Machine
Previous:
Different Styles of
Next:
Task Identifiers
Up:
How PVM Works
Previous:
How PVM Works
Components
Next:
Architecture Classes
Up:
Components
Previous:
Components
Task Identifiers
Next:
Architecture Classes
Up:
Components
Previous:
Components
Next:
Message Model
Up:
Components
Previous:
Task Identifiers
Architecture Classes
Next:
Asynchronous Notification
Up:
Components
Previous:
Architecture Classes
Message Model
Next:
Asynchronous Notification
Up:
Components
Previous:
Architecture Classes
Next:
PVM Daemon and
Up:
Components
Previous:
Message Model
Asynchronous Notification
Type Meaning
-----------------------------------------------
PvmTaskExit Task exits or crashes
PvmHostDelete Host is deleted or crashes
PvmHostAdd New hosts are added to the VM
-----------------------------------------------
Next:
PVM Daemon
Up:
Components
Previous:
Asynchronous Notification
PVM Daemon and Programming Library
Next:
Programming Library
Up:
PVM Daemon and
Previous:
PVM Daemon and
PVM Daemon
Next:
Messages
Up:
PVM Daemon and
Previous:
PVM Daemon
Programming Library
Next:
Fragments and Databufs
Up:
How PVM Works
Previous:
Programming Library
Messages
Next:
Introduction
Up:
Contents
Previous:
Comments and Questions
Acknowledgments
PVM: Parallel Virtual Machine
Next:
Messages in Libpvm
Up:
Messages
Previous:
Messages
Fragments and Databufs
Next:
Messages in the
Up:
Messages
Previous:
Fragments and Databufs
Messages in Libpvm
Figure: Message storage in libpvm
Next:
Messages in the
Up:
Messages
Previous:
Fragments and Databufs
Next:
Pvmd Entry Points
Up:
Messages
Previous:
Messages in Libpvm
Messages in the Pvmd
Figure: Message storage in pvmd
Next:
Control Messages
Up:
Messages
Previous:
Messages in the
Pvmd Entry Points
Function Messages From
-----------------------------------------------------
loclentry() Local tasks
netentry() Remote pvmds
schentry() Local or remote special tasks
(Resource manager, Hoster, Tasker)
-----------------------------------------------------
Next:
PVM Daemon
Up:
Messages
Previous:
Pvmd Entry Points
Control Messages
Tag Meaning
----------------------------------------
TC_CONREQ Connection request
TC_CONACK Connection ack
TC_TASKEXIT Task exited/doesn't exist
TC_NOOP Do nothing
TC_OUTPUT Claim child stdout data
TC_SETTMASK Change task trace mask
----------------------------------------
Next:
Startup
Up:
How PVM Works
Previous:
Control Messages
PVM Daemon
Next:
Shutdown
Up:
PVM Daemon
Previous:
PVM Daemon
Startup
Next:
Host Table and
Up:
PVM Daemon
Previous:
Startup
Shutdown
Next:
Host File
Up:
PVM Daemon
Previous:
Shutdown
Host Table
and Machine Configuration
Next:
Tasks
Up:
Host Table and
Previous:
Host Table and
Host File
Next:
Heterogeneous Network Computing
Up:
PVM: Parallel Virtual Machine
Previous:
Acknowledgments
Introduction
Next:
Heterogeneous Network Computing
Up:
PVM: Parallel Virtual Machine
Previous:
Acknowledgments
Next:
Wait Contexts
Up:
PVM Daemon
Previous:
Host File
Tasks
Next:
Wait Contexts
Up:
PVM Daemon
Previous:
Host File
Next:
Fault Detection and
Up:
PVM Daemon
Previous:
Tasks
Wait Contexts
Next:
Fault Detection and
Up:
PVM Daemon
Previous:
Tasks
Next:
Pvmd'
Up:
PVM Daemon
Previous:
Wait Contexts
Fault Detection and Recovery
Next:
Starting Slave Pvmds
Up:
PVM Daemon
Previous:
Fault Detection and
Pvmd'
Next:
Resource Manager
Up:
PVM Daemon
Previous:
Pvmd'
Starting Slave Pvmds
Figure: Timeline of addhost operation
---------------------------------------------------------------------------
pvmd' --> slave: (exec) $PVM_ROOT/lib/pvmd -s -d8 -nhonk 1 80 a9ca95:0f5a
4096 3 80a95c43:0000
slave --> pvmd': ddpro<2312> arch
Next:
Resource Manager
Up:
PVM Daemon
Previous:
Pvmd'
Next:
Libpvm Library
Up:
PVM Daemon
Previous:
Starting Slave Pvmds
Resource Manager
------------------------------------------------
Libpvm function Default Message RM Message
------------------------------------------------
pvm_addhost() TM_ADDHOST SM_ADDHOST
pvm_delhost() TM_DELHOST SM_DELHOST
pvm_spawn() TM_SPAWN SM_SPAWN
------------------------------------------------
------------------------------------------------
Libpvm function Default Message RM Message
------------------------------------------------
pvm_config() TM_CONFIG SM_CONFIG
pvm_notify() TM_NOTIFY SM_NOTIFY
pvm_task() TM_TASK SM_TASK
------------------------------------------------
Next:
Libpvm Library
Up:
PVM Daemon
Previous:
Starting Slave Pvmds
Next:
Language Support
Up:
How PVM Works
Previous:
Resource Manager
Libpvm Library
Next:
Connecting to the
Up:
Libpvm Library
Previous:
Libpvm Library
Language Support
Next:
Protocols
Up:
Libpvm Library
Previous:
Language Support
Connecting to the Pvmd
Next:
Protocols
Up:
Libpvm Library
Previous:
Language Support
Next:
Messages
Up:
How PVM Works
Previous:
Connecting to the
Protocols
Next:
Trends in Distributed
Up:
Introduction
Previous:
Introduction
Heterogeneous Network Computing
All these factors translate into reduced development and debugging time,
reduced contention for resources, reduced costs, and possibly more
effective implementations of an application.
It is these benefits that PVM seeks to exploit.
From the beginning,
the PVM software package
was designed to make programming for a
heterogeneous collection of machines straightforward.
Next:
Trends in Distributed
Up:
Introduction
Previous:
Introduction
Next:
Pvmd-Pvmd
Up:
Protocols
Previous:
Protocols
Messages
Next:
Pvmd-Task and Task-Task
Up:
Protocols
Previous:
Messages
Pvmd-Pvmd
Figure: Pvmd-pvmd packet header
Field Meaning
-----------------------------------------------------
hd_hostpart TID of pvmd
hd_mtu Max UDP packet length to host
hd_sad IP address and UDP port number
hd_rxseq Expected next packet number from host
hd_txseq Next packet number to send to host
hd_txq Queue of packets to send
hd_opq Queue of packets sent, awaiting ack
hd_nop Number of packets in hd_opq
hd_rxq List of out-of-order received packets
hd_rxm Buffer for message reassembly
hd_rtt Estimated smoothed round-trip time
-----------------------------------------------------
Figure: Host descriptors with send queues
Next:
Pvmd-Task and Task-Task
Up:
Protocols
Previous:
Messages
Next:
Message Routing
Up:
Protocols
Previous:
Pvmd-Pvmd
Pvmd-Task
and Task-Task
Figure: Pvmd-task packet header
Next:
Message Routing
Up:
Protocols
Previous:
Pvmd-Pvmd
Next:
Pvmd
Up:
How PVM Works
Previous:
Pvmd-Task and Task-Task
Message Routing
Next:
Packet Buffers
Up:
Message Routing
Previous:
Message Routing
Pvmd
Next:
Message Routing
Up:
Pvmd
Previous:
Pvmd
Packet Buffers
Next:
Packet Routing
Up:
Pvmd
Previous:
Packet Buffers
Message Routing
Next:
Refragmentation
Up:
Pvmd
Previous:
Message Routing
Packet Routing
Figure: Packet and message routing in pvmd
Next:
Pvmd and Foreign
Up:
Pvmd
Previous:
Packet Routing
Refragmentation
PVM: Parallel Virtual Machine
A Users' Guide and Tutorial for
Networked Parallel Computing
pvm@msr.epm.ornl.gov
Next:
Contents
Scientific and Engineering ComputationJanusz Kowalik, Editor
netlib@netlib.org
and in the message type:
send pvm-book.ps from pvm3/book
To order from the publisher, send email to mitpress-orders@mit.edu,
or telephone 800-356-0343 or 617-625-8569. Send snail mail orders to
http://www.netlib.org/pvm3/book/pvm-book.html
8 x 9 * 176 pages * $17.95 * Original in Paperback ISBN 0-262-57108-0
For more information, contact Gita Manaktala, manak@mit.edu.
or
Thu Sep 15 21:00:17 EDT 1994
Next:
Contents
Jack Dongarra
Thu Sep 15 21:00:17 EDT 1994
Next:
Example program: mmult.c
Up:
Program Examples
Previous:
Example program: failure.c
Matrix Multiply
pvm_barrier()
is called to make sure all the tasks have joined
the group. If the barrier is not performed, later
calls to pvm_gettid()
might fail since a task may not have yet
joined the group.
pvm_mcast()
will
send to all the tasks in the tasks array except the calling task. This
procedure works well in the case of mmult since we don't want to have to needlessly
handle the extra message coming into the multicasting task with an extra
pvm_recv()
. Both the multicasting task and the tasks receiving the block
calculate the
for the diagonal block and the block of
residing in
the task.
pvm_recv()
. It's tempting to use
wildcards for the fields of pvm_recv()
; however, such a practice can be dangerous. For instance,
had we incorrectly calculated the value for up and used a wildcard
for the pvm_recv()
instead of down,
we might have sent
messages to the wrong tasks without knowing it. In this example we fully
specify messages, thereby reducing the possibility of mistakes by receiving
a message from the wrong task or the wrong phase of the algorithm.
pvm_lvgroup()
, since PVM will
realize the task has exited and will remove it from the group. It is good
form, however, to leave the group before calling pvm_exit()
. The
reset command from the PVM console will reset all the PVM groups. The
pvm_gstat
command will print the status of any groups that
currently exist.
Next:
Example program: mmult.c
Up:
Program Examples
Previous:
Example program: failure.c
Next:
Libpvm
Up:
Message Routing
Previous:
Refragmentation
Pvmd and Foreign Tasks
Next:
PVM Daemon
Up:
Messages
Previous:
Pvmd Entry Points
Control Messages
Next:
Control Messages
Up:
Messages
Previous:
Messages in the
Pvmd Entry Points
Next:
Libpvm Library
Up:
PVM Daemon
Previous:
Starting Slave Pvmds
Resource Manager
Next:
Libpvm Library
Up:
PVM Daemon
Previous:
Starting Slave Pvmds
Next:
Standard Input and
Up:
Task Environment
Previous:
Task Environment
Environment Variables
PVM_EXPORT=DISPLAY:SHELL
exports the variables DISPLAY and SHELL to
children tasks (and PVM_EXPORT too).
Next:
Standard Input and
Up:
Task Environment
Previous:
Task Environment
Environment Variables
PVM_EXPORT=DISPLAY:SHELL
exports the variables DISPLAY and SHELL to
children tasks (and PVM_EXPORT too).
Next:
Tracing
Up:
Task Environment
Previous:
Environment Variables
Standard Input and Output
Figure: Output states of a task
Next:
Tracing
Up:
Task Environment
Previous:
Environment Variables
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Author's Affiliations
Up:
Templates for the Solution
Previous:
Templates for the Solution
How to Use This Book
List of Symbols
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Stationary Iterative Methods
Up:
Iterative Methods
Previous:
Iterative Methods
Overview of the Methods
Next:
Stationary Iterative Methods
Up:
Iterative Methods
Previous:
Iterative Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Generating a CRS-based
Up:
Data Structures
Previous:
CDS Matrix-Vector Product
Sparse Incomplete Factorizations
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
CRS-based Factorization Solve
Up:
Sparse Incomplete Factorizations
Previous:
Sparse Incomplete Factorizations
Generating a CRS-based
-
Incomplete Factorization
for i = 1, n
pivots(i) = val(diag_ptr(i))
end;
Each elimination step starts by inverting the pivot
for i = 1, n
pivots(i) = 1 / pivots(i)
For all nonzero elements
with
, we next check whether
is a nonzero matrix element, since this is the only element
that can cause fill with
.
for j = diag_ptr(i)+1, row_ptr(i+1)-1
found = FALSE
for k = row_ptr(col_ind(j)), diag_ptr(col_ind(j))-1
if(col_ind(k) = i) then
found = TRUE
element = val(k)
endif
end;
If so, we update
.
if (found = TRUE)
pivots(col_ind(j)) = pivots(col_ind(j))
- element * pivots(i) * val(j)
end;
end;
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
CRS-based Factorization Transpose
Up:
Sparse Incomplete Factorizations
Previous:
Generating a CRS-based
CRS-based Factorization Solve
for i = 1, n
sum = 0
for j = row_ptr(i), diag_ptr(i)-1
sum = sum + val(j) * z(col_ind(j))
end;
z(i) = pivots(i) * (x(i)-sum)
end;
for i = n, 1, (step -1)
sum = 0
for j = diag(i)+1, row_ptr(i+1)-1
sum = sum + val(j) * y(col_ind(j))
y(i) = z(i) - pivots(i) * sum
end;
end;
The temporary vector z can be eliminated by reusing the space
for y; algorithmically, z can even overwrite x, but overwriting input data is in general not recommended
.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Generating a CRS-based
Up:
Sparse Incomplete Factorizations
Previous:
CRS-based Factorization Solve
CRS-based Factorization Transpose Solve
for i = 1, n
x_tmp(i) = x(i)
end;
for i = 1, n
z(i) = x_tmp(i)
tmp = pivots(i) * z(i)
for j = diag_ptr(i)+1, row_ptr(i+1)-1
x_tmp(col_ind(j)) = x_tmp(col_ind(j)) - tmp * val(j)
end;
end;
for i = n, 1 (step -1)
y(i) = pivots(i) * z(i)
for j = row_ptr(i), diag_ptr(i)-1
z(col_ind(j)) = z(col_ind(j)) - val(j) * y(i)
end;
end;
The extra temporary x_tmp is used only for clarity, and can
be overlapped with z. Both x_tmp and z can be
considered to be equivalent to y. Overall, a CRS-based
preconditioner solve uses short vector lengths, indirect addressing,
and has essentially the same memory traffic patterns as that of
the matrix-vector product.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Parallelism
Up:
Sparse Incomplete Factorizations
Previous:
CRS-based Factorization Transpose
Generating a CRS-based
Incomplete Factorization
% copy y into w
for i=1,ly
w( yind(i) ) = y(i)
% add w to x wherever x is already nonzero
for i=1,lx
if w( xind(i) ) <> 0
x(i) = x(i) + w( xind(i) )
w( xind(i) ) = 0
% add w to x by creating new components
% wherever x is still zero
for i=1,ly
if w( yind(i) ) <> 0 then
lx = lx+1
xind(lx) = yind(i)
x(lx) = w( yind(i) )
endif
In order to add a sequence of vectors
, we add the
vectors into
before executing
the writes into
.
A different implementation would be possible, where
is allocated
as a sparse vector and its sparsity pattern is constructed during the
additions. We will not discuss this possibility any further.
% copy y into w
for i=1,ly
w( yind(i) ) = y(i)
wlev( yind(i) ) = ylev(i)
% add w to x wherever x is already nonzero;
% don't change the levels
for i=1,lx
if w( xind(i) ) <> 0
x(i) = x(i) + w( xind(i) )
w( xind(i) ) = 0
% add w to x by creating new components
% wherever x is still zero;
% carry over levels
for i=1,ly
if w( yind(i) ) <> 0 then
lx = lx+1
x(lx) = w( yind(i) )
xind(lx) = yind(i)
xlev(lx) = wlev( yind(i) )
endif
for k=1,n
for j=1,k-1
for i=j+1,n
a(k,i) = a(k,i) - a(k,j)*a(j,i)
for j=k+1,n
a(k,j) = a(k,j)/a(k,k)
This is a row-oriented version of the traditional `left-looking'
factorization algorithm.
for row=1,n
% go through elements A(row,col) with col<row
COPY ROW row OF A() INTO DENSE VECTOR w
for col=aptr(row),aptr(row+1)-1
if aind(col) < row then
acol = aind(col)
MULTIPLY ROW acol OF M() BY A(col)
SUBTRACT THE RESULT FROM w
ALLOWING FILL-IN UP TO LEVEL k
endif
INSERT w IN ROW row OF M()
% invert the pivot
M(mdiag(row)) = 1/M(mdiag(row))
% normalize the row of U
for col=mptr(row),mptr(row+1)-1
if mind(col) > row
M(col) = M(col) * M(mdiag(row))
Next:
Parallelism
Up:
Sparse Incomplete Factorizations
Previous:
CRS-based Factorization Transpose
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Inner products
Up:
Related Issues
Previous:
Generating a CRS-based
Parallelism
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Overlapping communication and
Up:
Parallelism
Previous:
Parallelism
Inner products
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Fewer synchronization points
Up:
Inner products
Previous:
Inner products
Overlapping communication and computation
Figure: A rearrangement of Conjugate Gradient for parallelism
For a more detailed discussion see Demmel, Heath and
Van der Vorst
[67]. This algorithm
can be extended trivially to preconditioners of
form, and
nonsymmetric preconditioners in the Biconjugate Gradient Method.
Next:
Fewer synchronization points
Up:
Inner products
Previous:
Inner products
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Vector updates
Up:
Inner products
Previous:
Overlapping communication and
Fewer synchronization points
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Matrix-vector products
Up:
Parallelism
Previous:
Fewer synchronization points
Vector updates
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The Jacobi Method
Up:
Iterative Methods
Previous:
Overview of the
Stationary Iterative Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Preconditioning
Up:
Parallelism
Previous:
Vector updates
Matrix-vector products
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Discovering parallelism in
Up:
Parallelism
Previous:
Matrix-vector products
Preconditioning
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
More parallel variants
Up:
Preconditioning
Previous:
Preconditioning
Discovering parallelism in sequential preconditioners.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Fully decoupled preconditioners.
Up:
Preconditioning
Previous:
Discovering parallelism in
More parallel variants of sequential preconditioners.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Wavefronts in the
Up:
Preconditioning
Previous:
More parallel variants
Fully decoupled preconditioners.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Blocked operations in
Up:
Parallelism
Previous:
Fully decoupled preconditioners.
Wavefronts in the Gauss-Seidel and Conjugate Gradient methods
Next:
Blocked operations in
Up:
Parallelism
Previous:
Fully decoupled preconditioners.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Remaining topics
Up:
Parallelism
Previous:
Wavefronts in the
Blocked operations in the GMRES method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The Lanczos Connection
Up:
Templates for the Solution
Previous:
Blocked operations in
Remaining topics
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Block and -step
Up:
Remaining topics
Previous:
Remaining topics
The Lanczos Connection
Next:
Block and -step
Up:
Remaining topics
Previous:
Remaining topics
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Reduced System Preconditioning
Up:
Remaining topics
Previous:
The Lanczos Connection
Block and
-step Iterative Methods
Next:
Reduced System Preconditioning
Up:
Remaining topics
Previous:
The Lanczos Connection
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Convergence of the
Up:
Stationary Iterative Methods
Previous:
Stationary Iterative Methods
The Jacobi Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Domain Decomposition Methods
Up:
Remaining topics
Previous:
Block and -step
Reduced System Preconditioning
Next:
Domain Decomposition Methods
Up:
Remaining topics
Previous:
Block and -step
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Overlapping Subdomain Methods
Up:
Remaining topics
Previous:
Reduced System Preconditioning
Domain Decomposition Methods
Next:
Overlapping Subdomain Methods
Up:
Remaining topics
Previous:
Reduced System Preconditioning
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Non-overlapping Subdomain Methods
Up:
Domain Decomposition Methods
Previous:
Domain Decomposition Methods
Overlapping Subdomain Methods
Next:
Non-overlapping Subdomain Methods
Up:
Domain Decomposition Methods
Previous:
Domain Decomposition Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Further Remarks
Up:
Domain Decomposition Methods
Previous:
Overlapping Subdomain Methods
Non-overlapping Subdomain Methods
These properties make it possible
to apply a preconditioned iterative method to (
), which is
the basic method in the nonoverlapping substructuring approach.
We will also need some good preconditioners
to further improve the convergence of the Schur system.
Next:
Further Remarks
Up:
Domain Decomposition Methods
Previous:
Overlapping Subdomain Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Multiplicative Schwarz Methods
Up:
Domain Decomposition Methods
Previous:
Non-overlapping Subdomain Methods
Further Remarks
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Inexact Solves
Up:
Further Remarks
Previous:
Further Remarks
Multiplicative Schwarz Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Nonsymmetric Problems
Up:
Further Remarks
Previous:
Multiplicative Schwarz Methods
Inexact Solves
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Choice of Coarse
Up:
Further Remarks
Previous:
Inexact Solves
Nonsymmetric Problems
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Multigrid Methods
Up:
Further Remarks
Previous:
Nonsymmetric Problems
Choice of Coarse Grid Size
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Row Projection Methods
Up:
Remaining topics
Previous:
Choice of Coarse
Multigrid Methods
Steps 1 and 3 are called ``pre-smoothing'' and ``post-smoothing''
respectively; by applying this method recursively to step 2 it becomes
a true ``multigrid'' method. Usually the generation of subsequently
coarser grids is halted at a point where the number of variables
becomes small enough that direct solution of the linear system is feasible.
Next:
Row Projection Methods
Up:
Remaining topics
Previous:
Choice of Coarse
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The Gauss-Seidel Method
Up:
The Jacobi Method
Previous:
The Jacobi Method
Convergence of the Jacobi method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Obtaining the Software
Up:
Remaining topics
Previous:
Multigrid Methods
Row Projection Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Overview of the
Up:
Templates for the Solution
Previous:
Row Projection Methods
Obtaining the Software
mail netlib@ornl.gov
On the subject line or in the body, single or multiple requests (one per line)
may be made as follows:
send index from templates
send sftemplates.shar from templates
The first request results in a return e-mail message
containing the index from the library templates, along with brief
descriptions of its contents. The second request results in a return
e-mail message consisting of a shar file containing the single
precision FORTRAN routines and a README file. The
versions of the templates that are available are listed in the table
below:
sh templates.shar
No subdirectory will be created. The unpacked files will stay in the
current directory. Each method is written as a separate subroutine in
its own file, named after the method (e.g., CG.f, BiCGSTAB.f, GMRES.f). The argument parameters are the same
for each, with the exception of the required matrix-vector products
and preconditioner solvers (some require the transpose matrix). Also,
the amount of workspace needed varies. The details are documented in
each routine.
Next:
Overview of the
Up:
Templates for the Solution
Previous:
Row Projection Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Glossary
Up:
Templates for the Solution
Previous:
Obtaining the Software
Overview of the BLAS
ALPHA = SDOT( N, X, 1, Y, 1 )
computes the inner product of two
vectors
and
, putting the result in scalar
.
ALPHA = 0.0
DO I = 1, N
ALPHA = ALPHA + X(I)*Y(I)
ENDDO
CALL SAXPY( N, ALPHA, X, 1, Y )
multiplies a
vector
of length
by the scalar
, then adds the result to
the vector
, putting the result in
.
DO I = 1, N
Y(I) = ALPHA*X(I) + Y(I)
ENDDO
CALL SGEMV( 'N', M, N, ONE, A, LDA, X, 1, ONE, B, 1 )
computes the matrix-vector product plus vector
,
putting the resulting vector in
.
DO J = 1, N
DO I = 1, M
B(I) = A(I,J)*X(J) + B(I)
ENDDO
ENDDO
CALL STRMV( 'U', 'N', 'N', N, A, LDA, X, 1 )
computes the
matrix-vector product
, putting the resulting vector in
,
for upper triangular matrix
.
DO J = 1, N
TEMP = X(J)
DO I = 1, J
X(I) = X(I) + TEMP*A(I,J)
ENDDO
ENDDO
'U'
for `UPPER TRIANGULAR', 'N'
for `No Transpose'. This
feature will be used extensively, resulting in storage savings
(among other advantages).
LDA
is critical for addressing the array
correctly. LDA
is the leading dimension of the two-dimensional
array A
,
that is, LDA
is the declared (or allocated) number
of rows of the two-dimensional array
.
Next:
Glossary
Up:
Templates for the Solution
Previous:
Obtaining the Software
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Notation
Up:
Templates for the Solution
Previous:
Overview of the
Glossary
The same properties hold for matrix norms. A matrix norm and a
vector norm (both denoted
) are called a mutually
consistent pair if for all matrices
and vectors
For matrices from problems on less regular domains, some common
orderings are:
Next:
Notation
Up:
Templates for the Solution
Previous:
Overview of the
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
References
Up:
Templates for the Solution
Previous:
Glossary
Notation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Index
Up:
Templates for the Solution
Previous:
Notation
References
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
About this document
Up:
Templates for the Solution
Previous:
References
Index
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Up:
Templates for the Solution
Previous:
Index
About this document ...
Building Blocks for Iterative Methods
latex2html report.tex.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The Successive Overrelaxation
Up:
Stationary Iterative Methods
Previous:
Convergence of the
The Gauss-Seidel Method
Figure: The Gauss-Seidel Method
Next:
The Successive Overrelaxation
Up:
Stationary Iterative Methods
Previous:
Convergence of the
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Choosing the Value
Up:
Stationary Iterative Methods
Previous:
The Gauss-Seidel Method
The Successive Overrelaxation Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The Symmetric Successive
Up:
The Successive Overrelaxation
Previous:
The Successive Overrelaxation
Choosing the Value of
Next:
The Symmetric Successive
Up:
The Successive Overrelaxation
Previous:
The Successive Overrelaxation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Notes and References
Up:
Stationary Iterative Methods
Previous:
Choosing the Value
The Symmetric Successive Overrelaxation Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Nonstationary Iterative Methods
Up:
Stationary Iterative Methods
Previous:
The Symmetric Successive
Notes and References
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Conjugate Gradient Method
Up:
Iterative Methods
Previous:
Notes and References
Nonstationary Iterative Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Acknowledgments
Up:
Templates for the Solution
Previous:
How to Use
Author's Affiliations
Los Alamos National Laboratory
University of Tennessee, Knoxville
University of California, Los Angeles
University of California, Berkeley
Oak Ridge National Laboratory
University of Tennessee, Knoxville
and Oak Ridge National Laboratory
University of California, Los Angeles
National Institute of Standards and Technology
Oak Ridge National Laboratory
Utrecht University, the Netherlands
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Theory
Up:
Nonstationary Iterative Methods
Previous:
Nonstationary Iterative Methods
Conjugate Gradient Method (CG)
Figure: The Preconditioned Conjugate Gradient Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Convergence
Up:
Conjugate Gradient Method
Previous:
Conjugate Gradient Method
Theory
Next:
Convergence
Up:
Conjugate Gradient Method
Previous:
Conjugate Gradient Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Implementation
Up:
Conjugate Gradient Method
Previous:
Theory
Convergence
Next:
Implementation
Up:
Conjugate Gradient Method
Previous:
Theory
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Further references
Up:
Conjugate Gradient Method
Previous:
Convergence
Implementation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
MINRES and SYMMLQ
Up:
Conjugate Gradient Method
Previous:
Implementation
Further references
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Theory
Up:
Nonstationary Iterative Methods
Previous:
Further references
MINRES and SYMMLQ
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
CG on the
Up:
MINRES and SYMMLQ
Previous:
MINRES and SYMMLQ
Theory
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Theory
Up:
Nonstationary Iterative Methods
Previous:
Theory
CG on the Normal Equations, CGNE and CGNR
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Generalized Minimal Residual
Up:
CG on the
Previous:
CG on the
Theory
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Theory
Up:
Nonstationary Iterative Methods
Previous:
Theory
Generalized Minimal Residual (GMRES)
Figure: The Preconditioned GMRES
Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Contents
Up:
Templates for the Solution
Previous:
Author's Affiliations
Acknowledgments
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Implementation
Up:
Generalized Minimal Residual
Previous:
Generalized Minimal Residual
Theory
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
BiConjugate Gradient (BiCG)
Up:
Generalized Minimal Residual
Previous:
Theory
Implementation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Convergence
Up:
Nonstationary Iterative Methods
Previous:
Implementation
BiConjugate Gradient (BiCG)
Figure: The Preconditioned BiConjugate Gradient Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Implementation
Up:
BiConjugate Gradient (BiCG)
Previous:
BiConjugate Gradient (BiCG)
Convergence
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Quasi-Minimal Residual (QMR)
Up:
BiConjugate Gradient (BiCG)
Previous:
Convergence
Implementation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Convergence
Up:
Nonstationary Iterative Methods
Previous:
Implementation
Quasi-Minimal Residual (QMR)
Figure: The Preconditioned Quasi Minimal Residual Method
without Look-ahead
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Implementation
Up:
Quasi-Minimal Residual (QMR)
Previous:
Quasi-Minimal Residual (QMR)
Convergence
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Conjugate Gradient Squared
Up:
Quasi-Minimal Residual (QMR)
Previous:
Convergence
Implementation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Convergence
Up:
Nonstationary Iterative Methods
Previous:
Implementation
Conjugate Gradient Squared Method (CGS)
Figure: The Preconditioned Conjugate Gradient Squared Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Implementation
Up:
Conjugate Gradient Squared
Previous:
Conjugate Gradient Squared
Convergence
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
List of Figures
Up:
Templates for the Solution
Previous:
Acknowledgments
Contents
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
BiConjugate Gradient Stabilized
Up:
Conjugate Gradient Squared
Previous:
Convergence
Implementation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Convergence
Up:
Nonstationary Iterative Methods
Previous:
Implementation
BiConjugate Gradient Stabilized (Bi-CGSTAB)
Figure: The Preconditioned BiConjugate Gradient Stabilized Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Implementation
Up:
BiConjugate Gradient Stabilized
Previous:
BiConjugate Gradient Stabilized
Convergence
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Chebyshev Iteration
Up:
BiConjugate Gradient Stabilized
Previous:
Convergence
Implementation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Comparison with other
Up:
Nonstationary Iterative Methods
Previous:
Implementation
Chebyshev Iteration
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Convergence
Up:
Chebyshev Iteration
Previous:
Chebyshev Iteration
Comparison with other methods
Next:
Convergence
Up:
Chebyshev Iteration
Previous:
Chebyshev Iteration
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Implementation
Up:
Chebyshev Iteration
Previous:
Comparison with other
Convergence
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Computational Aspects of
Up:
Chebyshev Iteration
Previous:
Convergence
Implementation
Figure: The Preconditioned Chebyshev Method
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
A short history
Up:
Iterative Methods
Previous:
Implementation
Computational Aspects of the Methods
Table: Summary of Operations for Iteration
.
``a/b'' means ``a'' multiplications with the matrix and ``b'' with
its transpose.
Table: Storage Requirements for the Methods in iteration
:
denotes the order of the matrix.
Next:
A short history
Up:
Iterative Methods
Previous:
Implementation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Survey of recent
Up:
Iterative Methods
Previous:
Computational Aspects of
A short history of Krylov
methods
Next:
Survey of recent
Up:
Iterative Methods
Previous:
Computational Aspects of
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Introduction
Up:
Templates for the Solution
Previous:
Contents
List of Figures
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Preconditioners
Up:
Iterative Methods
Previous:
A short history
Survey of recent Krylov methods
Next:
Preconditioners
Up:
Iterative Methods
Previous:
A short history
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The why and
Up:
Templates for the Solution
Previous:
Survey of recent
Preconditioners
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Cost trade-off
Up:
Preconditioners
Previous:
Preconditioners
The why and how
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Left and right
Up:
The why and
Previous:
The why and
Cost trade-off
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Jacobi Preconditioning
Up:
The why and
Previous:
Cost trade-off
Left and right preconditioning
It should be noted that such methods cannot be made to reduce to the
algorithms given in section
by such
choices as
or
.
Next:
Jacobi Preconditioning
Up:
The why and
Previous:
Cost trade-off
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Block Jacobi Methods
Up:
Preconditioners
Previous:
Left and right
Jacobi Preconditioning
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Discussion
Up:
Jacobi Preconditioning
Previous:
Jacobi Preconditioning
Block Jacobi Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
SSOR preconditioning
Up:
Jacobi Preconditioning
Previous:
Block Jacobi Methods
Discussion
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Incomplete Factorization Preconditioners
Up:
Preconditioners
Previous:
Discussion
SSOR preconditioning
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Creating an incomplete
Up:
Preconditioners
Previous:
SSOR preconditioning
Incomplete Factorization Preconditioners
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Why Use Templates?
Up:
Templates for the Solution
Previous:
List of Figures
Introduction
It turns out both are true, for different groups of users.
Next:
Why Use Templates?
Up:
Templates for the Solution
Previous:
List of Figures
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Solving a system
Up:
Incomplete Factorization Preconditioners
Previous:
Incomplete Factorization Preconditioners
Creating an incomplete factorization
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Point incomplete factorizations
Up:
Creating an incomplete
Previous:
Creating an incomplete
Solving a system with an incomplete factorization
preconditioner
Figure: Preconditioner solve of a system
, with
Figure: Preconditioner solve of a system
,
with
.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Fill-in strategies
Up:
Incomplete Factorization Preconditioners
Previous:
Solving a system
Point incomplete factorizations
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Simple cases: and
Up:
Point incomplete factorizations
Previous:
Point incomplete factorizations
Fill-in strategies
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Special cases: central
Up:
Point incomplete factorizations
Previous:
Fill-in strategies
Simple cases:
and
-
Figure: Construction of a
-
incomplete
factorization preconditioner, storing the inverses
of the pivots
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Modified incomplete factorizations
Up:
Point incomplete factorizations
Previous:
Simple cases: and
Special cases: central differences
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Vectorization of the
Up:
Point incomplete factorizations
Previous:
Special cases: central
Modified incomplete factorizations
Next:
Vectorization of the
Up:
Point incomplete factorizations
Previous:
Special cases: central
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Parallelizing the preconditioner
Up:
Point incomplete factorizations
Previous:
Modified incomplete factorizations
Vectorization of the preconditioner solve
Figure: Wavefront solution of
from a central difference problem
on a domain of
points.
Figure: Preconditioning step algorithm for
a Neumann expansion
of an incomplete factorization
.
Next:
Parallelizing the preconditioner
Up:
Point incomplete factorizations
Previous:
Modified incomplete factorizations
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Block factorization methods
Up:
Point incomplete factorizations
Previous:
Vectorization of the
Parallelizing the preconditioner solve
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The idea behind
Up:
Incomplete Factorization Preconditioners
Previous:
Parallelizing the preconditioner
Block factorization methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
What Methods Are
Up:
Introduction
Previous:
Introduction
Why Use Templates?
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Approximate inverses
Up:
Block factorization methods
Previous:
Block factorization methods
The idea behind block factorizations
Figure: Block version of a
-
factorization
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The special case
Up:
Block factorization methods
Previous:
The idea behind
Approximate inverses
Figure: Algorithm for approximating the inverse of a banded matrix
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Two types of
Up:
Block factorization methods
Previous:
Approximate inverses
The special case of block tridiagonality
Figure: Incomplete block factorization of a block tridiagonal matrix
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Blocking over systems
Up:
Block factorization methods
Previous:
The special case
Two types of incomplete block factorizations
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Incomplete LQ factorizations
Up:
Incomplete Factorization Preconditioners
Previous:
Two types of
Blocking over systems of partial differential
equations
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Polynomial preconditioners
Up:
Incomplete Factorization Preconditioners
Previous:
Blocking over systems
Incomplete LQ factorizations
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Preconditioners from properties
Up:
Preconditioners
Previous:
Incomplete LQ factorizations
Polynomial preconditioners
Next:
Preconditioners from properties
Up:
Preconditioners
Previous:
Incomplete LQ factorizations
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Preconditioning by the
Up:
Preconditioners
Previous:
Polynomial preconditioners
Preconditioners from properties of the
differential equation
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
The use of
Up:
Preconditioners from properties
Previous:
Preconditioners from properties
Preconditioning by the symmetric part
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Alternating Direction Implicit
Up:
Preconditioners from properties
Previous:
Preconditioning by the
The use of fast solvers
Next:
Alternating Direction Implicit
Up:
Preconditioners from properties
Previous:
Preconditioning by the
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Iterative Methods
Up:
Introduction
Previous:
Why Use Templates?
What Methods Are Covered?
For each method we present a general description, including a
discussion of the history of the method and numerous references to the
literature. We also give the mathematical conditions for selecting a
given method.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Related Issues
Up:
Preconditioners from properties
Previous:
The use of
Alternating Direction Implicit methods
Next:
Related Issues
Up:
Preconditioners from properties
Previous:
The use of
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Complex Systems
Up:
Templates for the Solution
Previous:
Alternating Direction Implicit
Related Issues
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Stopping Criteria
Up:
Related Issues
Previous:
Related Issues
Complex Systems
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
More Details about
Up:
Related Issues
Previous:
Complex Systems
Stopping Criteria
Next:
More Details about
Up:
Related Issues
Previous:
Complex Systems
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
When or is
Up:
Stopping Criteria
Previous:
Stopping Criteria
More Details about Stopping Criteria
Next:
When or is
Up:
Stopping Criteria
Previous:
Stopping Criteria
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Estimating
Up:
Stopping Criteria
Previous:
More Details about
When
or
is not readily available
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Stopping when progress
Up:
Stopping Criteria
Previous:
When or is
Estimating
Next:
Stopping when progress
Up:
Stopping Criteria
Previous:
When or is
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Accounting for floating
Up:
Stopping Criteria
Previous:
Estimating
Stopping when progress is no longer being made
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Data Structures
Up:
Stopping Criteria
Previous:
Stopping when progress
Accounting for floating point errors
Next:
Data Structures
Up:
Stopping Criteria
Previous:
Stopping when progress
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Survey of Sparse
Up:
Related Issues
Previous:
Accounting for floating
Data Structures
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Overview of the
Up:
Templates for the Solution
Previous:
What Methods Are
Iterative Methods
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Compressed Row Storage
Up:
Data Structures
Previous:
Data Structures
Survey of Sparse Matrix Storage Formats
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Compressed Column Storage
Up:
Survey of Sparse
Previous:
Survey of Sparse
Compressed Row Storage (CRS)
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Block Compressed Row
Up:
Survey of Sparse
Previous:
Compressed Row Storage
Compressed Column Storage (CCS)
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Compressed Diagonal Storage
Up:
Survey of Sparse
Previous:
Compressed Column Storage
Block Compressed Row Storage (BCRS)
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Jagged Diagonal Storage
Up:
Survey of Sparse
Previous:
Block Compressed Row
Compressed Diagonal Storage (CDS)
Next:
Jagged Diagonal Storage
Up:
Survey of Sparse
Previous:
Block Compressed Row
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Skyline Storage (SKS)
Up:
Survey of Sparse
Previous:
Compressed Diagonal Storage
Jagged Diagonal Storage (JDS)
Next:
Skyline Storage (SKS)
Up:
Survey of Sparse
Previous:
Compressed Diagonal Storage
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Matrix vector products
Up:
Survey of Sparse
Previous:
Jagged Diagonal Storage
Skyline Storage (SKS)
Figure: Profile of a nonsymmetric skyline or variable-band
matrix.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
CRS Matrix-Vector Product
Up:
Data Structures
Previous:
Skyline Storage (SKS)
Matrix vector products
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
CDS Matrix-Vector Product
Up:
Matrix vector products
Previous:
Matrix vector products
CRS Matrix-Vector Product
for i = 1, n
y(i) = 0
for j = row_ptr(i), row_ptr(i+1) - 1
y(i) = y(i) + val(j) * x(col_ind(j))
end;
end;
for i = 1, n
y(i) = 0
end;
for j = 1, n
for i = row_ptr(j), row_ptr(j+1)-1
y(col_ind(i)) = y(col_ind(i)) + val(i) * x(j)
end;
end;
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Sparse Incomplete Factorizations
Up:
Matrix vector products
Previous:
CRS Matrix-Vector Product
CDS Matrix-Vector Product
for i = 1, n
y(i) = 0
end;
for diag = -diag_left, diag_right
for loc = max(1,1-diag), min(n,n-diag)
y(loc) = y(loc) + val(loc,diag) * x(loc+diag)
end;
end;
for i = 1, n
y(i) = 0
end;
for diag = -diag_right, diag_left
for loc = max(1,1-diag), min(n,n-diag)
y(loc) = y(loc) + val(loc+diag, -diag) * x(loc+diag)
end;
end;
The memory access for the CDS-based matrix-vector product
(or
) is
three vectors per inner iteration. On the other hand, there is no
indirect addressing, and the algorithm is vectorizable with vector
lengths of essentially the matrix order
. Because of the regular data
access, most machines can perform this algorithm
efficiently by keeping three base registers and using simple offset
addressing.
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Building Blocks for Iterative Methods
Next:
How to Use
Templates for the Solution of Linear Systems:
Building Blocks for Iterative Methods
James Demmel
,
June M. Donato
,
Jack Dongarra
,
Victor Eijkhout ,
Roldan Pozo
Charles Romine ,
and
Henk Van der Vorst
To retrieve the postscript file you can use one of the following methods:
cd templates
get templates.ps
quit
rcp anon@www.netlib.org:templates/templates.ps templates.ps
send templates.ps from templates
Jack Dongarra
Mon Nov 20 08:52:54 EST 1995
Next:
Foreword
Up:
Parallel Computing Works
Previous:
Parallel Computing Works
Contents
Wed Mar 1 10:19:35 EST 1995
Next:
1 Introduction
Up:
Parallel Computing Works
Previous:
Contents
Foreword
Next:
1 Introduction
Up:
Parallel Computing Works
Previous:
Contents
Wed Mar 1 10:19:35 EST 1995
Next:
1.1 Introduction
Up:
Parallel Computing Works
Previous:
Foreword
1 Introduction
Wed Mar 1 10:19:35 EST 1995
Next:
1.2 The National Vision
Up:
1 Introduction
Previous:
1 Introduction
1.1 Introduction
Next:
1.2 The National Vision
Up:
1 Introduction
Previous:
1 Introduction
Wed Mar 1 10:19:35 EST 1995
Next:
1.3 Caltech Concurrent Computation
Up:
1 Introduction
Previous:
1.1 Introduction
1.2 The National Vision
for ParallelComputation
Figure 1.1 :
The nCUBE-2 node and its integration into
a board. Upto 128 of these boards can be combined into a single
supercomputer.
Figure 1.2 :
The CM-5 produced by Thinking Machines.
Figure 1.3 :
The DELTA Touchstone parallel
supercomputer produced by Intel and installed at Caltech.
Figure 1.4:
Grand Challenge Appications. Some major
applications which will be enabled by parallel supercomputers. The computer
performance numbers are given in more detail in color figure 2.1.
Next:
1.3 Caltech Concurrent Computation
Up:
1 Introduction
Previous:
1.1 Introduction
Wed Mar 1 10:19:35 EST 1995
Next:
1.4 How Parallel Computing
Up:
1 Introduction
Previous:
1.2 The National Vision
1.3 Caltech Concurrent Computation Program
``Is the hypercube an effective computer for large-scale scientific
computing?''
``used real hardware and real software to solve real problems.''
``What are the cost, technology, and software trade-offs that will drive
the design of future parallel supercomputers?''
``What is the appropriate software environment for parallel machines and
how should one develop it?''
Next:
1.4 How Parallel Computing
Up:
1 Introduction
Previous:
1.2 The National Vision
Wed Mar 1 10:19:35 EST 1995
Next:
2 Technical Backdrop
Up:
1 Introduction
Previous:
1.3 Caltech Concurrent Computation
1.4 How Parallel Computing Works
The book quantifies and exemplifies this assertion.
CrOS and its follow-on Express, described in Chapter
5
,
support this software paradigm. Explicit message passing is still an
important software model and in many cases, the only viable approach to
high-performance parallel implementations on MIMD machines.
Guy Robinson
Guy Robinson
This chapter surveys activities related to parallel computing that took place around the time that C P was an active project, primarily during the 1980s. The major areas that are covered are hardware, software, research projects, and production uses of parallel computers. In each case, there is no attempt to present a comprehensive list or survey of all the work that was done in that area. Rather, the attempt is to identify some of the major events during the period.
There are two major motivations for creating and using parallel computer architectures. The first is that, as surveyed in Section 1.2 parallelism is the only avenue to achieve vastly higher speeds than are possible now from a single processor. This was the primary motivation for initiating C P. Table 2.1 demonstrates dramatically the rather slow increase in speed of single-processor systems for one particular brand of supercomputer, CRAYs, the most popular supercomputer in the world. Figure 2.1 (Color Plate) shows a more comprehensive sample of computer performance, measured in operations per second, from the 1940s extrapolated through the year 2000.
Figure 2.1:
Historical trends of peak computer
performance. In some cases, we have scaled up parallel performance to
correspond to a configuration that would cost approximately $20 million.
A second motivation for the use of parallel architectures is that they should be considerably cheaper than sequential machines for systems of moderate speeds; that is, not necessarily supercomputers but instead minicomputers or mini-supercomputers would be cheaper to produce for a given performance level than the equivalent sequential system.
At the beginning of the 1980s, the goals of research in parallel computer architectures were to achieve much higher speeds than could be obtained from sequential architectures and to get much better price performance through the use of parallelism than would be possible from sequential machines.
Guy Robinson
Guy Robinson
There were parallel computers before 1980, but they did not have a widespread impact on scientific computing. The activities of the 1980s had a much more dramatic effect. Still, a few systems stand out as having made significant contributions that were taken advantage of in the 1980s. The first is the Illiac IV [ Hockney:81b ]. It did not seem like a significant advance to many people at the time, perhaps because its performance was only moderate, it was difficult to program, and had low reliability. The best performance achieved was two to six times the speed of a CDC 7600 . This was obtained on various computational fluid dynamics codes. For many other programs, however, the performance was lower than that of a CDC 7600, which was the supercomputer of choice during the early and mid-1970s. The Illiac was a research project, not a commercial product, and it was reputed to be so expensive that it was not realistic for others to replicate it. While the Illiac IV did not inspire the masses to become interested in parallel computing, hundreds of people were involved in its use and in projects related to providing better software tools and better programming languages for it. They first learned how to do parallel computing on the Illiac IV and many of them went on to make significant contributions to parallel computing in the 1980s.
The Illiac was an SIMD computer-single-instruction, multiple-data architecture. It had 32 processing elements, each of which was a processor with its own local memory; the processors were connected in a ring. High-level languages such as Glypnyr and Fortran were available for the Illiac IV. Glypnyr was reminiscent of Fortran and had extensions for parallel and array processing.
The ICL Distributed Array Processor (DAP) [ DAP:79a ] was a commercial product; a handful of machines were sold, mainly in England where it was designed and built. Its characteristics were that it had either 1K or 4K one-bit processors arranged in a square plane, each connected in rectangular fashion to its nearest neighbors. Like the Illiac IV, it was an SIMD system. It required an ICL mainframe as a front end. The ICL DAP was used for many university science applications. The University of Edinburgh, in particular, used it for a number of real computations in physics, chemistry, and other fields [Wallace:84a;87a]. The ICL DAP had a substantial impact on scientific computing culture, primarily in Europe. ICL did try to market it in the United States, but was never effective in doing so; the requirement for an expensive ICL mainframe as a host was a substantial negative factor.
A third important commercial parallel computer in the 1970s was the Goodyear Massively Parallel Processor (MPP) [ Batcher:85a ], [Karplus:87a, pp. 157-166]. Goodyear started building SIMD computers in 1969, but all except the MPP were sold to the military and to the Federal Aviation Administration for air traffic control. In the late 1970s, Goodyear produced the MPP which was installed at Goddard Space Flight Center, a NASA research center, and used for a variety of scientific applications. The MPP attracted attention because it did achieve high speeds on a few applications, speeds that, in the late 1970s and early 1980s, were remarkable-measured in the hundreds of MFLOPS in a few cases. The MPP had 16K one-bit processors, each with local memory, and was programmed in Pascal and Assembler.
In summary, the three significant scientific parallel computers of the 1970s were the Illiac IV, the ICL DAP, and the Goodyear MPP. All were SIMD computers. The DAP and the MPP were fine-grain systems based on single-bit processors, whereas the Illiac IV was a large-grain SIMD system. The other truly significant high-performance (but not parallel) computer of the 1970s was the CRAY 1, which was introduced in 1976. The CRAY 1 was a single-processor vector computer and as such it can also be classified as a special kind of SIMD computer because it had vector instructions. With a single vector instruction, one causes up to 64 data pairs to be operated on.
There were significant and seminal activities in parallel computing in the 1970s both from the standpoint of design and construction of systems and in the actual scientific use of the systems. However, the level of activity of parallel computing in the 1970s was modest compared to what followed in the 1980s.
Guy Robinson
In contrast to the 1970s, in the early 1980s it was MIMD (multiple instruction, multiple data) computers that dominated the activity in parallel computing. The first of these was the Denelcor Heterogeneous Element Processor (HEP). The HEP attracted widespread attention despite its terrible cost performance because of its many interesting hardware features that facilitated programming. The Denelcor HEP was acquired by several institutions, including Los Alamos, Argonne National Laboratory, Ballistic Research Laboratory, and Messerschmidt in Germany. Messerschmidt was the only installation that used it for real applications. The others, however, used it extensively for research on parallel algorithms. The HEP hardware supported both fine-grain and large-grain parallelism. Any one processor had an instruction pipeline that provided parallelism at the single instruction level. Instructions from separate processes (associated with separate user programs or tasks) were put into hardware queues and scheduled for execution once the required operands had been fetched from memory into registers, again under hardware control. Instructions from up to 128 processes could share the instruction execution pipeline. The latter had eight stages; all instructions except floating-point divide took eight machine cycles to execute. Up to 16 processors could be linked to perform large-grain MIMD computations. The HEP had an extremely efficient synchronization mechanism through a full-empty bit associated with every word of memory. The bit was automatically set to indicate whether the word had been rewritten since it had last been written into and could be set to indicate that the memory location had been read. The value of the full-empty bit could be checked in one machine cycle. Fortran, C, and Assembler could be used to program the HEP. It had a UNIX environment and was front-ended by a minicomputer. Because Los Alamos and Argonne made their HEPs available for research purposes to people who were interested in learning how to program parallel machines or who were involved in parallel algorithm research, hundreds of people became familiar with parallel computing through the Denelcor HEP [ Laksh:85a ].
A second computer that was important in the early 1980s, primarily because it exposed a large number of computational scientists to parallelism, was the CRAY X-MP/22, which was introduced in 1982. Certainly, it had limited parallelism, namely only two processors; still, it was a parallel computer. Since it was at the very high end of performance, it exposed the hardcore scientific users to parallelism, although initially mostly in a negative way. There was not enough payoff in speed or cost to compensate for the effort that was required to parallelize a program so that it would use both processors: the maximum speedup would, of course, only be two. Typically, it was less than two and the charging algorithms of most computer centers generated higher charges for a program when it used both processors than when it used only one. In a way, though, the CRAY X-MP multiprocessor legitimized parallel processing, although restricted to very large grain, very small numbers of processors. A few years later, the IBM 3090 series had the same effect; the 3090 can have up to six vector and scalar processors in one system. Memory is shared among all processors.
Another MIMD system that was influential during the early 1980s was the New York University Ultracomputer [ Gottlieb:86a ] and a related system, the IBM RP3 [ Brochard:92a ], [ Brochard:92b ], [ Darema:87a ], [ Pfister:85a ]. These systems were serious attempts to design and demonstrate a shared-memory architecture that was scalable to very large numbers of processors. They featured an interconnection network between processors and memories that would avoid hot spots and congestion. The fetch-and-add instruction that was invented by Jacob Schwartz [ Schwartz:80a ] would avoid some of the congestion problems in omega networks. Unfortunately, these systems took a great deal of time to construct and it was the late 1980s before the IBM RP3 existed in a usable fashion. At that time, it had 64 processors but each was so slow that it attracted comparatively little attention. The architecture is certainly still considered to be an interesting one, but far fewer users were exposed to these systems than to other designs that were constructed more quickly and put in places that allowed a large number of users to have at least limited access to the systems for experimentation. Thus, the importance of the Ultracomputer and RP3 projects lay mainly in the concepts.
Guy Robinson
Perhaps the most significant and influential parallel computer system of the early 1980s was the Caltech Cosmic Cube [ Seitz:85a ], developed by Charles Seitz and Geoffrey Fox. Since it was the inspiration for the C P project, we describe it and its immediate successors in some detail [Fox:87d;88oo], [ Seitz:85a ].
The hypercube work at Caltech originated in May 1981 when, as described in Chapter 1 , Fox attended a seminar by Carver Mead on VLSI and its implications for concurrency. As described in more detail in Sections 4.1 and 4.3 , Fox realized that he could use parallel computers for the lattice gauge computations that were central to his research at the time and that his group was running on a VAX 11/780. During the summer of 1981, he and his students worked out an algorithm that he thought would be parallel and tried it out on his VAX (simulating parallelism). The natural architecture for the problems he wanted to compute was determined to be a three-dimensional grid, which happens to be 64 processors (Figure 4.3 ).
In the fall of 1981 Fox approached Chuck Seitz about building a suitable computer. After an exchange of seminars, Seitz showed great interest in doing so and had funds to build a hypercube. Given Fox's problem, a six-dimensional hypercube (64 processors) was set as the target. Memory size of 128K was chosen after some debate; applications people (chiefly Fox) wanted at least that much. A trade-off was made between the number of processors and memory size. A smaller cube would been built if larger memory had been chosen.
From the outset a key goal was to produce an architecture with interprocessor communications that would scale well to a very large number of processors. The features that led to the choice of the hypercube topology specifically were the moderate growth in the number of channels required as the number of processors increases, and the good match between processor and memory speeds because memory is local.
The Intel 8086 was chosen because it was the only microprocessor available at the time with a floating-point co-processor, the 8087. First, a prototype 4-node system was built with wirewrap boards. It was designed, assembled, and tested during the winter of 1981-82. In the spring of 1982, message-passing software was designed and implemented on the 4-node. Eugene Brooks' proposal of simple send/receive routines was chosen and came to be known as the Crystalline Operating System (CrOS), although it was never really an operating system.
In autumn of 1982, simple lattice problems were implemented on the 4-node by Steve Otto and others. CrOS and the computational algorithm worked satisfactorily. By January 1983, Otto had the lattice gauge applications program running on the 4-node. Table 4.2 details the many projects and publications stemming from this pioneering work.
With the successful experience on the 4-node, Seitz proceeded to have printed circuit boards designed and built. The 64-node Cosmic Cube was assembled over the summer of 1983 and began operation in October 1983. It has been in use ever since, although currently it is lightly used.
The key characteristics of the Cosmic Cube are that it has 64 nodes, each with an 8086/8087, of memory, and communication channels with 2 Mbits/sec peak speed between nodes (about 0.8 Mbits/sec sustained in one direction). It is five feet long, six cubic feet in volume, and draws 700 watts.
The Cosmic Cube provided a dramatic demonstration that multicomputers could be built quickly, cheaply, and reliably. In terms of reliability, for example, there were two hard failures in the first 560,000 node hours of operation-that is, during the first year of operation. Its performance was low by today's standards, but it was still between five and ten times the performance of a DEC VAX 11/780, which was the system of choice for academic computer departments and research groups in that time period. The manufacturing cost of the system was $80,000, which at that time was about half the cost of a VAX with a modest configuration. Therefore, the price performance was on the order of 10 to 20 times better than a VAX 780. This estimate does not take into account design and software development costs; on the other hand, it was a one-of-a-kind system, so manufacturing costs were higher than for a commercial product. Furthermore, it was clearly a scalable architecture, and that is perhaps the most important feature of that particular project.
In the period from October, 1983 to April, 1984 a 2500-hour run of a QCD problem (Table 4.1 ) was completed, achieved 95% efficiency, and produced new scientific results. This demonstrated that hypercubes are well-suited for QCD (as are other architectures).
As described in Section 1.3 , during the fall of 1982 Fox surveyed many colleagues at Caltech to determine whether they needed large-scale computation in their research and began to examine those applications for suitability to run on parallel computers. Note that this was before the 64-node Cosmic Cube was finished, but after the 4-node gave evidence that approach was sound. The Caltech Concurrent Computation Program (C P) was formed in Autumn of 1982. A decision was made to develop big, fast hypercubes rather than rely on Crays. By the summer of 1984, the ten applications of Table 4.2 were running on the Cosmic Cube [ Fox:87d ].
Two key shortcomings that were soon noticed were that too much time was spent in communications and that high-speed external I/O was not available. The first was thought to be addressable with custom communication chips.
In the summer of 1983, Fox teamed with Caltech's Jet Propulsion Laboratory (JPL) to build bigger and better hypercubes. The first was the Mark II, still based on 8086/8087 (no co-processor faster than 8087 was yet available), but with memory, faster communications, and twice as many nodes. The first 32 nodes began operating in September, 1984. Four 32-node systems and one 128-node were built. The latter was completed in June, 1985 [ Fox:88oo ].
The Caltech project inspired several industrial companies to build commercial hypercubes. These included Intel, nCUBE [ Karplus:87a ], Ametek [ Seitz:88b ], and Floating Point Systems Corporation. Only two years after the 64-node Caltech Cosmic Cube was operational, there were commercial products on the market and installed at user sites.
With the next Caltech-JPL system, the Mark III, there was a switch to the Motorola family of microprocessors. On each node the Mark III had one Motorola 68020/68881 for computation and another 68020 for communications. The two processors shared the of node memory. The first 32-node Mark III was operational in April, 1986. The concept of dedicating a processor to communications has influenced commercial product design, including recently introduced systems.
In the spring of 1986, a decision was made to build a variant of the Mark III, the Mark IIIfp (originally dubbed the Mark IIIe). It was designed to compete head-on with ``real'' supercomputers. The Mark IIIfp has a daughter board at each node with the Weitek XL floating-point chip set running at , which gives a peak speed of . By January 1987, an 8-node Mark IIIfp was operational. A 128-node system was built and in the spring of 1989 achieved on two applications.
In summary, the hypercube family of computers enjoyed rapid development and was used for scientific applications from the beginning. In the period from 1982 to 1987, three generations of the family were designed, built, and put into use at Caltech. The third generation (the Mark III) even included a switch of microprocessor families. Within the same five years, four commercial vendors produced and delivered computers with hypercube architectures. By 1987, approximately 50 major applications had been completed on Caltech hypercubes. Such rapid development and adaption has few if any parallels. The remaining chapters of this book are largely devoted to lessons from these applications and their followers.
Guy Robinson
During this period, many new systems were launched by commercial companies, and several were quite successful in terms of sales. The two most successful were the Sequent and the Encore [Karplus:87a, pp. 111-126] products. Both were shared-memory, bus-connected multiprocessors of moderate parallelism. The maximum number of processors on the Encore product was 20; on the Sequent machine initially 16 and later 30. Both provided an extremely stable UNIX environment and were excellent for time-sharing. As such, they could be considered VAX killers since VAXes were the time-sharing system of choice in research groups in those days. The Sequent and the Encore provided perhaps a cost performance better by a factor of 10, as well as considerably higher total performance than could be obtained on a VAX at that time. These systems were particularly useful for smaller jobs, for time-sharing, and for learning to do parallel computing. Perhaps their most impressive aspect was the reliability of both hardware and software. They operated without interruption for months at a time, just as conventional mini-supercomputers did. Their UNIX operating system software was familiar to many users and, as mentioned before, very stable. Unlike most parallel computers whose system software requires years to mature, these systems had very stable and responsive system software from the beginning.
Another important system during this period was the Alliant [Karplus:87a, pp. 35-44]. The initial model featured up to eight vector processors, each of moderate performance. But when used simultaneously, they provided performance equivalent to a sizable fraction of a CRAY processor. A unique feature at the time was a Fortran compiler that was quite good at automatic vectorization and also reasonably good at parallelization. These compiler features, coupled with its shared memory, made this system relatively easy to use and to achieve reasonably good performance. The Alliant also supported the C language, although initially there was no vectorization or parallelization available in C. The operating system was UNIX-based. Because of its reasonably high floating-point performance and ease of use, the Alliant was one of the first parallel computers that was used for real applications. The Alliant was purchased by groups who wanted to do medium-sized computations and even computations they would normally do on CRAYs. This system was also used as a building block of the Cedar architecture project led by D. Kuck [ Kuck:86a ].
Advances in compiling technology made wide-instruction word machines an interesting and, for a few years, commercially viable architecture. The Multiflow and Cydrome systems both had compilers that effectively exploited very fine-grain parallelism and scheduling of floating-point pipelines within the processing units. Both these systems attempted to get parallelism at the instruction level from Fortran programs-the so-called dusty decks that might have convoluted logic and thus be very difficult to vectorize or parallelize in a large-grain sense. The price performance of these systems was their main attraction. On the other hand, because these systems did not scale to very high levels of performance, they were relegated to the super-minicomputer arena. An important contribution they made was to show dramatically how far compiler technology had come in certain areas.
As was mentioned earlier, hypercubes were produced by Intel, nCUBE, Ametek, and Floating Point Systems Corporation in the mid-1980s. Of these, the most significant product was the nCUBE with its high degree of integration and a configuration of up to 1024 nodes [ Palmer:86a ], [ nCUBE:87a ]. It was pivotal in demonstrating that massively parallel MIMD medium-grain computers are practical. The nCUBE featured a complete processor on a single chip, including all channels for connecting to the other nodes so that one chip plus six memory chips constituted an entire node. They were packaged on boards with 64 nodes so that the system was extremely compact, air-cooled, and reliable. Caltech had an early 512-node system, which was used in many C P calculations, and soon afterwards Sandia National Laboratories installed the 1024-node system. A great deal of scientific work was done on those two machines and they are still in use. The 1024-node Sandia machine got the world's attention by demonstrating speedups of 1000 for several applications [ Gustafson:88a ]. This was particularly significant because it was done during a period of active debate as to whether MIMD systems could provide speedups of more than a hundred. Amdahl's law [ Amdahl:67a ] was cited as a reason why it would not be possible to get speedups greater than perhaps a hundred, even if one used 1000 processors.
Towards the end of the mid-1980s, transputer-based systems [ Barron:83a ], [ Hey:88a ], both large and small, began to proliferate, especially in Europe but also in the United States. The T800 transputer was like the nCUBE processor, a single-chip system with built-in communications channels, and it had respectable floating point performance-a peak speed of nearly and frequently achieved speeds of 1/2 to . They provided a convenient building block for parallel systems and were quite cost-effective. Their prevalent use at the time was in boards with four or eight transputers that were attached to IBM PCs, VAXes, or other workstations.
Guy Robinson
By the late 1980s, truly powerful parallel systems began to appear. The Meiko system at Edinburgh University, is an example; by 1989, that computer had 400 T800s [ Wallace:88a ]. The system was being used for a number of traditional scientific computations in physics, chemistry, engineering, and other areas [ Wexler:89a ]. The system software for transputer-based systems had evolved to resemble the message-passing system software available on hypercubes. Although the transputer's two-dimensional mesh connection is in principle less efficient than hypercube connections, for systems of moderate size (only a few hundred processors), the difference is not significant for most applications. Further, any parallel architecture deficiencies were counterbalanced by the transputer's excellent communication channel performance.
Three new SIMD fine-grain systems were introduced in the late 1980s: the CM-2, the MasPar, and a new version of the DAP. The CM-2 is a version of the original Connection Machine [Hillis:85a;87a] that has been enhanced with Weitek floating-point units, one for each 32 single-bit processors, and optional large memory. In its largest configuration, such as is installed at Los Alamos National Laboratory, there are 64K single-bit processors, 2048 64-bit floating-point processors, and of memory. The CM-2 has been measured at running the unlimited Linpack benchmark solving a linear system of order 26,624 and even higher performance on some applications, e.g., seismic data processing [ Myczkowski:91a ] and QCD [ Brickner:91b ], [ Liu:91a ]. It has attracted widespread attention both because of its extremely high performance and its relative ease of use [ Boghosian:90a ], [Hillis:86a;87b]. For problems that are naturally data parallel, the CM Fortran language and compiler provide a relatively easy way to implement programs and get high performance.
The MasPar and the DAP are smaller systems that are aimed more at excellent price performance than at supercomputer levels of performance. The new DAP is front-ended by Sun workstations or VAXes. This makes it much more affordable and compatible with modern computing environments than when it required an ICL front end. DAPs have been built in ruggedized versions that can be put into vehicles, flown in airplanes, and used on ships, and have found many uses in signal processing and military applications. They are also used for general scientific work. The MasPar is the newest SIMD system. Its architecture constitutes an evolutionary approach of fine-grain SIMD combined with enhanced floating-point performance coming from the use of 4-bit (Maspar MP-1) or 32-bit (Maspar MP-2) basic SIMD units. Standard 64-bit floating-point algorithms implemented on a (SIMD) machine built around an l bit CPU take time of order machine instructions. The DAP and CM-1,2 used l=1 and here the CM-2 and later DAP models achieve floating-point performance with special extra hardware rather than by increasing l .
Two hypercubes became available just as the decade ended: the second generation nCUBE, popularly known as the nCUBE-2, and the Intel iPSC/860. The nCUBE-2 can be configured with up to 8K nodes; that configuration would have a peak speed of . Each processor is still on a single chip along with all the communications channels, but it is about eight times faster than its predecessor-a little over . Communication bandwidth is also a factor of eight higher. The result is a potentially very powerful system. The nCUBE-2 has a custom microprocessor that is instruction-compatible with the first-generation system. The largest system known to have been built to date is a 1024 system installed at Sandia National Laboratories. The unlimited size Linpack benchmark for this system yielded a performance of solving a linear system of order 21,376.
The second hypercube introduced in 1989 (and first shipped to a customer, Oak Ridge, in January 1990), the Intel iPSC/860, has a peak speed of over . While the communication speed between nodes is very low compared to the speed of the i860 processor, high speeds can be achieved for problems that do not require extensive communication or when the data movement is planned carefully. For example, the unlimited size Linpack benchmark on the largest configuration iPSC/860, 128 processors, ran at when solving a system of order 8,600.
The iPSC/860 uses the Intel i860 microprocessor, which has a peak speed of full precision and with 32-bit precision. In mid-1991, a follow-on to Intel iPSC/860, the Intel Touchstone Delta System, reached a Linpack speed of for a system of order 25,000. This was done on 512 i860 nodes of the Delta System. This machine has a peak speed of and of memory and is a one-of-a-kind system built for a consortium of institutions and installed at California Institute of Technology. Although the C P project is finished at Caltech, many C P applications have very successfully used the Delta. The Delta uses a two-dimensional mesh connection scheme with mesh routing chips instead of a hypercube connection scheme. The Intel Paragon, a commercial product that is the successor to the iPSC/860 and the Touchstone Delta, became available in the fall of 1992. The Paragon has the same connection scheme as the Delta. Its maximum configuration is 4096 nodes. It uses a second generation version of the i860 microprocessor and has a peak speed of .
The BBN TC2000 is another important system introduced in the late 1980s. It provides a shared-memory programming environment supported by hardware. It uses a multistage switch based on crossbars that connect processor memory pairs to each other [Karplus:87a, pp. 137-146], [ BBN:87a ]. The BBN TC2000 uses Motorola 88000 Series processors. The ratio of speeds between access to data in cache, to data respectively in the memory local to a processor, and to data in some other processor's memory, is approximately one, three and seven. Therefore, there is a noticeable but not prohibitive penalty for using another processor's memory. The architecture is scalable to over 500 processors, although none was built of that size. Each processor can have a substantial amount of memory, and the operating system environment is considered attractive. This system is one of the few commercial shared-memory MIMD computers that can scale to large numbers of nodes. It is no longer in production; the BBN Corporation terminated its parallel computer activities in 1991.
Guy Robinson
By the late 1980s, several highly parallel systems were able to achieve high levels of performance-the Connection Machine Model CM-2, the Intel iPSC/860, the nCUBE-2, and, early in the decade of the '90s, the Intel Touchstone Delta System. The peak speeds of these systems are quite high and, at least for some applications, the speeds achieved are also high, exceeding those achieved on vector supercomputers. The fastest CRAY system until 1992 was a CRAY Y-MP with eight processors, a peak speed of , and a maximum speed observed for applications of . In contrast, the Connection Machine Model CM-2 and the Intel Delta have achieved over for some real applications [ Brickner:89b ], [ Messina:92a ], [Mihaly:92a;92b]. There are some new Japanese vector supercomputers with a small number of processors (but a large number of instruction pipelines) that have peak speeds of over .
Finally, the vector computers continued to become faster and to have more processors. For example, the CRAY Y-MP C-90 that was introduced in 1992 has sixteen processors and a peak speed of .
By 1992, parallel computers were substantially faster. As was noted above, the Intel Paragon has a peak speed of . The CM-5, an MIMD computer introduced by Thinking Machines Corporation in 1992 has a maximum configuration of 16K processors, each with a peak speed of . The largest system at this writing is a 1024-node configuration in use at Los Alamos National Laboratory.
New introductions continue with Fall, 1992 seeing Fujitsu (Japan) and Meiko (U. K.) introducing distributed-memory parallel machines with a high-performance node featuring a vector unit using, in each case, a different VLSI implementation of the node of Fujitsu's high-end vector supercomputer. 1993 saw major Cray and Convex systems built around Digital and HP RISC microprocessor nodes.
Recently, there has been an interesting new architecture with a distributed-memory design supported by special hardware to build an appearance of shared memory to the user. The goal is to continue the cost effectiveness of distributed memory with the programmability of a shared-memory architecture. There are two major university projects: DASH at Stanford [ Hennessy:93a ], [ Lenoski:89a ] and Alewife [ Agarwal:91a ] at MIT. The first commercial machine, the Kendall Square KSR-1, was delivered to customers in Fall, 1991. A high-performance ring supports the apparent shared memory, which is essentially a distributed dynamic cache. The ring can be scaled up to 32 nodes that can be joined hierarchically to a full-size, 1024-node system that could have a performance of approximately . Burton Smith, the architect of the pioneering Denelcor HEP-1, has formed Teracomputer, whose machine has a virtual shared memory and other innovative features. The direction of parallel computing research could be profoundly affected if this architecture proves successful.
In summary, the 1980s saw an incredible level of activity in parallel computing, much greater than most people would have predicted. Even those projects that in a sense failed-that is, that were not commercially successful or, in the case of research projects, failed to produce an interesting prototype in a timely fashion-were nonetheless useful in that they exposed many people to parallel computing at universities, computer vendors, and (as outlined in Chapter 19 ) commercial companies such as Xerox, DuPont, General Motors, United Technologies, and aerospace and oil companies.
Guy Robinson
A gross generalization of the situation in the 1980s can be made that there was good software on low- and medium-performance systems such as Alliant, Sequent, Encore, and Multiflow systems (uninteresting to those preoccupied with the highest performance levels), while there was poor quality software in the highest performance systems. In addition, there is little or no software aimed at managing the system and providing a service to a diverse user community. There is typically no software that provides information on who uses the system and how much, that is, accounting and reporting software. Batch schedulers are typically not available. Controls for limiting the amount of time interactive users can take on the system at any one time also are missing. Ways of managing the on-line disks are non-existent. In short, the system software provided with the high-performance parallel computers is at best suitable for systems used by a single person or a small tightly-knit group of people.
Guy Robinson
In contrast, in the area of computer languages and compilers for those languages for parallel machines, there has been a significant amount of progress, especially in the late 1980s, for example [ AMT:87a ]. In February of 1984, the Argonne Workshop on Programming the Next Generation of Supercomputers was held in Albuquerque, New Mexico [ Smith:84a ]. It addressed topics such as:
Many people came to the workshop and showed high levels of interest, including leading computer vendors, but not very much happened in terms of real actions by compiler writers or standards-making groups. By the late 1980s, the situation had changed. Now the Parallel Computing Forum is healthy and well attended by vendor representatives. The Parallel Computing Forum was formed to develop a shared-memory multiprocessor model for parallel processing and to establish language standards for that model beginning with Fortran and C. In addition, the ANSI Standards Committee X3 formed a new technical committee, X3H5, named Parallel Processing Constructs for High Level Programming Languages. This technical committee will work on a model based upon standard practice in shared memory parallel processing. Extensions for message-passing-based parallel processing are outside the scope of the model under consideration at this time. The first meeting of X3H5 was held March 23, 1990.
Finally there are efforts under way to standardize language issues for parallel computing, at least for certain programming models. In the meantime, there has been progress in compiler technology. Compilers provided with Alliant and Multiflow machines before they went out of business, can be quite good at producing efficient code for each processor and relatively good at automatically parallelizing. On the other hand, compilers for the processors that are used on multicomputers generally produce inefficient code for the floating-point hardware. Generally, these compilers do not perform even the standard optimizations that have nothing to do with fancy instruction scheduling, nor do they do any automatic parallelization for the distributed-memory computers. While automatic parallelization for distributed-memory, as well as shared-memory systems, is a difficult task, and it is clear that it will be a few more years before good compilers exist for that task, it is a shame that so little effort is invested in producing efficient code for single processors. There are known compilation techniques that would provide a much greater percentage of the peak speed on commonly used microprocessors than is currently produced by the existing compilers.
As for languages, despite much work and interest in new languages, in most cases people still use Fortran or C with minor additions or calls to system subroutines. The language known as Connection Machine Fortran or CM-Fortran is, as discussed in Section 13.1 , an important exception. It is, of course, based largely on the array extensions of Fortran 90, but is not identical to that. One might note that CM-Fortran array extensions are also remarkably similar to those defined in the Department of Energy Language Working Group Fortran effort of the early 1980s [ Wetherell:82a ]. Fortran 90 itself was influenced by the LWG Fortran; in the early and mid-1980s, there were regular and frequent interactions between the DOE Language Working Group and the Fortran Standards Committee. A recent variant of Fortran 90 designed for distributed-memory systems is Fortran 90D [ Fox:91e ], which, as described in Chapter 13 , is the basis of an informal industry standard for data-parallel Fortran-High Performance Fortran (HPF) [ Kennedy:93a ]. HPF has attracted a great deal of attention from both users and computer vendors and it is likely to become a de facto standard in one or two years. The time for such a language must have finally come. The Fortran developments are mirrored by those in other languages, with C and, in particular, C++ receiving the most attention. Among many important projects, we select pC++ at Indiana University [ Bodin:91a ], which extends C++ so that it incorporates essentially the HPF parallel constructs. Further, C++ allows one to define more general data structures than the Fortran array; correspondingly pC++ supports general parallel collections.
Other languages that have seen some use include Linda [ Gelertner:89a ], [ Ahuja:86a ], and Strand [ Foster:90a ]. Linda has been particularly successful especially as a coordination language allowing one to link the many individual components of what we term metaproblems -a concept developed throughout this book and particularly in Chapters 3 and 18 . A more recent language effort is Program Composition Notation (PCN) that is being developed at the Center for Research on Parallel Computation (an NSF Science and Technology Center) [ Chandy:90a ]. PCN is a parallel programming language in its own right, but additionally has the feature that one can take existing Fortran and C functions and subprograms and use them through PCN as part of a PCN parallel program. PCN is in some ways similar to Strand, which is a dataflow-oriented logic language in the flavor of Prolog. PCN has been extended to CC++ [ Chandy:92a ] (Compositional C++), supporting general functional parallelism. Chandy reports that users found the novel syntax in PCN uncomfortable for users familiar with existing languages. This motivated his group to embody the PCN ideas in widely used languages with CC++ for C and C++ (sequential) users, and Fortran-M for Fortran users. The combination of CC++ and data parallel pC++ is termed HPC++ and this is an attractive candidate for the software model that could support general metaproblems . The requirements and needs for such software models will become clear from the discussion in this book, and are summarized in Section 18.1.3 .
Guy Robinson
Substantial efforts have been put into developing tools that facilitate parallel programming, both in shared-memory and distributed-memory systems, e.g., [ Clarke:91a ], [ Sunderam:90a ], [ Whiteside:88a ]. For shared-memory systems, for example, there are SCHEDULE [ Hanson:90a ], MONMACS, and FORCE. MONMACS and FORCE both provide higher level parallel programming constructs such as barrier synchronization and DO ALL that are useful for shared-memory environments. SCHEDULE provides a graphical interface to producing functionally decomposed programs for shared-memory systems. With SCHEDULE, one specifies a tree of calls to subroutines, and SCHEDULE facilitates and partially automates the creation of Fortran or C programs (augmented by appropriate system calls) that implement the call graphs. For distributed-memory environments, there are also several libraries or small operating systems that provide extensions to Fortran and C for programming on such architectures. A subset of MONMACS falls into that camp. More widely used systems in this area include Cosmic Environment Reactive Kernel [ Seitz:88a ] (see Chapter 16 ), Express [ ParaSoft:88a ] (discussed in detail in Chapter 5 ), and PICL [ Sunderam:90a ]. These systems provide message-passing routines in some cases, including those that do global operations on data such as Broadcast. They may also provide facilities for measuring performance or collecting data about message traffic, CPU utilization, and so on. Some debugging capabilities may also be provided. These are all general purpose tools and programming environments, and had been used for a wide variety of applications, chiefly scientific and engineering, but also non-numerical ones.
In addition, there are many tools that are domain-specific in some sense. Examples of these would be the Distributed Irregular Mesh Environment (DIME) by Roy Williams [Williams:88a;89b] (described in Chapter 10 ), and the parallel ELLPACK [ Houstis:90a ] partial differential equation solver and domain decomposer [Chrisochoides:91b:93a] developed by John Rice and his research group at Purdue. DIME is a programming environment for calculations with irregular meshes; it provides adaptive mesh refinement and dynamic load balancing. There are also some general purpose tools and programming systems, such as Sisal from Livermore, that provide a dataflow-oriented language capability; and Parti [Saltz:87a;91b], [ Berryman:91a ], which facilitates, for example, array mappings on distributed-memory machines. Load-balancing tools are described in Chapter 11 and, although they look very promising, they have yet to be packaged in a robust form for general users.
None of the general-purpose tools has emerged as a clear leader. Perhaps there is still a need for more research and experimentation with such systems.
Guy Robinson
There was remarkable progress during the 1980s in most areas related to high-performance computing in general and parallel computing in particular. There are now substantial numbers of people who use parallel computers to get real applications work done, in addition to many people who have developed and are developing new algorithms, new operating systems, new languages, and new programming paradigms and software tools for massively parallel and other high-performance computer systems. It was during this decade, especially in the last half, that there was a very quick transition towards identifying high-performance computing strongly with massively parallel computing. In the early part of the decade, only large, vector-oriented systems were used for high-performance computing. By the end of the decade, while most such work was still being done on vector systems, some of the leading-edge work was already being done on parallel systems. This includes work at universities and research laboratories, as well as in industrial applications. By the end of the decade, oil companies, brokerage companies on Wall Street, and database users were all taking advantage of parallelism in addition to the traditional scientific and engineering fields. The C P efforts played an important role in advancing parallel hardware, software, and applications. As this chapter indicates, many other projects contributed to this advance as well.
There is still a frustrating phenomenon of neglect of certain areas in the design of parallel computer systems, including ratios of internal computational speed versus input and output speed, and speed of communication between the processors in distributed-memory systems. Latency for both I/O and communication is still very high. Compilers are often still crude. Operating systems still lack stability and even the most fundamental system management tools. Nevertheless, much progress was made.
By the end of the 1980s, higher speeds than on any sequential computer were indeed achieved on the parallel computer systems, and this was done for a few real applications. In a few cases, the parallel systems even proved to be cheaper, that is, more cost-effective than sequential computers of equivalent power. This despite a truly dramatic increase in performance of sequential microprocessors, especially floating-point units, in the late 1980s. So, both key objectives of parallel computing-the highest achievable speed and more cost-effective performance-were achieved and demonstrated in the 1980s.
Guy Robinson
Guy Robinson
Computing is a controversial field. In more traditional fields, such as mathematics and physics, there is usually general agreement on the key issues-which ideas and research projects are ``good,'' what are the critical questions for the future, and so on. There is no such agreement in computing on either the academic or industrial sides. One simple reason is that the field is young-roughly forty years old. However, another important aspect is the multidisciplinary nature of the field. Hardware, software, and applications involve practitioners from very different academic fields with different training, prejudices, and goals. Answering the question, ``Does and How Does Parallel Computing Work?''
requires ``Solving real problems with real software on real hardware''
and so certainly involves aspects of hardware, software, and applications. Thus, some sort of mix of disciplines seems essential in spite of the difficulties in combining several disciplines.
The Caltech Concurrent Computation program attempted to cut through some of the controversy by adopting an interdisciplinary rather than multidisciplinary methodology. We can consider the latter as separate teams of experts as shown in Figures 3.1 and 3.2 , which tackle each component of the total project. In , we tried an integrated approach illustrated in Figure 3.3 . This did not supplant the traditional fields but rather augmented them with a group of researchers with a broad range of skills that to a greater or lesser degree spanned those of the core areas underlying computing. We will return to these discipline issues in Chapter 20 , but note here that this current discussion is simplistic and just designed to give context to the following analysis. The assignment of hardware to electrical engineering and software to computer science (with an underlying discipline of mathematics) is particularly idealized. Indeed, in many schools, these components are integrated. However, it is perhaps fair to say that experts in computer hardware have significantly different training and background from experts in computer software.
Figure 3.1:
The Multi-Disciplinary (Three-Team) Approach to Computing
Figure 3.2:
An Alternative (Four-Team) Multi-Disciplinary Approach to Computing
We believe that much of the success (and perhaps also the failures) of can be traced to its interdisciplinary nature. In this spirit, we will provide here a partial integration of the disparate contributions in this volume with a mathematical framework that links hardware, software, and applications. In this chapter, we will describe the principles behind this integration which will then be amplified and exemplified in the following chapters. This integration is usually contained in the first introductory section of each chapter. In Section 3.2 , we define a general methodology for computation and propose that it be described as mappings between complex systems. The latter are only loosely defined but several examples are given in Section 3.3 , while relevant properties of complex systems are given in Sections 3.4 through Section 3.6 . Section 3.7 describes mappings between different complex systems and how this allows one to classify software approaches. Section 3.8 uses this formalism to state our results and what we mean by ``parallel computing works.'' In particular, it offers the possibility of a quantitative approach to such questions as,
``Which parallel machine is suitable for which problems?''and
``What software models are suitable for what problems on what machines?''
Guy Robinson
There is no agreed-upon process behind computation, that is, behind the use of a computer to solve a particular problem. We have tried to quantify this in Figures 3.1 (b), 3.2 (b), and 3.3 which show a problem being first numerically formulated and then mapped by software onto a computer.
Figure 3.3:
An Interdisciplinary Approach to Computing with Computational
Science Shown Shaded
Even if we could get agreement on such an ``underlying'' process, the definitions of the parts of the process are not precise and correspondingly the roles of the ``experts'' are poorly defined. This underlies much of the controversy, and in particular, why we cannot at present or probably ever be able to define ``The best software methodology for parallel computing.''
How can we agree on a solution (What is the appropriate software?) unless we can agree on the task it solves?
``What is computation and how can it be broken up into components?''In other words, what is the underlying process?
In spite of our pessimism that there is any clean, precise answer to this question, progress can be made with an imperfect process defined for computation. In the earlier figures, it was stressed that software could be viewed as mapping problems onto computers. We can elaborate this as shown in Figure 3.4 , with the solution to a question pictured as a sequence of idealizations or simplifications which are finally mapped onto the computer. This process is spelled out for four examples in Figures 3.5 and 3.6 . In each case, we have tried to indicate possible labels for components of the process. However, this can only be approximate. We are not aware of an accepted definition for the theoretical or computational parts of the process. Again, which parts are modelling or simulation? Further, there is only an approximate division of responsibility among the various ``experts''; for example, between the theoretical physicist and the computational physicist, or among aerodynamics, applied mathematics, and computer science. We have also not illustrated that, in each case, the numerical algorithm is dependent on the final computer architecture targeted; in particular, the best parallel algorithm is often different from the best conventional sequential algorithm.
Figure 3.4:
An Idealized Process of Computation
Figure 3.5:
A Process for Computation in Two Examples in Basic Physics
Simulations
Figure 3.6:
A Process for Computation in Two Examples from Large Scale
System Simulations
We can abstract Figures 3.5 and 3.6 into a sequence of maps between complex systems , .
We have anticipated (Chapter 5 ) and broken the software into a high level component (such as a compiler) and a lower level one (such as an assembler) which maps a ``virtual computer'' into the particular machine under consideration. In fact, the software could have more stages, but two is the most common case for simple (sequential) computers.
A complex system, as used here, is defined as a collection of fundamental entities whose static or dynamic properties define a connection scheme between the entities. Complex systems have a structure or architecture. For instance, a binary hypercube parallel computer of dimension d is a complex system with members connected in a hypercube topology. We can focus in on the node of the hypercube and expand the node, viewed itself as a complex system, into a collection of memory hierarchies, caches, registers, CPU., and communication channels. Even here, we find another ill-defined point with the complex system representing the computer dependent on the resolution or granularity with which you view the system. The importance of the architecture of a computer system has been recognized for many years. We suggest that the architecture or structure of the problem is comparably interesting. Later, we will find that the performance of a particular problem or machine can be studied in terms of the match (similarity) between the architectures of the complex systems and defined in Equation 3.1 . We will find that the structure of the appropriate parallel software will depend on the broad features of the (similar) architecture of and . This can be expected as software maps the two complex systems into each other.
At times, we have talked in terms of problem architecture, but this is ambiguous since it could refer to any of the complex systems which can and usually do have different architectures. Consider the second example of Figure 3.5 with the computational fluid dynamics study of airflow. In the language of Equation 3.1 :
Guy Robinson
In the previous section, we showed how the process of computation could be viewed as mappings between complex systems. As the book progresses, we will quantify this by providing examples that cover a range of problem architectures. In the next three sections, we will set up the general framework and define terms which will be made clearer later on as we see the explicit problems with their different architectures. The concept of complex systems may have very general applicability to a wide range of fields but here we will focus solely on their application to computation. Thus, our discussion of their properties will only cover what we have found useful for the task at hand. These properties are surely more generally applicable, and one can expect that other ideas will be needed in a general discussion. Section 3.3 gives examples and a basic definition of a complex system and its associated space-time structure. Section 3.4 defines temporal properties and, finally, Section 3.5 spatial structures.
We wish to understand the interesting characteristics or structure of a complex system. We first introduce the concept of space-time into a general complex system. As shown in Figure 3.7 , we consider a general complex system as a space , or data domain, that evolves deterministically or probabilistically with time. Often, the space-time associated with a given complex system is identical with physical space-time but sometimes it is not. Let us give some examples.
Figure 3.7:
(a) Synchronous, Loosely Synchronous (Static), and (b)
Asynchronous (Dynamic) Complex Systems with their Space-Time Structure
Consider instead the solution of the elliptic partial differential equation
for the electrostatic potential in the presence of a charge density . A simple, albeit usually non-optimal approach to solving Equation 3.4 , is a Jacobi iteration, which in the special case of two dimensions and involves the iterative procedure
where we assume that integer values of the indices x and y label the two-dimensional grid on which Laplace's equation is to be solved. The complex system defined by Equation 3.5 has spatial domain defined by the grid and a temporal dimension defined by the iteration index n . Indeed, the Jacobi iteration is mathematically related to solving the parabolic partial differential equation
where one relates the discretized time t to the iteration index n . This equivalence between Equations 3.5 and 3.6 is qualitatively preserved when one compares the solution of Equations 3.3 and 3.5 . As long as one views iteration as a temporal structure, Equations 3.3 and 3.4 can be formulated numerically with isomorphic complex systems . This implies that parallelization issues, both hardware and software, are essentially identical for both equations.
The above example illustrates the most important form of parallelism-namely, data parallelism . This is produced by parallel execution of a computational (temporal) algorithm on each member of a space or data domain . Data parallelism is essentially synonymous with either massive parallelism or massive data parallelism . Spatial domains are usually very large, with from to members today; thus exploiting this data parallelism does lead to massive parallelism.
Parallelization of this is covered fully in [ Fox:88a ] and Chapter 8 of this book. Gaussian elimination (LU decomposition) for solving Equation 3.7 involves successive steps where in the simplest formulation without pivoting, at step k one ``eliminates'' a single variable where the index . At each step k , one modifies both and
and , are formed from ,
where one ensures (if no pivoting is employed) that when j>k . Consider the above procedure as a complex system. The spatial domain is formed by the matrix A with a two-dimensional array of values . The time domain is labelled by the index k and so is a discrete space with n (number of rows or columns of A ) members. The space is also discrete with members.
Guy Robinson
As shown in Equation 3.1 , we will use complex systems to unify a variety of different concepts including nature and an underlying theory such as Quantum Chromodynamics; the numerical formulation of the theory; the result of expressing this with various software paradigms and the final computer used in its simulation. Different disciplines have correctly been built up around these different complex systems. Correspondingly different terminology is often used to describe related issues. This is certainly reasonable for both historical and technical reasons. However, we argue that understanding the process of computation and answering questions such as, ``Which parallel computers are good for which problems?''; ``What problems parallelize?''; and ``What are productive parallel software paradigms?'' is helped by a terminology which bridges the different complex systems. We can illustrate this with an anecdote. In a recent paper, an illustration of particles in the universe was augmented with a hierarchical set of clusters produced with the algorithm of Section 12.4 . These clusters are designed to accurately represent the different length scales and physical clustering of the clouds of particles. This picture was labelled ``data structure'' but one computer science referee noted that this was not appropriate. Indeed, the referee was in one sense correct-we had not displayed a computer science data structure such as a Fortran array or C structure defining the linked list of particles. However, taking the point of view of the physicist, this picture was precisely showing the structure of the data and so, the caption was in one discipline (physics) correct and in another (computer science) false!
We will now define and discuss some general properties and parameters of complex systems which span the various disciplines involved.
We will first discuss possible temporal structures for a complex system. Here, we draw on a computer science classification of computer architecture. In this context, aspects such as internode topology refer to the spatial structure of the computer viewed as a complex system. The control structure of the computer refers to the temporal behavior of its complex system. In our review of parallel computer hardware, we have already introduced the concepts of SIMD and MIMD, two important temporal classes which carry over to general complex systems. Returning to Figures 3.7 (a) and 3.7 (b), we see complex systems which are MIMD (or asynchronous as defined below) in Figure 3.7 (b) and either SIMD or a restricted form of MIMD in Figure 3.7 (a) (synchronous or loosely synchronous in language below). In fact, when we consider the temporal structure of problems ( in Equation 3.1 ), software ( ), and hardware ( in Equation 3.1 ), we will need to further extend this classification. Here we will briefly define concepts and give the section number where we discuss and illustrate it more fully.
Society shows many examples of loosely synchronous behavior. Vehicles proceed more or less independently on a city street between loose synchronization points defined by traffic lights. The reader's life is loosely synchronized by such events as meals and bedtime.
When we consider computer hardware and software systems, we will need to consider other temporal classes which can be thought of as further subdivisions of the asynchronous class.
In Figure 3.8 , we summarize these temporal classifications for complex systems, indicating a partial ordering with arrows pointing to more general architectures. This will become clearer in Section 3.5 when we discuss software and the relation between problem and computer. Note that although the language is drawn from the point of view of computer architecture, the classifications are important at the problem, software, and hardware level.
Figure 3.8:
Partial Ordering of Temporal (Control) Architectures for a Complex
System
The hardware (computer) architecture naturally divides into SIMD (synchronous), MIMD (asynchronous), and von Neumann classes. The problem structures are synchronous, loosely synchronous, or asynchronous. One can argue that the shared-memory asynchronous architecture is naturally suggested by software ( ) considerations and in particular by the goal of efficient parallel execution for sequential software models. For this reason it becomes an important computer architecture even though it is not a natural problem ( ) architecture.
Guy Robinson
Now we switch topics and consider the spatial properties of complex systems.
The size N of the complex system is obviously an important property. Note that we think of a complex system as a set of members with their spatial structure evolving with time. Sometimes, the time domain has a definite ``size'' but often one can evolve the system indefinitely in time. However, most complex systems have a natural spatial size with the spatial domain consisting of N members. In the examples of Section 3.3 , the seismic example had a definite spatial extent and unlimited time domain; on the other hand, Gaussian elimination had spatial members evolving for a fixed number of n ``time'' steps. As usual, the value of spatial size N will depend on the granularity or detail with which one looks at the complex system. One could consider a parallel computer as a complex system constructed as a collection of transistors with a correspondingly very large value of but here we will view the processor node as the fundamental entity and define the spatial size of a parallel computer viewed as a complex system, by the number of processing nodes.
Now is a natural time to define the von Neumann complex system spatial structure which is relevant, of course, for computer architecture. We will formally define this to be a system with no spatial extent, i.e. size . Of course, a von Neumann node can have a sophisticated structure if we look at fine enough resolution with multiple functional units. More precisely, perhaps, we can generalize this complex system to one where is small and will not scale up to large values.
Consider mapping a seismic simulation with grid points onto a parallel machine with processors. An important parameter is the grain size n of the resultant decomposition. We can introduce the problem grain size and the computer grain size as the memory contained in each node of the parallel computer. Clearly we must have,
if we measure memory size in units of seismic grid points. More interestingly, in Equation 3.10 we will relate the performance of the parallel implementation of the seismic simulation to and other problem and computer characteristics. We find that, in many cases, the parallel performance only depends on and in the combination and so grain size is a critical parameter in determining the effectiveness of parallel computers for a particular application.
The next set of parameters describe the topology or structure of the spatial domain associated with the complex system. The simplest parameter of this type is the geometric dimension of the space. As reviewed in Chapter 2 , the original hardware and, in fact, software (see Chapter 5 ) exhibited a clear geometric structure for or . The binary hypercube of dimension d had this as its geometric dimension. This was an effective architecture because it was richer than the topologies of most problems. Thus, consider mapping a problem of dimension onto a computer of dimension . Suppose the software system preserves the spatial structure of the problem and that . Then, one can show that the parallel computing overhead f has a term due to internode communication that has the form,
with parallel speedup S given by
The communication overhead depends on the problem grain size and computer complex system . It also involves two parameters specifying the parallel hardware performance.
The definitions of and are imprecise above. In particular, depends on the nature of node and can take on very different values depending on the details of the implementation; floating-point operations are much faster when the operands are taken from registers than from slower parts of the memory hierarchy. On systems built from processors like the Intel i860 chip, these effects can be large; could be from registers (50 MFLOPS) and larger by a factor of ten when the variables a,b are fetched from dynamic RAM. Again, communication speed depends on internode message size (a software characteristic) and the latency (startup time) and bandwidth of the computer communication subsystem.
Returning to Equation 3.10 , we really only need to understand here that the term indicates that communication overhead depends on relative performance of the internode communication system and node (floating-point) processing unit. A real study of parallel computer performance would require a deeper discussion of the exact values of and . More interesting here is the dependence on the number of processors and problem grain size . As described above, grain size depends on both the problem and the computer. The values of and are given by
independent of computer parameters, while if
The results in Equation 3.13 quantify the penalty, in terms of a value of that increases with , for a computer architecture that is less rich than the problem architecture. An attractive feature of the hypercube architecture is that is large and one is essentially always in the regime governed by the top line in Equation 3.13 . Recently, there has been a trend away from rich topologies like the hypercube towards the view that the node interconnect should be considered as a routing network or switch to be implemented in the very best technology. The original MIMD machines from Intel, nCUBE, and AMETEK all used hypercube topologies, as did the SIMD Connection Machines CM-1 and CM-2. The nCUBE-2, introduced in 1990, still uses a hypercube topology but both it and the second generation Intel iPSC/2 used a hardware routing that ``hides'' the hypercube connectivity. The latest Intel Paragon and Touchstone Delta and Symult (ex-Ametek) 2010 use a two-dimensional mesh with wormhole routing. It is not clear how to incorporate these new node interconnects into the above picture and further research is needed. Presumably, we would need to add new complex system properties and perhaps generalize the definition of dimension as we will see below is in fact necessary for Equation 3.10 to be valid for problems whose structure is not geometrically based.
Returning to Equations 3.10 through 3.12 , we note that we have not properly defined the correct dimension or to use. We have implicitly equated this with the natural geometric dimension but this is not always correct. This is illustrated by the complex system consisting of a set of particles in three dimensions interacting with a long range force such as gravity or electrostatic charge. The geometric structure is local with but the complex system structure is quite different; all particles are connected to all others. As described in Chapter 3 of [ Fox:88a ], this implies that whatever the underlying geometric structure. We define the system dimension for a general complex system to reflect the system connectivity. Consider Figure 3.9 which shows a general domain D in a complex system. We define the volume of this domain by the information in it. Mathematically, is the computational complexity needed to simulate D in isolation. In a geometric system
where L is a geometric length scale. The domain D is not in general isolated and is connected to the rest of the complex system. Information flows in D and again in a geometric system. is a surface effect with
Figure 3.9:
The Information Density and Flow in a General Complex System
with Length Scale
L
If we view the complex system as a graph, is related to the number of links of the graph inside D and is related to the number of links cut by the surface of D . Equations 3.14 and 3.15 are altered in cases like the long-range force problem where the complex system connectivity is no longer geometric. We define the system dimension to preserve the surface versus volume interpretation of Equation 3.15 compared to Equation 3.14 . Thus, generally we define
With this definition of system dimension , we will find that Equations 3.10 through 3.12 essentially hold in general. In particular for the long range force problem, one finds independent of .
A very important special type of spatial structure is the case when we find the embarrassingly parallel or spatially disconnected complex system. Here there is no connection between the different members in the spatial domain. Applying this to parallel computing, we see that if or is spatially disconnected, then it can be parallelized very straightforwardly. In particular, any MIMD machine can be used whatever the temporal structure of the complex system. SIMD machines can only be used to simulate embarrassingly parallel complex systems which have spatially disconnected members with identical structure.
In Section 13.7 , we extend the analysis of this section to cover the performance of hierarchical memory machines. We find that one needs to replace subdomains in space with those in space-time.
In Chapter 11 , we describe other interesting spatial properties in terms of a particle analogy. We find system temperature and phase transitions as one heats and cools the complex system.
Guy Robinson
In Sections 3.4 and 3.5 , we discussed basic characteristics of complex systems. In fact, many ``real world examples'' are a mixture of these fundamental architectures. This is illustrated by Figure 3.10 , which shows a very conventional computer network with several different architectures. We cannot only regard each individual computer as a complex system but the whole network is, of course, a single complex system as we have defined it. We will term such complex systems compound . Note that one often puts together a network of resources to solve a ``single problem'' and so an analysis of the structure of compound complex systems is not an academic issue, but of practical importance.
Figure 3.10:
A Heterogeneous Compound Complex System
Corresponding
to a Network of Computers of Disparate Architectures
Figure 3.10 shows a mixture of temporal, synchronous and asynchronous, and spatial, hypercube, and von Neumann architectures. We can look at both the architecture of the individual network components or, taking a higher level view, look at the network itself with member computers, such as the hypercube in Figure 3.10 , viewed as ``black boxes.'' In this example, the network is an asynchronous complex system and this seems quite a common circumstance. Figure 3.11 shows two compound problems coming from the aerospace and battle management fields, respectively. In each case we link asynchronously problem modules which are synchronous, loosely synchronous, or asynchronous. We have found this very common. In scientific and engineering computations, the basic modules are usually synchronous or loosely synchronous. These modules have large spatial size and naturally support ``massive data parallelism.'' Rarely do we find large asynchronous modules; this is fortunate as such complex systems are difficult to parallelize. However, in many cases the synchronous or loosely synchronous program modules are hierarchically combined with an asynchronous architecture. This is an important way in which the asynchronous problem architecture is used in large scale computations. This is explored in detail in Chapter 18 . One has come to refer to systems such as those in Figure 3.10 as metacomputers . Correspondingly, we designate the applications in Figure 3.11 metaproblems .
Figure:
Two Heterogeneous Complex Systems
Corresponding
to: a) the Integrated Design of a New Aircraft, and b) the Integrated
Battle Management Problem Discussed in Chapter
18
If we combine Figures 3.10 and 3.11 , we can formulate the process of computation in its most general complicated fashion.
``Map a compound heterogeneous problem onto a compound heterogeneous computer.''
Guy Robinson
Equation 3.1 first stated our approach to computation as a map between different complex systems. We can quantify this by defining a partial order on complex systems written
Equation 3.17 states that a complex system A can be mapped to complex system B , that is, that B has a more general architecture than A . This was already seen in Figure 3.8 and is given in more detail in Figure 3.12 , where we have represented complex systems in a two-dimensional space labelled by their spatial and temporal properties. In this notation, we require:
Figure 3.12:
An Illustration of Problem or Computer Architecture Represented
in a Two-dimensional Space. The spatial structure only gives a few
examples.
The requirement that a particular problem parallelize is that
which is shown in Figure 3.13 . We have drawn our space labelled by complex system properties so that the partial ordering of Figures 3.8 and 3.12 ``flows'' towards the origin. Roughly, complex systems get more specialized as one either moves upwards or to the right. We view the three key complex systems, , , and , as points in the space represented in Figures 3.12 and 3.13 . Then Figure 3.13 shows that the computer complex system lies below and to the left of those representing and .
Let us consider an example. Suppose (the computer) is a hypercube of dimension with a MIMD temporal structure. Synchronous, loosely synchronous, or asynchronous problems can be mapped onto this computer as long as the problem's spatial structure is contained in the hypercube topology. Thus, we will successfully map a two-dimensional mesh. But what about a mesh or irregular lattice? The mesh only has nine (spatial) components and insufficient parallelism to exploit the 64-node computer. The large irregular mesh can be efficiently mapped onto the hypercube as shown in Chapter 12 . However, one could support this with a more general computer architecture where hardware or software routing essentially gives the general node-to-node communication shown in the bottom left corner of Figure 3.12 . The hypercube work at Caltech and elsewhere always used this strategy in mapping complex spatial topologies; the crystal-router mechanism in CrOS or Express was a powerful and efficient software strategy. Some of the early work using transputers found difficulties with some spatial structures since the language Occam only directly supported process-to-process communication over the physical hardware channels. However, later general Occam subroutine libraries (communication ``harnesses'') callable from FORTRAN or C allowed the general point-to-point (process-to-process) communication model for transputer systems.
Figure:
Easy and Hard Mappings of Complex Systems
or
. We show
complex systems for problems and computers in a space labelled by
spatial and temporal complex system (computer architectures).
Figure
3.12
illustrates possible ordering of
these structures.
Guy Robinson
The complex system classification introduced in this chapter allows a precise formulation of the lessons of current research.
The majority of large scale scientific and engineering computations have synchronous or loosely synchronous character. Detailed statistics are given in Section 14.1 but we note that our survey suggests that at most 10% of applications are asynchronous. The microscopic or macroscopic temporal synchronization in the synchronous or loosely synchronous problems ensures natural parallelism without difficult computer hardware or software synchronization. Thus, we can boldly state that
for these problems. This quantifies our statement that ``Parallel Computing Works,''
where Equation 3.20 should be interpreted in the sense shown in Figure 3.13 (b).
Roughly, loosely synchronous problems are suitable for MIMD and synchronous problems for SIMD computers. We can expand Equation 3.20 and write
The results in Equation 3.21 are given with more details in Tables 14.1 and 14.2 . The text of this book is organized so that we begin by studying the simpler synchronous applications, then give examples first of loosely synchronous and finally asynchronous and compound problems.
The bold statements in Equations 3.20 and 3.21 become less clear when one considers software and the associated software complex system . The parallel software systems CrOS and its follow-on Express were used in nearly all our applications. These required explicit user insertion of message passing, which in many cases is tiresome and unfamiliar. One can argue that, as shown in Figure 3.14 , we supported a high-level software environment that reflected the target machine and so could be effectively mapped on it. Thus, we show (CrOS) and (MIMD) close together on Figure 3.14 . A more familiar and attractive environment to most users would be a traditional sequential language like Fortran77. Unfortunately,
and so, as shown in Figures 3.13 (a) and 3.14 , it is highly non-trivial to effectively map existing or new Fortran77 codes onto MIMD or SIMD parallel machines-at least those with distributed memory. We will touch on this issue in this book in Sections 13.1 and 13.2 , but it remains an active research area.
We also discuss data parallel languages, such as High Performance Fortran, in Chapter 13 [ Kennedy:93a ]. This is designed so that
Figure 3.14:
The Dusty Deck Issue in Terms of the Architectures of Problem,
Software, and Computer Complex Systems
We can show this point more graphically if we introduce a quantitative measure M of the difficulty of mapping . We represent M as the difference in heights h ,
where we can only perform the map if M is positive
Negative values of M correspond to difficult cases such as Equation 3.22 while large positive values of M imply a possible but hard map. Figure 3.15 shows how one can now picture the process of computation as moving ``downhill'' in the complex system architecture space.
Figure 3.15:
Two Problems
and Five Computer Architectures
in the Space-Time Architecture Classification of Complex Systems. An arrow
represents a successful mapping and an ``X'' a mapping that will fail without a
sophisticated compiler.
This formal discussion is illustrated throughout the book by numerous examples, which show that a wide variety of applications parallelize. Most of the applications chapters start with a computational analysis that refers back to the general concepts developed. This is finally summarized in Chapter 14 , which exemplifies the asynchronous applications and starts with an overview of the different temporal problem classes. We build up to this by discussing synchronous problems in Chapters 4 and 6 ; embarrassingly parallel problems in Chapter 7 ; loosely synchronous problem with increasing degrees of irregularity in Chapters 8 , 9 , and 12 . Compound problem classes-an asynchronous mixture of loosely synchronous components-are described in Chapters 17 and 18 . The large missile tracking and battle management simulation built at JPL (Figure 3.11 b) and described briefly in Chapter 18 was the major example of a compound problem class within C P. Chapters 17 and 19 indicate that we believe that this class is extremely important in many ``real-world'' applications that integrate many diverse functions.
Guy Robinson
Guy Robinson
The Caltech Concurrent Computation Project started with QCD, or Quantum Chromodynamics, as its first application. QCD is discussed in more detail in Sections 4.2 and 4.3 , but here we will put in the historical perspective. This nostalgic approach is developed in [ Fox:87d ], [ Fox:88oo ] as well as Chapters 1 and 2 of this book.
We show, in Table 4.1 , fourteen QCD simulations, labelled by representative physics publications, performed within C P using parallel machines. This activity started in 1981 with simulations, using the first four-node 8086-8087-based prototypes of the Cosmic Cube. These prototypes were quite competitive in performance with the VAX 11/780 on which we had started (in 1980) our computational physics program within high energy physics at Caltech. The 64-node Cosmic Cube was used more or less continuously from October, 1983 to mid-1984 on what was termed by Caltech, a ``mammoth calculation'' in the press release shown in Figure 4.2 . This is the modest, four-dimensional lattice calculation reported in line 3 of Table 4.1 . As trumpeted in Figures 4.1 and 4.2 , this was our first major use of parallel machines and a critical success on which we built our program.
Table 4.1:
Quantum Chromodynamic (QCD) Calculations Within C
P
Our 1983-1984 calculations totalled some 2,500 hours on the 64-node Cosmic Cube and successfully competed with 100-hour CDC Cyber 205 computations that were the state of the art at the time [ Barkai:84b ], [ Barkai:84c ], [ Bowler:85a ], [ DeForcrand:85a ]. We used a four-dimensional lattice with grid points, with eight gluon field values defined on each of the 110,592 links between grid points. The resultant 884,736 degrees of freedom seem modest today as QCD practitioners contemplate lattices of order simulated on machines of teraFLOP performance [ Aoki:91a ]. However, this lattice was comparable to those used on vector supercomputers at the time.
A hallmark of this work was the interdisciplinary team building hardware, software, and parallel application. Further, from the start we stressed large supercomputer-level simulations where parallelism would make the greatest initial impact. It was also worth noting that our use of comparatively high-level software paid off-Otto and Stack were able to code better algorithms [ Parisi:83a ] than the competing vector supercomputer teams. The hypercube could be programmed conveniently without use of microcode or other unproductive environments needed on some of the other high-performance machines of the time.
Our hypercube calculations used an early C plus message-passing programming approach which later evolved into the Express system described in the next chapter. Although not as elegant as data-parallel C and Fortran (discussed in Chapter 13 ), our approach was easier than hand-coded assembly, which was quite common for alternative high-performance systems of the time.
Figures 4.1 and 4.2 show extracts from Caltech and newspaper publicity of the time. We were essentially only a collection of 64 IBM PCs. Was that a good thing (as we thought) or an indication of our triviality (as a skeptical observer commenting in Figure 4.1 thought)? 1985 saw the start of a new phase as conventional supercomputers and availability increased in power and NSF and DOE allocated many tens of thousands of hours on the CRAY X-MP (2, Y-MP) and ETA-10 to QCD simulations. Our final QCD hypercube calculations in 1989 within C P used a 64-node JPL Mark IIIfp with approximately performance. Since this work, we switched to using the Connection Machine CM-2, which by 1990 was the commercial standard in the field. C P helped the Los Alamos group of Brickner and Gupta (one of our early graduates!) to develop the first CM-2 QCD codes, which in 1991 performed at on the full size CM-2 [ Brickner:91a ], [ Liu:91a ].
Caltech Scientists Develop `Parallel' Computer Model By LEE DEMBART
Times Science Writer Caltech scientists have developed a working prototype for a new super computer that can perform many tasks at once, making possible the solution of important science and engineering problems that have so far resisted attack.The machine is one of the first to make extensive use of parallel processing, which has been both the dream and the bane of computer designers for years.
Unlike conventional computers, which perform one step at a time while the rest of the machine lies idle, parallel computers can do many things at the same time, holding out the prospect of much greater computing speed than currently available-at much less cost.
If its designers are right, their experimental device, called the Cosmic Cube, will open the way for solving problems in meteorology, aerodynamics, high-energy physics, seismic analysis, astrophysics and oil exploration, to name a few. These problems have been intractable because even the fastest of today's computers are too slow to process the mountains of data in a reasonable amount of time.
One of today's fastest computers is the Cray 1, which can do 20 million to 80 million operations a second. But at $5 million, they are expensive and few scientists have the resources to tie one up for days or weeks to solve a problem.
``Science and engineering are held up by the lack of super computers,'' says one of the Caltech inventors, Geoffrey C. Fox, a theoretical physicist. ``They know how to solve problems that are larger than current computers allow.''
The experimental device, 5 feet long by 8 inches high by 14 inches deep, fits on a desk top in a basement laboratory, but it is already the most powerful computer at Caltech. It cost $80,000 and can do three million operations a second-about one-tenth the power of a Cray 1.
Fox and his colleague, Charles L. Seitz, a computer scientist, say they can expand their device in coming years so that it has 1,000 times the computing power of a Cray.
``Poor old Cray and Cyber (another super computer) don't have much of a chance of getting any significant increase in speed,'' Fox said. ``Our ultimate machines are expected to be at least 1,000 times faster than the current fastest computers.''
``We are getting to the point where we are not going to be talking about these things as fractions of a Cray but as multiples of them,'' Seitz said.
But not everyone in the field is as impressed with Caltech's Cosmic Cube as its inventors are. The machine is nothing more nor less than 64 standard, off-the-shelf microprocessors wired together, not much different than the innards of 64 IBM personal computers working as a unit.
``We are using the same technology used in PCs (personal computers) and Pacmans,'' Seitz said. The technology is an 8086 microprocessor capable of doing 1/20th of a million operations a second with 1/8th of a megabyte of primary storage. Sixty-four of them together will do 3 million operations a second with 8 megabytes of storage.
Currently under development is a single chip that will replace each of the 64 8-inch-by-14-inch boards. When the chip is ready, Seitz and Fox say they will be able to string together 10,000 or even 100,000 of them.
Computer scientists have known how to make such a computer for years but have thought it too pedestrian to bother with.
``It could have been done many years ago,'' said Jack B. Dennis, a computer scientist at the Massachusetts Institute of Technology who is working on a more radical and ambitious approach to parallel processing than Seitz and Fox. He thinks his approach, called ``dataflow,'' will both speed up computers and expand their horizons, particularly in the direction of artificial intelligence .
Computer scientists dream of getting parallel processors to mimic the human brain , which can also do things concurrently.
``There's nothing particularly difficult about putting together 64 of these processors,'' he said. ``But many people don't see that sort of machine as on the path to a profitable result.''
What's more, Dennis says, organizing these machines and writing programs for them have turned out to be sticky problems that have resisted solution and divided the experts.
``There is considerable debate as to exactly how these large parallel machines should be programmed,'' Dennis said by telephone from Cambridge, Mass. ``The 64-processor machine (at Caltech) is, in terms of cost-performance, far superior to what exists in a Cray 1 or a Cyber 205 or whatever. The problem is in the programming.''
Fox responds that he has ``an existence proof'' for his machine and its programs, which is more than Dennis and his colleagues have to show for their efforts.
The Caltech device is a real, working computer, up and running and chewing on a real problem in high-energy physics. The ideas on which it was built may have been around for a while, he agreed, but the Caltech experiment demonstrates that there is something to be gained by implementing them.
For all his hopes, Dennis and his colleagues have not yet built a machine to their specifications. Others who have built parallel computers have done so on a more modest scale than Caltech's 64 processors. A spokesman for IBM said that the giant computer company had built a 16-processor machine, and is continuing to explore parallel processing.
The key insight that made the development of the Caltech computer possible, Fox said, was that many problems in science are computationally difficult because they are big, not because they are necessarily complex.
Because these problems are so large, they can profitably be divided into 64 parts. Each of the processors in the Caltech machine works on 1/64th of the problem.
Scientists studying the evolution of the universe have to deal with 1 million galaxies. Scientists studying aerodynamics get information from thousands of data points in three dimensions.
To hunt for undersea oil, ships tow instruments through the oceans, gathering data in three dimensions that is then analyzed in two dimensions because of computer limitations. The Caltech computer would permit three-dimensional analysis.
``It has to be problems with a lot of concurrency in them,'' Seitz said. That is, the problem has to be split into parts, and all the parts have to be analyzed simultaneously.
So the applications of the Caltech computer for commercial uses such as an airline reservation system would be limited, its inventors agree.
Figure 4.1:
Caltech Scientists Develop ``Parallel'' Computer Model
[
Dembart:84a
]
CALTECH'S COSMIC CUBE
PERFORMING MAMMOTH CALCULATIONSLarge-scale calculations in basic physics have been successfully run on the Cosmic Cube, an experimental computer at Caltech that its developers and users see as the forerunner of supercomputers of the future. The calculations, whose results are now being published in articles in scientific journals, show that such computers can deliver useful computing power at a far lower cost than today's machines.
The first of the calculations was reported in two articles in the June 25 issue of . In addition, a second set of calculations related to the first has been submitted to for publication.
The June articles were:
-``Pure Gauge SU(3) Lattice Theory on an Array of Computers,'' by Eugene Brooks, Geoffrey Fox, Steve Otto, Paul Stolorz, William Athas, Erik DeBenedictis, Reese Faucette, and Charles Seitz, all of Caltech; and John Stack of the University of Illinois at Urbana-Champaign, and
-``The SU(3) Heavy Quark Potential with High Statistics,'' by Steve Otto and John Stack.
The Cosmic Cube consists of 64 computer elements, called nodes, that operate on parts of a problem concurrently. In contrast, most computers today are so-called von Neumann machines, consisting of a single processor that operates on a problem sequentially, making calculations serially.
The calculation reported in the June took 2,500 hours of the computation time on the Cosmic Cube. The calculation represents a contribution to the test of a set of theories called the Quantum Field Theories, which are mathematical attempts to explain the physical properties of subatomic particles known as hadrons, which include protons and neutrons.
These basic theories represent in a series of equations the behavior of quarks, the basic constituents of hadrons. Although theorists believe these equations to be valid, they have never been directly tested by comparing their predictions with the known properties of subatomic particles as observed in experiments with particle accelerators.
The calculations to be published in probe the properties, such as mass, of the glueballs that are predicted by theory.
``The calculations we are reporting are not earth-shaking,'' said Dr. Fox. ``While they are the best of their type yet done, they represent but a steppingstone to better calculations of this type.'' According to Dr. Fox, the scientists calculated the force that exists between two quarks. This force is carried by gluons, the particles that are theorized to carry the strong force between quarks. The aim of the calculation was to determine how the attractive force between quarks varies with distance. Their results showed that the potential depends linearly on distance.
``These results indicate that it would take an infinite amount of energy to separate two quarks, which shows why free quarks are not seen in nature,'' said Dr. Fox. ``These findings represent a verification of what most people expected.''
The Cosmic Cube has about one-tenth the power of the most widely used supercomputer, the Cray-1, but at one hundredth the cost, about $80,000. It has about eight times the computing power of the widely used minicomputer, the VAX 11/780. Physically, the machine occupies about six cubic feet, making it fit on the average desk, and uses 700 watts of power.
Each of the 64 nodes of the Cosmic Cube has approximately the same power as a typical microcomputer, consisting of 16-bit Intel 8086 and 8087 processors, with 136K bytes of memory storage. For comparison, the IBM Personal Computer uses the same family of chips and typically possesses a similar amount of memory. Each of the Cosmic Cube nodes executes programs concurrently, and each can send messages to six other nodes in a communication network based on a six-dimensional cube, or hypercube. The chips for the Cosmic Cube were donated by Intel Corporation, and Digital Equipment Corporation contributed supporting computer hardware. According to Dr. Fox, a full-scale extension of the Quantum Field Theories to yield the properties of hadrons would require a computer 1,000 times more powerful than the Cosmic Cube-or 100 computer projects at Caltech are developing hardware and software for such advanced machines.
Figure 4.2:
Caltech's Cosmic Cube Performing Mammoth Calculations
[
Meredith:84a
]
It is not surprising that our first hypercube calculations in C P did not need the full MIMD structure of the machine. This was also a characteristic of Sandia's pioneering use of the 1024-node nCUBE-[ Gustafson:88a ]. Synchronous applications like QCD are computationally important and have a simplicity that made them the natural starting point for our project.
Guy Robinson
Table 4.2 indicates that 70 percent of our first set of applications were of the class we call synchronous. As remarked above, this could be expected in any such early work as these are the problems with the cleanest structure that are, in general, the simplest to code and, in particular, the simplest to parallelize. As already defined in Section 3.4 , synchronous applications are characterized by a basic algorithm that consists of a set of operations which are applied identically in every point in a data set. The structure of the problem is typically very clear in such applications, and so the parallel implementation is easier than for the irregular loosely synchronous and asynchronous cases. Nevertheless, as we will see, there are many interesting issues in these problems and they include many very important applications. This is especially true for academic computations that address fundamental science. These are often formulated as studies of fundamental microscopic quantities such as the quark and gluon fundamental particles seen in QCD of Section 4.3 . Fundamental microscopic entities naturally obey identical evolution laws and so lead to synchronous problem architectures. ``Real world problems''-perhaps most extremely represented by the battle simulations of Chapter 18 in this book-typically do not involve arrays of identical objects, but rather the irregular dynamics of several different entities. Thus, we will find more loosely synchronous and asynchronous problems as we turn from fundamental science to engineering and industrial or government applications. We will now discuss the structure of QCD in more detail to illustrate some general computational features of synchronous problems.
Table 4.2:
The Ten Pioneer Hypercube Applications within C
P
The applications using the Cosmic Cube were well established by 1984 and Table 4.2 lists the ten projects which were completed in the first year after we started our interdisciplinary project in the summer of 1983. All but one of these projects are more or less described in this book, while the other will be found in [ Fox:88a ]. They covered a reasonable range of application areas and formed the base on which we first started to believe that parallel computing works! Figure 4.3 illustrates the regular lattice used in QCD and its decomposition onto 64 nodes. QCD is a four-dimensional theory and all four dimensions can be decomposed. In our initial 64-node Cosmic Cube calculations, we used the three-dimensional decompositions shown in Figure 4.3 with the fourth dimension, as shown in Figure 4.4 , and internal degrees of freedom stored sequentially in each node. Figure 4.3 also indicates one subtlety needed in the parallelization; namely, one needs a so-called red-black strategy with only half the lattice points updated in each of the two (``red'' and ``black'') phases. Synchronous applications are characterized by such a regular spatial domain as shown in Figure 4.3 and an identical update algorithm for each lattice point. The update makes use of a Monte Carlo procedure described in Section 4.3 and in more detail in Chapter 12 of [ Fox:88a ]. This procedure is not totally synchronous since the ``accept-reject'' mechanism used in the Monte Carlo procedure does not always terminate at the same step. This is no problem on an MIMD machine and even makes the problem ``slightly loosely synchronous.'' However, SIMD machines can also cope with this issue as all systems (DAP, CM-2, Maspar) have a feature that allows processors to either execute the common instruction or ignore it.
Figure 4.3:
A
Problem Lattice Decomposed onto a 64-node Machine
Arranged as a
Machine Lattice. Points labeled X (``red'') or
(``black'') can be updated at the same time.
Figure:
The 16 time and eight internal gluon degrees of freedom stored
at each point shown in Figure
4.3
Figure 4.5 illustrates the nearest neighbor algorithm used in QCD and very many problems described by local interactions. We see that some updates require communication and some don't. In the message-passing software model used in our hypercube work described in Chapter 5 , the user is responsible for organizing this communication with an explicit subroutine call. Our later QCD calculations and the spin simulations of Section 4.4 use the data-parallel software model on SIMD machines where a compiler can generate their messaging. Chapter 13 will mention later projects which are aimed at producing a uniform data parallel Fortran or C compiler which will generate the correct message structure for either SIMD or MIMD machines on such regular problems as QCD.
Figure 4.5:
Template of a Local Update Involving No Communication in a) and
the Value to be Communicated in b).
The calculations in Sections 4.3 and 4.4 used a wide variety of machines, in-house and commercial multicomputers, as well as the SIMD DAP and CM-2. The spin calculations in Section 4.4 can have very simple degrees of freedom, including that of the binary ``spin'' of the Ising model. These are naturally suited to the single-bit arithmetic available on the AMT DAP and CM-2. Some of the latest Monte Carlo algorithms do not use the local algorithms of Figure 4.4 but exploit the irregular domain structure seen in materials near a critical point. These new algorithms are much more efficient but very much more difficult to parallelize-especially on SIMD machines. They are discussed in Section 12.8 . We also see a taste of the ``embarrassingly parallel'' problem structure of Chapter 7 in Section 4.4 . For the Potts simulation, we obtained parallelism not from the data domain (lattice of spins) but from different starting points for the evolution. This approach, described in more detail in Section 7.2 , would not be practical for QCD with many degrees of freedom as one must have enough memory to store the full lattice in each node of the multicomputer.
Table 4.2 lists the early seismic simulations of the group led by Clayton, whose C P work is reviewed in Section 18.1 . These solved the elastic wave equations using finite difference methods as discussed in Section 3.5 . The equations are iterated with time steps replacing the Monte Carlo iterations used above. This work is described in Chapters 5 and 7 of [ Fox:88a ] and represents methods that can tackle quite practical problems, for example, predicting the response of complicated geological structures such as the Los Angeles basin. The two-dimensional hydrodynamics work of Meier [ Meier:88a ] is computationally similar, using the regular decomposition and local update of Figures 4.3 and 4.5 . These techniques are now very familiar and may seem ``old-hat.'' However, it is worth noting that, as described in Chapter 13 , we are only now in 1992 developing the compiler technology that will automate these methods developed ``by-hand'' by our early users. A much more sophisticated follow-on to these early seismic wave simulations is the ISIS interactive seismic imaging system described in Chapter 18 .
Chapter 9 of [ Fox:88a ] explains the well-known synchronous implementation of long-range particle dynamics. This algorithm was not used directly in any large C P application as we implemented the much more efficient cluster algorithms described in Sections 12.4 , 12.5 , and 12.8 . The initial investigation of the vortex method of Section 12.5 used the method [ Harstad:87a ]. We also showed a parallel database used in Kolawa's thesis on how a semi-analytic approach to QCD could be analyzed identically with the long-range force problem [Kolawa:86b;88a]. As explained in [ Fox:88a ], one can use the long-range force algorithm in any case where the calculation involves a set of N points with observables requiring functions of every pair of which there are . In the language of Chapter 3 , this problem has a system dimension of one, whatever its geometrical dimension. This is illustrated in Figures 4.6 and 4.7 , which represent in the form of Equations 3.10 and 3.13 . We find for the simple two-dimensional decompositions described for the Clayton and Meier applications for Table 4.2 . We increase range of R ``interaction'' in Figure 4.7 (a),(b)-defined formally by
from small (nearest neighbor) R to , the long-range force. As shown in Figure 4.7 (a),(b), decreases as R increases with the limiting form independently of for . Noederlinger [ Lorenz:87a ] and Theiler [Theiler:87a;87b] used such a ``long-range'' algorithm for calculating the correlation dimension of a chaotic dynamical system. This measures the essential number of degrees of freedom for a complex system which in this case was a time series becoming a plasma. The correction function involved studying histograms of over the data points at time .
Figure 4.6:
Some Examples of Communication Overhead as a Function of Increasing
Range of Interaction
R
.
Figure 4.7:
General Form of Communication Overhead for (a) Increasing and
(b) Infinite Range
R
Fucito and Solomon [Fucito:85b;85f] studied a simple Coulomb gas which naturally had a long-range force. However, this was a Monte Carlo calculation that was implemented efficiently by an ingenious algorithm that cannot directly use the analysis of the particle dynamics (time-stepped) case shown in Figure 4.7 . Monte Carlo is typically harder to parallelize than time evolution, where all ``particles'' can be evolved in time together. However, Monte Carlo updates can only proceed simultaneously if they involve disjoint particle sets. This implies the red-black ordering of Figure 4.5 and the requiring of a difficult asynchronous algorithm in the irregular melting problem of Section 14.2 . Johnson's application was technically the hardest in our list of pioneers in Table 4.2 .
Finally, Section 4.5 uses cellular automata ideas that lead to a synchronous architecture to grain dynamics, which, if implemented directly as in Section 9.2 , would be naturally loosely synchronous. This illustrates that the problem architecture depends on the particular numerical approach.
Guy Robinson
HPFA Applications and Paradigms
Guy Robinson
Quantum chromodynamics (QCD) is the proposed theory of the so-called strong interactions that bind quarks and gluons together to form hadrons-the constituents of nuclear matter such as the proton and neutron. It also mediates the forces between hadrons and thus controls the formation of nuclei. The fundamental properties of QCD cannot be directly tested, but a wealth of indirect evidence supports this theory. The problem is that QCD is a nonlinear theory that is not analytically solvable. For the equivalent quantum field theories of weaker forces such as electromagnetism, approximations using perturbation expansions in the interaction strength give very accurate results. However, since the QCD interaction is so strong, perturbative approximations often fail. Consequently, few precise predictions can be made from the theory. This led to the introduction of a non-perturbative approximation based on discretizing four-dimensional space-time onto a lattice of points, giving a theory called lattice QCD, which can be simulated on a computer.
Most of the work on lattice QCD has been directed towards deriving the masses (and other properties) of the large number of hadrons, which have been found in experiments using high energy particle accelerators. This would provide hard evidence for QCD as the theory of the strong force. Other calculations have also been performed; in particular, the properties of QCD at finite (i.e., non-zero) temperature and/or density have been determined. These calculations model the conditions of matter in the early stages of the evolution of the universe, just after the Big Bang. Lattice calculations of other quantum field theories, such as the theory of the weak nuclear force, have also been performed. For example, numerical calculations have given estimates for the mass of the Higgs boson, which is currently the Holy Grail of experimental high energy physics, and one of the motivating factors for the construction of the, now cancelled, $10 billion Superconducting Supercollider.
One of the major problems in solving lattice QCD on a computer is that the simulation of the quark interactions requires the computation of a large, highly non-local matrix determinant, which is extremely compute-intensive. We will discuss methods for calculating this determinant later. For the moment, however, we note that, physically, the determinant arises from the dynamics of the quarks. The simplest way to proceed is thus to ignore the quark dynamics and work in the so-called quenched approximation, with only gluonic degrees of freedom. This should be a reasonable approximation, at least for heavy quarks. However, even solving this approximate theory requires enormous computing power. Current state-of-the-art quenched QCD calculations are performed on lattices of size , which involves the numerical solution of a 21,233,664 dimensional integral. The only way of solving such an integral is by Monte Carlo methods.
Guy Robinson
In order to explain the computations for QCD, we use the Feynman path integral formalism [ Feynman:65a ]. For any field theory described by a Lagrangian density , the dynamics of the fields are determined through the action functional
In this language, the measurement of a physical observable represented by an operator is given as the expectation value
where the partition function Z is
In these expressions, the integral indicates a sum over all possible configurations of the field . A typical observable would be a product of fields , which says how the fluctuations in the field are correlated, and in turn, tells us something about the particles that can propagate from point x to point y . The appropriate correlation functions give us, for example, the masses of the various particles in the theory. Thus, to evaluate almost any quantity in field theories like QCD, one must simply evaluate the corresponding path integral. The catch is that the integrals range are over an infinite-dimensional space.
To put the field theory onto a computer, we begin by discretizing space and time into a lattice of points. Then the functional integral is simply defined as the product of the integrals over the fields at every site of the lattice :
Restricting space and time to a finite box, we end up with a finite (but very large) number of ordinary integrals, something we might imagine doing directly on a computer. However, the high dimensionality of these integrals renders conventional mesh techniques impractical. Fortunately, the presence of the exponential means that the integrand is sharply peaked in one region of configuration space. Hence, we resort to a statistical treatment and use Monte Carlo type algorithms to sample the important parts of the integration region [ Binder:86a ].
Monte Carlo algorithms typically begin with some initial configuration of fields, and then make pseudorandom changes on the fields such that the ultimate probability P of generating a particular field configuration is proportional to the Boltzmann factor,
where is the action associated with the given configuration. There are several ways to implement such a scheme, but for many theories the simple Metropolis algorithm [ Metropolis:53a ] is effective. In this algorithm, a new configuration is generated by updating a single variable in the old configuration and calculating the change in action (or energy)
If , the change is accepted; if , the change is accepted with probability . In practice, this is done by generating a pseudorandom number r in the interval [0,1] with uniform probability distribution, and accepting the change if . This is guaranteed to generate the correct (Boltzmann) distribution of configurations, provided ``detailed balance'' is satisfied. That condition means that the probability of proposing the change is the same as that of proposing the reverse process . In practice, this is true if we never simultaneously update two fields which interact directly via the action. Note that this constraint has important ramifications for parallel computers as we shall see below.
Whichever method one chooses to generate field configurations, one updates the fields for some equilibration time of E steps, and then calculates the expectation value of in Equation 4.3 from the next T configurations as
The statistical error in Monte Carlo behaves as , where N is the number of statistically independent configurations. , where is known as the autocorrelation time. This autocorrelation time can easily be large, and most of the computer time is spent in generating effectively independent configurations. The operator measurements then become a small overhead on the whole calculation.
Guy Robinson
QCD describes the interactions between quarks in high energy physics. Currently, we know of five types (referred to as ``flavors'') of quark: up, down, strange, charm, and bottom; and expect one more (top) to show up soon. In addition to having a ``flavor,'' quarks can carry one of three possible charges known as ``color'' (this has nothing to do with color in the macroscopic world!); hence, quantum chromo dynamics. The strong color force is mediated by particles called gluons, just as photons mediate the electromagnetic force. Unlike photons, though, gluons themselves carry a color charge and, therefore, interact with one another. This makes QCD a nonlinear theory, which is impossible to solve analytically. Therefore, we turn to the computer for numerical solutions.
QCD is an example of a ``gauge theory.'' These are quantum field theories that have a local symmetry described by a symmetry (or gauge) group. Gauge theories are ubiquitous in elementary particle physics: The electromagnetic interaction between electrons and photons is described by quantum electrodynamics (QED) based on the gauge group U(1); the strong force between quarks and gluons is believed to be explained by QCD based on the group SU(3); and there is a unified description of the weak and electromagnetic interactions in terms of the gauge group . The strength of these interactions is measured by a coupling constant. This coupling constant is small for QED, so very precise analytical calculations can be performed using perturbation theory, and these agree extremely well with experiment. However, for QCD, the coupling appears to increase with distance (which is why we never see an isolated quark, since they are always bound together by the strength of the coupling between them). Perturbative calculations are therefore only possible at short distances (or large energies). In order to solve QCD at longer distances, Wilson [ Wilson:74a ] introduced lattice gauge theory, in which the space-time continuum is discretized and a discrete version of the gauge theory is derived which keeps the gauge symmetry intact. This discretization onto a lattice, which is typically hypercubic, gives a nonperturbative approximation to the theory that is successively improvable by increasing the lattice size and decreasing the lattice spacing, and provides a simple and natural way of regulating the divergences which plague perturbative approximations. It also makes the gauge theory amenable to numerical simulation by computer.
Guy Robinson
To put QCD on a computer, we proceed as follows [ Wilson:74a ], [ Creutz:83a ]. The four-dimensional space-time continuum is replaced by a four-dimensional hypercubic periodic lattice of size , with the quarks living on the sites and the gluons living on the links of the lattice. is the spatial and is the temporal extent of the lattice. The lattice has a finite spacing a . The gluons are represented by complex SU(3) matrices associated with each link in the lattice. The 3 in SU(3) reflects the fact that there are three colors of quarks, and SU means that the matrices are unitary with unit determinant (i.e., ``special unitary''). This link matrix describes how the color of a quark changes as it moves from one site to the next. For example, as a quark is transported along a link of the lattice it can change its color from, say, red to green; hence, a red quark at one end of the link can exchange colors with a green quark at the other end. The action functional for the purely gluonic part of QCD is
where is a coupling constant and
is the product of link matrices around an elementary square or plaquette on the lattice-see Figure 4.8 . Essentially all of the time in QCD simulations of gluons is spent multiplying these SU(3) matrices together. The main component of this is the kernel, which most supercomputers can do very efficiently. As the action involves interactions around plaquettes, in order to satisfy detailed balance we can update only half the links in any one dimension simultaneously, as shown in Figure 4.9 (in two dimensions for simplicity). The partition function for full-lattice QCD including quarks is
where is a large sparse matrix the size of the lattice squared. Unfortunately, since the quark or fermion variables are anticommuting Grassmann numbers, there is no simple representation for them on the computer. Instead, they must be integrated out, leaving a highly non-local fermion determinant:
This is the basic integral one wants to evaluate numerically.
Figure 4.8:
A Lattice Plaquette
Figure 4.9:
Updating the Lattice
The biggest stumbling block preventing large QCD simulations with quarks is the presence of the determinant in the partition function. There have been many proposals for dealing with the determinant. The first algorithms tried to compute the change in the determinant when a single link variable was updated [ Weingarten:81a ]. This turned out to be prohibitively expensive. So instead, the approximate method of pseudo-fermions [ Fucito:81a ] was used. Today, however, the preferred approach is the so-called Hybrid Monte Carlo algorithm [ Duane:87a ], which is exact. The basic idea is to invent some dynamics for the variables in the system in order to evolve the whole system forward in (simulation) time, and then do a Metropolis accept/reject for the entire evolution on the basis of the total energy change. The great advantage is that the whole system is updated in one fell swoop. The disadvantage is that if the dynamics are not correct, the acceptance will be very small. Fortunately (and this is one of very few fortuitous happenings where fermions are concerned), good dynamics can be found: the Hybrid algorithm [ Duane:85a ]. This is a neat combination of the deterministic microcanonical method [ Callaway:83a ], [ Polonyi:83a ] and the stochastic Langevin method [ Parisi:81a ], [ Batrouni:85a ], which yields a quickly evolving, ergodic algorithm for both gauge fields and fermions. The computational kernel of this algorithm is the repeated solution of systems of equations of the form
where and are vectors that live on the sites of the lattice. To solve these equations, one typically uses a conjugate gradient algorithm or one of its cousins, since the fermion matrix is sparse. For more details, see [ Gupta:88a ]. Such iterative matrix algorithms have as their basic component the kernel, so again computers which do this efficiently will run QCD well.
Guy Robinson
Lattice QCD is truly a ``grand challenge'' computing problem. It has been estimated that it will take on the order of a TeraFLOP-year of dedicated computing to obtain believable results for the hadron mass spectrum in the quenched approximation, and adding dynamical fermions will involve many orders of magnitude more operations. Where is the computer power needed for QCD going to come from? Today, the biggest resources of computer time for research are the conventional supercomputers at the NSF and DOE centers. These centers are continually expanding their support for lattice gauge theory, but it may not be long before they are overtaken by several dedicated efforts involving concurrent computers. It is a revealing fact that the development of most high-performance parallel computers-the Caltech Cosmic Cube, the Columbia Machine, IBM's GF11, APE in Rome, the Fermilab Machine and the PAX machines in Japan-was actually motivated by the desire to simulate lattice QCD [ Christ:91a ], [ Weingarten:92a ].
As described already, Caltech built the first hypercube computer, the Cosmic Cube or Mark I, in 1983. It had 64 nodes, each of which was an Intel 8086/87 microprocessor with of memory, giving a total of about (measured for QCD). This was quickly upgraded to the Mark II hypercube with faster chips, twice the memory per node, and twice the number of nodes in 1984. Then, QCD was run on the last internal Caltech hypercube, the 128-node Mark IIIfp (built by JPL), at sustained [ Ding:90b ]. Each node of the Mark IIIfp hypercube contains two Motorola 68020 microprocessors, one for communication and the other for calculation, with the latter supplemented by one 68881 co-processor and a 32-bit Weitek floating point processor.
Norman Christ and Anthony Terrano built their first parallel computer for doing lattice QCD calculations at Columbia in 1984 [ Christ:84a ]. It had 16 nodes, each of which was an Intel 80286/87 microprocessor, plus a TRW 22-bit floating point processor with of memory, giving a total peak performance of . This was improved in 1987 using Weitek rather than TRW chips so that 64 nodes gave peak. In 1989, the Columbia group finished building their third machine: a 256-node, , lattice QCD computer [ Christ:90a ].
QCDPAX is the latest in the line of PAX (Parallel Array eXperiment) machines developed at the University of Tsukuba in Japan. The architecture is very similar to that of the Columbia machine. It is a MIMD machine configured as a two-dimensional periodic array of nodes, and each node includes a Motorola 68020 microprocessor and a 32-bit vector floating-point unit. Its peak performance is similar to that of the Columbia machine; however, it achieves only half the floating-point utilization for QCD code [ Iwasaki:91a ].
Don Weingarten initiated the GF11 project in 1984 at IBM. The GF11 is a SIMD machine comprising 576 Weitek floating point processors, each performing at to give the total peak implied by the name. Preliminary results for this project are given in [Weingarten:90a;92a].
The APE (Array Processor with Emulator) computer is basically a collection of 3081/E processors (which were developed by CERN and SLAC for use in high energy experimental physics) with Weitek floating-point processors attached. However, these floating-point processors are attached in a special way-each node has four multipliers and four adders, in order to optimize the calculations, which form the major component of all lattice QCD programs. This means that each node has a peak performance of . The first small machine-Apetto-was completed in 1986 and had four nodes yielding a peak performance of . Currently, they have a second generation of this machine with peak from 16 nodes. By 1993, the APE collaboration hopes to have completed the 2048-node ``Apecento,'' or APE-100, based on specialized VLSI chips that are software compatible with the original APE [ Avico:89a ], [ Battista:92a ]. The APE-100 is a SIMD machine with the architecture based on a three-dimensional cubic mesh of nodes. Currently, a 128-node machine is running with a peak performance of .
Table 4.3:
Peak and Real Performances in MFLOPS of ``Homebrew'' QCD
Machines
Not to be outdone, Fermilab has also used its high energy experimental physics emulators to construct a lattice QCD machine called ACPMAPS. This is a MIMD machine, using a Weitek floating-point chip set on each node. A 16-node machine, with a peak rate of , was finished in 1989. A 256-node machine, arranged as a hypercube of crates, with eight nodes communicating through a crossbar in each crate, was completed in 1991 [ Fischler:92a ]. It has a peak rate of , and a sustained rate of about for QCD. An upgrade of ACPMAPS is planned, with the number of nodes being increased and the present processors being replaced with two Intel i860 chips per node, giving a peak performance of per node. These performance figures are summarized in Table 4.3 . (The ``real'' performances are the actual performances obtained on QCD codes.)
Major calculations have also been performed on commercial SIMD machines, first on the ICL Distributed Array Processor (DAP) at Edinburgh University during the period from 1982 to 1987 [ Wallace:84a ], and now on the TMC Connection Machine (CM-2); and on commercial distributed memory MIMD machines like the nCUBE hypercube and Intel Touchstone Delta machines at Caltech. Currently, the Connection Machine is the most powerful commercial QCD machine available, running full QCD at a sustained rate of approximately on a CM-2 [ Baillie:89e ], [ Brickner:91b ]. However, simulations have recently been performed at a rate of on the experimental Intel Touchstone Delta at Caltech. This is a MIMD machine made up of 528 Intel i860 processors connected in a two-dimensional mesh, with a peak performance of for 32-bit arithmetic. These results compare favorably with performances on traditional (vector) supercomputers. Highly optimized QCD code runs at about per processor on a CRAY Y-MP, or on a fully configured eight-processor machine.
The generation of commercial parallel supercomputers, represented by the CM-5 and the Intel Paragon, have a peak performance of over . There was a proposal for the development of a TeraFLOPS parallel supercomputer for QCD and other numerically intensive simulations [ Christ:91a ], [ Aoki:91a ]. The goal was to build a machine based on the CM-5 architecture in collaboration with Thinking Machines Corporation, which would be ready by 1995 at a cost of around $40 million.
It is interesting to note that when the various groups began building their ``home-brew'' QCD machines, it was clear they would outperform all commercial (traditional) supercomputers; however, now that commercial parallel supercomputers have come of age [ Fox:89n ], the situation is not so obvious. To emphasize this, we describe QCD calculations on both the home grown Caltech hypercube and on the commercially available Connection Machine.
Guy Robinson
To make good use of MIMD distributed-memory machines like hypercubes, one should employ domain decomposition. That is, the domain of the problem should be divided into subdomains of equal size, one for each processor in the hypercube; and communication routines should be written to take care of data transfer across the processor boundaries. Thus, for a lattice calculation, the N sites are distributed among the processors using a decomposition that ensures that processors assigned to adjacent subdomains are directly linked by a communication channel in the hypercube topology. Each processor then independently works through its subdomain of sites, updating each one in turn, and only communicating with neighboring processors when doing boundary sites. This communication enforces ``loose synchronization,'' which stops any one processor from racing ahead of the others. Load balancing is achieved with equal-size domains. If the nodes contain at least two sites of the lattice, all the nodes can update in parallel, satisfying detailed balance, since loose synchronicity guarantees that all nodes will be doing black, then red sites alternately.
The characteristic timescale of the communication, , corresponds roughly to the time taken to transfer a single matrix from one node to its neighbor. Similarly, we can characterize the calculational part of the algorithm by a timescale, , which is roughly the time taken to multiply together two matrices. For all hypercubes built without floating-point accelerator chips, and, hence, QCD simulations are extremely ``efficient,'' where efficiency (Equations 3.10 and 3.11 ) is defined by the relation
where is the time taken for k processors to perform the given calculation. Typically, such calculations have efficiencies in the range , which means they are ideally suited to this type of computation since doubling the number of processors nearly halves the total computational time required for solution. However, as we shall see (for the Mark IIIfp hypercube, for example), the picture changes dramatically when fast floating-point chips are used; then and one must take some care in coding to obtain maximum performance.
Rather than describe every calculation done on the Caltech hypercubes, we shall concentrate on one calculation that has been done several times as the machine evolved-the heavy quark potential calculation (``heavy'' because the quenched approximation is used).
QCD provides an explanation of why quarks are confined inside hadrons, since lattice calculations reveal that the inter-quark potential rises linearly as the separation between the quarks increases. Thus, the attractive force (the derivative of the potential) is independent of the separation, unlike other forces, which usually decrease rapidly with distance. This force, called the ``string tension,'' is carried by the gluons, which form a kind of ``string'' between the quarks. On the other hand, at short distances, quarks and gluons are ``asymptotically free'' and behave like electrons and photons, interacting via a Coulomb-like force. Thus, the quark potential V is written as
where R is the separation of the quarks, is the coefficient of the Coulombic potential and is the string tension. In fitting experimental charmonium data to this Coulomb plus linear potential, Eichten et al. [ Eichten:80a ] estimated that and =0.18GeV . Thus, a goal of the lattice calculations is to reproduce these numbers. Enroute to this goal, it is necessary to show that the numbers from the lattice are ``scaling,'' that is, if one calculates a physical observable on lattices with different spacings then one gets the same answer. This means that the artifacts due to the finiteness of the lattice spacing have disappeared and continuum physics can be extracted.
The first heavy quark potential calculation using a Caltech hypercube was performed on the 64-node Mark I in 1984 on a lattice with ranging from to [ Otto:84a ]. The value of was found to be and the string tension (converting to the dimensionless ratio) . The numbers are quite a bit off from the charmonium data but the string tension did appear to be scaling, albeit in the narrow window .
The next time around, in 1986, the 128-node Mark II hypercube was used on a lattice with [ Flower:86b ]. The dimensionless string tension decreased somewhat to 83, but clear violations of scaling were observed: The lattice was still too coarse to see continuum physics.
Therefore, the last (1989) calculation using the Caltech/JPL 32-node Mark IIIfp hypercube concentrated on one value, , and investigated different lattice sizes: , , , [ Ding:90b ]. Scaling was not investigated; however, the values of and , that is, = 0.15GeV , compare favorably with the charmonium data. This work is based on about 1300 CPU hours on the 32-node Mark IIIfp hypercube, which has a performance of roughly twice a CRAY X-MP processor. The whole 128-node machine performs at . As each node runs at , this corresponds to a speedup of 100, and hence, an efficiency of 78 percent. These figures are for the most highly optimized code. The original version of the code written in C ran on the Motorola chips at and on the Weitek chips at . The communication time, which is roughly the same for both, is less than a 2 percent overhead for the former but nearly 30 percent for the latter. When the computationally intensive parts of the calculation are written in assembly code for the Weitek, this overhead becomes almost 50 percent. This of communication, shown in lines two and three in Table 4.4 , is dominated by the hardware/software message startup overhead (latency), because for the Mark IIIfp the node-to-node communication time, , is given by
where W is the number of words transmitted. To speed up the communication, we update all even (or odd) links (eight in our case) in each node, allowing us to transfer eight matrix products at a time, instead of just sending one in each message. This reduces the by a factor of
to . On all hypercubes with fast floating-point chips-and on most hypercubes without these chips for less computationally intensive codes-such ``vectorization'' of communication is often important. In Figure 4.10 , the speedups for many different lattice sizes are shown. For the largest lattice size, the speedup is 100 on the 128-node machine. The speedup is almost linear in number of nodes. As the total lattice volume increases, the speedup increases, because the ratio of calculation/communication increases. For more information on this performance analysis, see [ Ding:90c ].
Table 4.4:
Link Update Time (msec) on Mark IIIfp Node for Various Levels
of Programming
Figure 4.10:
Speedups for QCD on the Mark IIIfp
Guy Robinson
The Connection Machine Model CM-2 is also very well suited for large scale simulations of QCD. The CM-2 is a distributed-memory, single-instruction, multiple-data (SIMD), massively parallel processor comprising up to 65536 ( ) processors [Hillis:85a;87a]. Each processor consists of an arithmetic-logic unit (ALU), or of random-access memory (RAM), and a router interface to perform communications among the processors. There are 16 processors and a router per custom VLSI chip, with the chips interconnected as a 12-dimensional hypercube. Communications among processors within a chip work essentially like a crossbar interconnect. The router can do general communications, but only local communication is required for QCD, so we use the fast nearest-neighbor communication software called NEWS. The processors deal with one bit at a time. Therefore, the ALU can compute any two Boolean functions as output from three inputs, and all datapaths are one bit wide. In the current version of the Connection Machine (the CM-2), groups of 32 processors (two chips) share a 32-bit (or 64-bit) Weitek floating-point chip, and a transposer chip, which changes 32 bits stored bit-serially within 32 processors into 32 32-bit words for the Weitek, and vice versa.
The high-level languages on the CM, such as *Lisp and CM-Fortran, compile into an assembly language called Parallel Instruction Set (Paris). Paris regards the bit-serial processors as the fundamental units in the machine. However, floating-point computations are not very efficient in the Paris model. This is because in Paris, 32-bit floating-point numbers are stored ``fieldwise''; that is, successive bits of the word are stored at successive memory locations of each processor's memory. However, 32 processors share one Weitek chip, which deals with words stored ``slicewise''-that is, across the processors, one bit in each. Therefore, to do a floating-point operation, Paris loads in the fieldwise operands, transposes them slicewise for the Weitek (using the transposer chip), does the operation, and transposes the slicewise result back to fieldwise for memory storage. Moreover, every operation in Paris is an atomic process; that is, two operands are brought from memory and one result is stored back to memory, so no use is made of the Weitek registers for intermediate results. Hence, to improve the performance of the Weiteks, a new assembly language called CM Instruction Set (CMIS) has been written, which models the local architectural features much better. In fact, CMIS ignores the bit-serial processors and thinks of the machine in terms of the Weitek chips. Thus, data can be stored slicewise, eliminating all the transposing back and forth. CMIS allows effective use of the Weitek registers, creating a memory hierarchy, which, combined with the internal buses of the Weiteks, offers increased bandwidth for data motion.
When the arithmetic part of the program is rewritten in CMIS (just as on the Mark IIIfp when it was rewritten in assembly code), the communications become a bottleneck. Therefore, we need also to speed up the communication part of the code. On the CM-2, this is done using the ``bi-directional multi-wire NEWS'' system. As explained above, the CM chips (each containing 16 processors) are interconnected in a 12-dimensional hypercube. However, since there are two CM chips for each Weitek floating-point chip, the floating-point hardware is effectively wired together as an 11-dimensional hypercube, with two wires in each direction. This makes it feasible to do simultaneous communications in both directions of all four space-time directions in QCD-bidirectional multiwire NEWS-thereby reducing the communication time by a factor of eight. Moreover, the data rearrangement necessary to make use of this multiwire NEWS further speeds up the CMIS part of the code by a factor of two.
In 1990-1992, the Connection Machine was the most powerful commercial QCD machine available: the ``Los Alamos collaboration'' ran full QCD at a sustained rate of on a CM-2 [ Brickner:91a ]. As was the case for the Mark IIIfp hypercube, in order to obtain this performance, one must resort to writing assembly code for the Weitek chips and for the communication. Our original code, written entirely in the CM-2 version of *Lisp, achieved around [ Baillie:89e ]. As shown in Table 4.5 , this code spends 34 percent of its time doing communication. When we rewrote the most computationally intensive part in the assembly language CMIS, this rose to 54 percent. Then when we also made use of ``multi-wire NEWS'' (to reduce the communication time by a factor of eight), it fell to 30 percent. The Intel Delta and Paragon, as well as Thinking Machines CM-5, passed the CM-2 performance levels in 1993, but here optimization is not yet complete [ Gupta:93a ].
Table 4.5:
Fermion Update Time (sec) on
Connection Machine for
Various Levels of Programming
Guy Robinson
The status of lattice QCD may be summed up as: under way. Already there have been some nice results in the context of the quenched approximation, but the lattices are still too coarse and too small to give definitive results. Results for full QCD are going to take orders of magnitude more computer time, but we now have an algorithm-Hybrid Monte Carlo-which puts real simulations within reach.
When will the computer power be sufficient? In Figure 4.11 , we plot the horsepower of various QCD machines as a function of the year they started to produce physics results. The performance plotted in this case is the real sustained rate on actual QCD codes. The surprising fact is that the rate of increase is very close to exponential, yielding a factor of 10 every two years! On the same plot, we show our estimate of the computer power needed to redo correct quenched calculations on a lattice. This estimate is also a function of time, due to algorithm improvements.
Figure 4.11:
MFLOPS for QCD Calculations
Extrapolating these trends, we see the outlook for lattice QCD is rather bright. Reasonable results for the phenomenologically interesting physical observables should be available within the quenched approximation in the mid-1990s. With the same computer power, we will be able to redo today's quenched calculations using dynamic fermions (but still on today's size of lattice). This will tell us how reliable the quenched approximation is. Finally, results for the full theory with dynamical fermions on a lattice should follow early in the next century (!), when computers are two or three orders of magnitude more powerful.
Guy Robinson
HPFA Applications and Paradigms
Guy Robinson
Spin models are simple statistical models of real systems, such as magnets, which exhibit the same behavior and hence provide an understanding of the physical mechanisms involved. Despite their apparent simplicity, most of these models are not exactly soluble by present theoretical methods. Hence, computer simulation is used. Usually, one is interested in the behavior of the system at a phase transition; the computer simulation reveals where the phase boundaries are, what the phases on either side are, and how the properties of the system change across the phase transition. There are two varieties of spins: discrete or continuous valued. In both cases, the spin variables are put on the sites of the lattice and only interact with their nearest neighbors. The partition function for a spin model is
with the action being of the form
where denotes nearest neighbors, is the spin at site i , and is a coupling parameter which is proportional to the interaction strength and inversely proportional to the temperature. A great deal of work has been done over the years in finding good algorithms for computer simulations of spin models; recently some new, much better, algorithms have been discovered. These so-called cluster algorithms are described in detail in Section 12.6 . Here, we shall describe results obtained from using them to perform large-scale Monte Carlo simulations of several spin models-both discrete and continuous.
Guy Robinson
The Ising model is the simplest model for ferromagnetism that predicts phase transitions and critical phenomena. The spins are discrete and have only two possible states. This model, introduced by Lenz in 1920 [ Lenz:20a ], was solved in one dimension by Ising in 1925 [ Ising:25a ], and in two dimensions by Onsager in 1944 [ Onsager:44a ]. However, it has not been solved analytically in three dimensions, so Monte Carlo computer simulation methods have been one of the methods used to obtain numerical solutions. One of the best available techniques for this is the Monte Carlo Renormalization Group (MCRG) method [ Wilson:80a ], [ Swendsen:79a ]. The Ising model exhibits a second-order phase transition in d=3 dimensions at a critical temperature . As T approaches , the correlation length diverges as a power law with critical exponent :
and the pair correlation function at falls off to zero with distance r as a power law defining the critical exponent :
, and determine the critical behavior of the 3-D Ising model and it is their values we wish to determine using MCRG.
In 1984, this was done by Pawley, Swendsen, Wallace and Wilson [ Pawley:84a ] in Edinburgh on the ICL DAP computer with high statistics. They ran on four lattice sizes- , , and -measuring seven even and six odd spin operators. We are essentially repeating their calculation on the new AMT DAP computer. Why should we do this? First, to investigate finite size effects-we have run on the biggest lattice used by Edinburgh, , and on a bigger one, . Second, to investigate truncation effects-qualitatively the more operators we measure for MCRG, the better, so we have included 53 even and 46 odd operators. Third, we are making use of the new cluster-updating algorithm due to Swendsen and Wang [ Swendsen:87a ], implemented according to Wolff [ Wolff:89b ]. Fourth, we would like to try to measure another critical exponent more accurately-the correction-to-scaling exponent , which plays an important role in the analysis.
The idea behind MCRG is that the correlation length diverges at the critical point, so that certain quantities should be invariant under ``renormalization'', which here means a transformation of the length scale. On the lattice, we can double the lattice size by, for example, ``blocking'' the spin values on a square plaquette into a single spin value on a lattice with 1/4 the number of sites. For the Ising model, the blocked spin value is given the value taken by the majority of the 4 plaquette spins, with a random tie-breaker for the case where there are 2 spins in either state. Since quantities are only invariant under this MCRG procedure at the critical point, this provides a method for finding the critical point.
In order to calculate the quantities of interest using MCRG, one must evaluate the spin operators . In [ Pawley:84a ], the calculation was restricted to seven even spin operators and six odd; we evaluated 53 and 46, respectively [ Baillie:91d ]. Specifically, we decided to evaluate the most important operators in a cube [ Baillie:88h ]. To determine the critical coupling (or inverse temperature), , one performs independent Monte Carlo simulations on a large lattice L of size and on smaller lattices S of size , , and compares the operators measured on the large lattice blocked m times more than the smaller lattices. when they are the same. Since the effective lattice sizes are the same, unknown finite size effects should cancel. The critical exponents, , are obtained directly from the eigenvalues, , of the stability matrix, , which measures changes between different blocking levels, according to . In particular, the leading eigenvalue of for the even gives from , and, similarly, from the odd eigenvalue of .
The Distributed Array Processor (DAP) is a SIMD computer consisting of bit-serial processing elements (PEs) configured as a cyclic two-dimensional grid with nearest-neighbor connectivity. The Ising model computer simulation is well suited to such a machine since the spins can be represented as single-bit (logical) variables. In three-dimensions, the system of spins is configured as an simple cubic lattice, which is ``crinkle mapped'' onto the DAP by storing pieces of each of M planes in each PE: , with . Our Monte Carlo simulation uses a hybrid algorithm in which each sweep consists of 10 standard Metropolis [ Metropolis:53a ] spin updates followed by one cluster update using Wolff's single-cluster variant of the Swendsen and Wang algorithm. On the lattice, the autocorrelation time of the magnetization reduces from sets of 100 sweeps for Metropolis alone to sets of 10 Metropolis plus one cluster update for the hybrid algorithm. In order to measure the spin operators, , the DAP code simply histograms the spin configurations so that an analysis program can later pick out each particular spin operator using a look-up table. Currently, the code requires the same time to do one histogram measurement, one Wolff single-cluster update or 100 Metropolis updates. Therefore, our hybrid of 10 Metropolis plus one cluster update takes about the same time as a measurement. On a DAP 510, this hybrid update takes on average 127 secs (13.5 secs) for the ( ) lattices. We have performed simulations on and lattices at two values of the coupling: (Edinburgh's best estimate of the critical coupling) and . We accumulated measurements for each of the simulations and for the so that the total time used for this calculation is roughly 11,000 hours. For error analysis, this is divided into bins of measurements.
In analyzing our results, the first thing we have to decide is the order in which to arrange our 53 even and 46 odd spin operators. Naively, they can be arranged in order of increasing total distance between the spins [ Baillie:88h ] (as was done in [ Pawley:84a ]). However, the ranking of a spin operator is determined physically by how much it contributes to the energy of the system. Thus, we did our analysis initially with the operators in the naive order to calculate their energies, then subsequently we used the ``physical'' order dictated by these energies. This physical order of the first 20 even operators is shown in Figure 4.12 with 6 of Edinburgh's operators indicated; the 7th Edinburgh operator (E-6) is our 21st. This order is important in assessing the systematic effects of truncation, as we are going to analyze our data as a function of the number of operators included. Specifically, we successively diagonalize the , , , ( for even, for odd) stability matrix to obtain its eigenvalues and, thus, the critical exponents.
Figure 4.12:
Our Order for Even Spin Operators
We present our results in terms of the eigenvalues of the even and odd parts of . The leading even eigenvalue on the first four blocking levels starting from the lattice is plotted against the number of operators included in the analysis in Figure 4.13 , and on the first five blocking levels starting from the lattice in Figure 4.14 . Similarly, the leading odd eigenvalues for and lattices are shown in Figures 4.15 and 4.16 , respectively. First of all, note that there are significant truncation effects-the value of the eigenvalues do not settle down until at least 30 and perhaps 40 operators are included. We note also that our value agrees with Edinburgh's when around 7 operators are included-this is a significant verification that the two calculations are consistent. With most or all of the operators included, our values on the two different lattice sizes agree, and the agreement improves with increasing blocking levels. Thus, we feel that we have overcome the finite size effects so that a lattice is just large enough. However, the advantage in going to is obvious in Figures 4.14 and 4.16 : There, we can perform one more blocking , which reveals that the results on the fourth and fifth blocking levels are consistent. This means that we have eliminated most of the transient effects near the fixed point in the MCRG procedure. We also see that the main limitation of our calculation is statistics-the error bars are still rather large for the highest blocking level.
Now in order to obtain values for and , we must extrapolate our results from a finite number of blocking levels to an infinite number. This is done by fitting the corresponding eigenvalues and according to
where is the extrapolated value and is the correction-to-scaling exponent. Therefore, we first need to calculate , which comes directly from the second leading even eigenvalue: . Our best estimate is in the interval -0.85, and we use the value 0.85 for the purpose of extrapolation, since it gives the best fits. The final results are , , where the first errors are statistical and the second errors are estimates of the systematic error coming from the uncertainty in .
Figure 4.13:
Leading Even Eigenvalue on
Lattice
Figure 4.14:
Leading Even Eigenvalue on
Lattice
Figure 4.15:
Leading Odd Eigenvalue on
Lattice
Figure 4.16:
Leading Odd Eigenvalue on
Lattice
Finally, perhaps the most important number, because it can be determined the most accurately, is . By comparing the fifth blocking level on the lattice to the fourth on the lattice for both coupling values and taking a weighted mean, we obtain , where again the first error is statistical and the second is systematic.
Thus, MCRG calculations give us very accurate values for the three critical parameters , , and , and give a reasonable estimate for . Each parameter is obtained independently and directly from the data. We have shown that truncation and finite-size errors at all but the highest blocking level have been reduced to below the statistical errors. Future high statistics simulations on lattices will significantly reduce the remaining errors and allow us to determine the exponents very accurately.
Guy Robinson
The q -state Potts model [ Potts:52a ] consists of a lattice of spins , which can take q different values, and whose Hamiltonian is
For q=2 , this is equivalent to the Ising model. The Potts model is thus a simple extension of the Ising model; however, it has a much richer phase structure, which makes it an important testing ground for new theories and algorithms in the study of critical phenomena [ Wu:82a ].
Monte Carlo simulations of Potts models have traditionally used local algorithms such as that of Metropolis, et al. [ Metropolis:53a ], however, these algorithms have the major drawback that near a phase transition, the autocorrelation time (the number of sweeps needed to generate a statistically independent configuration) increases approximately as , where L is the linear size of the lattice. New algorithms have recently been developed that dramatically reduce this ``critical slowing down'' by updating clusters of spins at a time (these algorithms are described in Section 12.6 ). The original cluster algorithm of Swendsen and Wang (SW) was implemented for the Potts model [ Swendsen:87a ], and there is a lot of interest in how well cluster algorithms perform for this model. At present, there are very few theoretical results known about cluster algorithms, and theoretical advances are most likely to come from first studying the simplest possible models.
We have made a high statistics study of the SW algorithm and the single cluster Wolff algorithm [ Wolff:89b ], as well as a number of variants of these algorithms, for the q=2 and q=3 Potts models in two dimensions [ Baillie:90n ]. We measured the autocorrelation time in the energy (a local operator) and the magnetization (a global one) on lattice sizes from to . About 10 million sweeps were required for each lattice size in order to measure autocorrelation times to within about 1 percent. From these values, we can extract the dynamic critical exponent z , given by , where is measured at the infinite volume critical point (which is known exactly for the two-dimensional Potts model).
The simulations were performed on a number of different parallel computers. For lattice sizes of or less, it is possible to run independent simulations on each processor of a parallel machine, enabling us to obtain 100 percent efficiency by running 10 or 20 runs for each lattice size in parallel, using different random number streams. These calculations were done using a 32-processor Meiko Computing Surface, a 20-processor Sequent Symmetry, a 20-processor Encore Multimax, and a 96-processor BBN GP1000 Butterfly , as well as a network of SUN workstations. The calculations ook approximately 15,000 processor-hours. For the largest lattice sizes, and , a parallel cluster algorithm was required, due to the large amount of calculation (and memory) required. We have used the self-labelling algorithm described in Section 12.6 , which gives fairly good efficiencies of about 70 percent on the machines we have used (an nCUBE-1 and a Symult S2010), by doing multiple runs of 32 nodes each for the lattice, and 64 nodes for . Since this problem does not vectorize, using all 512 nodes of the nCUBE gives a performance approximately five times that of a single processor CRAY X-MP, while all 192 nodes of the Symult is equivalent to about six CRAYs. The calculations on these machines have so far taken about 1000 hours.
Results for the autocorrelation times of the energy for the Wolff and SW algorithms are shown in Figure 4.17 for q=2 and Figure 4.18 for q=3 . As can be seen, the Wolff algorithm has smaller autocorrelation times than SW. However, the dynamical critical exponents for the two algorithms appear to be identical, being approximately and for q=2 and q=3 respectively (shown as straight lines in Figures 4.17 and 4.18 ), compared to values of approximately 2 for the standard Metropolis algorithm.
Figure 4.17:
Energy Autocorrelation Times,
q=2
Figure 4.18:
Energy Autocorrelation Times,
q=3
Burkitt and Heermann [ Heermann:90a ] have suggested that the increase in the autocorrelation time is a logarithmic one, rather than a power law for the q=2 case (the Ising model), that is, z = 0 . Fits to this are shown as dotted lines in Figure 4.17 . These have smaller values than the power law fits, favoring logarithmic behavior. However, it is very difficult to distinguish between a logarithm and a small power even on lattices as large as . In any case, the performance of the cluster algorithms for the Potts model is quite extraordinary, with autocorrelation times for the lattice hundreds of times smaller than for the Metropolis algorithm. In the future, we hope to use the cluster algorithms to perform greatly improved Monte Carlo simulations of various Potts models, to study their critical behavior.
There is little theoretical understanding of why cluster algorithms work so well, and in particular there is no theory which predicts the dynamic critical exponents for a given model. These values can currently only be obtained from measurements using Monte Carlo simulation. Our results, which are the best currently available, are shown in Table 4.6 . We would like to know why, for example, critical slowing down is virtually eliminated for the two-dimensional 2-state Potts model, but z is nearly one for the 4-state model; and why the dynamic critical exponents for the SW and Wolff algorithms are approximately the same in two dimensions, but very different in higher dimensions.
Table 4.6:
Measured Dynamic Critical Exponents for Potts Model Cluster Algorithms.
The only rigorous analytic result so far obtained for cluster algorithms was derived by Li and Sokal [ Li:89a ]. They showed that the autocorrelation time for the energy using the SW algorithm is bounded (as a function of the lattice size) by the specific heat , that is, , which implies that the corresponding dynamic critical exponent is bounded by , where and are critical exponents measuring the divergence at the critical point of the specific heat and the correlation length, respectively. A similar bound has also been derived for the Metropolis algorithm, but with the susceptibility exponent substituted for the specific heat exponent.
No such result is known for the Wolff algorithm, so we have attempted to check this result empirically using simulation [ Coddington:92a ]. We found that for the Ising model in two, three, and four dimensions, the above bound appears to be satisfied (at least to a very good approximation); that is, there are constants a and b such that , and thus , for the Wolff algorithm.
This led us to investigate similar empirical relations between dynamic and static quantities for the SW algorithm. The power of cluster update algorithms comes from the fact that they flip large clusters of spins at a time. The average size of the largest SW cluster (scaled by the lattice volume), m , is an estimator of the magnetization for the Potts model, and the exponent characterizing the divergence of the magnetization has values which are similar to our measured values for the dynamic exponents of the SW algorithm. We therefore scaled the SW autocorrelations by m , and found that within the errors of the simulations, this gave either a constant (in three and four dimensions) or a logarithm (in two dimensions). This implies that the SW autocorrelations scale in the same way (up to logarithmic corrections) as the magnetization, that is, .
These simple empirical relations are very surprising, and if true, would be the first analytic results equating dynamic quantities, which are dependent on the Monte Carlo algorithm used, to static quantities, which depend only on the physical model. These relations could perhaps stem from the fact that the dynamics of cluster algorithms are closely linked to the physical properties of the system, since the Swendsen-Wang clusters are just the Coniglio-Klein-Fisher droplets which have been used to describe the critical behavior of these systems [ Fisher:67a ] [ Coniglio:80a ].
We are currently doing further simulations to check whether these relations hold up with larger lattices and better statistics, or whether they are just good approximations. We are also trying to determine whether similar results hold for the general q -state Potts model. However, we have thus far only been able to find simple relations for the q=2 (Ising) model. This work is being done using both parallel machines (the nCUBE-1, nCUBE-2, and Symult S2010) and networks of DEC, IBM, and Hewlett-Packard workstations. These high-performance RISC workstations were especially useful in obtaining good results for the Wolff algorithm, which does not vectorize or parallelize, apart from the trivial parallelism we used in running independent simulations on different processors.
Guy Robinson
The XY (or O(2)) model consists of a set of continuous valued spins regularly arranged on a two-dimensional square lattice. Fifteen years ago, Kosterlitz and Thouless (KT) predicted that this system would undergo a phase transition as one changed from a low-temperature spin wave phase to a high-temperature phase with unbound vortices. KT predicted an approximate transition temperature, , and the following unusual exponential singularity in the correlation length and magnetic susceptibility:
with
where and the correlation function exponent is defined by the relation .
Our simulation [ Gupta:88a ] was done on the 128-node FPS (Floating Point Systems) T-Series hypercube at Los Alamos. FPS software allowed the use of C with a software model similar (communication implemented by subroutine call) to that used on the hypercubes at Caltech. Each FPS node is built around Weitek floating-point units, and we achieved per node in this application. The total machine ran at , or at about twice the performance of one processor of a CRAY X-MP for this application. We use a 1-D torus topology for communications, with each node processing a fraction of the rows. Each row is divided into red/black alternating sites of spins and the vector loop is over a given color. This gives a natural data structure of ( ) words for lattices of size . The internode communications, in both lattice update and measurement of observables, can be done asynchronously and are a negligible overhead.
Figure 4.19:
Autocorrelation Times for the XY Model
Previous numerical work was unable to confirm the KT theory, due to limited statistics and small lattices. Our high-statistics simulations are done on , , , and lattices using a combination of over-relaxed and Metropolis algorithms which decorrelates as . (For comparison, a Metropolis algorithm decorrelates as .) Each configuration represents over-relaxed sweeps through the lattice followed by Metropolis sweeps. Measurement of observables is made on every configuration. The over-relaxed algorithm consists of reflecting the spin at a given site about , where is the sum of the nearest-neighbor spins, that is,
This implementation [ Creutz:87a ], [ Brown:87a ] of the over-relaxed algorithm is microcanonical, and it reduces critical slowing down even though it is a local algorithm. The ``hit'' elements for the Metropolis algorithm are generated as , where is a uniform random number in the interval , and is adjusted to give an acceptance rate of 50 to 60 percent. The Metropolis hits make the algorithm ergodic, but their effectiveness is limited to local changes in the energy. In Figure 4.19 , we show the autocorrelation time vs. the correlation length ; for , we extract , and for , we get .
Table 4.7:
Results of the XY Model Fits: (a)
in T, and (b)
in T Assuming the KT Form. The fits KT1-3 are pseudominima while KT4 is
the true minimum. All data points are included in the fits and we give the
for each fit and an estimate of the exponent
.
We ran at 14 temperatures near the phase transition and made unconstrained fits to all 14 data points (four parameter fits according to Equation 4.24 ), for both the correlation length (Figure 4.20 ) and susceptibility (Figure 4.21 ). The key to the interpretation of the data is the fits. We find that fitting programs (e.g., MINUIT, SLAC) move incredibly slowly towards the true minimum from certain points (which we label spurious minima), which, unfortunately, are the attractors for most starting points. We found three such spurious minima (KT1-3) and the true minimum KT4, as listed in Table 4.7 .
Figure 4.20:
Correlation Length for the XY Model
Figure 4.21:
Susceptibility for the XY Model
Thus, our data was found to be in excellent agreement with the KT theory and, in fact, this study provides the first direct measurement of from both and data that is consistent with the KT predictions.
Guy Robinson
The XY model is the simplest O(N) model, having N=2 , the O(N) model being a set of rotors (N-component continuous valued spins) on an N-sphere. For , this model is asymptotically free [ Polyakov:75a ], and for N=3 , there exist so called instanton solutions. Some of these properties are analogous to those of gauge theories in four dimensions; hence, these models are interesting. In particular, the O(3) model in two dimensions should shed some light on the asymptotic freedom of QCD (SU(3)) in four dimensions. The predictions of the renormalization group for the susceptibility and inverse correlation length (i.e., mass gap) m in the O(3) model are [ Brezin:76a ]
and
respectively. If m and vary according to these equations, without the correction of order , they are said to follow asymptotic scaling. Previous work was able to confirm that this picture is qualitatively correct, but was not able to probe deep enough in the area of large correlation lengths to obtain good agreement.
The combination of the over-relaxed algorithm and the computational power of the FPS T-Series allowed us to simulate lattices of sizes up to . We were thus able to simulate at coupling constants that correspond to correlation lengths up to 300, on lattices where finite-size effects are negligible. We were also able to gather large statistics and thus obtain small statistical errors. Our simulation is in good agreement with similar cluster calculations [Wolff:89b;90a]. Thus, we have validated and extended these results in a regime where our algorithm is the only known alternative to clustering.
Table 4.8:
Coupling Constant, Lattice Size, Autocorrelation Time, Number of
Overrelaxed Sweeps, Susceptibility, and Correlation Length for the O(3)
Model
We have made extensive runs at 10 values of the coupling constant. At the lowest , several hundred thousand sweeps were collected, while for the largest values of , between 50,000 and 100,000 sweeps were made. Each sweep consists of between 10 iterations through the lattice at the former end and 150 iterations at the latter. The statistics we have gathered are equivalent to about 200 days, use of the full 128-node FPS machine.
Our results for the correlation length and susceptibility for each coupling and lattice size are shown in Table 4.8 . The autocorrelation times are also shown. The quantities measured on different-sized lattices at the same agree, showing that the infinite volume limit has been reached.
To compare the behavior of the correlation length and susceptibility with the asymptotic scaling predictions, we use the ``correlation length defect'' and ``susceptibility defect'' , which are defined as follows: , , so that asymptotic scaling is seen if , go to constants as . These defects are shown in Figures 4.22 and 4.23 , respectively. It is clear that asymptotic scaling does not set in for , but it is not possible to draw a clear conclusion for -though the trends of the last two or three points may be toward constant behavior.
Figure 4.22:
Correlation Length Defect Versus the Coupling Constant for the O(3)
Model
Figure 4.23:
Susceptibility Defect Versus the Coupling Constant for the O(3) Model
Figure 4.24:
Decorrelation Time
Versus Number of Over-relations Sweeps
for Different Values of
We gauged the speed of the algorithm in producing statistically independent configurations by measuring the autocorrelation time . We used this to estimate the dynamical critical exponent z , which is defined by . For constant , our fits give . However, we discovered that by increasing in rough proportion to , we can improve the performance of the algorithm significantly. To compare the speed of decorrelation between runs with different , we define a new quantity, which we call ``effort,'' . This measures the computational effort expended to obtain a decorrelated configuration. We define a new exponent from , where is chosen to keep constant. We also found that the behavior of the decorrelation time can be approximated over a good range by
A fit to the set of points ( , , ) gives , . Thus, is significantly lower than z . Figure 4.24 shows versus , with the fits shown as solid lines.
Guy Robinson
HPFA Applications
Guy Robinson
Physical systems comprised of discrete, macroscopic particles or grains which are not bonded to one another are important in civil, chemical, and agricultural engineering, as well as in natural geological and planetary environments. Granular systems are observed in rock slides, sand dunes, clastic sediments, snow avalanches, and planetary rings. In engineering and industry they are found in connection with the processing of cereal grains, coal, gravel, oil shale, and powders, and are well known to pose important problems associated with the movement of sediments by streams, rivers, waves, and the wind.
The standard approach to the theoretical modelling of multiparticle systems in physics has been to treat the system as a continuum and to formulate the model in terms of differential equations. As an example, the science of soil mechanics has traditionally focussed mainly on quasi-static granular systems, a prime objective being to define and predict the conditions under which failure of the granular soil system will occur. Soil mechanics is a macroscopic continuum model requiring an explicit constitutive law relating, for example, stress and strain. While very successful for the low-strain quasi-static applications for which it is intended, it is not clear how it can be generalized to deal with the high-strain, explicitly time-dependent phenomena which characterize a great many other granular systems of interest. Attempts at obtaining a generalized theory of granular systems using a differential equation formalism [ Johnson:87a ] have met with limited success.
An alternate approach to formulating physical theories can be found in the concept of cellular automata , which was first proposed by Von Neumann in 1948. In this approach, the space of a physical problem would be divided up into many small, identical cells each of which would be in one of a finite number of states. The state of a cell would evolve according to a rule that is both local (involves only the cell itself and nearby cells) and universal (all cells are updated simultaneously using the same rule).
The Lattice Grain Model [ Gutt:89a ] (LGrM) we discuss here is a microscopic, explicitly time-dependent, cellular automata model, and can be applied naturally to high-strain events. LGrM carries some attributes of both particle dynamics models [ Cundall:79a ], [Haff:87a;87b], [ Walton:84a ], [ Werner:87a ] (PDM), which are based explicitly on Newton's second law, and lattice gas models [ Frisch:86a ], [ Margolis:86a ] (LGM), in that its fundamental element is a discrete particle, but differs from these substantially in detail. Here we describe the essential features of LGrM, compare the model with both PDM and LGM, and finally discuss some applications.
Guy Robinson
The purpose of the lattice grain model is to predict the behavior of large numbers of grains (10,000 to 1,000,000) on scales much larger than a grain diameter. In this respect, it goes beyond the particle dynamics calculations of Section 9.2 , which are limited to no more than grains by currently available computing resources [ Cundall:79a ], [Haff:87a;87b], [ Walton:84a ], [ Werner:87a ]. The particle dynamics models follow the motion of each individual grain exactly, and may be formulated in one of two ways depending upon the model adopted for particle-particle interactions.
In one formulation, the interparticle contact times are assumed to be of finite duration, and each particle may be in simultaneous contact with several others [ Cundall:79a ], [ Haff:87a ], [ Walton:84a ], [ Werner:87a ]. Each particle obeys Newton's law, F = ma , and a detailed integration of the equations of motion of each particle is performed. In this form, while useful for applications involving a much smaller number of particles than LGrM allows, PDM cannot compete with LGrM for systems involving large numbers of grains because of the complexity of PDM ``automata.''
In the second, simpler formulation, the interparticle contact times are assumed to be of infinitesimal duration, and particles undergo only binary collisions (the hard-sphere collisional models) [ Haff:87b ]. Hard-sphere models usually rely upon a collision-list ordering of collision events to avoid the necessity of checking all pairs of particles for overlaps at each time step. In regions of high particle number density, collisions are very frequent; and thus in problems where such high-density zones appear, hard-sphere models spend most of their time moving particles through very small distances using very small time steps. In granular flow, zones of stagnation where particles are very nearly in contact much of the time, are common, and the hard-sphere model is therefore unsuitable, at least in its simplest form, as a model of these systems. LGrM avoids these difficulties because its time-stepping is controlled, not by a collision list but by a scan frequency, which in turn is a function of the speed of the fastest particle and is independent of number density. Furthermore, although fundamentally a collisional model, LGrM can also mimic the behavior of consolidated or stagnated zones of granular material in a manner which will be described.
Guy Robinson
LGrM closely resembles LGM [ Frisch:86a ], [ Margolis:86a ] in some respects. First, for two-dimensional applications, the region of space in which the particles are to move is discretized into a triangular lattice-work, upon each node of which can reside a particle. The particles are capable of moving to neighboring cells at each tick of the clock, subject to certain simple rules. Finally, two particles arriving at the same cell (LGM) or adjacent cells (LGrM) at the same time, may undergo a ``collision'' in which their outgoing velocities are determined according to specified rules chosen to conserve momentum.
Each of the particles in LGM has the same magnitude of velocity and is allowed to move in one of six directions along the lattice, so that each particle travels exactly one lattice spacing in each time step. The single-velocity magnitude means that all collisions between particles are perfectly elastic and that energy conservation is maintained simply through particle number conservation. It also means that the temperature of the gas is uniform throughout time and space, thus limiting the application of LGM to problems of low Mach number. An exclusion principle is maintained in which no two particles of the same velocity may occupy one lattice point. Thus, each lattice point may have no more than six particles, and the state of a lattice point can be recorded using only six bits.
LGrM differs from LGM in that it has many possible velocity states, not just six. In particular, in LGrM not only the direction but the magnitude of the velocity can change in each collision. This is a necessary condition because the collision of two macroscopic particles is always inelastic, so that mechanical energy is not conserved. The LGrM particles satisfy a somewhat different exclusion principle: No more than one particle at a time may occupy a single site. This exclusion principle allows LGrM to capture some of the volume-filling properties of granular material-in particular, to be able to approximate the behavior of static granular masses.
The determination of the time step is more critical in LGrM than in LGM. If the time step is long enough that some particles travel several lattice spacings in one clock tick, the problem of finding the intersection of particle trajectories arises. This involves much computation and defeats the purpose of an automata approach. A very short time step would imply that most particles would not move even a single lattice spacing. Here we choose a time step such that the fastest particle will move exactly one lattice spacing. A ``position offset'' is stored for each of the slower particles, which are moved accordingly when the offset exceeds one-half lattice spacing. These extra requirements for LGrM automata imply a slower computation speed than expected in LGM simulations; but, as a dividend, we can compute inelastic grain flows of potential engineering and geophysical interest.
Guy Robinson
In order to keep the particle-particle interaction rules as simple as possible, all interparticle contacts, whether enduring contacts or true collisions, will be modelled as collisions. Collisions that model enduring contacts will transmit, in each time step, an impulse equal to the force of the enduring contact multiplied by the time step. The fact that collisions take place between particles on adjacent lattice nodes means that some particles may undergo up to six collisions in a time step. For simplicity, these collisions will be resolved as a series of binary collisions. The order in which these collisions are calculated at each lattice node, as well as the order in which the lattice nodes are scanned, is now an important consideration.
The rules of the Lattice Grain Model may be summarized as follows:
where:
Once the offset exceeds half the distance to the nearest lattice node, and that node is empty, the particle is moved to that node, and its offset is decremented appropriately. Also, in a collision, the component of the offset along the line connecting the centers of the colliding particles is set to zero.
The transmission of ``static'' contact forces within a mass of grains (as in grains at rest in a gravitational field) is handled naturally within the above framework. Though a particle in a static mass of grains may be nominally at rest, its velocity may be nonzero (due to gravitational or pressure forces); and it will transmit the appropriate force (in the form of an impulse) to the particles under it by means of collisions. When these impulses are averaged over several time steps, the proper weights and pressures will emerge.
Figure 4.25:
Definition of Lattice Numbers and Collision Directions
Guy Robinson
When implementing this algorithm on a computer, what is stored in the computer's memory is information concerning each point in the lattice, regardless of whether or not there is a particle at that lattice point. This allows for very efficient checking of the space around each particle for the presence of other particles (i.e., information concerning the six adjacent points in a triangular lattice will be found at certain known locations in memory). The need to keep information on empty lattice points in memory does not entail as great a penalty as might be thought; many lattice grain model problems involve a high density of particles, typically one for every one to four lattice points, and the memory cost per lattice point is not large. The memory requirements for the implementation of LGrM as described here are five variables per lattice site: two components of position, two of velocity, and one status variable, which denotes an empty site, an occupied site, or a bounding ``wall'' particle. If each variable is stored using four bytes of memory, then each lattice point requires 20 bytes.
The standard configuration for a simulation consists of a lattice with a specified number of rows and columns bounded at the top and bottom by two rows of wall particles (thus forming the top and bottom walls of the problem space), and with left and right edges connected together to form periodic boundary conditions. Thus, the boundaries of the lattice are handled naturally within the normal position updating and collision rules, with very little additional programming. (Note: Since the gravitational acceleration can point in an arbitrary direction, the top and bottom walls can become side walls for chute flow. Also, the periodic boundary conditions can be broken by the placement of an additional wall, if so desired.)
Because of the nearest-neighbor type interactions involved in the model, the computational scheme was well suited to an nCUBE parallel processor. For the purpose of dividing up the problem, the hypercube architecture is unfolded into a two-dimensional array, and each processor is given a roughly equal-area section of the lattice. The only interaction between sections will be along their common boundaries; thus, each processor will only need to exchange information with its eight immediate neighbors. The program itself was written in C under the Cubix/CrOS III operating system.
Guy Robinson
The LGrM simulations performed so far have involved from to automata. Trial application runs included two-dimensional, vertical, time-dependent flows in several geometries, of which two examples are given here: Couette flow and flow out of an hourglass-shaped hopper.
The standard Couette flow configuration consists of a fluid confined between two flat parallel plates of infinite extent, without any gravitational accelerations. The plates move in opposite directions with velocities that are equal and parallel to their surfaces, which results in the establishment of a velocity gradient and a shear stress in the fluid. For fluids that obey the Navier-Stokes equation, an analytical solution is possible in which the velocity gradient and shear stress are constant across the channel. If, however, we replace the fluid by a system of inelastic grains, the velocity gradient will no longer necessarily be constant across the channel. Typically, stagnation zones or plugs form in the center of the channel with thin shear-bands near the walls. Shear-band formation in flowing granular materials was analyzed earlier by Haff and others [ Haff:83a ], [ Hui:84a ] based on kinetic theory models.
The simulation was carried out with 5760 grains, located in a channel 60 lattice points wide by 192 long. Due to the periodic boundary conditions at the left and right ends, the problem is effectively infinite in length. The first simulation is intended to reproduce the standard Couette flow for a fluid; consequently, the particle-particle collisions were given a coefficient of restitution of 1.0 (i.e., perfectly elastic collisions) and the particle-wall collisions were given a .75 coefficient of restitution. The inelasticity of the particle-wall collisions is needed to simulate the conduction of heat (which is being generated within the fluid) from the fluid to the walls. The simulation was run until an equilibrium was established in the channel (Figure 4.26 (a)). The average x - and y -components of velocity and the second moment of velocity, as functions of distance across the channel, are plotted in Figure 4.26 (b).
Figure 4.26:
(a) Elastic Particle Couette Flow; (b)
x
-component (1),
y
-component (2), and Second Moment (3) of Velocity
The second simulation used a coefficient of restitution of .75 for both the particle-particle and particle-wall collisions. The equilibrium results are shown in Figure 4.27 (a) and (b). As can be seen from the plots, the flow consists of a central region of particles compacted into a plug, with each particle having almost no velocity. Near each of the moving walls, a region of much lower density has formed in which most of the shearing motion occurs. Note the increase in value of the second moment of velocity (the granular ``thermal velocity'') near the walls, indicating that grains in this area are being ``heated'' by the high rate of shear. It is interesting to note that these flows are turbulent in the sense that shear stress is a quadratic, not a linear, function of shear rate.
Figure 4.27:
(a) Inelastic Particle Couette Flow; (b)
x
-component (1),
y
-component (2), and Second Moment (3) of Velocity.
In the second problem, the flow of grains through a hopper or an hourglass with an opening only a few grain diameters wide, was studied; the driving force was gravity. This is an example of a granular system which contains a wide range of densities, from groups of grains in static contact with one another to groups of highly agitated grains undergoing true binary collisions. Here, the number of particles used was 8310, and the lattice was 240 points long by 122 wide. Additional walls were added to form the sloped sides of the bin and to close off the bottom of the lattice to prevent the periodic boundary conditions from reintroducing the falling particles back into the bin (Figure 4.28 (a)). This is a typical feature of automata modelling. It is often easier to configure the simulation to resemble a real experiment-in this case by explicitly ``catching'' spent grains-than by reprogramming the basic code to erase such particles.
The hourglass flow in Figure 4.28 (b) showed internal shear zones, regions of stagnation, free-surface evolution toward an angle of repose, and an exit flow rate approximately independent of pressure head, as observed experimentally [ Tuzun:82a ]. It is hard to imagine that one could solve a partial differential equation describing such a complex, multiple-domain, time-dependent problem, even if the right equation were known (which is not the case).
Figure 4.28:
(a) Initial Condition of Hourglass; (b) Hourglass Flow after 2048
Time Steps
Guy Robinson
These exploratory numerical experiments show that an automata approach to granular dynamics problems can be implemented on parallel computing machines. Further work remains to be done to assess more quantitatively how well such calculations reflect the real world, but the prospects are intriguing.
Guy Robinson
Guy Robinson
As already noted in Chapter 2 , the initial software used by C P was called CrOS, although its modest functionality hardly justified CrOS being called an operating system. Actually, this is an interesting issue. In our original model, the ``real'' operating system (UNIX in our case) ran on the ``host'' that directly or indirectly (via a network) connected to the hypercube. The nodes of the parallel machine need only provide the minimal services necessary to support user programs. This is the natural mode for all SIMD systems and is still offered by several important MIMD multicomputers. However, systems such as the IBM SP-1, Intel's Paragon series, and Meiko's CS-1 and 2 offer a full UNIX (or equivalent, such as MACH) on each node. This has many advantages, including the ability of the system to be arbitrarily configured-in particular we can consider a multicomputer with N nodes as ``just'' N ``real'' computers connected by a high-performance network. This would lead to particularly good performance on remote disk I/O, such as that needed for the Network File System (NFS). The design of an operating system for the node is partly based on the programming usage paradigm, and partly on the hardware. The original multicomputers all had small node memories ( on the Cosmic Cube) and could not possibly hold UNIX on a node. Current multicomputers such as the CM-5, Paragon, and Meiko CS-2 would consider a normal minimum node memory. This is easily sufficient to hold a full UNIX implementation with the extra functionality needed to support parallel programming. There are some, such as IBM Owego (Execube), Seitz at Caltech (MOSAIC) [Seitz:90a;92a], and Dally at MIT (J Machine) [Dally:90a;92a], who are developing very interesting families of highly parallel ``small node'' multicomputers for which a full UNIX on each node may be inappropriate.
Essentially, all the applications described in this book are not sensitive to these issues, which would only affect the convenience of program development and operating environment. C P's applications were all developed using a simple message-passing system involving C (and less often Fortran) node programs that sent messages to each other via subroutine call. The key function of CrOS and Express, described in Section 5.2 , was to provide this subroutine library.
There are some important capabilities that a parallel computing environment needs in addition to message passing and UNIX services. These include:
We did not perform any major computations in C P that required high-speed input/output capabilities. This reflects both our applications mix and the poor I/O performance of the early hypercubes. The applications described in Chapter 18 needed significant but not high bandwidth input/output during computation, as did our analysis of radio astronomy data. However, the other applications used input/output for checkpointing, interchange of parameters between user and program, and in greatest volume, checkpoint and restart. This input/output was typically performed between the host and (node 0 of) the parallel ensemble. Section 5.2.7 and in greater detail [ Fox:88a ] describe the Cubix system, which we developed to make this input/output more convenient. This system was overwhelmingly preferred by the C P community as compared to the conventional host-node programming style. Curiously, Cubix seems to have made no impact on the ``real world.'' We are not aware of any other group that has adopted it.
Guy Robinson
The evolution of the various message-passing paradigms used on the Caltech/JPL machines involved three generations of hypercubes and many different software concepts, which ultimately led to the development of Express , a general, asynchronous buffered communication system for heterogeneous multiprocessors.
Originally designed, developed, and used by scientists with applications-oriented research goals, the Caltech/JPL system software was written to generate near-term needed capability. Neither hindered nor helped by any preconceptions about the type of software that should be used for parallel processing, we simply built useful software and added it to the system library.
Hence, the software evolved from primitive hardware-dependent implementations into a sophisticated runtime library, which embodied the concepts of ``loose synchronization,'' domain decomposition, and machine independence. By the time the commercial machines started to replace our homemade hypercubes, we had evolved a programming model that allowed us to develop and debug code effectively, port it between different parallel computers, and run with minimal overheads. This system has stood the test of time and, although there are many other implementations, the functionality of CrOS and Express appears essentially correct. Many of the ideas described in this chapter, and the later Zipcode System of Section 16.1 , are embodied in the current message-passing standards activity, MPI [ Walker:94a ]. A detailed description of CrOS and Express will be found in [ Fox:88a ] and [ Angus:90a ], and is not repeated here.
How did this happen?
Guy Robinson
The original hypercubes described in Chapter
20
, the Cosmic
Cube
[
Seitz:85a
], and Mark II
[
Tuazon:85a
] machines, had been designed and built as exploratory
devices. We expected to be able to do useful physics and, in particular, were
interested in high-energy physics. At that time, we were merely trying to
extract exciting physics from an untried technology. These first machines
came equipped with ``64-bit FIFOs,'' meaning that at a software level, two
basic communication routines were available:
rdELT(packet, chan)
wtELT(packet, chan).
The latter pushed a 64-bit ``packet'' into the indicated hypercube channel, which was then extracted with the rdELT function. If the read happened before the write, the program in the reading node stopped and waited for the data to show up. If the writing node sent its data before the reading node was ready, it similarly waited for the reader.
To make contact with the world outside the hypercube cabinet, a node had
to be able to communicate with a ``host'' computer. Again, the FIFOs
came into play with two additional calls:
rdIH(packet)
wtIH(packet),
which allowed node 0 to communicate with the host.
This rigidly defined behavior, executed on a hypercubic lattice of nodes, resembled a crystal, so we called it the Crystalline Operating System (CrOS). Obviously, an operating system with only four system calls is quite far removed from most people's concept of the breed. Nevertheless, they were the only system calls available and the name stuck.
Guy Robinson
We began to build algorithms and methods to exploit the power of parallel computers. With little difficulty, we were able to develop algorithms for solving partial differential equations [ Brooks:82b ], FFTs [ Newton:82a ], [ Salmon:86b ], and high-energy physics problems described in the last chapter [ Brooks:83a ].
As each person wrote applications, however, we learned a little more about the way problems were mapped onto the machines. Gradually, additional functions were added to the list to download and upload data sets from the outside world and to combine the operations of the rdELT and wtELT functions into something that exchanged data across a channel.
In each case, these functions were adopted, not because they seemed necessary to complete our operating system, but because they fulfilled a real need. At that time, debugging capabilities were nonexistent; a mistake in the program running on the nodes merely caused the machine to stop running. Thus, it was beneficial to build up a library of routines that performed common communication functions, which made reinventing tools unnecessary.
Guy Robinson
Up to this point, our primary concern was with communication between neighboring processors. Applications, however, tended to show two fundamental types of communication: local exchange of boundary condition data, and global operations connected with control or extraction of physical observables.
As seen from the examples in this book, these two types of communication are generally believed to be fundamental to all scientific problems-the modelled application usually has some structure that can be mapped onto the nodes of the parallel computer and its structure induces some regular communication pattern. A major breakthrough, therefore, was the development of what have since been called the ``collective'' communication routines, which perform some action across all the nodes of the machine.
The simplest example is that of `` broadcast ''- a function that enabled node 0 to communicate one or more packets to all the other nodes in the machine. The routine `` concat '' enabled each node to accumulate data from every other node, and `` combine '' let us perform actions, such as addition, on distributed data sets. The routine combine is often called a reduction operator.
The development of these functions, and the natural way in which they could be mapped to the hypercube topology of the machines, led to great increases in both productivity on the part of the programmers and efficiency in the execution of the algorithms. CrOS quickly grew to a dozen or more routines.
Guy Robinson
By 1985, the Mark II machine was in constant use and we were beginning to examine software issues that had previously been of little concern. Algorithms, such as the standard FFT, had obvious implementations on a hypercube [ Salmon:86b ], [ Fox:88a ]-the ``bit-twiddling'' done by the FFT algorithm could be mapped onto a hypercube by ``twiddling'' the bits in a slightly revised manner. More problematic was the issue of two- or three-dimensional problem solving. A two- or three-dimensional problem could easily be mapped into a small number of nodes. However, one cannot so easily perform the mapping of grids onto 128 nodes connected as a seven-dimensional hypercube.
Another issue that quickly became apparent was that C P users did not have a good feel for the `` chan '' argument used by the primitive communication functions. Users wanted to think of a collection of processors each labelled by a number, with data exchanged between them, but unfortunately the software was driven instead by the hypercube architecture of the machine. Tolerance of the explanation of ``Well, you XOR the processor number with one shifted left by the '' was rapidly exceeded in all but the most stubborn users.
Both problems were effectively removed by the development of whoami [ Salmon:84b ]. We used the well-known techniques of binary grey codes to automate the process of mapping two, three, or higher dimensional problems to the hypercube. The whoami function took the dimensionality of the physical system being modelled and returned all the necessary `` chan '' values to make everything else work out properly. No longer did the new user have to spend time learning about channel numbers, XORing, and the ``mapping problem''-everything was done by the call to whoami . Even on current hardware, where long-range communication is an accepted fact, the techniques embodied by whoami result in programs that can run up to an order of magnitude faster than those using less convenient mapping techniques (see Figure 5.1 ).
Figure 5.1:
Mapping a Two-dimensional World
Guy Robinson
Up to this point, we had concentrated on the most obvious scientific problems: FFTs, ordinary and partial differential equations, matrices, and so on, which were all characterized by their amenability to the lock step, short-range communication primitives available. Note that some of these, such as the FFT and matrix algorithms, are not strictly ``nearest neighbor'' in the sense of the communication primitives discussed earlier, since they require data to be distributed to nodes further than one step away. These problems, however, are amenable to the ``collective communication'' strategies.
Based on our success with these problems, we began to investigate areas that were not so easily cast into the crystalline methodology. A long-term goal was the support of event-driven simulations, database machines, and transaction-processing systems, which did not appear to be crystalline .
In the shorter term, we wanted to study the physical process of ``melting'' [ Johnson:86a ] described in Section 14.2 . The melting process is different from the applications described thus far, in that it inherently involves some indeterminacies-the transition from an ordered solid to a random liquid involves complex and time-varying interactions. In the past, we had solved such an irregular problem-that of N-body gravity [ Salmon:86b ] by the use of what has since been called the ``long-range-force'' algorithm [ Fox:88a ]. This is a particularly powerful technique and leads to highly efficient programs that can be implemented with crystalline commands.
The melting process differs from the long-range force algorithm in that the interactions between particles do not extend to infinity, but are localized to some domain whose size depends upon the particular state of the solid/fluid. As such, it is very wasteful to use the long-range force technique, but the randomness of the interactions makes a mapping to a crystalline algorithm difficult (see Figure 5.2 ).
Figure 5.2:
Interprocessor Communication Requirements
To address these issues effectively, it seemed important to build a communication system that allowed messages to travel between nodes that were not necessarily connected by ``channels,'' yet didn't need to involve all nodes collectively.
At this point, an enormous number of issues came up-routing, buffering, queueing, interrupts, and so on. The first cut at solving these problems was a system that never acquired a real name, but was known by the name of its central function, `` rdsort '' [ Johnson:85a ]. The basic concept was that a message could be sent from any node to any other node, at any time, and the receiving node would have its program interrupted whenever a message arrived. At this point, the user provided a routine called `` rdsort '' which, as its name implies, needed to read, sort and process the data.
While simple enough in principle, this programming model was not initially adopted (although it produced an effective solution to the melting problem). To users who came from a number-crunching physics background, the concept of ``interrupts'' was quite alien. Furthermore, the issues of sorting, buffering, mutual exclusion, and so on, raised by the asynchronous nature of the resulting programs, proved hard to code. Without debugging tools, it was extremely hard to develop programs using these techniques. Some of these ideas were taken further by the Reactive Kernel [ Seitz:88b ] (see Section 16.2 ), which do not, however, implement ``reaction'' with an interrupt level handler. The recent development of active messages on the CM-5 has shown the power of the rdsort concepts [ Eiken:92a ].
Guy Robinson
The advent of the Mark III machine [ Peterson:85a ] generated a rapid development in applications software. In the previous five years, the crystalline system had shown itself to be a powerful tool for extracting maximum performance from the machines, but the new Mark III encouraged us to look at some of the ``programmability'' issues, which had previously been of secondary importance.
The first and most natural development was the generalization of the CrOS system for the new hardware [ Johnson:86c ], [ Kolawa:86d ]. Christened ``CrOS III,'' it allowed us the flexibility of arbitrary message lengths (rather than multiples of the FIFO size), hardware-supported collective communication-the Mark III allowed hardware support of simultaneous communication down multiple channels, which allowed fast cube and subcube broadcast [ Fox:88a ]. All of these enhancements, however, maintained the original concept of nearest-neighbor (in a hypercube) communication supported by collective communication routines that operated throughout (or on a subset of) the machine. In retrospect, the hypercube's specific nature of CrOS should not have been preserved in the major redesign of CrOS III.
Guy Robinson
At this point, the programming model for the machines remained pretty much as it had been in 1982. A program running on the host computer loaded an application into the hypercube nodes, and then passed data back and forth with routines similar in nature to the rdIH and wtIH calls described earlier. This remained the only method through which the nodes could communicate with the outside world.
During the Mark II period, it had quickly become apparent that most users were writing host programs that, while differing in detail, were identical in outline and function. An early effort to remove from the user the burden of writing yet another host program was a system called C3PO [ Meier:84b ], which had a generic host program providing a shell in which subroutines could be executed in the nodes. Essentially, this model freed the user from writing an explicit host program, but still kept the locus of control in the host.
Cubix [ Salmon:87a ] reversed this. The basic idea was that the parallel computer, being more powerful than its host, should play the dominant role. Programs running in the nodes should decide for themselves what actions to take and merely instruct the host machine to intercede on their behalf. If, for example, the node program wished to read from a file, it should be able to tell the host program to perform the appropriate actions to open the file and read data, package it up into messages, and transmit it back to the appropriate node. This was a sharp contrast to the older method in which the user was effectively responsible for each of these actions.
The basic outcome was that the user's host programs were replaced with a standard ``file-server'' program called cubix . A set of library routines were then created with a single protocol for transactions between the node programs and cubix , which related to such common activities as opening, reading and writing files, interfacing with the user, and so on.
This change produced dramatic results. Now, the node programs could contain calls to such useful functions as printf , scanf , and fopen , which had previously been forbidden. Debugging was much easier, albeit in the old-fashioned way of ``let's print everything and look at it.'' Furthermore, programs no longer needed to be broken down into ``host'' and ``node'' pieces and, as a result, parallel programs began to look almost like the original sequential programs.
Once file I/O was possible from the individual nodes of the machine, graphics soon followed through Plotix [ Flower:86c ], resulting in the development system shown in the heart of the family tree ( 5.3 ). The ideas embodied in this set of tools-CrOS III, Cubix, and Plotix-form the basis of the vast majority of C P parallel programs.
Figure 5.3:
The C
P ``Message-passing'' Family Tree
Guy Robinson
While the radical change that led to Cubix was happening, the non-crystalline users were still developing alternative communication strategies. As mentioned earlier, rdsort never became popular due to the burden of bookkeeping that was placed on the user and the unfamiliarity of the interrupt concept.
The ``9 routines'' [ Johnson:86a ] attempted to alleviate the most burdensome issues by removing the interrupt nature of the system and performing buffering and queueing internally. The resultant system was a simple generalization of the wtELT concept, which replaced the ``channel'' argument with a processor number. As a result, messages could be sent to non-neighboring nodes. An additional level of sophistication was provided by associating a ``type'' with each message. The recipient of the messages could then sort incoming data into functional categories based on this type.
The benefit to the user was a simplified programming model. The only remaining problem was how to handle this new found freedom of sending messages to arbitrary destinations.
We had originally planned to build a ray tracer from the tools developed while studying melting. There is, however, a fundamental difference between the melting process and the distributed database searching that goes on in a ray tracer. Ray tracing is relatively simple if the whole model can be stored in each processor, but we were considering the case of a geometric database larger than this.
Melting posed problems because the exact nature of the interaction was known only statistically-we might need to communicate with all processors up to two hops away from our node, or three hops, or some indeterminate number. Other than this, however, the problem was quite homogeneous, and every node could perform the same tasks as the others.
The database search is inherently non-deterministic and badly load-balanced because it is impossible to map objects into the nodes where they will be used. As a result, database queries need to be directed through a tree structure until they find the necessary information and return it to the calling node.
A suitable methodology for performing this kind of exercise seemed to be a real multitasking system where ``processes'' could be created and destroyed on nodes in a dynamic fashion which would then map naturally onto the complex database search patterns. We decided to create an operating system.
The crystalline system had been described, at least in the written word, as an operating system. The concept of writing a real operating system with file systems, terminals, multitasking, and so on, was clearly impossible while communication was restricted to single hops across hypercube channels. The new system, however, promised much more. The result was the ``Multitasking, Object-Oriented, Operating System'' (MOOOS, commonly known as MOOSE) [ Salmon:86a ]. The follow-up MOOS II is described in Section 15.2 .
The basic idea was to allow for multitasking-running more than one process per node, with remote task creation, scheduling, semaphores, signals-to include everything that a real operating system would have. The implementation of this system proved quite troublesome and strained the capabilities of our compilers and hardware beyond their breaking points, but was nothing by comparison with the problems encountered by the users.
The basic programming model was of simple, ``light weight'' processes communicating through message box/pipe constructs. The overall structure was vaguely reminiscent of the standard UNIX multiprocessing system; fork/exec and pipes (Figure 5.4 ). Unfortunately, this was by now completely alien to our programmers, who were all more familiar with the crystalline methods previously employed. In particular, problems were encountered with naming. While a process that created a child would automatically know its child's ``ID,'' it was much more difficult for siblings to identify each other, and hence, to communicate. As a natural result, it was reasonably easy to build tree structures but difficult to perform domain decompositions. Despite these problems, useful programs were developed including the parallel ray tracer with a distributed database that had originally motivated the design [Goldsmith:86a;87a].
Figure 5.4:
A ``MOOSE'' Process Tree
An important problem was that architectures offered no memory protection between the lightweight processes running on a node. One had to guess how much memory to allocate to each process, which complicated debugging when the user guessed wrong. Later, the Intel iPSC implemented the hardware memory protection, which made life simpler ([ Koller:88b ] and Section 15.2 ).
In using MOOSE, we wanted to explore dynamic load-balancing issues. A problem with standard domain decompositions is that irregularities in the work loads assigned to processors lead to inefficiencies since the entire simulation, proceeding in lock step, executes at the speed of the slowest node. The Crystal Router , developed at the same time as MOOSE, offered a simpler strategy.
Guy Robinson
By 1986, we began to classify our algorithms in order to generalize the performance models and identify applications that could be expected to perform well using the existing technology. This led to the idea of ``loosely synchronous'' programming.
The central concept is one in which the nodes compute for a while, then synchronize and communicate, continually alternating between these two types of activities. This computation model was very well-suited to our crystalline communication system, which enforced synchronization automatically. In looking at some of the problems we were trying to address with our asynchronous communication systems (The ``9 routines'' and MOOSE), we found that although the applications were not naturally loosely synchronous at the level of the individual messages, they followed the basic pattern at some higher level of abstraction.
In particular, we were able to identify problems in which it seemed that messages would be generated at a fairly uniform rate, but in which the moment when the data had to be physically delivered to the receiving nodes was synchronized. A load balancer, for example, might use some type of simulated-annealing [ Flower:87a ] or neural-network [ Fox:88e ] approach, as seen in Chapter 11 , to identify work elements that should be relocated to a different processor. As each decision is made, a message can be generated to tell the receiving node of its new data. It would be inefficient, however, to physically send these messages one at a time as the load-balancing algorithm progresses, especially since the results need only be acted upon once the load-balancing cycle has completed.
We developed the Crystal Router to address this problem [Fox:88a;88h]. The idea was that messages would be buffered on their node of origin until a synchronization point was reached when a single system call sent every message to its destination in one pass. The results of this technology were basically twofold.
The resultant system had some of the attractive features of the ``9 routines,'' in that messages could be sent between arbitrary nodes. But it maintained the high efficiency of the crystalline system by performing all its internode communications synchronously. A glossary of terms used is in Figure 5.5 .
The crystal router was an effective system on early Caltech, JPL, and commercial multicomputers. It minimized latency, interrupt overhead and used optimal routing. It has not survived as a generally important concept as it is not needed in this form on modern machines with different topologies and automatic hardware routing.
Guy Robinson
In all of the software development cycles, one of our primary concerns was portability . We wanted our programs to be portable not only between various types of parallel computers, but also between parallel and sequential computers. It was in this sense that Cubix was such a breakthrough, since it allowed us to leave all the standard runtime system calls in our programs. In most cases, Cubix programs will run either on a supported parallel computer or on a simple sequential machine through the provision of a small number of dummy routines for the sequential machine. Using these tools, we were able to implement our codes on all of the commercially and locally built hypercubes.
The next question to arise, however, concerned possible extensions to alternative architectures, such as shared-memory or mesh-based structures. The crystal router offered a solution. By design, messages in the crystal router can be sent to any other node. This step merely involves construction of a set of appropriate queues. When the interprocessor communication is actually invoked, the system is responsible for routing messages between processors-a step in which potential differences in the underlying hardware architecture can be concealed. As a result, applications using the crystal router can conceivably operate on any type of parallel or sequential hardware.
Guy Robinson
At the end of 1987, ParaSoft Corporation was founded by a group from C P with the goal of providing a uniform software base-a set of portable programming tools-for all types of parallel processors (Figure 5.6 ).
Figure 5.6:
Express System Components
The resultant system, Express [ ParaSoft:88a ], is a merger of the C P message passing tools, developed into a unified system that can be supported on all types of parallel computers. The basic components are:
Additionally, ParaSoft added:
ParaSoft extended the parallel debugger originally developed for the nCUBE hypercube [ Flower:87c ] and created a set of powerful performance analysis tools [ Parasoft:88f ] to help users analyze and optimize their parallel programs. This toolset, incorporating all of the concepts of the original work and available on a wide range of parallel computers, has been widely accepted and is now the most commonly used system at Caltech. It is interesting to note that the most successful parallel programs are still built around the crystalline style of internode communication originally developed for the Mark II hypercube in 1982. While other systems occasionally seem to offer easier routes to working algorithms, we usually find that a crystalline implementation offers significantly better performance.
At the current stage of development, we also believe that parallel processing is reasonably straightforward. The availability of sophisticated debugging tools, and I/O systems has resulted in several orders of magnitude reduction in debugging time. Similarly, the performance evaluation system has proved itself very powerful in analyzing areas where potential improvements can be made in algorithms.
ParaSoft also supports a range of other parallel computing tools, some of which are described later in this chapter.
Guy Robinson
It is interesting to compare the work of other organizations with that performed at Caltech. In particular, our problem-solving approach to the art of parallel computing has, in some cases, led us down paths which we have since abandoned but which are still actively pursued by other groups. Yet, a completely fresh look at parallel programming methods may produce a more consistent paradigm than our evolutionary approach. In any case, the choice of a parallel programming system depends on whether the user is more interested in machine performance or ease of programming.
These are several systems that offer some or all of the features of Express, based on long-range communication by message passing. Many are more general operating environments with the features of ``real'' operating systems missing in Express and especially CrOS. We summarize some examples in the following:
JPL developed this message-passing system [ Lee:86a ] at the same time as we developed the 9 routines at Caltech. Mercury is similar to the 9 routines in that messages can be transmitted between any pair of nodes, irrespective of whether a channel connects them. Messages also have ``types'' and can be sorted and buffered by the system as in the 9 routines or Express. A special type of message allows one node to broadcast to all others.
Centaur is a simulation of CrOS III built on Mercury. This system was designed to allow programmers with crystalline applications the ability to operate either at the level of the hardware with high performance (with the CrOS III library) or within the asynchronous Mercury programming model, which had substantially higher (about a factor of three) message startup latency. When operating in Centaur mode, CrOS III programs may use advanced tools, such as the debugger, which require asynchronous access to the communication hardware.
VERTEX is the native operating system of the nCUBE. It shares with Express, Mercury, and the 9 routines the ability to send messages, with types, between arbitrary pairs of processors. Only two basic functions are supported to send and receive messages. I/O is not supported in the earliest versions of VERTEX, although this capability has been added in support of the second generation nCUBE hypercube.
The Reactive Kernel [ Seitz:88b ] is a message-passing system based on the idea that nodes will normally be sending messages in response to messages coming from other nodes. Like all the previously mentioned systems, the Reactive Kernel can send messages between any pair of nodes with a simple send/receive interface. However, the system call that receives messages does not distinguish between incoming messages. All sorting and buffering must be done by the user. As described in Chapter 16 , Zipcode has been built on top of the Reactive Kernel to provide similar capabilities to Express.
The NX system provided for the Intel iPSC series of multicomputers is also similar in functionality to the previously described long-range communication systems. It supports message types and provides sorting and buffering capabilities similar to those found in Express. No support is provided for nearest-neighbor communication in the crystalline style, although some of the collective communication primitives are supported.
The MACH operating system [ Tevanian:89a ] is a full implementation of UNIX for a shared-memory parallel computer. It supports all of the normally expected operating system facilities, such as multiuser access, disks, terminals, printers, and so on, in a manner compatible with the conventional Berkeley UNIX. MACH is also built with an elegant small (micro) kernel and a careful architecture of the system and user level functionality.
While this provides a strong basis for multiuser processing, it offers only simple parallel processing paradigms, largely based on the conventional UNIX interprocess communication protocols, such as ``pipes'' and ``sockets.'' As mentioned earlier in connection with MOOSE, these types of tools are not the easiest to use in tightly coupled parallel codes. The Open Software Foundation (OSF) has extended and commercialized MACH. They also have an AD (Advanced Development) prototype version for distributed memory machines. The latest Intel Paragon multicomputer offers OSF's new AD version of MACH on every node, but the operating system has been augmented with NX to provide high-performance message passing.
Helios [ DSL:89a ] is a distributed-memory operating system designed for transputer networks-distributed-memory machines. It offers typical UNIX-like utilities, such as compilers, editors, and printers, which are all accessible from the nodes of the transputer system, although fewer than the number supported by MACH. In common with MACH, however, the level of parallel processing support is quite limited. Users are generally encouraged to use pipes for interprocessor communication-no collective or crystalline communication support is provided.
The basic concept used in Linda [ Ahuja:86a ] is the idea of a tuple-space (database) for objects of various kinds. Nodes communicate by dropping objects into the database, which other nodes can then extract. This concept has a very elegant implementation, which is extremely simple to learn, but which can suffer from quite severe performance problems. This is especially so on distributed-memory architectures, where the database searching necessary to find an ``object'' can require intensive internode communication within the operating system.
More recent versions of Linda [ Gelertner:89a ] have extended the original concept by adding additional tuple-spaces and allowing the user to specify to which space an object should be sent and from which it should be retrieved. This new style is reminiscent of a mailbox approach, and is thus, quite similar to the programming paradigm used in CrOS III or Express.
PVM is a very popular elegant system that is available freely from Oak Ridge [ Sunderam:90a ], [ Geist:92a ]. This parallel virtual machine is notable for its support of a heterogeneous computing environment with, for instance, a collection of disparate architecture computers networked together.
There are several other message-passing systems, including active messages [ Eiken:92a ] discussed earlier, P4 [ Boyle:87a ], PICL [ Geist:90b ], EUI on the IBM SP-1, CSTools from Meiko, Parmacs [ Hempel:91a ], and CMMD on the CM-5 from Thinking Machines. PICL's key feature is the inclusion of primitives to support the gathering of data to support performance visualization (Section 5.4 ). This could be an important feature in such low-level systems.
Most of the ideas in Express, PVM, and the other basic message-passing systems are incorporated in a new Message-Passing Interface (MPI) standard [ Walker:94a ]. This important development tackles basic point to point, and collective communication. MPI does not address issues such as ``active messages'' or distributed computing and wide-area networks (e.g., what are correct protocols for video-on-demand and multimedia with real time constraints). Operating systems issues, outside the communication layer, are also not considered in MPI.
Guy Robinson
Guy Robinson
The history of our message-passing system work at Caltech is interesting in that its motivation departs significantly from that of most other institutions. Since our original goals were problem-oriented rather than motivated by the desire to do parallel processing research, we tended to build utilities that matched our hardware and software goals rather than for our aesthetic sense. If our original machine had had multiplexed DMA channels and specialized routing hardware, we might have started off in a totally different direction. Indeed, this can be seen as motivation for developing some of the alternative systems described in the previous section.
In retrospect, we may have been lucky to have such limited hardware available, since it forced us to develop tools for the user rather than rely on an all-purpose communication system. The resultant decomposition and collective communication routines still provide the basis for most of our successful work-even with the development of Express, we still find that we return again and again to the nearest-neighbor, crystalline communication style, albeit using the portable Express implementation rather than the old rdELT and wtELT calls. Even as we attempt to develop automated mechanisms for constructing parallel code, we rely on this type of technology.
The advent of parallel UNIX variants has not solved the problems of message passing-indeed these systems are among the weakest in terms of providing user-level support for interprocessor communication. We continually find that the best performance, both from our parallel programs and the scientists who develop them, is obtained when working in a loosely synchronous programming environment such as Express, even when this means implementing such a system on top of a native, ``parallel UNIX.''
We believe that the work done by is still quite unique, at least in its approach to problem solving. It is amusing to recall the comment of one new visitor to Caltech who, coming from an institution building sophisticated ``parallel UNIXs,'' was surprised to see the low level at which CrOS III operated. From our point of view, however, it gets the job done in an efficient and timely manner, which is of paramount importance.
Guy Robinson
Guy Robinson
Relatively little attention was paid in the early days of parallel computers to debugging the resulting parallel programs. We developed our approaches by trial and error during our various experiments in C P, and debugging was never a major research project in C P.
In this section, we shall consider some of the history and current technology of parallel debugging, as developed by C P.
Method 1. Source Scrutiny The way one worked on the early C P machines was to compile the target code, download it to the nodes, and wait. If everything worked perfectly, results might come back. Under any other circumstances, nothing would come out. The only real way to debug was to stare at the source code.
The basic problem was that while the communication routines discussed in the previous chapter were adequate (and in some sense ideal) for the task of algorithm development, they lacked a lot in terms of debugging support. In order to ``see'' the value of a variable inside the nodes, one had to package it up into a message and then send it to the host machine. Similarly, the host code had to be modified to receive this message at the right time and format it for the user's inspection. Even then only node 0 could perform this task directly, and all the other nodes had to somehow get their data to node 0 before it could be displayed.
Given the complexity of this task it is hardly surprising that users typically stared at their source code rather than attempt it. Ironically this procedure actually tended to introduce new bugs in the process of detecting the old ones because incorrect synchronization of the messages in nodes and host would lead to the machine hanging, probably somewhere in the new debugging code rather than the location that one was trying to debug. After several hours of fooling around, one would make the painful discovery that the debugging code itself was wrong and would have to start once more.
Method 2. Serial Channels In building the first C P hypercubes, each node had been given a serial RS-232 channel. No one quite knew why this had been done, but it was pointed out that by attaching some kind of terminal, or even a PC, it might be possible to send ``print'' statements out of the back of one or more nodes.
This was quickly achieved but proved less than the dramatic improvement one would have hoped. The interface was really slow and only capable of printing simple integer values. Furthermore, one had to use it while sitting in the machine room and it was necessary to attach the serial cable from the debugging terminal to the node to be debugged-an extremely hazardous process that could cause other cables to become loose.
A modification of the process that should probably have pointed us in the right direction immediately was when the MS-DOS program DEBUG was modified for this interface. Finally, we could actually insert breakpoints in the target node code and examine memory!
Unfortunately, this too failed to become popular because of the extremely low level at which it operated. Memory locations had to be specified in hexadecimal and code could only be viewed as assembly language instructions.
A final blow to this method was that our machines still operated in ``single-user'' mode-that is, only a single user could be using the system at any one time. As a result, it was intolerable for a single individual to ``have'' the machine for a couple of hours while struggling with the DEBUG program while others were waiting.
Method 3. Cubix As has been described in the previous section on communication systems, the advent of the Cubix programming style brought a significant improvement to the life of the parallel code developer. For the first time, any node could print out its data values, not using some obscure and arcane functions but with normal printf and WRITE statements. To this extent, debugging parallel programs really did become as simple as debugging sequential ones.
Using this method took us out of the stone age: Each user would generate huge data files containing the values of all the important data and then go to work with a pocket calculator to see what went wrong.
Method 4. Help from the Manufacturer The most significant advance in debugging technology, however, came with the first nCUBE machine. This system embodied two important advances:
The ``real'' kernel was a mixed blessing. As has been pointed out previously, we didn't really need most of its resources and resented the fact that the kernel imposed a message latency almost ten times that of the basic hardware. On the other hand, it supported real debugging capabilities.
Unfortunately, the system software supplied with the nCUBE hadn't made much more progress in this direction than we had with our `` DEBUG '' program. The debugger expected to see addresses in hex and displayed code as assembly instructions. Single stepping was only possible at the level of a single machine instruction.
Method 5. The Node Debugger: ndb Our initial attempt to get something useful out of the nCUBE's debugging potential was something called `` bdb '' that communicated with nCUBE's own debugger through a pipe and attempted to provide a slightly more friendly user interface. In particular, it allowed stack frames to be unrolled and also showed the names of functions rather than the absolute addresses. It was extremely popular.
As a result of this experience, we decided to build a full-blown, user-friendly, parallel programming debugger , finally resulting in the C P and now ParaSoft tool known as `` ndb ,'' the ``node debugger.''
Guy Robinson
The basics of the design were straightforward, but tedious to code. Much work had to be done building symbol tables from executables, figuring out how line numbers mapped to memory addresses, and so on, but the most important discoveries lay in that a parallel program debugger had to work in a rather different way than normal sequential versions.
Lesson 1. Avoiding Deadlock The first important discovery was that a parallel program debugger couldn't operate in the ``on'' or ``off'' modes of conventional debuggers. In sdb or dbx , for example, either the debugger is in control or the user program is running. There are no other states. Once you have issued a ``continue'' command, the user program continues to run until it either terminates or reaches another breakpoint, at which time you may once again issue debugger commands.
To see how this fails for a parallel program, consider the code outline shown in Figure 5.7 . Assume that we have two nodes, both stopped at breakpoints at line one. At this point, we can do all of the normal debugger activities including examination of variables, listing of the program, and so on. Now assume that we single-step only node 0 . Since line one is a simple assignment we have no problem and we move to line two.
Figure 5.7:
Single-Stepping Through Message-Passing Code
Repeating this process is a problem, however, since we now try to send a message to node 1 which is not ready to receive it-node 1 is still sitting in its breakpoint at line one in its node. If we adopted the sequential debugger standard in which the user program takes control whenever a command is given to step or continue the program, we would now have a deadlock, because node 0 will never return from its single-step command until node 1 does something. On the other hand node 1 cannot do anything until it is given a debugger command.
In principle, we can get around this problem by redefining the action of the send_message function used in node 0. In the normal definition of our system at that time, this call should block until the receiving node is ready. By relaxing this constraint, we can allow the send_message function to return as soon as the data to be transmitted is safely reusable, without waiting for the receive.
This does not save the debugger. We now expect the single step from line two to line three to return, as will the trivial step to line four. But the single step to line five involves receiving a message from node 1 and no possible relaxing of the communication specification can deal with the fact that node 1 hasn't sent anything.
Deadlock is unavoidable.
The solution to this problem is to simply make debugging a completely autonomous process which operates independently of the user program. Essentially, this means that any debugger command immediately returns and gives the user a new prompt. The single-step command, for example, doesn't wait for anything important to happen but allows the user to carry on giving debugger commands even though the user process may be ``hung'' as a consequence of the single step as shown in Figure 5.7 .
Lesson 1a. Who Gets the Input? As an immediate consequence of the decision to leave the debugger in control of the keyboard at all times, we run into the problem of how to pass input to the program being debugged.
Again, sequential debuggers don't have this problem because the moment you continue or run the program it takes control of the keyboard and you enter data in the normal manner. In ndb , life is not so simple because if you start up your code and it prints on the screen
Enter two integers: [I,J] or some such, you can't actually type the values because they would be interpreted as debugger commands! One way around this is to have multiple windows on your workstation; in one you type debugger commands; in the other, input to your program. Another solution is to have a debugger command that explicitly switches control to the user program in just the same way that a sequential debugger would: ndb supports both mechanisms.
Lesson 2. Show State Because the debugger operates in the manner just described, it becomes very important to give the user a quick way of seeing when something has really happened. Normal sequential debuggers give you this feedback by simply returning a prompt whenever the user program has encountered a breakpoint or terminated. In our case, we provide a simple command, `` show state ,'' to allow the user to monitor the progress of the node program.
As an example, the output when executed on node 0 at line two might be something like
Node 0: Breakpoint, PC=[foo.c,2] which shows that the node is stopped at a breakpoint at the indicated line of a source file named `` foo.c ''. If we step again, the debugger gives us back a prompt and a very quick `` show state '' command might now show
Node 0: Running, PC=[send.c, 244] showing that the node is now running code somewhere inside a source file called `` send.c ''. Another similar command would probably show something like
Node 0: Breakpoint, PC=[foo.c, 3] indicating that the node had now reached the breakpoint on the next line. If the delay between the first two `` show state '' commands were too long, you might never see the ``Running'' state at all because the node will have performed its ``send'' operation and reached line three.
If you continued with this process of single stepping and probing with the `` show state '' command, you would eventually get to a state where the node would show as ``Running'' in the receive function from which it would never return until node 1 got around to sending its message.
Lesson 3. Sets of Nodes The simplest applications of a sequential debugger for parallel programs would be similar to those already seen. Each command issued by the user to the debugger is executed on a particular node. Up to now, for example, we have considered only actions on node 0. Obviously, we can't make much progress in the example code shown in Figure 5.7 until node 1 moves from its initial breakpoint at line one.
We might extend the syntax by adding a `` pick '' command that lets us, for example,
pick node 1 and then execute commands there instead of on node 0. This would clearly allow us to make progress in the example we have been studying. On the other hand, it is very tedious to debug this way. Even on as few as four nodes, the sequence
is used frequently and is very painful to type. Running on 512 nodes in this manner is out of the question. The solution adopted for ndb is to use ``node sets.'' In this case, the above effect would be achieved with the command
on all show state or alternatively
The basic idea is that debugger commands can be applied to more than a single processor at once. In this way, you can obtain global information about your program without spending hours typing commands.
In addition to the simple concepts of a single node and ``all'' nodes, ndb supports other groups such as contiguous ranges of nodes, discontinuous ranges of nodes, ``even'' and ``odd'' parity groups, the ``neighbors'' of a particular node, and user-defined sets of nodes to facilitate the debugging process. For example, the command
on 0, 1, nof 1, even show state executes the `` show state '' command on nodes 0, 1, the neighbors of node 1, and all ``even parity'' nodes.
Lesson 4. Smart stepping Once node sets are available to execute commands on multiple processors, another tricky issue comes up concerning single stepping. Going back to the example shown in Figure 5.7 , consider the effect of executing the sequence of commands
starting from the initial state in which both nodes are at a breakpoint at line one. The intent is fairly obvious-the user wants to single-step over the intermediate lines of code from line one, eventually ending up at line five.
In principle, the objections that gave rise to the independence of debugger and user program should no longer hold, because when we step from line two, both nodes are participating and thus the send/receive combination should be satisfied properly.
The problem, however, is merely passed down to the internal logic of the debugger. While it is true that the user has asked both nodes to step over their respective communication calls, the debugger is hardly likely to be able to deduce that. If the debugger expands (internally) the single-step command to something like
then all might be well, since node 0 will step over its ``send'' before node 1 steps over its receive-a happy result. If, however, the debugger chooses to expand this sequence as
it will hang just as badly as the original user interaction.
Even though the ``obvious'' expansion is the one that works in this case, this is not generally true-in fact, it fails when stepping from line four to line five in the example.
In general, there is no way for the debugger to know how to expand such command sequences reliably, and as a result a much ``smarter'' method of single stepping must be used, such as that shown schematically in Figure 5.8 .
Figure 5.8:
Logic for Single Stepping on Multiple Nodes
The basic idea is to loop over each of the nodes in the set being stepped trying to make ``some'' progress towards reaching the next stopping point. If no nodes can make progress, we check to see if some time-out has expired and if not, continue. This allows us to step over system calls that may take a significant time to complete when measured in machine instructions.
Finally, if no more progress can be made, we attempt to analyze the reason for the deadlock and return to the user anyway.
This process is not foolproof in the sense that we will sometimes ``give up'' on single steps that are actually going to complete, albeit slowly. But it has the great virtue that even when the user program ``deadlocks'', the debugger comes back to the user, often with a correct analysis of the reason for the deadlock.
Lesson 5. Show queue Another interesting question about the debugger concerns the extensions and/or modifications that one might make to a sequential debugger.
One might be tempted to say that the parallel debugger is so different from its sequential counterparts that a totally new syntax and method of operation is justified. One then takes the chance that no one will invest the time needed to learn the new tool and it will never be useful.
For ndb , we decided to adopt the syntax of the well-known UNIX dbx debugger that was available on the workstations that we used for development. This meant that the basic command syntax was familiar to everyone using the system.
Of course we have already introduced several commands that don't exist in dbx , simply because sequential debuggers don't have need for them. The `` show state '' command is never required in sequential debuggers because the program is either running or it's stopped at a point that the debugger can tell you about. Similarly, one never has to give commands to multiple processors.
Another command that we learned early on was very important was `` show q '', which monitored the messages in transit between processors. Because our parallel programs were really just sequential programs with additional message passing, the ``bugs'' that we were trying to find were not normally algorithmic errors but message-passing ones.
A typical scenario would be that the nodes would compute (correctly) and then reach some synchronization or communication point at which point the logic relating to message transfer would be wrong and everything would hang. At this point, it proved to be immensely useful to be able to go in with the debugger and look at which nodes had actually sent messages to other nodes.
Often one would see something like
Node 1, type 12, len 32
(12 4a 44 82 3e 00 ...)
Node 2, type 12, len 32
(33 4a 5f ff 00 00 ...)
Node 1: No messages
Node 2: No messages
Node 0:
indicating that node 0 has received two messages of type 12 and
length 32 bytes from node 1 and node 2 but that neither node 1 nor node 2 has
any.
Armed with this type of information, it is usually extremely easy to detect the commonest type of parallel processing problem.
Lesson 5a. Message Passing Is Easy An interesting corollary to the debugging style just described is that we learned that debugging message-passing programs was much easier than other types of parallel programming.
The important advantage that a user-friendly debugger brings to the user is the ability to slow down the execution of the program to the point where the user can ``see'' the things that go wrong. This fits well with the ``message-passing'' methodology since bugs in message passing usually result in the machine hanging. In this state, you have plenty of time to examine what's happening and deduce the error. Furthermore, the problem is normally completely repeatable since it usually relates to a logic error in the code.
In contrast, shared-memory or multiprocessing paradigms are much harder because the bugs tend to depend on the relative timing of various events within the code. As a result, the very act of using the debugger can cause the problem to show up in a different place or even to go away all together. This is akin to that most frustrating of problems when you are tracking down a bug with print statements, only to find that just as you insert the climactic final statement which will isolate your problem, it goes away altogether!
Lesson 6. How Many Windows? The debugger ndb was originally designed to be driven from a terminal by users typing commands, but with the advent of graphical workstations with windowing systems it was inevitable that someone would want a ``windowing'' version of the debugger.
It is interesting to note that many users' original conception was that it would now be correct to port a sequential debugger and have multiple instances of it, each debugging one node.
This illusion is quickly removed, however, when we are debugging a program on many nodes with many invocations of a sequential debugger. Not only is it time-consuming setting up all of the windows, but activities such as single stepping become extremely frustrating since one has to go to each window in turn and type the ``continue'' command. Even providing a ``button'' that can be clicked to achieve this doesn't help much because you still have to be able to see the button in the overlapping windows, and however fast you are with the mouse it gets harder and harder to achieve this effect as the number of nodes on which your program is running grows.
Our attempt at solving this problem is to have two different window types: an ndb console and a node window. The console is basically a window-oriented version of the standard debugger. The lower panel allows the user to type any of the normal debugger commands and have them behave in the expected fashion. The buttons at the top of the display allow ``shortcuts'' for the often issued commands, and the center panel allows a shortcut for the most popular command of all:
on all show state This button doesn't actually generate the output from this command in the normal mode since, brief as its output is, it can still be tedious watching 512 copies of
Node XXX: Breakpoint, [foo.c, 13] scroll past. Instead, it presents the node state as a colored bar chart in which the various node states each have different colors. In this way, for example, you can ``poll'' until all the nodes hit a breakpoint by continually hitting the `` Update '' button until the status panel shows a uniform color and the message shows that all nodes have reached a breakpoint.
In addition to this usage, the color coding also vividly shows problems such as a node dividing by zero. In this case, the bar chart would show uniform colors except for the node that has died, which might show up in some contrasting shade.
The second important use of the `` Update '' button is to synchronize the views presented by the second type of window, the ``node windows.''
Each of these presents a view of a ``group'' of nodes represented by a particular choice. Thus, for example, you might choose to make a node window for the nodes 0-3, represented by node 0. In this case, the upper panel of the node window would show the source code being executed by node 0 while the lower panel would automatically direct commands to all four nodes in the group. The small status bar in the center shows a ``smiley'' face if all nodes in the group appear to be at the same source line number and a ``sad'' face if one or more nodes are at different places.
This method allows the user to control large groups of nodes and simultaneously see their source code while also monitoring differences in behavior. A common use of the system, for example, is to start with a single node window reflecting ``all nodes'' and to continue in this way until the happy face becomes sad, at which point additional node windows can be created to monitor those nodes which have departed from the main thread of execution.
The importance of the `` Update '' button in this regard is that the node windows have a tendency to get out of sync with the actual execution of the program. In particular, it would be prohibitively expensive to have each node window constantly tracking the program location of the nodes it was showing, since this would bombard the node program with status requests and also cause constant scrolling of the displayed windows. Instead, ndb chooses its own suitable points to update the displayed windows and can be forced to update them at other times with the `` Update '' button.
Guy Robinson
This section has emphasized the differences between ndb and sequential debuggers since those are the interesting features from the implementation standpoint. On the other hand, from the user's view, the most striking success of the tool is that it has made the debugging process so little different from that used on sequential codes. This can be traced to the loosely synchronous structure of most (C P) parallel codes. Debugging fully asynchronous parallel codes can be much more challenging than the sequential case.
In practice, users have to be shown only once how to start up the debugging process, and be given a short list of the new commands that they might want to use. For users who are unfamiliar with the command syntax, the simplest route is to have them play with dbx on a workstation for a few minutes.
After this, the process tends to be very straightforward, mostly because of the programming styles that we tend to use. As mentioned in an earlier section, debugging totally asynchronous programs that generate multiple threads synchronizing with semaphores in a time-dependent manner is not ndb 's forte. On the other hand, debugging loosely synchronous message-passing programs has been reduced to something of a triviality.
In some sense, we can hardly be said to have introduced anything new. The basis on which ndb operates is very conventional, although some of the implications for the implementation are non-trivial. On the other hand, it provides an important and often critical service to the user. The next section will describe some of the more revolutionary steps that were taken to simplify the development process in the areas of performance analysis and visualization.
Guy Robinson
From the earliest days of parallel computing, the fundamental goal was to accelerate the performance of algorithms that ran too slowly on sequential machines. As has been described in many other places in this book, the effort to do basic research in computer science was always secondary to the need for algorithms that solved practical problems more quickly than was possible on other machines.
One might think that an important prerequisite for this would be advanced profiling technology. In fact, about the most advanced piece of equipment then in use was a wristwatch! Most algorithms were timed on one node, then on two, then on four, and so on. The results of this analysis were then compared with the theoretically derived models for the applications. If all was well, one proceeded to number-crunch; if not, one inserted print statements and timed the gaps between them to see what pieces of code were behaving in ways not predicted by the models.
Even the breakthrough of having a function that a program could call to get at timing information was a long time coming, and even then proved somewhat unpopular, since it had different names on different machines and didn't even exist on the sequential machines. As a result, people tended to just not bother with it rather than mess up their codes with many different timing routines.
Guy Robinson
Of course, this was all totally adequate for the first few applications that were parallelized, since their behavior was so simple to model. A program solving Laplace's equation on a square grid, for example, has a very simple model that one would actually have to work quite hard not to find in a parallel code. As time passed, however, more complex problems were attempted which weren't so easy to model and tools had to be invented.
Of course, this discussion has missed a rather important point which we also tended to overlook in the early days.
When comparing performance of the problems on one, two, four, eight, and so on nodes, one is really only assessing the efficiency of the parallel version of the code. However, an algorithm that achieves 100 percent efficiency on a parallel computer may still be worthless if its absolute performance is lower than that of a sequential code running on another machine.
Again, this was not so important in the earliest days, since the big battle over architectures had not yet arisen. Nowadays, however, when there is a multitude of sequential and parallel supercomputers, it is extremely important to be able to know that a parallel version of a code is going to outperform a sequential version running on another architecture. It is becoming increasingly important to be able to understand what complex algorithms are doing and why, so that the performance of the software and hardware can both be tuned to achieve best results.
This section attempts to discuss some of the issues surrounding algorithm visualization, parallelization and performance optimization, and the tools which C P developed to help in this area. A major recent general tool, PABLO [ Reed:91a ] has been developed at Illinois by Reed's group, but here we only describe the C P activity. One of the earliest tools was Seecube [Couch:88a;88b].
Guy Robinson
The first question that must be asked of any algorithm when a parallel version is being considered is, ``What does it do?'' Surprisingly, this question is often quite hard to answer. Vague responses such as ``some sort of linear algebra'' are quite common and even if the name of the algorithm is actually known, it is quite surprising how often codes are ported without anyone actually having a good impression of what the code does.
One attempt to shed light on these issues by providing a data visualization service is vtool . One takes the original (sequential) source code and runs it through a preprocessor that instruments various types of data access. The program is then compiled with a special run time library and run in the normal manner. The result is a database describing the ways in which the algorithm or application makes use of its data.
Once this has been collected, vtool provides a service analogous to a home VCR which allows the application to be ``played back'' to show the memory accesses being made. Sample output is shown in Figure 5.9 .
Figure 5.9:
Analysis of a Sorting Algorithm Using
vtool
The basic idea is to show ``pictures'' of arrays together with a ``hot spot'' that shows where accesses and updates are being made. As the hot spot moves, it leaves behind a trail of continuingly fading colors that dramatically show the evolution of the algorithm. As this proceeds, the corresponding source code can be shown and the whole simulation can be stopped at any time so that a particularly interesting sequence can be replayed in slow motion or even one step at a time, both forward and backward.
In addition to showing simple access patterns, the display can also show the values being stored into arrays, providing a powerful way of debugging applications.
In the parallel processing arena, this tool is normally used to understand how an algorithm works at the level of its memory references. Since most parallel programs are based on the ideas of data distribution, it is important to know how the values at a particular grid point or location in space depend on those of neighbors. This is fundamental to the selection of a parallelization method. It is also central to the understanding of how the parallel and sequential versions of the code will differ which becomes important when the optimization process begins.
It should be mentioned in passing that we have been surprised in using this tool how often people's conceptions of the way that numerical algorithms work are either slightly or completely revised after seeing the visualization system at work.
Guy Robinson
Hopefully, the visualization system goes some way towards the development of a parallel algorithm. One must then code and debug the application which, as has been described previously, can be a reasonably time-consuming process. Finally, one comes to the ``crisis'' point of actually running the parallel code and seeing how fast it goes.
One of our major concerns in developing performance analysis tools was to make them easy to use. The standard UNIX method of taking the completed program, deleting all its object files, and then recompiling them with special switches seemed to be asking too much for parallel programs because the process is so iterative. On a sequential machine, the profiler may be run once or twice, usually just to check that the authors' impressions of performance are correct. On a parallel computer, we feel that the optimization phase should more correctly be included in the development cycle than as an afterthought, because we believe that few parallel applications perform at their best immediately after debugging is complete. We wanted, therefore, to have a system that could give important information about an algorithm without any undue effort.
The system to be described works with the simple addition of either a runtime switch or the definition of an ``environment'' variable, and makes available about 90% of the capabilities of the entire package. To use some of the most exotic features, one must recompile code.
Guy Robinson
As an example of the ``free'' profiling information that is available consider the display from the ctool utility shown in Figure 5.10 . This provides a summary of the gross ``overheads'' incurred in the execution of a parallel application divided into categories such as ``calculation,'' ``I/O,'' ``internode communication,'' ``graphics,'' and so on. This is the first type of information that is needed in assessing a parallel program and is obtained by simply adding a command line argument to an existing program.
Figure 5.10:
Overhead Summary from
ctool
At the next level of detail after this, the individual overhead categories can be broken down into the functions responsible for them. Within the ``internode communication'' category, for example, one can ask to be shown the times for each of the high-level communication functions, the number of times each was called and the distribution of message lengths used by each. This output is normally presented graphically, but can also be generated in tabular form (Figure 5.11 ) for accurate timing measurements. Again, this information can be obtained more or less ``for free'' by giving a command line argument.
Figure 5.11:
Tabular Overhead Summary
Guy Robinson
The overhead summaries just described offer replies to the important question, ``What are the costs of executing this algorithm in parallel?'' Once this information is known, one typically proceeds to the question, ``Why do they cost this much?''
To answer this question we use etool , the event-tracing profiler.
The purpose of this tool is probably most apparent from its sample output, Figure 5.12 . The idea is that we present timelines for each processor on which the most important ``events'' are indicated by either numbered boxes or thin bars. The former indicate atomic events such as ``calling subroutine foo '' or ``beginning of loop at line 12,'' while the bars are used to indicate the beginning and end of extended events such as a read operation on a file or a global internode communication operation.
Figure 5.12:
Simple Event Traces
The basic idea of this tool is to help understand why the various overheads observed in the previous analysis exist. In particular, one looks for behavior that doesn't fit with that expected of the algorithm.
One common situation, for example, is to look for places where a ``loosely synchronous'' operation is held up by the late arrival of one or more processors at the synchronization point. This is quite simple in etool ; an ``optimal'' loosely synchronous event would have bars in each processor that aligned perfectly in the vertical direction. The impact of a late processor shows up quite vividly, as shown in Figure 5.13 .
Figure 5.13:
Sample Application Behavior as Seen by
etool
This normally occurs either because of a poorly constructed algorithm or because of poor load balancing due to data dependencies.
An alternative pattern that shows up remarkably well is the sequential behavior of ``master-slave'' or ``client-server'' algorithms in which one particular node is responsible for assigning work to a number of other processors. These algorithms tend to show patterns similar to that of Figure 5.12 , in which the serialization of the loop that distributed work is quite evident.
Another way that the event-profiling system can be used is to collect statistics regarding the usage of particular code segments. Placing calls to the routine eprof_toggle around a particular piece of code causes information to be gathered describing how many times that block was executed, and the mean and variance of the time spent there. This is analogous to the ``block profiling'' supported by some compilers.
Guy Robinson
The system first described, vtool , had as its goal the visualization of sequential programs prior to their parallelization. The distribution profiler dtool serves a similar purpose for parallel programs which rely on data distribution for their parallelism. The basic idea is that one can ``watch'' the distribution of a particular data object change as an algorithm progresses. Sample output is shown in Figure 5.14 .
Figure 5.14:
Data Distribution Analysis
At the bottom of the display is a timeline which looks similar to that used in the event profiler, etool . In this case, however, the events shown are the redistribution operations on a particular data object. Clicking on any event with the mouse causes a picture of the data distribution among the nodes to be shown in the upper half of the display. Other options allow for fast and slow replays of a particular sequence of data transformations.
The basic idea of this tool is to look at the data distributions that are used with a view to either optimizing their use or looking for places in which redundant transformations are being made that incur high communication costs. Possible restructuring of the code may eliminate these transformations, thus improving performance. This is particularly useful in conjunction with automatic parallelization tools, which have a tendency to insert redundant communication in an effort to ensure program correctness.
Guy Robinson
As mentioned earlier, the most often neglected question with parallel applications is how fast they are in absolute terms. It is possible that this is a throwback to sequential computers, where profiling tools, although available, are rarely used. In most cases, if a program doesn't run fast enough when all the compiler's optimization capabilities are exhausted, one merely moves to a higher performance machine. Of course, this method doesn't scale well and doesn't apply at all in the supercomputer arena. Even more importantly, as processor technology becomes more and more complex, the performance gap between the peak speed of a system and that attained by compiled code gets ever wider.
The typical solution for sequential computers is the use of profiling tools like prof or gprof that provide a tabular listing of the routines in a program and the amount of time spent in each. This avoids the use of the wristwatch but only goes so far. You can certainly see which routines are the most expensive but no further.
The profiler xtool was designed to serve this purpose for parallel computers and in addition to proceed to lower levels of resolution: source code and even machine instructions. Sample displays are shown in Figure 5.15 . At the top is a graphical representation of the time spent executing each of the most expensive routines. The center shows a single routine at the level of its source code and the bottom panel shows individual machine instructions.
Figure 5.15:
Output from the CPU Usage Profiler
The basic goal of this presentation is to allow the user to see where CPU time is being spent at any required level of detail. At the top level, one can use this information to develop or restructure algorithms, while at the lowest level one can see how the processor instructions operate and use this data to rework pieces of code in optimized assembly language.
Note that while the other profiling tools are directed specifically towards understanding the parallel processing issues of an application, this tool is aimed mostly at a thorough understanding of sequential behavior.
Guy Robinson
One of the most often asked questions about this profiling system is why there are so many separate tools rather than an all-encompassing system that tells you everything you wish to know about the application.
Our fundamental reason for choosing this method was to attempt to minimize the ``self-profiling'' problem that tends to show up in many systems in which the profiling activity actually spends most of its time profiling the analysis system itself. Users of the UNIX profiling tools, for example, have become adept at ignoring entries for routines such as mcount , which correspond to time spent within the profiling system itself.
Unfortunately, this is not so simple in a parallel program. In sequential applications, the effect of the profiling system is merely to slow down other types of operation, an effect which can be compensated for by merely subtracting the known overheads of the profiling operations. On a parallel computer, things are much more complicated, since slowing down one processor may affect another which in turn affects another, and so on until the whole system is completely distorted by the profiling tools.
Our approach to this problem is to package up the profiling subsystems in subsets which have more or less predictable effects, and then to let the user decide which systems to use in which cases. For example, the communication profiler, ctool , incurs very small overheads-typically a fraction of 1%-while the event profiler costs more and the CPU usage profiler, xtool , most of all. In common use, therefore, we tend to use the communication profiler first, and then enable the event traces. If these two trials yield consistent results, we move on to the execution and distribution profilers. We have yet to encounter an application in which this approach has failed, although the fact that we are rarely interested in microsecond accuracy helps in this regard.
Interestingly, we have found problems due to ``clock-skewing'' to have negligible impact on our results. It is true that clock skewing occurs in most parallel systems, but we find that our profiling results are accurate enough to be helpful without taking any special precautions in this regard. Again, this is mostly due to the fact that, for the kinds of performance analysis and optimization in which we are interested, resolution of tens or even hundreds of microseconds is usually quite acceptable.
Guy Robinson
Our assumption that parallel algorithms are complex entities seems to be borne out by the fact that nearly everyone who has invested the (minimal) time to use the profiling tools on their application has come away understanding something better than before. In some cases, the revelations have been so profound that significant performance enhancements have been made possible.
In general, the system has been found easy to use, given a basic understanding of the parallel algorithm being profiled, and most users have no difficulty recognizing their applications from the various displays. On the other hand, the integration between the different profiling aspects is not yet as tight as one might wish and we are currently working on this aspect.
Another interesting issue that comes up with great regularity is the request on behalf of the users for a button marked ``Why?'', which would automatically analyze the profile data being presented and then point out a block of source code and a suggestion for how to improve its performance. In general, this is clearly too difficult, but it is interesting to note that certain types of runtime system are more amenable to this type of analysis than others. The ``distribution profiler,'' for instance, possesses enough information to perform quite complex communication and I/O optimizations on an algorithm and we are currently exploring ways of implementing these strategies. It is possible that this line of thought may eventually lead us to a more complete programming model than is in use now-one which will be more amenable to the automation of parallel processing that has long been our goal.
Guy Robinson
Guy Robinson
Synchronous problems have been defined in Section 3.4 as having the simplest temporal or computational structure. The problems are typically defined by a regular grid, as illustrated in Figure 4.3 , and are parallelized by a simple domain decomposition. A synchronous temporal structure corresponds to each point in the data domain being evolved with an identical computational algorithm, and we summarize this in the caricature shown in Figure 6.1 . We find several important synchronous problems in the academic applications, which formed the core of C P's work. We expect-as shown in Chapter 19 -that the ``real world'' (industry and government) will show fewer problems of the synchronous class. One hopes that a fundamental theory will describe phenomena in terms of simple elegant and uniform laws; these are likely to lead to a synchronous or computational (temporal) structure. On the other hand, real-world problems typically involve macroscopic phenomenological models as opposed to fundamental theories of the microscopic world. Correspondingly, we find in the real world more loosely synchronous problems that only exhibit macroscopic temporal synchronization.
Figure 6.1:
The Synchronous Problem Class
There is no black-and-white definition of synchronous since, practically, we allow some violations of the rigorous microscopic synchronization. This is already seen in Section 4.2 's discussion of the irregularity of Monte Carlo ``accept-reject'' algorithms. A deeper example is irregular geometry problems, such as the partial differential equations of Chapters 9 and 12 with an irregular mesh. The simplest of these can be implemented well on SIMD machines as long as each node can access different addresses. In the High Performance Fortran analysis of Chapter 13 , there is a class of problems lacking the regular grid of Figure 4.3 . They cannot be expressed in terms of Fortran 90 with arrays of values. However, the simpler irregular meshes are topologically rectangular-they can be expressed in Fortran 90 with an array of pointers. The SIMD Maspar MP-1,2 supports this node-dependent addressing and has termed this an ``autonomous SIMD'' feature. We believe that just as SIMD is not a precise computer architecture, the synchronous problem class will also inevitably be somewhat vague, with some problems having architectures in a grey area between synchronous and loosely synchronous.
The applications described in Chapter 4 were all run on MIMD machines using the message-passing model of Chapter 5 . Excellent speedups were obtained. Interestingly, even when C P acquired a SIMD CM-2, which also supported this problem class well, we found it hard to move onto this machine because of the different software model-the data parallel languages of Chapter 13 -offered by SIMD machines. The development of High Performance Fortran, reviewed in Section 13.1 , now offers the same data-parallel programming model on SIMD and MIMD machines for synchronous problems. Currently, nobody has efficiently ported the message-passing model to SIMD machines-even with the understanding that it would only be effective for synchronous problems. It may be that with the last obvious restriction, the message-passing model could be implemented on SIMD machines.
This chapter includes a set of neural network applications. This is an important class of naturally parallel problems, and represents one approach to answering the question:
``How can one apply massively parallel machines to artificial intelligence (AI)?''
We were asked this many times at the start of C P, since AI was one of the foremost fields in computer science at the time. Today, the initial excitement behind the Japanese fifth-generation project has abated and AI has transitioned to a routine production technology which is perhaps more limited than originally believed. Interestingly, the neural network approach leads to synchronous structure, whereas the complementary actor or expert system approaches have a very different asynchronous structure. The high temperature superconductivity calculations in Section 6.3 made a major impact on the condensed matter community. Quoting from Nature [ Maddox:90a ]
``Yet some progress seems to have been made. Thus Hong-Qiang Ding and Miloje S. Makivic, from California Institute of Technology, now describe an exceedingly powerful Monte Carlo calculation of an antiferromagnetic lattice designed to allow for the simulation of ( Phys. Rev. Lett. 64 , 1,449; 1990). In this context, a Monte Carlo simulation entails starting with an arbitrary arrangement of spins on the lattice, and then changing them in pairs according to rules that allow all spin states to be reached without violating the overall constraints. The authors rightly boast of their access to Caltech's parallel computer system, but they have also devised a new and efficient algorithm for tracing out the evolution of their system. As is the custom in this part of the trade, they have worked with square patches of two-dimensional lattice with as many as 128 lattice spacings to each side.The outcome is a relationship between correlation length-the distance over which order, on the average, persists-and temperature; briefly, the logarithm of the correlation length is inversely proportional to the temperature. That, apparently, contradicts other models of the ordering process. In lanthanum copper oxide, the correlation length agrees well with that measured by neutron diffraction below (where there is a phase transition), provided the interaction energy is chosen appropriately. For what it is worth, that energy is not very different from estimates derived from Raman-scattering experiments, which provide a direct measurement of the energy of interaction by the change of frequency of the scattered light.''
The hypercube allowed much larger high- calculations than the previous state of the art, with conventional machines. Curiously, with QCD simulations (described in Section 4.3 ), we were only able at best to match the size of the Cray calculations of other groups. This probably reflects different cultures and computational expectations of the QCD and condensed matter communities. C P had the advantage of dedicated facilities and could devote them to the most interesting applications.
Section 6.2 describes an early calculation, which was a continuation of our collaboration with Sandia on nCUBE applications. They, of course, followed this with a major internal activity, including their impressive performance analysis of 1024-node applications [ Gustafson:88a ]. There were several other synchronous applications in C P that we will not describe in this book. Wasson solved the single-particle Schrödinger equation in a regular grid to study the ground state of nuclear matter as a function of temperature and pressure. His approach used the time-dependent Hartree-Fock method, but was never taken past the stage of preliminary calculations on the early Mark II machines [ Wasson:87a ]. There were also two interesting signal-processing algorithms. Pollara implemented the Viterbi algorithm for convolutional decoding of data sent on noisy communication channels [ Pollara:85a ], [ Pollara:86a ]. This has similarities with the Cooley-Tukey binary FFT parallelization described in [ Fox:88a ]. We also looked at alternatives to this binary FFT in a collaboration with Aloisio from the Italian Space Agency. The prime number (nonbinary) discrete Fourier transform produces a more irregular communication pattern than the binary FFT and, further, the node calculations are less easy to pipeline than the conventional FFT. Thus, it is hard to achieve the theoretical advantage of the nonbinary FFT. This often has less floating-point operations needed for a given analysis whose natural problem size may not be the power of two demanded by the binary FFT
[Aloisio:88a;89b;90b;91a;91b]. This parallel discrete FFT was designed for synthetic aperture radar applications for the analysis of satellite data [Aloisio:90c;90d].
The applications in Sections 6.7.3 , 6.5 , and 6.6 use the important multiscale approach to a variety of vision or image processing problems. Essentially, all physical problems are usefully considered at several different length scales, and we will come back to this in Chapters 9 and 12 when we study partial differential equations (multigrid) and practice dynamics (fast multipole).
Guy Robinson
This work implemented a code on the nCUBE-1 hypercube for studying the evolution of two-dimensional, convectively-dominated fluid flows. An explicit finite difference scheme was used that incorporates the flux-corrected transport (FCT) technique developed by Boris and Book [ Boris:73a ]. When this work was performed in 1986-1987, it was expected that explicit finite difference schemes for solving partial differential equations would run efficiently on MIMD distributed-memory computers, but this had only been demonstrated in practice for ``toy'' problems on small hypercubes of up to 64 processors. The motivation behind this work was to confirm that a bona fide scientific application could also attain high efficiencies on a large commercial hypercube. The work also allowed the capabilities and shortcomings of the newly-acquired nCUBE-1 hypercube to be assessed.
HPFA Applications and Paradigms
Guy Robinson
Although first-order finite difference methods are monotonic and stable, they are also strongly dissipative, causing the solution to become smeared out. Second-order techniques are less dissipative, but are susceptible to nonlinear, numerical instabilities that cause nonphysical oscillations in regions of large gradient. The usual way to deal with these types of oscillation is to incorporate artificial diffusion into the numerical scheme. However, if this is applied uniformly over the problem domain, and enough is added to dampen spurious oscillations in regions of large gradient, then the solution is smeared out elsewhere. This difficulty is also touched upon in Section 12.3.1 . The FCT technique is a scheme for applying artificial diffusion to the numerical solution of a convectively-dominated flow problem in a spatially nonuniform way. More artificial diffusion is applied in regions of large gradient, and less in smooth regions. The solution is propagated forward in time using a second-order scheme in which artificial diffusion is then added. In regions where the solution is smooth, some or all of this diffusion is subsequently removed, so the solution there is basically second order. Where the gradient is large, little or none of the diffusion is removed, so the solution in such regions is first order. In regions of intermediate gradient, the order of the solution depends on how much of the artificial diffusion is removed. In this way, the FCT technique prevents nonphysical extrema from being introduced into the solution.
Guy Robinson
The governing equations are similar to those in Section 12.3.1 , namely, the two-dimensional Euler equations,
where,
Here is the fluid mass density, E is the specific energy, u and v are the fluid velocities in the x and y directions, and are body force components, and the pressure, p , is given by,
where is the constant adiabatic index. The motion of the fluid is tracked by introducing massless marker particles and allowing them to be advected with the flow. Thus, the number density of the marker particles, , satisfies,
The equations are solved on a rectilinear two-dimensional grid. Second-order accuracy in time is maintained by first advancing the velocities by a half time step, and then using these velocities to update all values for the full time step. The size of the time step is governed by the Courant condition.
The basic procedure in each time step is to first apply a five-point difference operator at each grid point to convectively transport the field values. These field values are then diffused in each of the positive and negative x and y directions. The behavior of the resulting fields in the vicinity of each grid point is then examined to determine how much diffusion to remove at that point. In regions where a field value is locally monotonic, nearly all the diffusion previously applied is removed for that field. However, in regions close to extrema, the amount of diffusion removed is less.
Guy Robinson
The code used in this study parallelizes well for a number of reasons. The discretization is static and regular, and the same operations are applied at each grid point, even though the evolution of the system is nonlinear. Thus, the problem can be statically load balanced at the start of the code by ensuring that each processor's rectangular subdomain contains the same number of grid points. In addition, the physics, and hence the algorithm, is local so the finite difference algorithm only requires communication between nearest neighbors in the hypercube topology. The extreme regularity of the FCT technique means that it can also be efficiently used to study convective transport on SIMD concurrent computers, such as the Connection Machine, as has been done by Oran, et al. [ Oran:90a ].
No major changes were introduced into the sequential code in parallelizing it for the hypercube architecture. Additional subroutines were inserted to decompose the problem domain into rectangular subdomains, and to perform interprocessor communication. Communication is necessary in applying the Courant condition to determine the size of the next time step, and in transferring field values at grid points lying along the edge of a processor's subdomain. Single rows and columns of field values were communicated as the algorithm required. Some inefficiency, due to communication latency, could have been avoided if several rows and/or columns were communicated at the same time, but in order to avoid wasting memory on larger communication buffers, this was not done. This choice was dictated by the small amount of memory (about ) available on each nCUBE-1 processor.
Guy Robinson
As a sample problem, the onset and growth of the Kelvin-Helmholtz instability was studied. This instability arises when the interface between two fluids in shear motion is perturbed, and for this problem the body forces, and , are zero. In Figure 6.2 (Color Plate), we show the development of the Kelvin-Helmholtz instability at the interface of two fluids in shear motion. In these figures, the density of the massless marker particles normalized by the fluid density is plotted on a color map, with red corresponding to a density of one through green, blue, and white to a density of zero. Initially, all the marker particles are in the upper half of the domain, and the fluids in the lower- and upper-half domains have a relative shear velocity in the horizontal direction. An finite difference grid was used. Vortices form along the interface and interact before being lost to numerical diffusion. By processing the output from the nCUBE-1, a videotape of the evolution of the instability was produced. This sample problem demonstrates that the FCT technique is able to track the physical instability without introducing numerical instability.
Figure 6.2:
Development of the Kelvin-Helmholtz
instability at the interface of two fluids in shear motion.
Guy Robinson
Table 6.1:
Timing Results in Seconds for a 512-processor and a
1-processor nCUBE-1. The values
and
represent the numbers
of grid points per processor in the
x
and
y
directions. The
concurrent efficiency, overhead, and speedup are denoted by
,
f
, and
S
.
The code was timed for the Kelvin-Helmholtz problem for hypercubes with dimension ranging from zero to nine. The results for the 512-processor case are presented in Table 6.1 , and show a speedup of 429 for the largest problem size considered. Subsequently, a group at Sandia National Laboratories, using a modified version of the code, attained a speedup of 1009 on a 1024-processor nCUBE-1 for a similar type of problem [ Gustafson:88a ]. The definitions of concurrent speedup, overhead, and efficiency are given in Section 3.5 .
An analytic model of the performance of the concurrent algorithm was developed, and ignoring communication latency, the concurrent overhead was found to be proportional to , where n is the number of grid points per processor. This is in approximate agreement with the results plotted in Figure 6.3 , that shows the concurrent overhead for a number of different hypercubes dimensions and grain sizes.
Guy Robinson
The FCT code was ported to the nCUBE-1 by David W. Walker [ Walker:88b ]. Gary Montry of Sandia National Laboratories supplied the original code, and made several helpful suggestions. A videotape of the evolution of the Kelvin-Helmholtz instability was produced by Jeff Goldsmith at the Image Processing Laboratory of the Jet Propulsion Laboratory.
Figure 6.3:
Overhead,
f
, as a Function of
, Where
n
Is
the Number of Grid Points per Processor. Results are shown for nCUBE-1
hypercubes of dimension one to nine. The overhead for the 2-processor case
(open circles) lies below that for the higher dimensional hypercubes. This
is because the processors only communicate in one direction in the
2-processor case, whereas for hypercubes of dimension greater than one,
communication is necessary in both the
x
and
y
directions.
Guy Robinson
HPFA Applications and Paradigms
Guy Robinson
Following the discovery of high-temperature superconductivity, two-dimensional quantum antiferromagnetic spin systems have received enormous attention from physicists worldwide. It is generally believed that high-temperature superconductivity occurs in the planes, which is shown in Figure 6.4 . Many features can be explained [ Anderson:87a ] in the Hubbard theory of the strongly coupled electron, which at half-filling is reduced to spin-1/2 antiferromagnetic Heisenberg model:
where are quantum spin operators. Furthermore, the neutron scattering experiments on the parent compound, , reveal a rich magnetic structure which is also modelled by this theory.
Physics in two dimensions (as compared to three dimensions) is characterized by the large fluctuations. Many analytical methods work well in three dimensions, but fail in two dimensions. For the quantum systems, this means additional difficulties in finding solutions to the problem.
Figure 6.4:
The Copper-Oxygen Plane, Where the Superconductivity Is
Generally Believed to Occur. The arrows denote the quantum spins.
,
,
denote the wave functions which
lead to the interactions among them.
Figure:
Inverse Correlation Length of
Measured in Neutron
Scattering Experiment, Denoted by Cross; and Those Measured in our
Simulation, Denoted by Squares (Units in
.
. At
,
undergoes a structural transition. The curve is the fit shown in
Figure
6.11
.
New analytical methods have been developed to understand the low-T behavior of these two-dimensional systems, and progress had been made. These methods are essentially based on a expansion. Unfortunately, the extreme quantum case lies in the least reliable region of these methods. On the other hand, given sufficient computer power, Quantum Monte Carlo simulation [ Ding:90g ] can provide accurate numerical solutions of the model theory and quantitative comparison with the experiment (see Figure 6.5 ). Thus, simulations become a crucial tool in studying these problems. The work described here has made a significant contribution to the understanding of high- materials, and has been well received by the science community [ Maddox:90a ].
Guy Robinson
Using the Suzuki-Trotter transformation , the two-dimensional quantum problem is converted into three-dimensional classical Ising spins with complicated interactions. The partition function becomes a product of transfer matrices for each four-spin interaction
with . These four-spin squares go in the time direction on the three-dimensional lattice. This transfer matrix serves as the probability basis for a Monte Carlo simulation. The zero matrix elements are the consequence of the quantum conservation law. To avoid generating trial configurations with zero probability, thus wasting the CPU time since these trials will never be accepted, one should have the conservation law built into the updating scheme. Two types of local moves may locally change the spin configurations, as shown in Figure 6.6 . A global move in the time direction flips all the spins along this time line. This update changes the magnetization. Another global move in spatial directions changes the winding numbers.
Figure 6.6:
(a) A ``Time-Flip.'' The white
plaquette is a
non-interacting one. The eight plaquettes surrounding it are interacting
ones. (b) A ``Space-Flip.'' The white
plaquette is a
non-interacting one lying in spatial dimensions. The four plaquettes
going in time direction are interacting ones.
This classical spin system in three dimensions is simulated using the Metropolis Monte Carlo algorithm. Starting with a given initial configuration, we locate a closed loop C of L spins, in one of the four moves. After checking that they satisfy the conservation law, we compute , the probability before all L spins are flipped, which is a product of the diagonal elements of the transfer matrix; and , the probability after the spins are flipped, which is a product of the off-diagonal elements of the transfer matrix along the loop C . The Metropolis procedure is to accept the flip according to the probability .
Figure 6.7:
A Vectorization of Eight ``Time-Flips.'' Spins along the
t
-direction are packed into computer words. The two 32-bit words,
S1 and S2, contain eight ``time plaquettes,'' indicated by the dashed
lines.
We implemented a simple and efficient multispin coding method which facilitates vectorization and saves index calculation and memory space. This is possible because each spin only has two states, up (1) or down (0), which is represented by a single bit in a 32-bit integer. Spins along the t -direction are packed into 32-bit words, so that the boundary communication along the x or y direction can be handled more easily. All the necessary checks and updates can be handled by the bitwise logical operations OR, AND, NOT, and XOR. Note that this is a natural vectorization, since AND operations for the 32 spins are carried out in the single AND operation by the CPU. The index calculations to address these individual spins are also minimized, because one only computes the index once for the 32 spins. The same principles are applied for both local and global moves. Figure 6.7 shows the case for time-loop coding.
Guy Robinson
The fairly large three-dimensional lattices (usually ) are partitioned into a ring of M processors with x -dimension which is uniformly distributed among the M processors. The local updates are easily parallelized since the connection is, at most, next-nearest neighbor (for the time-loop update). The needed spin-word arrays from its neighbor are copied into the local storage by the shift routine in the CrOS communication system [ Fox:88a ] before doing the update. One of the global updates, the time line, can also be done in the same fashion. The communication is very efficient in the sense that a single communication shift, , spins instead of Nt spins in the case where the lattice is partitioned into a two-dimensional grid. The overhead/latency associated with the communication is thus significantly reduced.
The winding-line global update along the x -direction is difficult to do in this fashion, because it involves spins on all the M nodes. In addition, we need to compute the correlation functions which have the same difficulty. However, since these operations are not used very often, we devised a fairly elegant way to parallelize these global operations. A set of gather-scatter routines, based on the cread and cwrite in CrOS, is written. In gather , the subspaces on each node are gathered into complete spaces on each node, preserving the original geometric connection. Parallelism is achieved now since the global operations are done on each node just as in the sequential computer, with each node only doing the part it originally covers. In scatter , the updated (changed) lattice configuration on a particular node (number zero) is scattered (distributed) back to all the nodes in the ring, exactly according to the original partition. Note that this scheme differs from the earlier decomposition scheme [ Fox:84a ] for the gravitation problem, where memory size constraint is the main concern.
The hypercube nodes were divided into several independent rings, each ring holding an independent simulation, as shown in Figure 6.8 . At higher temperatures, a spin system of is enough, so that we can simulate several independent systems at the same time. At low temperatures, one needs larger systems, such as -all the nodes will then be dedicated to a single large system. This simple parallelism makes the simulation very flexible and efficient. In the simulation, we used a parallel version of the Fibonacci additive random numbers generator [ Ding:88d ], which has a period larger that .
Figure 6.8:
The Configuration of the Hypercube Nodes. In the example, 32
nodes are configured as four independent rings, each consisting of 8
nodes. Each ring does an independent simulation.
We have made a systematic performance analysis by running the code on different sizes and different numbers of nodes. The timing results for a realistic situation (20 sweeps of update, one measurement) are measured [ Ding:90k ]. The speedup, / , where ( ) is the time for the same size spins system to run same number operations on one node, is plotted in Figure 6.9 . One can see that speedup is quite close to the ideal case, denoted by the dashed line. For the quantum spin system, the 32-node hypercube speeds up the computation by a factor of 26.6, which is a very good result. However, running the same spin system on a 16-node is more efficient, because we can run two independent systems on the 32-node hypercube with a total speedup of (each speedup a factor 14.5). This is better described by efficiency , defined as speedup/nodes, which is plotted in Figure 6.10 . Clearly, the efficiency of the implementation is very high, generally over 90%.
Figure 6.9:
Speedup of the Parallel Algorithm for Lattice Systems
,
and
. The dashed line is the ideal
case.
Figure 6.10:
Efficiency of the Parallel Algorithm
Comparison with other supercomputers is interesting. For this program, the one-head CRAY X-MP speed is approximately that of a 2-node Mark IIIfp. This indicates that our 32-node Mark IIIfp performs better than the CRAY X-MP by about a factor of % = 14! We note that our code is written in C and the vectorization is limited to the 32-bit inside the words. Rewriting the code in Fortran (Fortran compilers on the CRAY are more efficient) and fully vectorizing the code, one may gain a factor of about three on the CRAY. Nevertheless, this quantum Monte Carlo code is clearly a good example, in that parallel computers easily (i.e., at same programming level) outperform the conventional supercomputers.
Guy Robinson
We obtained many good results which were previously unknown. Among them, the correlation functions are perhaps the most important. First, the results can be directly compared with experiments, thus providing new understanding of the magnetic structure of the high-temperature superconducting materials. Second, and no less important, is the behavior of the correlation function we obtained which gives a crucial test of the assessment of various approximate methods.
In the large spin- S (classical) system, the correlation length goes as
at low temperatures. This predicts a too-large correlation length, compared with experimental results. As , the quantum fluctuations in the system become significant. Several approximate methods [ Chakravarty:88a ], [ Auerbach:88a ] predict a similar low- T behavior. , , and p=0 or 1 . is a quantum renormalization constant.
Our extensive quantum Monte Carlo simulations were performed [ Ding:90g ] on the spin- system as large as at low temperature range -1.0. The correlation length, as a function of , is plotted in Figure 6.11 . The data points fall onto a straight line, surprisingly well, throughout the whole temperature range, leading naturally to the pure exponential form:
where a is the lattice constant. This provides a crucial support to the above-mentioned theories. Quantitatively,
or
Figure 6.11:
Correlation Length Measured at Various Temperatures. The
straight line is the fit.
Direct comparison with experiments will not only test the validity of the Heisenberg model, but also determine the important parameter, the exchange coupling J . The spacing between Cu atoms in plane is . Setting , the Monte Carlo data is compared with those from neutron scattering experiments [ Endoh:88a ] in Figure 6.5 . The agreement is very good. This provides strong evidence that the essential magnetic behavior is captured by the Heisenberg model. The quantum Monte Carlo result is an accurate first principle calculation; no adjustable parameter is involved. Comparing directly with the experiment, the only adjustable parameter is J . This gives an independent determination of the effective exchange coupling:
Note that near , the experimentally measured correlation is systematically smaller than the theoretical curve, shown in Equation 6.4 . This is a combined result of small effects: frustration, anisotropies, inter-layer coupling, and so on.
Various moments of the Raman spectrum are calculated using series expansions and comparing with experiments [ Singh:89a ]. This gives an estimate, ( ), which is quite close to the above value determined from correlation functions. Raman scattering probes the short wavelength region, whereas neutron scattering measures the long-range correlations. The agreement of J 's obtained from these two rather different experiments is another significant indication that the magnetic interactions are dominated by the Heisenberg model.
Equation 6.4 is valid for all the quantum AFM spins. The classic two-dimensional antiferromagnetic system discovered twenty years ago [ Birgeneau:71a ], , is a spin-one system with . Very recently, Birgeneau [ Birgeneau:90a ] fitted the measured correlation lengths to
The fit is very good, as shown in Figure 6.12 . The factor ( ) comes from integration of the two-loop -function without taking the limit, and could be neglected if T is very close to 0. For the spin- AFM , Equation 6.4 also describes the data quite well [ Higgins:88a ].
Figure 6.12:
Correlation Length of
Measured in Neutron
Scattering Experiment with the Fit.
A common feature from Figures 6.11 and 6.12 is that the scaling equation Equation 6.4 , which is derived near , is valid for a wide range of T , up to . This differs drastically from the range of criticality in three-dimensional systems, where the width is usually about 0.2 or less. This is a consequence of the crossover temperature [ Chakravarty:88a ], where the Josephson length scale becomes compatible with the thermal wave length, being relatively high, . This property is a general character in the low critical dimensions. In the quantum XY model, a Kosterlitz-Thouless transition occurs [ Ding:90b ] at and the critical behavior remains valid up to .
As emphasized by Birgeneau, the spin-wave value
S=1 , , fits the experiment quite well, whereas for , spin-wave value differs significantly from the correct value 1.25 as in Equation 6.4 . This indicates that the large quantum fluctuations in the spin- system are not adequately accounted for in the spin-wave theory, whereas for the spin-one system, they are.
Figure 6.13 shows the energy density at various temperatures. At higher T , the high-temperature series expansion accurately reproduces our data. At low T , E approaches a finite ground state energy. Another useful thermodynamical quantity is uniform susceptibility, which is shown in Figure 6.14 . Again, at high- T , series expansion coincides with our data. The maximum point occurs at with . This is useful in determining J and for the material.
Figure 6.13:
Energy Measured as a Function of Temperature. Squares are from
our work. The curve is the 10th order high-T expansion.
Figure:
Uniform Susceptibility Measured as a Function of Temperature.
Symbols are similar to Figure
6.13
.
Guy Robinson
In conclusion, the quantum AFM Heisenberg spins are now well understood theoretically. The data from neutron scattering experiments for both , and S=1 , compare quite well. For , this leads to a direct determination .
Quantum spins are well suited for the hypercube computer. Its spatial decomposition is straightforward; the short-range nature (excluding the occasional long-range one) of interaction makes the extension to large numbers of processors simple. Hypercube connections made the use of the node computer efficient and flexible. High speedup can be achieved with reasonable ease, provided one improves the algorithm to minimize the communications.
The work described here is the result of the collaboration between H. Q. Ding and M. S. Makivic.
Guy Robinson
In this section, we discuss two further important developments based on the previous section (Section 6.3 ) on the isotropic Heisenberg quantum spins. These extensions are important in treating the observed phase transitions in the two-dimensional magnetic systems. Theoretically, two-dimensional isotropic Heisenberg quantum spins remain in paramagnetic state at all temperatures [ Mermin:66a ]. However, all crystals found in nature with strong two-dimensional magnetic characters go through phase transitions into ordered states [ Birgeneau:71a ], [ DeJongh:74a ]. These include the recently discovered high- materials, and , despite the presence of large quantum fluctuations in the spin- antiferromagnets.
We consider the cases where the magnetic spins interact through
In the case , the system goes through an Ising-like antiferromagnetic transition, very similar to those that occur in the high- materials. In the case h = -J , that is, the XY model, the system exhibits a Kosterlitz-Thouless type of transition. In both cases, our simulation provides convincing and complete results for the first time.
Through the Matsubara-Matsuda transformation between spin-1/2 operator and bosonic creation/destruction operations and , a general quantum system can be mapped into quantum spin system. Therefore, the phase transitions described here apply to general two-dimensional quantum systems. These results have broad implications in two-dimensional physical systems in particular, and the statistical systems in general.
HPFA Applications and Paradigms
Guy Robinson
The popular explanation for the antiferromagnetic ordering transitions in these high- materials emphasizes the very small coupling, , between the two-dimensional layers, , and is estimated to be about . However, all these systems exhibit some kind of in-plane anisotropies, which is of order . An interesting case is the spin-one crystal, , discovered twenty years ago [ Birgeneau:71a ]. The magnetic behavior of exhibits very strong two-dimensional characters with an exchange coupling . It has a Néel ordering transition at , induced by an Ising-like anisotropy, .
Our simulation provides clear evidence to support the picture that the in-plane anisotropy is also quite important in bringing about the observed antiferromagnetic transition at the most interesting spin- case. Adding an anisotropy energy as small as will induce an ordering transition at . This striking effect and related results agree well with a wide class of experiments, and provide some insights into these types of materials.
Guy Robinson
In the antiferromagnetic spin system, superexchange leads to the dominant isotropic coupling. One of the high-order effects, due to crystal field, is written as , which is a constant for these spin- high- materials. Another second-order effect is the spin-orbital coupling. This effect will pick up a preferred direction and lead to an term, which also arises due to the lattice distortion in . More complicated terms, like the antisymmetric exchange, can also be generated. For simplicity and clarity, we focus the study on the antiferromagnetic Heisenberg model with an Ising-like anisotropy as in Equation 6.12 . The anisotropy parameter h relates to the usual reduced anisotropy energy through . In the past, the anisotropy field model, , has also been included. However, its origin is less clear and, furthermore, the Ising symmetry is explicitly broken.
Guy Robinson
For the large anisotropy system, h=1 , the specific heat are shown for several spin systems in Figure 6.15 (a). The peak becomes sharper and higher as the system size increases, indicating a divergent peak in an infinite system, similar to the two-dimensional Ising model. Defining the transition temperature at the peak of for the finite system, the finite-size scaling theory [ Landau:76a ] predicts that relates to through the scaling law
Setting , the Ising exponent, a good fit with , is shown in Figure 6.15 (b). A different scaling with the same exponent for the correlation length,
is also satisfied quite well, resulting in . The staggered magnetization drops down near , although the behaviors are rounded off on these finite-size systems. All the evidence clearly indicates that an Ising-like antiferromagnetic transition occurs at , with a divergent specific heat. In the smaller anisotropy case, , similar behaviors are found. The scaling for the correlation length is shown in Figure 6.16 , indicating a transition at . However, the specific heat remains finite at all temperatures.
Figure 6.15:
(a) The Specific Heat for Different Size Systems of
h=1
. (b)
Finite Size Scaling for
.
Figure 6.16:
The Inverse Correlation Lengths for
System
(
),
System (
), and
h=0
System
(
) for the Purpose of Comparison. The straight lines are the
scaling relation:
. From it we can pin down
.
The most interesting case is (or , very close to those in [ Birgeneau:71a ]). Figure 6.17 shows the staggered correlation function at compared with those on the isotropic model [ Ding:90g ]. The inverse correlation length measured, together with those for the isotropic model ( h=0 ), are shown in Figure 6.16 . Below , the Ising behavior of a straight line becomes clear. Clearly, the system becomes antiferromagnetically ordered around . The best estimate is
Figure 6.17:
The Correlation Function on the
System at
for
system. It decays with correlation length
. Also shown is the isotropic case
h=0
, which has
.
Guy Robinson
It may seem a little surprising that a very small anisotropy can lead to a substantially high . This may be explained by the following argument. At low T , the spins are highly correlated in the isotropic case. Since no direction is preferred, the correlated spins fluctuate in all directions, resulting in zero net magnetization. Adding a very small anisotropy into the system introduces a preferred direction, so that the already highly correlated spins will fluctuate around this direction, leading to a global magnetization.
More quantitatively, the crossover from the isotropic Heisenberg behavior to the Ising behavior occurs at , where the correlation length is of order of some power of the inverse anisotropy. From the scaling arguments [ Riedel:69a ], where is the crossover exponent. In the two-dimensional model, both and are infinite, but the ratio is approximately 1/2. For , this relation indicates that the Ising behavior is valid for , which is clearly observed in Figure 6.16 . Similar crossover around for is also observed in Figure 6.16 . At low T , for the isotropic quantum model, the correlation length behaves as [ Ding:90g ] where . Therefore, we expect
where is spin- S dependent constant of order one. Therefore, even a very small anisotropy will induce a phase transition at a substantially high temperature ( ). This crude picture, suggested a long time ago to explain the observed phase transitions, is now confirmed by the extensive quantum Monte Carlo simulations for the first time. Note that this problem is an extreme case both because it is an antiferromagnet (more difficult to become ordered than the ferromagnet), and because it has the largest quantum fluctuations (spin- ). Since varies slowly with h , we can estimate at :
Guy Robinson
This simple result correctly predicts for a wide class of crystals found in nature, assuming the same level of anisotropy, that is, . The high- superconductor exhibits a Néel transition at . With , our results give quite a close estimate: . Similar close predictions hold for other systems, such as superconductor and insulator . For the high- material , [ Ding:90g ]. This material undergoes a Néel transition at . Our prediction of is in the same range of , and much better than the naive expectation that . In this crystal, there is some degree of frustration (see below), so the actual transition is pushed down. These examples clearly indicate that the in-plane anisotropy could be quite important to bring the system to the Néel order for these high- materials. For the S=1 system, , our results predict a , quite close to the observed .
These results have direct consequences regarding the critical exponents. The onset of transition is entirely due to the Ising-like anisotropy. Once the system becomes Néel-ordered, different layers in the three-dimensional crystals will order at the same time. Spin fluctuations, in different layers, are incoherent so that the critical exponents such as , , and will be the two, rather than three-dimensional Ising exponents. and show such behaviors clearly. However, the interlayer coupling, although very small (much smaller than the in-plane anisotropy), could induce coherent correlations between the layers, so that the critical exponents will be somewhere between the two and three-dimensional Ising exponents. and seem to belong to this category.
Whether the ground state of the spin- antiferromagnet spins has the long-range Néel order, is a longstanding problem [ Anderson:87a ]. The existence of the Néel order is vigorously proved for . In the most interesting case , numerical calculations on small lattices suggested the existence of the long-range order. Our simulation establishes the long-range order for .
The fact that near , the spin system is quite sensitive to the tiny anisotropy could have a number of important consequences. For example, the correlation lengths measured in are systematically smaller than the theoretical prediction [ Ding:90g ] near . The weaker correlations probably indicate that the frustrations, due to the next to nearest neighbor interaction, come into play. This is consistent with the fact that is below the suggested by our results.
Guy Robinson
It is well known now that the two-dimensional (2D) classical (planar) XY model undergoes Kosterlitz-Thouless (KT) [ Kosterlitz:73a ] transition at [ Gupta:88a ], characterized by exponentially divergent correlation length and in-plane susceptibility. The transition, due to the unbinding of vortex-antivortex pairs, is weak; the specific heat has a finite peak above .
Does the two-dimensional quantum XY model go through a phase transition? If so, what type of transition? This is a longstanding problem in statistical physics. The answers are relevant to a wide class of two-dimensional problems such as magnetic insulators, superfluidity, melting, and possibly to the recently discovered high- superconducting transition. Physics in two dimensions is characterized by large fluctuations. Changing from the classical model to the quantum model, additional quantum fluctuations (which are particularly strong in the case of spin-1/2) may alter the physics significantly. A direct consequence is that the already weak KT transition could be washed out completely.
Guy Robinson
The quantum XY model was first proposed [ Matsubara:56a ] in 1956 to study the lattice quantum fluids. Later, high-temperature series studies raised the possibility of a divergent susceptibility for the two-dimensional model. For the classical planar model, the remarkable theory of Kosterlitz and Thouless [ Kosterlitz:73a ] provided a clear physical picture and correctly predicted a number of important properties. However, much less is known about the quantum model. In fact, it has been controversial. Using a large-order high-temperature expansion, Rogiers, et al. [ Rogiers:79a ] suggested a second-order transition at for spin-1/2. Later, real-space renormalization group analysis was applied to the model with contradictory and inconclusive results. DeRaedt, et al. [ DeRaedt:84a ] then presented an exact solution and Monte Carlo simulation, both based on the Suzuki-Trotter transformation with small Trotter number m . Their results, both analytical and numerical, supported an Ising-like (second-order) transition at the Ising point , with a logarithmically divergent specific heat. Loh, et al. [ Loh:85a ] simulated the system with an improved technique. They found that specific peak remains finite and argued that a phase transition occurs at -0.5 by measuring the change of the ``twist energy'' from the lattice to the lattice. The dispute between DeRaedt, et al., and Loh, et al., centered on the importance of using a large Trotter number m and the global updates in small-size systems, which move the system from one subspace to another. Recent attempts to solve this problem still add fuel to the controversy.
Guy Robinson
The key to pinning down the existence and type of transition is a study of correlation length and in-plane susceptibility, because their divergences constitute the most direct evidence of a phase transition. These quantities are much more difficult to measure, and large lattices are required in order to avoid finite size effects. These key points are lacking in the previous works, and are the focus of our study. By extensive use of the Mark IIIfp Hypercube, we are able to measure spin correlations and thermodynamic quantities accurately on very large lattices ( ). Our work [Ding:90h;92a] provides convincing evidence that a phase transition does occur at a finite temperature in the extreme quantum case, spin- . At transition point, , the correlation length and susceptibility diverge exactly according to the form of Kosterlitz-Thouless (Equation 6.18 ).
We plot the correlation length, , and the susceptibility, , in Figures 6.18 and 6.19 . They show a tendency of divergence at some finite . Indeed, we fit them to the form predicted by Kosterlitz and Thouless for the classical model
The fit is indeed very good ( per degree of freedom is 0.81), as shown in Figure 6.18 . The fit for correlation length gives
A similar fit for susceptibility, is also very good ( ):
as shown in Figure 6.19 . The good quality of both fits and the closeness of 's obtained are the main results of this work. The fact that these fits also reproduce the expected scaling behavior with
is a further consistency check. These results strongly indicate that the spin-1/2 XY model undergoes a Kosterlitz-Thouless phase transition at . We note that this is consistent with the trend of the ``twist energy'' [ Loh:85a ] and that the rapid increase of vortex density near is due to the unbinding of vortex pairs. Figures 6.18 and 6.19 also indicate that the critical region is quite wide ( ), which is very similar to the spin-1/2 Heisenberg model, where the behavior holds up to . These two-dimensional phenomena are in sharp contrast to the usual second-order transitions in three dimensions.
Figure 6.18:
Correlation Length and Fit. (a)
versus
T
. The vertical
line indicates
diverges at
; (b)
versus
. The straight line indicates
.
Figure:
(a) This figure repeats the plot of Figure
6.18
(a)
showing on a coarser scale both the high temperature expansion (HTE) and the
Kosterlitz-Thouless fit (KT). (b) Susceptibility
and Fit
The algebraic exponent is consistent with the Ornstein-Zernike exponent at higher T . As , shifts down slightly and shows signs of approaching 1/4, the value at for the classical model. This is consistent with Equation 6.21 .
Figure 6.20:
Specific Heat
. For
, the lattice size is
.
We measured energy and specific heat, (for we used a lattice). The specific heat is shown in Figure 6.20 . We found that has a peak above , at around . The peak clearly shifts away from on the much smaller lattice. DeRaedt, et al. [ DeRaedt:84a ] suggested a logarithmic divergent in their simulation, which is likely an artifact of their small m values. One striking feature in Figure 6.20 is a very steep increase of at . The shape of the curve is asymmetric near the peak. These features of the curve differ from that in the classical XY model [ Gupta:88a ].
Guy Robinson
Quantum fluctuations are capable of pushing the transition point from in the classical model, down to in the quantum spin-1/2 case, although they are not strong enough to push it down to 0. They also reduced the constant from 1.67 in the classical case to 1.18 in the spin-1/2 case.
The critical behavior in the quantum case is of the KT-type, as in the classical case. This is a little surprising, considering the differences regarding the spin space. In the classical case, the spins are confined to the X - Y plane (thus the model is conventionally called a ``planar rotator'' model). This is important for the topological order in KT theory. The quantum spins are not restricted to the X - Y plane, due to the presence of for the commutator relation. The KT behavior found in the quantum case indicates that the extra dimension in the spin space (which does not appear in the Hamiltonian) is actually unimportant. These correlations are very weak and short-ranged. The out-of-plane susceptibility remains a small quantity in the whole temperature range.
These results for the XY model, together with those on the quantum Heisenberg model, strongly suggest that although quantum fluctuations at finite T can change the quantitative behavior of these nonfrustrated spin systems with continuous symmetries, the qualitative picture of the classical system persists. This could be understood following universality arguments that, near the critical point, the dominant behavior of the system is determined by long wavelength fluctuations which are characterized by symmetries and dimensionality. The quantum effects only change the short-range fluctuations which, after integrated out, only enter as renormalization of the physical parameters, such as .
Our data also show that, for the XY model, the critical exponents are spin- S independent, in agreement with universality. More specifically, in Equation 6.18 could, in principle, differ from its classical value 1/2. Our data are sufficient to detect any systematic deviation from this value. For this purpose, we plotted in Figure 6.18 (b), using versus . As expected, data points all fall well on a straight line (except the point at where the critical region presumably ends). A systematic deviation from would lead to a slightly curved line instead of a straight line. In addition, the exponent, at , seems to be consistent with the value for the classical system.
Our simulations reveal a rich structure, as shown in the phase diagram (Figure 6.21 ) for these quantum spins. The antiferromagnetic ordered region and the topological ordered region are especially relevant to the high- materials.
Figure:
Phase Diagram for the Spin-
Quantum System Shown in
Equation
6.12
. The solid points are from quantum Monte
Carlo simulations. For large
, the system is practically
an Ising system. Near
h=0
or
h=-2
, the logarithmic relation,
Equation
6.16
holds.
Finally, we point out the connection between the quantum XY system and the general two-dimensional quantum system with continuous symmetry. Through the above-mentioned Matsubara-Matsuda transformations, our result implies the existence of the Kosterlitz-Thouless condensation in two-dimensional quantum systems. The symmetry in the XY model now becomes , a continuous phase symmetry. This quantum KT condensation may have important implications on the mechanism of the recently discovered high-temperature superconducting transitions.
Guy Robinson
Vision (both biological and computer-based) is a complex process that can be characterized by multiple stages where the original iconic information is progressively distilled and refined. The first researchers to approach the problem underestimated the difficulty of the task-after all, it does not require a lot of effort for a human to open the eyes, form a model of the environment, recognize objects, move, and so on. But in the last years a scientific basis has been given to the first stages of the process ( low- and intermediate-level vision ) and a large set of special-purpose algorithms are available for high-level vision.
It is already possible to execute low-level operations (like filtering, edge detection, intensity normalization) in real time (30 frames/sec) using special-purpose digital hardware (like digital signal processors). On the contrary, higher level visual tasks tend to be specialized to the different applications, and require general-purpose hardware and software facilities.
Parallelism and multiresolution processing are two effective strategies to reduce the computational requirements of higher visual tasks (see, for example, [Battiti:91a;91b], [ Furmanski:88c ], [ Marr:76a ]). We describe a general software environment for implementing medium-level computer vision on large-grain-size MIMD computers. The purpose has been to implement a multiresolution strategy based on iconic data structures (two-dimensional arrays that can be indexed with the pixels' coordinates) distributed to the computing nodes using domain decomposition .
In particular, the environment has been applied successfully to the visible surface reconstruction and discontinuity detection problems. Initial constraints are transformed into a robust and explicit representation of the space around the viewer. In the shape from shading problem, the constraints are on the orientation of surface patches, while in the shape from motion problem (for example), the constraints are on the depth values.
We will describe a way to compute the motion ( optical flow ) from the intensity arrays of images taken at different times in Section 6.7 .
Discontinuities are necessary both to avoid mixing constraints pertaining to different physical objects during the reconstruction, and to provide a primitive perceptual organization of the visual input into different elements related to the human notion of objects.
HPFA Applications
Guy Robinson
The purpose of early vision is to undo the image formation process, recovering the properties of visible three-dimensional surfaces from the two-dimensional array of image intensities.
Computationally, this amounts to solving a very large system of equations. In general, the solution is not unique or does not exist (and therefore, one must settle for a suitable approximation).
The class of admissible solutions can be restricted by introducing a priori knowledge: the desired ``typical'' properties are enforced, transforming the inversion problem into the minimization of a functional . This is known as the regularization method [ Poggio:85a ]. Applying the calculus of variations, the stationary points are found by solving the Euler-Lagrange partial differential equations.
In standard methods for solving PDEs, the problem is first discretized on a finite-dimensional approximation space. The very large algebraic system obtained is then solved using, for example, ``relaxation'' algorithms which are local and iterative. The local structure is essential for the efficient use of parallel computation.
By the local nature of the relaxation process, solution errors on the scale of the solution grid step are corrected in a few iterations; however, larger scale errors are corrected very slowly. Intuitively, in order to correct them, information must be spread over a large scale by the ``sluggish'' neighbor-neighbor influence. If we want a larger spread of influence per iteration, we need large scale connections for the processing units, that is, we need to solve a simplified problem on a coarser grid.
The pyramidal structure of the multigrid solution grids is illustrated in Figure 6.22 .
Figure 6.22:
Pyramidal Structure for Multigrid Algorithms and General Flow of
Control
This simple idea and its realization in the multigrid algorithm not only leads to asymptotically optimal solution times (i.e., convergence in operations), but also dramatically decreases solution times for a variety of practical problems, as shown in [ Brandt:77a ].
The multigrid ``recipe'' is very simple. First use relaxation to obtain an approximation with smooth error on a fine grid. Then, given the smoothness of the error, calculate corrections to this approximation on a coarser grid, and to do this first relax, then correct recursively on still coarser grids. Optionally, you can also use nested iteration (that coarser grids provide a good starting point for finer grids) to speed up the initial part of the computation.
Historically, these ideas were developed starting from the 1960s by Bakhvalov, Fedorenko, and others (see Stüben, et al. [ Stuben:82a ]). The sequential multigrid algorithm has been used for solving PDEs associated with different early vision problems in [ Terzopoulos:86a ].
It is shown in [ Brandt:77a ] that, with a few modifications in the basic algorithms, the actual solution (not the error) can be stored in each layer ( full approximation storage algorithm ). This method is particularly useful for visual reconstruction where we are interested not only in the finest scale result, but also in the multiscale representation developed as a byproduct of the solution process.
Guy Robinson
Line processes [ Marroquin:84a ] are binary variables arranged in a two-dimensional array. An active line process ( ) between two neighboring pixels indicates that there is a physical discontinuity between them. Activation is, therefore, based on a measure of the difference in pixel properties but must also take into account the presence of other LPs. The idea is that continuous nonintersecting chains of LPs are preferred to discontinuous and intersecting ones, as it is shown in Figure 6.23 .
Figure:
The Multiscale Interaction Favors the Formation of Continuous
Chains of Line Processes. The figure on the left sketches the multiscale
interaction of LPs that, together with the local interaction at the
same scale, favors the formation of continuous chains of Line
Processes (LP caused by ``noise'' are filtered out at the coarse
scales, the LPs caused by real discontinuities remain and act on the
finer scales, see Figure
6.24
). On the right,
we show a favored (top) and a penalized (bottom) configuration. On
the left, we see coarsest scale with increasing resolution in two
lower outlines of hand.
We propose to combine the surface reconstruction and discontinuity detection phases in time and scale space . To do this, we introduce line processes at different scales, ``connect'' them to neighboring depth processes (henceforth DPs) at the same scale and to neighboring LPs on the finer and coarser scale. The reconstruction assigns equal priority to the two process types.
This scheme not only greatly improves convergence speed (the typical multigrid effect) but also produces a more consistent reconstruction of the piecewise smooth surface at the different scales.
Guy Robinson
Creation of discontinuities must be favored either by the presence of a ``large'' difference in the z values of the nearby DPs, or by the presence of a partial discontinuity structure that can be improved.
To measure the two effects in a quantitative way, it is useful to introduce two functions: cost and benefit . The benefit function for a vertical LP is , and analogously for a horizontal one. The idea is that the activation of one LP is beneficial when this quantity is large.
Cost is a function of neighborhood configuration. A given LP updates its value in a manner depending on the values of nearby LPs. These LPs constitute the neighborhood, and we will to refer to its members as the LPs connected to the original one. The neighborhood is shown in Figure 6.24 .
Figure 6.24:
``Connections'' Between Neighboring Line Processes, at the Same
Scale and Between Different Scales
The updating rule for the LPs derived from the above requirements is:
Because Cost is a function of a limited number of binary variables, we used a look-up table approach to increase simulation speed and to provide a convenient way for simulating different heuristical proposals.
A specific parametrization for the values in the table is suggested in [ Battiti:90a ].
Guy Robinson
The multigrid algorithm described in the previous section can be executed in many different ways on a parallel computer. One essential distinction that has to be made is related to the number of processors available and the ``size'' of a single processor.
The drawback of implementations on fine grain-size SIMD computers (where we assign one processor to each grid point) is that when iteration is on a coarse scale, all the nodes in the other scales (i.e., the majority of nodes) are idle, and the efficiency of computation is seriously compromised.
Furthermore, if the implementation is on a hypercube parallel computer and the mapping is such that all the communications paths in the pyramid are mapped into communication paths in the hypercube with length bounded by two [ Chan:86b ], a fraction of the nodes is never used because the total number of grid points is not equal to a power of two. This fraction is one third for two-dimensional problems encountered in vision.
Fortunately, if we use a MIMD computer with powerful processors, sufficient distributed memory, and two-dimensional internode connections (the hypercube contains a two-dimensional mesh), the above problems do not exist.
In this case, a two-dimensional domain decomposition can be used efficiently: A slice of the image with its associated pyramidal structure is assigned to each processor. All nodes are working all the time, switching between different levels of the pyramid as illustrated in Figure 6.25 .
Figure 6.25:
Domain Decomposition for Multigrid Computation. Processor
communication is on a two-dimensional grid; each processor operates at all
levels of the pyramid.
No modification of the sequential algorithm is needed for points in the image belonging to the interior of the assigned domain. Conversely, points on the boundary need to know values of points assigned to a nearby processor. With this purpose, the assigned domain is extended to contain points assigned to nearby processors, and a communication step before each iteration on a given layer is responsible for updating this strip so that it contains the correct (most recent) values. Two exchanges are sufficient.
The recursive multiscale call mg(lay) is based on an alternation of relaxation steps and discontinuity detection steps as follows (software is written in C language):
{
int i;
if(lay==coarsest)step(lay);
else{ i=na;while(i-)step(lay);
i=nb;if(i!=0)
{up(lay);while(i-)mg(lay-1);down(lay-1);}
i=nc;while(i-)step(lay);
}
}
int step(lay) int lay;
{
exchange_border_strip(lay);
update_line_processes(lay); relax_depth_processes(lay);
}
int mg(lay) int lay;
Each step is preceded by an exchange of data on the border of the assigned domains.
Because the communication overhead is proportional to the linear dimension of the assigned image portion, the efficiency is high as soon as the number of pixels in this portion is large. Detailed results are in [ Battiti:91a ].
Guy Robinson
An iterative scheme for solving the shape from shading problem has been proposed in [ Horn:85a ]. A preliminary phase recovers information about orientation of the planes tangent to the surface at each point by minimizing a functional containing the image irradiance equation and an integrability constraint , as follows:
where , , I = measured intensity, and R = theoretical reflectance function.
After the tangent planes are available, the surface z is reconstructed, minimizing the following functional:
Euler-Lagrange differential equations and discretization are left as an exercise to the reader.
Figure 6.26 shows the reconstruction of the shape of a hemispherical surface starting from a ray-traced image . Left is the result of standard relaxation after 100 sweeps, right the ``minimal multigrid'' result (with computation time equivalent to 3 to 4 sweeps at the finest resolution).
Figure 6.26:
Reconstruction of Shape From Shading: Standard Relaxation
(top right) Versus Multigrid (bottom right). The original image is
shown on left.
This case is particularly hard for a standard relaxation approach. The image can be interpreted ``legally'' in two possible ways, as either a concave or convex hemisphere. Starting from random initial values, after some relaxations, some image patches typically ``vote'' for one or the other interpretation and try to extend the local interpretation to a global one. This is slow (given the local nature of the updating rule) and encounters an endless struggle in the regions that mark the border between different interpretations. The multigrid approach solves this ``democratic impasse'' on the coarsest grids (much faster because information spreads over large distances) and propagates this decision to the finer grids, that will now concentrate their efforts on refining the initial approximation.
In Figure 6.27 , we show the reconstruction of the Mona Lisa face painted by Leonardo da Vinci.
Figure 6.27:
Mona Lisa in Three Dimensions. The right figure shows the
multigrid reconstruction.
Guy Robinson
For the surface reconstruction problem (see [ Terzopoulos:86a ]) the energy functional is:
A physical analogy is that of fitting the depth constraints with a membrane pulled by springs connected to them. The effect of active discontinuities is that of ``cutting the membrane'' in the proper places.
Figure 6.28:
Simulation Environment for Multigrid Surface Reconstruction
from a Noisy Image. The top screen shows an intermediate, and the
bottom final results. For each screen, the upper part displays the activated
discontinuities; the lower part, the gray-encoded
z
values of the
surface.
Figure 6.29:
The Original Surface (top) and Surface Corrupted by 25%
Noise (bottom)
Figure 6.30:
The Reconstruction of a ``Randomville'' Scenery using
Multigrid Method. Each figure shows a different resolution.
Figure 6.28 to 6.30 show the simulation environment on the SUN workstation, and the reconstruction of a ``Randomville'' image (random quadrangular blocks placed in the image plane). The original surface, the surface corrupted by noise (25%), are shown in Figure 6.29 while reconstruction on different scales is shown in Figure 6.30 .
For ``images'' and 25% noise, a faithful reconstruction of the surface (within a few percent of the original one) is obtained after a single multiscale sweep (with V cycles) on four layers. The total computational time corresponds approximately to the time required by three relaxations on the finest grid. Because of the optimality of multiscale methods, time increases linearly with the number of image pixels.
Guy Robinson
The parallel simulation environment was written by Roberto Battiti [ Battiti:90a ]. Geoffrey Fox, Christof Koch, and Wojtek Furmanski contributed with many ideas and suggestions [ Furmanski:88c ].
A JPL group [ Synnott:90a ] also used the Mark III hypercube to find three-dimensional properties of planetary objects from the two-dimensional images returned from NASA's planetary missions, and from the Hubble Space Telescope. The hypercube was used in a simple parallel mode with each node assigned calculations for a subset of the image pixels, with no internodal communication required. The estimation uses an iterative linear least-squares approach where the data are the pixel brightness values in the images; and partials of theoretical models of these brightness values are computed for use in a square root formulation of the normal equations. The underlying three-dimensional model of the object consists of a triaxial ellipsoid overlaid with a spherical harmonic expansion to describe low- to mid-spatial frequency topographic or surface composition variations. The initial results were not followed through into production use for JPL missions, but this is likely to become an important application of parallel computing to image processing from planetary missions.
Guy Robinson
Much of the current interest in neural networks can be traced to the introduction a few years ago of effective learning algorithms for these systems ([ Denker:86a ], [ Parker:82a ], [ Rumelhart:86a ]). In [ Rumelhart:86a ] Chapter 8, it was shown that for some problems using multi-layer perceptrons (MLP), back-propagation was capable of finding a solution very reliably and quickly. Back-propagation has been applied to a number of realistic and complex problems [ Sejnowski:87a ], [ Denker:87a ]. The work of this section is described in [ Felten:90a ].
Real-world problems are inherently structured, so methods incorporating this structure will be more effective than techniques applicable to the general case. In practice, it is very important to use whatever knowledge one has about the form of possible solutions in order to restrict the search space. For multilayer perceptrons, this translates into constraining the weights or modifying the learning algorithm so as to embody the topology, geometry, and symmetries of the problem.
Here, we are interested in determining how automatic learning can be improved by following the above suggestion of restricting the search space of the weights. To avoid high-level cognition requirements, we consider the problem of classifying hand-printed upper-case Roman characters. This is a specific pattern-recognition problem, and has been addressed by methods other than neural networks. Generally the recognition is separated into two tasks: the first one is a pre-processing of the image using translation, dilation, rotations, and so on, to bring it to a standardized form; in the second, this preprocessed image is compared to a set of templates and a probability is assigned to each character or each category of the classification. If all but one of the probabilities are close to zero, one has a high confidence level in the identification. This second task is the more difficult one, and the performance achieved depends on the quality of the matching algorithm. Our focus is to study how well an MLP can learn a satisfactory matching to templates, a task one believes the network should be good at.
In regard to the task of preprocessing, MLPs have been shown capable [ Rumelhart:86a ] Chapter 8 of performing translations at least in part, but it is simpler to implement this first step using standard methods. This combination of traditional methods and neural network matching can give us the best of both worlds. In what follows, we suggest and test a learning procedure which preserves the geometry of the two-dimensional image from one length scale transformation to the next, and embodies the difference between coarse and fine scale features.
HPFA Applications
Guy Robinson
There are many architectures for neural networks; we shall work with Multi-Layer Perceptrons. These are feed-forward networks, and the network to be used in our problem is schematically shown in Figure 6.31 . There are two processing layers: output and hidden. Each one has a number of identical units (or ``neurons''), connected in a feed-forward fashion by wires, often called weights because each one is assigned a real number . The input to any given unit is , where i labels incoming wires and is the input (or current) to that wire. For the hidden layer, is the value of a bit of the input image; for the output layer, it is the output from a unit of the hidden layer.
Figure 6.31:
A Multi-Layer Perceptron
Generally, the output of a unit is a nonlinear, monotonic-increasing function of the input. We make the usual choice and take
to be our neuron input/output function. is the threshold and can be different for each neuron. The weights and thresholds are usually the only quantities which change during the learning period. We wish to have a network perform a mapping M from the input space to the output space. Introducing the actual output for an input I , one first chooses a metric for the output space, and then seeks to minimize , where d is a measure of the distance between the two points. This quantity is also called the error function, the energy, or (the negative of) the harmony function. Naturally, depends on the 's. One can then apply standard minimization searches like simulated annealing [ Kirkpatrick:83a ] to attempt to change the 's so as to reduce the error. The most commonly used method is gradient descent, which for MLP is called back-propagation because the calculation of the gradients is performed in a feed-backwards fashion. Improved descent methods may be found in [ Dahl:87a ], [ Parker:87a ] and in Section 9.9 of this book.
The minimization often runs into difficulties because one is searching in a very high-dimensional space, and the minima may be narrow. In addition, the straightforward implementation of back-propagation will often fail because of the many minima in the energy landscape. This process of minimization is referred to as learning or memorization as the network tries to match the mapping M . In many problems, though, the input space is so huge that it is neither conceivable nor desirable to present all possible inputs to the network for it to memorize. Given part of the mapping M , the network is expected to guess the rest: This is called generalization. As shown clearly in [ Denker:87a ] for the case of a discrete input space, generalization is often an ill-posed problem: Many generalizations of M are possible. To achieve the kind of generalization humans want, it is necessary to tell the network about the mapping one has in mind. This is most simply done by constraining the weights to have certain symmetries as in [ Denker:87a ]. Our approach will be similar, except that the ``outside'' information will play an even more central role during the learning process.
Guy Robinson
To do character recognition using an MLP, we assume the input layer of the network to be a set of image pixels, which can take on analogue (or grey scale) values between 0 and 1. The two-dimensional set of pixels is mapped onto the set of input neurons in a fairly arbitrary way: For an image, the top row of N pixels is associated with the first N neurons, the next row of N pixels is associated with the next N neurons, and so forth. At the start of the training process, the network has no knowledge of the underlying two-dimensional structure of the problem (that is, if a pixel is on, nearby pixels in the two-dimensional space are also likely to be on). The network discovers the two-dimensional nature of the problem during the learning process.
We taught our networks the alphabet of 26 upper-case Roman characters. To encourage generalization, we show the net many different hand-drawn versions of each character. The 320-image training set is shown in Figure 6.32 . These images were hand-drawn using a mouse attached to a SUN workstation. The output is encoded in a very sparse way. There are only 26 outputs we want the net to give, so we use 26 output neurons and map the output pattern: first neuron on, rest off, to the character ``A;'' second neuron on, rest off, to ``B;'' and so on. Such an encoding scheme works well here, but is clearly unworkable for mappings with large output sets such as Chinese characters or Kanji. In such cases, one would prefer a more compact output encoding, with possibly an additional layer of hidden units to produce the more complex outputs.
Figure 6.32:
The Training Set of 320 Handwritten Characters, Digitized on a
Grid
As mentioned earlier, we do not feed images directly into the network. Instead, simple, automatic preprocessing is done which dilates the image to a standard size and then translates it to the center of the pixel space. This greatly enhances the performance of the system-it means that one can draw a character in the upper left-hand corner of the pixel space and the system easily recognizes it. If we did not have the preprocessing, the network would be forced to solve the much larger problem of character recognition of all possible sizes and locations in the pixel space. Two other worthwhile preprocessors are rotations (rotate to a standard orientation) and intensity normalization (set linewidths to some standard value). We do not have these in our current implementation.
The MLP is used only for the part of the algorithm where one matches to templates. Given any fixed set of exemplars, a neural network will usually learn this set perfectly, but the performance under generalization can be very poor. In fact, the more weights there are, the faster the learning (in the sense of number of iterations, not of CPU time), and the worse the ability to generalize. This was in part realized in [ Gullichsen:87a ], where the input grid was . If one has a very fine mesh at the input level, so that a great amount of detail can be seen in the image, one runs the risk of having terrible generalization properties because the network will tend to focus upon tiny features of the image, ones which humans would consider irrelevant.
We will show one approach to overcoming this problem. We desire the potential power of the large, high-resolution net, but with the stable generalization properties of small, coarse nets. Though not so important for upper-case Roman characters, where a rather coarse grid does well enough (as we will see), a fine mesh is necessary for other problems such as recognition of Kanji characters or handwriting. A possible ``fix,'' similar to what was done for the problem of clump counting [ Denker:87a ], is to hard wire the first layer of weights to be local in space, with a neighborhood growing with the mesh fineness. This reduces the number of weights, thus postponing the deterioration of the generalization. However, for an MLP with a single hidden layer, this approach will prevent the detection of many nonlocal correlations in the images, and in effect this fix is like removing the first layer of weights.
Guy Robinson
We would like to train large, high-resolution nets. If one tries to do this directly, by simply starting with a very large network and training by the usual back-propagation methods, not only is the training slow (because of the large size of the network), but the generalization properties of such nets are poor. As described above, a large net with many weights from the input layer to the hidden layer tends to ``grandmother'' the problem, leading to poor generalization.
The hidden units of an MLP form a set of feature extractors. Considering a complex pattern such as a Chinese character, it seems clear that some of the relevant features which distinguish it are large, long-range objects requiring little detail while other features are fine scale and require high resolution. Some sort of multiscale decomposition of the problem therefore suggests itself. The method we will present below builds in long-range feature extractors by training on small networks and then uses these as an intelligent starting point on larger, higher resolution networks. The method is somewhat analogous to the multigrid technique for solving partial differential equations.
Let us now present our multiscale training algorithm. We begin with the training set, such as the one shown in Figure 6.32 , defined at the high resolution (in this case, ). Each exemplar is coarsened by a factor of two in each direction using a simple grey scale averaging procedure. blocks of pixels in which all four pixels were ``on'' map to an ``on'' pixel, those in which three of the four were ``on'' map to a ``3/4 on'' pixel, and so on. The result is that each exemplar is mapped to a exemplar in such a way as to preserve the large scale features of the pattern. The procedure is then repeated until a suitably coarse representation of the exemplars is reached. In our case, we stopped after coarsening to .
At this point, an MLP is trained to solve the coarse mapping problem by one's favorite method (back-propagation, simulated annealing, and so on). In our case, we set up an MLP of 64 inputs (corresponding to ), 32 hidden units, and 26 output units. This was then trained on the set of 320 coarsened exemplars using the simple back propagation method with a momentum term [ Rumelhart:86a ], Chapter 8. Satisfactory convergence was achieved after approximately 50 cycles through the training set.
We now wish to boost back to a high-resolution MLP, using the results of the coarse net. We use a simple interpolating procedure which works well. We leave the number of hidden units unchanged. Each weight from the input layer to the hidden layer is split or ``un-averaged'' into four weights (each now attached to its own pixel), with each 1/4 the size of the original. The thresholds are left untouched during this boosting phase. This procedure gives a higher resolution MLP with an intelligent starting point for additional training at the finer scale. In fact, before any training at all is done with the MLP (boosted from ), it recalls the exemplars quite well. This is a measure of how much information was lost when coarsening from to . The boost and train process is repeated to get to the desired MLP. The entire multiscale training process is illustrated in Figure 6.33 .
Figure 6.33:
An Example Flowchart for the Multiscale Training Procedure.
This was the procedure used in this text, but the averaging and boosting
can be continued through an indefinite number of stages.
Guy Robinson
Here we give some details of our results and compare with the standard approach. As mentioned in the previous section, a MLP (1024 inputs, 32 hidden units, 26 output units) was trained on the set of Figure 6.32 using the multiscale method. Outputs are never exactly 0 or 1, so we defined a ``successful'' recognition to occur when the output value of the desired letter was greater than 0.9, and all other outputs were less than 0.1. The training on the grid used back-propagation with a momentum term and went through the exemplars sequentially. The weights are changed to reduce the error function for the current character. The result is that the system does not reach an absolute minimum. Rather, at long times the weight values oscillate with a period equal to the time of one sweep through all the exemplars. This is not a serious problem as the oscillations are very small in practice. Figure 6.34 shows the training curve for this problem. The first part of the curve is the training of the network; even though the grid is a bit coarse, almost all of the characters can be memorized. Proceeding to the next grid by scaling the mesh size by a factor of two and using the exemplars, we obtained the second part of the learning curve in Figure 6.34 . The net got 315/320 correct. After 12 additional sweeps on the net, a perfect score of 320/320 is achieved. The third part of Figure 6.34 shows the result of the final boost to . In just two cycles on the net, a perfect score of 320/320 was achieved and the training was stopped. It is useful to compare these results with a direct use of back-propagation on the mesh without using the multiscale procedure. Figure 6.35 shows the corresponding learning curve, with the result from Figure 6.34 drawn in for comparison. Learning via the multiscale method takes much less computer time. In addition, the internal structure of the resultant network is much different and we will now turn to this question.
How do these two networks compare for the real task of recognizing exemplars not belonging to the training set? We used as a generalization test set 156 more handwritten characters. Though there are no ambiguities for humans in this test set, the networks did make mistakes. The network from the direct method made errors 14% of the time, and the multiscale network made errors 9% of the time. We feel the improved performance of the multiscale net is due to the difference in quality of the feature extractors in the two cases. In a two-layer MLP, we can think of each hidden-layer neuron as a feature extractor which looks for a certain characteristic shape in the input; the function of the output layer is then to perform the higher level operation of classifying the input based on which features it contains. By looking at the weights connecting a hidden-layer neuron to the inputs, we can determine what feature that neuron is looking for.
Figure 6.34:
The Learning Curve for our Multiscale Training Procedure
Applied to 320 Handwritten Characters. The first part of the curve is
the training on the
net, the second on the
net, and the last on the full,
net. The curve is plotted
as a function of CPU time and not sweeps through the presentation set,
in order to exhibit the speed of training on the smaller networks.
Figure:
A Comparison of Multiscale Training with the Usual, Direct
Back-propagation Procedure. The curve labelled ``Multiscale'' is the same as
Figure
6.34
, only rescaled by a factor of two. The curve
labelled ``Brute Force'' is from directly training a
network,
from a random start, on the learning set. The direct approach does not quite
learn all of the exemplars, and takes much more CPU time.
For example, Figure 6.36 shows the input weights of two neurons in the net. The neuron of (a) seems to be looking for a stroke extending downward and to the right from the center of the input field. This is a feature common to letters like A, K, R, and X. The feature extractor of (b) seems to be a ``NOT S'' recognizer and, among other things, discriminates between ``S'' and ``Z''.
Figure 6.36:
Two Feature Extractors for the Trained
net. This figure
shows the connection weights between one hidden-layer, and all the
input-layer neurons. Black boxes depict positive weights, while white
depict negative weights; the size of the box shows the magnitude. The
position of each weight in the
grid corresponds to the position
of the input pixel. We can view these pictures as maps of the features which
each hidden-layer neuron is looking for. In (a), the neuron is looking for a
stroke extending down and to the right from the center of the input field;
this neuron fires upon input of the letter ``A,'' for example. In (b), the
neuron is looking for something in the lower center of the picture, but it
also has a strong ``NOT S'' component. Among other things, this neuron
discriminates between an ``S'' and a ``Z''. The outputs of several such
feature extractors are combined by the output layer to classify the original
input.
Figure:
The Same Feature Extractor as in Figure
6.36
(b),
after the Boost to
. There is an obvious correspondence between
each connection in Figure
6.36
(b) and
clumps
of connections here. This is due to the multiscale procedure, and leads to
spatially smooth feature extractors.
Even at the coarsest scale, the feature extractors usually look for blobs rather than correlating a scattered pattern of pixels. This is encouraging since it matches the behavior we would expect from a ``good'' character recognizer. The multiscale process accentuates this locality, since a single pixel grows to a local clump of four pixels at each rescaling. This effect can be seen in Figure 6.37 , which shows the feature extractor of Figure 6.36 (b) after scaling to and further training. Four-pixel clumps are quite obvious in the network. The feature extractors obtained by direct training on large nets are much more scattered (less smooth) in nature.
Guy Robinson
Before closing, we would like to make some additional comments on the multiscale method and suggest some possible extensions.
In a pattern-recognition problem such as character recognition, the two-dimensional spatial structure of the problem is important. The multiscale method preserves this structure so that ``reasonable'' feature extractors are produced. An obvious extension to the present work is to increase the number of hidden units as one boosts the MLP to higher resolution. This corresponds to adding completely new feature extractors. We did not do this in the present case since 32 hidden units were sufficient-the problem of recognizing upper-case Roman characters is too easy. For a more challenging problem such as Chinese characters, adding hidden units will probably be necessary. We should mention that incrementally adding hidden units is easy to do and works well-we have used it to achieve perfect convergence of a back-propagation network for the problem of tic-tac-toe.
When boosting, the weights are scaled down by a factor of four and so it is important to also scale down the learning rate (in the back-propagation algorithm) by a factor of four.
We defined our ``blocking,'' or coarsening, procedure to be a simple, grey scale averaging of blocks. There are many other possibilities, well known in the field of real-space renormalization in physics. Other interesting blocking procedures include: using a scale factor, , different from two; using majority rule averaging; simple decimation; and so on.
Multiscale methods work well in cases where spatial locality or smoothness is relevant (otherwise, the interpolation approximation is bad). Another way of thinking about this is that we are decomposing the problem onto a set of spatially local basis functions such as gaussians. In other problems, a different set of basis functions may be more appropriate and hence give better performance.
The multiscale method uses results from a small net to help in the training of a large network. The different-sized networks are related by the rescaling or dilation operator. A variant of this general approach would be to use the translation operator to produce a pattern matcher for the game of Go. The idea is that at least some of the complexity of Go is concerned with local strategies. Instead of training an MLP to learn this on the full board of Go, do the training on a ``mini-Go'' board of or . The appropriate way to relate these networks to the full-sized one is not through dilations, but via the translations: The same local strategies are valid everywhere on the board.
Steve Otto had the original idea for the MultiScale training technique. Otto and Ed Felten and Olivier Martin developed the method. Jim Hutchinson contributed by supplying the original back-propagation program.
Guy Robinson
When moving objects in a scene are projected onto an image plane (for example, onto our retina), the real velocity field is transformed into a two-dimensional field, known as the motion field.
By taking more images at different times and calculating the motion field, we can extract useful parameters like the time to collision, useful for obstacle avoidance. If we know the motion of a camera (or our ego motion), we can reconstruct the entire three-dimensional structure of the environment (if the camera translates, near objects will have a larger motion field with respect to distant ones). The depth measurements can be used as starting constraints for a surface reconstruction algorithm like the one described in Section 9.9 .
In particular situations, the apparent motion of the brightness pattern, known as the optical flow, provides a sufficiently accurate estimate of the motion field. Although the adaptive scheme that we propose is applicable to different methods, the discussion will be based on the scheme proposed by Horn and Schunck [ Horn:81a ]. They use the assumptions that the image brightness of a given point remains constant over time, and that the optical flow varies smoothly almost everywhere. Satisfaction of these two constraints is formulated as the problem of minimizing a quadratic energy functional (see also [ Poggio:85a ]). The appropriate Euler-Lagrange equations are then discretized on a single or multiple grid and solved using, for example, the Gauss-Seidel relaxation method [ Horn:81a ], [ Terzopoulos:86a ]). The resulting system of equations (two for every pixel in the image) is:
where and are the optical flow variables to be determined, , , are the partial derivatives of the image brightness with respect to space and time, and are local averages, is the spatial discretization step, and controls the smoothness of the estimated optical flow.
HPFA Applications
Guy Robinson
Now, we need to estimate the partial derivatives in the above equations with discretized formulas starting from brightness values that are quantized (say integers from 0 to n ) and noisy. Given these derivative estimation problems, the optimal step for the discretization grid depends on local properties of the image. Use of a single discretization step produces large errors on some images. Use of a homogeneous multiscale approach, where a set of grids at different resolutions is used, may in some cases produce a good estimation on an intermediate grid and a bad one on the final and finest grid. Enkelmann and Glazer [ Enkelmann:88a ], [ Glazer:84a ] encountered similar problems.
These difficulties can be illustrated with the following one-dimensional example. Let's suppose that the intensity pattern observed is a superposition of two sinusoids of different wavelengths:
where R is the ratio of short to long wavelength components. Using the brightness constancy assumption ( or , see [ Horn:81a ]) the measured velocity is given by:
where and are the three-point approximations of the spatial and temporal brightness derivatives.
Now, if we calculate the estimated velocity on two different grids, with spatial step equal to one and two, as a function of the parameter, R , we obtain the result illustrated in Figure 6.38 .
Figure 6.38:
Measured velocity for superposition of sinusoidal patterns as a
function of the ratio of short to long wavelength components. Dashed line:
, continuous line:
.
While on the coarser grid, the correct velocity is obtained (in this case); on the finer one, the measured velocity depends on the value of R . In particular, if R is greater than , we obtain a velocity in the opposite direction!
We propose a method for ``tuning'' the discretization grid to a measure of the reliability of the optical flow derived at a given scale. This measure is based on a local estimate of the errors due to noise and discretization, and is described in [Battiti:89g;91b].
Guy Robinson
First, a Gaussian pyramid [ Burt:84a ] is computed from the given images. This consists of a hierarchy of images obtained filtering the original ones with Gaussian filters of progressively larger size.
Then, the optical flow field is computed at the coarsest scale using relaxation, and the estimated error is calculated for every pixel. If this quantity is less than a given threshold , the current value of the flow is interpolated to the finer resolutions without further processing. This is done by setting an inhibition flag contained in the grid points of the pyramidal structure, so that these points do not participate in the relaxation process. On the contrary, if the error is larger than , the approximation is relaxed on a finer scale and the entire process is repeated until the finest scale is reached.
Figure 6.39:
Adaptive Grid (shown on left) in the Multiresolution
Pyramid; (middle) Gray Code Mapping Strategy; (right) Domain
Decomposition Mapping Strategy. In the middle and right pictures, the
activity pattern for three resolutions is shown at the top, for a
simple one-dimensional case.
In this way, we obtain a local inhomogeneous approach where areas of the images, characterized by different spatial frequencies or by different motion amplitudes, are processed at the appropriate resolutions, avoiding corruption of good estimates by inconsistent information from a different scale (the effect shown in the previous example). The optimal grid structure for a given image is translated into a pattern of active and inhibited grid points in the pyramid, as illustrated in Figure 6.39 .
Figure 6.40:
Efficiency and Solution Times
The motivation for freezing the motion field as soon as the error is below threshold, is that the estimation of the error may itself become incorrect at finer scales and, therefore, useless in the decision process. It is important to point out that single-scale or homogeneous approaches cannot adequately solve the above problem. Intuitively, what happens in the adaptive multiscale approach is that the velocity is frozen as soon as the spatial and temporal differences at a given scale are big enough to avoid quantization errors, but small enough to avoid errors in the use of discretized formulas. The only assumption made in this scheme is that the largest motion in the scene can be reliably computed at one of the used resolutions. If the images contain motion discontinuities, line processes (indicating the presence of these discontinuities) are necessary to prevent smoothing where it is not desired (see [ Battiti:90a ] and the contained references).
Figure 6.41:
Plaid Image (top); The Error in Calculation of Optical Flow
for both Homogeneous (Upper-line) and Adaptive (Lower-line) Algorithms.
The error is plotted as a function of computation time.
Figure:
Reconstructed Optical Flow for Translating ``Plaid'' Pattern
of Figure
6.41
. Homogeneous Multiscale Strategy
(top), Adaptive Multiscale Strategy (middle), and Active (black) and
Inhibited (white) Points
Figure 6.43:
Test Images and Motion Fields for a Natural (pine-cone) Image
at Three Resolutions (top). Estimated versus Actual Velocity Plotted
for Three Choices of Resolution (bottom). The dotted line indicates a
``perfect'' prediction.
Large grain-size multicomputers, with a mapping based on domain decomposition and limited coarsening, have been used to implement the adaptive algorithm, as described in Section 6.5 . The efficiency and solution times for an implementation with transputers (details in [ Battiti:91a ]) are shown in Figure 6.40 .
Real-time computation with high efficiency is within the reach of available digital technology!
On a board with four transputers, and using the Express communication routines from ParaSoft, the solution time for images is on the order of one second.
The software implementation is based on the multiscale vision environment developed by Roberto Battiti and described in Section 9.9 . Christof Koch and Edoardo Amaldi collaborated on the project.
Guy Robinson
Results of the algorithm show that the adaptive method is capable of effectively reducing the solution error. In the last Figures 6.41 to 6.43 , we show two test images (showing a ``plaid'' pattern and a natural scene), with the obtained optical flow.
For the ``plaid'' image, we show the r.m.s. error obtained with the adaptive (lower-line) and homogeneous (upper-line) scheme and the resulting fields.
For the natural image, we show in Figure 6.42 , the average computed velocity (in a region centered on the pine cone) as a function of the correct velocity for different number of layers. Increasing the number of resolution grids increases the range of velocities that are detected correctly by the algorithm. The pine cone is moving upward at the rate of 1.6 pixels per frame. The multiscale algorithm is always better than single-level algorithm, especially at larger velocity .
Guy Robinson
The collective stereopsis algorithm described in [ Marr:76a ] was historically one of the first ``cooperative'' algorithms based on relaxation proposed for early vision.
The goal in stereopsis is to measure the difference in retinal position ( disparity ) of features of a scene observed with two eyes (or video cameras). This is achieved by placing a fiber of ``neurons'' (one for each disparity value) at each pixel position. Each neuron inhibits neurons of different disparities at the same location (because the disparity is unique) and excites neurons of the same disparity at near location (because disparity tends to vary smoothly). After, convergence the activation pattern corresponds to the disparity field defined above.
Figure 6.44:
Collective Stereopsis: (top left) Definition for geometry of
stereoscopic vision. (bottom left) Neural Network Activity (top three
layers disparity
d=0, 1, 2
) corresponding to real world
structure illustrated. (right) Results of iterations for
d=0
and
d=2
layers of neurons.
d
measures disparity value for pixels.
The parallel implementation is based on a straightforward domain decomposition and the results are illustrated in Figure 6.44 . They show the initial state of disparity computation and the evolution in time of the different layers of disparity neurons. Details are described in [ Battiti:88a ].
HPFA Applications
Guy Robinson
Guy Robinson
In Chapters 4 and 6 , we studied the synchronous problem class where the uniformity of the computation, that is, of the temporal structure, made the parallel implementation relatively straightforward. This chapter contains examples of the other major problem class, where the simple spatial structure leads to clear parallelization. We define the embarrassingly parallel class of problems for which the computational graph is disconnected. This spatial structure allows a simple parallelization as no (temporal) synchronization is involved. In Chapters 4 and 6 , on the other hand, there was often substantial synchronization and associated communication; however, the uniformity of the synchronization allowed a clear parallelization strategy. One important feature of embarrassingly parallel problems is the modest node-to-node communication requirements-the definition of no spatial connection implies in fact no internode communication, but a practical problem would involve some amount of communication, if only to set up the problem and accumulate results. The low communication requirements of embarrassingly parallel problems make them particularly suitable for a distributed computing implementation on a network of workstations; even the low bandwidth of an Ethernet is often sufficient. Indeed, we used such a network of Sun workstations to support some of the simulations described in Section 7.2 .
The caricature of this problem class, shown in Figure 7.1 , uses a database problem as an example. This is illustrated in practice by the DOWQUEST program where a CM-2 supports searching of financial data contained in articles that are partitioned equally over the nodes of this SIMD machine [Waltz:87a;88a,90a].
Figure 7.1:
Embarrassingly Parallel Problem Class
This problem class can have either a synchronous or asynchronous temporal structure. We have illustrated the former above and analysis of a large (high energy) physics data set exhibits asynchronous temporal structure. Such experiments can record - separate events, which can be analyzed independently. However, each event is usually quite different and would require both distinct instruction streams and very different execution times. This was realized early in the high energy physics community and so-called farms-initially of special-purpose machines and now of commercial workstations-have been used extensively for production data analysis [ Gaines:87a ], [ Hey:88a ], [ Kunz:81a ].
The applications in Sections 7.2 and 7.6 obtain their embarrassingly parallel structure from running several full simulations-each with independent data. Each simulation could in fact also be decomposed spatially and in fact this spatial parallelization has since been pursued and is described for the neural network simulator in Section 7.6 . Some of Chiu's random block lattice calculations also used an embarrassingly parallel approach with 1024 separate lattices being calculated on the 1024 nCUBE-1 at Sandia [ Chiu:90a ], [Fox:89i;89n]. This would not, of course, be possible for the QCD of Section 4.3 , where each node would not be available to hold an interesting size lattice. The spatial parallelism in the examples of Sections 7.2 and 7.6 is nontrivial to implement as the irregularities makes these problems loosely synchronous. This relatively difficult domain parallelism made it attractive to first explore the independent structure gotten by exploiting the parallelism coming from simulations with different parameters.
It is interesting that Sections 6.3 and 7.3 both address simulations of spin systems relevant to high -depending on the algorithm used, one can get very different problem architectures (either synchronous or embarrassingly parallel in this case) for a given application.
The embarrassingly parallel gravitational lens application of Section 7.4 was frustrating for the developers as it needed software support not available at the time on the Mark III. Suitable software (MOOSE in Section 15.2 ) had been developed on the Mark II to support graphics ray tracing as briefly discussed in Section 14.1 . Thus, the calculation is embarrassingly parallel, but a distributed database is essentially needed to support the calculation of each ray. This was not available in CrOS III at the time of the calculations described in Section 7.4 .
Guy Robinson
HPFA Paradigms
Guy Robinson
In this section, we describe some large scale parallel simulations of dynamically triangulated random surfaces [ Baillie:90c ], [ Baillie:90d ], [ Baillie:90e ], [ Baillie:90j ], [ Baillie:91c ], [ Bowick:93a ]. Dynamically triangulated random surfaces have been suggested as a possible discretization for string theory in high energy physics and fluid surfaces or membranes in biology [ Lipowski:91a ]. As physicists, we shall focus on the former.
String theories describe the interaction of one-dimensional string-like objects in a fashion analogous to the way particle theories describe the interaction of zero-dimensional point-like particles. String theory has its genesis in the dual models that were put forward in the 1960s to describe the behavior of the hadronic spectrum then being observed. The dual model amplitudes could be derived from the quantum theory of a stringlike object [ Nambu:70a ], [ Nielsen:70a ], [ Susskind:70a ]. It was later discovered that these so-called bosonic strings could apparently only live in 26 dimensions [ Lovelace:68a ] if they were to be consistent quantum-mechanically. They also had tachyonic (negative mass-squared) ground states, which is normally the sign of an instability. Later, fermionic degrees of freedom were added to the theory, yielding the supersymmetric Neveu-Schwarz-Ramond [ Neveu:71a ] (NSR) string. This has a critical dimension of 10, rather than 26, but still suffers from the tachyonic ground state. Around 1973, it became clear that QCD provided a plausible candidate for a model of the hadronic spectrum, and the interest in string models of hadronic interactions waned. However, about this time it was also postulated by numerous groups that strings [ Scherk:74a ] might provide a model for gravity because of the prescence of higher spin excitations in a natural manner. A further piece of the puzzle fell into place in 1977 when [ Gliozzi:77a ] found a way to remove the tachyon from the NSR string. The present explosion of work on string theory began with the work of Green and Schwarz [ Green:84a ], who found that only a small number of string theories could be made tachyon free in 10 dimensions, and predicted the occurrence of one such that had not yet been constructed. This appeared soon after in the form of the heterotic string [ Gross:85a ], which is a sort of composite of the bosonic and supersymmetric models.
After these discoveries, the physics community leaped on string models as a way of constructing a unified theory of gravity [ Schwarz:85a ]. Means were found to compactify the unwanted extra dimensions and produce four-dimensional theories that were plausible grand unified models, that is, models which include both the standard model and gravity. Unfortunately, it now seems that much of the predictive power that came from the constraints on the 10-dimensional theories is lost in the compactification, so interest in string models for constructing grand unified theories has begun to fade. However, considered as purely mathematical entities, they have led and are leading to great advances in complex geometry and conformal field theory. Many of the techniques that have been used in string theory can also be directly translated to the field of real surfaces and membranes, and it is from this viewpoint that we want to discuss the subject.
Guy Robinson
As a point particle in space moves through time, it traces out a line, called the worldline ; similarly, as the string which looks like a line in space moves through time, it sweeps out a two-dimensional surface called the worldsheet . Thus, there are two ways in which to discretize the string:either the worldsheet is discretized or the ( d -dimensional) space-time in which the string is embedded is discretized. We shall consider the former, which is referred to as the intrinsic approach; the latter is reviewed in [ Ambjorn:89a ]. Such discretized surface models fall into three categories:regular, fixed random, and dynamical random surfaces. In the first, the surface is composed of plaquettes in a d -dimensional regular hypercubic lattice; in the second, the surface is randomly triangulated once at the beginning of the simulation; and in the third, the random triangulation becomes dynamical (i.e., is changed during the simulation). It is these dynamically triangulated random surfaces we wish to simulate. Such a simulation is, in effect, that of a fluid surface. This is because the varying triangulation means that there is no fixed reference frame, which is precisely what one would expect of a fluid where the molecules at two different points could interchange and still leave the surface intact. In string theory, this is called reparametrization invariance. If, instead, one used a regular surface, one would be simulating a tethered or crystalline surface on which there is considerable literature (see [ Ambjorn:89b ] for a survey of the work in the field). In this case, the molecules of the surface are frozen in a fixed array. There have also been simulations of fixed random surfaces; see, for example, [ Baig:89a ]. One other reason for studying random surface models is to understand integration over geometrical objects and discover whether the nonperturbative discretization procedures, which work so well for local field theories like QCD, can be applied successfully.
The partition function describing the quantum mechanics of a surface was first formulated by Polyakov [ Polyakov:81a ]. For a bosonic string embedded in d -dimensions, it is written as
where labels the dimensions of the embedding space, a,b = 1,2 are the coordinates on the worldsheet, and T is the string tension. The integration is over both the fields and the metric on the worldsheet . gives the embedding of the two-dimensional worldsheet swept out by the string in the d -dimensional space in which it lives. If we integrate over the metric, we obtain an area action for the worldsheet,
which is a direct generalization of the length action for a particle, i.e.,
We can thus see that the action in Equation 7.1 is the natural area action that one might expect for a surface whose dynamics were determined by the surface tension.
The first discretized model of this partition function was suggested independently by three groups:[ Ambjorn:85a ], [ David:85a ], and [ Kazakov:85a ].
where the outer sum runs over some set T of allowed triangulations of the surface, weighted by their importance factors , and is supposed to represent the effect of the metric integration in the path integral. The inner sum in the exponential is over the edges of the triangulation, or mesh, and working with a fixed number of nodes N corresponds to working in a microcanonical ensemble of fixed intrinsic area. The model is that of a dynamically triangulated surface because one is instructed to perform the sum over different triangulations, so both the fields, on the mesh and the mesh itself, are dynamical objects.
A considerable amount of effort has been devoted to simulating this pure area action, both in microcanonical form with a fixed number of nodes [ Billoire:86a ], [ Boulatov:86a ], [ Jurkiewicz:86a ] and in grand canonical form, where the number of nodes is allowed to change in a manner which satisfies detailed balance [ Ambjorn:87b ], [ David:87a ], [ Jurkiewicz:86b ]. (This allows measurements to be made of how the partition function varies with the number of nodes N , which determines an exponent called the string susceptibility .) The results are rather disappointing, in that the surfaces appear to be in a very crumpled state, as can be seen from measuring the gyration radius X2 , which gives a figure for the ``mean size'' of the surface. Its discretized form is
where the sum now runs over all pairs of nodes ij . X2 is observed to grow only logarithmically with N . This means that the Hausdorff dimension, , which measures how the surface grows upon the addition of intrinsic area and is defined by
is infinite. Analytical work [ Durhuus:84a ] shows that the string tension fails to scale in the continuum limit so that, heuristically speaking, it becomes so strong that it collapses the surface into something like a branched polymer. Thus, the pure area action does not provide a good continuum limit. It was observed in [ Ambjorn:85a ] and [ Espriu:87a ] that another way to understand the pathological behavior of the simulations, was to note that spikelike configurations in the surface were not suppressed by the area action, allowing it to degenerate into a spiky, crumpled object. An example of such a configuration is shown in Figure 7.2 (a).
Figure 7.2:
(a) Crumpled Phase (b) Smooth Phase
To overcome this difficulty, one uses the fact that adding to the pure area action a term in the extrinsic curvature squared (as originally suggested by Polyakov [ Polyakov:86a ] and Kleinert [ Kleinert:86a ] for string models of hadron interactions) smooths out the surface. In three dimensions, the extrinsic curvature of a two-dimensional surface is given by
where the r s are the principle radii of curvature at a point x on the surface. Two discretized forms of this extrinsic curvature term are possible, namely
where the inner sum is over the neighbors j of a node i and is the sum of the areas of the surrounding triangles as shown in Figure 7.3 ; and
where one takes the dot product of the unit normals of triangles which share a common edge . For sufficiently large values of the coupling, the worldsheet is smooth, as shown in Figure 7.2 (b). Analytical work [Ambjorn:87a;89b] strongly suggests, however, that a continuum limit will only be found in the limit of infinite extrinsic curvature coupling.
Figure:
Illustration of First Form of Extrinsic Curvature
(Equation
7.8
)
It came as something of a surprise, therefore, when a simulation by Catterall [ Catterall:89a ] revealed that the discretization (Equation 7.8 ) seems to give a third-order phase transition to the smooth phase and the discretization (Equation 7.9 ) a second -order phase transition, the latter offering the possibility of defining a continuum limit at a finite value of the extrinsic curvature coupling (because of the divergence of correlation length at a second-order transition). Further work by Baillie, Johnston, and Williams [ Baillie:90j ] confirmed the existence of this ``crumpling transition.'' Typical results, for a surface consisting of 288 nodes, are shown in the series of Figure 7.4 (Color Plate), for the discretization of Equation 7.8 . The extrinsic curvature coupling, , is increased from 0 (the crumpled phase) to (the smooth phase). We estimate that the crumpling transition is around .
Figure 7.4:
A 288-node DTRS uncrumpling as
changes from 0 to 1.5.
Further studies of the crumpling transition using the ``edge action'' discretization (Equation 7.9 ) have recently been performed [ Ambjorn:92a ], [ Bowick:93a ] on larger lattices in order to see whether this is a genuine phase transition, or just a finite size effect due to the small mesh sizes which had been simulated.
To summarize, a dynamically triangulated random surface with a pure area action does not offer a good discretization of a bosonic string or of a fluid surface. The addition of an extrinsic curvature term appears to give a crumpling transition between a smooth and crumpled phase, but the nature of this transition is unclear. In order for the continuum limit to give a string theory, it is necessary that there be a second-order phase transition, so that the correlation length diverges and the details of the lattice discretization are irrelevant, as in lattice QCD (see Section 4.3 ).
Guy Robinson
In order to give the reader a feel for how one actually simulates a dynamically triangulated random surface, we briefly explain our computer program, string , which does this-more details can be found in [ Baillie:90e ]. As we explained previously, in order to incorporate the metric fluctuations, we randomly triangulate the worldsheet of the string or random surface to obtain a mesh and make it dynamical by allowing flips in the mesh that do not change the topology. The incorporation of the flips into the simulation makes vectorization difficult, so running on traditional supercomputers like the Cray is not efficient. Similarly, the irregular nature of the dynamically triangulated random surface inhibits efficient implementation on SIMD computers like the Distributed Array Processor and the Connection Machine. Thus, in order to get a large amount of CPU power behind our random surface simulations, we are forced to run on MIMD parallel computers. Here, we have a choice of two main architectures: distributed-memory hypercubes or shared-memory computers. We initially made use of the former, as several machines of this type were available to us-all running the same software environment, namely, ParaSoft's Express System [ ParaSoft:88a ]. Having the same software on different parallel computers makes porting the code from one to another very easy. In fact, we ran our strings simulation program on the nCUBE hypercube (for a total of 1800 hours on 512 processors), the Symult Series 2010 (900 hours on 64 processors), and the Meiko Computing Surface (200 hours on 32 processors). Since this simulation fits easily into the memory of a single node of any of these hypercubes, we ran multiple simulations in parallel-giving, of course, linear speedup. Each node was loaded with a separate simulation (using a different random number generator seed), starting from a single mesh that has been equilibrated elsewhere, say on a Sun workstation. After allowing a suitable length of time for the meshes to decorrelate, data can be collected from each node, treating them as separate experiments. More recently, we have also run string on the GP1000 Butterfly (1000 hours on 14 processors) and TC2000 Butterfly II (500 hours on 14 processors) shared-memory computers-again with each processor performing a unique simulation. Parallelism is thereby obtained by ``decomposing'' the space of Monte Carlo configurations.
The reason that we run multiple independent Monte Carlo simulations, rather than distribute the mesh over the processors of the parallel computer, is that this domain decomposition would be difficult for such an irregular problem. This is because, with a distributed mesh, each processor wanting to change its part of the mesh would have to first check that the affected pieces were not simultaneously being changed by another processor. If they were, detailed balance would be violated and the Metropolis algorithm would no longer work. For a regular lattice this is not a problem, since we can do a simple red/black decomposition (Section 4.3 ); however this is not the case for an irregular lattice. Similar parallelization difficulties arise in other irregular Monte Carlo problems, such as gas and liquid systems (Section 14.2 ). For the random surfaces application, the size of the system which can be simulated using independent parallelism is limited not by memory requirements, but by the time needed to decorrelate the different meshes on each processor, which grows rapidly with the number of nodes in the mesh.
The mesh is set up as a linked list in the programming language C, using software developed at Caltech for doing computational fluid dynamics on unstructured triangular meshes, called DIME (Distributed Irregular Mesh Environment) [Williams:88a;90b] (Sections 10.1 and 10.2 ). The logical structure is that of a set of triangular elements corresponding to the faces of the triangulation of the worldsheet, connected via nodes corresponding to the nodes of the triangulation of the worldsheet. The data required in the simulation (of random surfaces or computational fluid dynamics) is then stored in either the nodes or the elements. We simulate a worldsheet with a fixed number of nodes, N , which corresponds to the partition function of Equation 7.1 evaluated at a fixed area. We also fix the topology of the mesh to be spherical. (The results for other topologies, such as a torus, are similar). The Monte Carlo procedure sweeps through the mesh moving the X s which live at the nodes, and doing a Metropolis accept/reject. It then sweeps through the mesh a second time doing the flips, and again performs a Metropolis accept/reject at each attempt. Figure 7.5 illustrates the local change to the mesh called a flip . For any edge, there is a triangle on each side, forming the quadrilateral ABCD, and the flip consists of changing the diagonal AC to BD. Both types of Monte Carlo updates to the mesh can be implemented by identifying which elements and nodes would be affected by the change, then
For the X move, the affected elements are the neighbors of node i , and the affected nodes are the node i itself and its node neighbors, as shown in Figure 7.6 . For the flip, the affected elements are the two sharing the edge, and the affected nodes are the four neighbor nodes of these elements, as shown in Figure 7.7 .
Figure 7.6:
Nodes and Elements Affected by
X
Move
Figure 7.7:
Nodes and Elements Affected by Flip Move
Guy Robinson
Due to its irregular nature, string is an extremely good benchmark of the scalar performance of a computer. Hence, we timed it on several machines we had access to, yielding the numbers in Table 7.1 . Note that we timed one processor of the parallel machines. We see immediately that the Sun 4/60, known as the SPARCstation 1, had the highest performance of the Suns we tested. Moreover, this machine (running with TI 8847 floating-point processor at clock rate of ) is as fast as the Motorola 88000 processor (at ) which is used in the TC2000 Butterfly. Turning to the hypercubes, we see that the nCUBE-2 is faster than the Meiko, which is twice as fast as the (scalar) Symult, which in turn is twice as fast as the nCUBE-1, per processor, for the string program. We have also run on the Weitek vector processors of the Mark III and Symult. The vector processors are faster than the scalar processors, but since string is entirely scalar, it does not run very efficiently on the vector processors and, hence, is still slower than on the Sun 4/60. The Mark III is as fast as the Symult, despite having one-third the clock rate, because it has a high-performance cache between its vector processor and memory. We have also timed the code on the new IBM and Hewlett-Packard workstations, and Cimarron Boozer of Sky Computers has optimized the code for the Intel i860. As a final comparison, the modern RISC workstations run the string code as fast as the CRAY X-MP.
Table 7.1:
Time Taken to Execute the String Program
We should emphasize that these performances are for scalar codes. A completely different picture emerges for codes which vectorize well, like QCD. QCD, with dynamical fermions, runs on the CRAY X-MP at around and pure-gauge QCD runs on one processor of the Mark III at . In contrast, the Sun 4/60 only achieves about for pure-gauge QCD. This ratio of QCD performance (which we may claim as the ``realistic peak'' performance of the machines) 100:6:1 compares with 5:0.7:1 for strings. Thus, these two calculations from one area of physics illustrate clearly that the preferred computer architecture depends on the problem.
Guy Robinson
Large scale numerical simulations are becoming increasingly important in many areas of science. Lattice QCD calculations have been commonplace in high energy physics for many years. More recently, this technique has been applied to string theories formulated as dynamically triangulated random surfaces. As we have pointed out, such computer simulations of strings are difficult to implement efficiently on all but MIMD computers, due to the inherent irregular nature of random surfaces. Moreover, on most MIMD machines, it is possible to get 100% speedup by doing multiple independent Monte Carlos, since an entire simulation easily fits within each processor.
Guy Robinson
Although the mechanism of high-temperature superconductivity is not yet established, an enormous amount of experimental work has been completed on these materials and, as a result, a ``magnetic'' explanation has probably gained the largest number of adherents. In this picture, high-temperature superconductivity results from the effects of dynamical holes on the magnetic properties of planes, perhaps through the formation of bound hole pairs. In the undoped materials (``precursor insulators''), these planes are magnetic insulators and appear to be well described by the two-dimensional spin-1/2 Heisenberg antiferromagnet,
where each spin represents a d -electron on a site. Since many aspects of the two-dimensional Heisenberg antiferromagnet were obscure before the discovery of high-T , this model has been the subject of intense numerical study, and comparisons with experiments on the precursor insulators have generally been successful. (A review of this subject including recent references has been prepared for the group [ Barnes:91a ].) If the proposed ``magnetic'' origin of high-temperature superconductivity is correct, one may only need to incorporate dynamical holes in the Heisenberg antiferromagnet to construct a model that exhibits high-temperature superconductivity. Unfortunately, such models (for example, the ``t-J'' model) are dynamical many-fermion systems and exhibit the ``minus sign problem'' which makes them very difficult to simulate on large lattices using Monte Carlo techniques. The lack of appropriate algorithms for many-fermion systems accounts in large part for the uncertainty in the predictions of these models.
In our work, we carried out numerical simulations of the low-lying states of one- and two-dimensional Heisenberg antiferromagnets; the problems we studied on the hypercube which relate to high-T systems were the determination of low-lying energies and ground state matrix elements of the two-dimensional spin-1/2 Heisenberg antiferromagnet, and in particular the response of the ground state to anisotropic couplings in the generalized model
Until recently, the possible existence of infinite-range spin antialignment ``staggered magnetization'' in the ground state of the two-dimensional Heisenberg antiferromagnet, which would imply spontaneous breaking of rotational symmetry, was considered an open question. Since the precursor insulators such as are observed to have a nonzero staggered magnetization, one might hope to observe it in the Heisenberg model as well. (It has actually been proven to be zero in the isotropic model above zero temperature, so this is a very delicate kind of long-range order.) Assuming that such order exists, one might expect to see various kinds of singular behavior in response to anisotropies, which would choose a preferred direction for symmetry breaking in the ground state. In our numerical simulations we measured the ground state energy per spin , the energy gap to the first spin excitation , and the component of the staggered magnetization N , as a function of the anisotropy parameter g on L L square lattices, extrapolated to the bulk limit. We did indeed find evidence of singular behavior at the isotropic point g=1 , specifically that is probably discontinuous there (Figure 7.8 ), decreases to zero at g=1 (Figure 7.9 ) and remains zero for g>1 , and N decreases to a nonzero limit as g approaches one, is zero for g>1 , and is undefined at g=1 ([Barnes:89a;89c]). Finite lattice results which led to this conclusion are shown in Figure 7.10 . (Perturbative and spin-wave predictions also appear in these figures; details are discussed in the publications we have cited.) These results are consistent with a ``spin flop'' transition, in which the long-range spin order is oriented along the energetically most favorable direction, which changes discontinuously from to planar as g passes through the isotropic point. The qualitative behavior of the energy gap can be understood as a consequence of Goldstone's theorem, given these types of spontaneous symmetry breaking. Our results also provided interesting tests of spin-wave theory, which has been applied to the study of many antiferromagnetic systems including the two-dimensional Heisenberg model, but is of questionable accuracy for small spin. In this spin-1/2 case, we found that finite-size and anisotropic effects were qualitatively described surprisingly well by spin-wave theory, but that actual numerical values were sometimes rather inaccurate; for example, the energy gap due to a small easy-axis anisotropy was in error by about a factor of two.
Figure 7.8:
Ground State Energy per Spin
Figure 7.9:
Spin Excitation Energy Gap
In related work, we developed hypercube programs to study static holes in the Heisenberg model, as a first step towards more general Monte Carlo investigations of the behavior of holes in antiferromagnets. Preliminary static-hole results have been published [ Barnes:90b ], and our collaboration is now continuing to study high-T models on an Intel iPSC/860 hypercube at Oak Ridge National Laboratory.
For our studies on the Caltech machine, we used the DGRW (discrete guided random-walk) Monte Carlo algorithm [ Barnes:88c ], and incorporated algorithm improvements which lowered the statistical errors [ Barnes:89b ]. This algorithm solves the Euclidean time Schrödinger equation stochastically by running random walks in the configuration space of the system and accumulating a weight factor, which implicitly contains energies and matrix elements. Since the algorithm only requires a single configuration, our memory requirements were very small, and we simply placed a copy of the program on each node; no internode communication was necessary. A previously developed DGRW spin system Fortran program written by T. Barnes was rewritten in C and adapted to the hypercube by D. Kotchan, and an independent DGRW code was written by E. S. Swanson for debugging purposes. Our collaboration for this work eventually grew to include K. J. Cappon (who also wrote a DGRW Monte Carlo code) and E. Dagotto (UCSB/ITP) and A. Moreo (UCSB), who wrote Lanczos programs to give essentially exact results on the lattice. This provided an independent check of the accuracy of our Monte Carlo results.
Figure 7.10:
Ground State Staggered Magnetization
In addition to providing resources that led to these physics results, access to the hypercube and the support of the group were very helpful in the PhD programs of D. Kotchan and E. S. Swanson, and their experience has encouraged several other graduate-level theorists at the University of Toronto to pursue studies in computational physics, in the areas of high-temperature superconductivity (K. J. Cappon and W. MacReady) and Monte Carlo studies of quark model physics (G. Grondin).
Guy Robinson
This project of Apostolakis and Kochanek used the Caltech/JPL Mark III to simulate gravitational lenses . These are galaxies which bend the light of a background quasar to produce multiple images of it. Astronomers are very interested in these objects, and have discovered more than 10 of them to date. Several exhibit symptoms of lensing by more than one galaxy. This spurred us to simulate models of this class of lens. Our model systems were composed of two galaxy-like lensing potentials in different positions and redshifts. We studied about 100 cases at a resolution of , taking about three weeks of running time on a 32-node Mark III. The algorithm we used is based on ray tracing. The problem is very irregular; this led us to use a scattered block decomposition. We achieved the performance needed for our purposes, but did not gain large speedups. The feature of the machine that was essential for our calculation was its large memory, because of the need for high resolution. Two of the cases we studied are illustrated in Figures 7.11 and 7.12 : Areas on the source plane that produce one, three, five, or seven images, and the respective image regions on the image plane can be seen. An interesting example of an extended source is also shown in each case. A detailed exposition of our results and a description of our algorithm for a concurrent machine are contained in [ Kochanek:88a ] and [ Apostolakis:88d ].
Figure 7.11:
Part A shows the areas of the source plane that produce
different numbers of images. Part B is a map of the areas of the image
plane with negative amplification, i.e., flipped images, and positive
amplification. Part C is a similar plot of the image plane, separating
the areas by the total number of images of the same source. An example
extended source is shown in A, whose images can be seen in Part B and
Part C.
Figure 7.12:
Part A shows the areas of the source plane that produce
different numbers of images. Part B is a map of the areas of the image
plane with negative amplification, that is, flipped images, and positive
amplification. Part C is a similar plot of the image plane, separating
the areas by the total number of images of the same source. An example
of an extended source is shown in A, whose images can be seen in Parts B
and C.
Guy Robinson
Many important algorithms in science and engineering are of the Monte Carlo type. This means that they employ pseudorandom number generators to simulate physical systems which are inherently probabilistic or statistical in nature. At other times, Monte Carlo is used to getting a fast approximation to what is actually a large, deterministic computation. Examples of this are Lattice Gauge computations (Section 4.3 ) and Simulated Annealing methods (Sections 11.1.4 and 11.4 ).
Figure 7.13:
A Comparison of the Sequential and Concurrent Generation of
Random Numbers
Even for a sequential algorithm, the question of correlations between members of the pseudorandom number sequence is nontrivial. In the parallel case, at least for the popular linear congruential algorithm, it is easy for the parallel algorithm to exactly mimic what a sequential algorithm would do. This means that the parallel case can be reduced to the well-understood sequential case.
The fundamental idea is that the processors of an N processor concurrent computer each compute only the N number of the sequential random number sequence. The parallel sequences are staggered and interleaved so that the parallel computer exactly reproduces the sequential sequence. Figure 7.13 illustrates what happens in the parallel case versus the sequential case for a four-processor concurrent computer.
Chapter 12 of [ Fox:88a ] has an extensive discussion of the parallel algorithm. This reference also has a discussion of what to do to achieve exact matching between parallel and sequential computations in more complex applications. We extend this work from the linear congruential method of [ Fox:88a ] to the so-called shift register sequences, which have longer periods and less correlations than the congruential method [ Chiu:88b ], [ Ding:88d ]. As an illustration, we use Ding's Fibonacci additive random number generator developed for the QCD calculations on the Mark IIIfp, as previously described in Section 4.3 . This uses the sequence
This has a period longer than . The assembly language code for Equation 7.12 on the Mark IIIfp took to generate a floating-point random number normalized to the range [0,1) .
Guy Robinson
Guy Robinson
Neurobiology is the study of the nervous system. Until recently, most neurobiology research centered around exposing different neural tissue preparations to a wide range of environmental stimuli and seeing how they responded. More recently, the growing field of computational neurobiology has involved constructing models of how we think the nervous system works [ Segev:89a ], [ Wehmeier:89a ], [ Yamada:89a ]. These models are then exposed to a wide range of experimental conditions and their responses compared to the real neural systems. Those models that are demonstrated to accurately and reliably mimic the behavior of real neural systems are then used to predict the neural system's response to new and untried experimental situations, and to make firm predictions about how the neural system should respond if our theory of neural functioning is correct. Simplified models are also used to determine which features of a real neural system are critical to its underlying behavior and function. In doing so, they also indicate which features of the real system have no effect on desired system performance.
Computer modelling of neural structures from the level of single cells to that of large networks has, until recently, been rather isolated from mainstream experimental neurophysiology [ Koch:92a ]. Largely, this has been due to limitations of computer power which have necessitated reducing the models to such a basic level that their biological relevance becomes questionable. More powerful computer platforms, such as the parallel computers which have been used at Caltech, allow the construction of simulations of sufficient detail for their results to be compared directly with experimental results. Furthermore, the inherent flexibility of the modelling approach allows the neurophysiologist to observe the effect of experimental manipulations which are presently difficult or impossible to carry out on a biological basis. In this way, neural modelling can make firm predictions that can be confirmed by later experimentation [ Bhalla:93a ].
Guy Robinson
The neural modelling community at Caltech has been fortunate in gaining access to several parallel computers. One of these, the experimental supercomputer produced by Intel called the Intel Touchstone Delta and described in Chapter 2 , held the record as the World's fastest computer while much of this work was being carried out. Unlike a traditional (serial) computer, a parallel computer is more analogous to our own biological computer (the nervous system) where tasks such as vision and hearing can continue to function independently of one another. The parallel style of computer would therefore seem to lend itself very well to neural modelling applications where many neural compartments (whether individual ion channels or whole cells) are active simultaneously [Nelson:89a;90b].
Guy Robinson
Having listed the suitability of the newer style of parallel computers for neural modelling tasks, it is important to examine why they are not in wider use. Traditionally, parallel computers have been much harder to program than traditional computers and the typical neurobiologist was expected to understand a lot about advanced computer science issues before he or she could adequately construct neural models on such a machine. As this is not considered a reasonable expectation in neural modelling circles, most modellers have continued to use traditional (serial) computers and have had to sacrifice model detail in order to get acceptable performance figures. Such cut-down models may still require more than 12 hours to complete on a traditional high-performance computer [ Bhalla:92a ]. Parallel computers hold the promise of allowing more detailed models to be run in an acceptable period of simulation time. In order to make this power available to a range of neural modellers, it was decided to produce a version of a widely used neural simulation system [ Wilson:89a ] which could take advantage of such a parallel computer with only minimal manipulation by the neural modeller.
Guy Robinson
GENESIS is a package designed to allow the construction of a wide variety of neural simulations. It was originally designed by Matt Wilson at Caltech to assist in his doctoral modelling work on the Piriform Cortex [ Wilson:89b ]. One of the design objectives was to allow the easy construction and alteration of a wide variety of neural models from detailed single cells all the way up to complex multilayered neural network structures. In order to make the simulator as flexible as possible, it was decided to adopt an object-oriented approach to the underlying simulator and to allow the user to include precompiled libraries of elements appropriate to their particular scale of modelling (e.g., detailed single cells or network-scale models). The structure of individual models was described via neural description script language files, which were interpreted as execution of the model proceeded. This combination of interpreted script files and precompiled element libraries has proved to be a very powerful approach to the problems of neural modelling at a variety of levels of detail from detailed single cell models all the way up to large network-scale simulations composed of thousands of neural elements.
GENESIS is an object-oriented neural simulator. All communication between the elements composing the simulation is via well-defined messages. As such, it was expected to fit well into the distributed computing environment of modern parallel computers.
It is designed in an object-oriented manner where each GENESIS element has private internal data that other elements cannot access directly. They can only access this information via predefined messages that request the internal state information from an element.
The GENESIS neural simulation system has now been running successfully on two of the Intel parallel computers at Caltech since 1991 and has already produced biologically interesting and previously unobtainable results. Much of the use of the simulator to date has been in the construction of a highly detailed model of the Cerebellar Purkinje Cell (work produced by Dr. Erik de Schutter at Caltech using the parallel GENESIS system provided by ourselves) [Schutter:91a;93a]. This is thought to be one of the most detailed and biologically realistic single-cell models developed to date. By utilizing the special capabilities of the Parallel GENESIS system, it has been possible to carry out simulated experiments which are presently very difficult to carry out experimentally. The initial results have been very promising and have shown several previously unsuspected properties of the Purkinje Cell, which arise as a result of the anatomical and physiological properties of the cell's dendritic tree. The ability to run up to 512 different Purkinje Cell models simultaneously has allowed the construction of statistically significant profiles of Purkinje Cell response patterns. This research has previously been impossible to conduct for detailed cell models because of the excessive computational power required. Until now, the only statistical behavior that has been described is for population dynamics of very simplified neural elements.
Currently, these machines are being used in two distinct ways (the task farming approach and the distributed model via the postmaster element) [Speight:92a;92b].
Guy Robinson
In the task farming approach , each node runs its own copy of a neural simulation (generally a detailed single-cell model). Each node and, therefore, each simulation runs totally independently of all other nodes. This method is particularly suited to examining large parameter spaces. In many of our applications, there are a wide variety of free parameters (i.e., those not defined experimentally). By using the task farming approach on these supercomputer-class machines, we can range widely across this huge parameter space looking for combinations which give biologically realistic results [ Bhalla:93a ] (i.e., similar to those measured experimentally). This allows us to make predictions for the future experimental measurement of these free parameters. It is also possible to run the same model many times in order to build up statistically significant summaries of the overall model behavior. The task farming approach is inherently parallel (zero communication between nodes and, therefore, linear scaling of computation with number of nodes available) and as such it is one of the most efficient programming styles available on any parallel machine (i.e., it allows the full utilization of the available computational power of the machine). This approach allows modelling in a single overnight run what would otherwise take a full year of nonstop computation on previously available computing platforms.
Guy Robinson
The postmaster element is a self-contained object within the GENESIS neural simulator. Like the objects in a true object-oriented language it is an entity composed of public (externally available) data, private (internal) data, and functions to operate on the object. Like other GENESIS elements, it is a black box that can be used to construct a neural simulation. For the present, modellers wishing to produce large distributed simulations that are able to take full advantage of the power inherent in our current parallel computers must specify how to distribute the different parts of their simulation over the available hardware computing nodes. While this may be an annoying necessity to certain neural modellers, it must be remembered that this technique allows the construction of simulations of such a size and computational complexity that they can be modelled on no other existing computer platforms (e.g., traditional serial supercomputers). Bearing this in mind, the requirement of explicitly stating how to distribute the model is a small price to pay for the power gained. This explicit method also brings with it certain advantages. Firstly, because the modeller is familiar with the computational demands of the different parts of his model, he is able to accurately balance the computational load over the available parallel hardware. This makes for far more efficient load balancing and scaling behavior than would be possible with an automatic decomposition scheme that has to balance the very different needs of both single-cell modellers and network-scale modellers. It is also less efficient to carry out automatic decomposition because the various GENESIS elements comprising a complete neural simulation have widely varying computational demands. As with other GENESIS elements, the postmaster acts as a self-contained black box which usually communicates information about its changing state via messages to other elements. It can both send messages to, and receive messages from, other GENESIS elements which may exist either on this particular hardware node or on others in the network. As a result of this, it is an element that ties together and coordinates the disparate parts of a model running on separate hardware nodes of the parallel computer. The actual messages transferred depend on the type of element with which it communicates. The postmaster element neither knows nor cares whether the quantity it is communicating is a membrane potential, a channel conductance (simple current flow), or the concentration of any substance from ions to complex neurotransmitters. The postmaster is a two-faced element. Its first face is that of a normal GENESIS element which it presents to the rest of the simulation. This first aspect of the postmaster is able to pass GENESIS messages between elements and allows use of the GENESIS Script Language to query its state. The second aspect of the postmaster is aware of the parallel nature of the underlying hardware and can use the operating system primitives for communicating information between separate physical nodes. In summary, the postmaster element is a conduit along which information can flow between nodes. It allows the modeller to tie together the disparate aspects of the simulation into a coherent whole.
To show how this mechanism works in practice, performance measurements are presented in Figure 7.14 , which are taken on the Intel Touchstone Delta parallel supercomputer based at Caltech. These figures show both scaling of performance, where more nodes allow the model to complete in a shorter timescale, and the construction of models so large that it has been impossible to model them before now.
Figure 7.14:
Results from Running a GENESIS Simulation of a Passive Cable
Model Composed of Varying Numbers of Compartments Distributed Across
Varying Numbers of Nodes on the Intel Touchstone Delta Parallel
Supercomputer
The previous record for the most complex GENESIS model produced on a traditional serial computer is approximately 80,000 elements, the Purkinje Cell model produced at Caltech by Dr. de Schutter [Schutter:91a;93a]. Using the postmaster element on a parallel supercomputer (the 512-node Intel Touchstone Delta), this limit has now been pushed to over two million GENESIS elements (actually ). As can be seen, this now allows for the construction of far more complex and realistic models than was previously possible. The present price to be paid for this freedom is the decision to make the modeller explicitly distribute his simulation over the available hardware. This was a design decision that has allowed far greater efficiency of load balancing than would be possible using an automatic distribution technique as the illustrated scaling graphs confirm. This technique also has the advantage of leaving the basic GENESIS script interface unaltered and is applicable to a wide variety of parallel hardware. As a result of the requirement to retain compatibility with the existing serial GENESIS implementation, another benefit has become obvious. By changing only the network layer of the postmaster element, it is possible to produce a version of the postmaster that can use traditional serial machines distributed across the Internet to produce a distributed model, which ties together existing supercomputers based anywhere on the network. The potential size of model that can be constructed in this manner is staggering, although the reality of network communication delays will limit its area of application to compute-limited tasks (cf. communication-bound simulations).
The assessment of the usefulness of a parallel neural modelling platform can be demonstrated by two extremes of neural modelling applications:
The individual subcompartments which make up a neuron's dendritic tree are active simultaneously, and each of these compartments may be studded with several independently functioning ionic membrane channels. This is illustrated in Figure 7.15.
Figure 7.15:
Cerebella Purkinle cell model in GENESIS.
Our most detailed single-cell model to date.
A detailed network simulation may be composed of many thousands of individual nerve cells from a smaller number of biological cell types.
What does a parallel computer have to offer the single-cell modeller?
Most of the work on the parallel GENESIS at Caltech to date has been in the field of single-cell modelling. Several distinct ways of using the system have been developed allowing a variety of approaches by the single-cell modeller.
Each individual node on the parallel computer runs a separate, complete, single-cell model (task farming). This facility can be used to examine the sensitivity of the cell's performance to a wide range of physiological states including testing the effect of parameters which are at present difficult to measure experimentally [ Bhalla:93a ]. A selection of these appear below:
Like the experimentalist, the neural modeller can block or poison different ion channel subsets, with the added advantage that it is possible to block both channels where no chemical blocking agent currently exists and also to have 100% channel-blocking specificity (e.g., work conducted by D. Jaeger, Caltech [ Jaeger:93a ]).
An experiment at present difficult or impossible to perform in any physiological setup can easily be tested on the computer model system. For example, the independence of stimulation site on Purkinje Cell response (Work performed by E. de Schutter, Caltech [Schutter:91a;93a]).
The prediction of ionic channel distribution over different parts of a cell's dendritic arbor by observing the effect of changed distribution on the model cell's electrophysiological properties, and relating this to the experimentally measured behavior of the real neuron [ Bhalla:93a ]. Experimental confirmation of channel distribution predicted by the manipulation of such computer models seems likely to appear in the near future due to advances in monoclonal antibody techniques for different channel subsets.
Predicting the effect of changing membrane properties which are impossible to measure experimentally-for example, in the distal dendritic arbor or in ``spines'' (e.g., E. de Schutter, Caltech [ Bhalla:93a ], [ Schutter:91a ]).
The first example above is of modelling following and confirming physiology experiments. The latter examples are uses of neural modelling to predict future experimental findings. Although rarely used to date (because of computer limits on the model's level of biological realism), this synergistic use of neural modelling in predicting experimental results and suggesting new experiments appears to offer substantial benefits to the neurobiology community at large. Neural modelling on parallel computers, such as the Intel Touchstone Delta, is allowing modelling to adopt these new closer links to experimental work, thereby closing the dichotomy between experimenters and modellers. In the past, this dichotomy has caused several experimentalists to question the relevance of funding modelling work. Hopefully, this attitude will change as more results of synergy between modelling experiments and physiology experiments become widely known.
The system allows the construction of larger and more detailed cell models than is possible on a traditional serial computer. The level of detail included in models to date has been limited either by the memory size constraints of the computer used, or by the computational time requirements of the model [ Bhalla:92a ]. A distributed model of a single cell on a parallel computer alleviates both of these constraints simultaneously. This allows the construction of larger cell models than have been previously possible but which nevertheless run in acceptable time frames.
What does a parallel computer have to offer the neural network modeller?
Much of the work to date has been on task farming (as described above), whereby each node runs its own copy of a cell. This is less useful to the network modeller but still allows detailed statistical information to be built up about network and population behavior. A more interesting way of using the parallel machine for network modelling is the distributed model scheme mentioned above. This allows networks to be both larger and to run more quickly than their counterparts on equivalent serial machines. A promising project in this category, although still in its very early stages, is the construction by Upinder Bhalla at Caltech of a detailed model of the rat olfactory bulb [ Bhalla:93a ]. This incorporates detailed cellular elements, which are rare in network class models. Such a network model makes far greater communication demands of the internode communications mechanism on the parallel computer than a distributed single-cell model. Initially an expanded version of the postmaster element [Speight:92a;92b] which was used for distributed single-cell models will be employed, but this may change as the different demands of a large network model become apparent.
All of the work on the Parallel GENESIS project was carried out in the laboratory of Professor James Bower at the California Institute of Technology.
Guy Robinson
Guy Robinson
Concurrent matrix algorithms were among the first to be studied on the hypercubes at Caltech [ Fox:82a ], and they have also been intensively studied at other institutions, notably Yale [ Ipsen:87b ], [Johnsson:87b;89a], [ Saad:85a ], and Oak Ridge National Laboratory [Geist:86a;89a], [Romine:87a;90a]. The motivation for this interest is the fact that matrix algorithms play a prominent role in many scientific and engineering computations. In this chapter, we study the so called- full or dense (and closely related banded ) matrix algorithms where essentially all elements of the matrix are nonzero. In Chapters 9 and 12 , we will treat the more common case of problems, which, if formulated as matrix equations, are represented by sparse matrices. Here, most of the elements of the matrix are zero; one can apply full matrix algorithms to such sparse cases, but there are much better algorithms that exploit the sparseness to reduce the computational complexity. Within C P, we found two classes of important problems that needed full matrix algorithms. In Sections 8.2 and 8.3 , we cover two chemical scattering problems, which involve relatively small full matrices-where the matrix rows and columns are labelled by the different reaction channels. The currently most important real-world use of full matrix algorithms is computational electromagnetic simulations [ Edelman:92a ]. These are used by the defense industry to design aircraft and other military vehicles with low radar cross sections. Solutions of large sets of linear equations come from the method of moments approach to electromagnetic equations [ Wang:91b ]. We investigated this method successfully at JPL [ Simoni:89a ] but in this book, we only describe (in Section 9.4 ) the alternative approaches, finite elements, to electromagnetic simulations. Such sparse matrix formulation will be more efficient for large electromagnetic problems, but the moment method and associated full matrix is probably the most common and numerically reliable approach.
Early work at Caltech on full matrices (1983 to 1987) focused on specific algorithms, such as matrix multiplication, matrix-vector products, and LU decomposition. A major issue in determining the optimal algorithm for these problems is choosing a decomposition which has good load balance and low communication overhead. Many matrix algorithms proceed in a series of steps in which rows and/or columns are successively made inactive. The scattered decomposition described in Section 8.1.2 is usually used to balance the load in such cases. The block decomposition, also described in Section 8.1.2 , generally minimizes the amount of data communicated, but results in sending several short messages rather than a few longer messages. Thus, a block decomposition is optimal for a multiprocessor with low message latency, or startup cost, such as the Caltech/JPL Mark II hypercube. For machines with high message latency, such as the Intel iPSC/1, a row decomposition may be preferable. The best decomposition, therefore, depends crucially on the characteristics of the concurrent hardware.
In recent years (1988-1990), interest has centered on the development of libraries of concurrent linear algebra routines. As discussed in Section 8.1.7 , two approaches have been followed at Caltech. One approach by Fox, et al. has led to a library of routines that are optimal for low latency, homogeneous hypercubes, such as the Caltech/JPL Mark II hypercubes. In contrast, Van de Velde has developed a library of routines that are generally suboptimal, but which may be ported to a wider range of multiprocessor architectures, and are suitable for incorporation into programs with dynamically changing data distributions.
HPFA Applications and Paradigms
Guy Robinson
The data decomposition (or distribution) is a major factor in determining the efficiency of a concurrent matrix algorithm, so before detailing the research into concurrent linear algebra done at Caltech, we shall first introduce some basic decomposition strategies.
The processors of a concurrent computer can be uniquely labelled by , where is the number of processors. A vector of length M may be decomposed over the processors by assigning the vector entry with global index m (where ) to processor p , where it is stored as the i entry in a local array. Thus, the decomposition of a vector can be regarded as a mapping of the global index, m , to an index pair, , specifying the processor number and local index.
For matrix problems, the processors are usually arranged as a grid. Thus, the grid consists of P rows of processors and Q columns of processors, and . Each processor can be uniquely identified by its position, , on the processor grid. The decomposition of an matrix can be regarded as the Cartesian product of two vector decompositions, and . The mapping decomposes the M rows of the matrix over the P processor rows, and decomposes the N columns of the matrix over the Q processor columns. Thus, if and , then the matrix entry with global index is assigned to the processor at position on the processor grid, where it is stored in a local array with index .
Two common decompositions are the linear and scattered decompositions. The linear decomposition, , assigns contiguous entries in the global vector to the processors in blocks,
where
and and . The scattered decomposition, , assigns consecutive entries in the global vector to different processors,
Figure 8.1 shows examples of these two types of decomposition for a matrix.
Figure 8.1:
These Eight Figures Show Different Ways of Decomposing a
Matrix. Each cell represents a matrix entry, and is
labelled by the position,
, in the processor grid of the
processor to which it is assigned. To emphasize the pattern of
decomposition, the matrix entries assigned to the processor in the
first row and column of the processor grid are shown shaded. Figures
(a) and (b) show linear and scattered row-oriented decompositions,
respectively, for four processors arranged as a
grid
(
P=4
,
Q=1
). In Figures (c) and (d), the corresponding
column-oriented decompositions are shown (
P=1
,
Q=4
). Figures (e)
through (h) show linear and scattered block-oriented decompositions for
16 processors arranged as a
grid (
P=Q=4
).
The mapping of processors onto the processor grid is determined by the programming methodology, which in turn depends closely on the concurrent hardware. For machines such as the nCUBE-1 hypercube, it is advantageous to exploit any locality properties in the algorithm in order to reduce communication costs. In such cases, processors may be mapped onto the processor grid by a binary Gray code scheme [ Fox:88a ], [ Saad:88a ], which ensures that adjacent processors on the processor grid are directly connected by a communication channel. For machines such as the Symult 2010, for which the time to send a message between any two processors is almost independent of their separation in the hardware topology, locality of communication is not an issue, and the processors can be mapped arbitrarily onto the processor grid.
Guy Robinson
Figure 8.2:
A Schematic Representation of a Pipeline
Broadcast
for an Eight-Processor Computer. White
squares represent processors not involved in communication, and such
processors are available to perform calculations. Shaded squares
represent processors involved in communication, with the degree of
shading indicating how much of the data have arrived at any given
step. In the first six steps, those processor not yet involved in the
broadcast
can continue to perform calculations.
Similarly, in steps 11 through 16, processors that are no longer involved in
communicating can perform useful work since they now have all the data
necessary to perform the next stage of the algorithm.
One of the first linear algebra algorithms implemented on the Caltech/JPL Mark II hypercube was the multiplication of two dense matrices, and , to form the product, [ Fox:85b ]. The algorithm uses a block-oriented, linear decomposition, which is optimal for machines with low message latency when the subblocks are (as nearly as possible) square. Let us denote by the subblock of in the processor at position of the processor grid, with a similar designation applying to the subblocks of and . Then, if the processor grid is square, that is, , the matrix multiplication algorithm in block form is,
The case in which involves some extra bookkeeping, but does not change the concurrent algorithm in any essential way.
On the Mark II hypercube, communication cost increases with processor separation, so processors are mapped onto the processor grid using a binary Gray code scheme. Two types of communication are required at each stage of the algorithm, and both exploit the hypercube topology to minimize communication costs. Matrix subblocks are communicated to the processor above in the processor grid, and subblocks are broadcast along processor rows by a communication pipeline (Figure 8.2 ).
The matrix multiplication algorithm has been modified for use on the Caltech/JPL Mark IIIfp hypercube [ Hipes:89b ]. The Mark II hypercube is a homogeneous machine in the sense that there is only one level in the memory hierarchy, that is, the local memory of each processor. However, each processor of the Mark IIIfp hypercube has a Weitek floating-point processor with a data cache. To take full advantage of the high processing speed of the Weitek, data transfer between local memory and the Weitek data cache must be minimized. Since there are two levels in the memory hierarchy of each processor (local memory and cache), the Mark IIIfp is an inhomogeneous hypercube. The main computational task in each stage of the concurrent algorithm is to multiply the subblocks in each processor, and for large problems not all of the data will fit into the cache. The multiplication is, therefore, done in inner product form on the Weitek by further subdividing the subblocks in each processor. This intraprocessor subblocking allows the multiplication in each processor to be done in a number of stages, during each of which only the data needed for that stage are explicitly loaded into the cache.
Independently, Cherkassky, et al. in [ Cherkassky:88a ], Berntsen in [ Berntsen:89a ], and Aboelaze [ Aboelaze:89a ] improved Fox's algorithm for dense matrix multiplication, reducing the time complexity from
to
where is the number of processors, is the time for one addition or one multiplication, and , are machine-dependent communication parameters defining bandwidth and latency [ Chrisochoides:92a ]. In fact, the communication cost of transferring w words is
A concurrent algorithm to perform the matrix-vector product has also been implemented on the Caltech/JPL Mark II hypercube [ Fox:88a ]. Again, a block-oriented, linear decomposition is used for the matrix A . The vector x is decomposed linearly over the processors' columns, so that all the processors in the same processor column contain the same portion of x . Similarly, at the end of the algorithm, the vector y is decomposed over the processor rows, so that all the processors in the same processor row contain the same portion of y . In block form the matrix-vector product is,
As in the matrix multiplication algorithm, the concurrent matrix-vector
product algorithm is optimal for low latency, homogeneous hypercubes if the
subblocks of
are square.
Guy Robinson
Banded Matrix-Vector Multiplication
First, we consider the parallelization of the operation on a linear array of processors when is a banded matrix with , upper and lower bandwidths, and we assume that matrices are stored using a sparse scheme [ Rice:85a ]. For simplicity, we describe the case . The proposed implementation is based on a decomposition of matrix into an upper (including the diagonal of ) and lower triangular matrices, such as . Furthermore, we assume that row and are stored in processor i . Without loss of generality, we can assume and . The vector can then be expressed as . The products and are computed within and iterations, respectively. The computation involved is described in Figure 8.3 . In order to compute the complexity of the above algorithm, we assume without any loss of generality, that has K non-zero elements, and . Then it can be shown that the time complexity is
and the memory space required for each subdomain is .
Figure 8.3:
The Pseudo Code for Banded Matrix-Vector Multiplication
Banded Matrix-Matrix Multiplication
Second, we consider the implementation of , on a ring of processors when , are banded matrices with upper, and lower bandwidths, respectively. Again, we describe the realization for . The case is straightforward generalization. The processor i computes column of matrix and holds one row of matrix (denoted by ) and a column of matrix (denoted by ).
The algorithm consists of two phases as in banded-matrix vector multiplication. Without loss of generality, we can assume , and . In the first phase, each node starts by calculating , then each node i passes to node i-1 , this phase is repeated times. In the second phase, each node restores and passes it to node i+1 . This phase is repeated times. The implementation proposed for this operation is described in Figure 8.4 .
Figure 8.4:
The Pseudo Code for Banded Matrix-Matrix Multiplication
Without loss of generality, we assume that are the number of non-zero elements for the matrices , respectively, and denote by and . Then we can show that the parallel execution time is given by
The above realization has been implemented on the nCUBE-1 [ Chrisochoides:90a ]. Tables 8.1 and 8.2 indicate the performance of BLAS 2 computation for a block tridiagonal matrix where each block is dense. In these experiments, each processor has the same computation to perform. The results indicate very satisfactory performance for these type of data.
Table 8.1:
Measured maximum total elapsed time (in seconds) for
multiplication of a block tridiagonal matrices with a vector.
Table 8.2:
Measured maximum elapsed time (in seconds) for multiplication
of a block tridiagonal matrix by a block tridiagonal matrix.
Guy Robinson
Factorization of Full Matrices
LU factorization of dense matrices, and the closely related Gaussian elimination algorithm, are widely used in the solution of linear systems of equations of the form . LU factorization expresses the coefficient matrix, A , as the product of a lower triangular matrix, L , and an upper triangular matrix, U . After factorization, the original system of equations can be written as a pair of triangular systems,
The first of the systems can be solved by forward reduction, and then back substitution can be used to solve the second system to give x . If A is an matrix, LU factorization proceeds in M-1 steps, in the k of which column k of L and row k of U are found,
and the entries of A in a ``window'' extending from column k+1 to M-1 and row k+1 to M-1 are updated,
Partial pivoting is usually performed to improve numerical stability. This involves reordering the rows or columns of A .
In the absence of pivoting, the row- and column-oriented decompositions involve almost the same amounts of communication and computation. However, the row-oriented approach is generally preferred as it is more convenient for the back substitution phase [ Chu:87a ], [ Geist:86a ], although column-based algorithms have been proposed [ Li:87a ], [ Moler:86a ]. A block-oriented decomposition minimizes the amount of data communicated, and is the best approach on hypercubes with low message latency. However, since the block decomposition generally involves sending shorter messages, it is not suitable for machines with high message latency. In all cases, pipelining is the most efficient way of broadcasting rows and columns of the matrix since it minimizes the idle time that a processor must wait when participating in a broadcast , and effectively overlaps communication and calculation.
Load balance is an important issue in LU factorization. If a linear decomposition is used, the computation will be imbalanced and processors will become idle once they no longer contain matrix entries in the computational window. A scattered decomposition is much more effective in keeping all the processors busy, as shown in Figure 8.5 . The load imbalance is least when a scattered block-oriented decomposition is used.
Figure 8.5:
The Shaded Area in These Two Figures Shows the Computational Window
at the Start of Step Three of the LU Factorization Algorithm. In (a) we see
that by this stage the processors in the first row and column of the processor
grid have become idle if a linear block decomposition is used. In contrast,
in (b) we see that all processors continue to be involved in the computation
if a scattered block decomposition is used.
At Caltech, Van de Velde has investigated LU factorization of full matrices for a number of different pivoting strategies, and for various types of matrix decomposition on the Intel iPSC/2 hypercube and the Symult 2010 [ Velde:90a ]. One observation based on this work was that if a linear decomposition is used, then in many cases pivoting results in a faster algorithm than with no pivoting, since the exchange of rows effectively randomizes the decomposition, resulting in better load balance. Van de Velde also introduces a clever enhancement to the standard concurrent partial pivoting procedure. To illustrate this, consider partial pivoting over rows. Usually, only the processors in a single-processor column are involved in the search for the pivot candidate, and the other processors are idle at this time. In Van de Velde's multirow pivoting scheme, in each processor column a search for a pivot is conducted concurrently within a randomly selected column of the matrix. This incurs no extra cost compared with the standard pivoting procedure, but improves the numerical stability. A similar multicolumn pivoting scheme can be used when pivoting over columns. Van de Velde concludes from his extensive experimentation with LU factorization schemes that a scattered decomposition generally results in a more efficient algorithm on the iPSC/2 and Symult 2010, and his work illustrates the importance of decomposition and pivoting strategy in determining load balance, and hence concurrent efficiency.
Figure 8.6:
Schematic Representation of Step
k
of LU Factorization for
an
Matrix,
A
, with Bandwidth
w
. The
computational window is shown as a dark-shaded square, and
matrix entries in this region are updated at step
k
. The
light-shaded part of the band above and to the left of the window has
already been factorized, and in an in-place algorithm contains the
appropriate columns and rows of
L
and
U
. The unshaded part of the
band below and to the right of the window has not yet been modified.
The shaded region of the matrix
B
represents the
window
updated in step
k
of forward reduction,
and in
step
M-k-1
of back substitution.
LU Factorization of Banded Matrices
Aldcroft, et al. [ Aldcroft:88a ] have investigated the solution of linear systems of equations by LU factorization, followed by forward elimination and back substitution, when the coefficient matrix, A , is an matrix of bandwidth w=2m-1 . The case of multiple right-hand sides was considered, so the system may be written as AX=B , where X and B are matrices. The LU factorization algorithm for banded matrices is essentially the same as that for full matrices, except that the computational window containing the entries of A updated in each step is different. If no pivoting is performed, the window is of size and lies along the diagonal, as shown in Figure 8.6 . If partial pivoting over rows is performed, then fill-in will occur, and the window may attain a maximum size of . In the work of Aldcroft, et al. the size of the window was allowed to vary dynamically. This involved some additional bookkeeping, but is more efficient than working with a fixed window of the maximum size. Additional complications arise from only storing the entries of A within the band in order to reduce memory usage.
As in the full matrix case, good load balance is ensured by using a scattered block decomposition for the matrices. As noted previously, this choice of decomposition also minimizes communication cost on low latency multiprocessors, such as the Caltech/JPL Mark II hypercube used in this work, but may not be optimal for machines in which the message startup cost is substantial.
A comparison between an analytic performance model and results on the Caltech/JPL Mark II hypercube shows that the concurrent overhead for the LU factorization algorithm falls to zero as , where . This is true in both the pivoting and non-pivoting cases. Thus, the LU factorization algorithm scales well to larger machines.
Guy Robinson
As described for his chemistry application in Section 8.2 , Hipes has studied the use of the Gauss-Jordan (GJ) algorithm as a means of solving systems of linear equations [ Hipes:89b ]. On a sequential computer, LU factorization followed by forward reduction and back substitution is preferable over GJ for solving linear systems since the former has a lower operation count. Another apparent drawback of GJ is that it has generally been believed that the right hand sides must be available a priori, which in applications requiring the solution for multiple right-hand sides is a handicap. Hipes' work has shown that this is not the case, and that a well-written, parallel GJ solver is significantly more efficient than using LU factorization with triangular solvers on hypercubes.
As noted by Gerasoulis, et al. [ Gerasoulis:88a ], GJ does not require the solution of triangular systems. The solution of such systems by LU factorization features an outer loop of fixed length and two inner loops of decreasing length, whereas GJ has two outer fixed-length loops and only one inner loop of decreasing length. GJ is, therefore, intrinsically more parallel than the LU solver, and its better load balance compensates for its higher operation count. Hipes has pointed out that the multipliers generated in the GJ algorithm can be saved where zeros are produced in the coefficient matrix. The entries in the coefficient matrix are, therefore, overwritten by the GJ multipliers, and we shall call this the GJ factorization (although we are not actually expressing the original matrix A as the product of two matrices). It is now apparent that the right-hand side matrix does not have to be known in advance, since a solution can be obtained using the previously computed multipliers. Another factor, noted by Hipes, favoring the use of the GJ solver on a multiprocessor, is the larger grain size maintained throughout the GJ factorization and solution phases, and the lower communication cost in the GJ solution phase.
Hipes has implemented his GJ solver on the Caltech/JPL Mark III and nCUBE-1 hypercubes, and compared the performance with the usual LU solver [ Hipes:89d ]. In the GJ factorization, a scattered column decomposition is used, similar to that shown in Figure 8.1 (d). This ensures good load balance as columns become eliminated in the course of the algorithm. In the LU factorization, both rows and columns are eliminated so a scattered block decomposition is used. On both machines, it was found that the GJ approach is faster for sufficiently many right-hand sides.
Guy Robinson
Hipes has also compared the Gaussian-Jordan (GJ) and Gaussian Elimination (GE) algorithms for finding the inverse of a matrix [ Hipes:88a ]. This work was motivated by an application program that integrates a special system of ordinary differential equations that arise in chemical dynamics simulations [ Hipes:87a ], [ Kuppermann:86a ]. The sequential GJ and GE algorithms have the same operation count for matrix inversion. However, Hipes found the parallel GJ inversion has a more homogeneous load distribution, and requires fewer communication calls than GE inversion, and so should result in a more efficient parallel algorithm. Hipes has compared the two methods on the Caltech/JPL Mark II hypercube, and as expected found that GJ inversion algorithm to be the fastest.
Fox and Furmanski have also investigated matrix algorithms at Caltech [ Furmanski:88b ]. Among the parallel algorithms they discuss is the power method for finding the largest eigenvalue, and corresponding eigenvector, and a matrix A . This starts with an initial guess, , at the eigenvector, and then generates subsequent estimates using
As k becomes large, tends to the eigenvalue with the largest absolute value (except for a possible sign change), and tends to the corresponding eigenvector. Since the main component of the algorithm is matrix-vector multiplication, it can be done as discussed in Section 8.1.2 .
A more challenging algorithm to parallelize is the tridiagonalization of a symmetric matrix by Householder's method, which involves the application of a series of rotations to the original matrix. Although the basic operations involved in each rotation are straightforward (matrix-vector multiplication, scalar products, and so on), special care must be taken to balance the load. This is particularly difficult since the symmetry of the matrix A means that the basic structure being processed is triangular, and this is decomposed into a set of local triangular matrices in the individual processors. Load balance is optimized by scattering the rows over the processors, and the algorithm requires vectors to be broadcast and transposed.
Guy Robinson
Since matrix algorithms play such an important role in scientific computing, it is desirable to develop a library of linear algebra routines for concurrent multiprocessors. Ideally, these routines should be optimal and general-purpose, that is, portable to a wide variety of multiprocessors. Unfortunately, these two objectives are antagonistic, and an algorithm that is optimal on one machine will often not be optimal on another machine. Even among hypercubes it is apparent that the optimal decomposition, and hence the optimal algorithm, depends on the message latency, with a block decomposition being best for low latency machines, and a row decomposition often being best for machines with high latency. Another factor to be considered is that often a matrix algorithm is only part of a larger application code. Thus, the data decomposition before and after the matrix computation may not be optimal for the matrix algorithm itself. We are faced with the choice of either transforming the decomposition before and after the matrix computation so that the optimal matrix algorithm can be used, or leaving the decomposition as it is and using a suboptimal matrix algorithm. To summarize, the main issues that must be addressed are:
Two approaches to designing linear algebra libraries have been followed at Caltech. Fox, Furmanski, and Walker choose optimality as the most important concern in developing a set of linear algebra routines for low latency, homogeneous hypercubes, such as the Caltech/JPL Mark II hypercube. These routines feature the use of the scattered decomposition to ensure load balance, and to minimize communication costs. Transformations between decompositions are performed using the comutil library of global communication routines [ Angus:90a ], [ Fox:88h ], [ Furmanski:88b ]. This approach was mainly dictated by historical factors, rather than being a considered design decision-the hypercubes used most at Caltech up to 1987 were homogeneous and had low latency.
A different, and probably more useful approach, has been taken at Caltech by Van de Velde [ Velde:89b ] who opted for general-purpose library routines. The decomposition currently in use is passed to a routine through its argument list, so in general the decomposition is not changed and a suboptimal algorithm is used. The main advantage of this approach is that it is decomposition-independent and allows portability of code among a wide variety of multiprocessors. Also, the suboptimality of a routine must be weighed against the possibly large cost of transforming the data decomposition, so suboptimality does not necessarily result in a slower algorithm if the time to change the decomposition is taken into account.
Occasionally, it may be advantageous to change the decomposition, and most changes of this type are what Van de Velde calls orthogonal . In an orthogonal redistribution of the data, each pair of processors exchanges the same amount of data. Van de Velde has shown [ Velde:90c ] that any orthogonal redistribution can be performed by the following sequence of operations: Local permutation - Global transposition - Local permutation
A local permutation merely involves reindexing the local data within individual processors. If we have P processors and P data items in each processor, then the global transposition, , takes the item with local index i in processor p and sends it to processor i , where it is stored with local index p . Thus,
Van de Velde's transpose routine is actually a generalization of the hypercube-specific index routine in the comutil library.
Van de Velde has implemented his linear algebra library on the Intel iPSC/2 and the Symult 2010, and has used it in investigations of concurrent LU and QR factorization algorithms [ Velde:89b ], [ Velde:90a ], and in studies of invariant manifolds of dynamical systems [ Lorenz:89a ], [ Velde:90b ].
A group centered at Oak Ridge National Laboratory and the University of Tennessee is leading the development of a major new portable parallel full matrix library called ScaLAPACK [ Choi:92a ], [ Choi:92b ]. This is built around an elegant formulation of matrix problems in terms of the so-called level three BLAS, which are a set of submatrix operations introduced to support the basic LAPACK library [ Anderson:90c ], [ Dongarra:90a ]. This full matrix system embodies earlier ideas from LINPACK and EISPACK and is designed to ensure data locality and get good performance on shared-memory and vector supercomputers. The multicomputer ScaLAPACK is built around the scattered block decomposition described earlier.
Guy Robinson
The basic matrix algorithms appear to fall in the synchronous class in the language of Section 3.4 . Correspondingly, one would expect to get good performance on SIMD machines. This is indeed true for matrix multiplication, but it is hard to get good SIMD performance on LU factorization and the more complicated matrix algorithms. Here the algorithm is not fully synchronous. In particular, there are several operations involving row or column operations. These lead to two problems. Firstly, the parallelism is reduced from (for an matrix) to -this is typically a serious problem on SIMD machines, such as the CM-2 or Maspar MP-1,2 which are fine grain and require ``massive parallelism.'' Secondly, the use of pivoting clearly introduces irregularity into the algorithm, which complicates the SIMD implementation. For these reasons, most research on matrix algorithms has concentrated on MIMD multicomputers, such as the hypercube.
Guy Robinson
Work on the concurrent Gauss-Jordan algorithm was mostly done by Paul Hipes. Eric Van de Velde developed the linear algebra library discussed in Section 8.1.7 , and collaborated with Jens Lorenz in their work on invariant manifolds. Many of the other current algorithms were devised by Geoffrey Fox. Wojtek Furmanski and David Walker worked on routines for transforming decompositions. The implementation of the banded LU solver on the Caltech/JPL Mark II hypercube was done by Tom Aldcroft, Arturo Cisneros, and David Walker.
Guy Robinson
HPFA Applications and Paradigms
Guy Robinson
There is considerable current interest in performing accurate quantum mechanical, three-dimensional, reactive scattering cross section calculations. Accurate solutions have, until recently, proved to be difficult and computationally expensive to obtain, in large part due to the lack of sufficiently powerful computers. Prior to the advent of supercomputers, one could only solve the equations of motion for model systems or for sufficiently light atom-diatom systems at low energy [Schatz:75a;76a;76b]. As a result of the current development of efficient methodologies and increased access to supercomputers, there has been a remarkable surge of activity in this field. The use of symmetrized hyperspherical coordinates [ Kuppermann:75a ] and of the local hyperspherical surface function formalism [ Hipes:87a ], [ Kuppermann:86a ], [ Ling:75a ], has proven to be a successful approach to solving the three-dimensional Schrödinger equation [Cuccaro:89a;89b], [ Hipes:87a ], [ Kuppermann:86a ]. However, even for modest reactive scattering calculations, the memory and CPU demands are so great that even CRAY-type supercomputers will soon be insufficient to sustain progress.
In this section, we show how quantum mechanical reactive scattering calculations can be structured so as to use MIMD-type parallel computer architectures efficiently. We present a concurrent algorithm for calculating local hyperspherical surface functions (LHSF) and use a parallelized version [ Hipes:88b ] of Johnson's logarithmic derivative method [Johnson:73a;77a;79a], modified to include the improvements suggested by Manolopoulos [ Manolopoulos:86a ], for integrating the resulting coupled channel reactive scattering equations. We compare the results of scattering calculations on the Caltech/JPL Mark IIIfp 64-processor hypercube for the system J=0,1,2 partial waves on the LSTH [ Liu:73a ], [ Siegbahn:78a ], [Truhlar:78a;79a], potential energy surface, with those of calculations done on a CRAY X-MP/48 and a CRAY-2. Both accuracy and performance are discussed, and speed estimates are made for the Mark IIIfp 128-processor hypercube soon to become available and compared with those of the San Diego Supercomputer Center CRAY Y-MP/864 machine which has recently been put into operation.
Guy Robinson
The detailed formulation of reactive scattering based on hyperspherical coordinates and local variational hyperspherical surface functions (LHSF) is discussed elsewhere [ Kuppermann:86a ], [ Hipes:87a ], [ Cuccaro:89a ]. We present a very brief review to facilitate the explanation of the parallel algorithms.
For a triatomic system, we label the three atoms , and . Let ( ) be any cyclic permutation of the indices ( ). We define the coordinates, the mass-scaled [Delves:59a;62a] internuclear vector from to , and the mass-scaled position vector of with respect to the center of mass of diatom. The symmetrized hyperspherical coordinates [ Kuppermann:75a ] are the hyper-radius , and a set of five angles , , , and , denoted collectively as . The first two of these are in the range 0 to and are, respectively, and the angle between and . The angles , are the polar angles of in a space-fixed frame and is the tumbling angle of the , half-plane around its edge . The Hamiltonian is the sum of a radial kinetic energy operator term in , and the surface Hamiltonian , which contains all differential operators in and the electronically adiabatic potential . The surface Hamiltonian depends on parametrically and is therefore the ``frozen'' hyperradius part of .
The scattering wave function is labelled by the total angular momentum J , its projection M on the laboratory-fixed Z axis, the inversion parity with respect to the center of mass of the system, and the irreducible representation of the permutation group of the system ( for ) to which the electronuclear wave function, excluding the nuclear spin part, belongs [Lepetit:90a;90b]. It can be expanded in terms of the LHSF defined below, and calculated at the values of :
The index i is introduced to permit consideration of a set of many linearly independent solutions of the Schrödinger equation corresponding to distinct initial conditions which are needed to obtain the appropriate scattering matrices.
The LHSF and associated energies are, respectively, the eigenfunctions and eigenvalues of the surface Hamiltonian . They are obtained using a variational approach [ Cuccaro:89a ]. The variational basis set consists of products of Wigner rotation matrices , associated Legendre functions of and functions of which depend parametically on and are obtained from the numerical solution of one-dimensional eigenvalue-eigenfunction differential equations in , involving a potential related to .
The variational method leads to an eigenvalue problem with coefficient and overlap matrices and and whose elements are five-dimensional integrals involving the variational basis functions.
The coefficients defined by Equation 8.12 satisfy a coupled set of second-order differential equations involving an interaction matrix whose elements are defined by
The configuration space is divided into a set of Q hyperspherical shells within each of which we choose a value used in expansion 8.12 .
When changing from the LHSF set at to the one at , neither nor its derivative with respect to should change. This imposes continuity conditions on the and their -derivatives at , involving the overlap matrix between the LHSF evaluated at and
The five-dimensional integrals required to evaluate the elements of , , , and are performed analytically over , , and and by two-dimensional numerical quadratures over and . These quadratures account for 90% of the total time needed to calculate the LHSF and the matrices and .
The system of second-order ordinary differential equations in the is integrated as an initial value problem from small values of to large values using Manolopoulos' logarithmic derivative propagator [ Manolopoulos:86a ]. Matrix inversions account for more than 90% of the time used by this propagator. All aspects of the physics can be extracted from the solutions at large by a constant projection [ Hipes:87a ], [ Hood:86a ], [ Kuppermann:86a ].
Guy Robinson
The computer used for this work is a 64-processor Mark IIIfp hypercube. The Crystalline Operating System (CrOS)-channel-addressed synchronous communication provides the library routines to handle communications between nodes [Fox:85d;85h;88a]. The programs are written in C programming language except for the time-consuming two-dimensional quadratures and matrix inversions, which are optimized in assembly language.
The hypercube was configured as a two-dimensional array of processors. The mapping is done using binary Gray codes [ Gilbert:58a ], [ Fox:88a ], [ Salmon:84b ], which gives the Cartesian coordinates in processor space and communication channel tags for a processor's nearest neighbors.
We mapped the matrices into processor space by local decomposition. Let and be the number of processors in the rows and columns of the hypercube configuration, respectively. Element of an matrix is placed in processor row and column , where means the integer part of x .
The parallel code implemented on the hypercube consists of five major steps. Step one constructs, for each value of , a primitive basis set composed of the product of Wigner rotation matrices, associated Legendre functions, and the numerical one-dimensional functions in mentioned in Section 8.2.2 and obtained by solving the corresponding one-dimensional eigenvalue-eigenvector differential equation using a finite difference method. This requires that a subset of the eigenvalues and eigenvectors of a tridiagonal matrix be found.
A bisection method [ Fox:84g ], [Ipsen:87a;87c], which accomplishes the eigenvalue computation using the TRIDIB routine from EISPACK [ Smith:76a ], was ported to the Mark IIIfp. This implementation of the bisection method allows computation of any number of consecutive eigenvalues specified by their indices. Eigenvectors are obtained using the EISPACK inverse iteration routine TINVIT with modified Gram-Schmidt orthogonalization. Each processor solves independent tridiagonal eigenproblems since the number of eigenvalues desired from each tridiagonal system is small, but there are a large number of distinct tridiagonal systems. To achieve load balancing, we distributed subsets of the primitive functions among the processors in such a way that no processor computes greater than one eigenvalue and eigenvector more than any other. These large grain tasks are most easily implemented on MIMD machines; SIMD (Single Instruction Multiple Data) machines would require more extensive modifications and would be less efficient because of the sequential nature of effective eigenvalue iteration procedures. The one-dimensional bases obtained are then broadcast to all the other nodes.
In step two, a large number of two-dimensional quadratures involving the primitive basis functions which are needed for the variational procedure are evaluated. These quadratures are highly parallel procedures requiring no communication overhead once each processor has the necessary subset of functions. Each processor calculates a subset of integrals independently.
Step three assembles these integrals into the real symmetric dense matrices and which are distributed over processor space. The entire spectrum of eigenvalues and eigenvectors for the associated variational problem is sought. With the parallel implementation of the Householder method [ Fox:84h ], [ Patterson:86a ], this generalized eigensystem is tridiagonalized and the resulting single tridiagonal matrix is solved completely in each processor with the QR algorithm [ Wilkinson:71a ]. The QR implementation is purely sequential since each processor obtains the entire solution to the eigensystem. However, only different subsets of the solution are kept in different processors for the evaluation of the interaction and overlap matrices in step four. This part of the algorithm is not time consuming and the straightforward sequential approach was chosen. It has the further effect that the resulting solutions are fully distributed, so no communication is required.
Step four evaluates the two-dimensional quadratures needed for the interaction and overlap matrices. The same type of algorithms are used as in step two. By far, the most expensive part of the sequential version of the surface function calculation is the calculation of the large number of two-dimensional numerical integrals required by steps two and four. These steps are, however, highly parallel and well suited for the hypercube.
Step five uses Manolopoulos' [ Manolopoulos:86a ] algorithm to integrate the coupled linear ordinary differential equations. The parallel implementation of this algorithm is discussed elsewhere [ Hipes:88b ]. The algorithm is dominated by parallel Gauss-Jordan matrix inversion and is I/O intensive, requiring the input of one interaction matrix per integration step. To reduce the I/O overhead, a second source of parallelism is exploited. The entire interaction matrix (at all ) and overlap matrix (at all ) data sets are loaded across the processors, and many collision energies are calculated simultaneously. This strategy works because the same set of data is used for each collision energy, and because enough main memory is available. Calculation of scattering matrices from the final logarithmic derivative matrices is not computationally intensive, and is done sequentially.
The program steps were all run on the Weitek coprocessor, which only supports 32-bit arithmetic. Experimentation has shown that this precision is sufficient for the work reported below. The 64-bit arithmetic hardware needed for larger calculations was installed after the present calculations were completed.
Guy Robinson
Accuracy
Calculations were performed for the system on the LSTH surface [ Liu:73a ], [ Siegbahn:78a ], [Truhlar:78a;79a] for partial waves with total angular momentum J = 0,1,2 and energies up to . Flux is conserved to better than 1% for J = 0 , 2.3% for J = 1 , and 3.6% for J = 2 for all open channels over the entire energy range considered.
To illustrate the accuracy of the 32-bit arithmetic calculations, the scattering results from the Mark IIIfp with 64 processors are shown in Figure 8.7 for J = 0 , in which some transition probabilities as a function of the total collision energy, E , are plotted. The differences between these results, and those obtained using a CRAY X-MP/48 and a CRAY-2, do not exceed 0.004 in absolute value over the energy range investigated.
Figure 8.7:
Probabilities as a Function of Total Energy
E
(Lower Abscissa)
and Initial Relative Translational Energy
(Upper Abscissa) for the
Symmetry Transition in
Collisions on the LSTH Potential Energy Surface. The symbol
labels an asymptotic state of the
system in which
v
,
j
, and
are the quantum numbers of the initial or final
states. The vertical arrows on the upper abscissa denote the energies at
which the corresponding
states open up. The length of those
arrows decreases as
v
spans the values 0, 1, and 2, and the numbers 0, 5,
and 10 associated with the arrows define a labelling for the value of
j
.
The number of LHSF used was 36 and the number of primitives used to calculate
these surface functions was 80.
Table 8.3:
Performance of the surface function code.*
Timing and Parallel Efficiency
In Tables 8.3 and 8.4 , we present the timing data on the 64-processor Mark IIIfp, a CRAY X-MP/48 and a CRAY 2, for both the surface function code (including calculation of the overlap and interaction matrices) and the logarithmic derivative propagation code. For the surface function code, the speeds on the first two machines are about the same. The CRAY 2 is 1.43 times faster than the Mark IIIfp and 1.51 times faster than the CRAY X-MP/48 for this code. The reason is that this program is dominated by matrix-vector multiplications which are done in optimized assembly code in all three machines. For this particular operation, the CRAY 2 is 2.03 times faster than the CRAY X-MP/48 whereas, for more memory-intensive operations, the CRAY 2 is slower than the CRAY X-MP/48 [ Pfeiffer:90a ]. A slightly larger primitive basis set is required on the Mark IIIfp in order to obtain surface function energies of an accuracy equivalent to that obtained with the CRAY machines. This is due to the lower accuracy of the 32-bit arithmetic of the former with respect to the 64-bit arithmetic of the latter.
Table 8.4:
Performance of the logarithmic derivative code. Based on a
calculation using 245 surface functions and 131 energies, and a
logarithmic derivative integration step of 0.01 bohr.
The efficiency ( ) of the parallel LHSF code was determined using the definition , where and are, respectively, the implementation times using a single-processor and N processors. The single processor times are obtained from runs performed after removing the overhead of the parallel code, that is, after removing the communication calls and some logical statements. Perfect efficiency ( ) implies that the N -processor hypercube is N times faster than a single processor. In Figure 8.8 , efficiencies for the surface function code (including the calculation of the overlap and interaction matrices) as a function of the size of the primitive basis set are plotted for 2, 4, 8, 16, 32 and 64 processor configurations of the hypercube. The global dimensions of the matrices used are chosen to be integer multiples of the number of processor rows and columns in order to insure load balancing among the processors. Because of the limited size of a single-processor memory, the efficiency determination is limited to 32 primitives. As shown in Figure 8.8 , the efficiencies increase monotonically and approach unity asymptotically as the size of the calculation increases. Converged results require large enough primitive basis sets so that the efficiency of the surface function code is estimated to be about 0.95 or greater.
Figure 8.8:
Efficiency of the Surface Function Code (Including the Calculation
of the Overlap and Interaction Matrices) as a Function of the Global Matrix
Dimension (i.e., the Size of the Primitive Basis Set) for 2, 4, 8, 16,
32, and 64 Processors. The solid curves are straight line segments
connecting the data points for a fixed number of processors and are provided
as an aid to examine the trends.
The data for the logarithmic derivative code given in Table 8.4 for a 245-channel (i.e., LHSF) example show that the Mark IIIfp has a speed about 62% of that of the CRAY 2, but only about 31% of that of the CRAY X-MP/48. This code is dominated by matrix inversions, which are done in optimized assembly code in all three machines. The reason for the slowness of the hypercube with respect to the CRAYs is that the efficiency of the parallel logarithmic derivative code is 0.52. This relatively low value is due to the fact that matrix inversions require a significant amount of interprocessor communication. Figure 8.9 displays efficiencies of the logarithmic derivative code as a function of the number of channels propagated for different processor configurations, as done previously for the Mark III [ Hipes:88b ], [ Messina:90a ] hypercubes. The data can be described well by an operations count formula developed previously for the matrix inversion part of the code [ Hipes:88a ]; this formula can be used to extrapolate the data to larger numbers of processors or channels. It can be seen that for an 8-processor configuration, the code runs with an efficiency of 0.81. This observation suggested that we divide the Mark IIIfp into eight clusters of eight processors each, and perform calculations for different energies in different clusters. The corresponding timing information is also given in Table 8.4 . As can be seen from the last row of this table, the speed of the logarithmic derivative code using this configuration of the 64-processor Mark IIIfp is , which is about 44% of that of the CRAY X-MP/48 and 88% of that of the CRAY 2. As the number of channels increases, the number of processors per cluster may be made larger in order to increase the amount of memory available in each cluster. The corresponding efficiency should continue to be adequate due to the larger matrix dimensions involved.
Figure 8.9:
Efficiency of Logarithmic Derivative Code as a Function of the
Global Matrix Dimension (i.e., the Number of Channels or LHSF) for 8,
16, 32, and 64 Processors. The solid curves are straight-line segments
connecting the data points for a fixed number of processors, and are provided
as an aid to examine the trends.
Planned upgrades of the Mark IIIfp include increasing the number of processors to 128, and replacement of the I/O system will be high-performance CIO (concurrent I/O) hardware. Further new Weitek coprocessors, installed since the present calculations were done, perform 64-bit floating-point arithmetic at about the same nominal peak speed as the 32-bit boards. From the data in the present paper, it is possible to predict with good reliability the performance of this upgraded version of the Mark IIIfp (the CIO upgrade was never performed). A CRAY Y-MP/864 was installed at the San Diego Supercomputer Center and measurements show that it is about two times faster than the CRAY X-MP/48 for the surface function code and 1.7 times faster for the logarithmic derivative code. In Table 8.5 , we summarize the available or predicted speed information for the present codes for the current 64-processor and the planned 128-processor Mark IIIfp, as well as the CRAY X-MP/48, CRAY 2, and CRAY Y-MP/864 supercomputers. It can be seen that Mark IIIfp machines are competitive with all of the currently available CRAYs (operating as single-processor machines). The results described in this paper demonstrate the feasibility of performing reactive scattering calculations with high efficiency in parallel fashion. As the number of processors continues to increase, such parallel calculations in systems of greater complexity will become practical in the not-too-distant future.
Table 8.5:
Overall speed of reactive scattering codes on several machines.
Guy Robinson
HPFA Applications and Paradigms
Guy Robinson
Collisions of low-energy electrons with atoms and molecules have been of both fundamental and practical interest since the early days of the quantum theory. Indeed, one of the first successes of quantum mechanics was an explanation of the curious transparency of certain gases to very slow electrons [ Mott:87a ]. Today, we have an excellent understanding of the physical principles involved in low-energy electron collisions in gases, and with it an ability to calculate the cross section, or probability, for various electron- atom collision processes to high accuracy [ Bartschat:89a ]. The case of electron collisions in molecular gases is, however, quite different. Although the same principles are involved, complications arising from the nonspherical shapes of molecules and their numerous internal degrees of freedom (vibrations and rotations) make calculating reliable cross sections for low-energy electron-molecule collisions a significant computational challenge.
At the same time, electron-molecule collision data is of growing practical importance. Plasma-based processing of materials [ Manos:89a ], [ JTIS:88a ] relies on collisions between ``hot'' electrons, with kinetic energies on the order of tens of electron-volts ( ), and gas molecules at temperatures of hundreds of to generate reactive fragments-atoms, radicals, and ions-that could otherwise be obtained only at temperatures high enough to damage or destroy the surface being treated. Such low-temperature plasma processing is a key technology in the manufacture of semiconductors [ Manos:89a ], and has applications in many other areas as well [ JTIS:88a ], ranging from the hardening of metals to the deposition of polymer coatings.
The properties of materials-processing plasmas are sensitive to operating conditions, which are generally optimized by trial and error. However, efforts at direct numerical modelling of plasmas are being made [ Kushner:91a ], which hold the potential to greatly increase the efficiency of plasma-based processing. Since electron-molecule collisions are responsible for the generation of reactive species, clearly, an essential ingredient in plasma modelling is knowledge of the electron-molecule collision cross sections.
We have been engaged in studies of electron-molecule collisions for a number of years, using a theoretical approach, the Schwinger Multichannel (SMC) method, specifically formulated to handle the complexities of electron-molecule interactions [ Lima:90a ], [Takatsuka:81a;84a]. Implementations of the SMC method run in production mode both on small platforms (e.g., Sun SPARCstations) and on CRAY machines, and cross sections for several diatomic and small polyatomic molecules have been reported [ Brescansin:89a ], [Huo:87a;87b], [ Lima:89a ], [ Pritchard:89a ], [ Winstead:90a ]. Recently, however, the computational demands of detailed studies, combined with the high cost of cycles on CRAY-type machines, have led us to implement the SMC method on distributed-memory parallel computers, beginning with the JPL/Caltech Mark IIIfp and currently including Intel's iPSC/860 and Touchstone Delta machines . In the following, we will describe the SMC method, our strategy and experiences in porting it to parallel architectures, and its performance on different machines. We conclude with selected results produced by the parallel SMC code and some speculation on future prospects.
Guy Robinson
The collision of an electron with a molecule A may be illustrated schematically as
where is the electron's initial kinetic energy and the momentum vector points in its initial direction of travel; after the collision, the electron travels along with kinetic energy . If differs from , the collision is said to be inelastic, and energy is transferred to the target, leaving it in an excited state, denoted . The quantity we seek is the probability of occurrence or cross section for this process, as a function of the energies and and of the angle between the directions and . (Since a gas is a very large ensemble of randomly oriented molecules, orientational dependence of these quantities for an asymmetric target A is averaged over in calculations.)
The SMC procedure [ Lima:90a ], [Takatsuka:81a;84a], a multichannel extension of Schwinger's variational principle [ Schwinger:47a ], is a method for obtaining cross sections for low-energy electron-molecule collision processes, including elastic scattering and vibrational or electronic excitation. As such, it is capable of accurately treating effects arising from electron indistinguishability and from polarization of the target by the charge of the incident electron, both of which can be important at low collision velocities. Moreover, it is formulated to be applicable to and efficient for molecules of arbitrary geometry.
The scattering amplitude , a complex quantity whose square modulus is proportional to the cross section, is approximated in the SMC method as
where is an -electron interaction-free wave function of the form
V is the interaction potential between the scattering electron and the target, and the -electron functions are spin-adapted Slater determinants which form a linear variational basis set for approximating the exact scattering wave functions and . The are elements of the inverse of the matrix representation in the basis of the operator
Here P is the projector onto open (energetically accessible) electronic states,
is the -electron Green's function projected onto open channels, and , where E is the total energy of the system and H is the full Hamiltonian.
In our implementation, the -electron functions are formed from antisymmetrized products of one-electron molecular orbitals which are themselves combinations of Cartesian Gaussian orbitals
commonly used in molecular electronic-structure studies. Expansion of the trial scattering wave function in such a basis of exponentially decaying functions is possible since the trial function of the SMC method need not satisfy scattering boundary conditions asymptotically [ Lima:90a ], [Takatsuka:81a;84a]. All matrix elements needed in the evaluation of can then be obtained analytically, except those of . These terms are evaluated numerically via a momentum-space quadrature procedure [ Lima:90a ], [Takatsuka:81a,84a]. Once all matrix elements are calculated, the final step in the calculation is solution of a system of linear equations to obtain the scattering amplitude in the form given above.
The computationally intensive step in the above formulation is the evaluation of large numbers of so-called ``primitive'' two-electron integrals
for all unique combinations of Cartesian Gaussians , , and , and for a wide range of in both magnitude and direction. These integrals are evaluated analytically by a set of subroutines comprising approximately two thousand lines of FORTRAN. Typical calculations might require to calls to this integral-evaluation suite, consuming roughly 80% of the total computation time. Once calculated, the primitive integrals are assembled in appropriate combinations to yield the matrix elements appearing in the variational expression for . The original CRAY code performs this procedure in two steps: first, a repeated linear transformation to integrals involving molecular orbitals, then a transformation from the molecular-orbital integrals to physical matrix elements. The latter step is equivalent to an extremely sparse linear transformation, whose coefficients are determined in an elaborate subroutine with a complicated logical flow.
Guy Robinson
The necessity of evaluating large numbers of primitive two-electron integrals makes the SMC procedure a natural candidate for parallelization on a coarse-grain MIMD machine. With a large memory per processor, it is feasible to load the integral evaluator on each node and to distribute the evaluation of the primitive integrals among all the processors. Provided issues of load balance and subsequent data reduction can be addressed successfully, high parallel efficiency may be anticipated, since the stage of the calculation which typically consumes the bulk of the computation time is thereby made perfectly parallel.
In planning the decomposition of the set of integrals onto the nodes, two principal issues must be considered. First, there are too many integrals to store in memory simultaneously, and certain indices must therefore be processed sequentially. Second, the transformation from primitive integrals to physical matrix elements, which necessarily involves interprocessor communication, should be as efficient and transparent as possible. With these considerations in mind, the approach chosen was to configure the nodes logically as a two-torus, on which is mapped an array of integrals whose columns are labeled by Gaussian pairs , and whose rows are labeled by directions ; the indices and are processed sequentially. With this decomposition, the transformation steps and associated interprocessor communication can be localized and ``hidden'' in parallel matrix multiplications. This approach is both simple and efficient, and results in a program that is easily ported to new machines.
Care was needed in designing the parallel transformation procedure. Direct emulation of the sequential code-that is, transformation first to molecular-orbital integrals and then to physical matrix elements-is undesirable, because the latter step would entail a parallel routine of great complexity governing the flow of a relatively limited amount of data between processors. Instead, the two transformations are combined into a single step by using the logical outline of the original molecular-orbital-to-physical-matrix-element routine in a perfectly parallel routine which builds a distributed transformation matrix. The combined transformations are then accomplished by a single series of large, almost-full complex-arithmetic-matrix multiplications on the primitive-integral data set.
The remainder of the parallel implementation involves relatively straightforward modifications of the sequential CRAY code, with the exception of a series of integrations over angles arising in the evaluation of the matrix elements, and of the solution of a system of linear equations in the final phase of the calculation. The angular integration, done by Gauss-Legendre quadrature, is compactly and efficiently coded as a distributed matrix multiplication of the form . The solution of the linear system is performed by a distributed LU solver [ Hipes:89b ] modified for complex arithmetic.
The implementation described above has proven quite successful [ Hipes:90a ], [ Winstead:91d ] on the Mark IIIfp architecture for which it was originally designed, and has since been ported with modest effort to the Intel iPSC/860 and subsequently to the 528-processor Intel Touchstone Delta. No algorithmic modifications were necessary in porting to the Intel machines; modifications to improve efficiency will be described below. Complications did arise from the somewhat different communication model embodied in Intel's NX system, as compared to the more rigidly structured, loosely synchronous CrOS III operating system of the Mark IIIfp described in Chapter 5 . These problems were overcome by careful typing of all interprocessor messages-essentially, assigning of sequence numbers and source labels. In porting to the Delta, the major difficulty was the absence of a host processor. Our original parallel version of the SMC code left certain initialization and end-stage routines, which were computationally trivial but logically complicated, as sequential routines to be executed on the host. In porting to the Delta, we chose to parallelize these routines as well rather than allocate an extra node to serve as host. There is thus only one program to maintain and to port to subsequent machines, and a rectangular array of nodes, suitable for a matrix-oriented computation, is preserved.
Guy Robinson
Guy Robinson
Performance assessment on the Mark IIIfp has been published in [ Hipes:90a ]. In brief, a small but otherwise typical case was run both on the Mark IIIfp and on one processor of a CRAY Y-MP. Performance on 32 nodes of the Mark IIIfp surpassed that of the sequential code on the Y-MP; on 64 nodes, the performance was approximately three times higher than on the CRAY. Considering the small size of the test case, a reasonable parallel efficiency (60% on 64 nodes) was observed.
Guy Robinson
Performance of the original port of the parallel code from the Mark IIIfp to a 64-processor iPSC/860 hypercube, while adequate, was below expectations based on the 4:1 ratio of 64-bit floating-point peak speeds. Moreover, initial runs on up to 512 nodes of the Delta indicated very poor speedups. Timings at the subroutine level revealed that an excessive amount of time was being spent both in matrix multiplication and in construction of the distributed transformation matrix. Optimization is still in progress, and performance is still a small fraction of the machine's peak speed, but some improvements have been made.
Several steps were taken to improve the matrix multiplication. Blocking sends and receives were replaced with asynchronous NX routines, overlapping communication with computation; the absolute number of communications was reduced by grouping together small data blocks and by computing rather than communicating block sizes; one of the matrices was transposed in order to maximize the length of the innermost loop; and finally, the inner loop was replaced with a level-one BLAS call. Presently the floating-point work proceeds at 7 to , including loop overhead, depending on problem size. On the iPSC/860, throughput for the subroutine as a whole is generally limited by communication bandwidth to approximately . We expect to increase this by better matching the sizes of the two matrices being multiplied, which will require minor modifications in the top-level routine. Higher throughput, approximately , is obtained on the Delta. Further improvement is certainly possible, but communication overhead on the Delta is already below 10% for the application as a whole, and matrix multiplication time is no longer a major limitation.
Reducing the time spent in constructing the transformation matrix proved to be a matter of removing index computations in the innermost loops. In the original implementation, integer modulo arithmetic was used on each call to determine the local components of the transformation matrix. This form of parallel overhead proved surprisingly costly. It was essentially eliminated by precomputing and storing three lists of pointers to the data elements needed locally. These pointers are used for indirect indexing of elements needed in a vector-vector outer product, which now runs at approximately . (Preceding the outer product with an explicit gather using the same pointers was tested, but proved counterproductive.) A BLAS call (daxpy), timed at 13.1 to for typical cases, was inserted elsewhere. Construction of the transformation matrices is now typically 1% of the total time, with throughput, including all logic and integer arithmetic as overhead, around .
Table 8.6:
SMC Performance on the Delta (MFLOPS)
In the present state of the program, the perfectly parallel integral-calculation step is the dominant element in most of our calculations, as desired and expected based on the amount of floating-point work. It is also the most complex step, however, with little linear algebra but with many math library calls (sin, cos, exp, sqrt), floating-point divides, and branches. Not surprisingly, therefore, it is comparatively slow. We have timed the CRAY version at on a single-processor Y-MP, reflecting the routine's intrinsically scalar character. Present performance on the i860 is about . Some additional optimization is planned, but substantial improvement may have to await more mature versions of the compiler and libraries.
Figure 8.10:
Calculated Integral Elastic Cross Sections for Electron Scattering
by the C
H
Isomers Cyclopropane and Propylene. For comparison,
experimental total cross sections of Refs. [Floeder:85a] (open symbols) and
[Nishimura:91a] (filled symbols) are shown; triangles are cyclopropane and
circles propylene data.
With the program components as described above, the present code should run on 512 nodes of the Delta at a sustained rate of approximately . In practice, lower performance is obtained, due to synchronization delays, load imbalance, file I/O, etc. Actual timings taken from 64- to 512-node production runs are given in Table 8.6 . The limited data available for the integral-evaluation package reflects the difficulty of obtaining an accurate operation count; for the case shown, a count was obtained using flow-tracing utilities on a CRAY. For the ``large'' case shown in the table, we estimate overall performance at , inclusive of all I/O and overhead, on 512 nodes of the Delta; this estimate is based on an approximate operation count for the integral package and actual counts for the remaining routines.
Guy Robinson
The distributed-memory SMC program has been applied to a number of elastic and inelastic electron-molecule scattering problems, emphasizing polyatomic gases of interest in low-temperature plasma applications [ JTIS:88a ], [ Manos:89a ]. Initial applications [ Hipes:90a ], [ Winstead:91d ] on the Mark IIIfp were to elastic scattering by ethylene (C H ), ethane (C H ), propane (C H ), disilane (Si H ), germane (GeH ), and tetrafluorosilane (SiF ). We have since studied elastic scattering by other systems, including phosphine (PH ), propylene (C H ) and its isomer cyclopropane, n -butane (C H ), and 1,2- trans -difluoroethylene, both on the Mark IIIfp and on the Intel machines. We have also examined inelastic collisions with ethylene [ Sun:92a ], formaldehyde (CH O), methane (CH ), and silane (SiH ). Below we present selected results of these calculations, where possible comparing to experimental data.
Figure 8.10 shows integral elastic cross sections-that is, cross sections summed over all angles of scattering, plotted as a function of the electron's kinetic energy-for the two C H isomers, cyclopropane and propylene. Scattering from propylene requires some special consideration, because of its small dipole moment [ Winstead:92a ]. These calculations were performed in the static-exchange approximation, neglecting polarization and excitation effects, on 256-node partitions of the Delta. The results in Figure 8.10 should be considered preliminary, since studies to test convergence of the cross section with respect to basis set are in progress, but we do not expect major changes at the energies shown. Corresponding experimental values have not been reported, but the total scattering cross section, of which elastic scattering is the dominant component, has been measured [ Floeder:85a ], [ Nishimura:91a ], and these data are included in Figure 8.10 . Both the calculation and the measurements show a clear isomer effect in the vicinity of the broad maximum, which gradually lessens at higher energies. At the level of approximation (static-exchange) used in these calculations, the maxima in the cross sections are expected to appear shifted to higher energies and somewhat broadened and lowered in intensity. Thus, for propylene, where some discrepancy is seen between the two measurements, our calculation appears to support the larger values of [ Nishimura:91a ].
Figure 8.11:
Differential Cross Sections for Elastic Scattering of
Electrons by Disilane and Ethane. Experimental points for ethane (circles)
are from Ref. [Tanaka:88a]; disilane data (squares) are from Ref.
[Tanaka:89a].
Figure 8.11 shows the plotting of the calculated differential cross section, that is, the cross section as a function of scattering angle, for elastic scattering of electrons from ethane and its analogue disilane. These results were obtained on the Mark IIIfp within the static-approximation. Agreement with experiment [Tanaka:88a;89a], is quite good; although there are quantitative differences where the magnitude of the cross section is small, the qualitative features are well reproduced for both molecules.
Calculations of electronic excitation cross sections are shown in Figures 8.12 and 8.13 . In Figure 8.12 , we present the integral cross section for excitation of the state of ethylene [ Sun:92a ], obtained on the Mark IIIfp in a two-channel approximation. This excitation weakens the C-C bond, and its cross section is relevant to the dissociation of ethylene by low-energy electron impact. As seen in the figure, the cross section increases rapidly from threshold (experimental value ) and reaches a fairly high peak value before beginning a gradual decline. The threshold rise is largely due to a d -wave ( ) contribution, seen as a shoulder around above threshold, which may arise from a core-excited shape resonance. Relative measurements of this cross section [ Veen:76a ], which we have placed on an absolute scale by normalizing to our calculated value at the broad maximum, show a much sharper structure near threshold, but are otherwise in good agreement.
Figure 8.12:
Integral Cross Section for Electron-Impact Excitation of the
State of Ethylene. Solid line: present
two-channel result; dashed line: relative measurement of Ref. [Veen:76a],
normalized to the calculated value at the broad maximum.
Figure 8.13 shows the cross section for electron-impact excitation of the and states of formaldehyde, obtained from a three-channel calculation. Portions of this calculation were done on the Mark IIIfp, the iPSC/860, and the Delta. Experimental data for these excitations are not available, but an independent calculation at a similar level of theory has been reported [ Rescigno:90a ], and is shown in the figure. Since the complex-Kohn calculation of [ Rescigno:90a ] included only partial waves up to , we show both the full SMC result, obtained from , and a restricted SMC result, obtained with f projected onto a spherical-harmonic basis , . The agreement between the restricted SMC result and that of [ Rescigno:90a ] is in general excellent; however, comparison to the full SMC result indicates that such a restriction introduces some errors at higher energies.
Figure 8.13:
Calculated Integral Cross Sections for Electron-Impact Excitation of
the
and
States of
Formaldehyde, Obtained from Three-Channel Calculations. Solid lines:
present SMC results; short-dashed lines: SMC results, limited to
; long-dashed lines: complex-Kohn calculations of Ref.\
[Rescigno:90a].
Guy Robinson
The concurrent implementation of a large sequential code which is in production on CRAY-type machines is a type of project which is likely to become increasingly common as commercial parallel machines proliferate and ``mainstream'' computer users are attracted by their potential. Several lessons which emerge from the port of the SMC code may prove useful to those contemplating similar projects. One is the value of focusing on the concurrent implementation and, so far as possible, avoiding or deferring minor improvements. If the original code is a reasonably effective production tool, such tinkering is unlikely to be of great enough benefit to justify the distraction from the primary goal of achieving a working concurrent version. On the other hand, major issues of structure and organization which bear directly on the parallel conversion deserve very careful attention, and should ideally be thought through before the actual conversion has begun. In the SMC case, the principal such issue was how to implement efficiently the transformation from primitive integrals to physical matrix elements. The solution arrived at not only suggested that a significant departure from the sequential code was warranted but also determined the data decomposition. A further point worth mentioning is that the conversion was greatly facilitated by the C P environment which fostered collaboration between workers familiar with the original code and its application, and workers adept at parallel programming practice, and in which there was ready access both to smaller machines for debugging runs and to larger production machines. Finally, we believe that the emphasis on achieving a simple communication strategy has justified itself in practice, not only in efficiency but in the portability and reliability of the program.
At present the parallel SMC code is essentially in production mode, all capabilities of the original sequential code having been implemented and some optimization performed. Further optimization of the primitive-integral package is in progress, but the major focus in the near future is likely to be applications on the one hand and extending the capabilities of the parallel code on the other. We are particularly interested in modifying the program to allow the study of electron scattering from open-shell systems (i.e., those with unpaired electrons), with a view to obtaining cross sections for some of the more important polyatomic species found in materials-processing plasmas. With continued progress in parallel hardware, we are very optimistic about the prospects for theory to make a substantial contribution to our knowledge of electron-polyatomic collisions.
Guy Robinson
Guy Robinson
Figure 9.1:
The Loosely Synchronous Problem Class
The significance of loosely synchronous problems and their natural parallelism was an important realization that emerged gradually (perhaps in 1987 as a clear concept) as we accumulated results from C P research. As described in Figure 9.1 , fundamental theories often describe phenomena in terms of a set of similar entities obeying a single law. However, one does not usually describe practical problems in terms of their fundamental description in a theory such as QCD in Section 4.3 . Rather, we use macroscopic concepts. Looking at society, a particle physicist might view it as a bunch of quarks and gluons; a nuclear physicist as a collection of protons and neutrons; a chemist as a collection of molecules; a biochemist as a set of proteins; a biologist as myriad cells; and a social scientist as a collection of people. Each description is appropriate to answer certain questions, and it is usually clear which description should be used. If we consider a simulation of society, or a part there of, only the QCD description is naturally synchronous. The other fields view society as a set of macroscopic constructs, which are no longer identical and typically have an irregular interconnect. This is caricatured in Figure 9.1 as an irregular network. The simulation is still data-parallel and, further, there is a critical macroscopic synchronization-in a time-stepped simulation at every time step , , . This is an algorithmic synchronization that ensures natural scaling parallelism, that is, that the efficiency of Equations 3.10 and 3.11 is given by
and
with the parameter of Equation 3.10 equal to zero. is given by Equation 3.10 in terms of the system dimension. The efficiency only depends on the problem grain size and not explicitly on the number of processing nodes. As emphasized in [ Gustafson:88a ], these problems scale so that if one doubles both the machine and problem size, the speedup will also double with constant efficiency. This situation is summarized in Figure 9.2 .
Figure 9.2:
Speedup as a Function of Number of Processors
Why is there no synchronization overhead in this problem class? Picturesquely, we can say that the processors ``know'' that they are synchronized at the end of each algorithmic time step. We use time in the generalized complex system language of Section 3.3 and so it would represent, for instance, iteration number in a matrix problem. Operationally, we can describe the loosely synchronous class on a MIMD machine by the communication-calculation sequence in Figure 9.3 . The update (calculate) phase can involve very different algorithms and computations for the points stored in different processors. Thus, a MIMD architecture is needed in the general case. Synchronization is provided, as in Figure 9.3 by the internode communication phases at each time step. As described in Chapters 5 and 16 , this does not need, but certainly can use, the full asynchronous message-passing capability of a MIMD machine.
Figure 9.3:
Communication-Calculation Phases in a Loosely Synchronous
Problem
We have split the loosely synchronous problems into two chapters, with those in Chapter 12 showing more irregularities and greater need for MIMD architectures than the applications described in this chapter. There has been no definitive study of which loosely synchronous problems can run well on SIMD machines. Some certainly can, but not all. We have discussed some of these issues in Section 6.1 . If, as many expect, SIMD will remain a cost-effective architecture offered commercially, it will be important to better clarify the class of irregular problems that definitely need the full MIMD architecture.
As mentioned above, the applications in this chapter are ``modestly'' loosely synchronous. They include particle simulations (Sections 9.2 and 9.3 ), solutions of partial differential equation (Sections 9.3 , 9.4 , 9.5 , 9.7 ), and circuit simulation (Sections 9.5 and 9.6 ). In Section 9.8 , we describe an optimal assignment algorithm that can be used for multiple target Kalman filters and was developed for the large scale battle management simulation of Sections 18.3 and 18.4 . Section 9.9 covers the parallelization of learning (``back-propagation'') neural nets with improved learning methods. An interesting C P application not covered in detail in this book was the calculation of an exchange energy in solid at temperatures below [Callahan:88a;88b]. This was our first major use of the nCUBE-1 in production mode and Callahan suffered all the difficulties of a pioneer with the, at the time, decidely unreliable hardware and software. He used 250 hours on our 512-node nCUBE-1, which was equivalent to 1000 hours of a non-vectorized CRAY X-MP implementation. In discussing SIMD versus MIMD, one usually concentrates on the synchronization aspects. However, Callahan's application illustrates another point; namely, commercial SIMD machines typically have many more processors than a comparable MIMD computer. For example, Thinking Machines introduced the 32-node MIMD CM-5 as roughly equivalent (in price) to an SIMD CM-2. The SIMD architecture has 256 times as many nodes. Of course, the SIMD nodes are much simpler, but this still implies that one needs a large enough problem to exploit this extra number of nodes. There are some coarse-grain SIMD machines-especially special-purpose QCD machines [ Battista:92a ], [ Christ:86a ], [ Fox:93a ], [ Marinari:93a ]-but it is more natural to build fine-grain machines. If the node is large, one might as well add MIMD capability! Full matrix algorithms, such as LU decomposition, (see Chapter 8 ), are often synchronous, but do not perform very well on SIMD machines due to insufficient parallelism [ Fox:92j ]. Many of the operations only involve single rows and columns and have severe load imbalance on fine-grain machines. Callahan's application did not exhibit ``massive'' parallelism, and so ``had'' to use a MIMD machine irrespective of his problem's temporal structure. He used 512 nodes on the nCUBE-1 by combining three forms of parallelism: Two came from the problem formulation with spatial and temporal parallelism, the other from running four different parameter values concurrently.
A polymer simulation [Ding:88a;88b] [ Ding:88a ] [ Ding:88b ] by Ding and Goddard, using the reptation method, exhibited a similar effect. There is a chain of N chemical units and the algorithm involves special treatment of the two units at the beginning and end of the linear polymer. The MIMD program ran successfully on the Mark III and FPS T-Series, but the problem is too ``small'' (parallelism of ) for this algorithm to run on a SIMD machine even though most of the basic operations can be run synchronously.
This issue of available parallelism also complicates the implementation of multigrid algorithms on SIMD machines [Frederickson:88a;88b;89a;89b].
Guy Robinson
Geomorphology is the study of the small-scale surface evolution of the earth under the forces resulting from such agents as wind, water, gravity, and ice. Understanding and prediction in geomorphology are critically dependent upon the ability to model the processes that shape the landscape. Because these processes in general are too complicated on large scales to describe in detail, it is necessary to adopt a system of hierarchical models in which the behavior of small systems is summarized by a set of rules governing the next larger system; in essence, these rules constitute a simplified algorithm for the physical processes in the smaller system that cannot be treated fully at larger scales. A significant fraction of the processes in geomorphology involve entrainment, transport and deposition of particulate matter. Where the intergrain forces become comparable to or greater than the forces arising from the transporting agents, consideration of the properties of a granular material, a system of grains which collide with the slide against neighboring grains, is warranted. A micromechanical description of granular materials has proved difficult, except in energetic flow regimes [ Haff:83a ], [ Jenkins:83a ]. Thus, researchers have turned to dynamical and computer simulations at the level of individual grains in order to elucidate some of the basic mechanical properties of granular materials ([ Cundall:79a ], [ Walton:83a ] pioneered this simulation technique). In this section, we discuss the role that hypercube concurrent processing has played and is expected to play both in grain-level dynamical simulations and in relating these simulations to modelling the formation and evolution of landforms.
As an example of this approach to geomorphology, we shall consider efforts to model transport of sand by the wind based upon the grain-to-grain dynamics. Sand is transported by the wind primarily in saltation and in reptation [ Bagnold:41a ]. Saltating grains are propelled along the surface in short hops by the wind. Each collision between a saltating sand grain and the surface results in a loss of energy which is compensated, on the average, by energy acquired from the wind. Reptating grains are ejected from the sand surface by saltating grain-sand bed impacts; they generally come to rest shortly after returning to the sand surface.
Computer simulations of saltating grain impacts upon a loose grain bed were performed on an early version of the hypercube [Werner:88a;88b]. Collisions between a single impacting grain and a box of 384 circular grains were simulated. The grains interact through stiff, inelastic compressional contact forces plus a Coulomb friction force. The equations of motion for the particles are integrated forward in time using a predictor-corrector technique. At each step in time, the program checks for contacts between particles and, where contacts exist, computes the contact forces. Dynamical simulations of granular materials are computationally intensive, because the time scale of the interaction between grains (tens of microseconds) is much smaller than the time scale of the simulation (order one second).
The simulation was decomposed on a Caltech hypercube by assigning the processors to regions of space lying on a rectangular grid. The computation time is a combination of calculation time in each processor due to contact searches and to force computations, and of communication time in sending information concerning grain positions and velocities to neighboring processors for interparticle force calculations on processor boundaries. Because the force computation is complicated, the communication time was found to be a negligible fraction of the total computation time for granular materials in which enduring intergrain contacts are dominant. The boundaries between processors are changed incrementally throughout the calculation in order to balance the computational load among the processors. The optimal decomposition has enough particles per processor to diminish the relative importance of statistical fluctuations in the load, and a system of boundaries which conforms as much as possible to the geometry of the problem. For grain-bed impacts, efficiencies between 0.89 and 0.97 were achieved [ Werner:88a ].
Irregularities of the geometry are important in determining which sand grains interact with each other. Thus, it is not possible to find an efficient synchronous algorithm for this and many other particle interaction problems. The very irregular inhomogeneous astrophysical calculations described in Section 12.4 illustrate this point clearly. One also finds the same issue in molecular dynamics codes, such as CHARMM which are extensively used in chemistry. This problem is, however, loosely synchronous as we can naturally macroscopically synchronize the calculation after each time step-thus a MIMD implementation where each processor processes its own irregular collection of grains is very natural and efficient. The sand grain problem, unlike that of Section 12.4 , has purely local forces as the grains must be in physical contact to affect each other. Thus, only very localized communication is necessary. Note that Section 4.5 describes a synchronous formulation of this problem.
The results of the grain-bed impact simulations have facilitated treatment of two larger scale problems. A simulation of steady-state saltation in which calculation of saltating grain trajectories and modifications to the wind velocity profile, due to acceleration of saltating grains, were combined with a grain-bed impact distribution function derived from experiments and simulations. This simulation yielded such characteristics of saltation as flux and erosive potential [ Werner:90a ]. A simulation of the rearrangement of surface grains in reptation led to the formation of self-organized small-scale bedforms, which resemble wind ripples in both size and shape [ Landry:93a ], [Werner:91a;93a]. Larger, more complicated ripple formation simulations and a simulation of sand dune formation, using a similar approach which is under development, are problems that will require a combination of processing power and memory not available on present supercomputers. Ripple and dune simulations are expected to run efficiently with a spatial decomposition on a hypercube.
Water is an important agent for the transport of sediment. Unlike wind-blown sand transport, underwater sand transport requires simultaneous simulation of the grains and the fluid because water and sand are similar in density. We are developing a grain/fluid mixture simulation code for a hypercube in which the fluid is modelled by a gas composed of elastic hard circles (spheres in three dimensions). The simulation steps the gas forward at discrete time intervals, allowing the gas particles to collide (with another gas particle or a macroscopic grain) only once per step. The fluid velocity and the fluid force on each grain are computed by averaging. Since a typical void between macroscopic grains will be occupied by up to 1000 gas particles, the requisite computational speed and memory capacity can be found only in the hypercube architecture. Communication is expected to be minimal and load balancing can be accomplished for a sufficiently large system. It is expected that larger scale simulations of erosion and deposition by water [ Ahnert:87a ] will benefit from the findings of the fluid/grain mixture simulations. Also, these large-scale landscape evolution simulations are suitable themselves for a MIMD parallel machine.
Computer simulation is assuming an increasing role in geomorphology. We suggest that the development and availability of high-performance MIMD concurrent processors will have considerable influence upon the future of computing in geomorphology.
Guy Robinson
Guy Robinson
Plasmas-gases of electrically charged particles-are one of the most complex fluids encountered in nature. Because of the long-range nature of the electric and magnetic interactions between plasma electrons and the ions composing them, plasmas exhibit a wide variety of collective forms of motions, for example, coherent motions of large number of electrons, ions, or both. This leads to an extremely rich physics of plasmas. Plasma particle-in-cell (PIC) simulation codes have proven to be a powerful tool for the study of complex nonlinear plasma problems in many areas of plasma physics research such as space and astrophysical plasmas, magnetic and inertial confinement, free electron lasers, electron and ion beam propagation and particle accelerators. In PIC codes, the orbits of thousands to millions of interacting plasma electrons and ions are followed in time as the particles move in electromagnetic fields calculated self-consistently from the charge and current densities created by these same plasma particles.
We developed an algorithm, called the general concurrent particle-in-cell algorithm (GCPIC) for implementing PIC codes efficiently on MIMD parallel computers [ Liewer:89c ]. This algorithm was first used to implement a well-benchmarked [ Decyk:88a ] one-dimensional electrostatic PIC code. The benchmark problem, used to benchmark the Mark IIIfp, was a simulation of an electron beam plasma instability ([ Decyk:88a ], [ Liewer:89c ]). Dynamic load balancing has been implemented in a one-dimensional electromagnetic GCPIC code [ Liewer:90a ]; this code was used to study electron dynamics in magnetosonic shock waves in space plasmas [ Liewer:91a ]. A two-dimensional electrostatic PIC code has also been implemented using the GCPIC algorithm with and without dynamic load balancing [Ferraro:90b;93a]. More recently, the two-dimensional electrostatic GCPIC code was extended to an electromagnetic code [ Krucken:91a ] and used to study parametric instabilities of large amplitude Alfvén waves in space plasmas [ Liewer:92a ].
Guy Robinson
In plasma PIC codes, the orbits of the many interacting plasma electrons and ions are followed as an initial value problem as the particles move in self-consistently calculated electromagnetic fields. The fields are found by solving Maxwell's equations, or a subset, with the plasma currents and charge density as source terms; the electromagnetic fields determine the forces on the particles. In a PIC code, the particles can be anywhere in the simulation domain, but the field equations are solved on a discrete grid. At each time step in a PIC code, there are two stages to the computation. In the first stage, the position and velocities of the particles are updated by calculating the forces on the particles from interpolation of the field values at the grid points; the new charge and current densities at the grid points are then calculated by interpolation from the new positions and velocities of the particles. In the second stage, the updated fields are found by solving the field equations on the grid using the new charge and current densities. Generally, the first stage accounts for most of the computation time because there are many more particles than grid points.
The GCPIC algorithm [ Liewer:89c ] is designed to make the most computationally intensive portion of a PIC code, which updates the particles and the resulting charge and current densities, run efficiently on a parallel processor. The time used to make these updates is generally on the order of 90% of the total time for a sequential code, with the remaining time divided between the electromagnetic field solution and the diagnostic computations.
To implement a PIC code in parallel using the GCPIC algorithm, the physical domain of the particle simulation is partitioned into subdomains, equal in number to the number of processors, such that all subdomains have roughly equal numbers of particles. For problems with nonuniform particle densities, these subdomains will be of unequal physical size. Each processor is assigned a subdomain and is responsible for storing the particles and the electromagnetic field quantities for its subdomain and for performing the particle computations for its particles. For a one-dimensional code on a hypercube, nearest-neighbor subdomains are assigned to nearest-neighbor processors. When particles move to new subdomains, they are passed to the appropriate processor. As long as the number of particles per subdomain is approximately equal, the processors' computational loads will be balanced. Dynamic load balancing is accomplished by repartitioning the simulations domain into subdomains with roughly equal particle numbers when the processor loads become sufficiently unbalanced. The computation of the new partitions, done in a simple way using a crude approximation to the plasma density profile, adds very little overhead to the parallel code.
The decomposition used for dividing the particles is termed the primary decomposition . Because the primary decomposition is not generally the optimum one for the field solution on the grid, a secondary decomposition is used to divide the field computation. The secondary decomposition remains fixed. At each time step, grid data must be transferred between the two decompositions [Ferraro:90b;93a], [ Liewer:89c ].
The GCPIC algorithm has led to a very efficient parallel implementation of the benchmarked one-dimensional electrostatic PIC code [ Liewer:89c ]. In this electrostatic code, only forces from self-consistent (and external) electric fields are included; neither an external nor a plasma-generated magnetic field is included.
Guy Robinson
The problem used to benchmark the one-dimensional electrostatic GCPIC code on the Mark IIIfp was a simulation of an instability in a plasma due to the presence of an electron beam. The six color pictures in Figure 9.4 (Color Plate) show results from this simulation from the Mark IIIfp. Plotted is electron phase space-position versus velocity of the electrons-at six times during the simulation. The horizontal axis is the velocity and the vertical axis is the position of the electrons. Initially, the background plasma electrons (magenta dots) have a Gaussian distribution of velocities about zero. The width of the distribution in velocity is a measure of the temperature of the electrons. The beam electrons (yellow dots) stream through the background plasma at five times the thermal velocity. The beam density was 10% of the density of the background electrons. Initially, these have a Gaussian distribution about the beam velocity. Both beam and background electrons are distributed uniformly in x . This initial configuration is unstable to an electrostatic plasma wave which grows by tapping the free energy of the electron beam. At early times, the unstable waves grow exponentially. The influence of this electrostatic wave on the electron phase space is shown in the subsequent plots. The beam electrons lose energy to the wave. The wave acts to try to ``thermalize'' the electron's velocity distribution in the way collisions would act in a classical fluid. At some point, the amplitude of the wave's electrostatic potential is enough to ``trap'' some of the beam and background electrons, leading the visible swirls in the phase space plots. This trapping causes the wave to stop growing. In the end, the beam and background electrons are mixed and the final distribution is ``hotter'' kinetic energy from the electron beam which has gone into heating both the background and beam electrons.
Figure 9.4:
Time history of electron phase space in a
plasma PIC simulation of an electron beam plasma instability on the Mark
IIIfp hypercube. The horizontal axis is the electron velocity and the
vertical axis is the position. Initially, a small population of beam
electrons (green dots) stream through the background plasma electrons
(magenta dots). An electronic wave grows, tapping the energy in the
electron beam. The vortices in phase space at late times result from
electrons becoming trapped in the potential of the wave. See section 9.3 of
the text for further description.
Guy Robinson
Timing results for the benchmark problem, using the one-dimensional code without dynamic load balancing, are given in the tables. In Table 9.1 , results for the push time are given for various hypercube dimensions for the Mark III and Mark IIIfp hypercubes. Here, we define the push time as the time per particle per time step to update the particle positions and velocities (including the interpolation to find the forces at the particle positions) and to deposit (interpolate) the particles' contributions to the charge and/or current densities onto the grid. Table 9.1 shows the efficiency of the push for runs in which the number of particles increased linearly with the number of processors used, so that the number of particles per processor was constant ( fixed grain size ). The efficiency is defined to be , where is the run time on N processors. In the ideal situation, a code's run time on N will be of its run time on one processor, and the efficiency is 100%. In practice, communication between nodes and unequal processor loads leads to a decrease in the efficiency.
Table 9.1:
Hypercube Push Efficiency for Increasing Problem Size
The Mark III Hypercube consists of up to 64 independent processors, each with four megabytes of dynamic random access memory and 128 kilobytes of static RAM. Each processor consists of two Motorola MC68020 CPUs with a MC68882 Co-processor. The newer Mark IIIfp Hypercubes have, in addition, a Weitek floating-point processor on each node. In Table 9.1 , push times are given for both the Mark III processor (Motorola MC68882) and the Mark IIIfp processor (Weitek). For the Weitek runs, the entire parallel code was downloaded into the Weitek processors. The push time for the one-dimensional electrostatic code has been benchmarked on many computers [ Decyk:88a ]. Some of the times are given in Table 9.2 ; times for other computers can be found in [ Decyk:88a ]. For the Mark III and Mark IIIfp runs, 720,896 particles were used (11,264 per processor); for the other runs in Table 9.2 , 11,264 particles were used. In all cases, the push time is the time per particle per time step to make the particle updates. It can be seen that for the push portion of the code, the 64-processor Mark IIIfp is nearly twice the speed of a one-processor CRAY X-MP and 2.6 times the speed of a CRAY 2.
We have also compared the total run time for the benchmark code for a case with 720,896 particles and 1024 grid points run for 1000 time steps. The total run time on the 64-node Mark IIIfp was ; on a one-processor CRAY 2, . For this case, the 64-node Mark IIIfp was 1.6 times faster than the CRAY 2 for the entire code. For the Mark IIIfp run, about 10% of the total run time was spent in the initialization of the particles, which is done sequentially.
Benchmark times for the two-dimensional GCPIC code can be found in [ Ferraro:90b ].
Guy Robinson
The parallel one-dimensional electrostatic code was modified to include the effects of external and self-consistent magnetic fields. This one-dimensional electromagnetic code, with kinetic electrons and ions, has been used to study electron dynamics in oblique collisionless shock waves such as in the earth's bow shock. Forces on the particles are found from the fields at the grid points by interpolation. For this code, with variation in the x direction only, the orbit equations for the i particle are
Motion is followed in the x direction only, but all three velocity components must be calculated in order to calculate the force. The longitudinal (along x ) electric field is found by solving Poisson's equation
The transverse (to x ) electromagnetic fields, , , , and , are found by solving
Table 9.2:
Comparison of Push Times on Various Computers
The plasma current density and charge density are found at the grid points by interpolation from the particle positions. Only the transverse ( y and z ) components of the plasma current are needed. These coupled particle and field equations are solved in time as an initial value problem. As in the electrostatic code, the fields are solved by Fourier-transforming the charge and current densities and solving the equation in k space, and advancing the Fourier components in time. External fields and currents can also be included. At each time step, the fields are transformed back to configuration space to calculate the forces needed to advance the particles to the next time step. The hypercube FFT routine described in Section 12.4 was used in the one-dimensional codes. Extending the existing parallel electrostatic code to include the electromagnetic effects required no change in the parallel decomposition of the code.
Guy Robinson
In the GCPIC electrostatic code, the partitioning of the grid was static. The grid was partitioned so that the computational load of the processors was initially balanced. As simulations progress and particles move among processors, the spatial distribution of the particles can change, leading to load imbalance. This can severely degrade the parallel efficiency of the push stage of the computation. To avoid this, dynamic load balancing has been implemented in a one-dimensional electromagnetic code [ Liewer:90a ] and a two-dimensional electrostatic code [ Ferraro:93a ].
To implement dynamic load balancing, the grid is repartitioned into new subdomains with roughly equal numbers of particles as the simulation progresses. The repartitioning is not done at every time step. The load imbalance is monitored at a user-specified interval. When the imbalance becomes sufficiently large, the grid is repartitioned and the particles moved to the appropriate processors, as necessary. The load was judged sufficiently imbalanced to warrant load balancing when the number of particles per processor deviated from the ideal value (= number of particles/number of processors) by , for example, twice the statistical fluctuation level.
The dynamic load balancing is performed during the push stage of the computation. Specifically, the new grid partitions are computed after the particle positions have been updated, but before the particles are moved to new processors to avoid an unnecessary moving of particles. If the loads are sufficiently balanced, the subroutine computing the new grid partitions is not called. The subroutine, which moves the particles to appropriate processors, is called in either case.
To accurately represent the physics, a particle cannot move more than one grid cell per time step. As a result, in the static one-dimensional code, the routine which moves particles to new processors only had to move particles to nearest-neighbor processors. To implement dynamic load balancing, this subroutine had to be modified to allow particles to be moved to processors any number of steps away. Moving the particles to new processors after grid repartitioning can add significant overhead; however, this is incurred only at time steps when load balancing occurs.
The new grid partitions are computed by a very simple method which adds very little overhead to the parallel code. Each processor constructs an approximation to the plasma density profile, , and uses this to compute the grid partitioning to load balance. To construct the approximate density profile, each processor sends the locations of its current subdomain boundaries and its current number of particles to all other processors. From this information, each processor can compute the average plasma density in each processor and from this can create the approximate-to-density profile (with as many points as processors). This approximate profile is used to compute the grid partitioning which approximately divides the particles equally among the processors. This is done by determining the set of subdomain boundaries and such that
Linear interpolation of the approximate profile is used in the numerical integration. The actual plasma density profile could also be used in the integration to determine the partitions. No additional computation would be necessary to obtain the local (within a processor) because it is already computed for the field solution stage. However, it would require more communication to make the density profile global. Other methods of calculating new subdomain boundaries, such as sorting particles, require a much larger amount of communication and computational overhead.
Guy Robinson
The GCPIC algorithm was developed and implemented by Paulett C. Liewer, Jet Propulsion Laboratory, Caltech, and Viktor K. Decyk, Physics Department, University of California, Los Angeles. R. D. Ferraro, Jet Propulsion Laboratory, Caltech, implemented the two-dimensional electrostatic PIC code using the GCPIC algorithm with dynamic load balancing.
Guy Robinson
A group at JPL, led by Jean Patterson, developed several hypercube codes for the solution of large-scale electromagnetic scattering and radiation problems. Two codes were parallel implementations of standard production-level EM analysis codes and the remaining are largely or entirely new. Included in the parallel implementations of existing codes is the widely used numerical electromagnetics code (NEC-2) developed at Lawrence Livermore National Laboratory. Other codes include an integral equation formulation Patch code, a time-domain finite-difference code, a three-dimensional finite-elements code, and infinite and finite frequency selective surfaces codes. Currently, we are developing an anisotropic material modeling capability for the three-dimensional Finite Elements code and a three-dimensional coupled approach code. In the Coupled Approach, one uses finite elements to represent the interior of a scattering object, and the boundary integrals for the exterior. Along with the analysis tools, we are developing an Electromagnetic Interactive Analysis Workstation (EIAW) as an integrated environment to aid in design and analysis. The workstation provides a general user interface for specification of an object to be analyzed and graphical representations of the results. The EIAW environment is implemented on an Apollo DN4500 Color Graphics Workstation, and a Sun Sparc2. This environment provides a uniform user interface for accessing the available parallel processor resources (e.g., the JPL/Caltech Mark IIIfp and the Intel iPSC/860 hypercubes.) [ Calalo:89b ].
One of the areas of current emphasis is the development of the anisotropic three-dimensional finite element analysis tool. We briefly describe this effort here. The finite element method is being used to compute solutions to open region electromagnetic scattering problems where the domain may be irregularly shaped and contain differing material properties. Such a scattering object may be composed of dielectric and conducting materials, possibly with anisotropic and inhomogeneous dielectric properties. The domain is discretized by a mesh of polygonal (two-dimensional) and polyhedral (three-dimensional) elements with nodal points at the corners. The finite element solution that determines the field quantities at these nodal points is stated using the Helmholtz equation. It is derived from Maxwell's equations describing the incident and scattered field for a particular wave number, k . The two-dimensional equation for the out-of-plane magnetic field, , is given by
where is the relative permittivity and is the relative magnetic permeability. The equation for the electric field is similarly stated, interchanging and .
The open region problem is solved in a finite domain by imposing an artificial boundary condition for a circular boundary. For the two-dimensional case, we are applying the approach of Bayliss and Turkel [ Bayliss:80a ]. The cylindrical artificial boundary condition on scattered field, (where ), is given by
where is the radius of artificial boundary, is the angular coordinate, and A and B are operators that are dependent on .
The differential Equation 9.7 can be converted to an integral equation by multiplying by a test function which has certain continuity properties. If the total field is expressed in terms of the incident and scattered fields, then we may substitute Equation 9.8 to arrive at our weak form equation
where F is the excitation, which depends on the incident field.
Substituting the field and test function representations in terms of nodal basis functions into Equation 9.9 forms a set of linear equations for the coefficients of the basis functions. The matrix which results from this finite-element approximation is sparse with nonzero elements clustered about the diagonal.
The solution technique for the finite-element problem is based on a domain decomposition . This decomposition technique divides the physical problem space among the processors of the hypercube. While elements are the exclusive responsibility of hypercube processors, the nodal points on the boundaries of the subdomains are shared. Because shared nodal points require that there be communication between hypercube processors, it is important for processing efficiency to minimize the number of these shared nodal points.
Figure 9.5:
Domain Decomposition of the Finite Element Mesh into Subdomains,
Each of Which are Assigned to Different Hypercube Processors.
The tedious process of specifying the finite-element model to describe the geometry of the scattering object is greatly simplified by invoking the graphical editor, PATRAN-Plus, within the Hypercube Electromagnetics Interactive Analysis Workstation. The graphical input is used to generate the finite-element mesh. Currently, we have implemented isoparametric three-node triangular, six-node triangular, and nine-node quadrilateral elements for the two-dimensional case, and linear four-node tetrahedral elements for the three-dimensional case.
Once the finite-element mesh has been generated, the elements are allocated to hypercube processors with the aid of a partitioning tool which we have developed. In order to achieve good load balance, each of the hypercube processors should receive approximately the same number of elements (which reflects the computation load) and the same number of subdomain edges (which reflects the communication requirement). The recursive inertial partitioning (RIP) algorithm chooses the best bisection axis of the mesh based on calculated moments of inertia. Figure 9.6 illustrates one possible partitioning for a dielectric cylinder.
Figure 9.6:
Finite Element Mesh for a Dielectric Cylinder Partitioned Among
Eight Hypercube Processors
The finite-element problem can be solved using several different strategies: iterative solution, direct solution, or a hybrid of the two. We employ all of these techniques in our finite elements testbed. We use a preconditioned biconjugate gradients approach for iterative solutions and a Crout solver for the direct solution [Peterson:85d;86a]. We also have developed a hybrid solver which uses first Gaussian elimination locally within hypercube processors, and then biconjugate gradients to resolve the remaining degrees of freedom [ Nour-Omid:87b ].
The output from the finite elements code is displayed graphically at the Electromagnetics Interactive Analysis Workstation. In Figure 9.7 (Color Plate) are plotted the real (on the left) and the imaginary (on the right) components of the total scalar field for a conducting cylinder of ka=50 . The absorbing boundary is placed at kr=62 . Figure 9.8 (Color Plate) shows the plane wave propagation indicated by vectors in a rectangular box (no scatterer). The box is modeled using linear tetrahedral elements. Figure 9.9 (Color Plate) shows the plane wave propagation (no scatterer) in a spherical domain, again using tetrahedral linear elements. The half slices show the internal fields. In the upper left is the x -component of the field, the lower left is the z -component, and on the right is the y -component with the fields shown as contours on the surface.
Figure 9.7:
Results from the two-dimensional
electromagnetic scalar finite-element code described in the text.
Figure 9.8:
Test case for the electromagnetic
three-dimensional code with no scatterer described in text.
Figure 9.9:
Test case for electromagnetic
three-dimensional planewave in spherical domain with no scatterer describes
in the text.
The speedups over the problem running on one processor are plotted for hypercube configurations ranging from 1 to 32 processors in Figure 9.10 . The problem for this set of runs is a two-dimensional dielectric cylinder model consisting of 9313 nodes.
Figure 9.10:
Finite-Element Execution Speedup Versus Hypercube Size
The setup and solve portions of the total execution time demonstrate 87% and 81% efficiencies, respectively. The output portion where the results obtained by each processor are sent back to the workstation run at about 50% efficiency. The input routine exhibits no speedup and greatly reduces the overall efficiency, 63% of the code. Clearly, this is an area on which we now must focus. We have recently implemented the partitioning code on parallel. We are also now reducing the size of the input file by compressing the contents of the mesh data file and removing formatted reads and writes. We are also developing a parallel mesh partitioner which iteratively refines a coarse mesh which was generated by the graphics software.
We are currently exploring a number of accuracy issues with regards to the finite elements and coupled approach solutions. Such issues include gridding density, element types, placement of artificial boundaries, and specification of basis functions. We are investigating outgoing wave boundary conditions; currently, we are using a modified Sommerfield radiation condition in three dimensions. In addition, we are exploring a number of higher order element types for three dimensions. Central to our investigations is the objective of developing analysis techniques for massive three-dimensional problems.
We have demonstrated that the parallel processing environments offered by the current coarse-grain MIMD architectures are very well suited to the solution of large-scale electromagnetic scattering and radiation problems. We have developed a number of parallel EM analysis codes that currently run in production mode. These codes are being embedded in a Hypercube Electromagnetic Interactive Analysis Workstation. The workstation environment simplifies the user specification of the model geometry and material properties, and the input of run parameters. The workstation also provides an ideal environment for graphically viewing the resulting currents and near- and far-fields. We are continuing to explore a number of issues to fully exploit the capabilities of this large-memory, high-performance computing environment. We are also investigating improved matrix solvers for both dense and sparse matrices, and have implemented out-of-core solving techniques, which will prevent us from becoming memory-limited. By establishing testbeds, such as the finite-element one described here, we will continue to explore issues that will maintain computational accuracy, while reducing the overall computation time for EM scattering and radiation analysis problems.
Guy Robinson
Efficient sparse linear algebra cannot be achieved as a straightforward extension of the dense case described in Chapter 8 , even for concurrent implementations. This paper details a new, general-purpose unsymmetric sparse LU factorization code built on the philosophy of Harwell's MA28, with variations. We apply this code in the framework of Jacobian-matrix factorizations, arising from Newton iterations in the solution of nonlinear systems of equations. Serious attention has been paid to the data-structure requirements, complexity issues, and communication features of the algorithm. Key results include reduced communication pivoting for both the ``analyze'' A-mode and repeated B-mode factorizations, and effective general-purpose data distributions useful incrementally to trade-off process-column load balance in factorization against triangular solve performance. Future planned efforts are cited in conclusion.
HPFA Paradigms
Guy Robinson
The topic of this section is the implementation and concurrent performance of sparse, unsymmetric LU factorization for medium-grain multicomputers. Our target hardware is distributed-memory, message-passing concurrent computers such as the Symult s2010 and Intel iPSC/2 systems. For both of these systems, efficient cut-through wormhole routing technology provides pairwise communication performance essentially independent of the spatial location of the computers in the ensemble [ Athas:88a ]. The Symult s2010 is a two-dimensional, mesh-connected concurrent computer; all examples in this paper were run on this variety of hardware. Message-passing performance, portability, and related issues relevant to this work are detailed in [ Skjellum:90a ].
Figure 9.11:
An Example of Jacobian Matrix Structures. In
chemical-engineering process flowsheets, Jacobians with main-band
structure, lower-triangular structure (feedforwards), upper-triangular
structure (feedbacks), and borders (global or artificially restructured
feedforwards and/or feedbacks) are common.
Questions of linear-algebra performance are pervasive throughout scientific and engineering computation. The need for high-quality, high-performance linear algebra algorithms (and libraries) for multicomputer systems therefore requires no attempt at justification. The motivation for the work described here has a specific origin, however. Our main higher level research goal is the concurrent dynamic simulation of systems modelled by ordinary differential and algebraic equations; specifically, dynamic flowsheet simulation of chemical plants (e.g., coupled distillation columns) [ Skjellum:90c ]. Efficient sequential integration algorithms solve staticized nonlinear equations at each time point via modified Newton iteration (cf., [ Brenan:89a ], Chapter 5). Consequently, a sequence of structurally identical linear systems must be solved; the matrices are finite-difference approximations to Jacobians of the staticized system of ordinary differential-algebraic equations. These Jacobians are large, sparse, and unsymmetric for our application area. In general, they possess both band and significant off-band structure. Generic structures are depicted in Figure 9.11 . This work should also bear relevance to electric power network/grid dynamic simulation where sparse, unsymmetric Jacobians arise, and elsewhere.
Guy Robinson
Figure 9.12:
Process-Grid Data Distribution of
Ax=b
. Representation of a
concurrent matrix, and distributed-replicated concurrent vectors on a
logical process grid. The solution of
Ax=b
first appears in
x
, a column-distributed vector, and then is normally ``transposed'' via a
global
combine
to the row-distributed vector
y
.
We solve the problem Ax=b where A is large, and includes many zero entries. We assume that A is unsymmetric both in sparsity pattern and in numerical values. In general, the matrix A will be computed in a distributed fashion, so we will inherit a distribution of the coefficients of A (cf., Figures 9.12 and 9.13 ). Following the style of Harwell's MA28 code for unsymmetric sparse matrices, we use a two-phase approach to this solution. There is a first LU factorization called A-mode or ``analyze,'' which builds data structures dynamically, and employs a user-defined pivoting function. The repeated B-mode factorization uses the existing data structures statically to factor a new, similarly structured matrix, with the previous pivoting pattern. B-mode monitors stability with a simple growth factor estimate. In practice, A-mode is repeated whenever instability is detected. The two key contributions of this sparse concurrent solver are reduced communication pivoting, and new data distributions for better overall performance.
Figure 9.13:
Example of Process-Grid Data Distribution. An
array
with block-linear rows (
B=2
) and scattered columns on a
logical process grid. Local arrays are denoted at left by
where
is the grid position of the process on
. Subscripts
(i.e.,
) are the global (
I,J
) indices.
Following Van de Velde [ Velde:90a ], we consider the LU factorization of a real matrix A , . It is well known (e.g., [ Golub:89a ], pp. 117-118), that for any such matrix A , an LU factorization of the form
exists, where are square, (orthogonal) permutation matrices, and are the unit lower- and upper-triangular factors, respectively. Whereas the pivot sequence is stored (two N -length integer vectors), the permutation matrices are not stored or computed with explicitly. Rearranging, based on the orthogonality of the permutation matrices, We factor A with implicit pivoting (no rows or columns are exchanged explicitly as a result of pivoting). Therefore, we do not store directly, but instead: . Consequently, , , and . The ``unravelling'' of the permutation matrices is accomplished readily (without implication of additional interprocess communication) during the triangular solves.
For the sparse case, performance is more difficult to quantify than for the dense case, but, for example, banded matrices with bandwidth can be factored with work; we expect subcubic complexity in N for reasonably sparse matrices, and strive for subquadratic complexity, for very sparse matrices. The triangular solves can be accomplished in work proportional to the number of entries in the respective triangular matrix L or U . The pivoting strategy is treated as a parameter of the algorithm and is not predetermined. We can consequently treat the pivoting function as an application-dependent function, and sometimes tailor it to special problem structures (cf., Section 7 of [ Velde:88a ]) for higher performance. As for all sparse solvers, we also seek subquadratic memory requirements in N , attained by storing matrix entries in linked-list fashion, as illustrated in Figure 9.14 .
Figure 9.14:
Linked-list Entry Structure of Sparse Matrix. A single entry
consists of a double-precision value (8 bytes), the local row (i) and column
(j) index (2 bytes each), a ``Next Column Pointer'' indicating the next
current column entry (fixed j), and a ``Next Row Pointer'' indicating the next
current row entry (fixed i), at 4 bytes each. Total: 24 bytes per entry.
For further discussion of LU factorizations and sparse matrices, see [ Golub:89a ], [ Duff:86a ].
Guy Robinson
At each stage of the concurrent LU factorization, the pivot element is chosen by the user-defined pivot function. Then, the pivot row (new row of U ) must be broadcast, and pivot column (new column of L ) must be computed and broadcast on the logical process grid (cf., Figure 9.12 ), vertically and horizontally, respectively. Note that these are interchangeable operations. We use this degree of freedom to reduce the communication complexity of particular pivoting strategies, while impacting the effort of the LU factorization itself negligibly.
We define two ``correctness modes'' of pivoting functions. In the first correctness mode, ``first row fanout,'' the exit conditions for the pivot function are: All processes must know (the pivot process row); the pivot process row must know (the pivot process column) as well as , the -local matrix row of the pivot; and the pivot process must know in addition the pivot value and -local matrix column of the pivot. Partial column pivoting and preset pivoting can be set up to satisfy these correctness conditions as follows. For partial column pivoting, the k row is eliminated at the k step of the factorization. From this fact, each process can derive the process row and -local matrix row using the row data distribution function. Having identified themselves, the pivot-row processes can look for the largest element in local matrix row and choose the pivot element globally among themselves via a combine. At completion, this places , , and the pivot value in the entire pivot process row. This completes the requirements for the ``first row fanout'' correctness mode. For preset pivoting, the k elimination row and column are both stored as , and each process knows these values without communication . Furthermore, the pivot process looks up the pivot value. Hence, preset pivoting satisfies the requirements of this correctness mode also.
For ``first row fanout,'' the universal knowledge of and knowledge of the pivot matrix row by the pivot process row, allow the vertical broadcast of this row (new row of U ). In addition, we broadcast , , and the pivot value simultaneously. This extends the correct value of to all processes, as well as and the pivot value to the pivot process column. Hence, the multiplier ( L ) column may be correctly computed and broadcast . Along with the multiplier column broadcast , we include the pivot value. After this broadcast , all processes have the correct indices and the pivot value. This provides all that's required to complete the current elimination step.
For the second correctness mode ``first column fanout,'' the exit conditions for the pivot function are: All processes must know and the entire pivot process column must know , the pivot value, and . The pivot process in addition knows . Partial row pivoting can be set up to satisfy these correctness conditions. The arguments are analogous to partial column pivoting and are given in [ Skjellum:90c ].
For ``first column fanout,'' the entire pivot process column knows the pivot value, and local column of the pivot. Hence, the multiplier column may be computed by dividing the pivot matrix column by the pivot value. This column of L can then be broadcast horizontally, including the pivot value, , and as additional information. After this step, the entire ensemble has the correct pivot value, and ; in addition, the pivot process row has the correct . Hence, the pivot matrix row may be identified and broadcast . This second broadcast completes the needed information in each process for effecting the k elimination step.
Thus, when using partial row or partial column pivoting, only local combines of the pivot process column (respectively row) are needed. The other processes don't participate in the combine, as they must without this methodology. Preset pivoting implies no pivoting communication, except very occasionally (e.g., 1 in 5000 times), as noted in [ Skjellum:90c ], to remove memory unscalabilities. This pivoting approach is a direct savings, gained at a negligible additional broadcast overhead. See also [ Skjellum:90c ].
Guy Robinson
We introduce new closed-form -time, -memory data distributions useful for sparse matrix factorizations and the problems that generate such matrices. We quantify evaluation costs in Table 9.3 .
Table 9.3:
Evaluation Times for Three Data Distributions
Every concurrent data structure is associated with a logical process grid at creation (cf., Figure 9.12 and [Skjellum:90a;90c]). Vectors are either row- or column-distributed within a two-dimensional process grid. Row-distributed vectors are replicated in each process column, and distributed in the process rows. Conversely, column-distributed vectors are replicated in each process row, and distributed in the process columns. Matrices are distributed both in rows and columns, so that a single process owns a subset of matrix rows and columns. This partitioning follows the ideas proposed by Fox et al. [ Fox:88a ] and others. Within the process grid, coefficients of vectors and matrices are distributed according to one of several data distributions. Data distributions are chosen to compromise between load-balancing requirements and constraints on where information can be calculated in the ensemble.
Definition 1 (Data-Distribution Function)
A data-distribution function
maps three integers
where
, is the global name of a coefficient,
P
is
the number of processes among which all coefficients are to be partitioned,
and
M
is the total number of coefficients. The pair
represents the
process
p
(
) and local (process-
p
) name
i
of the
coefficient (
). The inverse distribution
function
transforms the local name
i
back
to the global coefficient name
I
.
The formal requirements for a data-distribution function are as follows. Let
be the set of global coefficient names associated with process
, defined implicitly by a data distribution function
. The following set properties must hold:
The cardinality of the set
, is given by
.
The linear and scatter data-distribution functions are most often defined. We generalize these functions (by blocking and scattering parameters) to incorporate practically important degrees of freedom. These generalized distribution functions yield optimal static load balance as do the unmodified functions described in [ Velde:90a ] for unit block size, but they differ in coefficient placement. This distinction is technical, but necessary for efficient implementations.
Definition 2 (Generalized Block-Linear)
The definitions for the generalized block-linear distribution function,
inverse, and cardinality function are:
while
where
B
denotes the coefficient block size,
and where
.
For
B=1
, a load-balance-equivalent variant of the common linear
data-distribution function is recovered. The general block-linear
distribution function divides coefficients among the
P
processes
so that each
is a set of coefficients with
contiguous global names, while optimally load-balancing the
b
blocks among
the
P
sets. Coefficient boundaries between processes are on multiples of
B
. The maximum possible coefficient imbalance between processes is
B
.
If
, the last block in process
P-1
will be foreshortened.
Definition 3 (Parametric Functions)
To allow greater freedom in the distribution of coefficients among
processes, we define a new, two-parameter distribution function family,
. The
B
blocking
parameter (just introduced in
the block-linear function) is mainly suited to the clustering of
coefficients that must not be separated by an interprocess boundary
(again, see [
Skjellum:90c
] for a definition of general
block-scatter,
). Increasing
B
worsens the static load
balance. Adding a second scaling parameter
S
(of no impact on the
static load balance) allows the distribution to scatter coefficients to
a greater or lesser degree, directly as a function of this one
parameter. The two-parameter distribution function, inverse and
cardinality function are defined below. The one-parameter distribution
function family,
, occurs as the special case
B=1
, also as
noted below:
where
with
and where
r
,
b
,
etc.
are as defined above.
The inverse distribution function
is defined as
follows:
with
For
S = 1
, a block-scatter distribution results, while for
, the generalized block-linear distribution
function is recovered. See also [
Skjellum:90c
].
Definition 4 (Data Distributions)
Given a data-distribution function family
(
), a process list of
P
(
Q
),
M
(
N
) as the number of coefficients, and a row (respectively,
column) orientation, a row (column) data distribution
(
) is defined as:
respectively,
A two-dimensional data distribution may be identified as consisting of a row
and column distribution defined over a two-dimensional process grid of
processes, as
.
Further discussion and detailed comparisons on data-distribution functions are offered in [ Skjellum:90c ]. Figure 9.13 illustrates the effects of linear and scatter data-distribution functions on a small rectangular array of coefficients.
Guy Robinson
Consider a fixed logical process grid of R processes, with . For the sake of argument, assume partial row pivoting during LU factorization for the retention of numerical stability. Then, for the LU factorization, it is well known that a scatter distribution is ``good'' for the matrix rows, and optimal if no off-diagonal pivots were chosen. Furthermore, the optimal column distribution is also scatter, because columns are chosen in order for partial row pivoting. Compatibly, a scatter distribution of matrix rows is also ``good'' for the triangular solves. However, for triangular solves, the best column distribution is linear, because this implies less intercolumn communication, as we detail below. In short, the optimal configurations conflict, and because explicit redistribution is expensive, a static compromise must be chosen. We address this need to compromise through the one-parameter distribution function, , described in the previous section, offering a variable degree of scattering via the S -parameter. To first order, changing S does not affect the cost of computing the Jacobian (assuming columnwise finite-difference computation), because each process column works independently.
It's important to note that triangular solves derive no benefit from Q > 1 . The standard column-oriented solve keeps one process column active at any given time. For any column distribution, the updated right-hand-side vectors are retransmitted W times (process column-to-process column) during the triangular solve-whenever the active process column changes. There are at least such transmissions (linear distribution), and at most transmissions (scatter distribution). The complexity of this retransmission is , representing quadratic work in N for .
Calculation complexity for a sparse triangular solve is proportional to the number of elements in the triangular matrix, with a low leading coefficient. Often, there are with x < 1 elements in the triangular matrices, including fill. This operation is then , which is less than quadratic in N . Consequently, for large W , the retransmission step is likely of greater cost than the original calculation. This retransmission effect constrains the amount of scattering and size of Q in order to have any chance of concurrent speedup in the triangular solves.
Using the one-parameter distribution with implies that , so that the retransmission complexity is . Consequently, we can bound the amount of retransmission work by making S sufficiently large. Clearly, is a hard upper bound, because we reach the linear distribution limit at that value of the parameter. We suggest picking as a first guess, and , more optimistically. The former choice basically reduces retransmission effort by an order of magnitude. Both examples in the following section illustrate the effectiveness of choosing S by these heuristics.
The two-parameter distribution can be used on the matrix rows to trade off load balance in the factorizations and triangular solves against the amount of (communication) effort needed to compute the Jacobian. In particular, a greater degree of scattering can dramatically increase the time required for a Jacobian computation (depending heavily on the underlying equation structure and problem), but significantly reduce load imbalance during the linear algebra steps. The communication overhead caused by multiple process rows suggests shifting toward smaller P and larger Q (a squatter grid), in which case greater concurrency is attained in the Jacobian computation, and the additional communication previously induced is then somewhat mitigated. The one-parameter distribution used on the matrix columns then proves effective in controlling the cost of the triangular solves by choosing the minimally allowable amount of column scattering.
Let's specify make explicit the performance objectives we consider when tuning S , and, more generally, when tuning the grid shape . In the modified Newton iteration, for instance, a Jacobian factorization is reused until convergence slows unacceptably. An ``LU Factorization + Backsolve'' step is followed by ``Forward + Backsolves,'' with typically (and varying dynamically throughout the calculation). Assuming an averaged , say (perhaps as large as five [ Brenan:89a ]), then our first-level performance goal is a heuristic minimization of
over S for fixed P, Q . more heavily weights the reduction of triangular solve costs versus B-mode factorization than we might at first have assumed, placing a greater potential gain on the one-parameter distribution for higher overall performance. We generally want heuristically to optimize
over S , P , Q , R . Then, the possibility of fine-tuning row and column distributions is important, as is the use of non-power-of-two grid shapes.
Guy Robinson
Guy Robinson
Table 9.4:
Order 13040 Band Matrix Performance
We consider an order 13040 banded matrix with a bandwidth of 326 under partial row pivoting. For this example, we have compiled timing results for a process grid with random matrices (entries have range 0-10,000), using different values of S on the column distribution (Table 9.4 ). We indicate timing for A-mode, B-mode, Backsolves and Forward- and Backsolves together (``Solve'' heading). For this example, S=30 saves of the triangular solve cost compared to S=1 , or approximately 186 seconds, roughly 6 seconds above the linear optimal. Simultaneously, we incur about 17 seconds additional cost in B-mode, while saving about 93 seconds in the Backsolve. Assuming , in the first above-mentioned objective function, we save about 262 (respectively, 76) seconds. Based on this example, and other experiences, we conclude that this is a successful practical technique for improving overall sparse linear algebra performance. The following example further bolsters this conclusion.
Guy Robinson
Now, we turn to a timing example of an order 2500 sparse, random matrix. The matrix has a random diagonal, plus 2 percent random fill of the off-diagonals; entries have a dynamic range of 0-10,000. Normally, data is averaged over random matrices for each grid shape (as noted), and over four repetitive runs for each random matrix. Partial row pivoting was used exclusively. Table 9.5 compiles timings for various grid shapes of row-scatter/column-scatter, and row-scatter/column-( S=10 ) distributions, for as few as nine nodes and as many as 128. Memory limitations set the lower bound on the number of nodes.
This example demonstrates that speedups are possible for this reasonably small sparse example with this general-purpose solver, and that the one-parameter distribution is critical to achieving overall better performance even for this random, essentially unstructured example. Without the one-parameter distribution, triangular-solve performance is poor, except in grid configurations where the factorization is itself degraded (e.g., ). Furthermore, the choice of S=10 is universally reasonable for the Q > 1 grid shapes illustrated here, so the distribution proves easy to tune for this type of matrix. We are able to maintain an almost constant speed for the triangular solves while increasing speed for both the A-mode and B-mode factorizations. We presume, based on experience, that triangular-solve times are comparable to the sequential solution times-further study is needed in this area to see if and how performance can be improved. The consistent A-mode to B-mode ratio of approximately two is attributed primarily to reduced communication costs in B-mode, realized through the elimination of essentially all combine operations in B-mode.
Table 9.5:
Order 2500 Matrix Performance. Performance is a function of grid
shape and size, and
S
-parameter. ``Best'' performance is for the
grid with
S=10
.
While triangular-solve performance exemplifies sequentialism in the algorithm, it should be noted that we do achieve significant overall performance improvements between 6 nodes and 96 ( grid) nodes, and that the repeatedly used B-mode factorization remains dominant compared to the triangular solves even for 128 nodes. Consequently, efforts aimed at increasing performance of the B-mode factorization (at the expense of additional A-mode work) are interesting to consider. For the factorizations, we also expect that we are achieving nontrivial speedups relative to one node, but we are unable to quantify this at present because of the memory limitations alluded to above.
Guy Robinson
There are several classes of future work to be considered. First, we need to take the A-mode ``analyze'' phase to its logical completion, by including pivot-order sorting of the pointer structures to improve performance for systems that should demonstrate subquadratic sequential complexity. This will require minor modifications to B-mode (that already takes advantage of column-traversing elimination), to reduce testing for inactive rows as the elimination progresses. We already realize optimal computation work in the triangular solves, and we mitigate the effect of Q > 1 quadratic communication work using the one-parameter distribution.
Second, we need to exploit ``timelike'' concurrency in linear algebra-multiple pivots. This has been addressed by Alaghband for shared-memory implementations of MA28 with -complexity heuristics [ Alaghband:89a ]. These efforts must be reconsidered in the multicomputer setting and effective variations must be devised. This approach should prove an important source of additional speedup for many chemical engineering applications, because of the tendency towards extreme sparsity, with mainly band and/or block-diagonal structure.
Third, we could exploit new communication strategies and data redistribution. Within a process grid, we could incrementally redistribute by utilizing the inherent broadcasts of L columns and U rows to improve load balance in the triangular solves at the expense of slightly more factorization computational overhead and significantly more memory overhead (a factor of nearly two). Memory overhead could be reduced at the expense of further communication if explicit pivoting were used concomitantly.
Fourth, we can develop adaptive broadcast algorithms that track the known load imbalance in the B-mode factorization, and shift greater communication emphasis to nodes with less computational work remaining. For example, the pivot column is naturally a ``hot spot" because the multiplier column ( L column) must be computed before broadcast to the awaiting process columns. Allowing the non-pivot columns to handle the majority of the communication could be beneficial, even though this implies additional overall communication. Similarly, we might likewise apply this to the pivot row broadcast , and especially for the pivot process, because it must participate in two broadcast operations.
We could utilize two process grids. When rows (columns) of are broadcast , extra broadcasts to a secondary process grid could reasonably be included. The secondary process grid could work on redistributing to an efficient process grid shape and size for triangular solves while the factorization continues on the primary grid. This overlapping of communication and computation could also be used to reduce the cost of transposing the solution vector from column-distributed to row-distributed, which normally follows the triangular solves.
The sparse solver supports arbitrary user-defined pivoting strategies. We have considered but not fully explored issues of fill-reduction versus minimum time; in particular we have implemented a Markowitz-count fill-reduction strategy [ Duff:86a ]. Study of the usefulness of partial column pivoting and other strategies is also needed. We will report on this in the future.
Reduced-communication pivoting and parametric distributions can be applied immediately to concurrent dense solvers with definite improvements in performance. While triangular solves remain lower-order work in the dense case, and may sensibly admit less tuning in S , the reduction of pivot communication is certain to improve performance. A new dense solver exploiting these ideas is under construction at present.
In closing, we suggest that the algorithms generating the sequences of sparse matrices must themselves be reconsidered in the concurrent setting. Changes that introduce multiple right-hand sides could help to amortize linear algebra cost over multiple timelike steps of the higher level algorithm. Because of inevitable load imbalance, idle processor time is essentially free-algorithms that find ways to use this time by asking for more speculative (partial) solutions appear useful in working towards higher performance.
This work was performed by Anthony Skjellum and Alvin Leung while the latter held a Caltech Summer Undergraduate Research Fellowship. A helpful contribution was the dense concurrent linear algebra library provided by Eric Van de Velde, as well as his prototype sparse concurrent linear algebra library.
Guy Robinson
The accurate, high-speed solution of systems of ordinary differential-algebraic equations (DAEs) of low index is of great importance in chemical, electrical, and other engineering disciplines. Petzold's Fortran-based DASSL is the most widely used sequential code for solving DAEs. We have devised and implemented a completely new C code, Concurrent DASSL, specifically for multicomputers and patterned on DASSL [Skjellum:89a;90c]. In this work, we address the issues of data distribution and the performance of the overall algorithm, rather than just that of individual steps. Concurrent DASSL is designed as an open, application-independent environment below which linear algebra algorithms may be added in addition to standard support for dense and sparse algorithms. The user may furthermore attach explicit data interconversions between the main computational steps, or choose compromise distributions. A ``problem formulator'' (simulation layer) must be constructed above Concurrent DASSL, for any specific problem domain. We indicate performance for a particular chemical engineering application, a sequence of coupled distillation columns. Future efforts are cited in conclusion.
Guy Robinson
We discuss the design of a general-purpose integration system for ordinary differential-algebraic equations of low index, following up on our more preliminary discussion in [ Skjellum:89a ]. The new solver, Concurrent DASSL, is a parallel, C-language implementation of the algorithm codified in Petzold's DASSL, a widely used Fortran-based solver for DAE's [ Petzold:83a ], [ Brenan:89a ], and is based on a loosely synchronous model of communicating sequential processes [ Hoare:78a ]. Concurrent DASSL retains the same numerical properties as the sequential algorithm, but introduces important new degrees of freedom compared to it. We identify the main computational steps in the integration process; for each of these steps, we specify algorithms that have correctness independent of data distribution.
We cover the computational aspects of the major computational steps, and their data distribution preferences for highest performance. We indicate the properties of the concurrent sparse linear algebra as it relates to the rest of the calculation. We describe the proto-Cdyn simulation layer, a distillation-simulation-oriented Concurrent DASSL driver which, despite specificity, exposes important requirements for concurrent solution of ordinary DAE's; the ideas behind a template formulation for simulation are, for example, expressed.
We indicate formulation issues and specific features of the chemical engineering problem-dynamic distillation simulation. We indicate results for an example in this area, which demonstrates not only the feasibility of this method, but also the need for additional future work. This is needed both on the sparse linear algebra, and on modifying the DASSL algorithm to reveal more concurrency, thereby amortizing the cost of linear algebra over more time steps in the algorithm.
Guy Robinson
We address the following initial-value problem consisting of combinations of N linear and nonlinear coupled, ordinary differential-algebraic equations over the interval :
IVP :
with unknown state vector , known external inputs , where and are the given initial-value, derivative vectors, respectively. We will refer to Equation 9.11 's deviation from as the residuals or residual vector. Evaluating the residuals means computing (``model evaluation'') for specified arguments , , and t .
DASSL's integration algorithm can be used to solve systems fully implicit in and and of index zero or one, and specially structured forms of index two (and higher) [ Brenan:89a , Chapter 5], where the index is the minimum number of times that part or all of Equation 9.11 must be differentiated with respect to t in order to express as a continuous function of and t [ Brenan:89a , page 17].
By substituting a finite-difference approximation for , we obtain:
a set of (in general) nonlinear staticized equations. A sequence of Equation 9.12 's will have to be solved, one at each discrete time , in the numerical approximation scheme; neither M nor the 's need be predetermined. In DASSL, the variable step-size integration algorithm picks the 's as the integration progresses, based on its assessment of the local error. The discretization operator for , , varies during the numerical integration process and hence is subscripted as .
The usual way to solve an instance of the staticized equations, Equation 9.12 , is via the familiar Newton-Raphson iterative method (yielding ):
given an initial, sufficiently good approximation . The classical method is recovered for and c = 1 , whereas a modified (damped) Newton-Raphson method results for (respectively, ). In the original DASSL algorithm and in Concurrent DASSL, the Jacobian is computed by finite differences rather than analytically; this departure leads in another sense to a modified Newton-Raphson method even though and c = 1 might always be satisfied. For termination, a limit is imposed; a further stopping criterion of the form is also incorporated (see Brenan et al. [ Brenan:89a , pages 121-124]).
Following Brenan et al., the approximation is replaced by a BDF-generated linear approximation, , and the Jacobian
From this approximation, we define in the intuitive way. We then consider Taylor's Theorem with remainder, from which we can easily express a forward finite-difference approximation for each Jacobian column (assuming sufficient smoothness of ) with a scaled difference of two residual vectors:
By picking proportional to , the j unit vector in the natural basis for , namely , Equation 9.15 yields a first-order-accurate approximation in of the j column of the Jacobian matrix:
Each of these N Jacobian-column computations is independent and trivially parallelizable. It's well known, however, that for special structures such as banded and block n -diagonal matrices, and even for general sparse matrices, a single residual can be used to generate multiple Jacobian columns [ Brenan:89a ], [ Duff:86a ]. We discuss these issues as part of the concurrent formulation section below.
The solution of the Jacobian linear system of equations is required for each k -iteration, either through a direct (e.g., LU-factorization) or iterative (e.g., preconditioned-conjugate-gradient) method. The most advantageous solution approach depends on N as well as special mathematical properties and/or structure of the Jacobian matrix . Together, the inner (linear equation solution) and outer (Newton-Raphson iteration) loops solve a single time point; the overall algorithm generates a sequence of solution points .
In the present work, we restrict our attention to direct, sparse linear algebra as described in [ Skjellum:90d ], although future versions of Concurrent DASSL will support the iterative linear algebra approaches by Ashby, Lee, Brown, Hindmarsh et al. [ Ashby:90a ], [ Brown:91a ]. For the sparse LU factorization, the factors are stored and reused in the modified Newton scenario. Then, repeated use of the old Jacobian implies just a forward- and back-solve step using the triangular factors L and U . Practically, we can use the Jacobian for up to about five steps [ Brenan:89a ]. The useful lifetime of a single Jacobian evidently depends somewhat strongly on details of the integration procedure [ Brenan:89a ].
Guy Robinson
To use the Concurrent DASSL system on other than toy problems, a simulation layer must be constructed above it. The purpose of this layer is to accept a problem specification from within a specific problem domain, and formulate that specification for concurrent solution as a set of differential-algebraic equations, including any needed data. On one hand, such a layer could explicitly construct the subset of equations needed for each processor, generate the appropriate code representing the residual functions, and create a set of node programs for effecting the simulation. This is the most flexible approach, allowing the user to specify arbitrary nonlinear DAEs. It has the disadvantage of requiring a lot of compiling and linking for each run in which the problem is changed in any significant respect (including but not limited to data distribution), although with sophisticated tactics, parametric variations within equations could be permitted without recompiling from scratch, and incremental linking could be supported.
We utilize a template-based approach here, as we do in the Waveform-Relaxation paradigm for concurrent dynamic simulation [ Skjellum:88a ]. This is akin to the ASCEND II methodology utilized by Kuru and many others [ Kuru:81a ]. It is a compromise approach from the perspective of flexibility; interesting physical prototype subsystems are encapsulated into compiled code as templates. A template is a conceptual building block with states, nonstates, parameters, inputs, and outputs (see below). A general network made from instantiations of templates can be constructed at run time without changing any executable code. User input specifies the number and type of each template, their interconnection pattern, and the initial value of systemic states and extraneous (nonstate) variables, plus the value of adjustable parameters and more elaborate data, such as physical properties. The addition of templates requires new subroutines for the evaluation of the residuals of their associated DAEs, and for interfacing to the remainder of the system (e.g., parsing of user input, interconnectivity issues). With suitable automated tools, this addition process can be made straightforward to the user.
Importantly, the use of a template-based methodology does not imply a degradation in the numerical quality of the model equations or solution method used. We are not obliged to tear equations based on templates or groups of templates as is done in sequential-modular simulators [ Westerberg:79a ], [ Cook:80a ], where ``sequential'' refers in this sense to the stepwise updating of equation subsets, without connection to the number of computers assigned to the problem solution.
Ideally, the simulation layer could be made universal. That is, a generic layer of high flexibility and structural elegance would be created once and for all (and without predilection for a specific computational engine). Thereafter, appropriate templates would be added to articulate the simulator for a given problem domain. This is certainly possible with high-quality simulators such as ASCEND II and Chemsim (a recent Fortran-based simulator driving DASSL and MA28 [ Andersen:88a ], [ Duff:77a ], [ Petzold:83a ]). Even so, we have chosen to restrict our efforts to a more modest simulation layer, called proto-Cdyn, which can create arbitrary networks of coupled distillation columns. This restricted effort has required significant effort, and already allows us to explore many of the important issues of concurrent dynamic simulation. General-purpose simulators are for future consideration. They must address significant questions of user-interface in addition to concurrency-formulation issues.
In the next paragraphs, we describe the important features of proto-Cdyn. In doing so, we indicate important issues for any Concurrent DASSL driver.
Guy Robinson
A template is a prototype for a sequence of DAEs which can be used repeatedly in different instantiations. Normally, but not always, the template corresponds to some subsystem of a physical-model description of a system, like a tank or distillation tray. The key characteristics of a template are: the number of integration states it incorporates (typically fixed), the number of nonstate variables it incorporates (typically fixed), its input and output connections to other templates, and external sources (forcing functions) and sinks. State variables participate in the overall DASSL integration process. Nonstates are defined as variables which, given the states of a template alone, may be computed uniquely. They are essentially local tear variables. It is up to the template designer whether or not to use such local tear variables: They impact the numerical quality of the solution, in principle. Alternative formulations, where all variables of a template are treated as states, can be posed and comparisons made. Because of the superlinear growth of linear algebra complexity, the introduction of extra integration states must be justified on the basis of numerical accuracy. Otherwise, they artificially slow down the problem solution, perhaps significantly. Nonstates are extremely convenient and practically useful; they appear in all the dynamic simulators we have come across.
The template state and nonstate structure imply a two-phase residual computation. First, given a state , the non-states of each template are updated on a template-by-template basis. Then, given its states and non-states, inputs from other templates and external inputs, each template's residuals may be computed. In the sequential implementation, this poses no particular nuisances, other than having two evaluation loops over all templates. However, in concurrent evaluation, a communication phase intervenes between nonstate and residual updates. This communication phase transmits all states and nonstates appearing as outputs of templates to their corresponding inputs at other templates. This transmission mechanism is considered below under concurrent formulation.
Guy Robinson
In general, the ``optimal'' ordering for the equations of a dynamic simulation will in general be too difficult to establish , because of the NP-hard issues involved in structure selection. However, many important heuristics can be applied, such as those that precedence-order the nonlinear equations, and those that permute the Jacobian structure to a more nearly triangular or banded form [ Duff:86a ]. For the proto-Cdyn simulator, we skirt these issues entirely, because it proves easy to arrange a network of columns to produce a ``good structure''-a main block tridiagonal Jacobian structure with off-block-diagonal structure for the intercolumn connections, simply by taking the distillation columns with their states in tray-by-tray, top-down (or bottom-up) order.
Given a set of DAEs, and an ordering for the equations and states (i.e., rows and columns of the Jacobian, respectively), we need to partition these equations between the multicomputer nodes, according to a two-dimensional process grid of shape . The partitioning of the equations forms, in main part, the so-called concurrent database. This grid structure is illustrated in [ Skjellum:90d , Figure 2.]. In proto-Cdyn, we utilize a single process grid for the entire Concurrent DASSL calculation. That is, we do not currently exploit the Concurrent DASSL feature which allows explicit transformations between the main calculational phases (see below). In each process column, the entire set of equations is to be reproduced, so that any process column can compute not only the entire residual vector for a prediction calculation, but also, any column of the Jacobian matrix.
A mapping between the global and local equations must be created. In the general case, it will be difficult to generate a closed-form expression for either the global-to-local mapping or its inverse (that also require storage). At most, we will have on a hand a partial (or weak) inverse in each process, so that the corresponding global index of each local index will be available. Furthermore, in each node, a partial global-to-local list of indices associated with the given node will be stored in global sort order. Then, by binary search, a weak global-to-local mapping will be possible in each process. That is, each process will be able to identify if a global index resides within it, and the corresponding local index. A strong mapping for row (column) indices will require communication between all the processes in a process row (respectively, column). In the foregoing, we make the tacit assumption that it is an unreasonable practice to use storage proportional to the entire problem size N in each node, except if this unscalability can be removed cheaply when necessary for large problems.
The proto-Cdyn simulator works with templates of specific structure-each template is a form of a distillation tray and generates the same number of integration states. It therefore skirts the need for weak distributions. Consequently, the entire row-mapping procedure can be accomplished using the closed-form general two-parameter distribution function family described in [ Skjellum:90d ], where the block size B is chosen as the number of integration states per template. The column-mapping procedure is accomplished with the one-parameter distribution function family also described in [ Skjellum:90d ]. The effects of row and column degree-of-scattering are described in [ Skjellum:90d ] with attention to linear algebra performance.
Guy Robinson
Guy Robinson
Next, we turn to Equation 9.11 's (that is, IVP 's) concurrent numerical solution via the DASSL algorithm. We cover the major computational steps in abstract, and we also describe the generic aspects of proto-Cdyn in this connection. In the subsequent section, we discuss issues peculiar to the distillation simulation.
Broadly, the concurrent solution of IVP consists of three block operations: startup, dynamic simulation, and a cleanup phase. Significant concurrency is apparent only in the dynamic simulation phase. We will assume that the simulation interval requested generates enough work so that the startup and cleanup phases prove insignificant by comparison and consequently pose no serious Amdahl's-law bottleneck. Given this assumption, we can restrict our attention to a single step of IVP as illustrated schematically in Figure 9.15 .
Figure 9.15:
Major Computational Blocks of a Single Integration Step.A single step in the integration begins with a number of BDF-related
computations, including the solution ``prediction'' step. Then,
``correction'' is achieved through Newton iteration steps, each involving a
Jacobian computation, and linear-system solution (LU factorization plus
forward- and back-solves). The computation of the Jacobian in turn relies upon
multiple independent residual calculations, as shown. The three items
enclosed in the rounded rectangle (Jacobian computation-through at most
N
Residual computations-and LU factorization) are, in practice, computed less
often than the others-the old Jacobian matrix is used in the iteration loop
until convergence
slows intolerably.
In the startup phase, a sequential host program interprets the user specification for the simulation. From this it generates the concurrent database: the templates and their mutual interconnections, data needed by particular templates, and a distribution of this information among the processes that are to participate. The processes are themselves spawned and fed their respective databases. Once they receive their input information, the processes rebuild the data structures for interfacing with Concurrent DASSL, and for generating the residuals. Tolerances and initial derivatives must be computed and/or estimated. Furthermore, in each process column, the processes must rendezvous to finalize their communication labelling for the transmission of states and nonstates to be performed during the residual calculation. This provides the basis for a reactive, deadlock-free update procedure described below.
The cleanup phase basically retrieves appropriate state values and returns them to the host for propagation to the user. Cleanup may be interspersed intermittently with the actual dynamic simulation. It provides a simple record of the results of simulation and terminates the concurrent processes at the simulation's conclusion.
The dynamic simulation phase consists of repetitive prediction and correction steps, and marches in time. Each successful time step requires the solution of one or more instances of Equation 9.12 -additional time steps that converge but fail to satisfy error tolerances, or fail to converge quickly enough, are necessarily discarded. In the next section, we cover aspects of these operations in more detail, for a single step.
Guy Robinson
Guy Robinson
The sequential time complexity of the integration computations is , if considered separately from the residual calculation called in turn, which is also normally (see below). We pose these operations on a grid, where we assume that each process column can compute complete residual vectors. Each process column repeats the entire prediction operations: There is no speedup associated with Q > 1 , and we replicate all DASSL BDF and predictor vectors in each process column. Taller, narrower grids are likely to provide the overall greatest speedup, though the residual calculation may saturate (and slow down again) because of excessive vertical communication requirements. It's definitely not true that the shape is optimal in all cases.
The distribution of coefficients in the rows has no impact on the integration operations, and is dictated largely by the requirements of the residual calculation itself. In practical problems, the concurrent database cannot be reproduced in each process (cf., [ Lorenz:92a ]), so a given process will only be able to compute some of the residuals. Furthermore, we may not have complete freedom in scattering these equations, because there will often be a trade-off between the degree of scattering and the amount of communication needed to form the entire residual vector.
The amount of integration-computation work is not terribly large-there is consequently a nontrivial but not tremendous effort involved in the integration computations. (Residual computations dominate in many if not most circumstances.) Integration operations consist mainly of vector-vector operations not requiring any interprocess communication and, in addition, fixed startup costs. Operations include prediction of the solution at the time point, initiation and control of the Newton iteration that ``corrects'' the solution, convergence and error-tolerance checking, and so forth. For example, the approximation is chosen within this block using the BDF formulas. For these operations, each process column currently operates independently, and repetitively forms the results. Alternatively, each process column could stride with step Q , and row-combines could be used to propagate information across the columns [ Skjellum:90a ]. This alternative would increase speed for sufficiently large problems, and can easily be implemented. However, because of load imbalance in other stages of the calculation, we are convinced that including this type of synchronization could be an overall negative rather than positive to performance. This alternative will nevertheless be a future user-selectable option.
Included in these operations are a handful of norm operations, which constitute the main interprocess communication required by the integration computations step; norms are implemented concurrently via recursive doubling (combine) [ Stone:87a ], [ Skjellum:90a ]. Actually, the weighted norm used by DASSL requires two recursive doubling operations, each of which combines a scalar. The first operation obtains the vector coefficient of maximum absolute value, the second sums the weighted norm itself. Each can be implemented as Q independent column combines, each producing the same repetitive result, or a single Q -striding norm that takes advantage of the repetition of information, but utilizes two combines over the entire process grid. Both are supported in Concurrent DASSL, although the former is the default norm. As with the original DASSL, the norm function can be replaced, if desired.
Guy Robinson
Here, we consider the single residual computation required by the integration computations just described. Given a state vector , and approximation for , we need to evaluate . The exploitable concurrency available in this step is strictly a function of the model equations. As defined, there are N equations in this system, so we expect to use at best N computers for this step. Practically, there will be interprocess communication between the process rows, corresponding to the connectivity among the equations. This will place an upper limit on (the number of row processes) that can be used before the speed will again decrease: We can expect efficient speedup for this step provided that the cost of the interprocess communication is insignificant compared to the single-equation grain size. As estimated in [ Skjellum:90a ], the granularity for the Symult s2010 multicomputer is about fifty, so this implies about 450 floating-point operations per communication in order to achieve 90% concurrent efficiency in this phase.
Guy Robinson
In general, we'd like to consider the Jacobian computation on a rectangular grid. For this, we can consider using to accomplish the calculation. With a general grid shape, we exploit some concurrency in both the column evaluations and the residual computations, with the time for this step, the corresponding speedup, the residual evaluation time with P row processes, and the apparent speedup compared to one row process:
assuming no shortcuts are available as a result of latency. This timing is exemplified in the example below, which does not take advantage of latency.
There is additional work whenever the Jacobian structure is rebuilt for better numerical stability in the subsequent LU factorization (A-mode). Then, work is involved in each process in the filling of the initial Jacobian. In the normal case, work proportional to the number of local nonzeroes plus fill elements is incurred in each process for refilling the sparse Jacobian structure.
Guy Robinson
We have also devised a ``blocklike'' format, which will be applied to block n-diagonal matrices that include some off-block entries as well. Optimally, fewer residual computations will be needed than for the banded case. The same column-by-column compatible sets will be created, and the Curtis algorithm can also be applied. Hopefully, because of the less restrictive compatibility requirement, the blocklike case will produce higher concurrent speedups than those attained using the conservative bandlike assumption for Jacobians possessing blocklike structure. Comparative results will be presented in a future paper.
Guy Robinson
Guy Robinson
Guy Robinson
Guy Robinson
The algorithms and formalism needed to run this example amount to about 70,000 lines of C code including the simulation layer, Concurrent DASSL, the linear algebra packages, and support functions [Skjellum:90a;90c;90d].
In this simulation, we consider seven distillation columns arranged in a tree sequence [ Skjellum:90c ], working on the distillation of eight alcohols: methanol, ethanol, propan-1-ol, propan-2-ol, butan-1-ol, 2-methyl propan-1-ol, butan-2-ol, and 2-methyl propan-2-ol. Each column has 143 trays. Each tray is initialized to a nonsteady condition, and the system is relaxed to the steady state governed by a single-feed stream to the first column in the sequence. This setup generates suitable dynamic activity for illustrating the cost of a single ``transient'' integration step.
We note the performance in Table 9.6 . Because we have not exploited latency in the Jacobian computation, this calculation is quite expensive, as seen for the sequential times on a Sun 3/260 depicted there. (The timing for the Sun 3/260 is quite comparable to a single Symult s2010 node and was lightly loaded during this test run.) As expected, Jacobian calculations speed up efficiently, and we are able to get an approximate speedup of 100 for this step using 128 nodes. The A-mode linear algebra also speeds up significantly. The B-mode factorization speeds up negligibly and quickly slows down again for more than 16 nodes. Likewise, the triangular solves are significantly slower than the sequential time. It should be noted that B-mode reflects two orders of magnitude speed improvement over A-mode. This reflects the fact that we are seeing almost linear time complexity in B-mode, since this example has a narrow block tridiagonal Jacobian with too little off-diagonal coupling to generate much fill-in. It seems hard to imagine speeding up B-mode for such an example, unless we can exploit multiple pivots. We expect multiple-pivot heuristics to do reasonably well for this case, because of its narrow structure, and nearly block tridiagonal structure. We have used Wilson Equation Vapor-Liquid Equilibrium with the Antoine Vapor equation. We have found that the thermodynamic calculations were much less demanding than we expected, with bubble-point computations requiring `` '' iterations to converge. Consequently, there was not the greater weight of Jacobian calculations we expected beforehand. Our model assumes constant pressure, and no enthalpy balances. We include no flow dynamics and include liquid and vapor flows as states, because of the possibility of feedbacks.
Table 9.6:
Order 9009 Dynamic Simulation Data
If we utilize latency in the Jacobian calculation, we could reduce the sequential time by a factor of about 100. This improvement would also carry through to the concurrent times for Jacobian solution. At that ratio, Jacobian computation to B-mode factorization has a sequential ratio of about 10:1. As is, we achieve legitimate speedups of about five. We expect to improve these results using the ideas quoted elsewhere in this book and in [ Skjellum:90d ].
From a modelling point-of-view, two things are important to note. First, the introduction of more nonideal thermodynamics would improve speedup, because these calculations fall within the Jacobian computation phase and single-residual computation. Furthermore, the introduction of a more realistic model will likewise bear on concurrency, and likely improve it. For example, introducing flow dynamics, enthalpy balances, and vapor holdups makes the model more difficult to solve numerically (higher index). It also increases the chance for a wide range of stepsizes, and the possible need for additional A-mode factorizations to maintain stability in the integration process. Such operations are more costly, but also have a higher speedup. Furthermore, the more complex models will be less likely to have near diagonal dominance; consequently more pivoting is to be expected, again increasing the chance for overall speedup compared to the sequential case. Mainly, we plan to consider the waveform-relaxation approach more heavily, and also to consider new classes of dynamic distillation simulations with Concurrent DASSL [ Skjellum:90c ].
Guy Robinson
We have developed a high-quality concurrent code, Concurrent DASSL, for the solution of ordinary differential-algebraic equations of low index. This code, together with appropriate linear algebra and simulation layers, allows us to explore the achievable concurrent performance of nontrivial problems. In chemical engineering, we have applied it thus far to a reasonably large, simple model of coupled distillation columns. We are able to solve this large problem, which is quite demanding on even a large mainframe because of huge memory requirements and nontrivial computational requirements; the speedups achieved thus far are legitimately at least five, when compared to an efficient sequential implementation. This illustrates the need for improvements to the linear algebra code, which are feasible because sparse matrices will admit multiple pivots heuristically. It also illustrates the need to consider hidden sources of additional timelike concurrency in Concurrent DASSL, perhaps allowing multiple right-hand sides to be attacked simultaneously by the linear algebra codes, and amortizing their cost more efficiently. Furthermore, the performance points up the need for detailed research into novel numerical techniques, such as waveform relaxation, which we have begun to do as well [ Skjellum:88a ].
Guy Robinson
Guy Robinson
Simple relaxation methods reduce the high-frequency components of the solution error by an order of magnitude in a few iterations. This observation is used to derive the multigrid method; see Brandt [ Brandt:77a ], Hackbusch [ Hackbusch:85a ]; Hackbusch and Trottenberg [ Hackbusch:82a ]. In the multigrid method, a smoothed problem is projected to a coarser grid. This coarse-grid problem is then solved recursively by smoothing and coarse-grid correction. The recursion terminates on the coarsest grid, where an exact solver is used. In the full multigrid method, a coarser grid is also used to compute an initial guess for the multigrid iteration on a finer grid. With this method, it is possible to solve the problem with an operation count proportional to the number of unknowns.
Multigrid methods are best understood for elliptic problems, that is, the Poisson equation, stationary reaction-diffusion equations, implicit time-steps in parabolic problems, and so on. However, the multigrid approach is also successful for many other applications, from fluid flow to computer vision. Parallelization issues are independent of particular applications, and elliptic problems are a good test bed for the study of concurrent multigrid. We chose two- and three-dimensional stationary nonlinear reaction-diffusion equations in a rectangular domain as our model problem, that is,
with suitable boundary conditions.
To parallelize multigrid, we proceed as follows (see also [ Velde:87a ], [ Velde:87b ]). First, a sequential multigrid procedure is developed. Here, the basic numerical problems are addressed: which smoothing, restriction, and prolongation operators to use, the resolution required (size of the finest grid), the number of levels (size of the coarsest grid), and the coarsest-grid solver. Second, this sequential multigrid code is generalized to include local grid refinement (in the neighborhood of singularities, for example). Three basic problems are addressed in this second stage: the algorithmic aspect of local grid refinement, the numerical treatment of interior boundaries, and the relaxation of partially overlapping grid patches. In the third and last step, the multigrid code is parallelized. This can now be done without introducing new numerical issues. Each concurrent process starts a sequential multigrid procedure, each one locally refining to a particular subdomain. To achieve this, a communication operation for the exchange of interior boundary values is needed. Depending on the size of the coarsest grid, it might be required to develop, independently, a concurrent coarsest grid solver.
This parallelization strategy has the advantage that all numerical problems can be addressed in the sequential stages. Although our implementation is for regular grids, the same strategy is also valid for irregular grids.
Guy Robinson
To simplify the switch to adaptive grids later, we use a multigrid variant known as the full approximation scheme . Thus, on every level, we compute an approximation to the solution of the original equation, not of an error equation. This multigrid procedure is defined by the following basic building blocks: a coarsest-grid solver, a solution restriction operator, a right-hand-side restriction operator, a prolongation operator, and a smoothing operator.
Two feasible coarsest-grid solvers are relaxation until convergence and a direct solver (embedded in a Newton iteration if the problem is nonlinear). The cost of solving a problem on the coarsest grid is, of course, related to the size of the coarsest grid. If the coarsest grid is very coarse, the cost is negligible. However, numerical reasons often dictate a minimum resolution for the coarsest grid. Moreover, elaborate computations may take place on the coarsest grid; see [ Bolstadt:86a ], [ Chan:82a ], [ Dinar:85a ] for examples of multigrid continuation. In some instances, the performance of the computations on the coarsest grids cannot be neglected.
Many alternatives exist for smoothing. Parallelization will be easiest with point relaxations. Jacobi underrelaxation and red-black Gauss-Seidel relaxation are particularly suited for concurrent implementations and for adaptive grids. Hence, we shall restrict our attention to point relaxation methods.
The intergrid transfers are usually simple: linear interpolation as the prolongation operator, injection or full-weight restriction as the restriction operator.
The main data structure of the sequential nonadaptive algorithm is a doubly linked list of grids, where a grid structure provides memory for the solution and right-hand-side vectors, and each grid is connected to one finer and one coarser grid. The sequential multigrid code has the following structure: a library of operations on grid functions, a code related to the construction of a doubly linked list of grids, and the main multigrid algorithm. We maintain this basic structure for the concurrent and adaptive algorithms. Although the doubly linked list of grids will be replaced by a more complex structure, the basic multigrid algorithm will not be altered. While the library of grid function operations will be expanded, the fundamental operations will remain the same. This is important, because the basic library for a general multigrid package with several options for each operator is large.
Guy Robinson
Here, we focus on the use of adaptive grids for sequential computations. We shall apply these ideas in the next section to achieve concurrency. Figure 9.16 illustrates the grid structure of a one-dimensional adaptive multigrid procedure. Fine grids are introduced only where necessary, in the neighborhood of a singularity, for example. In two and three dimensions the topology is more complicated, and it makes sense to refine in several subdomains that partially overlap.
Figure 9.16:
One-Dimensional Adaptive Multigrid Structure
We focus first on the intergrid transfers. Although these operators are straightforward, they are the source of some implementation difficulties for the concurrent algorithm, because load-balanced data distributions of fine and coarse grids are not compatible. The structure introduced here avoids these difficulties. Before introducing a fine grid on a subdomain, we construct an artificial coarse grid on the same subdomain. This artificial coarse grid, called a child grid , differs from a normal grid data structure only because its data vectors (the solution and right-hand side) are subvectors of the parent-grid data vectors. Thus, child grids do not use extra memory for data (except for some negligible amount for bookkeeping). In Figure 9.16 , grid 1 is a parent-grid with two children, grids 2 and 3. With child grids, the intergrid transfers of the nonadaptive procedure can be reused. The restriction, for example, takes place between a fine grid (defined over a subdomain) and a coarse child grid (in Figure 9.16 , between grids 4 and 2 and between grids 5 and 3, respectively). Because data memory of child and parent grid are shared, the appropriate subvectors of the coarse grid data are updated automatically. Similarly, prolongation occurs between the child grid and the fine grid.
The basic data structure of the nonadaptive procedure, the doubly linked list of grids, is transformed radically as a result of child grids and their refinements. The data structure is now a tree of doubly linked lists; see Figure 9.16 . As mentioned before, the intergrid transfers are not affected by this complicated structure. For relaxation, the only significant difference is that more than one grid may have to be relaxed on each level. When the boundary of one grid intersects the interior of another grid on the same level, the boundary values must be interpolated (Figure 9.17 ). This does not affect the relaxation operators, as long as the relaxation step is preceded by a boundary-interpolation step.
Figure 9.17:
Boundary Interpolation
Guy Robinson
The same structure that made the multigrid code adaptive allows us to parallelize it. For now, assume that every process starts out with a copy of the coarsest grid, defined on the whole domain. Each process is assigned a subdomain in which to compute the solution to maximum accuracy. The collection of subdomains assigned to all processes covers the computational domain. Within each process, an adaptive grid structure is constructed so that the finest level at which the solution is needed envelops the assigned subdomain; see Figure 9.18 . The set of all grids (in whatever process they reside) forms a tree structure like the one described in the previous section. The same algorithm can be applied to it. Only one addition must be made to the program: overlapping grids on the same level but residing in different processes must communicate in order to interpolate boundary values. This is an operation to be added to the basic library.
Figure 9.18:
Use of Adaptive Multigrid for Concurrency
The coarsest grid can often be ignored as far as machine efficiency is concerned. As mentioned in Section 9.7.2 , the computations on the coarsest grid are sometimes substantial. In such cases, it is crucial to parallelize the coarsest-grid computations. With relaxation until convergence as the coarsest-grid solver, one could simply divide up the coarsest grid over all concurrent processes. It is more likely, however, that the coarsest-grid computations involve a direct solution method. In this case, the duplicated coarsest grid is well suited as an interface to a concurrent direct solver, because it simplifies the initialization of the coefficient matrix. We refer to Section 8.1 and [ Velde:90a ] for details on some direct solvers.
The total algorithm, adaptive multigrid and concurrent coarsest grid solver, is heterogeneous: its communication structure is irregular and varies significantly from one part of the program to the next, making the data distribution for optimal load balance difficult to predict. On the coarsest level, we achieve load balance by exploiting the data distribution independence of our linear algebra code; see [ Lorenz:92a ]. On the finer levels, load balance is obtained by allocating an approximately equal number of finest-grid points to each process.
Guy Robinson
The concurrent multigrid program was developed by Eric F. Van de Velde. Associated C P references are [ Lorenz:89a ], [ Lorenz:92a ], [ Velde:87a ], [ Velde:87b ], [ Velde:89b ], [ Velde:90a ].
Guy Robinson
Guy Robinson
The so-called assignment problem is of considerable importance in a variety of applications, and can be stated as follows. Let
and
be two sets of items, and let
be a measure of the distance (dissimilarity) between individual items from the two lists. Taking , the objective of the assignment problem is to find the particular mapping
such that the total association score
is minimized over all permutations .
For , the naive (exhaustive search) complexity of the assignment problem is . There are, however, a variety of exact solutions to the assignment problem with reduced complexity ([ Blackman:86a ], [ Burgeios:71a ], [ Kuhn:55a ]). Section 9.8.2 briefly describes one such method, Munkres algorithm [ Kuhn:55a ], and presents a particular sequential implementation. Performance of the algorithm is examined for the particularly nasty problem of associating lists of random points within the unit square. In Section 9.8.3 , the algorithm is generalized for concurrent execution, and performance results for runs on the Mark III hypercube are presented.
Guy Robinson
The input to the assignment problem is the matrix of dissimilarities from Equation 9.19 . The first point to note is that the particular assignment which minimizes Equation 9.21 is not altered if a fixed value is added to or subtracted from all entries in any row or column of the cost matrix D . Exploiting this fact, Munkres' solution to the assignment problem can be divided into two parts
The preceding paragraph provides a hopelessly incomplete hint as to the number theoretic basis for Munkres algorithm. The particular implementation of Munkres algorithm used in this work is as described in Chapter 14 of [ Blackman:86a ]. To be definite, take and let the columns of the distance matrix be associated with items from list . The first step is to subtract the smallest item in each column from all entries in the column. The rest of the algorithm can be viewed as a search for special zero entries (starred zeros ), and proceeds as follows:
Munkres Algorithm
The preceding algorithm involves flags (starred or primed) associated with zero entries in the distance matrix, as well as ``covered'' tags associated with individual rows and columns. The implementation of the zero tagging is done by first noting that there is at most one or in any row or column. The covers and zero tags of the algorithm are accordingly implemented using five simple arrays:
Figure 9.19:
Flowchart for Munkres Algorithm
Entries in the cover arrays CC and CR are one if the row or column is covered zero otherwise. Entries in the zero-locator arrays ZS, ZR, and ZP are zero if no zero of the appropriate type exists in the indexed row or column.
With the star-prime-cover scheme of the preceding paragraph, a sequential implementation of Munkres algorithm is completely straightforward. At the beginning of Step 1, all cover and locator flags are set to zero, and the initial zero search provides an initial set of nonzero entries in ZS(). Step 2 sets appropriate entries in CC() to one and simply counts the covered columns. Steps 3 and 5 are trivially implemented in terms of the cover/zero arrays and the ``alternating sequence'' for Step 4 is readily constructed from the contents of ZS(), ZR() and ZP().
As an initial exploration of Munkres algorithm, consider the task of associating two lists of random points within a 2D unit square, assuming the cost function in Equation 9.19 is the usual Cartesian distance. Figure 9.20 plots total CPU times for execution of Munkres algorithm for equal size lists versus list size. The vertical axis gives CPU times in seconds for one node of the Mark III hypercube. The circles and crosses show the time spent in Steps 5 and 3, respectively. These two steps (zero search and zero manufacture) account for essentially all of the CPU time. For the case, the total CPU time spent in Step 2 was about , and that spent in Step 4 was too small to be reliably measured. The large amounts of time spent in Steps 3 and 5 arise from the very large numbers of times these parts of the algorithm are executed. The case involves 6109 entries into Step 3 and 593 entries into Step 5.
Since the zero searching in Step 3 of the algorithm is required so often, the implementation of this step is done with some care. The search for zeros is done column-by-column, and the code maintains pointers to both the last column searched and the most recently uncovered column (Step 3.3) in order to reduce the time spent on subsequent re-entries to the Step 3 box of Figure 9.19 .
Figure 9.20:
Timing Results for the Sequential Algorithm Versus Problem Size
The dashed line of Figure 9.20 indicates the nominal scaling predicted for Munkres algorithm. By and large, the timing results in Figure 9.20 are consistent with this expected behavior. It should be noted, however, that both the nature of this scaling and the coefficient of are very dependent on the nature of the data sets. Consider, for example, two identical trivial lists
with the distance between items given by the absolute value function. For the data sets in Equation 9.22 , the preliminaries and Step 1 of Munkres algorithm completely solve the association in a time which scales as . In contrast, the random-point association problem is a much greater challenge for the algorithm, as nominal pairings indicated by the initial nearest-neighbor searches of the preliminary step are tediously undone in the creation of the staircaselike sequence of zeros needed for Step 4. As a brief, instructive illustration of the nature of this processing, Figure 9.21 plots the CPU time per step for the last passes through the outer loop of Figure 9.19 for the 150 150 assignment problem (recall that each pass through the outer loop increases the count by one). The processing load per step is seen to be highly nonuniform.
Figure 9.21:
Times Per Loop (i.e.,
increment) for the Last Several
Loops in the Solution of the
Problem
Guy Robinson
The timing results from Figure 9.20 clearly dictate the manner in which the calculations in Munkres algorithm should be distributed among the nodes of a hypercube for concurrent execution. The zero and minimum element searches for Steps 3 and 5 are the most time consuming and should be done concurrently. In contrast, the essentially bookkeeping tasks associated with Steps 2 and 4 require insignificant CPU time and are most naturally done in lockstep (i.e., all nodes of the hypercube perform the same calculations on the same data at the same time). The details of the concurrent algorithm are as follows.
Data Decomposition
The distance matrix is distributed across the nodes of the hypercube, with entire columns assigned to individual nodes. (This assumes, effectively, that , which is always the case for assignment problems which are big enough to be ``interesting.'') The cover and zero locator lists defined in Section 9.8.2 are duplicated on all nodes.
Task Decomposition
The concurrent implementation of Step 5 is particularly trivial. Each node first finds its own minimum uncovered value, setting this value to some ``infinite'' token if all columns assigned to the node are covered. A simple loop on communication channels determines the global minimum among the node-by-node minimum values, and each node then modifies the contents of its local portion of the distance matrix according to Steps 5.2 and 5.3.
The concurrent implementation of Step 3 is just slightly more awkward. On entry to Step 3, each node searches for zeros according to the rules of Section 9.8.2 , and fills a three-element status list:
where S is a zero-search status flag,
If the status is nonnegative, the last two entries in the status list specify the location of the found zero. A simple channel loop is used to collect the individual status lists of each node into all nodes, and the action taken next by the program is as follows:
The concurrent algorithm has been implemented on the Mark III hypercube, and has been tested against random point association tasks for a variety of list sizes. Before examining results of these tests, however, it is worth noting that the concurrent implementation is not particularly dependent on the hypercube topology. The only communication-dependent parts of the algorithm are
Table 9.7 presents performance results for the association of random lists of 200 points on the Mark III hypercube for various cube dimensions. (For consistency, of course, the same input lists are used for all runs.) Time values are given in CPU seconds for the total execution time, as well as the time spent in Steps 3 and 5. Also given are the standard concurrent execution efficiencies,
as well as the number of times the Step 3 box of Figure 9.19 is entered during execution of the algorithm. The numbers of entries into the other boxes of Figure 9.19 are independent of the hypercube dimension.
Table 9.7:
Concurrent Performance for
Random Points. T
is time,
efficiency, and N[Step 3] the number of times
Step 3 is executed.
There is an aspect of the timing results in Table 9.7 which should be noted. Namely, essentially all inefficiencies of the concurrent algorithm are associated with Step 3 for two nodes compared to Step 3 for one node. The times spent in Step 5 are approximately halved for each increase in the dimension of the hypercube. However, the efficiencies associated with the zero searching in Step 3 are rather poorer, particularly for larger numbers of nodes.
At a simple, qualitative level, the inefficiencies associated with Step 3 are readily understood. Consider the task of finding a single zero located somewhere inside an matrix. The mean sequential search time is
since, on average, half of the entries of the matrix will be examined before the zero is found. Now consider the same zero search on two nodes. The node which has the half of the matrix containing the zero will find it in about half the time of Equation 9.26 . However , the other node will always search through all of its items before returning a null status for Equation 9.24 . Since the node which found the zero must wait for the other node before the (lockstep) modifications of zero locators and cover tags, the node without the zero determines the actual time spent in Step 3, so that
In the full program, the concurrent bottleneck is not as bad as Equation 9.27 would imply. As noted above, the concurrent algorithm can process multiple ``Boring'' Zs in a single pass through Step 3. The frequency of such multiple Zs per step can be estimated by noting the decreasing number of times Step 3 is entered with increasing hypercube dimension, as indicated in Table 9.7 . Moreover, each node maintains a counter of the last column searched during Step 3. On subsequent re-entries, columns prior to this marked column are searched for zeros only if they have had their cover tag changed during the prior Step 3 processing. While each of these algorithm elements does diminish the problems associated with Equation 9.27 , the fact remains that the search for zero entries in the distributed distance matrix is the least efficient step in concurrent implementations of Munkres algorithm.
The results presented in Table 9.7 demonstrate that an efficient implementation of Munkres algorithm is certainly feasible. Next, we examine how these efficiencies change as the problem size is varied.
The results shown in Tables 9.8 and 9.9 demonstrate an improvement of concurrent efficiencies with increasing problem size-the expected result. For the problem on eight nodes, the efficiency is only about 50%. This problem is too small for eight nodes, with only 12 or 13 columns of the distance matrix assigned to individual nodes.
Table 9.8:
Concurrent Performance for
Random Points
Table 9.9:
Concurrent Performance for
Random Points
While the performance results in Tables 9.7 through 9.9 are certainly acceptable, it is nonetheless interesting to investigate possible improvements of efficiency for the zero searches in Step 3. The obvious candidate for an algorithm modification is some sort of checkpointing: At intermediate times during the zero search, the nodes exchange a ``Zero Found Yet?'' status flag, with all nodes breaking out of the zero search loop if any node returns a positive result.
For message-passing machines such as the Mark III, the checkpointing scheme is of little value, as the time spent in individual entries to Step 3 is not enormous compared to the node-to-node communication time. For example, for the two-node solution of the problem, the mean time for a single entry to Step 3 is only about , compared to a typical node-to-node communications time which can be a significant fraction of a millisecond. The time required to perform a single Step 3 calculation is not large compared to node-to-node communications. As a (not unexpected) consequence, all attempts to improve the Step 3 efficiencies through various ``Found Anything?'' schemes were completely unsuccessful.
The checkpointing difficulties for a message-passing machine could disappear, of course, on a shared-memory machine. If the zero-search status flags for the various nodes could be kept in memory locations readily (i.e., rapidly) accessible to all nodes, the problems of the preceding paragraph might be eliminated. It would be interesting to determine whether significant improvements on the (already good) efficiencies of the concurrent Munkres algorithm could be achieved on a shared-memory machine.
Guy Robinson
Computers and standard programming languages can be used efficiently for high-level, clearly formulated problems such as computing balance sheets and income statements, solving partial differential equations, or managing operations in a car factory. It is much more difficult to write efficient and fault-tolerant programs for ``simple'' primitive tasks like hearing, seeing, touching, manipulating parts, recognizing faces, avoiding obstacles, and so on. Usually, the existing artificial systems for the above tasks are within a narrowly limited domain of application, very sensitive to hardware and software failures, and difficult to modify and adapt to new environments.
Neural nets represent a new approach to bridging the gap between cheap computational power and solutions for some of the above-cited tasks. We as human beings like to consider ourselves good examples of the power of the neuronal approach to problem solving.
To avoid naive optimism and over inflated expectations about ``self-programming'' computers, it is safer to see this development as the creation of another level of tools insulating generic users looking for fast solutions from the details of sophisticated learning mechanisms. Today, generic users do not care about writing operating systems; in the near future some users will not care about programming and debugging. They will have to choose appropriate off-the-shelf subsystems (both hardware and software) and an appropriate set of examples and high-level specifications; neural nets will do the rest. Neural networks have already been useful in areas like pattern classification, robotics, system modeling, and forecasting over time ([ Borsellino:61a ], [ Broomhead:88a ], [ Gorman:88a ], [ Sejnowski:87a ], [ Rumelhart:86b ], [ Lapedes:87a ]).
The focus of this work is on ``supervised learning'', that is, learning an association between input and output patterns from a set of examples. The mapping is executed by a feed-forward network with different layers of units, such as the one shown in Figure 9.22 .
Figure 9.22:
Multilayer Perceptron and Transfer Function
Each unit that is not in the input layer receives an input given by a weighted sum of the outputs of the previous layer and produces an output using a ``sigmoidal'' transfer function, with a linear range and saturation for large positive and negative inputs. This particular architecture has been considered because it has been used extensively in neural network research, but the learning method presented can be used for different network designs ([ Broomhead:88a ]).
Guy Robinson
The multilayer perceptron, initialized with random weights, presents random output patterns. We would like to execute a learning stage, progressively modifying the values of the connection strengths in order to make the outputs nearer to the prescribed ones.
It is straightforward to transform the learning task into an optimization problem (i.e., a search for the minimum of a specified function, henceforth called energy ). If the energy is defined as the sum of the squared errors between obtained and desired output pattern over the set of examples, minimizing it will accomplish the task.
A large fraction of the theoretical and applied research in supervised learning is based on the steepest descent method for minimization. The negative gradient of the energy with respect to the weights is calculated during each iteration, and a step is taken in that direction (if the step is small enough, energy reduction is assured). In this way, one obtains a sequence of weight vectors, , that converges to a local minimum of the energy function:
Now, given that we are interested in converging to the local minimum in the shortest time (this is not always the case: to combat noise some slowness may be desired), there is no good reason to restrict ourselves to steepest descent, and there are at least a couple of reasons in favor of other methods. First, the learning speed , , is a free parameter that has to be chosen carefully for each problem (if it is too small, progress is slow; if too large, oscillations may be created). Second, even in the optimal case of a step along the steepest descent direction bringing the system to the absolute minimum ( along this direction ), it can be proved that steepest descent can be arbitrarily slow, particularly when ``the search space contains long ravines that are characterized by sharp curvature across the ravine and a gently sloping floor'' [ Rumelhart:86b ]. In other words, if we are unlucky with the choice of units along the different dimensions (and this is a frequent event when the number of weights is 1000 or 10,000), it may be the case that during each iteration, the previous error is reduced by 0.000000001%!
The problem is essentially caused by the fact that the gradient does not necessarily point in the direction of the minimum, as it is shown in Figure 9.23 .
Figure 9.23:
Gradient Direction for Different Choice of Units
If the energy is quadratic, a large ratio of the maximum to minimum eigenvalues causes the ``zigzagging'' motion illustrated. In the next sections, we will illustrate two suggestions for tuning parameters in an adaptive way and selecting better descent directions.
Guy Robinson
There are no general prescriptions for selecting an appropriate learning rate in back-propagation, in order to avoid oscillations and converge to a good local minimum of the energy in a short time. In many applications, some kind of ``black magic,'' or trial-and-error process, is employed. In addition, usually no fixed learning rate is appropriate for the entire learning session.
Both problems can be solved by adapting the learning rate to the local structure of the energy surface.
We start with a given learning rate (the value does not matter) and monitor the energy after each weight update. If the energy decreases, the learning rate for the next iteration is increased by a factor . Conversely, if the energy increases (an ``accident'' during learning), this is taken as an indication that the step made was too long, the learning rate is decreased by a factor , the last change is cancelled, and a new trial is done. The process of reduction is repeated until a step that decreases the energy value is found (this will inevitably happen because the search direction is that of the negative gradient). An example of the size of the learning rate as a function of the iteration number is shown in Figure 9.24 .
Figure 9.24:
Learning Rate Magnitude as a Function of the Iteration Number for a
Test Problem
The name ``Bold Driver'' was selected for the analogy with the learning process of young and inexperienced car drivers.
By using this ``brutally heuristic'' method, learning converges in a time that is comparable to, and usually better than that of standard (batch) back-propagation with an optimal and fixed learning rate. The important difference is that the time-consuming meta-optimization phase for choosing is avoided. The values for and can be fixed once and for all (e.g., , ) and performance does not depend critically on their choice.
Guy Robinson
Steepest descent suffers from a bad reputation with researchers in optimization. From the literature (e.g., [ Gill:81a ]), we found a wide selection of more appropriate optimization techniques. Following the ``decision tree'' and considering the characteristics of large supervised learning problems (large memory requirements and time-consuming calculations of the energy and the gradient), the Broyden-Fletcher-Goldfarb-Shanno one-step memoryless quasi-Newton method (all adjectives are necessary to define it) is a good candidate in the competition and performed very efficiently on different problems.
Let's define the following vectors: , and . The one-dimensional search direction for the n th iteration is a modification of the gradient , as follows:
Every N steps ( N being the number of weights in the network), the search is restarted in the direction of the negative gradient.
The coefficients and are combinations of scalar products:
The one-dimensional minimization used in this work is based on quadratic interpolation and tuned to back-propagation where, in a single step, both the energy value and the negative gradient can be efficiently obtained. Details on this step are contained in [ Williams:87b ].
The computation during each step requires operations (the same behavior as standard batch back-propagation), while the CPU time for each step increases by an average factor of three for the problems considered. Because the total number of steps for convergence is much smaller, we measured a large net benefit in computing time.
Last but not least, this method can be efficiently implemented on MIMD parallel computers.
Guy Robinson
Neural nets are ``by definition'' parallel computing systems of many densely interconnected units. Parallel computation is the basic method used by our brain to achieve response times of hundreds of milliseconds, using sloppy biological hardware with computing times of a few milliseconds per basic operation.
Our implementation of the learning algorithm is based on the use of MIMD machines with large grain size. An efficient mapping strategy consists of assigning a subset of the examples (input-output pairs) and the entire network structure to each processor. To obtain proper generalization, the number of example patterns has to be much larger (say, ) than the number of parameters defining the architecture (i.e., the number of connection weights). For this reason, the amount of memory used for storing the weights is not too large for significant problems.
Function and gradient evaluation is executed in parallel. Each processor calculates the contribution of the assigned patterns (with no communication), and a global combining-distributing step (see the ADDVEC routine in [ Fox:88a ]) calculates the total energy and gradient (let's remember that the energy is a sum of the patterns' contributions) and communicates the result to all processors.
Then the one-dimensional minimization along the search direction is completed and the weights are updated.
Figure 9.25:
``Healthy Food'' Has to Be Distinguished from ``Junk Food'' Using
Taste and Smell Information.
This simple parallelization approach is promising: It can be easily adapted to different network representations and learning strategies, and it is going to be a fierce competitor with analog implementations of neural networks, when these are available for significant applications (let's remember that airplanes do not flap their wings ).
Guy Robinson
This problem consists of classifying a set of randomly generated patterns (with real values) in two classes. An example in two dimensions is given by the ``healthy food'' learning problem. Inputs are given by points in the ``smell'' and ``taste'' plane, corresponding to the different foods. The learning task consists of producing the correct classification as ``healthy food'' or ``junk food'' (Figure 9.25 ).
On this problem we obtained a speedup of 20-120 (going from 6 to 100 patterns in two dimensions).
Guy Robinson
In this case, the task is to predict the next value in the sequence (ergodic and chaotic) generated by the logistic map [ Lapedes:87a ], according to the recurrence relation:
We tried different architecture and obtained a speedup of 400-500, and slightly better generalization properties for the BFGS optimization method presented.
In general, we obtained a larger speedup for problems with high-precision requirements (using real values for inputs or outputs). See also [ Battiti:89a ].
Guy Robinson
The distributed optimization [Battiti:89a;89e] software was developed by Roberto Battiti, modifying a back-propagation program written by Steve Otto. Fox and Williams inspired our first investigations into the optimization literature (Shanno's conjugate gradient [ Shanno:78a ] is used in [ Williams:87b ]).
Guy Robinson
HPFA Paradigms
Guy Robinson
In the next two chapters, we describe software tools developed to aid the user in the parallelization of some of the harder algorithms. Here we describe DIME, which is designed to generate both statically and adaptively irregular meshes. DIME was already used in the application of Section 7.2 ; however, it is more typically used for partial differential equations describing problems with nonuniform and varying density.
A large fraction of the problems that we wish to solve with a computer are continuum simulations of physical systems, where the interesting variable is not a finite collection of numbers but a function on a domain. For the purposes of the computation such a continuous spatial domain is given a structure, or mesh, to which field values may be attached and neighboring values compared to calculate derivatives of the field.
If the domain of interest has a simple shape, such as a cylinder or cuboid, there may be a natural choice of mesh whose structure is very regular like that of a crystal, but when we come to more complex geometries such as the space surrounding an aircraft or inside turbomachinery, there are no regular structures that can adequately represent the domain. The only way to mesh such complex domains is with an unstructured mesh, such as that shown in Figure 10.1 . At the right is a plot of a solution to Laplace's equation on the domain.
Figure 10.1:
Mesh and Solution of Laplace Equation
Notice that the mesh is especially fine at the sharp corners of the boundary where the solution changes rapidly: A desirable feature for a mesh is its ability to adapt, so that when the solution begins to emerge, the mesh may be made finer where necessary.
Naturally, we would like to run our time-consuming physical simulation with the most cost-effective general-purpose computer, which we believe to be the MIMD architecture. In view of the difficulty of programming an irregular structure such as one of these meshes, and the special difficulty of doing so with an MIMD machine, I decided to write not just a program for a specialized application, but a programming environment for unstructured triangular meshes.
The resulting software (DIME: Distributed Irregular Mesh Environment, [ Williams:90b ]) is responsible for the mesh structure, and a separate application code runs a particular type of simulation on the mesh. DIME keeps track of the mesh structure, allowing mesh creation, reading and writing meshes to disk, and graphics; also adaptive refinement and certain topological changes to the mesh. It hides the parallelism from the application code, and splits the mesh among the processors in an efficient way.
The application code is responsible for attaching data to the elements and nodes of the mesh, manipulating and computing with these data and the data from its mesh neighborhood. DIME is designed not only to be portable between different MIMD parallel machines, but it also runs on any Unix machine, treating this as a parallel machine with just one processor. This ability to run on a sequential machine is due to DIME's use of the Cubix server (Section 5.2 ).
Guy Robinson
The most efficient speed for aircraft flight is just below the speed of sound: the transonic regime. Simulations of flight at these speeds consume large quantities of computer time, and are a natural candidate for a DIME application. In addition to the complex geometries of airfoils and turbines for which these simulations are required, the flow tends to develop singular regions or shocks in places that cannot be predicted in advance; the adaptive refinement capability of a DIME mesh allows the mesh to be fine and detail resolved near shocks while keeping the regions of smooth flow coarsely meshed for economy (Section 12.3 ).
The version of DIME developed within C P was only able to mesh two-dimensional manifolds. More recent developments are described in Section 10.1.7 . The manifold may, however, be embedded in a higher-dimensional space. In collaboration with the Biology division at Caltech, we have simulated the electrosensory system of the weakly electric fish Apteronotus leptorhynchus . The simulation involves creating a mesh covering the skin of the fish, and using the boundary element method to calculate field strengths in the three-dimensional space surrounding the fish (Section 12.2 ).
In the same vein of embedding the mesh in higher dimensions, we have simulated a bosonic string of high-energy physics, embedding the mesh in up to 26 spatial dimensions. The problem here is to integrate over not only all positions of the mesh nodes, but also over all triangulations of the mesh (Section 7.2 ).
The information available to a DIME application is certain data stored in the elements and nodes of the mesh. When doing finite-element calculations, one would like a somewhat higher level of abstraction, which is to refer to functions defined on a domain, with certain smoothness constraints and boundary conditions. We have made a further software layer on top of DIME to facilitate this: DIMEFEM. With this we may add, multiply, differentiate and integrate functions defined in terms of the Lagrangian finite-element family, and define linear, bilinear, and nonlinear operators acting on these functions. When a bilinear operator is defined, a variational principle may be solved by conjugate-gradient methods. The preconditioner for the CG method may in itself involve solving a variational principle. The DIMEFEM package has been applied to a sophisticated incompressible flow algorithm (Section 10.2 ).
Guy Robinson
Figure 10.2 shows the structure of a DIME application. The shaded parts represent the contribution from the user, being a definition of a domain which is to be meshed, a definition of the data to be maintained at each element, node, and boundary edge of the mesh, and a set of functions that manipulate this data. The user may also supply or create a script file for running the code in batch mode.
Figure 10.2:
Major Components of DIME
The first input is the definition of a domain to be meshed. A file may be made using the elementary CAD program curvetool , which allows straight lines, arcs, and cubic splines to be manipulated interactively to define a domain.
Before sending a domain to a DIME application, it must be predigested to some extent, with the help of a human. The user must produce a coarse mesh that defines the topology of the domain to the machine. This is done with the program meshtool , which allows the user to create nodes and connect them to form a triangulation.
The user writes a program for the mesh, and this program is loaded into each processor of the parallel machine. When the DIME function readmesh () is called, (or ``Readmesh'' clicked on the menu), the mesh created by meshtool is read into a single processor, and then the function balance_orb () may be called (or ``Balance'' clicked on the menu) to split the mesh into domains, one domain for each processor.
The user may also call the function writemesh () (or click ``Writemesh'' in the menu), which causes the parallel mesh to be written to disk. If that mesh is subsequently read in, it is read in its domain-decomposed form, with different pieces assigned to different processors.
In Figure 10.2 , the Cubix server is not mandatory, but only needed if the DIME application is to run in parallel. The application also runs on a sequential machine , which is considered to be a one-processor parallel machine.
Guy Robinson
Figure 10.3 shows an example of a DIME boundary structure. The filled blobs are points , with curves connecting the points. Each curve may consist of a set of curve segments, shown in the figure separated by open circles. The curve segments may be straight lines, arcs of circles, or Bezier cubic sections. The program curvetool is for the interactive production of boundary files. When the domain is satisfactory, it should be meshed using meshtool .
Figure 10.3:
Boundary Structure
The program meshtool is used for defining boundaries and creating a triangulation of certain regions of a grid. Meshtool adds nodes to an existing triangulation using the Delaunay triangulation [ Bowyer:81a ]. A new node may be added anywhere except at the position of an existing node. Figure 10.4 illustrates how the Delaunay triangulation (thick gray lines) is derived from the Voronoi tesselation (thin black lines).
Figure 10.4:
Voronoi Tesselation and Resulting Delaunay Triangulation
Each node (shown by a blob in the figure) has a ``territory,'' or Voronoi polygon, which is the part of the plane closer to the node than to any other node. The divisions between these territories are shown as thin lines in the figure, and are the perpendicular bisectors of the lines between the nodes. This procedure tesselates the plane into a set of disjoint polygons and is called the Voronoi tesselation. Joining nodes whose Voronoi polygons have a common border creates a triangulation of the nodes known as the Delaunay triangulation. This triangulation has some desirable properties, such as the diagonal dominance of a finite-element stiffness matrix derived from the mesh [ Young:71a ].
Guy Robinson
Figure 10.5 shows a triangular mesh covering a rectangle, and Figure 10.6 the logical structure of that mesh split among four processors. The logical mesh shows the elements as shaded triangles and nodes as blobs. Each element is connected to exactly three nodes, and each node is connected to one or more elements. If a node is at a boundary, it has a boundary structure attached, together with a pointer to the next node clockwise around the boundary.
Figure 10.5:
A Mesh Covering a Rectangle
Figure 10.6:
The Logical Structure of the Mesh Split Among Four Processors
Each node, element and boundary structure has user data attached to it, which is automatically transferred to another processor if load-balancing causes the node or element to be moved to another processor. DIME knows only the size of the user data structures. Thus, these structures may not contain pointers, since when those data are moved to another processor the pointers will be meaningless.
The shaded ovals in Figure 10.5 are physical nodes , each of which consists of one or more logical nodes . Each logical node has a set of aliases, which are the other logical nodes belonging to the same physical node. The physical node is a conceptual object, and is unaffected by parallelism; the logical node is a copy of the data in the physical node, so that each processor which owns a part of that physical node may access the data as if it had the whole node.
DIME is meant to make distributed processing of an unstructured mesh almost as easy as sequential programming. However, there is a remaining ``kernel of parallelism'' that the user must bear in mind. Suppose each node of the mesh gathers data from its local environment (i.e., the neighboring elements); if that node is split among several processors, it will only gather the data from those elements which lie in the same processor and consequently each node will only have part of the result. We need to combine the partial results from the logical nodes and return the combined result to each. This facility is provided by a macro in DIME called NODE_COMBINE , which is called each time the node data is changed according to its local environment.
Guy Robinson
The Delaunay triangulation used by meshtool would be an ideal way to refine the working mesh, as well as making a coarse mesh for initial download. Unfortunately, adding a new node to an existing Delaunay triangulation may have global consequences; it is not possible to predict in advance how much of the current mesh should be replaced to accommodate the new node. Doing this in parallel requires an enormous amount of communication to make sure that the processors do not tread on each others' toes [ Williams:89c ].
DIME uses the algorithm of Rivara [Rivara:84a;89a] for refinement of the mesh, which is well suited to loosely synchronous parallel operation, but results in a triangulation which is not a Delaunay triangulation, and thus lacks some desirable properties. The process of topological relaxation changes the connectivity of the mesh to make it a Delaunay triangulation.
It is usually desirable to avoid triangles in the mesh which have particularly acute angles, and topological relaxation will reduce this tendency. Another method of doing this is by moving the nodes toward the average position of their neighboring nodes; a physical analogy would be to think of the edges of the mesh as damped springs and allowing the nodes to move under the action of the springs.
Guy Robinson
When DIME operates in parallel, the mesh should be distributed among the processors of the machine so that each processor has about the same amount of mesh to deal with, and the communication time is as small as possible. There are several ways to do this, such as with a computational neural net, by simulated annealing, or even interactively.
DIME uses a strategy known as recursive bisection [ Fox:88mm ], which has the advantages of being robust, simple, and deterministic, though sometimes the resulting communication pattern may be less than optimal. The method is illustrated in Figure 10.7 : each blob represents the center of an element, and the vertical and horizontal lines represent processor divisions. First, a median vertical line is found which splits the set of elements into two sets of approximately equal numbers, then (with two-way parallelism) two horizontal medians which split the halves into four approximately equal quarters, then (with four-way parallelism) four vertical medians, and so on. Chapter 11 describes more general and powerful load-balancing methods.
Figure 10.7:
Recursive Bisection
Figure 10.8 (Color Plate) and Figure 10.9 (Color Plate) are from a calculation of transonic flow over an airfoil (see Section 12.3 ). Figure 10.9 shows the parallel structure of the DIME mesh used to calculate the flow. The redundant copies of shared nodes have been separated to show the data connections between them in yellow and blue.
Figure :
Pressure plot Mach 0.8 flow over a NACA0012
airfoil, with the sonic line shown. The mesh for this computation is shown
in Figure 10.9.
Figure 10.9:
Depiction of the mesh for the transonic
flow calculation of color Figure 10.8. Each group of similarly colored
triangles is owned by an individual processor. The mesh has been
dynamically adapted and load-balanced. The yellow lines connect copies of
nodes which are in the same geometric positioin, but have been separated
for the purpose of this picture. The load balancing is by orthogonal
recursive bisection.
In the pressure plot, there is a vertical shock about two-thirds of the way downstream from the leading edge of the airfoil. This can also be seen in the mesh plot since the same region has especially small triangles and high mesh density. Since each processor has about the same number of triangles, the processor regions are also small in the neighborhood of the shock.
Guy Robinson
The DIME software was written by Roy Williams, and the C P work is published in [ Baillie:90e ], [Williams:87a;87b;88a;88d;89c;90b].
DIME has evolved recently into something rather more general: instead of a set of explicitly triangular elements which have access to the three nodes around them, the new language DIME++ has the idea of a set of objects that have an index to another set of objects. Just as DIME is able to refine its mesh dynamically, and load-balance the mesh, so in DIME++ the indices may be created and modified dynamically and the sets load-balanced [Williams:91a;91c;92a;93b].
This more general formulation of the interface frees the system from explicitly triangular meshes, and greatly expands and generalizes the range of problems that can be addressed: higher dimensions, different kinds of elements, multigrid, graph problems, and multiblock. Instead of linked lists, DIME++ stores data in long vectors for maximum efficiency; it is written as a C++ class library for extensibility and polymorphism.
Guy Robinson
DIMEFEM [ Williams:89a ] is a software layer which enables finite-element calculations to be done with the irregular mesh maintained by DIME. The data objects dealt with by DIMEFEM are finite-element functions (FEFs), which may be scalar or have several components (vector fields), as well as linear, multilinear and nonlinear operators which map these FEFs to numbers. The guiding principle is that interesting physical problems may be expressed in variational terms involving FEFs and operators on them [ Bristeau:87a ], [ Glowinski:84a ]. We shall use as an example a Poisson solver.
Poisson's equation is , which may also be expressed variationally as: Find u such that for all v
where the unknown u and the dummy variable v are taken to have the correct boundary conditions. To implement this with DIMEFEM, we first allocate space in each element for the FEFs u and f , then explicitly set f to the desired function. We now define the linear operator L and bilinear operator a as above, and call the linear solver to evaluate u .
Guy Robinson
When DIME creates a new element by refinement, it comes equipped with a pointer to a block of memory of user-specified size which DIMEFEM uses to store the data representing FEFs and corresponding linear operators. A template is kept of this element memory containing information about which one is already allocated and which one is free. When an FEF is to be created, the memory allocator is called, which decides how much memory is needed per element and returns an offset from the start of the element-data-space for storing the new FEF. Thus, a function in DIMEFEM typically consists of allocating a stack of work space, doing calculations, then freeing the work space.
An FEF thus consists of a specification of an element type, a number of fields (one for scalar, two or more for vector), and an offset into the element data for the nodal values.
Guy Robinson
Finite-element approximations to functions form a finite-dimensional vector space, and as such may be multiplied by a scalar and added. Functions are provided to do these operations. If the function is expressed as Lagrangian elements it may also be differentiated, which changes the order of representation: For example, differentiating a quadratic element produces a linear element.
At present, DIMEFEM provides two kinds of elements, Lagrangian and Gaussian, although strictly speaking the latter is not a finite element because it possesses no interpolation functions. The Gaussian element is simply a collection of function values at points within each triangle and a set of weights, so that integrals may be done by summing the function values multiplied by the weights. As with one-dimensional Gaussian integration, integrals are exact to some polynomial order. We cannot differentiate Gaussian FEFs, but can apply pointwise operators such as multiplication and function evaluation that cannot be done in the Lagrangian representation.
Consider the nonlinear operator L defined by
The most accurate way to evaluate this is to start with u in Lagrangian form, differentiate, convert to Gaussian representation, exponentiate, then multiply by the weights and sum. This can be done explicitly with DIMEFEM, but in the future we hope to create an environment which ``knows'' about representations, linearity, and so on, and can parse an expression such as the above and evaluate it correctly.
The computational kernel of any finite-element software is the linear solver. We have implemented this with preconditioned conjugate gradient , so that the user supplies a linear operator L , an elliptic bilinear operator a , a scalar product S (a strongly elliptic symmetric bilinear operator which satisfies the triangle inequality), and an initial guess for the solution. The conjugate-gradient solver replaces the guess by the solution u of the standard variational equation
Guy Robinson
We have implemented a sophisticated incompressible flow solver using DIME and DIMEFEM. The algorithm is described more completely in [ Bristeau:87a ]. The evolution equation for an incompressible Newtonian fluid of viscosity n is
We use a three-stage operator-split scheme whereby for each time step of length dt , the equation is integrated
The parameter is .
Each of these three implicit steps involves the solution of either a Stokes problem:
or the nonlinear problem:\
where is a parameter inversely proportional to the time step. We solve the Navier-Stokes equation, and consequently also these subsidiary problems, with given velocity at the boundary (Dirichlet boundary conditions).
Guy Robinson
With velocity approximated by quadratic, and pressure by linear Lagrangian elements, we found that both the Stokes and nonlinear solvers converged in three to five iterations. We ran the square cavity problem as a benchmark.
To reach a steady-state solution, we adopted the following strategy: With a coarse mesh, keep advancing simulation time until the velocity field no longer changes, then refine the mesh, iterate until the velocity stabilizes, refine, and so on. The refinement strategy was as follows. The velocity is approximated with quadratic elements with discontinuous derivatives, so we can calculate the maximum of this derivative discontinuity for each element, then refine those elements above the 70th percentile of this quantity.
Figure 10.10 shows the results. At top left is the mesh after four cycles of this refinement and convergence , at Reynolds number 1000. We note heavy refinement at the top left and top right, where the boundary conditions force a discontinuity in velocity, and also along the right side where the near discontinuous vorticity field is being transported around the primary vortex. Bottom left shows the logical structure, split among four transputers. The top right and bottom right show streamlines and vorticity, respectively. The results accord well with the benchmark of [ Schreiber:83a ].
Figure 10.10:
Results for Square Cavity Problem with Reynolds Number 1000
Guy Robinson
The DIMEFEM software was written by Roy Williams, and the flow algorithm developed by R. Glowinski of the University of Houston [Williams:89a;90b].
Guy Robinson
Guy Robinson
We have seen many times that parallel computing involves breaking problems into parts which execute concurrently. In the simple regular problems seen in the early chapters, especially Chapters 4 and 6 , it was usually reasonably obvious how to perform this breakup to optimize performance of the program on a parallel machine. However, in Chapter 9 and even more so in Chapter 12 , we will find that the nature of the parallelism is as clear as before, but that it is nontrivial to implement efficiently.
Irregular loosely synchronous problems consist of a collection of heterogeneous tasks communicating with each other at the macrosynchronization points characteristic of this problem class. Both the execution time per task and amount and pattern of communication can differ from task to task. In this section, we describe and compare several approaches to this problem. We note that formally this is a very hard-so-called NP-complete-optimization problem. With tasks running on processors we cannot afford to examine every one of the assignments of tasks to processors. Experience has shown that this problem is easier than one would have thought-partly at least because one does not require the exactly optimal assignment. Rather, a solution whose execution time is, say, within 10% of the optimal value is quite acceptable. Remember, one has probably chosen to ``throw away'' a larger fraction than this of the possible MPP performance by using a high-level language such as Fortran or C on the node (independent of any parallelism issues). The physical optimization methods described in Section 11.3 and more problem-specific heuristics have shown themselves very suitable for this class of approximate optimization [Fox:91j;92i]. In 1985, at a DOE contract renewal review at Caltech, we thought that this load balancing issue would be a major and perhaps the key stumbling block for parallel computing. However, this is not the case-it is a hard and important problem, but for loosely synchronous problems it can be solved straightforwardly [ Barnard:93a ], [Fox:92c;92h;92i], [ Fox:92h ] [ Mansour:92d ]. Our approach to this uses physical analogies and stems in fact from dinner conversations between Fox and David Jefferson, a collaborator from UCLA, at this meeting [Fox:85k;86a;88e;88mm]. An interesting computer science challenge is to understand why the NP-complete load-balancing problem appears ``easier in practice'' than the Travelling Salesman Problem, which is the generic NP-complete optimization problem. We will return to this briefly in Section 11.3 , but note that the ``shape of the objection function'' (in physics language, the ``energy landscape'') illustrated in Figure 11.1 appears critical. Load-balancing problems appear to fall into the ``easy class'' of NP-complete optimization problems with the landscape of Figure 11.1 (a). The methods discussed in the following are only a sample of the many effective approaches developed recently: [ Barhen:88a ], [ Berger:87a ], [ Chen:88a ], [ Chrisochoides:91a ], [ Ercal:88a ], [ Ercal:88b ], [Farhat:88a;89b], [ Fox:88nn ], [ Hammond:92b ], [ Houstis:90a ], [ Livingston:88a ], [ Miller:92a ], [ Nolting:91a ], [ Teng:91a ], [ Walker:90b ]. The work of Simon [ Barnard:93a ], [ Pothen:90a ], [ Simon:91b ], [ Venkatakrishnan:92a ] on recursive spectral bisection-a method with similarities to the eigenvector recursive bisection (ERB) method mentioned later-has been particularly successful.
Figure 11.1:
Two Possible ``Energy Landscapes'' for an Optimization
Problem
A few general remarks are necessary; we use the phrases ``load balancing'' and ``data decomposition'' interchangeably. One needs both ab initio and dynamic distribution and redistribution of data on the parallel machine. We also can examine load balancing at the level of data or of tasks that encapsulate the data and algorithm. In elegant (but currently inefficient) software models with one datum per task, these formulations are equivalent. Our examples will do load balancing at the level of data values, but the task and data distribution problems are essentially equivalent.
Our methods are applicable to general loosely synchronous problems and indeed can be applied to arbitrary problem classes. However, we will choose a particular finite-element problem to illustrate the issues where one needs to distribute a mesh, such as that illustrated in Figure 11.2 . Each triangle or element represents a task which communicates with its neighboring three triangles. In doing, for example, a simulation of fluid flow on the mesh, each element of the mesh communicates regularly with its neighbors, and this pattern may be repeated thousands of times.
Figure 11.2:
An Unstructured Triangular Mesh Surrounding a Four-Element
Airfoil. The mesh is distributed among 16 processors, with divisions
shown by heavy lines.
We may classify load-balancing strategies into four broad types, depending on when the optimization is made and whether the cost of the optimization is included in the optimization itself:\
Koller calls these problems adiabatic [ Fox:90nn ], using a physical analogy where the load balancer can be viewed as a heatbath keeping the problem in equilibrium. In adiabatic systems, changes are sufficiently slow that the heatbath can ``keep up'' and the system evolves from equilibrium state to equilibrium state.
If the mesh is solution-adaptive, that is, if the mesh, and hence the load-balancing problem, change discretely during execution of the code, then it is most efficient to decide the optimal mesh distribution in parallel. In this section, three parallel algorithms, orthogonal recursive bisection (ORB), eigenvector recursive bisection (ERB) and a simple parallelization of simulated annealing (SA) are discussed for load-balancing a dynamic unstructured triangular mesh on 16 processors of an nCUBE machine.
The test problem is a solution-adaptive Laplace solver, with an initial mesh of 280 elements, refined in seven stages to 5772 elements. We present execution times for the solver resulting from the mesh distributions using the three algorithms, as well as results on imbalance, communication traffic, and element migration.
In this section, we shall consider the quasi-dynamic case with observations on the time taken to do the load balancing that bear on the dynamic case. The testbed is an unstructured-mesh finite-element code, where the elements are the atoms of the problem, which are to be assigned to processors. The mesh is solution-adaptive, meaning that it becomes finer in places where the solution of the problem dictates refinement.
We shall show that a class of finite-element applications share common load-balancing requirements, and formulate load balancing as a graph-coloring problem. We shall discuss three methods for solving this graph-coloring problem: one based on statistical physics, one derived from a computational neural net, and one cheap and simple method.
We present results from running these three load-balancing methods, both in terms of the quality of the graph-coloring solution (machine-independent results), and in terms of the particular machine (16 processors of an nCUBE) on which the test was run.
Guy Robinson
An important class of problems are those which model a continuum system by discretizing continuous space with a mesh. Figure 11.2 shows an unstructured triangular mesh surrounding a cross-section of a four-element airfoil from an Airbus A-310. The variations in mesh density are caused by the nature of the calculation for which the mesh has been used; the airfoil is flying at Mach 0.8 to the left, so that a vertical shock extends upward at the trailing edge of the main airfoil, which is reflected in the increased mesh density.
The mesh has been split among 16 processors of a distributed machine, with the divisions between processors shown by heavy lines. Although the areas of the processor domains are different, the numbers of triangles or elements assigned to the processors are essentially the same. Since the work done by a processor in this case is the same for each triangle, the workloads for the processors are the same. In addition, the elements have been assigned to processors so that the number of adjacent elements which are in different processors is minimized.
In order to analyze the optimal distribution of elements among the processors, we must consider the way the processors need to exchange data during a calculation. In order to design a general load balancer for such calculations, we would like to specify this behavior with the fewest possible parameters, which do not depend on the particular mesh being distributed. The following remarks apply to several application codes, written to run with the DIME software (Section 10.1 ), which use two-dimensional unstructured meshes, as follows:\
As far as load balancing is concerned, all of these codes are rather similar. This is because the algorithms used are local: Each element or node of the mesh gets data from its neighboring elements or nodes. In addition, a small amount of global data is needed; for example, when solving iteratively, each processor calculates the norm of the residual over its part of the mesh, and all the processors need the minimum value of this to decide if the solve has converged.
We can analyze the performance of code using an approach similar to that in Section 3.5 . In this case, the computational kernel of each of these applications is iterative, and each iteration may be characterized by three numbers:\
These numbers are listed in the following table for the five applications listed above:\
The two finite-volume applications do not have iterative matrix solves, so they have no convergence checking and thus have no need for any global data exchange. The ratio c in the last column is the ratio of the third to the fifth columns and may be construed as follows. Suppose a processor has E elements, of which B are at the processor boundary. Then the amount of communication the processor must do compared to the amount of calculation is given by the general form of Equation 3.10 , which here becomes
It follows that a large value of c corresponds to an eminently parallelizable operation, since the communication rate is low compared to calculation. The ``Stress'' example has a high value of c because the solution being sought is a two-dimensional strain field; while the communication is doubled, the calculation is quadrupled, because the elements of the scalar stiffness matrix are replaced by block matrices, and each block requires four multiplies instead of one. For the ``Fluid'' example, with quadratic elements, there are the two components of velocity being communicated at both nodes and edges, which is a factor of four for communication, but the local stiffness matrix is now because of the quadratic elements. Thus, we conclude that the more interacting fields, and the higher the element order, the more efficiently the application runs in parallel.
Guy Robinson
We wish to distribute the elements among the processors of the machine to minimize both load imbalance (one processor having more elements than another) and communication between elements.
Our approach here is to write down a cost function which is minimized when the total running time of the code is minimized and is reasonably simple and independent of the details of the code. We then minimize this cost function and distribute the elements accordingly.
The load-balancing problem [Fox:88a;88mm], may be stated as a graph-coloring problem: Given an undirected graph of N nodes (finite elements), color these nodes with P colors (processors) to minimize a cost function H which is related to the time taken to execute the program for a given coloring. For DIME applications, it is the finite elements which are to be distributed among the processors, so the graph to be colored is actually the dual graph to the mesh, where each graph node corresponds to an element of the mesh and has (if it is not at a boundary) three neighbors.
We may construct the cost function as the sum of a part that minimizes load imbalance and one that minimizes communication:\
where is the part of the cost function which is minimized when each processor has equal work, is minimal when communication time is minimized, and is a parameter expressing the balance between the two, with related to the number c discussed above. If and were proportional to the times taken for calculation and communication, then should be inversely proportional to c . For programs with a great deal of calculation compared to communication, should be small, and vice versa.
As is increased, the number of processors in use will decrease until eventually the communication is so costly that the entire calculation must be done on a single processor.
Let e, f, label the nodes of the graph, and be the color (or processor assignment) of graph node e . Then the number of graph nodes of color q is:\
and is proportional to the maximum value of , because the whole calculation runs at the speed of the slowest processor, and the slowest processor is the one with the most graph nodes. This ignores node and link (node-to-node communication) contention, which contribute to idle time.
The formulation as a maximum of is, however, not satisfactory when a perturbation is added to the cost function, such as that from the communication cost function. If, for example, we were to add a linear forcing term proportional to , the cost function would be:\
and the minimum of this perturbed cost function is either if is less than , or , if is larger than this. This discontinuous behavior as a result of perturbations is undesirable, so we use a sum of squares instead, whose minima change smoothly with the magnitude of a perturbation:\
where is a scaling constant to be determined.
We now consider the communication part of the cost function. Let us define the matrix
which is the amount of communication between processors q and r , and the notation means that the graph nodes e and f are connected by an edge of the graph.
The cost of communication from processors q to r depends on the machine architecture; for some parallel machines it may be possible to write down this metric explicitly. For example, with the early hypercubes, the cost is the number of bits which are different in the binary representations of the processor numbers q and r . The metric may also depend on the message-passing software, or even on the activities of other users for a shared machine. A truly portable load balancer would have no option but to send sample messages around and measure the machine metric, then distribute the graph appropriately. In this book, however, we shall avoid the question of the machine metric by simply assuming that all pairs of processors are equally far apart, except of course a processor may communicate with itself at no cost.
The cost of sending the quantity of data also depends on the programming: the cost will be much less if it is possible for the messages to be bundled together and sent as one, rather than separately. The major problem is latency: The cost to send a message in any distributed system is the sum of an initial fixed price and one proportional to the size of the message. This is also the case for the pricing of telephone calls, freight shipping, mail service, and many other examples from the everyday world. If the message is large enough, we may ignore latency: For the nCUBE used in Section 11.1.7 of this book, latency may be ignored if the message is longer than a hundred bytes or so. In the tests of Section 11.1.7 , most of the messages are indeed long enough to neglect latency, though there is certainly further work needed on load balancing in the presence of this important effect. We also ignore blocking (idling) due to needed resources being unavailable due to contention.
The result of this discussion is that we shall assume that the cost of communicating the quantity of data is proportional to , unless q=r , in which case the cost is zero. This is a good assumption on many new machines, such as the Intel Touchstone series.
We shall now make the assumption that the total communication cost is the sum of the individual communications between processors:\
where is a constant to be determined. Notice that any overlap between calculation and communication is ignored. Here, we have ignored ``global'' contributions to , such as collective communication (global sums or reductions) mentioned in Section 11.1.1 .
Substituting the expression for , the expression for the load balance cost function simplifies to
The assumptions made to derive this cost function are significant. The most serious deviation from reality is neglecting the parallelism of communication, so that a minimum of this cost function may have grossly unbalanced communication loads. This turns out not to be the case, however, because when the mesh is equally balanced, there is a lower limit to the amount of boundary, analogous to a bubble having minimal surface area for fixed volume; if we then minimize the sum of surface areas for a set of bubbles of equal volumes, each surface must be minimized and equal.
We may now choose the scaling constants and . A convenient choice is such that the optimal and have contributions of about unit size from each processor; the form of the scaling constant is because the surface area of a compact shape in d dimensions varies as the d-1 power of the size, while volume varies as the d power. The final form for H is
where d is the dimensionality of the mesh from which the graph came.
The formalism of this section has a simple physical interpretation
[Fox:86a;88kk;88mm;88tt;88uu], which we introduce here and discuss further in Section 11.2 . The data points (tasks) to be distributed can be thought of as particles moving around in the discrete space formed by the processors. This physical system is controlled by the Hamiltonian (energy function) given in Equation 11.9 . The two terms in the Hamiltonian have simple physical meanings illustrated in Figure 11.3 . The first term in Equation 11.9 ensures equal work per node and is a short-range repulsive force trying to push particles away if they land in the same node. The second term in Equation 11.9 is a long-range attractive force which links ``particles'' (data points) which communicate with each other. This force tries to pull particles together (into the same node) with a strength proportional to the information needed to be communicated between them. In general, this communication force depends on the architecture of the interconnect of the parallel machine, although Equation 11.9 has assumed a simple form for this. The analogy is preserved in general with the MPP interconnect architecture translating into a topology for the discrete space formed by the processors in the analogy. This topology implies a distance dependence force for the communication term in H . We can also extend the discussion to include the cost of moving data between processors to rebalance a dynamically changing problem. This migration cost becomes a third force attracting each particle to the processor in which it currently resides. Figure 11.3 illustrates these three forces.
Figure:
Sixteen Data Points Distributed Optimally on Four Processors,
Illustrating the Physical Analogy of Section
11.3
.
We take a simple two-dimensional mesh connection for the particles.
Note that the load-balancing problem becomes that of finding the equilibrium state of a system of particles with a ``conflict'' between short-range repulsive (hardcore) and long-range attractive forces. This scenario is qualitatively similar to classical atomic physics problems and leads one to expect that the physically based optimization methods could be effective. This physical analogy is extended in Section 11.2 where we show that the physical system exhibits effects that can be associated with temperature and phase transitions. We also indicate how it needs to be extended for problems with microscopic structure in their temporal properties.
Guy Robinson
This book presents performance evaluation of three load-balancing algorithms, all of which run in parallel. With a massively parallel machine, it would not be possible to load-balance the mesh sequentially. This is because (1) there would be a serious sequential bottleneck, (2) there would not be enough memory in a host machine to store the entire distributed mesh, and (3) the large cost incurred in communicating the entire mesh.
The three methods are:\
Guy Robinson
Simulated annealing [ Fox:88mm ], [ Hajek:88a ], [ Kirkpatrick:83a ], [ Otten:89a ] is a very general optimization method which stochastically simulates the slow cooling of a physical system. The idea is that there is a cost function H (in physical terms, a Hamiltonian) which associates a cost with a state of the system, a ``temperature'' T , and various ways to change the state of the system. The algorithm works by iteratively proposing changes and either accepting or rejecting each change. Having proposed a change we may evaluate the change in H . The proposed change may be accepted or rejected by the Metropolis criterion; if the cost function decreases the change is accepted unconditionally; otherwise it is accepted but only with probability . A crucial requirement for the proposed changes is reachability or ergodicity -that there be a sufficient variety of possible changes that one can always find a sequence of changes so that any system state may be reached from any other.
When the temperature is zero, changes are accepted only if H decreases, an algorithm also known as hill-climbing , or more generally, the greedy algorithm [ Aho:83a ]. The system soon reaches a state in which none of the proposed changes can decrease the cost function, but this is usually a poor optimum. In real life, we might be trying to achieve the highest point of a mountain range by simply walking upwards; we soon arrive at the peak of a small foothill and can go no further.
On the contrary, if the temperature is very large, all changes are accepted, and we simply move at random ignoring the cost function. Because of the reachability property of the set of changes, we explore all states of the system, including the global optimum.
Simulated annealing consists of running the accept/reject algorithm between the temperature extremes. We propose many changes, starting at a high temperature and exploring the state space, and gradually decreasing the temperature to zero while hopefully settling on the global optimum. It can be shown that if the temperature decreases sufficiently slowly (the inverse of the logarithm of the time), then the probability of being in a global optimum tends to certainty [ Hajek:88a ].
Figure 11.4 shows simulated annealing applied to the load-balancing cost function in one dimension. The graph to be colored is a periodically connected linear array of 200 nodes, to be colored with four colors. The initial configuration, at the bottom of the figure, is the left 100 nodes colored white, two domains of 50 each in mid grays, and with no nodes colored in the darkest gray. We know that the global optimum is 50 nodes of each color, with all the nodes of the same color consecutive. Iterations run up the figure with the final configurations at the top.
Figure 11.4:
Simulated Annealing of a Ring Graph of Size 200, with the Four
Graph Colors Shown by Gray Shades. The time history of the annealing
runs vertically, with the maximum temperature and the starting
configuration at the bottom, and zero temperature and the final optimum at
the top. The basic move is to change the color of a graph node to a
random color.
At each iteration of the annealing, a random node is chosen, and its color changed to a random color. This proposed move is accepted if the Metropolis criterion is accepted. At the end of the annealing, a good balance is achieved at the top of the figure, with each color having equal numbers of nodes; but there are 14 places where the color changes (communication cost = 14), rather than the minimum four.
In choosing the change to be made to the state of the system, there may be intuitive or heuristic reasons to choose a change which tends to reduce the cost function. For our example of load balancing, we know that the optimal coloring of the graph has equal-sized compact ``globules''; if we were to restrict the new color of a node to the color of one of its two neighbors, then the boundaries between colors move without creating new domains.
The effect of this algorithm is shown in Figure 11.5 , with the same number of iterations as Figure 11.4 . The imbalance of 100 white nodes is quickly removed, but there are only three colors of 67 nodes each in the (periodically connected) final configuration. The problem is that the changes do not satisfy reachability; if a color is not present in graph coloring, then it can never come back.
Figure:
Same as Figure
11.4
, Except the Basic Move Is to
Change the Color of a Graph Node to the Color of One of the Neighbors.
Even if reachability is satisfied, a heuristic may degrade the quality of the final optimum, because a heuristic is coercing the state toward local minima in much the same way that a low temperature would. This may reduce the ability of the annealing algorithm to explore the state space, and cause it to drop into a local minimum and stay there, resulting in poor performance overall.
Figure 11.6 shows a solution to this problem. There is a high probability the new color is one of the neighbors, but also a small probability of a ``seed'' color, which is a randomly chosen color. Now we see a much better final configuration, close to the global optimum. The balance is perfect and there are five separate domains instead of the optimal four.
Figure:
Same as Figure
11.4
, Except the Basic Move Is to
Change the Color of a Graph Node to the Color of One of the Neighbors
with Large Probability, and to a Random Color with Small Probability.
Collisional Simulated Annealing
As presented so far, simulated annealing is a sequential algorithm, since whenever a move is made an acceptance decision must be made before another move may be evaluated. A parallel variant, which we shall call collisional simulated annealing, would be to propose several changes to the state of the system, evaluate the Metropolis criterion on each simultaneously, then make those changes which are accepted. Figure 11.7 shows the results of the same set of changes as Figure 11.6 , but doing 16 changes simultaneously instead of sequentially. Now there are eight domains in the final configuration rather than five. The essential difference from the sequential algorithm is that resulting from several simultaneous changes is not the sum of the values if the changes are made in sequence. We tend to get parallel collisions , where there may be two changes, each of which is beneficial, but which together are detrimental. For example, a married couple might need to buy a lawnmower; if either buys it, the result is beneficial to the couple, but if both simultaneously buy lawn mowers, the result is detrimental because they only need one.
Figure:
Same as Figure
11.6
, Except the Optimization Is
Being Carried Out in Parallel by 16 Processors. Note the fuzzy edges of
the domains caused by parallel collisions.
Figure 11.8 shows how parallel collisions can adversely affect the load-balancing process. At left, two processors share a small mesh, shown by the two colors, with a sawtooth division between them. There are seven edges with different colors on each side. In the middle are shown each processor's separate views of the situation, and each processor discovers that by changing the color of the teeth of the sawtooth it can reduce the boundary from 7 to 4. On the right is shown the result of these simultaneous changes; the boundary has increased to 15, instead of the 4 that would result if only one processor went ahead.
Figure 11.8:
Illustration of a Parallel Collision During Load Balance. Each
processor may take changes which decrease the boundary length, but the
combined changes increase the boundary.
The problem with this parallel variant is, of course, that we are no longer doing the correct algorithm, since each processor is making changes without consulting the others. As noted in [ Baiardi:89a ], [ Barajas:87a ], [ Braschi:90a ], [ Williams:86b ], we have an algorithm which is highly parallel, but not particularly efficient. We should note that when the temperature is close to zero, the success rate of changes (ratio of accepted to proposed changes) falls to zero: Since a parallel collision depends on two successful changes, the parallel collision rate is proportional to the square of the low success rate, so that the effects of parallel collisions must be negligible at low temperatures.
One approach [ Fox:88a ] [ Johnson:86a ] to the parallel collision problem is rollback . We make the changes in parallel, as above, then check to see if any parallel collisions have occurred, and if so, undo enough of the changes so that there are no collisions. While rollback ensures that the algorithm is carried out correctly, there may be a great deal of overhead, especially in a tightly coupled system at high temperature, where each change may collide with many others, and where most changes will be accepted. In addition, of course, rollback involves a large software and memory overhead since each change must be recorded in such a way that it can be rescinded, and a decision must be reached about which changes are to be undone.
For some cost functions and sets of changes, it may be possible to divide the possible changes into classes such that parallel changes within a class do not collide. An important model in statistical physics is the Potts model [ Wu:82a ], whose cost function is the same as the communication part of the load-balance cost function. If the underlying graph is a square lattice, the graph nodes may be divided into ``red'' and ``black'' classes, so called because the arrangement is like the red and black squares of a checkerboard . Then we may change all the red nodes or all the black nodes in parallel with no collisions.
Some highly efficient parallel simulated annealing algorithms have been implemented [ Coddington:90a ] for the Potts model using clustering. These methods are based on the locality of the Potts cost function: the change in cost function from a change in the color of a graph node depends only on the colors of the neighboring nodes of the graph. Unfortunately, the balance part of the cost function interferes with this locality in that widely separated (in terms of the Hamming distance) changes may collide, so these methods are not suitable for load balancing.
In this book, we shall use the simple collisional simulated annealing algorithm, making changes without checking for parallel collisions. Further work is required to invent and test more sophisticated parallel algorithms for simulated annealing, which may be able to avoid the degradation of performance caused by parallel collisions without unacceptable inefficiency from the parallelism [ Baiardi:89a ].
Clustering
Since the basic change made in the graph-coloring problem is to change the color of one node, a boundary can move at most one node per iteration. The boundaries between processors are diffusing toward their optimal configurations. A better change is to take a connected set of nodes which are the same color, and change the color of the entire set at once [ Coddington:90a ]. This is shown in Figure 11.9 where the cluster is chosen first by picking a random node; we then add nodes probabilistically to the cluster; in this case, the neighbor is added with probability 0.8 if it has the same color, and never if it has a different color. Once a neighbor has failed to be added, the cluster generation finishes. The coloring of the graph becomes optimal extremely quickly compared to the single color change method of Figure 11.6 .
Figure:
Same as Figure
11.6
, Except the Basic Move Is to
Change the Color of a Connected Cluster of Nodes.
Figure 11.10 shows the clustered simulated annealing running in parallel, where 16 clusters are chosen simultaneously. The performance is degraded, but still better than Figure 11.7 , which is parallel but with single color changes.
Figure:
Same as Figure
11.7
, Except That the Cluster Method
Is Being Carried Out in Parallel by 16 Processors.
Summary of the Algorithm
The annealing algorithm as presented so far requires that several parameters be chosen for tuning, which are in italic font in the description below.
First, we pick the initial coloring of the graph so that each graph node takes a color corresponding to the processor in which it currently resides. We form a population table, of which each processor has a copy of , the number of nodes which have color q . We pick a value for , the importance of communication .
We pick a maximum temperature and the number of stages during which the temperature is to be reduced to zero. Each stage consists of a number of changes to the graph coloring which may be accepted or rejected, with no communication between the processors. At the end of the stage, each processor has a different idea of the population table, and the colors of neighboring graph nodes which are in different processors, because each processor has made changes without knowledge of the others. At the end of the stage, the processors communicate to update the population tables and local neighbor information so that each processor has up-to-date information. Each stage consists of either having a given number of accepted changes , or a given number of rejected changes , whichever comes first, followed by a loosely synchronous communication between processors.
Each trial move within a stage consists of looking for a cluster of uniform color, choosing a new color for the cluster, evaluating the change in cost function, and using the Metropolis criterion to decide whether to accept it. The cluster is chosen by first picking a random graph node as a seed, and probabilistically forming a cluster. Neighboring nodes are added to the cluster with a given cluster probability if they are the same color as the seed and reside in the same processor.
The proposed new color for the cluster is chosen to be either random with given seed probability , or a random color chosen from the set of neighbors of the cluster. The Metropolis criterion is then used to decide if the color change is to be accepted, and if so, the local copy of the population table is updated.
Guy Robinson
Rather than coloring the graph by direct minimization of the load-balance cost function, we may do better to reduce the problem to a number of smaller problems. The idea of recursive bisection is that it is easier to color a graph with two colors than many colors. We first split the graph into two halves, minimizing the communication between the halves. We can then color each half with two colors, and so on, recursively bisecting each subgraph.
There are two advantages to recursive bisection; first, each subproblem (coloring a graph with two colors) is easier than the general problem; second, there is natural parallelism. While the first stage is splitting a single graph in two, and is thus a sequential problem, there is two-way parallelism at the second stage, when the two halves are being split, and four-way parallelism when the four quarters are being split. Thus, coloring a graph with P colors is achieved in a number of stages which is logarithmic in P .
Both of the recursive bisection methods we shall discuss split a graph into two by associating a scalar quantity, , with each graph node, e , which we may call a separator field. By evaluating the median S of the , we can color the graph according to whether is greater or less than S . The median is chosen as the division so that the number of nodes in each half is automatically equal; the problem is now reduced to that of choosing the field, , so that the communication is minimized.
Orthogonal Recursive Bisection
A simple and cheap choice [ Fox:88mm ] for the separator field is based on the position of the finite elements in the mesh. We might let the value of be the x -coordinate of the center of mass of the element, so that the mesh is split in two by a median line parallel to the y -axis. At the next stage, we split the submesh by a median line parallel to the x -axis, alternating between x and y stage by stage, as shown in Figure 11.11 . Another example is shown in Figure 12.13 .
Figure 11.11:
Load Balancing by ORB for Four Processors. The elements (left)
are reduced to points at their centers of mass (middle), then split into
two vertically, then each half split into two horizontally. The result
(right) shows the assignment of elements to processors.
Guy Robinson
Better but more expensive methods for splitting a graph are based on finding a particular eigenvector of a sparse matrix which has the structure of the adjacency matrix of the graph, and using this eigenvector as a separator field [ Barnes:82a ], [ Boppana:87a ], [ Pothen:89a ].
Neural Net Model
For our discussion of eigenvector bisection, we use the concept of a computational neural net, based on the model of Hopfield and Tank [ Fox:88tt ], [ Hopfield:85b ]. When the graph is to be colored with two colors, these may be conveniently represented by the two states of a neuron, which we conventionally represent by the numbers -1 and +1 . The Hopfield-Tank neural net finds the minimum of a ``computational energy,'' which is a negative-definite quadratic form over a space of variables which may take these values -1 and +1 , and consequently is ideally suited to the two-processor load-balance problem. Rewriting the load balance cost function,
where the are ``neural firing rates,'' which are continuous variables during the computation and tend to 1 as the computation progresses. The first term of this expression is the communication part of the cost function, the second term ensures equal numbers of the two colors if is small enough, and the third term is zero when the are 1, but pushes the away from zero during the computation. The latter is to ensure that H is negative-definite, and the large but arbitrary constant plays no part in the final computation. The firing rate or output of a neuron is related to its activity by a sigmoid function which we may take to be . The constant adjusts the ``gain'' of the neuron as an amplifier. The evolution equations to be solved are then:
where is a time constant for the system and is the degree (number of neighbors) of the graph node e . If the gain is sufficiently low, the stable solution of this set of equations is that all the are zero, and as the gain becomes large, the grow and the tend to either -1 or +1 . The neural approach to load balancing thus consists of slowly increasing the gain from zero while solving this set of coupled nonlinear differential equations.
Let us now linearize this set of equations for small values of , meaning that we neglect the hyperbolic tangent, because for small x , . This linear set of equations may be written in terms of the vector u of all the values and the adjacency matrix A of the graph, whose element is 1 if and only if the distinct graph nodes e and f are connected by an edge of the graph. We may write
where D is a diagonal matrix whose elements are the degrees of the graph nodes, I is the identity matrix, and E is the matrix with 1 in each entry. This linear set of equations may be solved exactly from a knowledge of the eigenvalues and eigenvectors of the symmetric matrix N . If is sufficiently large, all eigenvalues of N are positive, and when is greater than a critical value, the eigenvector of N corresponding to its largest eigenvalue grows exponentially. Of course, when the neuron activities are no longer close to zero, the growth is no longer exponential, but this initial growth determines the form of the emerging solution.
If is sufficiently small, so that balance is strongly enforced, then the eigenspectrum of N is dominated by that of E . The highest eigenvalue of N must be chosen from the space of the lowest eigenvalue of E . The lowest eigenvalue of E is zero, with eigenspace given by those vectors with , which is just the balance condition. We observe that multiples of the identity matrix make no difference to the eigenvectors, and conclude that the dominant eigenvector s satisfies and , where is maximal. The matrix is the Laplacian matrix of the graph [ Pothen:89a ], and is positive semi-definite. The lowest eigenvector of the Laplacian has eigenvalue zero, and is explicitly excluded by the condition . Thus, it is the second eigenvector which we use for load balancing.
In summary, we have set up the load balance problem for two processors as a neural computation problem, producing a set of nonlinear differential equations to be solved. Rather than solve these, we have assumed that the behavior of the final solution is governed by the eigenstate which first emerges at a critical value of the gain. This eigenstate is the second eigenvector of the Laplacian matrix of the graph.
If we split a connected graph in two equal pieces while minimizing the boundary, we expect each half to be a connected subgraph of the original graph. This is not true in all geometries, but is in ``reasonable cases.'' This intuition is supported by a theorem of Fiedler [Fiedler:75a;75b] that when we do the splitting by the second eigenvector of the Laplacian matrix, at least one-half is always connected.
To calculate this second eigenstate, we use the Lanczos method [ Golub:83a ], [ Parlett:80a ], [ Pothen:89a ]. We can explicitly exclude the eigenvector of value zero, because the form of this eigenvector is equal entries for each element of the vector. The accuracy of the Lanczos method increases quickly with the number of Lanczos vectors used. We find that 30 Lanczos vectors are sufficient for splitting a graph of 4000 nodes.
A closely related eigenvector method [ Barnes:82a ], [ Boppana:87a ] is based on the second highest eigenvector of the adjacency matrix of the graph, rather than the second lowest eigenvector of the Laplacian matrix. The advantage of the Laplacian method is in the implementation: The first eigenvector is known exactly (the vector of all equal elements), so that it can be explicitly deflated in the Lanczos method.
Figure 11.12 (Color Plate) shows eigenvector recursive bisection in action. A triangular mesh surrounding a four-element airfoil has already been split into eight pieces, with the pieces separated by gray lines. Each of these pieces is being split into two, and the plot shows the value of the eigenvector used to make the next split, shown by black lines. The eigenvector values range from large and positive in red through dark and light blue, green, yellow, and back to red. The eight eigenvector calculations are independent and are, of course, done in parallel.
Figure 11.12:
A stage of eigenvalue recursive
bisection. A mesh has already been split into eight pieces, which are
separated by gray lines, and the eigenvector is depicted on each of these.
The next split (into sizteen pieces) is shown by the black lines.
Figure 11.13:
Solution of the Laplace Equation Used to Test Load-Balancing
Methods. The outer boundary has voltage increasing linearly from
to
in the vertical direction, the light shade is voltage
1
, and
the dark shade voltage
-1
.
The splitting is constructed by finding a median value for the eigenvector so that half the triangles have values greater than the median and half lower. The black line is the division between these.
Guy Robinson
The applications described in Section 11.1.1 have been implemented with DIME (Distributed Irregular Mesh Environment), described in Section 10.1 .
We have tested these three load-balancing methods using the application code ``Laplace'' described in Section 11.1.1 . The problem is to solve Laplace's equation with Dirichlet boundary conditions, in the domain shown in Figure 11.13 . The square outer boundary has voltage linearly increasing vertically from to , the lightly shaded S-shaped internal boundary has voltage +1 , and the dark shaded hook-shaped internal boundary has voltage -1 . Contour lines of the solution are also shown in the figure, with contour interval .
The test begins with a relatively coarse mesh of 280 elements, all residing in a single processor, with the others having none. The Laplace equation is solved by Jacobi iteration, the mesh is refined based on the solution obtained so far, then is balanced by the method under test. This sequence-solve, refine, balance-is repeated seven times until the final mesh has 5772 elements. The starting and ending meshes are shown in Figure 11.14 .
Figure 11.14:
Initial and Final Meshes for the Load-Balancing Test. The
initial mesh with 280 elements is essentially a uniform meshing of the
square, and the final mesh of 5772 elements is dominated by the highly
refined S-shaped region in the center.
The refinement is solution-adaptive, so that the set of elements to be refined is based on the solution that has been computed so far. The refinement criterion is the magnitude of the gradient of the solution, so that the most heavily refined part of the domain is that between the S-shaped and hook-shaped boundaries where the contour lines are closest together. At each refinement, the criterion is calculated for each element of the mesh, and a value is found such that a given proportion of the elements are to be refined, and those with higher values than this are refined loosely synchronously. For this test of load balancing, we refined 40% of the elements of the mesh at each stage.
This choice of refinement criterion is not particularly to improve the accuracy of the solution, but to test the load-balancing methods as the mesh distribution changes. The initial mesh is essentially a square covered in mesh of roughly uniform density, and the final mesh is dominated by the long, thin S-shaped region between the internal boundaries, so the mesh changes character from two-dimensional to almost one-dimensional.
We ran this test sequence on 16 nodes of an nCUBE/10 parallel machine, using ORB and ERB and two runs with SA, the difference being a factor of ten in cooling rate, and different starting temperatures.
The eigenvalue recursive bisection used the deflated Lanczos method for diagonalization, with three iterations of 30 Lanczos vectors each to find the second eigenvector. These numbers were chosen so that more iterations and Lanczos vectors produced no significant improvement, and fewer degraded the performance of the algorithm.
The parameters used for the collisional annealing were as follows:\
In Figure 11.15 , we show the divisions between processor domains for the three methods at the fifth stage of the refinement, with 2393 elements in the mesh. The figure also shows the divisions for the ORB method at the fourth stage: Note the unfortunate processor division to the left of the S-shaped boundary which is absent at the fifth stage.
Figure 11.15:
Processor Divisions Resulting from the Load-Balancing
Algorithms. Top, ORB at the fourth and fifth stages; lower left, ERB at
the fifth stage; lower right, SA2 at the fifth stage.
Guy Robinson
We made several measurements of the running code, which can be divided into three categories:\
Machine-independent Measurements
These are measurements of the quality of the solution to the graph-partitioning problem which are independent of the particular machine on which the code is run.
Let us define load imbalance to be the difference between the maximum and minimum numbers of elements per processor compared to the average number of elements per processor. More precisely, we should use equations (i.e., work) per processor as, for instance, with Dirichlet boundary conditions, the finite element boundary nodes are inactive and generate no equations [ Chrisochoides:93a ].
The two criteria for measuring communication overhead are the total traffic size , which is the sum over processors of the number of floating-point numbers sent to other processors per iteration of the Laplace solver, and the number of messages , which is the sum over processors of the number of messages used to accomplish this communication.
These results are shown in Figure 11.16 . The load imbalance is significantly poorer for both the SA runs, because the method does not have the exact balance built in as do the RB methods, but instead exchanges load imbalance for reducing the communication part of the cost function. The imbalance for the RB methods comes about from splitting an odd number of elements, which of course cannot be exactly split in two.
Figure 11.16:
Machine-independent Measures of Load-Balancing Performance.
Left, percentage load imbalance; lower left, total amount of
communication; right, total number of messages.
There is a sudden reduction in total traffic size for the ORB method between the fourth and fifth stages of refinement. This is caused by the geometry of the mesh as shown at the top of Figure 11.15 ; at the fourth stage the first vertical bisection is just to the left of the light S-shaped region creating a large amount of unnecessary communication, and for the fifth and subsequent stages the cut fortuitously misses the highly refined part of the mesh.
Machine-dependent Measurements
These are measurements which depend on the particular hardware and message-passing software on which the code is run. The primary measurement is, of course, the time it takes the code to run to completion; this is the sum of startup time, load-balancing time, and the product of the number of iterations of the inner loop times the time per iteration. For quasi-static load balancing, we are assuming that the time spent on the basic problem computation is much longer than the load-balance time, so parallel computation time is our primary measurement of load-balancing performance. Rather than use an arbitrary time unit such as seconds for this measurement, we have counted this time per iteration as an equivalent number of floating-point operations (flops). For the nCUBE, this time unit is for a 64-bit multiply. Thus, we measure flops per iteration of the Jacobi solver.
The secondary measurement is the communication time per iteration, also measured in flops. This is just the local communication in the graph, and does not include the time for the global combine which is necessary to decide if the Laplace solver has reached convergence .
Figure 11.17 shows the timings measured from running the test sequence on the 16-processor nCUBE. For the largest mesh, the difference in running time is about 18% between the cheapest load-balancing method (ORB) and the most expensive (SA2). The ORB method spends up to twice as much time communicating as the others, which is not surprising, since ORB pays little attention to the structure of the graph it is splitting, concentrating only on getting exactly half of the elements on each side of an arbitrary line.
Figure 11.17:
Machine-dependent Measures of Load-Balancing Performance.
Left, running time per Jacobi iteration in units of the time for a
floating-point operation (flop); right, time spent doing local
communication in flops.
The curves on the right of Figure 11.17 show the time spent in local communication at each stage of the test run. It is encouraging to note the similarity with the lower left panel of Figure 11.16 , showing that the time spent communicating is roughly proportional to the total traffic size, confirming this assumption made in Section 11.1.2 .
Measurements for Dynamic Load Balancing
After refinement of the mesh, one of the load-balancing algorithms is run and decisions are reached as to which of a processor's elements are to be sent away, and to which processor they are to be sent. As discussed in Section 10.1 , a significant fraction of the time taken by the load balancer is taken in this migration of elements, since not only must the element and its data be communicated, but space must be allocated in the new processor and other processors must be informed of the new address of the element, and so on. Thus, an important measure of the performance of an algorithm for dynamic (in contrast to quasi-dynamic) load balancing is the number of elements migrated , as a proportion of the total number of elements.
Figure 11.18 shows the percentage of the elements which migrated at each stage of the test run. The one which does best here is ORB, because refinement causes only slight movement of the vertical and horizontal median lines. The SA runs are different because of the different starting temperatures: SA1 started at a temperature low enough that the edges of the domains were just ``warmed up,'' in contrast to SA2 which started at a temperature high enough to completely forget the initial configuration and, thus, essentially all the elements are moved. The ERB method causes the largest amount of element migration, which is because of two reasons. The first is because some elements are migrated several times because the load balancing is done in stages for P processors; this is not a fundamental problem, and arises from the particular implementation of the method used here. The second reason is that a small change in mesh refinement may lead to a large change in the second eigenvector; perhaps a modification of the method could use the distribution of the mesh before refinement to create an inertial term so that the change in eigenvector as the mesh is refined could be controlled.
Figure 11.18:
Percentage of Elements Migrated During Each Load-Balancing
Stage. The percentage may be greater than 100 because the recursive
bisection methods may cause the same element to be migrated several
times.
The migration time is only part of the time taken to do the load balancing, the other part being that taken to make the decisions about which element goes where. The total times for load balancing during the seven stages of the test run (solving the coloring problem plus the migration time) are shown in the table below:\
For the test run, the time per iteration was measured in fractions of a second, and it took few iterations to obtain full convergence of the Laplace equation, so that a high-quality load balance is obviously irrelevant for this simple case. The point is that the more sophisticated the algorithm for which the mesh is being used, the greater the time taken in using the distributed mesh compared to the time taken for the load balance. For a sufficiently complex application-for example, unsteady reactive flow simulation-the calculations associated with each element of the mesh may be enough that a few minutes spent load balancing is completely negligible, so that the quasi-dynamic assumption is justified.
Guy Robinson
The Laplace solver that we used for the test run embodies the typical operation that is done with finite-element meshes. This operation is matrix-vector multiply. Thus, we are not testing load-balancing strategies just for a Laplace solver but for a general class of applications, namely, those which use matrix-vector multiply as the heart of a scheme which iterates to convergence on a fixed mesh, then refines the mesh and repeats the convergence.
The case of the Laplace solver has a high ratio of communication to calculation, as may be seen from the discussion of Section 11.1.1 , and thus brings out differences in load-balancing algorithms particularly well.
Each load-balancing algorithm may be measured by three criteria:\
Orthogonal recursive bisection is certainly cheap, both in terms of the time it takes to solve the graph-coloring problem and the number of elements which must be migrated. It is also portable to different applications, the only required information being the dimensionality of the mesh. And it is easy to program. Our tests indicate, however, that more expensive methods can improve performance by over 20%. Because ORB pays no attention to the connectivity of the element graph, one suspects that as the geometry of the underlying domain and solution becomes more complex, this gap will widen.
Simulated annealing is actually a family of methods for solving optimization problems. Even when run sequentially, care must be taken in choosing the correct set of changes that may be made to the state space, and in choosing a temperature schedule to ensure a good optimum. We have tried a ``brute force'' parallelization of simulated annealing, essentially ignoring the parallelism. For sufficiently slow cooling, this method produces the best solution to the load-balancing problem when measured either against the load-balance cost function, or by timings on a real parallel computer. Unfortunately, it takes a long time to produce this high-quality solution, perhaps because some of the numerous input parameters are not set optimally. A more sensitive treatment is probably required to reduce or eliminate parallel collisions [ Baiardi:89a ]. Clearly, further work is required to make SA a portable and efficient parallel load balancer for parallel finite-element meshes. True portability may be difficult to achieve for SA, because the problem being solved is graph coloring, and graphs are extremely diverse; perhaps something approaching an expert system may be required to decide the optimal annealing strategy for a particular graph.
Eigenvalue recursive bisection seems to be a good compromise between the other methods, providing a solution of quality near that of SA at a price little more than that of ORB. There are few parameters to be set, which are concerned with the Lanczos algorithm for finding the second eigenvector. Mathematical analysis of the ERB method takes place in the familiar territory of linear algebra, in contrast to analysis of SA in the jungles of nonequilibrium thermodynamics. A major point in favor of ERB for balancing finite-element meshes is that the software for load balancing with ERB is shared to a large extent with the body of finite-element software: The heart of the eigenvector calculation is a matrix-vector multiply, which has already been efficiently coded elsewhere in the finite-element library. Recursive spectral bisection [ Barnard:93a ] has been developed as a production load balancer and very successfully applied to a variety of finite-element problems.
The C P research described in this section has been continued by Mansour in Fox's new group at Syracuse [Mansour:91a;92a-e].
He has considered simulating annealing, genetic algorithms, neural networks, and spectral bisection producing in each parallel implementation. Further, he introduced a multiscale or graph contraction approach where large problems to be decomposed are not directly tackled but are first ``clumped'' or contracted to a smaller problem [ Mansour:93b ], [ Ponnusamy:93a ]. The latter can be decomposed using the basic techniques discussed above and this solution of the small problem used to initialize a fast refinement algorithm for the original large problem. This strategy has an identical philosophy to the multigrid approach (Section 9.7 ) for partial differential equations. We are currently collaborating with Saltz in integrating these data decomposers into the high-level data-parallel languages reviewed in Section 13.1 .
Guy Robinson
In [Fox:86a;92c;92h;93a], we point out some interesting features of the physical analogy and energy function introduced in Section 11.1.4 .
Suppose that we are using the simulating annealing method of Section 11.1.4 on a dynamically varying system. Assume that this annealing algorithm is running in parallel in the same machine on which the problem executes. Suppose that we use a (reasonably) optimal annealing strategy. Even in this case, the ``heatbath'' formed by load balancer and operating system can only ``cool'' the problem to a minimum temperature . At this temperature, any further gains from improved decomposition by lowering the temperature will be outweighed by time taken to perform the annealing. This temperature is independent of performance of computer; it is a property of the system being simulated. Thus, we can consider this temperature as a new property of a dynamical complex system. High values of imply that the system is rapidly varying; low values that it is slowly varying.
Now we want to show that decompositions can lead to phase transitions between different states of the physical system defined by analogy of Section 11.1.2 . In the language of Chapter 3 , we can say that the complex system representing this problem exhibits a phase transition. We illustrate this with a trivial particle dynamics problem shown in Figure 11.19 . Typically, we use on such problems the domain decomposition of Figure 11.19 (a), where each node of the parallel machine contains a single connected region (compare Section 12.4 ). Alternatively, we can use the scattered decomposition-described for matrices in Section 8.1 and illustrated in Figure 11.19 (b). One assigns to each processor several small regions of the space scattered uniformly throughout the domain. Each processor gets ``a piece of the action'' and shares those parts of the domain where the particle density and hence computational work is either large or small. This was explored for partial differential equations in [ Morison:86a ]. The scattered decomposition is a local minimum-there is an optimal size for the scattered blocks of space assigned to each processor. Both in this example and generally, the scattered decomposition is not as good as domain decomposition. This is shown in Figure 11.19 (c), which sketches the energy H as a function of the chosen decomposition. Now, suppose that the particles move in time from t to as shown in Figure 11.19 (d). The scattered decomposition minimum is unchanged , but as shown in Figure 11.19 (c),(d) the domain decomposition minimum moves with time.
Figure 11.19:
Particle Dynamics Problem on a Four-node System with Two
Decompositions (a) Domain, Time
t
, (b) Scattered Times
t
and
, (c) Instantaneous Energies, (d) Domain Decomposition
Changing from Time
t
to
Now, one would often be interested not in the instantaneous energy H , but rather in the average
For this new objective function , the scattered decomposition can be the global minimum as illustrated in Figure 11.20 . The domain decomposition is smeared with time and so its minimum is raised in value; the value of H at the scattered decomposition minimum is unchanged. We can study as a function of , and the hardware ratio used in Equation 3.10 . As increases or decreases, we move from the situation of Figure 11.19 (c) to that of Figure 11.20 . In physics language, and are order parameters which control the phase transition between the two states scattered and domain . Rapidly varying systems (high ), rather than those with lower , are more likely to see the transition as increases. This agrees with physical intuition, as we now describe. When is small (slowly varying system), domain decomposition is the global minimum and this switches to a scattered decomposition as increases. In Figure 11.19 (a),(b), we can associate with each particle in the simulation a spin value which indicates the label of the processor to which it is assigned. Then we see the direct analogy to physical spin systems. At high temperatures, we have spin waves (scattered decomposition); at low temperatures, (magnetic) domains (domain decomposition).
Figure:
The Average Energy
of Equation
11.13
We end by noting that in the analogy there is a class of problems which we call microscopically dynamic . These are explored in [ Fox:88f ], [Fox:88kk;88uu]. In this problem class, the fundamental entities (particles in above analogy) move between nodes of parallel machine on a microscopic time scale. The previous discussion had only considered the adiabatic loosely synchronous problems where one can assume that the data elements (particles in the analogy) can be treated as fixed in a particular processor at each time instant. We will not give a general discussion here, but rather just illustrate the ideas with one example-the global sum calculation written in Fortran as
DO l I=l, LIMIT1
A(I)=0
DO 1 J=l, LIMIT2
1 A(I)=A(I) + B(I,J)
This is illustrated in Figure 11.21 (Color Plate) for the case LIMIT1=4 decomposed onto a four-node machine. The value of LIMIT1 is important for performance considerations but irrelevant for the discussion here. The optimal scheduling of communication and calculation is tricky and is discussed as the fold algorithm in [ Fox:88a ]. The four tasks of calculating the four A(I) cannot be viewed as particles as they move from node to node and we cannot represent this movement in the formalism used up to now. Rather, we now represent the tasks by ``space-time'' strings or world lines and one replaces Equation 11.9 by a Hamiltonian which describes interacting strings rather than interacting particles. This can be applied to event-driven simulations, message routing, and other microscopically dynamic problems. The strings need to be draped over the space-time grid formed by the complex computer as it evolves in time. Figure 11.21 (Color Plate) shows this compact ``draping'' for the fold algorithm.
Figure 11.21:
The Fold Algorithm. Four global sums
interleaved optimally on four processors.
We have successfully applied similar ideas to multivehicle and multiarm robot path planning and routing problems [ Chiu:88f ], [Fox:90e;90k;92c], [ Gandhi:90a ]. Comparison of the vehicle navigation in Figure 11.22 (Color Plate) with the computational routing problem in Figure 11.21 (Color Plate) illustrates the analogy.
Figure 11.22:
Two- and four-vechcle navigation
problems. in each case, vehicles have been given initial and final target
positions. The black squares are impassable and define a narrow pass.
Physical optimisation methods[Fox:88ii;90e] were used to find solutions.
Guy Robinson
C P maintained a significant activity in optimization. There were several reasons for this, one of which was, of course, natural curiosity. Another was the importance of load balancing and data decomposition which is, as discussed previously in this chapter, ``just'' an optimization problem. Again, we already mentioned in Section 6.1 our interest in neural networks as a naturally parallel approach to artificial intelligence. Section 9.9 and Section 11.1 have shown how neural networks can be used in a range of optimization problems. Load balancing has the important (optimization) characteristic of NP completeness, which implies that it would take an exponential time to solve completely. Thus, we studied the travelling salesman problem (TSP) which is well known to be NP-complete and formally equivalent to other problems with this property. One important contribution of C P was the work of Simic [Simic:90a;91a]. [ Simic:91a ]
Simic derived the relationship between the neural network [ Hopfield:86a ] and elastic net [Durbin:87a;89a], [Rose:90f;91a;93a], [ Yuille:90a ] approaches to the TSP. This work has been extensively reviewed [Fox:91j;92c;92h;92i] and we will not go into the details here. A key concept is that of physical optimization which implies the use of a physics approach of minimizing the energy, that is, finding the ground state of a complex system set up as a physical analogy to the optimization problem. This idea is illustrated clearly by the discussion in Section 11.1.3 and Section 11.2 . One can understand some reasons why a physics analogy could be useful from two possible plots of the objective function to be minimized, against the possible configurations, that is, against the values of parameters to be determined. Physical systems tend to look like Figure 11.1 (a), where correlated (i.e., local) minima are ``near'' global minima. We usually do not get the very irregular landscape shown in Figure 11.1 (b). In fact, we do find the latter case with the so-called random field Ising model, and here conventional physics methods perform poorly [ Marinari:92a ], [ Guagnelli:92a ]. Ken Rose showed how these ideas could be generalized to a wide class of optimization problems as a concept called deterministic annealing [ Rose:90f ], [ Stolorz:92a ]. Annealing is illustrated in Figure 11.23 (Color Plate). One uses temperature to smooth out the objective function (energy function) so that at high temperature one can find the (smeared) global minimum without getting trapped in spurious local minima. Temperature is decreased skillfully initializing the search at Temperature by the solution at the previous higher temperature . This annealing can be applied either statistically [ Kirkpatrick:83a ] as in Sections 11.1 and 11.3 or with a deterministic iteration. Neural and elastic networks can be viewed as examples of deterministic annealing. Rose generalized these ideas to clustering [Rose:90a;90c;91a;93a];
vector quantization used in coding [ Miller:92b ], [ Rose:92a ]; tracking [Rose:89b;90b]; and electronic packing [ Rose:92b ]. Deterministic annealing ;has also been used for robot path planning with many degrees of freedom [ Fox:90k ], [ Gandhi:90b ] (see also Figure 11.22 (Color Plate)), character recognition [ Hinton:92a ], scheduling problems [Gislen:89a;91a], [ Hertz:92a ], [ Johnston:92a ], and quadratic assignment [ Simic:91a ].
Figure 11.23:
Annealing tracks global minima by
initializing search at one temperature by minima found at other temperatures .
Neural networks have been shown to perform poorly in practice on the TSP [ Wilson:88a ], but we found them excellent for the formally equivalent load-balancing problem in Section 11.1 . This is now understood from the fact that the simple neural networks used in the TSP [ Hopfield:86a ] used many redundant neural variables, and the difficulties reported in [ Wilson:88a ] can be traced to the role of the constraints that remove redundant variables. The neural network approach summarized in Section 11.1.6 uses a parameterization that has no redundancy and so it is not surprising that it works well. The elastic network can be viewed as a neural network with some constraints satisfied exactly [ Simic:90a ]. This can also be understood by generalizing the conventional binary neurons to multistate or Potts variables [Peterson:89b;90a;93a].
Moscato developed several novel ways of combining simulated annealing with genetic algorithms [Moscato:89a;89c;89d;89e] and showed the power and flexibility of these methods.
Guy Robinson
The Travelling Salesman Problem (TSP) is probably the most well-known member of the wider field of combinatorial optimization (CO) problems. These are difficult optimization problems where the set of feasible solutions (trial solutions which satisfy the constraints of the problem but are not necessarily optimal) is a finite, though usually very large set. The number of feasible solutions grows as some combinatoric factor such as , where N characterizes the size of the problem. We have already commented on the use of neural networks for the TSP in the previous section. Here we show how to combine problem-specific heuristics with simulated annealing, a physical optimization method.
It has often been the case that progress on the TSP has led to progress on many CO problems and more general optimization problems. In this way, the TSP is a playground for the study of CO problems. Though the present work concentrates on the TSP, a number of our ideas are general and apply to all optimization problems.
The most significant issues occur as one tries to find extremely good or exact solutions to the TSP. Many algorithms exist which are fast and find feasible solutions which are within a few percent of the optimum length. Here, we present algorithms which will usually find exact solutions to substantial instances of the TSP. We are limited by space considerations to a brief presentation of the method-more details may be found in [ Martin:91a ].
In a general instance of the TSP one is given N ``cities'' and a matrix giving the distance or cost function for going from city i to j . Without loss of generality, the distances can be assumed to be positive. A ``tour'' consists of a list of N cities, , where each city appears once and only once. In the TSP, the problem is to find the tour with the minimum ``length,'' where length is defined to be the sum of the lengths along each step of the tour,
and is identified with to make it periodic.
Most common instances of the TSP have a symmetric distance matrix; we will hereafter focus on this case. All CO problems can be formulated as optimizing an objective function (e.g., the length) subject to constraints (e.g., legal tours).
Guy Robinson
In a local search method, one first defines a neighborhood topology on the set of all tours. For instance, one might define the neighborhood of a tour to be all those tours which can be obtained by changing at most k edges of . A tour is said to be locally opt if no tour in its neighborhood is shorter than it. One can search for locally opt tours by starting with a random tour and performing k -changes on it as long as the tour length decreases. In this way, one constructs a sequence of tours , , . Eventually the process stops and one has reached a local opt tour. Lin [ Lin:65a ] studied the case of k=2 and k=3 , and showed that one could get quite good tours quickly. Furthermore, since in general there are quite a few locally opt tours, in order to find the globally optimal tour, he suggested repeating this process from random starts many times until one was confident all the locally opt tours had been found. Unfortunately, the number of local opt tours rises exponentially with N , the number of cities. Thus in general, it is more efficient to use a more sophisticated local opt (say higher k ) than to try to repeat the search from random starts many times. The current state-of-the-art optimization heuristic is an algorithm due to Lin and Kernighan [ Lin:73a ]. It is a variable depth k -neighborhood search, and it is the benchmark against which all heuristics are tested. Since it is significantly better than three-opt, for any instance of the TSP, there are many fewer L - K -opt tours than there are three-opt tours. This postpones the problem of doing exponentially many random starts until one reaches N on the order of a few hundred. For still larger N , the number of L - K -opt tours itself gets unmanageable. Given that one really does want to tackle these larger problems, there are two natural ways to go. First, one can try to extend the neighborhood which L - K considers, just as L - K extended the neighborhood of three-changes. Second, one expects that instead of sampling the local opt tours in a random way as is done by applying the local searches from random starts many times, it might be possible to obtain local opt tours in a more efficient way, say via a sampling with a bias in favor of the shorter tours. We will see that this gives rise to an algorithm which indeed enables one to solve much larger instances.
Guy Robinson
Given that any local search method will stop in one of the many local opt solutions, it may be useful to find a way for the iteration to escape by temporarily allowing the tour length to increase. This leads to the popular method of ``simulated annealing'' [ Kirkpatrick:83a ].
One starts by constructing a sequence of tours , , and so on. At each step of this chain, one does a k -change (moves to a neighboring tour). If this decreases the tour length, the change is accepted; if the tour length increases, the change is rejected with some probability, in which case one simply keeps the old tour at that step. Such a stochastic construction of a sequence of T s is called a Markov chain . It can be viewed as a rather straightforward extension of the above local search to include ``noisiness'' in the search for shorter tours. Because increases in the tour length are possible, this chain never reaches a fixed point. For many such Markov chains, it is possible to show that given enough time, the chain will visit every possible tour T , and that for very long chains, the T s appear with a calculable probability distribution. Such Markov chains are closely inspired by physical models where the chain construction procedure is called a Monte Carlo. The stochastic accept/reject part is supposed to simulate a random fluctuation due to temperature effects, and the temperature is a parameter which measures the bias towards short tours. If one wants to get to the globally optimal tour, one has to move the temperature down towards zero, corresponding to a strong bias in favor of short tours. Thus, one makes the temperature vary with time, and the way this is done is called the annealing schedule, and the result is simulated annealing.
If the temperature is taken to zero too fast, the effect is essentially the same as setting the temperature to zero exactly, and then the chain just traps at a local opt tour forever. There are theoretical results on how slowly the annealing has to be done to be sure that one reaches the globally optimum solution, but in practice the running times are astronomical. Nevertheless, simulated annealing is a standard and widely used approach for many minimization problems. For the TSP, it is significantly slower than Lin-Kernighan, but it has the advantage that one can run for long times and slowly improve the quality of the solutions. See, for instance, the studies Johnson et al. [ Johnson:91a ] have done. The advantage is due to the improved sampling of the short length tours: Simulated annealing is able to ignore the tours which are not near the minimum length. An intuitive way to think about it is that for a long run, simulated annealing is able to try to improve an already very good tour, one which probably has many links in common with the exact optimum. The standard Lin-Kernighan algorithm, by contrast, continually restarts from scratch, throwing away possibly useful information.
Guy Robinson
Simulated annealing does not take advantage of the local opt heuristics. This means that instead of sampling local opt tours as does L - K repeated from random starts, the chain samples all tours. It would be a great advantage to be able to restrict the sampling of a Markov chain to the local opt tours only. Then the bias which the Markov chain provides would enable one to sample the shortest local opt tours more efficiently than local opt repeated from random starts. This is what our new algorithm does.
To do this, one has to find a way to go from one local opt tour, , to another, , and this is the heart of our procedure. We propose to do a change on , which we call a ``kick.'' This can be a random p -change, for instance, but we will choose something smarter than that. Follow this kick by the local opt tour improvement heuristic until reaching a new local opt tour . Then accept or reject depending on the increase or decrease in tour length compared to . This is illustrated in Figure 11.24 . Since there are many changes in going from to , we call this method a ``Large-Step Markov Chain.'' It can also be called ``Iterated Local Opt,'' but it should be realized that it is precisely finding a way to iterate which is the difficulty! The algorithm is far better than the small-step Markov chain methods (conventional simulated annealing) because the accept/reject procedure is not implemented on the intermediate tours which are almost always of longer length. Instead, the accept/reject does not happen until the system has returned to a local minimum. The method directly steps from one local minimum to another. It is thus much easier to escape from local minima.
Figure 11.24:
Schematic Representation of the Objective Function and of the
Tour Modification Procedure Used in the Large-step Markov Chain
At this point, let us mention that this method is no longer a true simulated annealing algorithm. That is, the algorithm does NOT correspond to the simulation of any physical system undergoing annealing. The reason is that a certain symmetry property, termed ``detailed balance'' in the physics community, is not satisfied by the large-step algorithm. [ Martin:91a ] says a bit more about this. One consequence of this is that the parameter ``temperature'' which one anneals with no longer plays the role of a true, physical temperature-instead it is merely a parameter which controls the bias towards the optimum. The lack of a physical analogy may be the reason that this algorithm has not been tried before, even though much more exotic algorithms (such as appealing to quantum mechanical analogies!) have been proposed.
We have found that in practice, this methodology provides an efficient sampling of the local opt tours. There are a number of criteria which need to be met for the biased sampling of the Markov chain to be more efficient than plain random sampling. These conditions are satisfied for the TSP, and more generally whenever local search heuristics are useful. Let us stress before proceeding to specifics that this large-step Markov chain approach is extremely general, being applicable to any optimization problem where one has local search heuristics. It enables one to get a performance which is at least as good as local search, with substantial improvements over that if the sampling can be biased effectively. Finally, although the method is general, it can be adapted to match the problem of interest through the choice of the kick. We will now discuss how to choose the kick for the TSP.
Consider, for instance, the case where the local search is three-opt. If we used a kick consisting of a three-change, the three-opt would very often simply bring us back to the previous tour with no gain. Thus, it is probably a good idea to go to a four-change for the kick when the local search is three-opt. For more general local search algorithms, a good choice for the kick would be a k -change which does not occur in the local search. Surprisingly, it turns out that two-opt, three-opt, and especially L - K are structured so that there is one kick choice which is natural for all of them. To see this, it is useful to go back to the paper by Lin and Kernighan. In that paper, they define ``sequential'' changes and they also show that if the tour is to be improved, one can force all the partial gains during the k -change to be positive. A consequence of this is that the checkout time for sequential k -changes can be completed in steps. It is easy to see that all two and three changes are sequential, and that the first nonsequential change occurs at k=4 (Figure 2 of their paper). We call this graph a ``double-bridge'' change because of what it does to the tour. It can be constructed by first doing a two-change which disconnects the tour; the second two-change must then reconnect the two parts, thereby creating a bridge. Note that both of the two-changes are bridges in their own way, and that the double-bridge change is the only nonsequential four-change which cannot be obtained by composing changes which are both sequential and leave the tour connected. If we included this double-bridge change in the definition of the neighborhood for a local search, checkout time would require steps (a factor N for each bridge essentially). Rather than doing this change as part of the local search, we include such changes stochastically as our kick. The double-bridge kick is the most natural choice for any local search method which considers only sequential changes. Because L - K does so many changes for k greater than three, but misses double-bridges, one can expect that most of what remains in excess length using L - K might be removed with our extension. The results below indicate that this is the case.
Guy Robinson
At first we implemented the Large-Step Markov Chain for the three-opt local search. We checked that we could solve to optimality problems of sizes up to 200 by comparing with a branch and bound program. For N=100 , the optimum was found in a few minutes on a SUN-3, while for N=200 an hour or two was required. For larger instances, we used problems which had been solved to optimality by other people. We ran our program on the Lin-318 instance solved to optimality by Padberg and Crowder. Our iterated three-opt found the optimal tour on each of five separate runs, with an average time of less than 20 hours on the SUN-3. We also ran on the AT&T-532 instance problem solved to optimality by Padberg and Rinaldi. By using a postreduction method inspired by tricks explained in the Lin-Kernighan paper, the program finds the optimum solution in 100 hours. It is of interest to ask what is the expected excess tour length for very large problems using our method with a reasonable amount of time. We have run on large instances of cities randomly distributed in the unit square. Ordinary three-opt gives an average length 3.6% above the Held-Karp bound, whereas the iterated three-opt does better than L - K (which is 2.2% above): it leads to an average of less than 2.0% above H - K . Thus we see that without much more algorithmic complexity, one can improve three-opt by more than 1.6%.
In [ Martin:91a ], we suggested that such a dramatic improvement should also carry over to the L - K local opt algorithm. Since then, we have implemented a version of L - K and have run it on the instances mentioned above. Johnson [ Johnson:90b ] and also Cook, Applegate, Chvatal [ Cook:90b ] have similarly investigated the improvement of iterated L - K over repeated L - K . It is now clear that the iterated L - K is a big improvement. Iterated L - K is able to find the solution to the Lin-318 instance in minutes, and the solution to the AT&T-532 problem in an hour. At a recent TSP workshop [ TSP:90a ], a 783-city instance constructed by Pulleyblank was solved to optimality by ourselves, Johnson, and Cook et. al., all using the large-step method.
For large instances (randomly distributed cities), Johnson finds that iterated L - K leads to an average excess length of 0.84% above the Held-Karp bound. Previously it was expected that the exact optimum was somewhere above 1% from the Held-Karp bound, but iterated L - K disproves this conjecture.
One of the most exciting results of the experiments which have been performed to date is this: For ``moderate''-sized problems (such as the AT&T-532 or the 783 instance mentioned above), no careful ``annealing'' seems to be necessary. It is observed that just setting the temperature to zero (no uphill moves at all!) gives an algorithm which can often find the exact optimum. The implication is that, for the large-step Markov chain algorithm, the effective energy landscape has only one (or few) local minima! Almost all of the previous local minima have been modified to saddle points by the extended neighborhood structure of the algorithm.
Steve Otto had the original idea for the large-step Markov chain. Olivier Martin has made (and continues to make) many improvements towards developing new, fast local search heuristics. Steve Otto and Edward Felten have developed the programs, and are working on a parallel implementation.
Guy Robinson
Guy Robinson
This chapter contains some of the hardest applications we developed within C P at Caltech. The problems are still ``just'' data-parallel with the natural ``massive'' (i.e., large scale as directly proportional to the number of data elements or problem size) loosely synchronous parallelism summarized in Figure 12.1 . However, the irregularity of the problem-both static and dynamic-makes the implementation technically hard. Interestingly, after this hard work, we do find very good speedups, that is, this problem class has as much parallelism as the simpler synchronous problems of Chapters 4 and 6 . In fact, one finds that it is in this class of problems that parallel machines most clearly outperform traditional vector supercomputers [Fox:89i;89n;90o]. The (dynamic) irregularity makes the parallelism harder to expose, but it does not remove it; however, the irregularity of a problem can make it impossible to get good performance on (SIMD) vector processors.
The problems contained in this chapter are also typical of the hardest challenges for parallelizing compilers. These applications are not easy to write in a high-level language, such as High Performance Fortran of Chapter 13 , in a way that compilers can efficiently extract the parallelism. This area is one of major research activity with interesting contributions from the groups at Yale [ Bhatt:92a ] and Stanford [ Singh:92a ] for the N-body problem described in Section 12.4 .
The applications in this chapter can be summarized as follows:
We suggest that Chapters 12 , 14 , and 18 contain some of those applications which should be studied by computer scientists developing new software tools and parallel languages. This is where the application programmer needs help! We have separated off Chapter 14 , as the violation of the loose synchronization condition in this chapter produces different complications from the dynamic irregularity that characterizes the applications of Chapter 12 . Chapter 18 contains compound metaproblems combining all types of problem structure.
Guy Robinson
All animals are faced with the computationally intense task of continuously acquiring and analyzing sensory data from their environment. To ensure maximally useful data, animals appear to use a variety of motor strategies or behaviors to optimally position their sensory apparatus. In all higher animals, neural structures which process both sensory and motor information are likely to exist which can coordinate this exploratory behavior for the sake of sensory acquisition.
To study this feedback loop, we have chosen the weak electric fish, which use a unique electrically based means of exploring their environment [ Bullock:86a ], [ Lissman:58a ]. These nocturnal fish, found in the murky waters of the Congo and Amazon, have developed electrosensory systems to allow them to detect objects without relying on vision. In fact, in some species this electric sense appears to be their primary sensory modality.
This sensory system relies on an electric organ which generates a weak electric field surrounding the fish's body that in turn is detected by specialized electroreceptor cells in the fish's skin. The presence of animate or inanimate objects in the local environment causes distortions of this electric field, which are interpreted by the fish. The simplicity of the sensory signal, in addition to the distributed external representation of the detecting apparatus, makes the electric fish an excellent animal through which to study the involvement in sensory discrimination of the motor system in general, and body position in particular.
Simulations in two dimensions [ Bacher:83a ], [ Heiligenberg:75a ] and measurements with actual fish have shown that body position, especially the tail angle, significantly alter the fields near the fish's skin.
To study quantitatively how the fish's behavior affects the ``electric images'' of objects, we are developing three-dimensional computer simulations of the electric fields that the fish generate and detect. These simulations, when calibrated with the measured fields, should allow us to identify and focus on behaviors that are most relevant to the fish's sensory acquisition tasks, and to predict the electrical consequences of the behavior of the fish with higher spatial resolution than possible in the tank.
Being able to visualize the electric fields, in false color on a simulated fish's body as it swims, may provide a new level of understanding of how these curious animals sense and respond to their world. For this simulation, we have chosen the fish Gnathonemus petersii .
Guy Robinson
We need to reduce the great complexity of a biological organism to a manageable physical model. The ingredients of this model are the fish body, shown in Figure 12.2 , the object that the fish is sensing, and the water exterior to both the fish and the object.
Figure 12.2:
Side and Top Views of the Fish, and Internal Potential Model
The real fish has some projecting fins, and our first approximation is to neglect these because their electrical properties are essentially the same as those of water.
We will assume that the fish is exploring a small conductive object, such as a small metal sphere. First, we reduce the geometrical aspect of the object to being pointlike, yet retaining some relevant electrical properties. Except when the object is another electric fish, we expect it to have no active electrical properties, but only to be an induced dipole .
We now come to the modelling of the fish body itself. This consists of a skin with electroreceptor cells which can detect potential differences, and a rather complex internal structure. We shall assume that the source voltage is maintained at the interface between the internal structure and the skin, so that we need not be concerned with the details of the internal structure. Thus, the fish body is modelled as two parts: an internal part with a given voltage distribution on its surface, and a surrounding skin with variable conductivity.
The upshot of this model is that we need to solve Laplace's equation in the water surrounding the fish, with an induced dipole at the position of the object the fish is investigating, with a mixed or Cauchy boundary condition at the surface of the fish body.
Guy Robinson
The boundary element method [ Brebbia:83a ], [ Cruse:75a ] has been used for many applications where it is necessary to solve a linear elliptic partial differential equation. Because of the linearity of the underlying differential equation, there is a Green's function expressing the solution throughout the three-dimensional domain in terms of the behavior at the boundaries, so that the problem may be transformed into an integral equation on the boundary.
The discrete approximation to this integral equation results in the solution of a full set of simultaneous linear equations, one equation for each node of the boundary mesh; the conventional finite-difference method would result in solving a sparse set of equations, one for each node of a mesh-filling space. Let us compare these methods in terms of efficiency and software cost.
To implement the finite-difference method, we would first make a mesh filling the domain of the problem (i.e., a three-dimensional mesh), then for each mesh point set up a linear equation relating its field value to that of its neighbors. We would then need to solve a set of sparse linear equations. In the case of an exterior problem such as ours, we would need to pay special attention to the farfield, making sure the mesh extends out far enough and that the proper approximation is made at this outer boundary.
With the boundary element method, we discretize only the surface of the domain, and again solve a set of linear equations, except that now they are no longer sparse. The far field is no longer a problem, since this is taken care of analytically.
If it is possible to make a regular grid surrounding the domain of interest, then the finite-difference method is probably more efficient, since multigrid methods or alternating direction methods will be faster than the solution of a full matrix. It is with complex geometries, however, that the boundary element method can be faster and more efficient on sequential or distributed-memory machines. It is much easier to produce a mesh covering a curved two-dimensional manifold than a three-dimensional mesh filling the space exterior to the manifold. If the manifold is changing from step to step, the two-dimensional mesh need only be distorted, whereas a three-dimensional mesh must be completely remade, or at least strongly smoothed, to prevent tangling. If the three-dimensional mesh is not regular, the user faces the not inconsiderable challenge of explicit load balancing and communication at the processor boundaries.
Guy Robinson
Figure 12.3 shows a view of four of the model fish in some rather unlikely positions, with natural shading.
Figure 12.3:
Four Fish with Simple Shading
Figure 12.4 shows a side view of the fish with the free field (no object) shown in gray scale, and we can see how the potential ramp at the skin-body interface has been smoothed out by the resistivity of the skin. Figure 12.5 shows the computed potential contours for the midplane around the fish body, showing the dipole field emanating from the electric organ in the tail.
Figure 12.4:
Potential Distribution on the Surface of the Fish, with No
External Object
Figure 12.5:
Potential Contours on the Midplane of the Fish, Showing Dipole
Distribution from the Tail
Figure 12.6 (Color Plate) shows the difference field (voltage at the skin with and without the object) for three object positions, near the tail (top), at the center (middle) and near the head (bottom). It can be seen that this difference field, which is the sensory input for the fish, is greatest when the object is close to the head. A better view of the difference voltage is shown in Figure 12.7 , which shows the envelope of the difference voltage on the midline of the fish, for various object positions. Again, we see that the maximum sensory input occurs when the object is close to the head of the fish, rather than at the tail, where the electric organ is.
Figure 12.6:
Potentials on the surface of the electric
fish model as a conducting object moves from head (left) to tail (right) of
the fish, keeping 3cm from the midline (above the paper).
Figure 12.7:
Envelope of Voltage Differences Along Midline of the Fish, for
20 Object Positions, Each
Above Mid-plane
Guy Robinson
The BEM algorithm was written as a DIME application [Williams:90b;90c] by Roy Williams, Brian Rasnow of the Biology Division, and Chris Assad of the Engineering Division.
Guy Robinson
Unstructured meshes have been widely used for calculations with conventional sequential machines. Jameson [Jameson:86a;86b] uses explicit finite-element-based schemes on fully unstructured tetrahedral meshes to solve for the flow around a complete aircraft, and other workers [ Dannenhoffer:89a ], [ Holmes:86a ], [Lohner:84a;85a;86a] have used unstructured triangular meshes. Jameson and others [Jameson:87a;87b], [ Mavriplis:88a ], [ Perez:86a ], [ Usab:83a ] have used multigrid methods to accelerate convergence . For this work [ Williams:89b ], we have used the two-dimensional explicit algorithm of Jameson and Mavriplis [ Mavriplis:88a ].
An explicit update procedure is local, and hence well matched to a massively parallel distributed machine, whereas an implicit algorithm is more difficult to parallelize. The implicit step consists of solving a sparse set of linear equations, where matrix elements are nonzero only for mesh-connected nodes. Matrix multiplication is easy to parallelize since it is also a local operation, and the solve may thus be accomplished by an iterative technique such as conjugate gradient, which consists of repeated matrix multiplications. If, however, the same solve is to be done repeatedly for the same mesh, the most efficient (sequential) method is first decomposing the matrix in some way, resulting in fill-in. In terms of the mesh, this fill-in represents nonlocal connection between nodes:indeed, if the matrix were completely filled, the communication time would be proportional to for N nodes.
Guy Robinson
The governing equations are the Euler equations, which are of advective type with no diffusion,
where U is a vector containing the information about the fluid at a point. I have used bold symbols to indicate an information vector, or a set of fields describing the state of the fluid. In this implementation, U consists of density, velocity, and specific total energy (or, equivalently, pressure); it could also include other information about the state of the fluid such as chemical mixture or ionization data. F is the flux vector and has the same structure as U in each of the two coordinate directions.
The numerical algorithm is explained in detail in [ Mavriplis:88a ], so only an outline is given here. The method uses linear triangular elements to approximate the field. First, a time step is chosen for each node which is constrained by a local Courant condition. The calculation consists of two parts:
The time stepping is done with a five-stage Runge-Kutta scheme, where the advection step is done five times, and the dissipation step is done twice. Since advection takes one communication stage and dissipation two, each full time step requires nine loosely synchronous communication stages.
Guy Robinson
After the initial transients have dispersed and the flow has settled, the mesh may be refined. The criterion used is based on the gradient of the pressure for deciding which elements are to be refined. The user specifies a percentage of elements which are to be refined, and a criterion
is calculated for each element. A value of this criterion is found such that the given percentage of elements have a value of greater than , and those elements are refined. The criterion is not simply the gradient of the pressure, because the strongest shock in the simulation would soak up all the refinement leaving weaker shocks unresolved. With the element area in the criterion, regions will ``saturate'' after sufficient refinement, allowing weaker shocks to be refined.
Guy Robinson
Figures 10.8 and 10.9 (Color Plates) show the pressure and computational mesh resulting from Mach 0.8 flow over a NACA0012 airfoil at 1.25 degrees angle of attack, computed with a 32-processor nCUBE machine. This problem is that used by the AGARD working group [ AGARD:83a ] in their benchmarking of compressible flow algorithms. The mesh has 5135 elements after four stages of adaptive refinement. Each processor has about the same number of elements. In the pressure plot is also shown the sonic line; the plot agrees well with the AGARD data.
Note the shock about 2/3 of the way downstream from the leading edge, and the corresponding increase in mesh density there.
Figure 12.8 (Color Plate) shows pressure in a wind-tunnel with a step. A Mach 3 stream comes in from the left, with a detached bow-shock upstream from the step. A second shock is attached by a Mach stem to the bow shock, which is then reflected from the walls of the wind tunnel.
Figure 12.8:
Pressure and mesh for a Mach 3 wind
tunnel with a step. The red lines in the mesh separate processor domains.
The mesh has been dynamically adapted and load-balanced with orthogonal
recursive bisection.
Notice how the mesh density is much greater in the neighborhood of the shocks and at the step where the pressure gradient is high. This computation was performed on 32 processors of a Symult machine.
Guy Robinson
The efficiency of any parallel algorithm increases as the computational load dominates the communication load [ Williams:90a ]. In the case of a domain-decomposed mesh, the computational time depends on the number of elements per processor, and the communication time on the number of nodes at the boundary of the processor domain. If there are N elements in total, distributed among n processors, we expect the computation to go as and the communication as the square root of this, so that the efficiency should approach unity as the square root of .
We have run the example described above starting with a mesh of 525 elements, and refining 50% of the elements. In fact, more than 50% will be refined because of the nature of the refinement algorithm:In practice, it is about 70%. The refinement continues until the memory of the machine runs out.
Figure 12.9 shows timing results. At top right are results for 1, 4, 16, 64, and 256 nCUBE processors. The time taken per simulation time step is shown for the compressible flow algorithm against number of elements in the simulation. The curves end when the processor memory is full. Each processor offers a nominal memory, but when all the software and communication buffers are accounted for, there is only about available for the mesh.
The top left of Figure 12.9 shows the same curves for 1, 4, 16, 64, and 128 Symult processors, and at bottom left the results for 1, 4, 16, and 32 processors of a Meiko CS-1 computing surface. For comparison, the bottom right shows the results for one head of the CRAY Y-MP, and also for the Sun Sparcstation.
Figure 12.9:
Timings for Transonic Flow
Each figure has diagonal lines to guide the eye; these are lines of constant time per element. We expect the curves for the sequential machines to be parallel to these because the code is completely local and the time should be proportional to the number of elements. For the parallel machines, we expect the discrepancy from parallel lines to indicate the importance of communication inefficiency.
Guy Robinson
The transonic flow algorithm was written as a DIME application by Roy Williams, using the algorithm of A. Jameson of Princeton University and D. Mavriplis of NASA ICASE [Williams:89b;90a;90b].
Guy Robinson
Continuous physical systems must generally be ``discretized'' prior to analysis with a digital computer. In practice, there are relatively few ways to discretize a physical system. Finite-element and finite-difference approximations are useful for dealing with partial differential equations in a small number of dimensions (up to three). If the dimensionality of the independent variable space is large, however, discretization by finite difference or finite elements becomes unwieldy. For example, the collisionless Boltzman equation,
is expressed as a partial differential equation in six independent variables. A fairly modest discretization of the domain with 100 ``elements'' in each dimension would result in a system with elements. A simulation of this size is out of the question on computers which will be available in the foreseeable future.
Fortunately, another means of discretization is available. Particle Simulation (or N-body simulation) is discussed at length by Hockney and Eastwood [ Hockney:81a ]. It is appropriate for systems like the collisionless and collisional Boltzman equation, and hence it is applicable to a number of outstanding problems in astrophysics , where the basic physical processes are governed by Newtonian gravity and the Boltzman equation [ Binney:87a ].
In such simulations, the phase-space density, f , is represented by a swarm of ``particles'', or ``bodies'' which evolve in time according to the dynamics of Newtonian gravity:
The 3N second-order, ordinary differential equations may be integrated in time by a large number of methods, ranging from the very simple (Euler's method) to the very complex [ Aarseth:85a ]. The difficulty with using Equation 12.4 is that a straightforward implementation of the right-hand sides of these equations requires operations. Each of N accelerations is the vector sum of N-1 components, each of which requires a handful of floating-point operations (including at least one square-root). Even if one utilizes Newton's second law, one can cut the total number of operations by half, but the asymptotic behavior remains unchanged. N-body simulations using direct summation are practical up to a few tens of thousands of bodies on modern supercomputers. Even the teraflop performance promised by parallel computation would only increase this by an order of magnitude or so. Substantially larger simulations require alternative methods for evaluating the forces. The fact that gravity is ``long-range,'' makes rapid evaluation of the forces problematical. It is not acceptable to simply disregard all bodies beyond a certain fixed cutoff, because the contribution of distant bodies does not decrease fast enough to balance the fact that the number of bodies at a given distance is an increasing function of distance.
Recent algorithmic advances [ Appel:85a ], [ Barnes:86a ], [ Greengard:87b ], [ Jernighan:89a ] however, have shown that while it is not acceptable to disregard distant collections of bodies, it is possible to accurately approximate their contribution without summing all of the individual components. It has been known since the time of Newton that the effect of the earth on an apple may be computed by replacing the countless individual atoms in the earth with a single point-mass located at the earth's center. The force calculation is then:
Guy Robinson
There are a number of ways to utilize this fact in a computer simulation [ Appel:85a ], [ Barnes:86a ], [ Greengard:87b ], [ Zhao:87a ]. The methods differ in choice of data structure, level of mathematical rigor, and complexity of the fundamental interactions. We shall consider an adaptive tree data structure, and an algorithm that treats each body independently. The algorithm begins by partitioning space into an oct-tree, that is, a tree whose nodes correspond to cubical regions of space. Each node may have up to eight daughter nodes, corresponding to the eight subcubes that are obtained by dividing in half in each Cartesian direction. The tree is defined by the following properties:
Figure 12.10:
(a) Expanded and (b) Flat Representation of an Adaptive Tree
Figure 12.11:
10,000 Body Barnes-Hut Tree
The oct-tree provides a convenient data structure which allows us to record the properties of the matter distribution on all length scales. It is especially convenient for astrophysical systems because it is adaptive. That is, the depth of the tree adjusts itself automatically to the local particle density. In order to use an approximation like Equation 12.5 , we need to know certain properties of the matter distribution in each cell. In the simplest case, these properties are the mass and center-of-mass of the matter distribution, but it is possible to use quadrupole moments [ Hernquist:87a ], or higher-order moments [ Salmon:90a ] for added accuracy. All of these properties may be computed by a bottom-up traversal of the tree, combining the properties of the ``daughters'' of a node to get the properties of the node itself. The time required for this bottom-up traversal is proportional to the number of internal nodes in the tree, that is, .
Guy Robinson
Once the distribution of matter is represented on a number of length scales, it is possible to use the approximation in Equation 12.5 to reduce the number of operations required to find the force on a body. The force on each body is computed independently by a recursive procedure that traverses the tree from the top down. Beginning at the root of the tree, we simply apply a multipole acceptability criterion (MAC). This tells us whether Equation 12.5 (or an appropriate higher-order approximation) is sufficiently accurate. If it is, then we evaluate the approximation, and eliminate the summation over all the bodies contained within the node. Otherwise, we proceed recursively to the eight daughter cells of the node. Whenever we reach a terminal node, we simply compute the body-body interaction. The procedure is shown schematically in Figure 12.12 .
Figure 12.12:
The Barnes-Hut Algorithm
The performance of the algorithm depends on how we evaluate the MAC. For example, one could always answer ``no,'' in which case the performance would be identical to the case (although the bookkeeping overhead would be somewhat higher, and we would not take advantage of Newton's second law). The specifics of how best to evaluate the MAC would take us far afield [ Barnes:89d ], [ Makino:90a ], and [ Salmon:92a ].
Suffice it to say that all methods are based on the idea that the multipole approximation is accurate when the distance to the cell is large compared to the size of the cell. Essentially any criterion based on a ratio of size-of-cell to distance-to-cell will require -force evaluations to compute the total force on each body [ Barnes:86a ], [ Salmon:90a ]. Since the forces on all bodies are evaluated independently, the total number of evaluations is proportional to , which is a substantial improvement over the situation that results from a naive evaluation of Equation 12.3 .
Guy Robinson
Computational science advances both in hardware and algorithms. Occasionally, algorithmic advances are of such tremendous significance that they completely overshadow the striking advances constantly being made by hardware. Tree codes are just such an algorithmic advance. It is literally true that a tree code running on a modest workstation can address larger problems than can the fastest parallel supercomputer running an algorithm. It is well known [ Fox:84e ], [ Fox:88a ] that parallel computers can efficiently evaluate the force evaluations required by direct application of Equation 12.5 . However, this fact is of limited significance now that a new class of algorithms has changed the underlying complexity of the problem. If parallel computers are to have an impact on the N-body problem, then they must be able to efficiently execute tree codes.
Parallelization of tree codes is a challenging problem. Typical astrophysical simulations are highly inhomogeneous. Spatial densities can vary by a factor of or more through the computational domain. The tree must be adaptive to deal with such a large dynamic range in densities, that is, it must be deep in regions of high particle density, and shallow in regions of low particle density. Furthermore, the structure of the inhomogeneities is often dynamic-for example, galaxies form, move, collide, and merge in cosmological simulations. A fixed tree and/or a fixed decomposition is not suitable for such a system. Despite these problems, it is possible to find parallelism in tree codes and to run them efficiently on large parallel computers [ Fox:89t ], [ Salmon:90a ], [Warren:92a;93a].
The technique of ``domain decomposition'' has been applied with excellent results to a number of other problem areas. We have found that a slightly abstracted concept of domain decomposition is also applicable to tree codes. Recall that a domain decomposition usually proceeds by ``assigning'' spatial domains to processors. In designing a parallel program, the precise meaning of ``assign'' is crucial. We adopt the following ``owner-computes'' definition of a domain: A domain is a rectangular region of simulation space. Assignment of a domain to a processor implies that the processor will be responsible for updating the positions and velocities of all particles located within that region of simulation space. We allow that processor domains might change from one time step to the next, based, presumably, on load-balancing considerations.
Processor domains are chosen using orthogonal recursive bisection, or ORB (see Section 11.1.5 ). Recall that ORB tries repeatedly to split some measure of the ``load'' in half, and assign the halves to sets of processors. In the present context, that means finding a coordinate so that half of the computational ``load'' is associated with particles above the split, and half is associated with particles below the split. The result of applying orthogonal recursive bisection to a system containing two ``galaxies,'' (well-separated regions with high local particle density) is shown in Figure 12.13 .
Figure 12.13:
Decomposition Resulting from Orthogonal Recursive
Bisection of a System with Two Galaxies
It is a simple matter to record the ``load'' associated with each particle. For example, one can count interactions, or one could simply read the clock before and after the force on the particle is computed. Then, in order to find the splitting coordinate, one simply executes a binary (or more sophisticated) search, seeking a value of the coordinate for which half of the per-particle work is above and half is below.
In fact, seeking the exact median coordinate of the per-particle work does not necessarily guarantee load balance. It guarantees load balance within the force calculation , but it does not account for load imbalance that may result during construction of the tree, or during the other phases of the computation. It is possible to account for these sources of load imbalance by seeking a coordinate which is not precisely at the median (i.e., percentile), but rather at another percentile. The new target percentile is found by measuring the actual load imbalance, and adjusting the target by a small amount on each time step to reduce the observed load imbalance [ Salmon:90a ].
Guy Robinson
Many parallel algorithms conform to a pattern of activity that can loosely be described as:
We have already discussed decomposition, and described the use of orthogonal recursive bisection to determine processor domains. The next step is the acquisition of ``locally essential data'', that is, the data that will be needed to compute the forces on the bodies in a local domain. In other applications one finds that the locally essential data associated with a domain is itself local. That is, it comes from a limited region surrounding the processor domain. In the case of hierarchical N-body simulations, however, the locally essential data is not restricted to a particular region of space. Nevertheless, the hierarchical nature of the algorithm guarantees that if a processor's domain is spatially limited, then any particle within that domain will not require detailed information about the particle distribution in distant regions of space. This idea is illustrated in Figure 12.14 , which shows the parts of the tree that are required to compute forces on bodies in the grey region. Clearly, the locally essential data for a limited domain is much smaller than the total data set (shown in Figure 12.11 ). In fact, when the grain size of the domain is large, that is, when the number of bodies in the domain is large, the size of the locally essential data set is only a modest constant factor larger than the local data set itself [ Salmon:90a ]. This means that the work (both communication and additional computation) required to obtain and assemble the locally essential dataset is proportional to the grain size, that is, is . In contrast, the work required to compute the forces in parallel is . The ``big-O'' notation can hide large constants which dominate practical considerations. Typical astrophysical simulations with - bodies perform 200 500 interactions per body [ Hernquist:87a ], [ Warren:92a ], and each interaction costs from 30 to 60 floating-point operations. Thus, there is reason to be optimistic that assembly of the locally essential data set will not be prohibitively expensive.
Figure 12.14:
The Locally Essential Data Needed to Compute Forces in a
Processor Domain, Located in the Lower Left Corner of the System
Determining, in parallel, which data is locally essential for which processors is a formidable task. Two facts allow us to organize the communication of data into a regular pattern that guarantees that each processor receives precisely the locally essential data which it needs.
The procedure by which processors go from having only local data to having all locally essential data consists of a loop over each of the bisections in the ORB tree. To initialize the iteration, each processor builds a tree from its local data. Then, for each bisector, it traverses its tree, applying the DMAC at each node, using the complimentary domain as an argument, that is, asking whether the given cell contains an approximation that is sufficient for all bodies in the domain on the other side of the current ORB bisector. If the DMAC succeeds, the cell is needed on the other side of the domain, so it is copied to a buffer and queued for transmission. Traversal of the current branch can stop at this point because no additional information within the current branch of the local tree can possibly be necessary on the other side of the bisector. If the DMAC fails, traversal continues to deeper levels of the tree. This procedure is shown schematically in code in Table 12.1 .
Table 12.1:
Outline of
BuildLETree
which constructs a locally essential
representation of a tree.
Figure 12.15 shows schematically how some data might travel around a 16-processor system during execution of the above code.
The second tree traversal in the above code conserves a processor's memory by reclaiming data which was transmitted through the processor, but which is not needed by the processor itself, or any other member of its current subset. In Figure 12.15 , the body sent from processor 0110 through 1110 and 1010 to 1011 would likely be deleted from processor 1110's tree during the pruning on channel 2, and from 1010's tree during the pruning on channel 0.
Figure 12.15:
Data Flow in a 16 Processor System. Arrows indicate the flow of
data and are numbered with a decreasing ``channel'' number corresponding to
the bisector being traversed.
The Code requires the existence of a DMAC function. Obviously, the DMAC depends on the details of the MAC which will eventually be used to traverse the tree to evaluate forces. Notice, however, that the DMAC must be evaluated before the entire contents of a cell are available in a particular processor. (This happens whenever the cell itself extends outside of the processor's domain). Thus, the DMAC must rely on purely geometric criteria (the size and location of the cell), and cannot depend on, for example, the exact location of the center-of-mass of the cell. The DMAC is allowed, however, to err on the side of caution. That is, it is allowed to return a negative result about a cell even though subsequent data may reveal that the cell is indeed acceptable. The penalty for such ``false negatives'' is degraded performance, as they cause data to be unnecessarily communicated and assembled into locally essential data sets.
Figure 12.16:
The Distance Used by the DMAC is Computed by Finding the Shortest
Distance Between the Processor Domain and the Boundary of the Cell.
Because the DMAC must work with considerably less information than the MAC, it is somewhat easier to categorically describe its behavior. Figure 12.16 shows schematically how the DMAC is implemented. Recall that the MAC is based on a ``distance-to-size'' ratio. The distance used by the DMAC is the shortest distance from the cell to the processor domain. The ``min-distance'' MAC [Salmon:90a;92a] uses precisely this distance to decide whether a multipole approximation is acceptable. Thus, in a sense, the min-distance MAC is best suited to parallelization because it is equivalent to its own DMAC. The DMAC generates fewer false-positive decisions. Fortunately, the min-distance MAC also resolves certain difficulties associated with more commonly used MACs, and is arguably the best of the ``simple'' MACs [ Salmon:92a ].
Guy Robinson
It is possible to generate a huge amount of data related to parallel performance. One can vary the size of the problem, and/or the number of processors. Performance can be related to various problem parameters, for example, the nonuniformity of the particle distribution. Parallel overheads can be identified and attributed to communication, load imbalance, synchronization, or additional calculation in the parallel code [ Salmon:90a ]. All these provide useful diagnostics and can be used to predict performance on a variety of machines. However, they also tend to obscure the fact that the ultimate goal of parallel computation is to perform simulations larger or faster than would otherwise be possible.
Rather than analyze a large number of statistics, we restrict ourselves to the following ``bald'' facts.
In 1992, the 512-processor Intel Delta at Caltech evolved two astrophysical simulations with 17.15 million bodies for approximately 600 time steps. The machine ran at an aggregate speed exceeding 5000 MFLOPS/sec. The systems under study were simulated regions of the universe 100 Mpc (megaparsec) and 25 Mpc in diameter, which were initialized with random-density fluctuations consistent with the ``cold dark matter'' hypothesis and the recent results on the anisotropy of the microwave background radiation. The data from these runs exceeded 25 Gbytes, and is analyzed in [ Zurek:93a ]. Salmon and Warren were recipients of the 1992 Gordon Bell Price for performance in practical parallel processing research.
Guy Robinson
Vortex methods are a powerful tool for the simulation of incompressible flows at high Reynolds number. They rely on a discrete Lagrangian representation of the vorticity field to approximately satisfy the Kelvin and Helmholtz theorems which govern the dynamics of vorticity for inviscid flows. A time-splitting technique can be used to include viscous effects. The diffusion equation is considered separately after convecting the particles with an inviscid vortex method. In our work, the viscous effects are represented by the so-called deterministic method. The approach was extended to problems where a flux of vorticity is used to enforce the no-slip boundary condition.
In order to accurately compute the viscous transport of vorticity, gradients need to be well resolved. As the Reynolds number is increased, these gradients get steeper and more particles are required to achieve the requisite resolution. In practice, the computing cost associated with the convection step dictates the number of vortex particles and puts an upper bound on the Reynolds number that can be simulated with confidence. That threshold can be increased by reducing the asymptotic time complexity of the convection step from to . The nearfield of every vortex particle is identified. Within that region, the velocity is computed by considering the pairwise interaction of vortices. The speedup is achieved by approximating the influence of the rest of the domain, the farfield. In that context, the interaction of two vortex particles is treated differently depending on their spatial relation. The resulting computer code does not lend itself to vectorization but has been successfully implemented on concurrent computers.
Guy Robinson
Vortex methods (see [ Leonard:80a ]) are used to simulate incompressible flows at high Reynolds number. The two-dimensional inviscid vorticity equation,
is solved by discretizing the vorticity field into Lagrangian vortex particles,
where is the strength or the circulation of the particle. For an incompressible flow, the knowledge of the vorticity is sufficient to reconstruct the velocity field. Using complex notation, the induced velocity is given by
The velocity is evaluated at each particle location and the discrete Lagrangian elements are simply advected at the local fluid velocity. In this way, the numerical scheme approximately satisfies Kelvin and Helmholtz theorems that govern the motion of vortex lines. The numerical approximations have transformed the original partial differential equation into a set of 2N ordinary differential equations, an N -body problem. This class of problems is encountered in many fields of computational physics, for example, molecular dynamics, gravitational interactions, plasma physics and, of course, vortex dynamics.
Guy Robinson
When each pairwise interaction is considered, distant and nearby pairs of vortices are treated with the same care. As a result, a disproportionate amount of time is spent computing the influence of distant vortices that have little influence on the velocity of a given particle. This is not to say that the far field is to be totally ignored since the accumulation of small contributions can have a significant effect. The key element in making the velocity evaluation faster is to approximate the influence of the far field by considering groups of vortices instead of the individual vortices themselves. When the collective influence of a distant group of vortices is to be evaluated, the very accurate representation of the group provided by its vortices can be overlooked and a cruder description that retains only its most important features can be used. These would be the group location, circulation, and, possibly, some coarse approximation of its shape and vorticity distribution.
A convenient approximate representation is based on multipole expansions. It would be possible to build a fast algorithm by evaluating the multipole expansion at the location of particles that do not belong to the group. This is basically the scheme used by Barnes and Hut [ Barnes:86a ] (the concurrent implementation of this algorithm is discussed in Section 12.4 ). Greengard and Rokhlin [ Greengard:87b ] went a step further by proposing group-to-group interactions. In this case, the multipole expansion is transformed into a Taylor series around the center of the second group, where the influence of the first one is sought. The expansions provide an accurate representation of the velocity field when the distance between the groups is large compared to their radii.
One now needs a data structure that is going to facilitate the search for acceptable approximations. As proposed by Appel [ Appel:85a ], a binary tree is used. In that framework, a giant cluster sits on top of the data structure; it includes all the vortex particles. It stores all the information relevant to the group, that is, its location, its radius, and the coefficients of the multipole expansion. In addition, it carries the address of its two children, each of them responsible for approximately half of the vortices of the parent group. Whenever smaller groups are sought, these pointers are used to rapidly access the relevant information. The children carry the description of their own group of vortices and are themselves pointing at two smaller groups, their own children, the grandchildren of the patriarchal group. More subgroups are created by equally dividing the vortices of the parent groups along the ``x'' and ``y'' axis alternatively. This splitting process stops when all groups have approximately vortices. Then, instead of pointing toward two smaller groups, the parent node points toward a list of vortices. This data structure provides a quick way to access groups, from the largest to the smallest ones, and ultimately to the individual vortices themselves. Appel's data structure is Lagrangian since it is built on top of the vortices and moves with them. As a result, it can be used for many time steps.
Comparing the speed of this algorithm with the classical approach, the crossover occurs for as few as 150 vortices. At this point, the extra cost of maintaining the data structure is balanced by the savings associated with the approximate treatment of the far field. When N is increased further, the savings outweigh the extra bookkeeping and the proposed algorithm is faster than its competitor by a margin that increases with the number of vortices.
Guy Robinson
The global nature of the approach has made its parallel implementation fairly straightforward (see [ Fox:88a ]). However, as we have already seen in Section 12.4 , that character was drastically changed by the fast algorithm as it introduced a strong component of locality. Globality is still present since the influence of particle is felt throughout the domain, but more care and computational effort is given to its near field. The fast parallel algorithm has to reflect that dual nature, otherwise an efficient implementation will never be obtained. Moreover, the domain decomposition can no longer ignore the spatial distribution of the vortices. Nearby vortices are strongly coupled computationally, so it makes sense to assign them to the same processor. Binary bisection is used in the host to spatially decompose the domain. Then, only the vortices are sent to the processors where a binary tree is locally built on top of them. For example, Figure 12.17 shows the portion of the data structure assigned to processor 1 in a 4-processor environment.
In a fast algorithm context, sending a copy of local data structure to half the other processors does not necessarily result in a load balanced implementation. The work associated with processor-to-processor interactions now depends on their respective location in physical space. A processor whose vortices are located at the center of the domain is involved in more costly interactions than a peripheral processor. To achieve the best possible load balancing, that central processor could send a copy of its data to more than half of the other processors and hence itself be responsible for a smaller fraction of the work associated with its vortices.
Before a decision is made on which one is going to visit and which to receive, we minimize the number of pairs of processors that need to exchange their data structure. Following the domain decomposition, the portion of the data structure that sits above the subtrees is not present anywhere in the hypercube. That gap is filled by broadcasting the description of the largest group of every processor. By limiting the broadcast to one group per processor, a small amount of data is actually exchanged but, as seen on Figure 12.18 , this step gives every processor a coarse description of its surroundings and helps it find its place in the universe.
Figure 12.17:
Data Structure Assigned to Processor 1
Figure 12.18:
Data Structure Known to Processor 1 After Broadcast
If the vortices of processor A are far enough from those of processor B , it is even possible to use that coarse description to compute the interaction of A and B without an additional exchange of information. The far field of every processor can be quickly disposed of. After thinking globally, one now has to act locally; if the vortices of A are adjacent to those of B , a more detailed description of their vorticity field is needed to compute their mutual influence.
This requires a transfer of information from either A to B or from B to A . In the latter case, most of the work involved in the A-B interaction takes place in processor A . Obviously, processor B should not always send its information away since it would then remain idle while the rest of the hypercube is working. Load-balancing concerns will dictate the flow of information.
Guy Robinson
Since our objective is to compute the flow around a cylinder, the efficiency of the parallel implementation was tested on such a problem. The region for which is uniformly covered with N particles. The parallel efficiency is shown on Figure 12.19 as a function of the hypercube size. The parallel implementation is fairly robust: The parallel efficiency, , remains larger than . The number of vortices per processor was kept roughly constant at 1500 even if the parallel efficiency is not a strong function of the size of the problem. It is, however, much more sensitive to the quality of the domain decomposition. The fast parallel algorithm performs better when all the subdomains have approximately the same squarish shape or in other words, when the largest group assigned to a processor is as compact as possible.
Figure 12.19:
Parallel Efficiency of the Fast Algorithm
The results of Figure 12.19 were obtained at early times when the Lagrangian particles are still distributed evenly around the cylinder which makes the domain decomposition an easier task. At later times, the distribution of the vortices does not allow the decomposition of the domain in groups having approximately the same radius and the same number of vortices. Some subdomains cover a larger region of space and as a result, the efficiency drops to approximately 0.6. This is mainly due to the fact that more processors end up in the near field of a processor responsible for a large group; the request lists are longer and more data has to be moved between processors.
The sources of overhead corresponding to Figure 12.19 are shown on Figure 12.20 normalized with the useful work. Load imbalance, the largest overhead contributor, is defined as the difference between the maximum useful work reported by a processor and the average useful work per processor. Further, the extra work includes the time spent making a copy of one's own data structure, the time required to absorb the returning information, and the work that was duplicated in all processors, namely, the search for acceptable interactions in the upper portion of the tree and the subsequent creation of the request lists. The remaining overhead has been lumped under communication time although most of it is probably idle time (or synchronization time) that was not included in the definition of load imbalance.
Figure 12.20:
Load Imbalance (solid), Communication and Synchronization Time
(dash), and Extra Work (dot-dash) as a Function of the Number of
Processors
We expected that as P increases, the near field of a processor would eventually contain a fixed number of neighboring processors. The number of messages and the load imbalance would then reach an asymptote and the loss of efficiency would be driven by the much smaller communication and extra times. However, this has yet to happen at 32 processors and the communication time is already starting to make an impact.
Nevertheless, the fast algorithm, its reasonably efficient parallel implementation and the speed of the Mark III have made possible simulations with as many as 80,000 vortex particles.
Guy Robinson
These 80,000 particles were used to compute the flow past an impulsively started cylinder. Figure 12.21 (Color Plate) shows the vorticity field after five time units, meaning that the cylinder has been displaced by five radii; the Reynolds number is 3000. The pair of primary eddies induced by the body's motion is clearly visible along with a number of small structures produced by the interaction of the wake with the rear portion of the cylinder. It should be noted that symmetry has been enforced in the simulation. Streamlines derived from this vorticity distribution are presented in Figure 12.22 and compared with Bouard and Coutanceau's flow visualization [ Bouard:80a ] obtained at the same dimensionless time and Reynolds number.
Figure 12.21:
Vorticity field for Re = 3000 at time =
5.0
Figure 12.22:
Comparison of Computed Streamlines with Bouard and Coutanceau
Experimental Flow Visualization at
Re=3000
and
Guy Robinson
Guy Robinson
The goal of computer simulations of spin models is to generate configurations of spins typical of statistical equilibrium and measure physical observables on this ensemble of configurations. The generation of configurations is traditionally performed by Monte Carlo methods such as the Metropolis algorithm [ Metropolis:53a ], which produce configurations with a probability given by the Boltzmann distribution , where is the action, or energy, of the system in configuration , and is the inverse temperature. One of the main problems with these methods in practice is that successive configurations are not statistically independent, but rather are correlated with some autocorrelation time, , between effectively independent configurations.
A key feature of traditional (Metropolislike) Monte Carlo algorithms is that the updates are local (i.e., one spin at a time is updated), and its new value depends only on the values of spins which affect its contribution to the action, that is, only on local (usually nearest neighbor) spins. Thus, in a single step of the algorithm, information about the state of a spin is transmitted only to its nearest neighbors. In order for the system to reach a new effectively independent configuration, this information must travel a distance of order the (static or spatial) correlation length . As the information executes a random walk around the lattice, one would suppose that the autocorrelation time . However, in general, , where z is called the dynamical critical exponent. Almost all numerical simulations of spin models have measured for local update algorithms. (See also Sections 4.3 , 4.4 , 7.3 , 12.6 , and 14.2 ).
For a spin model with a phase transition, as the inverse temperature approaches the critical value, diverges to infinity so that the computational efficiency rapidly goes to zero! This behavior is called critical slowing down (CSD), and until very recently it has plagued Monte Carlo simulations of statistical mechanical systems, in particular spin models, at or near their phase transitions. Recently, however, several new ``cluster algorithms'' have been introduced which decrease z dramatically by performing nonlocal spin updates, thus reducing (or even eliminating) CSD and facilitating much more efficient computer simulations.
Guy Robinson
The aim of the cluster update algorithms is to find a suitable collection of spins which can be flipped with relatively little cost in energy. We could obtain nonlocal updating very simply by using the standard Metropolis Monte Carlo algorithm to flip randomly selected bunches of spins, but then the acceptance would be tiny. Therefore, we need a method which picks sensible bunches or clusters of spins to be updated. The first such algorithm was proposed by Swendsen and Wang [ Swendsen:87a ], and was based on an equivalence between a Potts spin model [ Potts:52a ], [ Wu:82a ] and percolation models [ Stauffer:78a ], [ Essam:80a ], for which cluster properties play a fundamental role.
The Potts model is a very simple spin model of a ferromagnet, in which the spins can take q different values. The case q=2 is just the well-known Ising model. In the Swendsen and Wang algorithm, clusters of spins are created by introducing bonds between neighboring spins with probability if the two spins are the same, and zero if they are not. All such clusters are created and then updated by choosing a random new spin value for each cluster and assigning it to all the spins in that cluster.
A variant of this algorithm, for which only a single cluster is constructed and updated at each sweep, has been proposed by Wolff [ Wolff:89a ]. The implementation of this algorithm is shown in Figures 12.23 through 12.25 (Color Plates), which show a q=3 Potts model at its critical temperature, with different colors representing the three different spin values. From the starting configuration (Figure 12.23 (Color Plate)), we choose a site at random, and construct a cluster around it by bonding together neighboring sites with the appropriate probabilities (Figure 12.24 (Color Plate)). All sites in this cluster are then given the same new spin value, producing the new configuration shown in Figure 12.25 (Color Plate), which is obviously far less correlated with the initial configuration than the result of a single Metropolis update (Figure 12.26 (Color Plate)). Although Wolff's method is probably the best sequential cluster algorithm, the Swendsen and Wang algorithm seems to be better suited for parallelization, since it involves the entire lattice rather than just a single cluster. We have, therefore, concentrated our attention on parallelizing the method of Swendsen and Wang, where all the clusters must be identified and labelled.
Figure 12_23:
Initial configuration of 3-state Potts
spins on which Wolff Algorithm is to be applied.
Figure 12_24:
Configuration of figure 12.23 with bonds
of cluster constructed by Wolff Algorithm indicated in yellow.
Figure 12_25:
Results of Wolff Algorithm applied to
spin configuration in color Figure 12.23- all spins in cluster flipped to
same new value (in this case from blue to red).
Figure 12_26:
Results of Metropolis Algorithm applied
to spin configuration in Figure 12.23 - only a few single spins flipped.
First we outline a sequential method for labelling clusters, the so-called ``ants in the labyrinth'' algorithm. The reason for its name is that we can visualize the algorithm as follows [ Dewar:87a ]. An ant is put somewhere on the lattice, and notes which of the neighboring sites are connected to the site it is on. At the next time step, this ant places children on each of these connected sites which are not already occupied. The children then proceed to reproduce likewise until the entire cluster is populated. In order to label all the clusters, we start by giving every site a negative label, set the initial cluster label to be zero, and then loop through all the sites in turn. If a site's label is negative, then the site has not already been assigned to a cluster so we place an ant on this site, give it the current cluster label, and let it reproduce, passing the label on to all its offspring. When this cluster is identified, we increment the cluster label and carry on repeating the ant-colony birth, growth, and death cycle until all the clusters have been identified.
Guy Robinson
As with the percolation models upon which the cluster algorithms are based, the phase transition in a spin model occurs when the clusters of bonded spins become large enough to span the entire lattice. Thus, near criticality (which in most cases is where we want to perform the simulation), clusters come in all sizes, from order N (where N is the number of sites in the lattice) right down to a single site. The highly irregular and nonlocal nature of the clusters means that cluster update algorithms do not vectorize well and hence give poor performance on vector machines. On this problem, a CRAY X-MP is only about ten times faster than a Sun 4 workstation. The irregularity of the clusters also means that SIMD machines are not well suited to this problem [Apostolakis:92a;93a], [ Baillie:91a ], [ Brower:91a ], whereas for the Metropolis type algorithms, they are perhaps the best machines available. It therefore appears that the optimum performance for this type of problem will come from MIMD parallel computers.
A parallel cluster algorithm involves distributing the lattice onto an array of processors using the usual domain decomposition. Clearly, a sequential algorithm can be used to label the clusters on each processor, but we need a procedure for converting these labels to their correct global values. We need to be able to tell many processors, which may be any distance apart, that some of their clusters are actually the same, to agree on which of the many different local labels for a given cluster should be assigned to be the global cluster label, and to pass this label to all the processors containing a part of that cluster. We have implemented two such algorithms, ``self-labelling'' and ``global equivalencing'' [ Baillie:91a ], [ Coddington:90a ].
Guy Robinson
We shall refer to this algorithm as ``self-labelling,'' since each site figures out which cluster it is in by itself from local information. This method has also been referred to as ``local label propagation'' [ Brower:91a ], [ Flanigan:92a ]. We begin by assigning each site, i , a unique cluster label, . In practice, this is simply chosen as the position of that site in the lattice. At each step of the algorithm in parallel, every site looks in turn at each of its neighbors in the positive directions. If it is bonded to a neighboring site, n , which has a different cluster label, , then both and are set to the minimum of the two. This is continued until nothing changes, by which time all the clusters will have been labelled with the minimum initial label of all the sites in the cluster. Note that checking termination of the algorithm involves each processor sending a termination flag (finished or not finished) to every other processor after each step, which can become very costly for a large processor array. This is an SIMD algorithm and can, therefore, be run on machines like the AMT DAP and TMC Connection Machine. However, the SIMD nature of these computers leads to very poor load balancing. Most processors end up waiting for the few in the largest cluster which are the last to finish. We implemented this on the AMT DAP and obtained only about 20% efficiency.
We can improve this method on a MIMD machine by using a faster sequential algorithm, such as ``ants in the labyrinth,'' to label the clusters in the sublattice on each processor, and then just use self-labelling on the sites at the edges of each processor to eventually arrive at the global cluster labels [ Baillie:91a ], [ Coddington:90a ], [ Flanigan:92a ]. The number of steps required to do the self-labelling will depend on the largest cluster which, at the phase transition, will generally span the entire lattice. The number of self-labelling steps will therefore be of the order of the maximum distance between processors, which for a square array of P processors is just . Hence, the amount of communication (and calculation) involved in doing the self-labelling, which is proportional to the number of iterations times the perimeter of the sublattice, behaves as L for an lattice; whereas, the time taken on each processor to do the local cluster labelling is proportional to the area of the sublattice, which is . Therefore, as long as L is substantially greater than the number of processors, we can expect to obtain a reasonable speedup. Of course, this algorithm suffers from the same type of load imbalance as the SIMD version. However, in this case, it is much less severe since most of the work is done with ``ants in the labyrinth,'' which is well load balanced. The speedups obtained on the Symult 2010, for a variety of lattice sizes, are shown in Figure 12.27 . The dashed line indicates perfect speedup (i.e., 100% efficiency). The lattice sizes for which we actually need large numbers of processors are of the order of or greater, and we can see that running on 64 nodes (or running multiple simulations of 64 nodes each) gives us quite acceptable efficiencies of about 70% for and 80% for .
Figure 12.27:
Speedups for Self-Labelling Algorithm
Guy Robinson
In this method we again use the fastest sequential algorithm to identify the clusters in the sublattice on every processor. Each processor then looks at the labels of sites along the edges of the neighboring processors in the positive directions, and works out which ones are connected and should be matched up. These lists of ``equivalences'' are all passed to one of the processors, which uses an algorithm for finding equivalence classes [ Knuth:68a ], [ Press:86a ] (which, in this case, are the global cluster labels) to match up the connected clusters. This processor then broadcasts the results back to all the other processors.
Figure 12.28:
Speedups for Global Equivalencing Algorithm
This part of the algorithm is purely sequential, and is thus a potentially disastrous bottleneck for large numbers of processors. It also requires this processor to have a large amount of memory in which to store all the labels from every other processor. The amount of work involved in doing the global matchup is proportional to P times the perimeter of the sublattice on each processor, or so that the efficiency should be less than for self-labelling; although, we might still expect reasonable speedups if the number of processors is not extremely large. The speedups obtained on the Symult 2010 for a variety of lattice sizes are shown in Figure 12.28 . The figure for on 128 processors is missing due to memory constraints. Global equivalencing gives about the same speedups as self-labelling for small numbers of processors, but as expected self-labelling does much better as the number of processors increases.
Guy Robinson
The problem of labelling clusters of spins is an example of a standard graph problem known as connected component labelling [ Horowitz:78a ]. Another important instance occurs in image analysis, in identifying and labelling the connected components of a binary or multicolored image composed of an array of pixels [ Rosenfeld:82a ]. There have been a number of parallel algorithms implemented for this problem [ Alnuweiri:92a ] [ Cypher:89a ], [ Embrechts:89a ], [ Woo:89a ]. The most promising of these parallel algorithms for spin models has a hierarchical divide-and-conquer approach [ Baillie:91a ], [ Embrechts:89a ]. The processor array is divided up into smaller subarrays of, for example, processors. In each subarray, the processors look at the edges of their neighbors for clusters which are connected across processor boundaries. As in global equivalencing, these equivalences are all passed to one of the nodes of the subarray, which places them in equivalence classes. The results of these partial matchings are similarly combined on each subarray, and this process is continued until finally all the partial results are merged together on a single processor to give the global cluster values.
Finally, we should mention the trivial parallelization technique of running independent Monte Carlo simulations on different processors. This method works well until the lattice size gets too big to fit into the memory of each node. In the case of the Potts model, for example, only lattices of size less than about or will fit into 1 Mbyte, though most other spin models are more complicated and more memory-intensive. The smaller lattices which are seen to give poor speedups in Figure 12.27 and Figure 12.28 can be run with 100% efficiency in this way. Note, of course, that this requires an MIMD computer. In fact, we have used this method to calculate the dynamical critical exponents of various cluster algorithms for Potts models [Baillie:90m;91b], [ Coddington:92a ] (see Section 4.4.3 ).
Guy Robinson
This research was performed by C. F. Baillie, P. D. Coddington, J. Apostolakis, and E. Marinari.
Guy Robinson
This section discusses sorting : the rearrangement of data into some set sequential order. Sorting is a common component of many applications and so it is important to do it well in parallel. Quicksort (to be discussed below) is fundamentally a divide-and-conquer algorithm and the parallel version is closely related to the recursive bisection algorithm discussed in Section 11.1 . Here, we have concentrated on the best general-purpose sorting algorithms: bitonic , shellsort , and quicksort . No special properties of the list are exploited. If the list to be sorted has special properties, such as a known distribution (e.g., random numbers with a flat distribution between 0 and 1) or high degeneracy (many redundant items, e.g., text files), then other strategies can be faster. In the case of known data distribution, a bucketsort strategy (e.g., radix sort) is best, while the case of high degeneracy is best handled by the distribution counting method ([Knuth:73a, pp. 379-81]).
The ideas presented here are appropriate for MIMD machines and are somewhat specific to hypercubes (we will assume processors), but can easily be extended to other topologies.
There are two ways to measure the quality of a concurrent algorithm. The first may be termed ``speed at any cost,'' and here one optimizes for the highest absolute speed possible for a fixed-size problem. The other we can call ``speed per unit cost,'' where one, in addition to speed, worries about efficient use of the parallel machine. It is interesting that in sorting, different algorithms are appropriate depending upon which criterion is employed. If one is interested only in absolute speed, then one should pay for a very large parallel machine and run the bitonic algorithm. This algorithm, however, is inefficient. If efficiency also matters, then one should only buy a much smaller parallel machine and use the much more efficient shellsort or quicksort algorithms.
Another way of saying this is: for a fixed-size parallel computer (the realistic case), quicksort and shellsort are actually the fastest algorithms on all but the smallest problem sizes. We continue to find the misconception that ``Everyone knows that the bitonic algorithm is fastest for sorting.'' This is not true for most combinations of machine size and list size.
The data are assumed to initially reside throughout the parallel computer, spread out in a random, but load-balanced fashion (i.e., each processor begins with an approximately equal number of datums). In our experiments, the data were positive integers and the sorting key was taken to be simply their numeric value. We require that at the end of the sorting process, the data residing in each node are sorted internally and these sublists are also sorted globally across the machine in some way.
Guy Robinson
In the merging strategy to be used by our sorting algorithms, the first step is for each processor to sort its own sublist using some fast algorithm. We take for this a combined quicksort/insertion sort which is described in detail as Algorithm Q by Knuth ([Knuth:73a, pp. 118-9]). Once the local (processor) sort is complete, it must be decided how to merge all of the sorted lists in order to form one globally sorted list. This is done in a series of compare-exchange steps. In each step, two neighboring processors exchange items so that each processor ends up with a sorted list and all of the items in one processor are greater than all of the items in the other. Thus, two sorted lists of m items each are merged into a sorted list of 2 m items (stored collectively in the memory of the two processors). The compare-exchange algorithm is interesting in its own right, but we do not have the space here to discuss it. The reader is referred to Chapter 18 of [ Fox:88a ] for the details.
Guy Robinson
Figure 12.29:
Bitonic Scheme for
d=3
. This figure illustrates the six
compare-exchange steps of the bitonic
algorithm for
d=3
. Each diagram illustrates four compare-exchange processes which happen
simultaneously. The arrows represent a compare-exchange between
two processors. The largest items go to the processor at the point of
the arrow, and the smallest items to the one at the base of the arrow.
Table 12.2:
Bitonic sort on a hypercube. The rows are labelled by
hypercube size (
), the columns by number of items
to sort.
Many algorithms for sorting on concurrent machines are based upon Batcher's bitonic sorting algorithm ([ Batcher:68a ], [Knuth:73a, pp.232-3]). The first step in the merge strategy is for each processor to internally sort via quicksort. One is then left with the problem of constructing a series of compare-exchange steps which will correctly merge sorted sublists. This problem is completely isomorphic to the problem of sorting a list of items by pairwise comparisons between items. Each one of our sublist compare-exchange operations is equivalent to a single compare-exchange between two individual items. The pattern of compare-exchanges for the bitonic algorithm for the d=3 case is shown in Figure 12.29 . More details and a specification of the bitonic algorithm can be found in Chapter 18 of [ Fox:88a ].
Table 12.2 shows the actual times and efficiencies for our implementation of the bitonic algorithm. Results are shown for sorting lists of sizes to items on hypercubes with dimensions, d , ranging from one (2 nodes) to seven (128 nodes). Efficiencies are computed by comparing with single-processor times to quicksort the entire list (we take quicksort to be our benchmark sequential algorithm). The same information is also shown graphically in Figure 12.30 .
Figure 12.30:
The Efficiency of the Bitonic
Algorithm Versus
List Size for Various Size Hypercubes-Labelled by Cube Dimension
d
.
Clearly, the efficiencies fall off rapidly with increasing d . From the standpoint of cost-effectiveness, this algorithm is a failure. On the other hand, Table 12.2 shows that for fixed-list sizes and increasing machine size, the execution times continue to decrease. So, from the speed-at-any-cost point of view, the algorithm is a success. We attribute the inefficiency of the bitonic algorithm partly to communication overhead and some load imbalance during the compare-exchanges, but mostly to nonoptimality of the algorithm itself. In our definition of efficiency we are comparing the parallel bitonic algorithm to sequential quicksort. In bitonic, the number of cycles grows quadratically with d . This suggests that efficiency can be improved greatly by using a parallel algorithm that sorts in fewer operations without sacrificing concurrency.
Guy Robinson
This algorithm again follows the merge strategy and is motivated by the fact that d compare-exchanges in the d different directions of the hypercube result in an almost-sorted list. Global order is defined via ringpos , that is, the list will end up sorted on an embedded ring in the hypercube. After the d compare-exchange stages, the algorithm switches to a simple mopping-up stage which is specially designed for almost-sorted lists. This stage is optimized for moving relatively few items quickly through the machine and amounts to a parallel bucket brigade algorithm. Details and a specification of the parallel shellsort algorithm can be found in Chapter 18 of [ Fox:88a ].
It turns out that the mop-up algorithm takes advantage of the MIMD nature of the machine and that this characteristic is crucial to its speed. Only the few items which need to be moved are examined and processed. The bitonic algorithm, on the other hand, is natural for a SIMD machine. It involves much extra work in order to handle the worst case, which rarely occurs.
We refer to this algorithm as shellsort ([ Shell:59a ], [ Knuth:73a ] pp. 84-5, 102-5) or a diminishing increment algorithm. This is not because it is a strict concurrent implementation of the sequential namesake, but because the algorithms are similar in spirit. The important feature of Shellsort is that in early stages of the sorting process, items take very large jumps through the list reaching their final destinations in few steps. As shown in Figure 12.31 , this is exactly what occurs in the concurrent algorithm.
Figure 12.31:
The Parallel Shellsort on a
d=3
Hypercube. The left side
shows what the algorithm looks like on the cube, the right shows the
same when the cube is regarded as a ring.
The algorithm was implemented and tested with the same data as the bitonic case. The timings appear in Table 12.3 and are also shown graphically in Figure 12.32 . This algorithm is much more efficient than the bitonic algorithm, and offers the prospect of reasonable efficiency at large d . The remaining inefficiency is the result of both communication overhead and algorithmic nonoptimality relative to quicksort. For most list sizes, the mop-up time is a small fraction of the total execution time, though it begins to dominate for very small lists on the largest machine sizes.
Figure:
The Efficiency of the Shellsort Algorithms Versus List Size
for Various Size Hypercubes. The labelling of curves and axes is as
in Figure
12.30
.
Guy Robinson
The classic quicksort algorithm is a divide-and-conquer sorting method ([ Hoare:62a ], [ Knuth:73a ] pp.118-23). As such, it would seem to be amenable to a concurrent implementation, and with a slight modification (actually an improvement of the standard algorithm) this turns out to be the case.
The standard algorithm begins by picking some item from the list and using this as the splitting key. A loop is entered which takes the splitting key and finds the point in the list where this item will ultimately end up once the sort is completed. This is the first splitting point. While this is being done, all items in the list which are less than the splitting key are placed on the low side of the splitting point, and all higher items are placed on the high side. This completes the first divide. The list has now been broken into two independent lists, each of which still needs to be sorted.
The essential idea of the concurrent (hypercube) quicksort is the same. The first splitting key is chosen (a global step to be described below) and then the entire list is split, in parallel, between two halves of the hypercube. All items higher than the splitting key are sent in one direction in the hypercube, and all items less are sent the other way. The procedure is then called recursively, splitting each of the subcubes' lists further. As in Shellsort, the ring-based labelling of the hypercube is used to define global order. Once d splits occur, there remain no further interprocessor splits to do, and the algorithm continues by switching to the internal quicksort mentioned earlier. This is illustrated in Figure 12.33 .
Figure 12.33:
An Illustration of the Parallel Quicksort
So far, we have concentrated on standard quicksort. For quicksort to work well, even on sequential machines, it is essential that the splitting points land somewhere near the median of the list. If this isn't true, quicksort behaves poorly, the usual example being the quadratic time that standard quicksort takes on almost-sorted lists. To counteract this, it is a good idea to choose the splitting keys with some care so as to make evenhanded splits of the list.
Figure:
Efficiency Data for the Parallel Quicksort described in the
Text. The curves are labelled as in Figure
12.30
and
plotted against the logarithm of the number of items to be sorted.
This becomes much more important on the concurrent computer. In this case, if the splits are done haphazardly, not only will an excessive number of operations be necessary, but large load imbalances will also occur. Therefore, in the concurrent algorithm, the splitting keys are chosen with some care. One reasonable way to do this is to randomly sample a subset of the entire list (giving an estimate of the true distribution of the list) and then pick splitting keys based upon this sample. To save time, all splitting keys are found at once. This modified algorithm should perhaps be called samplesort and consists of the following steps:
Times and efficiencies for the parallel quicksort algorithm are shown in Table 12.4 . The efficiencies are also plotted in Figure 12.34 . In some cases, the parallel quicksort outperforms the already high performance of the parallel shellsort discussed earlier. There are two main sources of inefficiency in this algorithm. The first is a result of the time wasted sorting the sample. The second is due to remaining load imbalance in the splitting phases. By varying the sample size l , we achieve a trade-off between these two sources of inefficiency. Chapter 18 of [ Fox:88a ] contains more details regarding the choice of l and other ways to compute splitting points.
Before closing, it may be noted that there exists another way of thinking about the parallel quicksort/samplesort algorithm. It can be regarded as a bucketsort, in which each processor of the hypercube comprises one bucket. In the splitting phase, one attempts to determine reasonable limits for the buckets so that approximately equal numbers of items will end up in each bucket. The splitting process can be thought of as an optimal routing scheme on the hypercube which brings each item to its correct bucket. So, our version of quicksort is also a bucketsort in which the bucket limits are chosen dynamically to match the properties of the particular input list.
The sorting work began as a collaboration between Steve Otto and summer students Ed Felten and Scott Karlin. Ed Felten invented the parallel Shellsort; Felten and Otto developed the parallel Quicksort.
Guy Robinson
Guy Robinson
Two basic types of simulations exist for modelling systems of many particles: grid-based (point particles indirectly interacting with one another through the potential calculated from equivalent particle densities on a mesh) and particle-based (point particles directly interacting with one another through potentials at their positions calculated from the other particles in the system). Grid-based solvers traditionally model continuum problems, such as fluid and gas systems like the one described in Section 9.3 , and mixed particle-continuum systems. Particle-based solvers find more use modeling discrete systems such as stars within galaxies, as discussed in Section 12.4 , or other rarefied gases. Many different physical systems, including electromagnetic interactions, gravitational interactions, and fluid vortex interactions all are governed by Poisson's Equation:
for the gravitational case. To evolve N particles in time, the exact solution to the problem requires calculating the force contribution to each particle from all other particles at each time step:
The operation count is prohibitive for simulations of more than a few thousand particles commonly required to represent astrophysical and vortex configurations of interest.
One method of decreasing the operation count utilizes grid-based solvers which translate the particle problem into a continuum problem by interpolating the particles onto a mesh representing density and then solve the discretized equation. Initial implementations were based upon fast fourier transform (FFT) and cloud-in-cell (CIC) methods which can calculate the potential of a mass distribution on a three-dimensional grid with axes of length M in operations, but at the cost of lower accuracy in the force resolution. All of these algorithms are discussed extensively by Hockney and Eastwood [ Hockney:81a ].
A newer type of grid-based solver for discretized equations classified as a multilevel or multigrid method has been in development for over a decade [ Brandt:77a ], [ Briggs:87b ]. Frequently, the algorithm utilizes a hierarchy of rectangular meshes on which a traditional relaxation scheme may be applied, but multiscale methods have expanded beyond any particular type of solver or even grids, per se. Relaxation methods effectively damp oscillatory error modes whose wave numbers are comparable to the grid size, but most of the iterations are spent propagating smooth, low-wave number corrections throughout the system. Multigrid utilizes this property by resampling the low-wave number residuals onto secondary, lower-resolution meshes, thereby shifting the error to higher wave numbers comparable to the grid spacing where relaxation is effective. The corrections computed on the lower-resolution meshes are interpolated back onto the original finer mesh and the combined solutions from the various mesh levels determine the result.
Many grid-based methods for particle problems have incorporated some form of local direct-force calculation, such as the particle-particle/particle-mesh (PPPM) method or the method of local corrections (MLC), to correct the force on a local subset of particles. The grid is used to propagate the far-field component of the force while direct-force calculations provide the near-field component either completely or as a correction to the ``external'' potential. The computational cost strongly depends on the criterion used to distinguish near-field objects from far-field objects. Extremely inhomogeneous systems of densely clustered particles can deteriorate to nearly if most of the particles are considered neighbors requiring direct force computation.
A class of alternative techniques which have been implemented with great success utilize methods to efficiently calculate and combine the coefficients of an analytic approximation to the particle forces using spherical harmonic multipole expansions in three dimensions.
where the multipole moments
are the disjoint spatial regions, and is the Green's function. Instead of integrating G over the volume , one may compute the potential (and, in a similar manner, the gradient) at any position by calculating the multipole moments which characterize the density distribution in each region, evaluating G and its derivatives at , and summing over indices. This algorithm is described more extensively in Section 12.4 .
Not only does spatially sorting the particles into a tree-type data structure provide an efficient database for individual and collective particle information [ Samet:90a ], but the various algorithms require and utilize the hierarchical grouping of particles and combined information to calculate the force on each particle from the multipole moments in operations or less.
Implementations for three-dimensional problems frequently use an oct-tree-a cube divided into eight octants of equal spatial volume which contain subcubes similarly divided. The cubes continue to nest until, depending on the algorithm, the cube contains either zero or one particles or a few particles of equal number to the other ``terminal'' cells. Binary trees which subdivide the volume with planes chosen to evenly divide the number of particles instead of the space also have been used [ Appel:85a ]; a single bifurcation separates two particles spaced arbitrarily close together while the oct-tree would require arbitrarily many subcubes refining one particular region. This approach may produce fewer artifacts by not imposing an arbitrary rectangular structure onto the simulation, but construction is more difficult and information about each cut must be stored and used throughout the computation.
Initial implementations for both grid-based and multipole techniques normally span the entire volume with a uniform resolution net in which to catch the result. While this is adequate for homogeneous problems, it either wastes computational effort and storage or sacrifices accuracy for problems which exhibit clustering and structure. Many of the algorithms described above provide enough flexibility to allow adaptive implementations which can conform to complicated particle distribution or accuracy constraints.
Guy Robinson
Mesh-based algorithms have started to incorporate adaptive mesh refinement to decrease storage and wasted computational effort. Instead of solving the entire system with a fixed resolution grid designed to represent the finest structures, local regions may be refined adaptively depending on accuracy requirements such as the density of particles. Unlike finite-element and finite-volume algorithms, which deform a single grid by shifting or adding vertices, adaptive mesh refinement (AMR) algorithms simply overlay regions of interest with increasingly fine rectangular meshes. Berger, Colella, and Oliger have pioneered application of this method to hyperbolic partial differential equations [Berger:84a;89a]. Almgren recently has extended AMR for multigrid to an MLC implementation [ Almgren:91a ].
Adaptive mesh refinement traditionally has been limited to rectangular regions. McCormick and Quinlan have extended their very robust, inherently conservative adaptive mesh multilevel algorithm called asynchronous fast adaptive composite (AFAC) [ McCormick:89a ] to relax nonrectangular subregions directly between two grid levels. The algorithm is a true multiscale solver not limited to relaxation-type solvers. AFAC provides special benefits for parallel implementations because the various levels in a single multigrid cycle may be scheduled in any convenient order and combined at the end of the cycle instead of the traditional, sequentially-ordered V-cycle.
In the particle-based solver regime, the Barnes-Hut [ Barnes:86a ] method utilizes an adaptive tree to store information about one particle or the collective information about particles in the subcubes. Each particle calculates the force on itself from all of the other particles in the simulation by querying the hierarchical database, descending each branch of the tree until a user-specified accuracy criterion has been met. The accuracy is determined by the solid angle subtended by the cluster of particles within the cube from the vantage point of the particle calculating the force. If the cube contains a single particle or if all of the particles in the cube can be approximated by the center of mass, the force is computed using a multipole expansion; otherwise, each of the eight subcubes is examined in turn using the same criterion. By utilizing combined information instead of the individual data at the terminal node of each branch, the algorithm requires operations. Section 12.4 provides additional explanation while describing a parallel implementation of this method.
The Fast Multipole Method(FMM) developed by Greengard and Rokhlin [ Greengard:87b ] utilizes new techniques to quickly compute and combine the multipole approximations in operations. Initial implementations sorted the particles into groups on a fixed level of the tree with the hierarchical pyramid structure providing the communication network used to combine and repropagate the multipole-calculated potential. Recent enhancements include adaptive refinement of the hierarchy-creating structures similar to a Barnes-Hut tree [ Greengard:91a ].
Both Katzenelson and Anderson have noted the applicability of a variety of ``tree algorithms'' to the N-body problem. Katzenelson utilizes the common structure of the Barnes-Hut and FMM algorithms to study how this problem can be mapped to a variety of parallel computer designs [ Katzenelson:89a ]. Anderson utilizes the multigrid framework as a basis for communication in his FMM implementation which substitutes Poisson integrals for spherical harmonic multipole expansions [ Anderson:90b ].
Guy Robinson
We propose that the exact same hierarchical structure used by particle-based methods now may be effectively utilized in adaptive mesh refinement implementation. The spatially structured cubic volumes into which the mass-points are sorted are inherently situated, sized, and ordered as an efficient adaptive mesh representing the system of interest. Instead of interpreting the hierarchy as a graphical representation of the tree-shaped database, it can function as the physical mesh which links the grid resolution with the particle density. Figures 12.10 (a) and 12.35 represent a two-dimensional tree-structure from a particle simulation (simplified for ease of presentation). Figures 12.36 and 12.37 show the configuration in Figure 12.35 represented by a composite grid. The similarity between Figures 12.35 and 12.36 demonstrates the convergence of these two different approaches. Tree levels and cells may not directly correspond with grid levels and zones, that is, multiple particles (and cells) from multiple levels would be collected to form a single grid level of appropriate resolution aligned with the tree cells. Figure 12.11 shows a larger, more realistic two-dimensional tree for which we can give a similar discussion.
Figure 12.35:
A Collapsed Representation of a Small, Two-dimensional Barnes-Hut
Tree Containing 32 Particles
Figure:
The Flattened Tree in Figure
12.35
Interpreted as a
Composite Grid
Figure:
Another View of the Composite Grid in Figure
12.36
Showing the Individual Grid Levels from Which it is Constituted
This relationship stems from the grid-based algorithm's reliance on the locality of the discrete operator and the particle-based schemes' similar utilization of locality to efficiently collect, combine, and redistribute the multipole moments. In the Poisson case, the locality stems from the regularity of harmonic functions which allow accurate approximation of the smooth, far-field solution by low-order representations [ Almgren:91a ]. Barnes-Hut requires the locality of the tree not just as a framework for the algorithm but to provide the ability to selectively descend into subcubes as needed during the computation, allowing Salmon to create ``locally essential'' data sets per processor [ Salmon:90a ]. Locality is common to and useful for many loosely synchronous parallel algorithms [ Fox:88a ].
This union of hierarchies provides opportunities beyond similar programming structure [ Anderson:90b ], [ Katzenelson:89a ]: It allows easier synthesis of combined particle and mesh algorithms and allows hierarchy-building developments to benefit both simulation methods. An additional advantage of the oct-tree over the binary tree (recursive bisection) for dividing space is evident when combining particle and mesh algorithms: The spatially divided oct-tree allows for easy alignment with a mesh while the the binary tree does not easily overlay a mesh or another tree [ Samet:88a ]. The parallel implementation of the Barnes-Hut code by Salmon [ Salmon:90a ], including domain decomposition and tree construction, provides insights applicable to adaptive mesh refinement on massively parallel, multiple-instruction multiple-data (MIMD) computers. The locality of the algorithms precisely provide the structure necessary for efficient parallel domain decomposition and ordered, hypercubelike communication on MIMD architectures.
An astrophysical model combining a smooth fluid for gas dynamics with discrete particles representing massive objects can occur entirely on a mesh or using a mixed simulation. The block structures available in the AFAC algorithm allow arbitrarily shaped, nested regions of rectangular meshes to be used as the relaxation grid for a multilevel algorithm; these regions can directly represent the partially complete subcubes present in oct-tree data structures frequently used in three-dimensional particle simulations. When combining both methods, the density of mass points is no longer sufficient as an estimate for necessary grid resolution, so additional criteria based upon acceptable error in other aspects of the simulation, for example, accurately reproducing shocks, will affect the construction of the mesh. But the grid can adapt to these constraints and the hierarchy still provides the multipole information at points of interest.
If the method of local corrections is incorporated to provide greater accuracy for local interactions, the neighboring regions requiring correction can utilize the Barnes-Hut test of opening angle or the Salmon test of cumulative error contribution [ Salmon:92a ] instead of a direct proximity calculation. The correction can be calculated using a multipole expansion instead of the direct particle-particle interaction, which improves efficiency for the worst-case scenario of dense clusters. While the same machinery can be used to solve the entire particle problem with a multipole method, some boundary conditions may be much harder to implement, necessitating the use of a local correction grid method.
Guy Robinson
Grid-based particle simulation algorithms continue to provide an effective technique for studying systems of pointlike particles in addition to continuum systems. These methods are a useful alternative to grid-less simulations which cannot incorporate fluid interactions or complicated boundary conditions as easily or effectively. While the approach is quite different, the tree-structure and enhanced accuracy criterion which are the bases of multipole methods are equally applicable as the fundamental structure of an adaptive refinement mesh algorithm. The two techniques complement each other well and can provide a useful environment both for studying mixed particle-continuum systems and for comparing results even when a mesh is not necessitated by the physically interesting aspects of the modelled system. The hierarchical structure naturally occurs in problems which demonstrate locality such as systems governed by Poisson's Equation.
Implementations for parallel, distributed-memory computers gain direct benefit from the locality. Because both the grid-based and particle-based methods form the same hierarchical structure, common data partitioning can be employed. A hybrid simulation using both techniques implicitly has the information for both components-particle and fluid-at hand on the local processor node, simplifying the software development and increasing the efficiency of computing such systems.
Considerations such as the efficiency of a deep, grid-based hierarchy with few or even one particle per grid cell need to be explored. Current particle-based algorithm research comparing computational accuracy against grid resolution (i.e., one can utilize lower computational accuracy with a finer grid or less refinement with higher computational accuracy), will strongly influence this result. Also, the error created by interpolating the particles onto a grid and then solving the discrete equation must be addressed when comparing gridless and grid-based methods.
Guy Robinson
Guy Robinson
Guy Robinson
Essentially, all the work of C P used the message-passing model with the application scientist decomposing the problem by hand and generating C (and sometimes Fortran) plus message-passing code to express the parallel program. This book is designed to show that this message-passing model is effective. It gets good performance and experienced users find it convenient to use as it is the most powerful approach that can express essentially all problems as long as the software is suitably embellished-with, if necessary, the functionality described in Chapters 15 , 16 , and 17 . However, we can regard the success of message passing for parallel computing as comparable to the success of machine language programming for conventional machines. This was how early computers were programmed, and is still used today to get optimal performance for computational kernels and libraries. However, the overwhelming majority of lines of sequential code are developed, not with machine language but with high-level languages such as Fortran, C, or even higher level object-oriented systems. There are at least two reasons to seek a higher level approach than message passing for parallel computing, reasons that are shared by the machine language analogy.
Figure 13.1:
The Initial Integrated FortranD Environment
We can illustrate the portability issues with two anecdotes from C P. Our original (Cosmic Cube and Mark II) hypercubes did not allow the overlap of communication and calculation. However, we carefully designed the Mark IIIfp to allow the performance enhancement offered by this overlap. However, we made little use of this hardware feature because all our codes, algorithms and software support (CrOS) had been developed for the original hardware. Even the ``Marine Corps'' of C P was not willing to recode applications and systems software to gain the extra performance. Our software did port between MIMD machines as they evolved and in this sense message passing is portable. However, the ``optimal'' message-passing implementation is hardware dependent and nonportable. The goal of higher level software systems is to rely on compilers and runtime systems to provide such optimization. As a second anecdote, we note that C P shared a CM-2 with Argonne National Laboratory. C P's use of this was disappointing even though several of our applications, such as QCD in Chapter 4 , were very suitable for this SIMD architecture. We had excellent parallel (QCD) codes, which we ran in production, but these were written with message passing and this could not run on the CM-2. We were not willing to recode in CMFortran to use the SIMD machine for this problem.
C P was correct to concentrate on message passing on its MIMD machines; this is the only way to good performance on most (excepting the CM-5) MIMD machines even in 1992-ten years after we started. However, the enduring lesson of C P was that ``Parallel Computing Works.'' There is no reason that our particular software approach should endure in the same fashion. Rather, we wish to embody the lessons of C P's work into better and higher level software systems.
Table 13.1:
Reasons to build parallel languages on top of existing
languages-especially Fortran, C, C++
In 1987, Fox and Kennedy shared a crowded Olympus Airways flight from Athens to New York. Their conversations were key in establishing the collaboration on FortranD [Bozkus:93a;93b], [Choudhary:92c;92d;92e],
[ Fox:91e ], [ Ponnusamy:92c ]. This combined the parallel compiler expertise of Kennedy [ Callahan:88e ], [Hiranandani:91a;91b], with C P's wisdom in practical use of parallel machines. Again, Fox's move to Syracuse allowed him to compare the successes of CMFortran on the SIMD CM-2 with those of message passing on the C P MIMD emporium. He concluded that one could use high-level data-parallel Fortran for both SIMD and MIMD machines. This evolved FortranD from its initial Fortran 77D implementation to include a Fortran 90D version [ Wu:92a ], illustrated in Figure 13.1 . Section 13.3 describes some of the experiments leading to this realization. We will not describe data-parallel Fortran in detail because the situation is still quite fluid and this is an area that has grown spectacularly since 1990 when C P finished its project.
Table 13.2:
Features of the Fortran(C) Plus Message-Passing Paradigm
Section 13.2 describes a prototype software tool built at Caltech and Rice by Vas Balasundaram and Uli Kremer to enable users to experiment with different decompositions. This was a component of the FortranD project set up as part of the NSF Center for Research in Parallel Computation (CRPC). FortranD was set up as a scalable language, that is,
``We may need to rewrite our code for a parallel machine, but the resulting scalable (FortranD) code should run with high efficiency on `all' current and future anticipated machines.''
Many new parallel languages have been proposed-OCCAM is a well known example [ Pritchard:91a ]-but none are ``compelling'' that is, they do not solve enough parallel issues to warrant adoption. Thus, the recent trend has been to adapt existing languages such as Fortran [ Brandes:92a ], [ Callahan:88e ], [ Chapman:92a ], [ Chen:92b ], [ Gerndt:90a ], [ Merlin:92a ], [ Zima:88a ], C++ [ Bodin:91a ], C [ Hamel:92a ], [Hatcher:91a;91b], and Lisp. The latter is illustrated by the successful *Lisp, parallel Lisp implementation available on the CM-1, 2, and 5. Table 13.1 summarizes some of the issues involved in choosing to adopt a new language rather than modifying an old one. Table 13.2 summarizes the message-passing approach and why we might choose to replace it by a higher level system, such as data-parallel C or Fortran as summarized in Table 13.3 . We were impressed by the C language offered on the CM-2; Section 13.6 describes an early experiment to develop a loosely synchronous version of this. We should probably have explored this more thoroughly, although at the time we did not perceive this as our mission and realized this project would require major resources to develop a system with good performance. Indeed, the performance of the early CM-2 C compiler was poor and this also discouraged us. Quinn and Hatcher implemented a similar but more restrictive C MIMD compiler [ Hatcher:91a ]. ASPAR, in Section 13.5 , had similar goals to Fortran 77D, although it was targeted more as a migration tool than an efficient complete compiler.
Table 13.3:
Issues in Data-Parallel Fortran Programming Paradigm
FortranD extends Fortran with a set of directives [ Fox:91e ], which help the compiler produce good code on a parallel machine. These directives include those specifying the decomposition of the data-parallel arrays onto the target hardware. The language includes forms of parallel loop ( Forall and DO independent ) for which parallelization can be asserted without a difficult dependence analysis. The run time library implements optimized parallel functions operating on the data-parallel arrays. Fortran 90D also includes the parallelism implied by the explicit array notation, for example, if A , B , and C are arrays of the same size, A=B+C is executed in parallel. This CRPC research was based in important ways on the research of C P. Further during 1992, an informal forum representing all the major players in the parallel computing arena agreed on a new industry-standard language, High Performance Fortran or HPF [ Kennedy:93a ]. This embodies all the essential ideas of FortranD-including the full Fortran 90 syntax. We have modified FortranD so that HPF is a subset of FortranD. The CRPC FortranD project continues as a research compiler to investigate extensions of HPF to handle more general problems and unsolved issues such as parallel I/O [ Bordawekar:93a ], [ Rosario:93a ]. We expect that data-parallel languages should be able to eventually express nearly all loosely synchronous problems, that is, the vast majority of scientific and engineering computations.
Table 13.4:
High Performance Fortran (HPF) and its extensions
The scope of HPF and FortranD is summarized in Table 13.4 . Table 13.4 (a) roughly covers both the synchronous and embarrassingly parallel calculations of Chapters 4 , 6 , 7 , and 8 . Note that we include computations such as the Kuppermann and McKoy chemical reaction problems in Chapter 8 , which mix the synchronous and embarrassingly parallel classes. The original FortranD [ Fox:91e ] and the initial HPF language [ Kennedy:93a ] should be able to express these two problem classes in such a way that the compiler will get good performance on MIMD and, for synchronous problems, SIMD machines [Choudhary:92d;92e]. Table 13.4 (b) covers the loosely synchronous problems of Chapters 9 and 12 , which need HPF extensions to express the irregular structure. We intend to incorporate the ideas of PARTI [ Berryman:91a ], [ Saltz:91b ] into FortranD as a prototype of an extended HPF that could handle loosely synchronous problems. The difficult applications in Sections 12.4 , 12.5 , 12.7 , and 12.8 have a hierarchical tree structure that is not easy to express [ Bhatt:92a ], [ Blelloch:92a ], [ Mou:90a ], [ Singh:92a ]. Table 13.4 (c) indicates that we have not yet studied HPF and FortranD for signal processing applications, although the iWarp group at Carnegie Mellon University has developed high level languages APPLY and ADAPT for this problem class [ Webb:92a ]. Table 13.4 (d) notes that we cannot express in FortranD and HPF the difficult asynchronous applications introduced in Chapter 14 .
We expect this study and implementation of data-parallel languages to be a growing and critical area of parallel computing.
In Section 13.7 , we contrast hierarchical and distributed memory systems. Both require data locality and we expect that data parallel languages such as High Performance Fortran will be able to use the HPF directives to improve performance of sequential machines by exploiting the cache and other levels of memory hierarchy better.
Guy Robinson
Here we discuss the trade off between message-passing and data-parallel languages from the problem architecture point of view developed in Chapter 3 .
We return to Figure 3.4 , which expressed computation as a sequence of maps. We elaborate this in Figure 13.2 , concentrating on the map of the (numerical formulation of the) problem onto the computer . This map could be performed in several stages reflecting the different software levels. Here, we are interested in the high-level software map . One often refers to as the virtual machine (VM), since one can think of it as abstracting the specific real machine into a generic VM. One could perhaps more accurately consider it as a virtual problem, since one is expressing the details of a particular problem in the language of a general problem of a certain class. Naively, one can say in Figure 13.2 that is ``nearer'' the problem than the computer. One often thought of CMFortran as a language for SIMD machines. This is not accurate-rather, it is a language for synchronous problems (i.e., a particular problem architecture) which can be executed on all machine architectures. This is illustrated by the use of CMFortran on the MIMD CM-5 and the HPF (FortranD) discussion of the previous subsection. These issues are summarized in Table 13.5 . Generally, we believe that high-level software systems should be based on a study of problems and their architectures rather than on machine characteristics.
Figure 13.2:
Architecture of ``Virtual Problem'' Determines Nature of
High-Level Language
Figure 13.3 (Color Plate) illustrates the map of problem onto machine, emphasizing the different architectures of both. Here we regard message passing as a (low-level ) paradigm that is naturally associated with a particular machine architecture, that is, it does reflect a virtual machine-the generic MIMD architecture. One has a trade off in languages between features optimized for a particular problem class against those optimized for particular machine architectures. This figure is also drawn so as to emphasize that HPF corresponds to a ``near'' the problem and Fortran-plus message passing is a paradigm ``near'' the computer.
Figure 13.3:
Problem architectures mapped into machine
architectures.
Figure 13.4 (Color Plate) illustrates the compilation and migration processes from this point of view. HPF is a language that reflects the problem structure. It is difficult but possible to produce a compiler that maps it onto the two machine (SIMD and MIMD) architectures in the figure. Fortran-plus message passing expresses the MIMD computer architecture. It is typically harder for the user to express the problem in this paradigm than in the higher level HPF. However, it is quite easy for the operating system to map explicit message passing efficiently onto a MIMD architecture. However, this is not true if one wishes to map message passing to a different architecture (such as a SIMD machine) where one must essentially convert (``compile'') the message passing back to the HPF expression of the problem. This is typically impossible as the message-passing formulation does not have all the necessary information contained in it. Expressing a problem in a specific language often ``hides'' information about the problem that is essential for parallelization. This is why much of the existing Fortran 77 sequential code cannot be parallelized. Critical information about the underlying problem cannot be discovered except at run time when it is hard to exploit. We discuss this point in more detail in the following subsection.
Figure 13.4:
migration and compilation in the map of
problems to computers.
Table 13.5:
Message Passing, Data-Parallel Fortran, Problem
Architectures
Guy Robinson
In Section 3.3 , we noted that the concept of space and time are not preserved in the mappings between complex systems defined in Equation 3.1 . We can use this to motivate some advantage in using the array notation used in Fortran 90. Consider a complex problem whose data domain is expressed in two Fortran arrays A and B with, say,
Suppose some part of the program involves adding the arrays, which is expressed as
in Fortran 90 and
in Fortran 77. In this last equation, the data-parallel spatial manipulation of Equation 13.1 is converted into 10,000 time steps. In other words, Fortran 77 has not preserved the spatial structure of the problem. The task of a parallelizing Fortran 77 compiler is to reverse this procedure by recognizing that the sequential (time-stepped) DO loops are ``just'' a spatially (data)-parallel expression. We find mappings:
Note that the final parallel computer implementation maps the original spatial structure into a combination of time (the ``node'' program) and space (distribution) over nodes.
We can attribute some of the difficulties in producing an effective Fortran 77 compiler to the unfortunate mapping of space into time (control) shown in Equation 13.4 . In the trivial example of Equation 13.3 , one can undo this ``wrong,'' but in general there is not enough compile time information in a Fortran 77 code to recover the original spatial parallelism. In this language, Fortran-plus message passing also does not preserve the spatial structure, but rather maps into a mix of space (the message-passing parallelism) and time (node Fortran).
Guy Robinson
Programming a distributed-memory parallel computer is a complicated task. It involves two basic steps: (1) specifying the partitioning of the data, and (2) writing the communication that is necessary in order to preserve the correct data flow and computation order. The former requires some intellectual effort, while the latter is straightforward but tedious work.
We have observed that programmers use several well-known tricks to optimize the communication in their programs. Many of these techniques are purely mechanical, relying more on clever juxtapositions and transformations of the code rather than on a deep knowledge of the algorithm. This is not surprising, since once the data domain has been partitioned, the data dependences in the program completely define the communication necessary between the separate partitions. It should, therefore, be possible for a software tool to automate step (2), once step (1) has been accomplished by the programmer.
This would allow the program to be written in a traditional sequential language extended with annotations for specifying data distribution, and have a software tool or compiler mechanically generate the node program for the distributed-memory computer. This strategy, illustrated by stages II and III in Figure 13.5 , is being studied by several researchers [ Callahan:88d ], [ Chen:88b ], [Koelbel:87a;90a], [ Rogers:89b ], [ Zima:88a ].
Figure 13.5:
The Program Development Process
What is missing in this scheme? Although the tedious step has been automated, the hard intellectual step of partitioning the data domain is still left entirely to the programmer. The choice of a partitioning strategy often involves some deeper knowledge of the algorithm itself, so we clearly cannot hope to automate this process completely. We could, however, provide some assistance in the data partitioning process, so that the programmer can make a better choice of partitioning schemes from all the available options. This section describes the design of an interactive data partitioning tool that provides exactly this kind of assistance.
Guy Robinson
The ultimate goal of the programmer is peak performance on the target computer. The realization of peak performance requires the understanding of many subtle relationships between the algorithm, the program, and the target machine architecture. Factors such as input data size, data dependences in the code, target machine characteristics, and the data partitioning scheme are related in very nonintuitive ways, and jointly determine the performance of the program. Thus, a data partitioning scheme that is chosen purely on the basis of some algorithmic property, may not always be the best choice.
Let us examine the relationship between these aspects more closely, to illustrate the subtle complexities that are involved in choosing the partitioning of the data domain. Consider the following program:\
double precision A(N, N), B(N, N)
*
do k=1, cycles
do j=1,N
do i=2,N-1
A(i, j) =
( B(i-1, j), B(i+1, j) )
enddo
enddo
do j=2,N-1
do i=2,N-1
B(i,j) =
( A(i-1, j), A(i+1, j), A(i, j),
A(i, j-1), A(i, j+1) )
enddo
enddo
enddo
end
MMMMMMMMMMMMMMMMMMMMMMMMMM¯example (A, B, N)
and
represent functions with 4 and 10 double-precision
floating-point operations, respectively. This program segment does not
represent any particular realistic computation; rather, it was
chosen to illustrate all the aspects of our argument using a small
piece of code. The program segment was executed on 64 processors of an
nCUBE,
with array sizes ranging from
N = 64
to
N =
320
. A and B were first partitioned as columns, so that each
processor was assigned
successive columns. The program was then
run once again, this time with A and B partitioned as blocks, so that
each processor was assigned a block of
elements. The
resulting execution and communication times for column and block
partitioning schemes are shown in Figure
13.6
. The
communication time was measured by removing all computation in the
loops.
Figure 13.6:
Timing results on an nCUBE, using 64 processors.
When employing a column partitioning scheme for arrays A and B, communication is only necessary after the first j loop. Each processor has to exchange boundary values with its left and right neighbor. In a block partitioning scheme, each processor has to communicate with its four neighbors after the first loop and with its north and south neighbors after the second loop. For small message lengths, the communication cost is dominated by the message startup time, whereas the transmission cost begins to dominate as the messages get longer (i.e., more data is exchanged at each communication step). This explains why communication cost for the column partition is greater than for the block partition for array sizes larger than . It is clear from the graph that column partitioning is preferable when the array sizes are less than , and block partitioning is preferable for larger sizes.
The steps in the execution time graphs are caused mainly by load imbalance effects. For example, the step between N = 128 and N = 129 for the column partition is due to the fact that for size 129 one subdomain has an extra column, so that the processor assigned to that subdomain is still busy after all the others have finished, causing load imbalance in the system. Similar behavior can be observed for the block partition, but here the steps occur at smaller increments of the array size N . The steps in the communication time graphs are due to the fact that the packet size on the nCUBE is , so that messages that are even a few bytes longer need an extra packet to be transmitted.
The above example indicates that several factors contribute to the observed performance of a chosen partitioning scheme, making it difficult for a human to predict this behavior statically. Our aim is to make the programmer aware of these performance effects without having to run the program on the target computer. We hope to do this by providing an interactive tool, that can give performance estimates in response to a data layout specification. The tool's performance estimates will allow the programmer to gauge the effect of a data partitioning scheme and thus provide some guidance in making a better choice.
Guy Robinson
When using the tool we envision, the programmer will select a program segment for analysis, and the system will provide assistance in choosing an efficient data partitioning for the computation in that program segment, for various problem sizes. In a first step, the user determines a set of reasonable partitionings based on the data dependence information and interprocedural analysis information provided by the tool. An important component of the system is the performance estimation module, which is subsequently used to select the best partitionings and distributions from among those examined. In the present version, the do loop is the only kind of program segment that can be selected. For simplicity, the set of possible partitions of an array is restricted to regular rectangular patterns such as by row, by column, or by block for a two-dimensional array and their higher dimensional analogs for arrays of larger dimensions. This permits the examination of all reasonable partitionings of the data in an acceptable amount of time.
The tool will permit the user to generalize from local partitionings to layouts for an entire program in easy steps, using repartitioning and redistribution whenever it leads to a better performance overall. In addition, the tool will support many program transformations that can lead to more efficient data layouts.
The principal value of such an environment for data partitioning and distribution is that it supports an exploratory programming style in which the user can experiment with different data partitioning strategies, and estimate the effect of each strategy for different input data sizes or different target machines without having to change the program or run the program each time.
Guy Robinson
Given a sequential Fortran program and a selected program segment (which in the preliminary version can only be a loop nest), the tool provides assistance in deriving a set of reasonable data partitions for the arrays accessed in that segment. The assistance is given in the form of data dependence information for variables accessed within the selected segment. When partitioning data, we must ensure that the parallel computations done by all the processors on their local partitions preserve the data dependence relations in the sequential program segment. If the computations done by the processors on the distributed data satisfy all the data dependences, the results of the computation will be the same as those produced by a sequential execution of the original program segment. There are two ways to achieve this: (1) by ``internalizing'' data dependences within each partition, so that all values required by computations local to a processor are available in its local data subdomain; or (2) by inserting appropriate communication to get the nonlocal data.
Let us consider a sample program segment and see how data dependence information can be used to help derive reasonable data partitionings for the arrays accessed in the segment.
*
do j = 1, n
do i = 1, n
A(i, j) =
( A(i-1, j) )
B(i, j) =
( A(i, j), B(i, j-1), B(i, j) )
enddo
enddo
MMMMMMMMMMMMMMMMMMMMMMMMMM¯
P1. Example program segment.
and
represent arbitrary functions, and their exact
nature is irrelevant to this discussion. When the programmer selects the
``
do i
'' loop, the tool indicates that there is one data dependence
that is carried by the
i
loop: the dependence of
on
. This dependence indicates that the computation of
an element of A cannot be started until the element immediately above
it in the previous row has been computed. The programmer then selects the
outer ``
do j
'' loop to get the data dependences that are carried by the
j
loop. There is one such dependence, that of
on
. This dependence indicates that the computation of an
element of B cannot be started until the computation of the element
immediately to the left of it in the previous column has been computed.
Figure
13.7
(a) illustrates the pattern of data dependences for the
above program segment.
Figure 13.7:
Data Dependences Satisfied by Internalization and Communication
for the Partitioning Schemes (a) A by Column, B by Column (b) A by Column,
B by Row and (c) A by Block, B by Block. Dotted lines represent partition
boundaries and numbers indicate virtual processor ids (the figures are shown
for
p = 4
virtual processors). For clarity, only a few of the dependences
are shown.
The pattern of data dependences between references to elements of an array gives the programmer clues about how to partition the array. It is usually a good strategy to partition an array in a manner that internalizes all data dependences within each partition, so that there is no need to move data between the different partitions that are stored on different processors. This avoids expensive communication via messages. For example, the data dependence of on can be satisfied by partitioning A in a columnwise manner, so that the dependences are ``internalized'' within each partition. The data dependence of on can be satisfied by partitioning B row-wise, since this would internalize the dependences within each partition.
It is not enough to examine only the dependences that arise due to references to the same array. In some cases, the data flow in the program implicitly couples two different arrays together, so that the partitioning of one affects the partitioning of the other. In our example, each point also requires the value . We treat this as a special data dependence (3) called a value dependence (read ``B is value dependent on A''), to distinguish it from the traditional data dependence that is defined only between references to the same array. This value dependence must also be satisfied either by internalization or by communication. Internalization of the value dependence is possible only by partitioning B in the same manner as A, so that each and the value required by it are in the same partition.
Based on the pattern of data dependences in the program segment, the following are a possible list of partitioning choices that can be derived:\
The partitioning of A by row and B by column was not considered among the possible choices because, in this scheme, none of the dependences are internalized, thus requiring greater communication compared to (1), (2) or (3). Communication overhead is a major cause of performance degradation on most machines, so a reasonable first choice would be the partitioning scheme that requires the least communication. This can be determined either by analyzing the number of dependences that are cut by the partitioning (indicating the need for communication), or more accurately using the performance estimation module that is described in the next section.
Guy Robinson
For the selected program segment, the programmer picks one of the choices (1) through (3), and specifies the data partitioning via an interface provided by the tool. The tool responds by creating an internal data mapping that specifies the mapping of the data to a set of virtual processors. The number of virtual processors is equal to the number of partitions indicated by the data partitioning. The mapping of the virtual processors onto the physical processors is assumed to be done by the run time system, and this mapping is unspecified in the software layer. Henceforth, we will use the term ``processor'' synonymously with ``virtual processor.'' The internal data mapping is used by the performance estimator to compute an estimate of communication and other costs for the program segment. It is also used by the tool to determine the data that needs to be communicated between the processors.
Let us continue with our example program segment, and see how the internal mapping is constructed for partitioning (2), that is, A partitioned by column and B by row. The data mappings for the other two cases can be constructed in a similar manner. Let A and B be of size and the number of (virtual) processors be p . For simplicity we assume that p divides n . The following two data mappings are computed:\
The internal data mapping is used to solve the following two problems:\
A useful technique that we will subsequently use on these sections is called ``translation.'' Translation refers to the conversion of an accessed section computed with respect to a particular loop to the section accessed with respect to an enclosing loop. For example, consider a reference to a two-dimensional array within a doubly nested loop. The section of the array accessed within each iteration of the innermost loop is a single element. The same reference, when evaluated with respect to the entire inner loop (i.e., all iterations of the inner loop) may access a larger section, such as a column of the array. If we evaluated the reference with respect to the outer loop (i.e., all iterations of the outer loop), we may notice that the reference results in an access of the entire array in a columnwise manner. Translation is thus a method of converting array sections in terms of enclosing loops, and we will denote this operation by the symbol `` ''.
The tool uses (1) to determine which processors should do what computations. The general rule used is: each processor executes only those program statements whose l -values are in its local storage. The l -values computed by a processor are said to be owned by the processor. In order to compute an l -value, several r -values may be required, and not all of them may be local to that processor. The inverse mapping (2) is used to determine the set of processors that own the desired r -values. These processors must send the r -value they own to the processor that will execute the statement.
The data mapping scheme described above works only for arrays. Scalar variables are assumed to be replicated, that is, every processor stores a copy of the scalar variable in its local memory. By the rule stated earlier, this implies that any statement that computes the value of a scalar is executed by all the processors.
Guy Robinson
The communication analysis algorithm takes the internal data mappings, the dependence graph, and the loop nesting structure of the specified program segment as its input. For each processor the algorithm determines information about all communications the processor is involved in. We will now illustrate the communication analysis algorithm using the example program segments P1, P2, and P3, where P2 is derived from P1, and P3 from P2, respectively, by a transformation called loop distribution .
Substantial performance improvement can be achieved by performing various code transformations on the program segment. For example, the loop-distribution transformation [ Wolfe:89a ] often helps reduce the overhead of communication. Loop distribution splits a loop into a set of smaller loops, each containing a part of the body of the original loop. Sometimes, this allows communication to be done between the resulting loops, which may be more efficient than doing the communication within the original loop.
Consider the program segment P1. If A is partitioned by column and B by row, communication will be required within the inner loop to satisfy the value dependence of B on A. Each message communicates a single element of A. For small message sizes and a large number of messages, the fraction of communication time taken up by message startup overhead is usually quite large. Thus, program P1 will most likely give poor performance because it involves the communication of a large number of small messages.
However, if we loop-distributed the inner do i loop over the two statements, the communication of A from the first do i loop to the second do i loop can be done between the two new inner loops. This allows each processor to finish computing its entire column partition of A in the first do i loop, and then send its part of A to the appropriate processors as larger messages, before starting computation of a partition of B in the second do i loop. This communication is done only once for each iteration of the outer do j loop, that is, a total of O(n) communication steps. In comparison, program P1 requires communication within the inner loop, which gives a total of O( ) communication steps:\
*
do j = 2, n
do i = 2, n
A(i, j) =
( A(i
-
1, j) )
enddo
do i = 2, n
B(i, j) =
( A(i, j), B(i, j
-
1), B(i, j) )
enddo
enddo
MMMMMMMMMMMMMMMMMMMMMMMMMM¯
P2. After loop distribution of i loop.
The reduction in the number of communication steps also results in greater
parallelism, since the two inner
do i
loops can be executed in parallel
by all processors without any communication. This effect is much more
dramatic if we apply loop distribution once more, this time on the outer
do j
loop:\
*
do j = 2, n
do i = 2, n
A(i, j) =
( A(i
-
1, j) )
enddo
enddo
do j = 2, n
do i = 2, n
B(i, j) =
( A(i, j), B(i, j
-
1), B(i, j) )
enddo
enddo
MMMMMMMMMMMMMMMMMMMMMMMMMM¯
P3. After loop distribution of j loop.
For the same partitioning scheme (i.e., A by column and B by row), we
now need only O(1) communication steps, which occur between the two outer
do j
loops. The computation of A in the first loop can be done in
parallel by all processors, since all dependences within A are internalized
in the partitions. After that, the required communication is performed to
satisfy the value dependence of B on A. Then the computation of B can proceed
in parallel, because all dependences within B are internalized in the
partitions. The absence of any communication within the loops considerably
improves efficiency.
Currently, the tool provides a menu of several program transformations, and the programmer can choose which one to apply. When a particular transformation is chosen by the programmer, the tool responds by automatically performing the transformation on the program segment, and updating all internal information automatically.
Guy Robinson
For the sake of illustration, let the size of A and B be (i.e., n = 8 ), and let the number of (virtual) processors be p = 4 . The following is a possible sequence of actions that the programmer could do using the tool.
After examining the data dependences within the program segment as reported
by the tool, let us assume that the programmer decides to partition A
by column and B by row. The tool computes the internal mapping:\
A$(1) = A(1:8, 1:2) and B$(1) = B(1:2, 1:8).
A$(2) = A(1:8, 3:4) and B$(2) = B(3:4, 1:8).
A$(3) = A(1:8, 5:6) and B$(3) = B(5:6, 1:8).
A$(4) = A(1:8, 7:8) and B$(4) = B(7:8, 1:8).
To determine the communication necessary, the tool uses Algorithm COMM, shown in Figure 13.8 . For simple partitioning schemes as found in many applications, the communication computed by algorithm COMM can be parameterized by processor number, that is, evaluated once for an arbitrary processor. In addition, we are also investigating other methods to speed up the algorithm.
Figure 13.8:
Algorithm to Determine the Communication Induced by the Data
Partitioning Scheme
Consider program P1 for example. According to algorithm COMM, when the k th processor executes the first statement, the required communication is given by
where the range of i and j are determined by the section of the LHS owned by processor k , in this case and (since A is partitioned columnwise). But the partitioning of A ensures that , the data is always local to k . The set of pairs will, therefore, be an empty set for any k . Thus, the execution of the first statement with A partitioned by column requires no communication.
When the k th processor executes the second statement, the communication as computed by algorithm COMM is given by
The ranges of i and j are determined by the section of the LHS that is owned by processor k : in this case and (since B is partitioned rowwise). The second and third terms will be , because the row partitioning of B ensures that , the data is always local to k . The first term can be a nonempty set, because processor k owns a column of A (i.e., j in the range ), while the range of j in the first term is . Thus, communication may be required to get the nonlocal element of A before the k th processor can proceed with the computation of its . The dependence from the definition of to its use is loop-independent. Algorithm COMM therefore computes commlevel , the common nesting level of the source and sink of the dependence, to be the level of the inner i loop. The section translated to the level of the inner i loop is simply the single element . Thus, each message communicates this single element and the communication occurs within the inner i loop.
The execution of program P1 results in a large number of messages because each message only communicates a single element of A, and the communication occurs within the inner loop. Message startup and transmission costs are specified by the target machine parameters, and the average cost of each message is determined from the performance model. The tool computes the communication cost by multiplying the number of messages by the average cost of sending a single element message. This cost estimate is returned to the programmer.
Now consider the program P2, with the same partitioning scheme for A and B. When the k th processor executes the first statement, the required communication as determined by algorithm COMM is given by
where the range of j is determined by the section of the LHS owned by processor k , in this case (since A is partitioned columnwise). Note that in this case, . This is because commlevel is now the level of the outer j loop, so that the section must be translated to the level of the j loop. In other words, the reference to ) in the first statement results in an access of the first seven elements of the j th column of A, during each iteration of the j loop. Since A is partitioned columnwise, this section will always be available locally in each processor, so that the above set is empty and no communication is required.
When processor k executes the second statement, the communication required is given by
The second and third terms will be empty sets since the required part of B is local to each k (because B is partitioned rowwise). The first term will be nonempty, because each processor owns , and the range of j in the first term is outside the range . The data required by processor k from processor q will therefore be a strip , from each .
This data can be communicated between the two inner do i loops. Each message will communicate a size strip of A. Fewer exchanges will be required compared to program P1, because each exchange now communicates a strip of A, and the communication occurs outside the inner loop. Once again, the performance model and target machine parameters are used by the tool to estimate the total communication cost, and this cost is returned to the programmer.
For most target machines, the communication cost in program P2 will be considerably less than in program P1, because of larger message size and fewer messages.
Next, let us consider program P3. Assuming that the same partitioning scheme is used for A and B, the execution of the first loop by the k th processor will require communication given by
But this is an empty set because of the column partitioning of A. Here , because commlevel for this case is the level of the subroutine that contains the two loops. The section is, therefore, translated to this level by substituting the appropriate bounds for i and j . The translated section indicates that the reference in the first statement results in an access of the section during all iterations of the outer j loop that are executed by processor k .
When the k th virtual processor executes the second loop, the required communication is
The second and third terms will be empty sets because of the row partitioning of B. The first term will be nonempty, and the data required by processor k from processor q will be the block , , for each . This block can be communicated between the two do j loops.
This communication can be done between the two loops, allowing computation within each of the two loops to proceed in parallel. The number of messages is the fewest for this case because a block of A is communicated during each exchange. Program P3 is thus likely to give superior performance compared to P1 or P2, on most machines. We ran programs P1, P2 and P3 with A partitioned by column and B by row, on 16 processors of the nCUBE at Caltech. The functions and consisted of one and two double-precision floating-point operations, respectively. The results of the experiment are shown in Figure 13.9 . The graphs clearly illustrate the performance improvement that occurs due to reduction in number of messages and increase in length of each message.
Figure 13.9:
Timing Results for Programs P1, P2 and P3 on the nCUBE, Using 16
Processors.
Guy Robinson
Given the results of the communication analysis in a program segment, the performance estimator can be used to predict the performance of that program segment on the target machine. The realization of such an estimator requires a simple static model of performance that is based on (1) target machine parameters such as the number of processors, the message startup and transmission costs, and the average times to perform different floating-point operations; (2) the size of the input data set; and (3) the data partitioning scheme.
We undertook a study of published performance models [ Chen:88b ], [ Fox:88a ], [ Gustafson:88a ], [ Saltz:87b ] for use in the performance estimator, and noticed that these theoretical models did not give accurate predictions in many cases. We concluded that the theoretical models suffered from the following deficiencies:\
Our effort to correct these defects resulted in an increased complexity of the model, and also necessitated the introduction of several machine-specific features. We felt that this was undesirable, and decided to investigate alternative methods [ Balasundaram:90d ].
We constructed a program that tested a series of communication patterns using a set of basic low-level portable communication utilities. This program, called a ``training set,'' is executed once on the target machine. The program computes timings for the different communication operations and averages them over all the processors. These timings are determined for a sequence of increasing data sizes. Since the graph of communication cost versus data size is usually a linear function, it can easily be described by specifying a few parameters (e.g., the slope). The training set thus generates a table whose entries contain the minimal information necessary to completely define the performance characteristic for each communication utility. This table is used in place of the theoretical model for the purposes of performance prediction.
Figure 13.10 shows some communication cost characteristics created using a part of our training set on 32 processors of an nCUBE. The data space was assumed to be a two-dimensional array that was partitioned columnwise; that is, each processor was assigned a set of consecutive columns. The communication utilities tested here are:\
Figure 13.10:
Communication cost characteristics of some EXPRESS utilities on the
nCUBE
The table generated by the training set for the characteristics shown in Figure 13.10 is:\
The communication cost estimate for a particular data size is then calculated using the formula:\
where ``pkt size'' is the size of each message packet, which on the nCUBE is 1024 bytes ( ).
The static performance model is meant primarily to help the programmer discriminate between different data partitioning schemes. Our approach is to provide the programmer with the necessary tools to experiment with several data partitioning strategies, until he can converge on the one that is likely to give him a satisfactory performance. The tool provides feedback information about performance estimates each time a partitioning is done by the programmer.
Guy Robinson
Our emphasis in this work has been to try to recognize collective communication patterns rather than generate sequences of individual element sends and receives. Algorithm COMM determines this in a very natural way. This is especially important for loosely synchronous problems which represent a large class of scientific computations [ Fox:88a ]. Several communication utilities have been developed that provide optimal message-passing communication for such problems, provided the communication is of a regular nature and occurs collectively [ Fox:88h ].
We believe that our approach can be extended to derive partitioning schemes automatically. Data dependence and other information can be used to compute a fairly restricted set of reasonable data partitioning schemes for a selected program segment. The performance estimation module can then be applied in turn to each of the partitionings in the computed set.
The work described in this section was a joint effort between Caltech and Rice University, as part of the Center for Research on Parallel Computation (CRPC) research collaboration [ Balasundaram:90a ]. The principal researchers were Vasanth Balasundaram and Geoffrey Fox at Caltech, and Ken Kennedy and Ulrich Kremer at Rice. The data partitioning tool described here is being implemented as part of the ParaScope parallel programming environment under development at Rice University [ Balasundaram:89c ].
Guy Robinson
Near the end of the C P work at Caltech, we did some important experiments using Fortran 90 which formed the basis of the aspects of the FortranD project overviewed in Section 13.1 . These were partly motivated by Fox's change of architectural environment. At Caltech, he was surrounded by MIMD machines and the associated culture; at Syracuse's NPAC facility, the centerpiece in 1990 was a -node SIMD CM-2. In reading the CMFortran (Fortran 90) manual, Fox noted that the Fortran 90 run time support included all the important collective communication primitives (such as combine and broadcast) we had found important in CrOS and Express.
The first experiment involved a climate modelling code using spectral methods [Keppenne:89a;90a]. We had rashly promised a TRW group that we would be able to easily parallelize such a code. However, we had not realized that the code was written in C with extensive C++-like use of pointers. ParaSoft-responsible for the code conversion-was horrified and the task seemed daunting! However, Keppenne was interested in rewriting the code in Fortran 90, which was a ``neat'' language like C++. ParaSoft found that the resultant Fortran 90 code was straightforward to port to a variety of parallel machines, as shown in Tables 13.6 and 13.7 . Note that the new version of the code had an order-of-magnitude-higher performance than the original one on a single CPU CRAY Y-MP. The discipline implied by Fortran 90 allowed both ``outside computational scientists'' and the Cray compiler to ``understand'' the code. We analyzed this process and believe that we could indeed replace our friends at ParaSoft for this problem by a compiler-initially Fortran 90D-which could generate good SIMD and MIMD code. This is, of course, the motivation of use of the array syntax feature in High Performance Fortran as it captures the parallelism in a transparent fashion.
Table 13.6:
Logistics of Migration Experiment on Climate Code
Table 13.7:
Performance of a Climate Modelling Computational Kernel. In
each case, only minor (obviously needed) optimizations were performed.
This experiment motivated the Fortran 90D language [ Fox:91f ], [ Wu:92a ], and we followed up the climate experiments with some other simple examples, which are summarized in Table 13.8 . This compares ``optimal hand-coded'' Fortran-plus message-passing code with what we expect a good Fortran 90D (HPF) compiler could produce from the (annotated) Fortran 90 source. The results are essentially perfect for the Gaussian elimination example and reasonable for the FFT. These estimates were borne out in practice [Bozkus:93a;93b] and the prototype Fortran 90D compiler developed at Syracuse produced code that was about 10% slower than the optimal node Fortran 77+ message-passing version.
Table 13.8:
Effectiveness of Fortran 90 on Two Simple Kernels. The
execution time is given as a function of the number of nodes used in
the iPSC2 multicomputer.
Guy Robinson
The ability of neural networks to compute solutions to optimization problems has been widely appreciated since Hopfield and Tank's work on the travelling salesman problem [ Hopfield:85b ]. Chapter 11 reviews the general work in C P on optimization and physical and neural approaches. We have examined whether neural network optimization can be usefully applied to compiler optimizations. The problem is nontrivial because compiler optimizations usually involve intricate logical reasoning, but we were able to find an elegant formalism for turning a set of logical constraints into a neural network [ Fox:89l ]. However, our conclusions were that the method will only be viable if and when large hierarchically structured neural networks can be built. The neural approach to compiler optimization is worth pursuing, because such a compiler would not be limited to a finite set of code transformations and could handle unusual code routinely. Also, if the neural network were implemented in hardware, a processor could perform the optimizations at run time on small windows of code.
Figure 13.11 shows how a simple computation of is scheduled by a neural network. The machine state is represented at five consecutive cycles by five sets of 20 neurons. The relevant portion of the machine comprises the three storage locations a , b , c and a register r , and the machine state is defined by showing which of the five quantities A , B , C , and occupies which location. A firing neuron (i.e., shaded block) in row and column r indicates that is in the register. The neural network is set up to ``know'' that computations can only be done in the register, and it produces a firing pattern representing a correct computation of .
Figure 13.11:
A Neural Network Can Represent Machine States (top) and
Generate Correct Machine Code for Simple Computations (bottom).
The neural compiler was conceived by Geoffrey Fox and investigated by Jeff Koller [ Koller:88c ], [Fox:89l;90nn].
Guy Robinson
ASPAR was developed by Ikudome from C P [ Ikudome:90a ] in collaboration with ParaSoft. It was aimed at aiding the conversion of existing Fortran codes and embodies the experience especially of Flower and Kolawa. ASPAR is aimed at those applications involving particular stencil operation on arrays-noting that many sequential stencils need modification for parallel execution. In this way, ASPAR involves a collaboration between user and compiler in the parallelization process. The discussion in this section is due to Flower and Kolawa, and we include some of the introductory material as a contrast to the discussion given in the introductory sections of each chapter in this book, which largely reflect Fox's prejudice.
It is now a widely accepted fact that parallel computing is a successful technology. It has been applied to problems in many fields and has achieved excellent results on projects ranging in scope from academic demonstrations to complete commercial applications, as shown by other sections of this book.
Despite this success, however, parallel computing is still considered something of a ``black art'' to be undertaken only by those with intimate knowledge of hardware, software, physics, computer science and a wealth of other complex areas. To the uninitiated there is something frightening about the strange incantations that abound in parallel processing circles-not just the ``buzz words'' that come up in polite conversation but the complex operations carried out on a once elegant piece of sequential code in order for it to successfully run on a parallel processing system.
Guy Robinson
It is easy to define various ``degrees of difficulty'' in parallel processing. One such taxonomy might be as follows:
In this category fall the complex, asynchronous, real-time applications. A good example of such a beast is ``parallel chess'' [ Felten:88h ] of Section 14.3 , where AI heuristics must be combined with real-time constraints to solve the ill-posed problem of searching the ``tree'' of potential moves.
In this area one might put the very large applications of fairly straightforward science. Often, algorithms must be carefully constructed, but the greatest problems are the large scale of the overall system and the fact that different ``modules'' must be integrated into the complete system. An example might be the SDI simulation ``Sim88'' and its successors [ Meier:90a ] described in Section 18.3 . The parallel processing issues in such a code require careful thought but pose no insurmountable problems.
Problems such as large-scale fluid dynamics or oceanography [ Keppenne:90b ] mentioned in Section 13.3 often have complex physics but fairly straightforward and well-known numerical methods. In these cases, the majority of the work involved in parallelization comes from analysis of the individual algorithms which can then often be parallelized separately. Each submodule is then a simpler, tractable problem which often has a ``well-known'' parallel implementation.
The simplest class of ``interesting'' parallel programs are partial differential equations [ Brooks:82b ], [ Fox:88a ] and the applications of Chapters 4 and 6 . In these cases the parallel processing issues are essentially trivial but the successful implementation of the algorithm still requires some care to get the details correct.
The last class of problems are those with ``embarrassing parallelism'' such as in Chapter 7 -essentially uncoupled loop iterations or functional units. In these cases, the parallel processing issues are again trivial but the code still requires care if it is to work correctly in all cases.
The ``bottom line'' from this type of analysis is that all but the hardest cases pose problems in parallelization which are, at least conceptually, straightforward. Unfortunately, the actual practice of turning such concepts into working code is never trivial and rarely easy. At best it is usually an error-prone and time-consuming task.
This is the basic reason for ASPAR's existence. Experience has taught us that the complexities of parallel processing are really due not to any inherent problems but to the fact that human beings and parallel computers don't speak the same language. While a human can usually explain a parallel algorithm on a piece of paper with great ease, it is often a significant task to convert that picture to functional code. It is our hope that the bulk of the work can be automated by the use of ``parallelizing'' technologies such as ASPAR. In particular, we believe (and our results so far bear out this belief) that problems in all the previous categories (except possibly (1) above), can be either completely or significantly automated.
Guy Robinson
To understand the issues involved in parallelizing codes and the difference between ASPAR and other similar tools, we must examine two basic issues involved in parallelizing code: the local and the global views.
The local view of a piece of code is restricted to one or more loops or similar constructs upon which particular optimizations are to be applied. In this case little attention is paid to the larger scale of the application.
The global view of the program is one in which the characteristics of a particular piece of data or a function are viewed as a part of the complete algorithm. The impact of operating on one item is then considered in the context of the entire application. We believe that ASPAR offers a completely new approach to both views.
Guy Robinson
``Local'' optimization is a method which has been used in compilers for many years and whose principles are well understood. We can see the evolutionary path to ``parallelizing compilers'' as follows.
The goal of automatic parallelization is obviously not new, just as parallel processors are not new. In the past, providing support for advanced technologies was in the realm of the compiler, which assumed the onus, for example, of hiding vectorizing hardware from the innocent users.
Performing these tasks typically involves a fairly simple line of thought shown by the ``flow diagram'' in Figure 13.12 . Basically the simplest idea is to analyze the dependences between data objects within loops. If there are no dependences, then ``kick'' the vectorizer into performing all, or as many as it can handle, of the loop iterations at once. Classic vector code therefore has the appearance
DO 10 I=1,A(I) = B(I) + C(I)*D(I)
10 CONTINUE
DO 10 I=1,10000
Figure 13.12:
Vectorizability Analysis
We can easily derive parallelizing compilers from this type of technology by changing the box marked ``vectorize'' in Figure 13.12 to one marked ``parallelize.'' After all, if the loop iterations are independent, parallel operation is straightforward. Even better results can often be achieved by adding a set of ``restructuring operations'' to the analysis as shown in Figure 13.13 .
Figure 13.13:
A Parallelizing Compiler
The idea here is to perform complex ``code transformations'' on cases which fail to be independent during the first dependence analysis in an attempt to find a version of the same algorithm with fewer dependences. This type of technique is similar to other compiler optimizations such as loop unrolling and code inlining [ Zima:88a ], [ Whiteside:88a ]. Its goal is to create new code which produces exactly the same result when executed but which allows for better optimization and, in this case, parallelization.
The emphasis of the two previous techniques is still on producing exactly the same result in both sequential and parallel codes. They also rely heavily on sophisticated compiler technology to reach their goals.
ASPAR takes a rather different approach. One of its first assumptions is that it may be okay for the sequential and parallel codes to give different answers!
In technical terms, this assumption removes the requirement that loop iterations be independent before parallelization can occur. In practical terms, we can best understand this issue by considering a simple example: image analysis.
One of the fundamental operations of image analysis is ``convolution.'' The basic idea is to take an image and replace each pixel value by an average of its neighbors. In the simplest case we end up with an algorithm that looks like
DO 20 J = 2,N-1
DO 20 J = A(I,J)=0.25*(A(I+1,J)+A(I-1,J)+A(I,J+1)+A(I,J-1))
20 CONTINUE
10 CONTINUE
DO 10 I = 2,N-1
To make this example complete, we show in Figure 13.14 the results of applying this operation to an extremely small (integer-valued) image.
Figure 13.14:
A Sequential Convolution
It is crucial to note that the results of this operation are not as trivial as one might naively expect. Consider the value at the point (I=3, J=3) which has the original value 52. To compute this value we are instructed to add the values at locations A(2,2), A(2,4), A(3,2) , and A(3,4) . If we looked only at the original data from the top of the figure, we might then conclude that the correct answer is .
Note that the source code, however, modifies the array A while simultaneously using its values. As a result, the above calculation accesses the correct array elements, but by the time we get around to computing the value at the values to the left and above have already been changed by previous loop iterations. As a result the correct value at is given by , where the underlined values are those which have been calculated on previous loop iterations.
Obviously, this is no problem for a sequential program because the algorithm, as stated in the source code, is translated correctly to machine code by the compiler, which has no trouble executing the correct sequence of operations; the problems with this code arise, however, when we consider its parallelization.
The most obvious parallelization strategy is to simply partition the values to be updated among the available processors. Consider, for example, a version of this algorithm parallelized for four nodes.
Initially we divide up the original array A by assigning a quadrant to each processor. This gives the situation shown in Figure 13.15 . If we divide up the loop iterations in the same way, we see that the process updating the top-left corner of the array is to compute where the first two values are in its quadrant and the others lie to the right and below the processor boundary. This is not too much of a problem-on a shared-memory machine, we would merely access the value ``44'' directly, while on a distributed-memory machine, a message might be needed to transfer the value to our node. In neither case is the procedure very complex; especially since we are having the compiler or parallelizer do the actual communication for us.
Figure 13.15:
Data Distributed for Four Processors
The first problem comes in the processor responsible for the data in the top-right quadrant. Here we have to compute where the values ``80'' and ``81'' are local and the value ``52'' is in another processor's quadrant and therefore subject to the same issues just described for the top-left processor.
The crucial issue surrounds the value ``??'' in the previous expression. According to the sequential algorithm, this processor should wait for the top-left node to compute its value and then use this new result to compute the new data in the top-right quadrant. Of course, this represents a serialization of the worst kind, especially when a few moments' thought shows that this delay propagates through the other processors too! The end result is that no benefit is gained from parallelizing the algorithm.
Of course, this is not the way image analysis (or any of the other fields with similar underlying principles such as PDEs, Fluid mechanics, and so on) is done in parallel. The key fact which allows us to parallelize this type of code despite the dependences is the observation that: A large number of sequential algorithms contain data dependences that are not crucial to the correct ``physical'' results of the application.
Figure 13.16:
ASPAR's Decision Structure
In this case, the data dependence that appears to prevent parallelization is also present in the sequential code but is typically irrelevant. This is not to say that its effects are not present but merely that the large-scale behavior of our application is unchanged by ignoring it. In this case, therefore, we allow the processor with the top-right quadrant of the image to use the ``old'' value of the cells to its left while computing new values, even though the processor with the top-left quadrant is actively engaged in updating them at the very same time that we are using them!
While this discussion has centered on a particular type of application and the intricacies of parallelizing it, the arguments and features are common to an enormous range of applications. For this reason ASPAR works from a very different point of view than ``parallelizing'' compilers: its most important role is to find data dependences of the form just described-and break them! In doing this, we apply methods that are often described as stencil techniques.
In this approach, we try to identify a relationship between a new data value and the old values which it uses during computation. This method is much more general than simple dependence analysis and leads to a correspondingly higher success rate in parallelizing programs. The basic flow of ASPAR's deliberations might therefore be summarized in Figure 13.16 .
It is important to note that ASPAR provides options to enforce strict dependence checking as well as to override ``stencil-like'' dependences. By adopting this philosophy of checking for simple types of dependences, ASPAR more nearly duplicates the way humans address the issue of parallelization and this leads to its greater success. The use of advanced compilation techniques could also be useful, however, and there is no reason why ASPAR should ``give up'' at the point labelled ``Sequential'' in Figure 13.16 . A similar ``loopback'' via code restructuring, as shown in Figure 13.13 , would also be possible in this scenario and would probably yield good results.
Guy Robinson
Up to now, the discussion has rested mainly on the properties of small portions of code-often single or a single group of nested loops in practical cases. While this is generally sufficient for a ``vectorizing'' compiler, it is too little for effective parallelization. To make the issues a little clearer, consider the following piece of code:
DO 10 I=A(I) = B(I) + C(I)
10 CONTINUE
DO 20 I=1,100
DO 10 I=D(I) = B(I) + C(100-I+1)
20 CONTINUE
DO 10 I=1,100
Taken in isolation (the local view), both of these loop constructs are trivially parallelizable and have no dependences. For the first loop, we would assign the first few values of the arrays A, B, and C to the first processor, the next few to the second, and so on until we had accounted for each loop iteration. For the second loop, we would assign the first few elements of A and B and the last few of C to the first node, and so on. Unfortunately, there is a conflict here in that one loop wants to assign values from array C in increasing order while the other wants them in decreasing order. This is the global decomposition problem.
The simplest solution in this particular case can be derived from the fact that array C only appears on the right-hand side of the two sets of expressions. Thus, we can avoid the problem altogether by not distributing array C at all. In this case, we have to perform a few index calculations, but we can still achieve good speedup in parallel.
Unfortunately, life is not usually as simple as presented in this case. In typical codes, we would find that the logic which led to the ``nondistribution'' of array C would gradually spread out to the other data structures with the final result that we end up distributing nothing and often fail to achieve any speedup at all.
Guy Robinson
Addressing the global decomposition problem poses problems of a much more serious nature than the previous dependence analysis and local stencil methods because, while many clever compiler-related tricks are known to help the local problems, there is little theoretical analysis of more global problems. Only very recently, for example, do we find compilers that perform any kind of interprocedural analysis at all.
As a result, the resolution of this problem is really one which concerns the parallel programming model available to the parallelization tools. Again, ASPAR is unique in this respect.
To understand some of the possibilities, it is again useful to create a classification scheme for global decomposition strategies. It is interesting to note that, in some sense, the complexity of these strategies is closely related to our initial comments about the ``degree of difficulty'' of parallel processing.
This style is the simplest of all. We have a situation in which there are no data dependences among functional units other than initial and final output. Furthermore, each ``function'' can proceed independently of the others. The global decomposition problem is solved by virtue of never having appeared at all.
In this type of situation, the run-time requirements of the parallel processing system are quite small-typically, a ``send'' and ``receive'' paradigm is adequate to implement a ``master-slave'' processing scenario. This is the approach used by systems such as Linda [ Padua:86a ] and Strand [ Foster:90a ].
Of course, there are occasional complexities involved in this style of programming, such as the use of ``broadcast'' or data-reduction techniques to simplify common operations. For this reason higher level systems such as Express are often easier to use than their ``simpler'' contemporaries since they include standard mechanisms for performing commonly occurring operations.
This type of application is typified by areas such as numerical integration or convolution operations similar to those previously described.
Their characteristic is that while there are data dependences among program elements, these can be analyzed symbolically at compile time and catered for by suitable insertion of calls to a message-passing (for distributed-memory) or locking/unlocking (for shared-memory) library.
In the convolution case, for example, we provide calls which would arrange for the distribution of data values among processors and the communication of the ``boundary strip'' which is required for updates to the local elements.
In the integration example, we would require routines to sum up contributions to the overall integral computed in each node. For this type of application, only simple run time primitives are required.
Problems such as those encountered in large-scale scientific applications typically have behavior in which the global decomposition schemes for data objects vary in some standard manner throughout the execution of the program, but in a deterministic way which can be analyzed at compile-time: One routine might require an array to be distributed row-by-row, for example, while another might require the same array to be partitioned by columns or perhaps not at all.
These issues can be dealt with during the parallelization process but require much more sophisticated run-time support than those previously described. Particularly if the resulting programs are to scale well on larger numbers of nodes, it is essential that run-time routines be supplied to efficiently perform matrix transposition or boundary cell exchange or global concatenation operations. For ASPAR, these operations are provided by the Express run-time system.
The three categories of decomposition described so far can deal with a significant part of a large majority of ``real'' applications. By this we mean that good implementations of the various dependence analysis, dependence ``breaking'' and run-time support systems can correctly parallelize 90% of the code in any application that is amenable to automatic parallelization. Unfortunately, this is not really good enough.
Our real goal in setting out to produce automatic parallelization tools is to relieve the user of the burden of performing tricky manipulations by hand. Almost by definition, the 10% of each application left over by the application of the techniques described so far is the most complex part and probably represents about 95% of the complexity in parallelizing the original code by hand! So, at this point, all we have achieved is the automatic conversion of the really easy parts of the code, probably at the expense of introducing messy computer-generated code, which makes the understanding of the remaining 10% very difficult.
The solution to this problem comes from the adoption of a much more sophisticated picture of the run time environment.
Guy Robinson
The three decomposition methods already described suffer from the defect that they are all implemented, except in detail, during the compile-time ``parallelization'' of the original program. Thus, while the particular details of ``which column to send to which other processor'' and similar decisions may be deferred to the runtime support, the overall strategy is determined from static analysis of the sequential source code. ASPAR's method is entirely different.
Instead of enforcing global decomposition rules based on static evaluation of the code, ASPAR leaves all the decisions about global decomposition to the run time system and offers only hints as to possible optimizations, whenever they can safely be determined from static analysis. As a result, ASPAR's view of the previously troublesome code would be something along the lines of
I need B andDO 10 I=1,100
I need B andA(I) = B(I) + C(I)
10 I neCONTINUE
C- I need B to increase and C to decrease.
I neDO 20 I=1,100
I need B andD(I) = B(I) + C(100-I)
20 I neCONTINUE
C- I need B and C to be distributed in increasing order.
where the ``comments'' correspond to ASPAR's hints to the run time support.
The advantages of such an approach are extraordinary. Instead of being stymied by complex, dynamically changing decomposition strategies, ASPAR proceeds irrespective of these, merely expecting that the run time support will be smart enough to provide whatever data will be required for a particular operation.
As a result of this simplification in philosophy, ASPAR is able to successfully parallelize practically 100% of any application that can be parallelized at all, with no user intervention.
Guy Robinson
The success of ASPAR relies on two crucial pieces of technology:
It is interesting that neither of these is the result of any extensions to existing compiler technology but are derived from our experience with parallel computers. This is consistent with our underlying philosophy of having ASPAR duplicate the methods which real programmers use to successfully parallelize code by hand. Obviously not all problems are amenable to this type of automatic parallelization but we believe that of the cases discussed in the opening paragraphs of this section we can usefully address all but the ``Extremely Difficult.''
In the simpler cases, we believe that the goal of eliminating the role of ``human error'' in generating correctly functioning parallel code has been accomplished.
The price that has been paid, of course, is the requirement for extremely smart runtime systems. The use of Express as the underlying mechanism for ASPAR has proved its value in addressing the simpler types of decomposition scheme.
The development of the dynamic data-distribution mechanisms required to support the more complex applications has led to a completely new way of writing, debugging, and optimizing parallel programs which we believe will become the cornerstone of the next generation of Express systems and may revolutionize the ways in which people think about parallel processing.
Guy Robinson
Coherent Parallel C (CPC) was originally motivated by the fact that for many parallel algorithms, the Connection Machine can be very easy to program. The work of this section is described in [ Felten:88a ]. Parallel to our efforts, Philip Hatcher and Michael Quinn have developed a version of C , now called Data-Parallel C, for MIMD computers. Their work is described in [ Hatcher:91a ].
The CPC language is not simply a C with parallel for loops; instead, a data-parallel programming model is adopted. This means that one has an entire process for each data object. An example of an ``object'' is one mesh point in a finite-element solver. How the processes are actually distributed on a parallel machine is transparent-the user is to imagine that an entire processor is dedicated to each process. This simplifies programming tremendously: complex if statements associated with domain boundaries disappear, and problems which do not exactly match the machine size and irregular boundaries are all handled transparently. Figure 13.17 illustrates CPC by contrasting ``normal'' hypercube programming with CPC programming for a simple grid-update algorithm.
Figure 13.17:
Normal Hypercube Programming Model versus CPC Model for the
Canonical Grid-based Problem. The upper part of the figure shows a
two-dimensional grid upon which the variables of the problem live. The
middle portion shows the usual hypercube model for this type of problem.
There is one process per processor and it contains a subgrid. Some variables
of the subgrid are on a process boundary, some are not. Drawn explicitly are
communication buffers and the channels between them which must be managed by
the programmer. The bottom portion of the figure shows the CPC view of the
same problem. There is one data object (a grid point) for each process so
that all variables are on a process boundary. The router provides a full
interconnect between the processes.
The usual communication calls are not seen at all at the user level. Variables of other processes (which may or may not be on another processor) are merely accessed, giving global memory. In our nCUBE implementation, this was implemented using the efficient global communications system called the crystal_router (see Chapter 22 of [ Fox:88a ]).
An actual run-time system was developed for the nCUBE and is described in [ Felten:88a ]. Much work remains to be done, of course. How to optimize in order to produce an efficient communications traffic is unexplored; a serious attempt to produce a fine-grained MIMD machine really involves new types of hardware, somewhat like Dally's J-machine.
Ed Felten and Steve Otto developed CPC.
Guy Robinson
In this section, we review some ideas of Fox, dating from 1987, that unify the decomposition methodologies for hierarchical- and distributed-memory computers [ Fox:87b ]. For a modern workstation, the hierarchical memory is formed by the cache and main memory. One needs to minimize the cache misses to ensure that, as far as possible, we reference data in cache and not in main memory. This is often referred to as the need for ``data locality.'' This term makes clear the analogy with distributed-memory parallel computers. As shown in this book, we need data locality in the latter case to avoid communications between processors. We put the discussion in this chapter because we anticipate an important application of these ideas to data-parallel Fortran. The directives in High Performance Fortran essentially specify data locality and we believe that an HPF compiler can use the concepts of this section to optimize cache use on hierarchical-memory machines. Thus, HPF and similar data-parallel languages will give better performance than conventional Fortran 77 compilers on all high-performance computers, not just parallel machines.
Figure 13.18:
Homogeneous and Hierarchical-Memory Multicomputers. The
black box represents the data that fit into the lowest level of the
memory hierarchy.
Figures 13.18 and 13.19 contrast distributed-memory multicomputers, shared-memory, and sequential hierarchical-memory computers. In each case, we denote by a black square the amount of data which can fit into the lowest level of the memory hierarchy. In machines such as the nCUBE-1,2 with a simple node, illustrated in Figure 13.18 (a), this amount is just what can fit into the node of the distributed-memory computers. In the other architectures shown in Figures 13.18 and 13.19 , the data corresponding to the black square represents what can fit into the cache. There is one essential difference between cache and distributed memory. Both need data locality, but in the parallel case the basic data is static and fetches additional information as necessary. This gives the familiar surface-over-volume communication overheads of Equation 3.10 . However, in the case of a cache, all the data must stream through it and not just the data needed to provide additional information. For distributed-memory machines, we minimize the need for information flow into and out of a grain as shown in Figure 3.9 . For hierarchical-memory machines, we need to maximize the number of times we access the data in cache. These are related but not identical concepts which we will now compare. We can use the space-time complex system language introduced in Chapter 3 .
Figure 13.19:
Shared Hierarchical-Memory Computers. ``Cache'' could either
be a time cache or local (software-controlled) memory.
Figure 13.20 introduces a new time constant, , which is contrasted with and introduced in Section 3.5 . The constant represents the time it takes to load a word into cache. As shown in this figure, the cache overhead is also a ``surface-over-volume'' effect just as it was in Section 3.5 , but now the surface is measured in the temporal direction and the volume is that of a domain in space and time. We find , time, and memory hierarchy are analogous to , space, and distributed memory.
Figure:
The Fundamental Time Constants of a Node. The information
dimension represented by
d
is discussed in Section
3.5
.
Space-time decompositions are illustrated in Figure 13.21 for a simple one-dimensional problem. The decomposition in Figure 13.21 (a) is fine for distributed-memory machines, but has poor cache performance. It is blocked in space but not in time. The optimal decompositions are ``space-time'' blocked and illustrated in Figure 13.21 (b) and (c).
Figure 13.21:
Decompositions for a simple one-dimensional wave equation.
A ``space-time'' blocking is a universal high-performance implementation of data locality. It will lead to good performance on both distributed- and hierarchical-memory machines. This is best known for the use of the BLAS-3 matrix-matrix primitives in LAPACK and other matrix library projects (see Section 8.1 ) [ Demmel:91a ]. The next step is to generate such optimal decompositions from a High Performance Fortran compiler.
Figure 13.22:
Performance of a Random Surface Fortran Code on Five RISC
Architecture Sequential Computers. The optimizations are described in
the text.
We can illustrate these ideas with the application of Section 7.2 [ Coddington:93a ]. Table 7.1 records performance of the original code used, but this C version was improved by an order of magnitude in performance in a carefully rewritten Fortran code. The relevance of data locality for this new code is shown in Figure 13.22 . For each of a set of five RISC processors, we show four performance numbers gotten by switching on and off ``system Fortran compiler optimization'' and ``data locality.'' As seen from Section 7.2 , this application is naturally very irregular and normal data structures do not express this locality and preserve it. Even if one starts with neighboring points in the simulated space, ``near'' each other also in the computer, this is not easily preserved by the dynamic retriangulation. In the ``data locality'' column, we have arranged storage to preserve locality as far as possible; neighboring physical points are stored near each other in memory. A notable feature of Figure 13.22 is that the Intel i860 shows the largest improvement from forcing data locality-even after compilation optimization, this action improves performance by 70%. A similar result was found in [ Das:92c ] for an unstructured mesh partial differential equation solver. Other architectures such as the HP9000-720 with large caches show smaller effects. In Figure 13.22 , locality was achieved by user manipulation-as discussed, a next step is to put such intelligence in parallel compilers.
Guy Robinson
Guy Robinson
Table 14.1:
84 Application Areas Used in a Survey 1988-89 from 400 Papers
The two applications in this chapter fall into the asynchronous problem class of Section 3.4 . This class is caricatured in Figure 14.1 and is the last and hardest to parallelize of the basic problem architectures introduced in Section 3.4 . Thus, we will use this opportunity to summarize some issues across all problem classes. It would be more logical to do this in Chapter 18 where we discuss the compound metaproblem class, which we now realize is very important. However, the discussion here is based on a survey [ Fox:88b ], [ Angus:90a ] undertaken from 1988 to 1989, at which time we had not introduced the concept of compound or hierarchical problem architectures.
Figure 14.1:
The Asynchronous Problem Class
Table 14.1 divides 84 application areas into eight (academic) disciplines. Examples of the areas are given in the table-essentially each application section in this book would lead to a separate area for the purposes of this table. These areas are listed in [ Angus:90a ], [Fox:88b;92b] and came a reading in 1988 of about 400 papers which had developed quite seriously a parallel application or core nontrivial algorithm. In 1988, it was possible to read essentially all such papers-the field has grown so much in the following years that a complete survey would now be a daunting task. Table 14.2 divides these application areas into the basic problem architectures used in the book. There are several caveats to be invoked for this table. As we have seen in Chapters 9 and 12 , the division between synchronous and loosely synchronous is not sharp and is still a matter of important debate. The synchronous problems are naturally suitable for SIMD architectures, while properly loosely synchronous and asynchronous problems require MIMD hardware. This classification is illustrated by a few of the more major C P applications in Table 14.3 [ Fox:89t ], which also compares performance on various SIMD and MIMD machines in use in 1989.
Table 14.2:
Classification of 400 Applications in 84 Areas from 1989.
90% of applications scale to large SIMD/MIMD machines.
Table 14.3:
Classification of some C
P applications from 1989 and
their performances on machines at that time [Fox:89t]. A question mark
indicates the performance is unknown whereas an X indicates we expect or
have measured poor performance.
Table 14.2 can be interpreted as follows: 90% of application areas (i.e., all except the asynchronous class) naturally parallelize to large numbers of processors.
Forty-seven percent of applications will run well on SIMD machines while 43% need a MIMD architecture (this is a more precise version of Equation 3.21 ).
These numbers are rough for many reasons. The grey line between synchronous (perhaps generalized to autonomous SIMD in Maspar language of Section 6.1 ) and loosely synchronous means that the division between SIMD and MIMD fractions is uncertain. Further, how should one weight each area? QCD of Section 4.3 is one of the application areas in Table 14.1 , but this uses an incredible amount of computer time and is a synchronous problem. Thus, weighting by computer cycles needed or used could also change the ratios significantly.
These tables can also be used to discuss software issues as described in Sections 13.1 and 18.2 . The synchronous and embarrassingly parallel problem classes (54%) are those directly supported by the initial High Performance Fortran language [ Fox:91e ], [ Kennedy:93a ]. The loosely synchronous problems (34%) need run-time and language extensions, which we are currently working on [ Berryman:91a ], [ Saltz:91b ], [ Choudhary:92d ], as mentioned in Section 13.1 (Table 13.4 ). With these extensions, we expect High Performance Fortran to be able to express nearly all synchronous, loosely synchronous, and embarrassingly parallel problems.
The fraction (10%) of asynchronous problems is in some sense pessimistic. There is one very important asynchronous area-event driven simulations-where the general scaling parallelism remains unclear. This is illustrated in Figure 14.1 and briefly discussed in Section 15.3 . However, the two cases described in this chapter parallelize well-albeit with very hard work from the user! Further, some of the asynchronous areas in Tables 14.1 and 14.2 are of the compound class of Chapter 18 and these also parallelize well.
The two examples in this chapter need different algorithmic and software support. In fact, as we will note in Section 15.1 , one of the hard problems in parallel software is to isolate those issues that need to be supported over and above those needed for synchronous and loosely synchronous problems. The software models needed for irregular statistical mechanics (Section 14.2 ), chess (Section 14.3 ), and event-driven simulations (Section 15.3 ) are quite different.
In Section 14.2 , the need for a sequential ordering takes the normally loosely synchronous time-stepped particle dynamics into an asynchronous class. Time-stamping the updates provides the necessary ordering and a ``demand-driven processing queue'' provides scaling parallelism. Communication must be processed using interrupts and the loosely synchronous communication style of Section 9.1 will not work.
Another asynchronous application developed by C P was the ray-tracing algorithm developed by Jeff Goldsmith and John Salmon [ Fox:87c ], [Goldsmith:87a;88a]. This application used two forms of parallelism, with both the pixels (rays) and the basic model to be rendered distributed. This allows very large models to be simulated and the covers of our earlier books [ Fox:88a ], [ Angus:90a ] feature pictures rendered by this program. The distributed-model database requires software support similar to that of the application in Section 14.2 . The rays migrate from node to node as they traverse the model and need to access data not present in the node currently responsible for ray. This application was a great technical success, but was not further developed as it used software (MOOSE of Section 15.2 ) which was only supported on our early machines. The model naturally forms a tree with the scene represented with increasing spatial resolution as you go down the different levels of the tree. Goldsmith and Salmon used a similar strategy to the hierarchical Barnes-Hut approach to particle dynamics described in Section 12.4 . In particular, the upper parts of the tree are replicated in all nodes and only the lower parts distributed. This removes ``sequential bottlenecks'' near the top of the tree just as in the astrophysics case. Originally, Salmon's thesis intended to study the computer science and science issues associated with hierarchical data structures. Multiscale methods are pervasive to essentially all physical systems. However, the success of the astrophysical applications led to this being his final thesis topic. Su, another student of Fox, has just finished his Ph.D. on the general mathematical properties of hierarchical systems [ Su:93a ].
In Section 14.3 , we have a much more irregular and dynamic problem, computer chess, where statistical methods are used to balance the processing of the different branches of the dynamically pruned tree. There is a shared database containing previous evaluation of positions, but otherwise the processing of the different possible moves is independent. One does need a clever ordering of the work (evaluation of the different final positions) to avoid a significant number of calculations being wasted because they would ``later'' be pruned away by a parallel calculation on a different processor. Branch and bound applications [ Felten:88c ], [Fox:87c;88v] [ Fox:87c ] have similar parallelization characteristics to computer chess. This was implemented in parallel as a ``best-first'' and not a ``depth-first'' search strategy and was applied to the travelling salesman problem (Section 11.4 ) to find exact solutions to test the physical optimization algorithms. It was also applied to the 0/1 knapsack problem, but for this and TSP, difficulties arose due to insufficient memory for holding the queues of unexplored subtrees. The depth-first strategy, used in our parallel computer chess program and sequential branch and bound, avoids the need for large memory. On sequential machines, virtual memory can be used for the subtree queues, but this was not implemented on the nCUBE-1 and indeed is absent on most current multicomputers.
Figure 14.2:
Issues Affecting Relation of Machine, Problem, and Software
Architecture
The applications in this chapter are easier than a full event-driven simulation because enough is known about the problem to find a reasonable parallel algorithm. The difficulty then, is to make it efficient. Figure 14.2 is a schematic of problem architectures labelled by spatial and temporal properties. In general, the temporal characteristics-the problem architectures (synchronous, loosely synchronous and asynchronous)-determine the nature of the parallelism. One special case is the spatially disconnected problem class for which the temporal characteristic is essentially irrelevant. For the general spatially connected class, the nature of this connection will affect performance and ease, but not the nature of their parallelism. These issues are summarized in Table 14.4 . For instance, spatially irregular problems, such as those in Chapter 12 , are particularly hard to implement although they have natural parallelism. The applications in this chapter can be viewed as having little spatial connectivity and their parallelism comes because, although asynchronous, they are ``near'' the spatially disconnected class of Figure 14.2 .
Table 14.4:
Criterion for success in parallelizing a particular problem
on a particular machine.
Guy Robinson
Guy Robinson
Although we live in a three-dimensional world, many important processes involve interactions on surfaces, which are effectively two-dimensional. While experimental studies of two-dimensional systems have been successful in probing some aspects of such systems, computer simulation is another powerful tool that can be used to measure their properties. We have used a computer simulation to study the melting transition of a two-dimensional system of interacting particles [Johnson:86a;86b]. One purpose of the study is to investigate whether melting in two dimensions occurs through a qualitatively different process than it does it three dimensions. In three dimensions, the melting transition is a first-order transition which displays a characteristic latent heat. Halperin and Nelson [ Halperin:78a ], [ Nelson:79a ] and Young [ Young:79a ] have raised the possibility that melting in two dimensions could occur through a qualitatively different process. They have suggested that melting could consist of a pair of higher-order phase transitions, which lack a latent heat, that are driven by topological defects in the two-dimensional crystal lattice.
We studied a two-dimensional system of particles interacting through a truncated Lennard-Jones potential. The Lennard-Jones potential is
where is the energy parameter, is the length parameter, and r is the distance between two particles. The potential is attractive at distances larger than and repulsive smaller distances. The potential energy of the whole system is the sum of the potential energies of each pair of interacting particles. In order to ease the computational requirements of the simulation, we have truncated the potential at a particle separation of .
Mark A. Johnson wrote the Monte Carlo simulation of melting in two dimensions for his Ph.D. research at Caltech.
Guy Robinson
We chose to use a Monte Carlo method to simulate the interaction of the particles. The method consists of generating a sequence of configurations in such a way that the probability of being in configuration r , denoted as , is
where is the potential energy of configuration r and . A configuration refers collectively to the positions of all the particles in the simulation. The update procedure that we describe in the next section generates such a sequence of configurations by repeatedly updating the position of each of the particles in the system. Averaging the values of such quantities as potential energy and pressure over the configurations gives their expected values in such a system.
The process of moving from one configuration to another is known as a Monte Carlo update. The update procedure we used involves three steps that allow the position of one particle to change [ Metropolis:53a ]. The first step is to choose a new position for the particle with uniform probability in a region about its current position. Next, the update procedure calculates the difference in potential energy between the current configuration and the new one. Finally, the new position for the particle is either accepted or rejected based on the difference in potential energy and rules that generate configurations with the required probability distribution.
The two-dimensional system being studied has several characteristics that must be considered in designing an efficient algorithm for implementing the Monte Carlo simulation. One of the most important characteristics is that the interaction potential has a short range. The Lennard-Jones potential approaches zero quickly enough that the effect of distant particles can be safely ignored. We made the short-range nature of the potential precise by truncating it at a distance of . We must use the short-range nature of the potential to organize the particle positions so that the update procedure can quickly locate the particles whose potential energy changes during an update.
Another feature of the system that complicates the simulation is that the particles are not confined to a grid that would structure the data. Such irregular data make simultaneously updating multiple particles more difficult. One result of the irregular data is that the computational loads of the processors are unbalanced in a distributed-memory, MIMD processor. In order to minimize the effect of the load imbalance, the nodes of the concurrent processor must run asynchronously. We developed an interrupt-driven communication system [ Johnson:85a ] that allows the nodes to implement an asynchronous update procedure. This `` rdsort '' system is described in Section 5.2.5 and has similarities to the current active message ideas [ Eiken:92a ]. The interrupt-driven communication system allows a node to send requests for contributions to the change in potential energy that moving its particle would cause. Nodes receiving such requests compute the contribution of their particles and send a response reporting their result. This operating system was sophisticated but only used for this application. However, as described in Chapter 5 , these ideas formed the basis of both MOOS II and the evolution of CrOS III into Express. Interestingly, Mark Johnson designed the loosely synchronous CrOS III message-passing system as part of his service for C P even though his particular application was one of the few that could not benefit from it.
Guy Robinson
Performing Monte Carlo updates in parallel requires careful attention to ensuring that simultaneous updates do not interfere with each other. Because the basic equations governing the Monte Carlo method remain unchanged, performing the updates in parallel requires that a consistent sequential ordering of the updates exists. No particular ordering is required; only the existence of such an ordering is critical. Particles that are farther apart than the range of the interaction potential cannot influence each other, so any arbitrary ordering of their updates is always consistent. However, if some of the particles being updated together are within the range of the potential, they cannot be updated as if they were independent because the result of one update affects the others. Fortunately, the symmetry of the potential guarantees that all of the affected particles are aware that their updates are interdependent.
Note that the Monte Carlo approach to melting or, more generally, any particle simulation is often much harder to parallelize than the competitive time-stepped evolution approach. The latter would be loosely synchronous with natural parallelism. The need for a consistent sequential ordering in the Monte Carlo algorithm leads to the asynchronous temporal structure. It is interesting that on a sequential machine, both time-stepped and Monte Carlo methods would be equally easy to implement. However, even here the sequential ordering for the Monte Carlo would make it hard to vectorize the algorithm on a conventional supercomputer. In discussing regular Monte Carlo problems such as QCD in Section 4.3 , the sequential ordering constraint is there but trivial to implement, as the regular spatial structure allows one to predetermine a consistent update procedure. In particular, the normal red-black update structure achieves this. In the melting problem, one has a dynamically varying irregularity that allows no simple way of predetermining a consistent Monte Carlo update schedule.
Each node involved in the conflicting updates must act to resolve the situation by making one of only two possible decisions. For each request for contributions to the difference in potential energy of an update, a node can either send a response immediately or delay the response until its own update finishes. If the node sends the response immediately, it must use the old position of the particle that it is updating. If the node instead delays the response while waiting for its own update to finish, it will use the new position of the particle when its update finishes. If all of the nodes involved in the conflicting updates make consistent decisions, a sequential ordering of the updates will exist, ensuring the correctness of the Monte Carlo procedure. However, if two nodes both decide to send responses to each other based on the current positions of the particles they are updating, no such ordering will exist. If two nodes both decide to delay sending responses to each other, neither will be able to complete their update, causing the simulation to deadlock.
Several features of the concurrent update procedure make resolving such interdependent updates difficult. Each node must make its decision regarding the resolution of the conflicting updates in isolation from the other nodes because all of the nodes are running asynchronously to minimize load imbalance. However, the nodes cannot run completely asynchronously because assigning a consistent sequential ordering to the updates requires that the update procedure impose a synchronizing condition on the updates. Still, the condition should be as weak as possible so that the decrease in processing efficiency is minimized.
One solution to the problem of correctly ordering interdependent updates requires that a clock exist in each of the nodes. The update procedure records the time at which it begins updating a particle and includes that time with each of its requests for contributions to the difference in potential energy. When a node determines that its update conflicts with that of another node, it uses the times of the conflicting updates to resolve the dependence. The node sends a response immediately if the request involves an update that precedes its own. The node delays sending a response if its own update precedes the one that generated the request. Should the times be exactly equal, the unique number associated with each node provides a means of consistently ordering the updates. When each of the processors involved in the conflicting updates use such a method to resolve the situation, a consistent sequential ordering must result. Using the time of each of the conflicting updates to determine their ordering allows the earliest updates to finish first, which achieves good load balance in the concurrent algorithm.
Although delaying a response to a conflicting update is a synchronizing condition, it is sufficiently weak that it does not seriously degrade the performance of the concurrent algorithm. A node can respond to other nodes' requests while waiting for responses to requests that it has generated. The node that delays sending a response can perform most of the computation to generate the response while it is waiting for responses to its own requests, because the position of only one particle is in question. In fact, the current implementation simply generates the two possible responses so that it can send the correct response immediately after its own update completes.
An interesting feature of the concurrent update algorithm is that it produces results that are inherently irreproducible. If two simulations start with exactly the same initial data, including random number seeds, the simulations will eventually differ. The source of the irreproducible behavior is that all components of the concurrent processor are not driven by the same clock. For instance, the communication channels that connect the nodes contain an asynchronous loop that allows the arrival times of messages to differ by arbitrarily small amounts. Such differences can affect the order in which requests are received, which in turn determines the order in which a node generates responses. Once such differences change the outcome of a single update, the two simulations begin to evolve independently. Both simulations continue to generate configurations with the correct probability distribution, so the statistical properties of the simulations do not change. However, the irreproducible behavior of the concurrent update algorithm can make debugging somewhat more difficult.
Guy Robinson
Because a complete performance analysis of the Monte Carlo simulation is rather lengthy, we provide only a summary of the analysis here. Calculating the efficiency of the concurrent update algorithm is relatively simple because it requires only measurements of the time an update takes on one node and on multiple nodes. A more difficult parameter to calculate is the load balance of the update procedure. In order to calculate the load balance, we measured the time required to send each type of message that the update uses. The total communication overhead is the sum of the overheads for each type of message, which is the product of the time to send that type of message and the number of such messages. We calculated the number of each type of message by assuming a uniform distribution of particles. Because the update algorithm contains no significant serial components, we attributed to load imbalance the parallel overhead remaining after accounting for the communication overhead. The load balance is a factor that can range from , where N is the number of nodes, to 1, which occurs when the loads are balanced perfectly. We give the update time in seconds, the efficiency, and the load balance for several simulations on the 64-node Caltech hypercube in Table 14.5 ([ Johnson:86a ] p. 73).
Table 14.5:
Simulations on the 64-node Caltech Hypercube
Guy Robinson
As this book shows, distributed-memory, multiple-instruction stream (MIMD) computers are successful in performing a large class of scientific computations. As discussed in Section 14.1 and the earlier chapters, these synchronous and loosely synchronous problems tend to have regular, homogeneous data sets and the algorithms are usually ``crystalline'' in nature. Recognizing this, C P explored a set of algorithms which had irregular structure (as in Chapter 12 ) and asynchronous execution. At the start of this study, we were very unclear as to what parallel performance to expect. In fact, we achieved good speedup even in these hard problems.
Thus, as an attempt to explore a part of this interesting, poorly understood region in algorithm space, we implemented chess on an nCUBE-1 hypercube. Besides being a fascinating field of study in its own right, computer chess is an interesting challenge for parallel computers because:
One might also ask the question, ``Why study computer chess at all?'' We think the answer lies in the unusual position of computer chess within the artificial intelligence world. Like most AI problems, chess requires a program which will display seemingly intelligent behavior in a limited, artificial world. Unlike most AI problems, the programmers do not get to make up the rules of this world. In addition, there is a very rigorous procedure to test the intelligence of a chess program-playing games against humans. Computer chess is one area where the usually disparate worlds of AI and high-performance computing meet.
Before going on, let us state that our approach to parallelism (and hence speed) in computer chess is not the only one. Belle, Cray Blitz, Hitech, and the current champion, Deep Thought have shown in spectacular fashion that fine-grained parallelism (pipelining, specialized hardware) leads to impressive speeds (see [ Hsu:90a ], [ Frey:83a ], [ Marsland:87a ], [ Ebeling:85a ], [ Welsh:85a ]). Our coarse-grained approach to parallelism should be viewed as a complementary, not a conflicting, method. Clearly the two can be combined.
Guy Robinson
In this section we will describe some basic aspects of what constitutes a good chess program on a sequential computer. Having done this, we will be able to intelligently discuss the parallel algorithm.
At present, all competitive chess programs work by searching a tree of possible moves and countermoves. A program starts with the current board position and generates all legal moves, all legal responses to these moves, and so on until a fixed depth is reached. At each leaf node, an evaluation function is applied which assigns a numerical score to that board position. These scores are then ``backed up'' by a process called minimaxing, which is simply the assumption that each side will choose the line of play most favorable to it at all times. If positive scores favor white, then white picks the move of maximum score and black picks the move of minimum score. These concepts are illustrated in Figure 14.3 .
Figure 14.3:
Game Playing by Tree Searching. The top half of the figure
illustrates the general idea: Develop a full-width tree to some
depth, then score the leaves with the evaluation function,
f
. The
second half shows minimaxing-the reasonable supposition that white
(black) chooses lines of play which maximize (minimize) the score.
The evaluation function employed is a combination of simple material balance and several terms which represent positional factors. The positional terms are small in magnitude but are important since material balance rarely changes in tournament chess games.
The problem with this brute-force approach is that the size of the tree explodes exponentially. The ``branching factor'' or number of legal moves in a typical position is about 35. In order to play master-level chess a search of depth eight appears necessary, which would involve a tree of or about leaf nodes.
Fortunately, there is a better way. Alpha-beta pruning is a technique which always gives the same answer as brute-force searching without looking at so many nodes of the tree. Intuitively, alpha-beta pruning works by ignoring subtrees which it knows cannot be reached by best play (on the part of both sides). This reduces the effective branching factor from 35 to about 6, which makes strong play possible.
The idea of alpha-beta pruning is illustrated in Figure 14.4 . Assume that all child nodes are searched in the order of left to right in the figure. On the left side of the tree (the first subtree searched), we have minimaxed and found a score of +4 at depth one. Now, start to analyze the next subtree. The children report back scores of +5 , -1 , . The pruning happens after the score of -1 is returned: since we are taking the minimum of the scores +5 , -1 , , we immediately have a bound on the scores of this subtree-we know the score will be no larger than -1 . Since we are taking the maximum at the next level up (the root of the tree) and we already have a line of play better than -1 (namely, the +4 subtree), we need not explore this second subtree any further. Pruning occurs, as denoted by the dashed branch of the second subtree. The process continues through the rest of the subtrees.
Figure:
Alpha-Beta Pruning for the Same Tree as
Figure
14.3
. The tree is generated in left-to-right
order. As soon as the score
-1
is computed, we immediately have a
bound on the level above (
) which is below the score of the
+4
subtree. A cutoff occurs, meaning no more descents of the
node
need to be searched.
The amount of work saved in this small tree was insignificant but alpha-beta becomes very important for large trees. From the nature of the pruning method, one sees that the tree is not evolved evenly downward. Instead, the algorithm pursues one branch all the way to the bottom, gets a ``score to beat'' (the alpha-beta bounds), and then sweeps across the tree sideways. How well the pruning works depends crucially on move ordering. If the best line of play is searched first, then all other branches will prune rapidly.
Actually, what we have discussed so far is not full alpha-beta pruning, but merely ``pruning without deep cutoffs.'' Full alpha-beta pruning shows up only in trees of depth four or greater. A thorough discussion of alpha-beta with some interesting historical comments can be found in Knuth and Moore [ Knuth:75a ].
Guy Robinson
The evaluation function of our program is similar in style to that of the Cray Blitz program [ Welsh:85a ]. The most important term is material, which is a simple count of the number of pieces on each side, modified by a factor which encourages the side ahead in material to trade pieces but not pawns. The material evaluator also recognizes known draws such as king and two knights versus king.
There are several types of positional terms, including pawn structure, king safety, center control, king attack, and specialized bonuses for things like putting rooks on the seventh rank.
The pawn structure evaluator knows about doubled, isolated, backward, and passed pawns. It also has some notion of pawn chains and phalanxes. Pawn structure computation is very expensive, so a hash table is used to store the scores of recently evaluated pawn structures. Since pawn structure changes slowly, this hash table almost always saves us the work of pawn structure evaluation.
King safety is evaluated by considering the positions of all pawns on the file the king is occupying and both neighboring files. A penalty is assessed if any of the king's covering pawns are missing or if there are holes (squares which can never be attacked by a friendly pawn) in front of the king. Additional penalties are imposed if the opponent has open or half-open files near the king. The whole king safety score is multiplied by the amount of material on the board, so the program will want to trade pieces when its king is exposed, and avoid trades when the opponent's king is exposed. As in pawn structure, king safety uses a hash table to avoid recomputing the same information.
The center control term rewards the program for posting its pieces safely in the center of the board. This term is crude since it does not consider pieces attacking the center from a distance, but it can be computed very quickly and it encourages the kind of straightforward play we want.
King attack gives a bonus for placing pieces near the opposing king. Like center control, this term is crude but tends to lead to positions in which attacking opportunities exist.
The evaluation function is rounded out by special bonuses to encourage particular types of moves. These include a bonus for castling, a penalty for giving up castling rights, rewards for placing rooks on open and half-open files or on the seventh rank, and a penalty for a king on the back rank with no air.
Guy Robinson
Of course it only makes sense to apply a static evaluation function to a position which is quiescent, or tactically quiet. As a result, the tree is extended beyond leaf nodes until a quiescent position is reached, where the static evaluator is actually applied.
We can think of the quiescence search as a dynamic evaluation function, which takes into account tactical possibilities. At each leaf node, the side to move has the choice of accepting the current static evaluation or of trying to improve its position by tactics. Tactical moves which can be tried include pawn promotions, most capture moves, some checks, and some pawn promotion threats. At each newly generated position the dynamic evaluator is applied again. At the nominal leaf nodes, therefore, a narrow (small branching factor) tactical search is done, with the static evaluator applied at all terminal points of this search (which end up being the true leaves).
Guy Robinson
Tournament chess is played under a strict time control, and a program must make decisions about how much time to use for each move. Most chess programs do not set out to search to a fixed depth, but use a technique called iterative deepening. This means a program does a depth two search, then a depth three search, then a depth four search, and so on until the allotted time has run out. When the time is up, the program returns its current best guess at the move to make.
Iterative deepening has the additional advantage that it facilitates move ordering. The program knows which move was best at the previous level of iterative deepening, and it searches this principal variation first at each new level. The extra time spent searching early levels is more than repaid by the gain due to accurate move ordering.
Guy Robinson
During the tree search, the same board position may occur several times. There are two reasons for this. The first is transposition, or the fact that the same board position can be reached by different sequences of moves. The second reason is iterative deepening-the same position will be reached in the depth two search, the depth three search, and so on. The hash table is a way of storing information about positions which have already been searched; if the same position is reached again, the search can be sped up or eliminated entirely by using this information.
The hash table plays a central role in a good chess program and so we will describe it in some detail. First of all, the hash table is a form of content-addressable memory-with each chess board (a node in the chess tree) we wish to associate some slot in the table. Therefore, a hashing function h is required, which maps chess boards to slots in the table. The function h is designed so as to scatter similar boards across the table. This is done because in any single search the boards appearing in the tree differ by just a few moves and we wish to avoid collisions (different boards mapping to the same slot) as much as possible. Our hash function is taken from [ Zobrist:70a ]. Each slot in the table contains
Whenever the program completes the search of a subtree of substantial size (i.e., one of depth greater than some minimum), the knowledge gained is written into the hash table. The writing is not completely naive, however. The table contains only a finite number of slots, so collisions occur; writeback acts to keep the most valuable information. The depth field of the slot helps in making the decision as to what is most valuable. The information coming from the subtree of greater depth (and hence, greater value) is kept.
The staleness flag allows us to keep information from one search to the next. When time runs out and a search is considered finished, the hash table is not simply cleared. Instead, the staleness flag is set in all slots. If, during the next search, a read is done on a stale slot the staleness flag is cleared, the idea being that this position again seems to be useful. On writeback, if the staleness flag is set, the slot is simply overwritten, without checking the depths. This prevents the hash table from becoming clogged with old information.
Proper use of an intelligent hash table such as the one described above gives one, in effect, a ``principal variation'' throughout the chess tree. As discussed in [ Ebeling:85a ], a hash table can effectively give near-perfect move ordering and hence, very efficient pruning.
Guy Robinson
The opening is played by making use of an ``opening book'' of known positions. Our program knows the theoretically ``best'' move in about 18,000 common opening positions. This information is stored as a hash table on disk and can be looked up quickly. This hash table resolves collisions through the method of chaining [ Knuth:73a ].
Guy Robinson
Endgames are handled by using special evaluation functions which contain knowledge about endgame principles. For instance, an evaluator for king and pawn endgames may be able to directly recognize a passed pawn which can race to the last rank without being caught by the opposing king. Except for the change in evaluation functions, the endgame is played in the same fashion as the middlegame.
Figure 14.5:
Slaves Searching Subtrees in a Self-scheduled Manner. Suppose
one of the searches-in this case search two-takes a long time. The
advantage of self-scheduling is that, while this search is
proceeding in slave two, the other slaves will have done all the
remaining work. This very general technique works as long as the
dynamic range of the computation times is not too large.
Guy Robinson
Our program is implemented on an nCUBE/10 system. This is an MIMD (multiple instruction stream, multiple data stream) multicomputer, with each node consisting of a custom VLSI processor running at 7 MHz, 512 Kbytes of memory, and on-chip communication channels. There is no shared memory-processors communicate by message-passing. The nodes are connected as a hypercube but the VERTEX message-passing software [ nCUBE:87a ] gives the illusion of full connectivity. The nCUBE system at Caltech has 512 processors, but systems exist with as many as 1024 processors. The program is written in C, with a small amount of assembly code.
Guy Robinson
Some good chess programs do run in parallel (see [ Finkel:82a ], [ Marsland:84a ], [ Newborn:85a ], [Schaeffer:84a;86a]), but before our work nobody had tried more than about 15 processors. We were interested in using hundreds or thousands of processors. This forced us to squarely face all the issues of parallel chess-algorithms which work for a few processors do not necessarily scale up to hundreds of processors. An example of this is the occurrence of sequential bottlenecks in the control structure of the program. We have been very careful to keep control of the program decentralized so as to avoid these bottlenecks.
The parallelism comes from searching different parts of the chess tree at the same time. Processors are organized in a hierarchy with one master processor controlling several teams, each submaster controlling several subteams, and so on. The basic parallel operation consists of one master coming to a node in the chess tree, and assigning subtrees to his slaves in a self-scheduled way. Figure 14.5 shows a timeline of how this might happen with three subteams. Self-scheduling by the slaves helps to load-balance the computation, as can be seen in the figure.
So far, we have defined what happens when a master processor reaches a node of the chess tree. Clearly, this process can be repeated recursively. That is, each subteam can split into sub-subteams at some lower level in the tree. This recursive splitting process, illustrated in Figure 14.6 , allows large numbers of processors to come into play.
In conflict with this is the inherent sequential model of the standard alpha-beta algorithm. Pruning depends on fully searching one subtree in order to establish bounds (on the score) for the search of the next subtree. If one adheres to the standard algorithm in an overly strict manner, there may be little opportunity for parallelism. On the other hand, if one is too naive in the design of a parallel algorithm, the situation is easily reached where the parallel program searches an impressive number of board positions per second, but still does not search much more deeply than a single processor running the alpha-beta algorithm. The point is that one should not simply split or ``go parallel'' at every opportunity-as we will see below, it is sometimes better to leave processors idle for short periods of time and then do work at more effective points in the chess tree.
Figure:
The Splitting Process of Figure
14.5
is Now
Repeated, in a Recursive Fashion, Down the Chess Tree to Allow Large
Numbers of Processors to Come into Play. The topmost master has four
slaves, which are each in turn an entire team of processors, and so on.
This figure is only approximate, however. As explained in the text, the
splitting into parallel threads of computation is not done at every
opportunity but is tightly controlled by the global hash table.
Guy Robinson
The standard source on mathematical analysis of the alpha-beta algorithm is the paper by Knuth and Moore [ Knuth:75a ]. This paper gives a complete analysis for perfectly ordered trees, and derives some results about randomly ordered trees. We will concern ourselves here with perfectly ordered trees, since real chess programs achieve almost-perfect ordering.
Figure:
Pruning of a Perfectly Ordered Tree. The tree of
Figures
14.3
and
14.4
has
been extended another ply, and also the move ordering has been
rearranged so that the best move is always searched first. By
classifying the nodes into types as described in the text, the following
pattern emerges: All children of type one and three nodes are
searched, while only the first child of a type two node is searched.
In this context, perfect move-ordering means that in any position, we always consider the best move first. Ordering of the rest of the moves does not matter. Knuth and Moore show that in a perfectly ordered tree, the nodes can be divided into three types, as illustrated by Figure 14.7 . As in previous figures, nodes are assumed to be generated and searched in left-to-right order. The typing of the nodes is as follows. Type one nodes are on the ``principal variation.'' The first child of a type one node is type one and the rest of the children are type two. Children of type two nodes are type three, and children of type three nodes are type two.
How much parallelism is available at each node? The pruning of the perfectly ordered tree of Figure 14.7 offers a clue. By thinking through the alpha-beta procedure, one notices the following pattern:
The implications of this for a parallel search are important. To efficiently search a perfectly ordered tree in parallel, one should perform the following algorithm.
The key for parallel search of perfectly ordered chess trees, then, is to stay sequential at type two nodes, and go parallel at type three nodes. In the non-perfectly ordered case, the clean distinction between node types breaks down, but is still approximately correct. In our program, the hash table plays a role in deciding upon the node type. The following strategy is used by a master processor when reaching a node of the chess tree:
Make an inquiry to the hash table regarding this position. If the hash table suggests a move, search it first, sequentially. In this context, ``sequentially'' means that the master takes her slaves with her down this line of play. This is to allow possible parallelism lower down in the tree. If no move is suggested or the suggested move fails to cause an alpha-beta cutoff, search the remaining moves in parallel. That is, farm the work out to the slaves in a self-scheduled manner.This parallel algorithm is intuitively reasonable and also reduces to the correct strategy in the perfectly ordered case. In actual searches, we have explicitly (on the nCUBE graphics monitor) observed the sharp classification of nodes into type two and type three at alternate levels of the chess tree.
Guy Robinson
The central role of the hash table in providing refutations and telling the program when to go parallel makes it clear that the hash table must be shared among all processors. Local hash tables would not work since the complex, dynamically changing organization of processors makes it very unlikely that a processor will search the same region of the tree in two successive levels of iterative deepening. A shared table is expensive on a distributed-memory machine, but in this case it is worth it.
Each processor contributes an equal amount of memory to the shared hash table. The global hash function maps each chess position to a global slot number consisting of a processor ID and a local slot number. Remote memory is accessed by sending a message to the processor in which the desired memory resides. To insure prompt service to remote memory requests, these messages must cause an interrupt on arrival. The VERTEX system does not support this feature, so we implemented a system called generalized signals [ Felten:88b ], which allows interrupt-time servicing of some messages without disturbing the running program.
When a processor wants to read a remote slot in the hash table, it sends a message containing the local slot number and the 64-bit collision check to the appropriate processor. When this message arrives the receiving processor is interrupted; it updates the staleness flag and sends the contents of the desired slot back to the requesting processor. The processor which made the request waits until the answer comes back before proceeding.
Remote writing is a bit more complicated due to the possibility of collisions. As explained previously, collisions are resolved by a priority scheme; the decision of whether to overwrite the previous entry must be made by the processor which actually owns the relevant memory. Remote writing is accomplished by sending a message containing the new hash table entry to the appropriate processor. This message causes an interrupt on arrival and the receiver examines the new data and the old contents of that hash table slot and decides which one to keep.
Since hash table data is shared among many processors, any access to the hash table must be an atomic operation. This means we must guarantee that two accesses to the same slot cannot happen at the same time. The generalized signals system provides a critical-section protection feature which can be used to queue remote read and write requests while an access is in progress.
Experiments show that the overhead associated with the global hash table is only a few percent, which is a small price to pay for accurate move ordering.
Guy Robinson
As we explained in an earlier section, slaves get work from their masters in a self-scheduled way in order to achieve a simple type of load balancing. This turns out not to be enough, however. By the nature of alpha-beta, the time necessary to search two different subtrees of the same depth can vary quite dramatically. A factor of 100 variation in search times is not unreasonable. Self-scheduling is somewhat helpless in such a situation. In these cases, a single slave would have to grind out the long search, while the other slaves (and conceivably, the entire rest of the machine) would merely sit idle. Another problem, near the bottom of the chess tree, is the extremely rapid time scales involved. Not only do the search times vary by a large factor, but this all happens at millisecond time scales. Any load-balancing procedure will therefore need to be quite fast and simple.
These ``chess hot spots'' must be explicitly taken care of. The master and submaster processors, besides just waiting for search answers, updating alpha-beta bounds, and so forth, also monitor what is going on with the slaves in terms of load balance. In particular, if some minimum number of slaves are idle and if there has been a search proceeding for some minimum amount of time, the master halts the search in the slave containing the hot spot, reorganizes all his idle slaves into a large team, and restarts the search in this new team. This process is entirely local to this master and his slaves and happens recursively, at all levels of the processor tree.
This ``shoot-down'' procedure is governed by two parameters: the minimum number of idle slaves, and the minimum time before calling a search a potential hot spot. These parameters are introduced to prevent the halting, processor rearrangement, and its associated overhead in cases which are not necessarily hot spots. The parameters are tuned for maximum performance.
The payoff of dynamic load balancing has been quite large. Once the load-balancing code was written, debugged, and tuned, the program was approximately three times faster than before load balancing. Through observations of the speedup (to be discussed below), and also by looking directly at the execution of the program across the nCUBE (using the parallel graphics monitor, also to be discussed below) we have become convinced that the program is well load balanced and we are optimistic about the prospects for scaling to larger speedups on larger machines.
An interesting point regarding asynchronous parallel programming was brought forth by the dynamic load-balancing procedure. It is concerned with the question, ``Once we've rearranged the teams and completed the search, how do we return to the original hierarchy so as to have a reasonable starting point for the next search?'' Our first attempts at resetting the processor hierarchy met with disaster. It turned out that processors would occasionally not make it back into the hierarchy (that is, be the slave of someone) in time for another search to begin. This happened because of the asynchronous nature of the program and the variable amount of time that messages take to travel through the machine. Once this happened, the processor would end up in the wrong place in the chess tree and the program would soon crash. A natural thing to try in this case is to demand that all processors be reconnected before beginning a new search but we rejected this as being tantamount to a global resynchronization and hence very costly. We therefore took an alternate route whereby the code was written in a careful manner so that processors could actually stay disconnected from the processor tree, and the whole system could still function correctly. The disconnected processor would reconnect eventually-it would just miss one search. This solution seems to work quite well both in terms of conceptual simplicity and speed.
Guy Robinson
Speedup is defined as the ratio of sequential running time to parallel running time. We measure the speedup of our program by timing it directly with different numbers of processors on a standard suite of test searches. These searches are done from the even-numbered Bratko-Kopec positions [ Bratko:82a ], a well-known set of positions for testing chess programs. Our benchmark consists of doing two successive searches from each position and adding up the total search time for all 24 searches. By varying the depth of search, we can control the average search time of each benchmark.
The speedups we measured are shown in Figure 14.8 . Each curve corresponds to a different average search time. We find that speedup is a strong function of the time of the search (or equivalently, its depth). This result is a reflection of the fact that deeper search trees have more potential parallelism and hence more speedup. Our main result is that at tournament speed (the uppermost curve of the figure), our program achieves a speedup of 101 out of a possible 256. Not shown in this figure is our later result: a speedup estimated to be 170 on a 512-node machine.
Figure 14.8:
The Speedup of the Parallel Chess Program as a Function of
Machine Size and Search Depth. The results are averaged over a
representative test set of 24 chess positions. The speedup increases
dramatically with search depth, corresponding to the fact that there is
more parallelism available in larger searches. The uppermost curve
corresponds to tournament play-the program runs more than 100 times
faster on 256 nodes as on a single nCUBE node when playing at tournament
speed.
The ``double hump'' shape of the curves is also understood: The location of the first dip, at 16 processors, is the location at which the chess tree would like the processor hierarchy to be a one-level hierarchy sometimes, a two-level hierarchy at other times. We always use a one-level hierarchy for 16 processors, so we are suboptimal here. Perhaps this is an indication that a more flexible processor allocation scheme could do somewhat better.
Guy Robinson
One tool we have found extremely valuable in program development and tuning is a real-time performance monitor with color-graphics display. Our nCUBE hardware has a high-resolution color graphics monitor driven by many parallel connections into the hypercube. This gives sufficient bandwidth to support a status display from the hypercube processors in real time. Our performance-monitoring software was written by Rod Morison and is described in [ Morison:88a ].
The display shows us where in the chess tree each processor is, and it draws the processor hierarchy as it changes. By watching the graphics screen we can see load imbalance develop and observe dynamic load balancing as it tries to cope with the imbalance. The performance monitor gave us the first evidence that dynamic load balancing was necessary, and it was invaluable in debugging and tuning the load balancing code.
Guy Robinson
The best computer chessplayer of 1990 (Deep Thought) has reached grandmaster strength. How strong a player can be built within five years, using today's techniques?
Deep Thought is a chess engine implemented in VLSI that searches roughly 500,000 positions per second. The speed of Chiptest-type engines can probably be increased by about a factor of 30 through design refinements and improvements in fabrication technology. This factor comes from assuming a speed doubling every year, for five years. Our own results imply that an additional factor of 250 speedup due to coarse-grain parallelism is plausible. This is assuming something like a 1000-processor machine with each processor being an updated version of Deep Thought. This means that a machine capable of searching 3.75 billion ( ) positions per second is not out of the question within five years.
Communication times will also need to be improved dramatically over the nCUBE-1 used. This will entail hardware specialization to the requirements of chess. How far communication speeds can be scaled and how well the algorithm can cope with proportionally slower communications are poorly understood issues.
The relationship between speed and playing strength is well-understood for ratings below 2500. A naive extrapolation of Thompson's results [ Thompson:82a ] indicates that a doubling in speed is worth about 40 rating points in the regime above 2500. Thus, this machine would have a rating somewhere near 3000, which certainly indicates world-class playing strength.
Of course nobody really knows how such a powerful computer would do against the best grandmasters. The program would have an extremely unbalanced style and might well be stymied by the very deep positional play of the world's best humans. We must not fall prey to the overconfidence which led top computer scientists to lose consecutive bets to David Levy!
Guy Robinson
Steve Otto and Ed Felten were the leaders of the chess project and did the majority of the work. Eric Umland began the project and would have been a major contributor but for his untimely death. Rod Morison wrote the opening book code and also developed the parallel graphics software. Summer students Ken Barish and Rob Fätland contributed chess expertise and various peripheral programs.
Guy Robinson
Guy Robinson
There is not now nor will there be a single software paradigm that applies to all parallel applications. The previous chapters have shown one surprising fact. At least 90% of all scientific and engineering computations can be supported by loosely synchronous message-passing systems, such as CrOS (Express) at a low level and data-parallel languages such as High Performance Fortran at a higher and somewhat less general level. The following chapters contain several different software approaches that sometimes are alternatives for synchronous and loosely synchronous problems, and sometimes are designed to tackle more general applications.
Figures 3.11 (a) and 3.11 (b) illustrate two compound problem architectures-one of which the battle management simulation of Figure 3.11 (b) is discussed in detail in Sections 18.3 and 18.4 . MOVIE, discussed in Chapter 17 , and the more ad hoc heterogeneous software approach described in Section 18.3 are designed as ``software glue'' to integrate many disparate interconnected modules. Each module may itself be data-parallel. The application of Figure 3.11 (b) involves signal processing, for which data parallelism is natural, and linking of satellites, for which a message-passing system is natural. The integration needed in Figure 3.11 (a) is of different modules of a complex design and simulation environment-such as that for a new aircraft or automobile described in Chapter 19 . We redraw the application of Figure 3.11 (a) in Figure 15.1 to indicate that one can view the software infrastructure in this case as forming a software ``bus'' or backbone into which one can ``plug'' application modules. This software integration naturally involves an interplay between parallel and distributed computing. This is graphically shown in Figure 15.2 , which redisplays Figures 3.10 and 3.11 (a) to better indicate the analogy between a software network (bus) and a heterogeneous computer network. The concept of metacomputer has been coined to describe the integration of a heterogeneous computer network to perform as a single system. Thus, we can term the systems in Chapter 17 and Section 18.3 as metasoftware systems , and so on, software for implementing metaproblems on metacomputers.
Figure 15.1:
General Software Structure for Multidisciplinary Analysis and
Design
Figure:
The Mapping of Heterogeneous Problems onto Heterogeneous
Computer Systems combining Figure
3.10
and
Figure
3.11
(a).
The discussion in Chapter 16 shows how different starting points can lead to similar results. Express, discussed in Chapter 5 , can be viewed as a flexible (asynchronous) evolution of the original (loosely) synchronous CrOS message-passing system. Zipcode, described in Chapter 16 , starts with a basic asynchronous model but builds on top of it the structure necessary to efficiently represent loosely synchronous and synchronous problems.
MOOSE, described in the following section, was in some sense a dead end, but the lessons learned helped the evolution of our later software as described in Chapter 5 . MOOSE was designed to replace CrOS but users found it unnecessarily powerful for the relatively simple (loosely) synchronous problems being tackled by C P at the time.
The Time Warp operating system described briefly in Section 15.3 is an important software approach to the very difficult asynchronous event-driven simulations described in Section 14.1 . The simulations described in Section 18.3 also used this software approach combined in a heterogeneous environment with our fast, loosely synchronous system CrOS. This illustrates the need for and effectiveness of software designed to support a focussed subclass of problems. The evolution of software support for asynchronous problems would seem to need a classification with the complete asynchronous class divided into subclasses for which one can separately generate appropriate support. The discussions of Sections 14.2 , 14.3 , 15.2 , 15.3 , and 18.3 represent C P's contributions to this isolation of subclasses with their needed software. Chapters 16 and 17 represent a somewhat different approach in developing general software frameworks which can be tailored to each problem class.
Guy Robinson
Applications involving irregular time behavior or dynamically varying data structures are difficult to program using the crystalline model or its variants. Examples are dynamically adaptive grids for studying shock waves in fluid dynamics, N-body simulations of gravitating systems, and artificial intelligence applications, such as chess. The few applications in this class that have been written typically use custom designed operating systems and special techniques.
To support applications in this class, we developed a new, general-purpose operating system called MOOSE for the Mark II hypercube [ Salmon:88a ], and later wrote an extended version called MOOS II for the Intel iPSC/1 [ Koller:88b ]. While the MOOSE system was fairly convenient for some applications, it became available at a time when the Mark II and iPSC/1 were falling into disuse because of uncompetitive performance. The iPSC/1 was used for MOOS II for two reasons: It had the necessary hardware support on the node, and, because of low performance, it had little production use for scientific simulation. Thus, we could afford to devote the iPSC/1 to the ``messy'' process of developing a new operating system which rendered the machine unusable to others for long periods of time. Only one major application was ever written using MOOSE (Ray tracing, [ Goldsmith:88a ], mentioned in Section 14.1 [ Salmon:88c ]), and only toy applications were written using MOOS II. Its main value was therefore as an experiment in operating system design and some of its features are now incorporated in Express (Section 5.2 ). The lightweight threads pioneered in MOOSE are central to essentially all new distributed- and shared-memory computing models-in particular MOVIE, described in Chapter 17 .
Guy Robinson
The user writes a MOOSE program as a large collection of small tasks that communicate with each other by sending messages through pipes, as shown in Figure 15.3 . Each task controls a piece of data, so it can be viewed as an object in the object-oriented sense (hence the name Multitasking Object-oriented OS). The tasks and pipes can be created at any time by any task on any node, so the whole system is completely dynamic.
Figure 15.3:
An Executing MOOSE Program Is a Dynamic Network (left) of Tasks
Communicating Through FIFO Buffers Called Pipes (right).
The MOOS II extensions allow one to form groups of tasks called teams that share access to a piece of data. Also, a novel feature of MOOS II is that teams are relocatable, that is, they can be moved from one node to another while they are running. This allows one to perform dynamic load balancing if necessary.
The various subsystems of MOOS II, which together form a complete operating system and programming environment, are shown in Figure 15.4 . For convenience, we attempted to preserve a UNIX flavor in the design and were also able to provide support for debugging and performance evaluation because the iPSC/1 hardware has built-in memory protection. Easy interaction with the host is achieved using ICubix, an asynchronous version of Cubix (Section 5.2 ) that gives each task access to the Unix system calls on the host. The normal C-compilers can be used for programming, and the only extra utility program required is a binder to link the user program to the operating system.
Figure 15.4:
Subsystems of the MOOS II Operating System
Despite the increased functionality, the performance of MOOS II on the iPSC/1 turned out to be slightly better than that of Intel's proprietary NX system.
Guy Robinson
Our plan was to use MOOS II to study dynamic load balancing (Chapter 11 ), and eventually incorporate a dynamic load balancer in the MOOSE system. However, our first implementation of a dynamic load balancer, along the lines of [ Fox:86h ], convinced us that dynamic load balancing is a difficult and many-faceted issue, so the net result was a better understanding of the subject's complexities rather than a general-purpose balancer.
The prototype dynamic load balancer worked as shown in Figure 15.5 and is appropriate for applications where the number of MOOSE teams in an application is constant. However, the amount of work performed by individual teams changes slowly with time. A centralized load manager on node 0 keeps statistics on all the teams in the machine. At regular intervals, the teams report the amount of computation and communication they have done since the last report, and the central manager computes the new load balance. If the balance can be improved significantly, some teams are relocated to new nodes, and the cycle continues.
Figure 15.5:
One Simple Load-Balancing Scheme Implemented in MOOS II
This centralized approach is simple and successful in that it relocates as few teams as possible to maintain the balance; its drawback is that computing which teams to move becomes a sequential bottleneck. For instance, for 256 teams on 16 processors, a simulated annealing optimization takes about 1.2 seconds on the iPSC/1, while the actual relocation process only takes about 0.3 seconds, so the method is limited to applications where load redistribution needs only to be done every 10 seconds or so. The lesson here is that, to be viable, the load optimization step itself must be parallelized. The same conclusion will also hold for any other distributed-memory machine, since the ratio of computation time to optimization time is fairly machine-independent.
Guy Robinson
Aside from finding some new reasons not to use old hardware, we were able to pinpoint issues worthy of further study, concerning parallel programming in general and load balancing in particular.
Future work will therefore have to focus less on the mechanism of moving tasks around and more on how to communicate load information between user and system.
MOOSE was written by John Salmon, Sean Callahan, Jon Flower and Adam Kolawa. MOOS II was written by Jeff Koller.
The C P references are: [Salmon:88a], [Koller:88a;88b;88d;89a], [Fox:86h].
Guy Robinson
Discrete-event simulations are among the most expensive of all computational tasks. With current technology, one sequential execution of a large simulation can take hours or days of sequential processor time. For example, many large military simulations take days to complete on standard single processors. If the model is probabilistic, many executions will be necessary to determine the output distributions. Nevertheless, many scientific, engineering, and military projects depend heavily on simulation because experiments on real systems are too expensive or too unsafe. Therefore, any technique that speeds up simulations is of great importance.
We designed the Time Warp Operating System (TWOS) to address this problem. TWOS is a multiprocessor operating system that runs parallel discrete-event simulations. We developed TWOS on the Caltech/JPL Mark III Hypercube. We have since ported it to various other parallel architectures, including the Transputer and a BBN Butterfly GP1000. TWOS is not intended as a general-purpose multiuser operating system, but rather as an environment for a single concurrent application (especially a simulation) in which synchronization is specified using virtual time [ Jefferson:85c ].
The innovation that distinguishes TWOS from other operating systems is its complete commitment to an optimistic style of execution and to processing rollback for almost all synchronization. Most distributed operating systems either cannot handle process rollback at all or implement it as a rarely used mechanism for special purposes such as exception handling, deadlock , transaction abortion, or fault recovery. But the Time Warp Operating System embraces rollback as the normal mechanism for process synchronization, and uses it as often as process blocking is used in other systems. TWOS contains a simple, general distributed rollback mechanism capable of undoing or preventing any side effect, direct or indirect, of an incorrect action. In particular, it is able to control or undo such troublesome side effects as errors, infinite loops, I/O, creation and destruction of processes, asynchronous message communication, and termination.
TWOS uses an underlying kernel to provide basic message-passing capabilities, but it is not used for any other purpose. On the Caltech/JPL Mark III Hypercube, this role was played by Cubix , described in Section 5.2 . The other facilities of the underlying operating system are not used because rollback forces a rethinking of almost all operating system issues, including scheduling, synchronization, message queueing, flow control, memory management, error handling, I/O, and commitment. All of these are handled by TWOS. Only the extra work of implementing a correct message-passing facility prevents TWOS from being implemented on the bare hardware.
We have been developing TWOS since 1983. It is now an operational system that includes many advanced features such as dynamic creation and destruction of objects, dynamic memory management, and dynamic load management. TWOS is being used by the United States Army's Concept and Analysis Agency to develop a new generation of theater-level combat simulations. TWOS has also been used to model parallel processing hardware, computer networks, and biological systems.
Figure 15.6 shows the performance of TWOS on one simulation called STB88. This simulation models theater-level combat in central Europe [ Wieland:89a ]. The graph in Figure 15.6 shows how much version 2.3 of TWOS was able to speed up this simulation on varying numbers of nodes of a parallel processor. The speedup shown is relative to running a sequential simulator on a single node of the same machine used for the TWOS runs. The sequential simulator uses a splay tree for its event queue. It never performs rollback, and hence has a lower overhead than TWOS. The sequential simulator links with exactly the same application code as TWOS. It is intended to be the fastest possible general-purpose discrete-event simulator that can handle the same application code as TWOS.
Figure 15.6:
Performance of TWOS on STB88
Figure 15.6 demonstrates that TWOS can run this simulation more than 25 times faster than running it on the sequential simulator, given sufficient numbers of nodes. On other applications, even higher speedups are possible. In certain cases, TWOS has achieved up to 70% of the maximum theoretical speedup, as determined by critical path analysis.
Research continues on TWOS. Currently, we are investigating dynamic load management [ Reiher:90a ]. Dynamic load management is important for TWOS because good speedups generally require careful mapping of a simulation's constituent objects to processor nodes. If the balance is bad, then the run is slow. But producing a good static balance takes approximately the same amount of work as running the simulation on a single node. Dynamic load management allows TWOS to achieve nearly the same speed with simple mappings as with careful mappings.
Dynamic load management is an interesting problem for TWOS because the utilizations of TWOS' nodes are almost always high. TWOS optimistically performs work whenever work is available, so nodes are rarely idle. On the other hand, much of the work done by a node may be rolled back, contributing nothing to completing the computation. Instead of balancing simple utilization, TWOS balances effective utilization, the proportion of a node's work that is not rolled back. Use of this metric has produced very good results.
Future research directions for TWOS include database management, real-time user interaction with TWOS, and the application of virtual time synchronization to other types of parallel processing problems. [ Jefferson:87a ] contains a more complete description of TWOS.
Participants in this project were David Jefferson, Peter Reiher, Brian Beckman, Frederick Wieland, Mike Di Loreto, Philip Hontalas, John Wedel, Paul Springer, Leo Blume, Joseph Ruffles, Steven Belenot, Jack Tupman, Herb Younger, Richard Fujimoto, Kathy Sturdevant, Lawrence Hawley, Abe Feinberg, Pierre LaRouche, Matt Presley, and Van Warren.
Guy Robinson
Guy Robinson
Zipcode is a message-passing system developed originally by Skjellum, beginning at Caltech in the Summer of 1988 [ Skjellum:90a ], [ Skjellum:91c ], [ Skjellum:92c ] and [ Skjellum:91a ]. Zipcode was created to address features and issues absent in then-existing message-passing systems such as CrOS/Express, described in Section 5.2 . In particular, Zipcode was based on an underlying reactive asynchronous low-level message-passing system. CrOS was built on top of loosely synchronous low-level message-passing systems, which reflected C P's initial hardware and applications. Interestingly, both Zipcode and Express have evolved from their starting to quite similar high-level functionality. Currently, Zipcode continues to serve as a vehicle to demonstrate high-level message-passing research concepts and, more importantly, to provide the basis for supporting vendor-independent scalable concurrent libraries; notably, the Multicomputer Toolbox [ Falgout:92a ], [Skjellum:91b;92a;92d]. The basic assertion of Zipcode is that carefully managed, expressive message-passing is an effective way to program multicomputers and distributed computers, while low-level message passing is admittedly both error-prone and difficult.
The purpose of Zipcode is to manage the message-passing process within parallel codes in an open-ended way. This is done so that large-scale software can be constructed in a multicomputer application, with reduced likelihood that software so constructed will conflict in its dynamic resource use, thereby avoiding potentially hard-to-resolve, source-level conflicts. Furthermore, the message-passing notations provided are to reflect the algorithms and data organizations of the concurrent algorithms, rather than predefined tagging strategies. Tagging, while generic and easy to understand, proves insufficient to support manageable application development. Notational abstractions provide a means for the user to help Zipcode make runtime optimizations when a code runs on systems with specific hardware features. Abstraction is therefore seen as a means to higher performance, and notation is seen as a means towards more understandable, easier-to-develop-and-maintain concurrent software. Context allocation (see below) provides a ``social contract'' within which multiple libraries and codes can coexist reasonably. Contexts are like system-managed ``hypertag''; contexts here are called ``Zipcodes''.
Safety in communication is achieved by context control; the main process data structure is the process list (a collective of processes that are to communicate). These constructs are handled dynamically by the system. Contexts are needed so that diverse codes can be brought together and made to work without the possibility of message-passing conflicts, and without the need to globalize the semantic and syntactic issues of message passing contained in each separate piece of code. For instance, the use and support of independently conceived concurrent libraries requires separate communication space, which contexts support. As applications mature, more contexts are likely to be needed, especially if diverse libraries are linked into the system, or a number of (possibly overlapping) process structures are needed to represent various phases of a calculation. In purely message-passing instances within Zipcode , contexts control the flow of messages through a global messaging resource. In more complex hierarchies, contexts will manage channel and/or shared-memory blocks in the user program, while the notation remains message-passing-like to the user. This evolution is transparent to the user.
Concurrent mathematical libraries are well supported by the definition of multidimensional, logical-process-grid primitives, as provided by Zipcode ; one-, two- and three-dimensional grids are currently supported (grid mail classes , also known as virtual topologies). Grids are used to assign machine-independent naming to the processes participating in a calculation, with a shape chosen by the user. Such grids form the basis for higher level data structures that describe how matrices and vectors are shared across a set of processes, but these descriptors are external to Zipcode . New grids may be aligned to existing grids to provide nesting, partitioning, and other desired subsetting of process grids, all done in the machine-independent notation of the parent grid. The routine whoami and associated routines, described in Section 5.2 , provide this capability in CrOS/Express.
Mail classes (such as new grid structures) may be added statically to the system; because code cannot move with data in extant multicomputers, mail classes have to be enumerated at compile-time. Because we at present retain a C implementation, rather than C++, the library must currently be modified explicitly to add new classes of mail, rather than by inheritance. Fortunately, the predefined classes (grids, tagged messages) address a number of the situations we have encountered thus far in practical applications. Non-mathematically oriented users may conceive of mail classes that we have not as yet imagined, and which might be application-specific.
Recently, we have evolved the Zipcode system to provide higher level application interfaces to the basic message-passing contexts and classes of mail. These interfaces allow us to unify the notions of heterogeneity and non-uniform memory access hierarchy in a single framework, on a context-by-context basis. For instance, we view a homogeneous collection of multicomputer nodes as a particular type of memory hierarchy. We see this unification of heterogeneity and memory hierarchy in our notation as an important conceptual advance, both for distributed- and concurrent-computing applications of Zipcode . Mainly, heterogeneity impacts transmission bandwidth and should not have to be treated as a separate feature in data transmission, nor should it be explicitly visible in user-defined application code or algorithms, except perhaps in highly restricted method definitions, for performance' sake.
For instance, the notations currently provided by Zipcode support writing application programs so that the same message-passing code can map reasonably well to heterogeneous architectures, to those with shared memory between subsets of nodes, and to those which support active-message strategies. Furthermore, it should be possible to cache limited internode channel resources within the library, transparent to the user. This is possible because the gather-send and scatter-receive notations remove message formatting from the user's control. We provide general gather/scatter specifications through persistent invoice data types. This notation is available both to C and Fortran programmers. As a side effect, we provide a clean interface for message passing in the Fortran environment. If compilers support code inlining and other optimizations, we are convinced that overheads can be drastically reduced for systems with lighter communication overheads than heretofore developed. Cheap dynamic allocation mechanisms also help in this regard, and are easily attainable. In all cases, the user will have to map the process lists to processors to take advantage of the hierarchies, but this can be done systematically using Zipcode .
We define message-passing operations on a context-by-context basis (methods), so that the methods implementing send, receive, combine, broadcast , and so on, are potentially different for each context, reflecting optimizations appropriate to given parts of a hierarchy (homogeneity, power-of-two, flat shared memory, and so on). We have to rely on the user to map the problem to take advantage of such special contexts, but we provide a straightforward mechanism to take advantage of hierarchy through the gather/scatter notation. When compilers provide inlining, we will see significant improvements in performance for lower latency realizations of the system. Higher level notation, and context-by-context method definitions are key to optimizing for memory hierarchy and heterogeneity. Because the user provides us with information on the desired operations, rather than instructions on how to do them, we are able to discover optimizations. Low-level notations cannot hope to achieve this type of optimization, because they do not expose the semantic information in their instructions, nor work over process lists, for which special properties may be asserted (except with extensive compile-time analysis).
This evolutionary process implies that Zipcode has surpassed its original Reactive Kernel/Cosmic Environment platform; it is now planned that Zipcode implementations will be based on one or more of the following in a given implementation:
Importantly, when a code is moved to a system that does not have special features (e.g., a purely message-passing system), the user code's calls to Zipcode will compile down to pure message-passing, whereas the calls compile down to faster schemes within special hierarchies. This multifaceted approach to implementing Zipcode follows its original design philosophy; originally, the CE/RK primitives upon which Zipcode is based were the cheapest available primitives for system-level message-passing, and hence the most attractive to build higher level services like Zipcode . Today, vendor operating systems are likely to provide additional services in the other categories mentioned above which, if used directly in applications, would prove unportable, unmanageable, or too low level (like direct use of CE/RK primitives). If a user needs to optimize a code for a specific system, he or she works in terms of process lists and contexts to get desirable mappings from which Zipcode can effect runtime optimizations.
The CE/RK primitives (originally central to Zipcode ) manage memory as well as message-passing operations. This is an important feature, carried into the Zipcode system, which is helpful in reducing the number of copies needed to pass a message from sender to recipient in basic message (hence, less wasted bandwidth ). In CE/RK the system provides message space, which is freed upon transmission and allocated upon receipt. This approach removes the need for complicated strategies involving asynchronous sends, in which the user has to poll to see when his or her buffer is once again usable. Since the majority of transmissions in realistic applications involve a gather before send (and scatter on receive), rather than block-data transmission, these semantics provide, on the whole, good notational and performance benefits, while retaining simplicity. Zipcode extends the concept of the CE/RK-managed messages to include buffered messages (for global operations) and synchronizations. These three varieties of primitives make different assumptions about how memory is allocated (and by whom), and are implemented with the most efficient available system calls in a given Zipcode implementation. In all cases, actual memory allocation can be effected using lightweight allocation procedures in efficient implementations, rather than heavyweight mallocs. Therefore, the dynamic nature of the allocations need not imply significant performance penalties.
When moving Zipcode to a new system, the CE/RK layer will normally be the first interface provided, with additional interfaces provided if the hardware's special properties so warrant. In this way, user codes and libraries will come up to speed quickly, yet attain better performance as the Zipcode port is optimized for the new system. We see this as a desirable mode of operation, with the highest initial return on investment.
Guy Robinson
Guy Robinson
To appreciate the model upon which current Zipcode implementations are built, one needs to understand the scope and expressivity of the low-level CE/RK system.
Guy Robinson
Implementations of Zipcode todate interface to primitives of the CE/RK, a portable, lightweight multicomputer node operating system, which provides untyped blocked and unblocked message passing in a uniform host/node model, including type conversion primitives for heterogeneous host-node communications. Presently, the Reactive Kernel is implemented for Intel iPSC/1, iPSC/2, Sequent Symmetry, Symult S2010, and Intel iPSC860 Gamma prototype multicomputers, with emulations provided for the Intel Delta prototype, Thinking Machines CM-5, and nCUBE/2 6400 machines. Furthermore, Intel provides the CE/RK primitives at the lowest level (read: highest performance) on its Paragon system. CE/RK is also emulated on shared-memory computers such as the BBN TC2000 as well as networks of homogeneous NFS-connected workstations (e.g., Sun clusters). We see CE/RK primitives as a logical, flexible platform for our work, and for other message-passing developers, and upon which higher level layers such as Zipcode can be ported. Because most tagged message-passing systems with restrictive typing semantics do not provide quite enough receipt selectivity directly to support Zipcode , we find it often best to implement untagged primitives as the interface to which Zipcode works. The CE/RK emulations, built most often on vendor primitives, make use of any available tagging for bookkeeping purposes, and allow users of a specific vendor system to mix vendor-specific message passing with CE/RK- or Zipcode -based messaging.
One should view the CE/RK system as the default message-space management system for Zipcode (in C++ parlance, default constructor/new, destructor/free mechanisms), with the understanding that future implementations of Zipcode may prefer to use more primitive calls (e.g., packet protocols or active messages) to gain even greater performance. (Alternatively, if Paragon or similar implementations are very fast, such shortcuts will have commensurately less impact on performance.) Via the shortcut approach, Zipcode analogs of CE/RK calls will become the lowest level interface of message passing in the system, and become a machine-dependent layer.
Guy Robinson
The Cosmic Environment (CE) provides control for concurrent computation through a ``cube dæmon.'' This resource manager allows multiple real and emulated concurrent computers to be space-shared [ Seitz:88a ]. The following functions are provided, and we emulate these on systems that provide analogous host functionality (this emulation can be done efficiently on the nCUBE/2, almost trivially on the Gamma, and not at all on the Delta and CM-5):
Guy Robinson
To support Zipcode , we normally emulate the CE functions below. Again, some of these functions, particularly getmc() , freemc() , spawn() , and ckill() , are not available on all implementations (for instance, the Delta) and are restricted to the host program:
Guy Robinson
The RK calls required by Zipcode are as follows:
It is important to note that xsend() and xmsend() deallocate the message buffer after sending the message; they are semantically analogous to xfree() . The receive functions xrecv() and xrecvb() are semantically analogous to xmalloc() .
Zipcode -based programs are not to call any of these CE or RK functions directly. Both message passing and environment control are represented in Zipcode .
Guy Robinson
Guy Robinson
Guy Robinson
Mailers maintain contexts and process lists. All communication operations use mailers. Mailers are created through a loose synchronization between the members of the proposed mailer's process list. A single process creates the process list, placing itself first in the list, and initiates the ``mailer-open'' call with this process information; it's called the ``Postmaster'' for the mailer, as initiator. The other participants receive the process list as part of the synchronization procedure. A special reactive process, the ``Postmaster General,'' maintains and distributes zip codes as mailers are opened; essentially the zip code count is a single location of shared memory. Below, each class defines an ``open'' function to create its mailer.
Guy Robinson
Guy Robinson
Guy Robinson
Guy Robinson
We have been able to classify a number of message-passing systems in [ Skjellum:92c ], though specific differences in sending, and receiving strategies exist between common tagged-message-passing systems. In Zipcode , we define the L-class, which provides for receipt selectivity based on message source in unabstracted {node,pid} notation, and on a long-integer tag. This class can be used to define one or more contexts of tagged message systems, that call the primitives described fully in [ Skjellum:92b ]. However, and perhaps more interestingly, these L-class calls can be used to generate wrappers for all the major tagged message-passing systems. In the Zipcode manual we illustrate how this is done by showing a few of the wrappers for the PICL, NX, and Vertex system [ Skjellum:92b ]. We also have a long-standing Zipcode -based emulation for the Livermore Message Passing System (LMPS) [ Welcome:92a ].
Furthermore, for each context a user declares, he or she is guaranteed that the L-class messages will not be mixed up, so that if vendor-style calls are used in different libraries, then these will not interfere with other parts of a program. This allows several existing tagged subroutines or programs to be brought together and face-lifted easily to work together, without changing tags or seeing when/where the message passing resources might conflict. In short, this provides a general means to ensure tagged-message safety, as contemplated in [ Hart:93a ].
Guy Robinson
Guy Robinson
Class-specific primitives for G2-Class mail have been defined for both higher efficiency and better abstraction. Small-g calls require mailer specification while big-G calls do not, analogous to the y- and Y-type calls defined generically above.
Guy Robinson
char *letter = g2_Recv(ZIP_MAILER *mailer, int p, int q); /* unblocked: */ letter = g2_Recvb(ZIP_MAILER *mailer, int p, int q); /* blocked: */
Guy Robinson
void g2_Send(ZIP_MAILER *mailer, char *letter, int p, int q);Collective operations combine , broadcast (fanout), and collapse (fanin) are defined and have been highly tuned for this class (see schematics in [ Skjellum:90c ]).
Combines and fanins are over arbitrary associative-commutative operators specified by (*comb_fn)() . Broadcasts share data of arbitrary length, assuming all participants know the source. Collapses combine information assuming all participants know the destination:
int error = g2_combine(ZIP_MAILER *mailer, /* 2D grid mailer */ void *buffer, /* where result is accumulated */ void (*comb_fn)(), /* operator for combine */ int size /* size of buffer items in bytes */ int items); /* number of buffer items */ error = g2_fanout(ZIP_MAILER *mailer, void **data, /* data/result */ int *length, /* data length */ int orig_p, int orig_q); /* grid origin of data */ error = g2_fanin(ZIP_MAILER *mailer, int dest_p, int dest_q, /* destination on grid */ void *buffer, void (*comb_fn)(), int size, int nitems);Shorthands provide direct access to row and column children mailers, tersely providing common communication patterns within the two-dimensional grid:
g2_row_combine(mailer, buffer, comb_fn, size, items); g2_col_combine(mailer, buffer, comb_fn, size, items); g2_row_fanout(mailer, &data, &length, orig_q); g2_col_fanout(mailer, &data, &length, orig_p;and
g2_row_fanin(mailer, dest_q, buffer, comb_fn, size, items); g2_col_fanin(mailer, dest_p, buffer, comb_fn, size, items);The row/column instructions above compile to G1-grid calls, since rows and columns of G2 mailers are implemented via G1 mailers. G2-Grid mailer creation:
ZIP_MAILER *mailer = g2_grid_open(int *P, int *Q, ZIP_ADDRESSEES *addr);
Once a G2 grid mailer has been established, it is possible to derive subgrid mailers by a cooperative call between all the participants in the original g2_grid_open() . In normal applications, this will result in a set of additional mailers in the postmaster (usually host program) process, and one additional G2 grid mailer in each node process. This call allows subgrids to be aligned to the original grid in reasonably general ways, but requires a basic cartesian subgridding, in that each subgrid defined must be a rectangular collection of processes.
The postmaster of the original mailer (often the host process), initiates the subgrid open request as follows:
/* array of pointer to subgrid mailers: */ ZIP_MAILER **subgrid_mailers = g2_subgrid_open(ZIP_MAILER *mailer, /* mailer already opened */ /* for each (p,q) on original grid, marks its subgrid: */ int (*select_fn)(int p, int q, void *extra), void *select_extra; /* extra data needed by select_fn() */ int *nsubgrids); /* the number of subgrids created */while each process in the original g2_grid_open() does a second, standard g2_grid_open() :
ZIP_MAILER *subgrid_mailer = g2_grid_open(&P, &Q, NULL);Each subgrid so created has its own unique contexts of communication.
Finally, it is often necessary to determine the grid shape , as well as the current process's location on the grid , when using two-dimensional logical grids. Often this information is housed only in the mailer (though some applications may choose to duplicate this information). The following calls provide simple access to these four quantities.
int p,q, P, Q; ZIP_MAILER *mailer; /* set variables (P,Q) to grid shape: */ void g2_PQ(ZIP_MAILER *mailer, int P, int Q); /* set variables (p,q) to grid position: */ void g2_pq(ZIP_MAILER *mailer, int p, int q);
These are the preferred forms for accessing grid information from G2-Class mailers.
Guy Robinson
Shorthands provide access to the PQ-plane, QR-plane, and PR-plane children to which G2 grid operations may be applied, as above.
ZIP_MAILER *mailer; /* 3D grid mailer */ ZIP_MAILER *PQ_plane_mailer, QR_plane_mailer, PR_plane_mailer; PQ_plane_mailer = g3_PQ_plane(mailer); QR_plane_mailer = g3_QR_plane(mailer); PR_plane_mailer = g3_PR_plane(mailer);
Guy Robinson
In order to facilitate easier use and the possibility of heterogeneous parallel computers, Zipcode provides a mechanism to pack and unpack buffers and letters. Buffers are unstructured arrays of data provided by the user; they are applicable with buffer-oriented Zipcode commands. Letters are unstructured arrays of data provided by Zipcode based on user specification; they are tied to specific mail contexts and are dynamically allocated and freed.
Pack (gather) and unpack (scatter) are implemented with the use of Zip_Invoices . The analogy is taken from invoices or packing slips used to specify the contents of a postal package. An invoice informs Zipcode what variables are to be associated with a communication operation or communication buffer. This invoice is subsequently used when zip_pack() ( zip_unpack() ) is called to copy items from the variables specified into (out of) the communication buffer space to be sent (received); this implements gather-on-send- and scatter-on-receive-type semantics. In a heterogeneous environment, pack/unpacking will allow data conversions to take place without user intervention. Users who code with zip_pack() / zip_unpack() will have codes that are guaranteed to work in heterogeneous implementations of Zipcode .
Guy Robinson
The zip_new_invoice() call creates new invoices:
voidzip_new_invoice(Zip\_invoice const char *format, va_list ap)The call zip_new_invoice() creates an invoice ( **inv ), while taking a variable number of arguments, starting with a format string ( format ) similar to the commonly used printf() strings. The format string contains one or more conversion specifications. A conversion specification is introduced by a percent sign (``%'') and is followed by:
For both the number of items to convert and stride, ``*'' or ``&'' can replace the hard-coded integer. If `*' is used, then the next argument in the argument list is used as an integer expression specifying the size of the conversion (or stride). Both the number of items to convert and the stride factor can be indirected by using ``&'' instead of an integer. The ``&'' indicates that a pointer to an integer should be stored, which will address the size of the invoice item (or stride) when it is packed. When ``&'' is used, the size is not evaluated immediately but is deferred until the actual packing of the data occurs. The ``&'' indirection consequently allows variable-size invoices to be constructed at runtime; we call this feature deferred sizing . The ``*'' allows the size of an invoice item (or stride) to be specified at run time.
One must be cautious of the scope of C variables when using ``&.'' For example, it is improper to create an invoice in a subroutine that has a local variable as a stride factor and then attempt to pass this invoice out and use it elsewhere, since the stride factor points at a variable that is no longer in scope. Unpredictable things will happen if this is attempted.
The single character types that are supported are as follows: ``c'' character,
``s'' short,
``i'' int,
``l'' long,
``f'' float, and
``d'' double. For each conversion specification, a pointer to an array of that type must be passed as an argument.
User-defined types may be added to the system to ease the packing of complicated data structures. An extra field (for passing whatever the user wants) may be passed to the conversion routines by adding ``(*)'' to the end of the user-type name. The ``-'' character can be used to skip space so that one can selectively push/pull things out of a letter. This allows for unpacking part of a letter and then unpacking the rest based on the part unpacked.
The following code would pack variable i followed by elements of the double_array .
/* Example 1 */ ZIP_MAILER *mlr; char *letter; ... Zip_Invoice*invoice; int i = 20; double double_array[20]; zip_new_invoice(``*invoice,%i%10.2d'', &i, double_array); ... /* use the invoice (see below) */ letter = zip_malloc(mlr, zip_sizeof_invoice(mlr, invoice)); length = zip_pack(mlr, invoice, ZIP_LETTER, &letter, ZIP_IGNORE); if(length == -1) /* an error occurred */ ...The second example is a variant of the first. The first pack call is the same, while the second packs the first five elements of the double_array .
/* Example 2 */ int len = 10, stride = 2; zip_new_invoice(``*invoice,%i%&.&d'', &i, &len, &stride, double_array); /* use the invoice */ letter = zip_malloc(mlr, zip_sizeof_invoice(mlr, invoice)); length = zip_pack(mlr, invoice, ZIP_LETTER, &letter, ZIP_IGNORE); ... len = 5; /* set the length and stride for this use of the invoice */ stride = 1; /* use the invoice */ letter = zip_malloc(mlr, zip_sizeof_invoice(mlr, invoice)); length = zip_pack(mlr, invoice, ZIP_LETTER, &letter, ZIP_IGNORE);If a user-defined type matrix has been added to the system to pack matrix structures, then the following example shows how matrix -type data can be used in an invoice declaration. See also below on how to add a user-defined type.
/* Example 3 */ struct matrix M; /* some user-defined type */ int i; Extra extra; /* contains some special info on packing a */ /* `matrix'; often this will not be needed, */ /* but this feature is provided for */ /* flexibility */ zip_new_invoice(``*invoice,%i%matrix(*)%20d'', &i, &M, &extra, double_array);At times it might be useful to know the size (in bytes) that is needed to hold the variables specified by an invoice. zip_sizeof_invoice returns the size (in bytes) that the invoice will occupy when packed. We have already used this in several examples above.
int zip_sizeof_invoice(ZIP_MAILER *mailer, Zip_Invoice *inv)To delete an existing invoice when there is no more need for it use zip_free_invoice() :
void zip_free_invoice(Zip_Invoice **inv)This will free up the specified invoice and set *inv = NULL to help flag accidental access.
User-defined types for pack and unpack routines are defined using a registry mechanism provided by Zipcode .
int zip_register_invoice_type(char *name, Method *in, Method *out, Method *len, Method *align)The structure Method is a composite of a pointer-to-function, and additional state information for a function call. The details of Method declarations are beyond the scope of this presentation.
In the above, name is the user-defined name for the auxiliary type. User-defined names follow the ANSI standard for C identifiers. They begin with a nondigit (characters ``A'' through ``Z,'' ``a'' through ``z,'' and the underscore ``_''), followed by one or more nondigits or digits. User-defined type names currently have global scope so beware of name conflicts. User-defined types cannot be the same as one of the built-in types specified above. The in , out , len , and align are the Methods used to pack/unpack the user-defined type. They must have the following parameter lists
int in(ZIP_MAILER *mailer, void *SRC="http://www.netlib.org/utk/lsi/pcwLSI/text/http://www.netlib.org/utk/lsi/pcwLSI/text/node387.html">
Next: 16.3.2 Packing and Unpacking Up: 16.3 High-Level Primitives Previous: 16.3 High-Level Primitives
Guy Robinson
Wed Mar 1 10:19:35 EST 199516.3.2 Packing and Unpacking
Next: 16.3.3 The Packed-Message Functions Up: 16.3 High-Level Primitives Previous: 16.3.1 Invoices
16.3.2 Packing and Unpacking
Packing is done when one wishes to copy the variables into the communications buffer space prior to transmission; to access the contents of a packed buffer, one must unpack it first.
int zip_pack(ZIP_MAILER *mailer, Zip_Invoice *inv, int buffer_type, char **ptr, int len)This command packs the invoice. The meaning of buffer_type is either `` ZIP_BUFFER '' or `` ZIP_LETTER ,'' indicating whether we are packing into a buffer (say for a combine or fanout) or a letter (for sends/receives).
If one is packing a buffer and has preallocated the buffer space, then len must be set to the size of this allocated buffer space. If the invoice is too large to fit in this buffer space, an error occurs. By specifying *ptr = NULL and len = ZIP_IGNORE , the pack routine will allocate the space for the buffer based on the size of the invoice to be packed. Alternatively, if a preallocated letter is being packed, then pack will fill in the letter by using the invoice. If the letter provided is not large enough, then an error will occur. If no preallocated letter is available, the pack routine can create one automatically, provided *ptr = NULL . Note that len is ignored when letters are involved, as the size of letters can be determined with Zip_length() ; len should always be ZIP_IGNORE when packing letters. For either case, zip_pack() returns the number of bytes that the data from the invoice occupies in the communication space (letter or buffer).
To unpack a letter, use
int zip_unpack(ZIP_MAILER *mailer, Zip_Invoice *inv, int buffer_type, char *ptr)As in zip_pack() , inv is the invoice to unpack. The buffer_type parameter indicates the type of communication space being used; that is, whether we are unpacking a letter ( buffer_type = ZIP_LETTER ) or a buffer ( buffer_type = ZIP_BUFFER ). The parameter ptr is a pointer to the communication space. Unlike zip_pack() , we pass a pointer to the communication space to zip_unpack() , not a pointer to a pointer. The communication space must be freed by the caller after it is unpacked.
Next: 16.3.3 The Packed-Message Functions Up: 16.3 High-Level Primitives Previous: 16.3.1 Invoices
Guy Robinson
Wed Mar 1 10:19:35 EST 199516.3.3 The Packed-Message Functions
Next: 16.3.4 Fortran Interface Up: 16.3 High-Level Primitives Previous: 16.3.2 Packing and Unpacking
16.3.3 The Packed-Message Functions
As may be apparent, many packs are followed almost immediately by sends, while corresponding receives are followed closely by unpacks. Not only is this notationally somewhat tedious, but it also limits the optimizations that can be done by Zipcode . To create a more flexible system, Zipcode provides the capability to do both the pack and communications in a single call. For instance,
g3_pack_send(ZIP_MAILER *g3mailer, int d1, d2, d3, Zip_Invoice *invoice)takes care of creating the letter, packing the invoice, and sending it to grid location specified by {d1,d2,d3} . Whenever possible, use pack_send-style routines, as they will generally be more run time optimizable than pack calls followed by send calls.Packed versions of collective operations are also provided. Here is the specific syntax for the G2, two-dimensional-grid pack combine:
int g2_pack_combine(ZIP_MAILER *g2mlr, Zip_Invoice *invoice, void (*func()) )
Guy Robinson
Wed Mar 1 10:19:35 EST 199516.3.4 Fortran Interface
Next: 16.4 Details of Execution Up: 16.3 High-Level Primitives Previous: 16.3.3 The Packed-Message Functions
16.3.4 Fortran Interface
The Fortran (F77) interface is provided, but certain features have necessarily been omitted (awaiting Fortran 90). The syntax is different since there are no pointers in F77. There are no user-defined types provided in the Fortran interface as Fortran does not provide structures. Once Fortran 90 has been adopted, user-defined types will likely appear in the Fortran interface.
Since Fortran does not allow variable argument functions, the construction of invoices differs from that of the C interface. An invoice is built up over several function calls, each one specifying the next field in the invoice.
C Example 1F integer mailer integer letter ... integer invoice integer i, length double double_array(20) call ZIP_INV_NEW(invoice) call ZIP_INV_ADD_INT(invoice, i, 1, 1, .false, .false.) call ZIP_INV_ADD_DBLE(invoice, double_array, 10, 2, .false., .false.) C use the invoice CALL ZIP_SIZEOF_INVOICE(mailer, invoice, length) CALL YMALLOC(mailer, length, letter) CALL ZIP_PACK(mailer, invoice, ZIP_LETTER, letter, ZIP_IGNORE, length) if(length .eq. -1) then C an error occurred ...The above example packs the same invoice that Example 1 does in C. The last two arguments to ZIP_INV_ADD_INT() and ZIP_INV_ADD_DBLE() are the ``ignore-space'' and ``deferred-sizing'' logicals, respectively, to be explained via examples below. They also appear in the C syntax, but as part of the argument string via special characters.
C Example 2F integer invoice integer i, len, stride DOUBLE double_array(20) i = 20 len = 10 stride = 2 call ZIP_INV_NEW(invoice) call ZIP_INV_ADD_INT(invoice, i, 1, 1, .false., .false.) C C The .true. argument invokes deferred sizing of the data: call ZIP_INV_ADD_DBLE(invoice, double_array, len, stride, .false., .true.) C pack or unpack call is made... len = 5 stride = 1 C pack or unpack call is made...This example performs the same work the C Example 2 did. To ignore space in a pack or unpack call, the ignore-space logical is set true. For instance:call ZIP_INV_ADD_INT(invoice, 1, 1, .true., .false.)To ignore space and use deferred-sizing evaluation, both flags are set true:call ZIP_INV_ADD_INT(invoice, len, stride, .true., .true.)Other Zipcode calls are very similar in Fortran to the C versions. A preprocessor is used to create some definitions for use by the Fortran programmer.The following conventions are followed in the Fortran interface.
Guy Robinson
In this section, we cover the initialization and termination protocols, and discuss how to get node processes spawned in the Zipcode model of multicomputer programming. Within this model, the user is not allowed to call CE/RK functions directly.
Guy Robinson
Each host program and node program must call the appropriate initialization function to initialize Zipcode for their process:
int error = Zip_init(void); /* assume default mode for initialization */ error = Zip_global_init(void); /* assume a simpler host-master model */ void zip_exit(void); /* terminate Zipcode session */
Guy Robinson
The basic mailer manipulation commands (such as g1_grid_open() ) require the specification of process lists, currently as ordered pairs of nodes and process IDs packaged within a ZIP_ADDRESSEES structure. For application convenience, we supply optional commands to support the creation of such collections. One common collection is a cohort, a set of processes with the same process ID, distributed across a number of nodes. A cohort is often used to formulate a single-program, multiple-data (SPMD) calculation.
Cohort list creation:
ZIP_ADDRESSEES *addressees = zip_new_cohort(int N, /* number of processes involved */ int node_bias, /* node number of zeroth entry in list */ int cohort_pid, /* process ID of entire collection of processes */ int host_flag); /* flags whether host process participates */Additionally, we provide a Zipcode -level spawn mechanism:
int result = zip_spawn( char *prog_name, /* ASCII name of program to spawn */ ZIP_ADDRESSEES *addressees, /* addressee list to spawn */ void *state, /* future expansion */ int pm_flag); /* flags if program is spawned on zeroth addressee */where result is nonzero on failure. Most implementations require that this spawning function be effected in the host process, although the original CE/R system did not make this restriction. A compatible zip_kill() is also defined:
result = zip_kill(addressees);With the addition of these functions, Zipcode specifies an entire programming environment that can be completely divorced, if desired, from its original foundations in the CE/RK. This is possible so long as one can emulate appropriate CE/RK functions for Zipcode to use. This has been accomplished in release 2.0 of nCUBE's 6400 system software, for instance.
Guy Robinson
Zipcode currently provides portable message-passing capability on a number of multicomputers. It also works on homogeneous networks of workstations and will soon be supported on heterogeneous networks and for heterogeneous multicomputers, when such systems are created. The key benefits of Zipcode are its ability to work over process lists designated by the user, to define separate contexts of communication so that message-passing complexity can be managed, and to allow different notations of message-passing appropriate to the concurrent algorithms being implemented. Tagged message passing is seen as a special case of the notations supported by Zipcode .
We see notational abstraction as helpful in dealing with issues of non-uniform memory access hierarchy and heterogeneity in multicomputers and distributed computers. Abstraction is a way to help Zipcode find additional runtime optimizations, rather than a tacit source of inefficiency. We believe that Zipcode implementations will be competitive in performance to tagged message systems whenever vendors make low-level access to their hardware and system calls available to us during our implementation phase.
Guy Robinson
Guy Robinson
Guy Robinson
The software system described here- MOVIE (Multitasking Object-oriented Visual Interactive Environment)-is the most sophisticated developed by C P. Indeed, it is sufficiently complicated that the project led by Wojtek Furmanski didn't finish the first prototype until two years after the end of C P and Furmanski's move to Syracuse. MOVIE is designed to address the general compound problem class introduced in Section 3.6 and illustrated in Chapter 18 . Sections 17.3 and 17.2.10 describe current and potential MOVIE applications, and so provide an interesting discussion of many examples of this complex problem class. MOVIE is a new software system, integrating High Performance Computing (HPC) with the Open Systems standards for graphics and networking. The system was designed and prototyped by Furmanski at Caltech within the Caltech Concurrent Computation Program and it is currently in the advanced implementation stage at Northeast Parallel Architectures Center (NPAC), Syracuse University [ Furmanski:93a ]. The MOVIE System is structured as a multiserver network of interpreters of the high-level object-oriented programming language MovieScript. MovieScript derives from PostScript and extends it in the C++ syntax-based, object-oriented, interpreted style towards high-performance computing, three-dimensional graphics, and general-purpose high-level communication protocol for distributed and MIMD-parallel computing. The present paper describes the overall design of the system with the focus on the HPC component and it discusses in more detail one current application (Terrain Map Understanding) and one planned application area (Virtual Reality).
The concept of the MOVIE System emerged in a series of computational experiments with various software models and hardware environments, performed by Furmanski during the last few years. His attitude was that of a computational scientist who tries to find the shortest path towards a functional (HPC) environment which would be both dynamic enough to fully utilize the hardware and software technology advances and stable enough to support reusable programming, resulting in extendible, backward-compatible and integrable application software.
MOVIE concepts derive from the computational science research within C P, such as optimal communication [ Fox:88h ] and load-balancing [ Fox:88e ] algorithms for loosely synchronous problems and, in the application sector, matrix algebra [ Furmanski:88b ], neural network [ Nelson:89a ], and machine vision [ Furmanski:88c ] algorithms. As a next step, we started to develop the high-performance software environment for neural networks and machine vision and we realized that the full model in such areas must go beyond the regular HPC domain. New required components included dynamic interactive Graphical User Interfaces (GUI), support for irregular, dynamic computing which emerges, for example, in higher, AI-based layers of machine vision, and support for system integration in the heterogeneous computational model involving diverse components such as regular massively parallel image processing and irregular, symbolic expert system techniques. This complex structure-a ``system of systems''-is typical of the compound problem class.
Furmanski's work decoupled therefore for some time from the main C P/NPAC thrust and, assuming tentatively that we ``understand'' the regular HPC components, he followed an independent exploratory route, making a series of computational experiments and identifying components of the ``next-step'' broader model for HPC, which would integrate all elements discussed above.
Guy Robinson
The communication and load-balancing algorithms were implemented on Caltech/JPL hypercube Mark II. The need for interactive GUIs emerged for the first time during our work on parallel implementation of a neurophysiological model for olfactory cortex [ Furmanski:87a ] and, in the next, step, in the machine vision research [ Furmanski:88c ]. At that time (1988), we were using the nCUBE1 system at C P and also the ``personal hypercube'' system based on IBM AT under XENIX with the 4-node nCUBE1 add-on board. The graphics support in the latter environment was nonexistent and we constructed from scratch the GUI system based on the interpreted language g [ Furmanski:88a ], custom designed and coupled with the regular parallel computing software components. The language g was 80286 assembly-coded and XENIX kernel-based and hence very fast. However, it couldn't be ported anywhere beyond this environment, which clearly became obsolete before the g -based implementation work was even fully completed. Some design concepts and implementation techniques from this first experiment survived and are now part of the MOVIE System, but the major lesson learned was that GUIs for HPC must be based on portable graphics standards rather than on custom-made or vendor-specific models.
It was at about the same time (late 1980s) that the major multivendor effort started towards standardizing computer graphics in the UNIX environment. We were participating actively in this process, experimenting with subsequent models such as SunView, X10, X11, NeWS, XView, OpenLook, Motif, DPS, and finally PHIGS/PEX/GL. It was very difficult to build a stable graphics-intensive system for HPC during the period of the last four years, when the standardization efforts were competing with the vendor-specific customization efforts. However, certain generic concepts and required features of such a system gradually started to emerge in the course of my experiments with subsequent standard candidates.
For example, it became clear, with the onset of network-extensible server-based graphics models such as X or NeWS, that any solid HPC environment must include the distributing computing model as well and to unify it with the SIMD- and MIMD-parallel models. Also, to cope with portability issues in the emerging heterogeneous HPC environments, such a system must include appropriate high-level software abstraction layers, supporting Virtual Machine-based techniques. A network of compute servers, tightly/loosely coupled in MIMD-parallel/distributed mode, with each server following the high-level Virtual Machine design, appeared to be the natural overall software architecture. Modern software techniques such as preemptive multithreading and object orientation are required to assure appropriate dynamics and functionality of such a server in diverse tasks involving graphics, computation, and communication. Among the emerging standards, the design closest to the above specification was offered by the NeWS (Network-extensible Window System) [ Gosling:89a ] server, developed by Sun. Following NeWS ideas, we adopted PostScript [ Adobe:87a ] syntax for the server language design, extending it appropriately to support object-oriented techniques and enhancing substantially its functionality towards the HPC domain.
The resulting system was called MOVIE, due both to its adequate acronym and its stress on the relevance of interactive graphics in the system design. The server language, integrating computation, graphics, and communication, was called MovieScript. The first implementation of the MOVIE Server was done at Caltech [ Furmanski:89d ] on a Sun workstation and then the system was ported to the DEC environment at Syracuse University [ Furmanski:90b ]. Here, we learned about virtual reality [ Furmanski:91g ] which we now consider the ultimate GUI model for MOVIE, ``closing'' the system design. In the fall of 1991, the MOVIE group was formed and the ``individual researcher'' period of the system development is now followed by the team development period [ Furmanski:93a ].
Guy Robinson
Currently, MOVIE is in the advanced design and development process, organized as a sequence of internal prereleases of the system code and documentation. At the time of preparing this document (April 1992), MOVIE 0.4 has been released internally at NPAC. The associated technical documentation now contains manual drafts [Furmanski:92a;92b] and some 25 internal reports (about 700 pages total). The current total size of the MOVIE code, including source, binaries, and custom CASE tools, is on the order of 40 Mb. Both documentation and code will evolve during the next months towards the first external release MOVIE 1.0, planned for the fall 1993. MOVIE 1.0 will be associated with the set of reference/programming manual pairs for all basic system components, discussed later in this paper (MetaShell, MOVIE Server, MovieScript). Starting from MOVIE 1.0, we intend to provide user support, assure backward compatibility, and initiate a series of MOVIE application projects.
This paper discusses the MOVIE System and is considered as a summary of the design and prototype development stage (Section 17.2 ). It also contains a brief description of the current status and the planned near-term applications. A more complete description can be found in [ Furmanski:93a ] (see also [Furmanski:93b;93d] for recent overview reports). Here we concentrate on one current application-Terrain Map Understanding (Section 17.3 ) (see also [ Cheng:92a ]) and on one planned application area-virtual reality (Section 17.4 ) (see [ Furmanski:92g ] for an overview and [Furmanski:93c;93e;93f]
for the current status).
Guy Robinson
Guy Robinson
MOVIE System is a network of MOVIE servers. MOVIE Server is an interpreter of MovieScript. MovieScript is a high-level object-oriented programming language derived from PostScript. PostScript is embedded in the larger language model of MovieScript. This includes new types and operators as well as syntax extension towards the C++ style object-oriented model with dynamic binding and multiple inheritance. MOVIE Server is based on the custom-made high-performance MovieScript interpreter. Some design concepts of MovieScript are inherited from the NeWS model developed by Sun. C-shell-based CASE tools are constructed for automated server language extension. MOVIE 1.0 will offer uniform MovieScript interface to all major components of the Open Systems software such as X/Motif/OpenLook, DPS/NeWS, PEX/GL, UNIX socket library-based networking, and Fortran90-style index-free matrix algebra. Subsequent releases will build on top of these standards and extend the model by more advanced modules such as database management, expert systems and virtual reality. The language extensibility model is based on the concept of inheritance forest , which allows us to enlarge both the functional and object-oriented components of MovieScript, both in the system and application sector and at the compiled and interpreted level. The default development model for MOVIE applications is based on MovieScript programming. System integration tools are also provided which allow to incorporate third-party software into the system and to structure it as suitable language extensions. Integrated visualization model is provided, unifying two-dimensional pixel and vector graphics, three-dimensional graphics, and GUI toolkits. Interfaces to AVS-style dataflow-based visualization servers are also provided. MOVIE Server is a single C program, single UNIX process, and single X client. The server dynamics are governed by preemptive multithreading with real-time support. Threads, which are MovieScript light-weighted processes, compute by interpreting MovieScript and communicate by sending/receiving MovieScript. A uniform model for networking and message passing is provided. Various forms of concurrency can be naturally implemented in MOVIE, such as single-server multitasking or multiserver networks for MIMD-parallel or heterogeneous distributed computing. Multiserver systems of multithreading language interpreters offer a novel approach to parallel processing, integrating data-parallel and dynamic, irregular components. Due to such system features as rapid prototyping, extensibility, modularity, and ``in large'' programming model, MOVIE lends itself to building large, modern software applications of the compound or metaproblem class.
Guy Robinson
The server design, summarized in the previous section, can be conveniently formulated in terms of a Virtual Machine (VM) model. Our goal in MOVIE is to provide a uniform integration and development platform for diverse hardware architectures and software models. The natural strategy is to enforce homogeneity in such a heterogeneous collection by constructing an abstract software layer, implementing the VM ``assembler'' such that diverse software components are mapped on a consistent VM ``instruction set'' and diverse hardware architectures follow a uniform VM ``processor'' model.
Our initial hardware focus is a UNIX workstation and the target software volume is provided by the present set of Open Systems standards. This includes subsystems such as X for windowing, Motif/OpenLook for XtIntrinsics-based GUI toolkits, DPS/NeWS for PostScript graphics, PEX/GL for network-extensible three-dimensional graphics, AVS/Explorer-style dataflow-based visualization models, and C/C++ and Fortran77/Fortran90 as the major low-level programming languages. In the next stage, this environment will be extended by more advanced software models such as database management systems, expert systems, virtual reality operating shells, and so on. Massively parallel systems are considered in this approach as the ``special hardware'' extensions and will be discussed in the next sections.
The concept of Open Systems is to enforce interoperability among various vendors, but in practice the standardization efforts are often accompanied by the vendor-specific customization, driven by the marketing mechanisms. Examples include competing GUI toolkits such as Motif and OpenLook or three-dimensional graphics protocols such as PEX or GL. There are also deficiencies of the current integration models within the single-vendor software volume. The only currently existing fully consistent integration platform for the Open Systems software is provided at the level of the C programming language. However, C is not suitable for ``in large'' programming due to lack of rapid prototyping and ``impedance mismatch'' [ Bancilhon:88a ], generated by C interfaces to dedicated modules based on higher level languages (e.g., SQL-based DBMSs or PostScript-based vector graphics servers). In the HPC domain, the current standardization efforts are Fortran-based, which is an even less adequate language model for compound problems. In consequence, there is now an urgent need for the vendor-independent high-level integration platform of the VM type for the growing volume of the standard Open Systems software, also capable of incorporating the HPC component into the model. MOVIE System can be considered an attempt in this direction.
The choice of PostScript as the integration language represents a natural and in some sense unique minimal solution. A stack-based model, PostScript lends itself ideally as a Virtual Machine ``assembler.'' An interpreted high-level extensible model, it provides natural rapid prototyping capabilities. A Turing-equivalent model, it provides an effective integration factor between code and data and hence between computation and communication. Finally, the graphics model of PostScript is already a de facto standard for electronic printing/imaging and part of the Open Systems software in the form of DPS/NeWS servers.
The concept of the multithreading programmable server based on extended PostScript derives from the NeWS (Network-extensible Window System) server [ Gosling:89a ], developed by Sun in the late 1980s. The seminal ideas of NeWS for client-server-based device-independent windowing are substantially extended in MOVIE towards multiserver-based, Open System-conforming, device-independent, high-performance distributed computing.
Within the VM model, the C code for the MOVIE Server can be viewed as ``hardware'' material used to build the virtual processor. The MOVIE Server illustrated in Figure 17.1 is a virtual processor and MovieScript plays the role of virtual assembler. Continuing the VR analogy, we can consider MovieScript objects as VM ``words'' and the physical memory storing the content of objects as the VM ``registers.'' VM ``programs'', and so on, MovieScript ASCII files, are typically stored on disks, which play the role of VM memory.
Figure:
Elements of the MOVIE Server Virtual Machine Involved in
Executing the Script
{ 30 string }
. The number
30
is
represented as an atomic
item with the
T_Number
tag,
FMT_Integer
mask ON in the status flag vector, and with
the value field
= 30
. The operator
string
is represented as
a static composite item with the value field pointing to the header
which is given by the object structure
O_Operator
, stored in the
static memory and containing the specific execution instruction-in
this case a pointer to the appropriate C method. As a result of this
method of execution, the item
30
is popped and the string object is
created and pushed on the operand stack. String object is represented
as a dynamic composite item with the value field pointing to the
O_String
header. The header contains object attributes such as,
string length value, whereas the string character buffer is stored
in the dynamic memory.
The MovieScript ``machine word'' or object handle is represented as a 64-bit-long C structure, referred to as item , composed of a 32-bit-long tag field and a 32-bit-long value field. The tag field decomposes into a 16-bit-long object identifier field and a 16-bit-long status flag vector. The value field contains either the object value for atomic types (such as numbers) or the object pointer for composite types (such as strings or arrays). MovieScript array objects and stacks are implemented as vectors of items. Composite objects are handled by the custom Memory Manager. Each composite object contains the header with object attributes and (optionally) the data buffer. MOVIE memory consists of two sectors- static and dynamic -each implemented as a linked list of contiguous segments. Headers/buffers are located in static/dynamic memory. Static memory pointers are ``physical'' (time-independent), whereas buffers in the dynamic memory can be dynamically relocated by the heap fragmentation handler. Headers are assumed to be ``small'' (i.e., of fixed maximal size, much smaller than the memory segment size) and hence the static memory is assumed to never fragment in the nonrecoverable fashion.
The persistence of the memory objects is controlled by the reference count mechanism. Buffer relocation is controlled by the lock counter. Each reference to the object buffer must be preceded/followed by the appropriate open/close commands which increment/decrement the lock count. Only the buffers with zero lock count are relocated during the heap compaction process. Item, header, and buffer components of an object are represented by three separate chunks of physical memory. The connectivity is provided by three pointers: item points to the header, header points to the buffer, and buffer points back to the header (the last pointer is used during the heap compaction).
The inner loop of the interpreter is organized as a large C switch with the case values given by the identifier fields of the object items. Some performance-critical primitive operators are built into the inner loop as explicit switch cases, while others are implemented as C functions or MovieScript procedures. Convenient CASE tools are constructed for automatic insertion of new primitives into the inner loop switch.
A single cycle of the interpreter contains the following steps: Check the software interrupt vector, take the next object from the execution stack, push it on the operand stack (if the object is literal), or jump to the switch case, given by the object identifier (if the object is executable). The interrupt vector is used to handle the system-clock-based requests such as thread switching, event handling, or network services, as well as the user requests such as debugging, monitoring, and so on.
Both the MOVIE memory and the inner loop of the interpreter are performance-optimized and supported by internal caches, for example, to speedup the systemdict requests or small object creation. MOVIE Server is faster than NeWS or DPS servers in most basic operations such as control flow or arithmetic, often by a factor two or more.
Guy Robinson
A currently popular approach to portable data-parallel computing is based on the Fortran90 model, which extends the scalar arithmetic of Fortran77 towards the index-free matrix arithmetic. This concept, originally implemented as CM Fortran by TMC on CM-2, is now extended as in Chapter 13 in the form of Fortran90D and High Performance Fortran model towards the MIMD-parallel systems as well.
The Fortran90-based data-parallel model allows us to treat massively parallel machines as superfast mathematical co-processors/accelerators for matrix operations. The details of the parallel hardware architecture and even its existence are transparent to the Fortran programmer. Good programming practice is simply to minimize explicit loops and index manipulations and to maximize the use of matrices and index-free matrix arithmetic, optionally supported by the compiler directives to optimize data decompositions. The resultant product is a metaproblem programming system having as its core, for synchronous and loosely synchronous problems, an interpreter of High Performance Fortran.
The index-free vector/matrix algebra constructs appear in various languages, starting from the historically first APL model [ Brown:88b ]. Also, database query languages such as SQL can be viewed as vector models, operating on table components such as rows or columns. In interpreted languages, vector operations are useful also in sequential implementations since they allow reduction of the interpreter overhead. For example, scalar arithmetic in MovieScript is slower by a factor of five or more than the C arithmetic-the C-coded interpreter performs the actual arithmetical operation and additionally a few control and stack manipulation operations. The absolute value of such overhead is similar for scalar and vector operands and hence it becomes relatively negligible with the increasing vector size.
In MovieScript, the numerical computing is implemented in terms of the following types: number , record , and field . MovieScript numbers extend the PostScript model by adding the formatted numbers such as Char , Short , Double , Complex , and so on. The original PostScript arithmetic preserves value (e.g., an integer result is converted to real in case of overflow), whereas the extended formatted arithmetic preserves format as in the C language.
Record is the interpreted abstraction of the C language structure. The MovieScript interface is similar to that for dictionary objects. The memory layout of the record buffer coincides with the C language layout of the corresponding structure. This feature is C compiler-dependent and it is parametrized in the MOVIE Server code in terms of a few typical alignment models, covering all currently popular 32-bit processors.
Field is an n-dimensional array of numbers, records, or object handles. All scalar arithmetic operators are polymorphically extended to the field domain in a way similar to Fortran90. This basic set of field operators is then further expanded to provide vectorial support for domains such as imaging, neural nets, databases, and so on.
Images are represented as two-dimensional fields of bytes, and the image-processing algorithms can typically be reduced to the appropriate field algebra. Since the interpreter overhead is negligible for large fields, MovieScript offers natural rapid prototyping tools for experimentation with the image-processing algorithms and with other regular computational domains such as PDEs or neural networks.
A table in the relational database can be represented as a one-dimensional field of records, with the record elements used as column labels. Most of the basic SQL commands can be expressed again in terms of the suitably extended field algebra operators.
PostScript syntax provides flexible language tools for manipulating field objects and it facilitates operations such as constructing regions (object-oriented version of sections in Fortran90) or building multi-dimensional fields. The MovieScript field operator creates an instance of the field type. For example,
/image Byte [256 512] field def
creates a image (array of bytes) and
/cube Bit [ 10 { 2 } repeat ] field def
creates a 10-dimensional binary hypercube. Regions are created by the ptr operator. For example,
/p image [ [ 0 2 $ ] [ 1 2 $ ] ] ptr def
creates a ``checkerboard pattern'' pointer p , and
/c image [ [1[ ]1] [1[ ]1] ] ptr def
creates the ``central region'' pointer c containing the original image excluding the one-pixel wide boundaries. Pointers can be moved by the move operator, for example, one can move the central pointer c to the right by one pixel as follows:
/r c [ 1 0 ] move def .
To act with the Laplace operator on the original image, we construct the right, left, up, and down shifts as above, denoted by r , l , u , d . We store the content of c in the temporary field t and then we perform the following data-parallel arithmetic operation:
t 4 mul [ r l u d ] { sub } forall ,
which is equivalent to the set of scalar arithmetic operations - for each pixel in t .
The above examples illustrate the way new components of MovieScript are cooperating with the existing PostScript constructs. For example, we use literal PostScript arrays to define grid pointers and we extend polymorphic PostScript operators such as mul or sub to the field domain. New operators such as ptr are also polymorphic; for example, a two-dimensional region can be pointed to by either a two-element array or a two-component field, and so on. Array objects used as pointers can also be manipulated by appropriate language tools (e.g., they can be generated in the run time, concatenated, superposed, and so on), which provides flexibility in handling more complex matrix operations.
All section operations from the Fortran90 model are supported and appropriately encoded in terms of literal array pointers. Some of the resulting regions, such as rows, columns, scattered or contiguous windows, and so on, are shown in Figure 17.2 . Furthermore, there is nothing special about rectangular regions in the Postscript model, which is armed with the vector graphics operators. Hence, the ptr operator can also be polymorphically extended to select arbitrary irregular regions, such as illustrated in Figure 17.2 -for example, by allowing the PostScript path as a valid pointer object argument. This is a simple example of the uniform high-level design which crosses the boundaries of matrix algebra and graphics. Another example is provided by allowing the PostScript vector (and hence data-parallel) drawing operators to act on field objects. A diagonal two-dimensional array can then be constructed, for example, by ``drawing'' a diagonal line across the corresponding field ``canvas.''
Figure:
Some Data-Parallel Pointers in the MovieScript Model, Created by
the
ptr
Operator. Row, column, contiguous, and scattered rectangle
correspond to various Fortran90 style sections, here appropriately encoded as
grid pointers in terms of literal array objects. Other irregular regions in
the figure can be generated by using the corresponding PostScript graphics
path objects as arguments of the
ptr
operator. The n-dim grid
pointer is given as an n-element array of 1-dim axis pointers. Axis
pointers are given by numbers or arrays. A number pointer selects the
corresponding single element along the axis and the 1, 2, or 3-element array
selects 1-dim region. If all elements of such an array are numbers, they are
interpreted as
min
,
step
, and
max
offset values. If one
(central) element is an array itself, the other elements are interpreted as
the left/right margins and the array corresponds to the axis interior and is
interpreted recursively according to the above rules. Special convenience
symbols
$
and
[ ]
stand for ``infinity'' and ``full span.''
Unlike the Fortran90 model where the arithmetic is part of the syntax design, there is nothing special about the arithmetic operators such as mul or add in MovieScript. New, more specialized and/or more complex regular field operators can be smoothly added to the design, extending the index-free arithmetic and supporting computational domains such as signal processing, neural networks, databases, and so on.
The implementation of data-parallel operations in MovieScript is clearly hardware-dependent. The regular, grid-based component of the design is functionally equivalent to Fortran90 and its implementation can directly benefit from the existing or forthcoming parallel Fortran support. Some more specialized operators can in fact be difficult or impractical to implement on particular systems, such as, for example, data-parallel PostScript drawing on some SIMD-parallel processor arrays. In such cases, only the restricted regular subset of the language will be supported. The main strength of the concurrent MOVIE model is in the domain of MIMD-parallel computing discussed below.
Guy Robinson
The MIMD MOVIE model is illustrated in Figure 17.3 . Basically, MOVIE Server plays the role of the general purpose node program or, rather, node operating shell. The MovieScript-based communication model is constructed on top of the compiled language-based communication library, provided either directly by the hardware vendor or by one of the portable low-level models such as the commercial Express package [ ParaSoft:88a ] or the public domain PICL package [ Geist:90b ] described in Chapters 5 and 16 . The MIMD operation of MOVIE can both support the asynchronous problem class and mimic the message-passing model for loosely synchronous applications.
Figure 17.3:
Elements of the MIMD MOVIE Model. Each node runs
asynchronously an identical, independent copy of the multithreading
MOVIE Server, interpreting (a priori distinct) node MovieScript
programs and communicating with other modes via MovieScript messages.
Regular and irregular components can be time-shared as illustrated in
the figure. A single unique thread has been selected in each node (the
one in the upper right corner) for regular processing and the other
threads are participating in some independent or related irregular
tasks. The regular thread processing is based on the ``MovieScript +
message passing'' model, that is, all node programs are given by the
same code which depends only parametrically on the node number. The
mesh of regular threads is mapped on a single host thread which can be
considered, for example, as a matrix algebra accelerator ``board''
within some sequential or distributed Virtual Machine
model, involving the host server.
The interesting features of such a model stem from the multithreading character of the node program. The MIMD mesh of node servers can be configured either in the fully asynchronous or the regular mode. Various intermediate and/or mixed modes are also possible and useful. The default mode is asynchronous-each server maintains its own thread queue, executing individual thread programs and serving the communication requests according to the software-clock-based preemptive scheduling model. The system operation in this mode is similar to the distributed computing model and is discussed in more detail in Section 17.2.5 .
The simplest way to enforce the regular mode is by retaining only one thread in each node server and by following the conventional ``MovieScript + message passing'' loosely synchronous programming techniques. A more advanced, but also often more useful configuration is when the regular and asynchronous modes are time-shared. This is illustrated in Figure 17.3 , where a unique thread has been selected in each node to implement some regular algorithm and all other threads are involved in some irregular processing. The communication messages are thread-specific and hence the regular component is processed in a transparent way, without any conflicts with the irregular traffic. MovieScript scheduling is programmable and hence the system can adjust and control dynamically the time slices assigned to individual components.
The code development process for multicomponent algorithms factorizes into modular programs for individual threads or groups. In consequence, all techniques such as optimal regular communication or matrix algebra algorithms, constructed previously in compiled models (see, e.g., [ Fox:88h ], [ Furmanski:88b ]), can be easily reconstructed in MovieScript and organized as appropriate language extension.
A natural next step is to construct the Fortran90-style matrix algebra by using the physical communication layer and the already existing single node support in terms of the field objects now playing the role of node sections of the domain-decomposed global fields. Such construction represents the run-time interpreted version of the High Performance Fortran (Fortran90D) model [ Fox:91e ]. Compiler directives are replaced by ``interpreter directives,'' that is, MovieScript tools for data decomposition which can be employed in the dynamic real-time mode. Various interface models to the compiled Fortran90D environment can also be constructed. Furthermore, since arithmetic doesn't play any special role in the MovieScript syntax, the matrix algebra model can be naturally further extended by new, more complex and specialized regular operators, emerging in the application areas such as image processing, neural networks, and so on.
The advantage of the concurrent multithreading model is that the regular sector can be time-shared with the dynamic, irregular algorithms. The need for such configurations appears in complex applications such as machine vision, Command and Control, or virtual reality where the massively parallel regular algorithms (early vision, signal processing, rendering) are to be time-shared and often coupled by pipelines or feedback loops with the irregular components (AI, event-driven, geometry modeling).
Such problems can hardly be handled exclusively in the data-parallel, Fortran90-style model. The conventional, more versatile but less portable ``Fortran77 or C + message-passing'' techniques can be used, but then one effectively starts building the custom multithreading server for each large multicomponent application. In the MIMD MOVIE model, we reverse this process by first constructing the general-purpose multithreading services, organized as the user-extensible node operating shell.
Many other interesting features emerge in such a model. High-level PostScript messages can be dynamically created and destroyed. Dynamic point-like debugging and monitoring can be realized in a straightforward way at any time instance by sending an appropriate query script to the selected node. Longer chunks of the regular MovieScript code can be stored in a distributed fashion and broadcast only when synchronously invoked, that is, one can work with both distributed data and code. Static load-balancing and resource allocation techniques developed for compiled models (see, e.g., [Fox:88e;88tt;88uu]) apply, and can be significantly enhanced by new dynamic algorithms, utilizing the thread mobility features in the distributed MovieScript environment.
Guy Robinson
Distributed computing is the most natural environment for the MOVIE System. The communication model for MOVIE networks is based on one simple principle, uniform for distributed and MIMD-parallel architectures: nodes of such network communicate by sending MovieScript. This model unifies communication and computation: Computing in MOVIE is when a server interprets MovieScript, whereas communication occurs when a server sends MovieScript to be interpreted by another server on the network.
Social human activities provide adequate analogies here. One can think of MOVIE network as a society of autonomous intelligent agents, capable of internal information processing and of information exchange, both organized in terms of the same high-level language structures. The processing capabilities of such a system are in principle unlimited. Detailed programming paradigms for distributed computing are not specified initially at the MovieScript level and can be freely selected depending on the application needs. Successful computation/communication patterns with some reusability potential can then be retained within the system in the form of appropriate MovieScript extensions.
The MovieScript-based user-level model for MOVIE networks is uniform, elegant, and appealing. Its detailed technical implementation, however, is a complex task. Communication services must be included at the lowermost level of the inner loop of the MovieScript interpreter and coordinated with scheduling, event handling, software interrupts, and other dynamic components of the server. Also, when building interfaces to existing Open Systems components, we have to cope with various existing network and message-passing protocols. In the networking domain, we use the UNIX socket library as the base-C-level platform, but then the question arises of how to handle higher protocols such as NFS, RPC, XDR and a variety of recent ``open'' models (see, e.g. Figure 17.4 ). Similar uncertainties arise in handling and integrating various message-passing protocols for MIMD-parallel computing.
Figure 17.4:
An Example of the Distributed MOVIE Environment. The figure
illustrates various network-extensible graphics protocols used in
implementing the uniform high-level MovieScript protocol. We denote by
mps
,
nps
, and
dps
, respectively, the MOVIE, NeWS, and
Display PostScript protocols. MOVIE servers communicate directly via
mps
, whereas MovieScript messages sent to NeWS/DPS/X servers are
internally translated by the MOVIE server to the remote server-specific
protocols.
Since the consistent implementation of the MovieScript-based communication is one of the most complex tasks in the MOVIE development process, we adopted the following evolutionary and self-supporting approach. The system design was started in the single-node, single-thread configuration. The notion of multithreading was built into the design from the beginning by adopting consistent thread-relative addressing modes. In consequence, the detailed model for scheduling, networking, and message passing factorized as an independent sector of MovieScript and it was initially postponed. The base interpreter loop was developed first. In the next stage, we constructed the field algebra for regular matrix processing, interpreted object-oriented model with rapid prototyping capabilities and graphics/visualization/windowing layers with the focus on interpreted GUI interfaces.
These layers are currently in the mature stage and they can now be used to provide GUI support for prototyping multithreading distributed MOVIE networks, starting with the regular modules for concurrent matrix algebra and signal processing. The current status of the design and implementation work on scheduling, networking and message passing is described in [ Niemiec:92a ], [ Niemiec:92b ] and [ Furmanski:93a ].
Guy Robinson
Most of the MovieScript components discussed so far, such as Fortran90-style matrix algebra or communication, are implemented in terms of extended sets of Postscript types and operators. In the area of data-parallel computing, based on a finite set of generic operations, the distinction between language models such as Fortran, C, C++, Lisp, or PostScript is largely a matter of taste. However, with the growing structural complexity of a given domain, typically associated with irregular, dynamic computational complexity, both the compiled imperative languages such as Fortran or C/C++ as well as the interpreted functional languages such as Lisp of PostScript become impractical. The best techniques invented so far to handle such complex problems are provided by the interpretive object-oriented models.
MovieScript extends PostScript by the full object-oriented sector with all ``orthodox'' components such as polymorphism, encapsulation, data abstraction, dynamic binding, and multiple inheritance. This extension process is organized structurally in the form of a two-dimensional inheritance forest which provides a novel design platform for integrating functional and object-oriented language structures.
All the original PostScript types such as array , string , dict , and so on. are retained and included in the topmost ``horizontal'' layer of primitive types in MovieScript. This layer is further extended by new computation, graphics, and communication primitives. The design objectives of this language sector are optimized performance, structural simplicity, and enforced polymorphism of the operator set. The group of primitive types within the inheritance forest plays the role of the root class in conventional object-oriented models.
At the same time, the PostScript syntax itself is also extended within the MovieScript design to support the C++-style, ``true'' object-oriented model with dynamic binding and multiple inheritance. The derived types , constructed via the inheritance mechanism starting from the primitive functional types, extend the inheritance forest in the ``vertical'' direction towards more composite, complex, and abstract language structures. A finite set of primitive types is constructed in C and hardwired into the server design. Other primitive types and all derived types are constructed at run time at the interpreted level. Some elements of the inheritance forest model are illustrated in Figure 17.5 .
Figure 17.5:
Elements of the Inheritance Forest Model. The upper
horizontal axis represents primitive MovieScript types such as
dict
,
array
,
xtwidget
, and so on. The forest of derived
types extends down and follows the multiple inheritance model. Closed
loops in the inheritance network are allowed and resolved by
maintaining only a single copy of a degenerate superinstance. The
figure illustrates the
image browser
class which can be thought
of as being both a dictionary (of image names) and a widget (such as a
selection list). An instance of a derived type is represented by a
noncontiguous collection of superinstance headers and buffers, with
each buffer maintaining a list of pointers to its superinstance
headers, as illustrated in the figure.
The integration of the PostScript-style functional layer with the C++-style object-oriented layer, as well as the ``in large'' extensibility model which defines a suitable balance between both layers, are considered distinctive features of MovieScript. The idea is to encapsulate the structural complexity in the form of methods for derived types and to maintain a finite set of maximally polymorphic operators in the functional sector. The resulting organization is similar to the way complexity is handled by natural languages and human practices. The world is described by a large number of ``things'' (objects, words) and a relatively small number of ``rules'' (polymorphic operators, relations). We could define ``common English'' as a set of rules and a very restricted subset of objects. The ``expert English'' dialects are constructed by extending the vocabulary by more specialized and/or abstract objects with complex methods and inheritance patterns. The process of building expert extensions is graded and consistent with the human learning process.
Our claim is that the good ``in large'' computer language design should contain a nontrivial ``common English'' part, useful by itself for a broad set of generic tasks, and it should offer a graded, multiscale extensibility model towards specialized expert dialects. Indeed, we program by building reusable associations between software entities and names. Each ``in large'' programming model unavoidably contains a large number of names. The disciplined and structured process of naming software entities is crucial for successful complexity control. In languages such as C or Fortran, the ``common English'' part is reduced to arithmetic and simple control structures such as if , for , switch , and so on. All other names are simply mapped on a huge and ever-growing linear chain of library functions. The original language syntax, based on mathematics notation, degenerates towards a poorly organized functional programming style. ``In large'' programming in such languages becomes very complex.
More abstract models such as functional, object-oriented and dataflow modular programming are more useful, but there are usually some structure versus function trade-offs in the individual language designs and the optimal choice for ``in large'' model is all but obvious. A few examples of various language models are sketched in Figure 17.6 . In our approach, we consider the object-oriented techniques as the best available tool for building expert extensions (with the expert knowledge encapsulated in methods) and the functional model of PostScript as the best way of encoding the common part of the language. PostScript operators play the role of rules and Postscript primitive types represent the common vocabulary. Inheritance forest of MovieScript allows for smooth transition from common to expert types.
Figure 17.6:
Computational Vertices in Various Language Models. Solid arrows
indicate input/output arguments or objects. Wavy lines indicate messages
sent to objects. Dark blobs represent nonsyntactic components of the
language. Light blobs represent polymorphic operators, considered as
syntactic identifiers/keywords. Among the models illustrated in the
figure, we consider MovieScript organization of computational vertices
as most adequate for ``in large'' programming. C, AVS, and PostScript have
a poor encapsulation model. C and C++ are not convenient for dynamic
dataflow programming as they don't offer any universal mechanism for
multiobject/argument output. MovieScript vertices are constructed by
superposition of the C++ style encapsulation model and PostScript-style
multiobject interaction model. Large MovieScript operators are
functionally similar to AVS modules but they follow a multiscale
structural design which enforces software economy and reusability.
The complexity of ``expert English'' is encapsulated in methods for derived types, and general-purpose functionality of ``common English'' is exposed in terms of restricted set of polymorphic operators, processing objects of all granularities. A multiscale language development model, supported by such organization, is discussed in Section 17.2.8 .
Guy Robinson
Support for graphics is currently the most elaborate sector of the Open Systems software. It is also the sector which varied most vigorously during last years. The current standard environment, based on a collection of subsystems such as X, Motif/OpenLook, PHIGS/PEX/GL, and AVS/Explorer, offers broad functionality and diversity of visualization tools, but it is still difficult to use in application programming. The associated C libraries are huge and the C-language-based development and integration model generates severe compilation/linking time bottleneck. The most extreme case is the PHIGS library, which is on an order of 8 Mb and generates binaries of that size even for modest three-dimensional graphics applications. Furthermore, the competing subsystems, grouped in the list above-for example, Motif and OpenLook-are typically available only in exclusive mode on a particular hardware platform and hence the associated C language application codes are nonportable.
Guy Robinson
Our approach in MOVIE is to design an integrated MovieScript-based model for graphics, GUIs and visualization. We adopt the original PostScript model for scalable two-dimensional graphics as defined in [ Adobe:87a ] and we extend it by including other graphics subsystems. Even in the PostScript domain, however, we face uncertainties due to competing models offered by the DPS server from Adobe Systems, Inc. and the NeWS server from Sun. Since none of these Postscript extension models is complete (e.g., none offers the model for three-dimensional graphics), we don't follow any of them in building the MovieScript extension. Only the intersection of both models, given by the original PostScript model for printers, is adopted ``as is'' in MOVIE, and we build custom extensions towards windowing and event handling, compatible with other Open Systems components. The conflicting extension models of DPS and NeWS are not part of the MovieScript design but these language sectors can be accessed from the MovieScript code since the MOVIE DPS/NeWS interface model supports programmability of remote PostScript servers.
Remote PostScript devices such as NeWS or DPS servers are accessed from the MovieScript code by the operators gop and gdef . The syntax of gop is the following:
where key is the literal name, and are numbers, code is a MovieScript object capable of defining some remote PostScript code, and rop is the MovieScript operator (with the prefix ``r'' standing for ``remote''). Here, gop installs the user-defined graphics operator (implemented as a PostScript procedure) in the remote PostScript server and it also creates the local MovieScript operator rop associated with this remote operator. Both local and remote operators are associated with the common name, specified by key . The code object can be a MovieScript procedure or string. The execution of rop consists of sending arguments from the MOVIE operand stack to the NeWS/DPS operand stack, executing remote procedure in NeWS/DPS, associated with key and previously installed in NeWS/DPS by gop , finally transporting back output objects from NeWS/DPS to MOVIE.
The associated gdef operator is simply a sequence: { gop def } , that is, it installs rop in the local dictionary under the name key . In other words, the action of gdef is fully symmetric on local (MOVIE) and remote (NeWS/DPS) servers. The gop output format can be used to handle rop differently-for example, by installing it as an instance method within the MovieScript class model.
MovieScript support is also provided to control the connection status and buffering modes along the PostScript-based communication lines.
The interface model described above was developed first for the NeWS server [ Furmanski:92d ], and it is now ported to DPS [ Podgorny:92b ].
Guy Robinson
MovieScript windowing is constructed by building the interface to the XtIntrinsics-based GUI toolkits. The generic interface model is constructed and so far explicitly implemented for Motif [ Furmanski:92e ]. The OpenLook implementation is in progress. Mechanisms are provided for combining various toolkit components into the global GUI toolkit. The minimal set of components consists of the XtIntrinsics subtree provided by the X Consortium and the vendor-specific subtree such as Motif or OpenLook. This two-component model can be further extended by new user-provided components. Each toolkit component is implemented as individual MovieScript shell. In particular, the shell Xi defines the intrinsic widgets, the shell Xm defines the Motif widgets, and so on. There is also a toolkit integration shell Xt which provides tools for combining toolkit components (e.g., Xt = Xi + Xm ). The implementation of OpenLook interface in this model is reduced to specifying the shell Xol with the OpenLook widgets and building the full toolkit Xt = Xi + Xol .
The object-oriented model of XtIntrinsics is based on static binding and single inheritance. As such, it doesn't contain enough dynamics and functionality to motivate the faithful embedding in terms of derived types in MovieScript. Instead, we implement the widget classes as parametric modules in terms of a few primitive MovieScript types such as xtclass (widget class), xtwidget (widget instance), xtattr (widget attribute), and xtcallback (widget callback). The types xtclass and xtattr play the role of static containers of the corresponding Xlib information and they are supported only by a set of query/browse methods. The types xtwidget and xtcallback are dynamic, that is, their instances are created/destroyed in the run time.
The operator xtwidget creates an instance of the widget class, taking as input two objects: the parent widget and the array of attribute-value pairs. Attributes are specified by literal MovieScript names, coinciding with the corresponding Motif names. The Motif attribute set is suitably extended. For example, the widget class name itself is a special attribute, to be specified first in the attribute-value array. The associated value is the widget instance name as referred to by the X Resource Manager. Another special attribute is represented by the MovieScript atomic item $ which indicates the nested child widget. Its corresponding value is the attribute-value array for this child widget. The $[...] pairs of this type can be nested, which allows for creating trees of nested widgets linked by the parent-child relations. This construct is extensively used in building GUI interfaces. We illustrate it below on a simple example:
$ ¯[/MainShell /main
$ ¯[/XmRowColumn /panel
/orientation /Vertical
$ [/XmPushButton /red
/background [ 1.0 0.0 0.0 ]
/activateCallback { (red) run }
]
$ [/XmPushButton /green
/background [ 0.0 1.0 0.0 ]
/activateCallback { (green) run }
]
$ [/XmPushButton /blue
/background [ 0.0 0.0 1.0 ]
/activateCallback { (blue) run }
]
]
] xtwidget realize
xtmainloop
xtinit
As a result of executing the MovieScript program above, the main application
window will be created with three buttons, labelled by
color = red,
green, blue
strings and colored accordingly. By pressing a selected
color
button, the ./color file in the current directory will be
executed, that is, interpreted as a MovieScript code. In this example, the
nested widget tree is constructed with the depth three:
Main
is
created as a child of the root window,
panel
is created as a child of
main
, and, finally
red, green, blue
buttons are created as panel
children.
The GUI in this example is provided in terms of the button widgets and the associated callback procedures. The /activateCallback attribute for the button widget expects as value the MovieScript procedure (executable array), to be executed whenever the X event ButtonPress is generated, that is, whenever the user presses this button. Callback procedure in MovieScript is a natural interpreted version of the conventional C language interface, in which one registers the callback functions to be invoked as a response to the appropriate X events, created by the GUI controls. The advantage of the MovieScript-based GUI model is the support for rapid prototyping. After constructing the control panel as in the example above, one can now easily develop, modify, and test the scriptable callback procedures simply by editing the corresponding red, green, and blue files in the run-time mode.
Guy Robinson
A new model for visual distributed computing is proposed by the present generation of high-end dataflow-based visualization systems such as AVS from AVS, Inc. (formerly Stardent Computer, Inc.), Explorer from SGI, or public domain packages such as apE from OSC or Khoros from UNM.
The computational model of AVS is based on a collection of parametric modules, that is, autonomous building blocks which can be connected to form processing networks. Each module has definite I/0 dataflow properties, specified in terms of a small collection of data structures such as field , colormap , or geometry . The Network Editor, operating as a part of the AVS kernel, offers interactive visual tools for selecting modules, specifying connectivity and designing convenient GUIs to control module parameters. A set of base modules for mapping, filtering, and rendering is built into the AVS kernel. The user extensibility model is defined at the C/Fortran level-new modules can be constructed and appended to the system in the form of independent UNIX processes, supported by appropriate dataflow interface.
From the MOVIE perspective, we see AVS-type systems as providing the interesting model for ``coarse grain'' modular extensibility of MovieScript, augmenting the native ``fine grain'' extensibility model discussed in Section 17.2.6 . An AVS module interpolates between the concepts of a PostScript operator (since it ``consumes'' a set of input objects and ``produces'' a set of output objects) and a class instance (since it also contains GUI-based ``methods'' to control internal module parameters). This is illustrated in Section 17.6 where we compare various language models in the context of ``in large'' programming. Consequently, AVS-style modules can be used to extend both the functional and object-oriented layers of MovieScript towards the UNIX domain in the form of user-provided independent UNIX processes. Also, any third-party source or ``dusty deck'' software package can be converted to the appropriate modular format and appended to the MOVIE system in terms of similar interface libraries as developed for AVS modules. The advantage of the AVS extensibility model is maximal ``external'' software economy due to easy connectivity to third-party packages. The advantage of the MOVIE model, based on the MovieScript language extensibility, is maximal ``internal'' software economy within the native code volume, generated by MOVIE developers. The merging of both techniques is particularly natural in the MovieScript context since PostScript itself can be viewed as a dataflow language.
An independent near-term issue is designing MOVIE interfaces to current and competing packages such as AVS and Explorer. Various possible interface models can be constructed in which MOVIE server either plays the role of the compute server, offering high-level language tools for building AVS modules or it takes over the control and AVS is used as a high-quality rendering device.
Guy Robinson
Scientific visualization systems such as AVS or Explorer offer sufficient functionality for relatively static graphics needs but they are not very useful for dynamic real-time graphics-for example, those required in virtual reality environments. Features of MOVIE such as interpretive multithreading, object orientation and rapid prototyping are crucial in building such advanced interfaces, where the high-quality graphics support must be tightly coupled with high-performance computing and with the high-level-language-based development tools.
We are currently in the design and implementation process of the custom three-dimensional graphics model in MovieScript which will be fully portable across various platforms such as PHIGS, PEX, and GL and which will make full use of the functionality available in these protocols. The low-level component of this model is structurally similar to the Motif interface described above, that is, it is based on parametric modules implemented as primitive types with attribute-value input arrays. As in the Motif case, the purpose of this layer is to provide portable low-level interpreted interfaces to the appropriate C libraries and to facilitate further high-level design of derived types in the rapid prototyping mode. The initial design ideas and the current implementation status of this work is described in [ Faigle:92b ], [ Faigle:92c ], and [ Furmanski:93a ].
Guy Robinson
The integrated graphics model in MovieScript is simple at the user level and complex at the implementation level, as illustrated in Figure 17.7 .
Figure 17.7:
Integrated Graphics Model in MovieScript. Uniform interface in
terms of primitive types is constructed to X, DPS/NeWS, and PEX/GL
components of the Open Systems software and implemented,
correspondingly, in terms of the Xlib, GL/PEXlib and PostScript
communication protocols. Additionally, an interface to the AVS server
is constructed, supporting both the subroutine and coroutine operation
modes. The AVS-style extensibility model in terms of the UNIX dataflow
processes is illustrated both for AVS and MOVIE servers. Within this
model, one can also import other graphics models and applications to
the MOVIE environment. This is illustrated on the example of the
third-party X Window application which is configured as a MovieScript
object or operator.
Our main goal is to bring the heterogeneous collection of present standard subsystems (X, Motif/OpenLook, DPS/NeWS, PHIGS/PEX/GL) to the uniform sector of a high-level language. Interfaces to individual subsystems were discussed above. The overall strategy is to build first a uniform set of low-level primitive types for GUI toolkits and three-dimensional servers, structured as smooth extensions of the DPS/NeWS server-based Postscript graphics model for the two-dimensional vector graphics. This interpreted layer is then used in the next stage to design the high-level object-oriented graphics world in terms of more complex derived types. The resulting graphics support is very powerful and unique in some sense: It utilizes fully the available Open Systems graphics software resources; it conforms to one of the standards (PostScript) at the level of primitives; and finally, it provides the user-friendly, intuitive, and complete programming interface for modern graphics applications.
As an independent component, we provide also the MovieScript interface to dataflow packages such as AVS/Explorer. Both coroutine and subroutine models for MOVIE-based AVS modules are supported, which allows for diverse interaction patterns between MOVIE and AVS servers. The AVS interface is redundant since the graphics functionality of systems such as AVS/Explorer will soon be included in the PEX/GL-based 3D MOVIE model, but it is useful in the current stage where various components of the 3D MOVIE model are still in the implementation process. In particular, the AVS interface was used in the Map Separates application, providing high-quality three-dimensional display tools for the MovieScript field algebra-based imaging and histogramming. We discuss this application in Chapter 3 .
Guy Robinson
MOVIE 1.0 will represent the minimal closed design of the MOVIE server, defined as the uniform object-oriented interpreted interface to all Open Systems resources defined in Section 17.2.1 . Such a model can then be further expanded both at the system level (i.e., by adding new emerging standards or by creating and promoting new standard candidates) and at the application level (i.e., by building MOVIE based application packages).
Two basic structural entities used in the extension process are types and shells . Typically, types for the system extensions and shells are used for building MOVIE applications. In fact, however, both type- and shell-based extensibility models, as well as system and application level extensions, can be mixed within ``in large'' programming paradigms.
Both type- and shell-based extensions can be implemented at the compiled or interpreted level. At the current stage, the compiled extensibility level is fully open for MOVIE developers. The detailed user-level extensibility model will be specified starting from MOVIE 1.0. Explicit user access to the C code server resources will be restricted and the dominant extension mode will be provided at the interpreted MovieScript level. The C/Fortran-based user-level extensions as well as the extensibility via the third party software will be supported in the encapsulated, ``coarse-grain'' modular form similar to the AVS/Explorer model (see Section 17.2.7 ).
The type extension model is based on the inheritance forest and it was discussed in Section 17.2.3 . The shell extension model utilizes PostScript-style extensibility and is described below.
Structurally, a MovieScript shell is an instance of the shell type . Its special functional role stems from the fact that it provides mechanisms for extending the system dictionary by new types and the associated polymorphic operators. In consequence, types and shells are in a dual relationship-examples would be nodes and links in a network or nouns and verbs in a sentence. In a simple physical analogy, types play the role of particles, i.e. some elementary entities in the computational domain and shells provide interactions between particles. In conventional object-oriented models, objects-that is, particles-are the only structural entities and the interactions are to be constructed as special kind of objects. The organization in MOVIE is similar at the formal structural level since MovieScript shells are instances of the MovieScript type, but there is functional distinction between object-based and shell-based programming. The former is following the message-passing-based C++ syntax and can be visualized as ``particle in external field'' type interaction. The latter follows the dataflow-based PostScript syntax and can be visualized as multiparticle processes such as scattering, creation, annihilation, decay, fusion and so on.
An appealing high-level language design model can be constructed by iterating the dual relation between types and shells in the multiscale fashion. Composite types of generation N+1 are constructed in terms of interactions involving types and shells of generation N . The ultimate structural component, that is, the system-wide type dictionary, is expected to be rich and diverse to match the complexity of the ``real world'' computational problems. The ultimate functional component, that is, some very high level language defined by the associated shells is expected to be simple, polymorphic and easy to use (``common English''), with all complexity hidden in methods for specialized types (``expert English'').
In our particle physics analogy, this organization could be associated with the real-space renormalization group techniques. New composite types play the role of new collective variables at larger spatial scale, polymorphic operators correspond to the scale-invariant interaction vertices, and MovieScript shells contribute new effective interaction terms. Good high-level language design corresponds to the critical region, in which the number of effective ``relevant'' interactions stabilizes with increasing system size. Our conjecture is that natural languages can be viewed as such fixed points in some grammar space, and hence the best we can do to control computational complexity is to evolve in a similar direction when designing high-level computer languages.
Guy Robinson
MOVIE Server is a large C program and it requires appropriate software engineering tools for its development.
Commercial software systems are usually developed in terms of sophisticated commercial CASE tools. In the academic environment, one rarely builds large production systems and one usually uses simpler, lower level tools based on dialects of the UNIX shell, most typically the C shell which forms now the standard text-mode UNIX interface on most workstations. The C-shell-based environment is most natural in the research working mode where the code is usually of small or moderate size, its typical lifetime is short, and it undergoes a series of major changes during the development cycle. These changes are often of unpredictable character and hence difficult to parametrize a priori in the form of some high-level CASE tools.
MOVIE project aims at the large, commercial quality production system, and yet it is created in the academic environment and contains substantial research components in the domain of HPDC. We therefore decided to select a compromise strategy and to start the MOVIE Server development process in terms of simple, custom-made, C-shell-based CASE tools. More explicitly, the current generation of CASE tools for MOVIE is structured as the interpreter of a very simple high level object-oriented language called MetaShell, designed as a superset of the C-shell. In this way, we assure the compatibility with the standard academic environment and, at the same time, we provide somewhat more powerful software development tools than those offered by the plain text-mode UNIX environment.
A more functional language model for the CASE tools in MOVIE would be provided by the MovieScript itself due to its high-level features and the built-in GUI support but we need a consistent bootstrap scheme in such a process. A natural approach is to use C shell to build MOVIE 1.0, then use its MovieScript to build MOVIE 2.0, and so on. Alternatively, we can consider the task of building high-quality visual ``intelligent'' CASE tools as one of the MOVIE 1.0-based application projects. We discuss these future plans is Section 17.2.10 and here we present the current MetaShell model from the MOVIE developer's point of view. The detailed technical documentation of the MetaShell tools can be found in [ Furmanski:92c ].
Figure:
Sample Elements of the $MOVIEHOME Directory Tree. Dark blobs
represent system nodes/names, shaded blobs represent user-provided
nodes/names within the M-tree model. For example, each new type, such as
Dict
, automatically generates its subtree containing the following
directories:
Fcn
(low-level object functions, used for implementing
other object components),
Act
(object-dependent methods for
polymorphic operators),
Msg
(methods implementing object messages),
Const
(predefined default instances of a given type),
Lib
,
and (C libraries of object functions).
The entire code volume associated with MOVIE is stored in the directory $MOVIEHOME, shown in Figure 17.8 installed as the UNIX environment variable and used as the base pathname for the MetaShell addressing modes. The most relevant nodes in this directory are: bin , sys , and M . The bin directory, to be included in the developer's path, contains the external binaries such as the main MetaShell script and the MOVIE Server binary. The sys directory contains diverse system-level support tools-for example, the C and C-shell code implementing the MetaShell model. The server code starts in the subdirectory M and we will refer to the associated directory tree, starting at M , as the M-tree .
M branches into a set of base software categories such as, for example: Op (C or MovieScript source files implementing methods for the MovieScript operators), Lib (C language libraries), Err (MovieScript error operators), Key (system name objects), Type ( MovieScript types), Shell (MovieScript shell objects) and so on.
Some of these nodes are simple, that is, they contain only a set of regular files (e.g., M/Op); some are composite, that is, they branch further into subdirectories (e.g., M/Lib which branches into system libraries). In the current implementation, the maximal branching level is five (e.g., directory M/Type/String/Lib/regex, which contains the string type library functions for handling regular expressions). Many structural aspects of the system can be presented in the form of some suitable M-tree mappings, listed below:
Guy Robinson
There is a one-to-one mapping between an M-tree directory and a MovieScript dictionary. The dictionary tree starts from the MetaDictionary M , which contains keys Op , Lib , and so on, associated with appropriate dictionaries as values. The Op dictionary contains MovieScript operators as values, the Lib dictionary contains the dictionaries of library functions as values, and so on. MetaDictionary mapping provides run-time interpreted access to all resources within the M-tree , and it can be used, for example, for building more advanced MovieScript-based CASE tools for the server development.
Guy Robinson
There is a one-to-one correspondence between the C names of various software entities (functions, structures, macros, and so on) and the location of the corresponding code within the M-tree . In consequence, the whole server code has a hypertext-style organization which facilitates software understanding, documentation, upgrades, and maintenance in the group development mode.
Guy Robinson
There is a unique 32-bit integer called MetaIndex associated with each software entity contributing to the server, such as functions, structures, or individual structure elements. The overall index is constructed by concatenating subindices along the M-tree path which allows for fast encoding/decoding between the binary (MetaIndex-based) and ASCII (pathname-based) addressing modes for the server code entities. Since the MovieScript itself can be viewed structurally as a subset of M-tree , one can construct a compact binary network protocol equivalent to the ASCII representation of the MovieScript code and more suitable for the high-speed communication purposes.
Guy Robinson
The Makefile for the server binary is distributed along the M-tree in the form of independent modularized components for all functions, structures, and macros. The global Makefile is constructed from these components by a set of nested make-include directives.
Guy Robinson
The organization of the MOVIE Server Reference Manual mirrors the structure of the M-tree , with the appropriate M-nodes represented as nested parts, chapters, sections, and so on. There is a corresponding detailed manual page for each elementary component of the server such as function, structure, or method, and an overview page for each composite component such as type, shell, or library. The interactive documentation browser is available, currently based on the WYSIWYG Publisher program from Arbor Text, Inc. [ Podgorny:92a ].
MetaShell tools operate on nodes of the M-tree and its mappings in a way similar to how the query language operates on its database. Atomicity and integrity of all operations is assured. A typical command, creating new C function foo in the library M/Type/String/Lib/regex, implies the following actions, performed automatically by MetaShell:
MetaShell commands are organized in the object-oriented style, with each directory/file node of the M-tree represented as a MetaShell class/instance. The basic methods supported for all MetaShell objects are: Create, destroy, and query/browse. More sophisticated CASE tools, useful in the group development mode are currently under construction, such as a class corresponding to the whole $MOVIEHOME (with instances represented by individual developers' copies of the system) or the server class (with instances representing the customized versions of the MOVIE server).
Guy Robinson
Starting from the first external release MOVIE 1.0 of the system, we intend to initiate a series of application projects in various computational domains. Below, we list and briefly describe some of the near-term applications which are currently in the planning stage. In each case, we expose the elements of the MOVIE System which are most adequate in a given domain.
Guy Robinson
This problem provided the initial motivation for developing MOVIE. Vision involves diverse computational modules, ranging from massively parallel algorithms for image processing to symbolic AI techniques and coupled in the real time via feedforward and feedback pathways. In consequence, the corresponding software environment needs to support both the regular data-parallel computing and the irregular, dynamic processing, all embedded in some uniform high-level programming model with consistent data structures and communication model between individual modules. Furmanski started the vision research within the Computation and Neural Systems (CNS) program at Caltech and then continued experiments with various image-processing and early/medium vision algorithms (Sections 6.5 , 6.6 , 6.7 , 9.9 ) with the terrain Map Understanding project (Section 17.3 ). The most recent framework is the new Computational Neuroscience Program (CNP) at Syracuse University, where various elements of our previous work on vision algorithms and the software support could be augmented by new ideas from biological vision and possibly integrated towards some more complete machine vision system. We are also planning to couple some aspects of the vision research with the design and development work for virtual reality environments.
Guy Robinson
A broad class of neural network algorithms [ Grossberg:88a ], [ Hopfield:82a ], [ Kohonen:84a ], [ Rumelhart:86a ] can be implemented in terms of a suitable set of data-parallel operators [ Fox:88g ], [ Nelson:89a ]. Rapid prototyping capabilities of MOVIE, combined with the field algebra model, offer a convenient experimentation and portable development environment for neural network research. In fact, the need for such tools, integrated with the HPC support, was one of the original arguments driving the MOVIE project. We plan to continue our previous work on parallel neural network algorithms [ Fox:88e ], [ Ho:88c ], [ Nelson:89a ], now supported by rapid prototyping and visualization tools.
Within CNP, we also plan to continue our exploration of methods in computational neurobiology [ Furmanski:87a ], [ Nelson:89a ]. We want to couple MOVIE with popular neural network simulation systems such as Aspirin from MITRE or Genesis from Caltech and to provide the MOVIE-based HPC support for the neuroscience community. Another attractive area for neural network applications is in the context of load-balancing algorithms for the MIMD-parallel and distributed versions of the system. We plan to extend our previous algorithms for neural net-based static load balancing [ Fox:88e ] to the present, more dynamic MOVIE model and to construct ``neural routing'' techniques for MovieScript threads.
This class of neural net applications can be viewed as an instance of a broader domain referred to as physical computation , illustrated in Chapter 11 -that is, using methods and intuitions of physics to develop new algorithms for hard problems in combinatorial optimization
[Fox:88kk;88tt;88uu;90nn], [ Koller:89b ]. We also plan to continue this promising research path.
Guy Robinson
The new nCUBE2-based parallel Oracle system (currently version 7.0) is installed at NPAC within the joint JPL/NPAC database project sponsored by ASAS. The MIMD Oracle model is based on a mesh of SQL interpreters and hence it follows an organization similar to the MIMD MOVIE model. We plan to develop the ``server parallel'' coupling between Oracle and MOVIE systems, for example, by locating them on parallel subcubes and linking, via the common hypercube channel. This would allow for smooth integration of high-performance database with high-performance computing and also for extending the restricted parallelism of the current MIMD Oracle model by the Fortran90-style data-parallel support for processing large distributed tables.
We also plan to experiment with object-oriented [ Zdonik:90a ] and intelligent [ Parsaye:89a ] database models in MOVIE and to develop MovieScript tools for integrating heterogeneous distributed database systems. MovieScript offers adequate language tools to address these modern database issues and to develop a bridge between relational and object-oriented techniques. For example, a table in the relational database can be represented in terms of MovieScript objects (fields of records) and then extended towards more versatile abstract data structures by using the inheritance mechanism.
Guy Robinson
The Global Change federal initiative raises unprecedented challenges in various associated technology areas such as parallel processing [ Rosati:91a ] and large object-oriented databases [ Stonebraker:91a ]. The complexity of this domain is due both to the huge data sizes/rates to be processed and to the diversity of involved research and simulation areas ranging from climate modelling to economics. In collaboration with the Bainbridge Technology Group, Ltd. (BTGL) [ Rosati:91b ], we are planning to evaluate MOVIE in the context of various computational tasks associated with Global Change, with the focus on advanced visualization, animation, and large system integration.
Guy Robinson
Another computationally intensive domain is experimental High Energy Physics (HEP) at the Superconducting Super Collider (SSC) energy range. This accelerator (SSC) is now cancelled, but similar challenges exist at CERN (Geneva) and Fermilab near Chicago. We are examining areas such as high-end visualization and virtual reality (for event display and virtual detector engineering) [ Haupt:92a ], [ Skwarnicki:92a ], MIMD-parallel computing (e.g., for parallel GEANT-style Monte Carlo simulations) [ Fox:90bb ], and databases (parallel computing support, integration tools in the heterogeneous distributed environment). HEP is a computationally intensive discipline based on mature and advanced but so far custom-made Fortran-based software environment. The computational challenges of the next high-energy experiments require modern software technology insertions such as HPC, advanced visualization and rapid prototyping tools. We see the MOVIE model, appropriately interfaced to the existing Fortran77-based HEP systems and offering convenient Fortran90-style portable extension towards HPC, as an attractive development and integration platform for the software environment at current and future experiments [ Furmanski:92f ].
Guy Robinson
Using our work on Terrain Map Understanding (Section 17.3 ), we plan to build the expert system support in MovieScript to be used in late vision tasks such as proximity analysis, GIS knowledge-based processing, and object recognition. This project, also a part of the ASAS Map separates program, is planned with Coherent Research, Inc., Syracuse NY, where a similar expert system capability is being developed for analyzing black-and-white handmade maps used by the local electric company (Niagara Mohawk).
We are also planning to build the knowledge-based ``intelligent'' CASE tools to enforce economy and to accelerate the MOVIE development process. Typical examples include smart-class browsers or automated interface builders based on ``fuzzy'' specification of user requests. This approach is in the spirit of the Knowledge Based Software Engineering (KBSE) technology, recently advocated by DARPA on the basis of comprehensive analysis of software costs [ Boehm:90a ] as the efficient economy measure for the next generation software processes. Implementation of the KBSE concepts requires integrating expert system techniques with conventional software engineering practices. Since PostScript derives from Lisp, its appropriate extension in MovieScript towards symbolic processing offers a natural integration platform for KBSE tools.
Guy Robinson
Dynamic and integrative features of the MOVIE environment are optimally suited for modelling and prototyping various aspects and components of the new generation of C I systems. The new objectives in this area are to cope efficiently with potentially smaller but more diversified and less predictable threats, and to operate in a robust, adaptive fashion in the dynamic heterogeneous distributed environment. Dynamic topology of the MOVIE network, supporting adaptive routing schemes to recover from network damages is useful for such C I functions as information transmission and battle management . High-quality dynamic visualization services of the MOVIE model, evolving towards hypermedia navigation and virtual reality are suitable for such C I functions as planning and evaluation . Finally, the integrative high-level language model of MovieScript, supporting both the data parallel and irregular object-oriented computing, is adequate for such C I functions as fusion and detection .
MOVIE is planned as one of the candidate software models for the C I simulation, modelling and prototyping, to be evaluated within the new cooperative on parallel software engineering industrial CRDA (Cooperative Research and Development Agreement), starting in summer 1992 and coordinated by Rome Laboratory.
Guy Robinson
We see virtual reality as a promising candidate for the ``ultimate'' human-machine interface technology and also as the most challenging system component of the MOVIE model, playing the role of the global integration and synchronization platform for all major design concepts of the system, including interpretive multiserver networking, preemptive multithreading with the real-time aspects, object-oriented three-dimensional graphics model, and support for high-performance computing. We describe this application area in more detail in Section 17.4 .
Guy Robinson
Guy Robinson
Analysis of terrain maps, digitized as (noisy) full-color images, is the first MOVIE application, funded by the ASAS agency in parallel with the base system development project.
A sample set of map images, provided to us by ASAS/JPL, is presented in Figure 17.9 . The project has been split by the agency into the following stages:
Figure 17.9:
A Sample Set of RGB Images of Terrain Maps, Provided to Us by
ASAS/JPL. Maps are of various sizes, scales, resolutions, saturation, and
intensity ranges. They also contain diverse topographic elements and
cartographic techniques.
This problem, posed by the DMA and addressed by several groups within the ASAS TECHBASE program (Cartography group at JPL, MOVIE group at NPAC, Coherent Research, Inc. (CRI) at Syracuse), turns out to be highly nontrivial, especially above certain critical accuracy level of order 80%.
The JPL approach to Map Separates is based on the back-propagation techniques. The CRI approach to map understanding is based on the expert systems techniques. Our MOVIE group approach is based on machine vision techniques. Our goal is to construct the complete map recognition system, including both separation and understanding components, structured as low- and high-level layers of the vision system and coupled by the feedforward and feedback pathways.
The problem involves diverse computational domains such as image processing, pattern recognition, and AI, and it provided the initial driving force for developing the general-purpose MOVIE system based support. At the current stage, we have completed the implementation of a class of early/medium vision algorithms for map separates, based on zero-crossings for edge detection and RGB clustering for segmentation. The resulting techniques are comparable in quality and give higher performance than the backpropagation-based approach.
Our conclusion from this stage is that further quality improvement in the separation process can be achieved only by coupling the low-level pixel-based techniques with the high-level approaches, based on symbolic representations, and by providing the feedback loop from the recognition layers to the separation layer.
From the computational perspective, the currently implemented layers are based on the MOVIE field algebra support for image processing. Two trial user interfaces constructed so far were based on the X/Motif interface for two-dimensional graphics and on the AVS interface for three-dimensional graphics. In preparation is a more complete tool, based on uniform two- and three-dimensional graphics support in MOVIE and providing the testbed environment for evaluating various techniques, employed so far to handle this complex problem. As part of this testbed program, we have also recently implemented the backpropagation algorithm for map separates [ Simoni:92a ], following the techniques developed originally by the JPL group.
In the following, we discuss in more detail various algorithms involved in this problem, with the focus on the RGB clustering techniques. The material presented here is based on an internal report [ Fox:93b ].
Guy Robinson
We will use the map image in Figure 17.10 to illustrate concepts and algorithms discussed in this section.
Figure 17.10:
Map Image, Referred to in the Text as ad250 and Given to Us
by the JPL Group as the Test Case for the Backpropagation Algorithm. The
original image is of size
in pixel units with 24 bits of color
per pixel.
This image, referred to as ad250 , was given to us by the JPL group as the test case of their backpropagation algorithm. The ad250 is a relatively complex image since it involves shaded relief, color saturation is poor, and there are a lot of isoclines represented by a broad spectrum of a brown tint, fluctuating and intermixed on boundaries with white, grey ( ), green, and dark green ( ).
The color separation of this image is not unique unless some human guidance is involved. For example, gray shaded relief can either be considered independent color or ignored, that is, identified as white. Also, isoclines with various tints of brown can be labelled by either different colors or a single effective brown. We obtained from JPL their results from the backpropagation algorithm for this image. The color selection ambiguities were resolved there by the map analyst during the network training stage and since these decisions can be deduced from the final result, we adopted the same color mapping rules in our work.
The rules are as follows:
To summarize, the image in Figure 17.10 is to be separated into seven base colors: white, green, brown, magenta, blue, purple, and black. Having done this, we can easily declutter it within the contextual regions of these colors, that is, we can distinguish isoclines from roads (but we cannot, for example, distinguish a city name from the road boundary).
The separation results obtained at JPL using the backpropagation algorithm for the ad250 image are presented in Figure 17.11 .
Figure:
Image
ad250
from Figure
17.10
, Separated
into Seven Base Colors Using the JPL Backpropagation Algorithm. The net is
trained on a subset of the image pixels. A set of 27 color values for the
image window is provided each time and the required output is
enforced for the central pixel. Ten hidden neurons are used.
Guy Robinson
Our approach is to explore the computer vision techniques for map separates. This requires more labor and investment than the neural network method which has the advantage of the ``black box''-type approach, but one expects that the vision based strategy will be eventually more successful. The disadvantage of the backpropagation approach is that it doesn't leave much space for major improvements-it delivers reasonable quality results for low-level separation but it can hardly be improved by including more insights from higher level reasoning modules since some information is already lost inside the backpropagation ``black-box.'' On the contrary, machine vision is a hierarchical, coupled system with feedback which allows for systematic improvements and provides the proper foundation for the full map understanding program.
The map separates problem translates in the vision jargon into segmentation and region labelling. These algorithms are somewhere on the border of the early and medium vision modules. We have analyzed the RGB clustering algorithm for the map image segmentation. In this method, one first constructs a three-dimensional histogram of the image intensity in the unit RGB cube. For a hypothetical ``easy'' image composed of a small, fixed number of colors, only a fixed number of histogram cells will be populated. By associating distinct labels with the individual nonempty cells, one can then filter the image, treating the histogram as a pixel label look-up table-that is, assigning for each image pixel the corresponding cell label. For ``real world'' map images involving color fluctuation and diffusion across region boundaries, the notion of the isolated histogram cells should be replaced by that of the color clusters. The color image segmentation algorithm first isolates and labels individual clusters. The whole RGB cube is then separated into a set of nonoverlapping polyhedral regions, specified by the cluster centers. More explicitly, for each histogram cell, a label is assigned given by the label of the nearest cluster, detected by the clustering algorithm. The pixel region look-up table constructed this way is then used to assign region labels for individual pixels in the image.
As a first test of the RGB clustering, we constructed a set of color histograms with fixed bin sizes and various resolutions (Figure 17.12 ). Even with such a crude tool, one can observe that the clustering method is very efficient for most image regions and when appropriately extended to allow for irregular bin volumes, it will provide a viable segmentation and labelling algorithm. The interactive tool in Figure 17.12 also provided a nice test of rapid prototyping capabilities of MOVIE. The whole MovieScript code for the demo is on the order of only and it was created interactively based on the integrated scriptable tools for Motif, field algebra, and imaging.
Figure 17.12:
Map Separates Tool Constructed in MovieScript in the Rapid
Prototyping Mode to Test the RGB Clustering Techniques. The left image
represents the full color source, the right image is separated into a fixed
number of base colors. Three RGB histograms are constructed with the bin
sizes
,
, and
,
correspondingly. Each histogram is represented as a sequence of RG planes,
parametrized by the B values. The first row under the image panel contains
eight blue planes of the
histogram, the second row
contains
and
histograms. The
content of each bin is encoded as an appropriate shade of gray. A mouse
click into the selected square causes the corresponding separate to be
displayed in the right image window, using the average color in the selected
bin. In the
separate
mode, useful for previewing the histogram
content, subsequent separates overwrite the content of the right window.
In the
compose
mode, used to generate this snapshot, subsequent
separates are superimposed. Tools are also provided for displaying all
separates for a given histogram in the form of an array of images.
The simple, regular algorithm in Figure 17.12 cannot cope with problems such as shaded relief, non-convex clusters, and color ambiguities, discussed above. To handle such problems, we need interactive tools to display, manipulate, and process three-dimensional histograms. The results presented below were obtained by coupling MOVIE field algebra with the AVS visualization modules. In preparation is the uniform MovieScript-based tool with similar functionality, exploiting the currently developed support for the three-dimensional graphics in MOVIE.
The RGB histogram for the ad250 image is presented in Figure 17.13 . Each nonempty histogram cell is represented as a sphere, centered at the bin center, with the radius proportional to the integrated bin content and with the color given by the average RGB value of this cell. Poor color separation manifests as cluster concentration along the axis. Two large clusters along this diagonal correspond to white and grey patches on the image. A ``pipe'' connecting these two clusters is the effect of shaded relief, composed of a continuous band of shades of gray. Three prominent off-diagonal clusters, forming a strip parallel to the major white-gray structure, represent two tints of true green and dark green, again with the shaded relief ``pipe.'' Brown isoclines are represented by an elongated cloud of small spheres, scattered along the white-gray structure.
Figure:
Color histogram of the ad250 image (see
Figure
17.10
) in the unit RGB cube. Histogram resolution
is
. Each bin is represented by a sphere with the
radius proportional to the bin content and with the average color value in
this bin.
The separation of these three elongated structures-white, green, and brown-represents the major complexity since all three shapes are parallel and close to each other. The histogram in Figure 17.13 is constructed with the resolution, which is slightly too low for numerical analysis (discussed below) but useful for graphical representation as a black-and-white picture. The histogram, used in actual calculations, contains too many small spheres to create any compelling three-dimensional sensation without the color cues and interactive three-dimensional tools (however, it looks impressive and spectacular on a full-color workstation with 3D graphics accelerator). By working interactively with the histogram, one can observe that all three major clusters are in fact reasonably well separated.
All remaining colors separate in an easy way: Shades of black again form a scattered diagonal strip which is far away from the three major clusters and separates easily from a similar, smaller parallel strip of shades of purple; red separates as a small but prominent cluster (close to the central gray blob in Figure 17.13 ); finally, blue is very dispersed and manifests as a broad cloud or dilute gas of very small spheres, invisible in Figure 17.13 but again separating easily into an independent polyhedral sector of the RGB cube.
The conclusion from this visual analysis, only partially reproduced by the picture in Figure 17.13 , is that RGB clustering is the viable method for separating ad250 into the seven indicated base colors. As mentioned above, this separation process requires human guidance because of the color mapping ambiguities. The nontrivial technical problem from the domain of human-machine interface we are now facing is how to operate interactively on complex geometrical structures in the RGB cube. A map analyst should select individual clusters and assign a unique label/color with each of them. As discussed above, these clusters are separable but their shapes are complex, of them given as clouds of small spheres, some others elongated, non-convex, and so on.
Virtual reality-type interface with the glove input and the three-dimensional video output could offer a natural solution for this problem. For example, an analyst could encircle a selected cluster by a set of hand movements. Also, the analyst's presence inside the RGB cube, implemented in terms of the immersive video output, would allow for fast and efficient identification of various geometric structures.
Right now, we adopted a more cumbersome but also more realistic approach, implementable in terms of conventional GUI tools. Rather than separate clusters, we reconstruct them from a set of small spheres. An interactive tool was constructed in which an analyst can select a small region or even a single pixel in the image and assign an effective color/label to it. This procedure is iterated some number of times. For example, we click into some white areas and say: white. Then we click into few levels of a shaded relief and we say again: white. Finally, we click into the gray region and we also say: white. In a similar way, we click into some number of isoclines with various tints of brown and we say: brown. Each point selected in this way becomes a center of a new cluster.
Figure 17.14:
A set of
Color Values, Selected Interactively and Mapped
on the Specified Set of Seven Base Colors as Described in the Text
The set of clusters selected this way defines the partition of the RGB cube into a set of nonoverlapping polyhedral regions. Each such region is convex and therefore the number of small clusters to be selected in this procedure must be much larger than the number of ``real'' clusters (which is seven in our case), since the real clusters often have complex, nonconvex shapes.
A sample selection of this type is presented in Figure 17.14 . It contains about 80 small spheres, each in one of the seven base colors. Separation of ``easy'' colors such as blue or red can be accomplished in terms of a few clusters. Complex structures such as white, green, and brown require about 20 clusters each to achieve satisfactory results. The image is then segmented using the color look-up table constructed in this way and the weight is assigned to each small cluster, proportional to the integrated content of the corresponding polyhedral region. The same selection as in Figure 17.14 , but now with the sphere radii proportional to this integrated content, is presented in Figure 17.15 .
Figure:
The Same Set of Selected Color Values as in
Figure
17.14
But Now with the Radius Proportional to the
Integrated Content of Each Polyhedral Cell with the Center Specified by the
Selected RGB Value.
As seen, we have effectively reconstructed a shape with topology roughly similar to the original histogram in Figure 17.13 but now separated into the seven base colors.
The resulting separated image is presented in Figure 17.16 and compared with the JPL result in Figure 17.11 in the next section.
Figure:
Image ad250 from Figure
17.10
, Separated
into Seven Base Colors Using the RGB Clustering Algorithm with the Clusters
and Colors Selected as in Figure
17.15
.
Guy Robinson
The quality of the separation algorithms in Figure 17.16 (RGB clustering) and in Figure 17.11 (neural network) is roughly similar. The RGB cluster-based result contains more high-frequency noise since the algorithm is based on the point-to-point look-up table approach and it doesn't perform any neighborhood analysis. This noise could easily be cleaned up by a simple postprocessor, eliminating isolated pixels, but we didn't perform it so far. In our approach, image smoothness analysis is represented by another class of algorithms, discussed in Section 17.3.5 .
The most important point to stress is that the RGB cluster-based method is much faster than the backpropagation method. Indeed, in the RGB clustering algorithm, the pixel label assignment is performed by a simple local look-up table computation which involves five numerical operations per pixel. The JPL backpropagation algorithm, employed in computing the result in Figure 17.11 , contains 27 input neurons, 10 hidden neurons, and 7 output neurons, requiring about 700 numerical operations per pixel. The neural network chip speeds up the backpropagation-based separation algorithm by a factor of 10. In consequence, our algorithm is faster by a factor of 100 than the JPL software algorithm and it is still faster by a factor of 10 when compared with the JPL hardware implementation.
Our interpretation of these results and understanding of the backpropagation approach in view of our experience based on numerical/graphical experiments described above is as follows. Both algorithms contain similar components. In both cases, we enter some color mapping information into the system during the ``training'' stage and we construct some internal look-up table. In our case, this look-up table is constructed as a set of labelled polyhedral regions, realizing a partition of the RGB cube, whereas in the backpropagation case it is implemented in terms of the hidden units. Our look-up table is optimal for the problem at hand, whereas backpropagation uses the ``general-purpose'' look-up table offered by its general-purpose input output mapping capabilities. It is therefore understandable that our algorithm is much faster.
Still, both algorithms are probably functionally equivalent, that is, the backpropagation algorithm effectively constructs a very similar look-up table, performing RGB clustering in terms of hidden units and synaptic weights. But it does this in a very inefficient way. One says that neural network is always the ``second best'' solution of the problem. In complex perceptual or pattern matching problems, this truly best solution is often unknown and the neural network approach is useful, whereas in the early/medium vision problems such as map separates, the machine vision techniques are competitive in quality and more efficient. However, we stress that backpropagation, even if less efficient, is a convenient way to get reasonable results quickly as far as user development time is concerned. It maximizes initial user productivity-not algorithmic performance.
The backpropagation algorithm produces a cleaner separated image as seen in Figures 17.11 and 17.16 . This is due to the fact that the backpropagation operates on a input window and the RGB clustering uses window-that is, just a single pixel. Some smearing is therefore built into the neural network during the training period. The corresponding vision algorithms, involving the neighborhood analysis based on image smoothness, are discussed in the next Section.
Guy Robinson
Figure:
Three-dimensional Surface Representing a Selected Color Plane
(Red) for a Region of the ad250 Image from Figure
17.10
(Includes Letter ``P'' from ``Prachatice'' in the Upper Right Corner).
X,Y
of the surface correspond to pixel coordinates and the
Z
value
of the surface is proportional to the image intensity.
Figure 17.17 presents a region from the ad250 image, displayed as a three-dimensional surface. The image pixel coordinates X,Y are mapped on the surface X,Y coordinates, whereas the Z value of the surface for a given X,Y is proportional to the local intensity value of a given color plane. In Figure 17.17 , the red plane was taken; similar pictures can be obtained for green, blue, luminance, and any other plane filter. To identify the displayed region on the image, note the letter P-the first character in the ``Prachatice'' name in the upper-right corner of the surface and the number ``932'' below and left of it.
As seen, even if there are some local intensity fluctuations in the image, the resulting surface is reasonably smooth and the segmentation problem can also be addressed by using suitable edge detection techniques.
On an ``ideal'' map image, edges could be detected simply by identifying color discontinuities. On the ``real world'' map images, the edges are typically not very abrupt due to the A/D conversion process-it is more appropriate to think in terms of smooth surface fitting and analyzing rapid changes of the first derivatives or zeros of the second derivative. These types of techniques were developed originally by Marr and Hildreth [ Marr:80a ] and most recently by Canny [ Canny:87a ].
The single step of this algorithm looks as follows:
The result of the Canny filter applied to the ad250 image is presented in Figure 17.18 . Each pixel is represented there as a color square and the neighboring squares are separated by a one-pixel-wide black background. Zero crossings of D are marked as white segments and they always form closed contours.
As seen, the brown isoclines which required substantial labor to be separated by the RGB clustering techniques are now detected in a very easy and clean way. However, there is also some number of spurious contours in Figure 17.18 which are to be rejected. The simplest signal-to-noise-based selection technique could be as follows:
A natural use of the Canny filter in Figure 17.18 could be as follows. The image is first segmented into Canny contours which are threshold as above and then labelled. For each contour, an average color is computed by integrating the color context enclosed by this contour. This reduced color palette is then used as input to the RGB clustering. Such an approach would guarantee, for example, that all brown isoclines in Figure 17.10 will be detected as smooth lines, contrary to both RGB clustering and neural network algorithms, which occasionally fail to reconstruct continuous isoclines.
Figure:
Output of the Canny Edge Detector Filter, Applied to a Region of the
Image ad250 from Figure
17.10
. Closed contours
are zero crossings of the second directional derivative, taken in the
direction of local intensity gradient.
Consider, however, a ``Mexican hat''-shaped green patch in Figure 17.10 , located in the upper left part of the image, between Prachatice name and 932 number. This patch was very easily and correctly detected by both RGB clustering and by the neural net, but we would fail to detect it by the single step Canny filter described above. Indeed, after careful inspection of the contours in Figure 17.18 , one can notice that there is no single closed zero crossing line encircling this region. In consequence, any contour-based color averaging procedure as above would result in some green color ``leakage.'' Within the Canny edge detection program, such edges are detected using the multiscale approach. The base algorithm outlined above is iterated with the increasing value of the Gaussian width and some multiscale acceptance/rejection method is employed. The green patch would eventually manifest as a low-frequency edge for some sufficiently large value of .
We intend to investigate in more detail such multiresolution edge-detection strategies. In our opinion, however, a more attractive approach is based on hybrid techniques, discussed in the next Section.
Guy Robinson
The output of the Canny edge detector, composed of a set of non-overlapping contiguous regions covering the whole image, is precisely of the format provided as input to the expert system, constructed by Coherent Research, Inc. in their SmartMaps system. This expert system performs such high-level tasks as object grouping, proximity analysis, Hough transforms, and so on. The output of an RGB clustering and/or neural network can also be structured in such format. Probably the best strategy at this point is to extend this expert system so that it would select the best ultimate separation pattern using a set of trial candidates. A genetic algorithm type philosophy could be used as a guiding technique. Each low-level algorithm is typically successful within a certain image region and it fails for some other regions. A smart split-and-merge approach, consistent with some set of common sense rules, could yield a much better low-level separation result than each individual low-level technique itself. For example, Canny edge detector would offer brown isoclines as good candidates and RGB clustering would offer the green patch as a good candidate for a region. Both propositions would be cross-checked and accepted as reasonable by both algorithms and the final result would contain both types of regions, separated with high fidelity. This type of medium-level geometrical reasoning could then be augmented and enforced by the high-level contextual reasoning within the full map understanding program.
Guy Robinson
We have described in this chapter our current results for map separates, based on the RGB clustering algorithm. This method results in a comparable or somewhat lower quality separation then the backpropagation algorithm, but it is faster by a factor of 100. It is suggested that our RGB clustering algorithm is in fact essentially equivalent to a backpropagation algorithm. In the neural network jargon, we can say that we have found the analytic representation for the bulk of the hidden unit layer which results in dramatic performance improvement. This representation can be thought of numerically as a pixel region look-up table or geometrically as a set of polyhedral regions covering the RGB cube. Further quality improvement of our results will be achieved soon by refining our software tools and by coupling the RGB clustering with the zero-crossing-based segmentation and edge detection algorithms. Zero crossing techniques provide in turn a natural algorithmic connectivity for our intended collaboration with Coherent Research, Inc. on high-level vision and AI/expert systems techniques for map understanding.
Guy Robinson
Guy Robinson
Virtual reality (VR) is a new human-machine interface technology based on the full sensory immersion of participants in the virtual world, generated in real time by the high-performance computer. Virtual worlds can range from physical spaces such as those modelled by dynamic terrain viewers or architectural walk-through tools, through a variety of ``fantasy lands'' to entirely abstract cognitive spaces, generated by dynamic visualization of low-dimensional parametric subspaces, extracted from complex nontopographic databases.
The very concept of the immersive interface and the first prototypes were already known in 1960s [ Sutherland:68a ] and 1970s [ Kilpatrick:76a ]. In the 1990s, VR technology is becoming affordable. Most current popular hardware implementations of the interface are based on a set exotic peripherals such as goggles for the wide solid-angle three-dimensional video output, head-position trackers, and gloves for sensory input and tactile feedback. Another immersion strategy is based on ``non-encumbered'' interfaces [ Krueger:91a ], implemented in terms of the real-time machine vision front-end which analyzes participants' gestures and responds with the synchronized sensory feedback from the virtual world.
VR projects cover the wide range of technologies and goals, including high-end scientific visualization (UNC), high-end space applications (NASA Ames), base research and technology transfer (HIT Lab), and low-end consumer market products (AGE).
The VR domain is growing vigorously and already has reached the mass media, generating the current ``VR hype.'' According to VR enthusiasts, this technology marks the new generation of computing and will start a revolution comparable in scope to personal computing in the early 1980s. In our opinion, this might be the correct assessment since VR seems to be the most natural logical next step in the evolution of human-machine interfaces and it might indeed become the ``ultimate solution'' for using computers because of its potential for maximal sensory integration with humans. However, the explicit implementations of VR will most probably vary very rapidly in the coming years, in parallel with the progress of technology, and most of the current solutions, systems, and concepts will become obsolete very soon.
Nevertheless, one is tempted to immediately start exploring this exciting field, additionally encouraged by the rapidly increasing affordability of VR peripherals. The typical cost of a peripheral unit for a VR environment has gradually decreased from $1M (Super Cockpit) in the 1970s through $100K in the 1980s (NASA) down to $10K (VPL DataGlove) in the early 1990s. The new generation of low-cost ``consumer VR'' systems which will reach the broad market in the mid-1990s comes with a price tag of about $100. This clearly indicates that the time to get involved in VR is-now!
VR opponents predict that VR will have its major impact in entertainment rather than R&D or education. However, there is already a new buzzword in VR newspeak, suggesting a compromise solution: edutainment ! From the software engineering perspective, the edutainment argument can be formulated as follows: the software models and standards generated today will mature perhaps five to ten years from now and hence they will be used by the present ``Nintendo generation.'' There is no reason to expect that these kids will accept anything less intuitive and natural for user interfaces than the current Nintendo standards, which will evolve rapidly during the coming years towards the full-blown VR interfaces.
Leaving aside longer term prognoses, we would expect that a few years from now, VR will be available on all systems in the form of an add-on option, more or less as the mouse was for personal computers a few years ago. We will be witnessing soon the new generation of consumer VR products for the broad entertainment market and, in the next stage, the transfer of this technology to the computer interface domain. These low-cost gloves and headsets will probably appear more and more frequently attached to conventional monitors and easy to use. VR applications will coexist with standard applications within the existing windowing systems. We will still be using conventional text editors and other window tools, whereas the add-on VR peripherals and software layers will allow us to enter virtual worlds (i.e., dynamic three-dimensional-intensive applications) through conventional two-dimensional windows.
Guy Robinson
Matrix Information Services, Inc. (MIS) recently finished an extensive survey of emerging VR programs, firms and application areas [ MIS:91a ]. Some 40 sites have been identified. The claim is, however, that the actual number of new VR initiatives is much larger since many large firms do not disclose any information about their VR startups. The first generation of commercial VR products identified by MIS include applications in medical imaging, aerospace, business, engineering, transportation, architecture and design, law enforcement, education, tours and travel, manufacturing and training, personal computing, entertainment, and the arts. In fact, when Bill Bricken, HIT Lab's Chief Scientist, was asked to estimate the VR market some 20 years from now, he replied: ``Just the Gross National Product.'' Statements like this are clearly made to amplify the current VR hype for fund-raising purposes. Nevertheless, the diversity of emerging application areas might indeed suggest that VR is capable of embracing a substantial portion of today's computer market in the next decade.
Furmanski often met such enthusiastic opinions during his VR trip in the summer of 1991 [ Furmanski:91g ] with representatives of BTGL through the West Coast labs and companies. However, the same companies admit that the real VR market in the U.S. as of today is-virtual . The bulk of their sales is in Japan where the investments in VR R&D are an order of magnitude higher than in the U.S. We don't hear much about Japan's progress in VR since their approach is very different. Still, some of their latest achievements, like commercial products with nonencumbered, machine vision-based VR interfaces have found the way into the media. In the U.S., this technology has been researched for years in the academic and then small business mode by Myron Krueger, a true pioneer of artificial reality.
There is much less VR hype in Japan and the VR technology is viewed there in a more modest fashion as a natural next generation of GUIs. It is intended to be fully integrated with existing computing environments rather then an entirely new computing paradigm. It is therefore very plausible that, due to this more organized, long-range approach, Japan will take the true leadership role in VR. This issue has been raised by then-Senator Gore, who advocated increasing R&D funds for VR in this country. One should also notice that the federal support for virtual reality needs to be associated with similar ongoing efforts towards maintaining U.S. dominance in the domain of High Performance Computing, since we expect both technologies to become tightly coupled in the near future.
Guy Robinson
There is a campuswide interest in multimedia/VR at Syracuse University, involving labs and departments such as the CASE Center, NPAC, School of Information Studies, Multimedia Lab and Advanced Graphics Research Lab. A small scope virtual reality Lab has been started, sponsored by the CASE Center and Chris Gentile from AGE, Inc., who is an SU alumnus and partner in the successful NYS startup focused on low-end broad-market consumer VR products. New planned collaborations with the corporate sponsors include joint projects with SimGraphics Engineering, Inc., a California-based company developing high-quality graphics software for simulation, animation, and virtual engineering, and with virtual reality, Inc., a new East Coast startup interested in developing high-end VR systems with high-performance computing support.
On the base VR research side, there is a planned collaboration with Rome Laboratories [ Nilan:91a ] aimed at designing the VR-based group decision support for the modern C I systems. The project also involves evaluating MOVIE as a candidate for the high-end VR operating shell. Within the new multidisciplinary Computational Neuroscience Program at Syracuse University, we are also planning to couple some vision and neutral network research issues with the design issues for VR environments such as ``nonencumbered'' machine vision-based interfaces, VR-related perception limits, or neural net-based predictive tracking techniques for fast VR rendering.
Multimedia is a discipline closely associated with VR and strongly represented at Syracuse University by the Multimedia Lab within the CASE Center and by the Belfer Audio Lab. Some of the multimedia applications are more static and/or text-based than the dynamic three-dimensional VR environments. The borderline between both disciplines is usually referred to as hypermedia navigation-that is, dynamic real-time exploration of multimedia databases. Large, complex databases and associated R&D problems of integration, transmission, data abstraction, and so on, represent the common technology area connecting multimedia and VR projects.
Our interests at NPAC are towards high-performance VR systems, based on parallel computing support. A powerful VR environment could be constructed by combining the computational power and diverse functionality of new parallel systems at NPAC: CM-5, nCUBE2, and DECmpp, connected by fast HIPPI networks. A natural VR task assignment would be: modeller/simulator on CM-5, parallel database server on nCUBE2, and renderer on DECmpp-which basically exhausts all major computational challenges of virtual reality.
The relevance of parallel computing for VR is both obvious and yet largely unexplored within the VR community. The popular computational engine for high-end VR is provided currently by the Silicon Graphics machines and these systems are in fact custom parallel computers. But it remains to be seen if this is the most cost-effective or scalable solution for VR. The most natural testbed setup for exploring various forms of parallelism for VR can be provided by general-purpose systems. The distributed environment described above and based on a heterogeneous collection of general-purpose parallel machines would provide us with truly unique capabilities in the domain of high-end parallel/distributed VR. We intend to develop VR support in MOVIE and to use it as the base infrastructure system for high-end VR at NPAC. We discuss MOVIE's role in the VR area in more detail in the next section.
Guy Robinson
VR poses a true challenge for the underlying software environment, usually referred to as the VR operating shell . Such a system must integrate real-time three-dimensional graphics, in large object-oriented modelling and database techniques, event-driven simulation techniques, and the overall dynamics based on multithreading distributed techniques. The emerging VR operating shells, such as Trix at Autodesk, Inc., VEOS at HIT Lab, and Body Electric at VPL, Inc., share many design features with the MOVIE system. A multiserver network of multithreading interpreters of high-level object-oriented language seems to be the optimal software technology in the VR domain.
We expect MOVIE to play an important role in the planned VR projects at Syracuse University, described in the previous section. The system is capable of providing both the overall infrastructure (VR operating shell) and the high-performance computational model for addressing new challenges in computational science, stimulated by VR interfaces. In particular, we intend to address research topics in biological vision on visual perception limits [ Farell:91a ], [ Verghese:92a ], in association with analogous constraints on VR technology; research topics in machine vision in association with high-performance support for the ``non-encumbered'' VR interfaces [ Krueger:91a ]; and neural network research topics in association with the tracking and real-time control problems emerging in VR environments [ Simoni:92b ].
From the software engineering perspective, MOVIE can be used both as the base MovieScript-based software development platform and the integration environment which allows us to couple and synchronize various external VR software packages involved in the planned projects.
Figure 17.19 illustrates the MOVIE-based high-performance VR system planned at NPAC and discussed in the previous section. High-performance computing, high-quality three-dimensional graphics, and VR peripherals modules are mapped on an appropriate set of MovieScript threads. The overall synchronization necessary, for example, to sustain the constant frame rate, is accomplished in terms of the real-time component of the MovieScript scheduling model. The object-oriented interpreted multithreading language model of MovieScript provides the critical mix of functionalities, necessary to cope efficiently with prototyping in such complex software and hardware environments.
Figure 17.19:
Planned High-End Virtual Reality Environment at NPAC. New
parallel systems: CM-5, nCUBE2 and DECmpp are connected by the fast HIPPI
network and integrated with distributed FDDI clusters, high-end graphics
machines, and VR peripherals by mapping all these components on
individual threads of the VR MOVIE server. Overall synchronization is
achieved by the real-time support within the MOVIE scheduling model.
Although the figure presents only one ``human in the loop,'' the model can
also support in a natural way the multiuser, shared virtual worlds with
remote access capabilities and with a variety of interaction patterns
among the participants.
The MOVIE model-based high-performance VR server at NPAC could be employed in a variety of visualization-intensive R&D projects. It could also provide a powerful shared VR environment, accessible from remote sites. MovieScript-based communication protocol and remote server programmability within the MOVIE network assure satisfactory performance of shared distributed virtual worlds also for low-bandwidth communication media such as telephone lines.
From the MOVIE perspective, we see VR as an asymptotic goal in the GUI area, or the ``ultimate'' user interface. Rather than directly build the specific VR operating shell, which would be short-lived given the current state of the art in VR peripherals, we instead construct the VR support in the graded fashion, closely following existing and emerging standards. A natural strategy is to extend the present MovieScript GUI sector based on Motif and three-dimensional servers by some minimal VR operating shell support.
Two possible public domain standard candidates in this area to be evaluated are VEOS from HIT Lab and MR (Minimal Reality) from the University of Alberta. We also plan to experiment with the Presence toolkit from DEC and with the VR_Workbench system from SimGraphics, Inc.
Parallel with evaluating emerging standard candidates, we will also attempt to develop a custom MovieScript-based VR operating shell. Present VR packages typically split into the static CAD-style authoring system for building virtual worlds and the dynamic real-time simulation system for visiting these worlds. The general-purpose support for both components is already present in the current MovieScript design: an interpretive object-oriented model with strong graphics support for the authoring system and a multithreading multiserver model for the simulation system.
A natural next step is to merge both components within the common language model of MovieScript so that new virtual worlds could also be designed in the dynamic immersive mode. The present graphics speed limitations do not currently allow us to visit worlds much more complex than just Boxvilles of various flavors, but this will change in coming years. Simple solids can be modelled in the conventional mouse-based CAD style, but with the growing complexity of the required shapes and surfaces, more advanced tools such as VR gloves become much more functional. This is illustrated in Figure 17.20 , where we present a natural transition from the CAD-style to VR-style modelling environment. Such VR-based authoring systems will dramatically accelerate the process of building virtual worlds in areas such as industrial or fashion design, animation, art, and entertainment. They will also play a crucial role in designing nonphysical spaces-for example, for hypermedia navigation through complex databases where there are no established VR technologies and the novel immersion ideas can be created only by active, dynamic human participation in the interface design process.
Figure 17.20:
Examples of the Glove-Based VR Interfaces for CAD and Art
Applications. The upper figure illustrates the planned tool for
interactive sculpturing or some complex, irregular CAD tasks. A set of
``chisels'' will be provided, starting from the simplest ``cutting plane''
tool to support the glove-controlled polygonal geometry modelling. The
lower figure illustrates a more advanced interface for the
glove-controlled surface modelling. Given the sufficient resolution of
the polygonal surface representation and the HPC support, one can
generate the illusion of smooth, plastic deformations for various materials.
Typical applications of such tools include fashion design, industrial
(e.g., automotive) design, and authoring systems for animation. The
ultimate goal in this direction is a virtual world environment for
creating new virtual worlds.
Guy Robinson
Guy Robinson
Guy Robinson
In this chapter, we discuss some large-scale applications involving a mixture of several computational tasks. The ISIS system described in Section 18.2 grew out of several smaller C P projects undertaken by Rob Clayton and Brad Hager in Caltech's Geophysics Department. These are described in [Clayton:87a;88a], [ Clayton:88a ] [ Gurnis:88a ], [Lyzenga:85a;88a], [ Raefsky:88b ]. The geophysics applications in C P covered a broad range of topics and time scales. At the longest time scale ( to years), Hager's group used finite-element methods to study thermal convection processes in the Earth's mantle to understand the dynamics of plate tectonics. A similar algorithm was used to study the processes involved in earthquakes and crustal deformation over periods of 10 to 100 years. On a shorter time scale, Clayton simulated acoustic waves, such as those generated by an earth tremor in the Los Angeles basin. The algorithm was finite difference using high-order approximation. This (synchronous) regular grid was implemented using vector operations as the basic primitive so that Clayton could easily use both Cray and hypercube machines. This strategy is a forerunner of the ideas embodied in the data-parallel High Performance Fortran of Chapter 13 . Tanimoto developed a third type of geophysics application with the MIMD hypercube decomposed as a pipeline to calculate the different resonating eigenmodes of the Earth, stimulated after an earthquake.
Sections 18.3 and 18.4 discuss a major series of simulations that were developed under U. S. Air Force sponsorship at the Jet Propulsion Laboratory in collaboration with Caltech. The application is caricatured in Figure 3.11 (b), and Section 18.3 describes the overall architecture of the simulation. The major module was a sophisticated parallel Kalman filter and this is described in Section 18.4 . Other complex applications developed by C P included the use of the Mark IIIfp at JPL in an image processing system that was used in real time to analyze images sent down by the space probe Voyager as it sped past Neptune. One picture produced by the hypercube at this encounter is shown in Figure 18.1 (Color Plate) [ Groom:88a ], [Lee:88a;89b]. Another major data analysis project in C P involved using the 512-node nCUBE-1 to look at radio astronomy data to uncover the signature of pulsars. As indicated in Table 14.3 , this system ``discovered'' more pulsars in 1989 than the original analysis software running on a large IBM-3090. This measure (black holes located per unit time) seems more appropriate than megaflops for this application. A key algorithm used in the signal processing was a large, fast Fourier transform that was hand-coded for the nCUBE. This project also used the concurrent I/O subsystem on the nCUBE-1 and motivated our initial software work in this area, which has continued with software support from ParaSoft Corporation for the Intel Delta machine at Caltech. Figure 18.2 (Color Plate) illustrates results from this project and further details will be found in [Anderson:89c;89d;89e;90a],
[Gorham:88a;88d;89a].
Figure 18.1:
Neptune, taken by Voyager 2 in 1989 and
processed by Mark IIIfp.
Figure 18.2a:
Apparent pluse period of a binary pulsar
in the globular MI5. The approximately eight-hour period (one of the
shortest known) corresponds to high radial velocities that are 0.1% of the
speed of light. This pulsar was discovered from analysis of radio astronomy
data in 1989 by the 512-node nCUBE-1 at Caltech.
Figure 18.2b:
Five pulsares in globular cluster M15.
These were discovered or confirmed (M15 A) by analysis on the nCUBE-1
[Anderson:89d],[Fox:89i;89y;90o],[Gorman:88a].
Another interesting signal-processing application by the same group was the use of high-performance computing in the removal of atmospheric disturbance from astronomical images, as illustrated by Figure 18.3 . This combines parallel multidimensional Fourier transform of the bispectrum with conjugate gradient [ Gorham:88d ] minimization to reconstruct the phase. Turbulence, as shown in Figure 18.3 (a), broadens images but one exploits the approximate constancy of the turbulence due to atmospheric patches over a 10 to 100 millisecond period. The Mount Palomar telescope is used an an interferometer by dividing it spatially onto one thousand ``separate'' small telescopes. Then standard astronomical interferometry techniques based on the bispectrum can be used to remove the turbulence effects, as shown in Figure 18.3 (b), where one has increased statistics by averaging over 6,000 time slices [Fox:89i;89n;89y;90o].
Figure 18.3:
Optimal Binary Star Before (a) and After (b) Atmospheric
Turbulence Removed. (a) Raw data from a six second exposure of BS5747
(
Corona Borealis) with a diameter of about 1 arcsecond. (b)
The reconstructed image on the nCUBE-1 on the same scale as (a) using
an average over 6,000 frames, each of which lasted 100 milliseconds.
Each figure is magnified by a factor of 1000 over the physical image at
the
Palomar telescope focus.
An important feature of these applications is that they are built up from a set of modules as exemplified in Figures 3.10 , 3.11 , 15.1 , and 15.2 . They fall into the compound problem class defined in Section 3.6 . We had originally (back in 1989, during a survey summarized in Section 14.1 ) classified such metaproblems as asynchronous. We now realize that metaproblems have a hierarchical structure-they are an asynchronous collection of modules. However, this asynchronous structure does not lead to the parallelization difficulties illustrated by the applications of Chapter 14 . Thus, the ``massive'' parallelism does not come from the difficult synchronization of the asynchronously linked modules but rather from internal parallelization of the modules, which are individually synchronous (as, for example, with the FFT mentioned above), or loosely synchronous (as in the Kalman Filter of Section 18.4 ). One can combine data parallelism inside each module with the functional asynchronous parallelism by executing each module concurrently. For example, in the SIM 87, 88, 89 simulations of Section 18.3 , we implemented this with an effective but crude method. We divided the target machine-a 32-node to 128-node Mark IIIfp hypercube-into ``subcubes''-that is, the machine was statically partitioned with each module in Figure 3.11 (b) assigned to a separate partition. Inside each partition, we used a fast optimized implementation of CrOS, while the parallelism between partitions was implemented by a variation of the Time Warp mechanism discussed briefly in Sections 15.3 and 18.3 . In the following subsections, we discuss these software issues more generally.
Guy Robinson
Table 18.1:
Summary of Problem Classes
This is the last chapter on our voyage through the space of problem classes. Thus, we will use this opportunity to wrap up some general issues. We will, in particular, summarize the hardware and software architectures that are suitable for the different problem classes that are reviewed in Table 18.1 . We will first sharpen the distinction between loosely synchronous and asynchronous problems. Let us compare,
Loosely Synchronous : Solution of partial differential equation with an unstructured mesh, as in Figure 12.8 (Color Plate).
Asynchronous : Communication linkage between satellites in three dimensions, as in Figure 3.11 (b).
Loosely Synchronous : Fast multipole approach to N-body problem, as in Figure 12.11 .
Asynchronous : - pruned game tree coming from computer chess, as in Figure 14.4 .
These examples show that asynchronous and loosely synchronous problems are represented by similar underlying irregular graphs. What are the differences? Notice that we can always treat a loosely synchronous problem as asynchronous and indeed many approaches do this. One just needs to ignore the macroscopic algorithmic synchronization present in loosely synchronous problems. When is this a good idea? One would treat loosely synchronous problems as asynchronous when:
Thus, we see that loosely synchronous problems have an irregular underlying graph, but the underlying macroscopic synchronicity allows either the user or compiler to achieve higher performance. This is an opportunity (to achieve better performance), but also a challenge (it is not easy to exploit). Typically, asynchronous problems-or at least asynchronous problems that will get reasonable parallelism-have as much irregularity as loosely synchronous problems. However, they have larger grain size and lower communication-to-calculation ratios ( in Equation 3.10 ), so that one can obtain good performance without the loosely synchronous constraint. For instance, the chess tree of Figure 14.4 is more irregular and dynamic than the Barnes-Hut tree of Figure 12.11 . However, the leaves of the Barnes-Hut are more tightly coupled than those of the chess tree. In Figure 3.11 (b) the satellites represent much larger grain size (and hence lower values in Equation 3.10 ) than the small (in computational load) finite-element nodal points in Figure 12.8 (Color Plate).
As illustrated in Figure 18.4 , one must implement asynchronous levels of a problem with asynchronous software paradigms and execute on a MIMD machine. Synchronous and perhaps loosely synchronous components can be implemented with synchronous software paradigms and executed with good performance on SIMD architectures; however, one may always choose to use a more flexible software model and if necessary a more flexible hardware architecture. As we have seen, MIMD architectures support both asynchronous and the more restricted loosely synchronous class; SIMD machines support synchronous and perhaps some loosely synchronous problems. These issues are summarized in Tables 18.2 and 18.3 .
Figure 18.4:
Mapping of Asynchronous, Loosely Synchronous, and Synchronous
Levels or Components of Machine, Software and Problem. Each is
pictured hierarchically with the asynchronous level at the top and
synchronous components at lowest level. Any one of the components may
be absent.
The approaches of Sections 12.4 and 12.8 exemplify the different choices that are available. In Section 12.8 , Edelsohn uses an asynchronous system to control the high level of the tree with the lower levels implemented loosely synchronously for the particle dynamics and the multigrid differential equation solver. Warren and Salmon use a loosely synchronous system at each level. Software support for such structured adaptive problems is discussed in [ Choudhary:92d ] as part of the plans to add support of properly loosely synchronous problems to FortranD and High Performance Fortran (HPF).
Table 18.2:
What is the ``correct'' machine architecture for each problem
class?
In a traditional Fortran or HPF compiler, the unit of computation is the program or perhaps subroutine. Each Fortran statement (block) is executed sequentially (possibly with parallelism implied internally to statement (block) as in HPF), with synchronization at the end of each block. However, one could choose a smaller unit with loosely synchronous implementation of blocks and an overall asynchronous system for the statements (blocks). We are currently using this latter strategy for an HPF interpreter based on the MOVIE technology of Chapter 17 . This again illustrates that in a hierarchical problem, one has many choices at the higher level (coarse grain) of the problem. The parallel C++ system Mentat developed by Grimshaw [ Grimshaw:93b ] uses similar ideas.
Table 18.3:
Candidate Software Paradigms for each problem architecture.
Guy Robinson
We have already described how the application of Section 18.3 illustrates a compound or metaproblem. The software support is that of an adaptive asynchronous high-level system controlling data parallel (synchronous or loosely synchronous) modules. Perhaps the best developed system of this type is AVS, which was originally developed for visualization but can be used to control computational modules as well [ Cheng:92a ]. Examples of such use of AVS are [Mills:92a;92b] for financial modelling, [ Cheng:93a ] for electromagnetic simulation, and the NPSS system at NASA Lewis [ Claus:92a ] for multidisciplinary optimization, as in Figures 3.11 (a), 15.1 , and 15.2 . As summarized in Table 18.3 , MOVIE, described in Chapter 17 , was designed precisely for such metaproblems with the original target problem that of the many linked modules needed in large-scale image processing. Linda [ Gelertner:89a ] and its extension Trellis [ Factor:90a ], [ Factor:90b ] is one attractive commercial system which has been used for data fusion applications that fall into the problem class. The recent work on PCN [ Chandy:90a ] and its extensions CC++ [ Chandy:92a ] and Fortran-M [ Foster:92a ] were first implemented as reactive (asynchronous) software systems. However, it is planned to extend them to support the data-parallel modules needed for metaproblems.
The simulation systems of Sections 15.3 and 18.3 illustrate that one may need special functionality (in the cited cases, the support of event-driven simulation) in the high-level asynchronous component of the software system.
Clearly, this area is still poorly understood, as we have little experience. However, we expect such metaproblems to be the norm and not the exception as we tackle the large, complex problems needed in industry (Chapter 19 ).
Guy Robinson
The design goals and a prototype multicomputer implementation of an Interactive Seismic Imaging System (ISIS) are presented. The purpose of this system is to change the manner in which images of the subsurface are developed, by allowing the geologist/analyst to interactively observe the effects of changing focusing parameters, such as velocity. This technique leads to improved images and, perhaps more importantly, to an understanding of their accuracy.
Guy Robinson
ISIS is a multicomputer-based interactive system for the imaging of seismic reflection data. In the sense used here, interactive means that when the seismic analyst makes an adjustment to an imaging parameter, the displayed image is updated rapidly enough to create a feedback loop between the analyst and the imaging machine. This interactive responsiveness allows a much greater use of the analyst's talents and training than conventional seismic processing systems do. To carry out the interactive imaging, we introduce a set of conceptual tools for the analyst to exploit, and also suggest a new role for the structural geologist-who is usually charged with interpreting a seismic image-as geologist/analyst.
Guy Robinson
The task of the seismic analyst is twofold: to select the proper imaging steps for a given data set, and the proper imaging parameters to produce an accurate image of the subsurface. The ideal imaging system would allow the analyst to inspect every datum for the effects of parameter selections and to adjust those parameters for best results. In conventional practice, such a task would be extremely cumbersome, requiring the generation of hundreds of plots and dozens of passes through the data. ISIS, however, provides an efficient mechanism for accomplishing this task. The system keeps the entire data volume on-line and randomly accessible; thus, any gather may be assembled and displayed on the monitor very rapidly. A sequential series of gathers may be displayed at a rate of several per second, a feature we refer to as a movie . Movies provide an opportunity to inspect and edit the data and to adjust the imaging parameters on the data groupings that most naturally display the effects of those parameters. For example, a movie of shot gathers enables the analyst to quickly identify bad shots or to inspect the accuracy of the ground roll mutes. A movie of the midpoint gathers allows for the inspection of the normal moveout correction and the stretch mutes. A movie of receiver gathers permits the analyst to detect problematic surface conditions, and a movie of constant-offset gathers allows the analyst to study various offset-dependent characteristics. In this way, the analyst may inspect the entire data volume in various groupings in a few minutes and may stop at any point to interactively adjust the imaging parameters.
Some parameters have effects that manifest themselves more clearly in the composite image than they do in raw data gathers. For instance, the effects of the migration velocity are only apparent in the migrated image. Ideally, the analyst would adjust the imaging parameters and immediately see the effect on the image. We refer to this ability as interactive focusing , an analogy to a photographer focusing a camera while viewing an image through the viewfinder. A typical focusing technique is to alternate an image back and forth between under-focus and over-focus in diminishing steps until the point of optimal focus is reached. Any seismic analyst can easily recognize an over- or under-migrated image, but the ability to smoothly pass from one to the other allows for the fine-tuning of the velocity model. This process also allows the analyst to test the robustness of the reflectors and their orientation in the image. Other parameters, such as those used in deconvolution, for instance, may also be tuned interactively.
Another task of the analyst is to diagnose problems in the seismic image and take corrective action. An image may be contaminated by a variety of artifacts; it is important to eliminate them if possible, or identify them if not. To aid the analyst in this task, ISIS provides a feature called image deconstruction . Consider an analyst studying a stacked section. Image deconstruction allows the analyst to point the cursor to a feature on the image and call up the midpoint gather(s) that produced it. In the same way the analyst may display any of the shot or receiver gathers that provided traces to the midpoint gather(s). At this point, the analyst may use movies of the gathers to study the features of interest. By tying the image points back to the raw data through image deconstruction, the analyst has an additional tool for distinguishing true reflectors from processing artifacts.
Guy Robinson
In traditional practice, a seismic analyst will produce an image that is then sent to a structural geologist for interpretation and the construction of a geologic cross-section. A problem with that practice is that the features that are important to the geologist-relationships between geologic beds, the dip on structures, the thickness of beds, and so on-may have been given little consideration by the analyst. Similarly, the analyst may have little information as to the geologic constraints of the region being imaged-information that would aid in producing a more accurate image. The ISIS project was originally conceived in an attempt to blend the roles of analyst and geologist. In the role of geologist the user can make use of the interactive imaging facilities to gain useful information about the character and robustness of the imaged structure. A structural geologist provided with an interactive processing system can develop a much more thorough, dynamic understanding of the image than would ever be possible through the examination of a static section produced through some unknown processing sequence. In the role of analyst the user may apply geologic constraints and intuition to improve the imaging process. While we will continue to refer to the seismic analyst throughout this paper, we believe that through the use of interactive imaging the distinction between geologist and analyst will disappear.
As mentioned above, the principal task of the structural geologist is to interpret a seismic section and produce a geologic cross-section of a prospect. The act of making this interpretation also implicitly creates a seismic model. It should therefore be possible to use this as the input model in the imaging process. An image produced in this way should be very similar to the image from which the interpretation was originally made; if not, there is reason to suspect the accuracy of the model. A future addition to ISIS will allow the geologist to make interpretations as the imaging system honors the interpretation in recomputing the image. This process will further break down the barrier between imaging and interpretation, and between geologist and analyst.
Guy Robinson
It would be difficult to conceive of a fully automated system to process seismic data. The enormous complexity of geologic structures and the recorded data make such a system an unlikely near-term development. Similarly, it is difficult to imagine a generalized inversion formula for seismic reflection data, since the trade-off between the reflectivity and velocity structures of the subsurface is generally not completely constrained by the data. Now and for the foreseeable future, the expertise of a human analyst will be required for the accurate imaging of seismic data. This fact does not mean that the role of the machine will be minimized, however, as advances in imaging technique have more than kept pace with advances in hardware capability. But, until recently, the batch-processing paradigm in seismic imaging has been the only option. Currently, the seismic analyst uses his or her extensive experience and training only when the latest plot is generated by the processing software. ISIS is an attempt to change that paradigm by allowing for a much greater utilization of the analyst's abilities.
Other advantages of interactive imaging include the ability to process a seismic survey in just one or two days, and the generation of a self-documenting history of the imaging sequence (with the ability to return to any stage of the processing). ISIS should be an excellent educational tool, not only by providing students the ability to interact with data and imaging parameters, but also because it is programmable, providing a good platform for experimental algorithms.
Guy Robinson
ISIS consists of four main parts (Figure 18.5 ): a parallel, on-line seismic trace manager, a high-performance parallel computational engine, a parallel graphics display manager, and a window-based, interactive user interface. The data from a seismic survey are stored across an array of trace manager processes. These processes are responsible for providing seismic traces to the computational processes at transfer rates sufficiently high to keep the computational processes from being idle. The computational processes generate an image and deliver it to the display manager for display on a monitor. The user triggers this processing sequence by adjusting an imaging parameter. The system is designed to minimize the delay between the user's action and the refreshing of the image-if the delay is short enough, the imaging will be truly interactive.
Figure 18.5:
Imaging Tasks. The four principal divisions of the ISIS system are
shown. The dotted lines represent software layers that insulate the
computational processes from the other functions.
ISIS was designed to be a flexible, programmable imaging system. As described here, ISIS is actually two systems. The first is a set of system-level programs accessible through simple library interfaces. This software was designed to conceal implementation-specific details from the applications programmer. The trace manager and the display manager are part of the system-level software. The second level of ISIS, the applications level, is built upon the first. The user interface and seismic processing functions are part of the applications level. The system software was designed to minimize the effort needed to develop custom user interface and processing functions. We have developed both levels; the ISIS presented here is a processing system built upon the system software.
The advantages of the division between system and applications software are numerous: 1) the system is customizable, allowing for the addition of new imaging techniques or user interface technology; 2) the system will be portable with a minimal effort at the applications end-for instance, the application interface to the trace manager would be the same regardless of whether the platform was a message-passing multicomputer, a shared-memory multicomputer, a network of workstations, or a single uniprocessor machine; and 3) the parallelism of the trace manager and display manager are concealed from the applications programmer, greatly simplifying the programming effort.
Guy Robinson
To provide the interactive imaging capabilities discussed above, the imaging hardware must provide certain minimum levels of performance. Figure 18.5 schematically represents data flowing from the trace manager to the computational processes and the image generated there being sent to the display manager for eventual display on the monitor. To perform the interactive focusing discussed above, the computational engine must deliver approximately 200 to . While this number may be difficult and expensive to obtain in a single-processor system, it is easily obtainable in parallel systems. Likewise, in order to satisfy the demands of interactive stacking, the trace manager is required to deliver many thousands of traces per second (approximately 4 to 8 Kbytes/trace) to the computational processes. Since these traces are essentially randomly ordered throughout a multigigabyte data volume, a simple calculation will show that, for current disk drive technology, the limiting factor in supplying the data is the disk seek time, not the aggregate transfer rate. Again, a solution to the problem is for a number of disks working in parallel to provide the needed performance. Finally, in order to display movies of seismic gathers at eight frames per second, the graphics processors must be able to absorb and display eight megabytes of data per second. Once again, this requirement may be satisfied by multiple nodes working in parallel.
In addition to the general performance issues, which could be achieved by the creation of a specially built machine, or the addition of custom I/O devices to an existing supercomputer, we chose to use only commercially available hardware. The reasons for this choice are twofold: We wanted other interested researchers to be able to duplicate our efforts, and we wanted the system to be reasonably affordable.
Guy Robinson
From the point of view of the applications programmer, the trace manager consists of two principal functions: The first, datarequest, is a request for the trace manager to deliver certain data to the requesting process (e.g., a shot gather); the other function, getdata , is called repeatedly after a call to datarequest, each call returning a single trace until no traces remain and the request is satisfied. Because of the simplicity of this interface, the applications programmer need know nothing of the implementation details of the trace manager.
Each instance of the trace manager consists of an archive containing some portion of the data from a seismic survey, and a list containing information on the archived traces. In this implementation of ISIS, the archive takes the form of magnetic disk drives, but in other implementations the data may be stored or staged in process memory. A single copy of the data from a seismic survey is spread evenly among all the instances of the trace manager process.
When a computational process calls datarequest , each instance of the trace manager searches its trace list and generates a secondary list of traces that satisfy the request. Because there may be multiple computational processes, there may be several active request lists-at most, one for each computational process. The trace manager retrieves the listed traces from the archive and prepares them for delivery to the requesting process. Before delivering the traces, the trace manager may, at the behest of the requesting process, perform some simple object-oriented preprocessing, such as applying statics, mutes, and NMO.
Guy Robinson
The display manager, like the trace manager, is designed to conceal implementation details from the applications programmer. It consists of two calls: one for delivering a trace to the display manager for plotting, and another to inform the display manager that the image is complete. Each instance of the display manager buffers images until a signal from the user interface notifies it to copy or assemble an image in video memory and display it. This interface with the application allows the user to have complete control over what is displayed and the movie display rate.
Guy Robinson
At the system level, no hardware or software specification of the user interface is made; it is left to the applications designer to select an appropriate interface. The necessary communication between the user interface and the computational processes is accomplished through a system-level parameter database. The database manager maintains multiple user-defined databases and stores information in key/content pairs. When an imaging parameter is selected or modified, the user interface packs the parameter into a byte stream and stores it in a database (Figure 18.6 ).
Figure 18.6:
The User Interface/Database
The database manager then generates database events which are sent, along with the data, to the computational processes where they are dealt with as discussed in the next section.
This event-driven mechanism has several advantages over other means of managing control flow. It allows the system-level software to provide the communication between the user interface and the computational processes without the system needing any knowledge of the content of the messages, and without the user knowing the communications scheme. The data is packed and unpacked from user-defined structures only within the user-provided processes. Our implementation of ISIS uses Sun's XDR routines for packaging the data, which has the added advantage that it also resolves the byte-ordering differences between the host machine and the computational processors.
Guy Robinson
The computational process consists of two parts: the system-level framework, and the user-defined, application-level processing functions. The user-defined functions perform the seismic imaging tasks and are free to pursue that end by whatever means the applications programmer finds appropriate, as long as the functions return normally and in a timely fashion. Figure 18.7 a is a schematic example of a typical user-defined function. The user function, when called, first retrieves any relevant parameters from the database. These parameters may be processing parameters, such as the velocity model, or they may be special information, such as a specification of the data to be processed. After retrieving the parameters, the function requests the appropriate data from the trace manager through a call to datarequest. It then loops over calls to getdata, performs computations, and executes the appropriate calls to plot the processed traces. The function may loop over several data requests if multiple gathers are needed, for instance, to build a stacked section. The function notifies the display manager when the image is complete, and the user function returns to the calling process.
Figure 18.7:
An Instance of a Computational Process: (a) a User-Defined
Function; (b) the Controlling Process with Several User Functions in Place.
The bold arrow running from the notifier to the user function ``Func 1''
indicates the currently active function.
While the parallelism of the computational process cannot be hidden from the applications programmer, the programming task is made much simpler by concealing the implementation details of the trace manager, display manager, and database. To help facilitate parallelization, each instance of a computational process is provided with the total number of computational processes, as well as its own logical position in that number. With this information, most seismic imaging tasks can be easily parallelized by either data or domain decomposition.
The system-level software for the computational processes (Figure 18.7 b) consists of a main notifier loop that handles the database events and distributes control to the user-defined functions. The programmer is provided with functions for registering the processing functions with the notifier, along with the databases of interest to those functions. For instance, a function to plot shot gathers may depend on the statics database only, while a function to produce a stacked section may depend on the velocity database and the statics database. The applications programmer is also provided with an interface for selecting the active function. No more than one function may be active at any given time. Incoming database events are consumed by the notifier, the data are stored in the local database, and the notifier will call the active function if and only if that function has registered an interest in that particular database. This interface simplifies adding a new processing function or parameter to the existing system.
Guy Robinson
Figure 18.8:
Layout of Processes on the Meiko Multicomputer. Each box enclosing
a letter represents a node: trace manager processes are marked ``T,''
computational processes ``C,'' and display manager processes ``D.'' ``H'' is
the system host board, and ``W/S'' represents the Sun workstation, where the
user interface resides. Each trace manager has access to two disk drives
(small circles), and two processors also have 8mm tape drives. The lines
between nodes represent communications channels.
ISIS is implemented on a multicomputer manufactured by Meiko Scientific Ltd. Figure 18.8 is a schematic representation of the prototype ISIS hardware. The trace manager is implemented on eight nodes, each with an Inmos T800 processor and a SCSI controller responsible for two 1.2-gigabyte disk drives. Two of the trace manager nodes also control 8mm tape drives used for initial loading of data into the system. The computational processes reside on eight nodes with Intel i860 processors. The display manager is mapped onto two T800 nodes with video RAM and an RGB analog output that drives a color monitor. The user interface resides on a Sun SPARCstation that serves as the host machine for the Meiko system.
It should be noted that, because the i860 is nearly an order of magnitude faster than the T800, many of the functions of the trace manager and the display manager are actually performed on the computational nodes, but this detail is completely hidden from the applications programmer.
We consider this system a prototype. A simple evaluation of the capabilities of the hardware will show that it cannot provide the performance described in Section 18.2.6 . While this machine has proven to be extremely useful, a complete system would be two to four times the size. The system software is designed to be scalable, as is the hardware. In fact, if the size of the machine were doubled, the ISIS software would run as is, without requiring recompilation.
Guy Robinson
Because of the recent and ongoing advances in computer technology, interactive seismic imaging will become an increasingly powerful and affordable tool. It is only within the last two years that machines with all the capabilities necessary to perform interactive seismic imaging have been commercially available. In another ten years, machines with all the necessary capabilities will be no larger than a workstation and will be affordable even within the budget of the smallest departments. Because of the availability of these machines, interactive imaging will certainly replace the traditional methods. It is our hope that the success of the ISIS project will continue the trend toward true interactive imaging, and provide a model for systems of the future.
We have introduced several concepts that we believe will be important to any future systems: movies, interactive focusing, and image deconstruction. These tools provide the means for the analyst to interactively image seismic data. We also introduce the idea of geologist-as-analyst to extend the range of the imaging machine into the interpretation of the image, and to allow the geologist a better understanding of the image itself.
The design of ISIS concentrated on providing the building blocks of an interactive imaging system, and on the implementation of a prototype system. The imaging task is divided into four main parts: trace manager, display manager, computational engine, and user interface. Each part is implemented in a way that makes it scalable on multiprocessor systems, but conceals the implementation details from the applications programmer. Interfaces to the different parts are designed for simplicity and portability.
Guy Robinson
Increasingly, computer simulation is directed at predicting the behavior and performance of large manmade systems that themselves include multiple instances of imbedded digital computers. Often the computers are distributed, sometimes over wide geographic distances, and the system modelling becomes largely a combination computer/communication simulation. The type of simulation needed here can be characterized as having some elements that are simulated in a conventional sense where a statistical or descriptive model of the element is used. But other elements, particularly the imbedded computers, are emulated , which is to say that the computations performed nearly duplicate the functionality of that real-world element. For example, a ``tracker'' really tracks. It does not simply provide results that are in conformance with how a tracker should track.
In this manner, the simulator becomes both a predictor of system performance and an active participant in the system development as the behavior of the emulated elements is refined and evolved.
In 1987, the Mark III Hypercube Applications group at JPL undertook the most computationally demanding simulation of this type yet proposed: the detailed simulation of the global Strategic Defense Initiative System-sometimes known as Star Wars [Meier:89a;90a], [ Warren:88a ], [ Zimmerman:89a ].
A parallel processor was chosen to perform the simulation both because of its ability to deliver the computational power required and because it was closely reflective of the class of machines that might be used for the imbedded computers in the SDI System-that is, the simulation was helping to prove the applicability of parallel processing for complex real-time system applications. The Mark IIIfp Hypercube was the host machine of choice at the time (1987-1989).
Guy Robinson
The basic structure of the simulation-first called Simulation87-is shown in Figure 18.9 . Here, an otherwise monolithic hypercube is subdivided into subcubes, each containing a data-parallel subapplication of typically synchronous character. Shown in this early and greatly simplified view are
Figure 18.9:
Basic Simulation87 Structure
The details of each module are not important for our discussions here. What is pertinent is that each involves a substantial computation that runs on a parallel machine using standard data-parallel techniques. The intermodule communications take place over the normal hypercube communications channels in a (rather low-fidelity) emulation of the communications necessary in the real-world system. The execution of the simulation as a whole can then exploit two classes of parallelism: the multiple modules or functions execute concurrently and each function is itself a data-parallel process. Load balancing is done on a coarse level as shown by the size of the subcube allocations to each function. Emphasis was also placed on communicating information to graphics workstations that helped visualize and interpret the progress of the simulation.
This is a rather general structure for an emerging class of simulation that seeks to model large-scale system performance and employ elements both of pure computer simulation and this relatively unique element of emulation.
Guy Robinson
The most productive and efficient run-time environment interior to each subcube was that of CrOS-described above in Chapter 5 -since the applications typically hosted were of the synchronous type. But the intermodule communications needed were definitely asynchronous. For them, the communications primitives needed were those that had already been developed in the Mercury OS [ Burns:88a ] and similar to those described in Chapter 16 . The system-level view was one of needing ``loosely coupled sets of tightly coupled multiprocessors.'' That is, a single node needed to be tightly coupled, using CrOS, to nearest neighbors in its local subcube, yet loosely coupled, using Mercury or other asynchronous protocol, to other subcubes working on other tasks. Of course, it would have been possible to use Mercury for communications of both types, but on the Mark III level of hardware implementation, the performance penalty for using the asynchronous protocol where a synchronous protocol would suffice was factor of nearly five.
The CrOS latency of nearest-neighbor messaging on the Mark III is ; for Mercury it is -both adequate numbers for the 2-Mip 68020 data processors used on the Mark III, but often strained by the Weitek 16-MFLOPS floating-point accelerator. Messaging latency still is a key problem, even on the most recent machines. For example, on the Delta machine, NX delivers a nearest-neighbor message in ; Express takes . About the same, but now supporting a 120-Mip, 60-MFLOPS data processor.
To meet these hybrid needs and preserve maximum performance, a dual-protocol messaging system-called Centaur for its evocation of duality-was developed [ Lee:89a ]. To implement the disparate protocols involved-Mercury is interrupt driven whereas CrOS uses polled communications-it was determined that all messages would initially be assumed to be asynchronous and first handled as Mercury messages. A message that was actually synchronous contained a signal to that effect in its first packet header. Upon reading this signal, Mercury would disable interrupts and yield to the much faster CrOS machinery for the duration of the CrOS message. This scheme yielded synchronous and asynchronous performance only about 30% degraded from their counterparts in a nonmixed context.
Guy Robinson
Three separate versions of the SDI simulations were constructed: Sim87, Sim88, and Sim89. Each was more elaborate and used more capable hardware than its predecessor. Sim87, for example, executed on a single 32-node Mark III; Sim89, in contrast, was implemented to run on the 128-node Mark IIIJfp. Configuration was flexible; Figure 18.10 shows a typical example using two hypercubes and a network of Ethernet-connected workstations. The internal structure of the simulation showed similar evolution. Sim87 was not much more elaborate than indicated in Figure 18.9 but it evolved to the much more capable version shown in Figure 18.11 for Sim89 [ Meier:90a ], [ Yeung:90a ].
Figure 18.10:
Typical Sim89 Hardware Configuration
Features of Sim89 included more elaborate individual modules, outlined below.
Figure 18.11:
An Evolved SDI Simulation, Sim89
Figure 18.12:
A complex strategic defense situation
graphically summarized.
The Command Center was an important conceptual step. It re-positioned the role of the workstations from one of passive display of the activities occurring internally to the hypercube, to the role of the key user interface into a network computing environment assisted by large-scale parallel machines. It, in effect, helped us merge our own mental picture of the paradigm of network computing with that of parallel processing and into the more unifying view of cooperative, high-performance computing.
Guy Robinson
The original structure of multiple data-parallel function emulations executing concurrently was left intact by this evolution. The supporting services and means of synchronizing the various activities have evolved considerably, however.
The execution of the simulation as shown in Figure 18.9 is synchronized rather simply. Refer now to Figure 18.13 . By the structure of the desired activities, sensor data are sent to the Trackers, their tracks are sent to the Battle Planner, and the battle plans are returned to the Environment Generator (which calculates the effects of any defensive actions taken). The exchange of mono tracks shown is a communication internal to the Tracker's two subcubes.
Figure:
The Simulation of Figure
18.9
is Controlled
by a Pipeline Synchronization
The simulation initiates by having each module forward its data to the next unit in the pipeline and then read its input data as initialization for the first set of computations. At initialization, all messages are null except the sensor messages from the Environment Generator. In the first computation cycle, only the Tracker is active; the Battle Planner has no tracks. After the tracker completes its initial sensor data processing (described in detail later in the chapter), the resulting tracks are forwarded to the Battle Planner, which starts computing its first battle plan while the Tracker is working on the second set of sensor data-a computational pipeline, or round, has been established. When an element finishes with a given work segment, the results are forwarded and that element then waits if necessary for the data that initiates the next work segment. Convenient, effective, but hardly of the generality of, say, Time Warp (described in Section 15.3 ) as a means of concurrency control.
Yet a full implementation of Time Warp is not necessarily required here even in the most general circumstances. Remember that Time Warp implements a general but pure discrete event simulation. Its objective-speedup-is achieved by capitalizing on the functional parallelism inherent in all the multiple objects, analogous to the multiple functions being described here. It permits the concurrent execution of the needed procedures and ensures a correct simulation via rollback when the inevitable time accidents occur. In the type of simulation discussed here, not only is speedup often not the goal, but when it is, it is largely obtained via the data parallelism of each function and load-balanced by the judicious assignment of the correct number of processors to each. The speedup due to functional parallelism can be small and good performance can still result. A means of assuring a correct simulation, however, is crucial.
We have experimented with several synchronization schemes that will ensure this correctness even when the simulation is of a generality illustrated by Figure 18.11 . The most straightforward of these is termed time buckets and is useful whenever one can structure a simulation where activities taking place in one time interval, or bucket, can only have effects later on in the next or succeeding time buckets.
The initial implementation of the time bucket approach was in conjunction with an SDS communications simulation, one that sought to treat in higher fidelity the communications activities implicit in Figure 18.11 . In this simulation, the emulators of the communications processors aboard each satellite and the participating ground stations were distributed across the nodes of the Mark III hypercube. In the most primitive implementation, each node would emulate the role of a single satellite's comm processor; messages would be physically passed via the hypercube comm channels and a rather complete emulation would result.
Since the correspondence to the real world is not perfect-actual hypercube comm delays are not equal to the satellite-to-satellite light time delays, for example, this emulation must be controlled and synchronized just like a conventional discrete-event parallel simulation if time accidents are not to occur and invalidate the results. Figure 18.14 shows the use of the time bucket approach for synchronization in this situation. The key is to note that, because of the geometries involved there is a minimum light time delay between any two satellites. If the processing is done in time steps-time buckets, if you will-of less than this delay, it is possible to ensure that accidents will never occur.
Figure 18.14:
The Time Bucket Approach to Synchronization Control
Referring to Figure 18.14 , assume each of the processors is released at time and is free to process all messages in its input queue up to and including the time bucket's duration without regard to any coordination of the progress of simulation time in the other nodes. Since the minimum light time delay is longer than this bucket, it is not possible for a remote node to send a message and have it received (in simulation time) interior to the free processing time; no accidents can occur. Each node then processes all its events up to a simulation time advance of one time bucket. It waits until all processors have reached the same point and all messages-new events from outside nodes-have been received and placed properly in their local event queues. When all processors have finished, the next time bucket can be entered.
The maximum rate that simulation time can advance is, of course, determined by the slowest node to complete in each time bucket. If proceeding is desired, not as rapidly as possible but in real time (i.e., maintaining a close one-to-one correspondence between elapsed wall clock and simulation time), the nodes can additionally wait for the wall clock to catch up to simulation time before proceeding; this behavior is illustrated in Figure 18.4 .
While described as a communication simulation, this is a rather general technique. It can be used whenever the simulation modules can reasonably obey the constraint that they not schedule events shorter than into the future for objects external to the local node. It can work efficiently whenever the density of events/processor is significantly greater than unity for this same . A useful view of this technique is that the simulation is fully parallel and event-driven interior to a time bucket, but is controlled by an implicit global controller and is a time-stepped simulation with respect to coarser time resolutions.
Implementation varies. The communication simulation just described was hosted on the Mark III and took advantage of the global lines to coordinate the processor release times. In more general circumstances where globals are not available, an alternative time service [ Li:91a ] has been implemented and is currently used for a network-based parallel Strategic Air Defense simulation.
Where a fixed time step does not give adequate results, an alternate technique implementing an adaptive has been proposed and investigated [ Steinman:91a ]. This technique, termed breathing time buckets , is notionally diagrammed in Figure 18.15 . Pictured there are the local event queues for two nodes. These event queues are complete and ordered at the assumed synchronized start of cycle simulation time. The object of the initial processing is to determine a ``global event horizon'' which is defined as the time of the earliest new event that will be generated by the subsequent processing. Current events prior to that time may be processed in all the nodes without fear of time accidents. This time is determined by having each node optimistically process its own local events, but withhold the sending of messages, until it has reached a simulation time where the next event to be processed is a ``new'' event. The time reached is called the ``local event horizon.'' Once every node has determined its local horizon, a global minimum is determined and equated to the global event horizon. All nodes then roll back their local objects to that simulation time (easy since no messages have been sent) send the messages that are valid, and commit to the event processing up to that point.
Figure 18.15:
Breathing Time Buckets as a Means of Synchronization Control
In implementation, there are many refinements and extensions of these basic ideas in order to optimize performance, but this is the fundamental construct. It is proving to be relatively easily implemented, gives good performance in a variety of circumstances, and has even outperformed Time Warp in some cases [ Steinman:92a ].
Synchronization control is but one issue, albeit the most widely discussed and debated, in building a general framework to support this class of simulation. Viewed from the perspective of cooperative high performance computing, the Simulation Framework can be seen as services needed by the individual applications programmer, but not provided by the network or parallel computer operating system. Providing support for:
Guy Robinson
Guy Robinson
Sim89, described broadly in the previous section, is designed to process a so-called mass raid scenario, in which a few hundred primary threats are launched within a one- to two-minute time window, together with about 40 to 60 secondary, anti-satellite launches. The primary targets boost through two stages of powered flight (total boost time is about 300 seconds), with each booster ultimately deploying a single post boost vehicle (PBV). Over the next few hundred seconds, each PBV in turn deploys 10 re-entry vehicles (RVs). The Sim89 environment does not yet include the factor of 10 to 100 increase in object counts due to decoys, as would be expected in the ``real world.''
The data available for the tracking task consist, essentially, of line-of-sight measurements from various sensing platforms to individual objects in the target ensemble at (fairly) regular time intervals. At present, all sensing platforms are assumed to travel in circular orbits about a spherically symmetric earth (neither of these assumptions/simplifications is essential). The current program simulates two classes of sensors: GEO platforms, in geostationary, equatorial orbits, and MEO platforms in polar orbits at altitude . The scan time for GEO sensors is typically taken to be , with for MEO platforms.
Figure 18.16 shows a small portion of the field of view of a MEO sensor at a time about halfway through the RV deployment phase of a typical Sim89 scenario. The circles are the data seen by the sensor at one scan and the crosses are the data seen by the same sensor at a time later. Given such data, the primary tasks of the tracking program are fairly simple to state:
Figure 18.16:
Typical Midcourse Data Sets, Consecutive Scans
In order to (in principle) process data from a wide variety of sensors, the Sim89 tracker adopts a simple unified sensing formalism. For each sensor, the standard reference plane is taken to be the plane passing through the center of the earth, normal to the vector from the center of the earth to the (instantaneous) satellite position. Note that these standard frames are not inertial. The two-dimensional data used by the tracking algorithm are the coordinates of the intersection of the reference plane and the line of sight from the sensor to the target. The intersection coordinates are defined in terms of a standard Cartesian basis in the reference frame, with one axis along the normal to the sensor's orbital plane, and the other parallel to the projection of the platform velocity onto the reference plane.
Guy Robinson
The task of interpreting data such as those shown in Figure 18.16 is clearly rather challenging. The tracking algorithm described in the next section is based on a number of elementary building blocks, which are now briefly described [ Baillie:88f ],
[Gottschalk:87f;88a;90a;90b].
Guy Robinson
In order to associate observations from successive scans, a model for the expected motion of the underlying target is required. The system model used throughout the Sim89 tracker is based on a simple kinematic Kalman filter. Consider, for the moment, motion in one dimension. The model used to describe this motion is
where x , v , a and j are position, velocity, acceleration and jerk ( ) and is a stochastic (noise) contribution to the jerk. The Kalman filter based on Equation 18.1 is completely straightforward, and ultimately depends on a single parameter
The system model of Equation 18.1 is appropriate for describing targets travelling along trajectories with unknown but approximately smooth accelerations. The size of the noise term in Equation 18.2 determines the magnitude of abrupt changes in the acceleration which can be accommodated by the model without loss of track. For the typical noise value quoted in Equation 18.2 , scan-to-scan variations are easily accommodated.
During boost phase, the actual trajectories of the targets are, in principle, not known, and the substantial freedom for unanticipated maneuvering implicit in Equations 18.1 and 18.2 is essential. On the other hand, the exact equation of motion for ballistic target (i.e., RVs) is completely known, so that the uncertainties in predicted positions according to the kinematic model are much larger than is necessary. Nonetheless, Equation 18.1 is maintained as the primary system model throughout all phases of the Sim89 tracker. This choice is based primarily on considerations of speed. Evaluations of predicted positions according to Equation 18.1 require only polynomial arithmetic and are much faster than predictions done using the exact equations of motion. Moreover, for the scan times under consideration, the differences between exact and polynomial predictions are certainly small compared to expected sensor measurement errors.
While ``exact'' system models for target trajectories are not used in the tracker per se, they are used for the collection of tracking ``answers'' which are exchanged between tracking systems or between trackers and other elements in the full Sim89 environment. (This ``handover'' issue is discussed in more detail in the next section.)
Guy Robinson
Given the preceding prescription for estimating the state of a single target from a sequence of two-dimensional observations, the central issue in multitarget tracking is that of associating observations with tracks or observations on one scan with those of a subsequent scan (e.g., in Figure 18.16 , which x is paired with which o ). There are, in a sense, two extreme schemes for attempting this track hit association:
The track splitting model is robust in the sense that the correct track hit association is very likely to be generated and maintained at any step in track processing. The track extension task is also extremely ``localized,'' in the sense that splittings of any one track can be done independently of those for other tracks. This makes concurrent implementations of track splitting quite simple. The primary objections to track splitting are twofold:
The optimal association prescription is orthogonal to track splitting in the sense that the single ``best'' pairing is maintained in place of all plausible pairings. This best Track Hit association is determined by a global optimization procedure, as follows. Let and be two lists of items (e.g., actual data and predicted data values). Let
be a cost for associating items and (e.g., the cartesian distance between predicted and actual data positions for the data coordinates defined above). The optimal association of the two lists is that particular permutation,
such that the total association score,
is minimized over all permutations of Equation 18.4 .
Leaving aside, for now, the question of computational costs associated with the minimization of Equation 18.5 , there are some fundamental difficulties associated with the use of optimal associators in multitarget tracking models. In particular
Guy Robinson
The manner in which the elements of the preceding section are combined into an overall tracking algorithm is governed by two fundamental assumptions:
Given these assumptions on the nature of the tracking problem, the overall form of the Sim89 tracking model is as illustrated in Figure 18.17 . The basic elements are a pair of two-dimensional trackers, each receiving and processing data from its own sensor, a three-dimensional tracking module which combines information from the two two-dimensional systems, and a `Handover' module. The handover module controls both the manner in which the three-dimensional tracker sends its answers to whomever is listening and the way in which tracks from other systems are entered into the existing three-dimensional track files. The following subsections provide brief descriptions of the algorithms used in each of these component subtasks.
Figure 18.17:
Gross Structure of Sim89 Tracking Model
Guy Robinson
The primary function of the two-dimensional tracking module is fairly straightforward: Given two-dimensional data sets arriving at reasonably regular time intervals (scans) from the sensors, construct a big set of ``all'' plausible two-dimensional tracks linking these observations from scan-to-scan. This is done by way of a simple track-splitting module. The tracks from the two two-dimensional trackers in Figure 18.17 are the fundamental inputs to the three-dimensional track initialization algorithm described in the next subsection.
The adoption of track splitting in place of optimal association for the two-dimensional trackers is largely a consequence of assumption (A1) above. Without a restrictive model for the (unseen) motion along the sensor line of sight, the information available to the two-dimensional tracker is not sufficient to differentiate among plausible global track sets through the data points. Instead, the two-dimensional tracker attempts to form all plausible ``tracks'' through its own two-dimensional data set, with the distinction between real and phantom tracks deferred to the three-dimensional track initiation and association modules described in the next section.
With the receipt of a new data set from the sensors, the action of the two-dimensional tracker consists of several simple steps:
Figure 18.18:
Processing Flow for two-dimensional Mono Tracking
Guy Robinson
An item in the two-dimensional track file is described by an eight-component state vector
where the component vectors on the RHS of Equation 18.6 are four-element kinematic state vectors as defined for Equation 18.1 , referred to the standard measurement axes:
In principle, each track described by Equation 18.6 has an associated covariance matrix with 36 independent elements. In order to reduce the storage and CPU resource requirements of two-dimensional tracking, a simplifying assumption is made. The measurement error matrix for a two-dimensional datum
is taken to have the simple form
with the same effective value used to describe the measurement variance for each projection, and no correlation of the measurement errors. The assumption in Equations 18.7 and 18.8 is reasonable, provided the effective measurement error is made large enough, and reduces the number of independent components in the covariance matrix from 36 to 10.
The central task of the two-dimensional track extension module is to find all plausible track hit associations, subject to a set of criteria which define ``plausible.'' The primary association criterion is based on the track association score
where is the variance of the prediced data position along a reference axis and
is the difference between the actual data value and that predicted by Equation 18.6 for the time of the datum. Equation 18.9 is simply a dimensionless measure of the size of the mismatch in Equation 18.10 , normalized by the expected prediction error.
The first step in limiting Track Hit associations is a simple cut on the association score of Equation 18.9 . For the dense, multitarget environments used in Sim89, this simple cut is not sufficiently restrictive, and a variety of additional heuristic cuts are made. The most important of these are
The actual track scoring cut is a bit more complicated than the preceding paragraph implies. Let denote the nominal extension score of Equation 18.9 . In addition, define a cumulative association score which is updated on associations in a fading memory fashion
with (typically) . An extension is accepted only if is below some nominal cutoff (typically 3-4 ) and is below a more restrictive cut (2-3 ). This second cut prevents creation of poor tracks with barely acceptable extension scores at each step.
The preceding rules for Track Hit associations define the basic two-dimensional track extension formalism. There are, however, two additional problems which must be addressed:
In regard to the first problem, two entries in the track file are said to be equivalent if they involve the same associated data points over the past four scans. If an equivalent track pair is found in the track file, the track with a higher cumulative score is simply deleted. The natural mechanism for track deletion in a track-splitting model is based on the track's data association history. If no data items give acceptable association scores over some preset number of scans (typically 0-2), the track is simply discarded.
The equivalent-track merging and poor track deletion mechanisms are not sufficient to prevent track file ``explosions'' in truly dense environments. A final track-limiting mechanism is simply a hard cutoff on the number of tracks maintained for any item in the data set (this cut is typically ). If more than tracks give acceptable association scores to a particular datum, only the pairings with the lowest total association scores are kept.
The complexity of the track extension algorithm is nominally for new data items and existing tracks. This computational burden is easily reduced to something closer to by sorting both the incoming data and the predicted data values for existing tracks.
Guy Robinson
The report formation subtask of the two-dimensional tracker collects/organizes established two-dimensional tracks into a list to be used as input for three-dimensional track initiations, where ``established'' simply means tracks older than some minimum cutoff age (typically seven hits). The task of initiating three-dimensional tracks from lists of two-dimensional tracks consists of two parts:
The essential element in associations is the so-called hinge angle illustrated in Figure 18.19 . Consider a single target viewed simultaneously by two different sensors. Assuming that each two-dimensional tracker knows the orbits of the other tracker's sensor, each tracker can independently reconstruct two reference planes in three-dimensional inertial space:
is simply the angle between these two planes.
Figure 18.19:
Definition of the Stereo Association Angle
Once the time for the two-dimensional report has been specified, the steps involved in the report function are relatively straightforward:
Guy Robinson
The algorithm described in Section 18.4.4 is only applicable for extending tracks which already exist in the track file. The creation of new entries is done by a separate track initiation function.
The track initiator involves little more than searches for nearly colinear triples of data over the last three scans. A triplet
where the cutoff is generally fairly loose (e.g., ). In addition, a number of simple heuristic cuts (maximum speed) are applied.
The initiator searches for all approximately linear triples over the last three scans, subject to the important additional restriction that no initiations to a particular item of the current data set are attempted if any established track (minimum age cut) already exists ending at that datum. The nominal complexity of the initiator is reduced to approximately by exploiting the sorted dature on the incoming data sets.
Guy Robinson
Unlike the two-dimensional tracking module, the three-dimensional stereo tracker attempts to construct a single track for each (perceived) underlying target. The fundamental algorithm element for this type of tracking is the optimal associator described in Section 18.4.2 . A single pass through the three-dimensional tracker utilizes optimal associations for two distinct subtasks:
Guy Robinson
Given a list of existing three-dimensional tracks and a set of observations from a particular sensor, the track extension task nominally consists of three basic steps:
A list of predicted data values for existing tracks is evaluated and is sorted using the same key as was used sorting the data set. The union of the sorted prediction and data sets is then broken into some number of gross subblocks, defined by appropriately large gaps in values of the sorting key. This reduces the single large association problem into a number of smaller subproblems.
For each subproblem, a pruned distance matrix is evaluated, subject to two primary constraints:
where, for i = y,z ,
and is the predicted variance for Equation 18.15 according to the three-dimensional tracking filter. The score is essentially a for the proposed association, and the cutoff value is typically of order . The maximum allowed number of associations for any single prediction is typically -8. If more than data give acceptable association scores, the possible pairings are sorted by the association score and only the best fits (lowest scores) are kept.
The preceding scoring algorithm leads to a (generally) sparse distance matrix for the large subblocks defined through gaps in the sorting keys. The next step in the algorithm is a quick block diagonalization of the distance matrix through appropriate reorderings of the rows and columns. By this point, the original large association problem has been reduced to a large number of modest sized subproblems and Munkres algorithm for minimizing the global cost in Equation 18.5 is (finally) used to find the optimal pairings.
Guy Robinson
Guy Robinson
Now at Syracuse University, Fox has set up a new program ACTION (Advanced Computing Technology is an Innovative Opportunity Now). This is funded by New York State to accelerate the introduction of parallel computing into the State's industry. The methodology is based directly on that proven successful in C P. The applications scientists are now in different industries-not in different Caltech or JPL departments. There are many differences in detail between the projects. The basic hardware is now available commercially and need not be developed concurrently with applications and systems software. However, the applications are much harder. In C P, a typical code was at most a few thousand lines long and often developed from scratch by each new graduate student. In ACTION, the codes are typically larger (say 100,000 lines) and longer lived.
We also find differences when we analyze the problem class. There are fewer regular synchronous problems in industry than in academia and many more of the metaproblem class with several different interrelated functions.
Table 19.1 presents some initial results of a survey of industrial applications [ Fox:92e ]. Note that we are at the stage analogous to the beginnings of C P when we first wandered around Caltech talking to computational scientists.
Table 19.1:
An Initial Survey of Industry and Government Opportunities for
High-Performance (Parallel) Computing
In general, we find that the central parallel algorithms needed in industry have usually already been studied by the research community. Thus, again we find that, ``in principle,'' parallel computing works. However, we have an even harder software problem and it is not clear that the software issues key to the research applications are the same for industry. As described in Chapter 14 for High Performance Fortran, software standards are critical so companies can be assured that their parallel software investment will be protected as hardware evolves.
One interesting initial conclusion about the industrial opportunities for parallel computers concerns the type of applications. Simulations of various sorts dominated the previous chapters of this book and most academic computing. However, we find that the industrial applications show that simulation, while very promising, is not the largest market in the long run. Rather, we live in the ``information area'' and it is in the processing of information that parallel computing will have its largest opportunity. This is not (just) transaction processing for the galaxywide network of automatic teller machines; rather, it is the storage and access of information followed by major processing (``number-crunching''). Examples include the interpretation of data from NASA's ``mission to planet Earth'' where the processing is large-scale image analysis; the scanning and correlation of technical and electronic information from the world's media to give early warning for economic and social crises; the integration of medicaid databases to lower the burden on doctors and patients and identify inefficiencies. Interestingly, such information processing is currently not stressed in the national high-performance computing initiative.
Guy Robinson
In the following, we refer to the numerical label (item number) in the first column of Table 19.1 .
Items 1, 4, 14, 15, and 16 are typical of major signal processing and feature identification problems in defense systems. Currently, special purpose hardware-typically with built-in parallelism-is used for such problems. We can expect that use of commercial parallel architectures will aid the software development process and enhance reuse. Parallel computing in acoustic beam forming (item 1) should allow adaptive on-line signal processing to maximize signal-to-noise ratio dynamically as a function of angle and time. Currently, the INTEL iWarp is being used, although SIMD architectures would be effective in this and most low-level signal processing problems. A SIMD initial processor would be augmented with a MIMD machine to do the higher level vision functions. Currently, JointStars (item 4) uses a VAX for the final tracking stage of their airborne synthetic aperture radar system. This was used very successfully in the Gulf War. However, parallel computing could enhance the performance of JointStars and allow it to track many moving targets-one may remember the difficulties in following the movement of SCUD launchers in the Gulf War. As shown in Chapter 18 , we already know good MIMD algorithms for multitarget tracking [Gottschalk:88a;90b].
We can expect the Defense Department to reduce the purchases of new planes, tanks, and ships. However, we see a significant opportunity to integrate new high-performance computer systems into existing systems at all levels of defense. This includes both avionics and mission control in existing aircraft and the hierarchy of control centers within the armed services. High-performance computing can be used both in the delivered systems and perhaps even more importantly in the simulation of their performance.
Modelling of the ocean environment (item 2) is a large-scale partial differential equation problem which can determine dynamically the acoustic environment in which sonar signals are propagating. Large scale (teraflop) machines would allow real-time simulation in a submarine and lead to dramatic improvement in detection efficiency.
Computational fluid dynamics, structural analysis, and electromagnetic simulation (item 3) are a major emphasis in the national high-performance computing initiative-especially within NASA and DOE. Such problems are described in Chapter 12 . However, the industries that can use this application are typically facing major cutbacks and the integration of new technology faces major hurdles. How do you use parallelism when the corporation would like to shut down its current supercomputer center and, further, has a hiring freeze preventing personnel trained in this area from entering the company? We are collaborating with NASA in helping industry with a new consortium where several companies are banding together to accelerate the integration of parallelism into their working environment in the area of multidisciplinary design for electromagnetics, fluids, and structures. An interesting initial concept was a consortium project to develop a nonproprietary software suite of generic applications which would be modified by each company for their particular needs. One company would optimize the CFD code for a new commercial transport, another for aircraft engine design, another for automobile air drag simulation, another for automobile fan design, another for minimizing noise in air conditioners (item 7) or more efficient exhaust pumps (item 6). The electromagnetic simulation could be optimized either for stealth aircraft or the simulation of electromagnetics properties for a new high-frequency printed circuit board. In the latter case, we use simulation to identify problems which otherwise would require time-consuming fabrication cycles. Thus, parallel computing can accelerate the introduction of products to market and so give competitive edge to corporations using it.
Power utilities (item 9) have several interesting applications of high-performance computing, including nuclear power safety simulation, and gas and electric transmission problems. Here the huge dollar value of power implies that small percentage savings can warrant large high-performance computing systems. There are many electrical transmission problems suitable for high-performance computing which are built around sparse matrix operations. For Niagara Mohawk, a New York utility, the matrix has about 4000 rows (and columns) with approximately 12 nonzero elements in each row (column). We are designing a parallel transient stability analysis system now. This would have some features described in DASSL (Section 9.6 ). Niagara Mohawk's problem (matrix size) can only use a modest (perhaps 16-node) parallel system. However, one could use large teraflop machines (10,000 nodes?) to simulate larger areas-such as the sharing of power over a national grid.
In a completely different area, the MONY Insurance Company (item 10) spends $70 million a year on data processing-largely on COBOL applications where they have some 15 million lines of code and a multi-year backlog. They see no immediate need for high-performance computing, but surely a more productive software environment would be a great value! Similarly, Empire Blue Cross/Blue Shield (item 11) processes 6.5 million medical insurance transactions every day. Their IBM 3090-400 handles this even with automatic optical scanning of all documents. Massively parallel systems could only be relevant if one could develop a new approach, perhaps with parallel computers examining the database with an expert system or neural network to identify anomalous situations. The states and federal government are burdened by the major cost of medicaid and small improvements would have great value.
The major computing problem for Wall Street (items 12, 13) is currently centered on the large databases. SIAC runs the day-to-day operation of the New York and American Stock exchanges. Two acres (about 300) of Tandem computers handle the calls from brokers to traders on the floor. The traders already use an ``embarrassingly parallel'' decomposition with some 2000 stocks of the New York Stock Exchange decomposed over about 500 personal computers with about one PC per trader. For SIAC, the major problem is reliability and network management with essentially no down time ``allowed.'' High-performance computers could perhaps be used as part of a heterogeneous network management system to simulate potential bottlenecks and strategies to deal with faults. The brokerages already use parallel computers for economic modelling [Mills:92a;92b], [ Zenios:91b ]. This is obviously glamorous, with integration of sophisticated optimization methods very promising.
As our final example (item 17), we have the entertainment and education industries. Here high-performance computing is linked to areas such as multimedia and virtual reality with high bandwidth and sophisticated visualization and delivery systems. Some applications can be viewed as the civilian versions of military flight simulators, with commercial general-purpose parallel computers replacing the special-purpose hardware now used. Parallelism will appear in the low end with future extensions of Nintendo-like systems; at a medium scale for computer-generated stages in a future theater; at the high end with parallel supercomputers controlling simulations in tomorrow's theme parks. Here, a summer C P project lead by Alex Ho with three undergraduates may prove to be pioneering [ Ho:89b ], [ Ho:90b ]. They developed a parallel video game Asteroids on the nCUBE-1 and transputer systems [ Fox:88v ]. This game is a space war in a three-dimensional toroidal space with spacecrafts, missile, and rocks obeying some sort of laws of physics. We see this as a foretaste of a massively parallel supergame accessed by our children from throughout the globe with high-speed lines and consumer virtual reality interfaces. A parallel machine is a natural choice to support the realism and good graphics of the virtual worlds that would be demanded by the Nintendo generation. We note that, even today, the market for Nintendo and Sega video entertainment systems is an order of magnitude larger than that for supercomputers. High-performance computers should also appear in all major sports stadiums to perform image analysis as a training aid for coaches or providing new views for cable TV audiences. We can imagine sensors and tracking systems developed for Strategic Defense Initiative being adapted to track players on a football field rather than a missile launch from an unfriendly country. Many might consider this appropriate with American football being as aggressive as many military battles!
Otis (item 5) is another example of information processing, discussed generally in Section 19.1 . They are interested in setting up a database of elevator monitoring data which can be analyzed for indicators of future equipment problems. This would lead to improved reliability-an area where Otis and Japanese companies compete. In this way, high-performance computing can lead to competitive advantage in the ``global economic war.''
Guy Robinson
Guy Robinson
The C P program, from its very initial proposal and project implementation, was designed to directly answer such questions as:
The contents of this book illustrate our answers to these questions with such results as:
As in all research projects, we made many unexpected discoveries. One of the most interesting was Computational Science. Namely, much of the work described in this book is clearly interdisciplinary. It mixes physics, chemistry, engineering and other applications with mathematics and computer science. The national high performance computing initiative has stressed interdisciplinary teams in both its planning documents [ FHPCP:89a ] and implementations in federal proposal (Commerce Business Daily) solicitations. This idea was indeed part of the initial makeup and proposals of C P. However, this is not actually what happened in many cases. Probably the most important work in C P was not from teams of individuals-each with their own specialized skills. Rather, C P relied on the research of individuals and each individual possessed a mix of skills. We can give some examples.
Otto developed the initial QCD codes (Chapter 4.3 ) for the Cosmic Cube and its prototype. This required intricate knowledge of both the best physics and its numerical formulations. However, Otto also participated in the design and implementation of the hardware and its support software which later became Express. Otto obtained a physics Ph.D., but is now on the computer science faculty at the Oregon Graduate Center.
As a different example, we can quote the research in Chapter 11 which uses physics methods (such as simulated annealing) to solve a mathematics problem (optimization) for a computer science application (load balancing). Again, the design of higher level languages (Chapters 13 , 15 through 17 ) requires deep computer science compiler and operating system expertise, as well as application understanding to design, say, the high-performance Fortran directives or MOVIE functionality. This mix of interests, which combines the skills of a computer scientist with those of an application area such as physics, was the rule and not the exception in C P. In the following, we comment on some general implications of this.
Guy Robinson
C P trained computational scientists ``accidentally'' by involving faculty, students, and staff in a research program whose success demanded interdisciplinary knowledge and work. Most of our students were at the Ph.D. level, although some undergraduates were involved through NSF REU (Research Experience for Undergraduates) and other research support. For instance, Felten made significant discoveries in new sorting algorithms (Section 12.7 ) while a physics undergraduate at Caltech. This work was awarded the prize for the best undergraduate research at Caltech during 1984. Felten is now in the Computer Science Ph.D. program at Washington University, Seattle.
We can ask the question of whether such interdisciplinary computational science can be incorporated in the academic curriculum as well as appearing in leading edge research projects? We can also ask if there is a role for a computational science at Ph.D., Masters, Undergraduate, and in K-12 precollege education.
We believe that computational science should be taught academically at all levels and not confined to research projects [ Fox:92d ]. We believe that there is an interdisciplinary core of knowledge for computational science. Further, this core contains fundamental issues and is far more than a programming course in Fortran, Lotus 1-2-3 or even more sophisticatedly, Express or MOVIE.
An education in computational science would include the basics of applied computer science, numerical analysis, and simulation. Computational scientists need a broader education than the typical physicist or computer scientist. Their training in basic computer science, and how to apply it, must be joined with an understanding of one or more application areas, such as physics and the computational approach to physics. Computational scientists will need a computer laboratory course so they become facile with the use of computers. These must be modern parallel supercomputers, and not just the personal computers or workstations now used for students in most universities. This broad education will only be possible if existing fields can teach their material more concisely. Considering a computational physicist, for example, the courses in applied computer science could substitute, for instance, advanced courses in quantum theory, or the parallel computer laboratory for an experimental physics lab. Thus, we could train a computational physicist with a reasonable knowledge of both physics and computation. Although the details of parallel computing are changing rapidly, the graduate of such an education will be able to track future changes. Computational science naturally links scientific fields to computer science. Here again, a specialization in computational science is an attractive option for computer scientists. An understanding of applications will allow computer scientists to develop better hardware and software. Computational scientists, whether in computer science or in an application field such as physics, will benefit directly from technology that improves the performance of computers by a factor of two or so each year. Their theoretical colleagues will not be assisted as well by technological improvements, so computational science can be expected to be a field of growing rewards and opportunities, as compared to traditional areas.
We believe that students educated in computational science will find it a rewarding and exciting experience, which should give them excellent job opportunities. Only a few universities offer such a degree, however, and often only at the Ph.D. level. Fledgling programs exist at Caltech, Cornell, Clemson, Denver, Illinois, Michigan, North Carolina, Princeton, Rice, Stanford, Syracuse, and U.C. Davis. The Caltech and Syracuse programs are both based in lessons from C P. These programs are diverse, and no national consensus as to the core knowledge of computational science has been developed. The NSF Supercomputer Centers at Cornell, Illinois, Pittsburgh, and San Diego have played a major role in enhancing the visibility and progress of computational science. However, these centers are set up outside of the academic framework of universities and do not contribute directly to developing computational science as an academic area. These centers, industry, the National laboratories, and indeed the federal government with its new high-performance computing and communication initiative, are all driving computational science forward. Academia is behind. Not only are there few computational science education programs, but few faculty who could teach such a curriculum. The poor job opportunities for computationalists in leading universities naturally discourages students entering the field and so again hinders the development of new educational programs. It will not be an easy issue to address, and probably only slow progress will be made as computational science gradually gains recognition in universities as a fundamentally exciting field. The inevitable dominance of parallel computing will help, as will the use of parallel computers in the NSF centers that have provided such a critical stimulus for computational science. Industry and the National laboratories already offer computational scientists excellent job opportunities, and the demand for such training will grow. Hopefully, this market pressure will lead to initiatives from within universities to hire, encourage, and promote new computational faculty, and educate students in computational science.
Consider the issues controlling the development of computational science in universities. As this field borrows and extends ideas from existing fields-computer science, biology , chemistry, physics, and so on-it will naturally face campus political hurdles as it challenges traditional and firmly held beliefs. These inevitable difficulties are exacerbated by administrative problems; many universities are facing a scenario of no growth, or even of declining funding and faculty size. This will mean that creation of new areas implies reductions in other areas. Computational science shares difficulties with other interdisciplinary areas, such as those associated with the growing interest in Planet Earth. The peer referee system used in the hiring and promoting of new faculty is perfect for ensuring high standards within the referees' domain of expertise. This tends to lead to very high-quality but isolated departments that find it hard to move into new areas. The same effect is seen in the peer review system used for the refereeing of scholarly papers and federal grants. Thus, universities find it hard to change, making it difficult for computational science to grow in academia. A key hurdle will be the development of some consensus in the community that computational science is, as we have asserted, fundamental and exciting. This needs to be quantified academically with the development of a core curriculum-a body of knowledge on which one can build computational science as an academic discipline.
There are two obvious approaches to filling the academic void identified as computational science. The boldest and simplest approach is to create an entirely new academic degree, ``Computational Science,'' administered by a new university department. This would give the field great visibility, and, once created, the independent department would be able to develop its educational program, research, and faculty hiring without direct interference from existing academic fields. Such a department would need strong support from the university administration to flourish, and even more so for its creation. This approach would not be easy to implement. There would be natural opposition from existing academic units for both good and not-so-good reasons. A telling critic could argue that a freestanding Computational Science program is premature; there is as yet no agreement on a core body of knowledge that could define this field. Students graduating from this program might find it hard to progress up the academic ladder at the vast majority of universities that do not have such a department.
These difficulties are avoided by the second strategy for computational science, which, rather than filling the void with a new department, would broaden the existing fields to ``meet in the middle.'' Students could graduate with traditional degrees and have a natural academic future. This is the approach taken by the existing university Computational Science and Engineering programs. For example, consider the two fields of chemistry and computer science. A computational scientist would graduate with either a Chemistry or Computer Science degree. Later academic progress would be judged by the scientist's contributions to the corresponding base field. We have already argued that such an interdisciplinary education would allow the student to be a better chemist or a better computer scientist, respectively. Naturally, the chemistry graduate from the Computational Science program would not have received as complete an education in chemistry as is traditional for theoretical or experimental chemists. Some of the chemistry elective courses would have been replaced by computational science requirements. This change would need to be approved and evaluated by the Chemistry faculty, who would also need to identify key chemistry requirements to be satisfied by computational scientists. New courses might include computational chemistry and those covering the basics of computer science, numerical analysis, and simulation. The latter set would be taught either by computer scientists or interdisciplinary Computational Science faculty. The education of a computational scientist within a Computer Science department could be handled similarly. This would have an emphasis on applied computer science, and a training in at least one application area.
In this scenario, a degree in Computational Chemistry is equivalent to one in ``Chemistry within the Computational Science program.'' On the computer science side, one could see degrees in ``Computer Science with a minor in Chemistry ,'' or a ``Ph.D. in Computer Science with a master's degree in Chemistry .'' At the academic level, we see an interdisciplinary program in computational science, but no separate department; faculty are appointed and students admitted to existing academic units. This approach to computational science allows us to develop and understand the core knowledge curricula in an evolutionary fashion. Implementing this more modest plan is certainly not easy, as one must modify the well-established degree requirements for the existing fields, such as chemistry and computer science. These modifications are easiest at the master's and especially at the Ph.D. level, and this is where most of the new programs have been established.
These seem to be very good reasons to establish undergraduate level Computational Science programs. We also need to create an awareness in the (K-12) educational system of the importance of computation, and the possibility of Computational Science degrees. In this way, more high school students may choose Computational Science educational programs and careers. Further, in K-12 one emphasizes a general education without the specialization normal in college. The breadth of computational science makes this very suitable for pre-college education. We also expect that high-technology environments-such as virtual reality front ends to an interactive fluid flow or other physical simulation on a parallel supercomputer-will prove to be a valuable teaching tool for today's Nintendo generation. Kids with a background in computational science will interact better with this modern computer environment and so learn more about traditional fields-for example, more about the physics of fluid flow in the sample simulation mentioned above.
Eventually, everybody will learn computational science-it will be part of any general education. When all students take two years at college of basic applied computer science-including but not at all limited to programming-then it will be natural to define computational science in all its flavors as an extension of these two years of base courses. Computation, like mathematics, chemistry, physics, and humanities, is essential in the education of tomorrow's scientists and engineers.
Guy Robinson
This includes many, but certainly not all, of the key C P participants. The bibliography and Appendix A cites the full set of C P reports and authors.
Giovanni Aloisio
Dipartimento di Elettrotecnica ed Elettronica
Facolta di Ingegneria-Politecnico di Bari (Italy)
Via Orabona, 4
70125 Bari (Italy)
Aloisio@vaxle.le.infn.it
Worked from (11/86-end of project):
Investigating the efficiency of the Hypercube architecture in Real-Time
SAR data processing (``SAR Hypercube Project''). Non traditional FFT
algorithms, such as the Prime Factor, have been coded to run on the
nCUBE, iPSC, and Mark IIIfp hypercubes. The optimal decomposition, on
a specific hypercube system, of a complete software package for digital
SAR data processing has been determined. This package has been
implemented in the sequential version on a VAX-780 at IESI/CNR
(Bari-Italy) and has been tested on digital raw data obtained by JPL
(SIR-B space Shuttle mission).
Now works on:
High Performance Distributed Computing (porting of several
applications under PVM and Net-EXPRESS. Parallel compilers, such as
HPF, will also be tested). A joint project with CCSF is in progress.
Ian Angus
Research Scientist
Boeing Computer Services
P. O. Box 24346, MS 7L-48
Seattle, WA 98124-0346
angus@atc.boeing.com
Worked from (1986-1987):
Involved primarily with the implementation of a Hypercube simulator
and with the design and first implementation of the Fortran Cubix
programming system.
Now works on:
Programming tools and environments, object oriented approaches to scientific
and parallel computing, and compilation of object oriented languages.
John Apostolakis
CERN
CN Division, 513-R-024
CH 1211 GENEVA 23, Switzerland
japost@dxcern.cern.ch
Worked from (9/86-end of project):
With lattice gauge theory, lattice spin models, and gravitational
lenses and the issues involved in developing efficient parallel
programs to simulate them.
Now works on:
Implementing experimental high energy physics applications on Massively
Parallel Processors.
Contributed Section 7.4, Statistical Gravitational
Lensing
Clive F. Baillie
Research Fellow
Computer Science Department
Campus Box 430
University of Colorado
Boulder CO 80309
clive@kilt.cs.colorado.edu
Worked from (9/86-end of project):
Implementations of physics problems, particularly clustering methods and
performance studies. Large-scale Monte-Carlo simulations of QCD, XY and
O(3) models, 3D Ising model, 2D Potts model and dynamically triangulated
random surfaces (DTRS).
Now works on:
Further work on DTRS, making them self-avoiding to simulate superstrings,
and adding Potts models to simulate quantum gravity coupled to matter.
Contributed Sections 4.3, Quantum Chromodynamics;
4.4, Spin Models; 7.2, Dynamically Triangulated Random Surfaces; and
12.6, Cluster Algorithms for Spin Models
Vasanth Bala
Member of the Technical Staff
Kendall Square Research
170 Tracer Lane
Waltham, Massachusetts 02154
vas@ksr.com
Worked from (8/89-end of project):
With the design of software tools,
compiler optimizations, and communication libraries for scalable parallel
computers.
Now works on:
Speculative instruction scheduling for superscalar RISC processors, and
general compiler optimization of C, Fortran90/HPF and C++ programs for
RISC-based parallel computers. After leaving Caltech C
P, was a research
staff member at IBM T. J. Watson Research Center (Yorktown Heights, NY)
involved in the design of the IBM SP1 parallel computer.
Contributed Section 13.2, A Software Tool
Ted Barnes
Staff Physicist
Theoretical Physics Division
Oak Ridge National Laboratory
Oak Ridge, Tennessee 37831-8083
and
Associate Professor of Physics
Department of Physics
University of Tennessee
Knoxville, Tennessee 37996
Worked from (1987-1989):
Monte Carlo calculations to simulate high-temperature superconductivity.
Now works on:
QCD spectroscopy, couplings and decays of hadrons, high-temperature
superconductivity.
Contributed Section 7.3, Numerical Study of
High-
Spin Systems
Roberto Battiti
Assistant Professor of Physics
Universita` di Trento
Dipartimento di Matematica
38050 Povo (Trento), Italy
battiti@itnvax.cineca.it
Worked from (1986-end of project):
Parallel implementation of neural nets and vision algorithms; computational
complexity of learning algorithms.
Now works on:
Constructive and destructive learning methods for neural nets, ``natural''
problem solving such as genetic algorithms; application of neural nets
in financial and industrial areas.
Contributed Sections 6.5, A Hierarchical Scheme for
Surface Reconstruction and Discontinuity Detection; 6.7, An Adaptive
Multiscale Scheme for Real-Time Motion Field Estimation; 6.8,
Collective Stereopsis, and 9.9, Optimization Methods for Neural Nets:
Automatic Parameter Tuning and Faster Convergence
Jim Bower
Associate Professor of Biology
Computation and Neural Systems Program
California Institute of Technology
Mail Code 216-76
Pasadena, California 91125
jbower@smaug.bbb.caltech.edu
Worked from (1988-end of project):
Using concurrent computers to build large-scale realistic
models of the nervous system. We recognized early on that
truely realistic models of these complex systems would
require the power present in parallel computation. This, in
fact, is reflected in the fact that the nervous system
itself is probably a parallel device. Leader of GENESIS project
described in Section 7.6.
Now works on:
Current interest remains understanding the relationships
between the structure and the function of the nervous
system. We have recently published several scientific
papers that would have not been possible without the use of
the concurrent machines at Caltech.
Eugene D. Brooks, III
Deputy Associate Director
Advanced Technologies Computation Organization
Lawrence Livermore National Laboratory
P. O. Box 808, L-66
Livermore, CA 94550
brooks3@llnl.gov
Worked from (1981-1983):
The use of parallel computing to supply a new computational
capability for computational physics tasks.
Now works on:
Parallel computer architecture, parallel languages, computational
physics algorithms, and parallelization of computational physics
algorithms.
Robert W. Clayton
Professor of Geophysics
California Institute of Technology
Geophysics, 350 S. Mudd
Mail Code 252-21
Pasadena, CA 91125
clay@seismo.gps.caltech.edu
Worked from (1983-end of project):
Finite-difference solutions of wave phenomena. Imaging with
seismic reflection data.
Now works on:
Finite-difference solutions of wave phenomena. Imaging with
seismic reflection data.
Contributed Section 18.2, ISIS: An Interactive
Seismic Imaging System
Paul Coddington
Syracuse University
Northeast Parallel Architectures Center
111 College Place, 3-228 CST
Syracuse, New York 13244-4100
paulc@npac.syr.edu
Worked from (1988-end of project):
Developed parallel implementations of non-local Monte Carlo algorithms for
spin models of magnetism.
Now works on:
From 1990-92, worked as a Research Associate at NPAC on computational
physics applications, including new sequential and parallel Monte Carlo
algorithms for spin models and dynamically triangulated random surface
models of quantum gravity, as well as parallel algorithms for connected
component labeling and graph coloring. Also worked on improved
stochastic optimization techniques, such as simulated annealing.
From 1992 until the present, worked as a Research Scientist at NPAC leading a project on the use of parallel computing in the power utility industry. This involves porting existing code to parallel computers, and developing parallel algorithms for sparse matrix computations and differential-algebraic equation solvers.
Dave Curkendall
ALPHA Project Manager and
Advanced Parallel Processing Program Manager
Jet Propulsion Laboratory
4800 Oak Grove Drive, MC 138-310
Pasadena, California 91109
DAVE_CURKENDALL@macq_smtp.Jpl.Nasa.Gov
Worked from (8/84-end of project):
As Hypercube Task Manager and later as Hypercube Project Manager, was
interested in the hypercube hardware development, its operating system,
particularly the asynchronous message-passing developments of Mercury
and Centaur, and in the development of large-scale simulations.
Now works on:
The development of discrete event simulation software for parallel
machines and techniques for the remote, interactive exploration of
large, image and geographical databases.
Contributed Section 18.3, Parallel Simulations that Emulate Function
Hong-Qiang Ding
Member of Technical Staff
Jet Propulsion Laboratory
4800 Oak Grove Drive
Mail Stop 169-315
Pasadena, California 91109
hding@redwood.jpl.nasa.gov
Worked from (8/87-end of project):
Extensive and large-scale simulations QCD and quantum spin models.
Now works on:
Developing efficient methods for long-range interactions and molecular
simulations; simulate model superconductors with parallel machines.
Contributed Sections 6.3, Magnetism in the
High-Temperature Superconductor Materials; and 6.4, Phase Transitions
in Two-dimensional Quantum Spin Systems
David Edelsohn
IBM T. J. Watson Research Center
P. O. Box 218
Yorktown Heights, NY 10598-0218
c1dje@watson.ibm.com
Worked from (1989-end of project):
Computational astrophysics simulations of galaxy formation and
evolution, and cosmology using concurrent, multiscale, hierarchical
N-body and adaptive mesh refinement algorithms.
Now works on:
As a doctoral candidate at the Northeast Parallel Architectures Center,
Syracuse University, his research interests include computational
astrophysics simulations of galaxy formation and evolution, and
cosmology using concurrent, multiscale, hierarchical N-body and adaptive
mesh refinement algorithms; and object-oriented concurrent languages.
He is visiting IBM as an IBM Computational Science Graduate Fellow.
Contributed Section 12.8, Hierarchical
Tree-Structures as Adaptive Meshes
Ed Felten
Assistant Professor
Department of Computer Science
Princeton University
35 Olden Street
Princeton, New Jersey 08544
felten@cs.princeton.edu
Worked from (1984-end of project):
Research interests included a variety of issues surrounding how to
implement irregular and non-numerical applications on
distributed-memory systems.
Now works on:
How to build system software for parallel machines, and how to
construct parallel programs to use this system software. More
generally, my research interests include parallel and distributed
computing, operating systems, architecture, and performance modeling.
Jon Flower
President
ParaSoft Corporation
2500 E. Foothill Blvd.
Pasadena CA 91107
jwf@parasoft.com
Worked from (1983-end of project):
High-energy physics simulations; programming tools, debugging and
visualization. Founder and President of ParaSoft Corporation
Now works on:
Programming environments, tools, libraries for parallel computers.
Contributed Sections 5.2, A ``Packet'' History of
Message-passing Systems; 5.3, Parallel Debugging; 5.4, Parallel
Profiling; and 13.5, ASPAR
Geoffrey C. Fox
Professor of Computer Science and Physics
Director, Northeast Parallel Architectures Center
Syracuse University
111 College Place
Syracuse, New York 13244-4100
gcf@npac.syr.edu
Worked from (1981-end of project):
Involved as Principal Investigator with particular attention to
applications, algorithms, and software. Developed the theory of problem
architecture to describe and classify results of C
P. Developed concepts
in computational science education based on student involvement in C
P
and implemented new curricula initially at Caltech and later at Syracuse
University.
Now works on:
From 1990 until the present, directs the project at Syracuse University,
which has a similar spirit to C
P, but is aimed more at industry than at
academic problems.
Contributed Chapters 1, 3, 19, and 20; Sections 4.1,
4.2, 5.1, 6.1, 7.1, 9.1, 11.2, 11.3, 12.1, 13.1, 13.3, 13.7, 14.1, 15.1, and
18.1
Sandy Frey
President, Reliable Distributed Information Corporation
Pasadena, CA 91107
sandy@ccsf.caltech.edu
Worked from (1984-1988):
Studying the system problems of implementing a teraflop machine
with 1980s technology, and the data management problems involved
in implementing massive data intensive applications in parallel
processing environments, such as hypercubes.
Now works on:
Data management problems involved in implementing massive data
intensive applications in parallel processing environments, such
as hypercubes.
Wojtek Furmanski
Research Professor of Physics
Syracuse University
201 Physics Building
Syracuse, New York 13244-1130
furm@npac.syr.edu
Worked from (1986-end of project):
Developed a class of optimal collective communication algorithms
implemented on Caltech hypercubes, and applied in parallel implementation of
neuroscience simulations and machine vision algorithms.
Now works on:
Based on lessons learned in these early parallel simulations, developed
MOVIE system aimed at a general purpose platform for interactive HPCC
environments. MOVIE, initially used for terrain image analysis, is now
further developed at NPAC. Recently, the HPF interpreter has been
constructed on top of MOVIE, and the MOVIE system is now further developed
with the aim of integrating HPCC and Virtual Reality software technologies
towards the broadband network based televirtuality environment.
Contributed Chapter 17, MOVIE - Multitasking
Object-oriented Visual Interactive Environment
Jeff Goldsmith
California Institute of Technology
Mail code 350-74
Pasadena, California 91125
jeff@gg.caltech.edu
Worked from (1985-end of project):
Computer Graphics.
Now works on:
Computer Graphics, in particular, computer-designed motion.
Peter Gorham
Project Manager
University of Hawaii at Manoa
Honolulu, Hawaii 96822
gorham@fermion.phys.Hawaii.Edu
Worked from (1987-end of project):
My work with
P came about through Tom Prince's involvement
with the project. Tom hired me as a postdoc in 1987 and I arrived in
February of that year. Tom was beginning a collaboration with
Shri Kulkarni of the Caltech Astronomy Department in two areas:
first, a program to develop code for bispectral anlaysis of
astronomical speckle interferograms taken with the Hale
telescope; and second, a search for new radio pulsars using the
Arecibo Observatory's
transit telescope. In both cases,
the telescopes involved were among the largest of their class and the
data sets to be produced could only be managed with a supercomputer.
Also in both cases, the data analysis lent itself very well to parallel
processing techniques.
Both programs were very successful and Tom and I had the pleasure of seeing two graduate students complete their PhD requirements in each of the research areas (Stuart Anderson, pulsars; and Andrea Ghez, infrared speckle interferometry). Something of order a dozen research papers came out of this effort before I left for my present position in July of 1991, and a steady stream of results have come out since. Now works on: The Deep Underwater Muon and Neutrino Detector (DUMAND) project. This project is developing a large, deep ocean Cherenkov detector which will be sensitive to high energy neutrino interactions and will have the capability to produce images of the sky in the ``light'' of neutrinos, with angular resolution of order 1 degree. The motivation behind such research arises from current belief that emission of high energy neutrinos may be a dominant process by which active galactic nuclei and QSOs release energy into their galactic environment. Detection of such neutrinos would provide unique information about the central engine of such galaxies.
Thomas D. Gottschalk
Member of the Professional Staff
California Institute of Technology
Mail code 356-48
Pasadena, California 91125
tdg@cithex.cithep.caltech.edu
tdg@bigbird.jpl.nasa.gov
Worked from (1987-end of project):
Concurrent multi-target tracking for SDI scenarios/applications.
Now works on:
Multi-target tracking (aircraft and space objects), surveillance systems
operations, including sensor tasking, and design rule checking for VLSI
systems.
Contributed Sections 9.8, Munkres Algorithm for
Assignment; and 18.4, Multi-Target Tracking
Gary Gutt
Member of the Technical Staff
Jet Propulsion Laboratory
4800 Oak Grove Drive
Mail Stop 183-401
Pasadena, California 91109
gmg@mg.jpl.nasa.gov
Worked from (4/88-5/89):
Numerical simulation of granular systems using the lattice grain dynamics
paradigm.
Now works on:
Microgravity containerless materials processing; development of electrostatic
and electromagnetic positioning techniques for use in microgravity
containerless materials processing.
Contributed Section 4.5, An Automata Model for
Granular Materials
Peter Halamek
Technical Staff Member
Jet Propulsion Laboratory
Mail Stop 301-125L
Pasadena CA 91109
pxh@hamlet.caltech.edu
Worked from (6/88-1/89):
Image processing; determination of 3D physical properties of objects from
2D camera images taken aboard a spacecraft.
Now works on:
Optical navigation related research: improving accuracy of extended body
center-finding on images of celestial bodies.
Paul G. Hipes
Vice President
Salmon Brothers, Inc.
7 World Trade Center
37th Floor
New York, New York 10048
hipes@daffy.sbi.com
Worked from (11/87-end of project):
Direct solvers for dense systems of linear equations, special purpose
matrix O.D.E. solvers, electron-molecule scattering problems
approached with Schwinger variational methods, atom-molecule
scattering problems approached by direct expansion methods, and green
function Monte Carlo techniques for stationary states of many-electron
systems.
Now works on:
the term structure of interest rates and related topics in fixed
income arbitrage.
Alex Ho
Research scientist
IBM Almaden Research Center
K54/802
650 Harry Rd.
San Jose, California 95120-6099
Worked from (7/85-end of project):
Pattern recognition, artificial intelligence, neural nets, robot navigation.
Now works on:
Massively parallel computing, programming models, architectures,
fault-tolerance, performance evaluation.
Mark A. Johnson
Senior Engineer/Scientist
IBM Corporation
Internal Zip 4441
11400 Burnet Road
Austin, Texas 78758
maj@austin.ibm.com
Worked from (1983-1986):
Pursued research that led to a Ph.D. in Statistical Physics. Primary
research interests included studying melting in a two-dimensional system
of interacting particles.
Now works on:
System architecture in the area of High End Technical Systems of the
Advanced Workstations and Systems Division of IBM.
Contributed Section 14.2, Melting in Two Dimensions
Jai Sam Kim
Associate Professor, Department of Physics
Pohang Institute of Science and Technology
Hyoja-dong San 31
Pohang 780-784, S. KOREA
jsk@vision.postech.ac.kr
Worked from (1986-1988):
Involved in the development of the hypercube simulator NSIM. Later,
he parallelized the FFT codes with Italian visitors Aloisio and
collaborators. Their work on the prime factor DFT code demonstrated
the usefulness of Crystal_Router and also the limitations with the
store-and-forward routing method. He wrote the FORTRAN application
codes that were included in
Solving Problems on Concurrent
Processors
, Vol. 2 [Angus:90a].
Now works on:
Shortly before he returned to his home country Korea, he joined the
interactive parallelizer project described in Chapter 13. He has not
been heard from for some time, but has recently parallelized some
working PDE codes used by mechanical engineers both on NSIM and PVM.
Adam Kolawa
Chairman/CEO ParaSoft Corporation
2500 E. Foothill Blvd., Suite 104
Pasadena, California 91107
ukola@flea.parasoft.com
Worked from (1983-end of project):
Development of system software for parallel computers.
Now works on:
Development of software tools.
Jeff Koller
Computer Scientist
Information Sciences Institute
4676 Admiralty Way
Marina del Rey, California 90292
koller@isi.edu
Worked from (1987-1989):
MOOS II operating system, application of novel optimization techniques to
dynamic load balancing and compiler optimization.
Now works on:
VLSI design and system software for next-generation parallel machines.
Contributed Sections 13.4, Optimizing Compilers by
Neural Networks; and 15.2, MOOS II: An Operating System for Dynamic
Load Balancing on the iPSC/1
Aron Kuppermann
Professor
California Institute of Technology
Mail Code 127-72
Pasadena, California 91125
aron@caltech.edu
Worked from (from beginning to end of project):
Quantum mechanical reaction dynamics; reactive scattering methodologies
suitable for MIMD machines.
Now works on:
Adapting quantum mechanical reaction dynamics codes to new parallel
machines.
Contributed Section 8.2, Quantum Mechanical
Reactive Scattering using a High Performance Parallel Computer
Paulette C. Liewer
Member of the Technical Staff
Jet Propulsion Laboratory
4800 Oak Grove Drive
Mail Stop 198-231
Pasadena, California 91109
pauly@hyper-spaceport.jpl.nasa.gov
and
Visiting Associate in Applied Physics
California Institute of Technology
Mail Code 128-95
Pasadena, California 91125
Worked from (1986-end of project):
Concurrent algorithms for particle-in-cell codes.
Now works on:
3D plasma particle-in-cell codes; application of concurrent PIC codes to
problems in solar, space and laboratory plasmas.
Contributed Section 9.3, Plasma Particle-in-Cell
Simulation of an Electron Beam Plasma Instability
Gregory A. Lyzenga
Associate Professor of Physics
Harvey Mudd College
Physics Department
Claremont, California 91711
lyzenga@hmcvax.ac.hmc.edu
Worked from (1985-end of project):
Parallel solution of fine element problems as applied to
geophysics, solid mechanics, fluid mechanics, and
electromagnetics.
Now works on:
Solid earth geophysics; mechanics of earthquakes and
tectonic deformation
Miloje Makivic
Computational Research Scientist
Northeast Parallel Architectures Center
111 College Place
Syracuse, New York 13244-4100
miloje@npac.syr.edu
Worked from (1988-end of project):
As graduate student in the Division of Mathematics, Physics and Astronomy at
Caltech, collaborated with the C
P group. After 1990, used parallel
resources at C
P to develop computational physics algorithms,
specifically Monte Carlo methods on parallel processors for strongly
correlated quantum systems: Spin systems, high-temperature
superconductors, disordered superconducting thin films, and general quantum
critical phenomena. Also worked on self-consistent perturbation theory
approach to heavy fermions and low-dimensional magnets.
Now works on:
From 1990 until September 1993, worked as post-doctoral research in the
Physics Department of Ohio State University. Presently, working at Syracuse
University (NPAC) on the application of parallel computing in industry and
science. Current projects include atmospheric data assimilation and financial
modelling.
Vincent McKoy
Professor
California Institute of Technology
Mail Code 127-72
Pasadena, California 91125
bvm@citchem.bitnet
Worked from (3/89-9/89):
Studies of collisions of electrons with polyatomic molecules.
Now works on:
Using variational procedures to obtain cross-sections for electronic
excitation of molecules by electron impact.
Contributed Section 8.3, Studies of
Electron-Molecule Collisions on Distributed-memory Parallel Computers
Paul Messina
CCSF Executive Director
California Institute of Technology
Mail Code 158-79
Pasadena, California 91125
messina@CCSF.Caltech.edu
Worked from (1987-end of project):
Involved as Co-Investigator with particular emphasis on acquiring and
managing the computing facilities, and on the systems issues of concurrent
computing environments.
Now works on:
From 1990 until the present, directs the Caltech Concurrent Supercomputing
Facilities, which have pushed to higher limits of performance the approaches
conceived in C
P. Also, manages the CASA gigabit network testbed project,
which explores issues on distributed supercomputing.
Contributed Chapter 2, Technical Backdrop
Steve Otto
Assistant Professor
Department of Computer Science and Engineering
Oregon Graduate Institute of Science and Technology
20000 NW Walker Rd., P. O. Box 91000
Portland, Oregon 97291-1000
otto@cse.ogi.edu
Worked from (1981-1989):
QCD, Computer chess, fine-grained parallel systems, combinatoric optimization
schemes.
Now works on:
Parallel languages and compilation techniques for scalable parallel
architectures; new combinatoric optimization algorithms.
Contributed Sections 6.6, Character Recognition by
Neural Nets; 7.5, Parallel Random Number Generators; 11.4, An Improved
Method for the Traveling Salesman Problem; 12.7, Sorting; 13.6,
Coherent Parallel C; and 14.3, Computer Chess
Jean Patterson
Technical Group Supervisor for
Remote Sensing Analysis Systems and Modeling Group
Jet Propulsion Laboratory
4800 Oak Grove Drive
Mail Code 198-231
Pasadena, California 91109
jep@yosemite.Jpl.Nasa.Gov
Worked from (1984-end of project):
Remote sensing data analysis and modeling for remote sensing applications.
These applications use high-performance parallel processing systems for
the analysis. In particular, involved has been with electromagnetic
scattering and radiation analysis, atmospheric radiative transfer, and
synthetic aperture radar processing.
Now works on:
Continues with electromagnetics and atmospheric radiative transfer
modeling. Key participants involved in the finite element work include
Tom Cwik, Robert Ferraro, Nathan Jacobi, Paulette Liewer, Greg Lyzenga,
and Jay Parker
Contributed Section 9.4, Computational
Electromagnetics
Francois Pepin
Staff Member
Canadair Aerospace Group
11324 Meunier
Montreal H3L 2Z6, Canada
Worked from (6/87-end of project):
Simulation of viscous incompressible flows using vortex methods; fast
algorithms for N-body problems.
Now works on:
Simulation of compressible flows over transport aircraft.
Contributed Section 12.5, Fast Vortex Algorithm and
Parallel Computing
Tom Prince
Professor of Physics
California Institute of Technology
Mail Code 220-47
Pasadena, California 91125
prince@caltech.edu
Worked from (1985-end of project):
Diffraction-limited imaging with large ground-based optical
and infrared telescopes, very-high sensitivity searches for
pulsars in globular clusters, and searches for pulsars in
short-orbit binary systems.
Now works on:
Very-high speed data acquisition and analysis, image
enhancement of astronomical infrared maps, and pulsar search
and detection.
Peter Reiher
Member of the Technical Staff
Jet Propulsion Laboratory
4800 Oak Grove Drive
Mail Stop 525-3660
Pasadena, California 91109
Worked from (11/87-1989):
The TimeWarp operating system, parallel programming synchronization methods.
Now works on:
Parallel and distributed operating systems.
Contributed Section 15.3, Time Warp
Ken Rose
Assistant Professor
University of California at Santa Barbara
Department of Electrical and Computer Engineering
Santa Barbara, California 93106
rose@ece.ucsb.edu
Worked from (7/89-end of project):
Combinatin of principles of information theory with tools
from statistical physics for solving hard optimization
problems. Particular applications included fuzzy and hard
clustering (pttern recognition and neural networks), vector
quantization (coding/communications), and tracking.
Now works on:
Information theory (particularly rate-distortion theory),
pattern recognition, source coding, communications, signal
processing, and nonconvex optimization.
John Salmon
Research Fellow
California Institute of Technology
Mail Code 206-49
Pasadena, California 91125
johns@ccsf.caltech.edu
Worked from (8/83-end of project):
As a graduate student and post-doc, research interests included
astrophysical applications, Fourier transforms, ray-tracing, parallel
I/O, and operating systems. Still working with the current incarnations of
C
P at Caltech.
Now works on:
Fast tree-based methods for N-body problems and other applications
(hydrodynamics, panel methods, random fields) have dominated recent
attention. Work with the current incarnations of C
P are still
being continued at present.
Contributed Section 12.4, Tree Codes for N-body
Simulations
Anthony Skjellum
Assistant Professor of Computer Science
NSF Engineering Research Center for Computational Field Simulation
and Computer Science Department
Mississippi State University
P. O. Drawer CS
300 Butler Hall
Mississippi State, Mississippi 39762-5623
tony@cs.msstate.edu
Worked from (9/87-end of project):
Parallel libraries, message-passing systems, portability, chemical
engineering applications-flowsheeting.
Now works on:
Same as above, plus standards in message passing (message-passing
interface forum), heterogeneous high-performance clusters
Contributed Sections 9.5, LU Factorization of
Sparse, Unsymmetric Jacobi Matrices; 9.6, Concurrent DASSL Applied to
Dynamic Distillation Column Simulation; and Chapter 16, The Zipcode
Message-passing System
Michael D. Speight
Registrar in Medical Radiodiagnosis, Royal Infirmary of Edinburgh
c/o Medical Statistics Unit
University of Edinburgh
Teviot Place
Edinburgh EH8 9AG, Scotland
mds3@edinburgh.ac.uk
Worked from (1989-end of project):
Biologically realistic neural simulation on parallel computers. Most
recent involvement was via Jim Bower's group doing neural simulation
work on the Intel Touchstone Delta.
Now works on:
Virtual reality systems and parallel computing for manipulating medical
images (e.g., human brain MRI scans).
Contributed Section 7.6, Parallel Computing in
Neurobiology: The GENESIS Project
Eric Van de Velde
Senior Research Fellow
California Institute of Technology
Mail Code 217-50
Pasadena, California 91125
evdv@ama.caltech.edu
Worked from (6/86-end of project):
Algorithms for concurrent scientific computing; multigrid and linear
algebra algorithms.
Now works on:
Multigrid, linear algebra, fluid flow, reaction-diffusion equations.
Contributed Section 9.7, Adaptive Multigrid
David Walker
Member of the Technical Staff
Building 9207A, MS-8083
P.O. Box 2009
Oak Ridge National Laboratory
Oak Ridge, TN 37831-8083
walker@msr.epm.ornl.gov
Worked from (3/86-8/88):
Parallel linear algebra, parallel CFD, benchmarking, programming paradigms,
parallel FFT algorithms.
Now works on:
Linear algebra software for MIMD machines, concurrent particle-in-cell
algorithms for plasma simulations, benchmarking, molecular dynamics.
Contributed Sections 6.2, Convectively-Dominated
Flows and the Flux-Corrected Transport Technique; and 8.1, Full and
Banded Matrix Algorithms
Brad Werner
Assistant Professor
University of California at San Diego
Scripps Institution of Oceanography
Center for Coastal Studies
Mail Code 0209
9500 Gilman Drive
La Jolla, California 92093-0209
werner@hayek.ucsd.edu
Worked from (1983-1987):
Simulation of the dynamics of granular materials.
Now works on:
Quantitative geomorphology, nearshore and desert processes, granular
materials, computer simulation, pattern formation.
Contributed Section 9.2, Geomorphology by
Micromechanical Simulations
Roy Williams
Senior Staff Scientist
Concurrent Supercomputing Facilities
California Institute of Technology
Mail Code 206-49
Pasadena, California 91125
roy@ccsf.caltech.edu
Worked from (2/86-end of project):
Programming paradigms and algorithms for unstructured triangular meshes.
Now works on:
General unstructured meshes; finite-element and finite-volume methods;
reaction-diffusion equations; mesh generation in complex geometries.
Contributed Chapter 10, DIME Programming
Environment; Sections 11.1, Load Balancing as an Optimization Problem,
12.2, Simulation of the Electrosensory System of the Fish
Gnathonemus petersii
; and 12.3, Transonic Flow
Carl Winstead
Assistant Scientist
California Institute of Technology
Mail Code 127-72
Pasadena, California 91125
clw@cco.caltech.edu
Worked from (3/89-9/89):
Computation of electron-molecule collision cross-sections with parallel
machines.
Now works on:
Electron-molecule collision cross-sections relevant to low-temperature
plasmas; improving methods and algorithms in such calculations.
Guy Robinson
Guy Robinson
Guy Robinson
Parallel Computing Works
This document was generated using the LaTeX 2 HTML translator Version 0.7a2 (Fri Dec 2 1994) Copyright © 1993, 1994, Nikos Drakos , Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html
-show_section_numbers BOOK.tex
.
The translation was initiated by Guy Robinson on Wed Mar 1 10:19:35 EST 1995
Guy Robinson