Example - Jacobi iteration

Jacobi

We shall use the following example to illustrate the material introduced so far, and to motivate new functions.

Example 2.11 Jacobi iteration - sequential code
REAL A(0:n+1,0:n+1), B(1:n,1:n)
...
!Main Loop
DO WHILE(.NOT.converged)
   ! perform 4 point stencil
   DO j=1, n
      DO i=1, n
         B(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1))
      END DO
   END DO

   ! copy result back into array A
   DO j=1, n
      DO i=1, n
         A(i,j) = B(i,j)
      END DO
   END DO
...
! convergence test omitted
END DO
The code fragment describes the main loop of an iterative solver where
,at each iteration, the value at a point is replaced by the average of
the North , South, East and West neighbours(a fout point pencil is used
to keep example simple).Boundary values do not change.We focus on the 
inner loop, where most of the computation is done, and use fortran 90
syntax, for clarity.

Since this code has a simple structure, a data-parallel approach can be used to derive an equivalent parallel code. The array is distributed across processes, and each process is assigned the task of updating the entries on the part of the array it owns.

A parallel algorithm is derived from a choice of data distribution. The data distribution distribution should be balanced, allocating (roughly) the same number of entries to each processor; and it should minimize communication. Figure illustrates two possible distributions: a 1D (block) distribution, where the matrix is partitioned in one dimension, and a 2D (block,block) distribution, where the matrix is partitioned in two dimensions.

Figure: Block partitioning of a matrix.

Since the communication occurs at block boundaries, communication volume is minimized by the 2D partition which has a better area to perimeter ratio. However, in this partition, each processor communicates with four neighbors, rather than two neighbors in the 1D partition. When the ratio of n/P (P number of processors) is small, communication time will be dominated by the fixed overhead per message, and the first partition will lead to better performance. When the ratio is large, the second partition will result in better performance. In order to keep the example simple, we shall use the first partition; a realistic code would use a ``polyalgorithm'' that selects one of the two partitions, according to problem size, number of processors, and communication performance parameters.

The value of each point in the array B is computed from the value of the four neighbors in array A. Communications are needed at block boundaries in order to receive values of neighbor points which are owned by another processor. Communications are simplified if an overlap area is allocated at each processor for storing the values to be received from the neighbor processor. Essentially, storage is allocated for each entry both at the producer and at the consumer of that entry. If an entry is produced by one processor and consumed by another, then storage is allocated for this entry at both processors. With such scheme there is no need for dynamic allocation of communication buffers, and the location of each variable is fixed. Such scheme works whenever the data dependencies in the computation are fixed and simple. In our case, they are described by a four point stencil. Therefore, a one-column overlap is needed, for a 1D partition.

We shall partition array A with one column overlap. No such overlap is required for array B. Figure shows the extra columns in A and how data is transfered for each iteration.

We shall use an algorithm where all values needed from a neighbor are brought in one message. Coalescing of communications in this manner reduces the number of messages and generally improves performance.

Figure: 1D block partitioning with overlap and communication pattern for jacobi iteration.

The resulting parallel algorithm is shown below.

Example2.12 Jacobi iteration - first version of parallel code
...
REAL, ALLOCATABLE A(:,:), B(:,:)
...
! Compute number of processes and myrank
CALL MPI_COMM_SIZE(comm, p, ierr)
CALL MPI_COMM_RANK(comm, myrank, ierr)

! compute size of local block
m = n/p
IF (myrank.LT.(n-p*m)) THEN
   m = m+1
END IF

!Allocate local arrays
ALLOCATE (A(0:n+1,0:m+1), B(n,m))
...
!Main Loop
DO WHILE(.NOT.converged)
   ! compute
   DO j=1, m
      DO i=1, n
         B(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1))
      END DO
   END DO

   DO j=1, m
      DO i=1, n
         A(i,j) = B(i,j)
      END DO
   END DO
  
  ! Communicate
  IF (myrank.GT.0) THEn
      CALL MPI_SEND(B(1,1),n, MPI_REAL, myrank-1, tag, comm, ierr)
      CALL MPI_RECV(A(1,0),n, MPI_REAL, myrank-1, tag, comm, 
                                                  status, ierr)
  END IF
  IF (myrank.LT.p-1) THEN
      CALL MPI_SEND(B(1,m),n, MPI_REAL, myrank+1, tag, comm, ierr)
      CALL MPI_RECV(A(1,m+1),n, MPI_REAL, myrank+1, tag, comm, 
                                                  status, ierr)
  END IF
...

One way to get a safe version of this code is to alternate the order of sends and receives: odd rank processes will first send, next receive, and even rank processes will first receive, next send. Thus, one achieves the communication pattern of Example

The modified main loop is shown below. We shall later see simpler ways of dealing with this problem. Jacobi, safe version

Example2.13 Main Loop of Jacobi iteration-safe version of parallel code
... 
!Main Loop
DO WHILE(.NOT.converged)
   ! compute
   DO j=1, m
      DO i=1, n
         B(i,j)=0.25*(A(i-1,j)+A(i+1,j)+A(i,j-1)+A(i,j+1))
      END DO
   END DO

   DO j=1, m
      DO i=1, n
         A(i,j) = B(i,j)
      END DO
   END DO
  
  ! Communicate
  IF (MOD(myrank,2).EQ.1)THEN
      CALL MPI_SEND(B(1,1),n, MPI_REAL, myrank-1, tag, comm, ierr)
      CALL MPI_RECV(A(1,0),n, MPI_REAL, myrank-1, tag, comm, 
                                                  status, ierr)
    
     IF (myrank.LT.p-1) THEN
      CALL MPI_SEND(B(1,m),n, MPI_REAL, myrank+1, tag, comm, ierr)
      CALL MPI_RECV(A(1,m+1),n, MPI_REAL, myrank+1, tag, comm, 
                                                  status, ierr)
  END IF
  ELSE IF !myrank is even 
  IF (myrank.GT.0) THEN
      CALL MPI_RECV(A(1,0),n, MPI_REAL, myrank-1, tag, comm, 
                                                  status, ierr)
      CALL MPI_SEND(B(1,1),n, MPI_REAL, myrank-1, tag, comm, ierr)
 END IF
 IF (myrank.LT.p-1) THEN
      CALL MPI_RECV(A(1,m+1),n, MPI_REAL, myrank+1, tag, comm, 
                                                  status, ierr)
      CALL MPI_SEND(B(1,m),n, MPI_REAL, myrank+1, tag, comm, ierr)
  END IF
END IF
...
END DO

Next: Send-Receive Up: Point-to-Point Communication Previous: Fairness

Jack Dongarra
Fri Sep 1 06:16:55 EDT 1995