Building the
Collective Communications Module
from the reference source

This document discusses building the Collective Communications Module from the reference source. In particular, it discusses building CCM from the MPI and Shmem reference implementations on the platforms listed below. It also discusses building for various sets of array ranks and porting the module to other platforms.

Versions and platforms

There are two primary versions and reference implementations of the Collective Communications Module. The first is based on MPI or the Message Passing Interface. A good discussion on MPI can be found in www-unix.mcs.anl.gov/mpi/. The second reference implementation is based on the Shared Memory or shmem routines defined by Cray and adapted by SGI. The company's links for information on the shmem routines are www.cray.com/products/software/mpt.html and www.sgi.com/software/mpt/.

The make files distributed with the reference implementations will compile for the platforms listed below. There are no differences in MPI source for the various platforms for which it is compiled. Except for some minor differences, the shmem implementation has the same source for each platform. Obviously, there are different compilers and library settings for each machine and implementation. These differences are isolated into a small make include file.

Making using a predefined make include file

To build the reference implementations do the following.

1. Untar the source file in its own directory. For the MPI implementation:

mkdir mpi_src
mv mpi_src.tar mpi_src
cd mpi_src
tar -xf mpi_src.tar

or
cd mpi_src
mkdir shmem_src
mv shmem_src.tar shmem_src
cd shmem_src
tar -xf shmem_src.tar

2. Setenv

There is an environmental variable referenced in the make file that points to a make include file. The variable is CCM_COM. It needs to be set as follows:

setenv CCM_COMCommunications
Library
MPI Shmem
Platform
Apple OSXdarwinNA
IBM SPaixNA
SGI sgi_mpi sgi_shmem
Cray SV1 NA sv1_shmem
Cray T3e t3e_mpi t3e_shmem

For example on an IBM SP running tcsh you would type:

setenv CCM_COMM aix

3. Make

Type:
make make_mod
make

and the Collective Communications Module will be built.

What gets built?

The final products of the make are a library, libccm.a and a module file. The name of the module file is system dependent but is something like ccm.mod or CCM.mod. On most systems, both the library and mod files are required to run a program using the module.

There are a large number of intermediate of "*.o" or object files created. These are made from "*.f90" source files. Most of the "*.f90" files are created by a preprocessor make_mod. The Fortran source for make_mod is included with the distribution. Make_mod is actually run by the make file using a script make_mod_script. The script/program take as input "*.input" files and produce the "*.f90" files. The preprocessor is discussed in more detail below.

There is also a file created called build_report. Build_report contains a collection of html formatted tables. The tables show the data types and ranks of arrays that are supported by different routines in the module. For example for ccm_gather we have the entries:

gather123
0yyy
1yyy
2yyy
3yyy
gatherreal(b4)real(b8)integer(def_int)complex(c4)complex(c8)logicalcharacter

This indicates that this routine was built for input arrays of rank 0 to 3 and output ranks 1 to 3, for two real sizes, integer, two complex sizes, logical and character variables. The methodology for adding higher array ranks is discussed in the "Preprocessor" section below.

What is in the make include file?

The make include file gives flags and paths required to build the module. The one shown below is for compiling using the MPICH version of MPI on Apple OSX

# for MPICH on Apple OSX
FFLAGS= 
MPICH=/usr/local/mpich
F90=f90 $(FFLAGS)
PF90=f90 $(FFLAGS)
INC_DIR= -I$(MPICH)/include
LIB_DIR= -L$(MPICH)/lib
LIB=-lfmpich -lmpich -lpmpich -lU77 -lc
RANLIB=ranlib

There are no special flags required to build using MPICH so the FFLAGS line is blank. The next line points to the base MPICH directory. The lines F90 and PF90 are the serial and parallel Fortran 90 compile lines. The next two lines are the "-I" and "-L" options for the MPICH include files and library. The LIB line gives all of the libraries required to build an application using MPICH. The final line points to ranlib. Note on the SGI and Sv1 ranlib is not required so it is replaced with a dummy call to /bin/ls.

Preprocessor

Purpose

As discussed above, each routine is built for a collection of data types and array ranks. We must have a separate routine for each data type and input/output array rank pair so that we can take advantage of Fortran's capability to determine an array size for arguments passed to subroutines. This leads to a larger number of routines. The ideal way to "handle" the large number of routines would be to use templates. We would write a generic routine once and let the compiler create the various instances that we need. Unfortunately, Fortran 90 lacks the template capability. The preprocessor make_mod is used instead of templates.

Make_mod takes as input, data types, array ranks and a generic source file, *.input. It produces an output file that contains routines for the requested data types and array ranks.

Source data

The input files have a header and generic source. Consider the input file for the MPI version of CCM_BCAST.

.true.
bcast
0 3 0 3
sp real(b4) myreal
dp real(b8) mydouble
in integer(def_int) myint
comp complex(c4) mycomp
dpcomp complex(c8) mydpcomp
logical logical mylogical
character character mycharacter
        subroutine $1_$2(x,root,the_err) !(x,to_send,root,the_err)
!base1 rank input .eq. 0 .and.  rank output .eq. 0
same
        end subroutine $1_$2
        subroutine $1_$2(x,root,the_err) !(x,to_send,root,the_err)
!base1 rank other than zero
            use ccm_numz
            use ccm_error_mod
            use ccm_rs_mod
            implicit none
            $3,intent (inout) :: x$4
            integer, optional, intent(in) :: root
            integer, optional, intent(out) :: the_err
            integer :: local_root,local_size
            integer :: r1,r2,mytype
            if(present(the_err))the_err=0
            r1=ccm_rank(x)
            r2=r1
            mytype=$6
            if(iand(do_tests,ccm_trace) .ne. 0) &
                write(*,*)myid," entered $1 with in/out ranks ",r1,r2," data type $3" 
            if(present(root))then
                local_root=root
            else
                local_root=MPI_ROOT
            endif
            local_size=1       ! r1=0
            local_size=size(x) ! r1>0
!start of error tests
            if(iand(do_tests,ccm_checksize) .ne. 0)then
            	test_ray1(1)=local_size
            	call ccm_check_sizes("bcast",test_ray1,test_ray2,test_ray3,local_root,passed_test)
				if(.not. passed_test)then
				   call ccm_warning("bcast",mytype,r1)
				   if(present(the_err))the_err=1
				   return
				endif
            endif
!end of error tests
            call MPI_BCAST(x,local_size,mytype,local_root,mycomm,MPI_ERR)
            if(mpi_err .ne. 0)then
                write(err_str1,"(""low level communication error:"",i5)")mpi_err
                call ccm_fatal("bcast",mytype,r1)
            endif
        end subroutine $1_$2

The first line .true. indicates that we want to define this routine only for input and output arrays of the same rank. (Note that this routine could be compiled with mixed rank arrays but for bcast this is not normally required since the input and output arrays are the same.)

The second line bcast gives the name of the routine. When the preprocessor creates the actual source this will be prepended with "ccm_" so the generic routine is actually called as

call ccm_bcast(....

The next line gives the ranks of the arrays for which this routine is defined, input ranks of 0 to 3 and output ranks of 0 to 3.

The next lines:

sp real(b4) myreal
dp real(b8) mydouble
in integer(def_int) myint
comp complex(c4) mycomp
dpcomp complex(c8) mydpcomp
logical logical mylogical
character character mycharacter

give information about the specific instances of the generic routine. The text in the first field is appended to the generic routine name to give base specific routine names. The base specific routine names have the ranks of the arrays appended to them to give a collection of specific routine names for a given generic routine.

The next field gives the data type for the routine. The kinds are defined in ccm_numz_mod.f90 to map to the two normal real and complex types, and the default integer type.

The third field gives an MPI data type. These are set in ccm_init_mod.f90. For the shmem version of the source these are dummy values but there is an additional field that indicates the data type for the shmem_put routines.

The rest of the file is the generic source for the routine. There are actually two blocks of text delineated by the lines:

subroutine $1_$2
end subroutine $1_$2

The first "subroutine" is for sending scalars only. The second subroutine is for all other cases. The line same in the first block tells the preprocessor to use the same source for the scalar and vector versions of the routine.

You will notice that the source contains $1, $2, $3, and so on. These are replaced by the preprocessor by other text as indicated in the table below. Note that many of the replacements are dependent on input and output array ranks:

$1 mod_name, ccm_bcast in this case
$2 routine name with ranks replaced with 00, 01, ... nm
$3 data type
$4 input data rank, replaced with one of the following
"               " "(:)            "
"(:,:)          " "(:,:,:)        "
"(:,:,:,:)      " "(:,:,:,:,:)    "
"(:,:,:,:,:,:)  " "(:,:,:,:,:,:,:)"
$5 output data rank, replaced with one of the following
"               " "(:)            "
"(:,:)          " "(:,:,:)        "
"(:,:,:,:)      " "(:,:,:,:,:)    "
"(:,:,:,:,:,:)  " "(:,:,:,:,:,:,:)"
$6 mpitype
$f input data rank, replaced with one of the following
"               " "(:)            "
"(:,1)          " "(:,1,1)        "
"(:,1,1,1)      " "(:,1,1,1,1)    "
"(:,1,1,1,1,1)  " "(:,1,1,1,1,1,1)"
$g output data rank, replaced with one of the following
"               " "(:)            "
"(:,1)          " "(:,1,1)        "
"(:,1,1,1)      " "(:,1,1,1,1)    "
"(:,1,1,1,1,1)  " "(:,1,1,1,1,1,1)"
$h routine name without ranks
$i shmem data type
$j pointer rank, replaced with " " for scalar, "(:)" for arrays
$k pointer rank,replaced with " " for scalar, "(:)" for arrays
$l i, numeric value for input rank 0 - n
$m j, numeric value for output rank 0 - n

The preprocessing program also has some additional filters that allow conditional compilation. Lines that contain the text in the table shown below are only added to the final source if the routine that is being generated is of the prescribed rank.

Lines containing
the following
are compiled if
! r1=0 input rank is 0
! r1>0 input rank is > 0
! r2=0 output rank is 0
! r2>0 output rank is > 0

Specifying array ranks and data types

The routines in the reference implementation can be built for array ranks up to 7 by using the preprocessor. All that is required is to change the third line of the *.input file. However, this is not advisable unless absolutely needed. The size of the module is proportional to the square of the maximum array rank. For a single generic routine that is defined for array ranks 0-7 there are (8 input array ranks)*(8 output array ranks)*(7 data types) = 448 subroutines. Note that the need for higher ranks can be reduced by using the Fortran 90 reshape function. Also, the number of routines can be reduced by removing data type lines from the *.input files by using the standard Fortran 90 comment indicator "!".

Minimum lower bound values for array ranks

In order for an implementation to work correctly, the lower bounds for the input and output array ranks must have the values given in the table below. That is, the first and third values of the third line of the *input files must have the following values. Note that ccm_send and ccm_recv are only defined for rank 1 arrays.

Routine Input
array
Output
array
ccm_allreduce 0 0
ccm_alltoall 1 1
ccm_alltoallv 1 1
ccm_bcast 0 0
ccm_gather 0 1
ccm_gatherv 1 1
ccm_reduce 0 0
ccm_scatter 1 0
ccm_scatterv 1 1
Utility routines
not defined in the API
Input
array
Output
array
ccm_rank 0 na
ccm_size 0 na
ccm_send 1 na
ccm_recv na 1

Portability of the MPI reference implementation

Moving to other platforms

An effort was made to make the MPI reference implementation portable. The source should require only a few, if any, changes to move to a new platform. The known primary potential problem areas are related to data sizes. The data types used within the implementation are real(b8), real(b4), complex(c8), complex(c4), and integer(def_int) where we have b8, b4, c8, c4, and def_int defined in ccm_numz_mod.f90 as:

    integer, parameter:: b8 = selected_real_kind(10) 
    integer, parameter:: b4 = selected_real_kind(5)
    integer, parameter:: c8 = selected_real_kind(10) 
    integer, parameter:: c4 = selected_real_kind(5) 
    integer, parameter:: def_int = kind(iccm_dummy_int)
    

On most platforms these map to 4 and 8 byte reals, 8 and 16 byte complex values and the default integer which is either 2, 4, or 8 bytes. For some platforms the selected_real_kind parameters might need to be changed.

For MPI routines we create the types: myreal, mydouble, mycomp, mydpcomp and myint. These MPI types are defined in the routine ccm_init which is in the file ccm_init_mod.f90. The case statement


          case(200)
          ! use mpi_real4 and mpi_real8
             call MPI_TYPE_CONTIGUOUS(1,mpi_real4,myreal,mpi_err)
             call MPI_TYPE_COMMIT(myreal,mpi_err)
             if(mpi_err .ne. 0)call ccm_mpi_error("ccm_init")
             call MPI_TYPE_CONTIGUOUS(1,mpi_real8,mydouble,mpi_err)
             call MPI_TYPE_COMMIT(mydouble,mpi_err)
             if(mpi_err .ne. 0)call ccm_mpi_error("ccm_init")
          case default

defines the first two types. The complex types are defined in terms of these. For other platforms selecting another block of the case statement might be required to get the correct mapping between Fortran and MPI data types.

Some Fortran compilers support three real and complex data types, the third is sometimes called quad precision. Quad precision real values use 16 bytes. (On the Cray SV1 double precision values use 16 bytes, normal reals 8 bytes, but the SV1 also has a 4 byte real.)

Not all Fortran compilers have a third real data type and MPI does not have direct support for quad precision reals. So by default, the MPI reference implementation builds for two real data types. To build the MPI implementation for quad precision do the following:

(1) In the  *input files remove ! from the lines 

!qp real(b16) 
!qpcomp complex(c16) 

     The script to_quad can be used to perform these edits
     
     to_quad *input


(2) In the file ccm_merge_mod.f90 remove the initial ! from lines containing !qp

    The script to_quad can be used to perform these edits
    
    to_quad ccm_merge_mod.f90

(3) Make as discussed above

Portability of the shmem implementation

There is a constant


	integer, parameter :: maxprocs=64

defined in ccm_numz_sgi.f90 and ccm_numz_sv1.f90 that sets the maximum number of processors that can be used by the module. This parameter sets the sizes of two synchronization arrays


	integer :: sync_block
	common /ccm_p2p/ sync_block(-1:maxprocs*2)

in the same file and


        integer(selected_int_kind(18)) :: mycray(4,maxprocs)

in the ccm_checkin routine. To run on a larger number of processors maxprocs must be increased. If you try to run on a number of processors greater than maxprocs a warning message will be printed and the job will stop. Maxprocs is a constant, instead of variable, to avoid the use of Cray style pointers.

Unfortunately, the Cray/SGI shmem library routines are not portable across on a large number of platforms. Indeed, there are significant difference in the implementation of the shmem routines across various SGI and Cray machines. Thus a different philosophy was adopted for the shmem implementation.

The routines that are part of the API are defined in terms of generic send and receive operations. The shmem ccm_bcast routine, for example, does not contain a reference to shmem routines. It does call ccm_send and ccm_recv. Ccm_send and ccm_recv are defined for each data type in terms of shmem routines. These routines are only needed and defined for rank 1 arrays.

The shmem implementation is a good place to start to create a new implementation of the API. Replacing the routines Ccm_send and ccm_recv would constitute most of the work.

There is a coding practice used in the shmem implementation routines that could cause problems when moving to a new platform. Several of the routines make use of single dimensional pointers to access whole arrays of higher ranks. This works on most platforms but it is not universally portable.

A simple example is, say we have a two dimensional array, x(10,10), and a pointer, y(:). Some of the routines do things similar to following to set values for the 2d array.

y=>x(:)

do i=1,100
	y(i)= ....
enddo

Although this works on most platforms, the Fortran standard dictates that we should only reference ten elements of the array using this scheme. For the shmem routines to be Fortran compliant, these types of operations with pointers should be replaced with explicit array copies.

By default, the shmem reference implementation builds for two real and complex data types, 4 and 8 bytes. To build the MPI implementation for 16 byte data types do the following:

(1) In the  *input files remove ! from the lines 

!qp real(b16) 
!qpcomp complex(c16) 

     The script to_quad can be used to perform these edits
     
     to_quad *input


(2) In the file ccm_merge_mod.f90 remove the initial ! from lines containing !qp

    The script to_quad can be used to perform these edits
    
    to_quad ccm_merge_mod.f90

(3) Make as discussed above

Finally, it was discovered that some shmem implementations have problems sending complex values in some instances. To get around this problem complex values are sent as a pair of reals. Contrary to what the online documentaion says, the SV1 does not have a put operation for character values. One was written in terms of integers.