GBIS Benchmark Header File: pde1

   ==================================================================
   ===                                                            ===
   ===           GENESIS Distributed Memory Benchmarks            ===
   ===                                                            ===
   ===                           PDE1                             ===
   ===                                                            ===
   ===   3-Dimensional Poisson Solver Using Red-Black Relaxation  ===
   ===                                                            ===
   ===                 Versions: Std F77, PARMACS, PVM 3.1,       ===
   ===                           Subset HPF                       ===
   ===                                                            ===
   ===         Original authors: J. Klose, M. Lemke               ===
   ===               Subset HPF: Vladimir Getov                   ===
   ===           SOR version by: Ivan Wolton                      ===
   ===                                                            ===
   ===                Inquiries: HPC Centre                       ===
   ===                           Computing Services               ===
   ===                           University of Southampton        ===
   ===                           Southampton SO17 4BJ, U.K.       ===
   ===                                                            ===
   ===   Fax: +44 703 593939   E-mail: support@par.soton.ac.uk    ===
   ===                                                            ===
   ===          Last update: June 1994; Release: 3.0              ===
   ===                                                            ===
   ==================================================================


1. Description
--------------
The benchmark solves the Poisson-Equation on a 3-dimensional grid.  The
PARMACS and PVM versions use parallel red-black succesive over relaxation
(SOR) with Chebyshev acceleration.  The sequential and HPF versions do not
yet implement the SOR algorithm, but will be upgraded when time permits.

Many problems in the area of scientific computing are formulated in
terms of partial differential equations. Typical application areas are
Computational Fluid Dynamics, Meteorology, Climate Research, Oil
Reservoir simulation and others. The resulting PDEs are discretized
on some grid structure. The PDE is then represented by a large set of
(non)linear equations, each of which couples values at neighbouring grid
point with each other. For time-dependent problems, this set of
equations has to be determined (integration phase) and solved (solution
phase) at each time step.

This benchmark is an extreme example of the class of PDE solvers, as
due to the simplicity of the discretization of Poisson's equation the
number of floating point operations per gridpoint is quite small relative
to more complex PDEs. The ratio of computation to communication is
thus rather low.

The parallelization is performed by grid splitting. A part of the
computational grid is assigned to each processor. After each
computational step, values at the boundary of the subgrids are exchanged
with nearest neighbours.


2) Operating Instructions
-------------------------

A. Sequential version

The sequential version automatically produces results for a range of
problem sizes. The problem size is determined by the grid size, which
is related to the parameter N. The number of internal grid points in each
direction is 2**N, giving 2**3N internal points in 3 dimensions. There
are also 2 boundary points in each direction but no work is done at
these points and so the problem size is dependent only on the number
of internal grid points.

The parameter N is varied from 3 to MMAX within the benchmark and
the benchmark performance calculated for each resulting problem size.
The value of MMAX can be changed by editing the PARAMETER statement
in the file pde1.inc. The maximum value of MMAX which is consistent with 
the available processor memory should be chosen.
The memory required for array  storage is approximately
{ 2 * (2**MMAX + 2)**3 } * 8 bytes, which gives the following table:

	MMAX		Approx Memory required (Mbyte)
	  5			  0.6 
	  6			  4.6
	  7			 35.
	  8			274.

For the largest problem size the relative sizes of the grid in each
dimension is varied whilst keeping the overall problem size constant. 
This variation in the shape of the grid for the same problem size 
can increase performance by allowing more efficient vectorization.

To achieve a given accuracy in the timing measurements the number of
relaxations timed by the benchmark is specified by the input
parameter NITER. This should be chosen so that the benchmarked time
for NITER cycles on the smallest problem size is at least 100 times the
clock resolution .  (If the clock resolution is unknown this can be
determined using the TICK1 benchmark). For larger problem sizes, the
value of NITER is automatically reduced (subject to a minumum value of 5)
to maintain the overall benchmarked time constant for each problem size.



Compiling and running the sequential benchmark:

1) Change value of MMAX in file pde1.inc, if appropriate, to give maximum 
   problem size compatible with the available memory. (see above)

2) To compile and link the benchmark type:   `make' for the distributed 
   version or `make slave' for the single-node version.

3) To run the benchmark type:     pde1

4) Input NITER, the number of relaxations (suggested value 320)

 
   Output from the benchmark is written to the file "result"


B) Distributed Version

In the distributed version of the program the problem size, the
number of processors, and the number of relaxations are input from 
the standard input on channel 5.

The problem size is proportional to the total grid size, which is 
determined by the input parameter NN. 
The number of internal grid points in each direction is 2**NN, 
giving 2**3NN points in 3 dimensions.

The number of processors over which the lattice is distributed
is determined by the input parameter LOGP which is the log to base 2 of 
the required number of processors, ie.  number of processors = 2**LOGP.
The specified number of processors is configured as a 3D grid
internally within the program.

The recommended minimum number of iterations, NITER, supplied as input
is 10. For the smaller problem size this may need to be increased
if the total runtime is small compared with the clock resolution,
however on the larger problem sizes increasing NITER will result in
unneccesarily long run times.

The size of the local lattice determines the size of the workspace
required in the node program. The size of this workspace is determined
by a PARAMETER statement in the file node.u of the form:

       PARAMETER (NWORKD = 160000)

The size of NWORKD should be changed if necessary to ensure that it is
greater than or equal to (2**NN + 4)**3/(2**LOGP), a warning message
is printed if this condition is not satisfied and the program halts.
The maximum size of NWORKD, and hence of the local lattice size, is
constrained by the available node memory. The node memory required
is approximately 3 * NWORKD * 8 bytes. 

Suggested Problem Sizes :

It is recommended that the benchmark is run with four standard problem
sizes, given by NN  = 6, 7, 8 and 9. 

The optimum SOR relaxation parameter for each problem size is calculated
within the program. The rate of convergence of the solution is
theoretically dependent only on the problem size, after a certain
number of iterations. The radius of convergence is printed in
the results file and can be used to check correct operation of the code.
A little more verification is needed on this feature but details
should be available shortly. Contact the benchmark distributor for
more information on this.

Note that it may not be possible to run the largest problem size on all
machines because of restrictions on the available memory.
The approximate total memory required for array storage is given by the 
following table:

	 NN	Approx value of (2**NN+4)**3	Approx Memory required (Mbyte)
	  6		  .32 * 10**6			   8
	  7		  2.3 * 10**6			  56
	  8		   18 * 10**6			 430
	  9		  138 * 10**6			3300

To find the minimum node memory required to run each problem size, the total 
memory required should be divided by the number of processors on which the 
benchmark is run. 

The number of processors to be used will obviously depend on the
system available. The most important measurement is likely to be for
the largest power of two that will fit in to the machine. If time
permits the variation of performance with number of processors should be
investigated by reducing the number of processors by successive
factors of two or four.


Compiling and running the distributed benchmark:

1) Change value of NWORKD in file node.u, if appropriate, to give maximum 
   work space compatible with the available memory. (see above)

2) To compile and link the benchmark type:   make  

3) To run the benchmark type:     pde1
 
4) Input parameters NN, LOGP, NITER on standard
   input.

   Output from the benchmark is written to the file "result"

$Id: ReadMe,v 1.5 1994/06/28 11:38:21 igl Exp igl $
High Performance Computing Centre
Submitted by Mark Papiani,
last updated on 10 Jan 1995.