EXAMPLE SUBMISSION FORM

Below is an example of a completed submission form for submitting a code to the PARKBENCH compact applications suite. It is provided to show the level of detail expected.
-------------------------------------------------------------------------------
Name of Program         : PSTSWM 
                        : (Parallel Spectral Transform Shallow Water Model)
-------------------------------------------------------------------------------
Submitter's Name        : Patrick H. Worley
Submitter's Organization: Oak Ridge National Laboratory
Submitter's Address     : Bldg. 6012/MS-6367
                          P. O. Box 2008
                          Oak Ridge, TN 37831-6367
Submitter's Telephone # : (615) 574-3128
Submitter's Fax #       : (615) 574-0680
Submitter's Email       : worley@msr.epm.ornl.gov
-------------------------------------------------------------------------------
Cognizant Expert(s)     : Patrick H. Worley
CE's Organization       : Oak Ridge National Laboratory
CE's Address            : Bldg. 6012/MS-6367
                          P. O. Box 2008
                          Oak Ridge, TN 37831-6367
CE's Telephone #        : (615) 574-3128
CE's Fax #              : (615) 574-0680
CE's Email              : worley@msr.epm.ornl.gov

Cognizant Expert(s)     : Ian T. Foster
CE's Organization       : Argonne National Laboratory
CE's Address            : MCS 221/D-235
                          9700 S. Cass Avenue
                          Argonne, IL 60439
CE's Telephone #        : (708) 252-4619
CE's Fax #              : (708) 252-5986
CE's Email              : itf@mcs.anl.gov
-------------------------------------------------------------------------------
Extent and timeliness with which CE is prepared to respond to questions and
bug reports from ParkBench :

Modulo other commitments, Worley is prepared to respond quickly to questions
and bug reports, but expects to be kept informed as to results of experiments
and modifications to the code.

-------------------------------------------------------------------------------
Major Application Field : Fluid Dynamics
Application Subfield(s) : Climate Modeling
-------------------------------------------------------------------------------
Application "pedigree"  :

PSTSWM Version 1.0 is a message-passing benchmark code and parallel algorithm
testbed that solves the nonlinear shallow water equations using the spectral
transform method. The spectral transform algorithm of the code follows
closely how CCM2, the NCAR Community Climate Model, handles the dynamical
part of the primitive equations, and the parallel algorithms implemented in
the model include those currently used in the message-passing parallel
implementation of CCM2. PSTSWM was written by Patrick Worley of Oak Ridge
National Laboratory and Ian Foster of Argonne National Laboratory, and is
based partly on previous parallel algorithm research by John Drake, David
Walker, and Patrick Worley of Oak Ridge National Laboratory. Both the code
development and parallel algorithms research were funded by the DOE Computer
Hardware, Advanced Mathematics, and Model Physics (CHAMMP) program. The
features of version 1.0 were frozen on 8/1/93, and it is this version we
would offer initially as a benchmark.  

PSTSWM is a parallel implementation of a sequential code (STSWM 2.0) written
by James Hack and Ruediger Jakob at NCAR to solve the shallow water equations 
on a sphere using the spectral transform method. STSWM evolved from a
spectral shallow water model written by Hack (NCAR/CGD) to compare numerical
schemes designed to solve the divergent barotropic equations in spherical
geometry. STSWM was written partially to provide the reference solutions
to the test cases proposed by Williamson et. al. (see citation [4] below),
which were chosen to test the ability of numerical methods to simulate
important flow phenomena. These test cases are embedded in the code and 
are selectable at run-time via input parameters, specifying initial conditions,
forcing, and analytic solutions (for error analysis). The solutions are also
published in a Technical Note by Jakob et. al. [3]. In addition, this code is
meant to serve as an educational tool for numerical studies of the shallow
water equations. A detailed description of the spectral transform method, and
a derivation of the equations used in this software, can be found in the
Technical Note by Hack and Jakob [2].  

For PSTSWM, we rewrote STSWM to add vertical levels (in order to get the
correct communication and computation granularity for 3-D weather and climate
codes), to increase modularity and support code reuse, and to allow the
problem size to be selected at runtime without depending on dynamic memory
allocation. PSTSTWM is meant to be a compromise between paper benchmarks and
the usual fixed benchmarks by allowing a significant amount of
runtime-selectable algorithm tuning. Thus, the goal is to see how quickly the
numerical simulation can be run on different machines without fixing the
parallel implementation, but forcing all implementations to execute the same
numerical code (to guarantee fairness). The code has also been written in
such a way that linking in optimized library functions for common operations
instead of the "portable" code will simple.

-------------------------------------------------------------------------------
May this code be freely distributed (if not specify restrictions) :

Yes, but users are requested to acknowledge the authors (Worley and
Foster) and the program that supported the development of the code
(DOE CHAMMP program) in any resulting research or publications, and are
encouraged to send reprints of their work with this code to the authors.
Also, the authors would appreciate being notified of any modifications to 
the code. Finally, the code has been written to allow easy reuse of code in
other applications, and for educational purposes. The authors encourage this,
but also request that they be notified when pieces of the code are used.

-------------------------------------------------------------------------------
Give length in bytes of integers and floating-point numbers that should be
used in this application:

The program currently uses INTEGER, REAL, COMPLEX, and DOUBLE PRECISION
variables. The code should work correctly for any system in which COMPLEX is
represented as 2 REALs. The include file params.i has parameters that can be
used to specify the length of these. Also, some REAL and DOUBLE parameters
values may need to be modified for floating point number systems with large
mantissas, e.g., PI, TWOPI. PSTSWM is currently being used on systems where

        Integers : 4   bytes
	Floats   : 4   bytes

The use of two precisions can be eliminated, but at the cost of a significant
loss of precision. (For 4 bytes REALs, not using DOUBLE PRECISION increases
the error by approximately three orders of magnitude.) DOUBLE PRECISION
results are only used in set-up (computing Gauss weights and nodes and
Legendre polynomial values), and are not used in the body of the computation.

-------------------------------------------------------------------------------
Documentation describing the implementation of the application (at module
level, or lower) :

The sequential code is documented in a file included in the distribution of the
code from NCAR:

Jakob, Ruediger, Description of Software for the Spectral Transform Shallow
Water Model Version 2.0. National Center for Atmospheric Research,
Boulder, CO 80307-3000, August 1992

and in 

Hack, J.J. and R. Jakob, Description of a global shallow water model based on
the spectral transform method, NCAR Technical Note TN-343+STR, January 1992. 

Documentation of the parallel code is in preparation, but extensive
documentation is present in the code.

-------------------------------------------------------------------------------
Research papers describing sequential code and/or algorithms :

1) Browning, G.L., J.J. Hack and P.N. Swarztrauber, A comparison of
   three numerical methods for solving differential equations on
   the sphere, Monthly Weather Review, 117:1058-1075, 1989.

2) Hack, J.J. and R. Jakob, Description of a global
   shallow water model based on the spectral transform method,
   NCAR Technical Note TN-343+STR, January 1992.

3) Jakob, R., J.J. Hack and D.L. Williamson, Reference solutions to
   shallow water test set using the spectral transform method,
   NCAR Technical Note TN-388+STR (in preparation).

4) Williamson, D.L., J.B. Drake, J.J. Hack, R. Jakob and P.S. Swarztrauber,
   A standard test set for numerical approximations to the shallow
   water equations in spherical geometry, Journal of Computational Physics,
   Vol. 102, pp.211-224, 1992.
-------------------------------------------------------------------------------
Research papers describing parallel code and/or algorithms :

5) Worley, P. H. and J. B. Drake, Parallelizing the Spectral Transform Method,
   Concurrency: Practice and Experience, Vol. 4, No. 4 (June 1992), 
   pp. 269-291.

6) Walker, D. W., P. H. Worley, and J. B. Drake, Parallelizing the Spectral
   Transform Method. Part II, 
   Concurrency: Practice and Experience, Vol. 4, No. 7 (October 1992), 
   pp. 509-531.

7) Foster, I. T. and P. H. Worley,
   Parallelizing the Spectral Transform Method: A Comparison of Alternative
   Parallel Algorithms,
   Proceedings of the Sixth SIAM Conference on Parallel Processing for
   Scientific Computing (March22-24, 1993), pp. 100-107.

8) Foster, I. T. and P. H. Worley,
   Parallel Algorithms for the Spectral Transform Method,
   (in preparation)

9) Worley, P. H. and I. T. Foster,
   PSTSWM: A Parallel Algorithm Testbed and Benchmark.
   (in preparation)

-------------------------------------------------------------------------------
Other relevent research papers:

10) I. Foster, W. Gropp, and R. Stevens, 
    The parallel scalability of the spectral transform method, 
    Mon. Wea. Rev., 120(5), 1992, pp. 835--850. 

11) Drake, J. B., R. E. Flanery, I. T. Foster, J. J. Hack, J. G. Michalakes,
    R. L. Stevens, D. W. Walker, D. L. Williamson, and P. H. Worley,
    The Message-Passing Version of the Parallel Community Climate Model,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 500-513.

12) Sato, R. K. and R. D. Loft,
    Implementation of the NCAR CCM2 on the Connection Machine,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 371-393.

13) Barros, S. R. M. and Kauranne, T.,
    On the Parallelization of Global Spectral Eulerian Shallow-Water Models,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 36-43.

14) Kauranne, T. and S. R. M. Barros,
    Scalability Estimates of Parallel Spectral Atmospheric Models,
    Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
    Meteorology (Nov. 23-27, 1992)
    Hoffman, G.-R and T. Kauranne, ed., 
    World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, 
    pp. 312-328.

15) Pelz, R. B. and W. F. Stern,
    A Balanced Parallel Algorithm for Parallel Processing,
    Proceedings of the Sixth SIAM Conference on Parallel Processing for
    Scientific Computing (March22-24, 1993), pp. 126-128.

-------------------------------------------------------------------------------
Application available in the following languages (give message passing system
used, if applicable, and machines application runs on) :

The model code is primarily written in Fortran 77, but also uses
DO ... ENDDO and DO WHILE ... ENDDO, and the INCLUDE extension (to pull in
common and parameter declarations). It has been compiled and run on the Intel
iPSC/2, iPSC/860, Delta, and Paragon, the IBM SP1, and on Sun Sparcstation,
IBM RS/6000, and Stardent 3000/1500 workstations (as a sequential code).

Message passing is implemented using the PICL message passing system.
All message passing is encapsulated in 3 highlevel routines:

BCAST0 (broadcast)
GMIN0  (global minimum)
GMAX0  (global maximum)

two classes of low level routines:
 SWAP, SWAP_SEND, SWAP_RECV, SWAP_RECVBEGIN, SWAP_RECVEND, SWAP1, SWAP2, SWAP3
 (variants and/or pieces of the swap operation)
and
 SENDRECV, SRBEGIN, SREND, SR1, SR2, SR3
 (variants and/or pieces of the send/recv operation)

and one synchronization primitive:
CLOCKSYNC0

PICL instrumentation commands are also embedded in the code.

Porting the code to another message passing library will be simple, although
some of the runtime communication options may become illegal then.
The PICL instrumentation calls can be stubbed out (or removed) without
changing the functionality of the code, but some sort of synchronization is
needed when timing short benchmark runs.

-------------------------------------------------------------------------------
Total number of lines in source code: 28,204
Number of lines excluding comments  : 12,434
Size in bytes of source code        : 994,299
-------------------------------------------------------------------------------
List input files (filename, number of lines, size in bytes, and if formatted) :

problem:   23 lines, 559 bytes, ascii
algorithm: 33 lines, 874 bytes, ascii

-------------------------------------------------------------------------------
List output files (filename, number of lines, size in bytes, and if formatted) :

standard output: Number of lines and bytes is a function of the input
                 specifications, but for benchmarking would normally be
                 63 lines (2000 bytes) of meaningful output. (On the Intel
                 machine, FORTRAN STOP messages are sent from each processor
                 at the end of the run, increasing this number.)

timings:         Each run produces one line of output, containing approx.
                 150 bytes.

Both files are ascii.


-------------------------------------------------------------------------------
Brief, high-level description of what application does:

(P)STSWM solves the nonlinear shallow water equations on the sphere.
The nonlinear shallow water equations constitute a simplified
atmospheric-like fluid prediction model that exhibits many of the features of
more complete models, and that has been used to investigate numerical
methods and benchmark a number of machines.
Each run of PSTSWM uses one of 6 embedded initial conditions and forcing
functions. These cases were chosen to stress test numerical methods for this
problem, and to represent important flows that develop in atmospheric
modeling. STSWM also supports reading in arbitrary initial conditions, but
this was removed from the parallel code to simplify the development of the
initial implementation. 

-------------------------------------------------------------------------------
Main algorithms used:

PSTSWM uses the spectral transform method to solve the shallow water
equations. During each timestep, the state variables of the
problem are transformed between the physical domain, where most of the
physical forces are calculated, and the spectral domain, where the terms of
the differential equation are evaluated. The physical domain is a tensor
product longitude-latitude grid. The spectral domain is the set of spectral
coefficients in a spherical harmonic expansion of of the state variables, and
is normally characterized as a triangular array (using a "triangular"
truncation of spectral coefficients). 

Transforming from physical coordinates to spectral coordinates involves
performing a real FFT for each line of constant latitude, followed by 
integration over latitude using Gaussian quadrature (approximating the
Legendre transform) to obtain the spectral coefficients. The inverse
transformation involves evaluating sums of spectral harmonics and inverse
real FFTs, analogous to the forward transform.

Parallel algorithms are used to compute the FFTs and to compute the 
vector sums used to approximate the forward and inverse Legendre transforms.
Two major alternatives are available for both transforms, distributed
algorithms, using a fixed data decompostion and computing results where they
are assigned, and transpose algorithms, remapping the domains to allow the
transforms to be calculated sequentially. This translates to four major
parallel algorithms:

a) distributed FFT/distributed Legendre transform (LT)
b) transpose FFT/distributed LT
c) distributed FFT/transpose LT
d) transpose FFT/transpose LT

Multiple implementations are supported for each type of algorithm, and
the assignment of processors to transforms is also determined by input
parameters. For example, input parameters specify a logical 2-D processor
grid and define the data decomposition of the physical and spectral domains
onto this grid. If 16 processors are used, these can be arranged as
a 4x4 grid, an 8x2 grid, a 16x1 grid, a 2x8 grid, or a 1x16 grid.
This specification determines how many processors are used to calculate each
parallel FFT and how many are used to calculate each parallel LT.

-------------------------------------------------------------------------------
Skeleton sketch of application:

The main program calls INPUT to read problem and algorithm parameters
and set up arrays for spectral transformations, and then calls
INIT to set up the test case parameters. Routines ERRANL and
NRGTCS are called once before the main timestepping loop for
error normalization, once after the main timestepping for 
calculating energetics data and errors, and periodically during 
the timestepping, as requested. The prognostic fields are 
initialized using routine ANLYTC, which provides the analytic
solution. Each call to STEP advances the computed fields by a 
timestep DT. Timing logic surrounds the timestepping loop, so the
initialization phase is not timed. Also, a fake timestep is calculated before
beginning timing to eliminate the first time "paging" effect currently seen
on the Intel Paragon systems. 

STEP computes the first two time levels by two semi-implicit timesteps;
normal time-stepping is by a centered leapfrog-scheme. STEP calls COMP1,
which choses between an explicit numerical algorithm, a semi-implicit
algorithm, and a simplified algorithm associated with solving the advection
equation, one of the embedded test cases. The numerical algorithm used is an
input parameter. 

The basic outline of each timestep is the following:
1) Evaluate non-linear product and forcing terms.
2) Fourier transform non-linear terms in place as a block transform.
3) Compute and update divergence, geopotential, and vorticity spectral
   coefficients. (Much of the calculation of the time update is "bundled"
   with the Legendre transform.)
4) Compute velocity fields and transform divergence, geopotential,
   and vorticity back to gridpoint space using 
   a) an inverse Legendre transform and associated computations and
   b) an inverse real block FFT.

PSTSWM has "fictitious" vertical levels, and all computations are duplicated
on the different levels, potentially significantly increasing the granularity
of the computation. (The number of vertical levels is an input parameter.)
For error analysis, a single vertical level is extracted and analyzed. 

-------------------------------------------------------------------------------
Brief description of I/O behavior:

Processor 0 reads in the input parameters and broadcasts them to the rest of
the processors. Processor 0 also receives the error analysis and timing
results from the other processors and writes them out.

-------------------------------------------------------------------------------
Describe the data distribution (if appropriate) :

The processors are treated as a logical 2-D grid. There are 3 domains to be
distributed:
 a) physical domain: tensor product longitude-latitude grid
 b) Fourier domain: tensor product wavenumber-latitude grid
 c) spectral domain: triangular array, where each column contains the
                     spectral coefficients associated with a given
                     wavenumber. The larger the wavenumber is, the shorter
                     the column is.
An unordered FFT is used, and the Fourier and spectral domains use the
"unordered" permutation when the data is being distributed.

I) distributed FFT/distributed LT
   1) The tensor-product longitude-latitude grid is mapped onto the 
      processor grid by assigning a block of contiguous longitudes 
      to each processor column and by assigning one or two blocks of
      contiguous latitudes to each processor row. The vertical dimension is
      not distributed.   
   2) After the FFT, the subsequent wavenumber-latitude grid is similarly
      distributed over the processor grid, with a block of the permuted
      wavenumbers assigned to each processor column.
   3) After the LT, the wavenumbers are distributed as before and the spectral
      coefficients associated with any given wavenumber are either
      distributed evenly over the processors in the column containing that
      wavenumber, or are duplicated over the column. What happens is a
      function of the particular distributed LT algorithm used.

II) transpose FFT/distributed LT
   1) same as in (I)
   2) Before the FFT, the physical domain is first remapped to
      a vertical layer-latitude decomposition, with a block of contiguous
      vertical layers assigned to each processor column and the longitude
      dimension not distributed. After the transform, the vertical
      level-latitude grid is distributed as before, and the wavenumber
      dimension is not distributed. 
   3) After the LT, the spectral coefficients for a given vertical layers are
      either distributed evenly over the processors in a column, or are
      duplicated over that column. What happens is a function of the
      particular distributed LT algorithm used. 

III) distributed FFT/transpose LT
   1) same as (I)
   2) same as (I)
   3) Before the LT, the wavenumber-latitude grid is first remapped to
      a wavenumber-vertical layer decomposition, with a block of contiguous
      vertical layers assigned to eadh processor row and the latitude
      dimension not distributed. After the transform, the spectral
      coefficients associated with a given wavenumber and vertical layer
      are all on one processor, and the wavenumbers and vertical layers are
      distributed as before.

IV) transpose FFT/transpose LT
   1) same as (I)
   2) same as (II)
   3) Before the LT, the vertical level-latitude grid is first remapped to
      a vertical level-wavenumber decomposition, with a block of the permuted 
      wavenumbers now assigned to each processor row and the latitude
      dimension not distributed. After the transform, the spectral
      coefficients associated with a given wavenumber and vertical layer
      are all on one processor, and the wavenumbers and vertical layers are
      distributed as before.

-------------------------------------------------------------------------------
Give parameters of the data distribution (if appropriate) :

The distribution is a function of the problem size (longitude, latitude,
vertical levels), the logical processor grid (PX, PY), and the algorithm
(transpose vs. distributed for FFT and LT).

-------------------------------------------------------------------------------
Brief description of load balance behavior :

The load is fairly well balanced. If PX and PY evenly divide the number of
longitudes, latitudes, and vertical levels, then all load imbalances are due
to the unequal distribution of spectral coefficients. As described above, the
spectral coefficients are laid out as a triangular array in most runs, where
each column corresponds to a different Fourier wavenumber. The wavenumbers are
partitioned among the processors in most of the parallel algorithms. Since
each column is a different length, a wrap mapping of the the columns will
approximately balance the load. Instead, the natural "unordered" ordering of
the FFT is used with a block partitioning, which does a reasonable job of
load balancing without any additional data movement. The load imbalance is
quantified in Walker, et al [5]. 

If PX and PY do not evenly divide the dimensions of the physical domain,
then other load imbalances may be as large as a factor of 2 in the worse
case. 

-------------------------------------------------------------------------------
Give parameters that determine the problem size :

MM, NN, KK - specifes number of Fourier wavenumber and spectral truncation
             used. For a triangular truncation, MM = NN = KK.
NLON, NLAT, NVER
           - number of longitudes, latitudes, and vertical levels. There
             are required relationships between NLON, NLAT, and NVER, and
             between these and MM. These relationships are checked in the
             code. We will also provide a selection of input files that
             specify legal (and interesting) problems.
DT         - timestep (in seconds). (Must be small enough to satisfy Courant
             condition stability condition. Code warns if too large, but does
             not abort.)
TAUE       - end of model run (in hours)

-------------------------------------------------------------------------------
Give memory as function of problem size :

Executable size is determined at compile time by setting the parameters
COMPSZ in params.i. Per node memory requirements are approximately
(in REALs)

associated Legendre polynomial values:
   MM*MM*NLAT/PX*PY
physical grid fields: 
   8*NLON*NLAT*NVER/(PX*PY)
spectral grid fields: 
   3*MM*MM*NVER/(PX*PY) 
 or (if spectral coefficients duplicated within a processor column)
   3*MM*MM*MVER/PX        
work space:
   8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/(PX*PY)
 or (if spectral coefficients duplicated within a processor column)
   8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/PX

where BUFS1 and BUFS2 are input parameters (number of communication buffers).
BUFS1 and BUFS2 can be as small as 0 and as large as PX or PY.

In standard test cases, NLON=2*NLAT, NLON=4*NVER, and NLON=3*MM+1, so memory
requirements are approximately:

    (2 + 108*(1+BUFS1) + 3*(1+BUFS2))*(M**3)/(4*PX*PY)
  or
    (2 + 108*(1+BUFS1))*(M**3)/(4*PX*PY) + 3*(1+BUFS2)*(M**3)/(4*PX)


-------------------------------------------------------------------------------
Give number of floating-point operations as function of problem size :

for a serial run per timestep (very rough):
  nonlinear terms:
        10*NLON*NLAT*NVER
  forward FFT:
        40*NLON*NLAT*NVER*LOG2(NLON)
  forward LT and time update:
       48*MM*NLAT*NVER + 7*(MM**2)*NLAT*NVER
  inverse LT and calculation of velocities:
       20*MM*NLAT*NVER + 14*(MM**2)*NLAT*NVER
  inverse FFT:
       25*NLON*NLAT*NVER*LOG2(NLON)

Using standard assumptions (NLON=2*NLAT, NLON=4*NVER, and NLON=3*MM+1):

approx. 460*(M**3) + 348*(M**3)*LOG2(M) + 24*(M**4) flops per timestep.

For a total run, multiply by TAUE/DT.

-------------------------------------------------------------------------------
Give communication overhead as function of problem size and data distribution :

This is a function of the algorithm chosen.

I) transpose FFT
   a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY)
        2*(PX-1) steps, D volume
      or
        2*LOG2(PX) steps, D*LOG2(PX) volume 

II) distributed FFT
   a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY)
        2*LOG2(PX) steps, D*LOG2(PX) volume

III) transpose LT

   a) forward LT:  let D = 8*NLON*NLAT*NVER/(PX*PY)
        2*(PY-1) steps, D volume
      or
        2*LOG2(PY) steps, D*LOG2(PY) volume 

   b) inverse LT:  let D = (3/2)*(MM**2)*NVER/(PX*PY)
        (PY-1) steps, D volume
       or
        LOG2((PY) steps, D*PY volume

IV) distributed LT

   a) forward + inverse LT:  let D = 3*(MM**2)*NVER/(PX*PY)
        2*(PY-1) steps, D*PY volume
       or
        2*LOG2((PY) steps, D*PY volume

These are per timestep costs. Multiply by TAUE/DT for total communication
overhead. 

-------------------------------------------------------------------------------
Give three problem sizes, small, medium, and large for which the benchmark
should be run (give parameters for problem size, sizes of I/O files,
memory required, and number of floating point operations) :

Standard input files will be provided for 

T21: MM=KK=NN=21      T42: MM=KK=NN=42        T85: MM=NN=KK=85
     NLON=32               NLON=64                 NLON=128
     NLAT=64               NLAT=128                NVER=256
     NVER=8                NVER=16                 NVER=32
     ICOND=2               ICOND=2                 ICOND=2
     DT=4800.0             DT=2400.0               DT=1200.0
     TAUE=120.0            TAUE=120.0              TAUE=120.0

These are 5 day runs of the "benchmark" case specified in Williamson, et al
[3]. Flops and memory requirements for serial runs are as follows (approx.):

T21:           500,000 REALs
         2,000,000,000 flops
     
T42:         4,000,000 REALs
        45,000,000,000 flops

T85:        34,391,000 REALs
     1,000,000,000,000 flops

Both memory and flops scale well, so, for example, the T42 run fits in
approx. 4MB of memory for a 4 processor run. But different algorithms and 
different aspect ratios of the processor grid use different amounts of memory.

-------------------------------------------------------------------------------
How did you determine the number of floating-point operations (hardware
monitor, count by hand, etc.) :

Count by hand (looking primarily at inner loops, but eliminating common
subexpressions that compiler is expected to find).

-------------------------------------------------------------------------------
PARKBENCH compact applications page
Last Modified May 14, 1996