------------------------------------------------------------------------------- Name of Program : PSTSWM : (Parallel Spectral Transform Shallow Water Model) ------------------------------------------------------------------------------- Submitter's Name : Patrick H. Worley Submitter's Organization: Oak Ridge National Laboratory Submitter's Address : Bldg. 6012/MS-6367 P. O. Box 2008 Oak Ridge, TN 37831-6367 Submitter's Telephone # : (615) 574-3128 Submitter's Fax # : (615) 574-0680 Submitter's Email : worley@msr.epm.ornl.gov ------------------------------------------------------------------------------- Cognizant Expert(s) : Patrick H. Worley CE's Organization : Oak Ridge National Laboratory CE's Address : Bldg. 6012/MS-6367 P. O. Box 2008 Oak Ridge, TN 37831-6367 CE's Telephone # : (615) 574-3128 CE's Fax # : (615) 574-0680 CE's Email : worley@msr.epm.ornl.gov Cognizant Expert(s) : Ian T. Foster CE's Organization : Argonne National Laboratory CE's Address : MCS 221/D-235 9700 S. Cass Avenue Argonne, IL 60439 CE's Telephone # : (708) 252-4619 CE's Fax # : (708) 252-5986 CE's Email : itf@mcs.anl.gov ------------------------------------------------------------------------------- Extent and timeliness with which CE is prepared to respond to questions and bug reports from ParkBench : Modulo other commitments, Worley is prepared to respond quickly to questions and bug reports, but expects to be kept informed as to results of experiments and modifications to the code. ------------------------------------------------------------------------------- Major Application Field : Fluid Dynamics Application Subfield(s) : Climate Modeling ------------------------------------------------------------------------------- Application "pedigree" : PSTSWM Version 1.0 is a message-passing benchmark code and parallel algorithm testbed that solves the nonlinear shallow water equations using the spectral transform method. The spectral transform algorithm of the code follows closely how CCM2, the NCAR Community Climate Model, handles the dynamical part of the primitive equations, and the parallel algorithms implemented in the model include those currently used in the message-passing parallel implementation of CCM2. PSTSWM was written by Patrick Worley of Oak Ridge National Laboratory and Ian Foster of Argonne National Laboratory, and is based partly on previous parallel algorithm research by John Drake, David Walker, and Patrick Worley of Oak Ridge National Laboratory. Both the code development and parallel algorithms research were funded by the DOE Computer Hardware, Advanced Mathematics, and Model Physics (CHAMMP) program. The features of version 1.0 were frozen on 8/1/93, and it is this version we would offer initially as a benchmark. PSTSWM is a parallel implementation of a sequential code (STSWM 2.0) written by James Hack and Ruediger Jakob at NCAR to solve the shallow water equations on a sphere using the spectral transform method. STSWM evolved from a spectral shallow water model written by Hack (NCAR/CGD) to compare numerical schemes designed to solve the divergent barotropic equations in spherical geometry. STSWM was written partially to provide the reference solutions to the test cases proposed by Williamson et. al. (see citation [4] below), which were chosen to test the ability of numerical methods to simulate important flow phenomena. These test cases are embedded in the code and are selectable at run-time via input parameters, specifying initial conditions, forcing, and analytic solutions (for error analysis). The solutions are also published in a Technical Note by Jakob et. al. [3]. In addition, this code is meant to serve as an educational tool for numerical studies of the shallow water equations. A detailed description of the spectral transform method, and a derivation of the equations used in this software, can be found in the Technical Note by Hack and Jakob [2]. For PSTSWM, we rewrote STSWM to add vertical levels (in order to get the correct communication and computation granularity for 3-D weather and climate codes), to increase modularity and support code reuse, and to allow the problem size to be selected at runtime without depending on dynamic memory allocation. PSTSTWM is meant to be a compromise between paper benchmarks and the usual fixed benchmarks by allowing a significant amount of runtime-selectable algorithm tuning. Thus, the goal is to see how quickly the numerical simulation can be run on different machines without fixing the parallel implementation, but forcing all implementations to execute the same numerical code (to guarantee fairness). The code has also been written in such a way that linking in optimized library functions for common operations instead of the "portable" code will simple. ------------------------------------------------------------------------------- May this code be freely distributed (if not specify restrictions) : Yes, but users are requested to acknowledge the authors (Worley and Foster) and the program that supported the development of the code (DOE CHAMMP program) in any resulting research or publications, and are encouraged to send reprints of their work with this code to the authors. Also, the authors would appreciate being notified of any modifications to the code. Finally, the code has been written to allow easy reuse of code in other applications, and for educational purposes. The authors encourage this, but also request that they be notified when pieces of the code are used. ------------------------------------------------------------------------------- Give length in bytes of integers and floating-point numbers that should be used in this application: The program currently uses INTEGER, REAL, COMPLEX, and DOUBLE PRECISION variables. The code should work correctly for any system in which COMPLEX is represented as 2 REALs. The include file params.i has parameters that can be used to specify the length of these. Also, some REAL and DOUBLE parameters values may need to be modified for floating point number systems with large mantissas, e.g., PI, TWOPI. PSTSWM is currently being used on systems where Integers : 4 bytes Floats : 4 bytes The use of two precisions can be eliminated, but at the cost of a significant loss of precision. (For 4 bytes REALs, not using DOUBLE PRECISION increases the error by approximately three orders of magnitude.) DOUBLE PRECISION results are only used in set-up (computing Gauss weights and nodes and Legendre polynomial values), and are not used in the body of the computation. ------------------------------------------------------------------------------- Documentation describing the implementation of the application (at module level, or lower) : The sequential code is documented in a file included in the distribution of the code from NCAR: Jakob, Ruediger, Description of Software for the Spectral Transform Shallow Water Model Version 2.0. National Center for Atmospheric Research, Boulder, CO 80307-3000, August 1992 and in Hack, J.J. and R. Jakob, Description of a global shallow water model based on the spectral transform method, NCAR Technical Note TN-343+STR, January 1992. Documentation of the parallel code is in preparation, but extensive documentation is present in the code. ------------------------------------------------------------------------------- Research papers describing sequential code and/or algorithms : 1) Browning, G.L., J.J. Hack and P.N. Swarztrauber, A comparison of three numerical methods for solving differential equations on the sphere, Monthly Weather Review, 117:1058-1075, 1989. 2) Hack, J.J. and R. Jakob, Description of a global shallow water model based on the spectral transform method, NCAR Technical Note TN-343+STR, January 1992. 3) Jakob, R., J.J. Hack and D.L. Williamson, Reference solutions to shallow water test set using the spectral transform method, NCAR Technical Note TN-388+STR (in preparation). 4) Williamson, D.L., J.B. Drake, J.J. Hack, R. Jakob and P.S. Swarztrauber, A standard test set for numerical approximations to the shallow water equations in spherical geometry, Journal of Computational Physics, Vol. 102, pp.211-224, 1992. ------------------------------------------------------------------------------- Research papers describing parallel code and/or algorithms : 5) Worley, P. H. and J. B. Drake, Parallelizing the Spectral Transform Method, Concurrency: Practice and Experience, Vol. 4, No. 4 (June 1992), pp. 269-291. 6) Walker, D. W., P. H. Worley, and J. B. Drake, Parallelizing the Spectral Transform Method. Part II, Concurrency: Practice and Experience, Vol. 4, No. 7 (October 1992), pp. 509-531. 7) Foster, I. T. and P. H. Worley, Parallelizing the Spectral Transform Method: A Comparison of Alternative Parallel Algorithms, Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing (March22-24, 1993), pp. 100-107. 8) Foster, I. T. and P. H. Worley, Parallel Algorithms for the Spectral Transform Method, (in preparation) 9) Worley, P. H. and I. T. Foster, PSTSWM: A Parallel Algorithm Testbed and Benchmark. (in preparation) ------------------------------------------------------------------------------- Other relevent research papers: 10) I. Foster, W. Gropp, and R. Stevens, The parallel scalability of the spectral transform method, Mon. Wea. Rev., 120(5), 1992, pp. 835--850. 11) Drake, J. B., R. E. Flanery, I. T. Foster, J. J. Hack, J. G. Michalakes, R. L. Stevens, D. W. Walker, D. L. Williamson, and P. H. Worley, The Message-Passing Version of the Parallel Community Climate Model, Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, pp. 500-513. 12) Sato, R. K. and R. D. Loft, Implementation of the NCAR CCM2 on the Connection Machine, Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, pp. 371-393. 13) Barros, S. R. M. and Kauranne, T., On the Parallelization of Global Spectral Eulerian Shallow-Water Models, Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, pp. 36-43. 14) Kauranne, T. and S. R. M. Barros, Scalability Estimates of Parallel Spectral Atmospheric Models, Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in Meteorology (Nov. 23-27, 1992) Hoffman, G.-R and T. Kauranne, ed., World Scientific Publishing Co. Pte. Ltd, Singapore, 1993, pp. 312-328. 15) Pelz, R. B. and W. F. Stern, A Balanced Parallel Algorithm for Parallel Processing, Proceedings of the Sixth SIAM Conference on Parallel Processing for Scientific Computing (March22-24, 1993), pp. 126-128. ------------------------------------------------------------------------------- Application available in the following languages (give message passing system used, if applicable, and machines application runs on) : The model code is primarily written in Fortran 77, but also uses DO ... ENDDO and DO WHILE ... ENDDO, and the INCLUDE extension (to pull in common and parameter declarations). It has been compiled and run on the Intel iPSC/2, iPSC/860, Delta, and Paragon, the IBM SP1, and on Sun Sparcstation, IBM RS/6000, and Stardent 3000/1500 workstations (as a sequential code). Message passing is implemented using the PICL message passing system. All message passing is encapsulated in 3 highlevel routines: BCAST0 (broadcast) GMIN0 (global minimum) GMAX0 (global maximum) two classes of low level routines: SWAP, SWAP_SEND, SWAP_RECV, SWAP_RECVBEGIN, SWAP_RECVEND, SWAP1, SWAP2, SWAP3 (variants and/or pieces of the swap operation) and SENDRECV, SRBEGIN, SREND, SR1, SR2, SR3 (variants and/or pieces of the send/recv operation) and one synchronization primitive: CLOCKSYNC0 PICL instrumentation commands are also embedded in the code. Porting the code to another message passing library will be simple, although some of the runtime communication options may become illegal then. The PICL instrumentation calls can be stubbed out (or removed) without changing the functionality of the code, but some sort of synchronization is needed when timing short benchmark runs. ------------------------------------------------------------------------------- Total number of lines in source code: 28,204 Number of lines excluding comments : 12,434 Size in bytes of source code : 994,299 ------------------------------------------------------------------------------- List input files (filename, number of lines, size in bytes, and if formatted) : problem: 23 lines, 559 bytes, ascii algorithm: 33 lines, 874 bytes, ascii ------------------------------------------------------------------------------- List output files (filename, number of lines, size in bytes, and if formatted) : standard output: Number of lines and bytes is a function of the input specifications, but for benchmarking would normally be 63 lines (2000 bytes) of meaningful output. (On the Intel machine, FORTRAN STOP messages are sent from each processor at the end of the run, increasing this number.) timings: Each run produces one line of output, containing approx. 150 bytes. Both files are ascii. ------------------------------------------------------------------------------- Brief, high-level description of what application does: (P)STSWM solves the nonlinear shallow water equations on the sphere. The nonlinear shallow water equations constitute a simplified atmospheric-like fluid prediction model that exhibits many of the features of more complete models, and that has been used to investigate numerical methods and benchmark a number of machines. Each run of PSTSWM uses one of 6 embedded initial conditions and forcing functions. These cases were chosen to stress test numerical methods for this problem, and to represent important flows that develop in atmospheric modeling. STSWM also supports reading in arbitrary initial conditions, but this was removed from the parallel code to simplify the development of the initial implementation. ------------------------------------------------------------------------------- Main algorithms used: PSTSWM uses the spectral transform method to solve the shallow water equations. During each timestep, the state variables of the problem are transformed between the physical domain, where most of the physical forces are calculated, and the spectral domain, where the terms of the differential equation are evaluated. The physical domain is a tensor product longitude-latitude grid. The spectral domain is the set of spectral coefficients in a spherical harmonic expansion of of the state variables, and is normally characterized as a triangular array (using a "triangular" truncation of spectral coefficients). Transforming from physical coordinates to spectral coordinates involves performing a real FFT for each line of constant latitude, followed by integration over latitude using Gaussian quadrature (approximating the Legendre transform) to obtain the spectral coefficients. The inverse transformation involves evaluating sums of spectral harmonics and inverse real FFTs, analogous to the forward transform. Parallel algorithms are used to compute the FFTs and to compute the vector sums used to approximate the forward and inverse Legendre transforms. Two major alternatives are available for both transforms, distributed algorithms, using a fixed data decompostion and computing results where they are assigned, and transpose algorithms, remapping the domains to allow the transforms to be calculated sequentially. This translates to four major parallel algorithms: a) distributed FFT/distributed Legendre transform (LT) b) transpose FFT/distributed LT c) distributed FFT/transpose LT d) transpose FFT/transpose LT Multiple implementations are supported for each type of algorithm, and the assignment of processors to transforms is also determined by input parameters. For example, input parameters specify a logical 2-D processor grid and define the data decomposition of the physical and spectral domains onto this grid. If 16 processors are used, these can be arranged as a 4x4 grid, an 8x2 grid, a 16x1 grid, a 2x8 grid, or a 1x16 grid. This specification determines how many processors are used to calculate each parallel FFT and how many are used to calculate each parallel LT. ------------------------------------------------------------------------------- Skeleton sketch of application: The main program calls INPUT to read problem and algorithm parameters and set up arrays for spectral transformations, and then calls INIT to set up the test case parameters. Routines ERRANL and NRGTCS are called once before the main timestepping loop for error normalization, once after the main timestepping for calculating energetics data and errors, and periodically during the timestepping, as requested. The prognostic fields are initialized using routine ANLYTC, which provides the analytic solution. Each call to STEP advances the computed fields by a timestep DT. Timing logic surrounds the timestepping loop, so the initialization phase is not timed. Also, a fake timestep is calculated before beginning timing to eliminate the first time "paging" effect currently seen on the Intel Paragon systems. STEP computes the first two time levels by two semi-implicit timesteps; normal time-stepping is by a centered leapfrog-scheme. STEP calls COMP1, which choses between an explicit numerical algorithm, a semi-implicit algorithm, and a simplified algorithm associated with solving the advection equation, one of the embedded test cases. The numerical algorithm used is an input parameter. The basic outline of each timestep is the following: 1) Evaluate non-linear product and forcing terms. 2) Fourier transform non-linear terms in place as a block transform. 3) Compute and update divergence, geopotential, and vorticity spectral coefficients. (Much of the calculation of the time update is "bundled" with the Legendre transform.) 4) Compute velocity fields and transform divergence, geopotential, and vorticity back to gridpoint space using a) an inverse Legendre transform and associated computations and b) an inverse real block FFT. PSTSWM has "fictitious" vertical levels, and all computations are duplicated on the different levels, potentially significantly increasing the granularity of the computation. (The number of vertical levels is an input parameter.) For error analysis, a single vertical level is extracted and analyzed. ------------------------------------------------------------------------------- Brief description of I/O behavior: Processor 0 reads in the input parameters and broadcasts them to the rest of the processors. Processor 0 also receives the error analysis and timing results from the other processors and writes them out. ------------------------------------------------------------------------------- Describe the data distribution (if appropriate) : The processors are treated as a logical 2-D grid. There are 3 domains to be distributed: a) physical domain: tensor product longitude-latitude grid b) Fourier domain: tensor product wavenumber-latitude grid c) spectral domain: triangular array, where each column contains the spectral coefficients associated with a given wavenumber. The larger the wavenumber is, the shorter the column is. An unordered FFT is used, and the Fourier and spectral domains use the "unordered" permutation when the data is being distributed. I) distributed FFT/distributed LT 1) The tensor-product longitude-latitude grid is mapped onto the processor grid by assigning a block of contiguous longitudes to each processor column and by assigning one or two blocks of contiguous latitudes to each processor row. The vertical dimension is not distributed. 2) After the FFT, the subsequent wavenumber-latitude grid is similarly distributed over the processor grid, with a block of the permuted wavenumbers assigned to each processor column. 3) After the LT, the wavenumbers are distributed as before and the spectral coefficients associated with any given wavenumber are either distributed evenly over the processors in the column containing that wavenumber, or are duplicated over the column. What happens is a function of the particular distributed LT algorithm used. II) transpose FFT/distributed LT 1) same as in (I) 2) Before the FFT, the physical domain is first remapped to a vertical layer-latitude decomposition, with a block of contiguous vertical layers assigned to each processor column and the longitude dimension not distributed. After the transform, the vertical level-latitude grid is distributed as before, and the wavenumber dimension is not distributed. 3) After the LT, the spectral coefficients for a given vertical layers are either distributed evenly over the processors in a column, or are duplicated over that column. What happens is a function of the particular distributed LT algorithm used. III) distributed FFT/transpose LT 1) same as (I) 2) same as (I) 3) Before the LT, the wavenumber-latitude grid is first remapped to a wavenumber-vertical layer decomposition, with a block of contiguous vertical layers assigned to eadh processor row and the latitude dimension not distributed. After the transform, the spectral coefficients associated with a given wavenumber and vertical layer are all on one processor, and the wavenumbers and vertical layers are distributed as before. IV) transpose FFT/transpose LT 1) same as (I) 2) same as (II) 3) Before the LT, the vertical level-latitude grid is first remapped to a vertical level-wavenumber decomposition, with a block of the permuted wavenumbers now assigned to each processor row and the latitude dimension not distributed. After the transform, the spectral coefficients associated with a given wavenumber and vertical layer are all on one processor, and the wavenumbers and vertical layers are distributed as before. ------------------------------------------------------------------------------- Give parameters of the data distribution (if appropriate) : The distribution is a function of the problem size (longitude, latitude, vertical levels), the logical processor grid (PX, PY), and the algorithm (transpose vs. distributed for FFT and LT). ------------------------------------------------------------------------------- Brief description of load balance behavior : The load is fairly well balanced. If PX and PY evenly divide the number of longitudes, latitudes, and vertical levels, then all load imbalances are due to the unequal distribution of spectral coefficients. As described above, the spectral coefficients are laid out as a triangular array in most runs, where each column corresponds to a different Fourier wavenumber. The wavenumbers are partitioned among the processors in most of the parallel algorithms. Since each column is a different length, a wrap mapping of the the columns will approximately balance the load. Instead, the natural "unordered" ordering of the FFT is used with a block partitioning, which does a reasonable job of load balancing without any additional data movement. The load imbalance is quantified in Walker, et al [5]. If PX and PY do not evenly divide the dimensions of the physical domain, then other load imbalances may be as large as a factor of 2 in the worse case. ------------------------------------------------------------------------------- Give parameters that determine the problem size : MM, NN, KK - specifes number of Fourier wavenumber and spectral truncation used. For a triangular truncation, MM = NN = KK. NLON, NLAT, NVER - number of longitudes, latitudes, and vertical levels. There are required relationships between NLON, NLAT, and NVER, and between these and MM. These relationships are checked in the code. We will also provide a selection of input files that specify legal (and interesting) problems. DT - timestep (in seconds). (Must be small enough to satisfy Courant condition stability condition. Code warns if too large, but does not abort.) TAUE - end of model run (in hours) ------------------------------------------------------------------------------- Give memory as function of problem size : Executable size is determined at compile time by setting the parameters COMPSZ in params.i. Per node memory requirements are approximately (in REALs) associated Legendre polynomial values: MM*MM*NLAT/PX*PY physical grid fields: 8*NLON*NLAT*NVER/(PX*PY) spectral grid fields: 3*MM*MM*NVER/(PX*PY) or (if spectral coefficients duplicated within a processor column) 3*MM*MM*MVER/PX work space: 8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/(PX*PY) or (if spectral coefficients duplicated within a processor column) 8*NLON*NLAT*NVER*BUFS1/(PX*PY) + 3*MM*MM*NVER*BUFS2/PX where BUFS1 and BUFS2 are input parameters (number of communication buffers). BUFS1 and BUFS2 can be as small as 0 and as large as PX or PY. In standard test cases, NLON=2*NLAT, NLON=4*NVER, and NLON=3*MM+1, so memory requirements are approximately: (2 + 108*(1+BUFS1) + 3*(1+BUFS2))*(M**3)/(4*PX*PY) or (2 + 108*(1+BUFS1))*(M**3)/(4*PX*PY) + 3*(1+BUFS2)*(M**3)/(4*PX) ------------------------------------------------------------------------------- Give number of floating-point operations as function of problem size : for a serial run per timestep (very rough): nonlinear terms: 10*NLON*NLAT*NVER forward FFT: 40*NLON*NLAT*NVER*LOG2(NLON) forward LT and time update: 48*MM*NLAT*NVER + 7*(MM**2)*NLAT*NVER inverse LT and calculation of velocities: 20*MM*NLAT*NVER + 14*(MM**2)*NLAT*NVER inverse FFT: 25*NLON*NLAT*NVER*LOG2(NLON) Using standard assumptions (NLON=2*NLAT, NLON=4*NVER, and NLON=3*MM+1): approx. 460*(M**3) + 348*(M**3)*LOG2(M) + 24*(M**4) flops per timestep. For a total run, multiply by TAUE/DT. ------------------------------------------------------------------------------- Give communication overhead as function of problem size and data distribution : This is a function of the algorithm chosen. I) transpose FFT a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY) 2*(PX-1) steps, D volume or 2*LOG2(PX) steps, D*LOG2(PX) volume II) distributed FFT a) forward + inverse FFT: let D = 13*NLON*NLAT*NVER/(PX*PY) 2*LOG2(PX) steps, D*LOG2(PX) volume III) transpose LT a) forward LT: let D = 8*NLON*NLAT*NVER/(PX*PY) 2*(PY-1) steps, D volume or 2*LOG2(PY) steps, D*LOG2(PY) volume b) inverse LT: let D = (3/2)*(MM**2)*NVER/(PX*PY) (PY-1) steps, D volume or LOG2((PY) steps, D*PY volume IV) distributed LT a) forward + inverse LT: let D = 3*(MM**2)*NVER/(PX*PY) 2*(PY-1) steps, D*PY volume or 2*LOG2((PY) steps, D*PY volume These are per timestep costs. Multiply by TAUE/DT for total communication overhead. ------------------------------------------------------------------------------- Give three problem sizes, small, medium, and large for which the benchmark should be run (give parameters for problem size, sizes of I/O files, memory required, and number of floating point operations) : Standard input files will be provided for T21: MM=KK=NN=21 T42: MM=KK=NN=42 T85: MM=NN=KK=85 NLON=32 NLON=64 NLON=128 NLAT=64 NLAT=128 NVER=256 NVER=8 NVER=16 NVER=32 ICOND=2 ICOND=2 ICOND=2 DT=4800.0 DT=2400.0 DT=1200.0 TAUE=120.0 TAUE=120.0 TAUE=120.0 These are 5 day runs of the "benchmark" case specified in Williamson, et al [3]. Flops and memory requirements for serial runs are as follows (approx.): T21: 500,000 REALs 2,000,000,000 flops T42: 4,000,000 REALs 45,000,000,000 flops T85: 34,391,000 REALs 1,000,000,000,000 flops Both memory and flops scale well, so, for example, the T42 run fits in approx. 4MB of memory for a 4 processor run. But different algorithms and different aspect ratios of the processor grid use different amounts of memory. ------------------------------------------------------------------------------- How did you determine the number of floating-point operations (hardware monitor, count by hand, etc.) : Count by hand (looking primarily at inner loops, but eliminating common subexpressions that compiler is expected to find). -------------------------------------------------------------------------------
PARKBENCH compact applications page