================================================================== === === === GENESIS Distributed Memory Benchmarks === === === === QCD2 === === === === Conjugate Gradient iteration in SU(3) lattice gauge === === theory with Kogut-Susskind fermions === === === === Original author: John Merlin === === Modified by : Ivan Wolton === === PARMACS macros : Vladimir Getov === === Department of Electronics and Computer Science === === University of Southampton === === Southampton SO9 5NH, U.K. === === fax.:+44-703-593045 e-mail:icw@uk.ac.soton.ecs === === vsg@uk.ac.soton.ecs === === === === Copyright: SNARC, University of Southampton === === === === Last update: June 1993; Release: 2.2 === === === ================================================================== 1. Description -------------- This benchmark consists of solving a large, sparse system of linear equations using conjugate gradient iteration. The equations are derived from a lattice gauge theory simulation using dynamical Kogut-Susskind fermions. Conjugate gradient methods form the core of several important algorithms for lattice gauge theory with fermions. Supercomputer performance is essential for such problems as the inclusion of dynamical fermions increases the computational effort required by several orders of magnitude over the 'quenched' approximation. (The quenched approximation is used in the QCD1 benchmark) Simulations are defined on four-dimensional lattices which are discrete approximations to continuum space-time. The basic variables are 3 by 3 complex matrices. Four such matrices are associated with every lattice site. The benchmark takes the common approach of updating the variables on all even sites, and then on all odd sites, on alternate steps. Updating a site variable requires a number of matrix multiplications and involves matrices from neighbouring sites. Almost all the arithmetic operations are vectorizable. However, in order to achieve this vectorization an overhead is incurred in internal shifts of neighbouring matrices, which can become a significant part of the execution time. The parallel version of the program distibutes the spatial dimension of the lattice over a cuboidal process grid. Communications involve both the shifting of matrices from neighbouring processors and a global summation followed by a broadcast. 2. Operating Instructions ------------------------- Changing problem size and number of processors: ---------------------------------------------- The problem is based on a 4-dimensional space-time lattice of size: N = NX**3 * NT. For the purposes of the benchmark, NX & NT are specified as integer powers of 2, so that: NX = 2**LOGNX , NT = 2**LOGNT In the parallel version of the program the number of processors (NP) over which the spatial dimensions of the lattice are distributed is determined by the parameter LOGP, which is the log to base 2 of the required number of processors, ie. NP = 2**LOGP. The specified number of processors is configured as a 3D grid internally within the program. NP = NPX * NPY * NPZ Where NPX, NPY, NPZ are all powers of two, NPX >= NPY >= NPZ. The local lattice size on each processor is then: n = NT * (NX/NPX) * (NX/NPY) * (NX/NPZ) Suggested Problem Sizes: ------------------------ It is recommended that the benchmark is run with four standard problem sizes: 4 * 8**3, 8 * 16**3, 16 * 32**3 and 16 * 64**3. The input parameters and total memory requirement for array storage for each problem size is given in the following table: Problem Size LOGNT LOGNX Approx Memory(Mbyte) 4 * 8**3 2 3 2 8 * 16**3 3 4 32 16 * 32**3 4 5 512 16 * 64**3 4 6 4096 Compiling and Running the Benchmark: 1) Choose problem size and number of processors, edit the include file qcd2.inc to set the appropriate parameters. 2) To compile and link the benchmark type: `make' for the distributed version or `make slave' for the single-node version. 3) If any of the parameters in the include files are changed, the code has to be recompiled. The make-file will automatically send to the compiler only affected files. 4) On some systems it may be necessary to allocate the appropriate resources before running the benchmark, eg. on the iPSC/860 to reserve a cube of 8 processors, type: getcube -t8 5) To run either sequential or distributed version of the benchmark, type: qcd2 The progress of the benchmark execution can be monitored via the standard output, whilst a permanent copy of the benchmark output is written to a file called 'result'. 6) If the run is successful and a permanent record is required, the file 'result' should be copied to another file before the next run overwrites it. 3. Hints for Optimisation (Blockshift versus indirect addressing) ----------------------------------------------------------------- Two routines are provided for the shift operation, blockshift and shiftvec. Blockshift shifts coherent blocks corresponding to a given lattice direction. The problem is that the block length is rather small for the t & x directions and so with large vector startups the vector efficiency is poor. Hence in these directions it is more efficient to use the indirect addressing version of the shift routine shiftvec. The shift routines are called from the routine dvec which by default uses shiftvec in the t and x directions and blockshift in the y & z directions. For best performance with smaller lattices, blockshift should be used in the y direction. 4. Accuracy Check ----------------- The output results are best characterised by the total energy per lattice point (output in columnn 3). The program can be considered to have run successfully if the following two conditions are met. 1) The total energy should be constant to 5 decimal places for each iteration (a small variation in the final 6th place is allowable). 2) This constant value should be close to 3.0 Unfortunately it is difficult to be more precise as the fermion and gauge fields are initialised by a random number generator. Consequently the exact value of the total energy is then dependent on the number of processors and the problem size. $Id: ReadMe,v 1.2 1994/04/20 16:50:10 igl Rel igl $
Submitted by Mark Papiani,
last updated on 10 Jan 1995.