================================================================== === === === GENESIS / PARKBENCH Distributed Memory Benchmarks === === === === SOLVER === === === === Quark propagator generator === === === === Versions: Std F77, PARMACS, PVM 3.1 === === === === PARKBENCH authors: Stephen Booth, Nick Stanford === === GENESIS mods: Ian Glendinning === === === === Inquiries: HPC Centre === === Computing Services === === University of Southampton === === Southampton SO17 4BJ, U.K. === === === === Fax: +44 703 593939 E-mail: support@par.soton.ac.uk === === === === Last update: Jul 1994; Release: 3.0 === === === ================================================================== 1. Description -------------- SOLVER is part of an ongoing software development exercise carried out by UKQCD (The United Kingdom Quantum Chromo-Dynamics collaboration) To develop a new generation of simulation codes. The current generation of codes were highly tuned for a particular machine architecture so a software development exercise was started to design and develop a set of portable codes. This code was developed by S.Booth and N.Stanford of the University of Edinburgh during the course of 1993. Solver is a benchmark code derived from the codes used to generate quark propagators. It is designed to benchmark and validate the computational sections of this operation. It differs from the production code in that it self initialises to non-trivial test data rather than performing file access. This is because there is no accepted standard for parallel file access. The benchmark was originally developed as part of a national UK procurement exercise. The application generates quark propagators from a background gauge configuration and a fermionic source. This is equivalent to solving M psi = source where psi is the quark propagator and M (a function operating on psi) depends on the gauge fields. The benchmark performs a cut down version of this operation. The benchmark code initialises the gauge field to a unit gauge configuration. (The results for a unit gauge can be calculated analytically allowing a check on the results) A gauge transformation is then applied to the gauge field. A unit gauge field only consists of zeros and ones by applying a gauge transformation non-trivial values are generated. Quantities corresponding to physical observables should be unchanged by such a transformation. In application code the gauge field would have been read in from disk. The source field is initialised to a point source (a single non-zero point on one lattice site) An iterative solver is called to generate the quark propagator. The solver routine also generates timing information. In application code this would then be dumped to disk. In the benchmark we use the quark propagator to generate a physically significant quantity (the pion propagator). This generates a single real number for each timeslice of the lattice. These values are printed to standard out. This procedure requires a large number of iterations. For benchmarking we are only interested in the time per-iteration and some check on the validity of the results. We therefore usually only perform a fixed number of iterations (say 50) to generate accurate timing information and verify the results by comparison with other machines. Memory as function of problem size : The appropriate parameters for memory use are Max_body (maximum number of data-points per/processor) Max_bound (maximum number of data points on a single boundary between two processors) If LX LY LZ LT are the local lattice sizes obtained by dividing the lattice size by the processor grid size and rounding up to the nearest integer. Max_body = (LX*LY*LZ*LT)/2 Max_bound = MAX( LX*LY*LZ/2 ,LY*LZ*LT/2 ,LX*LZ*LT/2 ,LX*LY*LT/2 ) The code contains a number of build-time switches for variations in the implementation that may be beneficial on some machines. The memory usage depends on these switches but typical values are: 108 * Max_body + 36 * Max_bound Fpoints 16 * (Max_body + Max_bound) INTEGERS Number of floating-point operations as function of problem size : Each iteration performs 2760 floating point operations per lattice site, i.e. 50 iteration using a 24^3*48 lattice = 9.16e+10 floating point operations. 2. Operating Instructions ------------------------- The problem size and number of processors are set in the file auto_size.h For example to run a 8^3*16 lattice on 4 processors use C Set the problem size, these numbers MUST be even. PARAMETER( X_latt = 8, $ Y_latt = 8, $ Z_latt = 8, $ T_latt = 16) C Set the size of the processor grid. #ifndef FAKE PARAMETER( X_proc = 1, $ Y_proc = 1, $ Z_proc = 2, $ T_proc = 2) #endif The preprocessor option FAKE can be used to select single node execution. How to do this is described in more detail below. The total number of processors used (not counting the front-end) is X_proc * Y_proc * Z_proc * T_proc. Any reasonable processor grid can be used as the program will automatically use an irregular decomposition if a regular decomposition is not possible. I consider a processor grid where N_proc is more than half N_latt to be unreasonable. NB the local lattice size in the X and T directions must be even so an irregular decomposition will be used for these directions if these lattice dimensions are an odd multiple of the corresponding processor-grid width. The precision of the target machine must be set in the file precision.h other compile time switches are set in options.h and solver_options.h these should be ignored. Compiling and Running the Benchmark: To compile and link the benchmark for distributed execution type: make The PVM version can be compiled for sequential, single-node, execution by typing: make CPPFLAGS="-DPVM -DFAKE" To run the benchmark type: solver Results are written to the file `solver.res'. The measurement of flop/s that is reported is per node. The SOLVER program has been configured to use a fixed number of iterations rather than to iterate to convergence. The solver routine is run twice and the free pion propagator calculated after each run. The first run is for 4 iterations. This is used to verify the results. In this case the residues and the non zero elements of the free pion propagator should be the same for all lattice sizes larger than 16^3*32 STATUS:solver:print_pion:0:Timeslice 0 0.84581123E+00 STATUS:solver:print_pion:0:Timeslice 1 0.54563329E-01 STATUS:solver:print_pion:0:Timeslice 2 0.83493473E-02 STATUS:solver:print_pion:0:Timeslice 3 0.13466747E-02 STATUS:solver:print_pion:0:Timeslice 4 0.17637214E-03 STATUS:solver:print_pion:0:Timeslice 5 0.17347775E-04 STATUS:solver:print_pion:0:Timeslice 6 0.13273219E-05 STATUS:solver:print_pion:0:Timeslice 7 0.64078817E-07 STATUS:solver:print_pion:0:Timeslice 8 0.12376473E-08 STATUS:solver:print_pion:0:Timeslice 9 0.00000000E+00 There may be some variation due to rounding. The second run is for an additional 50 iterations and should only be used as a measure of performance. $Id: ReadMe,v 1.1 1994/07/20 18:10:44 igl Exp igl $
Submitted by Mark Papiani,
last updated on 10 Jan 1995.