HIGH PERFORMANCE FORTRAN VERSUS EXPLICIT MESSAGE PASSING

Comparing The Performance of MPI on The Cray T3E-900, The Cray Origin 2000 and The IBM P2SC

_{Glenn
R. Luecke and James J. Coyle} _{Iowa State
University} _{Ames, Iowa
50011-2251, USA}

_{June, 1998}

Keywords: HPF, CRAFT, MPI, Cholesky Factorization, Cray T3E, IBM SP-2

_Abstract

_{This study was conducted to
evaluate relative communication performance of the Cray T3E-900, the Cray
Origin 2000 and the IBM P2SC on a collection of 13 communication tests.
These tests call MPI routines using 2 to 64 processors with messages varying
from 8 bytes to 10 MBytes. The relative performance of these machines varied
depending on the communication test, but overall the T3E-900 was often
2 to 4 times faster than the Origin 2000 and P2SC. The Origin 2000 and
P2SC performed about the same for most of the tests.}

_INTRODUCTION

_{The performance of the communication network of a parallel computer
plays a critical role in its overall performance, see [4,6]. Writing scientific
programs with calls to MPI [5,9] routines is rapidly becoming the standard
for writing programs with explicit message passing. Thus, to evaluate the
performance of the communication network of a parallel computer for scientific
computing, a collection of communication tests that use MPI for the message
passing have been written. These communication tests are a significant
enhancement from those used in [7] and have been designed to test those
communication patterns that we feel are likely to occur in scientific programs.
This paper reports the results of these tests on the Cray T3E-900, the
Cray Origin 2000 and the IBM P2SC ("Power 2 Super Chip").}

_{DESCRIPTION
OF THE COMMUNICATION TESTS AND RESULTS}

_{All tests were written in Fortran with calls to MPI routines for
the message passing. The communication tests were run with message sizes
ranging from 8 bytes to 10 MBytes and with the number of processors ranging
from 2 to 64. (Throughout this paper, MByte means 1,000,000 bytes not 1024*1024
bytes and KBytes means 1,000 bytes not 1024 bytes.) Because of memory limitations,
for some of the tests the 10 MByte message size was replaced by a message
of size 1 MByte. Message sizes were not chosen based on any knowledge of
the computers being used but were chosen to represent a small message (8
bytes), a ‘large’ message (10 BMytes) with two additional message sizes
between these values (1 KByte and 100 KBytes). Some of these communication
patterns took a very short amount of time to execute so they were looped
to obtain a wall-clock time of at least one second in order to obtain more
accurate timings. The time to execute a particular communication pattern
was then obtained by dividing the total time by the number of loops. All
timings were done using the MPI wall-clock timer, mpi_wtime(). A call to
mpi_barrier was made just prior to the first call to mpi_wtime and again
just prior to the second call to mpi_wtime to ensure processor synchronization
for timing. Sixty-four bit real precision was used on all machines. Default
mpi environmental settings were used for all tests. Tests run on the Cray
T3E-900 and Cray Origin 2000 were executed on machines dedicated to running
only our tests. For these machines, five runs were made and the best performance
results are reported. Tests run on the IBM P2SC were executed using LoadLeveler
so that only one job at a time would be executing on the 32 nodes used.
However, jobs running on other nodes would sometimes cause variability
in the data so the tests were run at least ten times and the best performance
numbers reported. Default environmental settings were used for all three
machines.}

_{Tests for the Cray T3E-900 were run on a 64-processor machine
located in Chippewa Falls, Wisconsin. Each processor is a DEC EV5 microprocessor
running at 450 MHz with peak theoretical performance of 900 Mflop/s. The
three-dimensional, bi-directional torus communication network of the T3E-900
has a bandwidth of 350 MBytes/second and latency of 1 microsecond. For
more information on the T3E-900 see [9]. The UNICOS/mk version 1.3.1 operating
system, the cf90 version 3.0 Fortran compiler with the
-O2 –dp compiler options and MPI version 3.0 were used for these
tests. The MPI implementation used had the hardware data streams work-around
enabled even though this is not needed for the T3E-900.}

_{Tests for the Cray Origin 2000 were run on a 128-processor
machine located in Eagan, Minnesota. Each processor is a MIPS R10000, 195
MHz microprocessor with a peak theoretical performance of 390 Mflop/s and
a 4 MByte secondary cache. Each node consists of two processors sharing
a common memory. The communication network is a hypercube for up to 32
processors and is called a "fat bristled hypercube" for more than 32 processors
since multiple hypercubes are interconnected via the CrayRouter. The node-to-node
bandwidth is 150 MBytes per second and the maximum remote latency in a
128-processor system is about 1 microsecond. For more information see [9].
A pre-release version of the Irix 6.5 operating system, MPI from version
1.1 of the Message Passing Toolkit, and the MipsPro 7.20 Fortran compiler
with -O2 –64 compiler options were
used for these tests.}

_{Tests for the IBM P2SC were run at the Maui High Performance
Computing Center. The peak theoretical performance of each of these processors
is 480 Mflop/s for the 120 MHz thin nodes and 540 Mflop/s for the 135 MHz
wide nodes. The communication network has a peak bi-directional bandwidth
of 110 MBytes/second with a latency of 40.0 microseconds for thin nodes
and 39.2 microseconds for wide nodes. Performance tests were run on thin
nodes each with 128 MBytes of memory. At the Maui High Performance Computing
Center, there were only 48 thin nodes available for running these tests,
so there is no data for 64 processors. For more information about the P2SC
see [10]. The AIX 4.1.5.0 operating system, xlf version 4.1.0.0 Fortran
compiler with –O3 qarch=pwr2 compiler
options, and MPI version 2.2.0.2 were used for these tests.}

_{Communication Test 1
(point-to-point, see table 1)}

_{The first communication test measures the time required to send
a real array A(1:n) from one processor to another by dividing by
two the time required to send A from one processor to another and
then send A back to the original processor, where n is chosen to
obtain a message of the desired size. Thus, to obtain a message of size
1 KByte, n = 1,000/8 = 125. Since each A(i) is 8 bytes, the communication
rate for sending a message from one processor to another is calculated
by 2*8*n/(wall-clock time), where the wall-clock time is the time to send
A from one processor to another and then back to the original processor
and where n is chosen to obtain the desired message size. This test is
the same as the COMMS1 test described in section 3.3.1 of [4]. This test
uses mpi_send and mpi_recv.
Table 1 gives performance rates in KBytes per second. As is done in all
the tables, the last column gives the ratios of the performance results
of the T3E-900 to the IBM P2SC and of the T3E-900 to the Origin 2000. Recall
that for small messages network latency, not bandwidth, is the dominant
factor determining the communication rate. Notice that the achieved bandwidth
on this test is significantly less than the bandwidth rates provided by
the vendors: 350,000 KBytes for the T3E-900, 110,000 KBytes for the P2SC,
and 150,000 KBytes for the Origin. All data in this paper has been rounded
to three significant digits.}

Message Size (Bytes)	T3E-900	IBM P2SC	Origin 2000	T3E/IBM,T3E/Origin
8	371	215	303	1.7, 1.2
1,000	30100	14500	18600	2.1, 1.6
100,000	144000	78700	86500	1.8, 1.7
10,000,000	151000	103000	90900	1.5, 1.7
Peak Rates	350000	110000	150000

Table 1: Point-to-point communication rates plus advertised peak rates in KBytes/second.

Communication Test 2 (broadcast, see tables 2-a, 2-b and 2-c and figures 1 and 2):

This test measures communication rates for sending a message from one processor to all other processors and uses the mpi_bcast routine. This test is the COMMS3 test described in [4]. To better evaluate the performance of this broadcast operation, define a normalized broadcast rate as

(total data rate)/(p-1)

where p is the number of processors involved in the communication and total data rate is the total amount of data sent on the communication network per unit time measured in KBytes per second. Let R be the data rate when sending a message from one processor to another and let D be the total data rate for broadcasting the same message to the p-1 other processors. If the broadcast operation and communication network were able to concurrently transmit the messages, then D = R*(p-1) and thus the normalized broadcast rate would remain constant as p varied for a given message size. Therefore, for a fixed message size, the rate at which the normalized broadcast rate decreases as p increases indicates how far the broadcast operation is from being ideal. Assume the real array A(1:n) is broadcast from the root processor where each A(i) is 8 bytes, then the communication rate is calculated by 8*n*(p-1)/(wall-clock time) and then normalized by dividing by p-1 to obtain the normalized broadcast rate.

Table 2-a gives the normalized broadcast rates obtained by keeping the root processor fixed for all repetitions of the broadcast. Figure 1 shows the graph of these results for a message of size 100 KBytes. Observe that for all machines for a fixed message size the normalized broadcast rate decreases as the number of processors increase (instead of being constant). Notice that the P2SC and Origin machines perform roughly about the same and that the T3E-900 ranges from 1.1 to 4.7 times faster than the P2SC and Origin. Observe that the Origin does not scale well as the number of processors increase as compared with the T3E-900. One might expect that the communication rate for a broadcast with 2 processors would be the same as the rate for communication test 1. However, the rates measured for the broadcast in tables 2-a and 2-b are higher than those measured in communication test 1 for all machines. It is not clear why this is so.

Figure 1: Normalized broadcast rates for a 100 KByte message (from table 2-a).

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	545	350	518	1.6, 1.1
8	4	388	194	245	2.0, 1.6
8	8	290	132	159	2.2, 1.8
8	16	264	103	110	2.6, 2.4
8	32	223	83	60	2.7, 3.7
8	64	196	NA	42	---, 4.7
1,000	2	43600	32800	26700	1.3, 1.6
1,000	4	24400	16200	13900	1.5, 1.8
1,000	8	19100	11100	9500	1.7, 2.0
1,000	16	15900	8700	6650	1.8, 2.4
1,000	32	13300	7090	4960	1.9, 2.7
1,000	64	11800	NA	3820	---, 3.1
100,000	2	157000	81600	81000	1.9, 1.9
100,000	4	104000	41000	46200	2.5, 2.3
100,000	8	72500	27300	30900	2.7, 2.3
100,000	16	52300	20100	22100	2.6, 2.4
100,000	32	34900	13100	13300	2.7, 2.6
100,000	64	23800	NA	7900	---, 3.0
10,000,000	2	157000	103000	92800	1.5, 1.7
10,000,000	4	80400	51400	45100	1.6, 1.8
10,000,000	8	50800	34200	28900	1.5, 1.8
10,000,000	16	37600	25600	19500	1.5, 1.9
10,000,000	32	29200	15400	13300	1.9, 2.2
10,000,000	64	24500	NA	8080	---, 3.0

Table 2-a: Normalized broadcast rates in Kbytes/second with a fixed root processor.

Table 2-b gives the normalized broadcast rates where the root processor is cycled through all p processors as the broadcast operation is repeated. Notice that the rates do change from those in table 2-a and the maximum percent change depends on the machine. The maximum percent change is about 14% for the T3E-900, 150% for the P2SC, and about 50% for the Origin.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	637	178	415	3.6, 1.5
8	4	387	126	213	3.1, 1.8
8	8	287	93	178	3.1, 1.6
8	16	247	77	113	3.2, 2.2
8	32	211	62	58	3.4, 3.6
8	64	185	NA	37	---, 5.0
1,000	2	40300	13000	20200	3.1, 2.0
1,000	4	27000	9670	12000	2.8, 2.3
1,000	8	18400	7500	9340	2.5, 2.0
1,000	16	15300	6220	6710	2.6, 2.3
1,000	32	12900	4940	4920	2.6, 2.6
1,000	64	11400	NA	3570	---, 3.2
100,000	2	160000	77800	87000	2.1, 1.8
100,000	4	95800	40100	40800	2.4, 2.4
100,000	8	70000	26900	28500	2.6, 2.5
100,000	16	48700	20000	20400	2.4, 2.4
100,000	32	33900	12700	11800	2.7, 2.9
100,000	64	23700	NA	6020	---, 3.9
10,000,000	2	162000	103000	87600	1.6, 1.9
10,000,000	4	73300	51400	45000	1.4, 1.6
10,000,000	8	51000	34300	28700	1.5, 1.8
10,000,000	16	36300	25700	19500	1.4, 1.9
10,000,000	32	29300	15200	12200	1.9, 2.4
10,000,000	64	24600	NA	8270	---, 3.0

Table 2-b: Normalized broadcast rates in KBytes/second with the root processor cycled.

To better understand the amount of concurrency occurring in the broadcast operation, define the log normalized broadcast rate as

(total data rate)/log(p)

where p is the number of processors involved in the communication and log(p) is the log base 2 of p. Thus, if binary tree parallelism were being utilized, the log normalized data rate would be constant for a given message size as p varies. Table 2-c gives the log normalized data rates with a fixed root processor and shows in fact that concurrency is being utilized in the broadcast operation for these machines. Figure 2 shows these results for a message of size 100 KBytes. Notice that the performance of the T3E-900 is significantly better than binary tree parallelism for all message sizes tested. For messages of size 8 bytes and 1 KByte, the P2SC performs better than binary tree parallelism and yields binary tree parallelism for the other two message sizes. The Origin gives better than binary tree parallelism for a 1 KByte message and binary tree parallelism for the other three message sizes.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM,T3E/Origin
8	2	637	178	415	3.6, 1.5
8	4	580	189	319	3.1, 1.8
8	8	669	217	415	3.1, 1.6
8	16	926	288	423	3.2, 2.2
8	32	1310	384	353	3.4, 3.7
8	64	1940	NA	388	---, 5.0
1,000	2	40300	13000	20200	3.1, 2.0
1,000	4	40500	14500	17900	2.8, 2.3
1,000	8	43000	17500	21800	2.5, 2.0
1,000	16	57300	23300	25200	2.5, 2.3
1,000	32	79800	30600	30500	2.6, 2.6
1,000	64	120000	NA	37400	---, 3.2
100,000	2	160000	77800	87000	2.1, 1.8
100,000	4	144000	60200	61100	2.4, 2.4
100,000	8	163000	62700	66600	2.6, 2.5
100,000	16	183000	74900	76300	2.4, 2.4
100,000	32	210000	78900	73000	2.7, 2.9
100,000	64	249000	NA	63200	---, 3.9
10,000,000	2	162000	103000	87600	1.6, 1.9
10,000,000	4	110000	77100	67500	1.4, 1.6
10,000,000	8	119000	80000	67100	1.5, 1.8
10,000,000	16	136000	96300	73200	1.4, 1.9
10,000,000	32	184000	94400	75800	1.9, 2.4
10,000,000	64	258000	NA	86800	---, 3.0

Table 2-c: Log normalized broadcast rates in KBytes/second with the root processor cycled.

Figure 2: Log normalized broadcast rates for a 100 KByte message (from table 2-c).

Communication Test 3 (reduce, see table 3):

Assume that there are p processors and that processor i has a message, A_i(1:n), for i = 0, p-1 and where n is chosen to obtain the message of the desired size. Test 3 measures communication rates for varying sizes of the A_i’s when calculating A = S A_i and placing A on the root processor. Thus, this test uses mpi_reduce with the mpi_sum option. Since each element of A_i is 8 bytes, the communication rate can be calculated by 8*n*(p-1))/(wall-clock time) and then normalized by dividing by p-1. As was done with mpi_bcast, one could also calculate a log normalized data rate. Table 3 contains log normalized data rates since, as was true for mpi_bcast, more insight into the level of concurrency is obtained. Table 3 shows that the Origin exhibits binary tree parallelism and the other two machines exhibit better than binary tree parallelism. Notice that the T3E-900 performs well compared with the two other machines for messages of size 8 bytes and 1 KByte. However, for messages of size 100 KBytes (with 8 or more processors) and 10 MBytes (with 4 or more processors), the IBM machine gives superior performance. These results suggest that, for the larger message sizes IBM may be using a better algorithm for mpi_reduce than the algorithm used on the T3E.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	439	187	430	2.3, 1.0
8	4	336	185	372	1.8, .90
8	8	385	145	345	2.7, 1.1
8	16	491	120	368	4.1, 1.3
8	32	670	74	360	9.0, 1.9
8	64	977	NA	431	---, 2.2
1,000	2	29000	19400	26200	1.5, 1.1
1,000	4	22600	14800	21900	1.5, 1.0
1,000	8	25000	11900	21100	2.1, 1.2
1,000	16	30800	15800	25100	2.0, 1.2
1,000	32	42300	20400	29000	2.1, 1.5
1,000	64	59200	NA	38300	---, 1.5
100,000	2	49000	38300	63100	1.3, .78
100,000	4	40800	39600	47400	1.0, .86
100,000	8	43500	52600	47100	.83, .92
100,000	16	53500	75900	53100	.70, 1.0
100,000	32	70600	88700	59500	.80, 1.2
100,000	64	96300	NA	67900	---, 1.2
10,000,000	2	49600	39400	41900	1.3, 1.2
10,000,000	4	40100	56600	29100	.71, 1.4
10,000,000	8	41700	78600	26800	.53, 1.6
10,000,000	16	52000	118000	28900	.44, 1.8
10,000,000	32	70500	162000	35100	.43, 2.0
10,000,000	64	94100	NA	32400	---, 2.9

Table 3: Log normalized data rates in KBytes/second for mpi_reduce with the mpi_sum option.

Communication Test 4 (all reduce, see table 4 and figure 3):

This communication test is the same as communication test 3 except A is placed on all processors instead of only on the root processor. This test uses the mpi_allreduce routine and is functionally equivalent to a reduce followed by a broadcast. Thus, the communication rate for this test is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by p-1 to get a normalized data rate. Since normalized data rates drop sharply for fixed message sizes as the number of processors increase, more insight into the level of concurrency is obtained by displaying log normalized data rates, see table 4 and figure 3. Notice that the P2SC and Origin exhibit binary tree parallelism and the T3E does much better. Also notice that for most of the cases in test 4, the T3E-900 significantly outperforms the other two machines. The P2SC does not scale nearly as well as the T3E-900 for messages of sizes 8 bytes and 1 KByte. The Origin does not scale nearly as well as the T3E-900 for all message sizes.

Figure 3: Log normalized data rates for mpi_allreduce for a 1 KByte message (from table 4).

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM,T3E/Origin
8	2	671	225	363	3.0, 1.8
8	4	671	191	291	3.5, 2.3
8	8	775	198	299	3.9, 2.6
8	16	994	233	319	4.3, 3.1
8	32	1360	273	291	5.0, 4.7
8	64	2000	NA	357	---, 5.6
1,000	2	46000	17700	20300	2.6, 2.3
1,000	4	39900	14000	13700	2.9, 2.9
1,000	8	44700	14200	14400	3.2, 3.1
1,000	16	55000	16700	17600	3.3, 3.1
1,000	32	73400	19000	20200	3.9, 3.6
1,000	64	106000	NA	26300	---, 4.0
100,000	2	77900	63400	74400	1.2, 1.0
100,000	4	68300	49900	51500	1.4, 1.3
100,000	8	75100	51700	52300	1.5, 1.4
100,000	16	92300	61400	61000	1.5, 1.5
100,000	32	118000	56100	67700	2.1, 1.7
100,000	64	168000	NA	74700	---, 2.2
10,000,000	2	76500	69800	48000	1.1, 1.6
10,000,000	4	67700	55800	28100	1.2, 2.4
10,000,000	8	73900	59700	27400	1.2, 2.7
10,000,000	16	92300	71300	32800	1.3, 2.8
10,000,000	32	114000	70200	38800	1.6, 2.9
10,000,000	64	166000	NA	33300	---, 5.0

Table 4: Log normalized data rates in KBytes/sec for mpi_allreduce with the mpi_sum option.

Communication Test 5 (gather, see table 5):

Assume that there are p processors and that processor i has a message, A_i(1:n), for i = 0, p-1. This test uses the mpi_gather routine and measures the communication rate for gathering the A_i’s into an array B located on the root processor, where B(1:n,i) = A_i(1:n) for i = 0, p-1. Since the normalized data rates drop sharply as the number of processors increase for a fixed message size, the log normalized data rate is used for reporting performance results for this test. Thus, the communication rate is calculated by 8*n*(p-1)/(wall-clock-time) and then normalized by dividing by log(p). Because of the large amount of memory required to store B when a large number of processors are used, the largest message size used for this test was 1 MByte instead of 10 MBytes. Relative performance results are quite mixed but the T3E-900 outperformed the other machines on all of these tests. Notice the large drop in performance on the Origin for 8 bytes and 1 KByte messages as the number of processors increase.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	191	175	222	1.1, 0.9
8	4	266	167	233	1.6, 1.1
8	8	268	107	163	2.5, 1.6
8	16	251	83	83	3.0, 3.0
8	32	223	62	37	3.6, 6.0
8	64	189	NA	11	---, 18.
1,000	2	16200	14300	13700	1.1, 1.2
1,000	4	22700	8990	12100	2.5, 1.9
1,000	8	19900	3610	10400	5.5, 1.9
1,000	16	18900	4220	7190	4.5, 2.6
1,000	32	14700	5090	4280	2.9, 3.4
1,000	64	12300	NA	1450	---, 8.5
100,000	2	43300	30700	41100	1.4, 1.1
100,000	4	48400	34400	32700	1.4, 1.5
100,000	8	31900	21600	27800	1.5, 1.5
100,000	16	34400	23600	22300	1.5, 1.5
100,000	32	28600	19600	16300	1.5, 1.8
100,000	64	26400	NA	10600	---, 2.5
1,000,000	2	43600	37100	39100	1.2, 1.1
1,000,000	4	49100	38500	28200	1.3, 1.7
1,000,000	8	42200	32900	20100	1.3, 2.1
1,000,000	16	35700	27100	16100	1.3, 2.2
1,000,000	32	24100	23300	12000	1.0, 2.0
1,000,000	64	NA	NA	8440

Table 5: Log normalized data rates in KBytes/second for mpi_gather.

Communication Test 6 (all gather, see table 6 and figure 4):

This test is the same as test 5 except the gathered message is placed on all processors instead of only on the root processor. Test 6 is functionally equivalent to a gather followed by a broadcast and uses mpi_allgather. The communication rate calculated by 2*[8*n*(p-1)]/(wall-clock time) and is divided by log(p) to obtain a log normalized data rate. Because of the large amount of memory required to store B when a large number of processors is used, the largest message size used for this test was 1 MByte. Notice the large drop in relative performance of the Origin as the number of processors increase. Also notice that none of these machines were able to achieve binary tree parallelism on this test.

Figure 4: Log normalized data rates for mpi_allgather for a 100 KByte message (from table 6).

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	658	270	426	2.4, 1.5
8	4	498	203	323	2.5, 1.5
8	8	397	208	252	1.9, 1.6
8	16	315	244	176	1.3, 1.8
8	32	260	273	112	1.0, 2.3
8	64	221	NA	32	---, 7.0
1,000	2	50600	20700	20700	2.4, 2.4
1,000	4	36600	13100	14000	2.8, 2.6
1,000	8	27200	8800	9800	3.1, 2.8
1,000	16	18600	7730	5730	2.4, 3.2
1,000	32	12200	5720	3810	2.1, 3.2
1,000	64	9250	NA	2590	---, 3.6
100,000	2	137000	80200	65600	1.7, 2.1
100,000	4	87800	35100	26500	2.5, 3.3
100,000	8	63300	25600	14000	2.5, 4.5
100,000	16	36100	20000	8310	1.8, 4.3
100,000	32	22200	9230	3780	2.4, 5.9
100,000	64	14200	NA	1020	---, 14.
1,000,000	2	137000	93300	60300	1.5, 2.3
1,000,000	4	91600	38700	20700	2.4, 4.4
1,000,000	8	69100	26800	9010	2.6, 7.7
1,000,000	16	36700	19800	3880	1.9, 9.5
1,000,000	32	21100	9390	2030	2.2, 10.
1,000,000	64	NA	NA	725

Table 6: Log normalized data rates in KBytes/second for mpi_allgather. Communication Test 7 (scatter, see table 7):

Assume that B is a two dimensional array, B(1:n,0:p-1), where p is the number of processors used. This test uses mpi_scatter and measures communication rates for scattering B from the root processor to all other processors so that processor j receives B(1:n,j), for j = 0, p-1. The communication rate for this test is calculated by 8*n*(p-1))/(wall-clock-time) and then dividing by log(p) to obtain the log normalized data rate. Because of the large memory requirements when a large number of processors is used for this test, the largest message used for this test was 1 MByte.

Notice that relative to the T3E-900, the Origin performance results decrease as the number of processors increase for each message size. This also happens for the P2SC for all message sizes other than 8 bytes. Observe that the Origin and IBM P2SC perform roughly the same for most cases and that the T3E-900 is 2 to 3 times faster than both of these machines for most tests. Also notice that none of these machines are able to achieve binary tree parallelism except for the P2SC on the 8 byte message.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	417	314	265	1.3, 1.6
8	4	423	293	224	1.4, 1.9
8	8	418	345	196	1.2, 2.1
8	16	356	510	165	0.7, 2.2
8	32	304	614	136	0.5, 2.2
8	64	263	NA	105	---, 2.5
1,000	2	33600	28200	19700	1.2, 1.7
1,000	4	30500	20100	12100	1.5, 2.5
1,000	8	25500	11100	9510	2.3, 2.7
1,000	16	21300	9640	7610	2.2, 2.8
1,000	32	18000	7680	6030	2.3, 3.0
1,000	64	15700	NA	4600	---, 3.4
100,000	2	91900	65400	67400	1.4, 1.4
100,000	4	83200	43800	46000	1.9, 1.8
100,000	8	77700	30300	33100	2.6, 2.3
100,000	16	66600	22700	26300	2.9, 2.5
100,000	32	55800	17900	21000	3.1, 2.7
100,000	64	46100	NA	16000	---, 2.9
1,000,000	2	93500	73800	66500	1.3, 1.4
1,000,000	4	84500	46100	43100	1.8, 2.0
1,000,000	8	78700	32600	30900	2.4, 2.6
1,000,000	16	67400	25200	23600	2.7, 2.9
1,000,000	32	56900	20400	19400	2.8, 2.9
1,000,000	64	46900	NA	15300	---, 3.1

Table 7: Log normalized data rates in KBytes/second for mpi_scatter. Communication Test 8 (all-to-all, see table 8, figure 5):

Assume C is a three dimensional array, C(1:n,0:p-1,0:p-1) with C(1:n,j,0:p-1) on processor j. Also assume that C(1:n,j,k) is sent to processor k, where j and k both range from 0 to p-1. This test uses mpi_alltoall and the communication rate is calculated by 8*n*(p-1)*p/(wall-clock time) and then normalized by dividing by p and not by log(p). As the number of processors increase, this test provides a good stress test for the communication network. Because of the large memory requirements when a large number of processors are used for this test, the largest message used for this test was 1 MByte.

Notice that table 8 and figure 5 use normalized data rates and not log normalized data rates. Thus, table 8 and figure 5 show the high level of parallelism achieved for mpi_alltoall for these machines, especially for the T3E-900. Also notice that relative to one another, the performance of the T3E-900 and P2SC remained nearly constant for all these tests with the T3E-900 giving roughly twice the performance of the P2SC. However, the performance of the Origin relative to the other two machines dropped significantly as the number of processors increases. There was insufficient memory on the T3E-900 to run this test for a 1 MByte message with 64 processors.

Figure 5: Normalized data rates for mpi_alltoall for a 1 KByte message (from table 8).

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	305	151	154	2.0, 2.0
8	4	480	186	210	2.6, 2.3
8	8	574	273	203	2.1, 2.8
8	16	615	405	135	1.5, 3.0
8	32	620	527	62	1.2, 10.
8	64	630	NA	25	---, 25.
1,000	2	24100	11400	10800	2.1, 2.2
1,000	4	35400	16500	12900	2.1, 2.7
1,000	8	36900	16900	13300	2.2, 2.8
1,000	16	35800	16600	12300	2.2, 2.9
1,000	32	29400	12500	7840	2.3, 3.8
1,000	64	27300	NA	882	---, 31.
100,000	2	63900	40800	35000	1.6, 1.8
100,000	4	75300	44500	42900	1.7, 1.8
100,000	8	88100	44900	38000	2.0, 2.3
100,000	16	70400	44800	27100	1.6, 2.6
100,000	32	52400	33800	14400	1.6, 3.6
100,000	64	42700	NA	315	---, 135
1,000,000	2	62000	46500	38600	1.3, 1.6
1,000,000	4	92400	51500	37000	1.8, 2.5
1,000,000	8	103000	52900	29500	2.0, 3.5
1,000,000	16	73100	52700	8660	1.4, 8.4
1,000,000	32	53300	43400	2640	1.2, 20.
1,000,000	64	NA	NA	945

Table 8: Normalized data rates in KBytes/second for mpi_alltoall.

Communication Test 9 (broadcast-gather, see table 9):

This test uses mpi_bcast and mpi_gather and measures communication rates for broadcasting a message from the root processor to all other processors and then having the root processor gather these messages back from all processors. This test is included since there may be situations where the root processor will broadcast a message to the other processors, the other processors use this message to perform some calculations, and then the newly computed data is gathered back to the root processor. The communication rate is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate. Because of the large memory requirements when a large number of processors is used, the largest message used for this test was 1 MByte.

Notice that the T3E-900 significantly outperforms the other machines. Also observe that there seems to be a problem on the Origin for 8 byte messages with 64 processors. No machine achieved binary tree parallelism on this test. There was insufficient memory on the T3E-900 to run this test for a 1 MByte message with 64 processors.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	501	164	409	3.1, 1.2
8	4	473	131	317	3.6, 1.5
8	8	460	140	252	3.3, 1.8
8	16	446	169	195	2.6, 2.3
8	32	403	198	124	2.0, 3.3
8	64	357	NA	11	---, 34.
1,000	2	45300	12400	21300	3.7, 2.1
1,000	4	37300	8720	16400	4.3, 2.3
1,000	8	32100	7800	14500	4.2, 2.2
1,000	16	30300	7850	12900	3.9, 2.4
1,000	32	25500	7600	9870	3.4, 2.6
1,000	64	22700	NA	7200	---, 3.2
100,000	2	124000	71400	88400	1.7, 1.4
100,000	4	95200	53700	62200	1.8, 1.5
100,000	8	80500	44100	46000	1.8, 1.7
100,000	16	66700	38400	38600	1.7, 1.7
100,000	32	58000	33300	29600	1.7, 2.0
100,000	64	53100	NA	21900	---, 2.4
1,000,000	2	124000	85000	89200	1.5, 1.4
1,000,000	4	94700	60700	52200	1.6, 1.8
1,000,000	8	79000	49900	35400	1.6, 2.2
1,000,000	16	67400	42800	27700	1.6, 2.4
1,000,000	32	47100	38800	19600	1.2, 2.4
1,000,000	64	NA	NA	12800

Table 9: Log normalized data rates in KBytes/second for mpi_bcast followed by mpi_gather.

Communication Test 10 (scatter-gather, see table 10):

This test uses mpi_scatter followed by mpi_gather and measures communication rates for scattering a message from a root processor and then gathering these messages back to the root processor. The communication rate is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate. The largest size message used for this test is 1 MByte because of the large memory requirements of this test when 64 processors are used. There was insufficient memory on the T3E-900 to run this test for a 1 MByte message with 64 processors.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	386	157	408	2.5, 0.9
8	4	363	132	242	2.8, 1.5
8	8	338	147	187	2.3, 1.8
8	16	296	176	154	1.7, 1.9
8	32	254	211	118	1.2, 2.2
8	64	210	NA	84	---, 2.5
1,000	2	30900	11800	20400	2.6, 1.5
1,000	4	25700	8230	14500	3.1, 1.8
1,000	8	21900	6420	10600	3.4, 2.1
1,000	16	19900	5910	8060	3.4, 2.5
1,000	32	15900	5170	6080	3.1, 2.6
1,000	64	13600	NA	4870	---, 2.8
100,000	2	89500	62400	71800	1.4, 1.2
100,000	4	72900	44600	48300	1.6, 1.5
100,000	8	59700	31500	32300	1.9, 1.8
100,000	16	47100	23900	24600	2.0, 1.9
100,000	32	39400	19100	16500	2.1, 2.4
100,000	64	34000	NA	11000	---, 3.1
1,000,000	2	90800	74600	65200	1.2, 1.4
1,000,000	4	73800	48000	35100	1.5, 2.1
1,000,000	8	59400	34300	25900	1.7, 2.3
1,000,000	16	48300	25900	18500	1.9, 2.6
1,000,000	32	34500	22000	10900	3.2, 3.2
1,000,000	64	NA	NA	9030

Table 10: Log normalized data rates in KBytes/second for mpi_gather followed by mpi_scatter.

Communication Test 11 (reduce-scatter, see table 11):

The mpi_reduce_scatter routine with the mpi_sum option is functionally equivalent to first reducing messages on all processors to a root processor and then scattering this reduced message to all processors. This MPI routine could be implemented by a reduce followed by a scatter. However, it can be implemented more efficiently by not reducing arrays on a single processor and then scattering the reduced array, but by having each processor only reduce for those elements required for the final scattered array. Our communication rate is based on this more efficient method. Thus, it is calculated by 8*n*(p-1))/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate.

From table 11, notice that the T3E-900 achieves better than binary tree parallelism for all the message sizes tested. The P2SC and Origin achieve better than binary tree parallelism for 100 KBytes and 10 MBytes messages. Notice that the performance of the P2SC significantly improves for messages of size 100 KBytes and 10 MBytes.

The next two communication tests are designed to measure communication between "neighboring" processors for a ring of processors using mpi_cart_create (with reorder = .true.), mpi_cart_shift, and mpi_sendrecv.

Communication Test 12 (right shift, see table 12):

This communication test sends a message from processor i to processor (i+1) mod p, for i = 0, 1, …, p-1. Observe that the data rates for this test will increase proportionally with p in an ideal parallel machine. Thus, for communication tests 12 and 13, we define the normalized data rate to be (total data rate)/p. In an ideal parallel computer, the normalized data rate for the above communication would be constant since all communication would be done concurrently. For this test the total data rate is calculated by 8*n*p/(wall-clock time).

Table 12 gives the normalized data rates for the above communication in KBytes/second. Notice that both the T3E-900 and P2SC scale well as the number of processors increase (although there is only data for the P2SC up to 32 processors) since the normalized data rates are roughly constant as the number of processors increases. Observe that table 12 shows normalized data rates and not log normalized data rates and hence exhibiting the high degree of parallelism achieved on this test for all three machines, especially the T3E-900. Notice that the performance of the Origin relative to the T3E-900 becomes much worse as the message size increases. Also observe that the T3E-900 is significantly faster than both of the other machines.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	249	126	174	2.0, 1.4
8	4	224	65	129	3.5, 1.7
8	8	245	63	121	3.9, 2.0
8	16	266	60	113	4.4, 2.4
8	32	291	56	99	5.2, 2.9
8	64	305	NA	84	---, 3.6
1,000	2	16800	11800	11300	1.4, 1.5
1,000	4	14400	9580	8640	1.5, 1.7
1,000	8	15600	5490	9060	2.8, 1.7
1,000	16	15300	5510	10300	2.8, 1.5
1,000	32	17700	5510	10100	3.2, 2.2
1,000	64	19500	NA	9650	---, 2.0
100,000	2	44600	53900	43500	.82, 1.0
100,000	4	38500	58000	35000	.66, 1.1
100,000	8	42900	76100	40600	.56, 1.1
100,000	16	51300	105000	45800	.49, 1.1
100,000	32	66600	102000	53700	.65, 1.2
100,000	64	91900	NA	55100	---, 1.7
10,000,000	2	45500	62900	28600	.72, 1.6
10,000,000	4	38900	73900	23900	.52, 1.6
10,000,000	8	41600	93400	22900	.45, 1.8
10,000,000	16	51200	149000	27700	.34, 1.9
10,000,000	32	69700	186000	35200	.37, 2.0
10,000,000	64	94000	NA	31600	---, 3.0

Table 11: Log normalized data rates in KBytes/second for mpi_reduce_scatter.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	511	153	208	3.3, 2.5
8	4	495	151	218	3.3, 2.3
8	8	456	145	186	3.1, 2.5
8	16	390	139	115	2.8, 3.4
8	32	371	127	62	2.9, 6.0
8	64	367	NA	44	---, 8.3
1,000	2	30200	11600	13200	2.6, 2.3
1,000	4	28700	11200	12700	2.6, 2.3
1,000	8	28100	10900	11300	2.6, 2.5
1,000	16	25300	10400	10300	2.4, 2.5
1,000	32	23600	9200	7330	2.6, 3.2
1,000	64	22600	NA	5440	---, 4.2
100,000	2	134000	43300	38900	3.1, 3.4
100,000	4	109000	43800	37700	2.5, 2.9
100,000	8	129000	43100	33900	3.0, 3.8
100,000	16	126000	41100	28000	3.1, 4.5
100,000	32	121000	29100	15800	4.2, 7.7
100,000	64	110000	NA	8100	---, 14.
10,000,000	2	137000	56900	39000	2.4, 3.5
10,000,000	4	106000	54500	38200	1.9, 2.8
10,000,000	8	138000	54400	28500	2.5, 4.8
10,000,000	16	137000	54200	24300	2.5, 5.7
10,000,000	32	137000	47400	13800	2.5, 5.7
10,000,000	64	134000	NA	2680	---, 50.

Table 12: Normalized data rates for right shift in KBytes/second.

Communication Test 13 (left & right shift, see table 13 and figure 6):

This test is the same as the above test except here a message is sent from a processor i to each of its neighbors (i-1) mod(p) and (i+1) mod(p), for i = 0, 1,…, p. Thus, the amount of data being moved on the network will be twice that of the previous test so that the normalized data rate is calculated by 2*8*n*p/(wall-clock time). Notice that the normalized data rates for communication test 13 are about the same as those for communication test 12. The Origin communication network allows for the concurrent sending of incoming and outgoing steams of data from one node to another. Because of this, one might expect that the normalized data rates for the Origin for this test to be twice those of the previous test. However, this doubling of the data rate did not occur. Figure 6 shows the normalized data rate for this test for a message of size 100 KBytes.

Message Size (Bytes)	Number of Processors	T3E-900	IBM P2SC	Origin 2000	T3E/IBM, T3E/Origin
8	2	501	154	202	3.3, 2.5
8	4	484	146	226	3.3, 2.1
8	8	480	143	194	3.4, 2.5
8	16	419	138	117	3.0, 3.6
8	32	393	126	62	3.1, 6.3
8	64	370	NA	44	---, 8.4
1,000	2	29500	11500	13200	2.6, 2.2
1,000	4	27800	10900	12700	2.6, 2.2
1,000	8	28500	10500	12000	2.7, 2.4
1,000	16	27000	10200	10600	2.6, 2.5
1,000	32	23800	8940	7590	2.7, 3.1
1,000	64	23300	NA	5450	---, 4.3
100,000	2	132000	42800	43300	3.1, 3.1
100,000	4	116000	43600	42900	2.7, 2.7
100,000	8	129000	43100	38000	3.0, 3.4
100,000	16	126000	41700	31200	3.0, 4.0
100,000	32	124000	30300	21200	4.1, 5.8
100,000	64	109000	NA	8080	---, 14.
10,000,000	2	142000	54200	29800	2.6, 4.8
10,000,000	4	122000	54300	30100	2.3, 4.1
10,000,000	8	139000	54200	19700	2.6, 7.1
10,000,000	16	139000	54000	16600	2.6, 8.4
10,000,000	32	139000	48200	11900	2.9, 12.
10,000,000	64	123000	NA	2910	---, 42.

Table 13: Normalized data rates for the left and right shift in KBytes/second.

Figure 6: Normalized data rates for left & right shifts for a 100 KByte message (from table 13).

CONCLUSIONS

This study was conducted to evaluate relative communication performance of the Cray T3E-900, the Cray Origin 2000 and the IBM P2SC on a collection of 13 communication tests that call MPI routines. Communication tests have been designed to include communication patterns that we feel are likely to occur in scientific programs. Tests were run for messages of size 8 bytes, 1 KByte, 100 KBytes and 10 MBytes using 2, 4, 8, 16, 32 and 64 processors (although 64 processors were not available on the P2SC). Because of memory limitations, for some of the tests the 10 MBytes message size was replaced by messages of size 1 MByte. The relative performance of these machines varied depending on the communication test, but overall the T3E-900 was often 2 to 4 times faster than the Origin and P2SC. The Origin and P2SC performed about the same for most of the tests. For a fixed message size the performance of the Origin relative to the T3E-900 would often drop significantly as the number of processors increased. For a fixed message size, the performance of the P2SC relative to the T3E-900 would typically drop as the number of processors increased but this drop was not nearly as much as occurred on the Origin.

ACKNOWLEDGMENTS

Computer time on the Maui High Performance Computer Center’s P2SC was sponsored by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.

We would like to thank Cray Research Inc. for allowing us to use their T3E-900 and Origin 2000 located in Chippewa Falls, Wisconsin and Eagan, Minnesota, USA, respectively.

REFERENCES

Cray MPP Fortran Reference Manual, SR 2504 6.2.2, Cray Research, Inc., June 1995.
J. Dongarra, R. Whaley, A User’s Guide to the BLACS v1.0, Computer Science Department Technical Report CS-95-281, University of Tennessee, 1995. (Available as LAPACK Working Note 94 at: http://www.netlib.org/lapack/lawns/lawn94.ps)
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine A Users’ Guide and Tutorial for Networked Parallel Computing, The MIT Press, 1994.
R. Hockney, M. Berry, Public International Benchmarks for Parallel Computers: PARKBENCH Committee, Report-1, February 7, 1994.
W. Gropp, E. Lusk, A. Skjellum, USING MPI, The MIT Press 1994.
G. Luecke, J. Coyle, W. Haque, J. Hoekstra, H. Jespersen, Performance Comparison of Workstation Clusters for Scientific Computing, SUPERCOMPUTER, vol XII, no. 2, pp 4-20, March 1996.
G. Luecke, J. Coyle, Comparing the Performance of MPI on the Cray Research T3E and IBM SP-2, January, 1997 (preprint), see http://www.public.iastate.edu/~grl/homepage.html.
Optimization and Tuning Guide for Fortran, C, and C++ for AIX version 4, second edition, IBM, June 1996.
M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI: The Complete Reference, The MIT Press, 1996.
http://www.cray.co
http://www.austin.ibm.com/hardware/largescale/index.html