The performance of the communication network of a parallel computer plays a critical role in its overall performance, see [4,6]. Writing scientific programs with calls to MPI [5,9] routines is rapidly becoming the standard for writing programs with explicit message passing. Thus, to evaluate the performance of the communication network of a parallel computer for scientific computing, a collection of communication tests that use MPI for the message passing have been written. These communication tests are a significant enhancement from those used in [7] and have been designed to test those communication patterns that we feel are likely to occur in scientific programs. This paper reports the results of these tests on the Cray T3E-900, the Cray Origin 2000 and the IBM P2SC ("Power 2 Super Chip").
DESCRIPTION OF THE COMMUNICATION TESTS AND RESULTS
All tests were written in Fortran with calls to MPI routines for the message passing. The communication tests were run with message sizes ranging from 8 bytes to 10 MBytes and with the number of processors ranging from 2 to 64. (Throughout this paper, MByte means 1,000,000 bytes not 1024*1024 bytes and KBytes means 1,000 bytes not 1024 bytes.) Because of memory limitations, for some of the tests the 10 MByte message size was replaced by a message of size 1 MByte. Message sizes were not chosen based on any knowledge of the computers being used but were chosen to represent a small message (8 bytes), a ‘large’ message (10 BMytes) with two additional message sizes between these values (1 KByte and 100 KBytes). Some of these communication patterns took a very short amount of time to execute so they were looped to obtain a wall-clock time of at least one second in order to obtain more accurate timings. The time to execute a particular communication pattern was then obtained by dividing the total time by the number of loops. All timings were done using the MPI wall-clock timer, mpi_wtime(). A call to mpi_barrier was made just prior to the first call to mpi_wtime and again just prior to the second call to mpi_wtime to ensure processor synchronization for timing. Sixty-four bit real precision was used on all machines. Default mpi environmental settings were used for all tests. Tests run on the Cray T3E-900 and Cray Origin 2000 were executed on machines dedicated to running only our tests. For these machines, five runs were made and the best performance results are reported. Tests run on the IBM P2SC were executed using LoadLeveler so that only one job at a time would be executing on the 32 nodes used. However, jobs running on other nodes would sometimes cause variability in the data so the tests were run at least ten times and the best performance numbers reported. Default environmental settings were used for all three machines.
Tests for the Cray T3E-900 were run on a 64-processor machine located in Chippewa Falls, Wisconsin. Each processor is a DEC EV5 microprocessor running at 450 MHz with peak theoretical performance of 900 Mflop/s. The three-dimensional, bi-directional torus communication network of the T3E-900 has a bandwidth of 350 MBytes/second and latency of 1 microsecond. For more information on the T3E-900 see [9]. The UNICOS/mk version 1.3.1 operating system, the cf90 version 3.0 Fortran compiler with the -O2 –dp compiler options and MPI version 3.0 were used for these tests. The MPI implementation used had the hardware data streams work-around enabled even though this is not needed for the T3E-900.
Tests for the Cray Origin 2000 were run on a 128-processor machine located in Eagan, Minnesota. Each processor is a MIPS R10000, 195 MHz microprocessor with a peak theoretical performance of 390 Mflop/s and a 4 MByte secondary cache. Each node consists of two processors sharing a common memory. The communication network is a hypercube for up to 32 processors and is called a "fat bristled hypercube" for more than 32 processors since multiple hypercubes are interconnected via the CrayRouter. The node-to-node bandwidth is 150 MBytes per second and the maximum remote latency in a 128-processor system is about 1 microsecond. For more information see [9]. A pre-release version of the Irix 6.5 operating system, MPI from version 1.1 of the Message Passing Toolkit, and the MipsPro 7.20 Fortran compiler with -O2 –64 compiler options were used for these tests.
Tests for the IBM P2SC were run at the Maui High Performance Computing Center. The peak theoretical performance of each of these processors is 480 Mflop/s for the 120 MHz thin nodes and 540 Mflop/s for the 135 MHz wide nodes. The communication network has a peak bi-directional bandwidth of 110 MBytes/second with a latency of 40.0 microseconds for thin nodes and 39.2 microseconds for wide nodes. Performance tests were run on thin nodes each with 128 MBytes of memory. At the Maui High Performance Computing Center, there were only 48 thin nodes available for running these tests, so there is no data for 64 processors. For more information about the P2SC see [10]. The AIX 4.1.5.0 operating system, xlf version 4.1.0.0 Fortran compiler with –O3 qarch=pwr2 compiler options, and MPI version 2.2.0.2 were used for these tests.
Communication Test 1 (point-to-point, see table 1)
The first communication test measures the time required to send
a real array A(1:n) from one processor to another by dividing by
two the time required to send A from one processor to another and
then send A back to the original processor, where n is chosen to
obtain a message of the desired size. Thus, to obtain a message of size
1 KByte, n = 1,000/8 = 125. Since each A(i) is 8 bytes, the communication
rate for sending a message from one processor to another is calculated
by 2*8*n/(wall-clock time), where the wall-clock time is the time to send
A from one processor to another and then back to the original processor
and where n is chosen to obtain the desired message size. This test is
the same as the COMMS1 test described in section 3.3.1 of [4]. This test
uses mpi_send and mpi_recv.
Table 1 gives performance rates in KBytes per second. As is done in all
the tables, the last column gives the ratios of the performance results
of the T3E-900 to the IBM P2SC and of the T3E-900 to the Origin 2000. Recall
that for small messages network latency, not bandwidth, is the dominant
factor determining the communication rate. Notice that the achieved bandwidth
on this test is significantly less than the bandwidth rates provided by
the vendors: 350,000 KBytes for the T3E-900, 110,000 KBytes for the P2SC,
and 150,000 KBytes for the Origin. All data in this paper has been rounded
to three significant digits.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 2 (broadcast, see tables 2-a, 2-b and 2-c and figures 1 and 2):
This test measures communication rates for sending a message from one processor to all other processors and uses the mpi_bcast routine. This test is the COMMS3 test described in [4]. To better evaluate the performance of this broadcast operation, define a normalized broadcast rate as
where p is the number of processors involved in the communication and total data rate is the total amount of data sent on the communication network per unit time measured in KBytes per second. Let R be the data rate when sending a message from one processor to another and let D be the total data rate for broadcasting the same message to the p-1 other processors. If the broadcast operation and communication network were able to concurrently transmit the messages, then D = R*(p-1) and thus the normalized broadcast rate would remain constant as p varied for a given message size. Therefore, for a fixed message size, the rate at which the normalized broadcast rate decreases as p increases indicates how far the broadcast operation is from being ideal. Assume the real array A(1:n) is broadcast from the root processor where each A(i) is 8 bytes, then the communication rate is calculated by 8*n*(p-1)/(wall-clock time) and then normalized by dividing by p-1 to obtain the normalized broadcast rate.
Table 2-a gives the normalized broadcast rates obtained by keeping the root processor fixed for all repetitions of the broadcast. Figure 1 shows the graph of these results for a message of size 100 KBytes. Observe that for all machines for a fixed message size the normalized broadcast rate decreases as the number of processors increase (instead of being constant). Notice that the P2SC and Origin machines perform roughly about the same and that the T3E-900 ranges from 1.1 to 4.7 times faster than the P2SC and Origin. Observe that the Origin does not scale well as the number of processors increase as compared with the T3E-900. One might expect that the communication rate for a broadcast with 2 processors would be the same as the rate for communication test 1. However, the rates measured for the broadcast in tables 2-a and 2-b are higher than those measured in communication test 1 for all machines. It is not clear why this is so.
|
|
|
|
|
T3E/IBM, T3E/Origin |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 2-b gives the normalized broadcast rates where the root processor
is cycled through all p processors as the broadcast operation is repeated.
Notice that the rates do change from those in table 2-a and the maximum
percent change depends on the machine. The maximum percent change is about
14% for the T3E-900, 150% for the P2SC, and about 50% for the Origin.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To better understand the amount of concurrency occurring in the broadcast operation, define the log normalized broadcast rate as
where p is the number of processors involved in the
communication and log(p) is the log base 2 of p. Thus, if binary tree parallelism
were being utilized, the log normalized data rate would be constant for
a given message size as p varies. Table 2-c gives the log normalized data
rates with a fixed root processor and shows in fact that concurrency is
being utilized in the broadcast operation for these machines. Figure 2
shows these results for a message of size 100 KBytes. Notice that the performance
of the T3E-900 is significantly better than binary tree parallelism for
all message sizes tested. For messages of size 8 bytes and 1 KByte, the
P2SC performs better than binary tree parallelism and yields binary tree
parallelism for the other two message sizes. The Origin gives better than
binary tree parallelism for a 1 KByte message and binary tree parallelism
for the other three message sizes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 3 (reduce, see table 3):
Assume that there are p processors and that processor i has a message,
Ai(1:n), for
i = 0, p-1 and where n is chosen to obtain the message of the desired size.
Test 3 measures communication rates for varying sizes of the Ai’s
when calculating A = S
Ai and placing A on the root
processor. Thus, this test uses mpi_reduce with the mpi_sum option. Since
each element of Ai is 8 bytes, the
communication rate can be calculated by 8*n*(p-1))/(wall-clock time) and
then normalized by dividing by p-1. As was done with mpi_bcast, one could
also calculate a log normalized data rate. Table 3 contains log normalized
data rates since, as was true for mpi_bcast, more insight into the level
of concurrency is obtained. Table 3 shows that the Origin exhibits binary
tree parallelism and the other two machines exhibit better than binary
tree parallelism. Notice that the T3E-900 performs well compared with the
two other machines for messages of size 8 bytes and 1 KByte. However, for
messages of size 100 KBytes (with 8 or more processors) and 10 MBytes (with
4 or more processors), the IBM machine gives superior performance. These
results suggest that, for the larger message sizes IBM may be using a better
algorithm for mpi_reduce
than the algorithm used on the T3E.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 4 (all reduce, see table 4 and figure 3):
This communication test is the same as communication test 3 except A is placed on all processors instead of only on the root processor. This test uses the mpi_allreduce routine and is functionally equivalent to a reduce followed by a broadcast. Thus, the communication rate for this test is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by p-1 to get a normalized data rate. Since normalized data rates drop sharply for fixed message sizes as the number of processors increase, more insight into the level of concurrency is obtained by displaying log normalized data rates, see table 4 and figure 3. Notice that the P2SC and Origin exhibit binary tree parallelism and the T3E does much better. Also notice that for most of the cases in test 4, the T3E-900 significantly outperforms the other two machines. The P2SC does not scale nearly as well as the T3E-900 for messages of sizes 8 bytes and 1 KByte. The Origin does not scale nearly as well as the T3E-900 for all message sizes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 5 (gather, see table 5):
Assume that there are p processors and that processor i has a message, Ai(1:n), for i = 0, p-1. This test uses the mpi_gather routine and measures the communication rate for gathering the Ai’s into an array B located on the root processor, where B(1:n,i) = Ai(1:n) for i = 0, p-1. Since the normalized data rates drop sharply as the number of processors increase for a fixed message size, the log normalized data rate is used for reporting performance results for this test. Thus, the communication rate is calculated by 8*n*(p-1)/(wall-clock-time) and then normalized by dividing by log(p). Because of the large amount of memory required to store B when a large number of processors are used, the largest message size used for this test was 1 MByte instead of 10 MBytes. Relative performance results are quite mixed but the T3E-900 outperformed the other machines on all of these tests. Notice the large drop in performance on the Origin for 8 bytes and 1 KByte messages as the number of processors increase.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 6 (all gather, see table 6 and figure 4):
This test is the same as test 5 except the gathered message is placed on all processors instead of only on the root processor. Test 6 is functionally equivalent to a gather followed by a broadcast and uses mpi_allgather. The communication rate calculated by 2*[8*n*(p-1)]/(wall-clock time) and is divided by log(p) to obtain a log normalized data rate. Because of the large amount of memory required to store B when a large number of processors is used, the largest message size used for this test was 1 MByte. Notice the large drop in relative performance of the Origin as the number of processors increase. Also notice that none of these machines were able to achieve binary tree parallelism on this test.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Assume that B is a two dimensional array, B(1:n,0:p-1), where p is the number of processors used. This test uses mpi_scatter and measures communication rates for scattering B from the root processor to all other processors so that processor j receives B(1:n,j), for j = 0, p-1. The communication rate for this test is calculated by 8*n*(p-1))/(wall-clock-time) and then dividing by log(p) to obtain the log normalized data rate. Because of the large memory requirements when a large number of processors is used for this test, the largest message used for this test was 1 MByte.
Notice that relative to the T3E-900, the Origin performance results
decrease as the number of processors increase for each message size. This
also happens for the P2SC for all message sizes other than 8 bytes. Observe
that the Origin and IBM P2SC perform roughly the same for most cases and
that the T3E-900 is 2 to 3 times faster than both of these machines for
most tests. Also notice that none of these machines are able to achieve
binary tree parallelism except for the P2SC on the 8 byte message.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Assume C is a three dimensional array, C(1:n,0:p-1,0:p-1) with C(1:n,j,0:p-1) on processor j. Also assume that C(1:n,j,k) is sent to processor k, where j and k both range from 0 to p-1. This test uses mpi_alltoall and the communication rate is calculated by 8*n*(p-1)*p/(wall-clock time) and then normalized by dividing by p and not by log(p). As the number of processors increase, this test provides a good stress test for the communication network. Because of the large memory requirements when a large number of processors are used for this test, the largest message used for this test was 1 MByte.
Notice that table 8 and figure 5 use normalized data rates and not log normalized data rates. Thus, table 8 and figure 5 show the high level of parallelism achieved for mpi_alltoall for these machines, especially for the T3E-900. Also notice that relative to one another, the performance of the T3E-900 and P2SC remained nearly constant for all these tests with the T3E-900 giving roughly twice the performance of the P2SC. However, the performance of the Origin relative to the other two machines dropped significantly as the number of processors increases. There was insufficient memory on the T3E-900 to run this test for a 1 MByte message with 64 processors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 9 (broadcast-gather, see table 9):
This test uses mpi_bcast and mpi_gather and measures communication rates for broadcasting a message from the root processor to all other processors and then having the root processor gather these messages back from all processors. This test is included since there may be situations where the root processor will broadcast a message to the other processors, the other processors use this message to perform some calculations, and then the newly computed data is gathered back to the root processor. The communication rate is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate. Because of the large memory requirements when a large number of processors is used, the largest message used for this test was 1 MByte.
Notice that the T3E-900 significantly outperforms
the other machines. Also observe that there seems to be a problem on the
Origin for 8 byte messages with 64 processors. No machine achieved binary
tree parallelism on this test. There was insufficient memory on the T3E-900
to run this test for a 1 MByte message with 64 processors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 10 (scatter-gather, see table 10):
This test uses mpi_scatter followed by mpi_gather and measures communication rates for scattering a message from a root processor and then gathering these messages back to the root processor. The communication rate is calculated by 2*[8*n*(p-1)]/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate. The largest size message used for this test is 1 MByte because of the large memory requirements of this test when 64 processors are used. There was insufficient memory on the T3E-900 to run this test for a 1 MByte message with 64 processors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 11 (reduce-scatter, see table 11):
The mpi_reduce_scatter routine with the mpi_sum option is functionally equivalent to first reducing messages on all processors to a root processor and then scattering this reduced message to all processors. This MPI routine could be implemented by a reduce followed by a scatter. However, it can be implemented more efficiently by not reducing arrays on a single processor and then scattering the reduced array, but by having each processor only reduce for those elements required for the final scattered array. Our communication rate is based on this more efficient method. Thus, it is calculated by 8*n*(p-1))/(wall-clock time) and then divided by log(p) to obtain the log normalized data rate.
From table 11, notice that the T3E-900 achieves better than binary tree parallelism for all the message sizes tested. The P2SC and Origin achieve better than binary tree parallelism for 100 KBytes and 10 MBytes messages. Notice that the performance of the P2SC significantly improves for messages of size 100 KBytes and 10 MBytes.
The next two communication tests are designed to measure communication between "neighboring" processors for a ring of processors using mpi_cart_create (with reorder = .true.), mpi_cart_shift, and mpi_sendrecv.
Communication Test 12 (right shift, see table 12):
This communication test sends a message from processor i to processor (i+1) mod p, for i = 0, 1, …, p-1. Observe that the data rates for this test will increase proportionally with p in an ideal parallel machine. Thus, for communication tests 12 and 13, we define the normalized data rate to be (total data rate)/p. In an ideal parallel computer, the normalized data rate for the above communication would be constant since all communication would be done concurrently. For this test the total data rate is calculated by 8*n*p/(wall-clock time).
Table 12 gives the normalized data rates for the
above communication in KBytes/second. Notice that both the T3E-900 and
P2SC scale well as the number of processors increase (although there is
only data for the P2SC up to 32 processors) since the normalized data rates
are roughly constant as the number of processors increases. Observe that
table 12 shows normalized data rates and not log normalized data rates
and hence exhibiting the high degree of parallelism achieved on this test
for all three machines, especially the T3E-900. Notice that the performance
of the Origin relative to the T3E-900 becomes much worse as the message
size increases. Also observe that the T3E-900 is significantly faster than
both of the other machines.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 13 (left & right shift, see table 13 and figure 6):
This test is the same as the above test except here a message is sent
from a processor i to each of its neighbors (i-1) mod(p) and (i+1) mod(p),
for i = 0, 1,…, p. Thus, the amount of data being moved on the network
will be twice that of the previous test so that the normalized data rate
is calculated by 2*8*n*p/(wall-clock time). Notice that the normalized
data rates for communication test 13 are about the same as those for communication
test 12. The Origin communication network allows for the concurrent sending
of incoming and outgoing steams of data from one node to another. Because
of this, one might expect that the normalized data rates for the Origin
for this test to be twice those of the previous test. However, this doubling
of the data rate did not occur. Figure 6 shows the normalized data rate
for this test for a message of size 100 KBytes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
CONCLUSIONS
This study was conducted to evaluate relative communication performance of the Cray T3E-900, the Cray Origin 2000 and the IBM P2SC on a collection of 13 communication tests that call MPI routines. Communication tests have been designed to include communication patterns that we feel are likely to occur in scientific programs. Tests were run for messages of size 8 bytes, 1 KByte, 100 KBytes and 10 MBytes using 2, 4, 8, 16, 32 and 64 processors (although 64 processors were not available on the P2SC). Because of memory limitations, for some of the tests the 10 MBytes message size was replaced by messages of size 1 MByte. The relative performance of these machines varied depending on the communication test, but overall the T3E-900 was often 2 to 4 times faster than the Origin and P2SC. The Origin and P2SC performed about the same for most of the tests. For a fixed message size the performance of the Origin relative to the T3E-900 would often drop significantly as the number of processors increased. For a fixed message size, the performance of the P2SC relative to the T3E-900 would typically drop as the number of processors increased but this drop was not nearly as much as occurred on the Origin.
ACKNOWLEDGMENTS
Computer time on the Maui High Performance Computer Center’s P2SC was sponsored by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.
We would like to thank Cray Research Inc. for allowing us to use their T3E-900 and Origin 2000 located in Chippewa Falls, Wisconsin and Eagan, Minnesota, USA, respectively.
REFERENCES