Abstract
This paper reports the performance of the Cray Research T3E and IBM SP-2 on a collection of communication tests that use MPI for the message passing. These tests have been designed to evaluate the performance of communication patterns that we feel are likely to occur in scientific programs. Communication tests were performed for messages of sizes 8 Bytes (B), 1 KB, 100 KB, and 10 MB with 2, 4, 8, 16, 32 and 64 processors. Both machines provided a high level of concurrency for the nearest neighbor communication tests and moderate concurrency on the broadcast operations. On the tests used, the T3E significantly outperformed the SP-2 with most performance tests being at least three times faster than the SP-2.
INTRODUCTION
Message Passing Interface, MPI, [6,9] is rapidly becoming the standard for writing scientific programs with explicit message passing rather than PVM [3]. The communication network of a parallel computer plays an important role in its overall performance, see [4,5]. There are so many different ways that communication may occur when running scientific programs that it is not possible to test all of them. However, our MPI communication tests have been designed to test some of the communication patterns that we feel are likely to occur in scientific programs. The purpose of this study is to evaluate and compare communication performance of the Cray Research T3E-600 and IBM SP-2 on a collection of communication tests that use MPI for the message passing.
DESCRIPTION OF THE PERFORMANCE TESTS AND RESULTS
All communication tests have been written
in Fortran with calls to MPI routines for the message passing. These tests
were run with message sizes ranging from 8 Bytes (B) to 10 MB and with
the number of processors ranging from 2 to 64. Some of these communication
patterns take a very short amount of time to execute so they are looped
to obtain a wall-clock time of at least one second in order to obtain accurate
timings. The time of the communication pattern is then
calculated by dividing the total time
by the number of iterations. Timings were obtained using the standard unix
wall-clock timer "gettimeofday". Tests were run at least ten times and
the best performance numbers are reported.
Tests for the Cray T3E-600 were run on a 64 processor T3E located at Cray Research's corporate headquarters in Eagan, Minnesota. The peak theoretical performance of each processor is 600 Mflop/s. The communication network has a bandwidth of 350 MB/second and latency of 1.5 microsecond. The T3E-600 communication network is a bi-directional 3-D torus. For more information on the T3E see http://www.cray.com. The operating system used was UNICOS/mk version 1.3.1 and the Fortran compiler used was cf90 version 3.0 with the O2 optimization level.
Tests for the IBM SP-2 were run at the Maui High Performance Computing Center. The peak theoretical performance of each of these processors is 267 Mflop/s. The communication network has a peak bi-directional bandwidth of 40 MB/second with a latency of 40.0 microseconds for thin nodes and 39.2 microseconds for wide nodes. Wide (thin) nodes have a 256 (64) KB data cache, 256 (64) bit path from memory to the data cache, and 256 (128) bit path from the data cache to the processor bus. The IBM SP-2 uses a bi-directional multistage interconnection network and may be configured with thin and/or wide nodes. Performance tests were done separately for thin nodes and for wide nodes. For more information about the SP-2, see http://www.mhpcc.edu/training/workshop/html/ibmhwsw/ibmhwsw.html. The IBM SP-2 did not have 64 wide nodes available for our tests so no performance results could be obtained with this many wide nodes. AIX version 4.1 was used and the Fortran compiler used was xlf version 3.2.4 with the O2 optimization level.
Communication Test 1 (Table 1):
The first communication test sends a message from one processor to another, the second processor then sends a 4 Byte integer message back to the sender indicating that the message was received. This test was designed this way since there are situations where one wants to send a message and not proceed until a response message has been received from the receiving processor. The communication rate was calculated by measuring the total time required to send the message and to send back the 4 Byte response. This test uses mpi_send and mpi_recv. This test is different from the COMMS1 test described in section 3.3.1 of [4] in that the COMMS1 test sends a message to a processor and then that same message is sent back to the first processor. Table 1 reports performance rates in KB per second, where "K" means "1000" not "1024". Notice that the T3E consistently outperforms the wide nodes by more than a factor of three. Thin node performance ranged from 5% to 14% less than wide nodes.
1 A follow-on study is planned
that will evaluate the performance of MPI on the SGI ORIGIN 2000, IBM's
follow-on to the SP-2, and the Cray Research T3E-900.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 2 (Table 2-a):
This test measures communication rates for sending a message from one processor to all of the other processors using mpi_bcast and is the COMMS3 test described in [4]. To better evaluate the performance of this broadcast operation, define a Normalized Broadcast Rate as
where N is the total number of processors
involved in the communication and where the total data rate is the
total amount of data sent on the communication network per unit time and
measured in KB per second. Let R be the data rate when sending a
message from one processor to another and let D be the total data rate
for broadcasting the same message to the N-1 other processors. If the broadcast
operation and communication network were able to concurrently transmit
the messages, then D = R*(N-1) and thus the Normalized Broadcast Rate
would remain constant as N varied for a given message size. Thus, the rate
at which the Normalized Broadcast Rate decreases as N increases
indicates how far the broadcast operation is from being ideal. Table 2-a
reports Normalized Broadcast Rates for the T3E-600 and for both
wide and thin nodes on the SP-2. Notice that in all cases instead of being
constant for a given message size, the Normalized Broadcast Rate
decreases significantly as the number of processors increase. Also notice
that for all message sizes the T3E-600 is roughly 3 to 5 times faster than
wide nodes but this factor decreases as the number of processors used increases.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To see that there actually is concurrency occurring in the broadcast operation, define
the Log Normalized Broadcast Rate as (total data rate)/Log(N),
where N is the number of processors
involved in the communication and Log(N) is the log base 2 of N. Thus,
if binary tree parallelism were being utilized, the Log Normalized Data
Rate would be constant for a given message size as N varies. Table
2-b gives the Log Normalized Data Rates and does in fact show that
concurrency is being utilized in the broadcast operation for both machines.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 3 (Table 3):
Communication Test 3 measures the rates
for broadcasting a message from a processor to all other processors using
mpi_bcast and then having the receiving processors return this same message
back to the originating processor using mpi_send and mpi_recv with a wild
card. To eliminate the possibility of an optimizing compiler recognizing
that it is the same message being sent back and hence need not be sent
back, one element of the message is altered by the receiving processor
prior to sending the message back. Notice that this communication pattern
causes significantly more data traffic on the communication network than
the simple broadcast operation described in Communication Test 2. Table
3 gives the Normalized Broadcast Rates for this operation. Notice
that the trends in Table 3 are similar to those of Table 2-a but that the
data rates for 10 MB messages with 16, 32 and 64 processors drop off significantly
on the SP-2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 4 (Table 4):
The rest of the communication tests are designed to measure communication between "neighboring" processors. As above, let N be the total number of processors and assume that they have been numbered from 1 to N. This communication test sends a message from processor i to processor (i+1) mod N, for i = 1, 2, …, N. Observe that the data rates for this test will increase proportionally with N since communication can (hopefully) be done in parallel. Thus, in a manner similar to the Normalized Broadcast Rate, we define the Normalized Data Rate to be
In an ideal parallel computer, the
Normalized Data Rate for the above communication would be constant
since all communication would be done concurrently. Thus, the degree to
which the Normalized Data Rate is not constant indicates how far
from ideal this type of communication can be performed by the parallel
computer. This test uses mpi_send and mpi_recv. Table 4 reports the Normalized
Data Rates for the above communication in KB/second. Notice that for all
message sizes the data rate scales well for both machines and that the
T3E-600 significantly outperforms the SP-2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 5 (Table 5):
This next communication pattern sends
a message from a processor i to each of its neighbors (i-1) mod N and (i+1)
mod N, for i = 1, 2, …, N and where N is the number of processors used
in the test. Thus, the amount of data being moved on the network will be
twice that of the previous test. This test uses mpi_sendrecv. Table 5 reports
the performance results. Notice that the T3E-600 is able to handle the
extra data traffic better than the SP-2. Also notice that the data rates
scale well as the number of processors used increases.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Communication Test 6 (Table 6):
To describe this last communication
pattern, let N be the number of processors used for the test and let P
be a permutation of the first N positive integers. This communication pattern
can then be described as sending a message from processor P(i) to processor
P((i+1) mod N). Notice that this is exactly the same as the communication
pattern described in Communication Test 4 if one were to reorder the processor
numbering from 1, 2, …, N to P(1), P(2), …, P(N). Thus, the purpose of
this test is to determine the impact on performance of reordering the numbering
of the processors. Of course, it is likely that the performance will depend
on the particular permutation selected. This test uses mpi_sendrecv. Comparing
Table 6 with Table 4, one observes that the Normalized Data Rates
for both the T3E-600 and SP-2 can in fact change significantly depending
on the permutation selected.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
CONCLUSIONS
The purpose of this study was to evaluate and compare communication performance of the Cray Research T3E-600 and IBM SP-2 on a collection of communication tests written in MPI. These tests have been designed to evaluate the performance of some of the communication patterns that we feel are likely to occur in scientific programs. Both machines showed a high level of concurrency on the nearest neighbor communication tests and moderate concurrency on the broadcast communication tests. On the SP-2, thin nodes performed roughly 10% slower than wide nodes on our tests. The T3E-600 significantly outperformed the SP-2 with most performance tests being at least three times faster than wide nodes on the SP-2.
ACKNOWLEDGMENTS
Computer time on the Maui High Performance Computer Center's SP-2 was sponsored by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.
We would like to thank Cray Research Inc. for allowing us to use their T3E at their corporate headquarters in Eagan, Minnesota, USA.
REFERENCES