COMPARING COMMUNICATION PERFORMANCE OF MPI
ON THE CRAY RESEARCH T3E-600 AND IBM SP-2 1
by
Glenn R. Luecke and James J. Coyle
grl@iastate.edu and jjc@iastate.edu
Iowa State University
Ames, Iowa 50011-2251, USA
Waqar ul Haque
haque@unbc.edu
University of Northern British Columbia
Prince George, British Columbia, Canada V2N 4Z9
February 8, 1997
Revised August 13, 1997

Abstract

This paper reports the performance of the Cray Research T3E and IBM SP-2 on a collection of communication tests that use MPI for the message passing. These tests have been designed to evaluate the performance of communication patterns that we feel are likely to occur in scientific programs. Communication tests were performed for messages of sizes 8 Bytes (B), 1 KB, 100 KB, and 10 MB with 2, 4, 8, 16, 32 and 64 processors. Both machines provided a high level of concurrency for the nearest neighbor communication tests and moderate concurrency on the broadcast operations. On the tests used, the T3E significantly outperformed the SP-2 with most performance tests being at least three times faster than the SP-2.

INTRODUCTION

Message Passing Interface, MPI, [6,9] is rapidly becoming the standard for writing scientific programs with explicit message passing rather than PVM [3]. The communication network of a parallel computer plays an important role in its overall performance, see [4,5]. There are so many different ways that communication may occur when running scientific programs that it is not possible to test all of them. However, our MPI communication tests have been designed to test some of the communication patterns that we feel are likely to occur in scientific programs. The purpose of this study is to evaluate and compare communication performance of the Cray Research T3E-600 and IBM SP-2 on a collection of communication tests that use MPI for the message passing.

DESCRIPTION OF THE PERFORMANCE TESTS AND RESULTS

All communication tests have been written in Fortran with calls to MPI routines for the message passing. These tests were run with message sizes ranging from 8 Bytes (B) to 10 MB and with the number of processors ranging from 2 to 64. Some of these communication patterns take a very short amount of time to execute so they are looped to obtain a wall-clock time of at least one second in order to obtain accurate timings. The time of the communication pattern is then
calculated by dividing the total time by the number of iterations. Timings were obtained using the standard unix wall-clock timer "gettimeofday". Tests were run at least ten times and the best performance numbers are reported.

Tests for the Cray T3E-600 were run on a 64 processor T3E located at Cray Research's corporate headquarters in Eagan, Minnesota. The peak theoretical performance of each processor is 600 Mflop/s. The communication network has a bandwidth of 350 MB/second and latency of 1.5 microsecond. The T3E-600 communication network is a bi-directional 3-D torus. For more information on the T3E see http://www.cray.com. The operating system used was UNICOS/mk version 1.3.1 and the Fortran compiler used was cf90 version 3.0 with the O2 optimization level.

Tests for the IBM SP-2 were run at the Maui High Performance Computing Center. The peak theoretical performance of each of these processors is 267 Mflop/s. The communication network has a peak bi-directional bandwidth of 40 MB/second with a latency of 40.0 microseconds for thin nodes and 39.2 microseconds for wide nodes. Wide (thin) nodes have a 256 (64) KB data cache, 256 (64) bit path from memory to the data cache, and 256 (128) bit path from the data cache to the processor bus. The IBM SP-2 uses a bi-directional multistage interconnection network and may be configured with thin and/or wide nodes. Performance tests were done separately for thin nodes and for wide nodes. For more information about the SP-2, see http://www.mhpcc.edu/training/workshop/html/ibmhwsw/ibmhwsw.html. The IBM SP-2 did not have 64 wide nodes available for our tests so no performance results could be obtained with this many wide nodes. AIX version 4.1 was used and the Fortran compiler used was xlf version 3.2.4 with the O2 optimization level.

Communication Test 1 (Table 1):

The first communication test sends a message from one processor to another, the second processor then sends a 4 Byte integer message back to the sender indicating that the message was received. This test was designed this way since there are situations where one wants to send a message and not proceed until a response message has been received from the receiving processor. The communication rate was calculated by measuring the total time required to send the message and to send back the 4 Byte response. This test uses mpi_send and mpi_recv. This test is different from the COMMS1 test described in section 3.3.1 of [4] in that the COMMS1 test sends a message to a processor and then that same message is sent back to the first processor. Table 1 reports performance rates in KB per second, where "K" means "1000" not "1024". Notice that the T3E consistently outperforms the wide nodes by more than a factor of three. Thin node performance ranged from 5% to 14% less than wide nodes.

1 A follow-on study is planned that will evaluate the performance of MPI on the SGI ORIGIN 2000, IBM's follow-on to the SP-2, and the Cray Research T3E-900.
 

Message Size in Bytes
IBM SP-2 thin node
SP-2 wide node
Cray T3E-600
Ratio T3E/wide
8
106
114
374
3.3
1,000
5,625
6,519
20,374
3.1
100,000
27,619
31,349
111,742
3.6
10,000,000
32,304
33,878
111,551
3.3
Table 1: Processor to processor communication rate in KB/second.

Communication Test 2 (Table 2-a):

This test measures communication rates for sending a message from one processor to all of the other processors using mpi_bcast and is the COMMS3 test described in [4]. To better evaluate the performance of this broadcast operation, define a Normalized Broadcast Rate as

(total data rate)/(N-1)

where N is the total number of processors involved in the communication and where the total data rate is the total amount of data sent on the communication network per unit time and measured in KB per second. Let R be the data rate when sending a message from one processor to another and let D be the total data rate for broadcasting the same message to the N-1 other processors. If the broadcast operation and communication network were able to concurrently transmit the messages, then D = R*(N-1) and thus the Normalized Broadcast Rate would remain constant as N varied for a given message size. Thus, the rate at which the Normalized Broadcast Rate decreases as N increases indicates how far the broadcast operation is from being ideal. Table 2-a reports Normalized Broadcast Rates for the T3E-600 and for both wide and thin nodes on the SP-2. Notice that in all cases instead of being constant for a given message size, the Normalized Broadcast Rate decreases significantly as the number of processors increase. Also notice that for all message sizes the T3E-600 is roughly 3 to 5 times faster than wide nodes but this factor decreases as the number of processors used increases.
 

Message Size in bytes
Number of Processors
IBM SP-2 thin nodes
IBM SP-2 wide nodes
Cray T3E-600
Ratio T3E/wide
8
2
75
78
345
4.4
8
4
52
55
195
3.5
8
8
31
38
101
2.7
8
16
20
24
47
2.0
8
32
12
14
16
1.1
8
64
5
na
4
1,000
2
4,587
5,184
23,609
4.6
1,000
4
2,711
3,374
13,485
4.0
1,000
8
1,837
2,433
7,380
3.0
1,000
16
1,366
1,717
3,759
2.2
1,000
32
833
1,073
1,398
1.3
1,000
64
378
na
412
100,000
2
27,181
30,685
98,729
3.2
100,000
4
9,755
10,826
51,874
4.8
100,000
8
5,057
5,516
35,357
6.4
100,000
16
4,650
8,268
26,632
3.2
100,000
32
3,190
4,049
21,545
5.3
100,000
64
1,943
na
17,833
10,000,000
2
32,298
33,614
100,336
3.0
10,000,000
4
10,970
11,332
53,220
4.7
10,000,000
8
5,507
5,680
36,821
6.5
10,000,000
16
10,638
16,472
27,321
1.7
10,000,000
32
9,662
14,523
22,371
1.5
10,000,000
64
8,147
na
18,466
Table 2-a: Normalized Broadcast Rates in KB/second.

To see that there actually is concurrency occurring in the broadcast operation, define

the Log Normalized Broadcast Rate as  (total data rate)/Log(N),

where N is the number of processors involved in the communication and Log(N) is the log base 2 of N. Thus, if binary tree parallelism were being utilized, the Log Normalized Data Rate would be constant for a given message size as N varies. Table 2-b gives the Log Normalized Data Rates and does in fact show that concurrency is being utilized in the broadcast operation for both machines.
 

Message Size in bytes
Number of Processors
IBM SP-2 thin nodes
IBM SP-2 wide nodes
Cray T3E-600
Ratio T3E/wide
8
2
75
78
345
4.4
8
4
78
83
293
3.5
8
8
72
89
236
2.7
8
16
75
90
176
2.0
8
32
74
87
99
1.1
8
64
53
na
42
1,000
2
4,587
5,184
23,609
4.6
1,000
4
4,067
5,061
20,228
4.0
1,000
8
4,286
5,677
17,220
3.0
1,000
16
5,123
6,439
14,096
2.2
1,000
32
5,165
6,653
8,668
1.3
1,000
64
3,969
na
4,326
100,000
2
27,181
30,685
98,729
3.2
100,000
4
14,633
16,239
77,811
4.8
100,000
8
11,800
12,871
82,500
6.4
100,000
16
17,438
31,005
99,870
3.2
100,000
32
19,778
25,104
133,579
5.3
100,000
64
20,402
na
187,247
10,000,000
2
32,298
33,614
100,336
3.0
10,000,000
4
16,455
16,998
79,830
4.7
10,000,000
8
12,850
13,253
85,916
6.5
10,000,000
16
39,893
61,770
102,454
1.7
10,000,000
32
59,904
90,043
138,700
1.5
10,000,000
64
85,544
na
193,893
Table 2-b: Log Normalized Broadcast Rates in KB/second.
 

Communication Test 3 (Table 3):

Communication Test 3 measures the rates for broadcasting a message from a processor to all other processors using mpi_bcast and then having the receiving processors return this same message back to the originating processor using mpi_send and mpi_recv with a wild card. To eliminate the possibility of an optimizing compiler recognizing that it is the same message being sent back and hence need not be sent back, one element of the message is altered by the receiving processor prior to sending the message back. Notice that this communication pattern causes significantly more data traffic on the communication network than the simple broadcast operation described in Communication Test 2. Table 3 gives the Normalized Broadcast Rates for this operation. Notice that the trends in Table 3 are similar to those of Table 2-a but that the data rates for 10 MB messages with 16, 32 and 64 processors drop off significantly on the SP-2.
 

Message Size in bytes
Number of Processors
IBM SP-2 thin nodes
IBM SP-2 wide nodes
Cray T3E-600
Ratio T3E/wide
8
2
103
124
426
3.4
8
4
64
74
202
2.7
8
8
40
45
96
2.1
8
16
23
26
37
1.4
8
32
12
13
12
0.9
8
64
5
na
3
1,000
2
5,925
7,676
28,811
3.8
1,000
4
3,665
4,735
11,482
2.4
1,000
8
2,243
2,841
5,009
1.8
1,000
16
1,229
1,578
1,584
1.0
1,000
32
638
835
418
0.5
1,000
64
294
na
106
100,000
2
27,226
31,479
108,573
3.4
100,000
4
11,292
12,778
68,032
5.3
100,000
8
5,629
6,420
42,286
6.6
100,000
16
2,768
3,419
25,039
7.3
100,000
32
1,466
1,731
14,037
8.1
100,000
64
758
na
7,568
10,000,000
2
32,109
33,766
108,512
3.2
10,000,000
4
13,117
13,542
70,423
5.2
10,000,000
8
6,518
6,764
47,711
7.1
10,000,000
16
3,333
3,964
29,399
7.4
10,000,000
32
1,899
2,016
16,962
8.4
10,000,000
64
946
na
9,310
Table 3: Normalized Broadcast Rates in KB/second for a broadcast and return.

Communication Test 4 (Table 4):

The rest of the communication tests are designed to measure communication between "neighboring" processors. As above, let N be the total number of processors and assume that they have been numbered from 1 to N. This communication test sends a message from processor i to processor (i+1) mod N, for i = 1, 2, …, N. Observe that the data rates for this test will increase proportionally with N since communication can (hopefully) be done in parallel. Thus, in a manner similar to the Normalized Broadcast Rate, we define the Normalized Data Rate to be

(total data rate)/N.

In an ideal parallel computer, the Normalized Data Rate for the above communication would be constant since all communication would be done concurrently. Thus, the degree to which the Normalized Data Rate is not constant indicates how far from ideal this type of communication can be performed by the parallel computer. This test uses mpi_send and mpi_recv. Table 4 reports the Normalized Data Rates for the above communication in KB/second. Notice that for all message sizes the data rate scales well for both machines and that the T3E-600 significantly outperforms the SP-2.
 

Message Size in bytes
Number of Processors
IBM SP-2 thin nodes
IBM SP-2 wide nodes
Cray T3E-600
Ratio T3E/wide
8
2
67
75
259
3.5
8
4
67
72
231
3.2
8
8
65
70
228
3.2
8
16
61
67
226
3.4
8
32
55
66
223
3.7
8
64
43
na
222
1,000
2
3,994
4,936
15,845
3.2
1,000
4
3,642
4,671
14,382
3.1
1,000
8
3,644
4,507
14,247
3.2
1,000
16
3,475
4,303
14,142
3.3
1,000
32
3,444
3,964
14,093
3.6
1,000
64
3,294
na
13,864
100,000
2
13,612
15,832
57,801
3.7
100,000
4
13,554
15,694
57,126
3.6
100,000
8
13,364
15,645
57,166
3.7
100,000
16
13,290
15,389
57,079
3.7
100,000
32
13,197
14,899
56,869
3.8
100,000
64
12,630
na
55,955
10,000,000
2
16,091
16,893
56,001
3.3
10,000,000
4
16,056
16,939
47,475
2.8
10,000,000
8
15,722
16,808
40,355
2.4
10,000,000
16
15,791
16,767
40,336
2.4
10,000,000
32
15,497
16,739
40,307
2.4
10,000,000
64
13,245
na
40,126
Table 4: Normalized Data Rates for processor i to processor i+1 mod N.

Communication Test 5 (Table 5):

This next communication pattern sends a message from a processor i to each of its neighbors (i-1) mod N and (i+1) mod N, for i = 1, 2, …, N and where N is the number of processors used in the test. Thus, the amount of data being moved on the network will be twice that of the previous test. This test uses mpi_sendrecv. Table 5 reports the performance results. Notice that the T3E-600 is able to handle the extra data traffic better than the SP-2. Also notice that the data rates scale well as the number of processors used increases.
 

Message Size in bytes
Number of Processors
IBM SP-2 thin nodes
IBM SP-2 wide nodes
Cray T3E-600
Ratio T3E/wide
8
2
90
100
487
4.9
8
4
85
95
478
5.1
8
8
78
92
475
5.2
8
16
69
89
470
5.3
8
32
65
84
414
4.9
8
64
54
na
386
1,000
2
6,177
7,374
27,599
3.7
1,000
4
5,420
6,965
26,804
3.8
1,000
8
5,166
6,576
26,810
4.1
1,000
16
4,809
6,616
26,294
4.0
1,000
32
4,775
6,088
21,783
3.6
1,000
64
3,845
na
20,773
100,000
2
14,793
17,715
100,498
5.7
100,000
4
15,849
20,493
94,391
4.6
100,000
8
15,969
20,445
100,140
4.9
100,000
16
15,706
20,389
100,059
4.9
100,000
32
15,233
19,828
99,950
5.0
100,000
64
14,914
na
99,212
10,000,000
2
17,106
18,638
104,615
5.6
10,000,000
4
18,821
22,556
81,668
3.6
10,000,000
8
18,334
22,396
72,293
3.2
10,000,000
16
18,122
22,437
72,372
3.2
10,000,000
32
17,799
22,111
72,483
3.3
10,000,000
64
16,518
na
72,512
Table 5: Normalized Data Rates for processor i to processors (i+1) mod N & (i-1) mod N.

Communication Test 6 (Table 6):

To describe this last communication pattern, let N be the number of processors used for the test and let P be a permutation of the first N positive integers. This communication pattern can then be described as sending a message from processor P(i) to processor P((i+1) mod N). Notice that this is exactly the same as the communication pattern described in Communication Test 4 if one were to reorder the processor numbering from 1, 2, …, N to P(1), P(2), …, P(N). Thus, the purpose of this test is to determine the impact on performance of reordering the numbering of the processors. Of course, it is likely that the performance will depend on the particular permutation selected. This test uses mpi_sendrecv. Comparing Table 6 with Table 4, one observes that the Normalized Data Rates for both the T3E-600 and SP-2 can in fact change significantly depending on the permutation selected.
 

Message Size in bytes
Number of Processors
IBM SP-2 thin nodes
IBM SP-2 wide nodes
Cray T3E-600
Ratio T3E/wide
8
2
88
99
483
4.9
8
4
86
96
472
4.9
8
8
79
94
461
4.9
8
16
75
91
440
4.8
8
32
69
85
408
4.8
8
64
55
na
398
1,000
2
5,957
7,318
28,127
3.8
1,000
4
5,697
6,851
27,067
4.0
1,000
8
5,353
6,770
26,578
3.9
1,000
16
5,256
6,593
21,808
3.3
1,000
32
5,046
6,339
20,775
3.3
1,000
64
4,972
na
20,150
100,000
2
14,599
17,868
100,425
5.6
100,000
4
16,015
20,453
90,487
4.4
100,000
8
15,793
20,474
97,556
4.8
100,000
16
14,866
20,417
96,906
4.7
100,000
32
15,554
20,328
89,936
4.4
100,000
64
14,176
na
80,868
10,000,000
2
17,171
19,413
99,158
5.1
10,000,000
4
18,275
22,565
69,569
3.1
10,000,000
8
17,311
22,520
76,064
3.8
10,000,000
16
17,111
22,378
67,416
3.0
10,000,000
32
17,010
22,179
59,208
2.7
10,000,000
64
14,667
na
54,516
Table 6: Normalized Data Rates for processor P(i) to processor P((i+1) mod N).

CONCLUSIONS

The purpose of this study was to evaluate and compare communication performance of the Cray Research T3E-600 and IBM SP-2 on a collection of communication tests written in MPI. These tests have been designed to evaluate the performance of some of the communication patterns that we feel are likely to occur in scientific programs. Both machines showed a high level of concurrency on the nearest neighbor communication tests and moderate concurrency on the broadcast communication tests. On the SP-2, thin nodes performed roughly 10% slower than wide nodes on our tests. The T3E-600 significantly outperformed the SP-2 with most performance tests being at least three times faster than wide nodes on the SP-2.

ACKNOWLEDGMENTS

Computer time on the Maui High Performance Computer Center's SP-2 was sponsored by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.

We would like to thank Cray Research Inc. for allowing us to use their T3E at their corporate headquarters in Eagan, Minnesota, USA.

REFERENCES
 

  1. Cray MPP Fortran Reference Manual, SR 2504 6.2.2, Cray Research, Inc., June 1995.
  2. J. Dongarra, R. Whaley, A User's Guide to the BLACS v1.0, Computer Science Departmen Technical Report CS-95-281, University of Tennessee, 1995. (Available as LAPACK Working Note 94 at: http://www.netlib.org/lapack/lawns/lawn94.ps)
  3. A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing, The MIT Press, 1994.
  4. R. Hockney, M. Berry, Public International Benchmarks for Parallel Computers: PARKBENCH Committee, Report No. 1, Scientific Programming, 3 (2) : 101-146,1994.
  5. R. Hockney, The Science of Computer Benchmarking, SIAM, Philadelphia, 1996.
  6. W. Gropp, E. Lusk, A. Skjellum, USING MPI, The MIT Press 1994.
  7. G. Luecke, J. Coyle, W. Haque, J. Hoekstra, H. Jespersen, Performance Comparison of Workstation Clusters for Scientific Computing, SUPERCOMPUTER, vol XII, no. 2, pp 4-20, March 1996.
  8. Optimization and Tuning Guide for Fortran, C, and C++ for AIX version 4, second edition, IBM, June 1996.
  9. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI: The Complete Reference, The MIT Press, 1996.