HIGH PERFORMANCE FORTRAN VERSUS EXPLICIT MESSAGE PASSING

COMPARING COMMUNICATION PERFORMANCE OF MPI ON THE CRAY RESEARCH T3E-600 AND IBM SP-2 ¹ by Glenn R. Luecke and James J. Coyle grl@iastate.edu and jjc@iastate.edu Iowa State University Ames, Iowa 50011-2251, USA Waqar ul Haque haque@unbc.edu University of Northern British Columbia Prince George, British Columbia, Canada V2N 4Z9 February 8, 1997 Revised August 13, 1997

Abstract

This paper reports the performance of the Cray Research T3E and IBM SP-2 on a collection of communication tests that use MPI for the message passing. These tests have been designed to evaluate the performance of communication patterns that we feel are likely to occur in scientific programs. Communication tests were performed for messages of sizes 8 Bytes (B), 1 KB, 100 KB, and 10 MB with 2, 4, 8, 16, 32 and 64 processors. Both machines provided a high level of concurrency for the nearest neighbor communication tests and moderate concurrency on the broadcast operations. On the tests used, the T3E significantly outperformed the SP-2 with most performance tests being at least three times faster than the SP-2.

INTRODUCTION

Message Passing Interface, MPI, [6,9] is rapidly becoming the standard for writing scientific programs with explicit message passing rather than PVM [3]. The communication network of a parallel computer plays an important role in its overall performance, see [4,5]. There are so many different ways that communication may occur when running scientific programs that it is not possible to test all of them. However, our MPI communication tests have been designed to test some of the communication patterns that we feel are likely to occur in scientific programs. The purpose of this study is to evaluate and compare communication performance of the Cray Research T3E-600 and IBM SP-2 on a collection of communication tests that use MPI for the message passing.

DESCRIPTION OF THE PERFORMANCE TESTS AND RESULTS

All communication tests have been written in Fortran with calls to MPI routines for the message passing. These tests were run with message sizes ranging from 8 Bytes (B) to 10 MB and with the number of processors ranging from 2 to 64. Some of these communication patterns take a very short amount of time to execute so they are looped to obtain a wall-clock time of at least one second in order to obtain accurate timings. The time of the communication pattern is then
calculated by dividing the total time by the number of iterations. Timings were obtained using the standard unix wall-clock timer "gettimeofday". Tests were run at least ten times and the best performance numbers are reported.

Tests for the Cray T3E-600 were run on a 64 processor T3E located at Cray Research's corporate headquarters in Eagan, Minnesota. The peak theoretical performance of each processor is 600 Mflop/s. The communication network has a bandwidth of 350 MB/second and latency of 1.5 microsecond. The T3E-600 communication network is a bi-directional 3-D torus. For more information on the T3E see http://www.cray.com. The operating system used was UNICOS/mk version 1.3.1 and the Fortran compiler used was cf90 version 3.0 with the O2 optimization level.

Tests for the IBM SP-2 were run at the Maui High Performance Computing Center. The peak theoretical performance of each of these processors is 267 Mflop/s. The communication network has a peak bi-directional bandwidth of 40 MB/second with a latency of 40.0 microseconds for thin nodes and 39.2 microseconds for wide nodes. Wide (thin) nodes have a 256 (64) KB data cache, 256 (64) bit path from memory to the data cache, and 256 (128) bit path from the data cache to the processor bus. The IBM SP-2 uses a bi-directional multistage interconnection network and may be configured with thin and/or wide nodes. Performance tests were done separately for thin nodes and for wide nodes. For more information about the SP-2, see http://www.mhpcc.edu/training/workshop/html/ibmhwsw/ibmhwsw.html. The IBM SP-2 did not have 64 wide nodes available for our tests so no performance results could be obtained with this many wide nodes. AIX version 4.1 was used and the Fortran compiler used was xlf version 3.2.4 with the O2 optimization level.

Communication Test 1 (Table 1):

The first communication test sends a message from one processor to another, the second processor then sends a 4 Byte integer message back to the sender indicating that the message was received. This test was designed this way since there are situations where one wants to send a message and not proceed until a response message has been received from the receiving processor. The communication rate was calculated by measuring the total time required to send the message and to send back the 4 Byte response. This test uses mpi_send and mpi_recv. This test is different from the COMMS1 test described in section 3.3.1 of [4] in that the COMMS1 test sends a message to a processor and then that same message is sent back to the first processor. Table 1 reports performance rates in KB per second, where "K" means "1000" not "1024". Notice that the T3E consistently outperforms the wide nodes by more than a factor of three. Thin node performance ranged from 5% to 14% less than wide nodes.

¹ A follow-on study is planned that will evaluate the performance of MPI on the SGI ORIGIN 2000, IBM's follow-on to the SP-2, and the Cray Research T3E-900.

Message Size in Bytes	IBM SP-2 thin node	SP-2 wide node	Cray T3E-600	Ratio T3E/wide
8	106	114	374	3.3
1,000	5,625	6,519	20,374	3.1
100,000	27,619	31,349	111,742	3.6
10,000,000	32,304	33,878	111,551	3.3

Table 1: Processor to processor communication rate in KB/second.

Communication Test 2 (Table 2-a):

This test measures communication rates for sending a message from one processor to all of the other processors using mpi_bcast and is the COMMS3 test described in [4]. To better evaluate the performance of this broadcast operation, define a Normalized Broadcast Rate as

(total data rate)/(N-1)

where N is the total number of processors involved in the communication and where the total data rate is the total amount of data sent on the communication network per unit time and measured in KB per second. Let R be the data rate when sending a message from one processor to another and let D be the total data rate for broadcasting the same message to the N-1 other processors. If the broadcast operation and communication network were able to concurrently transmit the messages, then D = R*(N-1) and thus the Normalized Broadcast Rate would remain constant as N varied for a given message size. Thus, the rate at which the Normalized Broadcast Rate decreases as N increases indicates how far the broadcast operation is from being ideal. Table 2-a reports Normalized Broadcast Rates for the T3E-600 and for both wide and thin nodes on the SP-2. Notice that in all cases instead of being constant for a given message size, the Normalized Broadcast Rate decreases significantly as the number of processors increase. Also notice that for all message sizes the T3E-600 is roughly 3 to 5 times faster than wide nodes but this factor decreases as the number of processors used increases.

Message Size in bytes	Number of Processors	IBM SP-2 thin nodes	IBM SP-2 wide nodes	Cray T3E-600	Ratio T3E/wide
8	2	75	78	345	4.4
8	4	52	55	195	3.5
8	8	31	38	101	2.7
8	16	20	24	47	2.0
8	32	12	14	16	1.1
8	64	5	na	4
1,000	2	4,587	5,184	23,609	4.6
1,000	4	2,711	3,374	13,485	4.0
1,000	8	1,837	2,433	7,380	3.0
1,000	16	1,366	1,717	3,759	2.2
1,000	32	833	1,073	1,398	1.3
1,000	64	378	na	412
100,000	2	27,181	30,685	98,729	3.2
100,000	4	9,755	10,826	51,874	4.8
100,000	8	5,057	5,516	35,357	6.4
100,000	16	4,650	8,268	26,632	3.2
100,000	32	3,190	4,049	21,545	5.3
100,000	64	1,943	na	17,833
10,000,000	2	32,298	33,614	100,336	3.0
10,000,000	4	10,970	11,332	53,220	4.7
10,000,000	8	5,507	5,680	36,821	6.5
10,000,000	16	10,638	16,472	27,321	1.7
10,000,000	32	9,662	14,523	22,371	1.5
10,000,000	64	8,147	na	18,466

Table 2-a: Normalized Broadcast Rates in KB/second.

To see that there actually is concurrency occurring in the broadcast operation, define

the Log Normalized Broadcast Rate as (total data rate)/Log(N),

where N is the number of processors involved in the communication and Log(N) is the log base 2 of N. Thus, if binary tree parallelism were being utilized, the Log Normalized Data Rate would be constant for a given message size as N varies. Table 2-b gives the Log Normalized Data Rates and does in fact show that concurrency is being utilized in the broadcast operation for both machines.

Message Size in bytes	Number of Processors	IBM SP-2 thin nodes	IBM SP-2 wide nodes	Cray T3E-600	Ratio T3E/wide
8	2	75	78	345	4.4
8	4	78	83	293	3.5
8	8	72	89	236	2.7
8	16	75	90	176	2.0
8	32	74	87	99	1.1
8	64	53	na	42
1,000	2	4,587	5,184	23,609	4.6
1,000	4	4,067	5,061	20,228	4.0
1,000	8	4,286	5,677	17,220	3.0
1,000	16	5,123	6,439	14,096	2.2
1,000	32	5,165	6,653	8,668	1.3
1,000	64	3,969	na	4,326
100,000	2	27,181	30,685	98,729	3.2
100,000	4	14,633	16,239	77,811	4.8
100,000	8	11,800	12,871	82,500	6.4
100,000	16	17,438	31,005	99,870	3.2
100,000	32	19,778	25,104	133,579	5.3
100,000	64	20,402	na	187,247
10,000,000	2	32,298	33,614	100,336	3.0
10,000,000	4	16,455	16,998	79,830	4.7
10,000,000	8	12,850	13,253	85,916	6.5
10,000,000	16	39,893	61,770	102,454	1.7
10,000,000	32	59,904	90,043	138,700	1.5
10,000,000	64	85,544	na	193,893

Table 2-b: Log Normalized Broadcast Rates in KB/second.

Communication Test 3 (Table 3):

Communication Test 3 measures the rates for broadcasting a message from a processor to all other processors using mpi_bcast and then having the receiving processors return this same message back to the originating processor using mpi_send and mpi_recv with a wild card. To eliminate the possibility of an optimizing compiler recognizing that it is the same message being sent back and hence need not be sent back, one element of the message is altered by the receiving processor prior to sending the message back. Notice that this communication pattern causes significantly more data traffic on the communication network than the simple broadcast operation described in Communication Test 2. Table 3 gives the Normalized Broadcast Rates for this operation. Notice that the trends in Table 3 are similar to those of Table 2-a but that the data rates for 10 MB messages with 16, 32 and 64 processors drop off significantly on the SP-2.

Message Size in bytes	Number of Processors	IBM SP-2 thin nodes	IBM SP-2 wide nodes	Cray T3E-600	Ratio T3E/wide
8	2	103	124	426	3.4
8	4	64	74	202	2.7
8	8	40	45	96	2.1
8	16	23	26	37	1.4
8	32	12	13	12	0.9
8	64	5	na	3
1,000	2	5,925	7,676	28,811	3.8
1,000	4	3,665	4,735	11,482	2.4
1,000	8	2,243	2,841	5,009	1.8
1,000	16	1,229	1,578	1,584	1.0
1,000	32	638	835	418	0.5
1,000	64	294	na	106
100,000	2	27,226	31,479	108,573	3.4
100,000	4	11,292	12,778	68,032	5.3
100,000	8	5,629	6,420	42,286	6.6
100,000	16	2,768	3,419	25,039	7.3
100,000	32	1,466	1,731	14,037	8.1
100,000	64	758	na	7,568
10,000,000	2	32,109	33,766	108,512	3.2
10,000,000	4	13,117	13,542	70,423	5.2
10,000,000	8	6,518	6,764	47,711	7.1
10,000,000	16	3,333	3,964	29,399	7.4
10,000,000	32	1,899	2,016	16,962	8.4
10,000,000	64	946	na	9,310

Table 3: Normalized Broadcast Rates in KB/second for a broadcast and return.

Communication Test 4 (Table 4):

The rest of the communication tests are designed to measure communication between "neighboring" processors. As above, let N be the total number of processors and assume that they have been numbered from 1 to N. This communication test sends a message from processor i to processor (i+1) mod N, for i = 1, 2, …, N. Observe that the data rates for this test will increase proportionally with N since communication can (hopefully) be done in parallel. Thus, in a manner similar to the Normalized Broadcast Rate, we define the Normalized Data Rate to be

(total data rate)/N.

In an ideal parallel computer, the Normalized Data Rate for the above communication would be constant since all communication would be done concurrently. Thus, the degree to which the Normalized Data Rate is not constant indicates how far from ideal this type of communication can be performed by the parallel computer. This test uses mpi_send and mpi_recv. Table 4 reports the Normalized Data Rates for the above communication in KB/second. Notice that for all message sizes the data rate scales well for both machines and that the T3E-600 significantly outperforms the SP-2.

Message Size in bytes	Number of Processors	IBM SP-2 thin nodes	IBM SP-2 wide nodes	Cray T3E-600	Ratio T3E/wide
8	2	67	75	259	3.5
8	4	67	72	231	3.2
8	8	65	70	228	3.2
8	16	61	67	226	3.4
8	32	55	66	223	3.7
8	64	43	na	222
1,000	2	3,994	4,936	15,845	3.2
1,000	4	3,642	4,671	14,382	3.1
1,000	8	3,644	4,507	14,247	3.2
1,000	16	3,475	4,303	14,142	3.3
1,000	32	3,444	3,964	14,093	3.6
1,000	64	3,294	na	13,864
100,000	2	13,612	15,832	57,801	3.7
100,000	4	13,554	15,694	57,126	3.6
100,000	8	13,364	15,645	57,166	3.7
100,000	16	13,290	15,389	57,079	3.7
100,000	32	13,197	14,899	56,869	3.8
100,000	64	12,630	na	55,955
10,000,000	2	16,091	16,893	56,001	3.3
10,000,000	4	16,056	16,939	47,475	2.8
10,000,000	8	15,722	16,808	40,355	2.4
10,000,000	16	15,791	16,767	40,336	2.4
10,000,000	32	15,497	16,739	40,307	2.4
10,000,000	64	13,245	na	40,126

Table 4: Normalized Data Rates for processor i to processor i+1 mod N.

Communication Test 5 (Table 5):

This next communication pattern sends a message from a processor i to each of its neighbors (i-1) mod N and (i+1) mod N, for i = 1, 2, …, N and where N is the number of processors used in the test. Thus, the amount of data being moved on the network will be twice that of the previous test. This test uses mpi_sendrecv. Table 5 reports the performance results. Notice that the T3E-600 is able to handle the extra data traffic better than the SP-2. Also notice that the data rates scale well as the number of processors used increases.

Message Size in bytes	Number of Processors	IBM SP-2 thin nodes	IBM SP-2 wide nodes	Cray T3E-600	Ratio T3E/wide
8	2	90	100	487	4.9
8	4	85	95	478	5.1
8	8	78	92	475	5.2
8	16	69	89	470	5.3
8	32	65	84	414	4.9
8	64	54	na	386
1,000	2	6,177	7,374	27,599	3.7
1,000	4	5,420	6,965	26,804	3.8
1,000	8	5,166	6,576	26,810	4.1
1,000	16	4,809	6,616	26,294	4.0
1,000	32	4,775	6,088	21,783	3.6
1,000	64	3,845	na	20,773
100,000	2	14,793	17,715	100,498	5.7
100,000	4	15,849	20,493	94,391	4.6
100,000	8	15,969	20,445	100,140	4.9
100,000	16	15,706	20,389	100,059	4.9
100,000	32	15,233	19,828	99,950	5.0
100,000	64	14,914	na	99,212
10,000,000	2	17,106	18,638	104,615	5.6
10,000,000	4	18,821	22,556	81,668	3.6
10,000,000	8	18,334	22,396	72,293	3.2
10,000,000	16	18,122	22,437	72,372	3.2
10,000,000	32	17,799	22,111	72,483	3.3
10,000,000	64	16,518	na	72,512

Table 5: Normalized Data Rates for processor i to processors (i+1) mod N & (i-1) mod N.

Communication Test 6 (Table 6):

To describe this last communication pattern, let N be the number of processors used for the test and let P be a permutation of the first N positive integers. This communication pattern can then be described as sending a message from processor P(i) to processor P((i+1) mod N). Notice that this is exactly the same as the communication pattern described in Communication Test 4 if one were to reorder the processor numbering from 1, 2, …, N to P(1), P(2), …, P(N). Thus, the purpose of this test is to determine the impact on performance of reordering the numbering of the processors. Of course, it is likely that the performance will depend on the particular permutation selected. This test uses mpi_sendrecv. Comparing Table 6 with Table 4, one observes that the Normalized Data Rates for both the T3E-600 and SP-2 can in fact change significantly depending on the permutation selected.

Message Size in bytes	Number of Processors	IBM SP-2 thin nodes	IBM SP-2 wide nodes	Cray T3E-600	Ratio T3E/wide
8	2	88	99	483	4.9
8	4	86	96	472	4.9
8	8	79	94	461	4.9
8	16	75	91	440	4.8
8	32	69	85	408	4.8
8	64	55	na	398
1,000	2	5,957	7,318	28,127	3.8
1,000	4	5,697	6,851	27,067	4.0
1,000	8	5,353	6,770	26,578	3.9
1,000	16	5,256	6,593	21,808	3.3
1,000	32	5,046	6,339	20,775	3.3
1,000	64	4,972	na	20,150
100,000	2	14,599	17,868	100,425	5.6
100,000	4	16,015	20,453	90,487	4.4
100,000	8	15,793	20,474	97,556	4.8
100,000	16	14,866	20,417	96,906	4.7
100,000	32	15,554	20,328	89,936	4.4
100,000	64	14,176	na	80,868
10,000,000	2	17,171	19,413	99,158	5.1
10,000,000	4	18,275	22,565	69,569	3.1
10,000,000	8	17,311	22,520	76,064	3.8
10,000,000	16	17,111	22,378	67,416	3.0
10,000,000	32	17,010	22,179	59,208	2.7
10,000,000	64	14,667	na	54,516

Table 6: Normalized Data Rates for processor P(i) to processor P((i+1) mod N).

CONCLUSIONS

The purpose of this study was to evaluate and compare communication performance of the Cray Research T3E-600 and IBM SP-2 on a collection of communication tests written in MPI. These tests have been designed to evaluate the performance of some of the communication patterns that we feel are likely to occur in scientific programs. Both machines showed a high level of concurrency on the nearest neighbor communication tests and moderate concurrency on the broadcast communication tests. On the SP-2, thin nodes performed roughly 10% slower than wide nodes on our tests. The T3E-600 significantly outperformed the SP-2 with most performance tests being at least three times faster than wide nodes on the SP-2.

ACKNOWLEDGMENTS

Computer time on the Maui High Performance Computer Center's SP-2 was sponsored by the Phillips Laboratory, Air Force Material Command, USAF, under cooperative agreement number F29601-93-2-0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Phillips Laboratory or the U.S. Government.

We would like to thank Cray Research Inc. for allowing us to use their T3E at their corporate headquarters in Eagan, Minnesota, USA.

REFERENCES

Cray MPP Fortran Reference Manual, SR 2504 6.2.2, Cray Research, Inc., June 1995.
J. Dongarra, R. Whaley, A User's Guide to the BLACS v1.0, Computer Science Departmen Technical Report CS-95-281, University of Tennessee, 1995. (Available as LAPACK Working Note 94 at: http://www.netlib.org/lapack/lawns/lawn94.ps)
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam, PVM: Parallel Virtual Machine A Users' Guide and Tutorial for Networked Parallel Computing, The MIT Press, 1994.
R. Hockney, M. Berry, Public International Benchmarks for Parallel Computers: PARKBENCH Committee, Report No. 1, Scientific Programming, 3 (2) : 101-146,1994.
R. Hockney, The Science of Computer Benchmarking, SIAM, Philadelphia, 1996.
W. Gropp, E. Lusk, A. Skjellum, USING MPI, The MIT Press 1994.
G. Luecke, J. Coyle, W. Haque, J. Hoekstra, H. Jespersen, Performance Comparison of Workstation Clusters for Scientific Computing, SUPERCOMPUTER, vol XII, no. 2, pp 4-20, March 1996.
Optimization and Tuning Guide for Fortran, C, and C++ for AIX version 4, second edition, IBM, June 1996.
M. Snir, S. Otto, S. Huss-Lederman, D. Walker, J. Dongarra, MPI: The Complete Reference, The MIT Press, 1996.