Processes of a parallel application distributed over a collection of processors must communicate problem parameters and results. In distributed memory multiprocessors or workstations on a network, the information is typically communicated with explicit message-passing subroutine calls. To send data to another process, a subroutine is usually provided that requires a destination address, message, and message length. The receiving process usually provides a buffer, a maximum length, and the senders address. The programming model is often extended to include both synchronous and asynchronous communication, group communication (broadcast and multicast), and aggregate operations (e.g., global sum).
Message passing performance is usually measured in units of time or bandwidth (bytes per second). In this report, we choose time as the measure of performance for sending a small message. The time for a small, or zero length, message is usually bounded by the speed of the signal through the media (latency) and any software overhead in sending/receiving the message. Small message times are important in synchronization and determining optimal granularity of parallelism. For large messages, bandwidth is the bounded metric, usually approaching the maximum bandwidth of the media. Choosing two numbers to represent the performance of a network can be misleading, so the reader is encouraged to plot communication time as function of message length to compare and understand the behavior of message passing systems.
Message passing time is usually a linear
function of message size for two processors that are directly
connected.
For more complicated a networks, a per-hop delay may increase
the message passing time.
Message-passing time, , can be modeled as
with a start-up time, , a per-byte cost, ,
and a per-hop delay, , where n is the number of bytes
per message and h the number of hops a message must travel.
On most current message-passing multiprocessors the per-hop delay
is negligible due to ``worm-hole'' routing
techniques and the small diameter of
the communication network [3].
The results reported in this report
reflect nearest-neighbor communication.
A linear least-squares fit can be used to calculate
and from experimental data of message-passing times versus
message length.
The start-up time, , may be slightly different than
the zero-length time, and should be asymptotic bandwidth.
The message length at which half the maximum bandwidth is achieved,
, is another metric of interest and is equal to
[10].
As with any metric that is a ratio, any notion of ``goodness''
or ``optimality'' of should only be considered in the
context of the underlying metrics , ,
, and h. For a more complete discussion of
these parameters see [9, 8].
There are a number of factors that can affect the message passing performance. The number of times the message has to be copied or touched (e.g., checksums) is probably most influential and obviously a function of message size. The vendor may provide hints as to how to reduce message copies, for example, posting the receive before the send. Second order effects of message size may also affect performance. Message lengths that are powers of two or cache-line size may provide better performance than shorter lengths. Buffer alignment on word, cache-line, or page may also affect performance. For small messages, context-switch times may contribute to delays. Touching all the pages of the buffers can reduce virtual memory effects. For shared media, contention may also affect performance. There also may be some first-time effects that can be identified or eliminated by performing some ``warm up'' tests before collecting performance data. These ``warm up'' tests can be simply running the test a number of times before gathering the timing data.
There are of course other parameters of a message-passing system that may affect performance for given applications. The aggregate bandwidth of the network, the amount of concurrency, reliability, scalability, and congestion management may be issues.