The efficiency of the ScaLAPACK software depends
on efficient implementations of the BLAS and the BLACS being
provided by computer vendors (or others) for their computers.
The BLAS and the BLACS form a low-level interface between
ScaLAPACK software and different computer architectures.
Table 5.5 presents performance numbers indicating
how well the BLACS and Level 3 BLAS perform on different
distributed-memory computers. For each computer this table shows the
flop rate achieved by the matrix-matrix multiply Level 3 BLAS
routine SGEMM/DGEMM () on a node versus the theoretical
peak performance of that node,
the underlying message-passing library
called by the BLACS,
and the approximated values of the latency (
) and the
bandwidth (
) achieved by the BLACS versus the
underlying message-passing software for the machine.
Table 5.5: BLACS and Level 3 BLAS performance indicators
The values for latency in table 5.5 were obtained by timing the cost of a 0-byte message. The bandwidth numbers table 5.5 were obtained by increasing message length until message bandwidth was saturated. We used the same timing mechanism for both the BLACS and the underlying message-passing library.
These numbers are actual timing numbers, not values based on hardware peaks, for instance. Therefore, they should be considered as approximate values or indicators of the observed performance between two nodes, as opposed to precise evaluations of the interconnection network capabilities. On the CRAY, the numbers reported are for MPI and the MPIBLACS, instead of the more optimal shmem library with CRAY's native BLACS.
For all four computers, a machine-specific optimized BLAS
implementation was used for all the performance numbers reported
in this chapter. For the IBM Scalable POWERparallel 2 (SP2)
computer, the IBM Engineering and Scientific Subroutine Library (ESSL)
was used [88]. On the Intel XP/S MP Paragon computer, the Intel
Basic Math Library Software (Release 5.0) [89] was used.
On the Sun Ultra Enterprise 2
workstation, the Dakota Scientific Software Library (DSSL)
was used. The DSSL BLAS implementation used only one processor
per node. The speed of the BLAS matrix-matrix multiply routine shown
in Table 5.5 has been obtained for the following
operation
,
where A, B, and C are square matrices of order 500.