The ScaLAPACK strategy for combining efficiency with portability is to construct the software as much as possible out of calls to the PBLAS for global computation. These routines in turn rely on the BLAS for local computation and on the BLACS for communication.

The efficiency of the ScaLAPACK software depends on efficient implementations of the BLAS and the BLACS being provided by computer vendors (or others) for their machines. Thus, the BLAS and the BLACS form a low-level interface between ScaLAPACK software and different machine architectures. Above this level, all of the ScaLAPACK software is portable.

In this article, performance results are presented for three different distributed-memory concurrent computers: the IBM Scalable POWERparallel 2, the Intel XP/S Paragon, and a network of SPARC Ultra 1's connected via switched ATM. Table 1 summarizes the relevant technical features of these machines.

**Table 1:** Machine Characteristics

For each machine this table shows the type of processor node, the peak flop rate per node, the peak latency, the bandwidth of the interconnection network, and the amount of physical memory per node. The numbers in parentheses are the relevant and corresponding numbers that a user program can achieve. For example, the flop rate in parentheses is the flop rate of the BLAS matrix-multiply. The latency and bandwidth are the corresponding values achieved by the BLACS. Finally, the amount of memory per node in parentheses is an approximation of the largest amount of memory available to the user's program.

The performance numbers presented have been obtained on real double precision data on all the computers. This corresponds to 64-bit floating-point arithmetic on all machines tested.

Sat Feb 1 08:18:10 EST 1997