next up previous
Next: Block algorithms: Matrix Multiplication Up: Automatic Blocking of Nested Previous: Automatic Blocking of Nested

Introduction

An essential fact of life in very-large-scale integrated circuits is that transistors are cheap and wires are expensive. The concomitant fact in high-performance computing, especially parallel computing, is that computation is cheap and communication is expensive. The two types of communication that we are primarily concerned with here are communication between the processors in a multicomputer and communication between processors and main memory.

Both these forms of communication are characterized by long latency and low bandwidth compared to the CPU rate. For instance, in the CRAY-1 memory was able to provide only 80 Mwords per second to the vector unit, which could produce one result and take in two operands per clock at 80 MHz; thus the memory was too slow by a factor of three. This same phenomenon can be observed in the i860 RISC today, the NEC-SX supercomputers, the Alliant machines, the CM-2, and most other high-performance machines. Communication speeds are likewise slower than processor speeds in multicomputers such as the Intel iPSC/2. In that machine, processors communicate over 1-bit-wide channels but have full word-wide paths to local memory. While newer message passing machines will employ byte-wide communication channels, the evolving microprocessor already provides memory ports of 8 or 16 bytes.

The principal architectural solution to these problems is to provide a small but fast local memory at each processor. The memory may be managed by hardware on a demand basis (cache) or managed explicitly by software, either operating system or application. If the processor is B times faster than the data path to memory or to other processors, then it must make reference to that slow data path only once for every B operations in order not to be slowed down. For this to happen, it must get its data from the local memory roughly B-1 times out of every B. Software must organize the computation so that this ``hit ratio'' can be achieved.




next up previous
Next: Block algorithms: Matrix Multiplication Up: Automatic Blocking of Nested Previous: Automatic Blocking of Nested

Jack Dongarra
Tue Feb 18 15:39:11 EST 1997