Frequently
Asked Questions on the Linpack Benchmark and Top500
(Last
updated 6/17/2008 8:10 AM)
What is the Linpack Benchmark?
What is the Linpack Benchmark report?
What is the reference for the Linpack Benchmark
Report?
Is there a paper which describes the benchmark in
some detail and gives a historical perspective?
What is the theoretical peak performance?
What are the three benchmarks in the Linpack
Benchmark report?
What is the Linpack Fortran n = 100 benchmark?
What exactly does the Linpack Fortran n=100
benchmark time?
What is the Linpack n = 1000 benchmark (TPP, Best
Effort)?
What is the Linpack’s “Highly Parallel Computing”
benchmark?
Why is my performance results below the theoritical peak?
What are the ground rules for the first benchmark?
What are the ground rules for the second
benchmark?
What are the ground rules for the third benchmark?
To what accuracy must be the solution conform?
Can I get a more personalized list of machine and
performance results?
How can I get the Linpack Benchmark program?
Is there a Java version of the Linpack Benchmark?
What do I do to run the Linpack Benchmark Program?
How does the Linpack Benchmark performance relate
to my application?
Are there errors in the Linpack Benchmark report?
How can I get the complete Linpack software
collection?
Where can I get an optimized version of the BLAS?
Is Linpack the most efficient way to solve systems
of equations?
How can I get the whole LAPACK software
collection?
What is the history behind the Linpack Benchmark?
How can I add my computer's result to the table?
How can I measure the execution time more
accurately and reliably?
Should I run the single and double precision of
the benchmarks?
Can I do a mixed precision implementation of the benchmark?
How can I interpret the results from the
benchmark?
Can I use any matrix to run the benchmark?
What matrix is used to run the benchmark?
How can I get my computer listed on the Top500?
For HPL What problem size N should I run?
For HPL what block size NB should I use?
For HPL what process grid ratio P x Q should I use?
For HPL what about the one processor case?
For HPL why so many options in HPL.dat?
Can I use Strassen's Method when doing
the matrix multiplies?
Where can I get a copy of the Top500 report?
Where can I get the software to generate
performance results for the Top500?
Why would a machine appear in the Linpack Benchmark report but not in the Top500 list?
Why would a machine appear in the Top500 list and not in the Linpack Benchmark report?
What about a list of clusters?
How can I interpret the results from the Linpack
100x100 benchmark?
Do you have an archive of previous Linpack
Benchmark reports or results?
What is the HPC Challenge benchmark?
Where can I get
additional information on the HPC Challenge benchmark?
Is there a benchmark for sparse matrices?
Where can I get additional information on
benchmarks?
The Linpack Benchmark is a measure of a
computer’s floating-point rate of execution. It is determined by running a
computer program that solves a dense system of linear equations. Over the years
the characteristics of the benchmark has changed a bit. In fact, there are
three benchmarks included in the Linpack Benchmark report.
The Linpack Benchmark is something that
grew out of the Linpack software project. It was originally intended to give
users of the package a feeling for how long it would take to solve certain
matrix problems. The benchmark stated as an appendix to the Linpack Users'
Guide and has grown since the Linpack User’s Guide was published in 1979.
The Linpack Benchmark report is entitled
“Performance of Various Computers Using Standard Linear Equations Software”.
The report lists the performance in Mflop/s of a number of computer systems. A
copy of the report is available at http://www.netlib.org/benchmark/performance.ps.
The Linpack Benchmark report should be
referenced in the following way:
“Performance
of Various Computers Using Standard Linear Equations Software”, Jack Dongarra,
The paper “The LINPACK Benchmark: Past,
Present, and Future” by
Mflop/s is a rate of execution, millions
of floating point operations per second. Whenever this term is used it will
refer to 64 bit floating point operations and the operations will be either
addition or multiplication. Gflop/s refers to billions of floating point
operations per second and Tflop/s refers to trillions
of floating point operations per second.
The theoretical peak is based not on an actual performance from a benchmark
run, but on a paper computation to determine the theoretical peak rate of
execution of floating point operations for the machine. This is the number
manufacturers often cite; it represents an upper bound on performance. That is,
the manufacturer guarantees that programs will not exceed this rate-sort of a
"speed of light" for a given computer. The theoretical peak performance is
determined by counting the number of floating-point additions and
multiplications (in full precision) that can be completed during a period of
time, usually the cycle time of the machine. For example, an Intel Itanium 2 at
1.5 GHz can complete 4 floating point operations per cycle or a theoretical
peak performance of 6 GFlop/s.
The three benchmarks in the Linpack
Benchmark report are for Linpack Fortran n = 100
benchmark (see Table 1 for the report), Linpack n = 1000 benchmark (see Table 1
of the report), and Linpack’s Highly Parallel
Computing benchmark (see Table 3 of the report).
The first benchmark is for a matrix of
order 100 using the Linpack software in Fortran. The
results can be found in Table 1 of the benchmark report. In order to run this
benchmark download the file from http://www.netlib.org/benchmark/Linpackd,
this is a Fortran program. In order to run the program
you will need to supply a timing function called SECOND which should report the
CPU time that has elapsed. The ground rules for running this benchmark are that
you can make no changes to the Fortran code, not even
to the comments. Only compiler optimization can be used to enhance performance.
The Linpack benchmark measures the
performance of two routines from the Linpack collection of software. These
routines are DGEFA and DGESL (these are double-precision versions; SGEFA and
SGESL are their single-precision counterparts). DGEFA performs the LU decomposition
with partial pivoting, and DGESL uses that decomposition to solve the given
system of linear equations.
Most of the time is spent in DGEFA. Once
the matrix has been decomposed, DGESL is used to find the solution; this
process requires O(n2) floating-point
operations, as opposed to the O(n3)
floating-point operations of DGEFA. The
results for this benchmark can be found in Table 1 second column under “LINPACK
Benchmark n = 100” of the Linpack Benchmark Report.
The second benchmark is for a matrix of
size 1000 and can be found in Table 1 of the benchmark report. In order to run
this benchmark download the file from http://www.netlib.org/benchmark/1000d,
this is a Fortran driver. The ground rules for running
this benchmark are a bit more relaxed in that you can specify any linear
equation solve you wish, implemented in any language. A requirement is that
your method must compute a solution and the solution must return a result to
the prescribed accuracy. TPP stands for Toward Peak Performance; this is the
title of the column in the benchmark report that lists the results.
The performance of a computer
is a complicated issue, a function of many interrelated quantities. These
quantities include the application, the algorithm, the size of the problem, the
high-level language, the implementation, the human level of effort used to
optimize the program, the compiler's ability to optimize, the age of the
compiler, the operating system, the architecture of the computer, and the
hardware characteristics. The results
presented for this benchmark suites should not be
extolled as measures of total system performance (unless enough analysis has
been performed to indicate a reliable correlation of the benchmarks to the
workload of interest) but, rather, as reference points for further evaluations.
There are many reasons why
your results may vary from results recorded in the Linpack Benchmark Report.
Issues such as load on the system, accuracy of the clock, compiler options,
version of the compiler, size of cache, bandwidth from memory, amount of
memory, etc can effect the performance even when the processors are the same.
The third benchmark is called the Highly
Parallel Computing Benchmark and can be found in Table 3 of the Benchmark
Report. (This is the benchmark use for the Top500 report). This benchmark
attempts to measure the best performance of a machine in solving a system of
equations. The problem size and software can be chosen to produce the best
performance.
http://www.netlib.org/benchmark/hpl/
The “ground rules” for running the first
benchmark in the report, n=100 case, are that the program is run as is with no
changes to the source code, not even changes to the comments are allowed. The
compiler through compiler switches can perform optimization at compile time.
The user must supply a timing function called SECOND. SECOND returns the
running CPU time for the process. The matrix generated by the benchmark program
must be used to run this case.
The “ground rules” for running the
second benchmark in the report, n=1000 case, allows for a complete user
replacement of the LU factorization and solver steps. The calling sequence
should be the same as the original routines.
The problem size should be of order 1000. The accuracy of the solution
must satisfy the following bound:
(On IEEE machines this is 2-53 ) and n is the size of the
problem. The matrix used must be the same matrix used in the driver program
available from netlib.
The “ground rules” for running the third
benchmark in the report, Highly Parallel case, allows for a complete user
replacement of the LU factorization and solver steps. The accuracy of the
solution must satisfy the following bound:
(On IEEE machines this is 2-53 ) and n is the size of the
problem. The matrix used must be the same matrix used in the driver program
available from netlib. There is no restriction on the
problem size.
The solution to all three benchmarks
must satisfy the following mathematical formula:
(On IEEE machines this is 2-53 ) and n is the size of the
problem. This implies the computation must be done in 64 bit floating point
arithmetic.
In order to have an entry included in
the Linpack Benchmark report the results must be computed using full precision.
By full precision we generally mean 64 bit floating point arithmetic or higher.
Note that this is not an issue of single or double precision as some systems
have 64-bit floating point arithmetic as single precision. It is a function of
the arithmetic used.
You can get a more personalized listing
of machines by using the interface at http://performance.netlib.org/performance/html/PDSbrowse.html
This list is not kept current however
and may lag the Linpack benchmark report by months.
You can download the programs used to
generate the Linpack benchmark results by using the URL is http://www.netlib.org/benchmark/linpackd.
This is a Fortran program. There is a C version of the
benchmark located at: http://www.netlib.org/benchmark/linpackc.
There is a Java version of the benchmark that can be downloaded as an applet
at:
There is a Java program at:
http://www.netlib.org/benchmark/linpackjava/
There is a Java version of the benchmark
that can be downloaded as an applet at:
There is a Java program at: http://www.netlib.org/benchmark/linpackjava/
For the 100x100 based Fortran
version, you need to supply a timing function called SECOND. SECOND is an
elapse timer function that will be called from Fortran
and is expected to return the running CPU time in seconds. In the program two
called to SECOND are made and the difference taken to gather the time.
The performance of the Linpack benchmark
is typical for applications where the basic operation is based on vector
primitives such as added a scalar multiple of a vector to another vector. Many
applications exhibit the same performance as the Linpack Benchmark. However,
results should not be taken too seriously. In order to measure the performance
of any computer it’s critical to probe for the performance of your
applications. The Linpack Benchmark can only give one point of reference. In addition, in multiprogramming environments
it is often difficult to reliably measure the execution time of a single
program. We trust that anyone actually evaluating machines and operating
systems will gather more reliable and more representative data.
While we make every attempt to verify
the results obtained from users and vendors, errors are bound to exist and
should be brought to our attention. We encourage users to obtain the programs
and run the routines on their machines, reporting any discrepancies with the
numbers listed here.
The Linpack package is a collection of Fortran subroutines for solving various systems of linear
equations. (http://www.netlib.org/Linpack/) The software in Linpack is based on
a decompositional approach to numerical linear
algebra. The general idea is the following. Given a problem involving a matrix,
one factors or decomposes the matrix into a product of simple, well-structured
matrices which can be easily manipulated to solve the original problem. The
package has the capability of handling many different matrix types and different
data types, and provides a range of options. Linpack itself is built on another
package called the BLAS. Linpack was designed in the late 70's and has been
superseded by a package called LAPACK.
The Linpack software library is
available from netlib. See
http://www.netlib.org/Linpack/
The
BLAS (Basic Linear Algebra Subprograms) are high quality "building
block" routines for performing basic vector and matrix operations. Level 1
BLAS do vector-vector operations, Level 2 BLAS do matrix-vector operations, and
Level 3 BLAS do matrix-matrix operations. Because the BLAS are efficient,
portable, and widely available, they're commonly used in the development of
high quality linear algebra software, LINPACK and LAPACK for example. For
additional information see: http://www.netlib.org/blas/
The
ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing
research effort focusing on applying empirical techniques in order to provide
portable performance for the BLAS routines. At present, it provides C and Fortran77
interfaces to a portably efficient BLAS implementation, as well as a few
routines from LAPACK. For additional information see: http://www.netlib.org/atlas/
Linpack is not the most efficient
software for solving matrix problems. This is mainly due to the way the
algorithm and resulting software accesses memory. The memory access patterns of the algorithm
has disregard for the multi-layered memory hierarchies of RISC architecture and
vector computers, thereby spending too much time moving data instead of doing
useful floating-point operations. LAPACK addresses this problem by reorganizing
the algorithms to use block matrix operations, such as matrix multiplication in
the innermost loops. For each computer architecture
block operations can be optimized to account for memory hierarchies, providing
a transportable way to achieve high efficiency on diverse modern machines. We
use the term “Transportable” instead of “portable” because, for fastest
possible performance, LAPACK requires that highly optimized block matrix
operations be already implemented on each machine. These operations are
performed by the Level 3 BLAS in most cases.
LAPACK is a software collection to solve
various matrix problem in linear algebra. In
particular, systems of linear equations, least squares
problems, eigenvalue problems, and singular value decomposition. The software
is based on the use of block partitioned matrix techniques that aid in
achieving high performance on RISC based systems, vector computers, and shared
memory parallel processors.
LAPACK can be obtained from netlib, see (http://www.netlib.org/lapack/)
The Linpack Benchmark is, in some sense,
an accident. It was originally designed to assist users of the Linpack package
by providing information on execution times required to solve a system of
linear equations. The first ``Linpack Benchmark'' report appeared as an appendix
in the Linpack Users' Guide in 1979. The appendix comprised data for one
commonly used path in Linpack for a matrix problem of size 100, on a collection
of widely used computers (23 in all), so users could estimate the time required
to solve their matrix problem.
Over the years other data was added,
more as a hobby than anything else, and today the collection includes hundreds
of different computer systems.
You can contact Jack Dongarra and send him
the output from the benchmark program. When sending results please include the
specific information on the computer on which the test was run, the compiler,
the optimization that was used, and the site it was run on. You can contact
Dongarra by sending email to dongarra@cs.utk.edu.
In order to run the benchmark program
you will have to supply a function to gather the execution time on your
computer. The execution time is requested by a call to the Fortran
function SECOND. It is expected that the routine returns the accumulated
execution time of your program. Two called to SECOND are
made and the difference taken to compute the execution time.
The Performance API (PAPI)
project specifies a standard application programming interface (API) for
accessing hardware performance counters available on most modern microprocessors.
These counters exist as a small set of registers that count Events, occurrences
of specific signals related to the processor's function. Monitoring these
events facilitates correlation between the structure of source/object code and
the efficiency of the mapping of that code to the underlying architecture.
For addition information see:
http://icl.cs.utk.edu/projects/papi/
The benchmark must be run using 64 bit floating
point arithmetic. Mixed precision
arithmetic is not allowed in the implementation of the benchmark.
The results reported in the benchmark
report reflect performance for 64 bit floating point arithmetic. On some
machines this may be DOUBLE PERCISION, such as computers that have IEEE
floating point arithmetic and on other computers this may be single precision,
(declared REAL in Fortran), such as Cray’s vector computers.
When and how often are the results
updated in the benchmark report?
The benchmark report is updated
continuously as new results arrive. They are posted to the web as they are
updated.
The
benchmark must be run with the matrix generator that is supplied with the
source code. You are not allowed to change any aspect of the matrix.
The matrices are generated using a
pseudo-random number generator. The matrices are designed to force partial
pivoting to be performed in Gaussian Elimination.
The Top500 list the 500 fastest computer
system being used today. In 1993 the collection was
started and has been updated every 6 months since then. The report lists the
sites that have the 500 most powerful computer systems installed. The best
Linpack benchmark performance achieved is used as a performance measure in
ranking the computers. The TOP500 list has been updated twice a year since June
1993.
To
be listed on the Top500 list you have to run the software that can be found at http://www.netlib.org/benchmark/hpl/
and the performance of the benchmark run must be within the range of the 500
fasted computers for that period of time.
HPL is a software
package that solves a (random) dense linear system in double precision (64
bits) arithmetic on distributed-memory computers. It can thus be regarded as a
portable as well as freely available implementation of the High Performance
Computing Linpack Benchmark.
In
order to find out the best performance of your system, the largest problem size
fitting in memory is what you should aim for. The amount of memory used by HPL
is essentially the size of the coefficient matrix. So for example, if you have
4 nodes with 256 Mb of memory on each, this corresponds to 1 Gb total, i.e., 125 M double
precision (8 bytes) elements. The square root of that number is 11585. One
definitely needs to leave some memory for the OS as well as for other things,
so a problem size of 10000 is likely to fit. As a rule of thumb, 80 % of the
total amount of memory is a good guess. If the problem size you pick is too
large, swapping will occur, and the performance will drop. If multiple
processes are spawn on each node (say you have 2 processors per node), what
counts is the available amount of memory to each process.
HPL
uses the block size NB for the data distribution as well as for the
computational granularity. From a data distribution point of view, the smallest
NB, the better the load balance. You definitely want to stay away from very
large values of NB. From a computation point of view, a too small value of NB
may limit the computational performance by a large factor because almost no
data reuse will occur in the highest level of the memory hierarchy. The number
of messages will also increase. Efficient matrix-multiply routines are often
internally blocked. Small multiples of this blocking factor
are likely to be good block sizes for HPL. The bottom line is that
"good" block sizes are almost always in the [32 ..
256] interval. The best values depend on the computation / communication performance
ratio of your system. To a much less extent, the problem size matters as well.
Say for example, you empirically found that 44 was a good block size with
respect to performance. 88 or 132 are likely to give slightly better results
for large problem sizes because of a slightly higher flop rate.
This
depends on the physical interconnection network you have. Assuming a mesh or a
switch HPL "likes" a 1:k ratio with k in
[1..3]. In other words, P and Q should be approximately equal, with Q slightly
larger than P. Examples: 2 x 2, 2 x 4, 2 x 5, 3 x 4, 4 x 4, 4 x 6, 5 x 6, 4 x 8
... If you are running on a simple Ethernet network, there is only one wire
through which all the messages are exchanged. On such a network, the
performance and scalability of HPL is strongly limited and very flat process
grids are likely to be the best choices: 1 x 4, 1 x 8, 2 x 4
...
HPL
has been designed to perform well for large problem sizes on hundreds of nodes
and more. The software works on one node and for large problem sizes, one can
usually achieve pretty good performance on a single processor as well. For
small problem sizes however, the overhead due to message-passing, local
indexing and so on can be significant.
There
are quite a few reasons. First off, these options are useful to determine what
matters and what does not on your system. Second, HPL is often used in the
context of early evaluation of new systems. In such a case, everything is
usually not quite working right, and it is convenient to be able to vary these
parameters without recompiling. Finally, every system has its own peculiarities
and one is likely to be willing to empirically determine the best set of
parameters. In any case, one can always follow the advice provided in the tuning section of the
HPL document and not worry about the complexity of the input file.
Certainly. There is always
room for performance improvements. Specific knowledge about a particular system
is always a source of performance gains. Even from a generic point of view,
better algorithms or more efficient formulation of the classic ones are
potential winners.
The
normal matrix multination algorithm requires n3 + O(n2)
multiplications and about the same number of additions. Strassen's algorithm reduces the total number
of operations to O(n2.82) by recursively
multiplying 2n × 2n matrices using seven n × n matrix multiplications. Thus
using Strassen’s Algorithm will distort the true execution rate. As a result we
do not allow Strassen’s Algorithm to be used for the TOP500 reporting. As a
side note, in the "usual" matrix multiplication, we have an n2 error
term. In Strassen's method, the error exponent p for np
ranges from 2-3.85 and the numerical error can be 10-100 times greater than
that for standard multiplication.
The Top500 reports are maintained at http://www.top500.org/.
There is software available that has
been optimized and many people use to generate the Top500 performance
results. This benchmark attempts to
measure the best performance of a machine in solving a system of equations. The
problem size and software can be chosen to produce the best performance. A copy
of that software can be downloaded from:
http://www.netlib.org/benchmark/hpl/
In order to run this you will need MPI
and an optimized version of the BLAS. For MPI you can see: http://www-unix.mcs.anl.gov/mpi/mpich/download.html
and for the BLAS see: http://www.netlib.org/atlas/
.
There could be two reasons.
First the Linpack Benchmark report contains historic information. Even if a
computer is no longer in existence it can appear in the Linpack benchmark
report. This is unlike the Top500 which report the 500 fastest computers in
existence at a given point in time. The second reason is that the Top500 list come out twice a year and the Linpack Benchmark report
is updated continuously.
If a machine is in the Top500
list it should appear in the Linpack Benchmark report. If you see an instance
where this is not the case, its probably a mistake and
please send email to Jack Dongarra dongarra@cs.utk.edu
about the situation.
We
are starting a new list on Clusters for more information see http://clusters.top500.org/.
When the Linpack Fortran
n = 100 benchmark is run it produces the following kind of results:
Please send the results of this run to:
Jack J. Dongarra
Computer Science Department
Fax: 865-974-8296
Internet: dongarra@cs.utk.edu
norm. resid resid machep x(1) x(n)
1.67005097E+00 7.41628980E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00
times are reported
for matrices of order 100
dgefa dgesl total
mflops
unit ratio
times for array with
leading dimension of 201
1.540E-03 6.888E-05 1.609E-03
4.268E+02 4.686E-03 2.873E-02
1.509E-03 7.084E-05 1.579E-03
4.348E+02 4.600E-03 2.820E-02
1.509E-03 7.003E-05 1.579E-03
4.348E+02 4.600E-03 2.820E-02
1.502E-03 6.593E-05 1.568E-03
4.380E+02 4.567E-03 2.800E-02
times for array with
leading dimension of 200
1.431E-03 6.716E-05 1.498E-03
4.584E+02 4.363E-03 2.675E-02
1.424E-03 6.694E-05 1.491E-03
4.605E+02 4.343E-03 2.663E-02
1.431E-03 6.699E-05 1.498E-03
4.583E+02 4.364E-03 2.676E-02
1.432E-03 6.439E-05 1.497E-03
4.588E+02 4.360E-03 2.673E-02
The norm.
resid is a measure of the
accuracy of the computation. The value should be O(1).
If the value is much greater than O(100) it suggest
that the results are not correct.
The resid is
the unnormalized quantity.
The term machep
measure the precision used to carry out the computation. On an IEEE floating
point computer the value should be 2.22044605e-16.
The values of x(1)
and x(n) are the first and last component of the solution. The problem is
constructed so that the values of solution should be all ones.
There are two sets of timings performed
both on matrices of size 100. The first one is where the 2-dimensional array
that contained the matrix has a leading dimension of 201, and a second set
where the leading dimension 200. This is done to see what effect, if any, the
placement of the arrays in memory has on the performance.
Times for dgefa
and dgesl are reported. dgefa factors the matrix using Gaussian elimination with partial pivoting and dgesl solves a system based on the factoriuzation.
dgefa requires 2/3 n3
operations and dgesl requires n2
operations. The value of total is the sum of the times and mflops
is the execution rate, or millions of floating point operations per second.
Here a floating point operations is taken to be
floating point additions and multiplications. Unit and ratio are obsolete and
should be ignored.
If the time reported is negative or zero
then the clock resolution is not accurate enough for the granularity of the
work. In this case a different timing routine should be used that has better
resolution.
No archive is maintained of previous
results. However here is some information to provide a historical
perspective. The numbers in the
following tables have been extracted from old Linpack Benchmark Reports. It took a bit of ``file archaeology'' to put
the list together since I don't have the complete set of reports.
Top Computers Over Time for the Linpack
n=100 Benchmark
(Entries for
this table began in 1979.)
Year |
Computer |
Number
of Processors |
Cycle
time |
Mflop/s |
2006 |
NEC SX-8/1 (1 proc) |
1 |
2
GHz |
2177 |
2004 |
Intel Pentium Nocona (1
proc 3.6 GHz) |
1 |
3.6
GHz |
1803 |
2003 |
HP
Integrity Server rx2600 (1 proc 1.5GHz) |
1 |
1.5
GHz |
1635 |
2002 |
Intel
Pentium 4 (3.06 GHz) |
1 |
2.06
GHz |
1414 |
2001 |
Fujitsu VPP5000/1 |
1 |
3.33
nsec |
1156 |
2000 |
Fujitsu VPP5000/1 |
1 |
3.33
nsec |
1156 |
1999 |
CRAY T916 |
4 |
2.2
nsec |
1129 |
1995 |
CRAY T916 |
1 |
2.2
nsec |
522 |
1994 |
CRAY C90 |
16 |
4.2
nsec |
479 |
1993 |
CRAY C90 |
16 |
4.2
nsec |
479 |
1992 |
CRAY C90 |
16 |
4.2
nsec |
479 |
1991 |
CRAY C90 |
16 |
4.2
nsec |
403 |
1990 |
CRAY Y-MP |
8 |
6.0
nsec |
275 |
1989 |
CRAY Y-MP |
8 |
6.0
nsec |
275 |
1988 |
CRAY Y-MP |
1 |
6.0
nsec |
74 |
1987 |
ETA 10-E |
1 |
10.5
nsec |
52 |
1986 |
NEC SX-2 |
1 |
6.0
nsec |
46 |
1985 |
NEC SX-2 |
1 |
6.0
nsec |
46 |
1984 |
CRAY X-MP |
1 |
9.5
nsec |
21 |
1983 |
CRAY 1 |
1 |
12.5
nsec |
12 |
... |
|
|
|
|
1979 |
CRAY 1 |
1 |
12.5
nsec |
3.4 |
These numbers come from the Linpack
Benchmark Report Table 1.
=====================================================================
Top Computers Over Time for the Linpack
n=1000 Benchmark
(Entries for
this table began in 1986.)
Year |
Computer |
Number
of Processors |
Cycle time in
nsec. |
Measured Mflop/s |
Peak Mflop/s |
2006 |
NEC SX-8/8 |
8 |
2
GHz |
75140 |
128000 |
2000 |
NEC SX-5/16 |
16 |
4.0 |
45030 |
64000 |
1995 |
CRAY T916 |
16 |
2.2 |
19400 |
28800 |
1994 |
|
4 |
2 |
16170 |
32000 |
1993 |
NEC SX-3/44R |
4 |
2.5 |
15120 |
25600 |
1992 |
NEC SX-3/44 |
4 |
2.9 |
13420 |
22000 |
1991 |
Fujitsu VP2600/10 |
1 |
3.2 |
4009 |
5000 |
1990 |
Fujitsu VP2600/10 |
1 |
3.2 |
2919 |
5000 |
1989 |
CRAY Y-MP/832 |
8 |
6 |
2144 |
2667 |
1988 |
CRAY Y-MP/832 |
8 |
6 |
2144 |
2667 |
1987 |
NEC SX-2 |
1 |
6 |
885 |
1300 |
1986 |
CRAY X-MP-4 |
4 |
9.5 |
713 |
840 |
|
These numbers come from the Linpack
Benchmark Report Table 1.
(Full precision; matrix size 1000; best
effort programming, maximum optimization permitted.)
Top Computers Over
Time for the Highly-Parallel Linpack Benchmark
(Entries for
this table began in 1991.)
Year |
Computer |
Number of Processors |
Measured Gflop/s |
Size of Problem |
Size of 1/2 Perf |
Theoretical Peak Gflop/s |
2005-2006 |
IBM Blue Gene/L |
131072 |
280600 |
1769471 |
|
367001 |
2002 - 2004 |
Earth Simulator Computer, NEC |
5104 |
35610 |
1041216 |
265408 |
40832 |
2001 |
ASCI White-Pacific, IBM SP Power 3 |
7424 |
7226 |
518096 |
179000 |
11136 |
2000 |
ASCI White-Pacific, IBM SP Power 3 |
7424 |
4938 |
430000 |
|
11136 |
1999 |
ASCI Red
Intel Pentium II Xeon core
|
9632 |
2379 |
362880 |
75400 |
3207 |
1998 |
ASCI Blue-Pacific SST, IBM SP 604E |
5808 |
2144 |
431344 |
|
3868 |
1997 |
Intel ASCI Option Red (200 MHz Pentium
Pro) |
9152 |
1338 |
235000 |
63000 |
1830 |
1996 |
|
2048 |
368.2 |
103680 |
30720 |
614 |
1995 |
Intel Paragon XP/S MP |
6768 |
281.1 |
128600 |
25700 |
338 |
1994 |
Intel Paragon XP/S MP |
6768 |
281.1 |
128600 |
25700 |
338 |
1993 |
Fujitsu NWT |
140 |
124.5 |
31920 |
11950 |
236 |
1992 |
NEC SX-3/44 |
4 |
20.0 |
6144 |
832 |
22 |
1991 |
Fujitsu VP2600/10 |
1 |
4.0 |
1000 |
200 |
5 |
|
These numbers come from the Linpack
Benchmark Report Table 3.
(Full precision; the manufacture is
allowed to solve as large a problem as desired, maximum optimization
permitted.)
Measured Gflop/s is the measured peak
rate of execution for running the benchmark in billions of floating point
operations per second.
Size of Problem is the matrix size at
which the measured performance was observed.
Size of ½ Perf
is the size of problem needed to achieve ½ the measured peak performance.
The HPC
Challenge benchmark consists at this time of 7 benchmarks: HPL, STREAM, RandomAccess, PTRANS, FFTE, DGEMM and b_eff
Latency/Bandwidth. HPL is the Linpack TPP benchmark. The test stresses the
floating point performance of a system. STREAM is a benchmark that measures
sustainable memory bandwidth (in GB/s), RandomAccess measures the rate of random updates of memory.
PTRANS measures the rate of transfer for larges arrays of data from
multiprocessor’s memory. Latency/Bandwidth measures (as the name suggests)
latency and bandwidth of communication patterns of increasing complexity
between as many nodes as is time-wise feasible.
For
additional information on the benchmark see: http://icl.cs.utk.edu/hpcc/
The Linpack Benchmark suite is built
around software for dense matrix problems. In May 2000 we started to put together
a benchmark for sparse iterative matrix problems. For additional information
see: http://www.netlib.org/benchmark/sparsebench/
For addition information on benchmarks
see: http://www.netlib.org/benchweb/
Please send your comments to Jack
Dongarra at dongarra@cs.utk.edu.