Frequently
Asked Questions on the Linpack Benchmark and Top500
(Last
updated 3/21/2002 1:00 PM)
What is the Linpack Benchmark?
What is the Linpack Benchmark report?
What is the reference for the Linpack Benchmark
Report?
Is there a paper which describes the benchmark in
some detail and gives a historical perspective?
What are the three benchmarks in the Linpack
Benchmark report?
What is the Linpack Fortran n = 100 benchmark?
What exactly does the Linpack Fortran n=100
benchmark time?
What is the Linpack n = 1000 benchmark (TPP, Best
Effort)?
What is the Linpack’s “Highly Parallel Computing”
benchmark?
What are the ground rules for the first
benchmark?
What are the ground rules for the second
benchmark?
What are the ground rules for the third
benchmark?
To what accuracy must be the solution conform?
Can I get a more personalized list of machine and
performance results?
How can I get the Linpack Benchmark program?
Is there a Java version of the Linpack Benchmark?
What do I do to run the Linpack Benchmark
Program?
How does the Linpack Benchmark performance relate
to my application?
Are there errors in the Linpack Benchmark report?
How can I get the complete Linpack software
collection?
Where can I get an optimized
version of the BLAS?
Is Linpack the most efficient way to solve
systems of equations?
How can I get the whole LAPACK software
collection?
What is the history behind the Linpack Benchmark?
How can I add my computer's result to the table?
How
can I measure the execution time more accurately and reliably?
Should I run the single and double precision of
the benchmarks?
How can I interpret the results from the
benchmark?
What matrix is used to run the benchmark?
Where can I get a copy of the Top500 report?
Where can I get the software to generate
performance results for the Top500?
What about a list of clusters?
How can I interpret the results from the Linpack
100x100 benchmark?
Do you have an archive of previous Linpack
Benchmark reports or results?
Is there a benchmark for sparse matrices?
Where can I get additional information on
benchmarks?
The Linpack Benchmark is a measure of a
computer’s floating-point rate of execution. It is determined by running a
computer program that solves a dense system of linear equations. Over the years
the characteristics of the benchmark has changed a bit. In fact, there are
three benchmarks included in the Linpack Benchmark report.
The Linpack Benchmark is something that grew out
of the Linpack software project. It was originally intended to give users of
the package a feeling for how long it would take to solve certain matrix
problems. The benchmark stated as an appendix to the Linpack Users' Guide and
has grown since the Linpack User’s Guide was published in 1979.
The Linpack Benchmark report is entitled
“Performance of Various Computers Using Standard Linear Equations Software”.
The report lists the performance in Mflop/s of a number of computer systems. A
copy of the report is available at http://www.netlib.org/benchmark/performance.ps.
The Linpack Benchmark
report should be referenced in the following way:
“Performance of
Various Computers Using Standard Linear Equations Software”, Jack Dongarra,
The paper “The LINPACK Benchmark: Past, Present,
and Future” by Jack Dongarra, Piotr Luszczek, and Antoine Petitet
provides a look at the details of the benchmark and provides performance data
in graphics form for a number of machines on basic operations. A copy of the
paper is available at http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf.
Mflop/s is a rate of execution, millions of
floating point operations per second. Whenever this term is used it will refer
to 64 bit floating point operations and the operations will be either addition
or multiplication. Gflop/s refers to billions of floating point operations per
second and Tflop/s refers to trillions of floating
point operations per second.
The three benchmarks in the Linpack Benchmark
report are for Linpack Fortran n = 100 benchmark (see
Table 1 for the report), Linpack n = 1000 benchmark (see Table 1 of the
report), and Linpack’s Highly Parallel Computing
benchmark (see Table 3 of the report).
The first benchmark is for a matrix of order 100
using the Linpack software in Fortran. The results can
be found in Table 1 of the benchmark report. In order to run this benchmark
download the file from http://www.netlib.org/benchmark/Linpackd,
this is a Fortran program. In order to run the program
you will need to supply a timing function called SECOND which should report the
CPU time that has elapsed. The ground rules for running this benchmark are that
you can make no changes to the Fortran code, not even
to the comments. Only compiler optimization can be used to enhance performance.
The Linpack benchmark measures the performance
of two routines from the Linpack collection of software. These routines are
DGEFA and DGESL (these are double-precision versions; SGEFA and SGESL are their
single-precision counterparts). DGEFA performs the LU decomposition with
partial pivoting, and DGESL uses that decomposition to solve the given system
of linear equations.
Most of the time is spent in DGEFA. Once the
matrix has been decomposed, DGESL is used to find the solution; this process
requires O(n2) floating-point operations,
as opposed to the O(n3)
floating-point operations of DGEFA. The
results for this benchmark can be found in Table 1 second column under “LINPACK
Benchmark n = 100” of the Linpack Benchmark Report.
The second benchmark is for a matrix of size
1000 and can be found in Table 1 of the benchmark report. In order to run this
benchmark download the file from http://www.netlib.org/benchmark/1000d,
this is a Fortran driver. The ground rules for running
this benchmark are a bit more relaxed in that you can specify any linear
equation solve you wish, implemented in any language. A requirement is that
your method must compute a solution and the solution must return a result to
the prescribed accuracy. TPP stands for Toward Peak Performance; this is the
title of the column in the benchmark report that lists the results.
The third benchmark is called the Highly
Parallel Computing Benchmark and can be found in Table 3 of the Benchmark
Report. (This is the benchmark use for the Top500 report). This benchmark
attempts to measure the best performance of a machine in solving a system of
equations. The problem size and software can be chosen to produce the best
performance.
http://www.netlib.org/benchmark/hpl/
The “ground rules” for running the first
benchmark in the report, n=100 case, are that the program is run as is with no
changes to the source code, not even changes to the comments are allowed. The
compiler through compiler switches can perform optimization at compile time.
The user must supply a timing function called SECOND. SECOND returns the
running CPU time for the process. The matrix generated by the benchmark program
must be used to run this case.
The “ground rules” for running the second
benchmark in the report, n=1000 case, allows for a complete user replacement of
the LU factorization and solver steps. The calling sequence should be the same
as the original routines. The problem
size should be of order 1000. The accuracy of the solution must satisfy the
following bound:
(On IEEE machines this is 2-53 ) and n is the size of the
problem. The matrix used must be the same matrix used in the driver program
available from netlib.
The “ground rules” for running the third
benchmark in the report, Highly Parallel case, allows for a complete user
replacement of the LU factorization and solver steps. The accuracy of the
solution must satisfy the following bound:
(On IEEE machines this is 2-53 ) and n is the size of the
problem. The matrix used must be the same matrix used in the driver program
available from netlib. There is no restriction on the
problem size.
The solution to all three benchmarks must
satisfy the following mathematical formula:
(On IEEE machines this is 2-53 ) and n is the size of the
problem. This implies the computation must be done in 64 bit floating point
arithmetic.
In order to have an entry included in the
Linpack Benchmark report the results must be computed using full precision. By
full precision we generally mean 64 bit floating point arithmetic or higher.
Note that this is not an issue of single or double precision as some systems
have 64-bit floating point arithmetic as single precision. It is a function of
the arithmetic used.
You can get a more personalized listing of
machines by using the interface at http://performance.netlib.org/performance/html/PDSbrowse.html
You can download the programs used to generate
the Linpack benchmark results by using the URL is http://www.netlib.org/benchmark/linpackd.
This is a Fortran program. There is a C version of the
benchmark located at: http://www.netlib.org/benchmark/linpackc.
There is a Java version of the benchmark that can be downloaded as an applet
at:
There is a Java program at:
http://www.netlib.org/benchmark/linpackjava/
There is a Java version of the benchmark that
can be downloaded as an applet at:
There is a Java program at: http://www.netlib.org/benchmark/linpackjava/
For the 100x100 based Fortran
version, you need to supply a timing function called SECOND. SECOND is an
elapse timer function that will be called from Fortran
and is expected to return the running CPU time in seconds. In the program two
called to SECOND are made and the difference taken to gather the time.
The performance of the Linpack benchmark is
typical for applications where the basic operation is based on vector
primitives such as added a scalar multiple of a vector to another vector. Many
applications exhibit the same performance as the Linpack Benchmark. However,
results should not be taken too seriously. In order to measure the performance
of any computer it’s critical to probe for the performance of your
applications. The Linpack Benchmark can only give one point of reference. In addition, in multiprogramming environments
it is often difficult to reliably measure the execution time of a single
program. We trust that anyone actually evaluating machines and operating
systems will gather more reliable and more representative data.
While we make every attempt to verify the results
obtained from users and vendors, errors are bound to exist and should be
brought to our attention. We encourage users to obtain the programs and run the
routines on their machines, reporting any discrepancies with the numbers listed
here.
The Linpack package is a collection of Fortran subroutines for solving various systems of linear
equations. (http://www.netlib.org/Linpack/) The software in Linpack is based on
a decompositional approach to numerical linear
algebra. The general idea is the following. Given a problem involving a matrix,
one factors or decomposes the matrix into a product of simple, well-structured
matrices which can be easily manipulated to solve the original problem. The
package has the capability of handling many different matrix types and
different data types, and provides a range of options. Linpack itself is built
on another package called the BLAS. Linpack was designed in the late 70's and
has been superseded by a package called LAPACK.
The Linpack software library is available from netlib. See http://www.netlib.org/Linpack/
The
BLAS (Basic Linear Algebra Subprograms) are high quality "building
block" routines for performing basic vector and matrix operations. Level 1
BLAS do vector-vector operations, Level 2 BLAS do matrix-vector operations, and
Level 3 BLAS do matrix-matrix operations. Because the BLAS are efficient,
portable, and widely available, they're commonly used in the development of
high quality linear algebra software, LINPACK and LAPACK for example. For
additional information see: http://www.netlib.org/blas/
The
ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing
research effort focusing on applying empirical techniques in order to provide
portable performance for the BLAS routines. At present, it provides C and Fortran77
interfaces to a portably efficient BLAS implementation, as well as a few
routines from LAPACK. For additional information see: http://www.netlib.org/atlas/
Linpack is not the most efficient software for
solving matrix problems. This is mainly due to the way the algorithm and
resulting software accesses memory. The
memory access patterns of the algorithm has disregard for the multi-layered memory
hierarchies of RISC architecture and vector computers, thereby spending too
much time moving data instead of doing useful floating-point operations. LAPACK
addresses this problem by reorganizing the algorithms to use block matrix
operations, such as matrix multiplication in the innermost loops. For each computer architecture block operations can be
optimized to account for memory hierarchies, providing a transportable way to
achieve high efficiency on diverse modern machines. We use the term
“Transportable” instead of “portable” because, for fastest possible
performance, LAPACK requires that highly optimized block matrix operations be
already implemented on each machine. These operations are performed by the
Level 3 BLAS in most cases.
LAPACK is a software collection to solve various
matrix problem in linear algebra. In particular, systems of linear equations, least squares problems,
eigenvalue problems, and singular value decomposition. The software is based on
the use of block partitioned matrix techniques that aid in achieving high
performance on RISC based systems, vector computers, and shared memory parallel
processors.
LAPACK can be obtained from netlib,
see (http://www.netlib.org/lapack/)
The Linpack Benchmark is, in some sense, an
accident. It was originally designed to assist users of the Linpack package by
providing information on execution times required to solve a system of linear
equations. The first ``Linpack Benchmark'' report appeared as an appendix in
the Linpack Users' Guide in 1979. The appendix comprised data for one commonly
used path in Linpack for a matrix problem of size 100, on a collection of
widely used computers (23 in all), so users could estimate the time required to
solve their matrix problem.
Over the years other data was added, more as a
hobby than anything else, and today the collection includes hundreds of
different computer systems.
You can contact Jack Dongarra and send him the
output from the benchmark program. When sending results please include the
specific information on the computer on which the test was run, the compiler,
the optimization that was used, and the site it was run on. You can contact
Dongarra by sending email to dongarra@cs.utk.edu.
In order to run the benchmark program you will
have to supply a function to gather the execution time on your computer. The
execution time is requested by a call to the Fortran
function SECOND. It is expected that the routine returns the accumulated
execution time of your program. Two called to SECOND are
made and the difference taken to compute the execution time.
The Performance API (PAPI)
project specifies a standard application programming interface (API) for
accessing hardware performance counters available on most modern microprocessors.
These counters exist as a small set of registers that count Events, occurrences
of specific signals related to the processor's function. Monitoring these
events facilitates correlation between the structure of source/object code and
the efficiency of the mapping of that code to the underlying architecture.
For addition information see:
http://icl.cs.utk.edu/projects/papi/
The results reported in the benchmark report
reflect performance for 64 bit floating point arithmetic. On some machines this
may be DOUBLE PERCISION, such as computers that have IEEE floating point
arithmetic and on other computers this may be single precision, (declared REAL
in Fortran), such as Cray’s vector computers.
When and how often are the results updated in
the benchmark report?
The benchmark report is updated continuously as
new results arrive. They are posted to the web as they are updated.
The matrices are generated using a pseudo-random
number generator. The matrices are designed to force partial pivoting to be
performed in Gaussian Elimination.
The Top500 list the 500 fastest computer system being used today. In 1993 the collection was started
and has been updated every 6 months since then. The report lists the sites that
have the 500 most powerful computer systems installed. The best Linpack
benchmark performance achieved is used as a performance measure in ranking the
computers. The TOP500 list has been updated twice a year since June 1993.
The Top500 reports are maintained at http://www.top500.org/.
There is software available that has been
optimized and many people use to generate the Top500 performance results. This benchmark attempts to measure the best
performance of a machine in solving a system of equations. The problem size and
software can be chosen to produce the best performance. A copy of that software
can be downloaded from:
http://www.netlib.org/benchmark/hpl/
In order to run this you will need MPI and an
optimized version of the BLAS. For MPI you can see: http://www-unix.mcs.anl.gov/mpi/mpich/download.html
and for the BLAS see: http://www.netlib.org/atlas/
.
We
are starting a new list on Clusters for more information see http://clusters.top500.org/.
When the Linpack Fortran
n = 100 benchmark is run it produces the following kind of results:
Please send the results of this run to:
Jack J. Dongarra
Computer Science Department
Fax: 865-974-8296
Internet: dongarra@cs.utk.edu
norm. resid resid machep x(1) x(n)
1.67005097E+00 7.41628980E-14 2.22044605E-16 1.00000000E+00 1.00000000E+00
times are reported
for matrices of order 100
dgefa dgesl total
mflops
unit ratio
times for array with
leading dimension of 201
1.540E-03 6.888E-05 1.609E-03
4.268E+02 4.686E-03 2.873E-02
1.509E-03 7.084E-05 1.579E-03
4.348E+02 4.600E-03 2.820E-02
1.509E-03 7.003E-05 1.579E-03
4.348E+02 4.600E-03 2.820E-02
1.502E-03 6.593E-05 1.568E-03
4.380E+02 4.567E-03 2.800E-02
times for array with
leading dimension of 200
1.431E-03 6.716E-05 1.498E-03
4.584E+02 4.363E-03 2.675E-02
1.424E-03 6.694E-05 1.491E-03
4.605E+02 4.343E-03 2.663E-02
1.431E-03 6.699E-05 1.498E-03
4.583E+02 4.364E-03 2.676E-02
1.432E-03 6.439E-05 1.497E-03
4.588E+02 4.360E-03 2.673E-02
The norm.
resid is a measure of the accuracy
of the computation. The value should be O(1). If the
value is much greater than O(100) it suggest that the
results are not correct.
The resid is the unnormalized quantity.
The term machep
measure the precision used to carry out the computation. On an IEEE floating
point computer the value should be 2.22044605e-16.
The values of x(1) and
x(n) are the first and last component of the solution. The problem is
constructed so that the values of solution should be all ones.
There are two sets of timings performed both on
matrices of size 100. The first one is where the 2-dimensional array that
contained the matrix has a leading dimension of 201, and a second set where the
leading dimension 200. This is done to see what effect, if any, the placement
of the arrays in memory has on the performance.
Times for dgefa and dgesl are reported. dgefa
factors the matrix using Gaussian
elimination with partial pivoting and dgesl
solves a system based on the factoriuzation. dgefa requires 2/3 n3
operations and dgesl requires n2
operations. The value of total is the sum of the times and mflops
is the execution rate, or millions of floating point operations per second.
Here a floating point operations is taken to be
floating point additions and multiplications. Unit and ratio are obsolete and
should be ignored.
If the time reported is negative or zero then
the clock resolution is not accurate enough for the granularity of the work. In
this case a different timing routine should be used that has better resolution.
No archive is maintained of previous results.
However here is some information to provide a historical perspective. The numbers in the following tables have been
extracted from old Linpack Benchmark Reports.
It took a bit of ``file archaeology'' to put the list together since I
don't have the complete set of reports.
Top Computers Over Time for the Linpack n=100
Benchmark
(Entries for this
table began in 1979.)
|
Year |
Computer |
Number
of Processors |
Cycle
time in
nsecs |
Mflop/s |
|
2001 |
Fujitsu VPP5000/1 |
1 |
3.33 |
1156 |
|
2000 |
Fujitsu VPP5000/1 |
1 |
3.33 |
1156 |
|
1999 |
CRAY T916 |
4 |
2.2 |
1129 |
|
1995 |
CRAY T916 |
1 |
2.2 |
522 |
|
1994 |
CRAY C90 |
16 |
4.2 |
479 |
|
1993 |
CRAY C90 |
16 |
4.2 |
479 |
|
1992 |
CRAY C90 |
16 |
4.2 |
479 |
|
1991 |
CRAY C90 |
16 |
4.2 |
403 |
|
1990 |
CRAY Y-MP |
8 |
6.0 |
275 |
|
1989 |
CRAY Y-MP |
8 |
6.0 |
275 |
|
1988 |
CRAY Y-MP |
1 |
6.0 |
74 |
|
1987 |
ETA 10-E |
1 |
10.5 |
52 |
|
1986 |
NEC SX-2 |
1 |
6.0 |
46 |
|
1985 |
NEC SX-2 |
1 |
6.0 |
46 |
|
1984 |
CRAY X-MP |
1 |
9.5 |
21 |
|
1983 |
CRAY 1 |
1 |
12.5 |
12 |
|
... |
|
|
|
|
|
1979 |
CRAY 1 |
1 |
12.5 |
3.4 |
These numbers come from the Linpack Benchmark
Report Table 1.
=====================================================================
Top Computers Over Time for the Linpack n=1000
Benchmark
(Entries for this
table began in 1986.)
|
Year |
Computer |
Number
of Processors |
Cycle time in
nsec. |
Measured Mflop/s |
Peak Mflop/s |
|
2000 |
NEC SX-5/16 |
16 |
4.0 |
45030 |
64000 |
|
1995 |
CRAY T916 |
16 |
2.2 |
1940 |
28800 |
|
1994 |
|
4 |
2 |
16170 |
32000 |
|
1993 |
NEC SX-3/44R |
4 |
2.5 |
15120 |
25600 |
|
1992 |
NEC SX-3/44 |
4 |
2.9 |
13420 |
22000 |
|
1991 |
Fujitsu VP2600/10 |
1 |
3.2 |
4009 |
5000 |
|
1990 |
Fujitsu VP2600/10 |
1 |
3.2 |
2919 |
5000 |
|
1989 |
CRAY Y-MP/832 |
8 |
6 |
2144 |
2667 |
|
1988 |
CRAY Y-MP/832 |
8 |
6 |
2144 |
2667 |
|
1987 |
NEC SX-2 |
1 |
6 |
885 |
1300 |
|
1986 |
CRAY X-MP-4 |
4 |
9.5 |
713 |
840 |
|
|
|||||
These numbers come from the Linpack Benchmark
Report Table 1.
(Full precision; matrix size 1000; best effort
programming, maximum optimization permitted.)
Top Computers Over Time
for the Highly-Parallel Linpack Benchmark
(Entries for this
table began in 1991.)
|
Year |
Computer |
Number of Processors |
Measured Gflop/s |
Size of Problem |
Size of 1/2 Perf |
Theoretical Peak Gflop/s |
|
2001 |
ASCI White-Pacific, IBM SP Power 3 |
7424 |
7226 |
518096 |
179000 |
11136 |
|
2000 |
ASCI White-Pacific, IBM SP Power 3 |
7424 |
4938 |
430000 |
|
11136 |
|
1999 |
ASCI Red Intel Pentium II Xeon core |
9632 |
2379 |
362880 |
75400 |
3207 |
|
1998 |
ASCI Blue-Pacific SST, IBM SP 604E |
5808 |
2144 |
431344 |
|
3868 |
|
1997 |
Intel ASCI Option Red (200 MHz Pentium Pro) |
9152 |
1338 |
235000 |
63000 |
1830 |
|
1996 |
|
2048 |
368.2 |
103680 |
30720 |
614 |
|
1995 |
Intel Paragon XP/S MP |
6768 |
281.1 |
128600 |
25700 |
338 |
|
1994 |
Intel Paragon XP/S MP |
6768 |
281.1 |
128600 |
25700 |
338 |
|
1993 |
Fujitsu NWT |
140 |
124.5 |
31920 |
11950 |
236 |
|
1992 |
NEC SX-3/44 |
4 |
20.0 |
6144 |
832 |
22 |
|
1991 |
Fujitsu VP2600/10 |
1 |
4.0 |
1000 |
200 |
5 |
|
|
||||||
These numbers come from the Linpack Benchmark
Report Table 3.
(Full precision; the manufacture is allowed to
solve as large a problem as desired, maximum optimization permitted.)
Measured Gflop/s is the measured peak rate of
execution for running the benchmark in billions of floating point operations
per second.
Size of Problem is the matrix size at which the
measured performance was observed.
Size of ½ Perf is the
size of problem needed to achieve ½ the measured peak performance.
The Linpack Benchmark suite is built around
software for dense matrix problems. In May 2000 we started to put together a
benchmark for sparse iterative matrix problems. For additional information see:
http://www.netlib.org/benchmark/sparsebench/
For addition information on benchmarks see: http://www.netlib.org/benchweb/
Please send your comments to Jack Dongarra at dongarra@cs.utk.edu.