Path: senator-bedfellow.mit.edu!bloom-beacon.mit.edu!howland.erols.net!newsfeed.skycache.com!Cidera!news-hog.berkeley.edu!ucberkeley!cnn.nas.nasa.gov!marcy.nas.nasa.gov!eugene
From: eugene@marcy.nas.nasa.gov (Eugene N. Miya)
Newsgroups: comp.benchmarks
Subject: [l/m 3/10/94] Linpack			(9/28)	c.be FAQ
Date: 9 Sep 2000 12:25:00 GMT
Organization: NASA Ames Research Center, Moffett Field, CA
Lines: 147
Distribution: world
Message-ID: <8pda6s$j39$1@sun500.nas.nasa.gov>
Reply-To: eugene@amelia.nas.nasa.gov (Eugene N. Miya)
NNTP-Posting-Host: marcy.nas.nasa.gov
Keywords: who, what, where, when, why, how
Xref: senator-bedfellow.mit.edu comp.benchmarks:29898

9	Linpack					<This panel>
10	Network Performance
11	NIST source and .orgs
12	Measurement Environments
13	SLALOM
14
15	12 Ways to Fool the Masses with Benchmarks
16	SPEC
17	Benchmark invalidation methods
18
19	WPI Benchmark
20	Equivalence
21	TPC
22
23
24
25	Ridiculously short benchmarks
26	Other miscellaneous benchmarks
27
28	References
1	Introduction to FAQ chain and netiquette
2	Benchmarking Concepts
3	PERFECT
4
5	Performance Metrics
6	Temporary scaffold of New FAQ material
7	Music to benchmark by
8	Benchmark types

w/ great help from Patrick McGehearty
and suggestions from Brad Carlile on explaining compute intensity.

The LINPACK benchmark is a very simple LU decomposition of a dense linear
system (Gaussian elimination) by Jack Dongarra, one of the developers of the
LINPACK library and the netlib numerical software server.

Ref: Dongarra's article in CACM on Netlib.
And ACM SIGARCH Computer Architecture News and SIGNUM Newsletter.

It consists of three parts:
100x100 ("LINPACK Benchmark") All Fortran, no changes allowed, old algorithm
              that has low compute intensity and makes poor use of memory
	      bandwidth.
1000x1000 ("TPP", best effort) No limits on algorithm selection, or use of
              assembly language to improve performance.  Best implementations
              currently use blocked solvers that make efficient use of
	      memory with high compute intensity.  LAPACK offers examples
	      of this type of solver.
and
"A Look at Parallel Processing" (problem size = NxN with N selected by vendor)
              Best implementations use high compute intensity algorithms
	      scaled to a size where interprocessor communication cost
	      is minimal compared to computations.

The term Compute Intensity, is defined by Hockney & Jessope as:

     Compute Intensity = Operations/word

A complete LU solver has a compute intensity of:

     Compute Intensity = (2/3*N**3 operations)/(2*N**2 words) = .3333*n

This sounds wonderful even a 100x100 Linpack has a compute intensity of 33, 
however the rules say that you can can only optimize the FORTRAN provided.
It was written with BLAS 1 algorithms (DAXPY).  Daxpy has a compute intensity
of 2/3 (two operations per 3 memory references) no matter what the size of the 
matrix.  This is a requires a lot of memory bandwidth to get any performance.

The Linpack 1000 with no limits on algorithm means that everyone uses a LAPACK
solver based on the BLAS 3 kernels (DGEMM).  These have a compute intensity
that is equal to the blocking used in the algorithm.

Most Vendors understand this but most user's don't realize that this is the
true limiting factor for linpack.

Advantages:
Simple fairly portable FORTRAN.  One of the shorter benchamrks.
Source is small enough to be carried on disk or in Jack's laptop without
consuming too much porting time.
A good attempt at experiment control, has stringent execution requirements.

Dongarra also records the compiler options used to invoke the Fortran
compilers.  Record keeping is good.  Reports are quickly available
electronically and published with some frequency in Supercomputing Review.

The 100x100 case represents a well-defined type of floating point
computation.  The 1000x1000 case allows vendors to showcase their product's
potential if they are so inclined.  The third problem set is intended for
use by vendors of highly parallel systems which find even 1000x1000 problem
set to be too small when spread over hundreds or thousands of processors.
In this case, the vendor selects N and demonstrates the asymptotic effective
rate of their highly parallel machine.

Disadvantages:
Diminishing parallelism during the decomposition (as in all Gaussian
elimination).  It only tests some numeric aspects of a system, on
data with well-defined behavior patterns.

The 100x100 problem set is quite small by today's standards, and can have
problems with accurate measurements on those machines which do not offer
sub-millisecond timer resolution.  Also, the 100x100 problem set is too
small to show the performance potential of machines with high startup costs,
such as massively parallel or parallel-vector architectures.  It also can
fit entirely in a machine with a large cache, failing to measure the cache
miss behavior of a slightly larger problem.  Finally, the algorithm used by
the all Fortran code is suboptimal for machines which can do significantly
more floating point operations than memory to register transfers.

The 1000x1000 problem set is intended to address these concerns.  Each
machine vendor is allowed to use whatever algorithm they chose, including
assembly language if they desire.  By changing algorithms and increasing the
problem size, many vendors are able to demonstrate the full potential of
their machines on the 1000x1000 problem set.  Generating true "best effort"
results is not free, and vendors which do not put a high priority on
floating point performance or which do not expect a significant improvement
from 100x100 to 1000x1000 may not report results for the 1000x1000 problem
set.


NETLIB benchmark index (Linpack benchmark)
netlib@ornl.gov
	send index from benchmark
includes linpack (100x100,300x300,1000x1000).

The entries of the report change drastically with time.  Anyone
interested in floating point performance should get a new copy from
netlib from time to time.

Of additional interest about these sizes is that they are not
the powers of 2 which characterize many benchmarks.  Powers of 2
can bias in favor of some architectures and bias against other
architectures.

                   ^ A  
                s / \ r                
               m /   \ c              
              h /     \ h            
             t /       \ i          
            i /         \ t        
           r /           \ e      
          o /             \ c    
         g /               \ t  
        l /                 \ u
       A /                   \ r
        <_____________________> e   
                Language
 

.