Path: senator-bedfellow.mit.edu!bloom-beacon.mit.edu!howland.erols.net!newsfeed.skycache.com!Cidera!news-hog.berkeley.edu!ucberkeley!cnn.nas.nasa.gov!marcy.nas.nasa.gov!eugene From: eugene@marcy.nas.nasa.gov (Eugene N. Miya) Newsgroups: comp.benchmarks Subject: [l/m 3/10/94] Linpack (9/28) c.be FAQ Date: 9 Sep 2000 12:25:00 GMT Organization: NASA Ames Research Center, Moffett Field, CA Lines: 147 Distribution: world Message-ID: <8pda6s$j39$1@sun500.nas.nasa.gov> Reply-To: eugene@amelia.nas.nasa.gov (Eugene N. Miya) NNTP-Posting-Host: marcy.nas.nasa.gov Keywords: who, what, where, when, why, how Xref: senator-bedfellow.mit.edu comp.benchmarks:29898 9 Linpack 10 Network Performance 11 NIST source and .orgs 12 Measurement Environments 13 SLALOM 14 15 12 Ways to Fool the Masses with Benchmarks 16 SPEC 17 Benchmark invalidation methods 18 19 WPI Benchmark 20 Equivalence 21 TPC 22 23 24 25 Ridiculously short benchmarks 26 Other miscellaneous benchmarks 27 28 References 1 Introduction to FAQ chain and netiquette 2 Benchmarking Concepts 3 PERFECT 4 5 Performance Metrics 6 Temporary scaffold of New FAQ material 7 Music to benchmark by 8 Benchmark types w/ great help from Patrick McGehearty and suggestions from Brad Carlile on explaining compute intensity. The LINPACK benchmark is a very simple LU decomposition of a dense linear system (Gaussian elimination) by Jack Dongarra, one of the developers of the LINPACK library and the netlib numerical software server. Ref: Dongarra's article in CACM on Netlib. And ACM SIGARCH Computer Architecture News and SIGNUM Newsletter. It consists of three parts: 100x100 ("LINPACK Benchmark") All Fortran, no changes allowed, old algorithm that has low compute intensity and makes poor use of memory bandwidth. 1000x1000 ("TPP", best effort) No limits on algorithm selection, or use of assembly language to improve performance. Best implementations currently use blocked solvers that make efficient use of memory with high compute intensity. LAPACK offers examples of this type of solver. and "A Look at Parallel Processing" (problem size = NxN with N selected by vendor) Best implementations use high compute intensity algorithms scaled to a size where interprocessor communication cost is minimal compared to computations. The term Compute Intensity, is defined by Hockney & Jessope as: Compute Intensity = Operations/word A complete LU solver has a compute intensity of: Compute Intensity = (2/3*N**3 operations)/(2*N**2 words) = .3333*n This sounds wonderful even a 100x100 Linpack has a compute intensity of 33, however the rules say that you can can only optimize the FORTRAN provided. It was written with BLAS 1 algorithms (DAXPY). Daxpy has a compute intensity of 2/3 (two operations per 3 memory references) no matter what the size of the matrix. This is a requires a lot of memory bandwidth to get any performance. The Linpack 1000 with no limits on algorithm means that everyone uses a LAPACK solver based on the BLAS 3 kernels (DGEMM). These have a compute intensity that is equal to the blocking used in the algorithm. Most Vendors understand this but most user's don't realize that this is the true limiting factor for linpack. Advantages: Simple fairly portable FORTRAN. One of the shorter benchamrks. Source is small enough to be carried on disk or in Jack's laptop without consuming too much porting time. A good attempt at experiment control, has stringent execution requirements. Dongarra also records the compiler options used to invoke the Fortran compilers. Record keeping is good. Reports are quickly available electronically and published with some frequency in Supercomputing Review. The 100x100 case represents a well-defined type of floating point computation. The 1000x1000 case allows vendors to showcase their product's potential if they are so inclined. The third problem set is intended for use by vendors of highly parallel systems which find even 1000x1000 problem set to be too small when spread over hundreds or thousands of processors. In this case, the vendor selects N and demonstrates the asymptotic effective rate of their highly parallel machine. Disadvantages: Diminishing parallelism during the decomposition (as in all Gaussian elimination). It only tests some numeric aspects of a system, on data with well-defined behavior patterns. The 100x100 problem set is quite small by today's standards, and can have problems with accurate measurements on those machines which do not offer sub-millisecond timer resolution. Also, the 100x100 problem set is too small to show the performance potential of machines with high startup costs, such as massively parallel or parallel-vector architectures. It also can fit entirely in a machine with a large cache, failing to measure the cache miss behavior of a slightly larger problem. Finally, the algorithm used by the all Fortran code is suboptimal for machines which can do significantly more floating point operations than memory to register transfers. The 1000x1000 problem set is intended to address these concerns. Each machine vendor is allowed to use whatever algorithm they chose, including assembly language if they desire. By changing algorithms and increasing the problem size, many vendors are able to demonstrate the full potential of their machines on the 1000x1000 problem set. Generating true "best effort" results is not free, and vendors which do not put a high priority on floating point performance or which do not expect a significant improvement from 100x100 to 1000x1000 may not report results for the 1000x1000 problem set. NETLIB benchmark index (Linpack benchmark) netlib@ornl.gov send index from benchmark includes linpack (100x100,300x300,1000x1000). The entries of the report change drastically with time. Anyone interested in floating point performance should get a new copy from netlib from time to time. Of additional interest about these sizes is that they are not the powers of 2 which characterize many benchmarks. Powers of 2 can bias in favor of some architectures and bias against other architectures. ^ A s / \ r m / \ c h / \ h t / \ i i / \ t r / \ e o / \ c g / \ t l / \ u A / \ r <_____________________> e Language .