Path: senator-bedfellow.mit.edu!bloom-beacon.mit.edu!newsfeed.utk.edu!news-hog.berkeley.edu!ucberkeley!cnn.nas.nasa.gov!marcy.nas.nasa.gov!eugene From: eugene@marcy.nas.nasa.gov (Eugene N. Miya) Newsgroups: comp.benchmarks Subject: [l/m 7/2/98] Performance metrics (5/28) c.be FAQ Date: 5 Sep 2000 12:25:00 GMT Organization: NASA Ames Research Center, Moffett Field, CA Lines: 281 Distribution: world Message-ID: <8p2oms$ppf$1@sun500.nas.nasa.gov> Reply-To: eugene@amelia.nas.nasa.gov (Eugene N. Miya) NNTP-Posting-Host: marcy.nas.nasa.gov Keywords: who, what, where, when, why, how Xref: senator-bedfellow.mit.edu comp.benchmarks:29891 5 Performance Metrics 6 Temporary scaffold of New FAQ material 7 Music to benchmark by 8 Benchmark types 9 Linpack 10 Network Performance 11 NIST source and .orgs 12 Benchmark Environments 13 SLALOM 14 15 12 Ways to Fool the Masses with Benchmarks 16 SPEC 17 Benchmark invalidation methods 18 19 WPI Benchmark 20 Equivalence 21 TPC 22 23 24 25 Ridiculously short benchmarks 26 Other miscellaneous benchmarks 27 28 References 1 Introduction to FAQ chain and netiquette 2 3 PERFECT 4 Performance/benchmark metric terminology The usual important quote is: What's important is the time it takes to solve MY problem. This does not help the architect designing the next machine. It is an arrogant closed minded, Gestaltist statement which conflicts with the analytic/reductionist needs for science. Synthetic problems/benchmarks have some if limited value. We walk before we run, and we crawl before we walk. Similarly, right now, there is more benchmarking noise than signal. Perhaps the only, certainly best measure, is the second (time): For one of the best studied metrics see the atomic clocks of the NIST. Subject to relativistic effects: the Lorentz time contraction. Don't laugh, this is becoming more important at the pico-second level Less reliable measures include: MIP, GIP, TIP : MIPS, GIPS, TIPS: Million (Giga, billions; tera, trillion) Instructions Per Second : Meaningless Indicator of Performance : "Marketing's" Indicator of Performance What's an "instruction?" An instruction is an event. It is frequently a minute change in the state of a CPU (and the computer). Frequently, an instruction is synonymous with the clock rate of a machine: that ignores instructions requiring more than one clock pulse tick to execute. A common fallacy by naive benchmarkers is that a CPU determines the speed of a computation. This is frequently false. The people in the know these days understand Amdahl's "other law:" 1 MIPS for each 1 MB main memory at 1 MB/S transfer to disk MFLOPS, GFLOPS, TFLOPS: Million (Giga, billions; tera, trillion) Floating-Point Operations Per Second : The measure ignores non-floating-point instructions. Particularly bad for numeric codes transitioning from 2-D to 3-D since additional time is required for array address calculation, and for algorithms requiring big non-numeric steps like matrix transposition. : the original program name for Frank McMahon's Livermore Loops program. : one of the metrics used by Dongarra's LINPACK benchmark. LIPS, KLIPS, MLIPS: Logical "inferences" Per Second -- from the Logic Programming community (Gabriel LISP benchmarks). Also available in Prolog (Evan Tick). LIPS roughly correspond to "calls per second" for very simple predicates. Packets Per Second: Unit of measure used by the networking, communications community. Sometimes useful. : What do they do: make consistent packets? MHz, GHz, Bits per Second, Bytes per Second, Words per Second: : Frequently used to mismeasure the performance of computer networks like Ethernet (tm). It confuses the base band carrier frequency with the data trasnfer rate. It's not truth, but not complete false. : Also sometimes call Null or Wait instructions. TPS : Transactions per second, agreed on metric by the transaction processing council. : What's a transaction? Stones : An arbitrary unit of computation based on the Whetstone (or Dhrystone or other *stone) which is subject to the influences like compiler optimization or cache metrics. : What's a transaction? Normalized metrics SPECint92, SPECfp92: Normalized metrics based on performance against a DEC VAX-11/780. Based on SPEC integer/floating point workloads of CINT92 and CFP92. SPECmark89: A normalized metric based on the performance against a DEC VAX-11/780. Based on a SPEC Release1.2b workload (replaced by CINT92 and CFP92) on a 780 under glass. Speed up: Efficiency: Our problems aren't counting seconds (intervals or days), it's not counting instructions, operations, floating point operations. Events counts like instructions or operations are best done by non-instrusive instruction/operating counting hardware. These are expensive to say the least. Software profilers/event counters are also some times useful, but they are subject to optimization. We need to distingush "virtual" operations or instructions from real or actual instructions. Prefixes: kilo, mega, giga, tera, eka, peta, milli, micro, nano, pico, femto, Performance metrics are unlike conventional mathematics. You can't make mathematical inferences (excepting "guaranteed not to exceed numbers"), you can't apply all mathematical operators. The basis for metric theory is that for a metric space X and a metric function d() which maps pairs of elements in X to the real number system, then a) b) d(A,B) = d(B,A) c) d(A + B) <= d(A) + d(B) [triangle inequality] You might have a benchmrk sized for 128 elements. A program might not test well if it used 127 or 129 elements instead. It is not possible in infer or interpolate between values because of benchmarking "gotchas." This is especially bad when dealing with powers of two: an artifact of computer architecture, but sometimes also due to software (in a base-10 world). Mathematics derives a large portion of its power because of assumptions of continuity. Computers are very discrete objects. What works for case n might not work for case n-1 or n+1 (vector architectures for instance). Some interesting thing are learned by simply modifying the size of a benchmark by one (remember Kernighan and Plauger: beware off-by-1 errors). Can you even be assured of consistent measures? Most benchmarks try to run their tests in standalong conditions to attain consistency. This is an artifact of not being able to have a non-intrusive measurement environment. Measurement issues: 1) Reproducibility: first and foremost. You must be able to reproduce performance. 2) Accuracy and precision. Tough because of human limits. 3) Resolution. Details sometimes count. 4) History (memory). 5) Another important: measurement tools and environments What are some nice ones: Simple ones (non-standard) software Several: 'arch' name architecture, Cray: flotrace, hpm (hardware and software actually), others SGI/MIPS: gr_osview, ancillary: hinv (hardware inventory), pixie Convex: syspic, Obsolete ones: gprof, prof (your names may not vary, but the tools does, watch for name collision) Other useful tools should be reported. Why? Because most people do not get reasonable experience with the various kinds of tools out there to understand their advantages, drawbacks, etc. Beware of the graphical tools. They can deceive you. All performance monitoring tools can deceive you. Use them carefully. Example of a good/useful tool from a 'Class A' measurement environment. Sample Cray Research, Inc. Hardware Performance Monitor (HPM) output: hpm VERSION 1.3 (c) COPYRIGHT CRAY RESEARCH, INC. UNPUBLISHED -- ALL RIGHTS RESERVED UNDER THE COPYRIGHT LAWS OF THE UNITED STATES STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.038s, 0.2% of 8-CPU Machine HWM mem: 97679, HWM stack: 2048, Stack overflows: 0 Group 0: CPU seconds : 0.00 CP executing : 197638 Million inst/sec (MIPS) : 44.47 Instructions : 52730 Avg. clock periods/inst : 3.75 % CP holding issue : 42.57 CP holding issue : 84134 Inst.buffer fetches/sec : 0.77M Inst.buf. fetches: 913 Floating adds/sec : 0.21M F.P. adds : 246 Floating multiplies/sec : 0.23M F.P. multiplies : 267 Floating reciprocal/sec : 0.05M F.P. reciprocals : 54 I/O mem. references/sec : 0.22M I/O references : 256 CPU mem. references/sec : 14.58M CPU references : 17287 Floating ops/CPU second : 0.48M STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.002s, 4.2% of 8-CPU Machine HWM mem: 97679, HWM stack: 2048, Stack overflows: 0 Group 1: CPU seconds : 0.00119 CP executing: 198071 Hold issue condition % of all CPs actual # of CPs Waiting on semaphores : 0.14 284 Waiting on shared registers : 0.00 0 Waiting on A-registers/funct. units: 9.35 18520 Waiting on S-registers/funct. units: 27.98 55418 Waiting on V-registers : 1.35 2671 Waiting on vector functional units : 0.00 9 Waiting on scalar memory references: 0.56 1101 Waiting on block memory references : 1.86 3685 STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.002s, 4.4% of 8-CPU Machine HWM mem: 97679, HWM stack: 2048, Stack overflows: 0 Group 2: CPU seconds : 0.00121 CP executing : 201785 Inst. buffer fetches/sec : 0.75M total fetches : 913 fetch conflicts : 5265 I/O memory refs/sec : 0.00M actual refs : 0 avg conflict/ref 0.00: actual conflicts : 100 Scalar memory refs/sec : 5.51M actual refs : 6668 Block memory refs/sec : 8.77M actual refs : 10619 CPU memory refs/sec : 14.28M actual refs : 17287 avg conflict/ref 0.15: actual conflicts : 2668 CPU memory writes/sec : 8.66M actual refs : 10479 CPU memory reads/sec : 5.62M actual refs : 6808 STOP (called by EMPTY ) CP: 0.001s, Wallclock: 0.030s, 0.2% of 8-CPU Machine HWM mem: 97679, HWM stack: 2048, Stack overflows: 0 Group 3: CPU seconds : 0.00119 CP executing: 198445 (octal) type of instruction inst./CPUsec actual inst. % of all inst. (000-017)jump/special : 5.30M 6315 11.98 (020-077)scalar functional unit : 33.24M 39578 75.07 (100-137)scalar memory : 5.60M 6668 12.65 (140-157,175)vector integer/log.: 0.01M 14 0.03 (160-174)vector floating point : 0.00M 2 0.00 (176-177)vector load and store : 0.12M 141 0.27 type of operation ops/CPUsec actual ops avg. VL Vector integer&logical : 0.12M 138 9.86 Vector floating point : 0.19M 232 116.00 Scalar functional unit : 33.24M 39578 Examples of other Class A environments: SSI SS-1 (defunct) Tera MTA ===== In memoriam to Rear Adm. Grace Murray Hopper, for all the "nano seconds" and "pico seconds" she passed out (30 cm/1 ft copper wires or salt grains). She will be missed. ^ A s / \ r m / \ c h / \ h t / \ i i / \ t r / \ e o / \ c g / \ t l / \ u A / \ r <_____________________> e Language .