.ds TH "\h'-.50m'\v'-.40m'^\v'.40m'\h'.50m' .ds TH "\h'-.75m'\v'-.50m'^\v'.50m' .ds TD "\h'-.75m'\v'-.45m'~\v'.45m' .B .nr BT ''-%-'' .he '''' .pl 11i .de fO 'bp .. .wh -.5i fO .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 An Extended Set of Basic Linear Algebra Subprograms: .sp Model Implementation and Test Programs .ps 11 .sp 3 Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 81 .sp .7i January 1987 .pn 0 .he ''-%-'' .he '''' .bp . .sp 10 .B .ce 1 .ps 11 Abstract .R .ps 10 .sp .5i This paper describes a model implementation and test software for the Level 2 Basic Linear Algebra Subprograms (Level 2 BLAS). The Level 2 BLAS are targeted at matrix-vector operations with the aim of providing more efficient, but portable, implementations of algorithms on high-performance computers. The model implementation provides a portable set of Fortran 77 Level 2 BLAS for machines where specialized implementations do not exist or are not required. The test software aims to verify that specialized implementations meet the specification of the Level 2 BLAS and that implementations are correctly installed. .in .bp .ft 3 .ps 11 .bp .LP .LP .EQ gsize 11 delim $$ .EN .TL .ps 12 .in 0 An Extended Set of .sp Basic Linear Algebra Subprograms: .sp Testing Software and Model Implementations .AU .ps 11 .in 0 Jack J. Dongarra\|$size -1 {"" sup \(dg}$\h'.15i' .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' Argonne, Illinois 60439\h'.20i' .AU .ps 11 .in 0 Jeremy Du Croz\h'.20i' .AI .ps 10 .in 0 Numerical Algorithms Group Ltd.\h'.20i' NAG Central Office, Mayfield House\h'.20i' 256 Banbury Road, Oxford OX2 7DE\h'.20i' .AU .ps 11 .in 0 Sven Hammarling\h'.20i' .AI .ps 10 .in 0 Numerical Algorithms Group Ltd.\h'.20i' NAG Central Office, Mayfield House\h'.20i' 256 Banbury Road, Oxford OX2 7DE\h'.20i' .AU .ps 11 .in 0 Richard J. Hanson\|$size -1 {"" sup \(dd}$\h'.15i' .AI .ps 10 .in 0 Applied Mathematics Division 2646 Sandia National Laboratory\h'.20i' Albuquerque, New Mexico 87185\h'.20i' .FS .ps 9 .vs 11p $size -1 {"" sup \(dg}$\|Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S. Department of Energy, under Contract W-31-109-Eng-38. $size -1 {"" sup \(dd}$\|Work supported by the U.S. Department of Energy. Typeset on \*(DY. .FE .sp 3 .QS .ps 10 .in .25i .ll -.25i Abstract \(em This paper describes a model implementation and test software for the Level 2 Basic Linear Algebra Subprograms (Level 2 BLAS). The Level 2 BLAS are targeted at matrix-vector operations with the aim of providing more efficient, but portable, implementations of algorithms on high-performance computers. The model implementation provides a portable set of Fortran 77 Level 2 BLAS for machines where specialized implementations do not exist or are not required. The test software aims to verify that specialized implementations meet the specification of the Level 2 BLAS and that implementations are correctly installed. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .EQ delim $$ .EN .nr PO 1i .po 1i .ps 10 .nr FO .fo ''%'' .tp .sp 5 .NH Scope of the Algorithm .PP In [3] we have defined the specification of a set of Basic Linear Subprograms for selected matrix-vector operations usually referred to as the "Level 2 BLAS" or "Extended BLAS". They provide a standard framework to develop modular, portable, and efficient Fortran 77 code for many computational problems in linear algebra. Our hope is that specialized implementations of the Level 2 BLAS will be developed for many machines, especially for vector processors and other high-performance computers. Thus programs that call the Level 2 BLAS can be efficient across a wide range of machines. .PP To support and encourage the use of the Level 2 BLAS, this algorithm contains two components of software: .IP (1) A model implementation of the subprograms in Fortran 77. This enables the Level 2 BLAS to be used on any machine, regardless of whether a specialized implementation exists. It is described in Section 2. .sp .IP (2) Test programs, designed to ensure that implementations conform to the specification and have been correctly installed. The programs are described in Section 4. .LP Section 3 contains some advice on developing specialized implementations of the subprograms. Installation notes for the software are given in the Appendix. .NH The Model Implementation .SH 2.1 Programming considerations .PP There are many mathematically equivalent ways to implement the Level 2 BLAS even in standard Fortran 77, as discussed in Section 3. The choice of method for the model implementation has been guided by the following considerations: .sp .IP (1) The elements of the array A are accessed sequentially, column by column. On vector-processing machines this allows the columns of the array to be recognized as contiguous vectors (by the Fortran compiler). On virtual-memory machines it keeps page swaps to a minimum. .sp .IP (2) Special code is included for the commonly occurring cases when the increment parameters (INCX or INCY or both) which are used in the inner loops, are equal to 1. This code can use a simpler, and hence more efficient, Fortran indexing scheme and also allows contiguous vectors to be recognized by the Fortran compiler, for example: .sp .nf .cs 1 24 .KS .ps 8 DO 50, I = 1, M Y(I) = Y(I) + TEMP*A(I,J) 50 CONTINUE .fi .cs 1 .KE .ps 11 .sp instead of: .sp .KS .nf .ps 8 .cs 1 24 IY = KY DO 50, I = 1, M Y(IY) = Y(IY) + TEMP*A(I,J) IY = IY + INCY 50 CONTINUE .fi .cs 1 .ps 11 .sp .KE .IP (3) Provision is made to skip the innermost loop if relevant elements of the vectors X or Y are zero. This can yield a considerable gain in efficiency if the vectors are sparse, for example: .sp .KS .nf .ps 8 .cs 1 24 IF( X(JX).NE.ZERO )THEN TEMP = ALPHA*X(JX) DO 50, I = 1, M Y(I) = Y(I) + TEMP*A(I,J) 50 CONTINUE END IF .fi .cs 1 .ps 11 .sp .KE .SH 2.2 Efficiency .PP The model implementation is likely to achieve considerable efficiency on scalar-processing machines with a good optimizing compiler and even moderate efficiency on vector-processing machines with a good vectorizing compiler. .PP For illustration, Table 2.1 gives speeds obtained for some representative operations on a CRAY-1S, using automatic vectorization (no compiler directives to ignore data-dependencies were needed). The speeds are given in megaflops and were measured with $m = n = 256$, INCX = INCY = 1, UPLO = 'U', DIAG = 'N', and no zero elements in the data. Speeds with UPLO = 'L' were approximately the same as those with UPLO = 'U', speeds of the SV-routines were approximately the same as those of the corresponding MV-routines, and speeds of the routines using packed storage were approximately the same as those of the corresponding routines using 2-dimensional array storage. .sp .KS .ce Table 2.1: Speed of Level 2 BLAS on CRAY-1S in megaflops. .TS center; c s s c s s l c c l c c. Real Complex Routine TRANS Speed Routine TRANS Speed SGEMV 'N' 39 CGEMV 'N' 63 'T' 31 'T' 11 SSYMV 31 CHEMV 13 STRMV 'N' 33 CTRMV 'N' 56 'T' 20 'T' 11 SGER 39 CGERU 63 SSYR 36 CHER 56 SSYR2 47 CHER2 46 .TE .KE .sp Without automatic vectorization (i.e. running in scalar mode), the speeds were between 5 and 7 megaflops for real data, and between 10 and 14 megaflops for complex data. .PP Table 2.1 does not include measurements on the routines for banded matrices. In the model implementation of these routines, the vector-lengths are at most equal to the bandwidth, and hence for narrow bandwidths the routines run at roughly scalar speeds. However see section 3.3 concerning an alternative implementation of some of these routines. .SH 2.3 Language standards .PP The model implementation of the Level 2 BLAS is written entirely in portable standard Fortran 77 with two exceptions. .IP (1) For the routines that require a "double precision complex" data type (names beginning with Z), we have used the following extensions to standard Fortran: .sp .in +.25i .IP COMPLEX*16 type specification statements; .IP DCONJG and DCMPLX intrinsic functions whose argument and result are both of type COMPLEX*16; .IP DBLE intrinsic function with a COMPLEX*16 argument and DOUBLE PRECISION result, delivering the real part of the argument; .IP COMPLEX*16 constants formed from a pair of double precision constants in parentheses. .in -.25i .sp .IP (2) For the arguments of type CHARACTER that specify options, we wish to allow either upper-case or lower-case characters to be supplied. Lower-case characters are not part of the standard Fortran character set, but their use is so widespread that it would be unfriendly not to allow them. This can be an obstacle to portability on some systems (as is discussed in Section 7 of [3]), but we have avoided most of the problems by use of the auxiliary LOGICAL function LSAME described below. .SH 2.4 Auxiliary subprograms .PP Two auxiliary subprograms are called by the Level 2 BLAS: an error-handling routine XERBLA, called by all routines, and the character-comparison routine LSAME called by all except the _GER_ routines. Both these subprograms may be selectively modified by installers of the package as described in the Appendix. No changes need be made to the rest of the model code. .NH Notes on Implementation .PP Here we offer some advice to anyone planning to develop a specialized, machine-specific implementation of the Level 2 BLAS. The following broad possibilities should be considered: .sp .IP (a) Rewriting the algorithms in Fortran so that the structure of the inner loops is better adapted to the architecture of the machine. .sp .IP (b) Using machine-specific extensions to Fortran, such as array-syntax, compiler-directives or calls to library routines. .sp .IP (c) Coding the routines in assembly language. .sp .PP Approaches (b) and (c) should be considered as extensions of (a), not as alternatives. Implementors should not consider translating the model implementation into extended Fortran or assembly language without first considering whether the structure of the model implementation is well-adapted to their machine. .PP Each matrix-vector operation performed by the Level 2 BLAS involves doubly-nested loops. By interchanging the inner and outer loop indices we obtain two variant ways of organizing the computation. In one variant the matrix is accessed by columns, in the other by rows. For the MV and SV routines, the inner loop is equivalent, in one variant, to an inner-product vector operation, and in the other variant to a vector operation of the form $ y ~<-~ alpha x ~+~ y $ which we shall refer to as an AXPY. (For the rank-1 and rank-2 update routines the inner loop is always equivalent to an AXPY operation.) The choice of method will be principally governed by the relative efficiency of performing inner-products or AXPY operations, and by the cost of accessing a 2-dimensional array by rows instead of by columns. We discuss the options in more detail for individual routines below. To be specific, we discuss the real single precision routines. .SH 3.1 Routines using full matrix storage .PP We consider first those routines which require the matrix to be stored conventionally in a 2-dimensional array: columns of the matrix are stored in columns of the array and constitute contiguous vectors; rows of the matrix are stored in rows of the array and constitute vectors with constant stride whose elements are not contiguous. Accessing vectors with non-contiguous elements may require expensive gather or scatter operations (as on the CDC Cyber 205, for example), or may cost no more than accessing contiguous vectors except when memory-bank conflicts occur (as on the CRAY-1S). .SH 3.1.1 SGEMV .PP The operations performed by this routine have been discussed by Dongarra, Gustavson and Karp [5] and Daly and Du Croz [2]. The operation $ y ~ <- ~ alpha Ax ~+~ beta y $ (TRANS = 'N') can be performed either by AXPY operations with vector-length $m$, accessing $A$ by columns, or by inner-products with vector-length $n$, accessing $A$ by rows. The operation $ y ~<-~ alpha A sup T x ~+~ beta y $ (TRANS = 'T' or 'C') can be performed either by inner-products with vector-length $m$, accessing $A$ by columns, or by AXPY operations with vector-length $n$, accessing $A$ by rows. Here $m$ and $n$ are the numbers of rows and columns of $A$ respectively. .PP In both cases the AXPY form has the property that a sequence of AXPY operations are used to update a single left-hand-side vector. On a machine with vector-registers, this left-hand side vector (or segments of it) should ideally be held in a vector-register throughout the iterations of the outer loop in order to reduce the number of memory references (see [5]). If the routines are being implemented in Fortran and the compiler cannot recognize and take advantage of this property, then the technique of unrolling the outer loops may be applied (Dongarra and Eisenstat [4], Cowell and Thompson [1]). .PP Note that the relative advantage of the AXPY or inner-product forms depends on the values of $m$ and $n$, and an optimal implementation may need to switch between the two forms accordingly. .PP On parallel machines the cleanest way to achieve concurrency is to compute segments of the result vector in separate processors; this is discussed further by Dongarra and Sorensen [7]. .SH 3.1.2 SSYMV .PP As in SGEMV, each operation can be performed either by AXPY operations or by inner-products. However, in each case the matrix must be accessed partly by columns and partly by rows (because only half the matrix is stored). The model implementation uses a mixed form in which each iteration of the outer loop contains one AXPY operation and one inner-product operation, both involving the same column of the matrix. On many machines this mixed form can halve the number of memory references to elements of A; however if, say, inner products are markedly slower than AXPY operations, then they will govern the speed of the mixed form and a pure AXPY form may be preferable. The remarks made about the AXPY forms of SGEMV on vector-register machines apply here also, although the details are more complicated because the lengths of the left-hand-side vectors are not constant, but increase or decrease by 1 on each iteration of the outer loop. Unrolling the outer loops to a depth of 2 can be handled neatly as described by Dongarra, Kaufman and Hammarling [6]. In a pure AXPY form or pure inner product form it may be preferable to make two separate passes through the outer loop, one in which the matrix is accessed by rows and one in which it is accessed by columns. .SH 3.1.3 STRMV and STRSV .PP Again as in SGEMV, each operation can be performed either by AXPY operations or by inner-products, with the matrix being accessed by rows or columns depending on the value of TRANS. Those forms of the code which are not used in the model implementation can easily be derived from those which are. For example, to derive an AXPY form of the code for $ x ~<-~ L sup T x$, simply take the code for $x~ <-~ Ux$, and replace A(I,J) by A(J,I). The remarks in section 3.1.1 about implementing the AXPY forms on a machine with vector-registers apply here also, although as in SSYMV the vector-lengths are not constant throughout the outer loop. .PP The iterations of the outer loop must be performed in a particular order: forward from 1 to $n$ for the operations $x ~<-~ Ux$, $x ~<-~ L sup T x$, $x ~<-~ L sup -1 x$ and $x ~<-~ (U sup T ) sup -1 x$; backward from $n$ to 1 for the others. In STRMV this constraint is needed simply to allow the result vector to overwrite the input vector. In STRSV the recursive nature of the computation is more fundamental: each element of the result vector depends on previously computed elements. .SH 3.1.4 Rank-1 and Rank-2 update routines .PP Each column of the matrix can be updated by an AXPY operation, or in the case of the R2-routines by a double AXPY operation. Moreover on a parallel machine each column of the matrix can be updated concurrently. Interchanging the loop indices merely results in AXPY operations on rows of the matrix. .SH 3.2 Routines using packed storage .PP With the specified storage scheme for packed matrices, columns of the matrix are stored as contiguous vectors within the packed array. Rows of the matrix do not constitute vectors with constant stride. Hence those forms in which the matrix is accessed by columns are likely to be the only forms worth considering. .SH 3.3 Routines for banded matrices .PP The same choice between inner-product and AXPY forms is available as for operations on full matrices. With the specified storage scheme for banded matrices, columns of a matrix are stored in columns of the array and constitute contiguous vectors; rows of the matrix are stored in reverse diagonals of the array and constitute vectors with constant stride. Whether the matrix is accessed by rows or by columns, the vector-length is at most $kl+ku+1$ for SGBMV, and at most $k+1$ for the other banded routines; hence with typical bandwidths speeds on vector-processors may be slow. .PP However for SGBMV and SSBMV a third alternative can be used in which the matrix is accessed by diagonals and hence the array is accessed by rows. For SGBMV the essential features of the code (when INCX = INCY = 1) are: .sp .KS .ps 8 .nf .cs 1 24 IF( LSAME( TRANS, 'N' ) )THEN C C Form y := alpha*A*x + y. C DO 60, I = 1, KL + KU + 1 L = KU + 1 - I DO 50, J = MAX( 1, 1 + L ), MIN( N, M + L ) Y( J - L ) = Y( J - L ) + ALPHA*X( J )*A( I, J ) 50 CONTINUE 60 CONTINUE ELSE C C Form y := alpha*A'*x + y. C DO 100, I = 1, KL + KU + 1 L = KU + 1 - I DO 90, J = MAX( 1, 1 + L ), MIN( N, M + L ) Y( J ) = Y( J ) + ALPHA*X( J - L )*A( I, J ) 90 CONTINUE 100 CONTINUE END IF .sp .KE .ps 11 .fi .cs 1 In this form the inner loop is equivalent to a vector operation of the form V $<-$ V + $ alpha $*V*V (V a vector, $ alpha $ a scalar). The vector-lengths are close to $n$ and for large $n$ good speeds can be obtained which are more or less independent of the bandwidth, for example 30 megaflops for real data and 40 megaflops for complex data on a CRAY-1S. The same organization can be used for STBMV provided that a temporary work-vector can be created to hold the result, but cannot be used for STBSV because of the recursive nature of the computations. .SH 3.4 Other remarks .PP The model implementation includes separate segments of code for cases when INCX or INCY = 1 (or both): on many machines this is unnecessary. .PP Specialized implementations should, where possible, use straightforward comparison of characters, rather than the routine LSAME used by the model implementation. .NH The Test Programs .PP A separate test program exists for each of the four data types (REAL, COMPLEX, DOUBLE PRECISION and COMPLEX*16). All test programs conform to the same pattern with only the minimum necessary changes, so we shall talk generically about "the test program" in the singular. .PP The program has been designed not merely to check whether the model implementation has been correctly installed, but also to serve as a validation tool, and even as a modest debugging aid, for any specialized implementation. .PP The program has the following features: .sp .IP The parameters of the test problems and the names of the subprogram to be tested are specified by means of a data file, which can easily be modified for debugging. .IP The data for the test problem are generated internally, and the results are checked internally. .IP The program checks that no arguments are changed by the routines except the designated output vector or matrix. .IP All error exits (caused by illegal parameter values) are tested. .IP The program generates a concise summary report on the tests and optionally can generate a "history" or "snapshot" file as an additional debugging aid. .SH 4.1 Parameters of the Test Problems .PP Each test problem (i.e., each call of a subprogram to be tested) depends on a choice of values for the following parameters (where relevant to the particular subprograms): .sp .IP the dimensions $m$ and $n$; .IP the bandwidth arguments $k$, $kl$ and $ku$; .IP the options UPLO, TRANS, and DIAG; .IP the increments INCX and INCY; and .IP the scalars $ alpha $ and $ beta $. .PP All relevant combinations of the options UPLO, TRANS and DIAG are tested. The values of the other arguments are defined by a data file. Specifically, the program reads in a set $S sub n$ of values of $n$, a set $S sub k$ of values of $k$, a set $S sub inc$ of values for INCX and INCY, a set $S sub alpha $ of values for $ alpha $, and a set $S sub beta $ of values for $ beta $. .in 0 .fi .PP For subprograms that require a second dimension $m$, two values of $m$ are generated for each value of $n$, namely, $m ~=~ max(n-[n/2]-1,0)$ and $m ~=~ min(n+[n/2]+1,n sub max)$, where $n sub max $ is the maximum value permitted by the array dimensions in the program. If two bandwidth arguments $kl$ and $ku$ are required, they are generated from $k$ by $kl ~=~ max(k-1,0)$ and $ku ~=~ k$. .KS .PP The test problems are then generated in a nested loop structure: .nf .sp .in 1i for $n ~ \(mo ~ S sub n $ .in 1.25i for $k ~ \(mo ~ S sub k $ .in 1.5i for all relevant values of UPLO, TRANS and DIAG .in 1.75i for INCX $ \(mo ~ S sub inc $ .in 2i for INCY $ \(mo ~ S sub inc $ .in 2.25i for $ alpha ~ \(mo ~ S sub alpha $ .in 2.5i for $ beta ~ \(mo ~ S sub beta $. .in 0 .KE .sp .fi (Of course, arguments not relevant to the routine are omitted from the loop structure.) If $m$ = 0 or $n$ = 0 (a null problem), only one test with these values of $m$ and $n$ is generated. .PP Obviously, the sets $S sub n$, $S sub k$, etc. should be as small as possible; otherwise, a very large number of problems will be generated, and the test program will take a forbiddingly long time to run. On the other hand, for a comprehensive test it is essential to exercise all segments of the code and all special or extreme cases such as $n$ = 0, $n$ = 1, $k$ = 0, $k$ = $n-1$, INCX = 1, INCY = 1, $ alpha $ = 0, $ alpha $ = 1, $ beta $ = 0, $ beta $ = 1. Note that we cannot be sure what cases will be regarded as special or extreme in any specialized implementation. .PP A data-file which specifies sets of parameters suitable for many machines, is supplied with the test program, but installers and implementors must be alert to the possible need to extend or modify them (see Appendix). .SH 4.2 Data for the Test Problems .PP Data for the elements of the matrix $A$ and the vectors $x$ and $y$ are generated using a simple portable congruential number generator. Values for the elements of $A$ are uniformly distributed over $(-0.5,0.5)$, and for the elements of $x$ and $y$ over (0,1). Care is taken to ensure that the data have full working accuracy. Some of the vectors have selected elements set to zero so that special code for this case (see Section 2) can be tested. When DIAG = 'N', 1.0 is added to the diagonal elements of triangular matrices to ensure that they are reasonably well-conditioned. .PP Data for each test problem are first stored in a conventional two-dimensional array for the matrix $A$, and in contiguous one-dimensional arrays for the vectors $x$ and $y$. The matrix is stored as a full square or rectangular matrix, with all zero elements and unit diagonal elements stored explicitly. This form is used to compute the correct result, using the same simple code in each case. .PP The data are then copied into the arrays that will be passed to the subprogram being tested, taking into account the storage scheme required for the matrix, and of the values of INCX and INCY. The argument LDA is chosen to be 1 more than its minimum permitted value, that is, LDA = $m + 1 $ for the GE routines; $n + 1$ for the SY, HE and TR routines; $kl + ku + 2$ for the GB routines; and $k + 2$ for the SB, HB and TB routines. (If this value exceeds $ n sub max$, LDA is set equal to $ n sub max$.) .PP Elements in these arrays that are not to be referenced by the subprogram (for example, the subdiagonal elements of A when UPLO = 'U', or intervening elements of X when INCX > 1) are set to a "rogue" value ($ -10 sup 10 $), to increase the likelihood that a reference to them will be detected. If a fatal error is reported and an element of the computed result is of order $10 sup 10 $, then the routine has almost certainly referenced the wrong element of an array. .SH 4.3 Checking the Results .PP After each call of a subprogram being tested, its operation is checked in two ways. .PP First, each of its arguments, including all elements of the array arguments, is checked to see if it has been changed by the subprogram. If any argument, other than the specified elements of the result vector or matrix, has been changed, a fatal error is reported. (This check includes the supposedly unreferenced elements of the arrays, which were initialized with a rogue value.) .PP Second, the result vector or matrix computed by the subprogram is compared with the result computed by simple Fortran code. We do not expect exact agreement, because the two results are not necessarily computed by the same sequences of floating-point operations. We do, however, expect the differences to be insignificant to working precision in the following precise sense. .PP In the MV routines, each element of the result vector is defined by an expression of the form .EQ I y sub i <- alpha a sub i* sup T x ~+~ beta y sub i , .EN where $a sub i* sup T$ denotes the $i sup th$ row of $A$. (For the triangular matrix routines _TRMV, _TBMV, and _TPMV, we have $ alpha $ = 1 and $ beta $ = 0). This expression may be regarded as a simple inner product $y sub i ~ <- ~ u sup T v$ by writing $u sup T ~=~ ( alpha a sub i* sup T , ~y sub i ), ~~ v sup T ~=~ (x sup T , beta )$ The absolute error in the computed inner-product $ y sub i $ is bounded by .EQ I | y hat sub i ~-~ y sub i | ~ <= ~ n epsilon | ^u | sup T | ^v |, .EN where $ epsilon $ is the relative machine precision, and $ |u| sup T $ denotes the vector $ (|u sub 1 |,|u sub 2 |,...,|u sub n |) sup T $ (see Golub and van Loan [8], p.36 ). In our tests we have also to allow for the errors introduced in the multiplication by $ alpha $. On the other hand the above bound is usually a substantial over-estimate. We use the following semi-empirical approach. .PP For each element $y$ of the result, the program computes the test ratio: .sp .EQ I { | y hat sub i - y sub i | } over { epsilon ~ | ^u | sup T | ^v| } .EN .sp with $u$ and $v$ defined as above. This is compared with a constant threshold value, which is defined in the data file. Test ratios greater than the threshold are flagged as "suspect". On the basis of experience a threshold value of 16 is recommended (the largest value observed on a variety of machines has been 11.5). The precise value is not critical. Errors in the routines are most likely to be errors in array indexing, which will almost certainly lead to a totally wrong result. A more subtle potential error is the use of a single precision variable in a double precision computation. This is likely to lead to a loss of half the machine precision. The test program regards a test ratio greater than $ epsilon sup {- \(12}$ as a fatal error. .PP The R and R2 routines are checked in a similar way. Each element of the result is regarded as an inner-product of length 2 or 3: .EQ I a sub ij ~ <- ~ (a sub ij , alpha x sub i ) sup T left (^ pile { 1 above {y sub j} } right ) .EN or .EQ I a sub ij ~ <- ~ (a sub ij , alpha x sub i , alpha y sub i ) sup T left ( ^pile {1 above {y sub j } above {x sub j }} right ) . .EN The SV routines are checked as follows: if $y ~=~ T sup -1 x$ is the exact result, and $y hat $ is the computed result, then the test program computes $x hat ~=~ T { y hat } $ and compares it with $x$ as above. Thus the test ratio is .EQ I { | x hat sub i - x sub i | } over { epsilon | ^t sub i* | sup T ~| ^x sub i | } , .EN where $t sub i* $ denotes the $i sup th$ row of T. Theoretically, the test ratio should involve the condition number of T with respect to inversion, but the test program generates well-conditioned triangular matrices and, in practice the test ratios observed for these routines are no larger than for the others. .SH References .sp .IP [1] W.R. Cowell and C.P. Thompson, .I Transforming Fortran DO Loops to Improve Performance on Vector Architectures, .R Argonne National Laboratory Report ANL-85-63. .sp .IP [2] C. Daly and J.J. Du Croz, .I Performance of a Subroutine Library on Vector-Processing Machines, .R Computer Physics Communications, 37, pp 181-186 (1985). .sp .IP [3] J.J. Dongarra, J.J. Du Croz, S. Hammarling and R.J. Hanson, .I An Extended Set of Fortran Basic Linear Algebra Subprograms, .R Argonne National Laboratory Report ANL-MCS-TM-41 (Revision 3) (1986). .sp .IP [4] J.J. Dongarra and S.C. Eisenstat, .I Squeezing the Most out of an Algorithm in CRAY Fortran, .R ACM Trans. Math. Software, 10, pp 221-230 (1984). .sp .IP [5] J.J. Dongarra, F.G. Gustavson and A. Karp, .I Implementing Linear Algorithms for Dense Matrices on a Vector Pipeline Machine, .R SIAM Review, 26, pp 91-112 (1984). .sp .IP [6] J.J. Dongarra, L. Kaufman and S. Hammarling, .I Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers, .R Lin. Alg. Applic., 77, pp 113-136 (1986). .sp .IP [7] J.J. Dongarra and D.C. Sorensen, .I Linear Algebra on High-Performance Computers .R Proceedings Parallel Computing 85, Edited by U. Schendel, pp. 3-32, North Holland Pub. 1986. .sp .IP [8] G.H. Golub and C.F. Van Loan, .I Matrix Computations, .R The Johns Hopkins University Press, Baltimore, Maryland (1983). .bp .B .SH Appendix: Installation Notes .SH A1. Installing the model implementation .PP The subprograms fall into four sets according to the data-type of the matrices and vectors: REAL, COMPLEX, DOUBLE PRECISION and COMPLEX*16 (subprogram names beginning with S, C, D and Z, respectively). Choose which set or sets are to be installed. .PP Examine the auxiliary subprograms XERBLA and LSAME (which are independent of the data-type) and consider whether they need to be modified. .PP The subprogram XERBLA is called when one of the Level 2 BLAS detects an illegal value of one of its arguments. The version supplied with the model implementation writes a message to the standard output channel, for example .sp .ps 9 .cs 1 24 .nf ** On entry to STRSV parameter number 3 had an illegal value. .cs 1 .fi .ps 11 .sp and then executes a STOP statement. Installers may wish to redirect the error message to a different output channel, or to replace the STOP statement by a call to system-specific exception-handling or trace-back mechanisms. .PP The logical function LSAME is used to perform all character-comparison in the Level 2 BLAS in a case-insensitive manner. For example, the expression .ps 8 .cs 1 24 .nf LSAME(UPLO, 'U') .ps 11 .cs 1 .fi is equivalent to .ps 8 .cs 1 24 .nf (UPLO.EQ.'U').OR.(UPLO.EQ.'u'). .ps 11 .cs 1 .fi .PP The supplied version works correctly on all systems that use the ASCII code for internal representation of characters. For systems that use the EBCDIC code, one constant must be changed. For CDC systems with 6-12 bit representation, alternative code is provided in comments. Any of the versions work correctly on all systems if only upper-case characters are passed as arguments. .PP Compile the chosen sets of subprograms, together with LSAME and XERBLA, and create an object-library. .SH A2. Testing the model implementation .PP Select the test program or programs corresponding to the data-types handled by the subprograms that have been installed. .PP An annotated example of a data-file for the program can be obtained by editing the comments at the start of the main program. This defines the names and unit numbers of the output files, various parameters of the tests, and the names of those subprograms which are to be tested. The data-file for the REAL routines is illustrated below. The first 18 records are read using list-directed input, the last 16 records are read using the format ( A6, L2 ). .ps 8 .cs 1 24 .nf Record no. Record contents 1 'SBLAT2.SUMM' NAME OF SUMMARY OUTPUT FILE 2 6 UNIT NUMBER OF SUMMARY FILE 3 'SBLAT2.SNAP' NAME OF SNAPSHOT OUTPUT FILE 4 -1 UNIT NUMBER OF SNAPSHOT FILE (NOT USED IF .LT. 0) 5 F LOGICAL, T TO REWIND SNAPSHOT FILE AFTER EACH RECORD. 6 F LOGICAL, T TO STOP ON FAILURES. 7 T LOGICAL, T TO TEST ERROR EXITS. 8 16.0 THRESHOLD VALUE OF TEST RATIO 9 6 NUMBER OF VALUES OF N 10 0 1 2 3 5 9 VALUES OF N 11 4 NUMBER OF VALUES OF K 12 0 1 2 4 VALUES OF K 13 4 NUMBER OF VALUES OF INCX AND INCY 14 1 2 -1 -2 VALUES OF INCX AND INCY 15 3 NUMBER OF VALUES OF ALPHA 16 0.0 1.0 0.7 VALUES OF ALPHA 17 3 NUMBER OF VALUES OF BETA 18 0.0 1.0 0.9 VALUES OF BETA 19 SGEMV T PUT F FOR NO TEST. SAME COLUMNS. 20 SGBMV T PUT F FOR NO TEST. SAME COLUMNS. 21 SSYMV T PUT F FOR NO TEST. SAME COLUMNS. 22 SSBMV T PUT F FOR NO TEST. SAME COLUMNS. 23 SSPMV T PUT F FOR NO TEST. SAME COLUMNS. 24 STRMV T PUT F FOR NO TEST. SAME COLUMNS. 25 STBMV T PUT F FOR NO TEST. SAME COLUMNS. 26 STPMV T PUT F FOR NO TEST. SAME COLUMNS. 27 STRSV T PUT F FOR NO TEST. SAME COLUMNS. 28 STBSV T PUT F FOR NO TEST. SAME COLUMNS. 29 STPSV T PUT F FOR NO TEST. SAME COLUMNS. 30 SGER T PUT F FOR NO TEST. SAME COLUMNS. 31 SSYR T PUT F FOR NO TEST. SAME COLUMNS. 32 SSPR T PUT F FOR NO TEST. SAME COLUMNS. 33 SSYR2 T PUT F FOR NO TEST. SAME COLUMNS. 34 SSPR2 T PUT F FOR NO TEST. SAME COLUMNS. .cs 1 .ps 11 .fi .PP Change the 1st record of the data-file, if necessary, to ensure that the file-name is legal on your system. No other changes to the data-file should be necessary before an initial run of the test program, but some changes may be needed to ensure that the tests are sufficiently thorough (see below). .PP The data-file is read from unit NIN, which is set to 5 in a PARAMETER statement in the main program: change this if necessary. .PP Compile the test program, link in the required subprograms, and run the program. .PP Note: the test programs include an alternative version of the auxiliary subprogram XERBLA which is needed to test the error-exits from the Level 2 BLAS. On some systems special action must be taken to ensure that this version is linked in to the test program. If the model implementation of XERBLA is used, the test program will stop after writing an error-message from XERBLA. .PP The following table gives the approximate times taken to run the test programs, using the supplied data-file and the model implementation of the subprograms, on various machines: .KS .TS center; l l l l l l l l n n n n n l. S C D Z CRAY-1S 0.2 0.2 0.5 - minutes DEC VAX-11/750 3 4.5 4 8 NSC 32032 PC* 8 10 17 20 Compaq Deskpro 286** 12 44 24 50 .TE .ps 8 * DSI-32 coprocessor 10 Mhz; Green Hills Fortran; compiler options -01 -02 -X71; RAM disk for I/O. .br ** Lahey F77L V.2.10, 80287 - 5 Mhz, 80286 - 8 Mhz. .ps 11 .KE .PP If the tests using the supplied data-file are completed successfully, consider whether the tests have been sufficiently thorough. For example on a machine with vector-registers, at least one value of N greater than the length of a vector-register should be used, otherwise important parts of the compiled code may not be exercised by the tests. .PP The tests may fail, with either "suspect results" or "fatal errors." Suspect results - with a test ratio slightly greater than the threshold - are probably caused by anomalies in floating-point arithmetic on the machine; if this explanation is considered to be sufficient, increase the value of the threshold specified in the data-file. Fatal errors most probably indicate a compilation error or corruption of the source-text. An error detected by the system - for example, an array-subscript out of bounds, or use of an unassigned variable - is almost certainly due to the same causes. If the system does not provide adequate post-mortem information about the error, the snapshot file can give a little help (see section A5). .SH A3. Testing a specialized implementation .PP Proceed initially as described in section A2. .PP If the implementation does not use an error-handling subprogram XERBLA, compatible with the model implementation, then the data-file must be modified to suppress the testing of error-exits. .PP Consider very carefully what changes need to be made to the data-file, to ensure that the implementation has been thoroughly tested. For example, if the technique of loop-unrolling is applied, make sure that sufficient values of N are used to test all the clean-up code; if ALPHA .EQ. -1.0 is treated as a special case, add -1.0 to the values of ALPHA. .SH A4. Changing the parameters of the tests .PP The values supplied in the data-file must satisfy certain restrictions, defined by the following symbolic constants in the test program: .TS center; c c c l l n. Name Meaning Value NIDMAX Maximum number of values of N 9 NKBMAX Maximum number of values of K 7 NINMAX Maximum number of values of INCX, INCY 7 NALMAX Maximum number of values of ALPHA 7 NBEMAX Maximum number of values of BETA 7 NMAX Maximum value of N 65 INCMAX Maximum value of ABS(INCX), ABS(INCY) 2 .TE If necessary, modify the PARAMETER statements that define these symbolic constants. .SH A5. The History or Snapshot File .PP The main output file from the test program contains a concise report on the success or failure of the tests of each routine and the reasons for failure if it occurs. Optionally, the program writes to a separate file a one-line record giving details of the arguments in each call of a Level 2 BLAS subprogram, e.g., .cs 1 22 .nf 25: STRSV ('U', 'T', 'U', 3, A, 4, X, -1) .cs 1 .fi (The number 25 indicates that this is the 25th call of STRSV). The record is written immediately before the routine is called. .PP As a cumulative "history" file, this enables the user to monitor which tests are passed successfully before a failure occurs. Moreover, if an exception occurs in the Level 2 BLAS routine (e.g. array bound error or division by zero), the last record written to the file should give details of the call that caused the exception. However, on some systems the output buffers are not emptied when a program is terminated abnormally. Therefore, the program has an option to rewind the file after each record is written in order to force emptying of the buffer: in this mode the file presents a one-line "snapshot" of the current or most recent call to a Level 2 BLAS routine. .