.ds TH "\h'-.50m'\v'-.40m'^\v'.40m'\h'.50m'
.ds TH "\h'-.75m'\v'-.50m'^\v'.50m'
.ds TD "\h'-.75m'\v'-.45m'~\v'.45m'
.B
.nr BT ''-%-''
.he ''''
.pl 11i
.de fO
'bp
..
.wh -.5i fO
.LP
.nr LL 6.5i
.ll 6.5i
.nr LT 6.5i
.lt 6.5i
.ta 5.0i
.ft 3
.bp
.R
.sp 1i
.ce 100
.R
.sp .5i
 .
.sp 10
ARGONNE NATIONAL LABORATORY
.br
9700 South Cass Avenue
.br
Argonne, Illinois  60439
.sp .6i
.ps 12
.ft 3
An Extended Set of Basic Linear Algebra Subprograms:
.sp
Model Implementation and Test Programs
.ps 11
.sp 3
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson
.sp 3
.ps 10
.ft 1
Mathematics and Computer Science Division
.sp 2
Technical Memorandum No. 81
.sp .7i
January 1987
.pn 0
.he ''-%-''
.he ''''
.bp
 .
.sp 10
.B
.ce 1
.ps 11
Abstract
.R
.ps 10
.sp .5i
This paper describes a model implementation and test software for the Level 2
Basic Linear Algebra Subprograms (Level 2 BLAS). The Level 2 BLAS are
targeted at matrix-vector operations with the aim of providing more
efficient, but portable, implementations of algorithms on high-performance
computers. The model implementation provides a portable set of Fortran 77
Level 2 BLAS for machines where specialized implementations do not exist or
are not required. The test software aims to verify that specialized
implementations meet the specification of the Level 2 BLAS and that
implementations are correctly installed.
.in
.bp
.ft 3
.ps 11
.bp
.LP
.LP
.EQ
gsize 11
delim $$
.EN
.TL
.ps 12
.in 0
An Extended Set of 
.sp
Basic Linear Algebra Subprograms:
.sp
Testing Software and Model Implementations
.AU
.ps 11
.in 0
Jack J. Dongarra\|$size -1 {"" sup \(dg}$\h'.15i'
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'
Argonne National Laboratory\h'.20i'
Argonne, Illinois 60439\h'.20i'
.AU
.ps 11
.in 0
Jeremy Du Croz\h'.20i'
.AI
.ps 10
.in 0
Numerical Algorithms Group Ltd.\h'.20i'
NAG Central Office, Mayfield House\h'.20i'
256 Banbury Road, Oxford OX2 7DE\h'.20i'
.AU
.ps 11
.in 0
Sven Hammarling\h'.20i'
.AI
.ps 10
.in 0
Numerical Algorithms Group Ltd.\h'.20i'
NAG Central Office, Mayfield House\h'.20i'
256 Banbury Road, Oxford OX2 7DE\h'.20i'
.AU
.ps 11
.in 0
Richard J. Hanson\|$size -1 {"" sup \(dd}$\h'.15i'
.AI
.ps 10
.in 0
Applied Mathematics Division 2646
Sandia National Laboratory\h'.20i'
Albuquerque, New Mexico 87185\h'.20i'
.FS
.ps 9
.vs 11p
$size -1 {"" sup \(dg}$\|Work supported in part by the Applied Mathematical
Sciences subprogram of the Office of Energy Research,
U. S. Department of Energy, under Contract W-31-109-Eng-38.

$size -1 {"" sup \(dd}$\|Work supported by the U.S. Department of Energy.

Typeset on \*(DY.
.FE
.sp 3
.QS
.ps 10
.in .25i
.ll -.25i
Abstract
\(em This paper describes a model implementation and test software for the Level 2
Basic Linear Algebra Subprograms (Level 2 BLAS). The Level 2 BLAS are
targeted at matrix-vector operations with the aim of providing more
efficient, but portable, implementations of algorithms on high-performance
computers. The model implementation provides a portable set of Fortran 77
Level 2 BLAS for machines where specialized implementations do not exist or
are not required. The test software aims to verify that specialized
implementations meet the specification of the Level 2 BLAS and that
implementations are correctly installed.
.in
.ll
.QE
 
.nr PS 11
.nr VS 16
.nr PD 0.5v
.EQ
delim $$
.EN
.nr PO 1i
.po 1i
.ps 10
.nr FO
.fo ''%''
.tp
.sp 5
.NH
Scope of the Algorithm
.PP 
In [3] we have defined the specification of a set of Basic Linear
Subprograms for selected matrix-vector operations usually referred to as
the "Level 2 BLAS" or "Extended BLAS". They provide a standard framework
to develop modular, portable, and efficient Fortran 77 code for many
computational problems in linear algebra. Our hope is that specialized
implementations of the Level 2 BLAS will be developed for many machines,
especially for vector processors and other high-performance computers.
Thus programs that call the Level 2 BLAS can be efficient across a wide
range of machines.
.PP 
To support and encourage the use of the Level 2 BLAS, this algorithm
contains two components of software:
.IP (1)
A model implementation of the subprograms in Fortran 77. This enables
the Level 2 BLAS to be used on any machine, regardless of whether a
specialized implementation exists. It is described in Section 2.
.sp
.IP (2)
Test programs, designed to ensure that implementations conform to the
specification and have been correctly installed. The programs are
described in Section 4.
.LP
Section 3 contains some advice on developing specialized implementations
of the subprograms.
Installation notes for the software are given in the Appendix.
.NH 
The Model Implementation
.SH
2.1 Programming considerations
.PP 
There are many mathematically equivalent 
ways to implement the Level 2 BLAS even in standard Fortran
77, as discussed in Section 3. The choice of method for the model
implementation has been guided by the following considerations:
.sp
.IP (1) 
The elements of the array A are accessed sequentially, column by
column. On vector-processing machines this allows the columns of the
array to be recognized as contiguous vectors (by the Fortran
compiler). On virtual-memory machines it keeps page swaps to a
minimum.
.sp
.IP (2) 
Special code is included for the commonly occurring cases when the
increment parameters (INCX or INCY or both) which are used in the
inner loops, are equal to 1. This code can use a simpler, and hence
more efficient, Fortran indexing scheme and also allows contiguous
vectors to be recognized by the Fortran compiler, for example:
.sp
.nf
.cs 1 24
.KS
.ps 8 
             DO  50, I = 1, M
                Y(I) = Y(I) + TEMP*A(I,J)
          50 CONTINUE
.fi
.cs 1
.KE
.ps 11
.sp
instead of:
.sp
.KS
.nf
.ps 8
.cs 1 24
             IY = KY
             DO 50, I = 1, M
                Y(IY) = Y(IY) + TEMP*A(I,J)
                IY    = IY    + INCY
          50 CONTINUE
.fi
.cs 1
.ps 11
.sp
.KE
.IP (3)
Provision is made to skip the innermost loop if relevant elements of
the vectors X or Y are zero. This can yield a considerable gain in
efficiency if the vectors are sparse, for example:
.sp
.KS
.nf
.ps 8
.cs 1 24
             IF( X(JX).NE.ZERO )THEN
                TEMP = ALPHA*X(JX)
                DO 50, I = 1, M
                   Y(I) = Y(I) + TEMP*A(I,J)
          50    CONTINUE
             END IF
.fi
.cs 1
.ps 11
.sp
.KE
.SH
2.2 Efficiency
.PP
The model implementation is likely to achieve considerable efficiency on
scalar-processing machines with a good optimizing compiler and even
moderate efficiency on vector-processing machines with a good vectorizing
compiler.
.PP 
For illustration, Table 2.1 gives speeds obtained for some representative
operations on a CRAY-1S, using automatic vectorization (no compiler
directives to ignore data-dependencies were needed). The speeds are given
in megaflops and were measured with $m = n = 256$, INCX = INCY = 1,
UPLO = 'U', DIAG = 'N', and no zero elements in the data. Speeds with
UPLO = 'L' were approximately the same as those with UPLO = 'U', speeds of
the SV-routines were approximately the same as those of the corresponding
MV-routines, and speeds of the routines using packed storage were
approximately the same as those of the corresponding routines using
2-dimensional array storage.
.sp
.KS
.ce 
Table 2.1: Speed of Level 2 BLAS on CRAY-1S in megaflops.
.TS
center;
c s s c s s
l c c l c c.
Real	Complex  
 
Routine	TRANS	Speed	Routine	TRANS	Speed
 
SGEMV	'N'	39	CGEMV	'N'	63
	'T'	31		'T'	11
SSYMV		31	CHEMV		13
STRMV	'N'	33	CTRMV	'N'	56
	'T'	20		'T'	11
SGER		39	CGERU		63
SSYR		36	CHER		56
SSYR2		47	CHER2		46
.TE
.KE
.sp
Without automatic vectorization (i.e. running in scalar mode), the speeds
were between 5 and 7 megaflops for real data, and between 10 and 14
megaflops for complex data.
.PP
Table 2.1 does not include measurements on the routines for banded
matrices. In the model implementation of these routines, the
vector-lengths are at most equal to the bandwidth, and hence for narrow
bandwidths the routines run at roughly scalar speeds. However see section
3.3 concerning an alternative implementation of some of these routines.
.SH
2.3 Language standards
.PP 
The model implementation of the Level 2 BLAS is written entirely in
portable standard Fortran 77 with two exceptions.
.IP (1) 
For the routines that require a "double precision complex" data type
(names beginning with Z), we have used the following extensions to
standard Fortran:
.sp
.in +.25i
.IP 
COMPLEX*16 type specification statements;
.IP
DCONJG and DCMPLX intrinsic functions whose argument and result are
both of type COMPLEX*16;
.IP
DBLE intrinsic function with a COMPLEX*16 argument and
DOUBLE PRECISION result, delivering the real part of
the argument;
.IP
COMPLEX*16 constants formed from a pair of double precision
constants in parentheses.
.in -.25i
.sp
.IP (2) 
For the arguments of type CHARACTER that specify options, we wish to
allow either upper-case or lower-case characters to be supplied.
Lower-case characters are not part of the standard Fortran character
set, but their use is so widespread that it would be unfriendly not to
allow them. This can be an obstacle to portability on some systems (as
is discussed in Section 7 of [3]), but we have avoided most of the
problems by use of the auxiliary LOGICAL function LSAME described
below.
.SH
2.4 Auxiliary subprograms
.PP 
Two auxiliary subprograms are called by the Level 2 BLAS: an
error-handling routine XERBLA, called by all routines, and the
character-comparison routine LSAME called by all except the _GER_
routines. Both these subprograms may be selectively modified by installers
of the package as described in the Appendix.
No changes need be made to the rest of the model code.
.NH
Notes on Implementation
.PP 
Here we offer some advice to anyone planning to develop a specialized,
machine-specific implementation of the Level 2 BLAS. The following broad
possibilities should be considered:
.sp
.IP (a) 
Rewriting the algorithms in Fortran so that the structure of the inner
loops is better adapted to the architecture of the machine.
.sp
.IP (b) 
Using machine-specific extensions to Fortran, such as array-syntax,
compiler-directives or calls to library routines.
.sp 
.IP (c) 
Coding the routines in assembly language.
.sp
.PP
Approaches (b) and (c) should be considered as extensions of (a), not as
alternatives. Implementors should not consider translating the model
implementation into extended Fortran or assembly language without first
considering whether the structure of the model implementation is
well-adapted to their machine.
.PP
Each matrix-vector operation performed by the Level 2 BLAS involves
doubly-nested loops. By interchanging the inner and outer loop indices we
obtain two variant ways of organizing the computation. In one variant the
matrix is accessed by columns, in the other by rows. For the MV and SV
routines, the inner loop is equivalent, in one variant, to an
inner-product vector operation, and in the other variant to a vector
operation of the form $ y ~<-~  alpha x ~+~ y $ which we shall refer to as an AXPY. (For
the rank-1 and rank-2 update routines the inner loop is always equivalent
to an AXPY operation.) The choice of method will be principally governed
by the relative efficiency of performing inner-products or AXPY
operations, and by the cost of accessing a 2-dimensional array by rows
instead of by columns. We discuss the options in more detail for
individual routines below. To be specific, we discuss the real single
precision routines.
.SH 
3.1 Routines using full matrix storage
.PP 
We consider first those routines which require the matrix to be stored
conventionally in a 2-dimensional array: columns of the matrix are stored
in columns of the array and constitute contiguous vectors; rows of the
matrix are stored in rows of the array and constitute vectors with
constant stride whose elements are not contiguous. Accessing vectors with
non-contiguous elements may require expensive gather or scatter operations
(as on the CDC Cyber 205, for example), or may cost no more than accessing
contiguous vectors except when memory-bank conflicts occur (as on the
CRAY-1S).
.SH 
3.1.1 SGEMV
.PP 
The operations performed by this routine have been discussed by Dongarra,
Gustavson and Karp [5] and Daly and Du Croz [2]. The operation
$ y  ~ <-  ~ alpha Ax ~+~  beta y $ (TRANS = 'N') can be performed either by AXPY operations
with vector-length $m$, accessing $A$ by columns, or by inner-products with
vector-length $n$, accessing $A$ by rows. The operation $ y ~<-~  alpha A sup T x ~+~   beta y $
(TRANS = 'T' or 'C') can be performed either by inner-products with
vector-length $m$, accessing $A$ by columns, or by AXPY operations with
vector-length $n$, accessing $A$ by rows. Here $m$ and $n$ are the numbers of rows
and columns of $A$ respectively.
.PP
In both cases the AXPY form has the property that a sequence of AXPY
operations are used to update a single left-hand-side vector. On a machine
with vector-registers, this left-hand side vector (or segments of it)
should ideally be held in a vector-register throughout the iterations of
the outer loop in order to reduce the number of memory references (see
[5]). If the routines are being implemented in Fortran and the compiler
cannot recognize and take advantage of this property, then the technique
of unrolling the outer loops may be applied (Dongarra and Eisenstat [4],
Cowell and Thompson [1]).
.PP 
Note that the relative advantage of the AXPY or inner-product forms
depends on the values of $m$ and $n$, and an optimal implementation may need
to switch between the two forms accordingly.
.PP
On parallel machines the cleanest way to achieve concurrency is to compute
segments of the result vector in separate processors; this is discussed
further by Dongarra and Sorensen [7].
.SH 
3.1.2 SSYMV
.PP 
As in SGEMV, each operation can be performed either by AXPY operations or
by inner-products. However, in each case the matrix must be accessed
partly by columns and partly by rows (because only half the matrix is
stored). The model implementation uses a mixed form in which each
iteration of the outer loop contains one AXPY operation and one
inner-product operation, both involving the same column of the matrix. On
many machines this mixed form can halve the number of memory references to
elements of A; however if, say, inner products are markedly slower than
AXPY operations, then they will govern the speed of the mixed form and a
pure AXPY form may be preferable. The remarks made about the AXPY forms of
SGEMV on vector-register machines apply here also, although the details
are more complicated because the lengths of the left-hand-side vectors are
not constant, but increase or decrease by 1 on each iteration of the outer
loop. Unrolling the outer loops to a depth of 2 can be handled neatly as
described by Dongarra, Kaufman and Hammarling [6]. In a pure AXPY form or
pure inner product form it may be preferable to make two separate passes
through the outer loop, one in which the matrix is accessed by rows and
one in which it is accessed by columns.
.SH 
3.1.3 STRMV and STRSV
.PP 
Again as in SGEMV, each operation can be performed either by AXPY
operations or by inner-products, with the matrix being accessed by rows or
columns depending on the value of TRANS. Those forms of the code which are
not used in the model implementation can easily be derived from those
which are. For example, to derive an AXPY form of the code for 
$ x ~<-~ L sup T x$,
simply take the code for $x~ <-~ Ux$, and replace A(I,J) by A(J,I). The
remarks in section 3.1.1 about implementing the AXPY forms on a machine
with vector-registers apply here also, although as in SSYMV the
vector-lengths are not constant throughout the outer loop.
.PP 
The iterations of the outer loop must be performed in a particular order:
forward from 1 to $n$ for the operations 
$x ~<-~ Ux$, $x ~<-~ L sup T x$, $x ~<-~ L sup -1  x$ and
$x ~<-~ (U sup T ) sup -1  x$; backward from $n$ to 1 for the others. In STRMV this
constraint is needed simply to allow the result vector to overwrite the
input vector. In STRSV the recursive nature of the computation is more
fundamental: each element of the result vector depends on previously
computed elements.
.SH 
3.1.4 Rank-1 and Rank-2 update routines
.PP 
Each column of the matrix can be updated by an AXPY operation, or in the
case of the R2-routines by a double AXPY operation. Moreover on a parallel
machine each column of the matrix can be updated concurrently.
Interchanging the loop indices merely results in AXPY operations on rows
of the matrix.
.SH 
3.2 Routines using packed storage
.PP 
With the specified storage scheme for packed matrices, columns of the
matrix are stored as contiguous vectors within the packed array. Rows of
the matrix do not constitute vectors with constant stride. Hence those
forms in which the matrix is accessed by columns are likely to be the only
forms worth considering.
.SH 
3.3 Routines for banded matrices
.PP 
The same choice between inner-product and AXPY forms is available as for
operations on full matrices. With the specified storage scheme for banded
matrices, columns of a matrix are stored in columns of the array and
constitute contiguous vectors; rows of the matrix are stored in reverse
diagonals of the array and constitute vectors with constant stride.
Whether the matrix is accessed by rows or by columns, the vector-length is
at most $kl+ku+1$ for SGBMV, and at most $k+1$ for the other banded routines;
hence with typical bandwidths speeds on vector-processors may be slow.
.PP
However for SGBMV and SSBMV a third alternative can be used in which the
matrix is accessed by diagonals and hence the array is accessed by rows.
For SGBMV the essential features of the code (when INCX = INCY = 1) are:
.sp
.KS
.ps 8
.nf
.cs 1 24 
         IF( LSAME( TRANS, 'N' ) )THEN
   C
   C        Form  y := alpha*A*x + y.
   C
            DO 60, I = 1, KL + KU + 1
               L = KU + 1 - I
               DO 50, J = MAX( 1, 1 + L ), MIN( N, M + L )
                  Y( J - L ) = Y( J - L ) + ALPHA*X( J )*A( I, J )
      50       CONTINUE
      60    CONTINUE
         ELSE
   C
   C        Form  y := alpha*A'*x + y.
   C
            DO 100, I = 1, KL + KU + 1
               L = KU + 1 - I
               DO 90, J = MAX( 1, 1 + L ), MIN( N, M + L )
                  Y( J ) = Y( J ) + ALPHA*X( J - L )*A( I, J )
      90       CONTINUE
     100    CONTINUE
         END IF
.sp
.KE
.ps 11
.fi
.cs 1
In this form the inner loop is equivalent to a vector operation of the
form V $<-$ V + $ alpha $*V*V (V a vector, $ alpha $ a scalar). The vector-lengths are close
to $n$ and for large $n$ good speeds can be obtained which are more or less
independent of the bandwidth, for example 30 megaflops for real data and
40 megaflops for complex data on a CRAY-1S. The same organization can be
used for STBMV provided that a temporary work-vector can be created to
hold the result, but cannot be used for STBSV because of the recursive
nature of the computations.
.SH 
3.4 Other remarks
.PP 
The model implementation includes separate segments of code for cases when
INCX or INCY = 1 (or both): on many machines this is unnecessary.
.PP 
Specialized implementations should, where possible, use straightforward
comparison of characters, rather than the routine LSAME used by the model
implementation.
.NH
The Test Programs
.PP 
A separate test program exists for each of the four data types (REAL,
COMPLEX, DOUBLE PRECISION and COMPLEX*16). All test programs conform to
the same pattern with only the minimum necessary changes, so we shall talk
generically about "the test program" in the singular.
.PP 
The program has been designed not merely to check whether the model
implementation has been correctly installed, but also to serve as a
validation tool, and even as a modest debugging aid, for any specialized
implementation.
.PP 
The program has the following features:
.sp
.IP
The parameters of the test problems and the names of the subprogram to
be tested are specified by means of a data file, which can easily be
modified for debugging.
.IP
The data for the test problem are generated internally, and the results
are checked internally.
.IP
The program checks that no arguments are changed by the routines except
the designated output vector or matrix.
.IP
All error exits (caused by illegal parameter values) are tested.
.IP
The program generates a concise summary report on the tests and
optionally can generate a "history" or "snapshot" file as an additional
debugging aid.
.SH 
4.1 Parameters of the Test Problems
.PP 
Each test problem (i.e., each call of a subprogram to be tested) depends
on a choice of values for the following parameters (where relevant to the
particular subprograms):
.sp
.IP
the dimensions $m$ and $n$;
.IP
the bandwidth arguments $k$, $kl$ and $ku$;
.IP
the options UPLO, TRANS, and DIAG;
.IP
the increments INCX and INCY; and
.IP
the scalars $ alpha $ and $ beta $.
.PP
All relevant combinations of the options UPLO, TRANS and DIAG are tested.
The values of the other arguments are defined by a data file.
Specifically, the program reads in
a set $S sub n$ of values of $n$,
a set $S sub k$ of values of $k$,
a set $S sub inc$ of values for INCX and INCY,
a set $S sub alpha $ of values for $ alpha $, and 
a set $S sub beta $ of values for $ beta $.
.in 0
.fi
.PP
For subprograms that require a second dimension $m$, two values
of $m$ are generated for each value of $n$, namely,
$m ~=~ max(n-[n/2]-1,0)$ and $m ~=~ min(n+[n/2]+1,n sub max)$,
where $n sub max $ is the maximum value permitted by the array
dimensions in the program.  If two bandwidth arguments $kl$ and
$ku$ are required, they are generated from $k$ by $kl ~=~ max(k-1,0)$
and $ku ~=~ k$.
.KS
.PP
The test problems are then generated in a nested loop
structure:
.nf
.sp
.in 1i
for $n ~ \(mo ~ S sub n $
.in 1.25i
for $k ~ \(mo ~ S sub k $
.in 1.5i
for all relevant values of UPLO, TRANS and DIAG
.in 1.75i
for INCX $ \(mo ~ S sub inc $
.in 2i
for INCY $ \(mo ~ S sub inc $
.in 2.25i
for $ alpha ~ \(mo ~ S sub alpha $
.in 2.5i
for $ beta ~ \(mo ~ S sub beta $.
.in 0
.KE
.sp
.fi
(Of course, arguments not relevant to the routine are omitted
from the loop structure.)  
If $m$ = 0 or $n$ = 0 (a null problem), only one test with these
values of $m$ and $n$ is generated.
.PP
Obviously, the sets $S sub n$, $S sub k$, etc. should be as small as
possible; otherwise, a very large number of problems will be
generated, and the test program will take a forbiddingly long
time to run.  On the other hand, for a comprehensive test it
is essential to exercise all segments of the code and all
special or extreme cases such as $n$ = 0, $n$ = 1, $k$ = 0, $k$ = $n-1$,
INCX = 1, INCY = 1, $ alpha $ = 0, $ alpha $ = 1, $ beta $ = 0,  
$ beta $ = 1.  Note that 
we cannot be sure what cases will be regarded as special or
extreme in any specialized implementation.
.PP
A data-file which specifies sets of parameters suitable for many machines,
is supplied with the test program, but installers and implementors must
be alert to the possible need to extend or modify them (see Appendix).
.SH 
4.2 Data for the Test Problems
.PP 
Data for the elements of the matrix $A$ and the vectors $x$ and $y$ are
generated using a simple portable congruential number generator.
Values for the elements of $A$ are uniformly distributed over 
$(-0.5,0.5)$,
and for the elements of $x$ and $y$ over (0,1). Care is taken to ensure that
the data have full working accuracy. Some of the vectors have selected
elements set to zero so that special code for this case (see Section 2)
can be tested. When DIAG = 'N', 1.0 is added to the diagonal elements of
triangular matrices to ensure that they are reasonably well-conditioned.
.PP 
Data for each test problem are first stored in a conventional 
two-dimensional array for the matrix $A$, and in contiguous one-dimensional
arrays for the vectors $x$ and $y$. The matrix is stored as a full square or
rectangular matrix, with all zero elements and unit diagonal elements
stored explicitly. This form is used to compute the correct result, using
the same simple code in each case.
.PP 
The data are then copied into the arrays that will be passed to the
subprogram being tested, taking into account the storage scheme required
for the matrix, and of the values of INCX and INCY. The argument LDA is
chosen to be 1 more than its minimum permitted value, that is, LDA = $m + 1 $
for the GE routines; $n + 1$ for the SY, HE and TR routines; $kl + ku + 2$ for
the GB routines; and $k + 2$ for the SB, HB and TB routines. (If this value
exceeds $ n sub max$, LDA is set equal to $ n sub max$.)
.PP
Elements in these arrays that are not to be referenced by the subprogram
(for example, the subdiagonal elements of A when UPLO = 'U', or
intervening elements of X when INCX > 1) are set to a "rogue" value
($ -10 sup 10 $), to increase the likelihood that a reference to them will be
detected. If a fatal error is reported and an element of the computed
result is of order $10 sup 10 $, then the routine has almost certainly referenced
the wrong element of an array.
.SH 
4.3 Checking the Results
.PP 
After each call of a subprogram being tested, its operation is checked in
two ways.
.PP 
First, each of its arguments, including all elements of the array
arguments, is checked to see if it has been changed by the subprogram. If
any argument, other than the specified elements of the result vector or
matrix, has been changed, a fatal error is reported. (This check includes
the supposedly unreferenced elements of the arrays, which were initialized
with a rogue value.)
.PP 
Second, the result vector or matrix computed by the subprogram is compared
with the result computed by simple Fortran code.
We do not expect exact agreement, because the two results are not
necessarily computed by the same sequences of floating-point operations.
We do, however, expect the differences to be insignificant to working
precision in the following precise sense.
.PP 
In the MV routines, each element of the result vector is defined by an
expression of the form
.EQ I
y sub i <- alpha a sub i* sup T x ~+~ beta y sub i ,
.EN 
where $a sub i* sup T$ denotes the $i sup th$ row of $A$.  
(For the triangular matrix routines
_TRMV, _TBMV, and _TPMV, we have $ alpha $ = 1 and $ beta $ = 0). This expression may be
regarded as a simple inner product 
$y sub i ~ <- ~ u sup T v$
by writing
$u sup T ~=~ ( alpha a sub i* sup T , ~y sub i ), ~~ v sup T ~=~ (x sup T , beta )$
The absolute error in the computed inner-product $ y sub i $ is
bounded by
.EQ I
 | y hat sub i ~-~ y sub i | ~ <= ~
n epsilon  | ^u | sup T  | ^v |,
.EN
where  $ epsilon $ is the relative machine precision, and  $ |u| sup T $ denotes the vector
$ (|u  sub 1 |,|u  sub 2 |,...,|u sub n  |) sup T $
(see Golub and van Loan [8], p.36 ). In our tests we
have also to allow for the errors introduced in the multiplication by $ alpha $.
On the other hand the above bound is usually a substantial over-estimate.
We use the following semi-empirical approach.
.PP
For each element $y$ of the result, the program computes the test ratio:
.sp
.EQ I
{ | y hat sub i - y sub i | } over { epsilon  ~ | ^u | sup T  | ^v| }
.EN
.sp
with $u$ and $v$ defined as above.
This is compared with a constant threshold
value, which is defined in the data file. Test ratios greater than the
threshold are flagged as "suspect". On the basis of experience a threshold
value of 16 is recommended (the largest value observed on a variety of
machines has been 11.5). The precise value is not critical. Errors in the
routines are most likely to be errors in array indexing, which will almost
certainly lead to a totally wrong result. A more subtle potential error is
the use of a single precision variable in a double precision computation.
This is likely to lead to a loss of half the machine precision. The test
program regards a test ratio greater than $ epsilon sup {- \(12}$ as a fatal error.
.PP
The R and R2 routines are checked in a similar way. Each element of the
result is regarded as an inner-product of length 2 or 3:
.EQ I
a sub ij ~ <- ~ (a sub ij , alpha x sub i ) sup T left (^
pile { 1 above {y sub j} } right )
.EN
or
.EQ I
a sub ij ~ <- ~ (a sub ij , alpha x sub i , alpha y sub i ) sup T
left ( ^pile {1 above {y sub j } above {x sub j }} right )  .
.EN
The SV routines are checked as follows: if $y ~=~ T sup -1 x$ is the exact result,
and 
$y hat $ is the computed result, then the test program computes 
$x hat  ~=~ T { y hat } $ and
compares it with $x$ as above. Thus the test ratio is
.EQ I
{ | x hat sub i - x sub i | } over { epsilon | ^t sub i* | sup T ~| ^x sub i | } ,
.EN
where 
$t sub i* $ denotes the $i sup th$ row of T.
Theoretically, the test ratio should
involve the condition number of T with respect to inversion, but the test
program generates well-conditioned triangular matrices and, in practice
the test ratios observed for these routines are no larger than for the
others.
.SH 
References
.sp
.IP [1] 
W.R. Cowell and C.P. Thompson, 
.I
Transforming Fortran DO Loops to
Improve Performance on Vector Architectures,
.R
Argonne National Laboratory Report ANL-85-63.
.sp
.IP  [2] 
C. Daly and J.J. Du Croz, 
.I
Performance of a Subroutine Library on
Vector-Processing Machines,
.R
Computer Physics Communications, 37, pp 181-186 (1985).
.sp
.IP [3] 
J.J. Dongarra, J.J. Du Croz, S. Hammarling and R.J. Hanson, 
.I
An Extended Set of Fortran Basic Linear Algebra Subprograms,
.R
Argonne National Laboratory Report ANL-MCS-TM-41 (Revision 3) (1986).
.sp 
.IP [4] 
J.J. Dongarra and S.C. Eisenstat, 
.I
Squeezing the Most out of an Algorithm in CRAY Fortran,
.R
ACM Trans. Math. Software, 10, pp 221-230 (1984).
.sp
.IP [5] 
J.J. Dongarra, F.G. Gustavson and A. Karp, 
.I
Implementing Linear Algorithms for Dense Matrices on a Vector Pipeline Machine,
.R
SIAM Review, 26, pp 91-112 (1984).
.sp 
.IP [6] 
J.J. Dongarra, L. Kaufman and S. Hammarling, 
.I
Squeezing the Most out of Eigenvalue Solvers on High-Performance Computers,
.R
Lin. Alg.  Applic., 77, pp 113-136 (1986).
.sp
.IP [7] 
J.J. Dongarra and D.C. Sorensen, 
.I
Linear Algebra on High-Performance Computers 
.R
Proceedings Parallel Computing 85, Edited by U. Schendel, 
pp. 3-32, North Holland Pub. 1986.
.sp
.IP [8] 
G.H. Golub and C.F. Van Loan, 
.I
Matrix Computations,
.R
The Johns Hopkins University Press, Baltimore, Maryland (1983).
.bp
.B
.SH
 Appendix: Installation Notes
.SH 
 A1. Installing the model implementation
.PP 
The subprograms fall into four sets according to the data-type of the
matrices and vectors: REAL, COMPLEX, DOUBLE PRECISION and COMPLEX*16
(subprogram names beginning with S, C, D and Z, respectively). Choose
which set or sets are to be installed.
.PP 
Examine the auxiliary subprograms XERBLA and LSAME (which are independent
of the data-type) and consider whether they need to be modified.
.PP 
The subprogram XERBLA is called when one of the Level 2 BLAS detects an
illegal value of one of its arguments. The version supplied with the model
implementation writes a message to the standard output channel, for
example
.sp
.ps 9
.cs 1 24
.nf
   ** On entry to STRSV  parameter number  3 had an illegal value.
.cs 1
.fi
.ps 11
.sp 
and then executes a STOP statement. Installers may wish to redirect the
error message to a different output channel, or to replace the STOP
statement by a call to system-specific exception-handling or trace-back
mechanisms.
.PP 
The logical function LSAME is used to perform all character-comparison in
the Level 2 BLAS in a case-insensitive manner. For example, the expression

.ps 8
.cs 1 24
.nf 
         LSAME(UPLO, 'U')
 
.ps 11
.cs 1
.fi
is equivalent to
 
.ps 8
.cs 1 24
.nf 
         (UPLO.EQ.'U').OR.(UPLO.EQ.'u').
 
.ps 11
.cs 1
.fi
.PP
The supplied version works correctly on all systems that use the ASCII
code for internal representation of characters. For systems that use the
EBCDIC code, one constant must be changed. For CDC systems with 6-12 bit
representation, alternative code is provided in comments. Any of the
versions work correctly on all systems if only upper-case characters are
passed as arguments.
.PP
Compile the chosen sets of subprograms, together with LSAME and XERBLA,
and create an object-library.
.SH 
A2. Testing the model implementation
.PP 
Select the test program or programs corresponding to the data-types
handled by the subprograms that have been installed.
.PP
An annotated example of a data-file for the program can be obtained by
editing the comments at the start of the main program. This defines the
names and unit numbers of the output files, various parameters of the
tests, and the names of those subprograms which are to be tested. The
data-file for the REAL routines is illustrated below.
The first 18 records are read using list-directed input,
the last 16 records are read using the format ( A6, L2 ).

.ps 8
.cs 1 24
.nf 
   Record no.  Record contents
 
          1    'SBLAT2.SUMM'     NAME OF SUMMARY OUTPUT FILE
          2    6        UNIT NUMBER OF SUMMARY FILE
          3    'SBLAT2.SNAP'     NAME OF SNAPSHOT OUTPUT FILE
          4    -1       UNIT NUMBER OF SNAPSHOT FILE (NOT USED IF .LT. 0)
          5    F     LOGICAL, T TO REWIND SNAPSHOT FILE AFTER EACH RECORD.
          6    F     LOGICAL, T TO STOP ON FAILURES.
          7    T     LOGICAL, T TO TEST ERROR EXITS.
          8    16.0  THRESHOLD VALUE OF TEST RATIO
          9    6                 NUMBER OF VALUES OF N
         10    0 1 2 3 5 9       VALUES OF N
         11    4                 NUMBER OF VALUES OF K
         12    0 1 2 4           VALUES OF K
         13    4                 NUMBER OF VALUES OF INCX AND INCY
         14    1 2 -1 -2         VALUES OF INCX AND INCY
         15    3                 NUMBER OF VALUES OF ALPHA
         16    0.0 1.0 0.7       VALUES OF ALPHA
         17    3                 NUMBER OF VALUES OF BETA
         18    0.0 1.0 0.9       VALUES OF BETA
         19    SGEMV  T PUT F FOR NO TEST. SAME COLUMNS.
         20    SGBMV  T PUT F FOR NO TEST. SAME COLUMNS.
         21    SSYMV  T PUT F FOR NO TEST. SAME COLUMNS.
         22    SSBMV  T PUT F FOR NO TEST. SAME COLUMNS.
         23    SSPMV  T PUT F FOR NO TEST. SAME COLUMNS.
         24    STRMV  T PUT F FOR NO TEST. SAME COLUMNS.
         25    STBMV  T PUT F FOR NO TEST. SAME COLUMNS.
         26    STPMV  T PUT F FOR NO TEST. SAME COLUMNS.
         27    STRSV  T PUT F FOR NO TEST. SAME COLUMNS.
         28    STBSV  T PUT F FOR NO TEST. SAME COLUMNS.
         29    STPSV  T PUT F FOR NO TEST. SAME COLUMNS.
         30    SGER   T PUT F FOR NO TEST. SAME COLUMNS.
         31    SSYR   T PUT F FOR NO TEST. SAME COLUMNS.
         32    SSPR   T PUT F FOR NO TEST. SAME COLUMNS.
         33    SSYR2  T PUT F FOR NO TEST. SAME COLUMNS.
         34    SSPR2  T PUT F FOR NO TEST. SAME COLUMNS.
.cs 1
.ps 11
.fi
.PP
Change the 1st record of the data-file, if necessary, to ensure that the
file-name is legal on your system. No other changes to the data-file
should be necessary before an initial run of the test program, but some
changes may be needed to ensure that the tests are sufficiently thorough
(see below).
.PP 
The data-file is read from unit NIN, which is set to 5 in a PARAMETER
statement in the main program: change this if necessary.
.PP 
Compile the test program, link in the required subprograms, and run the
program.
.PP 
Note: the test programs include an alternative version of the auxiliary
subprogram XERBLA which is needed to test the error-exits from the Level 2
BLAS. On some systems special action must be taken to ensure that this
version is linked in to the test program. If the model implementation of
XERBLA is used, the test program will stop after writing an error-message
from XERBLA.
.PP 
The following table gives the approximate times taken to run the test
programs, using the supplied data-file and the model implementation of the
subprograms, on various machines:
 
.KS
.TS
center;
l l l l l l l
l n n n n n l.
	S	C	D	Z
 
CRAY-1S	0.2	0.2	0.5	-	minutes
DEC VAX-11/750	3	4.5	4	8
NSC 32032 PC*	8	10	17	20
Compaq Deskpro 286**	12	44	24	50
.TE
.ps 8
* DSI-32 coprocessor 10 Mhz; Green Hills Fortran; compiler options
-01 -02 -X71; RAM disk for I/O.
.br
** Lahey F77L V.2.10, 80287 - 5 Mhz, 80286 - 8 Mhz.
.ps 11

.KE
.PP
If the tests using the supplied data-file are completed successfully,
consider whether the tests have been sufficiently thorough. For example on
a machine with vector-registers, at least one value of N greater than the
length of a vector-register should be used, otherwise important parts of
the compiled code may not be exercised by the tests.
.PP 
The tests may fail, with either "suspect results" or "fatal errors."
Suspect results - with a test ratio slightly greater than the threshold -
are probably caused by anomalies in floating-point arithmetic on the
machine; if this explanation is considered to be sufficient, increase the
value of the threshold specified in the data-file. Fatal errors most
probably indicate a compilation error or corruption of the source-text. An
error detected by the system - for example, an array-subscript out of
bounds, or use of an unassigned variable - is almost certainly due to the
same causes. If the system does not provide adequate post-mortem
information about the error, the snapshot file can give a little help (see
section A5).
.SH 
A3. Testing a specialized implementation
.PP 
Proceed initially as described in section A2.
.PP
If the implementation does not use an error-handling subprogram XERBLA,
compatible with the model implementation, then the data-file must be
modified to suppress the testing of error-exits.
.PP
Consider very carefully what changes need to be made to the data-file, to
ensure that the implementation has been thoroughly tested. For example, if
the technique of loop-unrolling is applied, make sure that sufficient
values of N are used to test all the clean-up code; if ALPHA .EQ. -1.0 is
treated as a special case, add -1.0 to the values of ALPHA.
.SH
A4. Changing the parameters of the tests
.PP 
The values supplied in the data-file must satisfy certain restrictions,
defined by the following symbolic constants in the test program:

.TS 
center;
c c c
l l n.
Name	Meaning	Value
 
NIDMAX	Maximum number of values of N	9
NKBMAX	Maximum number of values of K	7
NINMAX	Maximum number of values of INCX, INCY	7
NALMAX	Maximum number of values of ALPHA	7
NBEMAX	Maximum number of values of BETA	7
NMAX	Maximum value of N	65
INCMAX	Maximum value of ABS(INCX), ABS(INCY)	2
.TE
 
If necessary, modify the PARAMETER statements that define these symbolic
constants.
.SH 
A5. The History or Snapshot File
.PP 
The main output file from the test program contains a concise report on
the success or failure of the tests of each routine and the reasons for
failure if it occurs. Optionally, the program writes to a separate file a
one-line record giving details of the arguments in each call of a Level 2
BLAS subprogram, e.g.,
 
.cs 1 22
.nf
        25: STRSV ('U', 'T', 'U',  3,  A,  4,  X, -1)
.cs 1
.fi
 
(The number 25 indicates that this is the 25th call of STRSV). The record
is written immediately before the routine is called.
.PP 
As a cumulative "history" file, this enables the user to monitor which
tests are passed successfully before a failure occurs. Moreover, if an
exception occurs in the Level 2 BLAS routine (e.g. array bound error or
division by zero), the last record written to the file should give details
of the call that caused the exception. However, on some systems the output
buffers are not emptied when a program is terminated abnormally.
Therefore, the program has an option to rewind the file after each record
is written in order to force emptying of the buffer: in this mode the file
presents a one-line "snapshot" of the current or most recent call to a
Level 2 BLAS routine.

.