From convex!dodson@a.cs.uiuc.edu Fri Jul 15 12:42:18 1988
Return-Path: <convex!dodson@a.cs.uiuc.edu>
Received: from anl-mcs.ARPA by antares.mcs.anl (3.2/SMI-3.2)
	id AA03820; Fri, 15 Jul 88 12:42:15 CDT
Received: from a.cs.uiuc.edu (a.cs.uiuc.edu.ARPA) by anl-mcs.ARPA (4.12/4.9)
	id AA17735; Fri, 15 Jul 88 12:48:32 cdt
Received: by a.cs.uiuc.edu (UIUC-5.52/9.7)
	id AA23647; Fri, 15 Jul 88 12:50:01 CDT
Received: from convex1 (8003e3f8) by convex (4.12/4.7)
	id AA02229; Fri, 15 Jul 88 11:46:41 cdt
Date: Fri, 15 Jul 88 11:46:34 cdt
From: convex!dodson@a.cs.uiuc.edu (Dave Dodson)
Message-Id: <8807151646.AA01567@convex1>
To: anl-mcs.ARPA!dongarra%uiucdcs.cs.uiuc.edu@a.cs.uiuc.edu
Subject: troff for sparse blas algorithm
Status: RO

\" tbl file | eqn | nitroff -ms
.hw spars-ity
.ft 3
.ps 11
.EQ
gsize 11
delim @@
.EN
.TL
.ps 12
.in 0
Model Implementation and Test Package for
.sp
the Sparse Basic Linear Algebra Subprograms
.AU
.ps 11
.in 0
David S. Dodson
.AI
.ps 10
.in 0
Convex Computer Corporation
701 N. Plano Road
Richardson, Texas 75081
.AU
.ps 11
.in 0
Roger G. Grimes
.AI
.ps 10
.in 0
Boeing Computer Services, M/S 7L-21
P.O. Box 24346
Seattle, Washington 98124-0346
.AU
.ps 11
.in 0
John G. Lewis
.AI
.ps 10
.in 0
Boeing Computer Services, M/S 7L-21
P.O. Box 24346
Seattle, Washington 98124-0346
.FS
.ps 10
.vs 11p
Typeset on \*(DY.
.FE
.sp 2
.QS
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em This paper describes a model implementation and test software for the
Sparse Basic Linear Algebra Subprograms (Sparse BLAS).
The Sparse BLAS perform vector operations common in sparse linear algebra, with
the goal of providing efficient, but portable, implementations of algorithms
for high performance computers.
The model implementation provides a portable set of FORTRAN 77 Sparse BLAS for
use on machines where specially tuned implementations do not exist or are not
required.
The test software is designed to verify that tuned implementations meet the
specifications of the Sparse BLAS and that implementations are correctly
installed.
.in
.ll
.QE
.ps 11
.vs 16
.nr PS 11
.nr VS 16
.nr PD 0.5v
.NH
Introduction
.PP
In [2] we defined the basic operations, along with naming conventions and
argument lists, for a set of extensions to the Basic Linear Algebra Subprograms
[4,5].
These extensions perform vector operations common in sparse linear algebra.
They provide a standard framework around which modular, portable, and efficient
FORTRAN 77 codes may be developed for many problems addressed by sparse linear
algebra.
We hope that tuned (perhaps assembly language) implementations of these
routines will be developed for many machines, especially for vector and other
high-performance computers.
Thus, programs that call the Sparse BLAS can be efficient across a wide range
of machines.
.PP
To support and encourage the use of the Sparse BLAS, this algorithm contains
two software components:
.IP (1)
A model implementation of the Sparse BLAS in ANSI standard FORTRAN 77.
This enables the Sparse BLAS to be used on any machine, regardless of whether
a tuned implementation exists.
It is described in Section 2.
.IP (2)
Test programs designed to ensure that implementations adhere to the definition
and have been installed correctly.
The test programs are described in Section 4.

.NH
The Model Implementation
.NH 2
Efficiency
.PP
The model implementation is likely to achieve considerable efficiency on a
scalar computer with a good optimizing compiler.
For example, with the highest level of optimization, the IBM FORTRAN H
compiler achieves nearly the greatest possible asymptotic computational rate
for all of the Sparse BLAS on IBM's scalar 370 and 30xx architectures.
There are, however, scalar computers for which the compilers obtain less
of the potential speed of the machine; the final paragraph of this
section discusses essentially such an example.
.PP
Satisfactory efficiency can be expected for the model implementation of _DOT-I
and _GTHR on a vector computer with a good vectorizing compiler if the
underlying hardware has vector instructions for dealing with the required
indirect addressing.
For example, with CFT 1.14 on CRAY X-MP processors having the hardware
scatter-gather feature, the model implementation of SGTHR achieves 70 percent
of the asymptotic rate of a tuned assembly language version.
.PP
On the other hand, the unmodified model implementation of _AXPYI, _GTHRZ,
_ROTI, and _SCTR will perform only at scalar speed on a vector processor or
sequential speed on a parallel processor.
The possibility that entries in INDX are repeated prevents the compiler from
determining that there are no conflicts.
That these subroutines are only used in a safe environment cannot be
communicated in standard FORTRAN 77.
The remedies, often nothing more than special compiler directives, are
discussed in \(sc3.
.PP
Finally, on vector processors that do not have hardware instructions for
handling indirect addressing in vector mode, the efficiency obtained with the
model implementation will depend largely on the ability of the FORTRAN compiler
to generate optimal, perhaps scalar, code.
This is well illustrated on the CRAY-1S, where the model implementation of
CDOTCI compiled with CFT 1.13 achieves only 25 percent of the performance of
an optimized assembly language version [7].

.NH 2
Language standards
.PP
The model implementation of the Sparse BLAS is written entirely in portable
ANSI standard FORTRAN 77 with one exception.
The routines that require a double precision complex data type (names
beginning with Z) use the following extensions to standard FORTRAN:
.IP
COMPLEX*16 type specification statements;
.IP
a DCONJG intrinsic function whose argument and result are of type COMPLEX*16;
and
.IP
COMPLEX*16 constants formed by enclosing a pair of double precision constants
in parentheses.

.NH
Notes on Implementation
.PP
Here we offer some advice to those planning to develop a machine-specific
implementation of the Sparse BLAS.
The following possibilities should be considered:
.IP (a)
Use machine-specific extensions to FORTRAN, such as array syntax, compiler
directives, or calls to library routines.
.IP (b)
Code the routines in assembly language.
.PP
Vectorizing or parallelizing FORTRAN compilers will detect apparent recurrences
in some of the routines where an indirectly-addressed array is used as both an
input and an output argument.
For example, for the loop in SAXPYI,
.KS
.TS
center;
r1 l.
	DO 10 I = 1, NZ
	   Y(INDX(I)) = A * X(I) + Y(INDX(I))
10	CONTINUE
.TE
.KE
a vectorizing or parallelizing compiler will be unable to create optimal code
for this loop unless it is informed that the values in INDX are distinct.
Then the loop iterations are independent and can be processed in vector or
parallel mode.
Optimizing compilers usually allow the programmer some machine-specific method,
such as a compiler directive, for conveying this information.
.PP
Even when the indirectly-addressed array is used only as an output argument,
a parallelizing compiler will recognize possible storage hazards in which
different values might be stored in the same array element.
The compiler will then inhibit parallelization because the result might be
unpredictable if the loop iterations were not performed sequentially.
Thus, for the loop in SSCTR,
.KS
.TS
center;
r1 l.
	DO 10 I = 1, NZ
	   Y(INDX(I)) = X(I)
10	CONTINUE
.TE
.KE
a parallelizing compiler will be able to distribute the loop iterations among
the processors and allow them to be performed in any order only if it is told
that the indices are distinct.
.PP
Because the restriction is imposed for certain routines that the values in INDX
must be distinct, all of the DO loops in the model implementation are safe
for vectorization or parallelization.
Thus, a vectorizing or parallelizing FORTRAN compiler that deigns to leave 
unoptimized DO loops in the model implementation should be overruled in the
appropriate manner.
Since different compilers choose different mechanisms for controlling
optimization, the use of the Sparse BLAS offers easier transportability than
use of in-line code because the modifications for different machines are
clearly isolated.
.PP
In many applications of sparse linear algebra, the sparse vectors manipulated
by the Sparse BLAS are the rows or columns of sparse matrices.
Even when a sparse matrix is huge, it is common that the number of nonzero
elements in a row or column is small.
Therefore, it is important that the Sparse BLAS be as fast as possible for
small values of NZ.
The asymptotic performance achieved as @roman NZ ~->~ inf@ is of less
importance than in dense linear algebra.
Hence, in a tuned implementation, it might be desirable to use different code,
separately optimized for small and large NZ, switching from one to the
other at an appropriate threshold.
.PP
The particular form of a tuned implementation is particularly important for
high performance computers.
These computers often obtain their speed by imposing restrictions on data
access patterns that conflict with the generality of the sparse index patterns.
For such machines, the implementor should be aware that there are often
multiple ways to realize the Sparse BLAS.
To illustrate some of the issues, we discuss the implementation of one of the
Sparse BLAS on three different vector architectures.
.PP
Most sparse Gaussian elimination factorization algorithms spend the vast
majority of their execution time in the Sparse BLAS _AXPYI loop:
.KS
.TS
center;
r1 l.
	DO 10 I = 1, NZ
	   Y(INDX(I)) = A * X(I) + Y(INDX(I))
10	CONTINUE
.TE
.KE
Dembart and Neves [1] analyzed seven different formulations of this loop on
the CDC STAR 100, a memory-to-memory vector machine with limited indirect
addressing capability, and determined that there were combinations of vector
length and vector density for which each of the seven formulations was the
fastest.
Similar analyses by those authors showed corresponding results for the CYBER
203 and 205, although the ratios changed.
For reasonable combinations of vector lengths and densities, the most important
of the seven formulations was the same on all three machines, and can be
presented as a sequence of calls to two Sparse BLAS and one original BLAS:
.KS
.TS
center;
l.
CALL SGTHR (NZ, Y, TEMPY, INDX)
CALL SAXPY (NZ, A, X, 1, TEMPY, 1)
CALL SSCTR (NZ, TEMPY, INDX, Y)
.TE
.KE
This uses a temporary array and two data movement operations, but takes
advantage of the vector arithmetic units for the numerical operations.
.PP
This SGTHR-SAXPY-SSCTR method is often the correct form for vector register
computers with gather-scatter hardware.
On those machines the SGTHR-SAXPY-SSCTR method can be implemented as a single
function consisting of an indirectly-addressed load into a vector register,
followed by a SAXPY operation between the vector register and the compressed
data, and completed with an indirectly-addressed store.
On such machines the only difference in performance between _AXPYI and
_AXPY is accounted for by the need to access the index array and the
use of indirect addressing in two of the three numerical memory transfers.
A CRAY X-MP, for example, can perform a SAXPYI operation at an asymptotic rate
of 88 million floating point operations per second (MFLOPS), compared with 192
MFLOPS for a SAXPY operation.
In [6], Lewis and Simon present performance figures comparing such a tuned
implementation on a CRAY X-MP with an untuned implementation in the context of
sparse Cholesky factorization.
.PP
In contrast, the CRAY-1 is a vector register machine without gather-scatter
hardware.
The model implementation SAXPYI loop written in FORTRAN executes at a maximum
rate of about 4 (MFLOPS).
Woo and Levesque [8] analyzed the SGTHR-SAXPY-SSCTR approach, and showed that
its maximum rate in assembly language was about 8 MFLOPS. 
Alternatively, the fastest known assembly language implementation of the
original loop uses only the scalar hardware, yet performs asymptotically at 13
MFLOPS, and is faster than the SGTHR-SAXPY-SSCTR method for every NZ [7].

.NH
The Test Programs
.PP
A separate test program exists for each of the four data types, REAL, DOUBLE
PRECISION, COMPLEX, and COMPLEX*16.
All test programs conform to the same pattern with only the minimum necessary
changes, so we shall talk generically about ``the test program'' in the
singular.
The program has been designed not merely to check whether the model
implementation has been installed correctly, but also to serve as a validation
tool, and even as a modest debugging aid, for any tuned implementation.
The test program has the following features:
.IP
The parameters of the test problems are specified in a data file that can be
modified easily to test specific vector lengths or for debugging.
.IP
The data for the test problems are generated internally and the results are
checked internally.
.IP
The program checks that no arguments are changed by the routines except the
expected output vector or vectors.
.IP
Null operations, i.e., those with negative or zero vector lengths, are checked.
.IP
The program generates a concise summary report on the tests.

.NH 2
Language standards
.PP
The Sparse BLAS test program is written entirely in portable ANSI standard
FORTRAN 77, except that the test programs for the double precision complex
routines, use the following extensions to standard FORTRAN:
.IP
COMPLEX*16 type specification statements;
.IP
a DCMPLX intrinsic function that forms a COMPLEX*16 quantity from one or two
DOUBLE PRECISION arguments;
.IP
a DCONJG intrinsic function that delivers the COMPLEX*16 complex conjugate of
its COMPLEX*16 argument;
.IP
COMPLEX*16 constants formed by enclosing a pair of double precision constants
in parentheses.

.NH 2
Parameters of the Test Problems
.PP
Each test problem (i.e., each call of a subprogram to be tested) depends on a
choice of values for the following parameters (where relevant to the particular
subprogram):
.IP
the number of nonzeros, NZ;
.IP
the dimension, @n@, of the vector space containing @x@ and @y@;
.IP
the scalar @a@; and
.IP
for REAL and DOUBLE PRECISION, the scalars @c@ and @s@.
.PP
The values of these parameters are defined by a data file.
Specifically, the program reads in a set @S sub roman NZ@ of values of NZ, a set
@S sub a@ of values of @a@, and a set @S sub cs@ of ordered pairs @(c,s)@ of
Givens rotation parameters.
For each value of NZ, the values of @n@ are generated from NZ by
@n ~ roman { = ~ 2 ~ max (NZ,1) }@.
(Although the relative density of vectors generated by these values of @n@ and
NZ is atypical of sparse matrix applications, it is suitable for testing the
logic, but not the performance, of a tuned implementation.)
.PP
The sets @S sub roman NZ@, @S sub a@, and @S sub cs@ should be chosen to
exercise all segments of the code and all special or extreme cases such as
@roman { NZ ~<~ 0 }@, @roman { NZ ~=~ 0 }@, @a ~ roman = ~ 0@,
@(c,s) ~ roman = ~ (1,0)@, etc.
A data file that specifies sets of parameters suitable for many machines is
supplied with the test program, but installers and implementors must be alert
to the possible need to extend or modify them (see Appendix).

.NH 2
Data for the Test Problems
.PP
Data for the arrays X and Y are generated using simple expressions involving
the SIN and COS intrinsic functions.
Data for the array INDX are generated by several schemes that should be
sufficient to test the indirect addressing.
Elements in these arrays that are not supposed to be referenced by a
subprogram (e.g., X(NZ+1), X(NZ+2), ..., INDX(NZ+1), INDX(NZ+2), ..., and the
values of Y whose indices are not listed in INDX) are initialized to a
rogue value to increase the likelihood that a reference to them will be
detected.
The rogue value is @-10 sup 10@ for REAL and DOUBLE PRECISION quantities,
(@-10 sup 10@,@-10 sup 10@) for COMPLEX and COMPLEX*16, and @-10 sup 7@ for
INTEGER.
If an illegal subscript error is reported or a memory addressing error occurs,
or if a fatal error is reported and an element of the computed result is of
order @10 sup 10@, then the routine has almost certainly referenced the wrong
element of an array.

.NH 2
Checking the Results
.PP
After calling each Sparse BLAS subprogram, the test program thoroughly checks
its operation in two ways.
First, each of the arguments is checked to see if it has been changed.
This includes checking all elements of array arguments, including the
supposedly unreferenced ones.
If any values other than the correct elements of the result vector or vectors
have been changed, a fatal error is reported.
.PP
Second, the test program checks the result vector or vectors.
We expect exact results from the routines that merely move data without doing
any floating point arithmetic. 
For subprograms that do use floating point arithmetic to compute their results,
exact agreement between the results and those computed by simple FORTRAN code is
not expected since the results are not necessarily computed by the same
sequences of floating-point operations.
However, the differences are expected to be insignificant to working precision
in the following precise sense.
.PP
In the _DOT-I routines, the absolute error in @w ~ roman = ~ x sup T y@ is
bounded by
.EQ
|w hat~-~w|~<=~ roman NZ ~ epsilon |x| sup T |y|,
.EN
where @epsilon@ is the relative machine precision and @|x| sup T@ denotes the
vector @(|x sub 1 |,|x sub 2 |,...,|x sub n |) sup T@ ([3], p. 36).
Since this bound is usually a substantial over-estimate, we use the following
semi-empirical approach: the test ratio
.EQ
{|w hat~-~w|} over {epsilon |x| sup T |y|}
.EN
is compared with a constant threshold value that is defined in the data file.
Test ratios greater than the threshold are flagged as suspect.
On the basis of experience, a value of 5 is recommended.
The precise value is not critical.
Errors in the routines are likely to be errors in array indexing, which will
almost certainly lead to a totally incorrect result.
A more subtle potential error is the use of a single precision variable in a
double precision computation.
This is likely to lead to a loss of about half the machine precision.
Therefore, the test program regards a test ratio greater than
@epsilon sup {-1/2}@ as a fatal error.
.PP
In the _AXPYI routines, each component of the result vector is regarded as an
inner product of length 2,
.EQ
y sub i ~ roman = ~ bold [ a~~~1 bold ] ~ left [ cpile { x sub i above y sub i } right ] ,
.EN
and the dot product error bound given above is applied componentwise.
Similarly, in the _ROTI routines, each component of the result vectors is
regarded as an inner product of length 2:
.EQ
x sub i ~ roman = ~ bold [ c~~~s bold ] ~ left [ cpile { {x sub i} above {y sub i} } right ]
.EN
.EQ
y sub i ~ roman = ~ bold [ -s~~c bold ] ~ left [ cpile { {x sub i} above {y sub i} } right ] .
.EN

.SH
References
.IP [1]
B. Dembart and K.W. Neves, ``Sparse Triangular Factorization on Vector
Computers,'' \fIExploring Applications of Parallel Processing,\fR Electric
Power Research Institute, EL-566-QR, Palo Alto, California (1977), pp. 22-25.
.IP [2]
D.S. Dodson, R.G. Grimes, and J.G. Lewis, ``Sparse Extensions to the FORTRAN
Basic Linear Algebra Subprograms,'' \fIthis issue.\fR
.IP [3]
G.H. Golub and C.F. Van Loan, \fIMatrix Computations,\fR The Johns Hopkins
University Press, Baltimore, Maryland, (1983).
.IP [4]
C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh, ``Basic Linear Algebra
Subprograms for FORTRAN Usage,'' \fIACM Transactions on Mathematical
Software 5\fR, 3 (September, 1979), pp. 308-323.
.IP [5]
C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh, ``Algorithm 539: Basic
Linear Algebra Subprograms for FORTRAN Usage,'' \fIACM Transactions on
Mathematical Software 5\fR, 3 (September, 1979), pp. 324-325.
.IP [6]
J.G. Lewis and H.D. Simon, ``The Impact of Hardware Gather/Scatter on Sparse
Gaussian Elimination,'' \fIProceedings of the 1986 International Conference on
Parallel Processing,\fR (K. Hwang, S. Jacobs, and E. Swartzlander, eds.), IEEE
Computer Society, Los Angeles, (1986).
.IP [7]
\fIVectorPak Users Manual - CRAY Supplement,\fR Publication Number 20460-0501,
Boeing Computer Services, Seattle, (1986).
.IP [8]
P.T. Woo and J.M. Levesque, ``Benchmarking a Sparse Elimination Routine on the
CYBER 205 and the CRAY-1,'' \fIProceedings of the Sixth SPE Symposium on
Reservoir Simulation,\fR (1982).
.bp
.SH
Appendix: Installation Notes
.SH
A1. Installing the Model Implementation
.PP
The subprograms fall into four sets according to the data type: REAL, DOUBLE
PRECISION, COMPLEX, and COMPLEX*16 (subprogram names beginning with S, D, C,
and Z, respectively).
If the FORTRAN compiler supports compiler directives to indicate where
vectorization or parallelization is safe, these can be inserted in the source
code at the places indicated in the comments.
Compile the sets to be installed and create an object library.

.SH
A2. Testing the Model Implementation
.PP
Select the test programs corresponding to the data types handled by the
subprograms that have been installed.
.PP
An annotated example of a data file for the test program can be obtained by
editing the comments at the start of the main program.
This defines the name and unit number of the output file and various parameters
of the tests.
The data file is read using list-directed input from unit NIN, which is set to
5 in a PARAMETER statement in the main program.
The data file for the REAL routines is illustrated below.

.KS
.ps 10
.vs 13
.TS
expand;
l l. 
`SBLATS.SUMM'	NAME OF OUTPUT FILE
6	UNIT NUMBER OF OUTPUT FILE
100	MAX NUMBER OF ERROR MESSAGES
5.0	THRESHOLD VALUE OF TEST RATIO
16	NUMBER OF VALUES OF NZ
-1 0 1 2 5 9 31 32 33 63 64 65 127 128 129 257	VALUES OF NZ
3	NUMBER OF VALUES OF A
0.0  1.0  0.7	VALUES OF A
4	NUMBER OF VALUES OF (C,S)
1.0  0.0 -0.6  0.8	VALUES OF C
0.0  1.0  0.8 -0.6	VALUES OF S
.TE
.ps
.vs
.KE
.PP
Change the first record of the data file, if necessary, to ensure that the file
name is legal on your system.
No other changes to the data file should be necessary before an initial run of
the test program, although some changes may be needed to ensure that the tests
are sufficiently thorough (see below).
.PP
Compile the test program, link in the required subprograms, and run the test
program using the data file as input.
.PP
If the tests using the supplied data file are completed successfully, consider
whether the tests have been sufficiently thorough.
For example, on a machine with vector registers, at least one value of NZ
greater than the length of a vector register should be used or important parts
of the compiled code may not be exercised by the tests.
.PP
The tests may fail with either suspect results or fatal errors.
Suspect results, where the test ratio is slightly greater than the threshold,
are probably caused by anomalies in floating point arithmetic on the machine;
if this explanation is considered to be sufficient, increase the value of the
test ratio in the data file.
Fatal errors most probably indicate a compilation error or corruption of the
source text.
An error detected by the system, e.g., an array subscript out of bounds or
use of an uninitialized variable, is almost certainly due to the same causes.

.SH
A3. Testing a Tuned Implementation
.PP
Proceed initially as described in section A2.
Consider very carefully what changes need to be made to the data file to
ensure that the implementation has been thoroughly tested.
For example, if the technique of loop-unrolling is used, make certain that
sufficient values of NZ are used to test all the clean-up code.
Similarly, if any values of A or (C,S) are treated specially, add them to the
data file.

.SH
A4. Changing the Parameters of the Test
.PP
The values supplied in the data file must satisfy certain restrictions,
dependent upon the following symbolic constants defined in PARAMETER
statements in the main program:

.KS
.ps 10
.vs
.TS
center;
c c c
l l a.
Name	Meaning	Value

NNZMAX	Maximum number of values of NZ	24
NZMAX	Maximum value of NZ	320
NAMAX	Maximum number of values of A	7
NGMAX	Maximum number of values of C and S	7
.TE
.ps
.vs
.KE

If necessary, modify the PARAMETER statements that define these symbolic
constants.


.