.EQ     
delim @@
.EN
.TL
.ps 12
.in 0
Squeezing the Most out of an Algorithm\h'.35i'     
.br
in CRAY Fortran\h'.35i'     
.AU
.ps 11
.in 0
Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i'    
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'   
Argonne National Laboratory\h'.20i'   
.AU
.ps 11
.in 0
Stanley C. Eisenstat\|@size -1 {"" sup \(dd}@\h'.20i'   
.AI
.ps 10
.in 0
Department of Computer Science and\h'.25i'  
Research Center for Scientific Computation\h'.20i'   
Yale University\h'.20i'   
.FS
.ps 9
.vs 11p
@size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical
Sciences Research Program (KC-04-02) of the Office of Energy Research of the 
U. S. Department of Energy under Contract W-31-109-Eng-38.
.sp .25v
@size -1 {"" sup \(dd}@\|Work supported in part by the Office of Naval Research 
under contract N00014-82-K-0184 and by the National Science Foundation under 
grant MCS-81-04874.
.FE
.QS
.ps 10
.in .25i
.ll -.25i
.I Abstract
\(em This paper describes a technique for achieving super-vector performance on
a CRAY-1 in a purely Fortran environment (i.e., without resorting to assembler
language).
The technique can be applied to a wide variety of algorithms in linear algebra,
and is beneficial in other architectural settings.
.in 
.ll 
.QE

.nr PS 11
.nr VS 16
.nr PD 0.5v
.SH
Introduction
.PP
There are three basic performance levels on the CRAY-1 \(em \fIscalar, vector, 
and super-vector\fP [4]:
.KS
.TS
box center;
c | c.
Performance Level	Rate of Execution*
_	_
Scalar	0\(mi4 MFLOPS
Vector	4\(mi50 MFLOPS
Super-Vector	50\(mi160 MFLOPS
.TE
.KE
.FS
.ps 9
.vs 11p
.ll
.sp .25v
*\|MFLOPS is an acronym for Million FLoating-point OPerations (additions or
multiplications) per Second.
.FE
The difference between scalar and vector modes is the use of vector instructions
to eliminate loop overhead and take full advantage of the pipelined functional
units.
The difference between vector and super-vector modes is the use of vector
registers to reduce the number of memory references (and thus avoid letting the
one path to/from memory become a bottleneck).
.PP
Typically, programs written in Fortran run at scalar or vector speeds, so that
one must resort to assembler language (or assembler language kernels) to improve
performance.
In this paper, we describe a technique for attaining super-vector speeds \fIfrom
Fortran\fP.@size -1 {"" sup \(dg}@
.FS  
.ps 9
.vs  11p  
@size -1 {"" sup \(dg}@\|We recognize that assembler code may be needed to
achieve the highest level of performance, and that its use in a small number of
`kernels' is not a significant barrier to transportability.
However, the approach presented does lead to high levels of performance, is
portable, and can be used to derive \fIalgorithmic\fP improvements in a much
wider class of problems than discussed in this paper.
.FE


.SH
The Ideal Setting@size -1 {"" sup \(dd}@   
.FS  
.ps 9
.vs  11p  
.sp .25v
@size -1 {"" sup \(dd}@\|See [4] for a more complete discussion.
.FE
.PP
Most algorithms in linear algebra are easily vectorized.
For example, consider the following subroutine which adds the product of a
matrix and a vector to another vector:
.DS B
.TS
l l.
	  SUBROUTINE SMXPY  (N1,Y,N2,LDM,X,M)
	  REAL  Y(\(**),  X(\(**),  M(LDM,\(**) 
	  DO  20  J = 1, N2
	       DO  10  I = 1, N1
	            Y(I) = Y(I) + X(J)\(**M(I,J)  
10	       CONTINUE
20	  CONTINUE
	  RETURN
	  END
.TE
.DE
The innermost loop is a SAXPY [5] (adding a multiple of one vector to another) 
and would be detected by a good vectorizing compiler.  
Thus, the Cray CFT Fortran compiler generates vector code of the general form:
.DS  
.I
.TS
center;
l.
Load vector \fRY\fP  
Load scalar \fRX(J)\fP  
Load vector \fRM(\(**,J)\fP  
Multiply scalar \fRX(J)\fP times vector \fRM(\(**,J)\fP  
Add result to vector \fRY\fP  
Store result in \fRY\fP  
.TE
.R
.DE
.PP
Note that there are \fIthree\fP vector memory references for each \fItwo\fP
vector floating-point operations.
Since there is only one path to/from memory and the memory bandwidth is 80
million words per second, the rate of execution cannot exceed \s-2\(ap\s053\ 1/3
MFLOPS (less than 50 MFLOPS when vector startup time is taken into account) \(em
\fIvector performance\fP.
.PP 
Thus to attain super-vector performance, it is necessary to expand the scope of
the vectorizing process to more than just simple vector operations.
In this case, a closer inspection reveals that the vector Y is stored and then
reloaded in successive SAXPY's.
If instead we accumulate Y in a vector register (up to 64 words at a time) until
all of the columns of M have been processed, we can avoid two of the three
vector memory references in the innermost loop.
The maximum rate of execution is then 160 MFLOPS (\s-2\(ap\s0148 MFLOPS when
vector startup time is taken into account) \(em \fIsuper-vector performance\fP.

.SH
Reality
.PP
The Cray CFT compiler does not detect the fact that the result can be
accumulated in a register (and not stored between successive vector operations).
Thus, the rate of execution is limited to vector speeds.
.PP
But if we unroll [1] the outer loop (in this case to a depth of four) and insert
parentheses to force the arithmetic operations to be performed in the most
efficient order, then the innermost loop becomes:
.DS B
.TS
l l.
	       DO  10  I = 1, N1
	            Y(I) = ((((Y(I)) + X(J\(mi3)\(**M(I,J\(mi3)) + X(J\(mi2)\(**M(I,J\(mi2))  
	 $                               + X(J\(mi1)\(**M(I,J\(mi1)) + X(J)   \(**M(I,J)  
10	       CONTINUE
.TE
.DE
Now the code generated by CFT has \fIsix\fP vector memory references for each
\fIeight\fP vector floating-point operations.
Thus the maximum rate of execution is \s-2\(ap\s0106\ 2/3 MFLOPS (\s-2\(ap\s0100
MFLOPS when vector startup time is taken into account) and the actual rate is
\s-2\(ap\s077 MFLOPS \(em \fIsuper-vector performance from Fortran\fP.
The complete subroutine SMXPY4 is given in Appendix I.

.SH
Generalizations
.PP
With this approach we can develop quite a collection of procedures from linear
algebra.
The key idea is to use two kernels \(em SMXPY and SXMPY (add a vector times a
matrix to another vector; see Appendix II) \(em to do the bulk of the work.
Since both kernels can be unrolled@size -1 {"" sup \(dg}@ to give super-vector
performance, the procedures themselves are capable of super-vector performance.
.FS  
.ps 9
.vs  11p  
@size -1 {"" sup \(dg}@\|Although there are only eight vector registers, this is
sufficient for any depth of unrolling.
.FE 
.PP
Many processes which involve elementary transformations can be described in
these terms, e.g., matrix multiplication, Cholesky decomposition, and LU
factorization (see Appendix III and [4, 6]).
However, the formulation is often not the ``natural'' one, which may be based on
outer-products of vectors or accumulating variable-length vectors, neither of
which can be super-vectorized in Fortran.
.PP
Tables 1-3 below summarize the results obtained for these procedures on a CRAY
1-S (as well as on the new CRAY 1-M@size -1 {"" sup \(dg}@ and CRAY 
X-MP@size -1 {"" sup \(dd}@) when the subroutines SMXPY and SXMPY were unrolled 
to the specified depth.
All runs used the CFT 1.11 Fortran compiler.
.FS  
.ps 9
.vs  11p  
.sp .25v
@size -1 {"" sup \(dg}@\|The CRAY 1-M is essentially a CRAY 1-S with ``slow''
memory.
It is faster in these tests because of a chaining anomaly \(em a vector load
issues earlier on the CRAY 1-S, causing a scalar-vector multiply to miss
chain-slot time.
.sp .25v
@size -1 {"" sup \(dd}@\|The CRAY X-MP is a multiprocessor, each processor
having a cycle time of 9.5 ns (vs. 12.5 ns for the CRAY 1-S) and three paths
to/from memory (two for vector loads, one for vector stores).  
These timings were obtained using only one processor.
While in principle the extra paths should remove the memory bottleneck, in
practice the unrolled code still runs faster because there are fewer vector
start-ups and less memory traffic (and thus fewer bank conflicts).
.FE 
.sp 1
.KS
.ce
\fITable 1\fP\^:  300 \(mu 300 Matrix Multiplication 
.TS
box center;
c | c   s   s
c | c   s   s
c | c | c | c
c | c | c | c
n | n | n | n.
Unrolled 	MFLOPS
	_
Depth	CRAY 1-M	CRAY 1-S	CRAY X-MP
_	_	_	_
1	39	40	106
2	60	53	151
4	83	72	161
8	101	86	170
16	111	96	177
.TE
.KE
.sp 1
.KS
.ce
\fITable 2\fP\^:  300 \(mu 300 Cholesky Decomposition
.R
.TS
box center;
c | c   s   s
c | c   s   s
c | c | c | c
c | c | c | c
n | n | n | n.
Unrolled 	MFLOPS
	_
Depth	CRAY 1-M	CRAY 1-S	CRAY X-MP
_	_	_	_
1	31	33	68
2	48	45	99
4	67	60	118
8	81	70	131
16	86	78	139
.TE
.KE
.sp 1
.KS
.ce
\fITable 3a\fP\^:  300 \(mu 300 LU Decomposition with Pivoting  
.R
.TS
box center;
c | c   s   s
c | c   s   s
c | c | c | c
c | c | c | c
n | n | n | n.
Unrolled 	MFLOPS
	_
Depth	CRAY 1-M	CRAY 1-S	CRAY X-MP
_	_	_	_
1	28	29	56
2	42	39	78
4	56	52	93
8	66	60	103
16	69	66	108
.TE
.KE
.sp 1
.KS
.ce 2
\fITable 3b\fP\^:  300 \(mu 300 LU Decomposition with Pivoting  
(Using an Assembler Language Implementation of ISAMAX@size -1 {"" sup \(dg}@)
.R
.TS
box center;
c | c   s   s
c | c   s   s
c | c | c | c
c | c | c | c
n | n | n | n.
Unrolled 	MFLOPS
	_
Depth	CRAY 1-M	CRAY 1-S	CRAY X-MP
_	_	_	_
1	30	32	62
2	46	43	96
4	64	59	117
8	78	68	129
16	83	76	136
.TE
.KE
.FS  
.ps 9
.vs  11p  
@size -1 {"" sup \(dg}@\|The search for the maximum element in the pivot column
(ISAMAX [5]) does not vectorize and thus limits performance.
These times were obtained using an assembler language implementation of ISAMAX.
.FE 
.sp 2
By contrast, 30 MFLOPS is often cited as a ``good rate for Fortran'' on the CRAY
1-S [3] and 100 MFLOPS as a ``good rate for CAL (Cray Assembler Language)'' [3]
(e.g., Fong and Jordan [4] report 107 MFLOPS for an assembler language
implementation of LU decomposition with pivoting).

.SH
Conclusion
.PP
We have described a technique that can produce significant gains in execution
speed on the CRAY-1.@size -1 {"" sup \(dd}@ 
.FS
.ps 9
.vs 11p
.ll
.sp .25v
@size -1 {"" sup \(dd}@\|See [2] for another approach.
.FE
Moreover, to the extent that this approach reduces loop overhead and takes
advantage of segmented functional units, it will be effective on more
conventional computers as well as on other ``super-computer'' architectures.
And since optimized assembler language implementations of the SMXPY and SXMPY
kernels are easy to code (as much so as any kernel) and frequently available,
one can get most of the advantages of assembler language while programming in
Fortran. 

.SH
ACKNOWLEDGMENTS
.PP
We would like to thank the National Magnetic Fusion Energy Computer Center for
providing computer time to carry out some of the experiments, Cray Research for
their cooperation, and Alan Hinds for many stimulating discussions on code
optimization.
.sp

.nr VS 13
.SH
REFERENCES
.sp .5v  
.IP [1] 4
J. J. Dongarra and A. R. Hinds,
``Unrolling loops in Fortran,''
.I
Software\(emPractice and Experience 
.R
\fB9\fP (1979),  219-229.
.sp .5v  
.IP [2] 4
Iain S. Duff
``The solution of sparse linear equations on the CRAY-1,''
.I
CRAY Channels
.R
\fB4:3\fP (1982),  4-9.
.sp .5v  
.IP [3] 4
I. S. Duff and J. K. Reid,
``Experience of Sparse Matrix Codes on the CRAY-1,''
.I
Computer Physics Communications
.R
\fB26\fP (1982), 293-302.
.sp .5v  
.IP [4] 4  
Kirby Fong and Thomas L. Jordan, 
.I 
Some Linear Algebra Algorithms and Their Performance on the CRAY-1,
.R 
Los Alamos Scientific Laboratory, UC-32, June 1977.
.sp .5v 
.IP [5] 4 
C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, 
``Basic Linear Algebra Subprograms for Fortran Usage,'' 
.I 
ACM Transactions on Mathematical Software 
.R 
\fB5\fP (1979), 308-371.  
.sp .5v
.IP [6] 4 
D. A. Orbits and D. A. Calahan, 
.I 
Data Flow Considerations in Implementing a Full Matrix Solver with Backing Store
on the CRAY-1, 
.R 
Report #98, Systems Engineering Laboratory, University of Michigan, September
1976.

.LP
.ps 10
.sp 2
.ss 24
.cs 1 24
.lg 0
.nf
.sp 2
.bp
.B
APPENDIX  I
.R


      SUBROUTINE SMXPY4 (N1, Y, N2, LDM, X, M)
      REAL Y(*), X(*), M(LDM,*)
C
C   PURPOSE:
C     Multiply matrix M times vector X and add the result to vector Y.
C
C   PARAMETERS:
C
C     N1 INTEGER, number of elements in vector Y, and number of rows in
C         matrix M
C
C     Y REAL(N1), vector of length N1 to which is added the product M*X
C
C     N2 INTEGER, number of elements in vector X, and number of columns
C         in matrix M
C
C     LDM INTEGER, leading dimension of array M
C
C     X REAL(N2), vector of length N2
C
C     M REAL(LDM,N2), matrix of N1 rows and N2 columns
C
C ----------------------------------------------------------------------
C
C   Cleanup odd vector
C
      J = MOD(N2,2)
      IF (J .GE. 1) THEN
         DO 10 I = 1, N1
            Y(I) = (Y(I)) + X(J)*M(I,J)
   10    CONTINUE
      ENDIF
C
C   Cleanup odd group of two vectors
C
      J = MOD(N2,4)
      IF (J .GE. 2) THEN
         DO 20 I = 1, N1
            Y(I) = ( (Y(I))
     $                + X(J-1)*M(I,J-1)) + X(J)*M(I,J)
   20    CONTINUE
      ENDIF
C
C   Main loop - groups of four vectors
C
      JMIN = J+4
      DO 40 J = JMIN, N2, 4
         DO 30 I = 1, N1
            Y(I) = ((( (Y(I))
     $             + X(J-3)*M(I,J-3)) + X(J-2)*M(I,J-2))
     $             + X(J-1)*M(I,J-1)) + X(J)  *M(I,J)
   30    CONTINUE
   40 CONTINUE
C
      RETURN
      END
.bp
.B
APPENDIX  II
.R


      SUBROUTINE SMXPY (N1, Y, N2, LDM, X, M)
      REAL Y(*), X(*), M(LDM,*)
C
C   PURPOSE:
C     Multiply matrix M times vector X and add the result to vector Y.
C
C   PARAMETERS:
C
C     N1 INTEGER, number of elements in vector Y, and number of rows in
C         matrix M
C
C     Y REAL(N1), vector of length N1 to which is added the product M*X
C
C     N2 INTEGER, number of elements in vector X, and number of columns
C         in matrix M
C
C     LDM INTEGER, leading dimension of array M
C
C     X REAL(N2), vector of length N2
C
C     M REAL(LDM,N2), matrix of N1 rows and N2 columns
C
C ----------------------------------------------------------------------
C
      DO 20 J = 1, N2
         DO 10 I = 1, N1
            Y(I) = (Y(I)) + X(J)*M(I,J)
   10    CONTINUE
   20 CONTINUE
C
      RETURN
      END
.bp
      SUBROUTINE SXMPY (N1, LDY, Y, N2, LDX, X, LDM, M)
      REAL Y(LDY,*), X(LDX,*), M(LDM,*)
C
C   PURPOSE:
C     Multiply row vector X times matrix M and add the result to row
C     vector Y.
C
C   PARAMETERS:
C
C     N1 INTEGER, number of columns in row vector Y, and number of
C         columns in matrix M
C
C     LDY INTEGER, leading dimension of array Y
C
C     Y REAL(LDY,N1), row vector of length N1 to which is added the
C         product X*M
C
C     N2 INTEGER, number of columns in row vector X, and number of
C         rows in matrix M
C
C     LDX INTEGER, leading dimension of array X
C
C     X REAL(LDX,N2), row vector of length N2
C
C     LDM INTEGER, leading dimension of array M
C
C     M REAL(LDM,N1), matrix of N2 rows and N1 columns
C
C ----------------------------------------------------------------------
C
      DO 20 J = 1, N2
         DO 10 I = 1, N1
            Y(1,I) = (Y(1,I)) + X(1,J)*M(J,I)
   10    CONTINUE
   20 CONTINUE
C
      RETURN
      END
.bp
.B
APPENDIX  III
.R


      SUBROUTINE MM (A, LDA, N1, N3, B, LDB, N2, C, LDC)
      REAL A(LDA,*), B(LDB,*), C(LDC,*)
C
C   PURPOSE:
C     Multiply matrix B times matrix C and store the result in matrix A.
C
C   PARAMETERS:
C
C     A REAL(LDA,N3), matrix of N1 rows and N3 columns
C
C     LDA INTEGER, leading dimension of array A
C
C     N1 INTEGER, number of rows in matrices A and B
C
C     N3 INTEGER, number of columns in matrices A and C
C
C     B REAL(LDB,N2), matrix of N1 rows and N2 columns
C
C     LDB INTEGER, leading dimension of array B
C
C     N2 INTEGER, number of columns in matrix B, and number of rows in
C         matrix C
C
C     C REAL(LDC,N3), matrix of N2 rows and N3 columns
C
C     LDC INTEGER, leading dimension of array C
C
C ----------------------------------------------------------------------
C
      DO 20 J = 1, N3
         DO 10 I = 1, N1
            A(I,J) = 0.0
   10    CONTINUE
         CALL SMXPY (N2,A(1,J),N1,LDB,C(1,J),B)
   20 CONTINUE
C
      RETURN
      END
.bp
      SUBROUTINE LLT (A, LDA, N, ROWI, INFO)
      REAL A(LDA,*), ROWI(*), T
C
C   PURPOSE:                                  t
C     Form the Cholesky factorization  A = L*L   of a symmetric positive
C     definite matrix A with factor L overwriting A.
C
C   PARAMETERS:
C
C     A REAL(LDA,N), matrix to be decomposed; only the lower triangle
C         need be supplied, the upper triangle is not referenced
C
C     LDA INTEGER, leading dimension of array A
C
C     N INTEGER, number of rows and columns in the matrix A
C
C     ROWI REAL(N), work array
C
C     INFO INTEGER, = 0 for normal return
C                   = I if I-th leading minor is not positive definite
C
C ----------------------------------------------------------------------
C
      INFO = 0
      DO 30 I = 1, N
C
C      Subtract multiples of preceding columns from I-th column of A
C
         DO 10 J = 1, I-1
            ROWI(J) = -A(I,J)
   10    CONTINUE
         CALL SMXPY (N-I+1,A(I,I),I-1,LDA,ROWI,A(I,1))
C
C      Test for non-positive definite leading minor
C
         IF (A(I,I) .LE. 0.0) THEN
            INFO = I
            GO TO 40
         ENDIF
C
C      Form I-th column of L
C
         T = 1.0/SQRT(A(I,I))
         A(I,I) = T
         DO 20 J = I+1, N
            A(J,I) = T*A(J,I)
   20    CONTINUE
   30 CONTINUE
   40 RETURN
      END
.bp
      SUBROUTINE LU (A, LDA, N, IPVT, INFO)
      INTEGER IPVT(*)
      REAL A(LDA,*), T
C
C   PURPOSE:
C     Form the LU factorization of A, where L is lower triangular and U
C     is unit upper triangular, with the factors L and U overwriting A.
C
C   PARAMETERS:
C
C     A REAL(LDA,N), matrix to be factored
C
C     LDA INTEGER, leading dimension of the array A
C
C     N INTEGER, number of rows and columns in the matrix A
C
C     IPVT INTEGER(N), sequence of pivot rows
C
C     INFO INTEGER, = 0 normal return.
C                   = J if L(J,J) is zero (whence A is singular)
C
C ----------------------------------------------------------------------
C
      INFO = 0
      DO 40 J = 1, N
C
C      Form J-th column of L
C
         CALL SMXPY (N-J+1,A(J,J),J-1,LDA,A(1,J),A(J,1))
C
C      Search for pivot
C
         T = ABS(A(J,J))
         K = J
         DO 10 I = J+1, N
            IF (ABS(A(I,J)) .GT. T) THEN
               T = ABS(A(I,J))
               K = I
            END IF
   10    CONTINUE
         IPVT(J) = K
C
C      Test for zero pivot
C
         IF (T .EQ. 0.0) THEN
            INFO = J
            GO TO 50
         ENDIF
C
C      Interchange rows
C
         DO 20 I = 1, N
            T = A(J,I)
            A(J,I) = A(K,I)
            A(K,I) = T
   20    CONTINUE
C
C      Form J-th row of U
C
         A(J,J) = 1.0/A(J,J)
         CALL SXMPY (N-J,LDA,A(J,J+1),J-1,LDA,A(J,1),LDA,A(1,J+1))
         T = -A(J,J)
         DO 30 I = J+1, N
            A(J,I) = T*A(J,I)
   30    CONTINUE
   40 CONTINUE
C
   50 RETURN
      END
.bp
.B
APPENDIX  IV
.R


      SUBROUTINE LLTS (A, LDA, N, X, B)
      REAL A(LDA,*), X(*), B(*), XK
C
C   PURPOSE:
C     Solve the symmetric positive definite system Ax = b given the
C     Cholesky factorization of A (as computed in LLT).
C
C   PARAMETERS:
C
C     A REAL(LDA,N), matrix which has been decomposed by routine LLT
C         in preparation for solving a system of equations 
C
C     LDA INTEGER, leading dimension of array A
C
C     N INTEGER, number of rows and columns in the matrix A
C
C     X REAL(N), solution of linear system
C
C     B REAL(N), right-hand-side of linear system
C
C ----------------------------------------------------------------------
C
      DO 10 K = 1, N
         X(K) = B(K)
   10 CONTINUE
C
      DO 30 K = 1, N
         XK = X(K)*A(K,K)
         DO 20 I = K+1, N
            X(I) = X(I) - A(I,K)*XK
   20    CONTINUE
         X(K) = XK
   30 CONTINUE
C
      DO 50 K = N, 1, -1
         XK = X(K)*A(K,K)
         DO 40 I = 1, K-1
            X(I) = X(I) - A(K,I)*XK
   40    CONTINUE
         X(K) = XK
   50 CONTINUE
C
      RETURN
      END
.bp
      SUBROUTINE LUS (A, LDA, N, IPVT, X, B)
      INTEGER IPVT(*)
      REAL A(LDA,*), X(*), B(*), XK
C
C   PURPOSE:
C     Solve the linear system Ax = b given the LU factorization of A (as
C     computed in LU).
C
C   PARAMETERS:
C
C     A REAL(LDA,N), matrix which has been decomposed by routine LU
C         in preparation for solving a system of equations 
C
C     LDA INTEGER, leading dimension of the array A
C
C     N INTEGER, number of rows and columns in the matrix A
C
C     IPVT INTEGER(N), sequence of pivot rows
C
C     X REAL(N), solution of linear system
C
C     B REAL(N), right-hand-side of linear system
C
C ----------------------------------------------------------------------
C
      DO 10 K = 1, N
         X(K) = B(K)
   10 CONTINUE
C
      DO 20 K = 1, N
         L = IPVT(K)
         XK = X(L)
         X(L) = X(K)
         X(K) = XK
   20 CONTINUE
C
      DO 40 K = 1, N
         XK = X(K)*A(K,K)
         DO 30 I = K+1, N
            X(I) = X(I) - A(I,K)*XK
   30    CONTINUE
         X(K) = XK
   40 CONTINUE
C
      DO 60 K = N, 1, -1
         XK = X(K)
         DO 50 I = 1, K-1
            X(I) = X(I) + A(I,K)*XK
   50    CONTINUE
   60 CONTINUE
C
      RETURN
      END

.