.EQ delim @@ .EN .TL .ps 12 .in 0 Squeezing the Most out of an Algorithm\h'.35i' .br in CRAY Fortran\h'.35i' .AU .ps 11 .in 0 Jack J. Dongarra\|@size -1 {"" sup \(dg}@\h'.15i' .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' .AU .ps 11 .in 0 Stanley C. Eisenstat\|@size -1 {"" sup \(dd}@\h'.20i' .AI .ps 10 .in 0 Department of Computer Science and\h'.25i' Research Center for Scientific Computation\h'.20i' Yale University\h'.20i' .FS .ps 9 .vs 11p @size -1 {"" sup \(dg}@\|Work supported in part by the Applied Mathematical Sciences Research Program (KC-04-02) of the Office of Energy Research of the U. S. Department of Energy under Contract W-31-109-Eng-38. .sp .25v @size -1 {"" sup \(dd}@\|Work supported in part by the Office of Naval Research under contract N00014-82-K-0184 and by the National Science Foundation under grant MCS-81-04874. .FE .QS .ps 10 .in .25i .ll -.25i .I Abstract \(em This paper describes a technique for achieving super-vector performance on a CRAY-1 in a purely Fortran environment (i.e., without resorting to assembler language). The technique can be applied to a wide variety of algorithms in linear algebra, and is beneficial in other architectural settings. .in .ll .QE .nr PS 11 .nr VS 16 .nr PD 0.5v .SH Introduction .PP There are three basic performance levels on the CRAY-1 \(em \fIscalar, vector, and super-vector\fP [4]: .KS .TS box center; c | c. Performance Level Rate of Execution* _ _ Scalar 0\(mi4 MFLOPS Vector 4\(mi50 MFLOPS Super-Vector 50\(mi160 MFLOPS .TE .KE .FS .ps 9 .vs 11p .ll .sp .25v *\|MFLOPS is an acronym for Million FLoating-point OPerations (additions or multiplications) per Second. .FE The difference between scalar and vector modes is the use of vector instructions to eliminate loop overhead and take full advantage of the pipelined functional units. The difference between vector and super-vector modes is the use of vector registers to reduce the number of memory references (and thus avoid letting the one path to/from memory become a bottleneck). .PP Typically, programs written in Fortran run at scalar or vector speeds, so that one must resort to assembler language (or assembler language kernels) to improve performance. In this paper, we describe a technique for attaining super-vector speeds \fIfrom Fortran\fP.@size -1 {"" sup \(dg}@ .FS .ps 9 .vs 11p @size -1 {"" sup \(dg}@\|We recognize that assembler code may be needed to achieve the highest level of performance, and that its use in a small number of `kernels' is not a significant barrier to transportability. However, the approach presented does lead to high levels of performance, is portable, and can be used to derive \fIalgorithmic\fP improvements in a much wider class of problems than discussed in this paper. .FE .SH The Ideal Setting@size -1 {"" sup \(dd}@ .FS .ps 9 .vs 11p .sp .25v @size -1 {"" sup \(dd}@\|See [4] for a more complete discussion. .FE .PP Most algorithms in linear algebra are easily vectorized. For example, consider the following subroutine which adds the product of a matrix and a vector to another vector: .DS B .TS l l. SUBROUTINE SMXPY (N1,Y,N2,LDM,X,M) REAL Y(\(**), X(\(**), M(LDM,\(**) DO 20 J = 1, N2 DO 10 I = 1, N1 Y(I) = Y(I) + X(J)\(**M(I,J) 10 CONTINUE 20 CONTINUE RETURN END .TE .DE The innermost loop is a SAXPY [5] (adding a multiple of one vector to another) and would be detected by a good vectorizing compiler. Thus, the Cray CFT Fortran compiler generates vector code of the general form: .DS .I .TS center; l. Load vector \fRY\fP Load scalar \fRX(J)\fP Load vector \fRM(\(**,J)\fP Multiply scalar \fRX(J)\fP times vector \fRM(\(**,J)\fP Add result to vector \fRY\fP Store result in \fRY\fP .TE .R .DE .PP Note that there are \fIthree\fP vector memory references for each \fItwo\fP vector floating-point operations. Since there is only one path to/from memory and the memory bandwidth is 80 million words per second, the rate of execution cannot exceed \s-2\(ap\s053\ 1/3 MFLOPS (less than 50 MFLOPS when vector startup time is taken into account) \(em \fIvector performance\fP. .PP Thus to attain super-vector performance, it is necessary to expand the scope of the vectorizing process to more than just simple vector operations. In this case, a closer inspection reveals that the vector Y is stored and then reloaded in successive SAXPY's. If instead we accumulate Y in a vector register (up to 64 words at a time) until all of the columns of M have been processed, we can avoid two of the three vector memory references in the innermost loop. The maximum rate of execution is then 160 MFLOPS (\s-2\(ap\s0148 MFLOPS when vector startup time is taken into account) \(em \fIsuper-vector performance\fP. .SH Reality .PP The Cray CFT compiler does not detect the fact that the result can be accumulated in a register (and not stored between successive vector operations). Thus, the rate of execution is limited to vector speeds. .PP But if we unroll [1] the outer loop (in this case to a depth of four) and insert parentheses to force the arithmetic operations to be performed in the most efficient order, then the innermost loop becomes: .DS B .TS l l. DO 10 I = 1, N1 Y(I) = ((((Y(I)) + X(J\(mi3)\(**M(I,J\(mi3)) + X(J\(mi2)\(**M(I,J\(mi2)) $ + X(J\(mi1)\(**M(I,J\(mi1)) + X(J) \(**M(I,J) 10 CONTINUE .TE .DE Now the code generated by CFT has \fIsix\fP vector memory references for each \fIeight\fP vector floating-point operations. Thus the maximum rate of execution is \s-2\(ap\s0106\ 2/3 MFLOPS (\s-2\(ap\s0100 MFLOPS when vector startup time is taken into account) and the actual rate is \s-2\(ap\s077 MFLOPS \(em \fIsuper-vector performance from Fortran\fP. The complete subroutine SMXPY4 is given in Appendix I. .SH Generalizations .PP With this approach we can develop quite a collection of procedures from linear algebra. The key idea is to use two kernels \(em SMXPY and SXMPY (add a vector times a matrix to another vector; see Appendix II) \(em to do the bulk of the work. Since both kernels can be unrolled@size -1 {"" sup \(dg}@ to give super-vector performance, the procedures themselves are capable of super-vector performance. .FS .ps 9 .vs 11p @size -1 {"" sup \(dg}@\|Although there are only eight vector registers, this is sufficient for any depth of unrolling. .FE .PP Many processes which involve elementary transformations can be described in these terms, e.g., matrix multiplication, Cholesky decomposition, and LU factorization (see Appendix III and [4, 6]). However, the formulation is often not the ``natural'' one, which may be based on outer-products of vectors or accumulating variable-length vectors, neither of which can be super-vectorized in Fortran. .PP Tables 1-3 below summarize the results obtained for these procedures on a CRAY 1-S (as well as on the new CRAY 1-M@size -1 {"" sup \(dg}@ and CRAY X-MP@size -1 {"" sup \(dd}@) when the subroutines SMXPY and SXMPY were unrolled to the specified depth. All runs used the CFT 1.11 Fortran compiler. .FS .ps 9 .vs 11p .sp .25v @size -1 {"" sup \(dg}@\|The CRAY 1-M is essentially a CRAY 1-S with ``slow'' memory. It is faster in these tests because of a chaining anomaly \(em a vector load issues earlier on the CRAY 1-S, causing a scalar-vector multiply to miss chain-slot time. .sp .25v @size -1 {"" sup \(dd}@\|The CRAY X-MP is a multiprocessor, each processor having a cycle time of 9.5 ns (vs. 12.5 ns for the CRAY 1-S) and three paths to/from memory (two for vector loads, one for vector stores). These timings were obtained using only one processor. While in principle the extra paths should remove the memory bottleneck, in practice the unrolled code still runs faster because there are fewer vector start-ups and less memory traffic (and thus fewer bank conflicts). .FE .sp 1 .KS .ce \fITable 1\fP\^: 300 \(mu 300 Matrix Multiplication .TS box center; c | c s s c | c s s c | c | c | c c | c | c | c n | n | n | n. Unrolled MFLOPS _ Depth CRAY 1-M CRAY 1-S CRAY X-MP _ _ _ _ 1 39 40 106 2 60 53 151 4 83 72 161 8 101 86 170 16 111 96 177 .TE .KE .sp 1 .KS .ce \fITable 2\fP\^: 300 \(mu 300 Cholesky Decomposition .R .TS box center; c | c s s c | c s s c | c | c | c c | c | c | c n | n | n | n. Unrolled MFLOPS _ Depth CRAY 1-M CRAY 1-S CRAY X-MP _ _ _ _ 1 31 33 68 2 48 45 99 4 67 60 118 8 81 70 131 16 86 78 139 .TE .KE .sp 1 .KS .ce \fITable 3a\fP\^: 300 \(mu 300 LU Decomposition with Pivoting .R .TS box center; c | c s s c | c s s c | c | c | c c | c | c | c n | n | n | n. Unrolled MFLOPS _ Depth CRAY 1-M CRAY 1-S CRAY X-MP _ _ _ _ 1 28 29 56 2 42 39 78 4 56 52 93 8 66 60 103 16 69 66 108 .TE .KE .sp 1 .KS .ce 2 \fITable 3b\fP\^: 300 \(mu 300 LU Decomposition with Pivoting (Using an Assembler Language Implementation of ISAMAX@size -1 {"" sup \(dg}@) .R .TS box center; c | c s s c | c s s c | c | c | c c | c | c | c n | n | n | n. Unrolled MFLOPS _ Depth CRAY 1-M CRAY 1-S CRAY X-MP _ _ _ _ 1 30 32 62 2 46 43 96 4 64 59 117 8 78 68 129 16 83 76 136 .TE .KE .FS .ps 9 .vs 11p @size -1 {"" sup \(dg}@\|The search for the maximum element in the pivot column (ISAMAX [5]) does not vectorize and thus limits performance. These times were obtained using an assembler language implementation of ISAMAX. .FE .sp 2 By contrast, 30 MFLOPS is often cited as a ``good rate for Fortran'' on the CRAY 1-S [3] and 100 MFLOPS as a ``good rate for CAL (Cray Assembler Language)'' [3] (e.g., Fong and Jordan [4] report 107 MFLOPS for an assembler language implementation of LU decomposition with pivoting). .SH Conclusion .PP We have described a technique that can produce significant gains in execution speed on the CRAY-1.@size -1 {"" sup \(dd}@ .FS .ps 9 .vs 11p .ll .sp .25v @size -1 {"" sup \(dd}@\|See [2] for another approach. .FE Moreover, to the extent that this approach reduces loop overhead and takes advantage of segmented functional units, it will be effective on more conventional computers as well as on other ``super-computer'' architectures. And since optimized assembler language implementations of the SMXPY and SXMPY kernels are easy to code (as much so as any kernel) and frequently available, one can get most of the advantages of assembler language while programming in Fortran. .SH ACKNOWLEDGMENTS .PP We would like to thank the National Magnetic Fusion Energy Computer Center for providing computer time to carry out some of the experiments, Cray Research for their cooperation, and Alan Hinds for many stimulating discussions on code optimization. .sp .nr VS 13 .SH REFERENCES .sp .5v .IP [1] 4 J. J. Dongarra and A. R. Hinds, ``Unrolling loops in Fortran,'' .I Software\(emPractice and Experience .R \fB9\fP (1979), 219-229. .sp .5v .IP [2] 4 Iain S. Duff ``The solution of sparse linear equations on the CRAY-1,'' .I CRAY Channels .R \fB4:3\fP (1982), 4-9. .sp .5v .IP [3] 4 I. S. Duff and J. K. Reid, ``Experience of Sparse Matrix Codes on the CRAY-1,'' .I Computer Physics Communications .R \fB26\fP (1982), 293-302. .sp .5v .IP [4] 4 Kirby Fong and Thomas L. Jordan, .I Some Linear Algebra Algorithms and Their Performance on the CRAY-1, .R Los Alamos Scientific Laboratory, UC-32, June 1977. .sp .5v .IP [5] 4 C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, ``Basic Linear Algebra Subprograms for Fortran Usage,'' .I ACM Transactions on Mathematical Software .R \fB5\fP (1979), 308-371. .sp .5v .IP [6] 4 D. A. Orbits and D. A. Calahan, .I Data Flow Considerations in Implementing a Full Matrix Solver with Backing Store on the CRAY-1, .R Report #98, Systems Engineering Laboratory, University of Michigan, September 1976. .LP .ps 10 .sp 2 .ss 24 .cs 1 24 .lg 0 .nf .sp 2 .bp .B APPENDIX I .R SUBROUTINE SMXPY4 (N1, Y, N2, LDM, X, M) REAL Y(*), X(*), M(LDM,*) C C PURPOSE: C Multiply matrix M times vector X and add the result to vector Y. C C PARAMETERS: C C N1 INTEGER, number of elements in vector Y, and number of rows in C matrix M C C Y REAL(N1), vector of length N1 to which is added the product M*X C C N2 INTEGER, number of elements in vector X, and number of columns C in matrix M C C LDM INTEGER, leading dimension of array M C C X REAL(N2), vector of length N2 C C M REAL(LDM,N2), matrix of N1 rows and N2 columns C C ---------------------------------------------------------------------- C C Cleanup odd vector C J = MOD(N2,2) IF (J .GE. 1) THEN DO 10 I = 1, N1 Y(I) = (Y(I)) + X(J)*M(I,J) 10 CONTINUE ENDIF C C Cleanup odd group of two vectors C J = MOD(N2,4) IF (J .GE. 2) THEN DO 20 I = 1, N1 Y(I) = ( (Y(I)) $ + X(J-1)*M(I,J-1)) + X(J)*M(I,J) 20 CONTINUE ENDIF C C Main loop - groups of four vectors C JMIN = J+4 DO 40 J = JMIN, N2, 4 DO 30 I = 1, N1 Y(I) = ((( (Y(I)) $ + X(J-3)*M(I,J-3)) + X(J-2)*M(I,J-2)) $ + X(J-1)*M(I,J-1)) + X(J) *M(I,J) 30 CONTINUE 40 CONTINUE C RETURN END .bp .B APPENDIX II .R SUBROUTINE SMXPY (N1, Y, N2, LDM, X, M) REAL Y(*), X(*), M(LDM,*) C C PURPOSE: C Multiply matrix M times vector X and add the result to vector Y. C C PARAMETERS: C C N1 INTEGER, number of elements in vector Y, and number of rows in C matrix M C C Y REAL(N1), vector of length N1 to which is added the product M*X C C N2 INTEGER, number of elements in vector X, and number of columns C in matrix M C C LDM INTEGER, leading dimension of array M C C X REAL(N2), vector of length N2 C C M REAL(LDM,N2), matrix of N1 rows and N2 columns C C ---------------------------------------------------------------------- C DO 20 J = 1, N2 DO 10 I = 1, N1 Y(I) = (Y(I)) + X(J)*M(I,J) 10 CONTINUE 20 CONTINUE C RETURN END .bp SUBROUTINE SXMPY (N1, LDY, Y, N2, LDX, X, LDM, M) REAL Y(LDY,*), X(LDX,*), M(LDM,*) C C PURPOSE: C Multiply row vector X times matrix M and add the result to row C vector Y. C C PARAMETERS: C C N1 INTEGER, number of columns in row vector Y, and number of C columns in matrix M C C LDY INTEGER, leading dimension of array Y C C Y REAL(LDY,N1), row vector of length N1 to which is added the C product X*M C C N2 INTEGER, number of columns in row vector X, and number of C rows in matrix M C C LDX INTEGER, leading dimension of array X C C X REAL(LDX,N2), row vector of length N2 C C LDM INTEGER, leading dimension of array M C C M REAL(LDM,N1), matrix of N2 rows and N1 columns C C ---------------------------------------------------------------------- C DO 20 J = 1, N2 DO 10 I = 1, N1 Y(1,I) = (Y(1,I)) + X(1,J)*M(J,I) 10 CONTINUE 20 CONTINUE C RETURN END .bp .B APPENDIX III .R SUBROUTINE MM (A, LDA, N1, N3, B, LDB, N2, C, LDC) REAL A(LDA,*), B(LDB,*), C(LDC,*) C C PURPOSE: C Multiply matrix B times matrix C and store the result in matrix A. C C PARAMETERS: C C A REAL(LDA,N3), matrix of N1 rows and N3 columns C C LDA INTEGER, leading dimension of array A C C N1 INTEGER, number of rows in matrices A and B C C N3 INTEGER, number of columns in matrices A and C C C B REAL(LDB,N2), matrix of N1 rows and N2 columns C C LDB INTEGER, leading dimension of array B C C N2 INTEGER, number of columns in matrix B, and number of rows in C matrix C C C C REAL(LDC,N3), matrix of N2 rows and N3 columns C C LDC INTEGER, leading dimension of array C C C ---------------------------------------------------------------------- C DO 20 J = 1, N3 DO 10 I = 1, N1 A(I,J) = 0.0 10 CONTINUE CALL SMXPY (N2,A(1,J),N1,LDB,C(1,J),B) 20 CONTINUE C RETURN END .bp SUBROUTINE LLT (A, LDA, N, ROWI, INFO) REAL A(LDA,*), ROWI(*), T C C PURPOSE: t C Form the Cholesky factorization A = L*L of a symmetric positive C definite matrix A with factor L overwriting A. C C PARAMETERS: C C A REAL(LDA,N), matrix to be decomposed; only the lower triangle C need be supplied, the upper triangle is not referenced C C LDA INTEGER, leading dimension of array A C C N INTEGER, number of rows and columns in the matrix A C C ROWI REAL(N), work array C C INFO INTEGER, = 0 for normal return C = I if I-th leading minor is not positive definite C C ---------------------------------------------------------------------- C INFO = 0 DO 30 I = 1, N C C Subtract multiples of preceding columns from I-th column of A C DO 10 J = 1, I-1 ROWI(J) = -A(I,J) 10 CONTINUE CALL SMXPY (N-I+1,A(I,I),I-1,LDA,ROWI,A(I,1)) C C Test for non-positive definite leading minor C IF (A(I,I) .LE. 0.0) THEN INFO = I GO TO 40 ENDIF C C Form I-th column of L C T = 1.0/SQRT(A(I,I)) A(I,I) = T DO 20 J = I+1, N A(J,I) = T*A(J,I) 20 CONTINUE 30 CONTINUE 40 RETURN END .bp SUBROUTINE LU (A, LDA, N, IPVT, INFO) INTEGER IPVT(*) REAL A(LDA,*), T C C PURPOSE: C Form the LU factorization of A, where L is lower triangular and U C is unit upper triangular, with the factors L and U overwriting A. C C PARAMETERS: C C A REAL(LDA,N), matrix to be factored C C LDA INTEGER, leading dimension of the array A C C N INTEGER, number of rows and columns in the matrix A C C IPVT INTEGER(N), sequence of pivot rows C C INFO INTEGER, = 0 normal return. C = J if L(J,J) is zero (whence A is singular) C C ---------------------------------------------------------------------- C INFO = 0 DO 40 J = 1, N C C Form J-th column of L C CALL SMXPY (N-J+1,A(J,J),J-1,LDA,A(1,J),A(J,1)) C C Search for pivot C T = ABS(A(J,J)) K = J DO 10 I = J+1, N IF (ABS(A(I,J)) .GT. T) THEN T = ABS(A(I,J)) K = I END IF 10 CONTINUE IPVT(J) = K C C Test for zero pivot C IF (T .EQ. 0.0) THEN INFO = J GO TO 50 ENDIF C C Interchange rows C DO 20 I = 1, N T = A(J,I) A(J,I) = A(K,I) A(K,I) = T 20 CONTINUE C C Form J-th row of U C A(J,J) = 1.0/A(J,J) CALL SXMPY (N-J,LDA,A(J,J+1),J-1,LDA,A(J,1),LDA,A(1,J+1)) T = -A(J,J) DO 30 I = J+1, N A(J,I) = T*A(J,I) 30 CONTINUE 40 CONTINUE C 50 RETURN END .bp .B APPENDIX IV .R SUBROUTINE LLTS (A, LDA, N, X, B) REAL A(LDA,*), X(*), B(*), XK C C PURPOSE: C Solve the symmetric positive definite system Ax = b given the C Cholesky factorization of A (as computed in LLT). C C PARAMETERS: C C A REAL(LDA,N), matrix which has been decomposed by routine LLT C in preparation for solving a system of equations C C LDA INTEGER, leading dimension of array A C C N INTEGER, number of rows and columns in the matrix A C C X REAL(N), solution of linear system C C B REAL(N), right-hand-side of linear system C C ---------------------------------------------------------------------- C DO 10 K = 1, N X(K) = B(K) 10 CONTINUE C DO 30 K = 1, N XK = X(K)*A(K,K) DO 20 I = K+1, N X(I) = X(I) - A(I,K)*XK 20 CONTINUE X(K) = XK 30 CONTINUE C DO 50 K = N, 1, -1 XK = X(K)*A(K,K) DO 40 I = 1, K-1 X(I) = X(I) - A(K,I)*XK 40 CONTINUE X(K) = XK 50 CONTINUE C RETURN END .bp SUBROUTINE LUS (A, LDA, N, IPVT, X, B) INTEGER IPVT(*) REAL A(LDA,*), X(*), B(*), XK C C PURPOSE: C Solve the linear system Ax = b given the LU factorization of A (as C computed in LU). C C PARAMETERS: C C A REAL(LDA,N), matrix which has been decomposed by routine LU C in preparation for solving a system of equations C C LDA INTEGER, leading dimension of the array A C C N INTEGER, number of rows and columns in the matrix A C C IPVT INTEGER(N), sequence of pivot rows C C X REAL(N), solution of linear system C C B REAL(N), right-hand-side of linear system C C ---------------------------------------------------------------------- C DO 10 K = 1, N X(K) = B(K) 10 CONTINUE C DO 20 K = 1, N L = IPVT(K) XK = X(L) X(L) = X(K) X(K) = XK 20 CONTINUE C DO 40 K = 1, N XK = X(K)*A(K,K) DO 30 I = K+1, N X(I) = X(I) - A(I,K)*XK 30 CONTINUE X(K) = XK 40 CONTINUE C DO 60 K = N, 1, -1 XK = X(K) DO 50 I = 1, K-1 X(I) = X(I) + A(I,K)*XK 50 CONTINUE 60 CONTINUE C RETURN END .