.ds TH "\h'-.75m'\v'-.50m'^\v'.50m'
.ds TD "\h'-.75m'\v'-.45m'~\v'.45m'
.B
.nr BT ''-%-''
.he ''''
.pl 11i
.de fO
'bp
..
.wh -.5i fO
.LP
.nr LL 6.5i
.ll 6.5i
.nr LT 6.5i
.lt 6.5i
.ta 5.0i
.ft 3
.bp
.R
.sp 1i
.ce 100
.R
.sp .5i
 .
.sp 10
ARGONNE NATIONAL LABORATORY
.br
9700 South Cass Avenue
.br
Argonne, Illinois  60439
.sp .6i
.ps 12
.ft 3
A Portable Environment for Developing Parallel Fortran Programs
.ps 11
.sp 3
Jack J. Dongarra and Danny C. Sorensen
.sp 3
.ps 10
.ft 1
Mathematics and Computer Science Division
.sp 2
Technical Memorandum No. 79
.sp .7i
July, 1986
.pn 1
.he ''-%-''
.he ''''
.in
.bp
.ft 3
.ps 11
.bp
.LP
.ds HT "\h'-.65m'\v'-.40m'^\v'.40m' 
.ds TD "\h'-.65m'\v'-.40m'~\v'.40m' 
.ds CT "\(mu\h'-.66m'\(ci 
.ds NI "\(mo\h'-.65m'/
.nr EP 1
.nr VS 14
.EQ
gsize 11
delim @@
define times '\(mu'
define ctimes '\(mu back 80 \(ci ^ '
define bl '\0'
.EN
.TL
A Portable Environment for Developing Parallel Fortran Programs*
.AU
.ps 11
.in 0
J. J. Dongarra and D. C. Sorensen
.AI
.ps 10
.in 0
Mathematics and Computer Science Division\h'.20i'   
Argonne National Laboratory\h'.20i'   
9700 South Cass Avenue\h'.20i'
Argonne, Illinois 60439-4844\h'.20i'
.sp
and\h'.20i'
.sp
Center for Supercomputing Research and Development\h'.20i'   
University of Illinois at Urbana-Champaign\h'.20i'   
305 Talbot Laboratory\h'.20i'   
104 South Wright Street\h'.20i'   
Urbana, Illinois 61801-2932\h'.20i'   
.nr PS 12
.nr VS 16
.nr PD 0.5v
.FS
* Work supported in part by the Applied Mathematical Sciences
subprogram of the Office of Energy Research, 
U.S. Department of Energy under Contracts W-31-109-Eng-38,
DE-AC05-840R21400, and DE-FG02-85ER25001.
.FE
.NH
Introduction
.PP
Many new parallel computers are now emerging as commercial products[6]. 
Each of the vendors has provided its own
parallel extensions to Fortran, and there is no agreement among any of them
on which extensions should be included.  There is not even an agreed-upon
naming convention for extensions that have identical functionality.
Program developers interested in producing
implementations of parallel algorithms that will
run on a number of different parallel machines are therefore faced with
an overwhelming task.
The process of developing portable parallel packages is complicated by 
additional factors that lie beyond 
each computer manufacturer supplying its own, very different 
mechanism for parallel processing.
A given implementation may require several different 
communicating parallel
processes, perhaps with different levels of granularity.
An efficient implementation may require the ability to 
dynamically start processes, perhaps many more than the number of physical
processors in the system.  This feature is either lacking or prohibitively
expensive on most commercially available parallel computers.  Instead, many of 
the manufacturers have limited themselves to providing one-level loop-based
parallelism.
.PP
This paper describes
an environment for the portable implementation of parallel
algorithms in a Fortran setting.
The package, called
SCHEDULE, can help a programmer familiar with a Fortran programming
environment to implement a parallel algorithm in a manner
that will lend itself to transporting the resulting program across a wide
variety of parallel machines. 
The package is designed to allow existing Fortran subroutines to be called
through SCHEDULE, without modification, thereby permitting users 
access to a wide body of existing library software in a parallel setting.
.PP
We have been influnced by the work of Babb [1], Browne [2], and
Lusk and Overbeek [8]. We present here our approach which has
applications to Fortran programs; allowing one to make use of existing
libraries in the parallel setting.
The approach taken here might be regarded as minimalist.  It has a very 
limited scope.  There are two reasons for this.  First, the goal of 
portability of user code will be less difficult to achieve.  Second,
the real hope for a solution to the software problems associated with
parallel programming lies with new programming languages or perhaps
with the "right" extension to Fortran.  This effort is expected to 
have a limited lifetime.  Its purpose is to allow us to exploit 
existing hardware immediately.  

.NH
Terminology
.PP
Within the science of parallel computation there seems to 
be no standard definition of terms.  A certain terminology will be adopted here
for the sake of dialogue.  It will not be "standard" and is 
intended only 
to apply within the scope of this document.
.R

Process - A unit of computation, an independently executable Fortran 
          subroutine together with calling sequence parameters, common 
          data, and externals.

Task    - A main program, processes, and a virtual processor.

Virtual Processor - A process designed to assume the identity of every 
          process within a given task (through an appropriate subroutine call).

Processor - A physical device capable of executing a main program or a 
         virtual processor.

Shared Data - Variables that are read and/or written by 
         more than one process (including copies of processes).  

Data Dependency - A situation wherein one process (A) reads any shared 
        data that another process (B) writes.  This data dependency
         is satisfied when B has written the shared data.

Schedulable Process - A process whose data dependencies have all
         been satisfied.

.NH
Parallel Programming Ideas
.PP
When designing a parallel algorithm one is required to describe the
data dependencies, parallel structures, and shared variables 
involved in the solution.
Typically, such algorithms are first designed at a conceptual level and 
later implemented
in Fortran and its extensions as provided by the computer manufacturer.
Each manufacturer provides a different set of extensions and targets
these extensions at different implementation levels. For example,
some manufacturers allow only test-and-set along with spawn-a-process, while
others allow concurrent execution of different loop iterations.
.PP
Our attempt here is to allow the user to define the 
data dependencies, parallel structures, and shared variables 
in his application and
then to implement these ideas in a Fortran program 
written in terms of subroutine calls to our environment.
Each set of subroutine calls to the environment
specifies the subroutine, 
or process (unit of computation),
along with the calling parameters 
and the
data dependencies necessary to 
coordinate the parallel execution.
.PP
The basic philosophy here is that Fortran programs are naturally
broken into subroutines that identify units of computation that
are self-contained and which operate on shared data structures.  
This allows one to call on 
existing library subroutines in a parallel setting
without modification, and without having
to write an envelope around the library subroutine call in order to conform
to some unusual data-passing conventions imposed by a given parallel programming
environment.  
.PP
A parallel(izable) 
program is written in terms of calls to subroutines which, in principle,
may be performed either independently or according to data dependency 
requirements that the user is responsible for defining.  
The result is a serial program that can run in parallel given
a way to schedule the units of computation on a system of parallel 
processors while obeying the data dependencies.
.NH
Parallel Programming Using SCHEDULE
.PP
We have constructed a package called SCHEDULE which allows a user
to specify the subroutine calls along with the execution dependencies
in order to carry out parallel execution.
To use SCHEDULE, one must be able to express (i.e., program) 
an algorithm in 
terms of processes and execution dependencies among the processes.  A convenient
way to view this is through a computational graph.  For example, the following
graph 
.KS
.nf



                                A


                         B           C


                    D    D    E

.fi
.KE
denotes five subroutines A, B, C, D, and E  (here with two 
"copies" of subroutine D operating on different data).
We intend the execution to start at the leaves of the dependency graph
with subroutines C, D, and E executing in parallel
(D will be started twice with different data).
After subroutines D (both copies) and E have returned,
subroutine B will be started; and after both B and C finish
subroutine A will be started.
To use SCHEDULE, one is required to specify the subroutine calling
sequence of each of the six schedulable units of computation, along with a
representation of this dependency graph.
.PP
For each node in the graph, SCHEDULE requires two subroutine calls.
One contains information about the user's routine to be called, such as
the name of the routine, calling sequence parameters, and a simple tag 
to identify the process.
The second subroutine call defines the dependency in the
graph to nodes above and below the one being specified, and specifies the
tag to identify the process.
In this example, after an initial call to set up the environment
for SCHEDULE, six pairs of calls would be made to define the relationships
and data in this computational graph.
.PP
These concepts are perhaps more easily grasped through an actual
Fortran example.  A very simple example is a parallel algorithm for computing
the inner product of two vectors.   The intention here is to illustrate the
mechanics of using SCHEDULE.  This algorithm and the use of SCHEDULE on
a problem of such small granularity are not necessarily recommended.
.nf

Problem: Given real vectors @a@ and @b@, each of length @n@, compute @sigma ~=~ a sup T b @.

Parallel Algorithm:  
                     Let @ a sup T ~=~ ( a sub 1 sup T , a sub 2 sup T ,...,a sub k sup T ) @ and @ b sup T ~=~ ( b sub 1 sup T , b sub 2 sup T ,...,b sub k sup T ) @ 
                     be a partitioning of the vectors @a@ and @b@ into smaller vectors @ a sub i @ and @ b sub i @.
.sp
                     Compute ( in parallel )
           
                         @sigma sub j ~=~ a sub j sup T b sub j@ , @ j ~=~ 1,k@.

                     When all done

                         @ sigma ~=~ sigma sub 1 ~+~ sigma sub 2 ~+~ ... ~+~ sigma sub k @.

Each of the parallel processes will execute code of the following form:

.nf
.cs 1 24
.vs 10
.ps 9

      subroutine inprod(m,a,b,sig)
      integer m
      real a(*),b(*),sig
      sig = 0.0
      do 100 j = 1,m
         sig = sig + a(j)*b(j)
  100 continue
      return
      end

.fi
.ps 12
.cs 1
.vs 14
The following routine is used
to accumulate the result:
.nf
.cs 1 24
.vs 10
.ps 9

      subroutine addup(k,sigma,temp)
      integer k
      real sigma,temp(*)
      sigma = 0.0
      do 100 j = 1,k
         sigma = sigma + temp(j)
  100 continue
      return
      end

.fi
.ps 12
.cs 1
.vs 14
.PP
The first step in constructing a code is to understand the parallel
algorithm in terms of schedulable processes and a data dependency graph.
Then the algorithm is expressed in a standard (serial) Fortran code.
This code consists of a main program which initializes the shared data and
a "parallel" subroutine \f3parprd\f1 to compute the inner product by invoking
the parallel processes \f3inprd\f1 and \f3addup\f1.  The program and associated
data dependency graph are shown below.

Serial Code:

.nf
.cs 1 24
.vs 10
.ps 9
      program main
      integer n,k
      real a(1000),b(1000),temp(50),sigma
      read (5,*) n,k
      do 100 j = 1,n 
         a(j) = j
         b(j) = 1
  100 continue
c
      call parprd(n,k,a,b,temp,sigma)
c
      write(6,*) ' sigma = ',sigma
      stop
      end


      subroutine parprd(n,k,a,b,temp,sigma)
c
c     declare shared variables
c
      integer n,k
      real a(*),b(*),temp(*),sigma
c
c     declare local variables
c
      integer m,indx,j   
c
      m = n/k
      indx = 1
      do 200 j = 1,k
c
         call inprod(m,a(indx),b(indx),temp(j))
c
         indx = indx + m
         if (j .eq. k-1) m = n - indx + 1
  200 continue
c
      call addup(k,sigma,temp)
c
      return
      end
.fi
.ps 12
.cs 1
.vs 14


Data Dependency Graph:



                        k+1





                1     2      3      ...         k-1     k
      


.fi

In this data dependency graph we have identified @k@ processes 
.EQ
         inprod(~ m,~ a(indx),~ b(indx),~ temp(j)~ ) ~ ~~,~ j~=~1,~ 2,...,~ k ~ ~~, 
          ~ ~indx~=~1~+~ (j-1)*m
.EN
which are not data dependent.  Each of them reads a segment of the 
shared data @a ~,~b @ and 
writes on its own entry of the array @temp@, but none of them needs to
read data that some other process will write.  This fact is evident in the
graphical representation where they are @leaves@.
One process,
.EQ
      addup(k,sigma,temp),
.EN
labeled @k+1@ is data dependent on each of the 
processes @ 1,2,...,k@.  This is because \f3addup\f1 needs to read each entry of
the array @temp@ in order to compute the sum and place it into @sigma@.
.PP
From this data dependency graph we may proceed to write the parallel
program.   Once we have understood the computation well enough to 
have carried out these two steps, the invocation of SCHEDULE to provide for
the parallel execution of schedulable processes is straight forward.
Calls to \f3parprd\f1, \f3inprod\f1, and \f3addup\f1 are replaced by calls to 
SCHEDULE to identify the routines to be executed as well
as the information relating to the dependency graph.  
The modified code follows.

 
.nf

Parallel Main:

.cs 1 24
.vs 10
.ps 9
      program main
      integer n,k
c
      EXTERNAL PARPRD
c
      real a(1000),b(1000),temp(50),sigma
      read (5,*) n,k,NPROCS
      do 100 j = 1,n 
         a(j) = j
         b(j) = 1
  100 continue
c
      CALL SCHED(nprocs,PARPRD,n,k,a,b,temp,sigma)
c
      write(6,*) ' sigma = ',sigma
      stop
      end


      subroutine parprd(n,k,a,b,temp,sigma)
c
c     declare shared variables
c
      integer n,k
      real a(*),b(*),temp(*),sigma
c
c     declare local variables
c
      integer m1,m2,indx,j,jobtag,icango,ncheks,mychkn(2) 
c
      EXTERNAL INPROD,ADDUP
      save m1,m2
c
      m1   = n/k
      indx = 1
      do 200 j = 1,k-1
c
c        express data dependencies
c
         JOBTAG = j
         ICANGO = 0
         NCHEKS = 1
         MYCHKN(1) = k+1
c
         CALL  DEP(jobtag,icango,ncheks,mychkn)
         CALL  PUTQ(jobtag,INPROD,m1,a(indx),b(indx),temp(j))
c
         indx = indx + m1   
  200 continue
      m2 = n - indx + 1
c
c     express data dependencies for clean up step
c
         JOBTAG = k
         ICANGO = 0
         NCHEKS = 1
         MYCHKN(1) = k+1
c
         CALL  DEP(jobtag,icango,ncheks,mychkn)
         CALL  PUTQ(jobtag,INPROD,m2,a(indx),b(indx),temp(k))
c
         indx = indx + m1   
c
         JOBTAG = k+1
         ICANGO = k
         NCHEKS = 0
c
      CALL  DEP(jobtag,icango,ncheks,mychkn)
      CALL  PUTQ(jobtag,ADDUP,k,sigma,temp)
c
      return
      end
.ps 12
.cs 1
.vs 14

.fi
.PP
The code that will execute in parallel has been derived from the serial
code by replacing calls to \f3parprd\f1, \f3inprd\f1, \f3addup\f1 with calls to 
SCHEDULE routines that invoke these routines.  The modifications are
signified by putting calls to SCHEDULE routines in capital letters.
Let us now describe the purpose of each of these calls.
.nf
.cs 1 24
.vs 10
.ps 9

      CALL SCHED(nprocs,PARPRD,n,k,a,b,temp,sigma)
.ps 12
.cs 1
.vs 14

.fi
This replaces the call to \f3parprd\f1 in the serial code.  The effect is 
to devote @nprocs@ virtual processors to the parallel subroutine \f3parprd\f1.
The parameter list following the subroutine name consist of the calling
sequence one would use to make a normal call to \f3parprd\f1.  Each of 
these parameters must be called by reference and not by value.  No 
constants or arithmetic expressions should be passed as parameters
through a call to \f3sched\f1.  This call to \f3sched\f1 will 
activate @nprocs@ copies of a virtual processor \f3work\f1.  This virtual
processor is a SCHEDULE procedure (written in C) that is internal to 
the package and not explicitly available to the user. 
.nf
.cs 1 24
.vs 10
.ps 9

         JOBTAG = j
         ICANGO = 0
         NCHEKS = 1
         MYCHKN(1) = k+1
c
         CALL  DEP(jogtag,icango,ncheks,mychkn)
         CALL  PUTQ(jobtag,INPROD,m,a(indx),b(indx),temp(j))

.cs 1
.vs 14
.ps 12
.fi
This code fragment shows the @j-th@ instance of the process \f3inprod\f1
being placed on a queue.  The information needed to schedule this process
is contained in the data dependency graph.  In this case, the @j-th@ instance
of a call to \f3inprod\f1 is being placed on the queue, so @jobtag@ is set to @j@.
The value zero is placed in @icango@, indicating that this process does not 
depend on any others.  If this process were dependent on @k@, other processes
besides @icango@ would be set to @k@.
.PP
The mechanism just described allows static scheduling of parallel 
processes.  In this program the partitioning and data dependencies 
are known in advance even though they are parameterized.  It is possible 
to dynamically allocate processes; this procedure will be explained later.  It 
might be worthwhile at this point to discuss the mechanism that
this package relies on.  
.NH
The SCHEDULE Mechanism
.PP

The call to the SCHEDULE routines \f3dep\f1 and \f3putq\f1, respectively, places process
dependencies and process descriptors on a queue.  There is a unique identifier 
@jobtag@ associated with each node of the dependency graph.  Internally this 
represents a pointer to a process.  The items needed to specify a 
data dependency are @icango@, @ncheks@, and @mychkn@.  The @icango@ specifies
the number of processes that process @jobtag@ depends on.  The @ncheks@ 
specifies the number of processes that depend on process @jobtag@.  The 
@mychkn@ is an integer array whose first @ncheks@ entries contain
the identifiers (i.e. @jobtag@s) of the processes that depend on 
process @jobtag@.
.PP
The initial call to @sched@(nprocs,subname,<parms>) results in @nprocs@ 
virtual processors called \f3work\f1 to begin executing on @nprocs@ separate
physical processors.  Typically @nprocs@ should be set to a value that
is less than or equal to the number of physical processors available
on the given system.  These \f3work\f1 routines access a ready queue of @jobtag@s
for schedulable
processes.  Recall that a schedulable process is one whose data dependencies
have been satisfied.  After a \f3work\f1 routine has been successful in 
obtaining the @jobtag@ of a schedulable process, it makes the subroutine
call associated with that @jobtag@ during the call to \f3putq\f1.
When this subroutine executes a @return@, control is returned to \f3work\f1,
and a SCHEDULE routine \f3chekin\f1 is called which decrements the @icango@ 
counter of each of the @ncheks@ processes that depend on process @jobtag@.
If any of these @icango@ values has been decremented to zero, the identifier
of that process is placed on the ready queue immediately.

.NH
Low-Level Synchronization
.PP
Two low-level synchronization primitives are available within
SCHEDULE.  They are \f3lockon\f1
and \f3lockoff\f1.  Each takes an integer argument.  An example of 
usage is
.sp2
.nf
.ps 9
.cs 1 24
.vs 10
      call lockon(ilock)
           ilocal = indx
           indx = indx + 5
      call lockoff(ilock)
.fi
.ps 12
.cs 1
.vs 14
.sp2
In this example a critical section has been placed around the act of getting
a local copy of the shared variable @indx@ and updating the value of @indx@.
If several concurrent processes are executing this code, then only one
of them will be able to occupy this critical section at any given time.
The variable, @ilock@ must be a globally shared variable and it must be 
initialized by calling the routine \f3lockasgn\f1.
In the above example the statement 
.sp
.nf
.ps 9
.cs 1 24
.vs 10
       call lockasgn(ilock,0)
.fi
.ps 12
.cs 1
.vs 14
.sp
must execute exactly once and before any of the calls to \f3lockon\f1 are made.
If there are low-level
data dependencies among any of the processes that will be scheduled, then
it will be necessary to enforce those data dependencies using locks. 
It is preferable to avoid using locks if possible.  However, in certain cases
such as pipelining, locks will be required.
.NH
Dynamic Allocation of Processes
.PP
The scheme presented above might be considered static allocation of processes.
By this we mean that the number of processes and their data dependencies were
known in advance.  Therefore the entire data structure (internal to SCHEDULE)
representing the computational graph could be recorded in advance of
the computation and is fixed throughout the computation.  In many
situations, however we will not know the computational graph in advance, and we will
need the ability for one process to start or spawn another depending 
on a computation
that has taken place up to a given point in the spawning process.  This 
dynamic allocation of processes is accomplished through the use of the SCHEDULE
subroutine \f3spawn\f1.  The method
of specifying a process is similar to the use of \f3putq\f1 described 
above. 
.PP
We shall use the same example to illustrate this mechanism.
.nf

Processes:  

       subroutine inprod .... same as above
 
.ps 9
.cs 1 24
.vs 10
       
      subroutine addup(myid,n,k,a,b,sigma,temp)
      integer myid,n,k
      real a(*),b(*),sigma,temp(*)
c
c     declare local variables
c
      integer j,jdummy,m1,m2
c
      LOGICAL WAIT
      EXTERNAL INPROD
      save m1,m2
c
      go to (1111,2222), IENTRY(myid)
 1111 continue
c
      m1 = n/k
      indx = 1
      do 200 j = 1,k-1
c
c        replace the call to inprod  with a call to spawn          
c
         CALL SPAWN(myid,jdummy,INPROD,m1,a(indx),b(indx),temp(j))
         indx = indx + m1
  200 continue
      m2 = n - indx + 1
c
c     clean up step
c     replace the call to inprod  with a call to spawn          
c
      CALL SPAWN(myid,jdummy,INPROD,m2,a(indx),b(indx),temp(k))
c
      nprocs = k         
      L2222 = 2
c
c     If any of the spawned process have not completed, RETURN
c     to the scheduler and help out.  This avoids busy waiting
c     and allows this code to be executed by one processor. 
c
         if (WAIT(myid,nprocs,L2222)) return
 2222    continue
c
c     All have checked in, now addup the results.
c
      sigma = 0.0
      do 100 j = 1,k
         sigma = sigma + temp(j)
  100 continue
      return
      end
.sp2
.fi
.cs 1
.vs 14
.ps 12
The subroutine \f3parprd\f1 must change somewhat.
.nf
.sp2
.cs 1 24
.vs 10
.ps 9


      subroutine parprd(n,k,a,b,temp,sigma)
c
c     declare shared variables
c
      integer n,k
      real a(*),b(*),temp(*),sigma
c
c     declare local variables
c
      integer mychkn(1),icango,ncheks,jobtag
      EXTERNAL ADDUP 
      save jobtag
c    
      JOBTAG = 1
      ICANGO = 0
      NCHEKS = 0
c
      CALL DEP(jobtag,icango,ncheks,mychkn)
      CALL PUTQ(jobtag,ADDUP,jobtag,n,k,a,b,sigma,temp)
c
      return
      end
.fi
.cs 1
.vs 14
.ps 12
.NH
Experience with SCHEDULE
.PP
At present the experience with using SCHEDULE is limited but encouraging.
Versions are running successfully on the VAX 11/780, Alliant FX/8,
and CRAY-2 computers.  That is, the same user code executes without modification
on all three machines.  Only the SCHEDULE internals are modified, and
these modifications are minor.  They involve such things as 
naming and parameter-passing conventions for the C - Fortran interface.
They also involve coding the low-level synchronization primitives and 
managing to "create" the \f3work\f1 processes.  
.PP
On the CRAY-2 process creation is accomplished using \f3taskstart\f1, and
the low-level synchronization already matches the low-level
synchronization routines provided by the CRAY multitasking library[3].
For the Alliant FX/8 we coded the low-level synchronization primitives
using their test-and-set instruction.  To "create" the work routines,
we used the CVD$L CNCALL directive before a loop that performed @nprocs@
calls to the subroutine @work@.   
.PP
In addition
to some toy programs used for debugging SCHEDULE, 
several codes have been written and executed using SCHEDULE including
the algorithm  TREEQL
for the symmetric tridiagonal eigenvalue problem [7],  a domain decompositon
code for singularly perturbed convection-diffusion PDE [4], and a block 
preconditioned conjugate gradient code for systems arising in reservoir
simulation [5].
.SH
References

.sp
.IP [1]
R.G. Babb II, 
.I
Parallel Processing with Large Grain Data Flow Techniques,
.R
IEEE Computer, Vol. 17, No. 7 , 55-61, July 1984.
.sp
.IP [2]
J.C. Browne,
.I
Framework for Formulation and Analysis of Parallel
Computation Structures,
.R
Parallel Computing, 3, 1-9, 1986.
.sp
.IP [3]
CRAY 2 Multitasking Users Guide, Cray Research Inc, Minn, MN, 1986.
.sp
.IP [4]
R. Chin, G. Hedstrom, F. Howes, and J. McGraw,
.I
Parallel Computation of Multiple-Scale Problems,
.R
New Computing Environments: Parallel, Vector, and Systolic,
Ed. A. Wouk,
Siam Pub., Philadelphia, 1986.
.I
.R
.sp
.IP [5]
J.C. Diaz,
.I
Calculating the Block Preconditioner on Parallel Multivector Processors,
.R
Proceedings of the Workshop on Applied Computing in The Energy Field,
Stillwater, Oklahoma, October 10, 1986.
.I
.R
To appear.
.sp
.IP [6]
J.J. Dongarra and I.S. Duff, 
.I
Advanced Architecture Computers,
.R
ANL-MCS-TM-57, October, 1985.
.sp
.IP [7]
J.J. Dongarra and D.C. Sorensen, 
.I
A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem,
.R
To appear SIAM SSISC.
.sp
.IP [8]
E. Lusk and R. Overbeek, 
"Implementation of Monitors with Macros: A 
Programming Aid for the HEP and Other Parallel Processors",
.R
ANL-83-97, (1983).

.