.ds TH "\h'-.75m'\v'-.50m'^\v'.50m' .ds TD "\h'-.75m'\v'-.45m'~\v'.45m' .B .nr BT ''-%-'' .he '''' .pl 11i .de fO 'bp .. .wh -.5i fO .LP .nr LL 6.5i .ll 6.5i .nr LT 6.5i .lt 6.5i .ta 5.0i .ft 3 .bp .R .sp 1i .ce 100 .R .sp .5i . .sp 10 ARGONNE NATIONAL LABORATORY .br 9700 South Cass Avenue .br Argonne, Illinois 60439 .sp .6i .ps 12 .ft 3 A Portable Environment for Developing Parallel Fortran Programs .ps 11 .sp 3 Jack J. Dongarra and Danny C. Sorensen .sp 3 .ps 10 .ft 1 Mathematics and Computer Science Division .sp 2 Technical Memorandum No. 79 .sp .7i July, 1986 .pn 1 .he ''-%-'' .he '''' .in .bp .ft 3 .ps 11 .bp .LP .ds HT "\h'-.65m'\v'-.40m'^\v'.40m' .ds TD "\h'-.65m'\v'-.40m'~\v'.40m' .ds CT "\(mu\h'-.66m'\(ci .ds NI "\(mo\h'-.65m'/ .nr EP 1 .nr VS 14 .EQ gsize 11 delim @@ define times '\(mu' define ctimes '\(mu back 80 \(ci ^ ' define bl '\0' .EN .TL A Portable Environment for Developing Parallel Fortran Programs* .AU .ps 11 .in 0 J. J. Dongarra and D. C. Sorensen .AI .ps 10 .in 0 Mathematics and Computer Science Division\h'.20i' Argonne National Laboratory\h'.20i' 9700 South Cass Avenue\h'.20i' Argonne, Illinois 60439-4844\h'.20i' .sp and\h'.20i' .sp Center for Supercomputing Research and Development\h'.20i' University of Illinois at Urbana-Champaign\h'.20i' 305 Talbot Laboratory\h'.20i' 104 South Wright Street\h'.20i' Urbana, Illinois 61801-2932\h'.20i' .nr PS 12 .nr VS 16 .nr PD 0.5v .FS * Work supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U.S. Department of Energy under Contracts W-31-109-Eng-38, DE-AC05-840R21400, and DE-FG02-85ER25001. .FE .NH Introduction .PP Many new parallel computers are now emerging as commercial products[6]. Each of the vendors has provided its own parallel extensions to Fortran, and there is no agreement among any of them on which extensions should be included. There is not even an agreed-upon naming convention for extensions that have identical functionality. Program developers interested in producing implementations of parallel algorithms that will run on a number of different parallel machines are therefore faced with an overwhelming task. The process of developing portable parallel packages is complicated by additional factors that lie beyond each computer manufacturer supplying its own, very different mechanism for parallel processing. A given implementation may require several different communicating parallel processes, perhaps with different levels of granularity. An efficient implementation may require the ability to dynamically start processes, perhaps many more than the number of physical processors in the system. This feature is either lacking or prohibitively expensive on most commercially available parallel computers. Instead, many of the manufacturers have limited themselves to providing one-level loop-based parallelism. .PP This paper describes an environment for the portable implementation of parallel algorithms in a Fortran setting. The package, called SCHEDULE, can help a programmer familiar with a Fortran programming environment to implement a parallel algorithm in a manner that will lend itself to transporting the resulting program across a wide variety of parallel machines. The package is designed to allow existing Fortran subroutines to be called through SCHEDULE, without modification, thereby permitting users access to a wide body of existing library software in a parallel setting. .PP We have been influnced by the work of Babb [1], Browne [2], and Lusk and Overbeek [8]. We present here our approach which has applications to Fortran programs; allowing one to make use of existing libraries in the parallel setting. The approach taken here might be regarded as minimalist. It has a very limited scope. There are two reasons for this. First, the goal of portability of user code will be less difficult to achieve. Second, the real hope for a solution to the software problems associated with parallel programming lies with new programming languages or perhaps with the "right" extension to Fortran. This effort is expected to have a limited lifetime. Its purpose is to allow us to exploit existing hardware immediately. .NH Terminology .PP Within the science of parallel computation there seems to be no standard definition of terms. A certain terminology will be adopted here for the sake of dialogue. It will not be "standard" and is intended only to apply within the scope of this document. .R Process - A unit of computation, an independently executable Fortran subroutine together with calling sequence parameters, common data, and externals. Task - A main program, processes, and a virtual processor. Virtual Processor - A process designed to assume the identity of every process within a given task (through an appropriate subroutine call). Processor - A physical device capable of executing a main program or a virtual processor. Shared Data - Variables that are read and/or written by more than one process (including copies of processes). Data Dependency - A situation wherein one process (A) reads any shared data that another process (B) writes. This data dependency is satisfied when B has written the shared data. Schedulable Process - A process whose data dependencies have all been satisfied. .NH Parallel Programming Ideas .PP When designing a parallel algorithm one is required to describe the data dependencies, parallel structures, and shared variables involved in the solution. Typically, such algorithms are first designed at a conceptual level and later implemented in Fortran and its extensions as provided by the computer manufacturer. Each manufacturer provides a different set of extensions and targets these extensions at different implementation levels. For example, some manufacturers allow only test-and-set along with spawn-a-process, while others allow concurrent execution of different loop iterations. .PP Our attempt here is to allow the user to define the data dependencies, parallel structures, and shared variables in his application and then to implement these ideas in a Fortran program written in terms of subroutine calls to our environment. Each set of subroutine calls to the environment specifies the subroutine, or process (unit of computation), along with the calling parameters and the data dependencies necessary to coordinate the parallel execution. .PP The basic philosophy here is that Fortran programs are naturally broken into subroutines that identify units of computation that are self-contained and which operate on shared data structures. This allows one to call on existing library subroutines in a parallel setting without modification, and without having to write an envelope around the library subroutine call in order to conform to some unusual data-passing conventions imposed by a given parallel programming environment. .PP A parallel(izable) program is written in terms of calls to subroutines which, in principle, may be performed either independently or according to data dependency requirements that the user is responsible for defining. The result is a serial program that can run in parallel given a way to schedule the units of computation on a system of parallel processors while obeying the data dependencies. .NH Parallel Programming Using SCHEDULE .PP We have constructed a package called SCHEDULE which allows a user to specify the subroutine calls along with the execution dependencies in order to carry out parallel execution. To use SCHEDULE, one must be able to express (i.e., program) an algorithm in terms of processes and execution dependencies among the processes. A convenient way to view this is through a computational graph. For example, the following graph .KS .nf A B C D D E .fi .KE denotes five subroutines A, B, C, D, and E (here with two "copies" of subroutine D operating on different data). We intend the execution to start at the leaves of the dependency graph with subroutines C, D, and E executing in parallel (D will be started twice with different data). After subroutines D (both copies) and E have returned, subroutine B will be started; and after both B and C finish subroutine A will be started. To use SCHEDULE, one is required to specify the subroutine calling sequence of each of the six schedulable units of computation, along with a representation of this dependency graph. .PP For each node in the graph, SCHEDULE requires two subroutine calls. One contains information about the user's routine to be called, such as the name of the routine, calling sequence parameters, and a simple tag to identify the process. The second subroutine call defines the dependency in the graph to nodes above and below the one being specified, and specifies the tag to identify the process. In this example, after an initial call to set up the environment for SCHEDULE, six pairs of calls would be made to define the relationships and data in this computational graph. .PP These concepts are perhaps more easily grasped through an actual Fortran example. A very simple example is a parallel algorithm for computing the inner product of two vectors. The intention here is to illustrate the mechanics of using SCHEDULE. This algorithm and the use of SCHEDULE on a problem of such small granularity are not necessarily recommended. .nf Problem: Given real vectors @a@ and @b@, each of length @n@, compute @sigma ~=~ a sup T b @. Parallel Algorithm: Let @ a sup T ~=~ ( a sub 1 sup T , a sub 2 sup T ,...,a sub k sup T ) @ and @ b sup T ~=~ ( b sub 1 sup T , b sub 2 sup T ,...,b sub k sup T ) @ be a partitioning of the vectors @a@ and @b@ into smaller vectors @ a sub i @ and @ b sub i @. .sp Compute ( in parallel ) @sigma sub j ~=~ a sub j sup T b sub j@ , @ j ~=~ 1,k@. When all done @ sigma ~=~ sigma sub 1 ~+~ sigma sub 2 ~+~ ... ~+~ sigma sub k @. Each of the parallel processes will execute code of the following form: .nf .cs 1 24 .vs 10 .ps 9 subroutine inprod(m,a,b,sig) integer m real a(*),b(*),sig sig = 0.0 do 100 j = 1,m sig = sig + a(j)*b(j) 100 continue return end .fi .ps 12 .cs 1 .vs 14 The following routine is used to accumulate the result: .nf .cs 1 24 .vs 10 .ps 9 subroutine addup(k,sigma,temp) integer k real sigma,temp(*) sigma = 0.0 do 100 j = 1,k sigma = sigma + temp(j) 100 continue return end .fi .ps 12 .cs 1 .vs 14 .PP The first step in constructing a code is to understand the parallel algorithm in terms of schedulable processes and a data dependency graph. Then the algorithm is expressed in a standard (serial) Fortran code. This code consists of a main program which initializes the shared data and a "parallel" subroutine \f3parprd\f1 to compute the inner product by invoking the parallel processes \f3inprd\f1 and \f3addup\f1. The program and associated data dependency graph are shown below. Serial Code: .nf .cs 1 24 .vs 10 .ps 9 program main integer n,k real a(1000),b(1000),temp(50),sigma read (5,*) n,k do 100 j = 1,n a(j) = j b(j) = 1 100 continue c call parprd(n,k,a,b,temp,sigma) c write(6,*) ' sigma = ',sigma stop end subroutine parprd(n,k,a,b,temp,sigma) c c declare shared variables c integer n,k real a(*),b(*),temp(*),sigma c c declare local variables c integer m,indx,j c m = n/k indx = 1 do 200 j = 1,k c call inprod(m,a(indx),b(indx),temp(j)) c indx = indx + m if (j .eq. k-1) m = n - indx + 1 200 continue c call addup(k,sigma,temp) c return end .fi .ps 12 .cs 1 .vs 14 Data Dependency Graph: k+1 1 2 3 ... k-1 k .fi In this data dependency graph we have identified @k@ processes .EQ inprod(~ m,~ a(indx),~ b(indx),~ temp(j)~ ) ~ ~~,~ j~=~1,~ 2,...,~ k ~ ~~, ~ ~indx~=~1~+~ (j-1)*m .EN which are not data dependent. Each of them reads a segment of the shared data @a ~,~b @ and writes on its own entry of the array @temp@, but none of them needs to read data that some other process will write. This fact is evident in the graphical representation where they are @leaves@. One process, .EQ addup(k,sigma,temp), .EN labeled @k+1@ is data dependent on each of the processes @ 1,2,...,k@. This is because \f3addup\f1 needs to read each entry of the array @temp@ in order to compute the sum and place it into @sigma@. .PP From this data dependency graph we may proceed to write the parallel program. Once we have understood the computation well enough to have carried out these two steps, the invocation of SCHEDULE to provide for the parallel execution of schedulable processes is straight forward. Calls to \f3parprd\f1, \f3inprod\f1, and \f3addup\f1 are replaced by calls to SCHEDULE to identify the routines to be executed as well as the information relating to the dependency graph. The modified code follows. .nf Parallel Main: .cs 1 24 .vs 10 .ps 9 program main integer n,k c EXTERNAL PARPRD c real a(1000),b(1000),temp(50),sigma read (5,*) n,k,NPROCS do 100 j = 1,n a(j) = j b(j) = 1 100 continue c CALL SCHED(nprocs,PARPRD,n,k,a,b,temp,sigma) c write(6,*) ' sigma = ',sigma stop end subroutine parprd(n,k,a,b,temp,sigma) c c declare shared variables c integer n,k real a(*),b(*),temp(*),sigma c c declare local variables c integer m1,m2,indx,j,jobtag,icango,ncheks,mychkn(2) c EXTERNAL INPROD,ADDUP save m1,m2 c m1 = n/k indx = 1 do 200 j = 1,k-1 c c express data dependencies c JOBTAG = j ICANGO = 0 NCHEKS = 1 MYCHKN(1) = k+1 c CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,INPROD,m1,a(indx),b(indx),temp(j)) c indx = indx + m1 200 continue m2 = n - indx + 1 c c express data dependencies for clean up step c JOBTAG = k ICANGO = 0 NCHEKS = 1 MYCHKN(1) = k+1 c CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,INPROD,m2,a(indx),b(indx),temp(k)) c indx = indx + m1 c JOBTAG = k+1 ICANGO = k NCHEKS = 0 c CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,ADDUP,k,sigma,temp) c return end .ps 12 .cs 1 .vs 14 .fi .PP The code that will execute in parallel has been derived from the serial code by replacing calls to \f3parprd\f1, \f3inprd\f1, \f3addup\f1 with calls to SCHEDULE routines that invoke these routines. The modifications are signified by putting calls to SCHEDULE routines in capital letters. Let us now describe the purpose of each of these calls. .nf .cs 1 24 .vs 10 .ps 9 CALL SCHED(nprocs,PARPRD,n,k,a,b,temp,sigma) .ps 12 .cs 1 .vs 14 .fi This replaces the call to \f3parprd\f1 in the serial code. The effect is to devote @nprocs@ virtual processors to the parallel subroutine \f3parprd\f1. The parameter list following the subroutine name consist of the calling sequence one would use to make a normal call to \f3parprd\f1. Each of these parameters must be called by reference and not by value. No constants or arithmetic expressions should be passed as parameters through a call to \f3sched\f1. This call to \f3sched\f1 will activate @nprocs@ copies of a virtual processor \f3work\f1. This virtual processor is a SCHEDULE procedure (written in C) that is internal to the package and not explicitly available to the user. .nf .cs 1 24 .vs 10 .ps 9 JOBTAG = j ICANGO = 0 NCHEKS = 1 MYCHKN(1) = k+1 c CALL DEP(jogtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,INPROD,m,a(indx),b(indx),temp(j)) .cs 1 .vs 14 .ps 12 .fi This code fragment shows the @j-th@ instance of the process \f3inprod\f1 being placed on a queue. The information needed to schedule this process is contained in the data dependency graph. In this case, the @j-th@ instance of a call to \f3inprod\f1 is being placed on the queue, so @jobtag@ is set to @j@. The value zero is placed in @icango@, indicating that this process does not depend on any others. If this process were dependent on @k@, other processes besides @icango@ would be set to @k@. .PP The mechanism just described allows static scheduling of parallel processes. In this program the partitioning and data dependencies are known in advance even though they are parameterized. It is possible to dynamically allocate processes; this procedure will be explained later. It might be worthwhile at this point to discuss the mechanism that this package relies on. .NH The SCHEDULE Mechanism .PP The call to the SCHEDULE routines \f3dep\f1 and \f3putq\f1, respectively, places process dependencies and process descriptors on a queue. There is a unique identifier @jobtag@ associated with each node of the dependency graph. Internally this represents a pointer to a process. The items needed to specify a data dependency are @icango@, @ncheks@, and @mychkn@. The @icango@ specifies the number of processes that process @jobtag@ depends on. The @ncheks@ specifies the number of processes that depend on process @jobtag@. The @mychkn@ is an integer array whose first @ncheks@ entries contain the identifiers (i.e. @jobtag@s) of the processes that depend on process @jobtag@. .PP The initial call to @sched@(nprocs,subname,) results in @nprocs@ virtual processors called \f3work\f1 to begin executing on @nprocs@ separate physical processors. Typically @nprocs@ should be set to a value that is less than or equal to the number of physical processors available on the given system. These \f3work\f1 routines access a ready queue of @jobtag@s for schedulable processes. Recall that a schedulable process is one whose data dependencies have been satisfied. After a \f3work\f1 routine has been successful in obtaining the @jobtag@ of a schedulable process, it makes the subroutine call associated with that @jobtag@ during the call to \f3putq\f1. When this subroutine executes a @return@, control is returned to \f3work\f1, and a SCHEDULE routine \f3chekin\f1 is called which decrements the @icango@ counter of each of the @ncheks@ processes that depend on process @jobtag@. If any of these @icango@ values has been decremented to zero, the identifier of that process is placed on the ready queue immediately. .NH Low-Level Synchronization .PP Two low-level synchronization primitives are available within SCHEDULE. They are \f3lockon\f1 and \f3lockoff\f1. Each takes an integer argument. An example of usage is .sp2 .nf .ps 9 .cs 1 24 .vs 10 call lockon(ilock) ilocal = indx indx = indx + 5 call lockoff(ilock) .fi .ps 12 .cs 1 .vs 14 .sp2 In this example a critical section has been placed around the act of getting a local copy of the shared variable @indx@ and updating the value of @indx@. If several concurrent processes are executing this code, then only one of them will be able to occupy this critical section at any given time. The variable, @ilock@ must be a globally shared variable and it must be initialized by calling the routine \f3lockasgn\f1. In the above example the statement .sp .nf .ps 9 .cs 1 24 .vs 10 call lockasgn(ilock,0) .fi .ps 12 .cs 1 .vs 14 .sp must execute exactly once and before any of the calls to \f3lockon\f1 are made. If there are low-level data dependencies among any of the processes that will be scheduled, then it will be necessary to enforce those data dependencies using locks. It is preferable to avoid using locks if possible. However, in certain cases such as pipelining, locks will be required. .NH Dynamic Allocation of Processes .PP The scheme presented above might be considered static allocation of processes. By this we mean that the number of processes and their data dependencies were known in advance. Therefore the entire data structure (internal to SCHEDULE) representing the computational graph could be recorded in advance of the computation and is fixed throughout the computation. In many situations, however we will not know the computational graph in advance, and we will need the ability for one process to start or spawn another depending on a computation that has taken place up to a given point in the spawning process. This dynamic allocation of processes is accomplished through the use of the SCHEDULE subroutine \f3spawn\f1. The method of specifying a process is similar to the use of \f3putq\f1 described above. .PP We shall use the same example to illustrate this mechanism. .nf Processes: subroutine inprod .... same as above .ps 9 .cs 1 24 .vs 10 subroutine addup(myid,n,k,a,b,sigma,temp) integer myid,n,k real a(*),b(*),sigma,temp(*) c c declare local variables c integer j,jdummy,m1,m2 c LOGICAL WAIT EXTERNAL INPROD save m1,m2 c go to (1111,2222), IENTRY(myid) 1111 continue c m1 = n/k indx = 1 do 200 j = 1,k-1 c c replace the call to inprod with a call to spawn c CALL SPAWN(myid,jdummy,INPROD,m1,a(indx),b(indx),temp(j)) indx = indx + m1 200 continue m2 = n - indx + 1 c c clean up step c replace the call to inprod with a call to spawn c CALL SPAWN(myid,jdummy,INPROD,m2,a(indx),b(indx),temp(k)) c nprocs = k L2222 = 2 c c If any of the spawned process have not completed, RETURN c to the scheduler and help out. This avoids busy waiting c and allows this code to be executed by one processor. c if (WAIT(myid,nprocs,L2222)) return 2222 continue c c All have checked in, now addup the results. c sigma = 0.0 do 100 j = 1,k sigma = sigma + temp(j) 100 continue return end .sp2 .fi .cs 1 .vs 14 .ps 12 The subroutine \f3parprd\f1 must change somewhat. .nf .sp2 .cs 1 24 .vs 10 .ps 9 subroutine parprd(n,k,a,b,temp,sigma) c c declare shared variables c integer n,k real a(*),b(*),temp(*),sigma c c declare local variables c integer mychkn(1),icango,ncheks,jobtag EXTERNAL ADDUP save jobtag c JOBTAG = 1 ICANGO = 0 NCHEKS = 0 c CALL DEP(jobtag,icango,ncheks,mychkn) CALL PUTQ(jobtag,ADDUP,jobtag,n,k,a,b,sigma,temp) c return end .fi .cs 1 .vs 14 .ps 12 .NH Experience with SCHEDULE .PP At present the experience with using SCHEDULE is limited but encouraging. Versions are running successfully on the VAX 11/780, Alliant FX/8, and CRAY-2 computers. That is, the same user code executes without modification on all three machines. Only the SCHEDULE internals are modified, and these modifications are minor. They involve such things as naming and parameter-passing conventions for the C - Fortran interface. They also involve coding the low-level synchronization primitives and managing to "create" the \f3work\f1 processes. .PP On the CRAY-2 process creation is accomplished using \f3taskstart\f1, and the low-level synchronization already matches the low-level synchronization routines provided by the CRAY multitasking library[3]. For the Alliant FX/8 we coded the low-level synchronization primitives using their test-and-set instruction. To "create" the work routines, we used the CVD$L CNCALL directive before a loop that performed @nprocs@ calls to the subroutine @work@. .PP In addition to some toy programs used for debugging SCHEDULE, several codes have been written and executed using SCHEDULE including the algorithm TREEQL for the symmetric tridiagonal eigenvalue problem [7], a domain decompositon code for singularly perturbed convection-diffusion PDE [4], and a block preconditioned conjugate gradient code for systems arising in reservoir simulation [5]. .SH References .sp .IP [1] R.G. Babb II, .I Parallel Processing with Large Grain Data Flow Techniques, .R IEEE Computer, Vol. 17, No. 7 , 55-61, July 1984. .sp .IP [2] J.C. Browne, .I Framework for Formulation and Analysis of Parallel Computation Structures, .R Parallel Computing, 3, 1-9, 1986. .sp .IP [3] CRAY 2 Multitasking Users Guide, Cray Research Inc, Minn, MN, 1986. .sp .IP [4] R. Chin, G. Hedstrom, F. Howes, and J. McGraw, .I Parallel Computation of Multiple-Scale Problems, .R New Computing Environments: Parallel, Vector, and Systolic, Ed. A. Wouk, Siam Pub., Philadelphia, 1986. .I .R .sp .IP [5] J.C. Diaz, .I Calculating the Block Preconditioner on Parallel Multivector Processors, .R Proceedings of the Workshop on Applied Computing in The Energy Field, Stillwater, Oklahoma, October 10, 1986. .I .R To appear. .sp .IP [6] J.J. Dongarra and I.S. Duff, .I Advanced Architecture Computers, .R ANL-MCS-TM-57, October, 1985. .sp .IP [7] J.J. Dongarra and D.C. Sorensen, .I A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem, .R To appear SIAM SSISC. .sp .IP [8] E. Lusk and R. Overbeek, "Implementation of Monitors with Macros: A Programming Aid for the HEP and Other Parallel Processors", .R ANL-83-97, (1983). .