TGEX: a tool for portable parallel and distributed execution of unstructured problems Hai Xiang Lin1 , Arjan J.C. van Gemund2, Johan Meijdam1 and Pier Nauta3 1
Dept. of Applied Mathematics and Informatics, Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands, e-mail: [email protected]
2 Dept. of Electrical Engineering, Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands, e-mail: [email protected]
3 TNO-BOUW Institute, P.O. Box 49, 2600 AA, Delft, the Netherlands
Abstract. There exist a large number of scienti c and industrial appli-
cations with inherently unstructured parallelism which cannot be solved eectively using a data parallel approach. In this paper we present TGEX, a generic tool for the parallel execution of unstructured problems on distributed memory computers. A general computational problem is rst expressed in terms of a computation task graph which serves as a machineindependent representation of the parallel program. TGEX then executes the task graph in a macro data ow style on the target distributedmemory machine. Thus TGEX simpli es parallel programming by freeing the application programmer from the burden of low level task synchronisation and communication. TGEX is implemented using PVM and is therefore portable on a large number of distributed memory systems.
1 Introduction Despite recent advances in the use of data parallel languages, such as data parallel C and HPF, there exist many scienti c and industrial applications that cannot be eciently programmed using a data parallel approach. In data parallel programming, the computation on a data item is assigned to the processor which 'owns' that data item (the ownership is determined through the distribution of the partitioned data to the processors). However, a large amount of other scienti c and industrial applications do not have structured parallelism which can be eciently expressed merely by using regular array distributions. We can distinguish irregular or unstructured problems into three dierent classes. The rst class requires irregular data distribution (e.g. a CFD problem de ned on a non-rectangular grid), but the data are de ned at compile-time and the dependency is data parallel. Such irregular problems have been dealt by data parallel languages supporting irregular data distributions. Examples are Fortran D  and Vienna Fortran . The second class comprises problems whose dataaccess patterns and workload are only known at run-time. This complicates the task of explicitly specifying the data distribution (e.g. a mapping of array elements to processors) in a data parallel language. One of the solutions proposed is to use some library extension on top of a data parallel language such as HPF. In , the Chaos system is proposed which consists of a set of procedures that can
be used by an HPF-style compiler to automatically manage programmer-de ned distributions, partition loop iterations, and generate optimized communication schedules. The third class comprises problems, not only whose data-access patterns and workload are only known at run-time, but where the data dependency itself is dynamic and only known at run-time (e.g. a sparse direct solver in a nite element analysis software package [7, 8]). Such problems cannot be solved eectively using data parallel programming (or any data distribution-oriented approach). The third class consists of the most dicult problems in parallel programming. The rst two classes together with the regular problems can be dealt with using the data parallel approach and we call them data parallel (regular or irregular) problems. We call the third class of problems irregular non-data parallel or simply unstructured problems in order to distinguish them from the rst two classes. In order to ease the programming of unstructured problems, recently researchers have started experimenting with task parallel or object-oriented parallel programming systems such as Schedule , Jade , and Charm . Schedule is a well-known example of earlier work dealing with unstructured problems. It is based on a shared-memory system with Fortran 77 as the base language. Programmers are required to de ne the task (subroutine) inter-dependency by de ning an array. However it is not speci ed which data is required by a task (this information is not necessary on a shared-memory system but will be necessary on a distributed-memory system for implementing update/communication of dependent data). The Jade language is a parallel programming language (an extension to C) for exploiting coarse-grain concurrency in sequential, imperative programs. It is implemented on top of the SAM virtual shared memory system. The central concept in Jade is the shared-object. Tasks are dependent if they access the same shared-object. The structure of parallelism produced by the Jade program is a directed acyclic graph of tasks. The execution of the tasks is controlled using a farmer/worker model (i.e. one central scheduler). For a large number of processors this central scheduler can become a performance bottleneck. Moreover, optimization with respect to mapping and scheduling is not possible, subsequently a large communication overhead can occur. The programming model of Charm is a set of medium-grained and potentially parallel processes (called chares). These processes interact with each other via messages. Communications are speci ed by sending and receiving messages and parallel execution of the processes is synchronized by messages. One of the limitation is that no optimization on mapping and scheduling of processes is performed at compile-time or before the parallel execution. Because processes with frequent mutual message trac might be assigned to distant processors, this may cause a large amount of communication. As a result of this, Charm is probably more suited for shared-memory machines than distributed-memory machines. In this paper, we present the TGEX system which cleanly separates the speci cation of the computational problem from the control of parallel execution. In
TGEX, a general computational problem is represented by means of a task dependence graph which forms a machine-independent parallel program. Unlike the above approaches, this machine-independent program is statically mapped onto a given parallel machine while multiple tasks mapped onto the same processor are dynamically scheduled (locally) according to the macro data ow principle. In this hybrid approach, global, run-time scheduling overhead is avoided, whereas task/communication concurrency and work load balance is retained by mapping sucient tasks onto each node. TGEX is implemented using PVM for communication and can run on dierent distributed-memory machines. Experiments with a number of test problems and real applications show that it is an ecient tool for analyzing and executing unstructured problems.
Problem specification in high-level language
Translation (partitioning and granularity control)
Program generate Task graph
User defined functions Library functions
Annotated task graph activity
Data flow execution
Remapping and scheduling algorithms
Application development with TGEX
2 The TGEX system Our aim is to come to an ecient system for analyzing and executing unstructured problems. The most important design criteria to be considered are: 1. Portability: It should run on various distributed-memory computing platforms. 2. Ease of use: It should greatly ease the programming of unstructured problems for distributed-memory machines. Low level communication and explicit synchronization among dierent processes (especially so for parallelism of dynamic nature) is a tedious and error prone activity in parallel programming,
the programmer should be relieved from it. 3. Eciency: It should achieve a high parallel eciency and be scalable towards a large number of processors. As communication is often the dominant overhead, the system should be able to minimize the communication requirement and tolerate communication latency (e.g. through overlapping of computation with communication). The TGEX system is designed with these criteria in mind. In TGEX the speci cation of the computational problem is separated from the control of the parallel execution. Figure 1 illustrates the structure of the TGEX system. A task/data dependence graph is used for specifying the computational problem. A task graph is a directed graph where the nodes represent the computational tasks and the edges the data dependence relations (e.g. a task needs the result (output) of another task as input). For many problems the tasks can be naturally identi ed through functional or data decomposition or a combination of these. The decomposition and the task graph generation can be either performed by an application programmer or by a preprocessor (an example of the latter can be found in ). A computational problem can also be described in some high level programming language and then use some translation mechanism (for partitioning with granularity control) to generate the task graph. This research will be for example a part of the forthcoming research project AUTOMAP . At this moment, programmers de ne the task graph themselves or use some application-speci c program to generate the task graph (e.g. [6, 7]). The functions corresponding to the tasks must be provided by the programmer. Both Fortran or C functions can be linked to TGEX. Either user-programmed functions or library functions such as BLAS subroutines can be used. parallel do /* parallel section */ f = A(a, b); || g = B(c,d); end parallel do if (pid = 2) send g to processor 1; if (pid = 1) receive g from processor 2; if (pid = 1) h = C(f,g); Fig. 2.
g C h
Comparison between a task parallel program and TGEX source.
Figure 2 illustrates the dierence between TGEX and some task parallel programminglanguage. The problem to be computed is f = A(a; b); g = B (c; d); h = C (f; g). The left gure shows the program in some task parallel language and the right gure shows the task graph which is to be de ned in TGEX task graph interface format. The functions A; B and C can be implemented in a conventional (i.e. sequential) way, they can be used by both TGEX and the task parallel program. In a task parallel program, the programmer needs to deal with communication and synchronization for exercising some explicit control on the parallel execution. In TGEX, there is no program in the conventional sense, the user only
speci es what tasks must be performed and their data interdependence. The actual control (communication and synchronization) of the execution is fully taken care by the TGEX system. For ecient parallel execution on a given parallel architecture, the tasks must be mapped to processors such that communication cost is minimized while retaining a good load-balance. After the mapping is known, the task graph is annotated with the assignment information. Apart from actually executing the annotated graph, TGEX comes with a performance simulator, implemented using Pamela (PerformAnce ModEling LAnguage [4, 5]), that enables performance optimisation and scalability analysis in order to investigate the in uence of the various algorithm and machine properties. As mentioned earlier, TGEX has been implemented using PVM for communication. In order to enable true overlap between computation and communication, TGEX uses three autonomous processes on each processor: a sender, a receiver and an executer. Thus, while an asynchronous transfer is in progress, computation at both sending and receiving processors continues.
3 Parallel applications with TGEX We have applied TGEX for the parallel execution of a number of highly unstructured problems. In the following we consider two case studies. The rst experiment comprises computing the LU-factorization of a number of dense matrices. A given matrix is subdivided into a number of block submatrices and a update on or factorization of the entire submatrix is de ned as a task. The second experiment deals with parallel Cholesky factorization of symmetric sparse matrices. These sparse matrices comes from real applications of the DIANA software package. These problems have been computed on a Parsytec GCel transputer system, as well as a PowerXplorer4 . 3.1
Table 1 shows the execution time of computing an LU-factorization for a 1536
1536 matrix using the TGEX on the PowerXplorer. The BLAS 3 Fortran sub-
routines have been linked to TGEX for implementing the task functions. Table 1 contains the results of this matrix with a CBC (Cyclic Block-Column) mapping in several situations. The rst column gives the result when the actual computations are done. The second to the fourth column contains the results when the computation routines are replaced by synthetic ones that simply execute the same number of operations. Because of its numeric insensitivity, this synthetic mode enables experimentation with alternative modes of operation. Column two contains the result of the original amount computation and communication, column three the result without communication (F-mode) and column four the result without computation (C-mode). 4
Both systems were kindly made available by the Interdisciplinary Center for Computer-based Complex systems research Amsterdam (IC3A).
P real kernel 1 663.3 2 342.3 4 180.3 8 99.6 12 74.7 16 59.2
synthetic kernel normal F-mode C-mode 731.4 696.6 35.8 377.0 356.9 22.8 198.3 186.8 19.0 108.9 101.1 21.0 81.5 74.5 22.8 64.2 57.2 25.4
P real synthetic kernel kernel normal F-mode C-mode 1 663.3 731.4 696.6 35.8 2 368.5 413.7 357.7 114.0 4 207.3 235.9 180.3 120.7 8 127.6 141.2 92.2 108.0 12 94.8 104.7 61.2 84.1 16 108.7 119.3 47.5 107.9
Dense 15361536 matrix, CBC mapping.
Table 2. Dense 15361536 matrix, CB mapping.
Table 2 contains the result of the CB (Cyclic Block) mapping. The F-mode and C-mode measurements indicate that there is a large overlap between computation and communication (e.g., for CB mapping with P = 4 it holds F-mode + C-mode = 301 s whereas normal = 235.9 s). This feature makes the TGEX distinct from many other parallel programming systems. Figure 3 shows the speedup as a function of the number of processors. It should be noticed that the Dense, dim=1536, sub=48; CB=o, CBC=*, ideal=x 16
Speed-up dense 15261536 matrix with 4848 submatrices.
eciency is less than 1 mainly due to the fact that the communication capacity of the PowerXplorer is very unbalanced. The communication network is the same as that of a GCel, however the processor speed is increased 20-fold. The eciency on a GCel can be expected to be much higher. Unfortunately such a large matrix cannot be factorized on the system because of insucient processor memory.
Sparse Cholesky factorization
The computer solution of large and complex nite analysis problems requires a large amount of computing time. The most computation-intensive part of a FEM computation is the solution of a large sparse system of equations. The direct solver plays a prominent role in many commercial FEM software packages. Therefore, we consider the parallel direct solver for sparse systems in FEM computation. In [7, 8] the parallelization of the FEM software package DIANA5 has been considered. The parallelization is based on domain decomposition. After the domain decomposition, a so-called block structured DBBD-matrix (with a sparse number of blocks) is constructed via a partial ordering of the interfaces between the subdomains. Subsequently, a task graph is constructed in which each task corresponds to a matrix operation on a block submatrix (see Fig. 4). Due to fact that this task graph is strongly problem and decomposition dependent, it is generated by the DIANA program . Table 3 shows the performance resparse DBBD matrix physical structure
-1 L 5,1 = A5,1⋅L-T1,1⋅D1,1
L 5,2 =
T A5,5 =A5,5 - L5,1⋅D1,1⋅L5,1
A2,2 → L2,2⋅D2,2⋅LT2,2
L 6,2 =
L 6,3 =
L 7,3 =
A5,5 → L5,5⋅D5,5⋅LT5,5
(5,1) (5,2) 0 5 X 0 (6,2) (6,3) X 6 X 0 (7,3) (7,4) 0 X 7 Region II
computation task graph A1,1 → L1,1⋅D1,1⋅LT1,1
L 6,5 = A6,6= A6,6 → L6,6⋅D6,6⋅LT6,6
Parallelization of FEM computations
sults of a parallel LDLT -factorization of a sparse matrix obtained by subdivision into 16 subdomains. Although the corresponding task graph consists of about 1000 tasks such that each processor is typically assigned with a large number of tasks, the potential parallelism is limited (16 subdomains) which is re ected in the performance for large P .
4 Concluding remarks We present an approach of representing a general computational problem by means of a task graph. Moreover, the same task graph representation provided with a proper interface forms a machine-independent parallel program which can be executed on dierent parallel computer systems. Expermental results show that TGEX is an ecient general purpose system for parallel execution 5
DIANA is a registered trademark of TNO-BOUW Institute for Building and Construction Research, Rijswijk, the Netherlands
Mapping Random CB CBC 1 40.0 40.0 40.0 2 26.7 26.6 24.1 4 18.8 19.0 14.9
Mapping Random CB CBC 8 19.1 18.4 11.8 12 21.1 19.5 11.1 16 21.2 18.9 9.9
Execution time of a LDLT -factorization of a sparse matrix obtained by subdivision into 16 subdomains.
of unstructured problems. Future work includes the development of a graphic user interface for the task graph and a high-level language accompanied with a translation system.
References 1. B. Chapman, P. Mehrotra and H. Zima, "Programming in Vienna Fortran", Scienti c Programming, Vol. 1 (1), Fall 1992, pp. 1-50. 2. J. Dongarra and D. Sorensen, "SCHEDULE: Tools for developing and analyzing parallel Fortran programs", In Characteristics of parallel algorithms, D. Gannon, L. Jamieson and R. Douglass (editors), The MIT Press, Cambridge, 1987. 3. G. Fox et al., "Fortran D language speci cation", Tech. Report CRPC-TR90097, Center for Research on Parallel Computation, Rice Univ., Houston, Texas, 1990. 4. A.J.C. van Gemund and H.X. Lin, "Scalability Analysis of Parallel Finite Element Methods using Performance Simulation", in Proc. EUROSIM'95, Vienna, Sept. 1995, pp. 261-266. 5. A.J.C. van Gemund, "Performance Modeling of Parallel Systems", Ph.D. thesis, Delft University of Technology, 1996. 6. H.X. Lin, A methodology for the parallel direct solution of nite element systems, Ph.D. thesis, Delft University of Technology, 1993. 7. H.X. Lin and H.J. Sips, "Parallel direct solution of large sparse systems in nite element computations", in Proceedings of 7th ACM Int'l Conf. Supercomputing, ACM Press, Tokyo, Japan, July 19-23, 1993, pp. 261-270. 8. H.X. Lin, "A general approach for parallelizing the FEM software package DIANA", Proc. HPC Conf. '94, Singapore, 1994, pp. 261-270. 9. R. Ponnusamy, Y.S. Hwang, R. Das and J.H. Saltz, "Supporting irregular distributions using data-parallel languages", IEEE Parallel & Distributed Technology, Vol. 3 (1), Spring 1995, pp. 12-24. 10. M. Rinard, D.J. Scales and M.S. Lam, "Jade: A high-level, machine-independent language for parallel programming", IEEE Computer, Vol. 26 (6), 1993, pp. 28-38. 11. W.W. Shu and L.V. Kale, "Chare Kernel: A runtime support system for parallel computations", J. of Parallel and Distr. Comput., Vol. 11, 1990, pp. 198-211. 12. H.J. Sips, A.J.C. van Gemund and H.X. Lin, "AUTOMAP: A parallel coordination based programming system", Research project proposal, April 1995. (approved and funded by Dutch NWO SION research programme) This article was processed using the LATEX macro package with LLNCS style