Parallel Solution of Symmetric Banded Systems
Deskripsi
Parallel Computing: Trends and Applications G.R. Joubert, D. Trystram, F.J. Peters and D.J. Evans (Editors) © 1994 Elsevier Science B.V. A l l rigtits reserved.
537
Parallel Solution of Symmetric Banded Systems H.X. Lin and M . R. Roest Parallel Algorithms Group, T W I / T A , Delft University of Technology, Mekelweg 4, 2628 CD DELFT, the Netherlands
An efficient parallel method is presented for the solution of symmetric banded systems through Cholesky factorization. The proposed algorithm, called ParChol (Parallel Cholesky factorization) algorithm, is better suited for banded systems than known sparse matrix factorization algorithms. Analysis shows that the ParChol algorithm achieves linear speedup since the communication overhead and the sequential fraction of the operations are small. For a matrix with dimension n and semibandwidth fe, the parallel factorization time on a computer with p processors is ö{~) for p < {n/b)/\og{n/b). The efficiency is C ( l ) , which readily indicates that the algorithm is efficient.
1. I N T R O D U C T I O N In many applications, a given symmetric banded linear system Ax = y has to be solved for many righthand side vectors y. This can be done by first performing a Cholesky factorization of the matrix, after which the solution for each vector y can be found by solving two triangular systems. On sequential computers, the computation of a Cholesky factorization takes 0{n.b'^) time. Much effort has been put in finding good parallel algorithms for this problem. Most of the work is based on [2], which considers the problem of parallel sparse matrix Cholesky factorization. The basic idea is to work on different columns in parallel if possible. Improved versions of the original algorithm have been reported in e.g. [3] and [4]. However, for banded matrices with no zeroelements within the band, the speedup of these algorithms can be no more then To increase the speedup factor, BarOn [1] has designed a parallel Cholesky factorization algorithm with a theoretical time complexity of 0{~) using p < (w/6)/log(n/6) processors. Unfortunately, the resulting triangular factor with the same structure as the original bandmatrix is not very suited for parallel triangular solution. The maximum speedup is 0{b) unless a lot of extra work is performed ( the total amount of operations is then increased to 0(n.6^) from that of 0{n.b) for sequential triangular solution). The parallel Cholesky factorization algorithm that is presented in this paper has the same theoretical time complexity but gives factors that allow efficient triangular solution. Furthermore, the required communication is Hmited. In the next section, the ParChol algorithm will be described, the efficiency of the parallel forward and back substitution will be indicated and a few remarks will be made on the data distribution.
538
•IIK ^. A23
A32
Figure 1. Decomposition of the matrix A for p = 3
2. T H E A L G O R I T H M 2.1. Factorization Figure 1 is a sketch of the decomposition of matrix A into p smaller banded matrices ( ^ 2 <  i , 2 .  i , i = 1 , . . . ,p) and 5{p — 1) matrices that represent the coupling between the ^ 2 .  i , 2 !  i ' s ( ^ 2 i , 2 ! , ^ 2 i , 2 i  i , > l 2 i , 2 ! + i , ^ 2 i  i , 2 i , ^ 2 i + i , 2 i , « = 1 , . . . ,p —1). This decomposition forms the basis of the ParChol algorithm, which recursively reduces ^ to a much smaller symmetric matrix D (see Figure 2) and a factorized matrix LL'^. Figure 3 gives a pseudo code of the algorithm. First, the matrices y 4 2 ,  i , 2 .  i , i = l,...,p are factorized in parallel, after which the matrices ^ 2 i , 2 ! '  i and A2i,2i+i ,» = 1 , — 1 are updated. This gives the matrix L. Then the remaining part D is computed which consists of matrices i ' 2 ! , 2 i , D2i,2i2,D2i,2i+2, i = 1,... ,p — 1. The matrix / ) is a symmetric positive definite block tridiagonal matrix, of order (p — 1)6. If p is large enough, it is worthwhile to apply the ParChol algorithm recursively to D, until the remaining part is sufficiently small for sequential factorization.
Q
A l l A12
All
A21 A22 A23
reordering
A12 A33
A21 A23
A43 A44 A45
0
A32 A34 A55
A32 A33 A34
A43 A45
A54 A33
A54 A22 A44
LU
(
0
L53 L21 L23 L43 L45
0 0
/
Figure 2. One step of the ParChol algorithm
0
539
At most (p — l)/2 processors can be used to factorize matrix D with the ParChol algorithm, as can be verified by applying the decomposition in Figure 1 to f and noting that £ ' 2 ! + 2 , 2 i are full matrices. It can be shown that the parallel time complexity of the algorithm is 0 { ^ ) for p < (n/5)/log(n/6). 2.2. Forward and back substitution The triangular factor obtained by the ParChol algorithm allows efficient parallelization of the forward substitution, Lz = y, and the back substitution, L^x = z. The parallel forward substitution starts with the solution of i 2 i  i , 2 t  i ^ ; 2 !  i = Ë 2 i _ i ' i = 1,... ,p. Then, the subvectors y^^. are updated with the subvectors Z 2 t  i and « 2 ! + i i i.e.: ^2i ~ &i
^2i,2il22il 
L2i,2i+lZ2i+l,^
= 1,. . . ,p 
I.
subroutine ParChol(input: .4 , p ; output: set of L) do parallel i = l,..,p Compute L 2 t  i , 2 i  i from ^ 2 1  1 , 2 .  1 ; enddopar do parallel i = l , . . , ( p  l ) do sequentially do parallel ^2i,2il =  ^ 2 l  l , 2 i  1 ^ 2 i  l , 2 i )
K ^2l',2!+l = ^2i+l,2t+1^2i+l,2i;
enddopar do parallel ^2!,2) = ^ 2 i , 2 i — i 2 i , 2 t  l i 2 i i  l ~
^2i,2i+li2i\2i+l)
i D2i+2,2i
= —i2i+2,2!+l£'J,2i+lj
enddopar enddoseq enddopar if p > 3 ParChol(ö,E^,L^); else Compute the factorization of D sequentially; endif
Figure 3. The ParChol algorithm. The symbol (j delimits processes that can run in parallel.
540
Next, the system Dx!_ =  / must be solved, where x!_ is made up of the subvectors ^ j i , i = 1,... , p of X and has length ( p — 1)6. The solution of this system is done recursively in the same way as is decribed here for the original system. W i t h the solution x!_, the subvectors Z 2 , _ i can be updated: 2l.2il
=
^2il
— 1^212,2ilS^2i2
~
•^2i,2)1^2n * =
1) • • •
and back substitution can proceed i n parallel on p processors: Ljj_i 2 i  i ^ 2 i  i = 2^2ii) i = 1,... ,p. I t can be shown that the speedup factor S for the solution phase is: n
»(2&+l) (46
+
l)n

(462 +
_
1) + 5 6 2 p l o g ( p 
1)
^'
which approaches S = p/2 if n > bp. 2.3. Data distribution The algorithm is specially suited for distributed memory machines. A n efficient data distribution is as follows. Each processor i gets ^ 2 i  i , 2 i  i ) ^ 2 i , 2 i  i , . ^ 2 1  2 , 2 1  1 ) ^ 2 1 2 » , subvectors y^ii V^i corresponding parts of the other vectors. From these data, processor i computes L 2 .  i , 2 .  i , ^ 2 .  2 , 2 1  1 , i 2 i , 2 i  i , Ö 2 ,  , 2 i , £ ' 2 .  , 2 .  2 , X 2 t  i and x^i. This distribution is essentially a rowwise distribution of the matrix. As was mentioned above, the number of processors is halved in each recursion. For the data, this means that upon entering a new recursion in the factorization, processors i and collect their computed £)matrices on the processor that will remain active. A similar data exchange happens in the forward and back substitution. The communication is limited during the factorization as well as during the forward and back substitution. During factorization, the volume of data that must be exchanged is 0 ( 6 ^ ) on each recursion level, whereas i t is C ( 6 ) on each recursion level in the forward and back substitution. Furthermore, the interconnection network i n most mcJdernday computers is certainly capable of performing the dataexchanges efficiently. REFERENCES 1.
I . BarOn. Efficient logarithmic time parallel algorithms for the cholesky decomposition and gramschmidt process. Parallel Computing, 17:409417, 1991. 2. A. George, M.T. Heath, J. Liu, and E. Ng. Sparse cholesky factorization on a localmemory multiprocessor. SIAM J. Sci. Stat. Comput, 9(2):327340, March 1988. 3. E. Ng and B.W. Peyton. A supernodal cholesky factorization algorithm for sharedmemory multiprocessors. SIAM J. Sci. Stat. Comput., 14(4):761769, July 1993. 4. G. Zhang and H.C. Elman. Parallel sparse cholesky factorization on a shared memory multiprocessor. Parallel Computing, 18:10091022, 1992.
Lihat lebih banyak...
Komentar