Kami adalah komunitas sharing. Jadi tolong bantu kami dengan mengunggah ** 1 ** dokumen baru atau yang ingin kami unduh:

ATAU SEPERTI DOWNLOAD SEGERA

Parallel

Algorithms

ami Applications,

V o l . 13, pp. 289-305

© 1999 O P A (Overseas Publishers Association) N . V .

Reprints available directly from the publisher

Published by license under

Photocopying permitted by license only

the Gordon and Breach Science Publishers Imprint. Printed in Malaysia.

SOLVING SPARSE L E A S T SQUARES P R O B L E M S WITH PRECONDITIONED C G L S M E T H O D ON P A R A L L E L DISTRIBUTED MEMORY COMPUTERS T I A N R U O Y A N G " ' * and H A I X I A N G L I N ' ' " Department of Computer and Information Science, Linköping University, S-58183, Linköping, Sweden; ^Department of Technical Mathematics and Computer Science, TU Delft, P.O. Box 356, 2600 AJ Delft, The Netherlands

(Received

11 March

1997: Revised 27 February

1998)

I n this paper we study the parallel aspects of PCGLS, a basic iterative method based on the conjugate gradient method with preconditioner apphed to normal equations and Incomplete Modified Gram-Schmidt (IMGS) preconditioner, for solving sparse least squares problems on massively parallel distributed memory computers. The performance of these methods on this kind o f architecture is usually limited because of the global communication required for the inner products. We will describe the parallehzation o f PCGLS and I M G S preconditioner by two ways of improvement. One is to accumulate the results o f a number o f inner products collectively and the other is to create situations where communication can be overlapped with computation. A theoretical model of computation and communication phases is presented which allows us to determine the optimal number of processors that minimizes the runtime. Several numerical experiments on the Parsytec GC/PowerPlus are presented. Keywords: Sparse least squares problems; PCGLS algorithm; Modified Gram-Schmidt preconditioner; Global communication and Parallel distributed memory computers

1

I N T R O D U C T I O N

M a n y scientific a n d

engineering a p p l i c a t i o n s such as linear p r o g r a m m i n g

[7], a u g m e n t e d L a g r a n g i a n m e t h o d f o r c o m p u t a t i o n a l

fluid

dynamics [15],

and the n a t u r a l f a c t o r m e t h o d i n s t r u c t u r e engineering [3,17] give rise to

the

sparse least squares p r o b l e m s . Because the c o e f f i c i e n t m a t r i x is s y m m e t r i c

* Corresponding author. E-mail: [email protected] 289

290

T. Y A N G A N D H . X . L I N

and positive definite, the n o r m a l equations o f the m i n i m i z a t i o n p r o b l e m can o f t e n be solved e f f i c i e n t l y using the conjugate gradient m e t h o d . T h e resulting m e t h o d is used as the basic iterative m e t h o d to solve the least squares problems. I n theory, i t is a s t r a i g h t f o r w a r d extension o f the stand a r d conjugate gradient m e t h o d . H o w e v e r , m a n y variants o f these methods appear to be numerical unstable. A comprehensive c o m p a r i s o n o f d i f f e r e n t implementations can be f o u n d i n [6]. I n this paper we use the most accurate version C G L S I as standard C G L S w h i c h w i l l be i n t r o d u c e d later o n . P r e c o n d i t i o n i n g techniques t h a t accelerate the convergence, such as c o l u m n scahng a n d SSOR preconditionings [5], incomplete Cholesky f a c t o r i z a t i o n [20], p o l y n o m i a l p r e c o n d i t i o n i n g [1] and incomplete o r t h o g o n a l f a c t o r i z a t i o n [19,23] have received extensive a t t e n t i o n i n the literature. W e consider i n this paper the Incomplete M o d i f i e d G r a m - S c h m i d t ( I M G S ) m e t h o d as the preconditioner. O n massively parallel computers, the basic time-consuming c o m p u t a t i o n a l kernels o f the iterative schemes are usually: inner products, vector updates, m a t r i x - v e c t o r products, and p r e c o n d i t i o n i n g . I n m a n y situations, especially w h e n m a t r i x operations are well structured, these operations are suitable f o r i m p l e m e n t a t i o n o n vector and shared m e m o r y parallel c o m p u ters [13]. B u t f o r parallel d i s t r i b u t e d m e m o r y machines, the matrices a n d vectors are distributed over the processors, so that even w h e n the m a t r i x operations can be implemented efficiently by parallel operations, we cannot a v o i d the global c o m m u n i c a t i o n required f o r inner p r o d u c t c o m p u t a t i o n s . T h e detailed discussions

o n the c o m m u n i c a t i o n p r o b l e m o n d i s t r i b u t e d

m e m o r y systems can be f o u n d i n [10,22]. I n other w o r d s , these global c o m m u n i c a t i o n costs become relatively more a n d m o r e i m p o r t a n t w h e n the number o f the parallel processors is increased and thus they have the potent i a l to affect the scalability o f the a l g o r i t h m i n a negative w a y [9,10]. T h i s aspect has received m u c h a t t e n t i o n a n d several approaches have been suggested to i m p r o v e the parallel performance [2,10]. W e also consider preconditioners because the p r e c o n d i t i o n i n g p a r t is o f t e n the most p r o b l e m a t i c p a r t i n a parallel environment. These p r e c o n d i tioners should preferably have a g o o d convergence rate, o n l y a moderate c o m m u n i c a t i o n cost, a h i g h degree o f paralleUsm, and a l l o w f o r a relatively simple i m p l e m e n t a t i o n . I n this paper we focus o n the I M G S preconditioner, because o f its existence is guaranteed [24] theoretically and its special properties that can be e f f i c i e n t l y exploited to reduce the global c o m m u n i c a t i o n costs i n practice. I n the present study we a p p l y the techniques f o r reducing the global c o m m u n i c a t i o n overhead o f the inner products suggested by de Sturler a n d

CGLS M E T H O D O N P A R A L L E L C O M P U T E R S

van

291

der V o r s t [11]. I n o u r a p p r o a c h we i d e n t i f y operations that can be

executed w h i l e c o m m u n i c a t i o n takes place ( f o r overlapping), and combine the messages f o r several inner products. F o r P C G L S the first goal o f overl a p p i n g is achieved b y rescheduling the operations, w i t h o u t changing the numerical stabiHty o f the m e t h o d [12]. F o r I M G S p r e c o n d i t i o n i n g b o t h goals are achieved by r e f o r m u l a t i n g the I M G S o r t h o g o n a l i z a t i o n procedure. Several n u m e r i c a l experiments are carried o u t o n the Parsytec G C / PowerPlus at the N a t i o n a l S u p e r c o m p u t i n g Center, L i n k ö p i n g U n i v e r s i t y . These experimental results are c o m p a r e d to the results f r o m our theoretical analysis. In

Section 2, we describe s h o r t l y a b o u t P C G L S a l g o r i t h m w i t h

the

I M G S preconditioner. T h e n we present the c o m m u n i c a t i o n m o d e l to be used i n the theoretical analysis o f time c o m p l e x i t y a n d a simple m o d e l f o r c o m p u t a t i o n time a n d c o m m u n i c a t i o n costs f o r P C G L S a n d I M G S prec o n d i t i o n i n g i n Sections 3 a n d 4, respectively. I n Section 5, comparisons between the sequential and the parallel p e r f o r m a n c e are made t h r o u g h b o t h theoretical c o m p l e x i t y analysis and experimentai observations.

2

P C G L S W I T H IMGS

PRECONDITIONER

Basically the linear least squares p r o b l e m can be described as f o l l o w s : m i n \\b —

Ax\\2,

where A e R ' " ^ ", r a n k ( ^ ) = «, is a given m a t r i x . I t is w e l l k n o w n t h a t x is a least squares s o l u t i o n i f a n d o n l y i f the resid u a l vector r = b-Ax±TZ(A),

or equivalently w h e n A^(b

- Ax) = 0. T h e

resulting system o f n o r m a l equations A'^Ax

=

A^b

is always consistent. T h e conjugate gradient m e t h o d ( C G ) , developed i n the early 1950s b y Hestenes a n d Stiefel [18]. Because i n exact

a r i t h m e t i c i t converges

at most n steps i t was first considered as a direct m e t h o d . H o w e v e r , the

in

finite

t e r m i n a t i o n does n o t h o l d w i t h r o u n d o f f , and the m e t h o d came i n t o wide use first i n the mid-1970s, w h e n i t was realized that i t should be regarded as an iterative m e t h o d . N o w i t has become a basic t o o l f o r solving large sparse linear systems and linear least squares problems.

292

T. Y A N G A N D H . X . L I N

There are m a n y d i f f e r e n t ways, a l l m a t h e m a t i c a l l y equivalent, to implement the conjugate gradient m e t h o d as described i n [6,25]. I n exact a r i t h metic they w i l l a l l generate the same sequence o f a p p r o x i m a t i o n s , b u t i n finite precision the achieved accuracy m a y d i f f e r substantially. E l f v i n g [14] compared several implementations o f the conjugate gradient m e t h o d , a n d f o u n d C G L S l to be the most accurate. A small v a r i a t i o n o f C G L S l is obtained instead o f r = b-Ax,

the residual to the n o r m a l equa-

tions s = A^{b - Ax) is recurred, namely C G L S 2 w h i c h is bad w i t h regard to n u m e r i c a l stabihty. Besides these t w o versions o f C G L S , Paige and Saunders [21] developed algorithms based o n the Lanczos b i d i a g o n a l i z a t i o n m e t h o d o f G o l u b and K a h a n [16]. There are t w o f o r m s o f this bidiagonahz a t i o n procedure, B i d i a g l and B i d i a g 2 , w h i c h result i n t w o algorithms w i t h d i f f e r e n t n u m e r i c a l properties. The comprehensive c o m p a r i s o n o f these d i f ferent implementations has been done by B j ö r c k et al. [6]. They d i d a detailed analysis f o r the f a i l u r e o f C G L S 2 a n d L S Q R 2 . I n this paper, we choose the version o f C G L S l , namely the

standard

C G L S , t h r o u g h o u t the rest o f the paper as the basic m e t h o d f o r solving the least squares problems. The P C G L S w h i c h is the C G L S w i t h a preconditioner M can be sketched as f o l l o w s : A L G O R I T H M PCGLS

L e t XQ be an i n i t i a l a p p r o x i m a t i o n a n d set i'o = b -

Axo,

so=po

= R ^{A^ro),

f o r /c = 1 , 2 , . . . compute

qk

=

Atk;

Oik = Xk+\

ikl{qk,qk); Xk +

i'k+\ = I'k -

aktkl

akqk',

Ok+\ =

A^rk+\;

Sk+\ =

R'^^Ok+w

ik+\ = lik =

{sk+\,sk+\); ik+\hk;

Pk+l =Sk+\

+ PkPk'y

70 =

{so,so),

CGLS METHOD O N PARALLEL COMPUTERS

293

A l l the operations o f the P C G L S m e t h o d , except f o r the update o f x, m u s t be c o m p u t e d i n sequence. T h e r e f o r e attempts t o p e r f o r m some o f these statements i n P C G L S simultaneously are b o u n d t o f a i l . A better w a y to paralleHze the P C G L S m e t h o d is to e x p l o i t geometric parallelism. T h i s means that the data w i l l be d i s t r i b u t e d over the processors i n such a w a y that every processor is responsible f o r the c o m p u t a t i o n s o n its local data. I M G S is a m o d i f i e d version o f I n c o m p l e t e G r a m - S c h m i d t o r t h o g o n a H z a t i o n t o o b t a i n an incomplete o r t h o g o n a l f a c t o r i z a d o n p r e c o n d i t i o n e r M = R, where A = QR + £" is an a p p r o x i m a f i o n o f a QR f a c t o r i z a t i o n , Q is a m a t r i x and R is upper t r i a n g u l a r m a t r i x , respectively. A detailed d e s c r i p f i o n can be f o u n d i n [25]. L e t P „ = {(/,;•) I i ^ J , 1 < iJ < «} a n d assume that the m a t r i x A has f u l l r a n k . Suppose we are given a set o f index pairs P such that P G P„ a n d {iJ)eP

implies

that i < j

Lihat lebih banyak...
Algorithms

ami Applications,

V o l . 13, pp. 289-305

© 1999 O P A (Overseas Publishers Association) N . V .

Reprints available directly from the publisher

Published by license under

Photocopying permitted by license only

the Gordon and Breach Science Publishers Imprint. Printed in Malaysia.

SOLVING SPARSE L E A S T SQUARES P R O B L E M S WITH PRECONDITIONED C G L S M E T H O D ON P A R A L L E L DISTRIBUTED MEMORY COMPUTERS T I A N R U O Y A N G " ' * and H A I X I A N G L I N ' ' " Department of Computer and Information Science, Linköping University, S-58183, Linköping, Sweden; ^Department of Technical Mathematics and Computer Science, TU Delft, P.O. Box 356, 2600 AJ Delft, The Netherlands

(Received

11 March

1997: Revised 27 February

1998)

I n this paper we study the parallel aspects of PCGLS, a basic iterative method based on the conjugate gradient method with preconditioner apphed to normal equations and Incomplete Modified Gram-Schmidt (IMGS) preconditioner, for solving sparse least squares problems on massively parallel distributed memory computers. The performance of these methods on this kind o f architecture is usually limited because of the global communication required for the inner products. We will describe the parallehzation o f PCGLS and I M G S preconditioner by two ways of improvement. One is to accumulate the results o f a number o f inner products collectively and the other is to create situations where communication can be overlapped with computation. A theoretical model of computation and communication phases is presented which allows us to determine the optimal number of processors that minimizes the runtime. Several numerical experiments on the Parsytec GC/PowerPlus are presented. Keywords: Sparse least squares problems; PCGLS algorithm; Modified Gram-Schmidt preconditioner; Global communication and Parallel distributed memory computers

1

I N T R O D U C T I O N

M a n y scientific a n d

engineering a p p l i c a t i o n s such as linear p r o g r a m m i n g

[7], a u g m e n t e d L a g r a n g i a n m e t h o d f o r c o m p u t a t i o n a l

fluid

dynamics [15],

and the n a t u r a l f a c t o r m e t h o d i n s t r u c t u r e engineering [3,17] give rise to

the

sparse least squares p r o b l e m s . Because the c o e f f i c i e n t m a t r i x is s y m m e t r i c

* Corresponding author. E-mail: [email protected] 289

290

T. Y A N G A N D H . X . L I N

and positive definite, the n o r m a l equations o f the m i n i m i z a t i o n p r o b l e m can o f t e n be solved e f f i c i e n t l y using the conjugate gradient m e t h o d . T h e resulting m e t h o d is used as the basic iterative m e t h o d to solve the least squares problems. I n theory, i t is a s t r a i g h t f o r w a r d extension o f the stand a r d conjugate gradient m e t h o d . H o w e v e r , m a n y variants o f these methods appear to be numerical unstable. A comprehensive c o m p a r i s o n o f d i f f e r e n t implementations can be f o u n d i n [6]. I n this paper we use the most accurate version C G L S I as standard C G L S w h i c h w i l l be i n t r o d u c e d later o n . P r e c o n d i t i o n i n g techniques t h a t accelerate the convergence, such as c o l u m n scahng a n d SSOR preconditionings [5], incomplete Cholesky f a c t o r i z a t i o n [20], p o l y n o m i a l p r e c o n d i t i o n i n g [1] and incomplete o r t h o g o n a l f a c t o r i z a t i o n [19,23] have received extensive a t t e n t i o n i n the literature. W e consider i n this paper the Incomplete M o d i f i e d G r a m - S c h m i d t ( I M G S ) m e t h o d as the preconditioner. O n massively parallel computers, the basic time-consuming c o m p u t a t i o n a l kernels o f the iterative schemes are usually: inner products, vector updates, m a t r i x - v e c t o r products, and p r e c o n d i t i o n i n g . I n m a n y situations, especially w h e n m a t r i x operations are well structured, these operations are suitable f o r i m p l e m e n t a t i o n o n vector and shared m e m o r y parallel c o m p u ters [13]. B u t f o r parallel d i s t r i b u t e d m e m o r y machines, the matrices a n d vectors are distributed over the processors, so that even w h e n the m a t r i x operations can be implemented efficiently by parallel operations, we cannot a v o i d the global c o m m u n i c a t i o n required f o r inner p r o d u c t c o m p u t a t i o n s . T h e detailed discussions

o n the c o m m u n i c a t i o n p r o b l e m o n d i s t r i b u t e d

m e m o r y systems can be f o u n d i n [10,22]. I n other w o r d s , these global c o m m u n i c a t i o n costs become relatively more a n d m o r e i m p o r t a n t w h e n the number o f the parallel processors is increased and thus they have the potent i a l to affect the scalability o f the a l g o r i t h m i n a negative w a y [9,10]. T h i s aspect has received m u c h a t t e n t i o n a n d several approaches have been suggested to i m p r o v e the parallel performance [2,10]. W e also consider preconditioners because the p r e c o n d i t i o n i n g p a r t is o f t e n the most p r o b l e m a t i c p a r t i n a parallel environment. These p r e c o n d i tioners should preferably have a g o o d convergence rate, o n l y a moderate c o m m u n i c a t i o n cost, a h i g h degree o f paralleUsm, and a l l o w f o r a relatively simple i m p l e m e n t a t i o n . I n this paper we focus o n the I M G S preconditioner, because o f its existence is guaranteed [24] theoretically and its special properties that can be e f f i c i e n t l y exploited to reduce the global c o m m u n i c a t i o n costs i n practice. I n the present study we a p p l y the techniques f o r reducing the global c o m m u n i c a t i o n overhead o f the inner products suggested by de Sturler a n d

CGLS M E T H O D O N P A R A L L E L C O M P U T E R S

van

291

der V o r s t [11]. I n o u r a p p r o a c h we i d e n t i f y operations that can be

executed w h i l e c o m m u n i c a t i o n takes place ( f o r overlapping), and combine the messages f o r several inner products. F o r P C G L S the first goal o f overl a p p i n g is achieved b y rescheduling the operations, w i t h o u t changing the numerical stabiHty o f the m e t h o d [12]. F o r I M G S p r e c o n d i t i o n i n g b o t h goals are achieved by r e f o r m u l a t i n g the I M G S o r t h o g o n a l i z a t i o n procedure. Several n u m e r i c a l experiments are carried o u t o n the Parsytec G C / PowerPlus at the N a t i o n a l S u p e r c o m p u t i n g Center, L i n k ö p i n g U n i v e r s i t y . These experimental results are c o m p a r e d to the results f r o m our theoretical analysis. In

Section 2, we describe s h o r t l y a b o u t P C G L S a l g o r i t h m w i t h

the

I M G S preconditioner. T h e n we present the c o m m u n i c a t i o n m o d e l to be used i n the theoretical analysis o f time c o m p l e x i t y a n d a simple m o d e l f o r c o m p u t a t i o n time a n d c o m m u n i c a t i o n costs f o r P C G L S a n d I M G S prec o n d i t i o n i n g i n Sections 3 a n d 4, respectively. I n Section 5, comparisons between the sequential and the parallel p e r f o r m a n c e are made t h r o u g h b o t h theoretical c o m p l e x i t y analysis and experimentai observations.

2

P C G L S W I T H IMGS

PRECONDITIONER

Basically the linear least squares p r o b l e m can be described as f o l l o w s : m i n \\b —

Ax\\2,

where A e R ' " ^ ", r a n k ( ^ ) = «, is a given m a t r i x . I t is w e l l k n o w n t h a t x is a least squares s o l u t i o n i f a n d o n l y i f the resid u a l vector r = b-Ax±TZ(A),

or equivalently w h e n A^(b

- Ax) = 0. T h e

resulting system o f n o r m a l equations A'^Ax

=

A^b

is always consistent. T h e conjugate gradient m e t h o d ( C G ) , developed i n the early 1950s b y Hestenes a n d Stiefel [18]. Because i n exact

a r i t h m e t i c i t converges

at most n steps i t was first considered as a direct m e t h o d . H o w e v e r , the

in

finite

t e r m i n a t i o n does n o t h o l d w i t h r o u n d o f f , and the m e t h o d came i n t o wide use first i n the mid-1970s, w h e n i t was realized that i t should be regarded as an iterative m e t h o d . N o w i t has become a basic t o o l f o r solving large sparse linear systems and linear least squares problems.

292

T. Y A N G A N D H . X . L I N

There are m a n y d i f f e r e n t ways, a l l m a t h e m a t i c a l l y equivalent, to implement the conjugate gradient m e t h o d as described i n [6,25]. I n exact a r i t h metic they w i l l a l l generate the same sequence o f a p p r o x i m a t i o n s , b u t i n finite precision the achieved accuracy m a y d i f f e r substantially. E l f v i n g [14] compared several implementations o f the conjugate gradient m e t h o d , a n d f o u n d C G L S l to be the most accurate. A small v a r i a t i o n o f C G L S l is obtained instead o f r = b-Ax,

the residual to the n o r m a l equa-

tions s = A^{b - Ax) is recurred, namely C G L S 2 w h i c h is bad w i t h regard to n u m e r i c a l stabihty. Besides these t w o versions o f C G L S , Paige and Saunders [21] developed algorithms based o n the Lanczos b i d i a g o n a l i z a t i o n m e t h o d o f G o l u b and K a h a n [16]. There are t w o f o r m s o f this bidiagonahz a t i o n procedure, B i d i a g l and B i d i a g 2 , w h i c h result i n t w o algorithms w i t h d i f f e r e n t n u m e r i c a l properties. The comprehensive c o m p a r i s o n o f these d i f ferent implementations has been done by B j ö r c k et al. [6]. They d i d a detailed analysis f o r the f a i l u r e o f C G L S 2 a n d L S Q R 2 . I n this paper, we choose the version o f C G L S l , namely the

standard

C G L S , t h r o u g h o u t the rest o f the paper as the basic m e t h o d f o r solving the least squares problems. The P C G L S w h i c h is the C G L S w i t h a preconditioner M can be sketched as f o l l o w s : A L G O R I T H M PCGLS

L e t XQ be an i n i t i a l a p p r o x i m a t i o n a n d set i'o = b -

Axo,

so=po

= R ^{A^ro),

f o r /c = 1 , 2 , . . . compute

qk

=

Atk;

Oik = Xk+\

ikl{qk,qk); Xk +

i'k+\ = I'k -

aktkl

akqk',

Ok+\ =

A^rk+\;

Sk+\ =

R'^^Ok+w

ik+\ = lik =

{sk+\,sk+\); ik+\hk;

Pk+l =Sk+\

+ PkPk'y

70 =

{so,so),

CGLS METHOD O N PARALLEL COMPUTERS

293

A l l the operations o f the P C G L S m e t h o d , except f o r the update o f x, m u s t be c o m p u t e d i n sequence. T h e r e f o r e attempts t o p e r f o r m some o f these statements i n P C G L S simultaneously are b o u n d t o f a i l . A better w a y to paralleHze the P C G L S m e t h o d is to e x p l o i t geometric parallelism. T h i s means that the data w i l l be d i s t r i b u t e d over the processors i n such a w a y that every processor is responsible f o r the c o m p u t a t i o n s o n its local data. I M G S is a m o d i f i e d version o f I n c o m p l e t e G r a m - S c h m i d t o r t h o g o n a H z a t i o n t o o b t a i n an incomplete o r t h o g o n a l f a c t o r i z a d o n p r e c o n d i t i o n e r M = R, where A = QR + £" is an a p p r o x i m a f i o n o f a QR f a c t o r i z a t i o n , Q is a m a t r i x and R is upper t r i a n g u l a r m a t r i x , respectively. A detailed d e s c r i p f i o n can be f o u n d i n [25]. L e t P „ = {(/,;•) I i ^ J , 1 < iJ < «} a n d assume that the m a t r i x A has f u l l r a n k . Suppose we are given a set o f index pairs P such that P G P„ a n d {iJ)eP

implies

that i < j

Download "Solving Sparse Least Squares Problems with Preconditioned Cgls Method on Parallel Distributed Memory Computers "

Kami adalah komunitas sharing. Jadi tolong bantu kami dengan mengunggah ** 1 ** dokumen baru atau yang ingin kami unduh:

ATAU SEPERTI DOWNLOAD SEGERA