Kami adalah komunitas sharing. Jadi tolong bantu kami dengan mengunggah ** 1 ** dokumen baru atau yang ingin kami unduh:

ATAU SEPERTI DOWNLOAD SEGERA

Parallel Computing 31 (2005) 563–587 www.elsevier.com/locate/parco

A class of novel parallel algorithms for the solution of tridiagonal systems J. Verkaik a, H.X. Lin a

b,*

TNO MADYMO BV, Schoemakerstraat 97, P.O. Box 6071, 2600 JA Delft, The Netherlands b Delft University of Technology, Delft Institute of Applied Mathematics (DIAM), Mekelweg 4, 2628 CD Delft, The Netherlands

Received 15 June 2004; received in revised form 11 November 2004; accepted 17 January 2005 Available online 29 April 2005

Abstract In this paper, a new class of parallel Gaussian elimination algorithms is presented for the solution of tridiagonal matrix systems. The new algorithms, called ACER (alternating cyclic elimination and reduction), combine the advantages of the well known cyclic elimination algorithm (which is fast) and the cyclic reduction algorithms (which requires fewer operations). The ACER algorithms are developed with the unifying graph model. 2005 Elsevier B.V. All rights reserved. Keywords: Parallel matrix algorithm; ACER algorithms; Graph transformation; Unifying graph model

1. Introduction The solution of tridiagonal linear systems of equations occurs in many engineering and scientiﬁc computer applications. For instance, it is often a part of the solution process in numerical simulation of partial diﬀerential equations using ﬁnite diﬀerence or ﬁnite element discretizations. When the standard Gaussian elimination method is applied to a (pre-ordered) tridiagonal matrix, the computation is *

Corresponding author. Tel.: +31 15 27 87229; fax: +31 15 27 87295. E-mail address: [email protected] (H.X. Lin).

0167-8191/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2005.01.002

564

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

inherently sequential. That is the reason why it has inspired many researchers to study parallel algorithms for solving this problem (in the past three decades more than 200 journal papers have been published on this problem, e.g., see an online list of publications [7]). The serial data dependency in the standard Gaussian elimination/factorization process prohibits any form of parallelization. Therefore, diﬀerent algorithms are designed which make a trade-oﬀ between doing extra arithmetic operations and having a higher degree of parallelism. Such trade-oﬀ is typical in designing parallel algorithms for problems whose eﬃcient parallelization is not straightforward. The solution of tridiagonal system is such a problem that ideally shows the trade-oﬀ considerations in designing algorithms for not so straightforward or hard to parallelize problems. As we have mentioned that during the last four decades a large variety of parallel algorithms for the solution of tridiagonal systems have been proposed, some well known examples are: the cyclic reduction algorithm [4,6], the cyclic elimination, the recursive doubling algorithm [13], the block partitioned elimination algorithms [5,11,15], parallel Cholesky factorization [1] and the block partitioned (Cholesky) factorization algorithms [8], and the twisted-factorization [14]. These and other algorithms are designed in an ingenious way and by diﬀerent researchers in parallel computing through the years. The diﬀerent approaches are often presented in a diﬀerent way, like partition of rows or columns of the matrix, index permutation and elimination tree, etc. Recently, a unifying graph framework has been presented [9] which not only can describe many of these algorithms in a uniﬁed and comprehensive way, but also can be used to design new algorithms with certain desired properties. In this paper we present a class of new algorithms, called ACER (alternating cyclic elimination and reduction algorithm), which combines the advantages of the well known cyclic elimination and the cyclic reduction algorithm. The cyclic reduction algorithm is a special case in the class of ACER algorithms. The ACER algorithms have a wide range of variants varying from very fast, requiring a large total number of arithmetic operations, to slower, requiring a smaller number of arithmetic operations. Through out the discussions in this paper we assume that the matrices being considered are symmetric positive deﬁnite, such that pivoting for numerical stability is not necessary during the parallel elimination process. It should be further noted that although we are considering the solution of a system A x = b, we will mainly omit the description of update operations of the right hand side vector b. These update operations are straightforward since they just follow the update on the corresponding row of the matrix. The method we present is based on the elimination instead of factorization. When multiple right hand side vectors are to be solved, the methods based on factorization followed by forward and backward substitution are more eﬃcient (in terms of the required number of arithmetic operations) than the elimination type methods for general matrices. However, for tridiagonal matrices the diﬀerence is small, especially if all right hand side vectors are updated simultaneously during the parallel elimination process.

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

565

The outline of this paper is as follows. First of all, the unifying graph theoretic framework is described in Section 2, then in Section 3 the cyclic reduction and cyclic elimination algorithms are discussed. In Section 4, we present the ACER class which was designed using the graph framework. Section 5 concludes this paper with some remarks and open questions.

2. The graph theoretic framework In this section we will ﬁrst brieﬂy review the relationship between graph transformation and Gaussian elimination, then we will describe the unifying graph framework which was used for designing the ACER class. A number of theorems are stated, but no proofs will be given here, readers are referred to [9] for the proofs. 2.1. Graph transformations and Gaussian elimination In [12], graphs are used for determining ﬁll-ins1 in a sparse matrix during Gaussian elimination. An adjacency graph or elimination graph associated with matrix A is an undirected graph with the set of edges deﬁned as E = {(i, j)ja(i, j) 5 0 or a(j, i) 5 0}. Note that we use the term edge and arc for undirected respectively directed connections between two nodes. It is shown that the elimination of a variable i only changes the coeﬃcients a(k, j) in the matrix if and only if there exist a pair of edges (k, i) and (i, j), see [12]. That means that for a sparse matrix the modiﬁcation in coeﬃcients is local. Since a ﬁllin during the elimination or factorization corresponds to the addition of a new edge in the elimination graph, minimizing ﬁll-ins is the same as minimizing the number of additional new edges in the elimination graph. With respect to parallel elimination, an important fact is that two nonadjacent nodes in an elimination graph can be eliminated independently. As was pointed out in [9], the undirected elimination graph has been successfully used for minimizing ﬁll-ins in sequential solution (e.g. [2,3]) and parallel factorization of sparse symmetric matrices. However, it cannot describe many operations in a parallel matrix algorithm. For instance, the recursive doubling algorithm [13] and the partition method [15] cannot be described in terms of eliminating each column or row exactly once during the elimination process. In the partition method some rows or columns are modiﬁed several times. Unlike in Gaussian elimination or LU-factorization the ﬁnal form is here not an upper triangular matrix. Therefore, we use directed graphs in our framework. In order to obtain the required high ﬂexibility we study the parallelism in eliminating an arc (j, i) associated with a(j, i) using a(i, i) instead of eliminating an entire column i. Note that we do not assume i > j or j > i here, i.e., we do not assume any pre-determined elimination ordering. We

1

A ﬁll-in is a coeﬃcient which is zero in the original A, but becomes nonzero during the elimination/ factorization.

566

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

consider the elimination of an arc as an operation which can be performed in any order or even repeatedly. Throughout the paper, we ignore numerical cancellations when considering ﬁll-ins and ﬁll-arcs during an elimination process. An nonzero is thus a logical nonzero and an accidental numerical cancellation is not considered as a zero coeﬃcient. 2.2. Description of Gaussian elimination In the graph model, a matrix A 2 Rnn is represented by a graph G(V, E), consisting of a set of nodes V and a set of arcs (edges) E. The set of nodes is deﬁned by V = {1, 2, . . ., n}, the set of arcs by E = {(i, j)ja (i, j) 5 0, i 2 V and j 2 V}. An arc (i, j) is said to have a begin node i and end node j. Arc (i, j) is an outgoing arc from i, and an ingoing arc to j. The set of successors and predecessors of node i is denoted by PRED(i) and SUCC(i), respectively. As a convention in this paper, we deﬁne that the node i is never a successor or predecessor of itself. A generalized Gaussian elimination step is described in this model by a graph transformation from graph G(a)(V, E(a)) to graph G(b)(V, E(b)). We will denote a graph transformation by (a) ! (b). When a graph G(b) is transformed to G(c)(V, E(c)) by a next Gaussian elimination step, we will denote the two graph transformations (a) ! (b) and (b) ! (c) by simply (a) ! (c). In a Gaussian elimination step a matrix element is eliminated, which can cause a ﬁll-in. In the unifying graph model this corresponds to eliminating an arc from graph G(a)(V, E(a)), and adding a new arc (ﬁll-in) to this graph. This results in a new graph G(b)(V, E(b)), corresponding to the matrix after the Gaussian elimination step. A ﬁll-in is thus an arc which does not belong to the graph G(a). The following theorem tells us which arcs are created and/or modiﬁed by eliminating arc (i, j) from the graph. Theorem 1. The elimination of arc (i, j) results in arcs (i, p) for all p2SUCC(j). 2.3. Description of parallel Gaussian elimination The following three basic operations on a graph G(a)(V, E(a)) are important for parallel Gaussian elimination: 1. Partitioning. Partition the graph G(a) into a number of subgraphs, and assign each subgraph to a processor. 2. Selection. Select a set of arcs S for parallel elimination. 3. Elimination and update. Remove S from the graph G(a)(V, E(a)), add a corresponding set of ﬁll-arcs to the graph. This results in a new graph G(b)(V, E(b)). In the following, we will call a graph transformation in relation with parallel Gaussian elimination a parallel elimination step. Now consider the parallel elimination of a pair of arcs S1 = (i1, j1) and S2 = (i2, j2) from the graph G(V, E). We will call this successful if elimination of S1 from graph

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

567

G(V, E)nS2 does not introduce a ﬁll-arc S2, and elimination of S2 from graph G(V, E)nS1 does not introduce a ﬁll-arc S1. The following theorem states when parallel elimination is successful. Theorem 2. Parallel elimination of arcs (i1, j1) and (i2, j2) is successful if and only if 1. i15i2; or 2. i1 = i2 and j1 is not in SUCC(j2) and j2 is not in SUCC(j1). Fig. 1(a) shows an example for which parallel elimination is not successful: clearly, arcs (2,1) and (2,3) cannot be eliminated in parallel. This is because elimination of arc (2,1) causes that arc (2,3) returns, and vice versa. Hence, the presence of arcs (1,3) and (3,1) results that there is no parallel elimination possible. Fig. 1(b) shows a graph without these arcs, and is called a bidirectional chain corresponding to a tridiagonal matrix. Clearly, the arcs (2, 1) and (2, 3) can be successfully eliminated, in contrast to the example of Fig. 1(a). In this case a so-called update conﬂict occurs. Per deﬁnition, an update conﬂict occurs when during a parallel elimination step the value of one coeﬃcient has to be modiﬁed several times. In Fig. 1(b) a update conﬂict occurs for the coeﬃcient a2,2. The following theorem says when a parallel elimination step is free of update conﬂicts. Theorem 3. Parallel elimination of arcs (i1, j1) and (i2, j2) is successful and free of update conflicts if and only if 1. i15i2; or 2. if i1 = i2 and SUCC(j1) \ SUCC(j2) = ;. Notice that parallel eliminations free of update conﬂict are not necessary conditions. The tests of parallel elimination (Theorem 2) and the test of update conﬂict (Theorem 3) can also be simpliﬁed by allowing each node being an initiating node at most once in a single parallel elimination step. Furthermore, additional conditions in selecting arcs for parallel elimination are sometimes preferred or required. Such additional conditions can be limiting the number of ﬁll-arcs, or requiring that the total number of ﬁll-arcs must be smaller than the number of arcs being eliminated

Fig. 1. Example for which the arcs (2, 1) and (2, 3) cannot be eliminated in parallel (a), and an example for which this can be done (b), but resulting in an update conﬂict in coeﬃcient a2,2.

568

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

in each step (the latter is to ensure the ﬁnite-ness of the elimination process). Additional conditions or heuristics in the selection can also be used for more regularity and control on the parallelism. Since without regularity it is impossible to describe a parallel algorithm for diﬀerent values of n, it is very important to retain regularity in the parallel elimination steps. In fact, many of the known parallel algorithms in literature correspond to applying a certain heuristics or imposing some structure in the elimination process. The class of ACER algorithms has both element of cyclic reduction (CR) and cyclic elimination (CE) in it. ACER combines the advantages of the cyclic elimination and the cyclic reduction algorithms. An important feature in the ACER algorithms is to keep the structure similarity as much as possible during the reduction and backsubstitution phases in the parallel elimination steps. During the design process of the ACER algorithms the graph model is used to provide us the insights in the structures of the CR and CE algorithms. Before we present the ACER algorithms, ﬁrst we describe the CR and CE algorithms using the unifying graph model. 3. The CR and CE algorithm 3.1. The CR algorithm Let the dimension of the matrix be given by n = 2 Æ 2k1, k = 1, 2, . . .. The CR algorithm [4] consists of two phases: the reduction phase, during which selective elimination of variables is done; and the back-substitution phase, during which the values of the eliminated xis are recovered. 3.1.1. The reduction phase The reduction phase consists of k parallel elimination steps, which we number from r = 1 to k. Deﬁne d = 2r1 and consider the following equations at step r: ðr1Þ

ðr1Þ

ðr1Þ

ðr1Þ

ðr1Þ

aid;i2d xi2d þ aid;id xid þ aid;i xi ¼ bid ; ðr1Þ

ðr1Þ

ai;id xid þ ai;i ðr1Þ

ðr1Þ

ðr1Þ

xi þ ai;iþd xiþd ¼ bi

ðr1Þ

ðr1Þ

ð1Þ ð2Þ

; ðr1Þ

aiþd;i xi þ aiþd;iþd xiþd þ aiþd;iþ2d xiþ2d ¼ biþd .

ð3Þ

Eq. (2) corresponds to the ith row of the system, Eqs. (1) and (3) to respectively the (i d)th and (i + d)th row. The basic idea is to eliminate the variables xid and xi+d in Eq. (2) in parallel for every i from the index set I = {2pdjp = 1, 2, . . ., 2k/d1}. The index set I is chosen in such a way that for two consecutive indices ij and ij+1 from I, Eq. (3) for ij equals the Eq. (1) for ij+1. The variables xid and xi+d can be ðr1Þ ðr1Þ eliminated by multiplying (1) and (3) by respectively ai;id =aid;id and ðr1Þ ðr1Þ ai;iþd =aiþd;iþd , and adding the result to Eq. (2). By this, the three equations (1)– (3) are reduced to one equation, i.e. ðrÞ

ðrÞ

ðrÞ

ðrÞ

ai;i2d xi2d þ ai;i xi þ ai;iþ2d xiþ2d ¼ bi

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

569

with the new coeﬃcients given by ðr1Þ

ðrÞ ai;i2d

¼

ðr1Þ aid;i2d

ai;id ðr1Þ

aid;id

ð4Þ

;

ðr1Þ

ðrÞ

ðr1Þ

ai;i ¼ ai;i

ðr1Þ

aid;i

ðr1Þ

ai;id

ðr1Þ

ðr1Þ

aid;id

aiþd;i

ai;iþd ðr1Þ

aiþd;iþd

ð5Þ

;

ðr1Þ

ðrÞ

ai;iþd

ðr1Þ

ai;iþ2d ¼ aiþd;iþ2d

ðr1Þ

aiþd;iþd

ð6Þ

;

ðr1Þ

ðrÞ

ðr1Þ

bi ¼ bi

ðr1Þ

bid

ðr1Þ

ai;id ðr1Þ

ðr1Þ

aid;id

biþd

ai;iþd ðr1Þ

aiþd;iþd

ð7Þ

.

The expressions (4)–(7) can be computed in parallel. Note that the quotients ðr1Þ ðr1Þ ðr1Þ ðr1Þ ai;id =aid;id and ai;iþd =aiþd;iþd need to be computed only once. Observation of the reduction phase shows that in each step r all variables are eliminated with indices which are odd multiples of d and not multiples of 2d, i.e.: xd, x3d, x5d, . . ., x2Æ2kd. Elimination of these 2k/d variables results in 2k/d 1 equations. At step r = k (d = 2k1) two variables x2k1 and x3 2k1 are eliminated, resulting in one equation in the unknown x2k . 3.1.2. The back-substitution phase The back-substitution phase also consists of k parallel elimination steps, which we now number from r = k to 1. After the reduction phase has been completed, one single equation in the unknown x2k resulted: ðkÞ

ðkÞ

a2k ;2k x2k ¼ b2k : ðkÞ

ðkÞ

From this equation it easily follows that x2k ¼ b2k =a2k ;2k . In the remaining k 1 steps, back-substitution takes place, i.e. ðr1Þ

xi ¼

bi

ðr1Þ

ðr1Þ

ai;id xid ai;iþd xiþd ðr1Þ

ai;i

;

ð8Þ

where xid and xi+d are recovered in step r + 1 of the back-substitution phase. This is done in parallel for every i from the index set I = {2pdjp = 1, 2, . . ., 2k/d 1}. Note that in the rth step of the back-substitution phase all variables are recovered which are eliminated in the rth step of the reduction phase. Using the framework, we can now generate the CR algorithm as shown in Fig. 2. The parallel elimination proceeds by starting with the elimination of all arcs initiated from even numbered nodes in the ﬁrst step, followed by repeatedly eliminates the set of arcs initiated at even numbered nodes with a distance of 2r1 to the end node at step r (they are independent from each other). Fig. 3 illustrates the diﬀerent elimination steps in the reduction and back substitution phase of the reduction algorithm.

570

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

Fig. 2. The CR algorithm in terms of the framework (n = 2 Æ 2k 1).

(a)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(b)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(c)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(d)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(e)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(f)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(g)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Fig. 3. Illustration of the graph transformation process corresponding to the CR algorithm, n = 15 (k = 3). (a) ! (d): the reduction phase; (d) ! (g): the back-substitution phase.

3.1.3. Set notation In order to aid the design and presentation of the class of ACER algorithms, we introduce some notations of sets which are not only used to describe the CR algorithm but also the CE and the ACER algorithms. These sets of nodes and arcs are

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

571

ðrÞ

ðBCÞ : Chain formed by arcs ði; jÞ for which j i j j is maximum at step r; ðrÞ

ðBCN Þ

ðCRSCÞ

ð9Þ

ðrÞ

: Set of nodes in ðBCÞ ;

ðrÞ

: Subset of ðBCÞ

ðrÞ

ð10Þ

consisting of CR subchains: Each CR

subchain consists of two D links formed by three nodes and four arcs: ð11Þ ðrÞ

ð#CRSCÞ

ðrÞ

: Number of CR subchains in ðCRSCÞ .

ð12Þ

A subchain is a bidirectional linear chain which comprises a set of nodes and arcs. A D-link is the smallest (sub)chain which comprises two consecutive nodes with an arc in each of the two directions. We denote a subchain by (p; q), where p and q are respectively the smallest and largest node number of the subchain. The center node of a chain is the node at the middle of the chain. For illustration the values of the sets (9)–(12) in the example depicted in Fig. 3 at step r = 2 (reduction phase) are given by ðBCÞ

ð2Þ

¼ fð2;14Þg;

ð2Þ

ðBCN Þ

ðCRSCÞ

¼ f2; 4; 6; 8; 10; 12; 14g;

ð2Þ

¼ fð1;6Þ; ð6;10Þ; ð10;14Þg;

ð2Þ

ð#CRSCÞ

¼ 3.

Using these set deﬁnitions the CR algorithm of Fig. 2 can now be described as in Fig. 4.

Fig. 4. The CR algorithm in set notation (n = 2 Æ 2k1).

572

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

It can be shown that the following relation yields: ð#CRSCÞ

ðrÞ

¼ 2 2kr 1.

ð13Þ (1)

(2)

In the example of Fig. 3 we have (#CRSC) = 7, (#CRSC) = 3 and (#CRSC)(3) = 1. Characteristic for the CR algorithm is that (#CRSC)(k) = 1. In each step r = 1, 2, . . ., k of the reduction phase, expressions (4)–(7) can be evaluated in parallel. When maximum parallelism is considered, this can be done in exactly four operations, so in total the reduction phase minimally costs 4k operations. The ﬁrst step of the back-substitution phase costs minimal four operations, in the last k 1 steps expression (8) has to be evaluated, costing in total 4k 4 operations. Therefore, the time complexity of the CR algorithm reads T CR ¼ 8log2 ðn þ 1Þ 8:

ð14Þ

In the CR algorithm there are only ﬁll-ins introduced in the reduction phase. Simple analysis shows that the number of ﬁll-ins is equal to 2 Æ #(CRSC)(r) 2 at step r, and therefore the total number of ﬁll-ins is nþ1 F CR ¼ 2ðn 1Þ 4log2 : ð15Þ 2 3.2. The CE algorithm The CE algorithm is a greedy algorithm which simply removes the oﬀ-diagonal bands of the matrix from the main diagonal. Let the dimension of the tridiagonal matrix be given by n = 2k, k = 1, 2, . . .. In terms of the framework, the CE algorithm repeatedly eliminates the arcs (i, i + d) and (i, i d) with d = 2r1 at step r. The algorithm in graph notation is given by Fig. 5. For the k steps of the CE algorithm expressions (4)–(7) can be evaluated in parallel in minimal four steps, considering maximal parallelism. For the last step this can be done in three steps. Finally, obtaining the solution vector requires one step, thus in total 4k steps are needed. Hence, the time complexity is T CE ¼ 4log2 ðnÞ:

ð16Þ

In the ﬁrst k 1 steps of the CE algorithm ﬁll-ins are introduced. Analysis shows that the number of ﬁll-ins is given by F CE ¼ 2log2 ðnÞn 4n þ 4:

Fig. 5. The CE algorithm in terms of the framework (n = 2 Æ 2k).

ð17Þ

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

573

4. The ACER algorithms When we compare the time complexity of the CR algorithm (14) with the time complexity of the CE algorithm (16), we see that the greedy CE approach costs much less operations. However, comparing the number of ﬁll-in arcs of the CR algorithm (15) with those introduced in the CE algorithm (17), we observe that signiﬁcantly more ﬁll-ins are introduced with the CE algorithm. This brings us to the idea of combining the advantages of those algorithms. Using the graph framework we can easily visualize and study the diﬀerent variants of parallel elimination algorithms, and that results in the ACER class of algorithms. 4.1. ACER in set notation Let the dimension of the matrix be given by n = 2lk 1 with l = 2,3, . . . and k = 1, 2, . . . For each l we denote ACER-l as a member of the ACER class. Like the CR algorithm, the ACER algorithm comprises a reduction and a back-substitution phase, each consisting of k parallel elimination steps. However, unlike the CR algorithm, each of these k steps in the reduction phase again consists of one or more substeps: each step consists of zero or more CE substeps, followed by one single CR substep at the end. In following the two phases will be described in detail using the graph notations. But before doing so, ﬁrst some additional notations of sets of nodes or arcs are introduced ðCESCÞ

ðrÞ

: Subset of ðBCÞ

ðrÞ

consisting of CE subchains: Each CE

subchain contains l 2D links formed by l 1 nodes and 2l 4 hboxarcs;

ð18Þ

ðCESCN ÞðrÞ : Set of nodes in the CE subchains ofðCESCÞðrÞ .

ð19Þ

ð#CESCÞðrÞ : Number of CE subchains in ðCESCÞðrÞ .

ð20Þ

Fig. 7 shows an example of an ACER algorithm with l = 3 for n = 17 and k = 2. In the ﬁrst steps (a) ! (c) the values of the subsets deﬁned above are the following: ðBCÞð1Þ ¼ fð1;17Þg;

ð21aÞ

ðBCN Þð1Þ ¼ f1; 2; . . . ; 17g;

ð21bÞ

ðCESCÞ

ð1Þ

ðCESCN Þ

¼ fð1;2Þ; ð4;5Þ; ð7;8Þ; ð10;11Þ; ð13;14Þ; ð16;17Þg;

ð1Þ

ð#CESCN Þ

¼ f1; 2; 4; 5; 7; 8; 10; 11; 13; 14; 16; 17g; ð1Þ

¼ 6;

ð21cÞ ð21dÞ ð21eÞ

574

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

ðCRSCÞ

ð1Þ

ð#CRSCÞ

¼ fð2;4Þ; ð5;7Þ; ð8;10Þ; ð11;13Þ; ð14;16Þg;

ð1Þ

ð21fÞ

¼ 5:

ð21gÞ

4.1.1. The reduction phase The reduction phase comprises k steps, and each of these steps again contains c + 1 substeps: c CE substeps followed by 1 CR substep. At each step r (r = 1, 2, . . ., k) of the reduction phase, ﬁrst the chain (BC)(r) formed by the largest arcs in the graph will be determined. The corresponding set of nodes in (BC)(r) is stored in (BCN)(r). Next, (BC)(r) is divided into alternate CE and CR subchains containing l2 and 2 D-links, respectively. The ﬁrst and last l 2 D-links of (BC)(r) diﬀer from the others. This results in two sets (CESC)(r) and (CRSC)(r) of subchains. In each step r all CE subchains in (CESC)(r) are transformed in parallel using the CE algorithm in Fig. 6. There are c substeps within an CE step, in substep t (t = 1, 2, . . ., c) all arcs with its begin and end node belonging to the CE subchains are eliminated. This results in two types of ﬁll-ins: 1. Fill-in arcs whose begin node is a node of the CE subchain, and the end node is the middle node of the neighboring CR subchain. 2. Fill-in arcs whose begin and end node are nodes belonging to the CE subchain. After c substeps there are no arcs in the graph left whose begin and end node belong both to the CE subchain. Until now we have not discussed about the value of the number of CE substeps c. Consider a CE subchain, the number of nodes in the CE subchain equals l 1 and

(a)

1

2

3

4

5

6

7

8

(b)

1

2

3

4

5

6

7

8

(c)

1

2

3

4

5

6

7

8

(d)

1

2

3

4

5

6

7

8

Fig. 6. Illustration of the graph transformation process corresponding to the CE algorithm, n = 8 (k = 3).

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

575

the number of arcs 2(l 2). For reasons of simplicity we mean with an arc an arc with both begin and end node in the current CE subchain under consideration. The constant c can be determined by examining how many substeps are needed to eliminate all l 2 arcs (i, j) with i > j or i < j. The constraint l 2 (1 + 21 + 22 + . . .2c1) 6 0, must be satisﬁed, meaning that c P log2 ðl 1Þ. Subsequently, using the upper-entier function c ¼ dlog2 ðl 1Þe.

ð22Þ

After c CE substeps are carried out in step r, a single CR substep follows. In this substep again in parallel for each CR subchain two outgoing arcs from the center node of the subchain are eliminated. For r = 1, 2, . . ., k 1 these eliminations result in two ﬁll-in arcs with the center node of the current CR subchain as the begin node and the center node in the left/right neighboring CR subchains as the end node. 4.1.2. The back-substitution phase At each step r (r = k, k1, . . ., 1) of the back-substitution phase all arcs whose begin and end node belonging to (BCN)(r) are eliminated. There are no ﬁll-in arcs introduced in this phase. Fig. 8 describes the ACER algorithm in terms of graph transformations using the framework. The following relations yield: ðrÞ

¼ 2lkr ;

ð23Þ

ðrÞ

¼ 2lkr 1.

ð24Þ

ð#CESCÞ and ð#CRSCÞ

In case of r = k we have (#CRSC)(k) = 1. 4.2. An example worked out in detail In the following, we consider an example for l = 3 and n = 17 to illustrate the ACER algorithm. Fig. 7 shows the process of graph transformation for this example problem. Since l = 3 it follows that c = 1. For the given n = 17 and l = 3, we have k = 2, therefore both reduction and back-substitution phase consists of 2 steps. Each reduction step contains two substeps: 1 CE substep (c = 1) and 1 CR substep. The CE substep (a) ! (b) and CR substep (b) ! (c) form together the ﬁrst step of the reduction phase, CE substep (c) ! (d) and CR substep (d) ! (e) form the second and the last step. Steps (e) ! (f) and (f) ! (g) form the back-substitution phase. In the ﬁrst step r = 1 of the reduction phase the sets in (21a)–(21g) are found. In the CE substep t = 1 of step r = 1 the following set of arcs are eliminated (from the graph in (a)): fð1; 2Þ; ð2; 1Þ; ð4; 5Þ; ð5; 4Þ; ð7; 8Þ; ð8; 7Þ; ð10; 11Þ; ð11; 10Þ; ð13; 14Þ; ð14; 13Þ; ð16; 17Þ; ð17; 16Þg

576

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

(a)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(b)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(c)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(d)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(e)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(f)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(g)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Fig. 7. Illustration of the ACER algorithm with l = 3 for n = 17 and k = 2. (a) ! (e): the reduction phase; (e) ! (g): the back-substitution phase.

and the following ﬁll-arcs are added to the graph: fð1; 3Þ; ð5; 3Þ; ð4; 6Þ; ð8; 6Þ; ð7; 9Þ; ð11; 9Þ; ð10; 12Þ; ð14; 12Þ; ð13; 15Þ; ð17; 15Þg. This results in the graph shown in (b). In the CR substep of step r = 1 the set of arcs eliminated (from the graph in (b)) are fð3; 2Þ; ð3; 4Þ; ð6; 5Þ; ð6; 7Þ; ð9; 8Þ; ð9; 10Þ; ð12; 11Þ; ð12; 13Þ; ð15; 14Þ; ð15; 16Þg and the following set of ﬁll-arcs are added to the graph: fð3; 6Þ; ð6; 3Þ; ð6; 9Þ; ð9; 6Þ; ð9; 12Þ; ð12; 9Þ; ð12; 15Þ; ð15; 12Þg. In the second step r = 2 of the reduction phase we ﬁnd the following sets: ðBCÞð2Þ ¼ fð3;15Þg; ð2Þ

ðBCN Þ

ðCESCÞ

¼ f3; 6; 9; 12; 15g;

ð2Þ

¼ fð3;6Þ; ð12; 15Þg;

ðCESCN Þð2Þ ¼ f3; 6; 12; 15g.

ð25aÞ

ð25bÞ

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

577

Fig. 8. The ACER algorithm in set notation (n = 2lk 1, l = 2, 3, . . ., k = 1, 2, . . .).

ð#CESCÞ ðCRSCÞ

ð2Þ

ð2Þ

¼ 2;

¼ fð6;12Þg;

ð25cÞ

ð#CRSCÞð2Þ ¼ 1. In the CE substep t = 1 of step r = 2 the arcs (3, 6), (6, 3), (12, 15) and (15, 12) from the graph in (c) are eliminated, and ﬁll-arcs (3, 9) and (15, 9) are added, resulting in graph (d). In the CR substep of r = 2 the arcs (9, 6) and (9, 12) form graph (d) are eliminated, resulting in graph (e). Herewith the reduction phase has been completed. In the second step r = 2 of the back-substitution phase all arcs in graph (e) with both begin and end node belonging to (BCN)(2) are eliminated (25a): (3, 9), (6, 9), (12, 9) and (15, 9). This results in graph (f). In the second and last step r = 2 of the back-substitution phase all arcs in graph (f) with both begin and end node belonging to (BCN)(1) (21-b) are eliminated. This results in graph (g) which has no arcs and the ACER algorithm terminates here. The ACER algorithms can also be described in more detail. This is shown in Fig. 9.

578

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

Fig. 9. The ACER algorithm in terms of the framework (n = 2lk1, l = 2, 3, . . ., k = 1, 2, . . ., and c ¼ dlog2 ðl 1Þe).

4.3. Analysis of time complexity and number of ﬁll-ins For determining the time complexity and the number of ﬁll-ins of the ACER algorithm we consider the ACER algorithm in set notation, as described in Section 4.1. 4.3.1. Time complexity For the time complexity analysis of ACER we restrict ourselves solely to the case where maximum parallelism is exploited.

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

579

4.3.1.1. The reduction phase. In each step r = 1, 2, . . ., k of the reduction phase all arcs in the set (CESC)(r) are eliminated simultaneously in c CE substeps. If there are two arcs in (CESC)(r) with the same end node, then a CE substep costs four operations. If the set does not contain such arcs, then the CE substep costs three operations, which is always the case for the last CE substep. By this, the CE substeps in the reduction phase takes 4(c 1)k + 3k operations. For c = 0 (l = 2) we correct this number by introducing the indicator function 1[c > 0] (=1 for c > 0; =0 otherwise), resulting in 4ck k Æ 1[c > 0] operations. After the ﬁnal CE substep one CR substep follows, costing four operations. Therefore, executing all the CR substeps in the reduction phase takes 4k operations, hence in total the reduction phase costs 4ck þ 4k k 1½c>0

ð26Þ

operations. 4.3.1.2. The back-substitution phase. In the ﬁrst step r = k of the back-substitution phase the variable x(n+1)/2 is recovered, costing one operation. After that, eliminating the arcs in (CRSC)(k) can be done in three operations, and the rest of the k 1 steps of the back-substitution phase cost four operations each. Therefore the back-substitution phase takes 4k

ð27Þ

operations. 4.3.1.3. Total operations. Adding (26) and (27) results in the following time complexity for the ACER algorithm: nþ1 T ACER ¼ ð4c þ 8 1½c>0 Þlogl . ð28Þ 2 Note that for l = 2 (c = 0) this expression equals to TCR, given by (14). 4.3.2. Number of ﬁll-ins Recall that only the reduction phase introduces ﬁll-in arcs. The total number of ﬁll-ins of the ACER algorithm consists of ﬁll-ins introduced in the ck CE substeps, and ﬁll-ins introduced in the k CR substeps. 4.3.2.1. Number of ﬁll-in arcs in ck CE substeps. As denoted in Section 4.1 we distinguish two diﬀerent types of ﬁll-ins: (1) ﬁll-in arcs whose begin node is a node of the CE subchain, and the end node is the middle node of the neighboring CR subchain, and (2) ﬁll-in arcs whose begin and end node belong to the CE subchain. Now the numbers of each of these types of ﬁll-in arcs will be determined. For the sake of simplicity a CE subchain is simply referred to as a subchain, and a CE substep simply as a substep. The number of type (1) ﬁll-ins per subchain can be determined as follows. In the ﬁrst substep t = 1 there are 2(l 1) 2 arcs eliminated, resulting in 2(l 1) 4 ﬁllin arcs. In the second substep t = 2 there are 2(l1)8 arcs eliminated, resulting in

580

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

2(l 1) 8 ﬁll-in arcs. This proces repeats itself for only c 1 times, because there are no ﬁll-ins introduced in substep t = c. By this, the total number of ﬁll-in arcs per subchain after c substeps is given by 2ðl 1Þðc 1Þ ð4 þ 8 þ 16 þ . . .Þ. |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} c1 terms

The indicated series can be classiﬁed as a ﬁnite geometric series and is given by 4(2c1 1), and so the number of ﬁll-in arcs per subchain becomes 2ðl 1Þðc 1Þ 4ð2c1 1Þ: Multiplying this with the total number of subchains (#CESC)(r), given by (23), results in k1 f2ðl 1Þðc 1Þ 4ð2c1 1Þgf2l þ 2lk2ﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ þ 2lk3 þ ﬄ} g: |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ

ð29Þ

k terms

The indicated series in the above expression is ﬁnite geometric and given by 2lk 2 ; l1

ð30Þ

and therefore the total number of type (1) ﬁll-in arcs after ck substeps can be written as

2lk 2 2ðc 1Þðl 1Þ 4ð2c1 1Þ : ðl 1Þ

ð31Þ

Apart from the ﬁrst and last subchain, per subchain and after c substeps 2(l 2) ﬁll-in arcs of type (2) are introduced. Using expression (24) for (#CESC)(r) the number of type (2) ﬁll-in arcs introduced in the rth step is 2lkr 2ðl 2Þ 2ðl 2Þ; in which the last term 2(l 2) is a correction for the ﬁrst and last subchain. Taking the sum over k steps results in k1 2ðl 2Þf2l þ 2lk2ﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ þ 2lk3 þ . .ﬄ}.g 2ðl 2Þk. |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ k terms

The indicated series is given by (30), and therefore the total number of type (2) ﬁllin arcs after ck substeps can be written as 2ðl 2Þ

2lk 2 2ðl 2Þk: l1

ð32Þ

4.3.2.2. Number of ﬁll-in arcs in k CR substeps. Apart from the ﬁrst and last CR subchain, in each step two ﬁll-in arcs are introduced per CR subchain. With (#CRSC)(r) given by (23) and taking the sum over k steps, results in

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

581

2f2lk1 1 þ 2lk2 1 þ 2lk3 1 þ . . .g 2k k1 ¼ 2f2l þ 2lk2ﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ þ 2lk3 þ ﬄ} g 4k; |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ k terms

where the indicated series is given by (30). By this the total number of ﬁll-in arcs introduced in k CR substeps can be written as 2

2lk 2 4k: l1

ð33Þ

4.3.2.3. Total number of ﬁll-ins. Adding (31)–(33) the total number of ﬁll-ins of the ACER algorithm is given by

F ACER ¼

2cþ1 4 nþ1 2c ðn 1Þ 2llogl . l1 2

ð34Þ

Note that for l = 2 (c = 0) FACER is equal to number of ﬁll-ins for the CR algorithm FCR given by (15). 4.4. A comparison with the CR and CE algorithm In this subsection we will make a quantitative comparison for the time complexities and ﬁll-ins of the ACER, CE and CR algorithms. We will illustrate that the ACER algorithm combines the properties of the CE and CR algorithms. Table 1 summarizes the time complexities and number of ﬁll-ins for the methods discussed in this paper. For the sake of simplicity, we will assume that T and F are continuous functions of n, instead of discrete. 4.4.1. Time complexities We can rewrite TCR, TCE and TACER as follows T CR ¼ 8flog2 ðn þ 1Þ 1g; T CE ¼ 4log2 ðnÞ 1; T ACER ¼

4c þ 8 1½c>0 flog2 ðn þ 1Þ 1g. log2 l

For suﬃciently large values of n we can approximate this time complexities by T CR ¼ 8log2 ðnÞ; T CE ¼ 4log2 ðnÞ; 4c þ 8 1½c>0 log2 ðnÞ. T ACER ¼ log2 l

ð35Þ

582

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

Table 1 Time complexities and number of ﬁll-ins for the CR, CE and ACER algorithm, for k = 1, 2, . . ., and l = 2, 3, . . ., and c ¼ dlogðl 1Þe n

T

F 2ðn 1Þ 4log2

CR

2 Æ 2 1

nþ1 8log2 2

CE

2k

4log2 ðnÞ

k

ACER

2log2 ðnÞn 4n þ 4

2cþ1 4 nþ1 2c ðn 1Þ 2llogl l1 2

nþ1 ð4c þ 8 1½c>0 Þlogl 2

k

2l 1

nþ1 2

These are the highest order terms, hence the time complexities are Oðlog2 ðnÞÞ. Now consider the factor in expression (35) of TACER. In the following table this factor is given for a certain range of l: l

2

3

4

5

6

7

8

9

10

100

c 4c þ 8 1½c>0 log2 l

0

1

2

2

3

3

3

3

4

7

8.00

6.94

7.50

6.46

7.35

6.77

6.33

5.99

6.92

5.27

This shows that for an increasing l, the factor of TACER decreases starting from the value 8 for the CR algorithm, and slowly tends to the factor 4 for the CE algorithm. It can be concluded that T CE < T ACER 6 T CR . Fig. 10 shows the diﬀerences in time complexities for an ACER-10 and ACER100 algorithm, compared to the CR and CE algorithm. The time complexities are plotted as a function of n, as well as the discrete values. This ﬁgure clearly shows that TCE < TACER100 < TACER10 < TCR. 4.4.2. Number of ﬁll-ins The expressions for FCR and FACER can be rewritten as F CR ¼ 2ðn 1Þ 4flog2 ðn þ 1Þ 1g; F CE ¼ 2log2 ðnÞn 4n þ 4;

F ACER ¼

2cþ1 4 2l flogl ðn þ 1Þ 1g. 2c ðn 1Þ l1 log2 ðlÞ

For n suﬃciently large, we have F CR ¼ 2n;

ð36Þ

F CE ¼ 2log2 ðnÞn;

ð37Þ

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587 150

583

CR CE ACER10 ACER100 discrete value

time complexity T

100

50

0 0 10

1

2

10

10

3

10 number of equations n

4

5

10

6

10

10

Fig. 10. Time complexities for the ACER-10 and ACER-100 algorithms compared to the CR and CE algorithm. A · denotes a discrete value.

F ACER ¼

2c

2cþ1 4 n; l1

ð38Þ

showing that FCE is Oðlog2 ðnÞnÞ, and that FCR as well as FACER are OðnÞ. Again, consider the factor in (38) for FACER. For a certain range of l this factor is given by the following table: l c

2c

2

cþ1

2

3

4

5

6

7

8

9

10

100

0

1

2

2

3

3

3

3

4

7

2.00

2.67

3.00

3.60 4.00

4.29

4.50

4.89 11.45

4 n 2.00 l1

This shows that for an increasing l the factor of FACER slightly increases. From this quantitative analysis it follows that FACER P FCR, and since FACER is of higher order than FCE we have F CE > F ACER P F CR . Fig. 11 shows the diﬀerences in ﬁll-ins for a ACER-10 and ACER-100 algorithm, compared to the CE and CR algorithm. It yileds that FCE > FACER100 > FACER10 > FCR.

584

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587 7

x 10 3.5

CR CE ACER10 ACER100 discrete value

3

number of fill-ins F

2.5

2

1.5

1

0.5

0

0

1

2

3

4 5 6 number of equations n

7

8

9

10 5

x 10

Fig. 11. Fill-ins for the ACER-10 and ACER-100 algorithms compared to the CR and CR algorithm. A · denotes a discrete value.

4.5. Some variants of the ACER class In this subsection some variants of the ACER class will be discussed. For the ACER algorithm it follows that (#CRSC)(r), given by (24), is equal to the rth term of the sequence . . . ; 2l3 1; 2l2 1; 2l 1; 1 . |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

ð39Þ

k terms

A term of this sequence can be determined by subtracting l 1 from the preceding term, followed by a division by l. Characteristic for ACER is that (#CRSC)(k) = 1. If we now choose (#CRSC)(k) diﬀerent from the terms in (39) and as a positive integer, then one can observe that the process of alternating CE and CR cancels after step r = k. When we eliminate simultaneously the ﬁll-in arcs, which are introduced in the CR substep r = k, this results in a variant of the ACER class. Besides the k steps of the reduction phase in total l m log2 ðð#CRSCÞðkÞ Þ extra CE steps are required. The back-substitution phase for this ACER variant consists of k steps. Fig. 12 shows an example of a variant of the ACER algorithm, for

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

585

(a)

1

2

3

4

5

6

7

8

9

10

11

(b)

1

2

3

4

5

6

7

8

9

10

11

(c)

1

2

3

4

5

6

7

8

9

10

11

(d)

1

2

3

4

5

6

7

8

9

10

11

(e)

1

2

3

4

5

6

7

8

9

10

11

(f)

1

2

3

4

5

6

7

8

9

10

11

Fig. 12. Illustration of a variant for the ACER-2 (CR) algorithm for (#CRSC) (a) ! (d): the reduction phase; (d) ! (f): the back-substitution phase.

(k)

= 2 and n = 11 (k = 2).

which (#CRSC)(k) = 2. Step (c) ! (d) is an extra step in which the ﬁll-in arcs, which are introduced in CR substep (b) ! (c), are eliminated. Now a general expression for n will be derived for the variants of ACER, as deﬁned above. Let (#CRSC)(k) be diﬀerent from the term in (39) and a positive integer. In general, n for ACER is given by n ¼ lð#CRSCÞ

ð1Þ

þ l 1.

ð40Þ

When we write for simplicity (#CRSC)(k) = v, then (#CRSC)(r) is equal to the rth term of the sequence . . . ; ðv þ 1Þl3 1; ðv þ 1Þl2 1; ðv þ 1Þl 1; v; |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} k terms

and the rth term is given by ð#CRSCÞðrÞ ¼ ðv þ 1Þlkr 1. By this, for r = 1 it follows that: ð#CRSCÞð1Þ ¼ ðv þ 1Þlk1 1; and substitution in (40) results in n ¼ fð#CRSCÞðkÞ þ 1glk 1.

ð41Þ (k)

For each positive integer (#CRSC) which does not belong to the sequence (39), this expression for n results in a diﬀerent variant class of the ACER class.

586

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

5. Concluding remarks In this paper a class of new parallel algorithms for the solution of tridiagonal matrix systems is presented, the ACER class. The ACER algorithms are designed and described with a general graph theoretic framework, which uniﬁes many diﬀerent parallel algorithms. It has been shown that the ACER algorithms combine the advantages of both the cyclic elimination and cyclic reduction algorithms. The power of the graph framework lies in that we can design parallel algorithms beyond the ability of detecting parallelism in an existing algorithm [10]. It remains an interesting and challenging problem to deﬁne similar frameworks for sparse matrix computations other than the Gaussian-elimination and factorization. We believe that many other parallel algorithms can be designed and analyzed using the graph framework. An interesting open question is: what is the minimum parallel time complexity to solve a tridiagonal system? For the tridiagonal matrix system, so far the fastest algorithm has a parallel time complexity of Oðlog2 ðnÞÞ (actually 4log2 ðnÞÞ. Can we use the graph model to prove the lower-bound of parallel time complexity for a tridiagonal system? (Under the given context of the set transformation rules).

References [1] I. Bar-On, Eﬃcient logarithmic time parallel algorithms for the Cholesky decomposition and Gram– Schmidt process, Parallel Comput. 17 (1991) 409–417. [2] I.S. Duﬀ, A.M. Erisman, J.K. Reid, Direct Methods for Sparse Matrices, Oxford University Press, New York, 1986. [3] A. George, J.W.H. Liu, Computer Solution of Large Sparse Positive Deﬁnite Systems, Prentice-Hall Inc., Englewoods Cliﬀs, NJ, 1981. [4] R.W. Hockney, A fast direct solution of Poissons equation using Fourier analysis, J. ACM 12 (1965) 95–113. [5] S.L. Johnsson, Solving tridiagonal systems on ensemble architectures, SIAM J. Sci. Stat. Comput. 8 (1987) 354–392. [6] J.J. Lambiotte, R.G. Voigt, The solution of tridiagonal linear systems on the CDC STAR-100 computer, ACM TOMS 1 (1975) 308–329. [7] A bibliography on parallel solution of tri-diagonal systems of equations, Available from . [8] H.X. Lin, M.R.T. Roest, Parallel solution of symmetric banded systems, in: G.R. Joubert, D. Trystram, F.J. Peters, D.J. Evans (Eds.), Parallel Computing: Trends and Applications, 1994, pp. 537–540. [9] H.X. Lin, A unifying graph model for designing parallel algorithms for tridiagonal systems, Parallel Comput 27 (2001) 925–939. [10] H.X. Lin, Designing parallel sparse matrix algorithms beyond data dependence analysis, in: Proceedings of the ICCP 2001, Workshop High Performance Scientiﬁc Engineering Computing with Applications (Keynote), IEEE Computer Society Press, Valencia, September 3–7, 2001. [11] U. Meijer, A parallel partition method for solving banded systems of linear equations, Parallel Comput. 2 (1985) 33–43. [12] S. Parter, The use of linear graphs in Gaussian elimination, SIAM Rev. 3 (1961) 119–130. [13] H.S. Stone, An eﬃcient parallel algorithm for the solution of a tridiagonal linear system of equations, J. ACM 20 (1973) 27–38.

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

587

[14] H.A. Vorst, van der, Large tridiagonal and block tridiagonal linear systems on vector and parallel computers, Parallel Comput. 5 (1987) 45–54. [15] H.H. Wang, A parallel method for tridiagonal equations, ACM TOMS 7 (1981) 167–183.

Lihat lebih banyak...
A class of novel parallel algorithms for the solution of tridiagonal systems J. Verkaik a, H.X. Lin a

b,*

TNO MADYMO BV, Schoemakerstraat 97, P.O. Box 6071, 2600 JA Delft, The Netherlands b Delft University of Technology, Delft Institute of Applied Mathematics (DIAM), Mekelweg 4, 2628 CD Delft, The Netherlands

Received 15 June 2004; received in revised form 11 November 2004; accepted 17 January 2005 Available online 29 April 2005

Abstract In this paper, a new class of parallel Gaussian elimination algorithms is presented for the solution of tridiagonal matrix systems. The new algorithms, called ACER (alternating cyclic elimination and reduction), combine the advantages of the well known cyclic elimination algorithm (which is fast) and the cyclic reduction algorithms (which requires fewer operations). The ACER algorithms are developed with the unifying graph model. 2005 Elsevier B.V. All rights reserved. Keywords: Parallel matrix algorithm; ACER algorithms; Graph transformation; Unifying graph model

1. Introduction The solution of tridiagonal linear systems of equations occurs in many engineering and scientiﬁc computer applications. For instance, it is often a part of the solution process in numerical simulation of partial diﬀerential equations using ﬁnite diﬀerence or ﬁnite element discretizations. When the standard Gaussian elimination method is applied to a (pre-ordered) tridiagonal matrix, the computation is *

Corresponding author. Tel.: +31 15 27 87229; fax: +31 15 27 87295. E-mail address: [email protected] (H.X. Lin).

0167-8191/$ - see front matter 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.parco.2005.01.002

564

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

inherently sequential. That is the reason why it has inspired many researchers to study parallel algorithms for solving this problem (in the past three decades more than 200 journal papers have been published on this problem, e.g., see an online list of publications [7]). The serial data dependency in the standard Gaussian elimination/factorization process prohibits any form of parallelization. Therefore, diﬀerent algorithms are designed which make a trade-oﬀ between doing extra arithmetic operations and having a higher degree of parallelism. Such trade-oﬀ is typical in designing parallel algorithms for problems whose eﬃcient parallelization is not straightforward. The solution of tridiagonal system is such a problem that ideally shows the trade-oﬀ considerations in designing algorithms for not so straightforward or hard to parallelize problems. As we have mentioned that during the last four decades a large variety of parallel algorithms for the solution of tridiagonal systems have been proposed, some well known examples are: the cyclic reduction algorithm [4,6], the cyclic elimination, the recursive doubling algorithm [13], the block partitioned elimination algorithms [5,11,15], parallel Cholesky factorization [1] and the block partitioned (Cholesky) factorization algorithms [8], and the twisted-factorization [14]. These and other algorithms are designed in an ingenious way and by diﬀerent researchers in parallel computing through the years. The diﬀerent approaches are often presented in a diﬀerent way, like partition of rows or columns of the matrix, index permutation and elimination tree, etc. Recently, a unifying graph framework has been presented [9] which not only can describe many of these algorithms in a uniﬁed and comprehensive way, but also can be used to design new algorithms with certain desired properties. In this paper we present a class of new algorithms, called ACER (alternating cyclic elimination and reduction algorithm), which combines the advantages of the well known cyclic elimination and the cyclic reduction algorithm. The cyclic reduction algorithm is a special case in the class of ACER algorithms. The ACER algorithms have a wide range of variants varying from very fast, requiring a large total number of arithmetic operations, to slower, requiring a smaller number of arithmetic operations. Through out the discussions in this paper we assume that the matrices being considered are symmetric positive deﬁnite, such that pivoting for numerical stability is not necessary during the parallel elimination process. It should be further noted that although we are considering the solution of a system A x = b, we will mainly omit the description of update operations of the right hand side vector b. These update operations are straightforward since they just follow the update on the corresponding row of the matrix. The method we present is based on the elimination instead of factorization. When multiple right hand side vectors are to be solved, the methods based on factorization followed by forward and backward substitution are more eﬃcient (in terms of the required number of arithmetic operations) than the elimination type methods for general matrices. However, for tridiagonal matrices the diﬀerence is small, especially if all right hand side vectors are updated simultaneously during the parallel elimination process.

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

565

The outline of this paper is as follows. First of all, the unifying graph theoretic framework is described in Section 2, then in Section 3 the cyclic reduction and cyclic elimination algorithms are discussed. In Section 4, we present the ACER class which was designed using the graph framework. Section 5 concludes this paper with some remarks and open questions.

2. The graph theoretic framework In this section we will ﬁrst brieﬂy review the relationship between graph transformation and Gaussian elimination, then we will describe the unifying graph framework which was used for designing the ACER class. A number of theorems are stated, but no proofs will be given here, readers are referred to [9] for the proofs. 2.1. Graph transformations and Gaussian elimination In [12], graphs are used for determining ﬁll-ins1 in a sparse matrix during Gaussian elimination. An adjacency graph or elimination graph associated with matrix A is an undirected graph with the set of edges deﬁned as E = {(i, j)ja(i, j) 5 0 or a(j, i) 5 0}. Note that we use the term edge and arc for undirected respectively directed connections between two nodes. It is shown that the elimination of a variable i only changes the coeﬃcients a(k, j) in the matrix if and only if there exist a pair of edges (k, i) and (i, j), see [12]. That means that for a sparse matrix the modiﬁcation in coeﬃcients is local. Since a ﬁllin during the elimination or factorization corresponds to the addition of a new edge in the elimination graph, minimizing ﬁll-ins is the same as minimizing the number of additional new edges in the elimination graph. With respect to parallel elimination, an important fact is that two nonadjacent nodes in an elimination graph can be eliminated independently. As was pointed out in [9], the undirected elimination graph has been successfully used for minimizing ﬁll-ins in sequential solution (e.g. [2,3]) and parallel factorization of sparse symmetric matrices. However, it cannot describe many operations in a parallel matrix algorithm. For instance, the recursive doubling algorithm [13] and the partition method [15] cannot be described in terms of eliminating each column or row exactly once during the elimination process. In the partition method some rows or columns are modiﬁed several times. Unlike in Gaussian elimination or LU-factorization the ﬁnal form is here not an upper triangular matrix. Therefore, we use directed graphs in our framework. In order to obtain the required high ﬂexibility we study the parallelism in eliminating an arc (j, i) associated with a(j, i) using a(i, i) instead of eliminating an entire column i. Note that we do not assume i > j or j > i here, i.e., we do not assume any pre-determined elimination ordering. We

1

A ﬁll-in is a coeﬃcient which is zero in the original A, but becomes nonzero during the elimination/ factorization.

566

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

consider the elimination of an arc as an operation which can be performed in any order or even repeatedly. Throughout the paper, we ignore numerical cancellations when considering ﬁll-ins and ﬁll-arcs during an elimination process. An nonzero is thus a logical nonzero and an accidental numerical cancellation is not considered as a zero coeﬃcient. 2.2. Description of Gaussian elimination In the graph model, a matrix A 2 Rnn is represented by a graph G(V, E), consisting of a set of nodes V and a set of arcs (edges) E. The set of nodes is deﬁned by V = {1, 2, . . ., n}, the set of arcs by E = {(i, j)ja (i, j) 5 0, i 2 V and j 2 V}. An arc (i, j) is said to have a begin node i and end node j. Arc (i, j) is an outgoing arc from i, and an ingoing arc to j. The set of successors and predecessors of node i is denoted by PRED(i) and SUCC(i), respectively. As a convention in this paper, we deﬁne that the node i is never a successor or predecessor of itself. A generalized Gaussian elimination step is described in this model by a graph transformation from graph G(a)(V, E(a)) to graph G(b)(V, E(b)). We will denote a graph transformation by (a) ! (b). When a graph G(b) is transformed to G(c)(V, E(c)) by a next Gaussian elimination step, we will denote the two graph transformations (a) ! (b) and (b) ! (c) by simply (a) ! (c). In a Gaussian elimination step a matrix element is eliminated, which can cause a ﬁll-in. In the unifying graph model this corresponds to eliminating an arc from graph G(a)(V, E(a)), and adding a new arc (ﬁll-in) to this graph. This results in a new graph G(b)(V, E(b)), corresponding to the matrix after the Gaussian elimination step. A ﬁll-in is thus an arc which does not belong to the graph G(a). The following theorem tells us which arcs are created and/or modiﬁed by eliminating arc (i, j) from the graph. Theorem 1. The elimination of arc (i, j) results in arcs (i, p) for all p2SUCC(j). 2.3. Description of parallel Gaussian elimination The following three basic operations on a graph G(a)(V, E(a)) are important for parallel Gaussian elimination: 1. Partitioning. Partition the graph G(a) into a number of subgraphs, and assign each subgraph to a processor. 2. Selection. Select a set of arcs S for parallel elimination. 3. Elimination and update. Remove S from the graph G(a)(V, E(a)), add a corresponding set of ﬁll-arcs to the graph. This results in a new graph G(b)(V, E(b)). In the following, we will call a graph transformation in relation with parallel Gaussian elimination a parallel elimination step. Now consider the parallel elimination of a pair of arcs S1 = (i1, j1) and S2 = (i2, j2) from the graph G(V, E). We will call this successful if elimination of S1 from graph

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

567

G(V, E)nS2 does not introduce a ﬁll-arc S2, and elimination of S2 from graph G(V, E)nS1 does not introduce a ﬁll-arc S1. The following theorem states when parallel elimination is successful. Theorem 2. Parallel elimination of arcs (i1, j1) and (i2, j2) is successful if and only if 1. i15i2; or 2. i1 = i2 and j1 is not in SUCC(j2) and j2 is not in SUCC(j1). Fig. 1(a) shows an example for which parallel elimination is not successful: clearly, arcs (2,1) and (2,3) cannot be eliminated in parallel. This is because elimination of arc (2,1) causes that arc (2,3) returns, and vice versa. Hence, the presence of arcs (1,3) and (3,1) results that there is no parallel elimination possible. Fig. 1(b) shows a graph without these arcs, and is called a bidirectional chain corresponding to a tridiagonal matrix. Clearly, the arcs (2, 1) and (2, 3) can be successfully eliminated, in contrast to the example of Fig. 1(a). In this case a so-called update conﬂict occurs. Per deﬁnition, an update conﬂict occurs when during a parallel elimination step the value of one coeﬃcient has to be modiﬁed several times. In Fig. 1(b) a update conﬂict occurs for the coeﬃcient a2,2. The following theorem says when a parallel elimination step is free of update conﬂicts. Theorem 3. Parallel elimination of arcs (i1, j1) and (i2, j2) is successful and free of update conflicts if and only if 1. i15i2; or 2. if i1 = i2 and SUCC(j1) \ SUCC(j2) = ;. Notice that parallel eliminations free of update conﬂict are not necessary conditions. The tests of parallel elimination (Theorem 2) and the test of update conﬂict (Theorem 3) can also be simpliﬁed by allowing each node being an initiating node at most once in a single parallel elimination step. Furthermore, additional conditions in selecting arcs for parallel elimination are sometimes preferred or required. Such additional conditions can be limiting the number of ﬁll-arcs, or requiring that the total number of ﬁll-arcs must be smaller than the number of arcs being eliminated

Fig. 1. Example for which the arcs (2, 1) and (2, 3) cannot be eliminated in parallel (a), and an example for which this can be done (b), but resulting in an update conﬂict in coeﬃcient a2,2.

568

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

in each step (the latter is to ensure the ﬁnite-ness of the elimination process). Additional conditions or heuristics in the selection can also be used for more regularity and control on the parallelism. Since without regularity it is impossible to describe a parallel algorithm for diﬀerent values of n, it is very important to retain regularity in the parallel elimination steps. In fact, many of the known parallel algorithms in literature correspond to applying a certain heuristics or imposing some structure in the elimination process. The class of ACER algorithms has both element of cyclic reduction (CR) and cyclic elimination (CE) in it. ACER combines the advantages of the cyclic elimination and the cyclic reduction algorithms. An important feature in the ACER algorithms is to keep the structure similarity as much as possible during the reduction and backsubstitution phases in the parallel elimination steps. During the design process of the ACER algorithms the graph model is used to provide us the insights in the structures of the CR and CE algorithms. Before we present the ACER algorithms, ﬁrst we describe the CR and CE algorithms using the unifying graph model. 3. The CR and CE algorithm 3.1. The CR algorithm Let the dimension of the matrix be given by n = 2 Æ 2k1, k = 1, 2, . . .. The CR algorithm [4] consists of two phases: the reduction phase, during which selective elimination of variables is done; and the back-substitution phase, during which the values of the eliminated xis are recovered. 3.1.1. The reduction phase The reduction phase consists of k parallel elimination steps, which we number from r = 1 to k. Deﬁne d = 2r1 and consider the following equations at step r: ðr1Þ

ðr1Þ

ðr1Þ

ðr1Þ

ðr1Þ

aid;i2d xi2d þ aid;id xid þ aid;i xi ¼ bid ; ðr1Þ

ðr1Þ

ai;id xid þ ai;i ðr1Þ

ðr1Þ

ðr1Þ

xi þ ai;iþd xiþd ¼ bi

ðr1Þ

ðr1Þ

ð1Þ ð2Þ

; ðr1Þ

aiþd;i xi þ aiþd;iþd xiþd þ aiþd;iþ2d xiþ2d ¼ biþd .

ð3Þ

Eq. (2) corresponds to the ith row of the system, Eqs. (1) and (3) to respectively the (i d)th and (i + d)th row. The basic idea is to eliminate the variables xid and xi+d in Eq. (2) in parallel for every i from the index set I = {2pdjp = 1, 2, . . ., 2k/d1}. The index set I is chosen in such a way that for two consecutive indices ij and ij+1 from I, Eq. (3) for ij equals the Eq. (1) for ij+1. The variables xid and xi+d can be ðr1Þ ðr1Þ eliminated by multiplying (1) and (3) by respectively ai;id =aid;id and ðr1Þ ðr1Þ ai;iþd =aiþd;iþd , and adding the result to Eq. (2). By this, the three equations (1)– (3) are reduced to one equation, i.e. ðrÞ

ðrÞ

ðrÞ

ðrÞ

ai;i2d xi2d þ ai;i xi þ ai;iþ2d xiþ2d ¼ bi

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

569

with the new coeﬃcients given by ðr1Þ

ðrÞ ai;i2d

¼

ðr1Þ aid;i2d

ai;id ðr1Þ

aid;id

ð4Þ

;

ðr1Þ

ðrÞ

ðr1Þ

ai;i ¼ ai;i

ðr1Þ

aid;i

ðr1Þ

ai;id

ðr1Þ

ðr1Þ

aid;id

aiþd;i

ai;iþd ðr1Þ

aiþd;iþd

ð5Þ

;

ðr1Þ

ðrÞ

ai;iþd

ðr1Þ

ai;iþ2d ¼ aiþd;iþ2d

ðr1Þ

aiþd;iþd

ð6Þ

;

ðr1Þ

ðrÞ

ðr1Þ

bi ¼ bi

ðr1Þ

bid

ðr1Þ

ai;id ðr1Þ

ðr1Þ

aid;id

biþd

ai;iþd ðr1Þ

aiþd;iþd

ð7Þ

.

The expressions (4)–(7) can be computed in parallel. Note that the quotients ðr1Þ ðr1Þ ðr1Þ ðr1Þ ai;id =aid;id and ai;iþd =aiþd;iþd need to be computed only once. Observation of the reduction phase shows that in each step r all variables are eliminated with indices which are odd multiples of d and not multiples of 2d, i.e.: xd, x3d, x5d, . . ., x2Æ2kd. Elimination of these 2k/d variables results in 2k/d 1 equations. At step r = k (d = 2k1) two variables x2k1 and x3 2k1 are eliminated, resulting in one equation in the unknown x2k . 3.1.2. The back-substitution phase The back-substitution phase also consists of k parallel elimination steps, which we now number from r = k to 1. After the reduction phase has been completed, one single equation in the unknown x2k resulted: ðkÞ

ðkÞ

a2k ;2k x2k ¼ b2k : ðkÞ

ðkÞ

From this equation it easily follows that x2k ¼ b2k =a2k ;2k . In the remaining k 1 steps, back-substitution takes place, i.e. ðr1Þ

xi ¼

bi

ðr1Þ

ðr1Þ

ai;id xid ai;iþd xiþd ðr1Þ

ai;i

;

ð8Þ

where xid and xi+d are recovered in step r + 1 of the back-substitution phase. This is done in parallel for every i from the index set I = {2pdjp = 1, 2, . . ., 2k/d 1}. Note that in the rth step of the back-substitution phase all variables are recovered which are eliminated in the rth step of the reduction phase. Using the framework, we can now generate the CR algorithm as shown in Fig. 2. The parallel elimination proceeds by starting with the elimination of all arcs initiated from even numbered nodes in the ﬁrst step, followed by repeatedly eliminates the set of arcs initiated at even numbered nodes with a distance of 2r1 to the end node at step r (they are independent from each other). Fig. 3 illustrates the diﬀerent elimination steps in the reduction and back substitution phase of the reduction algorithm.

570

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

Fig. 2. The CR algorithm in terms of the framework (n = 2 Æ 2k 1).

(a)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(b)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(c)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(d)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(e)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(f)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

(g)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Fig. 3. Illustration of the graph transformation process corresponding to the CR algorithm, n = 15 (k = 3). (a) ! (d): the reduction phase; (d) ! (g): the back-substitution phase.

3.1.3. Set notation In order to aid the design and presentation of the class of ACER algorithms, we introduce some notations of sets which are not only used to describe the CR algorithm but also the CE and the ACER algorithms. These sets of nodes and arcs are

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

571

ðrÞ

ðBCÞ : Chain formed by arcs ði; jÞ for which j i j j is maximum at step r; ðrÞ

ðBCN Þ

ðCRSCÞ

ð9Þ

ðrÞ

: Set of nodes in ðBCÞ ;

ðrÞ

: Subset of ðBCÞ

ðrÞ

ð10Þ

consisting of CR subchains: Each CR

subchain consists of two D links formed by three nodes and four arcs: ð11Þ ðrÞ

ð#CRSCÞ

ðrÞ

: Number of CR subchains in ðCRSCÞ .

ð12Þ

A subchain is a bidirectional linear chain which comprises a set of nodes and arcs. A D-link is the smallest (sub)chain which comprises two consecutive nodes with an arc in each of the two directions. We denote a subchain by (p; q), where p and q are respectively the smallest and largest node number of the subchain. The center node of a chain is the node at the middle of the chain. For illustration the values of the sets (9)–(12) in the example depicted in Fig. 3 at step r = 2 (reduction phase) are given by ðBCÞ

ð2Þ

¼ fð2;14Þg;

ð2Þ

ðBCN Þ

ðCRSCÞ

¼ f2; 4; 6; 8; 10; 12; 14g;

ð2Þ

¼ fð1;6Þ; ð6;10Þ; ð10;14Þg;

ð2Þ

ð#CRSCÞ

¼ 3.

Using these set deﬁnitions the CR algorithm of Fig. 2 can now be described as in Fig. 4.

Fig. 4. The CR algorithm in set notation (n = 2 Æ 2k1).

572

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

It can be shown that the following relation yields: ð#CRSCÞ

ðrÞ

¼ 2 2kr 1.

ð13Þ (1)

(2)

In the example of Fig. 3 we have (#CRSC) = 7, (#CRSC) = 3 and (#CRSC)(3) = 1. Characteristic for the CR algorithm is that (#CRSC)(k) = 1. In each step r = 1, 2, . . ., k of the reduction phase, expressions (4)–(7) can be evaluated in parallel. When maximum parallelism is considered, this can be done in exactly four operations, so in total the reduction phase minimally costs 4k operations. The ﬁrst step of the back-substitution phase costs minimal four operations, in the last k 1 steps expression (8) has to be evaluated, costing in total 4k 4 operations. Therefore, the time complexity of the CR algorithm reads T CR ¼ 8log2 ðn þ 1Þ 8:

ð14Þ

In the CR algorithm there are only ﬁll-ins introduced in the reduction phase. Simple analysis shows that the number of ﬁll-ins is equal to 2 Æ #(CRSC)(r) 2 at step r, and therefore the total number of ﬁll-ins is nþ1 F CR ¼ 2ðn 1Þ 4log2 : ð15Þ 2 3.2. The CE algorithm The CE algorithm is a greedy algorithm which simply removes the oﬀ-diagonal bands of the matrix from the main diagonal. Let the dimension of the tridiagonal matrix be given by n = 2k, k = 1, 2, . . .. In terms of the framework, the CE algorithm repeatedly eliminates the arcs (i, i + d) and (i, i d) with d = 2r1 at step r. The algorithm in graph notation is given by Fig. 5. For the k steps of the CE algorithm expressions (4)–(7) can be evaluated in parallel in minimal four steps, considering maximal parallelism. For the last step this can be done in three steps. Finally, obtaining the solution vector requires one step, thus in total 4k steps are needed. Hence, the time complexity is T CE ¼ 4log2 ðnÞ:

ð16Þ

In the ﬁrst k 1 steps of the CE algorithm ﬁll-ins are introduced. Analysis shows that the number of ﬁll-ins is given by F CE ¼ 2log2 ðnÞn 4n þ 4:

Fig. 5. The CE algorithm in terms of the framework (n = 2 Æ 2k).

ð17Þ

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

573

4. The ACER algorithms When we compare the time complexity of the CR algorithm (14) with the time complexity of the CE algorithm (16), we see that the greedy CE approach costs much less operations. However, comparing the number of ﬁll-in arcs of the CR algorithm (15) with those introduced in the CE algorithm (17), we observe that signiﬁcantly more ﬁll-ins are introduced with the CE algorithm. This brings us to the idea of combining the advantages of those algorithms. Using the graph framework we can easily visualize and study the diﬀerent variants of parallel elimination algorithms, and that results in the ACER class of algorithms. 4.1. ACER in set notation Let the dimension of the matrix be given by n = 2lk 1 with l = 2,3, . . . and k = 1, 2, . . . For each l we denote ACER-l as a member of the ACER class. Like the CR algorithm, the ACER algorithm comprises a reduction and a back-substitution phase, each consisting of k parallel elimination steps. However, unlike the CR algorithm, each of these k steps in the reduction phase again consists of one or more substeps: each step consists of zero or more CE substeps, followed by one single CR substep at the end. In following the two phases will be described in detail using the graph notations. But before doing so, ﬁrst some additional notations of sets of nodes or arcs are introduced ðCESCÞ

ðrÞ

: Subset of ðBCÞ

ðrÞ

consisting of CE subchains: Each CE

subchain contains l 2D links formed by l 1 nodes and 2l 4 hboxarcs;

ð18Þ

ðCESCN ÞðrÞ : Set of nodes in the CE subchains ofðCESCÞðrÞ .

ð19Þ

ð#CESCÞðrÞ : Number of CE subchains in ðCESCÞðrÞ .

ð20Þ

Fig. 7 shows an example of an ACER algorithm with l = 3 for n = 17 and k = 2. In the ﬁrst steps (a) ! (c) the values of the subsets deﬁned above are the following: ðBCÞð1Þ ¼ fð1;17Þg;

ð21aÞ

ðBCN Þð1Þ ¼ f1; 2; . . . ; 17g;

ð21bÞ

ðCESCÞ

ð1Þ

ðCESCN Þ

¼ fð1;2Þ; ð4;5Þ; ð7;8Þ; ð10;11Þ; ð13;14Þ; ð16;17Þg;

ð1Þ

ð#CESCN Þ

¼ f1; 2; 4; 5; 7; 8; 10; 11; 13; 14; 16; 17g; ð1Þ

¼ 6;

ð21cÞ ð21dÞ ð21eÞ

574

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

ðCRSCÞ

ð1Þ

ð#CRSCÞ

¼ fð2;4Þ; ð5;7Þ; ð8;10Þ; ð11;13Þ; ð14;16Þg;

ð1Þ

ð21fÞ

¼ 5:

ð21gÞ

4.1.1. The reduction phase The reduction phase comprises k steps, and each of these steps again contains c + 1 substeps: c CE substeps followed by 1 CR substep. At each step r (r = 1, 2, . . ., k) of the reduction phase, ﬁrst the chain (BC)(r) formed by the largest arcs in the graph will be determined. The corresponding set of nodes in (BC)(r) is stored in (BCN)(r). Next, (BC)(r) is divided into alternate CE and CR subchains containing l2 and 2 D-links, respectively. The ﬁrst and last l 2 D-links of (BC)(r) diﬀer from the others. This results in two sets (CESC)(r) and (CRSC)(r) of subchains. In each step r all CE subchains in (CESC)(r) are transformed in parallel using the CE algorithm in Fig. 6. There are c substeps within an CE step, in substep t (t = 1, 2, . . ., c) all arcs with its begin and end node belonging to the CE subchains are eliminated. This results in two types of ﬁll-ins: 1. Fill-in arcs whose begin node is a node of the CE subchain, and the end node is the middle node of the neighboring CR subchain. 2. Fill-in arcs whose begin and end node are nodes belonging to the CE subchain. After c substeps there are no arcs in the graph left whose begin and end node belong both to the CE subchain. Until now we have not discussed about the value of the number of CE substeps c. Consider a CE subchain, the number of nodes in the CE subchain equals l 1 and

(a)

1

2

3

4

5

6

7

8

(b)

1

2

3

4

5

6

7

8

(c)

1

2

3

4

5

6

7

8

(d)

1

2

3

4

5

6

7

8

Fig. 6. Illustration of the graph transformation process corresponding to the CE algorithm, n = 8 (k = 3).

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

575

the number of arcs 2(l 2). For reasons of simplicity we mean with an arc an arc with both begin and end node in the current CE subchain under consideration. The constant c can be determined by examining how many substeps are needed to eliminate all l 2 arcs (i, j) with i > j or i < j. The constraint l 2 (1 + 21 + 22 + . . .2c1) 6 0, must be satisﬁed, meaning that c P log2 ðl 1Þ. Subsequently, using the upper-entier function c ¼ dlog2 ðl 1Þe.

ð22Þ

After c CE substeps are carried out in step r, a single CR substep follows. In this substep again in parallel for each CR subchain two outgoing arcs from the center node of the subchain are eliminated. For r = 1, 2, . . ., k 1 these eliminations result in two ﬁll-in arcs with the center node of the current CR subchain as the begin node and the center node in the left/right neighboring CR subchains as the end node. 4.1.2. The back-substitution phase At each step r (r = k, k1, . . ., 1) of the back-substitution phase all arcs whose begin and end node belonging to (BCN)(r) are eliminated. There are no ﬁll-in arcs introduced in this phase. Fig. 8 describes the ACER algorithm in terms of graph transformations using the framework. The following relations yield: ðrÞ

¼ 2lkr ;

ð23Þ

ðrÞ

¼ 2lkr 1.

ð24Þ

ð#CESCÞ and ð#CRSCÞ

In case of r = k we have (#CRSC)(k) = 1. 4.2. An example worked out in detail In the following, we consider an example for l = 3 and n = 17 to illustrate the ACER algorithm. Fig. 7 shows the process of graph transformation for this example problem. Since l = 3 it follows that c = 1. For the given n = 17 and l = 3, we have k = 2, therefore both reduction and back-substitution phase consists of 2 steps. Each reduction step contains two substeps: 1 CE substep (c = 1) and 1 CR substep. The CE substep (a) ! (b) and CR substep (b) ! (c) form together the ﬁrst step of the reduction phase, CE substep (c) ! (d) and CR substep (d) ! (e) form the second and the last step. Steps (e) ! (f) and (f) ! (g) form the back-substitution phase. In the ﬁrst step r = 1 of the reduction phase the sets in (21a)–(21g) are found. In the CE substep t = 1 of step r = 1 the following set of arcs are eliminated (from the graph in (a)): fð1; 2Þ; ð2; 1Þ; ð4; 5Þ; ð5; 4Þ; ð7; 8Þ; ð8; 7Þ; ð10; 11Þ; ð11; 10Þ; ð13; 14Þ; ð14; 13Þ; ð16; 17Þ; ð17; 16Þg

576

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

(a)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(b)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(c)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(d)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(e)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(f)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

(g)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Fig. 7. Illustration of the ACER algorithm with l = 3 for n = 17 and k = 2. (a) ! (e): the reduction phase; (e) ! (g): the back-substitution phase.

and the following ﬁll-arcs are added to the graph: fð1; 3Þ; ð5; 3Þ; ð4; 6Þ; ð8; 6Þ; ð7; 9Þ; ð11; 9Þ; ð10; 12Þ; ð14; 12Þ; ð13; 15Þ; ð17; 15Þg. This results in the graph shown in (b). In the CR substep of step r = 1 the set of arcs eliminated (from the graph in (b)) are fð3; 2Þ; ð3; 4Þ; ð6; 5Þ; ð6; 7Þ; ð9; 8Þ; ð9; 10Þ; ð12; 11Þ; ð12; 13Þ; ð15; 14Þ; ð15; 16Þg and the following set of ﬁll-arcs are added to the graph: fð3; 6Þ; ð6; 3Þ; ð6; 9Þ; ð9; 6Þ; ð9; 12Þ; ð12; 9Þ; ð12; 15Þ; ð15; 12Þg. In the second step r = 2 of the reduction phase we ﬁnd the following sets: ðBCÞð2Þ ¼ fð3;15Þg; ð2Þ

ðBCN Þ

ðCESCÞ

¼ f3; 6; 9; 12; 15g;

ð2Þ

¼ fð3;6Þ; ð12; 15Þg;

ðCESCN Þð2Þ ¼ f3; 6; 12; 15g.

ð25aÞ

ð25bÞ

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

577

Fig. 8. The ACER algorithm in set notation (n = 2lk 1, l = 2, 3, . . ., k = 1, 2, . . .).

ð#CESCÞ ðCRSCÞ

ð2Þ

ð2Þ

¼ 2;

¼ fð6;12Þg;

ð25cÞ

ð#CRSCÞð2Þ ¼ 1. In the CE substep t = 1 of step r = 2 the arcs (3, 6), (6, 3), (12, 15) and (15, 12) from the graph in (c) are eliminated, and ﬁll-arcs (3, 9) and (15, 9) are added, resulting in graph (d). In the CR substep of r = 2 the arcs (9, 6) and (9, 12) form graph (d) are eliminated, resulting in graph (e). Herewith the reduction phase has been completed. In the second step r = 2 of the back-substitution phase all arcs in graph (e) with both begin and end node belonging to (BCN)(2) are eliminated (25a): (3, 9), (6, 9), (12, 9) and (15, 9). This results in graph (f). In the second and last step r = 2 of the back-substitution phase all arcs in graph (f) with both begin and end node belonging to (BCN)(1) (21-b) are eliminated. This results in graph (g) which has no arcs and the ACER algorithm terminates here. The ACER algorithms can also be described in more detail. This is shown in Fig. 9.

578

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

Fig. 9. The ACER algorithm in terms of the framework (n = 2lk1, l = 2, 3, . . ., k = 1, 2, . . ., and c ¼ dlog2 ðl 1Þe).

4.3. Analysis of time complexity and number of ﬁll-ins For determining the time complexity and the number of ﬁll-ins of the ACER algorithm we consider the ACER algorithm in set notation, as described in Section 4.1. 4.3.1. Time complexity For the time complexity analysis of ACER we restrict ourselves solely to the case where maximum parallelism is exploited.

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

579

4.3.1.1. The reduction phase. In each step r = 1, 2, . . ., k of the reduction phase all arcs in the set (CESC)(r) are eliminated simultaneously in c CE substeps. If there are two arcs in (CESC)(r) with the same end node, then a CE substep costs four operations. If the set does not contain such arcs, then the CE substep costs three operations, which is always the case for the last CE substep. By this, the CE substeps in the reduction phase takes 4(c 1)k + 3k operations. For c = 0 (l = 2) we correct this number by introducing the indicator function 1[c > 0] (=1 for c > 0; =0 otherwise), resulting in 4ck k Æ 1[c > 0] operations. After the ﬁnal CE substep one CR substep follows, costing four operations. Therefore, executing all the CR substeps in the reduction phase takes 4k operations, hence in total the reduction phase costs 4ck þ 4k k 1½c>0

ð26Þ

operations. 4.3.1.2. The back-substitution phase. In the ﬁrst step r = k of the back-substitution phase the variable x(n+1)/2 is recovered, costing one operation. After that, eliminating the arcs in (CRSC)(k) can be done in three operations, and the rest of the k 1 steps of the back-substitution phase cost four operations each. Therefore the back-substitution phase takes 4k

ð27Þ

operations. 4.3.1.3. Total operations. Adding (26) and (27) results in the following time complexity for the ACER algorithm: nþ1 T ACER ¼ ð4c þ 8 1½c>0 Þlogl . ð28Þ 2 Note that for l = 2 (c = 0) this expression equals to TCR, given by (14). 4.3.2. Number of ﬁll-ins Recall that only the reduction phase introduces ﬁll-in arcs. The total number of ﬁll-ins of the ACER algorithm consists of ﬁll-ins introduced in the ck CE substeps, and ﬁll-ins introduced in the k CR substeps. 4.3.2.1. Number of ﬁll-in arcs in ck CE substeps. As denoted in Section 4.1 we distinguish two diﬀerent types of ﬁll-ins: (1) ﬁll-in arcs whose begin node is a node of the CE subchain, and the end node is the middle node of the neighboring CR subchain, and (2) ﬁll-in arcs whose begin and end node belong to the CE subchain. Now the numbers of each of these types of ﬁll-in arcs will be determined. For the sake of simplicity a CE subchain is simply referred to as a subchain, and a CE substep simply as a substep. The number of type (1) ﬁll-ins per subchain can be determined as follows. In the ﬁrst substep t = 1 there are 2(l 1) 2 arcs eliminated, resulting in 2(l 1) 4 ﬁllin arcs. In the second substep t = 2 there are 2(l1)8 arcs eliminated, resulting in

580

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

2(l 1) 8 ﬁll-in arcs. This proces repeats itself for only c 1 times, because there are no ﬁll-ins introduced in substep t = c. By this, the total number of ﬁll-in arcs per subchain after c substeps is given by 2ðl 1Þðc 1Þ ð4 þ 8 þ 16 þ . . .Þ. |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} c1 terms

The indicated series can be classiﬁed as a ﬁnite geometric series and is given by 4(2c1 1), and so the number of ﬁll-in arcs per subchain becomes 2ðl 1Þðc 1Þ 4ð2c1 1Þ: Multiplying this with the total number of subchains (#CESC)(r), given by (23), results in k1 f2ðl 1Þðc 1Þ 4ð2c1 1Þgf2l þ 2lk2ﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ þ 2lk3 þ ﬄ} g: |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ

ð29Þ

k terms

The indicated series in the above expression is ﬁnite geometric and given by 2lk 2 ; l1

ð30Þ

and therefore the total number of type (1) ﬁll-in arcs after ck substeps can be written as

2lk 2 2ðc 1Þðl 1Þ 4ð2c1 1Þ : ðl 1Þ

ð31Þ

Apart from the ﬁrst and last subchain, per subchain and after c substeps 2(l 2) ﬁll-in arcs of type (2) are introduced. Using expression (24) for (#CESC)(r) the number of type (2) ﬁll-in arcs introduced in the rth step is 2lkr 2ðl 2Þ 2ðl 2Þ; in which the last term 2(l 2) is a correction for the ﬁrst and last subchain. Taking the sum over k steps results in k1 2ðl 2Þf2l þ 2lk2ﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ þ 2lk3 þ . .ﬄ}.g 2ðl 2Þk. |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ k terms

The indicated series is given by (30), and therefore the total number of type (2) ﬁllin arcs after ck substeps can be written as 2ðl 2Þ

2lk 2 2ðl 2Þk: l1

ð32Þ

4.3.2.2. Number of ﬁll-in arcs in k CR substeps. Apart from the ﬁrst and last CR subchain, in each step two ﬁll-in arcs are introduced per CR subchain. With (#CRSC)(r) given by (23) and taking the sum over k steps, results in

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

581

2f2lk1 1 þ 2lk2 1 þ 2lk3 1 þ . . .g 2k k1 ¼ 2f2l þ 2lk2ﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ þ 2lk3 þ ﬄ} g 4k; |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ k terms

where the indicated series is given by (30). By this the total number of ﬁll-in arcs introduced in k CR substeps can be written as 2

2lk 2 4k: l1

ð33Þ

4.3.2.3. Total number of ﬁll-ins. Adding (31)–(33) the total number of ﬁll-ins of the ACER algorithm is given by

F ACER ¼

2cþ1 4 nþ1 2c ðn 1Þ 2llogl . l1 2

ð34Þ

Note that for l = 2 (c = 0) FACER is equal to number of ﬁll-ins for the CR algorithm FCR given by (15). 4.4. A comparison with the CR and CE algorithm In this subsection we will make a quantitative comparison for the time complexities and ﬁll-ins of the ACER, CE and CR algorithms. We will illustrate that the ACER algorithm combines the properties of the CE and CR algorithms. Table 1 summarizes the time complexities and number of ﬁll-ins for the methods discussed in this paper. For the sake of simplicity, we will assume that T and F are continuous functions of n, instead of discrete. 4.4.1. Time complexities We can rewrite TCR, TCE and TACER as follows T CR ¼ 8flog2 ðn þ 1Þ 1g; T CE ¼ 4log2 ðnÞ 1; T ACER ¼

4c þ 8 1½c>0 flog2 ðn þ 1Þ 1g. log2 l

For suﬃciently large values of n we can approximate this time complexities by T CR ¼ 8log2 ðnÞ; T CE ¼ 4log2 ðnÞ; 4c þ 8 1½c>0 log2 ðnÞ. T ACER ¼ log2 l

ð35Þ

582

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

Table 1 Time complexities and number of ﬁll-ins for the CR, CE and ACER algorithm, for k = 1, 2, . . ., and l = 2, 3, . . ., and c ¼ dlogðl 1Þe n

T

F 2ðn 1Þ 4log2

CR

2 Æ 2 1

nþ1 8log2 2

CE

2k

4log2 ðnÞ

k

ACER

2log2 ðnÞn 4n þ 4

2cþ1 4 nþ1 2c ðn 1Þ 2llogl l1 2

nþ1 ð4c þ 8 1½c>0 Þlogl 2

k

2l 1

nþ1 2

These are the highest order terms, hence the time complexities are Oðlog2 ðnÞÞ. Now consider the factor in expression (35) of TACER. In the following table this factor is given for a certain range of l: l

2

3

4

5

6

7

8

9

10

100

c 4c þ 8 1½c>0 log2 l

0

1

2

2

3

3

3

3

4

7

8.00

6.94

7.50

6.46

7.35

6.77

6.33

5.99

6.92

5.27

This shows that for an increasing l, the factor of TACER decreases starting from the value 8 for the CR algorithm, and slowly tends to the factor 4 for the CE algorithm. It can be concluded that T CE < T ACER 6 T CR . Fig. 10 shows the diﬀerences in time complexities for an ACER-10 and ACER100 algorithm, compared to the CR and CE algorithm. The time complexities are plotted as a function of n, as well as the discrete values. This ﬁgure clearly shows that TCE < TACER100 < TACER10 < TCR. 4.4.2. Number of ﬁll-ins The expressions for FCR and FACER can be rewritten as F CR ¼ 2ðn 1Þ 4flog2 ðn þ 1Þ 1g; F CE ¼ 2log2 ðnÞn 4n þ 4;

F ACER ¼

2cþ1 4 2l flogl ðn þ 1Þ 1g. 2c ðn 1Þ l1 log2 ðlÞ

For n suﬃciently large, we have F CR ¼ 2n;

ð36Þ

F CE ¼ 2log2 ðnÞn;

ð37Þ

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587 150

583

CR CE ACER10 ACER100 discrete value

time complexity T

100

50

0 0 10

1

2

10

10

3

10 number of equations n

4

5

10

6

10

10

Fig. 10. Time complexities for the ACER-10 and ACER-100 algorithms compared to the CR and CE algorithm. A · denotes a discrete value.

F ACER ¼

2c

2cþ1 4 n; l1

ð38Þ

showing that FCE is Oðlog2 ðnÞnÞ, and that FCR as well as FACER are OðnÞ. Again, consider the factor in (38) for FACER. For a certain range of l this factor is given by the following table: l c

2c

2

cþ1

2

3

4

5

6

7

8

9

10

100

0

1

2

2

3

3

3

3

4

7

2.00

2.67

3.00

3.60 4.00

4.29

4.50

4.89 11.45

4 n 2.00 l1

This shows that for an increasing l the factor of FACER slightly increases. From this quantitative analysis it follows that FACER P FCR, and since FACER is of higher order than FCE we have F CE > F ACER P F CR . Fig. 11 shows the diﬀerences in ﬁll-ins for a ACER-10 and ACER-100 algorithm, compared to the CE and CR algorithm. It yileds that FCE > FACER100 > FACER10 > FCR.

584

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587 7

x 10 3.5

CR CE ACER10 ACER100 discrete value

3

number of fill-ins F

2.5

2

1.5

1

0.5

0

0

1

2

3

4 5 6 number of equations n

7

8

9

10 5

x 10

Fig. 11. Fill-ins for the ACER-10 and ACER-100 algorithms compared to the CR and CR algorithm. A · denotes a discrete value.

4.5. Some variants of the ACER class In this subsection some variants of the ACER class will be discussed. For the ACER algorithm it follows that (#CRSC)(r), given by (24), is equal to the rth term of the sequence . . . ; 2l3 1; 2l2 1; 2l 1; 1 . |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

ð39Þ

k terms

A term of this sequence can be determined by subtracting l 1 from the preceding term, followed by a division by l. Characteristic for ACER is that (#CRSC)(k) = 1. If we now choose (#CRSC)(k) diﬀerent from the terms in (39) and as a positive integer, then one can observe that the process of alternating CE and CR cancels after step r = k. When we eliminate simultaneously the ﬁll-in arcs, which are introduced in the CR substep r = k, this results in a variant of the ACER class. Besides the k steps of the reduction phase in total l m log2 ðð#CRSCÞðkÞ Þ extra CE steps are required. The back-substitution phase for this ACER variant consists of k steps. Fig. 12 shows an example of a variant of the ACER algorithm, for

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

585

(a)

1

2

3

4

5

6

7

8

9

10

11

(b)

1

2

3

4

5

6

7

8

9

10

11

(c)

1

2

3

4

5

6

7

8

9

10

11

(d)

1

2

3

4

5

6

7

8

9

10

11

(e)

1

2

3

4

5

6

7

8

9

10

11

(f)

1

2

3

4

5

6

7

8

9

10

11

Fig. 12. Illustration of a variant for the ACER-2 (CR) algorithm for (#CRSC) (a) ! (d): the reduction phase; (d) ! (f): the back-substitution phase.

(k)

= 2 and n = 11 (k = 2).

which (#CRSC)(k) = 2. Step (c) ! (d) is an extra step in which the ﬁll-in arcs, which are introduced in CR substep (b) ! (c), are eliminated. Now a general expression for n will be derived for the variants of ACER, as deﬁned above. Let (#CRSC)(k) be diﬀerent from the term in (39) and a positive integer. In general, n for ACER is given by n ¼ lð#CRSCÞ

ð1Þ

þ l 1.

ð40Þ

When we write for simplicity (#CRSC)(k) = v, then (#CRSC)(r) is equal to the rth term of the sequence . . . ; ðv þ 1Þl3 1; ðv þ 1Þl2 1; ðv þ 1Þl 1; v; |ﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄﬄ} k terms

and the rth term is given by ð#CRSCÞðrÞ ¼ ðv þ 1Þlkr 1. By this, for r = 1 it follows that: ð#CRSCÞð1Þ ¼ ðv þ 1Þlk1 1; and substitution in (40) results in n ¼ fð#CRSCÞðkÞ þ 1glk 1.

ð41Þ (k)

For each positive integer (#CRSC) which does not belong to the sequence (39), this expression for n results in a diﬀerent variant class of the ACER class.

586

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

5. Concluding remarks In this paper a class of new parallel algorithms for the solution of tridiagonal matrix systems is presented, the ACER class. The ACER algorithms are designed and described with a general graph theoretic framework, which uniﬁes many diﬀerent parallel algorithms. It has been shown that the ACER algorithms combine the advantages of both the cyclic elimination and cyclic reduction algorithms. The power of the graph framework lies in that we can design parallel algorithms beyond the ability of detecting parallelism in an existing algorithm [10]. It remains an interesting and challenging problem to deﬁne similar frameworks for sparse matrix computations other than the Gaussian-elimination and factorization. We believe that many other parallel algorithms can be designed and analyzed using the graph framework. An interesting open question is: what is the minimum parallel time complexity to solve a tridiagonal system? For the tridiagonal matrix system, so far the fastest algorithm has a parallel time complexity of Oðlog2 ðnÞÞ (actually 4log2 ðnÞÞ. Can we use the graph model to prove the lower-bound of parallel time complexity for a tridiagonal system? (Under the given context of the set transformation rules).

References [1] I. Bar-On, Eﬃcient logarithmic time parallel algorithms for the Cholesky decomposition and Gram– Schmidt process, Parallel Comput. 17 (1991) 409–417. [2] I.S. Duﬀ, A.M. Erisman, J.K. Reid, Direct Methods for Sparse Matrices, Oxford University Press, New York, 1986. [3] A. George, J.W.H. Liu, Computer Solution of Large Sparse Positive Deﬁnite Systems, Prentice-Hall Inc., Englewoods Cliﬀs, NJ, 1981. [4] R.W. Hockney, A fast direct solution of Poissons equation using Fourier analysis, J. ACM 12 (1965) 95–113. [5] S.L. Johnsson, Solving tridiagonal systems on ensemble architectures, SIAM J. Sci. Stat. Comput. 8 (1987) 354–392. [6] J.J. Lambiotte, R.G. Voigt, The solution of tridiagonal linear systems on the CDC STAR-100 computer, ACM TOMS 1 (1975) 308–329. [7] A bibliography on parallel solution of tri-diagonal systems of equations, Available from . [8] H.X. Lin, M.R.T. Roest, Parallel solution of symmetric banded systems, in: G.R. Joubert, D. Trystram, F.J. Peters, D.J. Evans (Eds.), Parallel Computing: Trends and Applications, 1994, pp. 537–540. [9] H.X. Lin, A unifying graph model for designing parallel algorithms for tridiagonal systems, Parallel Comput 27 (2001) 925–939. [10] H.X. Lin, Designing parallel sparse matrix algorithms beyond data dependence analysis, in: Proceedings of the ICCP 2001, Workshop High Performance Scientiﬁc Engineering Computing with Applications (Keynote), IEEE Computer Society Press, Valencia, September 3–7, 2001. [11] U. Meijer, A parallel partition method for solving banded systems of linear equations, Parallel Comput. 2 (1985) 33–43. [12] S. Parter, The use of linear graphs in Gaussian elimination, SIAM Rev. 3 (1961) 119–130. [13] H.S. Stone, An eﬃcient parallel algorithm for the solution of a tridiagonal linear system of equations, J. ACM 20 (1973) 27–38.

J. Verkaik, H.X. Lin / Parallel Computing 31 (2005) 563–587

587

[14] H.A. Vorst, van der, Large tridiagonal and block tridiagonal linear systems on vector and parallel computers, Parallel Comput. 5 (1987) 45–54. [15] H.H. Wang, A parallel method for tridiagonal equations, ACM TOMS 7 (1981) 167–183.

Kami adalah komunitas sharing. Jadi tolong bantu kami dengan mengunggah ** 1 ** dokumen baru atau yang ingin kami unduh:

ATAU SEPERTI DOWNLOAD SEGERA