On the Factor Refinement Principle and its Implementation on Multicore Architectures Masters Public Lecture Presented By: Md. Mohsin Ali Supervisor: Dr. Marc Moreno Maza Dept. of Computer Science (ORCCA Lab) The University of Western Ontario, London, ON, Canada December 15, 2011 1 / 65
Factor Refinement Serial Algorithms Approach based on the naive refinement Approach based on augment refinement Approach based on subproduct trees Motivation Implementation Challenges on Multicore Architectures Contribution Proposed Parallel Algorithms A d-n-c illustration Parallel algorithms based on the naive refinement Parallel approach based on the augment refinement Parallel Algorithm Based on Subproduct Tree Conclusion 2 / 65
Factor Refinement I Factor Refinement 3 / 65
Factor Refinement I Definition ◮ Let D be a UFD and m 1 , m 2 , . . . , m r be elements of D . ◮ Let m be the product of m 1 , m 2 , . . . , m r . ◮ We say that elements n 1 , n 2 , . . . , n s of D form a GCD-free basis whenever gcd( n i , n j ) = 1 for all 1 ≤ i < j ≤ s . ◮ Let e 1 , e 2 , . . . , e s be positive integers. ◮ We say that the pairs ( n 1 , e 1 ) , ( n 2 , e 2 ) , . . . , ( n s , e s ) form a refinement of m 1 , m 2 , . . . , m r if the following two conditions hold: ( i ) n 1 , n 2 , . . . , n s is a GCD-free basis, ( ii ) for every 1 ≤ i ≤ r there exists non-negative integers f 1 , . . . , f s such 1 ≤ j ≤ s n f j that we have � j = m i , 1 ≤ i ≤ s n e i ( ii ) � i = m . When this holds, we also say that ( n 1 , e 1 ) , ( n 2 , e 2 ) , . . . , ( n s , e s ) is a coprime factorization of m . 4 / 65
Factor Refinement II Example Let m 1 = 30 , m 2 = 42 and their product m = 1260. Then ( i ) 5 1 , 6 2 , 7 1 is a refinement of 30 and 42, ( ii ) 5 , 6 , 7 is a GCD-free basis of 30 and 42, ( iii ) 5 1 , 6 2 , 7 1 is a coprime factorization of 1260. 5 / 65
Factor Refinement III Applications ◮ Simplifying systems of polynomial equations and inequations, (i) � ab � a � = 0 � = 0 bc � = 0 = ⇒ b � = 0 ca � = 0 c � = 0 , (ii) Below, { A , B , C , D , E , F , G } can be seen as a GCD-free basis of { S 1 , S 2 , S 3 } : S 1 S 1 S 2 S 2 A B E , C = ⇒ D F G S 3 S 3 ◮ consolidation of independent factorizations, ◮ etc. 6 / 65
Serial Algorithms: Approach based on the naive refinement and quadratic arithmetic 7 / 65
Approach based on the naive refinement I Idea from Bach, Driscoll, and Shallit in 1990 [BDS90]. ◮ Given a partial factorization of an integer m , say m = m 1 m 2 , we compute d = gcd( m 1 , m 2 ) and write m = ( m 1 / d )( d 2 )( m 2 / d ) . ◮ This process is continued until all the factors are pairwise coprime. ◮ This is also used for the general case of more than two inputs, say m = m 1 m 2 . . . m ℓ . Algebraic complexity If m = m 1 m 2 . . . m ℓ , then this algorithm takes O ( size ( m ) 3 ) bit op- erations, where � 1 if m = 0 size ( m ) = 1 + ⌊ log 2 | m |⌋ if m > 0 . 8 / 65
Serial Algorithms: Approach based on the augment refinement and quadratic arithmetic 9 / 65
Approach based on augment refinement and quadratic arithmetic I Again from Bach, Driscoll, and Shallit in 1990 [BDS90]. ◮ Basic idea same as before but organizing the computations more precisely leading to an improved complexity [BDS90] ◮ The trick is to keep tracks of the pairs ( n j , n k ) in an ordered pair list such that only elements adjacent in the list can have a nontrivial GCD. Algebraic complexity If m = m 1 m 2 . . . m ℓ , then this algorithm takes O ( size ( m ) 2 ) bit op- erations, where � 1 if m = 0 size ( m ) = 1 + ⌊ log 2 | m |⌋ if m > 0 . 10 / 65
Serial Algorithms: Approach based on subproduct trees 11 / 65
Approach based on subproduct trees I Idea of Asymptotically Fast Algorithm for GCD-free Basis from Dahan, Moreno Maza, Schost, and Xie in 2005 [DMS + 05]. ◮ Divide the input into sub-problems until a base case is reached, ◮ Conquer the sub-problems from leaves to the root applying fast arithmetic based on subproduct trees (described later). Algebraic complexity The total number of field operations of this algorithm is O ( M ( d ) log 3 2 d ), where ◮ d is the sum of the degrees of the input polynomials, ◮ M ( d ) is a multiplication time of two univariate polynomials of degree less than d , 12 / 65
Motivation I Parallel Computation of the Minimal Elements of a Poset ◮ by Leiserson, Li, Moreno Maza, and Xie in 2010 [ELLMX10]. ◮ This is a multithreaded (fork-join parallelism) approach which is divide-and-conquer, free of data races, inspired by parallel-merge-sort. ◮ Its Cilk++ shows nearly linear speed-up on 32-core processors for sufficiently large input data set. This work led us to the design and implementation of parallel factor refinement algorithms. 13 / 65
Implementation Challenges on Multicore Architectures 14 / 65
Multithreaded Parallelism on Multicore Architectures I Multicore architectures ◮ A multi-core processor is a single computing component with two or more independent and tighly coupled processors, called cores, sharing memory. ◮ They also share the same bus and memory controller; thus memory bandwidth may limit performances. ◮ In order to maintain memory consistency, synchorization is needed between cores, which may also limit performances. 15 / 65
Multithreaded Parallelism on Multicore Architectures II Fork-join parallelism ◮ This model represents the execution of a multithreaded program as a set of nonblocking threads denoted by the vertices of a dag where the dag edges indicate dependencies between instructions. ◮ Assuming unit cost of execution for all threads, the number of vertices of the dag is the work (= running time on a single core). ◮ The maximum length of a path from the root to a leaf is the span (= running time on ∞ processors). ◮ The paralleisim is the ratio work to span (= average amount of work along the span). 16 / 65
f athena,cel,prokop,sridhar g @sup erte ch.l cs. mit. edu The Ideal-cache Model I Main organized by Memory optimal replacement strategy Cache CPU W work Q = L Cache lines Z cache = ( ) � misses Lines ( + = ) of length L ( + ( = )( + )) ( ) Figure 1: The ideal-cache model. � � p ( + ( + + ) = + = ) 17 / 65 > = ( ) ; ( ; ) ( ) ( ; )
The Ideal-cache Model II ◮ The processor can only refer to words that reside in the cache memory, which is a small and fast access memory, containing Z words organized in cache lines of L words each. ◮ If the referenced line of a word is not in cache, the corresponding line needs to be brought from the main memory. This is a cache miss. If the cache is full, a cache line must be evicted. ◮ Cache complexity analyzes algorithms in terms of cache misses. 18 / 65
From Cilk to Cilk++ I The language ◮ Cilk (resp. Cilk++ ) is an extension of C (resp. C++ ) implementating the fork-join parallelism with two keywords spawn and sync . ◮ A Cilk (resp. Cilk++ ) program has the same semantics as its C (resp. C++ ) ellision. Performance of the work-stealing scheduler In theory, (resp. Cilk++ )s scheduler executes any Cilk Cilk++ computation in a nearly optimal time on p processors, pro- vided that ◮ for almost all parallel steps, there are at least p units of work which can be run concurrently, ◮ each processor is either working or stealing work, ◮ each thread executes in unit time. 19 / 65
Parallelization overheads I Overheads and burden ◮ In practice, the observed speedup factor may be less (sometimes much less) than the theoretical parallelism. ◮ Many factors explain that fact: simplification assumptions of the fork-join parallelism model, architecture limitation, costs of executing the parallel constructs, overheads of scheduling. Parallelism vs. burdened parallelism ◮ Cilkview is a perforance analyzer which caclulates the work, the span, the parallelism of a given Cilk++ program run. ◮ Cilkview also estimates the running time T p on p processors as T p = T 1 / p + 1 . 7 burden span , where burden span is 15000 instructions times the number of spawn along the span! 20 / 65
Contribution I ◮ Parallel algorithm based on the naive refinement principle [NOT GOOD for data locality and thus for parallelism on multicore architectures]. ◮ Parallel algorithm based on the augment refinement principle [GOOD for data locality and parallelism]. ◮ Parallel algorithm based on subproduct tree [MORE CHALLENGING for implementation on multicore architectures]. Principle All are based on algorithms which are divide-and-conquer (d-n-c), multithreaded, free of data races. 21 / 65
Proposed Parallel Algorithms A d-n-c illustration 22 / 65
A d-n-c illustration I Input Expand 2, 6, 7, 10, 15, 21, 22, 26 2, 6, 7, 10 15, 21, 22, 26 Done in parallel 7, 10 2, 6 15, 21 22, 26 2 6 7 10 15 21 22 26 Merge 2 1 6 1 7 1 10 1 15 1 21 1 22 1 26 1 3 1 , 2 2 7 1 , 10 1 5 1 , 7 1 , 3 2 11 1 , 13 1 , 2 2 Done in parallel 3 1 , 7 1 , 5 1 , 2 3 5 1 , 7 1 , 3 2 , 11 1 , 13 1 , 2 2 11 1 , 13 1 , 3 3 , 7 2 , 5 2 , 2 5 Output Figure 2: Example of algorithm execution. 23 / 65
Proposed Parallel Algorithms Parallel algorithms based on the naive refinement 24 / 65
Recommend
More recommend