on the factor refinement principle and its implementation
play

On the Factor Refinement Principle and its Implementation on - PowerPoint PPT Presentation

On the Factor Refinement Principle and its Implementation on Multicore Architectures Masters Public Lecture Presented By: Md. Mohsin Ali Supervisor: Dr. Marc Moreno Maza Dept. of Computer Science (ORCCA Lab) The University of Western Ontario,


  1. On the Factor Refinement Principle and its Implementation on Multicore Architectures Masters Public Lecture Presented By: Md. Mohsin Ali Supervisor: Dr. Marc Moreno Maza Dept. of Computer Science (ORCCA Lab) The University of Western Ontario, London, ON, Canada December 15, 2011 1 / 65

  2. Factor Refinement Serial Algorithms Approach based on the naive refinement Approach based on augment refinement Approach based on subproduct trees Motivation Implementation Challenges on Multicore Architectures Contribution Proposed Parallel Algorithms A d-n-c illustration Parallel algorithms based on the naive refinement Parallel approach based on the augment refinement Parallel Algorithm Based on Subproduct Tree Conclusion 2 / 65

  3. Factor Refinement I Factor Refinement 3 / 65

  4. Factor Refinement I Definition ◮ Let D be a UFD and m 1 , m 2 , . . . , m r be elements of D . ◮ Let m be the product of m 1 , m 2 , . . . , m r . ◮ We say that elements n 1 , n 2 , . . . , n s of D form a GCD-free basis whenever gcd( n i , n j ) = 1 for all 1 ≤ i < j ≤ s . ◮ Let e 1 , e 2 , . . . , e s be positive integers. ◮ We say that the pairs ( n 1 , e 1 ) , ( n 2 , e 2 ) , . . . , ( n s , e s ) form a refinement of m 1 , m 2 , . . . , m r if the following two conditions hold: ( i ) n 1 , n 2 , . . . , n s is a GCD-free basis, ( ii ) for every 1 ≤ i ≤ r there exists non-negative integers f 1 , . . . , f s such 1 ≤ j ≤ s n f j that we have � j = m i , 1 ≤ i ≤ s n e i ( ii ) � i = m . When this holds, we also say that ( n 1 , e 1 ) , ( n 2 , e 2 ) , . . . , ( n s , e s ) is a coprime factorization of m . 4 / 65

  5. Factor Refinement II Example Let m 1 = 30 , m 2 = 42 and their product m = 1260. Then ( i ) 5 1 , 6 2 , 7 1 is a refinement of 30 and 42, ( ii ) 5 , 6 , 7 is a GCD-free basis of 30 and 42, ( iii ) 5 1 , 6 2 , 7 1 is a coprime factorization of 1260. 5 / 65

  6. Factor Refinement III Applications ◮ Simplifying systems of polynomial equations and inequations, (i) � ab � a � = 0 � = 0 bc � = 0 = ⇒ b � = 0 ca � = 0 c � = 0 , (ii) Below, { A , B , C , D , E , F , G } can be seen as a GCD-free basis of { S 1 , S 2 , S 3 } : S 1 S 1 S 2 S 2 A B E , C = ⇒ D F G S 3 S 3 ◮ consolidation of independent factorizations, ◮ etc. 6 / 65

  7. Serial Algorithms: Approach based on the naive refinement and quadratic arithmetic 7 / 65

  8. Approach based on the naive refinement I Idea from Bach, Driscoll, and Shallit in 1990 [BDS90]. ◮ Given a partial factorization of an integer m , say m = m 1 m 2 , we compute d = gcd( m 1 , m 2 ) and write m = ( m 1 / d )( d 2 )( m 2 / d ) . ◮ This process is continued until all the factors are pairwise coprime. ◮ This is also used for the general case of more than two inputs, say m = m 1 m 2 . . . m ℓ . Algebraic complexity If m = m 1 m 2 . . . m ℓ , then this algorithm takes O ( size ( m ) 3 ) bit op- erations, where � 1 if m = 0 size ( m ) = 1 + ⌊ log 2 | m |⌋ if m > 0 . 8 / 65

  9. Serial Algorithms: Approach based on the augment refinement and quadratic arithmetic 9 / 65

  10. Approach based on augment refinement and quadratic arithmetic I Again from Bach, Driscoll, and Shallit in 1990 [BDS90]. ◮ Basic idea same as before but organizing the computations more precisely leading to an improved complexity [BDS90] ◮ The trick is to keep tracks of the pairs ( n j , n k ) in an ordered pair list such that only elements adjacent in the list can have a nontrivial GCD. Algebraic complexity If m = m 1 m 2 . . . m ℓ , then this algorithm takes O ( size ( m ) 2 ) bit op- erations, where � 1 if m = 0 size ( m ) = 1 + ⌊ log 2 | m |⌋ if m > 0 . 10 / 65

  11. Serial Algorithms: Approach based on subproduct trees 11 / 65

  12. Approach based on subproduct trees I Idea of Asymptotically Fast Algorithm for GCD-free Basis from Dahan, Moreno Maza, Schost, and Xie in 2005 [DMS + 05]. ◮ Divide the input into sub-problems until a base case is reached, ◮ Conquer the sub-problems from leaves to the root applying fast arithmetic based on subproduct trees (described later). Algebraic complexity The total number of field operations of this algorithm is O ( M ( d ) log 3 2 d ), where ◮ d is the sum of the degrees of the input polynomials, ◮ M ( d ) is a multiplication time of two univariate polynomials of degree less than d , 12 / 65

  13. Motivation I Parallel Computation of the Minimal Elements of a Poset ◮ by Leiserson, Li, Moreno Maza, and Xie in 2010 [ELLMX10]. ◮ This is a multithreaded (fork-join parallelism) approach which is divide-and-conquer, free of data races, inspired by parallel-merge-sort. ◮ Its Cilk++ shows nearly linear speed-up on 32-core processors for sufficiently large input data set. This work led us to the design and implementation of parallel factor refinement algorithms. 13 / 65

  14. Implementation Challenges on Multicore Architectures 14 / 65

  15. Multithreaded Parallelism on Multicore Architectures I Multicore architectures ◮ A multi-core processor is a single computing component with two or more independent and tighly coupled processors, called cores, sharing memory. ◮ They also share the same bus and memory controller; thus memory bandwidth may limit performances. ◮ In order to maintain memory consistency, synchorization is needed between cores, which may also limit performances. 15 / 65

  16. Multithreaded Parallelism on Multicore Architectures II Fork-join parallelism ◮ This model represents the execution of a multithreaded program as a set of nonblocking threads denoted by the vertices of a dag where the dag edges indicate dependencies between instructions. ◮ Assuming unit cost of execution for all threads, the number of vertices of the dag is the work (= running time on a single core). ◮ The maximum length of a path from the root to a leaf is the span (= running time on ∞ processors). ◮ The paralleisim is the ratio work to span (= average amount of work along the span). 16 / 65

  17. f athena,cel,prokop,sridhar g @sup erte ch.l cs. mit. edu The Ideal-cache Model I Main organized by Memory optimal replacement strategy Cache CPU W work Q = L Cache lines Z cache = ( ) � misses Lines ( + = ) of length L ( + ( = )( + )) ( ) Figure 1: The ideal-cache model. � � p ( + ( + + ) = + = ) 17 / 65 > = ( ) ; ( ; ) ( ) ( ; )

  18. The Ideal-cache Model II ◮ The processor can only refer to words that reside in the cache memory, which is a small and fast access memory, containing Z words organized in cache lines of L words each. ◮ If the referenced line of a word is not in cache, the corresponding line needs to be brought from the main memory. This is a cache miss. If the cache is full, a cache line must be evicted. ◮ Cache complexity analyzes algorithms in terms of cache misses. 18 / 65

  19. From Cilk to Cilk++ I The language ◮ Cilk (resp. Cilk++ ) is an extension of C (resp. C++ ) implementating the fork-join parallelism with two keywords spawn and sync . ◮ A Cilk (resp. Cilk++ ) program has the same semantics as its C (resp. C++ ) ellision. Performance of the work-stealing scheduler In theory, (resp. Cilk++ )s scheduler executes any Cilk Cilk++ computation in a nearly optimal time on p processors, pro- vided that ◮ for almost all parallel steps, there are at least p units of work which can be run concurrently, ◮ each processor is either working or stealing work, ◮ each thread executes in unit time. 19 / 65

  20. Parallelization overheads I Overheads and burden ◮ In practice, the observed speedup factor may be less (sometimes much less) than the theoretical parallelism. ◮ Many factors explain that fact: simplification assumptions of the fork-join parallelism model, architecture limitation, costs of executing the parallel constructs, overheads of scheduling. Parallelism vs. burdened parallelism ◮ Cilkview is a perforance analyzer which caclulates the work, the span, the parallelism of a given Cilk++ program run. ◮ Cilkview also estimates the running time T p on p processors as T p = T 1 / p + 1 . 7 burden span , where burden span is 15000 instructions times the number of spawn along the span! 20 / 65

  21. Contribution I ◮ Parallel algorithm based on the naive refinement principle [NOT GOOD for data locality and thus for parallelism on multicore architectures]. ◮ Parallel algorithm based on the augment refinement principle [GOOD for data locality and parallelism]. ◮ Parallel algorithm based on subproduct tree [MORE CHALLENGING for implementation on multicore architectures]. Principle All are based on algorithms which are divide-and-conquer (d-n-c), multithreaded, free of data races. 21 / 65

  22. Proposed Parallel Algorithms A d-n-c illustration 22 / 65

  23. A d-n-c illustration I Input Expand 2, 6, 7, 10, 15, 21, 22, 26 2, 6, 7, 10 15, 21, 22, 26 Done in parallel 7, 10 2, 6 15, 21 22, 26 2 6 7 10 15 21 22 26 Merge 2 1 6 1 7 1 10 1 15 1 21 1 22 1 26 1 3 1 , 2 2 7 1 , 10 1 5 1 , 7 1 , 3 2 11 1 , 13 1 , 2 2 Done in parallel 3 1 , 7 1 , 5 1 , 2 3 5 1 , 7 1 , 3 2 , 11 1 , 13 1 , 2 2 11 1 , 13 1 , 3 3 , 7 2 , 5 2 , 2 5 Output Figure 2: Example of algorithm execution. 23 / 65

  24. Proposed Parallel Algorithms Parallel algorithms based on the naive refinement 24 / 65

Recommend


More recommend