CS 240A : Divide-and-Conquer with Cilk++ • Divide & Conquer Paradigm • Solving recurrences • Sorting: Quicksort and Mergesort Thanks to to Charles E. E. Leiserson for some of th these slides 1
Work and Span (Recap) T P = execution time on P processors T 1 = wo work T ∞ = sp span an * * Speedup Sp up on n p processo ssors ∙ T 1 /T p Pote tenti tial parallelism ∙ T 1 /T ∞ 2
Sorting ∙ Sorting is possibly the most frequently executed operation in computing! ∙ Quick Quicksort sort is the fastest sorting algorithm in practice with an average running time of O(N log N), (but O(N 2 ) worst case performance) ∙ Mergesort t has worst case performance of O(N log N) for sorting N elements ∙ Both based on the recursive div divide- ide-an and- d- conqu con quer er paradigm 3
QUICKSORT ∙ Basic Quicksort sorting an array S works as follows: § If the number of elements in S is 0 or 1, then return. § Pick any element v in S. Call this pivot. § Partition the set S-{v} into two disjoint groups: ♦ S 1 = = {x {x ε S S-{v {v} | } | x x ≤ v v} } ♦ S 2 = = {x {x ε S S-{v {v} | } | x x ≥ v v} } § Retu turn quicksort( t(S 1 ) f follow ollowed by ed by v f follow ollowed by ed by quicksort( t(S 2 ) ) 4
QUICKSORT 14 13 45 56 34 31 32 21 78 Select Pivot 14 13 45 56 34 31 32 21 78 5
QUICKSORT 14 13 45 56 34 31 32 21 78 Partition around Pivot 56 13 31 21 45 34 32 14 78 6
QUICKSORT 56 13 31 21 45 34 32 14 78 Quicksort recursively 13 14 21 31 32 34 45 56 78 13 14 21 31 32 34 45 56 78 7
Parallelizing Quicksort ∙ Serial Quicksort sorts an array S as follows: § If the number of elements in S is 0 or 1, then return. § Pick any element v in S. Call this pivot. § Partition the set S-{v} into two disjoint groups: ♦ S 1 = = {x {x ε S S-{v {v} | } | x x ≤ v v} } ♦ S 2 = = {x {x ε S S-{v {v} | } | x x ≥ v v} } § Retu turn quicksort( t(S 1 ) fo follo llowed wed by v f follow ollowed by ed by quicksort( t(S 2 ) ) 8
Parallel Quicksort (Basic) • The second recursive call to qsort does not depend on the results of the first recursive call • We have an opportunity to speed up the call by making both calls in parallel. template <typename T> void qsort(T begin, T end) { if (begin != end) { T middle = partition( begin, end, bind2nd( less<typename iterator_traits<T>::value_type>(), *begin ) ); cilk_spawn qsort(begin, middle); qsort(max(begin + 1, middle), end); cilk_sync; } } 9
Performance ∙ ./qsort 500000 -cilk_set_worker_count 1 >> 0.083 seconds ∙ ./qsort 500000 -cilk_set_worker_count 16 >> 0.014 seconds ∙ Speedup = T 1 /T 16 = 0.083/0.014 = 5.93 5.93 ∙ ./qsort 50000000 -cilk_set_worker_count 1 >> 10.57 seconds ∙ ./qsort 50000000 -cilk_set_worker_count 16 >> 1.58 seconds ∙ Speedup = T 1 /T 16 = 10.57/1.58 = 6.67 6.67 10
Measure Work/Span Empirically ∙ cilkscreen -w ./qsort 50000000 Work = 21593799861 Span = 1261403043 Burdened span = 1261600249 Parallelism = 17.1189 17.1189 Burdened parallelism = 17.1162 workspan ws; #Spawn = 50000000 ws.start(); #Atomic instructions = 14 sample_qsort(a, a + n); ws.stop(); ∙ cilkscreen -w ./qsort 500000 ws.report(std::cout); Work = 178835973 Span = 14378443 Burdened span = 14525767 Parallelism = 12.4378 12.4378 Burdened parallelism = 12.3116 #Spawn = 500000 #Atomic instructions = 8 11
Analyzing Quicksort 56 13 31 21 45 34 32 14 78 Quicksort recursively 13 14 21 31 32 34 45 56 78 13 14 21 31 32 34 45 56 78 Assume we have a “great” partitioner that always generates two balanced sets 12
Analyzing Quicksort ∙ Work: T 1 (n) = 2T 1 (n/2) + Θ (n) 2T 1 (n/2) = 4T 1 (n/4) + 2 Θ (n/2) …. …. n/2 T 1 (2) = n T 1 (1) + n/2 Θ (2) + + T 1 (n) = Θ (n lg n) ∙ Span recurrence: T ∞ (n) = T ∞ (n/2) + Θ (n) Solves to T ∞ (n) = Θ (n) 13
Analyzing Quicksort T 1 (n) Not t much ! Pa Parallelism: llelism: = Θ (lg n) T ∞ (n) ∙ Indeed, partitioning (i.e., constructing the array S 1 = {x ε S-{v} | x ≤ v}) can be accomplished in parallel in time Θ (lg n) ∙ Which gives a span T ∞ (n) = Θ (lg 2 n ) ∙ And parallelism Θ (n/lg n) Way bette tter ! ∙ Basic parallel qsort can be found in CLRS 14
The Master Method The Maste ter Meth thod for solving recurrences applies to recurrences of the form * T(n) = a T(n/b) + f(n) , where a ≥ 1, b > 1, and f is asymptotically positive. I DEA DEA : Compare n log b a with f(n) . * The unstated base case is T(n) = Θ (1) for sufficiently small n. 15
Master Method — C ASE 1 T(n) = a T(n/b) + f(n) n log b a ≫ f(n) Specifically, f(n) = O(n log b a – ε ) for some constant ε > 0 . Soluti tion: T(n) = Θ (n log b a ) . Eg matrix mult: a=8, b=2, f(n)=n 2 è T 1 (n)= Θ (n 3 ) 16
Master Method — C ASE 2 T(n) = a T(n/b) + f(n) n log b a ≈ f(n) Specifically, f(n) = Θ (n log b a lg k n) for some constant k ≥ 0. Soluti tion: T(n) = Θ (n log b a lg k+1 n)) . Eg qsort: a=2, b=2, k=0 è T 1 (n)= Θ (n lg n) 17
Master Method — C ASE 3 T(n) = a T(n/b) + f(n) n log b a ≪ f(n) Specifically, f(n) = Ω (n log b a + ε ) for some constant ε > 0, and f(n) satisfies the regularity ty conditi tion that a f(n/b) ≤ c f(n) for some constant c < 1 . Eg: S Eg : Span pan of of qs qsort ort Soluti tion: T(n) = Θ (f(n)) . 18
Master Method Summary T(n) = a T(n/b) + f(n) CASE E 1: f (n) = O(n log b a – ε ), constant ε > 0 ⇒ T(n) = Θ (n log b a ) . CASE E 2: f (n) = Θ (n log b a lg k n), constant k ≥ 0 ⇒ T(n) = Θ (n log b a lg k+1 n) . CASE E 3: f (n) = Ω (n log b a + ε ), constant ε > 0, and regularity condition ⇒ T(n) = Θ (f(n)) . 19
MERGESORT ∙ Mergesort is an example of a recursive sorting algorithm. ∙ It is based on the divide-and-conquer paradigm ∙ It uses the merge operation as its fundamental component (which takes in two sorted sequences and produces a single sorted sequence) ∙ Simulation of Mergesort ∙ Drawback of mergesort: Not in-place (uses an extra temporary array) 20
Merging Two Sorted Arrays template <typename T> void Merge(T *C, T *A, T *B, int na, int nb) { while (na>0 && nb>0) { if (*A <= *B) { *C++ = *A++; na--; } else { *C++ = *B++; nb--; Time to merge n } } elements = Θ (n). while (na>0) { *C++ = *A++; na--; } while (nb>0) { *C++ = *B++; nb--; } } 3 12 19 46 3 12 19 46 4 14 21 23 4 14 21 23 21
Parallel Merge Sort A: input t (unsorte ted) template <typename T> B: outp tput t (sorte ted) void MergeSort(T *B, T *A, int n) { if (n==1) { C: te temporary B[0] = A[0]; } else { T* C = new T[n]; cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } } 3 4 12 14 19 21 33 46 me merge ge 3 12 19 46 4 14 21 33 me merge ge 3 19 12 46 4 33 14 21 merge me ge 19 3 12 46 33 4 21 14 22
Work of Merge Sort template <typename T> void MergeSort(T *B, T *A, int n) { if (n==1) { B[0] = A[0]; } else { T* C = new T[n]; CASE E 2: cilk_spawn MergeSort(C, A, n/2); n log b a = n log 2 2 = n MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; f(n) = Θ (n log b a lg 0 n) Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } } Work: Work: T 1 (n) = 2T 1 (n/2) + Θ (n) = Θ (n lg n) 23
Span of Merge Sort template <typename T> void MergeSort(T *B, T *A, int n) { CASE E 3: if (n==1) { B[0] = A[0]; n log b a = n log 2 1 = 1 } else { T* C = new T[n]; f(n) = Θ (n) cilk_spawn MergeSort(C, A, n/2); MergeSort(C+n/2, A+n/2, n-n/2); cilk_sync; Merge(B, C, C+n/2, n/2, n-n/2); delete[] C; } } Sp Span: n: T ∞ (n) = T ∞ (n/2) + Θ (n) = Θ (n) 24
Parallelism of Merge Sort Work: Work: T 1 (n) = Θ (n lg n) Sp Span: n: T ∞ (n) = Θ (n) T 1 (n) Pa Parallelism: llelism: = Θ (lg n) T ∞ (n) We need to to parallelize th the merge! 25
Throw away at t Parallel Merge least t na/2 ≥ n/4 0 ma = na/2 na A ≤ A[ma] ≥ A[ma] Recu Recurs rsiv ive e Recu Recurs rsiv ive e Bin Binary S ary Search earch P_M _Merge P_M _Merge na ≥ nb B ≤ A[ma] ≥ A[ma] 0 mb-1 mb nb K EY EY I I DEA DEA : If the total number of elements to be merged in the two arrays is n = na + nb, the total number of elements in the larger of the two recursive merges is at most (3/4) n . 26
Parallel Merge template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { P_Merge(C, B, A, nb, na); } else if (na==0) { return; } else { int ma = na/2; int mb = BinarySearch(A[ma], B, nb); C[ma+mb] = A[ma]; cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); cilk_sync; } } Coarsen base cases for efficiency. 27
Span of Parallel Merge template <typename T> void P_Merge(T *C, T *A, T *B, int na, int nb) { if (na < nb) { ⋮ int mb = BinarySearch(A[ma], B, nb); CASE E 2: C[ma+mb] = A[ma]; n log b a = n log 4/3 1 = 1 cilk_spawn P_Merge(C, A, B, ma, mb); P_Merge(C+ma+mb+1, A+ma+1, B+mb, na-ma-1, nb-mb); f(n) = Θ (n log b a lg 1 n) cilk_sync; } } Sp Span: n: T ∞ (n) = T ∞ (3n/4) + Θ (lg n) = Θ (lg 2 n ) 28
Recommend
More recommend