Parallel Algorithms and CS26 S260 – Algor gorit ithmic mic Engin ginee eerin ing Yihan han Sun un Implementations * Some of the slides are from MIT 6.712, 6.886 and CMU 15-853.
Last Lecture • Sc Schedu dule ler: • Help you map your parallel tasks to processors • Fork-jo join in • Fork: create several tasks that will be run in parallel • Join: after all forked threads finish, synchronize them • Wo Work rk-span pan Can be scheduled in • Work: total number of operations, 𝑋 sequential complexity time: 𝑃 𝑞 + 𝑇 • Span (depth): the longest chain in the for work 𝑋 , span 𝑇 on 𝑞 dependence graph processors 2
Last Lecture • Writ ite C++ code in in p parallel llel Pseudocode Code using Cilk reduce(A, n) { int reduce(int* A, int n) { if (n == 1) return A[0]; if (n == 1) return A[0]; In parallel: int L, R; L = reduce(A, n/2); L = cilk_spawn reduce(A, n/2); R = reduce(A + n/2, n-n/2); R = reduce(A+n/2, n-n/2); return L+R; cilk_sync ; } return L+R; } 3
Last Lecture • Reduce/sc e/scan an alg lgorit ithms hms • Divide-and-conquer or blocking • Coarsening ening • Avoid overhead of fork-join • Let each subtask large enough 4
Concurrency & Atomic primitives 5
Concurrency • When two threads ds access s one memory ory lo loca catio ion n at th the same e tim ime • When it it is is possi sibl ble e for two threads ds to a access s the same e memory ory lo locatio ion, n, we need to co conside ider r co concu curr rrency ency • Usually we only care when at least one of them is a write • Race – will be introduced later in the course • Parallelism ≠ concurrency • For the reduce/scan algorithm we just saw, no concurrency occurs (even no concurrent reads needed) 6
Concurrency • The most im importa rtant nt prin incip iple le to d deal l wit ith concurrency ency is is th the co corr rrect ctness ness • Does it still give expected output even when concurrency occurs? • The second nd to co consid ider er is is the perfo forma rmanc nce • Usually leads to slowdown for your algorithm • The system needs to guarantee some correctness – results in much overhead 7
Concurrency • Correctness ness is is the fir irst st consid ider erati ation! n! A joke for you to understand this: Alice: I can compute multiplication very fast. Bob: Really? What is 843342 × 3424 ? Alice: 20. Bob: What? That’s not correct! Alice: Wasn’t that fast? • So Sometim imes es concur urrenc ency y is is in inevita itabl ble • Solution 1: Locks – usually safe, but slow • Solution 2: Some atomic primitives • Supported by most systems • Needs careful design 8
Atomic primitives • Compar are-and and-sw swap ap (CAS) S) • bool CAS(value* p, value vold, value vnew): compare the value stored in the pointer 𝑞 with value 𝑤𝑝𝑚𝑒 , if they are equal, change 𝑞 ’ s value to vnew and return true. Otherwise do nothing and return false. • Test-and and-set set (TAS) S) • bool TAS(bool* p): determine if the Boolean value stored at 𝑞 is false, if so, set it to true and return true. Otherwise, return false. • Fetch-and and-ad add d (FAA) • integer FAA(integer* p, integer x): add integer 𝑞 ’ s value by 𝑦 , and return the old value • Prio iorit ity-wr write ite: • integer PW(integer* p, integer x): write x to p if and only if x is smaller than the current value in 𝑞 9
sum = 5 Use Atomic Primitives P1: add(3) P2: add(4) void Add(x) { void Add(x) { • Fetch-and and-ad add d (FAA): ): temp = sum; temp = sum; 5 5 in integer ger FAA(in integer teger* * p, sum = temp + x; sum = temp + x; } } 9 8 in intege ger x): add in intege ger 𝑞 ’ s s valu lue by 𝑦 , a and return n the sum = 8 (but should be 12) old ld valu lue Shared variable sum Shared variable sum void Add(x) { void Add(x) { • Multiple threads want to sum = sum + x; FAA(&sum, x); add a value to a shared } } variable Shared variable count • Multiple threads want to get int get_id { a global sequentialized return FAA(&count, 1); order } 10
struct node { Use Atomic Primitives value_type value; node* next; }; shared variable node* head; • Compare-and-swap: • Multiple threads wants to add to the head of a linked-list X1 void insert(node* x) { node* old_head = head; x->next = old_head; while (!CAS(&head, old_head, x)) { ? X2 head node* old_head = head; x->next = old_head; } head } X1 void insert(node* x) { x->next = head; head = x; X2 } 11
struct node { Use Atomic Primitives value_type value; node* next; }; shared variable node* head; • Compare-and-swap: • Multiple threads wants to add to the head of a linked-list X1 void insert(node* x) { node* old_head = head; x->next = old_head; while (!CAS(&head, old_head, x)) { X2 old_head node* old_head = head; x->next = old_head; } head } X1 void insert(node* x) { x->next = head; head = x; X2 } old_head 12
Concurrency – rule of thumb • Do not use concurrency, algorithmically • If you have to (with the guarantee of correctness) • Do not use concurrent writes • If you have to (with the guarantee of correctness) • Do not use locks, use atomic primitives (still, with the guarantee of correctness) 13
Filtering/packing 14
Parallel filtering / packing • Gi Given n an array 𝑩 of ele lements ents and a p predica icate te func nctio tion n 𝒈 , , output ut an arr rray 𝑪 wit ith h ele lements ents in in 𝑩 that at satisf isfy y 𝒈 𝑔 𝑦 = ቊ 𝑢𝑠𝑣𝑓 𝑗𝑔 𝑦 𝑗𝑡 𝑝𝑒𝑒 𝑔𝑏𝑚𝑡𝑓 𝑗𝑔 𝑦 𝑗𝑡 𝑓𝑤𝑓𝑜 4 2 9 3 6 5 7 11 10 8 𝐵 = 9 3 5 7 11 𝐶 = 15
Parallel filtering / packing • Ho How can we know ow the e length ngth of 𝑪 in n parallel? rallel? • Count the number of red elements – parallel reduce • 𝑃(𝑜) work and 𝑃(log 𝑜) depth 4 2 9 3 6 5 7 11 10 8 𝐵 = 0 0 1 1 0 1 1 1 0 0 16
Filter(A, n, B, f) { Parallel filtering / packing new array flag[n], ps[n]; para_for (i = 1 to n) { flag[i] = f(A[i]); } • How ca can we know where e shoul uld d 9 g go? ps = scan(flag, n); • 9 is the first red element, 3 is the second, … parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; } } 4 2 9 3 6 5 7 11 10 8 𝑩 = 0 0 1 1 0 1 1 1 0 0 Flags of A 0 0 1 2 2 3 4 5 0 0 Prefix sum of flags 1 2 3 4 5 index 9 3 5 7 11 𝐶 = 17
Application of filter: partition in quicksort • For a an array A, m move e ele lements ents in in A s A smaller ller than n 𝒍 to the le left and those e la larg rger r than 𝒍 to the ri righ ght 6 2 9 4 1 3 5 8 7 0 A Partition by 6 Possible 2 4 1 3 5 0 6 9 8 7 output: • The div ivid iding ing crit iteria ia ge general ally ly can be any predict ictor or 18
Partition(A, n, k, B) { Using filter for partition new array flag[n], ps[n]; parallel_for (i = 1 to n) { flag[i] = (A[i]<k); } (Looking at the left part as an example) ps = scan(flag, n); parallel_for(i=1 to n) { using 6 as a pivot if (ps[i]!=ps[i-1]) 6 2 9 4 1 3 5 8 7 0 A B[ps[i]] = A[i]; } } 0 1 0 1 1 1 1 0 0 1 flag Can we avoid using too much extra space? X 2 X 4 1 3 5 X X 0 A 0 1 1 2 3 4 5 5 5 6 Prefix sum of flag 2 4 1 3 5 0 pack 19
Implementation trick: delayed sequence 20
Delayed sequence • A se sequen uence e is is a f funct nction ion, , so it it d does s not need d to b be stored • It maps an index (subscript) to a value • Save some space! 21
Delayed sequence • A se sequen uence e is is a f funct nction ion, , so it it d does s not need d to b be stored • Save some space int reduce(int* A, int n) { inline int get_val(int i) {return i;} if (n == 1) return A[0]; int reduce(int start, int n, function f) int L, R; { L = cilk_spawn reduce(A, n/2); if (n == 1) return f(start); Running time: Running time: R = reduce(A+n/2, n-n/2); int L, R; about 0.16s for n=10^9, about 0.19s for n=10^9, cilk_sync ; L = cilk_spawn reduce(start, n/2, f); with coarsening with coarsening return L+R; } R = reduce(start+n/2, n-n/2, f); cilk_sync ; int main() { return L+R; } cin >> n; parallel_for (int i = 0; i < n; i++) int main() { A[i] = i; cin >> n; cout << reduce(A, n) << endl; cout << reduce(0, n, get_val) << endl; 22
Partition without the flag array Old version New version Partition(A, n, k, B) { Partition(A, n, k, B) { new array flag[n], ps[n]; new array ps[n]; parallel_for (i = 1 to n) { ps = scan(0, n, flag[i] = (A[i]<k); } [&](int i) {return (A[i]<k);}); ps = scan(flag, n); parallel_for(i=1 to n) { parallel_for(i=1 to n) { if (ps[i]!=ps[i-1]) if (ps[i]!=ps[i-1]) B[ps[i]] = A[i]; B[ps[i]] = A[i]; } } Equivalent to having an array: flag[i] = (A[i]<k); But without explicitly storing it (We can also get rid of the ps[] array, but it makes the program a bit more complicated) 23
Implementation trick: nested/granular/blocked parallel for-loops 24
Recommend
More recommend