Parallel Func+onal Arrays Ananya Kumar Guy Blelloch Robert Harper Carnegie Mellon University
Goals • Func+onal arrays • Efficient (constant +me) • Parallel • Well defined cost seman+cs
Previous Work - Monads • Thread mutable state • Enforce single reference to array • Need completely different code • Not parallel
Previous Work – Specialized Type System • Enforce single threadedness of arrays • Not available in most languages • Hard to reason about
Previous Work – Reference Coun+ng • Check reference counts • If one, update in place, else copy • Depends on compiler • Hard to reason about
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Sequences A = NEW(5, 0) 0 0 0 0 0 B = SET(A, 0, 3) C = SET(A, 2, 3) 3 0 0 0 0 0 0 3 0 0 D = SET(C, 4, 14) 0 0 3 0 14 E = SET(D, 1, 11) 0 11 3 0 14
Previous Work • N = size of array • Dietz – O(log log N) per opera+on • Trailer arrays – O(1) for leaves • Improvements by Chuang, O’ Neill • No support for concurrency
Our Approach • Func+onal • Efficient – O(1) for leaves, fast for interior • Parallel – wait-free • Well defined cost seman+cs
Sequence Implementa+on C 0 11 3 0 14 2 D 3 E 4
Main Sec+ons • Cost dynamics • Concurrent implementa+on
Fork-Join Parallelism (1+2) || (3+4)
Fork-Join Parallelism (1+2) || (3+4) Fork
Fork-Join Parallelism (1+2) || (3+4) 3+4 1+2
Fork-Join Parallelism (1+2) || (3+4) 3+4 1+2 7 3
Fork-Join Parallelism (1+2) || (3+4) 3+4 1+2 7 3 Join
Fork-Join Parallelism (1+2) || (3+4) 3+4 1+2 7 3 (3, 7)
Work and Span N log(N) 1 Work: size of cost tree Span: depth of cost tree 1 1 1
Work and Span N log(N) 1 Work: N + log(N) + 4 Span: N + log(N) + 2 1 1 1
Scheduling Theorems • Work + Span gives execu+on cost on P processor machine • Goal: evaluate cost of using sequences on a P processor machine • Sufficient to evaluate work and span
Parallel Structural Dynamics • Cost of running program with ∞ processors • Determinis+c
Interleaved Structural Dynamics • Cost of running program with 1 processor • Non-determinis+c
Interleaved Structural Dynamics • Store which sequences are interior and leaf
Work = Non-Determinis+c A (leaf), size N GET GET GET SET Join
Work (Good Interleaving) A (leaf), size N GET GET Current Work: 1 Total Work: 1 GET SET Join
Work (Good Interleaving) A (leaf), size N GET GET Current Work: 1 Total Work: 2 GET SET Join
Work (Good Interleaving) A (leaf), size N GET GET Current Work: 1 Total Work: 3 GET SET Join
Work (Good Interleaving) A (leaf), size N GET GET Current Work: 1 Total Work: 4 GET SET Join
Work = Non-Determinis+c A (leaf), size N GET GET GET SET Join
Work (Bad Interleaving) A (leaf), size N GET GET Current Work: 1 Total Work: 1 GET SET Join
Work (Bad Interleaving) A (leaf), size N GET GET Current Work: 1 Total Work: 2 GET SET Join
Work (Bad Interleaving) A (leaf), size N GET GET Current Work: log(N) Total Work: 2 + log(N) GET SET Join
Work (Bad Interleaving) A (leaf), size N GET GET Current Work: log(N) Total Work: 2 + 2log(N) GET SET Join
GET-GET Case A (leaf), size N GET GET GET GET Join
SET-GET Case A (leaf), size N GET GET GET SET Join
SET-SET Case A (leaf), size N SET GET GET SET Join
Upper Bounding Work • Determinis+c evalua+onal dynamics • Store which sequences are leaf and interior • Store the number of “cheap” (cost = 1) GETs on each sequence • At the join, if sequence was modified on one side, make the GETs expensive (cost = log(N))
Upper Bounding Work • Showed that upper bounds are valid for all inter-leavings • Showed that the upper bound is +ght *
A = NEW(5, 0) Seq A ArrayData 1 (Version = 1) Version 1 0 0 0 0 0
B = SET(A, 2, 5) Seq A ArrayData 1 (Version = 2) Version 1 0 0 5 0 0 Seq B Version 1 Version 2 Value 0
Naïve SET • Implementa+on of SET(A, i, v) • First set values[i] = v • Then add a log entry to arraydata
GET-SET Race Sequence A, version 1 Array data AD, version 1 Values = [0, 0, 0, 0, 0] Logs = empty Thread 1 Thread 2 Result Step 1 Values[2] = 5 Step 2 GET(A, 2) Step 3 Add log entry to Logs[i]
GET-SET Race Sequence A, version 1 Array data AD, version 1 Values = [0, 0, 0, 0, 0] Logs = empty Thread 1 Thread 2 Result Step 1 Values[2] = 5 ✓ Step 2 GET(A, 2) Step 3 Add log entry to Logs[i]
GET-SET Race Sequence A, version 1 Array data AD, version 1 Values = [0, 0, 0, 0, 0] Logs = empty Thread 1 Thread 2 Result Step 1 Values[2] = 5 ✓ Step 2 GET(A, 2) 5 Step 3 Add log entry to Logs[i]
GET-SET Race Sequence A, version 1 Array data AD, version 1 Values = [0, 0, 0, 0, 0] Logs = empty Thread 1 Thread 2 Result Step 1 Values[2] = 5 ✓ Step 2 GET(A, 2) 5 Step 3 Add log entry to ✓ Logs[i]
A Wait-Free Solu+on • Can be fixed by adding log entry before muta+ng values array • Other issues in GET require careful ordering • Other issues in SET require compare & swap
Experimental Results • Compared sequences to regular arrays • Random & sequen+al accesses • Wri+ng: 2-3 +mes slower • Reading: under 10% slower
Concurrent Results • Compared – 1 thread reading million +mes – 2 threads reading half million +mes • 2 threads were > 1.75 +mes faster
Summary • Func+onal array implementa+on • O(1) opera+ons for leaf • Wait-free concurrent • Well defined cost seman+cs
Future Work • Prove concurrent costs of sequence implementa+on • Tighter cost bounds • Extend to disjoint sets, unordered sets • Lower bound for func+onal array costs
Acknowledgements • Joe Tassaror for lots of advice on correctness proof • Danny Sleator for ideas on lower bounds for func+onal array costs • NSF, Air Force Office, Intel for grants
Recommend
More recommend