Towards General-Purpose Acceleration by Exploiting Common Data- Dependence Forms Vidushi Dadu , Jian Weng, Sihao Liu, Tony Nowatzki UCLA MICRO 2019
Challenging trade-off in domain-specific and domain-agnostic acceleration CPU Maximum generality DOMAIN- AGNOSTIC REASON: Control/memory Relies on vectorization data-dependence and prefetching . DOMAIN- Support for application- SPECIFIC specific dependencies Maximum efficiency 2
Challenging trade-off in domain-specific and domain-agnostic acceleration CPU OUR GOAL Maximum generality DOMAIN- DOMAIN- AGNOSTIC AGNOSTIC Relies on vectorization and prefetching . DOMAIN- Support for application- SPECIFIC specific dependencies Maximum efficiency 3
Programmable Accelerators (eg. GPUs) Fail to Handle Arbitrary Control/Memory Dependence Memory Dependence Control Dependence Arbitrary code a[3] a[0] a[5] a[1] Request Arbitrary Branch execution vector access Code 2 Code 1 location Insight: Restricted control and memory dependence is Branch Memory a[0] a[1] a[2] a[3] a[4] a[5] sufficient for many data-processing algorithms. Code 3 Code 4 4
Outline • Irregularity is ubiquitous • Sufficient and Exploitable forms of Control and Memory dependence • Example Workload: Matrix Multiply • Exploiting data-dependence with SPU accelerator • uArch: Stream-join Dataflow & Compute-enabled Scratchpad • SPU Multicore Design • Evaluating SPU • Conclusion 5
Irregularity is Ubiquitous Sparsity within dataset Data-structures representing Purpose to reorder data (Machine Learning) relationships (Graphs) (Databases) 4 2 3 6 5 1 7 1 2 3 4 5 6 7 Pruned Neural Network Bayesian Networks Sorting Table Z = Table X Table Y Inner Join (X, Y) A B B B D F = C F F G Decision tree building Triangle Counting Database Join 6
Irregularity Stems from Data-dependence Data-dependent aspects of execution Restricted Control flow: Stream-Join 1. Control flow: if( f (a[i])) Restricted Memory Access: Alias-Free 2. Memory Access: b[ a[i]] Indirection Main-Insight: There are narrow forms of dependence which are: • Sufficient to express many algorithms (from ML, graph analytics, databases ) • Exploitable with minimal hardware overhead 7
Algorithm Classification Restricted memory Restricted control dependence dependence Stream Alias-free Regular Join Indirect No control/memory dependence General Irregularity 8
Regular Example: Dense Matrix Multiply Input Vector A (N) 0 2 0 3 0 4 0 • No data-dependence; × ie. the dynamic pattern of: Sparse matrix-multiply can be implemented in two ways: ∑ 3 0 0 0 0 1 4 1 Output Vector C (N) • Control 0 0 0 7 0 0 0 2 1. Inner product: Data-dependent control 0 0 0 0 0 0 9 0 • Data Access 0 1 0 0 9 3 3 1 0 0 0 0 0 3 2. Outer product: Data-dependent memory • … is known a priori. 4 2 0 0 0 0 0 2 5 0 0 0 0 0 0 0 0 0 1 0 0 0 6 0 0 0 2 3 4 0 0 0 0 6 Input Matrix B (NxN) 9
Sparse Inner Product Multiply (stream-join) CSR format: Compressed Sparse Row idx val 2 3 5 A 2 3 4 total+= 3 * 1 B[0] idx val 0 1 3 1 4 1 Conditional output 0 Output of 0 0 0 1 0 means no multiplication conditional • Known memory access pattern, but unpredictability in control 10
Sparse Inner Product Multiply (stream-join) float sparse_dotp ( row r1 , r2 ) CSR format: Compressed Sparse Row int i1 = 0 , i2 = 0 float total = 0 idx val 2 3 5 A 2 3 4 while( i1<r1.cnt && i2<r2.cnt ) if ( r1 . idx [ i1 ]== r2 . idx [ i2 ]) total += r1 . val [ i1 ]* r2 . val [ i2 ] total+= 3 * 1 i1 ++; i2 ++ elif ( r1 . idx [ i1 ]> r2 . idx [ i2 ]) B[0] idx val 0 1 3 1 4 1 i1 ++ Indicative of else Stream-Join i2 ++ ... Conditional output 0 Output of 0 0 0 1 0 means no multiplication conditional • Known memory access pattern, but unpredictability in control • Stream Join: • Memory read can be independent of data* • Order that we consume streams of data is data-dependent 11
Sparse Outer Product Multiply (Alias-free Indirection) CSC: Compressed Sparse Column idx 1 3 5 A val 2 3 4 0 1 5 3 4 0 3 5 0 3 B idx 1 2 2 3 2 4 3 5 1 1 val Accumulate C output vector • High memory unpredictability, but known control pattern • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i] ) 12
Sparse Outer Product Multiply (Alias-free Indirection) CSC: Compressed Sparse Column float sparse_mv ( row r1 , m2 ) ... idx 1 3 5 A for i1=0 to r1.cnt, ++ i1 val 2 3 4 cid = r1.idx [ i1 ] for i2=ptr[cid] to ptr[cid+1] 0 1 5 3 4 0 3 5 0 3 B idx out_vec [ m2 . idx [ i2 ]] += r1 . val[i1] * m2.val[i2] 1 2 2 3 2 4 3 5 1 1 val i2 ++ Indirection Accumulate C output vector • High memory unpredictability, but known control pattern • No unknown dependencies (only atomic updates: out[i]=out[i]+prod[i] ) • Alias-free Indirect: • Produce addresses depending on other data • Memory dependences, but no unknown (data-dependent) aliases 13
Graph Mining (e.g. Triangle Counting) • For every pair of connected nodes, a d find if they have a common neighbor c b f (alias-free indirect) e C A B D E F edge list b d a c e b d e f a c f b c c d (stream-join) 14
Stream Join Alias-free Indirection (irreg. control) (irregular memory) Machine Learning Neural Net (FC + Conv) Outer Product Mult. Inner Product Mult. Supp. Vector (SVM) “” “” Sparse + Histogramming Decision Trees (GBDT) data access Condition on + DAG Access Bayesian Networks node type + Indirect acc. Sparse join of Page Rank Graph for edges & BFS active list Find common + Indirect acc. Triangle Counting for edges neighbor edges Databases Sort-Join Join (inner) Hash-Join Merge-Sort Sort Radix-Sort Generate Generate Filter Filtered Col. Column Ind. 15
Outline • Irregularity is ubiquitous • Sufficient and Exploitable forms of Control and Memory dependence • Example Workload: Matrix Multiply • Exploiting data-dependence with SPU accelerator • uArch: Stream-join Dataflow & Compute-enabled Scratchpad • SPU Multicore Design • Evaluating SPU • Conclusion 16
Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router PuDianNao (ASPLOS’15) Ctrl Google TPU v2 Systolic ISCA’17 Array Systolic Array Stereotypical Dense Wide Scratchpad Accelerator Core Control Tabla (HPCA’16) 17
Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router Ctrl Systolic Array 18
Approach: Start with a Dense Programmable Accelerator Wide Scratchpad Router Ctrl Systolic Systolic array Array supporting stream-join control 19
Approach: Start with a Dense Programmable Accelerator Compute-Enabled Bank Scratchpad Router Scratchpad for fast I- ROB Alias-free indirect access Ctrl Systolic Systolic array Array supporting stream-join control 20
Specializing for Stream Join Compute-Enabled Bank Scratchpad Router Scratchpad for fast I- ROB Alias-free indirect access Ctrl Systolic Systolic array Array supporting stream-join control 21
Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Systolic array Ld Ld Gen Gen idxA idxB PE PE PE PE PE addr addr <= >= Cmp PE PE PE PE PE ++ ++ = PE PE PE PE PE Gen Gen addr addr PE PE PE PE PE Ld Ld × PE PE PE PE PE valA ValB Control-dep. Load, Cyclic dependence, acc Unpredictable branch! 22
Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow • Observation: For a Ld Ld Gen Gen strm strm strm strm idxA idxB stream join, memory is addr addr idxA idxB valA valB <= >= (mostly) separable from Cmp c c ++ ++ computation Cmp × = >,<,= • Idea: Allow Dataflow to Gen Gen init addr c addr conditionally acc pop/discard/reset Ld Ld × valA ValB values based on control decisions. acc 23
Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 1 Gen Gen strm strm strm strm idxA idxB 0 2 addr addr idxA idxB valA valB <= >= 0 2 Cmp c c ++ ++ Cmp × = >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 24
Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 1 Gen Gen strm strm strm strm idxA idxB 0 2 addr addr idxA idxB valA valB consume <= >= 0 2 Cmp 2 0 c c ++ ++ Cmp × = < >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 25
Novel Dataflow for Stream Join Sparse MM idx idx 2 3 5 0 1 3 A B[0] val Example val 2 3 4 1 4 1 Traditional Dataflow Novel Stream Join Dataflow 3 Ld Ld 3 Gen Gen strm strm strm strm idxA idxB 1 2 addr addr idxA idxB valA valB consume <= >= 0 2 Cmp 2 1 c c ++ ++ Cmp × = < >,<,= Gen Gen init addr c addr acc Ld Ld × valA ValB acc 26
Recommend
More recommend