Stream Programming: Explicit Parallelism and Locality Bill Dally Edge Workshop May 24, 2006 May 24, 2006 Edge: 1
Outline • Technology Constraints � Architecture • Stream programming • Imagine and Merrimac • Other stream processors • Future directions May 24, 2006 Edge: 2
ILP is mined out – end of superscalar processors Time for a new architecture 1e+7 5 1e+6 2 % / Perf (ps/Inst) y e 1e+5 a r Linear (ps/Inst) 1e+4 1e+3 19%/year 1e+2 74%/year 30:1 1e+1 1e+0 1,000:1 1e-1 30,000:1 1e-2 1e-3 1e-4 1980 1990 2000 2010 2020 Dally et al. “The Last Classsical Computer”, ISAT Study, 2001 May 24, 2006 Edge: 3
Performance = Parallelism Efficiency = Locality May 24, 2006 Edge: 4
Arithmetic is cheap, Communication is expensive 0.5mm Arithmetic • 90nm Chip – Can put 100s of FPUs on a chip 64-bit FPU $200 – $0.50/GFLOPS, 50mW/GFLOPS (to scale) 1GHz – Exploit with parallelism • Communication De c re a sing BW – Dominates cost • $8/GW/s 2W/GW/s (off-chip) Inc re a sing powe r – BW decreases (and cost increases) with distance 1 clock – Power increases with distance – Latency increases with distance 12mm • But can be hidden with parallelism – Need locality to conserve global bandwidth May 24, 2006 Edge: 5
Cost of data access varies by 1000x From Energy Cost* Time Local Register 10pJ $0.50 1ns Chip Region (2mm) 50pJ $2 4ns Global on Chip (15mm) 200pJ $10 20ns Off chip (node mem) 1nJ $50 200ns Global 5nJ $500 1us *Cost of providing 1GW/s of bandwidth All numbers approximate May 24, 2006 Edge: 6
So we should build chips that look like this May 24, 2006 Edge: 7
An abstract view Global Memory Switch LM CM Switch RM RM RM Switch Switch Switch R R R R R R R R R A A A A A A A A A May 24, 2006 Edge: 8
Real question is: How to orchestrate movement of data May 24, 2006 Edge: 9
Conventional Wisdom: Use caches Global Memory Switch LM CM Switch RM RM RM Switch Switch Switch R R R R R R R R R A A A A A A A A A May 24, 2006 Edge: 10
Caches squander bandwidth – our scarce resource • Unnecessary data movement • Poorly scheduled data movement – Idles expensive resources waiting on data • More efficient to map programs to an explicit memory hierarchy May 24, 2006 Edge: 11
Example – Simplified Finite-Element Code loop over cells flux[i] = ... loop over cells ... = f(flux[i],...) May 24, 2006 Edge: 12
Explicitly block into SRF loop over cells Flux passed flux[i] = ... through SRF, no memory loop over cells ... = f(flux[i],...) traffic May 24, 2006 Edge: 13
Explicitly block into SRF loop over cells flux[i] = ... Explicit re-use of Cells, no loop over cells misses ... = f(flux[i],...) May 24, 2006 Edge: 14
Stream loads/stores (bulk operations) hide latency (1000s of words in flight) gather Cells Cells fn1 Flux fn2 Cells scatter Cells DRAM SRFs LRFs May 24, 2006 Edge: 15
Explicit storage enables simple, efficient execution All needed data and instructions on-chip no misses May 24, 2006 Edge: 16
Caches lack predictability (controlled via a “wet noodle”) May 24, 2006 Edge: 17
Caches are controlled via a “wet noodle” 99% hit rate, 1 miss costs 100s of cycles, 10,000s of ops May 24, 2006 Edge: 18
So how do we program an explicit hierarchy? May 24, 2006 Edge: 19
Stream Programming: Parallelism, Locality, and Predictability • Parallelism – Data parallelism across stream elements – Task parallelsm across kernels – ILP within kernels • Locality – Producer/consumer – Within kernels • Predictability – Enables scheduling K2 K1 K4 K3 May 24, 2006 Edge: 20
Evolution of Stream Programming 1997 StreamC/KernelC Break programs into kernels Kernels operate only on input/output streams and locals Communication scheduling and stream scheduling 2001 Brook Continues the construct of streams and kernels Hides underlying details Too “one-dimensional” 2005 Sequoia Generalizes kernels to “tasks” Tasks operate on local data Local data “gathered” in an arbitrary way “Inner” tasks subdivide, “leaf” tasks compute Machine-specific details factored out May 24, 2006 Edge: 21
StreamC/KernelC Image 0 Image 0 convolve convolve convolve convolve Depth Map Depth Map SAD SAD Image 1 Image 1 convolve convolve convolve convolve KERNEL convolve( STREAMPROG depth) { istream<int> a, im_stream<pixels> in, tmp; ostream<int> y) { … … for (i=0; i<rows; i++) { loop_stream(a) { convolve(in, tmp, …); int ai, out; convolve(tmp, conv_row, …); a >> ai; } … … out = dotproduct(ai,…); for (i=0; i<rows; i++) { y << out; SAD(conv_row, depth_row, …); } } } … } May 24, 2006 Edge: 22
Explicit storage enables simple, efficient execution unit scheduling SW Pipeline One iteration 0 0 10 ComputeCellInt kernel from 20 10 30 StreamFem3D 40 50 20 60 Over 95% of peak with simple 70 80 30 90 hardware 100 110 40 120 Depends on explicit 20 50 30 communication to make delays 40 50 predictable 60 60 70 80 70 90 100 110 80 120 20 90 30 40 50 100 60 70 110 80 90 100 120 110 May 24, 2006 Edge: 23 120
Stream scheduling exploits explicit storage to reduce bandwidth demand Read-Only Table Lookup Data (Master Element) Stre a mF E M a pplic a tion Compute Compute Compute Gather Advance Flux Numerical Cell Cell Cell States Flux Interior Element Face Numerical Cell Elements Elements Faces Geometry Flux Geometry (Current) (New) Gathered Cell Elements Orientations Pre fe tc hing , re use , use / de f, limite d spilling May 24, 2006 Edge: 24
Sequoia – Generalize Kernels into Leaf Tasks Node memory • Perform actual computation • Analogous to kernels • “Small” working set Aggregate LS void __task matmul::leaf( __in float A[M][P], __in float B[P][N], __inout float C[M][N] ) LS 0 LS 7 { matmul for (int i=0; i<M; i++) { leaf for (int j=0; j<N; j++) { FU FU for (int k=0; k<P; k++) { C[i][j] += A[i][k] * B[k][j]; } May 24, 2006 Edge: 25
Inner tasks • Decompose to smaller subtasks Node memory – Recursively • “Larger” working sets Aggregate LS void __task matmul::inner( __in float A[M][P], __in float B[P][N], matmul __inout float C[M][N] ) inner { tunable unsigned int U, X, V; LS 0 LS 7 blkset Ablks = rchop(A, U, X); blkset Bblks = rchop(B, X, V); matmul matmul blkset Cblks = rchop(C, U, V); leaf leaf FU FU mappar (int i=0 to M/U, int j=0 to N/V) mapreduce (int k=0 to P/X) matmul(Ablks[i][k],Bblks[k][j],Cblks[i][j]); } May 24, 2006 Edge: 26
Stream Processors make communication explicit Enables optimization May 24, 2006 Edge: 27
Stream architecture makes communication explicit – exploits parallelism and locality 10k χ 1k χ 100 χ Chip Crossing(s) switch switch wire LRF DRAM Cache SRF CL Bank Bank Lane SW LRF Chip Pins M and SW Router LRF DRAM Cache SRF CL Bank Bank Lane SW ALU and cluster LRF arrays shown 1D here may be laid out as 2D arrays May 24, 2006 Edge: 28
Imagine VLSI Implementation • Chip Details – 2.56cm 2 die, 0.15um process, 21M transistors, 792-pin BGA – Collaboration with TI ASIC – Chips arrived on April 1, 2002 • Dual-Imagine test board May 24, 2006 Edge: 29
Application Performance (cont.) host bandwidth 100% stalls 90% stream controller 80% overhead Execution time memory stalls 70% 60% cluster stalls 50% kernel non main 40% loop 30% kernel main loop 20% overhead operations 10% 0% DEPTH MPEG QRD RTSL Average May 24, 2006 Edge: 30
Applications match the bandwidth hierarchy 1000 100 LRF Bandwidth (GB/s) SRF DRAM 10 1 0.1 Peak DEPTH MPEG QRD RTSL May 24, 2006 Edge: 31
Merrimac – Streaming Supercomputer Ba c kpla ne Ba c kpla ne 16 x 16 x Boa rd Boa rd XDR-DRAM XDR-DRAM Node Node 2GBytes 2GBytes Node Node Node Node Boa rd 2 Boa rd 2 64GBytes/ s 64GBytes/ s 2 2 16 16 16 Nodes 16 Nodes Ba c kpla ne 2 Ba c kpla ne 2 S S trea m trea m 1K 1K FPUs FPUs Boa rd 32 Boa rd 32 32 Boa rds 32 Boa rds Pro c e ssor Pro c e ssor 2T 2T FL FL OPS OPS 512 Nodes 512 Nodes Ba c kpla ne 32 Ba c kpla ne 32 64 FPU 64 FPU 32GBytes 32GBytes 32K 32K FPUs FPUs 128 GFL 128 GFL OPS OPS 64TFL 64TFL OPS OPS 1TBytes 1TBytes 12GBytes/ s 12GBytes/ s On-Boa rd Network On-Boa rd Network 32+32 pa irs 32+32 pa irs 48GBytes/ s 48GBytes/ s E/ O E/ O 128+128 pa irs 128+128 pa irs Intra -Ca b inet Network Intra -Ca b inet Network O/ E O/ E 6” Tera dyne Gb X 6” Tera dyne Gb X 768GBytes/ s 768GBytes/ s 2K 2K +2K links +2K links Inter-Ca b inet Network Inter-Ca b inet Network Rib b on Fib er Rib b on Fib er Bisec tion 24TBytes/ s Bisec tion 24TBytes/ s Sc a la b le fro m 2-T F L OP wo rksta tio n to 2-PF L OP supe rc o mpute r May 24, 2006 Edge: 32
Recommend
More recommend