Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT ’16 September 13, 2016, Haifa, Israel 1
Indirect Accesses 2
Indirect Accesses with OpenMP 3
Indirect Accesses with OpenMP 5 4 Speedup 3 OpenMP +Milk 2 1 0 uniform [0..100M) 8 threads, 8MB L3 3
Indirect Accesses with milk milk if(!milk) 5 4 Speedup 3 OpenMP +Milk 2 1 0 uniform [0..100M) 8 threads, 8MB L3 4
No Locality? Address Time 5
No Locality? • Cache miss Address • TLB miss • DRAM row miss • No prefetching Time 6
No Locality? Address Time 7
No Locality? Address Time 8
No Locality? Address Time 9
Milk Clustering 8 threads Address Time 10
Milk Clustering • Cache hit Address • TLB hit • DRAM row hit • Effective prefetching Time 11
Milk Clustering • Cache hit Address • TLB hit • DRAM row hit • Effective prefetching • No need for atomics! Time 12
Big (sparse) Data http://research.blogs.lincoln.ac.uk/ files/2011/02/map-of-internet.png
Big (sparse) Data • Terabyte Working Sets - AWS 2TB VM • In-memory Databases, Key-value stores • Machine Learning • Graph Analytics 14
Outline • Milk programming model • milk syntax • MILK compiler and runtime 15
Foundations • Milk programming model — extending BSP • milk syntax — OpenMP for C/C++ • MILK compiler and runtime — LLVM/Clang 16
Milk — BSP extension • Bulk-synchronous parallel (BSP) superstep - updates visible after a barrier • Milk virtual processors can access only • One random cache line from DRAM • Sequential streams • Cache-resident data 17
Superstep Locality in Graph Applications Temporal Locality (infinite cache) Spatial Locality (64 byte) 100 1.00 1.00 1.00 1.00 Ideal Cache Hit % 80 0.80 0.80 0.80 0.80 R oad (d=2.4) 60 0.60 0.60 0.60 0.60 T witter (d=24) 40 0.40 0.40 0.40 0.40 W eb (d=39) 20 0.20 0.20 0.20 0.20 0 0.00 0.00 0.00 0.00 R T W R T W R T W R T W R T W Betweenness Breadth-First Connected Single-Source PageRank [GAPBS] Centrality Search Components Shortest Paths 18
Milk Execution Model • Collection • Distribution • Delivery 19
Collection += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Collection += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 14 5 18 7 0 7 f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Distribution += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 14 5 18 7 0 7 f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Distribution += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 0 5 7 7 14 18 f(3) f(5) f(6) f(7) f(2) f(4) f(1) f(0) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Delivery += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 5 7 0 7 14 18 f(1) f(3) f(5) f(6) f(7) f(2) f(4) f(0) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Delivery += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
milk syntax • milk clause in parallel loop • milk directive per indirect access tag — address to group by 0 pack — additional state f(1) 23
pack Combiners 24
Combiners += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 0 0 5 7 7 7 14 18 f(1) f(6) f(3) f(0) f(5) f(7) f(2) f(4) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 count
Combiners 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 + + + 0 5 7 14 18 f(1) f(3) f(0) f(5) f(7) f(2) f(4) f(6) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 count
MILK compiler and runtime • Collection — loop transformation • Delivery — outlined function with continuation • Distribution — runtime library parallel multipass radix partitioning 27
Example: PageRank 28
Example: PageRank 7 17 0.5 28
PageRank with OpenMP 29
PageRank with milk 30
PageRank with milk 31
PageRank with milk 7 17 0.5 32
PageRank: Collection 0.5 7 33
Tag Distribution L2 pails … 9-bit radix partition 34
Tag Distribution L2 pails 0.5 … 7 17 17 17 0.5 p=7 35
Tag Distribution L2 pails … 17 17 7 17 0.5 p=7 0.5 35
Tag Distribution L2 pails … 17 17 7 7 17 0.5 7 0.5 p=7 36
Distribution: Pail Overflow L2 DRAM pails tubs 0.2 … 17 17 7 17 17 0.5 7 0.5 0.2 p=7 0.2 37
Milk Delivery DRAM tubs L2 17 0.2 27 0.1 7 0.3 17 27 17 7 17 17 0.5 7 0.5 0.2 38
Milk Delivery DRAM tubs L2 17 0.2 27 0.1 7 0.3 17 27 17 7 17 17 0.5 7 0.5 0.2 39
Related Work • Database JOIN optimizations • [Shatdal94] cache partitioning • [Manegold02, Kim09, Albutiu12, Balkesen15] TLB, SIMD, NUMA, non-temporal writes, software write buffers 40
Overall Speedup with milk 3x 2.7 × V=32M 2.5x [i7-4790K] 2x 8 MB L3 Speedup 1.5x 1.4 × 1x 0.5x 0x [GAPBS] BC BFS CC PR SSSP Betweenness Breadth-First Connected Single-Source PageRank Centrality Search Components Shortest Paths 41
Indirect Access Cache Hit% baseline milk 100 V=32M 80 [i7-4790K] 8 MB L3 Cache Hit % 256KB L2 60 40 20 0 BC BFS CC PR SSSP > 80% DRAM → < 22% 42
Stall Cycle Reduction baseline 100% milk PageRank 80% % of Total Cycles V=32M 60% d=16 uniform 40% 20% 0% L2 miss stalls L3 miss stalls 256 KB L2 8 MB L3 baseline: 6 of 7 cycles stalled! 43
Larger Graphs → Larger Speedups 2M 8M 32M 3x 2.5x d=16 2x uniform Speedup 1.5x 8 MB L3 [i7-4790K] 1x 0.5x 0x BC BFS CC PR SSSP 44
Higher Degree → Higher Locality 5x 4x CountDegree 3x Speedup V=16M V=32M 2x 1x 0x 1 2 4 8 16 32 64 16M edges 2B edges Average Degree 45
Q & A http://milk-lang.org/ 46
Backup Slides 47
Graph Datasets Social Web Road Graph Facebook Twitter Twitter62 CC12 .sk US 1.5 B Vertices 300 M 62 M 3.5 B 51 M 24 M Degree 290 200 24 36 39 2.4 [Backstrom14][Ching15][Beamer15] [CommonCrawl] 53
Degree Distribution RMAT25 Uniform25 Twitter’ V=62M, d=24 V=32M, d=16 100 % Cumulative Edges % 75 % L3 50 % 25 % 0 % 2 6 8 4 2 6 8 4 2 1 2 2 9 3 8 0 3 1 0 1 5 2 3 4 1 8 5 4 4 4 6 2 9 5 5 1 5 4 3 3 Vertex Degree Rank 52
Recommend
More recommend