optimizing indirect memory references with milk
play

Optimizing Indirect Memory References with milk Vladimir Kiriansky, - PowerPoint PPT Presentation

Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT 16 September 13, 2016, Haifa, Israel 1 Indirect Accesses 2 Indirect Accesses with OpenMP 3 Indirect Accesses


  1. Optimizing Indirect Memory References 
 with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe 
 MIT PACT ’16 
 September 13, 2016, Haifa, Israel 1

  2. Indirect Accesses 2

  3. Indirect Accesses 
 with OpenMP 3

  4. Indirect Accesses 
 with OpenMP 5 4 Speedup 3 OpenMP +Milk 2 1 0 uniform [0..100M) 8 threads, 8MB L3 3

  5. Indirect Accesses 
 with milk milk if(!milk) 5 4 Speedup 3 OpenMP +Milk 2 1 0 uniform [0..100M) 8 threads, 8MB L3 4

  6. No Locality? Address Time 5

  7. No Locality? • Cache miss Address • TLB miss • DRAM row miss • No prefetching Time 6

  8. No Locality? Address Time 7

  9. No Locality? Address Time 8

  10. No Locality? Address Time 9

  11. Milk Clustering 8 threads Address Time 10

  12. Milk Clustering • Cache hit Address • TLB hit • DRAM row hit • Effective prefetching Time 11

  13. Milk Clustering • Cache hit Address • TLB hit • DRAM row hit • Effective prefetching • No need for atomics! Time 12

  14. Big (sparse) Data http://research.blogs.lincoln.ac.uk/ 
 files/2011/02/map-of-internet.png

  15. Big (sparse) Data • Terabyte Working Sets 
 - AWS 2TB VM • In-memory Databases, Key-value stores • Machine Learning • Graph Analytics 14

  16. Outline • Milk programming model 
 • milk syntax 
 • MILK compiler and runtime 15

  17. Foundations • Milk programming model — extending BSP 
 • milk syntax — OpenMP for C/C++ 
 • MILK compiler and runtime — LLVM/Clang 16

  18. Milk — BSP extension • Bulk-synchronous parallel (BSP) superstep 
 - updates visible after a barrier • Milk virtual processors can access only • One random cache line from DRAM • Sequential streams • Cache-resident data 17

  19. Superstep Locality in 
 Graph Applications Temporal Locality (infinite cache) Spatial Locality (64 byte) 100 1.00 1.00 1.00 1.00 Ideal Cache Hit % 80 0.80 0.80 0.80 0.80 R oad (d=2.4) 60 0.60 0.60 0.60 0.60 T witter (d=24) 40 0.40 0.40 0.40 0.40 W eb (d=39) 20 0.20 0.20 0.20 0.20 0 0.00 0.00 0.00 0.00 R T W R T W R T W R T W R T W Betweenness 
 Breadth-First 
 Connected 
 Single-Source PageRank [GAPBS] Centrality Search Components Shortest Paths 18

  20. Milk Execution Model • Collection • Distribution • Delivery 19

  21. Collection += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  22. Collection += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 14 5 18 7 0 7 f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  23. Distribution += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 14 5 18 7 0 7 f(0) f(1) f(2) f(3) f(4) f(5) f(6) f(7) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  24. Distribution += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 0 5 7 7 14 18 f(3) f(5) f(6) f(7) f(2) f(4) f(1) f(0) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  25. Delivery += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 7 0 5 7 0 7 14 18 f(1) f(3) f(5) f(6) f(7) f(2) f(4) f(0) count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  26. Delivery += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 count 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

  27. milk syntax • milk clause in parallel loop • milk directive per indirect access tag — address to group by 0 pack — additional state f(1) 23

  28. pack Combiners 24

  29. Combiners += f(i); 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 0 0 5 7 7 7 14 18 f(1) f(6) f(3) f(0) f(5) f(7) f(2) f(4) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 count

  30. Combiners 0 1 2 3 4 5 6 7 d 7 0 14 5 18 7 0 7 + + + 0 5 7 14 18 f(1) f(3) f(0) f(5) f(7) f(2) f(4) f(6) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 count

  31. MILK compiler and runtime • Collection — loop transformation • Delivery — outlined function with continuation 
 • Distribution — runtime library 
 parallel multipass radix partitioning 27

  32. Example: PageRank 28

  33. Example: PageRank 7 17 0.5 28

  34. PageRank with OpenMP 29

  35. PageRank with milk 30

  36. PageRank with milk 31

  37. PageRank with milk 7 17 0.5 32

  38. PageRank: Collection 0.5 7 33

  39. Tag Distribution L2 pails … 9-bit radix partition 34

  40. Tag Distribution L2 pails 0.5 … 7 17 17 17 0.5 p=7 35

  41. Tag Distribution L2 pails … 17 17 7 17 0.5 p=7 0.5 35

  42. Tag Distribution L2 pails … 17 17 7 7 17 0.5 7 0.5 p=7 36

  43. Distribution: Pail Overflow L2 DRAM pails tubs 0.2 … 17 17 7 17 17 0.5 7 0.5 0.2 p=7 0.2 37

  44. Milk Delivery DRAM tubs L2 17 0.2 27 0.1 7 0.3 17 27 17 7 17 17 0.5 7 0.5 0.2 38

  45. Milk Delivery DRAM tubs L2 17 0.2 27 0.1 7 0.3 17 27 17 7 17 17 0.5 7 0.5 0.2 39

  46. Related Work • Database JOIN optimizations • [Shatdal94] cache partitioning • [Manegold02, Kim09, Albutiu12, Balkesen15] 
 TLB, SIMD, NUMA, 
 non-temporal writes, software write buffers 40

  47. Overall Speedup with milk 3x 2.7 × V=32M 2.5x [i7-4790K] 
 2x 8 MB L3 Speedup 1.5x 1.4 × 1x 0.5x 0x [GAPBS] BC BFS CC PR SSSP Betweenness 
 Breadth-First 
 Connected 
 Single-Source PageRank Centrality Search Components Shortest Paths 41

  48. Indirect Access Cache Hit% baseline milk 100 V=32M 80 [i7-4790K] 
 8 MB L3 Cache Hit % 256KB L2 60 40 20 0 BC BFS CC PR SSSP > 80% DRAM → < 22% 42

  49. 
 Stall Cycle Reduction baseline 100% milk PageRank 
 80% % of Total Cycles V=32M 
 60% d=16 
 uniform 40% 20% 0% L2 miss stalls 
 L3 miss stalls 
 256 KB L2 8 MB L3 baseline: 6 of 7 cycles stalled! 43

  50. Larger Graphs 
 → Larger Speedups 2M 8M 32M 3x 2.5x d=16 
 2x uniform Speedup 1.5x 8 MB L3 [i7-4790K] 1x 0.5x 0x BC BFS CC PR SSSP 44

  51. Higher Degree → Higher Locality 5x 4x CountDegree 3x Speedup V=16M V=32M 2x 1x 0x 1 2 4 8 16 32 64 16M edges 2B edges Average Degree 45

  52. Q & A http://milk-lang.org/ 46

  53. Backup Slides 47

  54. Graph Datasets Social Web Road Graph Facebook Twitter Twitter62 CC12 .sk US 1.5 B Vertices 300 M 62 M 3.5 B 51 M 24 M Degree 290 200 24 36 39 2.4 [Backstrom14][Ching15][Beamer15] [CommonCrawl] 53

  55. Degree Distribution RMAT25 Uniform25 Twitter’ V=62M, d=24 V=32M, d=16 100 % Cumulative Edges % 75 % L3 50 % 25 % 0 % 2 6 8 4 2 6 8 4 2 1 2 2 9 3 8 0 3 1 0 1 5 2 3 4 1 8 5 4 4 4 6 2 9 5 5 1 5 4 3 3 Vertex Degree Rank 52

Recommend


More recommend