w hirlpool
play

W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D - PowerPoint PPT Presentation

ASPLOS XXI - Atlanta, Georgia 4 April 2016 W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag Mukkara, Nathan Beckmann , Daniel Sanchez MIT CSAIL Processors are limited by data movement Data


  1. ASPLOS XXI - Atlanta, Georgia – 4 April 2016 W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag Mukkara, Nathan Beckmann , Daniel Sanchez MIT CSAIL

  2. Processors are limited by data movement  Data movement often consumes >50% of time & energy  E.g., FP multiply-add: 20 pJ  DRAM access: 20,000 pJ  To scale performance, must keep data near where its used  But how do programs use memory? Cache banks Good: nearby cache banks Bad: faraway cache banks Terrible: DRAM access

  3. Static policies have limitations 3 Program Code Static analysis Exploits program semantics or profiling Fixed policy Can’t adapt to application phases, input-dependent behavior, or shared systems Binary E.g., scratchpads, bypass hints

  4. Dynamic policies have limitations, too 4 Binary Observe Responsive to actual loads/stores application behavior Dynamic policy Difficult to recover program E.g., data migration & replication semantics from loads/stores  Expensive mechanisms (eg, extra data movement & directories)

  5. Combining static and dynamic is best 5 Program Code Static analysis or profiling Exploits program Pool Pool Pool Pool semantics at low overhead A B C D Binary Responsive to actual Observe application behavior loads/stores Policy Policy Policy Policy C D A B

  6. Agenda 6  Case study  Manual classification  Parallel applications  WhirlTool

  7. System configuration 7 Private L2 L1i L1d Core Non-uniform cache access (NUCA): Cache banks have different access latencies

  8. Baseline dynamic NUCA scheme 8  We apply Whirlpool to Jigsaw [Beckmann PACT’13] , a state-of-the-art NUCA cache  Allocates virtual caches , collections of parts of cache banks  Significantly outperforms prior D-NUCA schemes Reduce cache misses Reduce on-chip network traversals Simple mechanisms

  9. Dynamic policies can reduce data movement 9 App: Delaunay triangulation Static NUCA Jigsaw [Beckmann, PACT’13] Dynamic policy performs somewhat better: 4% better performance 12% lower energy

  10. Static analysis can help! 10 Points Access Intensity Vertices Accesses Footprint (MB) Triangles

  11. Jigsaw with Static Classification 11 Few data structures accessed more frequently than others Points Access Intensity Vertices Triangles Whirlpool! Jigsaw Vs Jigsaw: [Beckmann, PACT’13 ] 19% better performance 42% lower energy

  12. Agenda 12  Case study  Manual classification  Parallel applications  WhirlTool

  13. Whirlpool – Manual classification 13 Organize application data into memory pools Points, Triangles int poolPoints = pool_create(); Point* points = pool_malloc(sizeof(Point)*n, poolPoints); int poolTris = pool_create(); Tri* smallTris = pool_malloc(sizeof(Tri)*m, poolTris); Tri* largeTris = pool_malloc(sizeof(Tri)*M, poolTris); Insight: Group semantically similar data into a pool

  14. Minor changes to programs 14 Application Pools LOC Delaunay triangulation 3 11 Maximal matching 3 13 PBBS Delaunay refinement 3 8 Maximal independent set 3 13 Minimal spanning forest 3 11 401.bzip2 4 43 470.lbm 2 21 SPECCPU 429.mcf 2 14 2006 436.cactusADM 2 53

  15. Whirlpool on NUCA placement 15  Use pools to improve Jigsaw’s decisions  Each pool is allocated to a virtual cache  Jigsaw transparently places pools in NUCA banks  Whirlpool requires no changes to core Jigsaw  Increase size of structures (few KBs)  Minor improvements, e.g. bypassing (see paper)  Pools useful elsewhere, eg to dynamic prefetching

  16. Significant improvements on some apps 16 Performance Energy 38 Energy savings vs Jigsaw (%) 60 14 Speedup vs Jigsaw (%) 50 12 10 40 8 30 6 20 4 10 2 0 0 2 e T m f s g T S bzip2 refine MST lbm mcf cactus matching DT MIS c p n u n S D I m b M t i i i M z f l c h e b a c r c t a m Up to 38% better performance Up to 53% lower energy

  17. Agenda 17  Case study  Manual classification  Parallel applications  WhirlTool

  18. Conventional runtimes can harm locality 18 Optimize load balance, not locality

  19. Whirlpool co-locates tasks and data 19  Break input into pools Input  Application indicates task affinity  Schedule + steal tasks from nearby their data  Dynamically adapt data placement  Requires minimal changes to task-parallel runtimes

  20. Whirlpool improves locality 20

  21. Whirlpool adapts schedule dynamically 21  Data placement implicitly schedules tasks

  22. Significant improvements at 16 cores 22 Applications Divide and conquer algorithms : Mergesort, FFT Graph analytics: PageRank, Triangle Counting, Connected Components Graphics: Delaunay Triangulation Caveat : Splitting data into 70 3.0 Energy savings vs Jigsaw Speedup vs Jigsaw (%) pools can be expensive! 60 2.5 50 40 2.0 30 20 1.5 10 0 1.0 MS FFT TC DT PR CC MS FFT TC DT PR CC Up to 67% better performance Up to 2.6x lower energy

  23. Agenda 23  Case study  Manual classification  Parallel applications  WhirlTool

  24. WhirlTool – Automated classification 24  Modifying program code is not always practical  A profile-guided tool can automatically classify data into pools Application malloc() WhirlTool WhirlTool WhirlTool Profiler Analyzer runtime pool_malloc() Callpoint-to- Per-callpoint miss curves pool map Whirlpool Allocator

  25. WhirlTool profiles miss curves 25 Groups allocations Application by callpoint Alloc Accs Profiles accesses to each pool …. B C A Periodically records per-callpoint miss curves T Misses i m e Cache size

  26. WhirlTool analyzes curves to find pools 26  Hardware can only support a limited number of pools  Jigsaw uses 3 virtual caches / thread  0.6% area overhead over LLC  Whirlpool adds 4 pools (each mapped to a virtual cache)  1.2% total area overhead over LLC  Must cluster callpoints into semantically similar groups Per-callpoint Agglomerative Callpoint-to-pool clustering miss curves mapping

  27. Example of agglomerative clustering 27 1 1 1 2 2 3

  28. WhirlTool’s distance metric 28 How many misses are saved by separating pools? Pool 1 Small distance Misses Pool 2 Pool 3 Large distance Cache Size Combined Misses Separated Cache Size

  29. WhirlTool matches manual hints Speedup vs Jigsaw (%) Speedup vs Jigsaw (%) 10 10 12 12 14 14 0 0 2 2 4 4 6 6 8 8 leslie l e s l i e gcc g c c g gems e m Manual WhirlTool WhirlTool s b bzip2 z i p 2 o omnet m n e t r ray a y r refine e f i n e s sphinx3 p h i n x 3 M MST S T l lbm b m s setCover e t C o v e r s soplex o p l e x x xalanc a l a n c m mcf c f S SA A c cactus a c t u s m matching a t c h i n g D DT T 38 38 38 M MIS I S 29

  30. Multiprogram mixes 30  4-core system with random SPECCPU2006 apps  Including those that do not benefit  Whirlpool improves performance by (gmean over 20 mixes)  35% over S-NUCA  30% over idealized shared-private D-NUCA [Hererro , ISCA’10]  26% over R-NUCA [Hardavellas , ISCA’09 ]  18% over page placement by Awasthi et al. [Awasthi HPCA’09]  5% over Jigsaw [Beckmann, PACT’13]

  31. Conclusion 31  Semantic information from applications improves performance of dynamic policies  Coordinated data and task placement gives large improvements in parallel applications  Automated classification reduces programmer burden

  32. 32 T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ARE WELCOME ! WhirlTool code available at http://bit.ly/WhirlTool

Recommend


More recommend