ASPLOS XXI - Atlanta, Georgia – 4 April 2016 W HIRLPOOL ! I MPROVING D YNAMIC C ACHE M ANAGEMENT WITH S TATIC D ATA C LASSIFICATION Anurag Mukkara, Nathan Beckmann , Daniel Sanchez MIT CSAIL
Processors are limited by data movement Data movement often consumes >50% of time & energy E.g., FP multiply-add: 20 pJ DRAM access: 20,000 pJ To scale performance, must keep data near where its used But how do programs use memory? Cache banks Good: nearby cache banks Bad: faraway cache banks Terrible: DRAM access
Static policies have limitations 3 Program Code Static analysis Exploits program semantics or profiling Fixed policy Can’t adapt to application phases, input-dependent behavior, or shared systems Binary E.g., scratchpads, bypass hints
Dynamic policies have limitations, too 4 Binary Observe Responsive to actual loads/stores application behavior Dynamic policy Difficult to recover program E.g., data migration & replication semantics from loads/stores Expensive mechanisms (eg, extra data movement & directories)
Combining static and dynamic is best 5 Program Code Static analysis or profiling Exploits program Pool Pool Pool Pool semantics at low overhead A B C D Binary Responsive to actual Observe application behavior loads/stores Policy Policy Policy Policy C D A B
Agenda 6 Case study Manual classification Parallel applications WhirlTool
System configuration 7 Private L2 L1i L1d Core Non-uniform cache access (NUCA): Cache banks have different access latencies
Baseline dynamic NUCA scheme 8 We apply Whirlpool to Jigsaw [Beckmann PACT’13] , a state-of-the-art NUCA cache Allocates virtual caches , collections of parts of cache banks Significantly outperforms prior D-NUCA schemes Reduce cache misses Reduce on-chip network traversals Simple mechanisms
Dynamic policies can reduce data movement 9 App: Delaunay triangulation Static NUCA Jigsaw [Beckmann, PACT’13] Dynamic policy performs somewhat better: 4% better performance 12% lower energy
Static analysis can help! 10 Points Access Intensity Vertices Accesses Footprint (MB) Triangles
Jigsaw with Static Classification 11 Few data structures accessed more frequently than others Points Access Intensity Vertices Triangles Whirlpool! Jigsaw Vs Jigsaw: [Beckmann, PACT’13 ] 19% better performance 42% lower energy
Agenda 12 Case study Manual classification Parallel applications WhirlTool
Whirlpool – Manual classification 13 Organize application data into memory pools Points, Triangles int poolPoints = pool_create(); Point* points = pool_malloc(sizeof(Point)*n, poolPoints); int poolTris = pool_create(); Tri* smallTris = pool_malloc(sizeof(Tri)*m, poolTris); Tri* largeTris = pool_malloc(sizeof(Tri)*M, poolTris); Insight: Group semantically similar data into a pool
Minor changes to programs 14 Application Pools LOC Delaunay triangulation 3 11 Maximal matching 3 13 PBBS Delaunay refinement 3 8 Maximal independent set 3 13 Minimal spanning forest 3 11 401.bzip2 4 43 470.lbm 2 21 SPECCPU 429.mcf 2 14 2006 436.cactusADM 2 53
Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA banks Whirlpool requires no changes to core Jigsaw Increase size of structures (few KBs) Minor improvements, e.g. bypassing (see paper) Pools useful elsewhere, eg to dynamic prefetching
Significant improvements on some apps 16 Performance Energy 38 Energy savings vs Jigsaw (%) 60 14 Speedup vs Jigsaw (%) 50 12 10 40 8 30 6 20 4 10 2 0 0 2 e T m f s g T S bzip2 refine MST lbm mcf cactus matching DT MIS c p n u n S D I m b M t i i i M z f l c h e b a c r c t a m Up to 38% better performance Up to 53% lower energy
Agenda 17 Case study Manual classification Parallel applications WhirlTool
Conventional runtimes can harm locality 18 Optimize load balance, not locality
Whirlpool co-locates tasks and data 19 Break input into pools Input Application indicates task affinity Schedule + steal tasks from nearby their data Dynamically adapt data placement Requires minimal changes to task-parallel runtimes
Whirlpool improves locality 20
Whirlpool adapts schedule dynamically 21 Data placement implicitly schedules tasks
Significant improvements at 16 cores 22 Applications Divide and conquer algorithms : Mergesort, FFT Graph analytics: PageRank, Triangle Counting, Connected Components Graphics: Delaunay Triangulation Caveat : Splitting data into 70 3.0 Energy savings vs Jigsaw Speedup vs Jigsaw (%) pools can be expensive! 60 2.5 50 40 2.0 30 20 1.5 10 0 1.0 MS FFT TC DT PR CC MS FFT TC DT PR CC Up to 67% better performance Up to 2.6x lower energy
Agenda 23 Case study Manual classification Parallel applications WhirlTool
WhirlTool – Automated classification 24 Modifying program code is not always practical A profile-guided tool can automatically classify data into pools Application malloc() WhirlTool WhirlTool WhirlTool Profiler Analyzer runtime pool_malloc() Callpoint-to- Per-callpoint miss curves pool map Whirlpool Allocator
WhirlTool profiles miss curves 25 Groups allocations Application by callpoint Alloc Accs Profiles accesses to each pool …. B C A Periodically records per-callpoint miss curves T Misses i m e Cache size
WhirlTool analyzes curves to find pools 26 Hardware can only support a limited number of pools Jigsaw uses 3 virtual caches / thread 0.6% area overhead over LLC Whirlpool adds 4 pools (each mapped to a virtual cache) 1.2% total area overhead over LLC Must cluster callpoints into semantically similar groups Per-callpoint Agglomerative Callpoint-to-pool clustering miss curves mapping
Example of agglomerative clustering 27 1 1 1 2 2 3
WhirlTool’s distance metric 28 How many misses are saved by separating pools? Pool 1 Small distance Misses Pool 2 Pool 3 Large distance Cache Size Combined Misses Separated Cache Size
WhirlTool matches manual hints Speedup vs Jigsaw (%) Speedup vs Jigsaw (%) 10 10 12 12 14 14 0 0 2 2 4 4 6 6 8 8 leslie l e s l i e gcc g c c g gems e m Manual WhirlTool WhirlTool s b bzip2 z i p 2 o omnet m n e t r ray a y r refine e f i n e s sphinx3 p h i n x 3 M MST S T l lbm b m s setCover e t C o v e r s soplex o p l e x x xalanc a l a n c m mcf c f S SA A c cactus a c t u s m matching a t c h i n g D DT T 38 38 38 M MIS I S 29
Multiprogram mixes 30 4-core system with random SPECCPU2006 apps Including those that do not benefit Whirlpool improves performance by (gmean over 20 mixes) 35% over S-NUCA 30% over idealized shared-private D-NUCA [Hererro , ISCA’10] 26% over R-NUCA [Hardavellas , ISCA’09 ] 18% over page placement by Awasthi et al. [Awasthi HPCA’09] 5% over Jigsaw [Beckmann, PACT’13]
Conclusion 31 Semantic information from applications improves performance of dynamic policies Coordinated data and task placement gives large improvements in parallel applications Automated classification reduces programmer burden
32 T HANKS F OR Y OUR A TTENTION ! Q UESTIONS ARE WELCOME ! WhirlTool code available at http://bit.ly/WhirlTool
Recommend
More recommend