On Automated Feedback-Driven Data Placement in Multi-tiered Memory T. Chad Effler 1 , Adam P. Howard 1 , Tong Zhou 1 , Michael R. Jantz 1 , Kshitij A. Doshi 2 , and Prasad A. Kulkarni 3 1 University of Tennessee, {teffler,ahoward,tzhou9,mrjantz}@utk.edu 2 Intel Corporation, kshitij.a.doshi@intel.com 3 University of Kansas, kulkarni@ittc.ku.edu
The Problem • Multi-Tiered Memory Hierarchies • Different Capacities • Different Performance • Cross Layer Data Management for Heterogeneous Tiers • Match Memory Needs Efficiently • Optimality • Transparency • Simplicity
Current Solutions • Hardware Managed Caching • Non-Flexible • Large Architectural Costs • OS Managed Data Placement • Reactive • Relies on Non-Standard Hardware • Developer Managed Data Placement
Feedback-Driven Data Placement • Allocation Site Partitioning • Knapsack • Hotset • Profile-Guided Management • Static Arena Allocation • Phase-based Arena Allocation
Collecting Application Guidance • Track by allocation sites • Site == path upto malloc, calloc, etc. • Intuition 1: pedigree is a good predictor • Intuition 2: profile transferability 5
Collecting Application Guidance • Track by allocation sites • Site == path upto malloc, calloc, etc. • Intuition 1: pedigree is a good predictor • Intuition 2: profile transferability • Profile = { ∀ S : < peak RSS , # post-cache-accesses > } 6
Collecting Application Guidance • Track by allocation sites • Site == path upto malloc, calloc, etc. • Intuition 1: pedigree is a good predictor • Intuition 2: profile transferability • Profile = { ∀ S : < peak RSS , # post-cache-accesses > } • Partition sites into hot, cold sets. • Knapsack, Hotset 7
Applying Application Guidance • Profile-guided allocation into arenas • Static: one arena for each tier Allocation Guidance Application A1: contains sites A0: contains sites S1: Cold A1 Address Space 1, 3, 4, and 5 2 and 6 S2: Hot A0 S3: Cold A1 S4: Cold A1 S5: Cold A1 S6: Hot A0 … DDR Hardware MCDRAM CPU 0 8
Applying Application Guidance • Profile-guided allocation into arenas • Per-phase: one arena for each phase signature Allocation A1: 01111 A0: 10001 Guidance Application S1: 10001 A0 Address Space S2: 01111 A1 A3: 10100 A2: 11010 (phase 0) S3: 10001 A0 S4: 11010 A2 S5: 10100 A3 S6: 11010 A2 … DDR Hardware MCDRAM CPU 0 9
Applying Application Guidance • Profile-guided allocation into arenas • Per-phase: one arena for each phase signature Allocation A1: 01111 A2: 11010 Guidance Application S1: 10001 A0 Address Space S2: 01111 A1 A3: 10100 A0: 10001 ( phase 1 ) S3: 10001 A0 S4: 11010 A2 S5: 10100 A3 S6: 11010 A2 … DDR Hardware MCDRAM CPU 0 10
Simulation Framework • Marena – arena allocation library • Memtracer – Pin based instrumentation tool • Ramulator – Cycle accurate DRAM simulator
Marena • Arena based Allocator • Built on Jemalloc • Allocation site Guidance
Memtracer • Pin based Instrumentation Tool • Profiling • Generates Trace Files
Ramulator • Memory Simulator • Trace based and cycle accurate modeling • Modified to support Tiered Memory Simulation
Framework malloc ( ) Memory Arena Memory usage Executable Allocation usage Allocator guidance free ( ) guidance Guidance Pin Instruction Trace CPU Memory Controller HBM DDR Ramulator
Evaluation • Benchmarks: CPU 2006 • Multi-tier configuration: • HBM-DDR • HBM capacity = 12.5% of upper-level memory 16
Evaluation • Baseline Modes • Cache mode • Static First Touch (FT) • HBM/DDR only • Feed-back Guidance Directed Modes • Static Train • Static Ref • Adaptive Ref • Reactive Profiling • First Touch Hot Page (FTHP)
IPC relative to DDR3-only 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 bzip2 baseline performance gcc mcf HBM-DDR3: milc cactusADM leslie3d 512 KB cache gobmk soplex hmmer GemsFDTD libquantum Benchmarks h264ref lbm sphinx3 average mcf milc cactusADM 8 MB cache leslie3d HBM-only static-FT cache-mode soplex GemsFDTD libquantum lbm average
IPC relative to DDR3-only 0.0 0.5 1.0 1.5 2.0 2.5 3.0 bzip2 gcc mcf milc cactusADM leslie3d 512 KB cache gobmk soplex hmmer GemsFDTD libquantum Benchmarks Performance of static guidance strategies h264ref lbm sphinx3 average mcf milc cactusADM 8 MB cache leslie3d soplex static-ref static-train static-FT GemsFDTD libquantum lbm average
IPC relative to DDR3-only 0.0 0.5 1.0 1.5 2.0 2.5 3.0 bzip2 gcc mcf milc cactusADM leslie3d 512 KB cache gobmk soplex hmmer GemsFDTD Performance of static and adaptive policies libquantum Benchmarks h264ref lbm sphinx3 average mcf milc cactusADM 8 MB cache leslie3d FTHP adaptive-ref static-ref soplex GemsFDTD libquantum lbm average
Recommend
More recommend