comerge
play

CoMerge Toward Efficient Data Placement in Shared Heterogeneous - PowerPoint PPT Presentation

CoMerge Toward Efficient Data Placement in Shared Heterogeneous Memory Systems Thaleia Dimitra Doudali Ada Gavrilovska Motivation Performance slowdown in heterogeneous memory systems. Application data objects How to reduce the


  1. CoMerge Toward Efficient Data Placement in Shared Heterogeneous Memory Systems Thaleia Dimitra Doudali Ada Gavrilovska

  2. Motivation Performance slowdown in heterogeneous memory systems. Application data objects How to reduce ➡ the slowdown? ➡ higher access latency ⇒ performance slowdown from ‘all-data-in-DRAM’ cost ↑ DRAM Non Volatile Memory DRAM Heterogenous Memory Subsystem 2 MEMSYS 17

  3. Existing Solutions Data tiering that maximizes DRAM accesses. Application ➡ Think about data objects which objects data objects ➡ get allocated ➡ Existing Solutions in DRAM. 1. X-Mem - Dulloor et al. 2. Dataplacer - Shen et al. 3. Valgrind extension - Peña, Balaji. more memory requests with lower latency DRAM Non Volatile Memory DRAM Heterogenous Memory Subsystem 3 MEMSYS 17

  4. Problem Statement Limited Utility of Existing Solutions in Shared Systems. Application 1 Application 2 Which objects should now be data objects data objects ➡ ➡ in DRAM? Shared Memory System DRAM Non Volatile Memory Do the partitioning techniques using existing solutions: ⇒ NO ● Reduce the slowdown across all collocated applications? ● Maximize DRAM utilization? 4 MEMSYS 17

  5. Our Contributions What do we need to do differently? 1. Sorting objects within one application : co-benefit metric captures: a. Exact contribution of a data object to overall application runtime. b. Overall application sensitivity to execution over Non-Volatile Memory. 2. Distributing DRAM across applications: CoMerge memory sharing technique. a. Mitigates slowdown across all collocated DRAM applications. b. Maximizes the DRAM usage. 5 MEMSYS 17

  6. Observations What are we going to see next? 1. Not all applications are slowed down in the same degree when accessing Non Volatile Memory. 2. Not all data objects of an application help reduce the performance slowdown, when placed in DRAM. Polybench Benchmarks CORAL Suite of mini-apps ● 30 simple algebraic kernels. ● 3 HPC representative kernels. ● Single-threaded. ● Multi-threaded. OpenMP. Hardware Testbed Emulate Non Volatile Memory for various combinations of reduced bandwidth and emulated DRAM NVM increased latency . e.g. B 0.5 : L 2 CPU 0.5 times less bandwidth : 2 times more latency 6 MEMSYS 17

  7. Overall Application Sensitivity Do all applications get slowed down in the same way when accessing Non Volatile Memory? High Medium Low None Performance slowdown across Polybench/C, normalized to ‘all-data-in-DRAM’ execution. Applications show different levels of sensitivity to execution over slower memory components. 7 MEMSYS 17

  8. Data Object Sensitivity Do all data objects help minimize the slowdown, when allocated in DRAM? fixed NVM at B 0.2 : L 5 2 2 2 1 3 1 3 3 Observations 1. For non or low sensitive apps, doesn’t matter which object is in DRAM. 2. Different data objects can contribute equally to the application runtime. 3. There can be objects whose allocation in DRAM is the only way to mitigate slowdown. 8 MEMSYS 17

  9. Co-Benefit Metric Can we capture the previous observations? F = S/F S coB(O) t(O) F = 1 F B(O) Scale Normalize S = 0 S = 0 Run Objects in How much does a specific How can we make sure that Time DRAM object help reduce the objects of higher sensitivity F All slowdown? kernels are getting prioritized? t(O) object O S None coB(O) = 0.9 * low sensitivity = 0.9 e.g. B(O) = 0.9 ⇒ coB(O) = 0.9 * high sensitivity = 3.9 9 MEMSYS 17

  10. DRAM Distribution What are the goals of an efficient technique? Runtime { sharing Overall 1. Minimize overall runtime Slowdown data tiering slowdown across all applications. All-in-DRAM Collocation Object 1 unutilized 2. Maximize the utilization of DRAM. Object 2 Object 3 DRAM 10 MEMSYS 17

  11. Sharing DRAM Sorting objects using co-benefit metric. jacobi-2d adi Fair Merge high low sensitivity sensitivity CoMerge Fair CoMerge Fair CoMerge DRAM 11 MEMSYS 17

  12. Summary More detailed analysis in the paper Equal Split Proportional Split unused Partitioning & existing solutions xsbench clomp stream xsbench clomp stream 7x 6x slowdown Fair Merge CoMerge unused Sharing & co-benefit metric 2.7x 2.6x slowdown Co-Benefit metric allows CoMerge to achieve: ● Lower runtime across all collocated applications. ● Higher DRAM utilization. 12 MEMSYS 17

More recommend