Can Far Memory Improve Job Throughput? Eurosys 2020 Talk Emmanuel - PowerPoint PPT Presentation

Can Far Memory Improve Job Throughput? Eurosys 2020 Talk Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K. Aguilera (VMware Research), Aurojit Panda (NYU), Sylvia Ratnasamy, Scott Shenker

The rise of far memory ● Demand for memory has grown faster than availability ○ Prevalence of in memory workloads ○ End of Moore’s Law hinders DRAM progress ● Far memory allows you to use memory that is remote to the server

Context: Memory provisioning ● Local memory can only be provisioned at coarse granularity 192GB = 12x16GB +48GB = 12x 4GB =240GB, 25% increase ● Unbalanced memory configurations significantly limit memory bandwidth ○ 1 DIMM per controller → 35% of max system bandwidth ○ Balanced configuration: all slots 0 equal capacity and all slots 1 equal capacity ● If we measure the granularity of upgrades in memory per core in the cluster: ○ Far memory can be upgraded at much finer granularity than local memory

Focus of our work 1. How to make transparent access to far memory fast ? 2. How to decide how much far memory each job uses? 3. Once we solve 1 and 2, can far memory improve job throughput ?

Transparent and fast far memory access ● Operating system support → swapping with RDMA ○ Page fault handler brings pages from far memory into local ● Poor latency and bandwidth in previous systems due to overheads in page fault handler: ○ Head-of-line blocking (high priority reads queued behind low priority reads) ○ Asynchronous critical page reads (require context switch) ○ Page reclamation during page faults ● Fastswap solves key overheads: ○ Average page reads <5us ○ Applications can access far memory at 10Gbps (one thread) , and 25Gbps (7 threads) .

Fastswap read throughput 3.2x 1.7x

How much far memory each job should use? ● Far memory aware cluster scheduler ○ Improve job throughput ○ Pack the cluster densely by using far memory ● Strategy: ○ If memory is not the constraining resource to admit more jobs → normal scheduler ○ If memory becomes the constraining resource Scheduler shrinks local memory on existing jobs → residuals placed on far memory ■ ● Key challenge: Performance degradation is application-dependant ○ Scheduler needs to take this into account when shrinking

Job degradation profiles with Fastswap 3.03x 1.11x

Memory-time policy ● Uses memory-time products to find optimal shrink ratios for a set of jobs B+C = new memory-time when far memory is used A = local memory-time (no far memory used) Local memory ratio ● Optimization problem, intuition: ○ Want to find the ratios r for each job in the set ○ Such that we minimize B (local memory) usage over time ● Optimization runs when a job is admitted, or when a job finishes

Can Far Memory Improve Job Throughput? ● Baseline rack = No far mem, 40 servers (each 192GB and 48 cores) ○ Far (+0%) = convert compute node into far memory node; i.e. 192GB of far mem ○ Far (+X%) = X% additional rack memory available in far memory node ● Workload: a list of 6000 mixed jobs with uniformly random arrivals ○ Each workload is executed in different rack configurations

Conclusion ● How to make transparent access to far memory fast? ○ Fastswap provides transparent, and higher throughput far memory access than previous approaches, by 1.7x on single thread and 3.2x on multithreaded ● How to decide how much far memory each job uses? ○ Our far memory aware scheduler decides by using its memory-time policy ● Can far memory improve job throughput? ○ Yes, makespan improvements range from 10 to 40%

Thank you.

Can Far Memory Improve Job Throughput? Eurosys 2020 Talk Emmanuel - PowerPoint PPT Presentation

Can Far Memory Improve Job Throughput? Eurosys 2020 Talk Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K. Aguilera (VMware Research), Aurojit Panda (NYU), Sylvia Ratnasamy, Scott Shenker The rise of far memory

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Interactive Proofs Lecture 17 IP = PSPACE 1 So far 2 So far IP 2 So far IP AM, MA 2 So

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Function Pointers Refined Memory Model 1 The C0 Memory Model so far Local Memory Allocated

A STOR A STOR STORY SO FAR STORY SO FAR Y SO FAR SO FAR Brian Bruce Brian Bruce

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve

6.2 Online Job Search Objectives Identify the steps for an effective job search

Job 31:40b-32:5 The words of Job are ended. So these three men ceased to answer Job, because he

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in

Survivable and Bandwidth- Guaranteed Embedding of Virtual Clusters in Cloud Data Centers Ruozhou

What We Talk About When We Talk About Cloud Network Performance* * With apologies to Raymond

Thinking Outside the Box: Innovative Pathways to Refugee Employment What makes us unique? Hire

Tools & Techniques Triage Using 99 Business Analyst Techniques to better understand

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai , Mosharaf Chowdhury, Harsha

An Introduction to the Tor Ecosystem for Developers Alexander Fry February 2, 2020 FOSDEM

Nonparametric density estimation Christopher F Baum EC 823: Applied Econometrics Boston College,

Caching in HTTP Adaptive Streaming: Friend or Foe? Danny Lee Ali C. Begen Constantine Dovrolis

Can Far Memory Improve Job Throughput? Eurosys 2020 Talk Emmanuel - PowerPoint PPT Presentation

Can Far Memory Improve Job Throughput? Eurosys 2020 Talk Emmanuel Amaro, Christopher Branner-Augmon, Zhihong Luo, Amy Ousterhout, Marcos K. Aguilera (VMware Research), Aurojit Panda (NYU), Sylvia Ratnasamy, Scott Shenker The rise of far memory

Points of Pride: What we have accomplished so far! Created Job Framework 24 Job Groups/Job

Memory II. Memory improvement III. Problems with memory 3 systems/stages of Memory: memory

Interactive Proofs Lecture 17 IP = PSPACE 1 So far 2 So far IP 2 So far IP AM, MA 2 So

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Networks Computer-Computer Comm CPU CPU CPU CPU Memory Device Device Memory Memory

Evaluation of Improved Scalability Comparison points Throughput (IPC/Node)

Function Pointers Refined Memory Model 1 The C0 Memory Model so far Local Memory Allocated

A STOR A STOR STORY SO FAR STORY SO FAR Y SO FAR SO FAR Brian Bruce Brian Bruce

Virtual Memory 1 Memory Hierarchy Memory 4GB Cache 1M Registers 1K Question: What if

Personal SE Computer Memory Addresses C Pointers Computer Memory Organization Memory is a

Memory Memory processing is the ability to: Acquire (Short term memory) Manipulate

Memory Management Memory Manager Requirements Minimize primary memory access time

DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve DISC- Improv to Improve

6.2 Online Job Search Objectives Identify the steps for an effective job search

Job 31:40b-32:5 The words of Job are ended. So these three men ceased to answer Job, because he

BioNLP for NLPeople CS5832/HLT-NAACL/RANLP The weirdest job in the world 1 The weirdest job in

Survivable and Bandwidth- Guaranteed Embedding of Virtual Clusters in Cloud Data Centers Ruozhou

What We Talk About When We Talk About Cloud Network Performance* * With apologies to Raymond

Thinking Outside the Box: Innovative Pathways to Refugee Employment What makes us unique? Hire

Tools &amp; Techniques Triage Using 99 Business Analyst Techniques to better understand

To Relay or Not to Relay for Inter-Cloud Transfers? Fan Lai , Mosharaf Chowdhury, Harsha

An Introduction to the Tor Ecosystem for Developers Alexander Fry February 2, 2020 FOSDEM

Nonparametric density estimation Christopher F Baum EC 823: Applied Econometrics Boston College,

Caching in HTTP Adaptive Streaming: Friend or Foe? Danny Lee Ali C. Begen Constantine Dovrolis

Tools & Techniques Triage Using 99 Business Analyst Techniques to better understand