Improving Node-level MapReduce Performance using - PowerPoint PPT Presentation

Improving Node-level MapReduce Performance using Processing-in-Memory Technologies Mahzabeen Islam, Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of North Texas, USA Mike Ignatowski and Nuwan Jayasena AMD Research - Advanced Micro Devices, Inc., USA

Overview • Introduction • Motivation • Proposed Model • Server Architecture • Programming Framework • Experiments • Results • Conclusion and Future Work • Related Work • References 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 2 throughput-oriented programmable processing in memory. In: HPDC, (2014)

Introduction • 3D stacked DRAM consists of DRAM dies stacked on top of a logic die, • provides higher memory bandwidth, • lower access latencies and • lower energy consumption than existing DRAM technologies Ø Hybrid Memory Cube (HMC): capacity 2-4 GB, bandwidth 160 GB/sec (15x DDR3), 70%less energy per bit 1 • The bottom logic die contains peripheral circuitry (row decoder, sense amp etc.), but still there is enough silicon for other logic • 3D-DRAM can be used as large Last Level Cache or Main Memory or buffer to PCM • SRAM can be integrated in the logic layer to aid address translation – hardware page tables • Recent trend is to put processing capabilities in the logic layer 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS 3 (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

Processing in Memory • Processing-In-Memory (PIM) is the concept of moving computation closer to memory • Advantages: Ø Low access latency, high memory bandwidth and high degree of parallelization can be achieved by adding simple processing cores in memory Ø Minimize cache pollution by not transferring some data to main cores Ø Data intensive/memory bounded applications , which do not benefit from the conventional cache hierarchies, could benefit from PIM • Concerns: Ø Designing appropriate system architecture. § Too many design choices – main processor, PIM processors, memory hierarchy, communication channels, interfaces Ø Requires changes to Operating System (memory management), programming framework (e.g. MapReduce library), programming models (synchronization, coherence) 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 4 throughput-oriented programmable processing in memory. In: HPDC, (2014)

Our Work • 3D stacked DRAM has generated renewed interest in PIMs • We can use several low power cores in the logic layer of a 3D-DRAM to execute memory bounded functions closer to memory • Our current research is focusing on Big Data analyses based on MapReduce programming model Ø Map functions are good candidates for executing on PIM processors Ø We propose and evaluate a server architecture here Ø MapReduce is modified for shared memory processors § We plan to investigate using PIM for other parts of MapReduce applications § And other classes of applications (Scale-Out applications) § Contemporary research shows that emerging scale-out applications do not benefit from conventional processor architecture and cache hierarchies 2 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, 5 A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

Proposed Server Architecture • Host processor connected to multiple 3D Memory Units (3DMUs) • PIM cores in the logic layer of each 3DMMU • Simple, in-order, single-issue, energy efficient PIM cores with only L-1 caches • Processes running on host control the execution of PIM threads • Unified Memory View as proposed by Heterogeneous System Architecture (HSA) foundation • A number of such nodes will make up a cluster Memory ¡dies ¡ Abstract ¡ load/store ¡interface ¡ Timing-‑specific ¡ DRAM ¡ interface ¡ Host ¡ PIM ¡& ¡ ¡ DRAM ¡controllers ¡ 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 6 throughput-oriented programmable processing in memory. In: HPDC, (2014)

Proposed MapReduce Framework • Adapt MapReduce frameworks for shared memory systems that exhibit NUMA Ø We chose Phoenix++ which works with CMP and SMP systems Ø Needed to modify Phoenix for our purpose • Map phase - overlap with reading input using MP cores (host reads from files) Running on host processor Running on 3DMUs • Reduce phase - By using special data Manager PIM structures (2D hash tables) allow local Process 0 Threads reduction in the 3DMUs to minimize amount of data transferred during final reduction Manager PIM Process 1 Threads Master • Merge phase – Initial stages can be Input Process Manager PIM performed by PIM cores, and the rest by the Process 2 Threads host processor Manager PIM Process 3 Threads • Here we emphasize on single (intra) node level MapReduce operation, and assume, a global (inter) node level of MapReduce operation will take place if we need a cluster of such nodes. 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 7 throughput-oriented programmable processing in memory. In: HPDC, (2014)

Experiment Setup • Baseline vs. New System Configuration M M E E X eon ¡E5 QPI X eon ¡E5 M M P0 ¡ P1 ¡ 0 1 Table 1: Baseline System Con fi guration Table 2: New System Con fi guration CPU 2 x Xeon E5-2640 6 cores per processor, 2 threads/core Host Processor PIM cores Out-of-Order, 4-wide issue 1 Xeon E5-2640 64 = 4 * 16 ARM Cortex-A5 Clock Speed 2.5 GHz clock speed Processing Unit 6 cores, 2 threads/core In-order, single-issue Out-of-Order, 4-wide issue L3 Cache 15 MB/processor Clock Speed 2.5 GHz 1 GHz Power TDP = 95 W/processor LL Cache 15 MB 32 KB I and 32 KB D /core Low-power = 15 W/processor Power TDP = 95 W 80 mW/core (5.12 W for 64 cores) Memory BW 42.6 GB/s per processor Memory BW 42.6 GB/s 1.33 GB/s per core Memory 32 GB (8 x 4 GiB DIMM DDR3), NUMA enabled Memory 32 GB (4 8 GiB 3DMU) 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 8 throughput-oriented programmable processing in memory. In: HPDC, (2014)

Experiments and Analysis • Our assumption is that we can overlap reading of data with the execution of map tasks • The input reading is performed by the host CPU and the map tasks by PIM cores Time (ms) 0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 32 ¡ 64 ¡ 96 ¡ 128 ¡ 160 ¡ 192 ¡ 224 ¡ 0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 32 ¡ 64 ¡ 96 ¡ 128 ¡ 160 ¡ 192 ¡ 224 ¡ Host reads IP splits ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ Into 3DMUs idle ¡ busy ¡ idle ¡ busy ¡ idle ¡ idle ¡ busy ¡ busy ¡ PIM cores in 3DMU0 busy ¡ busy ¡ busy ¡ busy ¡ PIM cores in 3DMU1 busy ¡ busy ¡ busy ¡ busy ¡ PIM cores in 3DMU2 busy ¡ busy ¡ PIM cores in 3DMU3 (a) (b) Fig : (a) PIM cores mostly idle (b) PIM core utilization is high Ø We do not want the cores to sit idle Ø Estimate the number of cores needed 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 9 throughput-oriented programmable processing in memory. In: HPDC, (2014)

Experiments and Analysis How many PIM cores per 3DMU do we need? The time taken by PIM cores to process a input split should be smaller than the time taken by the host to read one input split Here s is the factor that indicates the relative slowdown caused by simple PIM cores when compared to the host. is the time taken by host to complete map function on one input split There are 4 DMUs and each contains n PIM cores is the time taken by host to read one input split 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 10 throughput-oriented programmable processing in memory. In: HPDC, (2014)

Improving Node-level MapReduce Performance using - PowerPoint PPT Presentation

Improving Node-level MapReduce Performance using Processing-in-Memory Technologies Mahzabeen Islam, Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of North

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

Governance ITechLaw - 17 May 2018 Overview Intro Benoit Van Asbroeck Is there a

Piloting and Sizing Sequential Multiple Assignment Randomized Trials in Dynamic Treatment Regime

Applying the Patient Demographic Data Quality (PDDQ) Framework to Reduce Duplicate Patient

Toward a Field Study on the Impact of Hacking Competitions on Secure Development Daniel Votipka ,

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of

2020 Emergency Solutions Grants CARES Act Application Instructional Guide Community Programs

Almost monotonicity formulas for elliptic and parabolic operators with variable coefficients

30 Transformational Design with Essential Aspect Decomposition: Model-Driven Architecture (MDA)

Improving Node-level MapReduce Performance using - PowerPoint PPT Presentation

Improving Node-level MapReduce Performance using Processing-in-Memory Technologies Mahzabeen Islam, Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of North

Title node 1 branch 1 branch 2 node 2 root branch 3 node 3 branch 4 node 4 Title node

Anonymity and Censorship Resistance Entry node Middle node Exit node Tor user Tor Node Tor

1 Agenda Quick'Intro' Node.js:'The'Beginning' What'Is'Node.js? Why'Use'Node.js?

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Node.js Workshop Tom Hughes-Croucher Chief Evangelist / Node Tech Lead @sh1mmer tom@joyent.com

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node-&gt;m_data == value) {

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

NODE.JS ANTI-PATTERNS and bad practices ADOPTION OF NODE.JS KEEPS GROWING CHAMPIONS Walmart,

Governance ITechLaw - 17 May 2018 Overview Intro Benoit Van Asbroeck Is there a

Piloting and Sizing Sequential Multiple Assignment Randomized Trials in Dynamic Treatment Regime

Applying the Patient Demographic Data Quality (PDDQ) Framework to Reduce Duplicate Patient

Toward a Field Study on the Impact of Hacking Competitions on Secure Development Daniel Votipka ,

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of

2020 Emergency Solutions Grants CARES Act Application Instructional Guide Community Programs

Almost monotonicity formulas for elliptic and parabolic operators with variable coefficients

30 Transformational Design with Essential Aspect Decomposition: Model-Driven Architecture (MDA)

Warmup Exercise while (node != NULL) { ! Consider a binary tree if (node->m_data == value) {

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the