Outline Out line 1. T il ed M ap ap R edu 1. iled educe ce . O - PowerPoint PPT Presentation

T il ed- M ap ap R educe iled educe Optimizing Resource Usages of Data-parallel Applications on Multicore Rong Chen Haibo Chen Binyu Zang P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity

D ata ata- P arallel arallel A pplicati pplication on Data-parallel applications emerge and rapidly increase in past 10 years • Google processes about 24 petabytes of data per day in 2008 • The movie tar” is takes over 1 petabyte of local storage for 3D rendering * • … * http://www.information- management.com/newsletters/avatar_data_processing-10016774-1.html

D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from

D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Functionality Parallelism Data Distribution Fault Tolerance programmer Load Balance

D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Two Primitive: Map ( input ) Functionality Reduce ( key , values ) MapReduce Runtime programmer

D ata parallel P rogr amming M odel ata-parallel rogramming odel MapReduce : a simple programming model for data-parallel applications from Two Primitive: Word Count Map ( input ) for each word in input emit ( word , 1 ) Functionality Reduce ( key , values ) MapReduce int sum = 0; Runtime for each value in values programmer sum += value ; emit ( word , sum )

M ul ultic ticore ore Multicore is commercially prevalent recently • Quad-cores and eight cores on a chip are common, • Tens and hundreds of cores on a single chip will appear in near feature 1X 1X 4X 8X 64X 64X

M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA

M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime for shared-memory > CMPs and SMPs > NUMA Features > Parallelism: threads > Communication: shared address space

M ap ap R edu on M ul educe ce on ultic ticore ore Phoenix [HPCA’07 IISWC’09] A MapReduce runtime Heavily optimized runtime for shared-memory > Runtime algorithm > CMPs and SMPs e.g. locality-aware task > NUMA distribution > Scalable data structure e.g. hash table Features > OS Interaction > Parallelism: threads e.g. memory allocator, > Communication: thread pool shared address space

I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Main Memory

I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input Main Memory ......... ......... Input ......... Buffer ......... Load ......... ......... ......... ......... ......... .........

I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input Main Memory ......... ......... .. .. .. .. .. .. .. .. ......... ......... ......... ......... Intermediate Buffer ......... ......... ......... .........

I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... 1 1 1 1 1 ..boy.... .. .. ......... ......... value ......... key array array

I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... Final Buffer ......... ..boy.... ......... ......... .........

I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... ..boy.... 5 .. ......... ......... .........

I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Worker Threads Input M M M M Main Memory ......... ...but... R R R R .. .. .. .. .. .. .. .. .. ......... .....boy. ......... ......... Merge ..boy.... Output ......... Result Buffer ......... .........

I mple M ul mplementat mentation ion on on ulticore ticore Disk Processors Start Input Output M M M M Main Memory R R R R .. .. .. .. .. .. .. .. .. Free Write File Merge End

D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re

D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data)

D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data) Poor data locality • Process all input data at one time e.g. WordCount with 4GB input has about 25% L2 cache miss rate

D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time e.g. WordCount with 4GB input requires more than 4.3GB memory on Phoenix ( 93% used by input data) Poor data locality • Process all input data at one time e.g. WordCount with 4GB input has about 25% L2 cache miss rate Strict dependency barriers • CPU idle at the exchange of phases

D efi ce on M ul ency of M ap ap R edu efici ciency of educe on ulti tico core re High memory usage • Keep the whole input data in main memory all the time Poor data locality S ol ion : : T il ed- M ap ap R edu olut utio iled educe ce • Process all input data at one time Strict dependency barriers • CPU idle at the exchange of phases

C ont ontributi ribution on Tiled-MapReduce programming model − Tiling strategy − Fault tolerance (in paper) Three optimizations for Tiled-MapReduce runtime − Input Data Buffer Reuse − NUCA/NUMA-aware Scheduler − Software Pipeline

O utl utline ine 1. T il ed M ap ap R edu 1. iled educe ce . O pt n on TMR TMR 2. ptimi imizatio zation on . E val 3. valuatio uation . C on 4. oncl clusio usion

T iled iled- M ap ap R edu educe ce “ Tiling Strategy” • Divide a large MapReduce job into a number of independent small sub-jobs • Iteratively process one sub-job at a time

T iled iled- M ap ap R edu educe ce “ Tiling Strategy” • Divide a large MapReduce job into a number of independent small sub-jobs • Iteratively process one sub-job at a time Requirement • Reduce function must be Commutative and Associative • all 26 applications in the test suit of Phoenix and Hadoop meet the requirement

T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start Map Reduce Merge End

T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map Reduce Reduce Merge End

T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Reduce iteration Reduce Merge End

T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Combine iteration 3. Rename the Reduce phase within Reduce loop to the Combine phase Merge End

T iled iled- M ap ap R edu educe ce Extensions to MapReduce Model Start 1. Replace the Map phase with a loop of Map and Reduce phases Map 2. Process one sub-job in each Combine iteration 3. Rename the Reduce phase within Reduce loop to the Combine phase Merge 4. Modify the Reduce phase to process the partial results of all End iterations

P rot pe of of T iled iled- M ap ap R edu rototy otype educe ce Ostrich : a prototype of Tiled-MapReduce programming model • Demonstrate the effectiveness of TMR programming model • Base on Phoenix runtime • Follow the data structure and algorithms

O str ich I mplem strich mplementa entation tion Disk Processors Start Worker Threads Input Main Memory ......... ......... .. .. .. .. .. .. .. .. ......... ......... Load ......... ......... Intermediate Buffer ......... ......... ......... .........

O str ich I mplem strich mplementa entation tion Disk Processors Start Worker Threads Input M M M M Main Memory ......... ......... Iteration .. .. .. .. .. .. .. .. ......... window ......... ......... ......... ......... ......... ......... .........

Outline Out line 1. T il ed M ap ap R edu 1. iled educe ce . O - PowerPoint PPT Presentation

T il ed- M ap ap R educe iled educe Optimizing Resource Usages of Data-parallel Applications on Multicore Rong Chen Haibo Chen Binyu Zang P a r a l le l P r o c e s s in g I n s t itute F u d a n U n i v e r s ity D ata ata- P arallel

The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a Line The Slope of a

Title Slide Math 696 Class July 19, 2002 Line 1 Line 2 Line 3 Line 4 Line 5 Line 6 Line 7

X-Line 101 June 2019 X-Line 101 X-Line Unit Overview What makes X-Line unique X-Line 101

Router Architectures CPU CPU Memory Memory packets NFE NFE Processor Processor Line Card

Commission: Out of touch, out of date, out of pocket April 2017 Commission: Out of touch, out of

Hartford Line: A New Model for Intercity Passenger Rail 1 Hartford Line Service 2 Hartford

Coupling On-line and Off-line Random Graphs Woojin Kim March 1st Introduction Preliminary

John Heartfeld J. Otto Seibold Tempest Half life Piet Mondrian The 7 elements of art 1. line

God Reaches Out Week 1: God Reaches Out To Meet Us Where We Are Week 2: God Reaches Out In

Hudson Line Investments and Capacity Constraints Pascack Valley Line The Pascack Valley

Command Line Arguments ECE2893 Lecture 20 ECE2893 Command Line Arguments Spring 2011 1 / 5

A 3.5 keV Photon Line from a 3.5 keV ALP Line Markus Rummel, University of Oxford Seminar,

The Command Line Matthew Bender CMSC Command Line Workshop Octover 30 Matthew Bender (2015)

Greedy On-Line Planning - abstract overview: what is greedy on-line planning? Part 1: - greedy

Line Segments and Triangles A line drawing = set of line segments + set of faces. We need to

Line segment intersection Find all pairs of intersecting line segments. Find all pairs of

Dependability Evaluation through Markovian model Markovian model The combinatorial methods are

Work Site Traffic Management Practice Review Progress Update and Options for Changes to Current

ENGR/CS 101 CS Session Lecture 8 Log into Windows/ACENET (reboot if in Linux) Start

How to make an application code using G4MT Xin Dong and Gene Cooperma High Performance Computing

Single Sensor Estimation of Radio Activity via Blind Block- Partitioned Tensor Decomposition

USING WORD EMBEDDINGS TO REPRESENT DIFFERENT TYPES OF CLINICAL DATA MERIJN BEEKSMA

GLOBE. WE ARE THE U.S. ARMY SERVICE COMPONENT COMMAND OF THE U.S. TRANSPORTATION COMMAND A N D A

FluidCheck: A Redundant Threading based Approach for Reliable Execution in Manycore Processors