Everywhere Blocks for SIMD Programming Authors: Rubens E. A. - PowerPoint PPT Presentation

Everywhere Blocks for SIMD Programming Authors: ¡ Rubens ¡E. ¡A. ¡Moreira, ¡Sylvain ¡Collange, ¡Fernando ¡M. ¡Q. ¡Pereira ¡ Speaker: ¡Breno ¡Campos ¡Ferreira ¡Guimarães ¡

Trends ¡in ¡Massively ¡Parallel ¡Processing ¡ Simple ¡ andalso ¡ efficient ¡ Source: ¡ hCp://on-‑demand.gputechconf.com/gtc/2016/presentaMon/s6224-‑mark-‑harris.pdf ¡

Trends ¡in ¡Massively ¡Parallel ¡Processing ¡ Explicit, ¡ yet ¡safe ¡ programming! ¡ Source: ¡ hCp://on-‑demand.gputechconf.com/gtc/2016/presentaMon/s6224-‑mark-‑harris.pdf ¡

Trends ¡in ¡Massively ¡Parallel ¡Processing ¡ Source: ¡ hCp://on-‑demand.gputechconf.com/gtc/2016/presentaMon/s6224-‑mark-‑harris.pdf ¡

D EPARTMENT ¡ OF ¡C OMPUTER ¡S CIENCE ¡ U NIVERSIDADE ¡F EDERAL ¡ DE ¡M INAS ¡G ERAIS ¡ F EDERAL ¡U NIVERSITY ¡ OF ¡M INAS ¡G ERAIS , ¡B RAZIL ¡ D IVERGENCES ¡

Divergences ¡ void kernel( int ** A, int ** B, int *N) { int tid( threadId.x ); if (tid > N) { memcpy <<< 1, 4 >>> (A[tid], B[tid], N[tid]); } else { memcpy <<< 1, 4 >>> (B[tid], A[tid], N[tid]); } } Kernel ¡for ¡parallel ¡execuMon ¡(CUDA). ¡

Divergences ¡ void kernel( int ** A, int ** B, int * N) { int tid( threadId.x ); if (tid > N) { if ( threadId.x > N) ¡ memcpy <<< 1, 4 >>> (A[tid], B[tid], N[tid]); } else { memcpy <<< 1, 4 >>> (B[tid], A[tid], N[tid]); } } Kernel ¡for ¡parallel ¡execuMon ¡(CUDA). ¡ then memcpy(A, B, N); ¡ else memcpy(B, A, N); ¡ Control ¡flow ¡graph ¡for ¡ kernel . ¡

Divergences ¡ T 2 ¡ T 3 ¡ T 1 ¡ T 0 ¡ void kernel( int ** A, int ** B, int * N) { int tid( threadId.x ); if (tid > N) { if ( threadId.x > N) ¡ memcpy <<< 1, 4 >>> (A[tid], B[tid], N[tid]); } else { memcpy <<< 1, 4 >>> (B[tid], A[tid], N[tid]); } } Kernel ¡for ¡parallel ¡execuMon ¡(CUDA). ¡ then memcpy(A, B, N); ¡ else SIMD: ¡LOCKSTEP ¡EXECUTION! ¡ memcpy(B, A, N); ¡ Control ¡flow ¡graph ¡for ¡ kernel . ¡

Divergences ¡ void kernel( int ** A, int ** B, int * N) { int tid( threadId.x ); if (tid > N) { if ( threadId.x > N) ¡ memcpy <<< 1, 4 >>> (A[tid], B[tid], N[tid]); T 0 ¡ T 1 ¡ T 2 ¡ T 3 ¡ } else { memcpy <<< 1, 4 >>> (B[tid], A[tid], N[tid]); } } Kernel ¡for ¡parallel ¡execuMon ¡(CUDA). ¡ DIVERGENCE! ¡ then memcpy(A, B, N); ¡ else SIMD: ¡LOCKSTEP ¡EXECUTION! ¡ memcpy(B, A, N); ¡ Control ¡flow ¡graph ¡for ¡ kernel . ¡

Divergences ¡ void kernel( int ** A, int ** B, int * N) { int tid( threadId.x ); if (tid > N) { if ( threadId.x > N) ¡ memcpy <<< 1, 4 >>> (A[tid], B[tid], N[tid]); } else { memcpy <<< 1, 4 >>> (B[tid], A[tid], N[tid]); } } Kernel ¡for ¡parallel ¡execuMon ¡(CUDA). ¡ DIVERGENCE! ¡ T 0 ¡ T 1 ¡ T 2 ¡ then T 3 ¡ memcpy(A, B, N); ¡ else SIMD: ¡LOCKSTEP ¡EXECUTION! ¡ memcpy(B, A, N); ¡ Control ¡flow ¡graph ¡for ¡ kernel . ¡

Divergences ¡ void kernel( int ** A, int ** B, int * N) { int tid( threadId.x ); if (tid > N) { if ( threadId.x > N) ¡ memcpy <<< 1, 4 >>> (A[tid], B[tid], N[tid]); } else { memcpy <<< 1, 4 >>> (B[tid], A[tid], N[tid]); } } Kernel ¡for ¡parallel ¡execuMon ¡(CUDA). ¡ DIVERGENCE! ¡ then T 3 ¡ memcpy(A, B, N); ¡ T 1 ¡ T 0 ¡ T 2 ¡ else SIMD: ¡LOCKSTEP ¡EXECUTION! ¡ memcpy(B, A, N); ¡ Control ¡flow ¡graph ¡for ¡ kernel . ¡

Divergences ¡ void kernel( int ** A, int ** B, int * N) { int tid( threadId.x ); if (tid > N) { if ( threadId.x > N) ¡ memcpy <<< 1, 4 >>> (A[tid], B[tid], N[tid]); } else { memcpy <<< 1, 4 >>> (B[tid], A[tid], N[tid]); } } Kernel ¡for ¡parallel ¡execuMon ¡(CUDA). ¡ DIVERGENCE! ¡ then memcpy(A, B, N); ¡ T 1 ¡ T 0 ¡ T 2 ¡ else SIMD: ¡LOCKSTEP ¡EXECUTION! ¡ memcpy(B, A, N); ¡ T 3 ¡ Control ¡flow ¡graph ¡for ¡ kernel . ¡

Divergences ¡ void kernel( int ** A, int ** B, int * N) { int tid( threadId.x ); if (tid > N) { if ( threadId.x > N) ¡ memcpy <<< 1, 4 >>> (A[tid], B[tid], N[tid]); } else { memcpy <<< 1, 4 >>> (B[tid], A[tid], N[tid]); } } And ¡waiMng ¡to ¡process ¡ Kernel ¡for ¡parallel ¡execuMon ¡(CUDA). ¡ can ¡be ¡quite ¡costly! ¡ DIVERGENCE! ¡ then memcpy(A, B, N); ¡ T 1 ¡ T 0 ¡ T 2 ¡ else SIMD: ¡LOCKSTEP ¡EXECUTION! ¡ memcpy(B, A, N); ¡ T 3 ¡ Control ¡flow ¡graph ¡for ¡ kernel . ¡

Interlude: ¡The ¡Kernels ¡of ¡Samuel ¡ int idx = threadId.x ; int dimx = threadDim.x ; void F( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size - i + 1; } } F ¡ assigns ¡the ¡result ¡of ¡ (size ¡-‑ ¡i ¡+ ¡1) ¡ to ¡ data[i] Source: ¡ hCp://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf ¡

Interlude: ¡The ¡Kernels ¡of ¡Samuel ¡ int idx = threadId.x ; int dimx = threadDim.x ; void F( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size - i + 1; } } void M( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size; M ¡ assigns ¡the ¡constant ¡ } } value ¡ size ¡ to ¡ data[i] Source: ¡ hCp://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf ¡

Interlude: ¡The ¡Kernels ¡of ¡Samuel ¡ int idx = threadId.x ; int dimx = threadDim.x ; void F( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size - i + 1; } } void M( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size; Q ¡ does ¡also ¡assign ¡ size ¡to ¡ } data[i] , ¡but ¡only ¡for ¡ } threads ¡with ¡odd ¡index ¡ i void Q( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { if (i % 2) data[i] = size; } } Source: ¡ hCp://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf ¡

Interlude: ¡The ¡Kernels ¡of ¡Samuel ¡ int idx = threadId.x ; int dimx = threadDim.x ; void F( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size - i + 1; } } void M( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size; P ¡ calls ¡funcMon ¡ random ¡ } and ¡assigns ¡its ¡value, ¡ } modulo ¡ size , ¡to ¡ data[i] void Q( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { if (i % 2) data[i] = size; } } void P( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = random() % size; } } Source: ¡ hCp://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf ¡

Interlude: ¡The ¡Kernels ¡of ¡Samuel ¡ int idx = threadId.x ; int dimx = threadDim.x ; void F( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size - i + 1; } } void M( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size; } } void Q( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { if (i % 2) data[i] = size; } } void P( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = random() % size; } } Source: ¡ hCp://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf ¡

Interlude: ¡The ¡Kernels ¡of ¡Samuel ¡ int idx = threadId.x ; int dimx = threadDim.x ; void F( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = size - i + 1; } } void M( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { 16153µs: data[i] = size; constant assignment } } void Q( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { if (i % 2) data[i] = size; } } void P( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = random() % size; } } Source: ¡ hCp://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf ¡

Interlude: ¡The ¡Kernels ¡of ¡Samuel ¡ int idx = threadId.x ; int dimx = threadDim.x ; void F( int * data, int size) { 16250µs: for ( int i = idx ; i < size; i += dimx ) { few operations data[i] = size - i + 1; and assignment } } void M( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { 16153µs: data[i] = size; constant assignment } } void Q( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { if (i % 2) data[i] = size; } } void P( int * data, int size) { for ( int i = idx ; i < size; i += dimx ) { data[i] = random() % size; } } Source: ¡ hCp://homepages.dcc.ufmg.br/~fernando/classes/dcc888/ementa/slides/DivergenceAnalysis.pdf ¡

Everywhere Blocks for SIMD Programming Authors: Rubens E. A. - PowerPoint PPT Presentation

Everywhere Blocks for SIMD Programming Authors: Rubens E. A. Moreira, Sylvain Collange, Fernando M. Q. Pereira Speaker: Breno Campos Ferreira Guimares Trends in Massively

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively

Blocks What is syntax (delimiters) Where can blocks be used Scope and blocks Do

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven

Composable GPU programming GPUs -- what are they? Basic model: SIMD, SPMD, MIMD; blocks

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

Architecture without explicit locks for logic Importance Of Simulation simulation on SIMD

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

STARTER PLANT CONCRETE BLOCKS 1 X 8 INCH Quality building blocks are essential in the safe

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

Content Everywhere Content Everywhere www.erg.com Or, navigating digital communications without

Poll Everywhere Quick Guide Google Slides Part I: Creating Polls at the Poll Everywhere web

Democratizing Energy Technology Dane A. Boysen, PhD April 17, 2017 University of Connecticut

Knowlywood: Mining Activity Knowledge From Hollywood Narratives Date:2016/08/30 Author:Nilet

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan Li Faculty advisor: Prof.

Ground Truth Data for Performance Evaluation of Urdu Nastalique OCR Aneeta Niazi Research

Transmission of resistant HIV in patients with a known date of infection Data from the HIV-1

Abstract Background Uninformed generalizations about how many elderly people have ever lived,

WARSAW RESEARCH ON YOUTH STYLES OF LIFE "DRUGS IN URBAN YOUTH CULTURE" Outline and

Mcaniques Discursives AN INSTALLATION BY Fred Penelle & Yannick Jacquet I N T R O

Everywhere Blocks for SIMD Programming Authors: Rubens E. A. - PowerPoint PPT Presentation

Everywhere Blocks for SIMD Programming Authors: Rubens E. A. Moreira, Sylvain Collange, Fernando M. Q. Pereira Speaker: Breno Campos Ferreira Guimares Trends in Massively

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively parallel (SIMD) computer

SIMD+ Overview Illiac IV History Early machines First massively

Blocks What is syntax (delimiters) Where can blocks be used Scope and blocks Do

SIMD Programming SIMD Programming with Larrabee with Larrabee Tom Forsyth Larrabee Architect

SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common

Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven

Composable GPU programming GPUs -- what are they? Basic model: SIMD, SPMD, MIMD; blocks

Automatic SIMD vectorization for Haskell Leaf Petersen, Dominic Orchard , Neal Glew ICFP 2013 -

Welcome! INFOMOV Lecture 5 SIMD (1) 2 Meanwhile, on ars technica INFOMOV

Architecture without explicit locks for logic Importance Of Simulation simulation on SIMD

Module 5.1 Thread Execusion Efficiency Warps and SIMD Hardware Objective To understand

STARTER PLANT CONCRETE BLOCKS 1 X 8 INCH Quality building blocks are essential in the safe

Software Vector Chaining M. Anton Ertl TU Wien Data Parallelism and SIMD instructions Data

Content Everywhere Content Everywhere www.erg.com Or, navigating digital communications without

Poll Everywhere Quick Guide Google Slides Part I: Creating Polls at the Poll Everywhere web

Democratizing Energy Technology Dane A. Boysen, PhD April 17, 2017 University of Connecticut

Knowlywood: Mining Activity Knowledge From Hollywood Narratives Date:2016/08/30 Author:Nilet

Daniel Vicory Allan Hancock College, Computer Science Mentor: Nan Li Faculty advisor: Prof.

Ground Truth Data for Performance Evaluation of Urdu Nastalique OCR Aneeta Niazi Research

Transmission of resistant HIV in patients with a known date of infection Data from the HIV-1

Abstract Background Uninformed generalizations about how many elderly people have ever lived,

WARSAW RESEARCH ON YOUTH STYLES OF LIFE &quot;DRUGS IN URBAN YOUTH CULTURE&quot; Outline and

Mcaniques Discursives AN INSTALLATION BY Fred Penelle &amp; Yannick Jacquet I N T R O

WARSAW RESEARCH ON YOUTH STYLES OF LIFE "DRUGS IN URBAN YOUTH CULTURE" Outline and

Mcaniques Discursives AN INSTALLATION BY Fred Penelle & Yannick Jacquet I N T R O