Vectorization Past Dependent Branches Through Speculation Majedul - PowerPoint PPT Presentation

Vectorization Past Dependent Branches Through Speculation Majedul Haque Sujon R. Clint Whaley Center for Computation & Technology(CCT), Louisiana State University (LSU). University of Texas at San Antonio (UTSA)* & Qing Yi Department of Computer Science, University of Colorado - Colorado Springs (UCCS). *part of the research work had been done when the authors were there September 12, 2013 PACT'2013 1

Outline • Motivation • Speculative Vectorization • Integration within Our Framework • Experimental Results • Related Work • Conclusions September 12, 2013 PACT'2013 2

Motivation • SIMD vectorization is required to attain high performance on modern computers • Many loops cannot be vectorized by existing techniques – Only 18-30% loops from two benchmarks can be auto-vectorized – Maleki et al.[PACT’11] – A key inhibiting factor is control hazard → We introduce a new technique for vectorization past dependent branches --- a major source where existing techniques fail September 12, 2013 PACT'2013 3

Example: SSQ Loop for(i=1; i<=N; i++) { ax = X[i]; ax = X[i]; ax = ABS & ax; ax = ABS & ax; if (ax > scal) if (ax > scal) GOTO L2; { t0 = scal/ax; Path-2 Path-1 t0 = t0*t0; ssq = 1.0+t1; scal = ax; L2: } t0 = ax/scal; t0 = scal/ax; else ssq += t0*t0; t0 = t0*t0; { ssq = 1.0+t1; t0 = ax/scal; scal = ax; ssq += t0*t0; } } SSQ Loop (NRM2) September 12, 2013 PACT'2013 4

Variable Analysis (1) ax = X[i]; scal : Recurrent variable ax = ABS & ax; [unvectorizable pattern] if (ax > scal) GOTO L2; ssq : Recurrent variable Path- Path- [unvectorizable pattern] 2 1 L2: t0 = ax/scal; t0 = scal/ax; ssq += t0*t0; Statements that t0 = t0*t0; ssq = 1.0+t1; operate on scal scal = ax; are not vectorizable scal : used before scal : defined defined September 12, 2013 PACT'2013 5

Variable Analysis (2) ax = X[i]; scal : Recurrent variable ax = ABS & ax; [unvectorizable pattern] if (ax > scal) GOTO L2; ssq : Recurrent variable Path-2 Path-1 [unvectorizable pattern] L2: t0 = ax/scal; considering both t0 = scal/ax; ssq += t0*t0; t0 = t0*t0; paths, statements ssq = 1.0+t1; scal = ax; that operate on ssq : reduction but ssq are not defined in the other path vectorizable ssq is defined again September 12, 2013 PACT'2013 6

Analysis of Path-1 scal : Invariant ax = X[i]; ssq : Reduction ax = ABS & ax; if (ax > scal) GOTO L2; ABS : Invariant t0, ax : private variable Path- 1 t0 = ax/scal; ssq: ssq += t0*t0; reduction variable Path-1: (vectorizable) Vectorizable September 12, 2013 PACT'2013 7

Speculative Vectorization Vectorize past branches using speculation: 1. Vectorize a chosen path --- speculate it will be taken in consecutive loop iterations (e.g. vector length iterations). 2. When speculation fails, re-evaluate mis- vectorized iterations using scalar operations [ Scalar Restart ]. September 12, 2013 PACT'2013 8

Vectorized Loop Structure Scalar Restart vector-prologue (initialization) vector-backup (if needed) vector-body Scalar Restart Vector Path vector-loop- update vector-epilogue (Reduction) September 12, 2013 PACT'2013 9

Vectorized Loop Structure Scalar Restart vector-prologue (initialization) vector-restore (if needed) vector-backup (if needed) vector-to-scalar (reduction) vector-body scalar loop of Vector Path vector-length # of iterations vector-loop- update scalar-to-vector update vector-epilogue (Reduction) September 12, 2013 PACT'2013 10

Example Vectorized Code (SSQ) SCALAR_RESTART : /* vector-prologue */ Vssq = [ssq,0.0,0.0,0.0]; /* vector-to-scalar */ Vscal= [scal,scal,scal,scal]; ssq = sum(Vssq[0:3]); VABS = [ABS,ABS,ABS,ABS]; /* scalar loop */ LOOP : for(j=0; j<4; j++) { /* vector-body */ ax = X[i]; Vax = X[i:i+3]; ax = ABS & ax; Vax = VABS & Vax; if (ax > scal) { if(VEC_ANY_GT(Vax,Vscal) t0 = scal/ax; GOTO SCALAR_RESTART; t0 = t0*t0; ssq = 1.0+t1; Vt0 = Vax/Vscal; scal = ax; Vssq += Vt0*Vt0; } else { t0 = ax/scal; /* vector-loop-update */ i+=4; ssq += t0*t0; if(i<=N4) GOTO LOOP ; } } /* scalar-to-vector */ /* vector-epilogue */ Vssq=[ssq,0.0,0.0,0.0]; ssq = sum(Vssq[0:3]); scal = Vscal[0]; Vscal=[scal,scal,scal,scal]; September 12, 2013 PACT'2013 11

Integration within the iFKO framework • iFKO (Iterative Floating Point Kernel Optimizer) analysis results ! problem ! params ! Specialized ! HIL +flags ! Input ! Search ! Timers/ ! optimized ! Compiler ! Routine ! Drivers ! Testers ! assembly ! HIL ! (FKO) ! performance/test results ! • Why necessary: – To find the best path to speculate for SV – To apply SV only when profitable September 12, 2013 PACT'2013 12

Results: SV vs Scalar AVX: float:8, double: 4 Data: in-L2, random [-0.5,0.5], sin/cos [0, 2 π ] SV & Scalar : auto tuned 8 Speedup over scalar Single Double 6.81 6.83 7 5.96 5.86 6 5.43 5 4.18 4 3.46 3.47 3.2 3.16 3.01 3 2.08 2 1.08 1.01 1.01 1 0.93 0.92 0.92 0 BLAS ATLAS-LU Factorization GLIBC Machine: Intel Xeon CPU E5-2620 September 12, 2013 PACT'2013 13

Results: SV vs Scalar 6.8 x / 3.4 x 8 Speedup over scalar Single Double 6.81 6.83 7 5.96 5.86 6 5.43 5 4.18 4 3.46 3.47 3.2 3.16 3.01 3 2.08 2 1.08 1.01 1.01 1 0.93 0.92 0.92 0 Speedup of AMAX/IAMAX : float 6.8x, double 3.4x September 12, 2013 PACT'2013 14

Results: SV vs Scalar 4.18x / 2.08x 8 Speedup over scalar Single Double 6.81 6.83 7 5.96 5.86 6 5.43 5 4.18 4 3.46 3.47 3.2 3.16 3.01 3 2.08 2 1.08 1.01 1.01 1 0.93 0.92 0.92 0 NRM2: Not vectorizable by prior methods 4.18x (float), 2.08x (double) September 12, 2013 PACT'2013 15

Results: SV vs Scalar 8 Speedup over scalar Single Double 6.81 6.83 7 5.96 5.86 6 5.43 5 4.18 4 3.46 3.47 3.2 3.16 3.01 3 2.08 2 1.08 1.01 1.01 1 0.93 0.92 0.92 0 September 12, 2013 PACT'2013 16

Results: SV vs Scalar 0.93x / 1.01x/ 0.92x 1.08x 0.92x/ 8 Speedup over scalar Single Double 1.01x 6.81 6.83 7 5.96 5.86 6 5.43 5 4.18 4 3.46 3.47 3.2 3.16 3.01 3 2.08 2 1.08 1.01 1.01 1 0.93 0.92 0.92 0 Slowdown up to 8% for ASUM and COS September 12, 2013 PACT'2013 17

Vectorization Strategies in iFKO – VMMR (Vectorization after Max/Min Reduction): • Eliminating Max/Min conditionals with vmax/vmin instruction – VRC (Vectorization with Redundant Computation): • Redundant computation with select/blend operation • Only e fg ective if all paths are vectorizable in our implementation → SV (Speculative Vectorization): • at least one path is vectorizable September 12, 2013 PACT'2013 18

Comparing Vectorization Strategies with AMAX - VMMR : only one branch to find max AVX: float:8, double: 4 - VRC : minimum redundant operation Intel Xeon CPU E5-2620 - SV : strong directionality 8 7.08 6.81 Speedup over scalar 6.46 7 VMMR VRC SV 6 5 3.5 3.48 4 3.13 3 2 1 0 Single Double September 12, 2013 PACT'2013 19

Related Work • If Conversion : J.R. Allen [POPL’83] – Control dependence to data dependence • Bit masking to combine di fg erent values from if-else branches: Bik et al.[int. J. PP’02] • Formalize predicated execution with select/ blend operation: Shin et al.[CGO’05] – General approach September 12, 2013 PACT'2013 20

Conclusions • Impressive speedup can be achieved when control-flow is directional. – Can vectorize some loops e fg ectively when other methods can’t. • SSQ (NRM2): 4.18x (float), 2.08x (double) • AMAX/IAMAX: 6.8x (float), 3.6 (double) – Complimentary to and can be combined with existing other vectorization methods (e.g., VRC) – Specialize hardware is not needed • Future work – Investigate combining vectorization strategies – Try under-speculation as veclen increases – Speculative vectorization of multiple paths – Loop specialization: switch to scalar loop when mispeculation is frequent September 12, 2013 PACT'2013 21

Vectorization Past Dependent Branches Through Speculation Majedul - PowerPoint PPT Presentation

Vectorization Past Dependent Branches Through Speculation Majedul Haque Sujon R. Clint Whaley Center for Computation & Technology(CCT), Louisiana State University (LSU). University of Texas at San Antonio (UTSA)* & Qing Yi Department of

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

Years Guri Sohi University of Wisconsin-Madison Outline Speculation infancy performance

LLVM Auto-Vectorization Past Present Future Renato Golin www.linaro.org LLVM

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira Advisor: Fernando Magno Quinto

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Prediction and speculation : the role of stochastic models of program behaviour in the

BCs Speculation & Vacancy Tax Register to claim your exemption by March 31 st , 2019 What

Sentiment and speculation in a market with heterogeneous beliefs Ian Martin Dimitris

Opportunity Day 30 March 2017 Draft Background and Business Company History and Background 20

Q12019 RESULTS OUR REGIONAL PRESENCE Branches 10 Ethiopia South ATMs 2 Sudan Staff 138

CSE 110A: Winter 2020 Fundamentals of Compiler Design I Branches and Binary Operators

ADVANCED DATABASE SYSTEMS Vectorization vs. Compilation @ Andy_Pavlo // 15- 721 // Spring

Using Machine Learning to Improve Automatic Vectorization Kevin Stock Louis-Nol Pouchet P .

Network Flow Based Datapath Bit Slicing Hua Xiang Minsik Cho Haoxing Ren Matthew Ziegler

BGP update profiles and the implications for secure BGP update validation processing Geoff

Side meeting on network and application integration Introduction and agenda Brje Ohlman,

Array Based Betweenness Centrality Eric Robinson Northeastern University MIT Lincoln Labs

Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Addressing educational equity for Latino youth in Oregon: The OSU Open Campus Juntos Program

HYPOTHESIS TESTING PART II LEARNING GOALS get more intimate with p -values distribution

Vectorization Past Dependent Branches Through Speculation Majedul - PowerPoint PPT Presentation

Vectorization Past Dependent Branches Through Speculation Majedul Haque Sujon R. Clint Whaley Center for Computation & Technology(CCT), Louisiana State University (LSU). University of Texas at San Antonio (UTSA)* & Qing Yi Department of

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

Years Guri Sohi University of Wisconsin-Madison Outline Speculation infancy performance

LLVM Auto-Vectorization Past Present Future Renato Golin www.linaro.org LLVM

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira Advisor: Fernando Magno Quinto

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Prediction and speculation : the role of stochastic models of program behaviour in the

BCs Speculation &amp; Vacancy Tax Register to claim your exemption by March 31 st , 2019 What

Sentiment and speculation in a market with heterogeneous beliefs Ian Martin Dimitris

Opportunity Day 30 March 2017 Draft Background and Business Company History and Background 20

Q12019 RESULTS OUR REGIONAL PRESENCE Branches 10 Ethiopia South ATMs 2 Sudan Staff 138

CSE 110A: Winter 2020 Fundamentals of Compiler Design I Branches and Binary Operators

ADVANCED DATABASE SYSTEMS Vectorization vs. Compilation @ Andy_Pavlo // 15- 721 // Spring

Using Machine Learning to Improve Automatic Vectorization Kevin Stock Louis-Nol Pouchet P .

Network Flow Based Datapath Bit Slicing Hua Xiang Minsik Cho Haoxing Ren Matthew Ziegler

BGP update profiles and the implications for secure BGP update validation processing Geoff

Side meeting on network and application integration Introduction and agenda Brje Ohlman,

Array Based Betweenness Centrality Eric Robinson Northeastern University MIT Lincoln Labs

Profiles and Multiple Alignments COMP 571 Luay Nakhleh, Rice University Outline Profiles and

Data Mining in Bioinformatics Day 6: Feature Selection in Bioinformatics Karsten Borgwardt

Addressing educational equity for Latino youth in Oregon: The OSU Open Campus Juntos Program

HYPOTHESIS TESTING PART II LEARNING GOALS get more intimate with p -values distribution

BCs Speculation & Vacancy Tax Register to claim your exemption by March 31 st , 2019 What