PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas † , Alberto Magni ‡ and Timothy M. Jones † University of Cambridge † University of Edinburgh ‡ EuroLLVM APR 2015 slide 1 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Why SIMD Vectorization? Scalar Reg. File • Scalable parallelism FU FU FU FU Scalar Func. Units a. ILP slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Why SIMD Vectorization? Scalar Reg. File • Scalable parallelism FU FU FU FU Scalar Func. Units a. ILP slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism FU FU FU FU 0 1 2 3 Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism FU FU FU FU 0 1 2 3 Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism • High Performance FU FU FU FU 0 1 2 3 Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism • High Performance FU FU FU FU 0 1 2 3 • Energy efficiency Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism • High Performance FU FU FU FU 0 1 2 3 • Energy efficiency Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism • Supported since mid 90’s • Frequent updates of vector ISAs AVX2 SSE4 slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Why SIMD Vectorization? Scalar Reg. File Vector Reg. File • Scalable parallelism • High Performance FU FU FU FU 0 1 2 3 • Energy efficiency Scalar Func. Units Vector Unit a. ILP b. Vector Parallelism • Supported since mid 90’s • Frequent updates of vector ISAs AVX2 • Vector generation not done in hardware • Low-level programming or SSE4 capable compiler slide 2 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer • Implemented in most compilers (including GCC and LLVM) slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer • Implemented in most compilers (including GCC and LLVM) • In theory it should be a superset of loop-vectorizer slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer • Implemented in most compilers (including GCC and LLVM) • In theory it should be a superset of loop-vectorizer • Unroll loop and vectorize with SLP • Even if loop-vectorizer fails, SLP could partly succeed slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Straight-Line Code Vectorizer • Superword Level Parallelism [Larsen PLDI’00] • State-of-the-art straight-line code vectorizer • Implemented in most compilers (including GCC and LLVM) • In theory it should be a superset of loop-vectorizer • Unroll loop and vectorize with SLP • Even if loop-vectorizer fails, SLP could partly succeed • In practice it is missing features present in the Loop vectorizer (Interleaved Loads, Predication) slide 3 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Vectorization Algorithm Scalar Code • Input is scalar IR slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores 2 Reductions slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable isomorphic instructions slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable Calculate Calculate 3. Scalar Cost Vector Cost isomorphic instructions • Cost: weighted instr. count slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable Calculate Calculate 3. Scalar Cost Vector Cost isomorphic instructions • Cost: weighted instr. count If 4. Vector Cost < • Check vectorization profitability Scalar Cost slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable Calculate Calculate 3. Scalar Cost Vector Cost isomorphic instructions • Cost: weighted instr. count If 4. Vector Cost < • Check vectorization profitability Scalar Cost YES • Emit vectors only if profitable Vectorize groups 5. & emit vectors DONE slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Vectorization Algorithm Scalar Code • Input is scalar IR Find vectorization 1. seed instructions • Seed instructions are: 1 Consecutive Stores Generate graph of 2. isomorphic scalar groups 2 Reductions • Graph contains vectorizable Calculate Calculate 3. Scalar Cost Vector Cost isomorphic instructions • Cost: weighted instr. count If 4. Vector Cost NO < • Check vectorization profitability Scalar Cost YES • Emit vectors only if profitable Vectorize groups 5. & emit vectors DONE slide 4 of 17 www.cl.cam.ac.uk/ ∼ vp331/
When SLP Fails ADD1 ADD2 ADD3 1 Data Dependencies ADD4 slide 5 of 17 www.cl.cam.ac.uk/ ∼ vp331/
When SLP Fails ADD1 ADD2 ADD3 1 Data Dependencies ADD4 Original Vectorized ADD1 Insert1 2 Too many ADD2 Insert2 ADD3 Insert3 gather/scatter ADD4 Insert4 ADD1 ADD2 ADD3 ADD4 instructions. Costs Extract1 outweigh benefits. Extract2 Extract3 Extract4 slide 5 of 17 www.cl.cam.ac.uk/ ∼ vp331/
When SLP Fails ADD1 ADD2 ADD3 1 Data Dependencies ADD4 Original Vectorized ADD1 Insert1 2 Too many ADD2 Insert2 ADD3 Insert3 gather/scatter ADD4 Insert4 ADD1 ADD2 ADD3 ADD4 instructions. Costs Extract1 outweigh benefits. Extract2 Extract3 Extract4 3 Non-isomorphism ADD1 ADD2 MUL ADD4 slide 5 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code L 7. L * 1. 5. + + S S b. DFG X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code L 7. L * 1. 5. + + 0 S S S S S S b. DFG c. SLP internal graph d. SLP vectorized groups X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code L 7. L * 1. 5. + + + + 1 + + 0 S S S S S S b. DFG c. SLP internal graph d. SLP vectorized groups X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Fails due to non-isomorphism ... B[i] = A[i] * 7.0 + 1.0; B[i+1]= A[i+1] + 5.0; ... a. Input C code NON−ISOMORPHIC L 7. STOP ! L 7. L 2 * 1. L 5. L * * 1. 5. + + + + 1 + + 0 S S S S S S b. DFG c. SLP internal graph d. SLP vectorized groups X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/
SLP Fails due to non-isomorphism Scalar Cost L ... B[i] = A[i] * 7.0 + 1.0; L * 7 B[i+1]= A[i+1] + 5.0; + + ... a. Input C code S S NON−ISOMORPHIC L 7. STOP ! L 7. L 2 * 1. L 5. L * * 1. 5. + + + + 1 + + 0 S S S S S S b. DFG c. SLP internal graph d. SLP vectorized groups X Instruction Node or Constant Data Flow Edge slide 6 of 17 www.cl.cam.ac.uk/ ∼ vp331/
Recommend
More recommend