Is vectorization easy? Is vectorization enough? Sébastien Ponce Florian Lemaitre
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Plan Introduction 1 What is SIMD ? How is vectorization done? Matrix-Vector product example 2 Impact of other optimizations on vectorization Let’s vectorize Performance Batch processing 3 Array of Structure Structure of Array Hand-made Vectorization 4 Check vectorization 5 Assembly Callgrind Conclusion & Guidelines 6 S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 2 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines What is SIMD ? Single Instruction Multiple Data Available on Intel architectures since 2000 Same time to process 4 , 8 , . . . float s than 1 on regular arithmetic [] x 0 x 1 x 2 x 3 X X + + + + + y 0 y 1 y 2 y 3 Y [] Y X + Y X []+ Y [] x 0+ y 0 x 1+ y 1 x 2+ y 2 x 3+ y 3 S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 3 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines How is vectorization done? Algorithm ( C = A + B scalar) Vector code input : A , B // n vector output : C // n vector what you can write using for i = 0 : n do matlab or numpy without C [ i ] ← A [ i ] + B [ i ] matrices Algorithm ( C = A + B vector) Vectorization is done in 3 : A , B input // n vector steps: output : C // n vector C [ : ] ← A [ : ] + B [ : ] Detect the pattern 1 eg: a simple loop Algorithm ( C = A + B SIMD ) Convert pattern into 2 : A , B abstract vector code input // n vector output : C // n vector Convert vector code into 3 for i = 0 : 4 : n do fixed width vector code C [ i : i +4] ← A [ i : i +4] + B [ i : i +4] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 4 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines How is vectorization done? Algorithm ( C = A + B scalar) Vector code input : A , B // n vector output : C // n vector what you can write using for i = 0 : n do matlab or numpy without C [ i ] ← A [ i ] + B [ i ] matrices Algorithm ( C = A + B vector) Vectorization is done in 3 : A , B input // n vector steps: output : C // n vector C [ : ] ← A [ : ] + B [ : ] Detect the pattern 1 eg: a simple loop Algorithm ( C = A + B SIMD ) Convert pattern into 2 : A , B abstract vector code input // n vector output : C // n vector Convert vector code into 3 for i = 0 : 4 : n do fixed width vector code C [ i : i +4] ← A [ i : i +4] + B [ i : i +4] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 4 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines How is vectorization done? Algorithm ( C = A + B scalar) Vector code input : A , B // n vector output : C // n vector what you can write using for i = 0 : n do matlab or numpy without C [ i ] ← A [ i ] + B [ i ] matrices Algorithm ( C = A + B vector) Vectorization is done in 3 : A , B input // n vector steps: output : C // n vector C [ : ] ← A [ : ] + B [ : ] Detect the pattern 1 eg: a simple loop Algorithm ( C = A + B SIMD ) Convert pattern into 2 : A , B abstract vector code input // n vector output : C // n vector Convert vector code into 3 for i = 0 : 4 : n do fixed width vector code C [ i : i +4] ← A [ i : i +4] + B [ i : i +4] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 4 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines How is vectorization done? Algorithm ( C = A + B scalar) Vector code input : A , B // n vector output : C // n vector what you can write using for i = 0 : n do matlab or numpy without C [ i ] ← A [ i ] + B [ i ] matrices Algorithm ( C = A + B vector) Vectorization is done in 3 : A , B input // n vector steps: output : C // n vector C [ : ] ← A [ : ] + B [ : ] Detect the pattern 1 eg: a simple loop Algorithm ( C = A + B SIMD ) Convert pattern into 2 : A , B abstract vector code input // n vector output : C // n vector Convert vector code into 3 for i = 0 : 4 : n do fixed width vector code C [ i : i +4] ← A [ i : i +4] + B [ i : i +4] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 4 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Plan Introduction 1 What is SIMD ? How is vectorization done? Matrix-Vector product example 2 Impact of other optimizations on vectorization Let’s vectorize Performance Batch processing 3 Array of Structure Structure of Array Hand-made Vectorization 4 Check vectorization 5 Assembly Callgrind Conclusion & Guidelines 6 S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 5 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Matrix-Vector product Algorithm ( Y = A · X ) : A input // n × n matrix : X input // n vector output : Y // n vector Simple algorithm : s temp // scalar accumulator used a lot for i = 0 : n do change of basis in ROOT s ← 0 for j = 0 : n do s ← s + A [ i, j ] · X [ j ] Y [ i ] ← s S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 5 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Small loop unrolling Impact of other optimizations Algorithm ( Y = A · X ) Complete unrolling is called input : A // n × n matrix unwinding . input : X // n vector output : Y // n vector Compilers are able to unroll temp : s // scalar accumulator small loops for i = 0 : n do s ← 0 if it is considered worth it for j = 0 : n do s ← s + A [ i, j ] · X [ j ] Loop version easier to Y [ i ] ← s understand For a Human Algorithm ( Y = A · X unwinded) For a compiler too input : A // 3 × 3 matrix input : X // 3 vector Unrolled version makes output : Y // 3 vector vectorization hard Y [0] ← A [0 , 0] · X [0]+ A [0 , 1] · X [1]+ A [0 , 2] · X [2] Pattern not recognized Y [1] ← A [1 , 0] · X [0]+ A [1 , 1] · X [1]+ A [1 , 2] · X [2] Y [2] ← A [2 , 0] · X [0]+ A [2 , 1] · X [1]+ A [2 , 2] · X [2] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 6 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Small loop unrolling Impact of other optimizations Algorithm ( Y = A · X ) Complete unrolling is called input : A // n × n matrix unwinding . input : X // n vector output : Y // n vector Compilers are able to unroll temp : s // scalar accumulator small loops for i = 0 : n do s ← 0 if it is considered worth it for j = 0 : n do s ← s + A [ i, j ] · X [ j ] Loop version easier to Y [ i ] ← s understand For a Human Algorithm ( Y = A · X unwinded) For a compiler too input : A // 3 × 3 matrix input : X // 3 vector Unrolled version makes output : Y // 3 vector vectorization hard Y [0] ← A [0 , 0] · X [0]+ A [0 , 1] · X [1]+ A [0 , 2] · X [2] Pattern not recognized Y [1] ← A [1 , 0] · X [0]+ A [1 , 1] · X [1]+ A [1 , 2] · X [2] Y [2] ← A [2 , 0] · X [0]+ A [2 , 1] · X [1]+ A [2 , 2] · X [2] S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 6 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Loop Order Impact of other optimizations Loop order can be changed Changes the way Algorithm ( Y = A · X scalar ij ) elements are accessed : s temp // scalar accumulator and processed for i = 0 : n do s ← 0 Vectorization will not be for j = 0 : n do s ← s + A [ i, j ] · X [ j ] applied the same way Y [ i ] ← s ij order: Algorithm ( Y = A · X scalar ji ) temp : x // scalar A elements are accessed for i = 0 : n do in Row-Major order Y [ i ] ← 0 for j = 0 : n do x ← X [ j ] ji order: for i = 0 : n do Y [ i ] ← Y [ i ] + A [ i, j ] · x A elements are accessed in Column-Major order S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 7 / 19
Plan Introduction Matrix-Vector product Batch processing Hand-made Vectorization Check vectorization Conclusion & Guidelines Loop Order Impact of other optimizations Loop order can be changed Changes the way Algorithm ( Y = A · X scalar ij ) elements are accessed : s temp // scalar accumulator and processed for i = 0 : n do s ← 0 Vectorization will not be for j = 0 : n do s ← s + A [ i, j ] · X [ j ] applied the same way Y [ i ] ← s ij order: Algorithm ( Y = A · X scalar ji ) temp : x // scalar A elements are accessed for i = 0 : n do in Row-Major order Y [ i ] ← 0 for j = 0 : n do x ← X [ j ] ji order: for i = 0 : n do Y [ i ] ← Y [ i ] + A [ i, j ] · x A elements are accessed in Column-Major order S. Ponce – F. Lemaitre Vectorization: Easy? Enough? 11/12/2017 7 / 19
Recommend
More recommend