Parallel Programming and Heterogeneous Computing SIMD: Integrated Accelerators Max Plauth, Sven Köhler , Felix Eberhardt, Lukas Wenzel, and Andreas Polze Operating Systems and Middleware Group
1 SIMD ParProg 2019 SIMD: Integrated Accelerators & AltiVec Sven Köhler Chart 2
Definition SIMD SIMD ::= S ingle I nstruction M ultiple D ata The same instruction is performed simultaneously on multiple data points (data-level parallelism). First proposed for ILLIAC IV, University of Illinois (1966). Today many architectures provide SIMD instruction set extensions. ParProg 2019 Intel: MMX, SSE, AVX SIMD: Integrated Accelerators ARM: VPF, NEON, SVE Sven Köhler POWER: AltiVec (VMX), VSX Chart 3
) Flynn’s Taxonomy on Multiprocessors (1966) ( instruction and data processing dimension Multiple Data (SIMD) Single Data (SISD) Single Instruction, Single Instruction, (C) Blaise Barney Multiple Data (MIMD) Multiple Instruction, Single Data (MISD) Multiple Instruction, ParProg 2019 SIMD: Integrated Accelerators Sven Köhler Chart 4
Scalar vs. SIMD How many instructions are needed to add four numbers from memory? scalar 4 element SIMD A 0 + B 0 = C 0 A 0 B 0 C 0 A 1 + B 1 = C 1 A 1 B 1 C 1 + = A 2 B 2 C 2 A 2 + B 2 C 2 = A 3 B 3 C 3 ParProg 2019 A 3 + B 3 = C 3 SIMD: Integrated Accelerators Sven Köhler 4 additions 1 addition 8 loads 2 loads Chart 5 4 stores 1 store
Vector Registers on POWER8 (1) 32 vector registers containing 128 bits each. fpr0 vsr0 fpr1 vsr1 … … fpr31 vsr31 VSX vr0 vsr32 AltiVec/VMX vr1 vsr33 … … vr31 vsr63 ParProg 2019 These are also used by Quad Word 0 SIMD: Integrated several coprocessors : Accelerators Double Word 0 Double Word 1 … Sven Köhler Word 0 Word 3 … Half Half Word 0 Word 7 VSX SHA2 AES … … Chart 6 Byte 0 Byte 15
Vector Registers on POWER8 (2) 32 vector registers containing 128 bits each. Depending on the instruction they are interpreted as 16 (un)signed bytes 8 (un)signed shorts 4 (un)signed integers of 32bit 4 single precision floats 2 (un)signed long integers of 64bit ParProg 2019 SIMD: Integrated Accelerators 2 double precision floats Sven Köhler or 2, 4, 8, 16 logic values Chart 7
AltiVec Instruction Reference Version 2.07 B 6.7.2 Vector Load Instructions For all instructions, registers The aligned byte, halfword, word, or quadword in Programming Note storage addressed by EA is loaded into register VRT. The Load Vector Element instructions load the specified element into the same location in the target register as the location into which it would and usage see be loaded using the Load Vector instruction. Load Vector Element Byte Indexed X-form Load Vector Element Halfword Indexed X-form lvebx VRT,RA,RB lvehx VRT,RA,RB 31 VRT RA RB 7 / PowerISA 2.07(B), chapter 6 & 7 0 6 11 16 21 31 31 VRT RA RB 39 / 0 6 11 16 21 31 if RA = 0 then b � 0 else b � (RA) if RA = 0 then b � 0 EA � b + (RB) else b � (RA) eb � EA 60:63 EA � (b + (RB)) & 0xFFFF_FFFF_FFFF_FFFE eb � EA 60:63 VRT � undefined VRT � undefined if Big-Endian byte ordering then VRT 8 × eb:8 × eb+7 � MEM(EA,1) ParProg 2019 if Big-Endian byte ordering then VRT 8 × eb:8 × eb+15 � MEM(EA,2) else VRT 120-(8 × eb):127-(8 × eb) � MEM(EA,1) else SIMD: Integrated VRT 112-(8 × eb):127-(8 × eb) � MEM(EA,2) Let the effective address (EA) be the sum Accelerators (RA|0)+(RB). Let the effective address (EA) be the result of ANDing 0xFFFF_FFFF_FFFF_FFFE with the sum Let eb be bits 60:63 of EA. (RA|0)+(RB). Sven Köhler If Big-Endian byte ordering is used for the storage Let eb be bits 60:63 of EA. access, the contents of the byte in storage at address EA are placed into byte eb of register VRT. The If Big-Endian byte ordering is used for the storage remaining bytes in register VRT are set to undefined access, values. Chart 8 – the contents of the byte in storage at address EA
2 #include <altivec.h> gcc -maltivec -mabi=altivec gcc -mvsx xlc –qaltivec –qarch=auto ParProg 2019 SIMD: Integrated C-Interface Accelerators Sven Köhler Chart 9
Vector Data Types The C-Interface introduces new keywords and data types: vector unsigned char vector unsigned long 16x 1 byte 2 x 8 bytes vector signed char vector signed long vector bool char vector double vector unsigned short 8x 2 bytes vector signed short vector bool short vector pixel ParProg 2019 vector unsigned int SIMD: Integrated vector signed int 4x 4 bytes Accelerators vector bool int Sven Köhler vector float Chart 10 gcc -maltivec gcc -mvsx
Vector Data Types Initialization, Loading and Storing vector int va = {1, 2, 3, 4}; int data[] = {1, 2, 3, 4, 5, 6, 7, 8}; vector int vb = *((vector int *)data); int output[4]; *((vector int *)output) = va; Can be very slow! ParProg 2019 SIMD: Integrated printf("vb = {%d, %d, %d, %d};\n", Accelerators Sven Köhler vb[0], vb[1], vb[2], vb[3]); Chart 11
Aligned Addresses Historically memory addresses required be aligned at 16 byte boundaries for efficiency reasons. (Although POWER8 has improved unaligned load/store and modern compilers will support you.) int data[] __attribute__((aligned(16))) = {1, 2, 3, 4, 5, 6, 7, 8}; (compiler specific) int *output = aligned_alloc(16, NUM * sizeof(int)); vector int va = vec_ld(data, 0); ParProg 2019 SIMD: Integrated vec_st(va, output, 0); Accelerators Sven Köhler address + index (truncated to 16) Chart 12
Vector Intrinsics Operations are available through a rich set 1 of “overloaded functions” (actually intrinsics): vector int va = {4, 3, 2, 1}; vector int vb = {1, 2, 3, 4}; A 0 B 0 C 0 vector int vc = vec_add(va, vb); A 1 B 1 C 1 + = A 2 B 2 C 2 A 3 B 3 C 3 vector float vfa = {4, 3, 2, 1}; ParProg 2019 vector float vfb = {1, 2, 3, 4}; SIMD: Integrated Accelerators vector float vfc = vec_add(vfa, vfb); Sven Köhler Chart 13 1 https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html
) Vector Intrinsics: Lots of overloads ( vector signed char vec_add (vector bool char, vector signed char); vector signed char vec_add (vector signed char, vector bool char); vector signed char vec_add (vector signed char, vector signed char); vector unsigned char vec_add (vector bool char, vector unsigned char); vector unsigned char vec_add (vector unsigned char, vector bool char); vector unsigned char vec_add (vector unsigned char, vector unsigned char); vector signed short vec_add (vector bool short, vector signed short); vector signed short vec_add (vector signed short, vector bool short); vector signed short vec_add (vector signed short, vector signed short); vector unsigned short vec_add (vector bool short, vector unsigned short); vector unsigned short vec_add (vector unsigned short, vector bool short); Attention: No implicit conversion! vector unsigned short vec_add (vector unsigned short, vector unsigned short); vector signed int vec_add (vector bool int, vector signed int); Also not all types for every operation. vector signed int vec_add (vector signed int, vector bool int); ParProg 2019 vector signed int vec_add (vector signed int, vector signed int); SIMD: Integrated vector unsigned int vec_add (vector bool int, vector unsigned int); Accelerators vector unsigned int vec_add (vector unsigned int, vector bool int); vector unsigned int vec_add (vector unsigned int, vector unsigned int); Sven Köhler vector float vec_add (vector float, vector float); vector double vec_add (vector double, vector double); vector long long vec_add (vector long long, vector long long); Chart 14 vector unsigned long long vec_add (vector unsigned long long, vector unsigned long long); 1 https://gcc.gnu.org/onlinedocs/gcc-6.2.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html
Get Help: Programming Interface Manual Generic and Specific AltiVec Operations Highly helpful resource: vec_add vec_add Vector Add d = vec_add( a , b ) • Integer add: Name of operation n ¨ number of elements □ do i=0 to n-1 d i ¨ a i + b i end Pseudocode description □ • Floating-point add: Text description □ do i=0 to 3 d i ¨ a i + fp b i end Graphical description □ Each element of a is added to the corresponding element of b. Each sum is placed in the corresponding element of d. Type table and according □ For vector float argument types, if VSCR[NJ] = 1, every denormalized operand element is truncated to a 0 of the same sign before the operation is carried out, and each assembly instruction denormalized result element is truncated to a 0 of the same sign. The valid combinations of argument types and the corresponding result types for d = vec_add( a , b ) are shown in Figure 4-12, Figure 4-13, Figure 4-14, and Figure 4-15. ParProg 2019 SIMD: Integrated Element Æ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 a Accelerators b Sven Köhler + + + + + + + + + + + + + + + + d d a b maps to Chart 15 vector unsigned char vector unsigned char vector unsigned char vector unsigned char vector bool char http://www.nxp.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf vector bool char vector unsigned char vaddubm d,a,b vector signed char vector signed char
Get Help: IBM Knowledge Center IBM has an online documentation of the extended standard, not fully implemented by GCC. ParProg 2019 SIMD: Integrated Accelerators Sven Köhler Chart 16
Recommend
More recommend