Assembly Language Programming Parallel architectures Zbigniew Jurkiewicz, Instytut Informatyki UW December 4, 2017 1 1 Images: Song Ho Anh Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Flynn taxonomy Large variety of parallel architectures, so different criteria for partitioning. The oldest popular classification of Flynn used the concepts of instruction stream and data stream . SISD (Single Instruction Single Data): classical uniprocessor system with single streams of instructions and data. If we multiply data streams, we will have SIMD system (Single Instruction Multiple Data), where many processors execute the same instruction, each on a different data units. This speeds up typical matrix operations, like multiplication or inversion. This mode was used on vector (e.g. Cray-1) and matrix processors . Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Flynn taxonomy Multiplying instead the instruction streams we will obtain the MISD system (Multiple Instruction Single Data), used only in very specific applications (e.g. nondeterministic or mirror processing). When we multiply both streams we will have MIMD system (Multiple Instruction Multiple Data), with full parallel concurrency — with the possibility of independent processing of different data by different processors. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Division across methods of synchronization synchronous, e.g. SIMD; asynchronuous, e.g. MIMD. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
MMX First Intel approach to SIMD (Pentium MMX). „Parasitic” on FPU — uses the same registers. First 64-bit instructions, e.g. movq mm0,[vector] which moves vector of values to MMX register. They are also available in SSE. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Checking for SSE technology Before using advanced instructions we have to check, whether our processor accepts them. Our laboratories have different processors! Checking is performed with cpuid instruction. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
CPUID For function 1 we get in EDX register bit 0: is FPU present? bit 11: does SYSCALL/SYSRET work? bit 15: is CMOV present? bit 23: is MMX present? bit 25: is SSE present? bit 26: is SSE2 present? and in ECX bit 0: is SSE3 present? Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
CPUID For function 4 we have in EAX bits 26..31: number of processors ( cores ) Much more in the documentation. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
XMM registers Used for SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE5, Advanced Vector Extensions, ... sets of instructions. In 32-bit mode: 8 128-bit registers XMM0–XMM7. In 64-bit mode: 16 128-bit registers XMM0–XMM15. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
XMM registers They can store a vector of 16 8-bit integer values. a vector of 8 16-bit integer values. a vector of 4 32-bit integer of floating-point values a pair (“vector”) of 2 64-bit integer of floating-point values Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Data Transfers Arithmetic instructions operate only on memory addresses aligned to 16 bytes (and of course on registers). Two kinds of transfers to/from registers: MOVDQU if the address is not aligned ( Unaligned ) MOVDQA if the address is aligned ( Aligned ) Clearing (zeroing) a register pxor xmm1,xmm1 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Addition Already present in MMX for integer vectors. SSE adds operations on (small) vectors of single precision floating-point numbers, e.g. ADDPS (where P is from Packed and S from Single ). In SSE2 extended to double precision floating-point numbers, e.g. ADDPD. Also added operations on single values, called scalars , e.g. ADDSS and ADDSD. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Example addition We are to add two three-dimensional vectors, represented in single precision. They have to be represented on 128 bits, so we assume that the “highest” part is cleared. movups xmm0,[eax] movups xmm1,[ebx] addps xmm0,xmm1 movups [edx],xmm0 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Multiplication Similar to addition: MULPS/MULPD for vectors, MULPS/MULSD for scalars Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Operations scalar/vector When we add a scalar to a vector, we must replicate the scalar to make it look like a vector. jak wektor The simplest way is to use the “shuffling” operation: shufps xmm1,xmm2, mask This operation scatters around the first argument two copies of scalars from the first argument and two copies of scalars from the second one. The selection of scalars is controlled by 8-bit mask. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Shuffling Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Adding scalar to vector When shuffling the first argument may be the same as the second one. movss xmm0,fscalar shufps xmm0,xmmo,00000000B addps xmm0,xmm1 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Comparisons The comparison operations in SSE do not set flags (there is not enough flags). Instead the first argument is set to true (all ones) or false (all zeros) — of course separately for each pair of elements. Thus comparison operations are destructive ! cmpd xmm0,xmm1 ;result in xmm0 movmskpd eax,xmm0 cmp eax,0 jne tam Remark: for scalars the instructions COMISD and UCOMISD directly modify EFLAGS, and we can use conditional jumps as usual. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Comparisons Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
SSE3 Scalar operations on floating-point with signle and double precision stored in lower parts of XMM registers. Floating-point operations register-register (instead of stack-based) result in less use for FPU, e.g. G CC uses FPU only in rare, special cases. Floating-point constants must be fetched from memory, they may not be put directly in instructions as constants. When comparing floating-points four results (also in C): < , = , > and “none of this” (for example when an argument is NaN). Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
SSE3 Added some interesting (read “exotic”) instructions, for example horizontal addition HADDPS/HADDPD, which adds neighbouring pairs in arguments and stores the result in the first argument. We will use it do compute scalar product of vectors movups xmm0,[eax] ;First vector movups xmm1,[ebx] ;Second vector mulps xmm0,xmm1 ;Products haddps xmm0,xmm0 ;Half-sums haddps xmm0,xmm0 movss [edx],xmm0 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Parallel architectures
Recommend
More recommend