assembly language programming optimization
play

Assembly Language Programming Optimization Zbigniew Jurkiewicz, - PowerPoint PPT Presentation

Assembly Language Programming Optimization Zbigniew Jurkiewicz, Instytut Informatyki UW December 9, 2017 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization Conditional transfer Sometimes we make comparison


  1. Assembly Language Programming Optimization Zbigniew Jurkiewicz, Instytut Informatyki UW December 9, 2017 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  2. Conditional transfer Sometimes we make comparison only to execute a single assignment depending on the result. Then we can use conditional move instruction, where assignment is performed only if the indicated condition was satisfied, e.g. the instruction cmove eax,1 sets register eax to 1 only if recently compared elements were equal. The main advantage is avoidance of the necessity of cleaning the pipeline or speculative execution. lub wykonania spekulacyjnego. Conditional assignment SET. Conditional transfer CMOV. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  3. Conditional transfer: an example Find maximum of two numbers (arguments in EAX and EBX, result in ECX): mov ecx,eax cmp ebx,ecx cmova ecx,ebx Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  4. Conditional transfers: errors Assume we are compiling in C the expression int *xp; ... return (xp ? *xp : 0); If xp is in rdi , we could try xor eax,eax ;Maybe we will return zero test rdi,rdi ;xp == 0 ? cmovne eax,[rdi] ;Maybe we will return *xp But then the dereference of xp will occurs always (even for the NULL pointer), and this we want to avoid. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  5. Jump avoidance Avoiding jumps ia a larger problem. Let us look at the computation of absolute value of number test eax,eax ;We set flags jns omi´ n ;Positive sign neg eax skip: Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  6. Jump avoidance There is a different way: mov ecx,eax sar ecx,31 ;sign bit everywhere xor eax,ecx ;bit reverse sub eax,ecx ; we subtract -1 and have 2-complement Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  7. Power of 2 Another trick: how to check, whether a number in EAX is a power of two? mov ebx,eax ;or lea ebx,[eax - 1] dec ebx test eax,ebx jnz isnot Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  8. Hints The processor tries to guess, whether the conditional jump will be performed. With static guess it is assumed, that the jump “backwards” will be peformed. We can help it using hints : prefixes HT(0x3e) and HNT(0x2e), for example test ecx,ecx db 3eh ;HT = we will jump jz L9 ... L9: Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  9. Hints Sometimes holding the data in cache memory is not useful, if it is only used once Direct write instructions ( non-temporal store ) MOVNTI, MOVNTPD, etc. in write phase omit the cache. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  10. Conservativity of compiler The C compiler must be conservative and generate code in such a way, that all possible cases are covered. Example: void memclr (char *data, int n) { for (; n > 0; n--) *data++ = 0; } If the compiler knew something about the alignment of data , it could generate a code to zero 2, 4 or ever 8 bajtów in one step. However, it must assume the worst case. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  11. Conservativity of compiler There a few elements in C/C++, which are classic examples of slowing down programs. The group is lead by the conversion ( cast ) from real number to integer, for example int i; float f; ... i = (int)f; Such conversion takes 50-100 processor cycles. Reason: the C/C++ defines a different way of rounding than implemented in FPU, so we have to toggle coprocessor mode. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  12. Conservativity of compiler Other nomination to Oscara prize is pointer aliasing . In the code below a compiler will not pull the evaluation of *p + 2 befor the loop void Func1 (int a[], int *p) { int i; for (i = 0; i < 100; i++) a[i] = *p + 2; } And it is right, because (hooray for C and C++ :-) void Func2() { int list[100]; Func1(list, &list[8]); } Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  13. Conservativity of compiler Sometimes the recipes are simple. The code below twice fetches arg1->p1 from the memory: struct S1 int p1; struct S2 int p2, p3; void f1 (struct S1 *arg1, struct S2 *arg2) arg2->p2 += arg1->p1; arg2->p3 += arg1->p1; It must work this way, because arg2->p2 and arg1->p1 may be the same memory cell. But it is enough to introduce local variable bound to S1->p1 . Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  14. Assembler Asembler allows us to take advantage from low-level services: Registers and direct input/output. Violating the compiler conventions: different passing of parameters, violating the memory allocation rules, iterative call of procedures. Linking incompatible code fragments, e.g. built by different compilers. Code optimization by hand to adapt it to a very particular hardware configuration. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  15. Extreme example Appetizer The following code in C float a[4], b[4], c[4]; for (int i = 0; i < 4; i++) { c[i] = a[i] > b[i] ? a[i] : b[i]; } can be optimally coded as follows movaps xmm0,[a] ;Load a vector maxps xmm0,[b] ;max(a,b) movaps [c],xmm0 ;c = a > b ? a : b Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  16. Not enough registers or “two in one” We have two variables index and increment , both 16-bit ( short ). On ARM they can pe put into one register, index at the top. Then the C code elem = tab[index]; index += increment; could be written in assembler as LDRB Relem, [Rtab, Rindincr, LSR#16] ADD Rindincr, Rindincr, Rindincr, LSL#16 Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  17. Intel/AMD The instruction set of CISC processors (x86) is not optimal — confirmed by several changes of architecture philosophy. It must be preserved because of back compatibility with systems from years 1980s, when RAM and disc memory were small and costly. But CISC also has some advantages. The compactness of code fits well to requirements of cache memories with restricted sizes. The main problem of x86 processors is lack of enough registers, alleviated a little when designing x86-64. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  18. Graphics accelerators Demading graphic applications need platforms with graphics coprocessor or accelerator card. The computational power contained in them can be used also to other tasks, but this is another story (and it depends much on hardware). Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  19. 64-bit code Advantages: More registers: usually no need to store variables and intermediate result in RAM memory. The efficient procedure call: passing parameters in registers. 64-bitowe registers for integers. Better management of large memory blocks. Built-in restricted SIMD (SSE). Relative addressing of data, efficient relocatable code. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  20. 64-bit code Disadvantages: Twice larger addresses and stack positions: troubles with cache memory. The access to static and global arrays requires more instructions for large memory images. Mostly for Windows and Mac. More complicated computation of effective memory address when the size greater than 2GB. Some instructions are longer. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  21. Intrinsic functions in C++ New approach for joining code from different levels. Intrinsic functions represent known to the compiler processor instructions. Example: addition of floating-point vectors ADDPS may be written in C++ as the function _mm_add_ps . We can also define the appropriate class of vectors and overlod the + operator in it. Intrinsic functions exist in Microsoft, Intela and GNU compilers. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  22. Examining compiled code Various reasons: Checking for evident places for rewriting by hand in assembly language (or for switching compiler flag, e.g. -O3 ;-) Use compiler as an intelligent typist, and the resulting code as more comfortable base than staring form nothing. This code at least has correct interfaces with environment, and they give us usually most troubles. And sometimes we will discover an error in compiler. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

  23. Examining compiled code Let us look at the loop for (int i = 0; i <= 15; i++) T[i] := i; The compiler should logically replace it by for (int i = 15; i >= 0; i--) T[i] := i; Reason: we save at a comparison instruction (with 15), because subtraction already set zero flag. Zbigniew Jurkiewicz, Instytut Informatyki UW Assembly Language Programming Optimization

Recommend


More recommend