✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ Great Reality Great Reality CS 105 “Tour of the Black Holes of Computing” There’s more to performance than asymptotic complexity Constant factors matter too! Code Optimization and Performance Code Optimization and Performance Easily see 10:1 performance range depending on how code is written Must optimize at multiple levels: � Algorithm, data representations, procedures, and loops Must understand system to optimize performance How programs are compiled and executed How to measure program performance and identify bottlenecks How to improve performance without destroying code modularity, generality, readability CS 105 – 2 – Optimizing Compilers Optimizing Compilers Limitations of Optimizing Compilers Limitations of Optimizing Compilers Operate Under Fundamental Constraint Provide efficient mapping of program to machine Must not cause any change in program behavior under any possible condition Register allocation Often prevents making optimizations that would only affect behavior under pathological Code selection and ordering conditions Eliminating minor inefficiencies Behavior that may be obvious to the programmer can be obfuscated by languages and Don’t (usually) improve asymptotic efficiency coding styles E.g., data ranges may be more limited than variable types suggest Up to programmer to select best overall algorithm Big-O savings are (often) more important than constant factors Most analysis is performed only within procedures � But constant factors also matter Whole-program analysis is too expensive in most cases (gcc does quite a bit of interprocedural analysis—but not across files) Have difficulty overcoming “optimization blockers” Most analysis is based only on static information Potential memory aliasing Compiler has difficulty anticipating run-time inputs Potential procedure side effects When in doubt, the compiler must be conservative CS 105 CS 105 – 3 – – 4 –
✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ Generally Useful Optimizations Generally Useful Optimizations Compiler-Generated Code Motion (-O1) Compiler-Generated Code Motion (-O1) Optimizations you should do regardless of processor / compiler void set_row(double *a, double *b, long j; long i, long n) long ni = n*i; Code Motion { double *rowp = a+ni; long j; for (j = 0; j < n; j++) Reduce frequency with which computation performed for (j = 0; j < n; j++) rowp[j] = b[j]; a[n*i+j] = b[j]; � If it will always produce same result } � Especially moving code out of loop � Gcc often does this for you (so check assembly) set_row: testq %rcx, %rcx # Test n for (i = 0; i < n; i++) { jle .L1 # If 0, goto done for (i = 0; i < n; i++) int ni = n*i; imulq %rcx, %rdx # ni = n*i for (j = 0; j < n; j++) leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8 for (j = 0; j < n; j++) a[n*i + j] = b[j]; movl $0, %eax # j = 0 a[ni + j] = b[j]; .L3: # loop: } movsd (%rsi,%rax,8), %xmm0 # t = b[j] movsd %xmm0, (%rdx,%rax,8) # M[A+ni*8 + j*8] = t addq $1, %rax # j++ cmpq %rcx, %rax # j:n jne .L3 # if !=, goto loop .L1: # done: rep ; ret CS 105 CS 105 – 5 – – 6 – Strength Reduction Strength Reduction Share Common Subexpressions Share Common Subexpressions Replace costly operation with simpler one Reuse portions of expressions Shift, add instead of multiply or divide Gcc will do this with –O1 and up 16*x x << 4 � /* Sum neighbors of i,j */ long inj = i*n + j; � Utility is machine-dependent up = val[(i-1)*n + j ]; up = val[inj - n]; � Depends on cost of multiply or divide instruction down = val[(i+1)*n + j ]; down = val[inj + n]; left = val[i*n + j-1]; left = val[inj - 1]; � On Intel Nehalem, integer multiply requires 3 CPU cycles right = val[i*n + j+1]; right = val[inj + 1]; Recognize sequence of products sum = up + down + left + right; sum = up + down + left + right; Again, gcc often does it ���������������������������������������� ��������������������� int ni = 0; for (i = 0; i < n; i++) for (i = 0; i < n; i++) { leaq 1(%rsi), %rax # i+1 imulq %rcx, %rsi # i*n for (j = 0; j < n; j++) leaq -1(%rsi), %r8 # i-1 addq %rdx, %rsi # i*n+j for (j = 0; j < n; j++) a[n*i + j] = b[j]; a[ni + j] = b[j]; imulq %rcx, %rsi # i*n movq %rsi, %rax # i*n+j imulq %rcx, %rax # (i+1)*n subq %rcx, %rax # i*n+j-n ni += n; } imulq %rcx, %r8 # (i-1)*n leaq (%rsi,%rcx), %rcx # i*n+j+n addq %rdx, %rsi # i*n+j addq %rdx, %rax # (i+1)*n+j addq %rdx, %r8 # (i-1)*n+j CS 105 CS 105 – 7 – – 8 –
✁ ✁ ✁ ✁ ✁ ✁ ✁ Optimization Blocker #1: Procedure Calls Optimization Blocker #1: Procedure Calls Lower-Case Conversion Performance Lower-Case Conversion Performance Procedure to Convert String to Lower Case Time quadruples when double string length Quadratic performance #include <ctype.h> void lower(char *s) 250 { size_t i; 200 for (i = 0; i < strlen(s); i++) if (isupper(s[i])) CPU seconds 150 s[i] = tolower(s[i]); lower1 } 100 50 Extracted from many student programs 0 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 String length CS 105 CS 105 – 9 – – 10 – Convert Loop To Goto Form Convert Loop To Goto Form Calling Strlen Calling Strlen /* My version of strlen */ void lower(char *s) size_t strlen(const char *s) { { size_t i = 0; size_t length = 0; if (i >= strlen(s)) while (*s != '\0') { goto done; s++; loop: length++; if (isupper(s[i])) } s[i] = tolower(s[i]); return length; i++; } if (i < strlen(s)) goto loop; Strlen performance done: } Only way to determine length of string is to scan its entire length, looking for NUL character. Overall performance, string of length N strlen executed every iteration N calls to strlen, each takes O(N) time Overall O(N 2 ) performance CS 105 CS 105 – 11 – – 12 –
✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ ✁ Improving Performance Improving Performance Lower-Case Conversion Performance Lower-Case Conversion Performance void lower(char *s) Time doubles when double string length { Linear performance of lower2 size_t i; size_t len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') 250 s[i] -= ('A' - 'a'); } 200 CPU seconds 150 lower1 Move call to strlen outside of loop 100 Since result does not change from one iteration to another Form of code motion 50 lower2 0 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 String length CS 105 CS 105 – 13 – – 14 – Optimization Blocker: Procedure Calls Optimization Blocker: Procedure Calls Memory Matters Memory Matters /* Sum rows is of n X n matrix a Why couldn’t compiler move strlen out of inner loop? and store in vector b */ void sum_rows1(double *a, double *b, long n) { Procedure may have side effects long i, j; Might alter global state each time called for (i = 0; i < n; i++) { Function may not return same value for given arguments b[i] = 0; for (j = 0; j < n; j++) Depends on other parts of global state b[i] += a[i*n + j]; Procedure lower could interact with strlen } size_t lencnt = 0; } Warning: size_t strlen(const char *s) { Compiler treats procedure call as a black box # sum_rows1 inner loop size_t length = 0; .L4: Weak optimizations near them movsd (%rsi,%rax,8), %xmm0 # FP load while (*s != '\0') { addsd (%rdi), %xmm0 # FP add Remedies: s++; movsd %xmm0, (%rsi,%rax,8) # FP store length++; addq $8, %rdi Use inline functions cmpq %rcx, %rdi } � GCC does this with –O1 jne .L4 lencnt += length; » But only within single file return length; Code updates b[i] on every iteration Do your own code motion } Why couldn’t compiler optimize this away? CS 105 CS 105 – 15 – – 16 –
Recommend
More recommend