1
play

1 Reduction in Strength Compiler-Generated Code Motion (-O1) void - PDF document

Today Overview Generally Useful Optimizations Program Optimization Code motion/precomputation Strength reduction Sharing of common subexpressions CSci 2021: Machine Architecture and Organization Removing unnecessary


  1. Today  Overview  Generally Useful Optimizations Program Optimization  Code motion/precomputation  Strength reduction  Sharing of common subexpressions CSci 2021: Machine Architecture and Organization  Removing unnecessary procedure calls April 6th-15th, 2020  Optimization Blockers Your instructor: Stephen McCamant  Procedure calls  Memory aliasing Based on slides originally by:  Exploiting Instruction-Level Parallelism Randy Bryant, Dave O’Hallaron  Dealing with Conditionals 1 2 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Optimizing Compilers Performance Realities  Provide efficient mapping of program to machine There’s more to performance than asymptotic complexity  register allocation  code selection and ordering (scheduling)  dead code elimination  Constant factors matter too!  eliminating minor inefficiencies  Easily see 10:1 performance range depending on how code is written  Don’t (usually) improve asymptotic efficiency  Must optimize at multiple levels:  up to programmer to select best overall algorithm  algorithm, data representations, procedures, and loops  big-O savings are (often) more important than constant factors  Must understand system to optimize performance  but constant factors also matter  How programs are compiled and executed  How modern processors + memory systems operate  Have difficulty overcoming “optimization blockers”  potential memory aliasing  How to measure program performance and identify bottlenecks  How to improve performance without destroying code modularity and  potential procedure side-effects generality 3 4 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Limitations of Optimizing Compilers Generally Useful Optimizations Operate under fundamental constraint   Must not cause any change in program behavior  Optimizations that you or the compiler should do regardless  Except, possibly when program making use of nonstandard language of processor / compiler features  Often prevents it from making optimizations that would only affect behavior under pathological conditions.  Code Motion Behavior that may be obvious to the programmer can be obfuscated by  Reduce frequency with which computation performed  languages and coding styles  If it will always produce same result  e.g., Data ranges may be more limited than variable types suggest  Especially moving code out of loop Most analysis is performed only within procedures   Whole-program analysis is too expensive in most cases void set_row(double *a, double *b, long i, long n)  Newer versions of GCC do interprocedural analysis within individual files { long j;  But, not between code in different files long j; int ni = n*i ; for (j = 0; j < n; j++) Most analysis is based only on static information for (j = 0; j < n; j++)  a[n*i+j] = b[j]; a[ni+j] = b[j]; }  Compiler has difficulty anticipating run-time inputs When in doubt, the compiler must be conservative  5 6 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 1

  2. Reduction in Strength Compiler-Generated Code Motion (-O1) void set_row(double *a, double *b,  Replace costly operation with simpler one long j; long i, long n) long ni = n*i; {  Shift, add instead of multiply or divide double *rowp = a+ni; long j; for (j = 0; j < n; j++) for (j = 0; j < n; j++) 16*x --> x << 4 *rowp++ = b[j]; a[n*i+j] = b[j]; }  Utility machine dependent  Depends on cost of multiply or divide instruction – On Intel Nehalem, integer multiply requires 3 CPU cycles set_row:  Most valuable when it can be done within a loop testq %rcx, %rcx # Test n jle .L1 # If 0, goto done  “Induction variable” has value linear in loop execution count imulq %rcx, %rdx # ni = n*i leaq (%rdi,%rdx,8), %rdx # rowp = A + ni*8 movl $0, %eax # j = 0 int ni = 0; .L3: # loop: for (i = 0; i < n; i++) { for (i = 0; i < n; i++) { movsd (%rsi,%rax,8), %xmm0 # t = b[j] for (j = 0; j < n; j++) int ni = n*i; movsd %xmm0, (%rdx,%rax,8) # M[A+ni*8 + j*8] = t a[ni + j] = b[j]; for (j = 0; j < n; j++) addq $1, %rax # j++ a[ni + j] = b[j]; ni += n; cmpq %rcx, %rax # j:n } } jne .L3 # if !=, goto loop .L1: # done: rep ; ret 7 8 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Optimization Blocker #1: Procedure Calls Share Common Subexpressions  Reuse portions of expressions  GCC will do this with – O1  Procedure to Convert String to Lower Case void lower(char *s) { /* Sum neighbors of i,j */ long inj = i*n + j; up = val[(i-1)*n + j ]; up = val[inj - n]; size_t i; down = val[(i+1)*n + j ]; down = val[inj + n]; for (i = 0; i < strlen(s); i++) left = val[i*n + j-1]; left = val[inj - 1]; if (s[i] >= 'A' && s[i] <= 'Z') right = val[i*n + j+1]; right = val[inj + 1]; sum = up + down + left + right; sum = up + down + left + right; s[i] -= ('A' - 'a'); } 3 multiplications: i*n, (i – 1)*n, (i+1)*n 1 multiplication: i*n leaq 1(%rsi), %rax # i+1 imulq %rcx, %rsi # i*n leaq -1(%rsi), %r8 # i-1 addq %rdx, %rsi # i*n+j  Extracted from CMU 213 lab submissions, Fall, 1998 imulq %rcx, %rsi # i*n movq %rsi, %rax # i*n+j  Similar pattern seen in UMN 2018 HA1 imulq %rcx, %rax # (i+1)*n subq %rcx, %rax # i*n+j-n imulq %rcx, %r8 # (i-1)*n leaq (%rsi,%rcx), %rcx # i*n+j+n addq %rdx, %rsi # i*n+j addq %rdx, %rax # (i+1)*n+j addq %rdx, %r8 # (i-1)*n+j 9 10 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Lower Case Conversion Performance Convert Loop To Goto Form void lower(char *s) {  Time quadruples when double string length size_t i = 0;  Quadratic performance if (i >= strlen(s)) goto done; loop: 250 if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); 200 i++; CPU seconds if (i < strlen(s)) 150 lower1 goto loop; done: 100 } 50  strlen executed every iteration 0 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000 String length 11 12 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition 2

Recommend


More recommend