machine dependent optimization machine dependent
play

Machine-Dependent Optimization Machine-Dependent Optimization CS - PowerPoint PPT Presentation


  1. ✆ ✗ ✙ ✌ ✓ ✑ ✝ ☎ ✆ ✝ ✞ ✄ ✛ ✄ ✜ � ✚ ✢ ✚ � ✚ ✢ ✚ ✣ ✤ ☛ ✞ ✥ ✂ ✑ ✑ ✌ ✓ ☛ ☛ ✔ ✄ ☛ ✝ ✌ ☎ ✝ ✆ ✝ ✞ ✄ ☛ ✗ ✘ ✓ ✌ ✟ ✆ ✤ ✦ ✓ ✧ ✝ ✌ ✓ ★ ✓ ✄ ✆ ✡ ✄ ✝ ✆ ✓ ✓ ✩ ✝ ☛ ✆ ✓ ✌ ✁ ✝ ✠ ✓ ✧ ✆ ✧ ✣ ✓ ✤ ✤ ✥ ✦ ✎ ✌ ✝ ✆ ✍ ✗ ✘ ✌ ☛ ✟ ✆ ✝ ✞ ✄ ✧ ✓ ☛ ✂ ✠ ✆ ✎ ✑ ✩ ✎ ✌ ✟ ✄ ☎ ✍ ✎ ✌ ✝ ✆ ✍ ✌ ☛ ✝ ✆ ✍ ✏ ✞ ✟ ✑ ✒ ✆ ✞ ☞ ☛ ✓ ✁ ☛ ✓ ✆ ✟ ✑ ✘ ✡ ✌ ✓ ✆ ✂ ✆ ✄ ☎ ✆ ✝ ✞ ✄ ✟ ✠ ✡ ✄ ✝ ✌ ✞ ✔ ☛ ✄ ✕ ✞ ✄ ✆ ✌ ✞ ✠ ✔ ✄ ✆ ✆ ✌ ✂ ☎ ✆ ✝ ✞ ✄ ✖ ✓ ☎ ☎ ✍ ✓ ✟ ☛ ✆ ✌ ✂ ☎ ✆ ✝ ✞ ✁ ✕ ✄ ☎ ✕ ✓ ✍ ✍ ✟ ☎ ✟ ✆ ✟ ✖ ✓ ✓ Machine-Dependent Optimization Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing” Need to understand the architecture Not portable Machine-Dependent Optimization Machine-Dependent Optimization Not often needed …but critically important when it is Also helps in understanding modern machines CS 105 – 2 – Modern CPU Design Modern CPU Design Superscalar Processor Superscalar Processor Definition: A superscalar processor can issue and execute multiple ������������������� instructions in one cycle . The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically. Benefit: without programming effort, superscalar processor can take advantage of the instruction-level parallelism that most programs have Most modern CPUs are superscalar. Intel: since Pentium (1993) ��������� CS 105 CS 105 – 3 – – 4 –

  2. ✁ ✁ ✁ ✁ ✁ What Is a Pipeline? What Is a Pipeline? Pipelined Functional Units Pipelined Functional Units ������� long mult_eg(long a, long b, long c) { long p1 = a*b; ������� long p2 = a*c; long p3 = p1 * p2; ������� return p3; } Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul ���� � � � � � � � a*b a*c p1*p2 ������� a*b a*c p1*p2 ������� a*b a*c p1*p2 ������� Result Bucket Divide computation into stages Pass partial computations from stage to stage Stage i can start new computation once values passed to i+1 E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles CS 105 CS 105 – 5 – – 6 – Haswell CPU Haswell CPU x86-64 Compilation of Combine4 x86-64 Compilation of Combine4 8 Total Functional Units Inner Loop (Case: Integer Multiply) Multiple instructions can execute in parallel 2 load, with address computation .L519: # Loop: imull (%rax,%rdx,4), %ecx # t = t * d[i] 1 store, with address computation addq $1, %rdx # i++ 4 integer cmpq %rdx, %rbp # Compare length:i jg .L519 # If >, goto Loop 2 FP multiply 1 FP add 1 FP divide Some instructions take > 1 cycle, but can be pipelined Method Integer Double FP Instruction Latency Cycles/Issue Operation Add Mult Add Mult Load / Store 4 1 Combine4 1.27 3.01 3.01 5.01 Integer Multiply 3 1 Latency 1.00 3.00 3.00 5.00 Integer/Long Divide 3-30 3-30 Bound Single/Double FP Multiply 5 1 Single/Double FP Add 3 1 Single/Double FP Divide 3-15 3-15 CS 105 CS 105 – 7 – – 8 –

  3. ✁ ✁ ✁ Combine4 = Serial Computation (OP = *) Combine4 = Serial Computation (OP = *) Loop Unrolling (2x1) Loop Unrolling (2x1) Computation (length=8) void unroll2a_combine(vec_ptr v, data_t *dest) 1 d 0 { ((((((((1 * d[0]) * d[1]) * d[2]) * d[3]) long length = vec_length(v); * * d[4]) * d[5]) * d[6]) * d[7]) d 1 long limit = length-1; data_t *d = get_vec_start(v); Sequential dependence * d 2 data_t x = IDENT; Performance: determined by latency of OP long i; * d 3 /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { * d 4 x = (x OP d[i]) OP d[i+1]; } /* Finish any remaining elements */ * d 5 for (; i < length; i++) { x = x OP d[i]; * d 6 } *dest = x; * d 7 } * Perform 2x more useful work per iteration CS 105 CS 105 – 9 – – 10 – Effect of Loop Unrolling Effect of Loop Unrolling Loop Unrolling with Reassociation (2x1a) Loop Unrolling with Reassociation (2x1a) void unroll2aa_combine(vec_ptr v, data_t *dest) Method Integer Double FP { Operation Add Mult Add Mult long length = vec_length(v); long limit = length-1; Combine4 1.27 3.01 3.01 5.01 data_t *d = get_vec_start(v); Unroll 2x1 1.01 3.01 3.01 5.01 data_t x = IDENT; long i; Latency 1.00 3.00 3.00 5.00 /* Combine 2 elements at a time */ Bound ����������������� for (i = 0; i < limit; i+=2) { x = x OP (d[i] OP d[i+1]); x = (x OP d[i]) OP d[i+1]; } x = (x OP d[i]) OP d[i+1]; /* Finish any remaining elements */ for (; i < length; i++) { Helps integer add by reducing number of overhead instructions x = x OP d[i]; } (Almost) achieves latency bound *dest = x; } Others don’t improve. Why? Can this change result of computation? Still sequential dependency Yes, for multiply and floating point. Why? CS 105 CS 105 – 11 – – 12 –

  4. ✁ ✁ ✁ ✁ ✁ Effect of Reassociation Effect of Reassociation Reassociated Computation Reassociated Computation Method Integer Double FP What changed: Operation Add Mult Add Mult Operations in the next iteration can be x = x OP (d[i] OP d[i+1]); started early (no dependency) Combine4 1.27 3.01 3.01 5.01 Unroll 2x1 1.01 3.01 3.01 5.01 d 0 d 1 Unroll 2x1a 1.01 1.51 1.51 2.51 Overall Performance * d 2 d 3 Latency 1.00 3.00 3.00 5.00 1 N elements, D cycles latency/op Bound * (N/2+1)*D cycles: d 4 d 5 * Throughput 0.50 1.00 1.00 0.50 CPE = D/2 Bound * d 6 d 7 * Nearly 2x speedup for Int *, FP +, FP * * * Reason: Breaks sequential dependency ��������������������������� ��������������������������� * x = x OP (d[i] OP d[i+1]); ���������������������������� Why is that? (next slide) ��������������������������� CS 105 CS 105 – 13 – – 14 – Loop Unrolling Loop Unrolling Effect of Separate Accumulators Effect of Separate Accumulators with Separate Accumulators (2x2) with Separate Accumulators (2x2) Method Integer Double FP void unroll2a_combine(vec_ptr v, data_t *dest) { Operation Add Mult Add Mult long length = vec_length(v); Combine4 1.27 3.01 3.01 5.01 long limit = length-1; data_t *d = get_vec_start(v); Unroll 2x1 1.01 3.01 3.01 5.01 data_t x0 = IDENT; data_t x1 = IDENT; Unroll 2x1a 1.01 1.51 1.51 2.51 long i; Unroll 2x2 0.81 1.51 1.51 2.51 /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { Latency Bound 1.00 3.00 3.00 5.00 x0 = x0 OP d[i]; Throughput Bound 0.50 1.00 1.00 0.50 x1 = x1 OP d[i+1]; } /* Finish any remaining elements */ Int + makes use of two load units for (; i < length; i++) { x0 = x0 OP d[i]; x0 = x0 OP d[i]; } x1 = x1 OP d[i+1]; *dest = x0 OP x1; } 2x speedup (over unroll2) for Int *, FP +, FP * Different form of reassociation CS 105 CS 105 – 15 – – 16 –

Recommend


More recommend