Machine-Dependent Optimization Machine-Dependent Optimization CS - PowerPoint PPT Presentation

✆ ✗ ✙ ✌ ✓ ✑ ✝ ☎ ✆ ✝ ✞ ✄ ✛ ✄ ✜ � ✚ ✢ ✚ � ✚ ✢ ✚ ✣ ✤ ☛ ✞ ✥ ✂ ✑ ✑ ✌ ✓ ☛ ☛ ✔ ✄ ☛ ✝ ✌ ☎ ✝ ✆ ✝ ✞ ✄ ☛ ✗ ✘ ✓ ✌ ✟ ✆ ✤ ✦ ✓ ✧ ✝ ✌ ✓ ★ ✓ ✄ ✆ ✡ ✄ ✝ ✆ ✓ ✓ ✩ ✝ ☛ ✆ ✓ ✌ ✁ ✝ ✠ ✓ ✧ ✆ ✧ ✣ ✓ ✤ ✤ ✥ ✦ ✎ ✌ ✝ ✆ ✍ ✗ ✘ ✌ ☛ ✟ ✆ ✝ ✞ ✄ ✧ ✓ ☛ ✂ ✠ ✆ ✎ ✑ ✩ ✎ ✌ ✟ ✄ ☎ ✍ ✎ ✌ ✝ ✆ ✍ ✌ ☛ ✝ ✆ ✍ ✏ ✞ ✟ ✑ ✒ ✆ ✞ ☞ ☛ ✓ ✁ ☛ ✓ ✆ ✟ ✑ ✘ ✡ ✌ ✓ ✆ ✂ ✆ ✄ ☎ ✆ ✝ ✞ ✄ ✟ ✠ ✡ ✄ ✝ ✌ ✞ ✔ ☛ ✄ ✕ ✞ ✄ ✆ ✌ ✞ ✠ ✔ ✄ ✆ ✆ ✌ ✂ ☎ ✆ ✝ ✞ ✄ ✖ ✓ ☎ ☎ ✍ ✓ ✟ ☛ ✆ ✌ ✂ ☎ ✆ ✝ ✞ ✁ ✕ ✄ ☎ ✕ ✓ ✍ ✍ ✟ ☎ ✟ ✆ ✟ ✖ ✓ ✓ Machine-Dependent Optimization Machine-Dependent Optimization CS 105 “Tour of the Black Holes of Computing” Need to understand the architecture Not portable Machine-Dependent Optimization Machine-Dependent Optimization Not often needed …but critically important when it is Also helps in understanding modern machines CS 105 – 2 – Modern CPU Design Modern CPU Design Superscalar Processor Superscalar Processor Definition: A superscalar processor can issue and execute multiple �� instructions in one cycle . The instructions are retrieved from a sequential instruction stream and are usually scheduled dynamically. Benefit: without programming effort, superscalar processor can take advantage of the instruction-level parallelism that most programs have Most modern CPUs are superscalar. Intel: since Pentium (1993) �� CS 105 CS 105 – 3 – – 4 –

✁ ✁ ✁ ✁ ✁ What Is a Pipeline? What Is a Pipeline? Pipelined Functional Units Pipelined Functional Units �� long mult_eg(long a, long b, long c) { long p1 = a*b; �� long p2 = a*c; long p3 = p1 * p2; �� return p3; } Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul �� a*b a*c p1*p2 �� a*b a*c p1*p2 �� a*b a*c p1*p2 �� Result Bucket Divide computation into stages Pass partial computations from stage to stage Stage i can start new computation once values passed to i+1 E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles CS 105 CS 105 – 5 – – 6 – Haswell CPU Haswell CPU x86-64 Compilation of Combine4 x86-64 Compilation of Combine4 8 Total Functional Units Inner Loop (Case: Integer Multiply) Multiple instructions can execute in parallel 2 load, with address computation .L519: # Loop: imull (%rax,%rdx,4), %ecx # t = t * d[i] 1 store, with address computation addq $1, %rdx # i++ 4 integer cmpq %rdx, %rbp # Compare length:i jg .L519 # If >, goto Loop 2 FP multiply 1 FP add 1 FP divide Some instructions take > 1 cycle, but can be pipelined Method Integer Double FP Instruction Latency Cycles/Issue Operation Add Mult Add Mult Load / Store 4 1 Combine4 1.27 3.01 3.01 5.01 Integer Multiply 3 1 Latency 1.00 3.00 3.00 5.00 Integer/Long Divide 3-30 3-30 Bound Single/Double FP Multiply 5 1 Single/Double FP Add 3 1 Single/Double FP Divide 3-15 3-15 CS 105 CS 105 – 7 – – 8 –

✁ ✁ ✁ Combine4 = Serial Computation (OP = *) Combine4 = Serial Computation (OP = *) Loop Unrolling (2x1) Loop Unrolling (2x1) Computation (length=8) void unroll2a_combine(vec_ptr v, data_t *dest) 1 d 0 { ((((((((1 * d[0]) * d[1]) * d[2]) * d[3]) long length = vec_length(v); * * d[4]) * d[5]) * d[6]) * d[7]) d 1 long limit = length-1; data_t *d = get_vec_start(v); Sequential dependence * d 2 data_t x = IDENT; Performance: determined by latency of OP long i; * d 3 /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { * d 4 x = (x OP d[i]) OP d[i+1]; } /* Finish any remaining elements */ * d 5 for (; i < length; i++) { x = x OP d[i]; * d 6 } *dest = x; * d 7 } * Perform 2x more useful work per iteration CS 105 CS 105 – 9 – – 10 – Effect of Loop Unrolling Effect of Loop Unrolling Loop Unrolling with Reassociation (2x1a) Loop Unrolling with Reassociation (2x1a) void unroll2aa_combine(vec_ptr v, data_t *dest) Method Integer Double FP { Operation Add Mult Add Mult long length = vec_length(v); long limit = length-1; Combine4 1.27 3.01 3.01 5.01 data_t *d = get_vec_start(v); Unroll 2x1 1.01 3.01 3.01 5.01 data_t x = IDENT; long i; Latency 1.00 3.00 3.00 5.00 /* Combine 2 elements at a time */ Bound �� for (i = 0; i < limit; i+=2) { x = x OP (d[i] OP d[i+1]); x = (x OP d[i]) OP d[i+1]; } x = (x OP d[i]) OP d[i+1]; /* Finish any remaining elements */ for (; i < length; i++) { Helps integer add by reducing number of overhead instructions x = x OP d[i]; } (Almost) achieves latency bound *dest = x; } Others don’t improve. Why? Can this change result of computation? Still sequential dependency Yes, for multiply and floating point. Why? CS 105 CS 105 – 11 – – 12 –

✁ ✁ ✁ ✁ ✁ Effect of Reassociation Effect of Reassociation Reassociated Computation Reassociated Computation Method Integer Double FP What changed: Operation Add Mult Add Mult Operations in the next iteration can be x = x OP (d[i] OP d[i+1]); started early (no dependency) Combine4 1.27 3.01 3.01 5.01 Unroll 2x1 1.01 3.01 3.01 5.01 d 0 d 1 Unroll 2x1a 1.01 1.51 1.51 2.51 Overall Performance * d 2 d 3 Latency 1.00 3.00 3.00 5.00 1 N elements, D cycles latency/op Bound * (N/2+1)*D cycles: d 4 d 5 * Throughput 0.50 1.00 1.00 0.50 CPE = D/2 Bound * d 6 d 7 * Nearly 2x speedup for Int *, FP +, FP * * * Reason: Breaks sequential dependency �� * x = x OP (d[i] OP d[i+1]); �� Why is that? (next slide) �� CS 105 CS 105 – 13 – – 14 – Loop Unrolling Loop Unrolling Effect of Separate Accumulators Effect of Separate Accumulators with Separate Accumulators (2x2) with Separate Accumulators (2x2) Method Integer Double FP void unroll2a_combine(vec_ptr v, data_t *dest) { Operation Add Mult Add Mult long length = vec_length(v); Combine4 1.27 3.01 3.01 5.01 long limit = length-1; data_t *d = get_vec_start(v); Unroll 2x1 1.01 3.01 3.01 5.01 data_t x0 = IDENT; data_t x1 = IDENT; Unroll 2x1a 1.01 1.51 1.51 2.51 long i; Unroll 2x2 0.81 1.51 1.51 2.51 /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { Latency Bound 1.00 3.00 3.00 5.00 x0 = x0 OP d[i]; Throughput Bound 0.50 1.00 1.00 0.50 x1 = x1 OP d[i+1]; } /* Finish any remaining elements */ Int + makes use of two load units for (; i < length; i++) { x0 = x0 OP d[i]; x0 = x0 OP d[i]; } x1 = x1 OP d[i+1]; *dest = x0 OP x1; } 2x speedup (over unroll2) for Int *, FP +, FP * Different form of reassociation CS 105 CS 105 – 15 – – 16 –

Machine-Dependent Optimization Machine-Dependent Optimization CS - PowerPoint PPT Presentation

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Toward dependent choice: a classical sequent calculus with dependent types Hugo Herbelin 1 ,

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Outline IAML: Optimization Why we use optimization in machine learning The general

Outline Optimization Unconstrained Optimization Problems Machine Learning and Pattern

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Theory Exercises Exercise 1 Consider a simple uniprocessor system with no caches. How does

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware & Software The

CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous

ILP: COMPILER-BASED TECHNIQUES Mahdi Nazm Bojnordi Assistant Professor School of Computing

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Hosted I/O Architecture Micah Dowty Jeremy Sugerman USENIX WIOV 2008 1 Contents GPUs are

Video Codec Requirements and Evaluation Methodology t www.huawei.com

Development of Computer - Story of Steve What is a computer A high intelligence machine A

Machine-Dependent Optimization Machine-Dependent Optimization CS - PowerPoint PPT Presentation

Why Dependent Origination? So what is dependent origination? Dependent on ignorance, there

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Dependent Eligibility Audit Dependent Eligibility Audit Purpose: The dependent eligibility audit

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

P2P Combinatorial Optimization Amir H. Payberah (amir@sics.se) P2P Combinatorial Optimization, 13

Toward dependent choice: a classical sequent calculus with dependent types Hugo Herbelin 1 ,

Local Function Optimization COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning

Outline IAML: Optimization Why we use optimization in machine learning The general

Outline Optimization Unconstrained Optimization Problems Machine Learning and Pattern

Machine Learning for Auto Optimization What is Machine Learning? Definition: Machine

Optimization of HPSG Grammar Implementations in Trale Georgiana Dinu Optimization of HPSG

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Optimization Optimization Goal: Find the minimizer ! that minimizes the objective (cost)

Five Steps to Optimization Five Steps to Optimization Beyond Best Practices Beyond Best

Theory Exercises Exercise 1 Consider a simple uniprocessor system with no caches. How does

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware &amp; Software The

CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous

ILP: COMPILER-BASED TECHNIQUES Mahdi Nazm Bojnordi Assistant Professor School of Computing

Multicore Multicore curiculum 1 Motivation Moores Law: the number of transistors double

Hosted I/O Architecture Micah Dowty Jeremy Sugerman USENIX WIOV 2008 1 Contents GPUs are

Video Codec Requirements and Evaluation Methodology t www.huawei.com

Development of Computer - Story of Steve What is a computer A high intelligence machine A

CS 333 Introduction to Operating Systems Class 2 OS-Related Hardware & Software The