enhanced pipeline scheduling
play

Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook - PowerPoint PPT Presentation

Some Cache Optimization with Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook Moon School of Electrical Engineering & Computer Science Seoul National University, Korea Outline Motivation and background Cache


  1. Some Cache Optimization with Enhanced Pipeline Scheduling Seok-Young Lee, Jaemok Lee, Soo-Mook Moon School of Electrical Engineering & Computer Science Seoul National University, Korea

  2. Outline  Motivation and background  Cache optimizations with Enhanced pipeline scheduling  Experimental results  Summary and future work

  3. Cache Misses for Integer Programs  CPU stalls caused by data cache misses are serious, even in some integer programs 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 164.gzip 175.vpr 176.gcc 181.m cf 197.parser 254.gap 256.bzip 300.twolf 445.gobm k 456.hm m er CPU Stall portion in the total running time

  4. Conventional Techniques  Many compiler optimization techniques have been used • Prefetches for array- accessing loops [Mowry’92] • Increasing locality in loops [Wolf’91] • Dynamic runtime optimization [Chilimbi’02]  But they are not well applicable to integer loops • Address estimation is not easy (e.g., pointer-chasing loops) • Complex control flows

  5. A Better Technique  In integer programs, it is easier to separate “ hot cache- missing loads ” from their consumers by cache-miss latencies • Simply implemented by increased load latency during code scheduling use a use b Cache miss use c latency load x = [y] load x = [y] use d use x use x CPU stall if cache miss No CPU stall if the load and consumer Is separated.

  6. Our Proposal  However, naïve code scheduling is not enough • Code motion of hot loads can be stuck at the loop entry • Difficult to fill added slack cycles fully and usefully • Actually, did not show tangible impact [Choi ’02 in EPIC -2]  Our proposal: moving hot loads across loop iterations

  7. Illustration of the Proposal [iter 1] load a = @b load a = @b load a = @b load a = @b load a = @b use a [iter n] use a … … … [iter n+1] naïve sep separatio ion: pro roposed se separatio ion : stuck at the loop header moving hot loads across loop iterations A code motion for software pipelining

  8. Some Characteristics of Hot Loads  Located close to loop entry Tight data dependence chains to their source operands  • Moving hot load requires moving dependent instructions as well  Difficult to estimate target address  Often in a loop with complex control flow • Require code motion above branches and joins

  9. Hot load example in 181.mcf cf tail->time while( arcin ) { tail = arcin->tail; if( tail->time + arcin->org_cost > latest ) { arcin = (arc_t *)tail->mark; continue; } Complex and large code including inner Inner loop loop and function call Pointer chasing load … In an outer loop with complex control flow } Close to the loop entry 181.mcf source code 181.mcf control flow graph

  10. Hot load example in 164.gzi zip ( *(ush*)(match+best_len-1) do { match = window + cur_match; if ( *(ush*)(match+best_len-1) != scan_end || *(ush*)match != scan_start) continue; Complex and large code including inner loop and function call } while ((cur_match = prev[cur_match & WMASK] ) > limit && --chain_length != 0); prev[cur_match & WMASK] 164.gzip source 164.gzip control flow graph

  11. Cross-Iteration Global Scheduling  Separating hot loads requires tail->time tail->time two types of code motions • Code motion across loop back- edges: software pipelining • Code motion across branches and joins: global scheduling  Needs global scheduling across loop iterations  Enhanced pipeline scheduling tail->time tail->time

  12. Enhanced Pipeline Scheduling (EPS)  A software pipelining technique based on code motions • Global scheduling can be applied across loop back-edges • Aggressive code motions for scheduling useful instructions  If we exploit EPS appropriately, we can (1) separate hot loads and the consumers effectively and (2) fill the slack cycles usefully  Let us first review how EPS works

  13. EPS Illustration  EPS repetitively (1) defines a DAG by cutting edges of a loop and (2) performs DAG scheduling preheader preheader preheader iter 1 iter 1 x’ = x+4 iter 1 iter 2 y = load(x’) y = load(x’) x’ = x+4 x’ = x+4 x = x’ x = x’ x’’ = x’+4 x’’ = x’+4 x’ = x’’ y = load(x) x = x+4 x’ = x+4 iter n+1 y = load(x) y = load(x) iter n+1 iter n+2 cc = (y==0) cc = (y==0) cc = (y==0) Back-edge Back-edge Back-edge if(!cc) goto loop if(!cc) goto loop if(!cc) goto loop store x @A store x @A store x @A

  14. CPU Stall Reduction with EPS  We simply add a L1-cache-missing latency for “ hot ” loads and schedule them by EPS algorithm • Their consumer instructions will be scheduled far enough from them, even across loop iterations bac acke kedge Inst Inst Inst Load Use Inst Inst ... ...  However, this is not that simple

  15. Issues in Stall Reduction with EPS  Adding slack cycles means more aggressive code motions • Some aggressive code motions such as speculative loads or join code motions have a negative side-effect if performed recklessly • Must limit aggressive code motion  On the other hand, hot loads and their source definitions should be scheduled aggressively • Must encourage aggressive code motion

  16. Hot Load-related instructions  We split instructions into two groups, hot-load- related instructions and non-related instructions.  Hot-load-related instructions are scheduled more aggressively than non-related instructions • Selective heuristics

  17. Scheduling Hot Load-related instructions Relate lated inst nstru ructio ion def d ... def c def d [iter n+1] ... def c [iter n+1] Relate lated inst nstru ructio ion ======== ======== add b = c + d add b = c + d[iter n+1] ... ... ======== ======== Hot Ho t load load ld1 a <= @b use a [iter n] ... ld1 a <= @b [iter n+1] ======== ... use a ======== br br other parts of loop body other parts of loop body

  18. Stall-Reducing EPS for Open-64  We implemented EPS into Open-64 (version 3.0), an open-source compiler for IA-64 • http://www.open64.net/ • EPS is positioned between register allocation and global instruction scheduling in Open-64  We then implemented stall reduction for EPS • Detect “hot” loads via profiling

  19. Experimental Results  Experimental Environment • Intel Itanium2 processor 900Mhz – 256Kb L1 D-cache (L1 cache miss takes 5 Cycles) • 10 integer benchmarks from SPEC CPU 2000 and 2006 • Use Performance Monitoring Unit for detecting hot loads – Collect load instructions whose stall overhead takes over 2% of running time – 12 loops in 10 benchmarks are selected – We do not touch other loops

  20. Experiment Configurations  Base: Open-64 – O3 with EPS disabled (1.0x)  EPS without cache optimizations • Strictly schedule hot loops only  EPS with cache optimizations • Strict heuristics – Limited code motions • Aggressive heuristics • Selective heuristics for hot-load-related instructions

  21. Stall Reduction and Performance Result 1.3 St Stric ict EP EPS without Cache Opt ptim imization 1.2 1.1 St Stric ict EP EPS with 1 Cache Opt ptim imization 0.9 0.8 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Sta Stall l cyc cycles 1.3 1.2 Sta Stall l is s redu reduced a li litt ttle than than 1.1 EP EPS w/o /o cach cache op optim timizatio ion 1 co config iguration. 0.9 0.8 No o tang tangible ef effe fects in n exe executio ion 0.7 cyc cycles. gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Total Tot l exe execution cycle cycles

  22. Stall Reduction and Performance Result 1.3 St Stric ict EP EPS without 1.2 Cache Opt ptim imization 1.1 St Stric ict EP EPS with 1 Cache Opt ptim imization 0.9 Aggressiv Ag ive EP EPS with 0.8 Cache Opt ptim imization 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Sta Stall l cyc cycles 1.3 1.2 Sta Stall l is s redu reduced more ore. 1.1 1 Ex Execution cycle cycle do does not not get get bett better 0.9 0.8 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Total Tot l exe execution cycle cycles

  23. Stall Reduction and Performance Result St Stric ict EP EPS without 1.3 Cache Opt ptim imization 1.2 1.1 Stric St ict EP EPS with 1 Cache Opt ptim imization 0.9 Ag Aggressiv ive EP EPS with 0.8 Cache Opt ptim imization 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Selective EP Se EPS with Sta Stall l cyc cycles Cache Opt ptim imization 1.3 1.2 Sta Stall l is s redu reduced as s much ch as s 1.1 agg ggressive con configuration. 1 0.9 Execution cycle Ex cycle is s dec decreased. 0.8 Especially gzi Es gzip and nd mcf cf. 0.7 gzip vpr gcc mcf parser gap bzip2 twolf gobmk hmmer Tot Total l exe execution cycle cycles

  24. Summary and Future Work  EPS-based stall reduction achieves promising result • Adding L1-cache-miss latency for hot loads to separate them from their consumers • Aggressively schedule hot-load-related instructions only  Future Work • More balanced heuristics between parallelism & stall reduction – Canceling code motions which has no advantage for either parallelism or stall reduction after EPS • Handling L2-cache-miss for some hottest loads

  25. Thanks  Questions?

Recommend


More recommend