1 Control-Flow Profiles Code Motion Using Control Flow Profiles - PowerPoint PPT Presentation

Profile-Guided Optimizations Motivation for Profiling Recall Limitations of static analysis – Instruction scheduling – Compilers can analyze possible paths but must behave conservatively – List scheduling – Frequency information cannot be obtained through static analysis – Register renaming – Loop unrolling How runtime information helps – Software pipelining – Control flow information if c – Alias analysis 10% 90% – how can we use alias analysis for instruction scheduling? Optimize the more frequent path – what causes conservative results? (perhaps at the expense of the less frequent path) Today − Memory conflicts – More instruction scheduling If r5 and r4 always have different values, st r1,0(r5) – Profiling we can move the load above the store ld r2,0(r4) – Trace scheduling CS553 Lecture Profile-Guided Optimizations 2 CS553 Lecture Profile-Guided Optimizations 3 Profile-Guided Optimizations Profiling Issues Basic idea Profile data – Instrument and run program on sample inputs to get likely runtime – Collected over whole program run behavior – May not be useful (unbiased branches) – Can use this information to improve instruction scheduling – May not reflect all runs – Many other uses – May be expensive and inconvenient to gather – Code placement – Continuous profiling [Anderson 97] – Inlining – May interfere with program – Value speculation – Branch prediction – Class-based optimization (static method lookup) CS553 Lecture Profile-Guided Optimizations 4 CS553 Lecture Profile-Guided Optimizations 5 1

Control-Flow Profiles Code Motion Using Control Flow Profiles Commonly gather two types of information Code motion across basic blocks – Execution frequencies of basic blocks – Increased scheduling freedom – Branch frequencies of conditional branches A B – Represent information in a weighted flow graph − If we want to move s1 to A, we must move execution frequencies 1 100 C s1 s1 to both A and B move code above a join 2 100 branch frequencies 30 70 s1 A 3 4 70 30 − If we want to move s1 to B, we must Instrumentation B C move s1 to both B and C – Insert instrumentation code at basic block entrances and before each move code below a split branch – Take average of values from multiple training runs – Fairly inexpensive CS553 Lecture Profile-Guided Optimizations 6 CS553 Lecture Profile-Guided Optimizations 7 Code Motion Using Control Flow Profiles (cont) Memory-Dependence Profiles Code motion across basic blocks Gather information about memory conflicts – Increased scheduling freedom – Frequencies of address matches between pairs of loads and stores – Attempts to answer the question: Are two references independent of one another? A s1 B A B – Concentrate on ambiguous reference pairs (those that the compiler cannot s1 C C ′ C figure out) move code below a join tail duplication prevents B → C from seeing s1 st1: store r5 (st1, ld2, 7) If this number is low, we can ld2: load r4 A − If we want to move s1 from B to A and if s1 speculatively assume that st1 would destroy a value along the A → C path, and ld2 do not conflict Instrumentation B s1 C do renaming – Much more expensive (in both space and time) to gather than control flow move code above a split − What if s1 might cause an exception? information – First perform control flow profiling – Apply only to most frequently executed blocks CS553 Lecture Profile-Guided Optimizations 8 CS553 Lecture Profile-Guided Optimizations 9 2

Trace Scheduling [Fisher 81] and [Ellis 85] Trace Scheduling (example) trace: Basic idea b[i] = “old” b[i] = “old” – We want large blocks to create large scheduling windows, but basic a[i] = ... a[i] = ... if (a[i]>0) then blocks are small because branches are frequent b[i]=“new”; b[i]=“new”; – Create superblocks to increase scheduling window c[i] = ... else if (a[i]<=0) then goto repair – Use profile information to create good superblocks stmt X continue: stmt Y – Optimize each superblock independently ... endif c[i] = ... Superblocks – A sequence of basic blocks with a single entrance and multiple exits repair: restore old b[i] 1 stmt X Goals stmt Y a superblock – Want large superblocks recalculate c[i]? 2 goto continue – Want to avoid early exits 3 4 – Want blocks that match actual execution paths CS553 Lecture Profile-Guided Optimizations 10 CS553 Lecture Profile-Guided Optimizations 11 Trace Scheduling (cont) Trace Scheduling (cont) Three steps 1. Superblock formation (cont) 1. Create superblocks – Convert traces into Superblocks 2. Enlarge superblocks – Use tail duplication to eliminate side entrances 3. Compact (optimize) superblocks A A 70 30 70 30 1. Superblock formation trace superblock C C B B − Create traces using mutual-most-likely heuristic 70 70 10 10 (two blocks A and B are mutual-most-likely if B is the most likely E E E ′ successor of A, and A is the most likely predecessor of B) − Tail duplication increases code size D A − A trace is a maximal sequence of mutual- 10 70 30 most-likely blocks that does not contain a back B C edge 70 10 − Each block belongs to exactly one trace E CS553 Lecture Profile-Guided Optimizations 12 CS553 Lecture Profile-Guided Optimizations 13 3

Trace Scheduling (cont) Trace Scheduling (cont) 2. Superblock enlargement 3. Optimizations – Enlarge superblocks that are too small – Perform list scheduling for each superblock – Code expansion can hurt i-cache performance – Memory-dependence profiles can be used to speculatively assume that load/store pairs do not conflict – Insert repair code in case the assumption is incorrect Three techniques for enlargement – Software pipelining – Branch target expansion – If the last branch in a superblock is likely to jump to the start of another superblock, append the contents of the target superblock to the first superblock – Loop peeling – Loop unrolling – These last two techniques apply to superblock loops, which are superblocks whose last blocks are likely to jump to their first blocks – Assume that each loop body has a single dominant path CS553 Lecture Profile-Guided Optimizations 14 CS553 Lecture Profile-Guided Optimizations 15 Enhancements to Profile-Guided Code Scheduling Speculation based on memory-dependence profiles (example) trace: Path profiling [Ball and Larus 96] b[i] = “old” b[i] = “old” a[i] = ... – Collect information about entire paths instead of about individual edges c[i] = a[j] if (a[i]>0) then a[i] = ... b[i]=“new”; 50 50 50 50 50 50 b[i]=“new”; else if (i==j) then goto deprepair stmt X if (a[i]<=0) then goto repair stmt Y continue: 50 50 50 50 50 50 endif ... c[i] = a[j] deprepair: c[i] = a[i] Edge profiles Path profiles Path profiles if (a[i]<=0) then goto repair goto continue repair: – Limit paths to some specified length (can thus handle loops) restore old b[i] – Can also stop paths at back edges stmt X – Disadvantages of path profiling? stmt Y goto continue CS553 Lecture Profile-Guided Optimizations 16 CS553 Lecture Profile-Guided Optimizations 17 4

Lessons Concepts Larger scope helps Instruction scheduling – How can we increase scope? How do we schedule across control – Trace scheduling dependences? – Uses profile information – Looks at scopes beyond basic blocks Static information is limited – Use profiles Miscellany – How else can profiles be used in optimization? – Path profiling – Can we do these kinds of optimizations at runtime? CS553 Lecture Profile-Guided Optimizations 18 CS553 Lecture Profile-Guided Optimizations 19 5

1 Control-Flow Profiles Code Motion Using Control Flow Profiles - PowerPoint PPT Presentation

Profile-Guided Optimizations Motivation for Profiling Recall Limitations of static analysis Instruction scheduling Compilers can analyze possible paths but must behave conservatively List scheduling Frequency information

Symbol Table ASU Textbook Chapter 7.6, 6.5 and 6.3 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Verilog for Testbenches Overall Module Structure A little Verilog module name (args);

CS3157: Advanced Programming Lecture #7 June 14 Shlomo Hershkop shlomo@cs.columbia.edu 1

For Friday Read chapter 8 Homework: Chapter 7, exercises 2 and 10 Program 1,

Memory Management What to do when coalescing fails 5H. Memory Compaction garbage collection

A Compacting Real-Time Memory Management System Silviu S. Craciunas, Christoph M. Kirsch, Hannes

Control - Procedures and Environments Control Procedure definition and activation: A

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of

Main Memory Tevfik Ko ar Louisiana State University February 28 th , 2008 1 Memory

Silberschatz and Galvin Chapter 11 File System Implementation CPSC 410--Richard Furuta 4/28/99

Measuring reservoir compaction using time-lapse timeshifts P. J. Hatchell*and S.J. Bourne, Shell

Bigtable: A Distributed Storage System for Structured Data Alvanos Michalis April 6, 2009

CS4224/CS5424 Lecture 3 Storage & Indexing B + -tree Index Fred Bob Dave Hal Joe (Alice,

Simulated Pointers Limitations Of Java Pointers May be used for internal data structures

EE3CL4 C01: Trans. Newton. Mech. Rot. Newton. Mech. Introduction to Linear Control Systems

Gursharan Singh Tatla professorgstatla@gmail.com 1 www.eazynotes.com 6-Oct-10 Intel 8086

ECE 3120: Microcomputer Systems Chapter 0: Important information Dr. Mohamed Mahmoud

Lecture 8 Transfer Function Definition Block Diagram Manipulation Process Control Prof. Kannan

BLOM : B erkeley L ibrary for O ptimization M odeling Sergey Vichik, Anthony Kelman and Francesco

BLOCK DIAGRAMS OF MS, BTS, MSC ECE 2526 MOBILE COMMUNICATION Friday, 15 March 2019 GSM SM

Operating System Principles: Processes, Execution, and State CS 111 Operating Systems Peter

Chapter 2: Threads: Questions ! How is a thread different from a process? CSCI [4|6]730 ! Why are

What is a Process? Answer 1: A process is an abstraction of a program in execution. Answer 2:

Process Synchronization Prepared By: Saed Swedan Omar Hirzallah Supervised By: Dr. Loai

1 Control-Flow Profiles Code Motion Using Control Flow Profiles - PowerPoint PPT Presentation

Profile-Guided Optimizations Motivation for Profiling Recall Limitations of static analysis Instruction scheduling Compilers can analyze possible paths but must behave conservatively List scheduling Frequency information

Symbol Table ASU Textbook Chapter 7.6, 6.5 and 6.3 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Verilog for Testbenches Overall Module Structure A little Verilog module name (args);

CS3157: Advanced Programming Lecture #7 June 14 Shlomo Hershkop shlomo@cs.columbia.edu 1

For Friday Read chapter 8 Homework: Chapter 7, exercises 2 and 10 Program 1,

Memory Management What to do when coalescing fails 5H. Memory Compaction garbage collection

A Compacting Real-Time Memory Management System Silviu S. Craciunas, Christoph M. Kirsch, Hannes

Control - Procedures and Environments Control Procedure definition and activation: A

EI 338: Computer Systems Engineering (Operating Systems &amp; Computer Architecture) Dept. of

Main Memory Tevfik Ko ar Louisiana State University February 28 th , 2008 1 Memory

Silberschatz and Galvin Chapter 11 File System Implementation CPSC 410--Richard Furuta 4/28/99

Measuring reservoir compaction using time-lapse timeshifts P. J. Hatchell*and S.J. Bourne, Shell

Bigtable: A Distributed Storage System for Structured Data Alvanos Michalis April 6, 2009

CS4224/CS5424 Lecture 3 Storage &amp; Indexing B + -tree Index Fred Bob Dave Hal Joe (Alice,

Simulated Pointers Limitations Of Java Pointers May be used for internal data structures

EE3CL4 C01: Trans. Newton. Mech. Rot. Newton. Mech. Introduction to Linear Control Systems

Gursharan Singh Tatla professorgstatla@gmail.com 1 www.eazynotes.com 6-Oct-10 Intel 8086

ECE 3120: Microcomputer Systems Chapter 0: Important information Dr. Mohamed Mahmoud

Lecture 8 Transfer Function Definition Block Diagram Manipulation Process Control Prof. Kannan

BLOM : B erkeley L ibrary for O ptimization M odeling Sergey Vichik, Anthony Kelman and Francesco

BLOCK DIAGRAMS OF MS, BTS, MSC ECE 2526 MOBILE COMMUNICATION Friday, 15 March 2019 GSM SM

Operating System Principles: Processes, Execution, and State CS 111 Operating Systems Peter

Chapter 2: Threads: Questions ! How is a thread different from a process? CSCI [4|6]730 ! Why are

What is a Process? Answer 1: A process is an abstraction of a program in execution. Answer 2:

Process Synchronization Prepared By: Saed Swedan Omar Hirzallah Supervised By: Dr. Loai

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of

CS4224/CS5424 Lecture 3 Storage & Indexing B + -tree Index Fred Bob Dave Hal Joe (Alice,