Program Analysis Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson
Outline Motivation ILP and its limitations Previous Work Limits of Control Flow on Parallelism Author’s Goal Abstract Machine Models Results & Conclusions Automatically Characterizing Large Scale Program Behavior Author’s Goal Finding Phases Results & conclusions
What is ILP? Instructions that do not have dependencies on each other; can be executed in any order
How much parallelism is there? That depends how hard you want to look for it... Ways to increase ILP: Register renaming Alias analysis Branch prediction Loop unrolling
ILP Challenges and Limitations to Multi- Issue Machines In order to achieve parallelism we should not have dependencies among instructions which are executing in parallel Inherent limitations of ILP 1 branch every 5 instructions Latencies of units: many operations must be scheduled Increase ports to Register File (bandwidth) Increase ports to memory (bandwidth) …
Dependencies
Previous Work: Wall ’ s Study Overall results parallelism = # of instr / # cycles required
Parallelism achieved by a quite ambitious HW style model The average parallelism is around 7, the median around 5.
Parallelism achieved by a quite ambitious SW style model The average parallelism is close to 9, but the median is still around 5.
Wall’s Study Conclusions The results contrasts sharply with previous experiments that assume perfect branch prediction, (i.e. oracle ) His results suggests a severe limitation in current approaches The reported speedups for this processor on a set of non- numeric programs ranges from 4.1 to 7.4 for an aggressive HW algorithm to predict outcomes of branches
Author’s Motivation and Goal Motivation : Much more parallelism is available on an oracle machine, suggesting that the bottleneck in Wall’s experiment is due to control flow Goal : Discover ways to increase parallelism by an order of magnitude beyond current approaches How? They analyze and evaluate the techniques of speculative execution, control dependence analysis, and following multiple flows of control Expectations? Hopefully establish the inadequacy of current approaches in handling control flow and identify promising new directions
Speculation with Branch Prediction Speculation is a form of guessing Important for branch prediction Need to “take our best shot” at predicting branch direction. A common technique to improve the efficiency of speculation is to only speculate on instructions from the most likely execution path If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly The success of this approach depends on the accuracy of branch prediction
Control Dependence Analysis b = 1 is control dependent on the condition a < 0 c = 2 is control independent branch on which an instruction is control dependent is the control dependence branch Control dependence in programs can be computed in a compiler using the reverse dominance frontier algorithm
Multiple Control Flows In the example above, control dependence analysis shows that the bar function can run concurrently with the preceding loop However, a uniprocessor can typically only follow one flow of control at any time Support multiple flows of control is necessary to fully exploit the parallelism uncovered by control dependence analysis Multiprocessor architectures are a general means of providing this support Each processor can follow an independent flow of control
Abstract Machine Models
Abstract Machine Models
Overall Parallelism Results for Numeric and Non- numeric Applications Parallelism for the CD machine is primarily limited by the constraint that branches must be executed in order. The constraints for the CD-MF machine only require that true data and control dependence be observed, the parallelism for this machine is a limit for all systems without speculative execution.
Results for Parallelism with Control Dependence Analysis Parallelism for the CD machine is primarily limited by the constraint that branches must be executed in order. The constraints for the CD-MF machine only require that true data and control dependence be observed, the parallelism for this machine is a limit for all systems without speculative execution.
Results for Parallelism with Speculative Execution and Control Dependence Analysis These results are comparable to Wall’s results for a similar machine
Conclusions Control flow in a program can severely limit the available parallelism To increase the available parallelism beyond the current level, the constraints imposed by control flow must be relaxed Some of the current highly parallel architectures lack adequate support for control flow
Automatically Characterizing Large Scale Program Behavior
Author’s Motivation and Goals Motivation : Programs can have wildly different behavior over their run time, and these behaviors can be seen even on the largest of scales. Understanding these large scale program behaviors can unlock many new optimizations Goals : To create an automatic system that is capable of intelligently characterizing time-varying program behavior To provide both analytic and software tools to help with program phase identification To demonstrate the utility of these tools for finding places to simulate How : Develop a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program
Large Scale Behavior for gzip
Approach Fingerprint each interval in program Enables building the high level model Basic Block Vector Tracks the code that is executing Long sparse vector Based on instruction execution frequency
Basic Block Vectors For each interval: ID: 1 2 3 4 5 . BB Exec Count: <1, 20, 0, 5, 0, …> weigh by Block Size: <8, 3, 1, 7, 2, …> = <8, 60, 0, 35, 0, …> Normalize to 1 = <8%,58%,0%,34%,0%,…>
Basic Block Similarity Matrix for gzip
Basic Block Similarity Matrix for gcc
Finding the Phases Basic Block Vector provide a compact and representative summary of the program’s behavior for intervals of execution But need to start by delineating a method of finding and representing the information; because there are so many intervals of execution that are similar to one another, one efficient representation is to group the intervals together that have similar behavior This problem is analogous to a clustering problem A Phase is a Cluster of BBVectors
Phase-finding Algorithm 1. Profile the basic blocks 2. Reduce the dimension of the BBV data to 15 dimensions using random linear projection 3. Use the k -means algorithm to find clusters in the data for many different values of k 4. Choose the clustering with the smallest k , such that it’s score is at least 90% as good as the best score.
Time varying and Cluster graph for gzip
Time varying and Cluster graph for gcc
Applications: Efficient Simulation Simulating to completion not feasible Detailed simulation on SPEC takes months Cycle level effects can’t be ignored To reduce simulation time it is only feasible to execute a small portion of the program, it is very important that the section simulated is an accurate representation of the program’s behavior as a whole. A SimPoint is a starting simulation place in a program’ s execution derived from their analysis.
Results for Single SimPoint
Multiple SimPoints Perform phase analysis For each phase in the program Pick the interval most representative of the phase This is the SimPoint for that phase Perform detailed simulation for SimPoints Weigh results for each SimPoint According to the size of the phase it represents
Results for Multiple SimPoints
Conclusions Use their analytical tools to automatically and efficiently analyze program behavior over large sections of execution, in order to take advantage of structure found in the program Clustering analysis can be used to automatically find multiple simulation points to reduce simulation time and to accurately model full program behavior
Back-up Slides
Comparisons Between Different Architectures Sequential Architecture Dependence Architecture Independence Architectures Additional info required None Specification of Minimally, a partial list of in the program dependences between independences. A complete operations specification of when and where each operation to be executed Typical kind of ILP Superscalar Dataflow VLIW processor Dependences analysis Performed by HW Performed by compiler Performed by compiler Independences analysis Performed by HW Performed by HW Performed by compiler Scheduling Performed by HW Performed by HW Performed by compiler Role of compiler Rearranges the code to Replaces some analysis Replaces virtually all the make the analysis and HW analysis and scheduling scheduling HW more HW successful
Large Scale Behavior for gcc
Recommend
More recommend