program analysis
play

Program Analysis Jose Lugo-Martinez CSE 240C: Advanced - PowerPoint PPT Presentation

Program Analysis Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson Outline Motivation ILP and its limitations Previous Work Limits of Control Flow on Parallelism Authors Goal Abstract Machine Models Results


  1. Program Analysis Jose Lugo-Martinez CSE 240C: Advanced Microarchitecture Prof. Steven Swanson

  2. Outline Motivation ILP and its limitations Previous Work Limits of Control Flow on Parallelism Author’s Goal Abstract Machine Models Results & Conclusions Automatically Characterizing Large Scale Program Behavior Author’s Goal Finding Phases Results & conclusions

  3. What is ILP? Instructions that do not have dependencies on each other; can be executed in any order

  4. How much parallelism is there? That depends how hard you want to look for it... Ways to increase ILP: Register renaming Alias analysis Branch prediction Loop unrolling

  5. ILP Challenges and Limitations to Multi- Issue Machines In order to achieve parallelism we should not have dependencies among instructions which are executing in parallel Inherent limitations of ILP 1 branch every 5 instructions Latencies of units: many operations must be scheduled Increase ports to Register File (bandwidth) Increase ports to memory (bandwidth) …

  6. Dependencies

  7. Previous Work: Wall ’ s Study Overall results parallelism = # of instr / # cycles required

  8. Parallelism achieved by a quite ambitious HW style model The average parallelism is around 7, the median around 5.

  9. Parallelism achieved by a quite ambitious SW style model The average parallelism is close to 9, but the median is still around 5.

  10. Wall’s Study Conclusions The results contrasts sharply with previous experiments that assume perfect branch prediction, (i.e. oracle ) His results suggests a severe limitation in current approaches The reported speedups for this processor on a set of non- numeric programs ranges from 4.1 to 7.4 for an aggressive HW algorithm to predict outcomes of branches

  11. Author’s Motivation and Goal Motivation : Much more parallelism is available on an oracle machine, suggesting that the bottleneck in Wall’s experiment is due to control flow Goal : Discover ways to increase parallelism by an order of magnitude beyond current approaches How? They analyze and evaluate the techniques of speculative execution, control dependence analysis, and following multiple flows of control Expectations? Hopefully establish the inadequacy of current approaches in handling control flow and identify promising new directions

  12. Speculation with Branch Prediction Speculation is a form of guessing Important for branch prediction Need to “take our best shot” at predicting branch direction. A common technique to improve the efficiency of speculation is to only speculate on instructions from the most likely execution path If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly The success of this approach depends on the accuracy of branch prediction

  13. Control Dependence Analysis b = 1 is control dependent on the condition a < 0 c = 2 is control independent branch on which an instruction is control dependent is the control dependence branch Control dependence in programs can be computed in a compiler using the reverse dominance frontier algorithm

  14. Multiple Control Flows In the example above, control dependence analysis shows that the bar function can run concurrently with the preceding loop However, a uniprocessor can typically only follow one flow of control at any time Support multiple flows of control is necessary to fully exploit the parallelism uncovered by control dependence analysis Multiprocessor architectures are a general means of providing this support Each processor can follow an independent flow of control

  15. Abstract Machine Models

  16. Abstract Machine Models

  17. Overall Parallelism Results for Numeric and Non- numeric Applications Parallelism for the CD machine is primarily limited by the constraint that branches must be executed in order. The constraints for the CD-MF machine only require that true data and control dependence be observed, the parallelism for this machine is a limit for all systems without speculative execution.

  18. Results for Parallelism with Control Dependence Analysis Parallelism for the CD machine is primarily limited by the constraint that branches must be executed in order. The constraints for the CD-MF machine only require that true data and control dependence be observed, the parallelism for this machine is a limit for all systems without speculative execution.

  19. Results for Parallelism with Speculative Execution and Control Dependence Analysis These results are comparable to Wall’s results for a similar machine

  20. Conclusions Control flow in a program can severely limit the available parallelism To increase the available parallelism beyond the current level, the constraints imposed by control flow must be relaxed Some of the current highly parallel architectures lack adequate support for control flow

  21. Automatically Characterizing Large Scale Program Behavior

  22. Author’s Motivation and Goals Motivation : Programs can have wildly different behavior over their run time, and these behaviors can be seen even on the largest of scales. Understanding these large scale program behaviors can unlock many new optimizations Goals : To create an automatic system that is capable of intelligently characterizing time-varying program behavior To provide both analytic and software tools to help with program phase identification To demonstrate the utility of these tools for finding places to simulate How : Develop a hardware independent metric that can concisely summarize the behavior of an arbitrary section of execution in a program

  23. Large Scale Behavior for gzip

  24. Approach Fingerprint each interval in program Enables building the high level model Basic Block Vector Tracks the code that is executing Long sparse vector Based on instruction execution frequency

  25. Basic Block Vectors For each interval: ID: 1 2 3 4 5 . BB Exec Count: <1, 20, 0, 5, 0, …> weigh by Block Size: <8, 3, 1, 7, 2, …> = <8, 60, 0, 35, 0, …> Normalize to 1 = <8%,58%,0%,34%,0%,…>

  26. Basic Block Similarity Matrix for gzip

  27. Basic Block Similarity Matrix for gcc

  28. Finding the Phases Basic Block Vector provide a compact and representative summary of the program’s behavior for intervals of execution But need to start by delineating a method of finding and representing the information; because there are so many intervals of execution that are similar to one another, one efficient representation is to group the intervals together that have similar behavior This problem is analogous to a clustering problem A Phase is a Cluster of BBVectors

  29. Phase-finding Algorithm 1. Profile the basic blocks 2. Reduce the dimension of the BBV data to 15 dimensions using random linear projection 3. Use the k -means algorithm to find clusters in the data for many different values of k 4. Choose the clustering with the smallest k , such that it’s score is at least 90% as good as the best score.

  30. Time varying and Cluster graph for gzip

  31. Time varying and Cluster graph for gcc

  32. Applications: Efficient Simulation Simulating to completion not feasible Detailed simulation on SPEC takes months Cycle level effects can’t be ignored To reduce simulation time it is only feasible to execute a small portion of the program, it is very important that the section simulated is an accurate representation of the program’s behavior as a whole. A SimPoint is a starting simulation place in a program’ s execution derived from their analysis.

  33. Results for Single SimPoint

  34. Multiple SimPoints Perform phase analysis For each phase in the program Pick the interval most representative of the phase This is the SimPoint for that phase Perform detailed simulation for SimPoints Weigh results for each SimPoint According to the size of the phase it represents

  35. Results for Multiple SimPoints

  36. Conclusions Use their analytical tools to automatically and efficiently analyze program behavior over large sections of execution, in order to take advantage of structure found in the program Clustering analysis can be used to automatically find multiple simulation points to reduce simulation time and to accurately model full program behavior

  37. Back-up Slides

  38. Comparisons Between Different Architectures Sequential Architecture Dependence Architecture Independence Architectures Additional info required None Specification of Minimally, a partial list of in the program dependences between independences. A complete operations specification of when and where each operation to be executed Typical kind of ILP Superscalar Dataflow VLIW processor Dependences analysis Performed by HW Performed by compiler Performed by compiler Independences analysis Performed by HW Performed by HW Performed by compiler Scheduling Performed by HW Performed by HW Performed by compiler Role of compiler Rearranges the code to Replaces some analysis Replaces virtually all the make the analysis and HW analysis and scheduling scheduling HW more HW successful

  39. Large Scale Behavior for gcc

Recommend


More recommend