High Performance Computing Systems (CMSC714) Lecture 11: Measurement Tools Abhinav Bhatele, Department of Computer Science
Summary of last lecture • Scalable networks: fat-tree, dragonfly • Use high-radix routers • Many nodes connected to each switch • Low network diameter, high bisection bandwidth • Dynamic routing Abhinav Bhatele, CMSC714 2
Performance analysis • Parallel performance of a program might not be what we expect • How do we find performance bottlenecks? • Two parts to performance analysis: measurement and analysis/visualization • Simplest tool: timers in the code and printf Abhinav Bhatele, CMSC714 3
Performance Tools • Tracing tools • Capture entire execution trace • Vampir, Score-P • Profiling tools • Typically use statistical sampling • Gprof • Many tools can do both • TAU, HPCToolkit, Projections Abhinav Bhatele, CMSC714 4
Metrics recorded • Counts of function invocations • Time spent in code • Hardware counters Abhinav Bhatele, CMSC714 5
Calling contexts, trees, and graphs main physics solvers • Calling context or call path: Sequence of function invocations leading to the current sample mpi hypre mpi • Calling context tree: dynamic prefix tree of all call paths in psm2 psm2 an execution • Call graph: keep caller-callee relationships as arcs Abhinav Bhatele, CMSC714 6
searching through the output. The static call graph can be constructed from The major entries of the call graph profile are the source text of the program. However, discover- the entries from the fiat profile, augmented by the ing the static call graph from the source text would require two moderately difficult steps: finding the time propagated to each routine from its descen- source text for the program (which may not be dants. This profile is sorted by the sum of the time available), and scanning and parsing that text, for the routine itself plus the time inherited from which may be in any one of several languages. its descendants. The profile shows which of the higher level routines spend large portions of the In our programming system, the static calling total execution time in the routines that they call. information is also contained in the executable ver- For each routine, we show the amount of time sion of the program, which we already have avail- passed by each child to the routine, which includes able, and which is in language-independent form. time for the child itself and for the descendants of One can examine the instructions in the object pro- the child (and thus the descendants of the routine). gram, looking for calls to routines, and note which We also show the percentage these times represent routines can be called. This technique allows us to of the total time accounted to the child. Similarly, add arcs to those already in the dynamic call graph. the parents of each routine are listed, along with If a statically discovered arc already exists in the time, and percentage of total routine time, pro- dynamic call graph, no action is required. Statically pagated to each one. discovered arcs that do not exist in the dynamic call graph are added to the graph with a traversal Cycles are handled as single entities. The cycle count of zero. Thus they are never responsible for as a whole is shown as though it were a single rou- any time propagation. However, they may affect tine, except that members of the cycle are listed in the structure of the graph. Since they may com- place of the children. Although the number of calls plete strongly connected components, the static of each member from within the cycle are shown, call graph construction is done before topological they do not affect time propagation. When a child is ordering. Output a member of a cycle, the time shown is the appropriate fraction of the time for the whole cycle. Self-recursive routines have their calls broken down 5. Data Presentation into calls from the outside and self-recursive calls. Only the outside calls affect the propagation of The data is presented to the user in two time. different formats. The first presentation simply lists the routines without regard to the amount of The following example is a typical fragment of a time their descendants use. The second presenta- call graph. tion incorporates the call graph of the program. 5.1. The Flat Profile The fiat profile consists of a list of all the rou- tines that are called during execution of the pro- gram, with the count of the number of times they are called and the number of seconds of execution time for which they are themselves accountable. • Flat profile: Listing of all functions with counts and execution The routines are listed in decreasing order of execu- tion time. A list of the routines that are never called during execution of the program is also avail- times able to verify that nothing important is omitted by The en'try in the call graph profile listing for this this execution. The fiat profile gives a quick over- example is shown in Figure 4. view of the routines that are used, and shows the foo routines that are themselves responsible for large The entry is for routine EXAMPLE, which has the • Call graph profile fractions of the execution time. In practice, this Caller routines as its parents, and the Sub routines profile usually shows that no single function is as its children. bar The reader should keep in mind qux waldo overwhelmingly responsible for the total time 'of the that all information is given with respect to EXAM- program. Notice that for this profile, the individual PLE. The index in the first column shows that EXAM- • Calling context tree times sum to the total execution time. baz grault quux fred garply PLE is the second entry in the profile listing. The EXAMPLE routine is Called ten times, four times by 5.'b-. The Call Graph Profile CALLER1, and six times by CALLER2. Consequently Ideally, we would like to print the call graph of corge plugh xyzzy 40~ of EXAmPLE's time is propagated to CALLER1, and the program, but we are limited by the two- 60~ of EXAMPLE'S time is prdpagated %o CALLER2. dimensional nature of our output devices. We can- The self 'and descendant fields o'f the parents show not assume that a call graph is planar, and even if it bar grault garply thud the amount o'f self and descendant time EXAMPLE is, that we can print a planar version-of it. Instead, propagates to 'them '(but not the 'time used by the we choose to list each routine, together With infor- parents directly). Note that EXAMPLE calls i~tself 'mation about the routines that are its direct baz grault baz garply recui'sively four times. The routine EXAMPLE calls parents and children. This listing presents a win- routine SUB1 twenty times, SUB2 once, and never dow into the call graph. Based on Our experience, calls SUB3. Since sUB2 ~s called a 'total of five times, both parent information and child iniormati0n is 20~ of its self and descendant 'time is propagated to important, and should be available without EXAMPLE's descendant time field. Because SUB1 is a Abhinav Bhatele, CMSC714 7 124
Questions gprof: A Call Graph Execution Profiler • Execution count: It is highlighted to be two types of counts, which is either an actual count or a boolean. What’s the benefit of introducing the second type? • It seems that the call to monitoring routine is more informative but slower compared to the inline counter increment. Will the slow down actually affect the accuracy of the monitoring? Also is this trade-off generally worth it (in terms of profiling)? • It is not immediately clear from the paper how they actually derive the timing approximation from the histogram. If possible I’d like to see if there’s an illustrating example. • Is there any principled way to extract static call graph from a generic program? • What are the different types of call graphs? How is each type best used for understanding program performance? • How much memory does profiling data require usually? Related: how does gprof balance various overheads? • How does timeslicing work on timeshare machines? Abhinav Bhatele, CMSC714 8
Questions Binary Analysis for Measurement and Attribution of Program Performance • The paper states “dynamic instrumentation remains susceptible to systematic measurement error because of instrumentation overhead”. Where do these overheads come from comparing to static and binary instrumentation? • The loop optimization performed by compiler introduces semantic gap between source code and binary. Is there any effort on incorporating compiler into the profiling system to reduce such gap? • It seems from the paper that the proposed HPCToolkit is better than gprof. How do they compare practically when used to profile a program? • How does highly optimized code make it harder to accurately profile? How does binary analysis address these issues? • What are the measurement techniques for instrumentation? Abhinav Bhatele, CMSC714 9
Questions? Abhinav Bhatele 5218 Brendan Iribe Center (IRB) / College Park, MD 20742 phone: 301.405.4507 / e-mail: bhatele@cs.umd.edu
Recommend
More recommend