Dynamic Binary Optimization ● Introduction ● Application profiling ● Optimizing translation blocks ● Compatibility ● Code reordering ● Other code optimizations 1 EECS 768 Virtual Machines
Optimization Overview ● Identify frequently executed hot code regions ● basic blocks ● paths – indicate control flow ● edges – approximation to paths ● Dynamic profiling ● count execution frequencies ● software or hardware implemented ● Form large translation blocks ● traces and superblocks ● Schedule and optimize large blocks 2 EECS 768 Virtual Machines
Optimization Based On Profiling Basic Block A Basic Block A . . . . . . . . . . . . R3 … R3 … R7 ... R7 ... R1 R2 + R3 BEQ L1 if R3 ==0 BEQ L1 if R3 ==0 Compensation code Basic Block B . . . R1 R2 + R3 R6 R1 + R6 … ... Basic Block B . . . R6 R1 + R6 … Basic Block C ... L1: R1 0 … ... Basic Block C L1: R1 0 … ... 3 EECS 768 Virtual Machines
Optimization Based On Profiling (2) Basic Block A . . . . . . R3 … Superblock R7 ... R1 R2 + R3 . . . . . . BEQ L1 if R3 ==0 R3 … R7 ... BNE L2 if R3 !=0 Basic Block B . . . R1 0 R6 R1 + R6 … … ... ... Compensation code R1 R2 + R3 Basic Block C Basic Block B L2: . . . L1: R1 0 R6 R1 + R6 … ... … ... 4 EECS 768 Virtual Machines
Program Behavior ● Many aspects of a program's behavior are predictable ● branches, data values R3 ← 100 R1 ← loop: mem(R2) ; load from memory Br found if R1 == -1 ; look for -1 R2 ← R2 + 4 R3 ← R3 -1 Br loop if R3 != 0 ; loop closing branch . . found: ● Backward branch primarily taken ● Forward branch mostly not taken 5 EECS 768 Virtual Machines
Branch Behavior ● Conditional branch predominantly decided one way ● either taken or not taken 50% Conditional Branches Fraction of Static 40% 30% 20% 10% 0% 0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% >90% Percent Taken 6 EECS 768 Virtual Machines
Branch Behavior (2) ● Most branches decided the same way as on previous execution ● backward conditional branches are mostly taken ● forward conditional branches taken less often 100% Percent Dynamic Branches Decided 90% 80% Same As Previous Time 70% 60% 50% 40% 30% 20% 10% 0% n f r m u s c 2 a c c e o a c p l s e m i p s g e w c i e r p r z e u . . . a s m 6 1 2 b a c l . p . 7 8 5 . . 1 . a 9 6 3 7 . 2 1 1 7 7 f 8 5 7 7 . 1 9 1 7 2 1 1 1 8 1 7 EECS 768 Virtual Machines
Other Program Behavior ● Some indirect jumps have a single target • others have several targets (e.g. returns) ● Predictability extends to data values • many instructions always produce the same result 0.7 0.6 Fraction with Constant Value 0.5 0.4 static dynamic 0.3 0.2 0.1 0 l t t l d c b f e A a i i u g h S o S o S L / L d d A Instruction Type 8 EECS 768 Virtual Machines
Profiling ● Collect statistics about a program as it runs • branches (taken, not taken) • jump targets • data values ● Predictability allows these statistics to be used for optimizations in the future ● Profiling in a VM differs from traditional profiling used for compiler feedback 9 EECS 768 Virtual Machines
Conventional ( Offline ) Profiling ● Multiple passes through compiler ● Done at program development time • profile overhead is a small issue ● Can be based on global analysis intermediate form A B C Instrumented HLL Compiler Compiler Code Program Front-end Back-end D E F Instrumented Code Program Program Optimizing Optmized Execution Statistics Compiler Binary Test Data 10 EECS 768 Virtual Machines
VM-Based ( Online ) Profiling ● Profile overhead is very important • profile time part of total execution time ● Limited view of program (no a priori global view) • profile probes cannot be carefully placed partially "discovered" code A Partial B Translator/ Program Program Interpreter Optimizer Binary D Statistics E Program Data 11 EECS 768 Virtual Machines
Types of Profiles ● Block or node profiles • identify hot code blocks; fewer nodes than edges ● Edge profiles • more precise idea of program flow • block profile can be derived from edge profile A A 65 50 15 B C B C 50 15 12 13 48 38 D 17 D 25 10 2 E E 48 15 F F 17 12 EECS 768 Virtual Machines
Collecting Profiles ● Instrumentation-based • software probes slows down program more requires less total time than sampling • hardware probes less overhead than software less well-supported in processors typically event counters ● Sampling based • interrupt at random intervals and take sample slows down program less requires longer time to get same amount of data • not useful during interpretation 13 EECS 768 Virtual Machines
Profiling During Interpretation taken not taken PC count count Instruction function list . Branch PC HASH branch_conditional(inst) { BO = extract(inst,25,5); BI = extract(inst,20,5); displacement = extract(inst,15,14) * 4; . . // code to compute whether branch should be taken . . profile_addr = lookup(PC); if (branch_taken) profile_cnt(profile_addr, taken)++; PC = PC + displacement; Else profile_cnt(profile_addr, nottaken)++; PC = PC + 4; } 14 EECS 768 Virtual Machines
Profiling Translated Code ● Software instrumentation in stub code Increment edge Translated counter (i) Basic Block Increment edge If (counter (i) > Fall-thru counter (j) trigger) then stub invoke optimizer If (counter (j) > Branch target trigger) then Else branch to stub invoke optimizer fall-thru basic block Else branch to target basic block 15 EECS 768 Virtual Machines
Sampling ● Set interval counter ● Interrupt when counter hits zero ● Sample PC at that point ● Gives block profile ● Could be modified to give edge profile Sample PC Initialize Counter TRAP Load PC decrement for each Interval Counter Instruction Address instruction Zero Detect Program Counter 16 EECS 768 Virtual Machines
Improving Code Locality ● Provide more optimization opportunities. ● Spatial locality A A ● consecutive memory B accesses are adjacent E ● Temporal locality B ● same memory access is repeated in near future ● Reasons for spatial and E temporal locality ● loops and sequential program flow 17 EECS 768 Virtual Machines
Improving Locality: Example 3 A Br cond1 == true A B 30 70 Br cond2 == false C B D Br uncond 1 29 D 68 Br cond3 == true F C E E Br uncond 2 29 68 F 1 G 15 G 97 Br cond4 == true 1 18 EECS 768 Virtual Machines
Improving Locality: Example (2) ● Little locality (spatial or temporal) in cache line that spans blocks E and F ● F seldom used • wasted I-cache space and I-fetch bandwidth ● Heavily used discontiguous code blocks • e.g., C and D • still wastes I-fetch bandwidth E F F F Br uncond 19 EECS 768 Virtual Machines
Improving Locality: Rearrange Code A A Br cond1 == false Br cond1 == true D B Br cond2 == false Br cond3 == true E C G Br uncond D Br cond4 == true Br uncond Br cond3 == true B E Br cond2 == false Br uncond C F Br uncond G Br cond4 == true F Br uncond 20 EECS 768 Virtual Machines
Improving Locality: Procedure Inlining ● Inlining – duplicate procedure body at call-site A ● Partial inlining X Y A ● follow dominant flow of Call proc xyz control B B . ● not practical to find full . . Proc xyz . . X procedure during dynamic . K incremental code discovery Y K ● Disadvantages Z X Call proc xyz Return L Z ● increases code size L ● increases register pressure 21 EECS 768 Virtual Machines
Improving Locality: Traces ● Divide program into chunks 3 ● may contain multiple blocks ● Greedy Method Trace 1 A 30 70 • suitable for on-the-fly translation Trace 2 • start at hottest block not in trace B D • follow hottest edges 1 29 68 Trace 3 • stop when trace reaches a 2 F C E certain size • stop when a block already in a 29 68 1 trace is reached G 15 97 1 22 EECS 768 Virtual Machines
Improving Locality: Traces (2) ● No redundancy • may reduce I-cache pressure • good for spatial locality ● Join points sometimes inihibit optimizations. ● Typically not used in optimizing VMs. 23 EECS 768 Virtual Machines
Improving Locality: Superblocks ● Superblock – One entry, multiple exits ● May contain redundant blocks (tail duplication) ● More commonly used by dynamic optimizers ● better branch prediction ● less constraints on optimizations A A B D B D F C E F C E G 15 G G 15 G 24 EECS 768 Virtual Machines
Superblocks: Example A Br cond1 == false D A 3 Br cond3 == true Br cond1 == true E A B G 30 70 Br cond2 == false Br cond4 == true C B D Br uncond 1 29 68 Br uncond D B F C E Br cond2 == false Br cond3 == true 2 C 29 E 68 1 Br uncond G G 15 F Br cond4 == true 97 Br uncond 1 G F Br cond4 == true G Br cond4 == true Br uncond 25 EECS 768 Virtual Machines
Recommend
More recommend