Emulation Michael Jantz
Acknowledgements • Slides adapted from Chapter 2 in Virtual Machines: Versatile Platforms for Systems and Processes by James E. Smith and Ravi Nair • Credit to Prasad A. Kulkarni – some slides were borrowed from his course on Virtual Machines at the University of Kansas 2
Outline • Emulation • Interpretation • Basic, indirect threaded, and direct threaded • Binary translation • Code discovery, code location • Other issues • Control transfer optimizations • Instruction set issues 3
Emulation vs. Simulation • Emulation: process of implementing the interface / functionality of a (sub)system on a different system • Applies specifically to an instruction set • Different emulation techniques • Interpretation (instruction-at-a-time) • Binary translation (block-at-a-time) • Simulation • Method for modeling a system’s operation • Goal is to study process – not to imitate function 4
Definitions • Guest Guest • Environment supported by underlying platform supported • Host by • Underlying platform used Host to provide an environment for the guest 5
Definitions • Source ISA or binary Source • Original instruction set or binary • The ISA to be emulated emulated • Target ISA or binary by • ISA of the host processor Target • Underlying ISA • Source / target refer to ISAs • Guest / host refer to platforms 6
Instruction Set Emulation • Binaries in source instruction set can be executed on machine implementing target instruction set • Required for many VM implementations • Example: IA-32 EL 7
Interpretation vs. Translation • Interpretation • Simple, easy to implement • Low performance • Binary translation • Complex implementation • Higher initial cost, better performance • Techniques in between these extremes • Predecoding • Selective compilation 8
Interpreter State Program Counter • Must maintain Condition Codes Code state of machine Reg 0 Reg 1 implementing the . . . source ISA Data Reg n-1 • Registers • Memory • Code • Data Stack • Stack Interpreter Code 9
Decode-And-Dispatch Interpreter • Decode-and-dispatch loop • One instruction at a time • Decode the current instruction • Dispatch to corresponding interpreter routine while (!halt && !interrupt) { inst = code[PC]; opcode = extract (inst,31,6); switch(opcode) { case LoadWordAndZero: LoadWordAndZero (inst); case ALU: ALU (inst); case Branch: Branch (inst); . . .} } 10
Decode-And-Dispatch Interpreter LoadWordAndZero(inst){ RT = extract (inst,25,5); RA = extract (inst,20,5); displacement = extract (inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4; } 11
Decode-And-Dispatch Interpreter ALU(inst){ RT = extract (inst,25,5); RA = extract (inst,20,5); RB = extract (inst, 15,5); source1 = regs[RA]; source2 = regs[RB]; extended_opcode = extract (inst,10,10); switch(extended_opcode) { case Add: Add (inst); case AddCarrying: AddCarrying (inst); case AddExtended: AddExtended (inst); . . .} PC = PC + 4; } 12
Decode-And-Dispatch Efficiency • Decode-and-dispatch loop • Several branch instructions • Indirect branch on switch statement • Interpreting an add instruction • Requires approximately 20 target instructions • Several expensive loads/stores to memory • Hand-coded assembly can improve performance • Example: HotSpot JVM 13
Indirect Threaded Interpretation • High number of branches in decode-and- dispatch loop reduces performance • At least 5 branches per instruction • Threaded interpretation • Append dispatch code with each interpretation routine • Removes 3 branches • Threads interpretation routines together 14
Indirect Threaded Interpretation LoadWordAndZero: RT = extract (inst,25,5); RA = extract (inst,20,5); displacement = extract (inst,15,16); if (RA == 0) source = 0; else source = regs(RA); address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC +4; If (halt || interrupt) goto exit; inst = code[PC]; opcode = extract (inst,31,6) extended_opcode = extract (inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine; 15
Indirect Threaded Interpretation Add: RT = extract (inst,25,5); RA = extract (inst,20,5); RB = extract (inst,15,5); source1 = regs(RA); source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; If (halt || interrupt) goto exit; inst = code[PC]; opcode = extract (inst,31,6); extended_opcode = extract (inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine; 16
Indirect Threaded Interpretation • Dispatch occurs indirectly through a table • Interpretation routines can be modified and relocated independently • Advantages • Interpretation routines still portable • Improves efficiency over decode-and-dispatch • Disadvantages • Increases interpreter code size 17
Indirect Threaded Interpretation interpreter interpreter source code routines source code routines "data" accesses dispatch loop Decode-dispatch Threaded 18
Predecoding • Parse each instruction into a pre-defined data structure to facilitate interpretation • Separate opcodes, operands, etc. • Reduces shifts / masks for decoding • More useful when source ISA is CISC lwz r1, 8(r2) add r3, r3,r1 stw r3, 0(r4) 19
Predecoding struct instruction { unsigned long op; unsigned char dest, src1, src2; } code [CODE_SIZE]; LoadWordandZero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; opcode = code[TPC].op routine = dispatch[opcode]; goto *routine; 20
Direct Threaded Interpretation • Replace table lookup with direct access to address of interpreter routine • Requires predecoding • Reduces portability 21
Direct Threaded Interpretation LoadWordandZero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine; 22
Direct Threaded Interpretation intermediate interpreter code routines source code pre- decoder 23
Binary Translation • Convert source binary to target binary before execution • Logical conclusion of predecoding • Removes parsing and jumps altogether • Allows optimizations on native code • Achieves better performance than interpretation • Generated code no longer portable 24
Binary Translation binary translated target code source code binary translator 25
Binary Translation x86 Source Binary addl %edx,4(%eax) movl 4(%eax),%edx add %eax,4 Translate to PowerPC Target r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value 26
Binary Translation lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwzx r5,r2,r5 ;load operand from memory lwz r4,12(r1) ;load %edx from register block add r5,r4,r5 ;perform add stw r5,12(r1) ;put result into %edx addi r3,r3,3 ;update PC (3 bytes) lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwz r4,12(r1) ;load %edx from register block stwx r4,r2,r5 ;store %edx value into memory addi r3,r3,3 ;update PC (3 bytes) lwz r4,0(r1) ;load %eax from register block addi r4,r4,4 ;add immediate stw r4,0(r1) ;place result back into %eax addi r3,r3,3 ;update PC (3 bytes) 27
Register Mapping • Map source registers to target registers • Reduces memory loads / stores • If target registers < source registers • Map some to memory • Map on per-block basis 28
Register Mapping r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value r4 holds x86 register %eax r7 holds x86 register %edx etc. addi r16,r4,4 ;add 4 to %eax lwzx r17,r2,r16 ;load operand from memory add r7,r17,r7 ;perform add of %edx addi r16,r4,4 ;add 4 to %eax stwx r7,r2,r16 ;store %edx value into memory addi r4,r4,4 ;increment %eax addi r3,r3,9 ;update PC (9 bytes) 29
Code Discovery Problem • May be difficult to statically predecode or translate the entire source program • Code Discovery Problem: how to find the beginning of all source instructions? • Consider the x86 code: mov %ch,0 ?? 31 c0 8b b5 00 00 03 08 8b bd 00 00 03 00 movl %esi, 0x08030000(%ebp) ?? 30
Code Discovery Problem • Contributors to the code discovery problem • Variable length CISC instructions • Indirect jumps • Data interspersed with code • Padding instructions to align branch targets source ISA instructions inst. 1 inst. 2 data in instruction jump inst. 3 stream reg. data inst. 5 inst. 6 uncond. brnch pad pad for instruction alignment inst. 8 jump indirect to??? 31
Recommend
More recommend