Simone Economo, Davide Cingolani, Alessandro Pellegrini and Francesco Quaglia DIAG - Sapienza University of Rome Configurable and Efficient Memory Access Tracing via Selective Expression-based x86 Binary Instrumentation {economo,cingolani,pellegrini,quaglia} @ diag.uniroma1.it
• Interception of memory accesses issued by a program • Off-line and on-line applications – Performance evaluation of architectures • e.g., Trace-driven simulation – Detection of security vulnerabilities • e.g., Buffer overflows – Detection of memory inefficiencies • e.g., Memory leaks – Runtime optimization of programs • e.g., CC-NUMA systems Memory access tracing
• Memory access tracing is interesting because – Intercepting all accesses may lead to excessive runtime overhead • e.g., profilers and debuggers – Intercepting some accesses may lead to inaccurate tracing results • e.g., trace-driven simulation, run-time optimization – Users could want a trade-off between accuracy and overhead • e.g., "I'm willing to sacrifice some accuracy for less overhead" – Users could be interested in tracing accesses to bigger chunks • e.g., OS pages, cache lines, malloc chunk etc. Tracing challenges
• Hardware-based – Performance Monitoring Units (PMUs) • Tracing performed implicitly by the hardware running the program • Software-based – Kernel-level • Usually limited to OS-page granularity (e.g., 4KB or 2MB) – Library-level • Usually limited to very specific application domains (e.g., MPI applications) – Binary Code Instrumentation • Performed explicitly and transparently by injecting additional code in the program • Our approach! Tracing techniques
Accurate directly affects the tracing accuracy Configurable Efficient should affect both overhead and accuracy – in terms of subset size and tracing granularity – 3. Add flexibility to tracing – 1. Instrument a subset of the accesses using a smart selection algorithm – 2. Make this subset representative directly affects the tracing overhead – rather than the entire stream – Our goals
Constants in expressions don't carry memory-alignment information A single expression can encode different addresses over time • False chunk sharing – Different expressions can encode the same address at the same time • Address aliasing – • • Memory addresses are encoded as expressions Address multiplexing – • Memory address expressions are subject to some issues – e.g., x86 SIB expressions (Scale-Index-Base) – evaluated to actual addresses at run-time – linear combinations of registers and constants Instrumentation issues • evaluated to Base + Index * Scale + Displacement
• x86 SIB addressing ( Scale-Index-Base ) is complex – The same structure is used for addressing different types of memory • e.g., The base address of a static object can be specified through an immediate • e.g., The base address of a dynamic object must be specified through a register – An address can be computed in more convoluted ways • e.g., A register in a SIB expression can be the result of another SIB expression Instrumentation issues on x86/GCC/Linux mov 0x601120(,%rax,4),%edi mov -0x4(%rbp),%edx lea 0x0(,%rax,4),%rdx add %rdx,%rax mov (%rax),%esi
• An abstract addressing model – Formalizes the structure and complexity of SIB expressions • A selection algorithm – Deals with the intrinsic issues of tracing via instrumentation – Satisfies the efficiency, accuracy and flexibility goals Our contributions
– either a register identifier or an immediate • A BID template is a family of expressions – sharing the same type (register or immediate) for each field ➡ x86 SIB expressions fall into two BID templates: 1. (e.g., dynamic memory or convoluted accesses to all kinds of memory) 2. (e.g., static memory) Base-Index-Displacement (BID) model • A BID address field is a placeholder for a value • A BID address expression is a tuple of fields <b,i,d> – evaluates to the address b + i + d RRI , when the base address is a register IRR , when the base address is an immediate
• It relies on two user-defined parameters: 1. • Determines the percentage of traced accesses at runtime • Affects overhead and accuracy 2. Chunk size = C • Determines the granularity of tracing • Partially affects accuracy • It elides the address multiplexing problem – Register values coming from multiple control-flow paths are ignored • The internal state is dicarded at basic-block boundaries – Updates to the contents of registers are tracked • Including possible updates coming from conditional data-flow instructions Selection algorithm Instrumentation factor = ω
• Two BID expressions are equal if and only if – they share the same fields – they share the same values for each field • Pointer aliasing can still occur – because the contents of registers are unpredictable – ...but there are no false positives Expression equality
• Equal expressions form a cluster led by a representative – so that further analysis doesn't have to consider the whole cluster – its access count is the size of the cluster that it represents ➡ Tracing a representative means tracing the cluster – a single instrumentation coin buys tracing of the whole cluster – reduces the overhead without affecting the accuracy Expression representatives
• The distance between two representatives is – evaluated on a field-by-field basis – zero if they are likely to fall into the same C-byte chunk – greater if they are likely to produce more distant addresses • False chunk sharing is still possible – because only runtime addresses have memory-alignment information – ...but the probability of false positives decreases with increasing C's – ...and also with decreasing gaps between immediates Expression distance • by comparing register identifiers against equality (e.g., rax ≠ rbx ) • by comparing immediates against their absolute difference e.g., |0x10 - 0x18| )
False True False False False True True True Distance function for RRI expressions b 1 = b 2 i 1 = i 2 i 1 = i 2 4 3 1 |d 1 - d 2 | ≥ C 0 5
False False True False True True True False Distance function for IRR expressions |b 1 - b 2 | ≥ C i 1 = i 2 5 d 1 = d 2 d 1 = d 2 1 0 4 3
e 2 e 1 Absolute difference less than C Example of false chunk sharing
• The score of a representative is a tuple composed of 1. Access count = how many other accesses are traced for free 2. ➡ The higher the score, the most valuable is the access – tells where an instrumentation coin is best spent – improves the accuracy without affecting the overhead Expression scores Average distance ≃ how well the access "samples" the address space
• Reduced to a (0,1)-knapsack problem , solved iteratively – Items are representatives – Values are scores – Weights are all equal ➡ Maximize sum of values, for all representatives, such that – items in the knapsack don't exceed the residual space Selecting expressions – The knapsack size is ω % of all representatives – Iteration i sees the residual space left by iteration i - 1
Start a new iterative step Place it in the knapsack 2. Unfreeze all representatives 1. If there is residual space in the knapsack 2. Freeze all zero-distance representatives 3. 2. • Base step (ignoring frozen ones) Select the next most-valuable representative 1. Solve a residual (0,1)-knapsack instance 1. • Iterative step Choose representatives and compute scores – The iterative (0,1)-knapsack
Example ω = 50%, C = 16B, n = 18, m = ? 1. RRI mov -0x4(%rbp),%edx 2. RRI mov -0x8(%rbp),%eax 3. RRI mov -0x18(%rbp),%rax 4. RRI mov -0x4(%rbp),%edx 5. RRI mov -0x8(%rbp),%eax 6. RRI mov -0x18(%rbp),%rax 7. RRI mov (%rax),%esi 8. RRI mov -0x4(%rbp),%edx 9. RRI mov -0xc(%rbp),%eax 10. IRR mov 0x601120(,%rax,4),%edi 11. RRI mov -0xc(%rbp),%edx 12. RRI mov -0x8(%rbp),%eax 13. IRR mov 0x601060(,%rax,4),%eax 14. RRI mov -0x4(%rbp),%edx 15. RRI mov -0x8(%rbp),%eax 16. RRI mov -0x18(%rbp),%rax 17. RRI mov -0x4(%rbp),%edx 18. RRI mov (%rax),%esi
Example ω = 50%, C = 16B, n = 18, m = ? 1. RRI 5 mov -0x4(%rbp),%edx 2. RRI 4 mov -0x8(%rbp),%eax 3. RRI 3 mov -0x18(%rbp),%rax 4. RRI mov -0x4(%rbp),%edx 5. RRI mov -0x8(%rbp),%eax 6. RRI mov -0x18(%rbp),%rax 7. RRI 1 mov (%rax),%esi 8. RRI mov -0x4(%rbp),%edx 9. RRI 2 mov -0xc(%rbp),%eax 10. IRR 1 mov 0x601120(,%rax,4),%edi 11. RRI mov -0xc(%rbp),%edx 12. RRI mov -0x8(%rbp),%eax 13. IRR 1 mov 0x601060(,%rax,4),%eax 14. RRI mov -0x4(%rbp),%edx 15. RRI mov -0x8(%rbp),%eax 16. RRI mov -0x18(%rbp),%rax 17. RRI mov -0x4(%rbp),%edx 18. RRI 1 mov (%rax),%esi
Example ω = 50%, C = 16B, n = 18, m = 8 1. RRI mov -0x4(%rbp),%edx score = <5, ?> 2. RRI mov -0x8(%rbp),%eax score = <4, ?> 3. RRI mov -0x18(%rbp),%rax score = <3, ?> 4. RRI mov (%rax),%esi score = <1, ?> 5. RRI mov -0xc(%rbp),%eax score = <2, ?> 6. IRR mov 0x601120(,%rax,4),%edi score = <1, ?> 7. IRR mov 0x601060(,%rax,4),%eax score = <1, ?> 8. RRI mov (%rax),%esi score = <1, ?>
Recommend
More recommend