reducing code size with run time decompression
play

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva - PowerPoint PPT Presentation

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor


  1. Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor High-Performance Computer Architecture (HPCA-6) January 10-12, 2000

  2. Motivation • Problem: embedded code size CPU RAM ROM – Constraints: cost, area, and power Program I/O – Fit program in on-chip memory – Compilers vs. hand-coded assembly Original Program • Portability • Development costs – Code bloat CPU RAM ROM • Solution: code compression I/O – Reduce compiled code size – Take advantage of instruction repetition Compressed Program • Implementation – Hardware or software? – Code size? – Execution speed? Embedded Systems 2

  3. Software decompression • Previous work – Decompression unit: whole program [Tauton91] • No memory savings – Decompression unit: procedures [Kirovski97][Ernst97] • Requires large decompression memory • Fragmentation of decompression memory • Slow • Our work – Decompression unit: 1 or 2 cache-lines – High performance focus – New profiling method 3

  4. Dictionary compression algorithm • Goal: fast decompression • Dictionary contains unique instructions • Replace program instructions with short index 32 bits 16 bits 32 bits lw r2,r3 5 lw r2,r3 lw r15,r3 lw r2,r3 5 lw r15,r3 30 .dictionary segment lw r15,r3 30 lw r15,r3 30 .text segment .text segment (contains indices) Original program Compressed program 4

  5. Decompression • Algorithm 1. I-cache miss invokes decompressor (exception handler) 2. Fetch index 3. Fetch dictionary word 4. Place instruction in I-cache (special instruction) • Write directly into I-cache • Decompressed instructions only exist in I-cache Memory � Add r1,r2,r3 � Dictionary I-cache � Proc. Indices 5 ... D-cache � 5

  6. CodePack • Overview – IBM – PowerPC – First system with instruction stream compression – Decompress during I-cache miss • Software CodePack Dictionary CodePack Codewords (indices) Fixed-length Variable-length Decompress granularity 1 cache line 2 cache lines Decompression overhead 75 instructions 1120 instructions 6

  7. Compression ratio compressed size = compressio n ratio • original size – CodePack: 55% - 63% – Dictionary: 65% - 82% 100% Dictionary 90% 80% CodePack 70% 60% Compression 50% ratio 40% 30% 20% 10% 0% go ijpeg cc1 perl vortex mpeg2enc pegwit ghostscript 7

  8. Simulation environment • SimpleScalar • Pipeline: 5 stage, in-order • I-cache: 16KB, 32B lines, 2-way • D-cache: 8KB, 16B lines, 2-way • Memory: 10 cycle latency, 2 cycle rate 8

  9. Performance • CodePack: very high overhead • Reduce overhead by reducing cache misses Go 22 CodePack 20 18 Dictionary 16 Slowdown Native 14 relative to 12 native 10 code 8 6 4 2 0 4KB 16KB 64KB I-cache size (KB) 9

  10. Cache miss • Control slowdown by optimizing I-cache miss ratio 40 CodePack 4KB 35 CodePack 16KB 30 CodePack 64KB 25 Dictionary 4KB Slowdown Dictionary 16KB relative to 20 Dictionary 64KB native code 15 10 5 0 0% 2% 4% 6% 8% I-cache miss ratio 10

  11. Selective compression • Hybrid programs – Only compress some procedures – Trade size for speed – Avoid decompression overhead • Profile methods – Count dynamic instructions • Example: Thumb • Use when compressed code has more instructions • Reduce number of executed instructions – Count cache misses • Example: CodePack • Use when compressed code has longer cache miss latency • Reduce cache miss latency 11

  12. Cache miss profiling • Cache miss profile reduces overhead 50% • Loop-oriented benchmarks benefit most – Approach performance of native code Pegwit (encryption) 1.12 CodePack: dynamic instructions 1.10 CodePack: cache miss 1.08 Slowdown relative to 1.06 native code 1.04 1.02 1.00 60% 70% 80% 90% 100% Compression ratio 12

  13. CodePack vs. Dictionary • More compression may have better performance – CodePack has smaller size than Dictionary compression – Even with some native code, CodePack is smaller – CodePack is faster due to using more native code Ghostscript 4.0 3.5 CodePack: cache miss Dictionary: cache miss 3.0 Slowdown 2.5 relative to native 2.0 code 1.5 1.0 0.5 60% 70% 80% 90% 100% Compression ratio 13

  14. Conclusions • High-performance SW decompression possible – Dictionary faster than CodePack, but 5-25% compression ratio difference – Hardware support • I-cache miss exception • Store-instruction instruction • Tune performance by reducing cache misses – Cache size – Code placement • Selective compression – Use cache miss profile for loop-oriented benchmarks • Code placement affects decompression overhead – Future: unify code placement and compression 14

  15. Web page http://www.eecs.umich.edu/compress 15

Recommend


More recommend