Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor High-Performance Computer Architecture (HPCA-6) January 10-12, 2000
Motivation • Problem: embedded code size CPU RAM ROM – Constraints: cost, area, and power Program I/O – Fit program in on-chip memory – Compilers vs. hand-coded assembly Original Program • Portability • Development costs – Code bloat CPU RAM ROM • Solution: code compression I/O – Reduce compiled code size – Take advantage of instruction repetition Compressed Program • Implementation – Hardware or software? – Code size? – Execution speed? Embedded Systems 2
Software decompression • Previous work – Decompression unit: whole program [Tauton91] • No memory savings – Decompression unit: procedures [Kirovski97][Ernst97] • Requires large decompression memory • Fragmentation of decompression memory • Slow • Our work – Decompression unit: 1 or 2 cache-lines – High performance focus – New profiling method 3
Dictionary compression algorithm • Goal: fast decompression • Dictionary contains unique instructions • Replace program instructions with short index 32 bits 16 bits 32 bits lw r2,r3 5 lw r2,r3 lw r15,r3 lw r2,r3 5 lw r15,r3 30 .dictionary segment lw r15,r3 30 lw r15,r3 30 .text segment .text segment (contains indices) Original program Compressed program 4
Decompression • Algorithm 1. I-cache miss invokes decompressor (exception handler) 2. Fetch index 3. Fetch dictionary word 4. Place instruction in I-cache (special instruction) • Write directly into I-cache • Decompressed instructions only exist in I-cache Memory � Add r1,r2,r3 � Dictionary I-cache � Proc. Indices 5 ... D-cache � 5
CodePack • Overview – IBM – PowerPC – First system with instruction stream compression – Decompress during I-cache miss • Software CodePack Dictionary CodePack Codewords (indices) Fixed-length Variable-length Decompress granularity 1 cache line 2 cache lines Decompression overhead 75 instructions 1120 instructions 6
Compression ratio compressed size = compressio n ratio • original size – CodePack: 55% - 63% – Dictionary: 65% - 82% 100% Dictionary 90% 80% CodePack 70% 60% Compression 50% ratio 40% 30% 20% 10% 0% go ijpeg cc1 perl vortex mpeg2enc pegwit ghostscript 7
Simulation environment • SimpleScalar • Pipeline: 5 stage, in-order • I-cache: 16KB, 32B lines, 2-way • D-cache: 8KB, 16B lines, 2-way • Memory: 10 cycle latency, 2 cycle rate 8
Performance • CodePack: very high overhead • Reduce overhead by reducing cache misses Go 22 CodePack 20 18 Dictionary 16 Slowdown Native 14 relative to 12 native 10 code 8 6 4 2 0 4KB 16KB 64KB I-cache size (KB) 9
Cache miss • Control slowdown by optimizing I-cache miss ratio 40 CodePack 4KB 35 CodePack 16KB 30 CodePack 64KB 25 Dictionary 4KB Slowdown Dictionary 16KB relative to 20 Dictionary 64KB native code 15 10 5 0 0% 2% 4% 6% 8% I-cache miss ratio 10
Selective compression • Hybrid programs – Only compress some procedures – Trade size for speed – Avoid decompression overhead • Profile methods – Count dynamic instructions • Example: Thumb • Use when compressed code has more instructions • Reduce number of executed instructions – Count cache misses • Example: CodePack • Use when compressed code has longer cache miss latency • Reduce cache miss latency 11
Cache miss profiling • Cache miss profile reduces overhead 50% • Loop-oriented benchmarks benefit most – Approach performance of native code Pegwit (encryption) 1.12 CodePack: dynamic instructions 1.10 CodePack: cache miss 1.08 Slowdown relative to 1.06 native code 1.04 1.02 1.00 60% 70% 80% 90% 100% Compression ratio 12
CodePack vs. Dictionary • More compression may have better performance – CodePack has smaller size than Dictionary compression – Even with some native code, CodePack is smaller – CodePack is faster due to using more native code Ghostscript 4.0 3.5 CodePack: cache miss Dictionary: cache miss 3.0 Slowdown 2.5 relative to native 2.0 code 1.5 1.0 0.5 60% 70% 80% 90% 100% Compression ratio 13
Conclusions • High-performance SW decompression possible – Dictionary faster than CodePack, but 5-25% compression ratio difference – Hardware support • I-cache miss exception • Store-instruction instruction • Tune performance by reducing cache misses – Cache size – Code placement • Selective compression – Use cache miss profile for loop-oriented benchmarks • Code placement affects decompression overhead – Future: unify code placement and compression 14
Web page http://www.eecs.umich.edu/compress 15
Recommend
More recommend