evaluation of a high performance code compression method
play

Evaluation of a High Performance Code Compression Method Charles - PowerPoint PPT Presentation

Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor


  1. Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor MICRO-32 November 16-18, 1999

  2. Motivation • Problem: embedded code size CPU RAM ROM – Constraints: cost, area, and power Program I/O – Fit program in on-chip memory – Compilers vs. hand-coded assembly Original Program • Portability • Development costs – Code bloat ROM CPU RAM • Solution: code compression I/O – Reduce compiled code size – Take advantage of instruction repetition Compressed Program – Systems use cheaper processors with smaller on-chip memories • Implementation – Code size? – Execution speed? Embedded Systems 2

  3. CodePack • Overview – IBM – PowerPC instruction set – First system with instruction stream compression – 60% compression ratio, ± 10% performance [IBM] • performance gain due to prefetching • Implementation – Binary executables are compressed after compilation – Compression dictionaries tuned to application – Decompression occurs on L1 cache miss • L1 caches hold decompressed data • Decompress 2 cache lines at a time (16 insns) – PowerPC core is unaware of compression 3

  4. CodePack encoding • 32-bit insn is split into 2 16-bit words • Each 16-bit word compressed separately Encoding for upper 16 bits Encoding for lower 16 bits Encodes zero 0 0 x x x 0 0 8 1 32 0 1 x x x x x 16 0 1 x x x x 64 1 0 0 x x x x x x 23 1 0 0 x x x x x 128 x x x x x x x 128 x x x x x x x 1 0 1 1 0 1 x x 256 256 1 1 0 x x x x x x x x 1 1 0 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 1 1 1 1 1 1 x x Tag Escape Index Raw bits 4

  5. CodePack decompression 0 5 6 25 26 31 L1 I-cache miss address Index table Fetch index (in main memory) Byte-aligned block address Compressed bytes Fetch (in main memory) compressed instructions Compression Block (16 instructions) Hi tag Low tag Hi index Low index 1 compressed instruction Decompress High dictionary Low dictionary High 16-bits Low 16-bits Native Instruction 5

  6. Compression ratio compressed size = = = = compressio n ratio • original size • Average: 62% 100% 90% 80% 70% 60% Compression ratio 50% 40% 30% 20% 10% 0% u i o g 5 d t l m 1 p r d s d m x 5 5 c i r v c g 9 i o l p e w e 3 e e p 2 r t p 9 n i i i c p c w g p a b t v p o l s g p a s e r j 2 c a p r k m e s r o s i 2 a d u m u w f 8 e g p v s t y 8 r e o p h m p t m m o c 6

  7. CodePack programs • Compressed executable – 17%-25% raw bits: not compressed! • Includes escape bits Tags • Compiler optimizations might help 25% – 5% index table – 2KB dictionary (fixed cost) Escape – 1% pad bits 3% Raw bits 14% Pad 1% Indices 51% Index table 5% Dictionary 1% go 7

  8. I-cache miss timing • Native code uses critical word first • Compressed code must be fetched sequentially • Example shows miss to 5th instruction in cache line – 32-bit insns, 64-bit bus a) Native code Instruction cache miss Instructions from main memory b) Compressed code Instruction cache miss Index from main memory Codes from main memory Decompressor c) Compressed code + optimizations Instruction cache miss A Index from index cache Codes from main memory B 2 Decompressors 30 10 t=0 20 1 cycle L1 cache miss Fetch instructions (first line) Decompression cycle Fetch index Fetch instructions (remaining lines) Critical instruction word 8

  9. Baseline results • CodePack causes up to 18% performance loss – SimpleScalar – 4-issue, out-of-order – 16 KB caches – Main memory: 10 cycle latency, 2 cycle rate 1.8 1.6 native 1.4 CodePack 1.2 Instructions 1.0 per cycle 0.8 0.6 0.4 0.2 0.0 cc1 go perl vortex 9

  10. Optimization A: Index cache • Remove index table access with a cache – A cache hit removes main memory access for index – optimized: 64 lines, fully assoc., 4 indices/line (<15% miss ratio) • Within 8% of native code – perfect: an infinite sized index cache • Within 5% of native code 1.2 1.0 0.8 Speedup over native code 0.6 CodePack 0.4 optimized 0.2 perfect 0.0 cc1 go perl vortex 10

  11. Optimization B: More decoders • Codeword tags enable fast extraction of codewords – Enables parallel decoding • Try adding more decoders for faster decompression • 2 decoders: performance within 13% of native code 1.0 0.8 Speedup over 0.6 native code CodePack 0.4 2 insn/cycle 3 insn/cycle 0.2 16 insn/cycle 0.0 cc1 go perl vortex 11

  12. Comparison of optimizations • Index cache provides largest benefit • Optimizations – index cache: 64 lines, 4 indices/line, fully assoc. – 2nd decoder • Speedup over native code: 0.97 to 1.05 • Speedup over CodePack: 1.17 to 1.25 1.2 1 0.8 Speedup over 0.6 native code CodePack index cache 0.4 2nd decoder 0.2 both optimizations 0 cc1 go perl vortex 12

  13. Cache effects • Cache size controls normal CodePack slowdown • Optimizations do well on small caches: 1.14 speedup go benchmark 1.4 1.2 1.0 Speedup 0.8 over native code 0.6 0.4 CodePack 0.2 optimized 0.0 1KB 4KB 16KB 64KB 13

  14. Memory latency • Optimized CodePack performs better with slow memories – Fewer memory accesses than native code go benchmark 1.2 1.0 Speedup 0.8 over native 0.6 code 0.4 CodePack 0.2 optimized 0.0 0.5x 1x 2x 4x 8x Memory latency 14

  15. Memory width • CodePack provides speedup for small buses • Optimizations help performance degrade gracefully as bus size increases go benchmark 1.2 1.0 0.8 Speedup over 0.6 native code 0.4 CodePack 0.2 optimized 0.0 16 32 64 128 Bus size (bits) 15

  16. Conclusions • CodePack works for other instruction sets than PowerPC • Performance can be improved at modest cost – Remove decompression overhead: index lookup, dictionary lookup • Compression can speedup execution – Compressed code requires fewer main memory accesses – CodePack includes simple prefetching • Systems that benefit most from compression – Narrow buses – Slow memories • Workstations might benefit from compression – Fewer L2 misses – Less disk access 16

  17. Web page http://www.eecs.umich.edu/~tnm/compress 17

Recommend


More recommend