Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva - PowerPoint PPT Presentation

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor High-Performance Computer Architecture (HPCA-6) January 10-12, 2000

Motivation • Problem: embedded code size CPU RAM ROM – Constraints: cost, area, and power Program I/O – Fit program in on-chip memory – Compilers vs. hand-coded assembly Original Program • Portability • Development costs – Code bloat CPU RAM ROM • Solution: code compression I/O – Reduce compiled code size – Take advantage of instruction repetition Compressed Program • Implementation – Hardware or software? – Code size? – Execution speed? Embedded Systems 2

Software decompression • Previous work – Decompression unit: whole program [Tauton91] • No memory savings – Decompression unit: procedures [Kirovski97][Ernst97] • Requires large decompression memory • Fragmentation of decompression memory • Slow • Our work – Decompression unit: 1 or 2 cache-lines – High performance focus – New profiling method 3

Dictionary compression algorithm • Goal: fast decompression • Dictionary contains unique instructions • Replace program instructions with short index 32 bits 16 bits 32 bits lw r2,r3 5 lw r2,r3 lw r15,r3 lw r2,r3 5 lw r15,r3 30 .dictionary segment lw r15,r3 30 lw r15,r3 30 .text segment .text segment (contains indices) Original program Compressed program 4

Decompression • Algorithm 1. I-cache miss invokes decompressor (exception handler) 2. Fetch index 3. Fetch dictionary word 4. Place instruction in I-cache (special instruction) • Write directly into I-cache • Decompressed instructions only exist in I-cache Memory � Add r1,r2,r3 � Dictionary I-cache � Proc. Indices 5 ... D-cache � 5

CodePack • Overview – IBM – PowerPC – First system with instruction stream compression – Decompress during I-cache miss • Software CodePack Dictionary CodePack Codewords (indices) Fixed-length Variable-length Decompress granularity 1 cache line 2 cache lines Decompression overhead 75 instructions 1120 instructions 6

Compression ratio compressed size = compressio n ratio • original size – CodePack: 55% - 63% – Dictionary: 65% - 82% 100% Dictionary 90% 80% CodePack 70% 60% Compression 50% ratio 40% 30% 20% 10% 0% go ijpeg cc1 perl vortex mpeg2enc pegwit ghostscript 7

Simulation environment • SimpleScalar • Pipeline: 5 stage, in-order • I-cache: 16KB, 32B lines, 2-way • D-cache: 8KB, 16B lines, 2-way • Memory: 10 cycle latency, 2 cycle rate 8

Performance • CodePack: very high overhead • Reduce overhead by reducing cache misses Go 22 CodePack 20 18 Dictionary 16 Slowdown Native 14 relative to 12 native 10 code 8 6 4 2 0 4KB 16KB 64KB I-cache size (KB) 9

Cache miss • Control slowdown by optimizing I-cache miss ratio 40 CodePack 4KB 35 CodePack 16KB 30 CodePack 64KB 25 Dictionary 4KB Slowdown Dictionary 16KB relative to 20 Dictionary 64KB native code 15 10 5 0 0% 2% 4% 6% 8% I-cache miss ratio 10

Selective compression • Hybrid programs – Only compress some procedures – Trade size for speed – Avoid decompression overhead • Profile methods – Count dynamic instructions • Example: Thumb • Use when compressed code has more instructions • Reduce number of executed instructions – Count cache misses • Example: CodePack • Use when compressed code has longer cache miss latency • Reduce cache miss latency 11

Cache miss profiling • Cache miss profile reduces overhead 50% • Loop-oriented benchmarks benefit most – Approach performance of native code Pegwit (encryption) 1.12 CodePack: dynamic instructions 1.10 CodePack: cache miss 1.08 Slowdown relative to 1.06 native code 1.04 1.02 1.00 60% 70% 80% 90% 100% Compression ratio 12

CodePack vs. Dictionary • More compression may have better performance – CodePack has smaller size than Dictionary compression – Even with some native code, CodePack is smaller – CodePack is faster due to using more native code Ghostscript 4.0 3.5 CodePack: cache miss Dictionary: cache miss 3.0 Slowdown 2.5 relative to native 2.0 code 1.5 1.0 0.5 60% 70% 80% 90% 100% Compression ratio 13

Conclusions • High-performance SW decompression possible – Dictionary faster than CodePack, but 5-25% compression ratio difference – Hardware support • I-cache miss exception • Store-instruction instruction • Tune performance by reducing cache misses – Cache size – Code placement • Selective compression – Use cache miss profile for loop-oriented benchmarks • Code placement affects decompression overhead – Future: unify code placement and compression 14

Web page http://www.eecs.umich.edu/compress 15

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva - PowerPoint PPT Presentation

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor

Singular curve point decompression attack Peter Gnther joint work with Johannes Blmer

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., & Davies, J.

Run-time Environments Chapter 7 1 Compiler Construction Run-time Environments Run-time

Reducing Code Size Using Outlining Jessica Paquette Apple Outline Code size Outlining

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Fast Software-managed Code Decompression Charles Lefurgy and Trevor Mudge Advanced Computer

Lumber Size Lumber Size Control Control Studies Studies Lumber Size Control Lumber Size

1 Trace Cache Summary of Reducing Cache Hit Time Trace: a dynamic sequence of Small and simple

Dave Mark Intrinsic Algorithm Reducing the world to mathematical equations! Reducing

Interprocedural IR Outlining For Code Size River Riddle Sony Interactive Entertainment - Tools

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

Scenarios@run.time Distributed Scenarios@run.time Distributed Execution of Specifications

The Case for Run- -Time Error Checking Time Error Checking The Case for Run Todd Austin

Run Time Complexity In typical application the total run time of a genetic algorithm is

Parents Seminar Springdale Primary School 1 Overview 1. Primary Science Education 2. Science

5: Religious Prose 26 November 2015 Figure: West Saxon Gospels, BL Royal 1 A XIV f. 83 (detail;

ENTSOG: 5 th Stakeholder Joint Working Session for the Incremental Proposal 8 April 2014 5th SJWS

Functions Multiplicities Relations can be associated with multiplicities. (e.g. in UML

Percona Xtrabackup Best Practices Marcelo Altmann Senior Support Engineer - Percona Agenda

Compression Bombs Strike Back Giancarlo Pellegrino gpellegrino@mmci.uni-saarland.de BeNeLux

DIANA Contributions Update Brian Bockelman Including work from Jim Pivarski, Oksana Shadura, and

ROHC Implementation Experience mark.a.west@roke.co.uk Mark West 1 s s Roke Manor Overview

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva - PowerPoint PPT Presentation

Reducing Code Size with Run-time Decompression Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The University of Michigan, Ann Arbor

Singular curve point decompression attack Peter Gnther joint work with Johannes Blmer

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., &amp; Davies, J.

Run-time Environments Chapter 7 1 Compiler Construction Run-time Environments Run-time

Reducing Code Size Using Outlining Jessica Paquette Apple Outline Code size Outlining

Case 2: Reducing Cardiovascular Risk Type 2 Diabetes Management Case 1: Reducing Hypoglycemic

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

{Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code} {Sequential Code}

Fast Software-managed Code Decompression Charles Lefurgy and Trevor Mudge Advanced Computer

Lumber Size Lumber Size Control Control Studies Studies Lumber Size Control Lumber Size

1 Trace Cache Summary of Reducing Cache Hit Time Trace: a dynamic sequence of Small and simple

Dave Mark Intrinsic Algorithm Reducing the world to mathematical equations! Reducing

Interprocedural IR Outlining For Code Size River Riddle Sony Interactive Entertainment - Tools

Code Generation Machine code generation cs4713 1 Machine code generation machine Intermediate

Scenarios@run.time Distributed Scenarios@run.time Distributed Execution of Specifications

The Case for Run- -Time Error Checking Time Error Checking The Case for Run Todd Austin

Run Time Complexity In typical application the total run time of a genetic algorithm is

Parents Seminar Springdale Primary School 1 Overview 1. Primary Science Education 2. Science

5: Religious Prose 26 November 2015 Figure: West Saxon Gospels, BL Royal 1 A XIV f. 83 (detail;

ENTSOG: 5 th Stakeholder Joint Working Session for the Incremental Proposal 8 April 2014 5th SJWS

Functions Multiplicities Relations can be associated with multiplicities. (e.g. in UML

Percona Xtrabackup Best Practices Marcelo Altmann Senior Support Engineer - Percona Agenda

Compression Bombs Strike Back Giancarlo Pellegrino gpellegrino@mmci.uni-saarland.de BeNeLux

DIANA Contributions Update Brian Bockelman Including work from Jim Pivarski, Oksana Shadura, and

ROHC Implementation Experience mark.a.west@roke.co.uk Mark West 1 s s Roke Manor Overview

Compression and Decompression in Cognition Vertolli, M. O., Kelly, M., & Davies, J.