Cache Aware Optimization of Stream Programs Janis Sermulins, - PowerPoint PPT Presentation

Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005

Streaming Computing Is Everywhere! • Prevalent computing domain with applications in embedded systems – As well as desktops and high-end servers

Properties of Stream Programs AtoD FMDemod • Regular and repeating computation Duplicate LPF 1 LPF 2 LPF 3 • Independent actors with explicit communication HPF 1 HPF 2 HPF 3 RoundRobin • Data items have short lifetimes Adder Speaker

Application Characteristics: Implications on Caching Scientific Streaming Control Inner loops Single outer loop Data Persistent array Limited lifetime processing producer-consumer Working set Small Whole-program Implications Natural fit for Demands novel cache hierarchy mapping

Application Characteristics: Implications on Compiler Scientific Streaming Parallelism Fine-grained Coarse-grained Data access Global Local Communication Implicit Explicit random access producer-consumer Implications Limited program Potential for global transformations reordering

Motivating Example Baseline Full Scaling A for i = 1 to N for i = 1 to N for i = 1 to N A (); A (); A (); for i = 1 to N B (); B (); B B (); end C (); for i = 1 to N for i = 1 to N end C C (); C (); cache size A + B B A A B A A Working B + + Set Size A B C C C B C B C B B C inst data inst data inst data

Motivating Example Baseline Full Scaling A for i = 1 to N for i = 1 to N 64 for i = 1 to N A (); A (); A (); for i = 1 to N B (); B (); B B (); end C (); for i = 1 to N for i = 1 to N 64 end C C (); C (); cache size A + B B A A B A A B Working B + + Set Size A B C C C B C B C B C B C inst data inst data inst data

Motivating Example Baseline Full Scaling Cache Opt A for i = 1 to N for i = 1 to N/64 for i = 1 to N A (); for i = 1 to N 64 A (); for i = 1 to N A (); B (); B B (); C (); B (); for i = 1 to N end end C C (); for i = 1 to N 64 C (); cache size end A + B A A B A A B Working B + + Set Size A B C C B C B C B C B C inst data inst data inst data

Outline • StreamIt • Cache Aware Fusion • Cache Aware Scaling • Buffer Management • Related Work and Conclusion

Model of Computation • Synchronous Dataflow [Lee 92] – Graph of autonomous filters A/D – Communicate via FIFO channels – Static I/O rates Band Pass • Compiler decides on an order of execution (schedule) Duplicate – Many legal schedules – Schedule affects locality Detect Detect Detect Detect – Lots of previous work on minimizing buffer LED LED LED LED requirements between filters

Example StreamIt Filter input 0 1 2 3 4 5 6 7 8 9 10 11 FIR output 0 1 float → float filter FIR (int N) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] ∗ peek (i); } push (result); pop (); } }

StreamIt Language Overview filter • StreamIt is a novel language for streaming pipeline may be – Exposes parallelism and any StreamIt communication language construct – Architecture independent splitjoin parallel computation – Modular and composable • Simple structures composed to creates splitter joiner complex graphs – Malleable • Change program behavior feedback loop with small modifications splitter joiner

Freq Band Detector in StreamIt void->void pipeline FrequencyBand { float sFreq = 4000; float cFreq = 500/(sFreq*2*pi); float wFreq = 100/(sFreq*2*pi); A/D add D2ASource(sFreq); add BandPassFilter(100, cFreq-wFreq, cFreq+wFreq); Band pass add splitjoin { Duplicate split duplicate; for (int i=0; i<4; i++) { add pipeline { add Detect (i/4); Detect Detect Detect Detect add LED (i); LED LED LED LED } } join roundrobin(0); } }

Fusion • Fusion combines adjacent filters into a single filter work pop 1 push 2 { work pop 1 push 2 { int a = pop(); int t1, t2; push( a ); × 1 push( a ); int a = pop(); } t1 = a; t2 = a; int b = t1; push(b * 2); × 2 work pop 1 push 1 { int c = t2; int b = pop(); push(c * 2); push(b * 2); } } • Reduces method call overhead • Improves producer-consumer locality • Allows optimizations across filter boundaries – Register allocation of intermediate values – More flexible instruction scheduling

Evaluation Methodology • StreamIt compiler generates C code – Baseline StreamIt optimizations • Unrolling, constant propagation – Compile C code with gcc-v3.4 with -O3 optimizations • StrongARM 1110 (XScale) embedded processor – 370MHz, 16Kb I-Cache, 8Kb D-Cache – No L2 Cache (memory 100× slower than cache) – Median user time • Suite of 11 StreamIt Benchmarks • Evaluate two fusion strategies: – Full Fusion – Cache Aware Fusion

Results for Full Fusion (StrongARM 1110) Hazard: The instruction or data working set of the fused program may exceed cache size!

Cache Aware Fusion (CAF) • Fuse filters so long as: – Fused instruction working set fits the I-cache – Fused data working set fits the D-cache • Leave a fraction of D-cache for input and output to facilitate cache aware scaling • Use a hierarchical fusion heuristic

Full Fusion vs. CAF

Improving Instruction Locality cache miss cache hit Baseline Full Scaling A for i = 1 to N for i = 1 to N A (); A (); for i = 1 to N B (); B C (); B (); end for i = 1 to N C C (); cache size miss rate = 1 miss rate = 1 / N A + Working B + Set Size A B C C inst inst

Impact of Scaling Fast Fourier Transform

How Much To Scale? state I/O cache size A Data B Working Set Size C A B C A B C A B C A B C no scale scale scale scaling by 3 by 4 by 5 • Scale as much as possible Our Scaling • Ensure at least 90% of filters have Heuristic: data working sets that fit into cache

How Much To Scale? state I/O cache size A Data B Working Set Size • Scale as much as possible Our Scaling • Ensure at least 90% of filters have Heuristic: data working sets that fit into cache

Impact of Scaling Heuristic choice is 4% from optimal Fast Fourier Transform

Scaling Results

Sliding Window Computation input 0 1 2 3 4 5 6 7 8 9 10 11 FIR output 0 1 2 3 float → float filter FIR (int N) { work push 1 pop 1 peek N { float result = 0; for (int i = 0; i < N; i++) { result += weights[i] ∗ peek (i); } push (result); pop (); } }

Performance vs. Peek Rate (StrongARM 1110) FIR

Evaluation for Benchmarks (StrongARM 1110) caf + scaling + modulation caf + scaling + copy-shift

Results Summary Large L2 Cache Large L2 Cache Large Reg. File VLIW

Related work • Minimizing buffer requirements – S.S. Bhattacharyya, P. Murthy, and E. Lee • Software Synthesis from Dataflow Graphs (1996) • AGPAN and RPMC: Complimentary Heuristics for Translating DSP Block Diagrams into Efficient Software Implementations (1997) • Synthesis of Embedded software from Synchronous Dataflow Specifications (1999) – P.K.Murthy, S.S. Bhattacharyya • A Buffer Merging Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications (1999) • Buffer Merging – A Powerful Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications (2000) – R. Govindarajan, G. Gao, and P. Desai • Minimizing Memory Requirements in Rate-Optimal Schedules (1994) • Fusion – T. A. Proebsting and S. A. Watterson, Filter Fusion (1996) • Cache optimizations – S. Kohli, Cache Aware Scheduling of Synchronous Dataflow Programs (2004)

Conclusions • Streaming paradigm exposes parallelism and allows massive reordering to improve locality • Must consider both data and instruction locality – Cache Aware Fusion enables local optimizations by judiciously increasing the instruction working set – Cache Aware Scaling improves instruction locality by judiciously increasing the buffer requirements • Simple optimizations have high impact – Cache optimizations yield significant speedup over both baseline and full fusion on an embedded platform

Cache Aware Optimization of Stream Programs Janis Sermulins, - PowerPoint PPT Presentation

Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005 Streaming Computing Is Everywhere! Prevalent computing domain with applications in embedded systems

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Cache related pre-emption delay aware response time analysis for fixed priority pre-emptive

Cache Impact on Program Performance T. Yang. UCSB CS240A. 2017 Multi-level cache in computer

Cache Memories, Cache Complexity Marc Moreno Maza University of Western Ontario, London, Ontario

Line Commission Meeting September 27, 2018 Agenda Margaret Doane Introductions Ho

HTAs PROGRAMMING FOR PARALLELISM AND LOCALITY WITH PAPER PUBLISHED AT PPOPP MARCH 2006

Simplification of Cylindrical Algebraic Formulas Changbo Chen Joint work with Marc Moreno Maza

Emotion Recognition in Speech under Environmental Noise Conditions using Wavelet Decomposition

CSE 341 Lecture 10 more about data types; nullable types; option Ullman 6.2 - 6.3; 4.2.5 -

BBM406 Fundamentals of Machine Learning Lecture 8: Maximum a Posteriori (MAP) Nave Bayes

RELAYRACE CREATE A NEW TEAM START APPLICATION: In the menu, right click on the button

Todays Message: The priests Exodus 27:20-30:10 The Priestly Garments Exodus 28:1-5 1 Have