hidisc a de coup l ed a r c h i t e c t u r e f o r app l
play

HiDISC: A De coup l ed A r c h i t e c t u r e f o - PowerPoint PPT Presentation

HiDISC: A De coup l ed A r c h i t e c t u r e f o r App l i c a t i on s i n Da t a I n t e n s i ve Comput i n g Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of


  1. HiDISC: A De coup l ed A r c h i t e c t u r e f o r App l i c a t i on s i n Da t a I n t e n s i ve Comput i n g Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of Southern California http:/ / www-pdpc.usc.edu 19 May 2000

  2. USC USC HiDISC: Hierarchical Decoupled Instruction Set Computer UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN Senso r CALIFORNIA CALIFORNIA I npu t s New Ideas App l i c a t i on App l i c a t i on (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) • A dedicated processor for each level of the Decoup l i ng Comp i l e r Decoup l i ng Comp i l e r memory hierarchy • Explicitly manage each level of the memory hierarchy using instructions generated by the compiler • Hide memory latency by converting data Dynamic P roce s so r P roce s so r P roce s so r Databas e access predictability to data access locality Reg i s t e r s • Exploit instruction-level parallelism without Cache HiDISC P roc e s so r extensive scheduling hardware Memory • Zero overhead prefetches for maximal computation throughput S i t ua t i ona l Awarene s s Schedule Impact • 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor • De f i n ed b enchma rk s • Con t i nu e s imu l a t i on s • Deve l op a nd t e s t a • Comp l e t e d s imu l a t o r o f mo r e b enchma rk s f u l l d e coup l i ng c omp i l e r • 7.4x speedup for matrix multiply over an in-order • P e r f o rmed i n s t r u c t i on - l e v e l ( SAR) • Gene r a t e p e r f o rmance s imu l a t i on s on h and - comp i l e d • Def i n e H iD ISC s t a t i s t i c s a nd ev a l u a t e issue superscalar processor b e n chma rk s a r c h i t e c t u r e d e s i gn • Upda t e S imu l a t o r • 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor • Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) Apr i l 9 8 Apr i l 9 9 Apr i l 0 0 May 01 S t a r t End • Allows the compiler to solve indexing functions for irregular applications • Reduced system cost for high-throughput scientific codes University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

  3. USC USC HiDISC: Hierarchical Decoupled Instruction Set Computer UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Senso r I npu t s App l i c a t i on App l i c a t i on (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) Decoup l i ng Comp i l e r Decoup l i ng Comp i l e r Dynamic P roce s so r P roce s so r P roce s so r Databas e Reg i s t e r s Cache HiDISC P roc e s so r Memory S i t ua t i ona l Awarene s s Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

  4. USC USC Present Solutions UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Solution Limitations Larger Caches — Slow — W orks well only if working set fits cache and there is temporal locality. — Cannot be tailored for each application Hardware Prefetching — Behavior based on past and present execution- time behavior — Ensure overheads of prefetching do not outweigh Software Prefetching the benefits > conservative prefetching — Adaptive software prefetching is required to change prefetch distance during run-time — Hard to insert prefetches for irregular access patterns — Solves the throughput problem, not the memory latency problem M ultithreading University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

  5. USC USC The HiDISC Approach UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Obse rv a t i on : • So f twa r e p r e f e t ch i ngimpac t s c o mpu t e p e r f o rmance • P IMs and RA M BUS o f f e r a h i g h -bandw id t h memory s y s t em -u s e f u l f o r s p e cu l a t i v e p r e f e t c h i n g App roach : • Add a p r o c e s so r t o manage p r e f e t c h i ng - > h i d e ove r h e ad • Comp i l e r e xp l i c i t l y manage s t h e memory h i e r a r c hy • P r e f e t c h d i s t a n c e a d ap t s t o t h e p r og r am r un t ime behav i o r University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

  6. USC USC W hat’s HiDISC UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Computation • A dedicated processor for each Instructions Compu t a t i o n Compu t a t i o n level of the memory hierarchy P roc e s s o r (CP ) P roc e s s o r (CP ) • Explicitly manage each level of Reg i s t e r s the memory hierarchy using instructions generated by the Access compiler Instructions Acce s s P r o c e s s o r P rog r am Comp i l e r Acce s s P r o c e s s o r (AP ) (AP ) • Hide memory latency by converting data access predictability to data access Cache locality (Just in Time Fetch) Cache Mgmt. Instructions • Exploit instruction-level Cache Mgmt . parallelism without extensive Cache Mgmt . P roce s so r P roce s so r scheduling hardware (CMP) (CMP) • Zero overhead prefetches for maximal computation throughput 2nd -Leve l Cache and Ma in Memory University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

  7. USC USC Decoupled Architectures UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA 5-way 3-way 2-way 8-way Computation Computation Processor Processor Processor (CP) Processor (CP) Registers Registers Registers Registers Store Address Store Data Store Address Store Data Load Load Queue Queue Queue Queue Queue Queue 3-way Access Access Processor (AP) Processor (AP) 5-way Slip Control Slip Control Cache Cache Queue Cache Queue Cache 3-way 3-way Cache Management Cache Management Processor (CMP) Processor (CMP) Second-Level Cache Second-Level Cache Second-Level Cache Second-Level Cache and Main Memory and Main Memory and Main Memory and Main Memory MIPS DEAP CAPP HiDISC ( n ew) (Conven t i ona l ) (Decoup l ed ) (Decoup l ed ) (Decoup l ed ) University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

  8. USC USC Slip Control Queue UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA • The Slip Control Queue (SCQ) adapts dynamically if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2* late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ; – Late prefetches = prefetched data arrived after load had been issued – Useful prefetches = prefetched data arrived before load had been issued University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

  9. Decoupling Programs for HiDISC-3 USC USC (Discrete Convolution - Inner Loop) UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA while (not end of loop) y = y + (x * h); send y to SDQ Computation Processor Code for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); for (j = 0; j < i; ++j) GET_SCQ; y[i]=y[i]+(x[j]* h[i-j-1]); } send (EOD token) Inner Loop Convolution send address of y[i] to SAQ A ccess Processor Code for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Cache Management Code University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

  10. USC USC Benchmarks UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Benchmark Source of Lines of Description Data Benchmark Source Set Size Code LLL1 Livermore Loops 20 1024-element 24 KB [45] arrays, 100 iterations LLL2 Livermore Loops 24 1024-element 16 KB arrays, 100 iterations LLL3 Livermore Loops 18 1024-element 16 KB arrays, 100 iterations LLL4 Livermore Loops 25 1024-element 16 KB arrays, 100 iterations LLL5 Livermore Loops 17 1024-element 24 KB arrays, 100 iterations Tomcatv SPECfp95 [68] 190 33x33-element <64 KB matrices, 5 iterations MXM NAS kernels [5] 113 Unrolled matrix 448 KB multiply, 2 iterations CHOLSKY NAS kernels 156 Cholesky matrix 724 KB decomposition VPENTA NAS kernels 199 Invert three 128 KB pentadiagonals simultaneously Qsort Quicksort sorting 58 Quicksort 128 KB algorithm [14] University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro

Recommend


More recommend