HiDISC: A De coup l ed A r c h i t e c t u r e f o r App l i c a t i on s i n Da t a I n t e n s i ve Comput i n g Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro University of Southern California http:/ / www-pdpc.usc.edu 19 May 2000
USC USC HiDISC: Hierarchical Decoupled Instruction Set Computer UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN Senso r CALIFORNIA CALIFORNIA I npu t s New Ideas App l i c a t i on App l i c a t i on (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) • A dedicated processor for each level of the Decoup l i ng Comp i l e r Decoup l i ng Comp i l e r memory hierarchy • Explicitly manage each level of the memory hierarchy using instructions generated by the compiler • Hide memory latency by converting data Dynamic P roce s so r P roce s so r P roce s so r Databas e access predictability to data access locality Reg i s t e r s • Exploit instruction-level parallelism without Cache HiDISC P roc e s so r extensive scheduling hardware Memory • Zero overhead prefetches for maximal computation throughput S i t ua t i ona l Awarene s s Schedule Impact • 2x speedup for scientific benchmarks with large data sets over an in-order superscalar processor • De f i n ed b enchma rk s • Con t i nu e s imu l a t i on s • Deve l op a nd t e s t a • Comp l e t e d s imu l a t o r o f mo r e b enchma rk s f u l l d e coup l i ng c omp i l e r • 7.4x speedup for matrix multiply over an in-order • P e r f o rmed i n s t r u c t i on - l e v e l ( SAR) • Gene r a t e p e r f o rmance s imu l a t i on s on h and - comp i l e d • Def i n e H iD ISC s t a t i s t i c s a nd ev a l u a t e issue superscalar processor b e n chma rk s a r c h i t e c t u r e d e s i gn • Upda t e S imu l a t o r • 2.6x speedup for matrix decomposition/substitution over an in-order issue superscalar processor • Reduced memory latency for systems that have high memory bandwidths (e.g. PIMs, RAMBUS) Apr i l 9 8 Apr i l 9 9 Apr i l 0 0 May 01 S t a r t End • Allows the compiler to solve indexing functions for irregular applications • Reduced system cost for high-throughput scientific codes University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
USC USC HiDISC: Hierarchical Decoupled Instruction Set Computer UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Senso r I npu t s App l i c a t i on App l i c a t i on (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) (FL IR SAR V IDEO ATR / SLD Sc i en t i f i c ) Decoup l i ng Comp i l e r Decoup l i ng Comp i l e r Dynamic P roce s so r P roce s so r P roce s so r Databas e Reg i s t e r s Cache HiDISC P roc e s so r Memory S i t ua t i ona l Awarene s s Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year) Problem: Some SPEC benchmarks spend more than half of their time stalling [Lebeck and Wood 1994] Domain: benchmarks with large data sets: symbolic, signal processing and scientific programs Present Solutions: Multithreading (Homogenous), Larger Caches, Prefetching, Software Multithreading University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
USC USC Present Solutions UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Solution Limitations Larger Caches — Slow — W orks well only if working set fits cache and there is temporal locality. — Cannot be tailored for each application Hardware Prefetching — Behavior based on past and present execution- time behavior — Ensure overheads of prefetching do not outweigh Software Prefetching the benefits > conservative prefetching — Adaptive software prefetching is required to change prefetch distance during run-time — Hard to insert prefetches for irregular access patterns — Solves the throughput problem, not the memory latency problem M ultithreading University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
USC USC The HiDISC Approach UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Obse rv a t i on : • So f twa r e p r e f e t ch i ngimpac t s c o mpu t e p e r f o rmance • P IMs and RA M BUS o f f e r a h i g h -bandw id t h memory s y s t em -u s e f u l f o r s p e cu l a t i v e p r e f e t c h i n g App roach : • Add a p r o c e s so r t o manage p r e f e t c h i ng - > h i d e ove r h e ad • Comp i l e r e xp l i c i t l y manage s t h e memory h i e r a r c hy • P r e f e t c h d i s t a n c e a d ap t s t o t h e p r og r am r un t ime behav i o r University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
USC USC W hat’s HiDISC UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Computation • A dedicated processor for each Instructions Compu t a t i o n Compu t a t i o n level of the memory hierarchy P roc e s s o r (CP ) P roc e s s o r (CP ) • Explicitly manage each level of Reg i s t e r s the memory hierarchy using instructions generated by the Access compiler Instructions Acce s s P r o c e s s o r P rog r am Comp i l e r Acce s s P r o c e s s o r (AP ) (AP ) • Hide memory latency by converting data access predictability to data access Cache locality (Just in Time Fetch) Cache Mgmt. Instructions • Exploit instruction-level Cache Mgmt . parallelism without extensive Cache Mgmt . P roce s so r P roce s so r scheduling hardware (CMP) (CMP) • Zero overhead prefetches for maximal computation throughput 2nd -Leve l Cache and Ma in Memory University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
USC USC Decoupled Architectures UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA 5-way 3-way 2-way 8-way Computation Computation Processor Processor Processor (CP) Processor (CP) Registers Registers Registers Registers Store Address Store Data Store Address Store Data Load Load Queue Queue Queue Queue Queue Queue 3-way Access Access Processor (AP) Processor (AP) 5-way Slip Control Slip Control Cache Cache Queue Cache Queue Cache 3-way 3-way Cache Management Cache Management Processor (CMP) Processor (CMP) Second-Level Cache Second-Level Cache Second-Level Cache Second-Level Cache and Main Memory and Main Memory and Main Memory and Main Memory MIPS DEAP CAPP HiDISC ( n ew) (Conven t i ona l ) (Decoup l ed ) (Decoup l ed ) (Decoup l ed ) University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
USC USC Slip Control Queue UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA • The Slip Control Queue (SCQ) adapts dynamically if (prefetch_buffer_full ()) Don’t change size of SCQ; else if ((2* late_prefetches) > useful_prefetches) Increase size of SCQ; else Decrease size of SCQ; – Late prefetches = prefetched data arrived after load had been issued – Useful prefetches = prefetched data arrived before load had been issued University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Decoupling Programs for HiDISC-3 USC USC (Discrete Convolution - Inner Loop) UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA while (not end of loop) y = y + (x * h); send y to SDQ Computation Processor Code for (j = 0; j < i; ++j) { load (x[j]); load (h[i-j-1]); for (j = 0; j < i; ++j) GET_SCQ; y[i]=y[i]+(x[j]* h[i-j-1]); } send (EOD token) Inner Loop Convolution send address of y[i] to SAQ A ccess Processor Code for (j = 0; j < i; ++j) { prefetch (x[j]); prefetch (h[i-j-1]; PUT_SCQ; } Cache Management Code University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
USC USC Benchmarks UNIVERSITY UNIVERSITY OF SOUTHERN OF SOUTHERN CALIFORNIA CALIFORNIA Benchmark Source of Lines of Description Data Benchmark Source Set Size Code LLL1 Livermore Loops 20 1024-element 24 KB [45] arrays, 100 iterations LLL2 Livermore Loops 24 1024-element 16 KB arrays, 100 iterations LLL3 Livermore Loops 18 1024-element 16 KB arrays, 100 iterations LLL4 Livermore Loops 25 1024-element 16 KB arrays, 100 iterations LLL5 Livermore Loops 17 1024-element 24 KB arrays, 100 iterations Tomcatv SPECfp95 [68] 190 33x33-element <64 KB matrices, 5 iterations MXM NAS kernels [5] 113 Unrolled matrix 448 KB multiply, 2 iterations CHOLSKY NAS kernels 156 Cholesky matrix 724 KB decomposition VPENTA NAS kernels 199 Invert three 128 KB pentadiagonals simultaneously Qsort Quicksort sorting 58 Quicksort 128 KB algorithm [14] University of Southern California, Alvin M. Despain, Jean-Luc Gaudiot, Manil Makhija and W onwoo Ro
Recommend
More recommend