CUstom Built HEterogeneous Multi- Core ArCHitectures (CUBEMACH): Breaking the Conventions Nagarajan Venkateswaran Director, Waran Research Foundation Karthikeyan Palavedu Saravanan - Nachiappan Chidambaram Nachiappan Research Trainees (2008 - 2010), Waran Research Foundation Aravind Vasudevan - Balaji Subramaniam - Ravindhiran Mukundarajan Former Research Trainees (2007 - 2009), Waran Research Foundation 1 Chennai, India
Motivation : Heterogeneity Redefined • Cost Effective High Performance Custom Built Heterogeneous Multi-Core Node Design for wider class applications – Inter and Intra core heterogeneity • Breaking the Conventions – Multiple User Multiple Application without Space- Time sharing in a Cluster : Cost sharing across users – Single User Multiple Application without Space-Timer Sharing (non-multiprogramming) : Cost sharing across applications 2 Chennai, India
Overview • Custom Built Heterogeneous Multi-Core Architectures (CUBEMACH) • Design Space – Architectural Space – Optimization Space – Customer Vendor Interaction – Simulation Space • CUBEMACH Design and Simulation Tool Framework • Conclusion 3 Chennai, India
Custom Built Heterogeneous Multi-Core Architectures (CUBEMACH) • CUBEMACH promises – Increased Resource Utilization – Multiple Application Flavored Architectures – Elimination of Space Time Sharing at the Quantum Level during Multiple Application Execution – Manufacturing and Operational Cost reduction 4 Chennai, India
Overview • Custom Built Heterogeneous Multi-Core Architectures (CUBEMACH) • Design Space – Architectural Space – Optimization Space – Customer Vendor Interaction – Simulation Space • CUBEMACH Design and Simulation Tool Framework • Conclusion 5 Chennai, India
CUBEMACH Design Paradigm 6 Chennai, India
Architectural Design Space - CUBEMACH SRAM DRAM ALISA Memory ONNET Compiler-On- Silicon ALFU PCOS SCOS CUBEMACH Architectural Space 7 Chennai, India
Architectural Space • Why ALU Why Not ALFU?? – Hardwired units – Design : Homogeneously Structured – Reduced Instruction Generation & Fetches : Employ a Higher Level ISA – Reduced memory-functional unit interaction – Helps execute multiple applications without space & time sharing 8 Chennai, India
Algorithm Level Functional Unit ALFU HLFU Delay Requirements Control ALFU Types Centralize/ Decentralized Control Algorithm Scalar Unit Class of Size Algorithms Class of Memory Units ALFU HLFU Number In Characteristics of MIP Processor Cells Input Bits Grain Size Type of MIP Architecture Cell 9 Chennai, India
ALU vs ALFU Instruction Generation Results 10 Chennai, India
Architectural Space Contd … Sample Algorithm Level Functional Units • Scalar Units • Matrix Centric Units • Scalar Adder / Subtractor • Matmul • Scalar Multiplier • Matadd • Scalar Divider • Chain Matadd • Comparator • Sorter • Graph Theoretic Units • Multiple Operand Adder • Graph Traversal Unit – • Min / Max Finder BFS, DFS • Vector Units • KL Graph Partitioning • Inner Product 11 Chennai, India
Architectural Space Contd… ALISA – Algorithm Level Instruction Set Architecture • Algorithm Level Instructions • Triggers ALFUS • ALISA Multiple VLIWs • ALISA for heterogeneous multi-cores 12 Chennai, India
Architectural Space Contd … Hierarchical Compilation Scheme Application • PCOS Partitions A Problem Into Sub-Problems – Level 1 PCOS Sub - Application • SCOS Partitions The Sub- Problems Into ALFU Level SCOS Instruction – Level 2 Instruction 13 Chennai, India
ALISA & Compiler On Silicon No of ALISA Fields Field Length Types of Instructions in Decoding/ Encoding Logic ISA Number of Instructions Per Types of Instructions Per ALISA ALISA ALISA Rate of Output Scheduler O/P BISA No. of parallel Generation Length Generation rate Units Scheduler Processing rate No. of I/O Ports PCOS SCOS No. of Ports Compiler-On- Silicon 14 Chennai, India
Architectural Space Contd … ON-Node-Network Architecture Global Router Sub-Local Router Local Router ALFU Population Core 2D - Torus 15 Chennai, India
Architectural Space Contd … ON-Node-Network Architecture H- Tree Global Router Topology Local Router Sub-Local Router 16 Chennai, India
Comparison of Conventional NOCs with ONNET ONNET Conventional NOCs Type of Switch MIN Crossbar N 2 Number of N* log 2 (N) Routers Hierarchy Yes No Switching Latency Log 2 (Number of Number of Inputs Inputs) * Switch * Switch Delay Delay 17 Chennai, India
On Node Network Architecture Route Data Rate Location No. of Buffer/ I/O Port Decoders Stack Size Decoding Logical Rate Grouping Routers Length of Address HLFU Stack Decoding Count Organization Input Traffic No. of Buffers ONNET Packet Size Type of MIN Packet Packetization Input/Output Switching I/P Data Size Destination Output Data Router ID Destination HLFU ID Path Word Length Size ID Latency Buffer Size 18 Chennai, India
Overview • Motivation • Custom Built Heterogeneous Multi-Core Architectures (CUBEMACH) • Design Space – Architectural Space – Optimization Space – Customer Vendor Interaction – Simulation Space • CUBEMACH Design and Simulation Tool Framework • Conclusion 19 Chennai, India
Optimization Space 5a Desired Power to Performance Ratio 5 3 4 9 Selected Simulated Game Final Parameters Annealing Theory CUBEMACH Multiple Multiple Multiple Applications Applications Applications 6 5b Input Calculated 2 Initial Candidate Power to Architecture Core Performance Parameters 1 Formation Ratio CUBEMACH Optimization 7 Space Power Model CUBEMACH Architectural 8 Space Performanc CUBEMACH e Model Simulation Space 20 Chennai, India
Optimization Space • Generates Optimized CUBEMACH for input specifications such as, – Power – Performance – Cost – Initial Architecture • Power and Performance Model • Uses GT and SA for optimization of Power and performance • Uses KL For Core Grouping 21 Chennai, India
Sample CUBEMACH Architecture 22
CUBEMACH Design Implementation : Supercomputer On Chip (SCOC) IP Cores 23 Chennai, India
SCOC IP Cores • ALFUs designed as SCOC IP Cores • Soft IP Core • Coarse-grained Reusable Soft IP Cores • Scalable IP Cores 24 Chennai, India
Optimization Space Contd … Customer Vendor Interaction App 1 CUBEMACH Customers Application App 2 Node Requirements – App 3 Manufacturers/ Power & Performance App 4 System Vendors Simultaneous Fabrication of Multiple IP Cores Applications Workload Generation Layout Intermediate Format SCOC IP CORES CUBEMACH Design Final Space Architecture Optimized Initial CUBEMACH` Heterogeneous Multi Core Candidate Intermediate Architecture CUBEMACH Simulator Optimizer 25 Chennai, India
Overview • Motivation • Custom Built Heterogeneous Multi-Core Architectures (CUBEMACH) • Design Space – Architectural Space – Optimization Space – Customer Vendor Interaction – Simulation Space • CUBEMACH Design and Simulation Tool Framework • Conclusion 26 Chennai, India
CUBEMACH Simulator • pThread based Simulator • Evaluates candidate CUBEMACH Architecture • Feed results to CUBEMACH Optimizer • CUBEMACH Optimization Engine (COE) produces Optimized Architecture • Simulation & Optimization : An iterative process • Consists of ALFU Sub-Simulator COS Sub-Simulator ONNET Sub-Simulator Memory Sub-Simulator 27 Chennai, India
CUBEMACH Simulator 28
What we have seen . . . Integrated CUBEMACH Design Paradigm … 29 Chennai, India
Recommend
More recommend