JADE Heterogeneous Multiprocessor Design & Simulation Environment Jiang Xu
Acknowledgement Intel Labs Bin Li, Ravi Iyer, Ramesh Illikkal HP Labs Qiong Cai Current PhD students Rafael Kioji Vivas Maeda, Peng Yang, Zhe Wang, Haoran Li, Zhehui Wang, Zhongyuan Tian, Zhifei Wang, Duong Huu Kinh Luan, Xuanqi Chen Past members Xiaowen Wu, Weichen Liu, Xuan Wang, Yaoyao Ye 2016-06-09 Jiang Xu (HKUST) 2
PERFECT Computing Systems Design targets Performance Energy efficiency Reliability Functionality Extensibility Cost Testability More cores and memory on a chip and in a system Heterogeneous 2016-06-09 Jiang Xu (HKUST) 3
Huge Design Space to Explore Application Interconnect IoT /IoE, mobile, data center, HPC, mainframe … Ad-hoc, bus, NoC , hybrid … Wireless communication, multimedia processing, machine Regular vs. irregular topology learning, database … Protocol: routing, flow control, congestion control … Switch/router architecture Processor Electrical, optical, RF … CPU, GPU, FPGA, DSP, ASIP, ASIC … Homogenous vs. heterogeneous multiprocessor Support FinFET, FD- SOI, GAA, CNT FET … Power delivery and management Clock distribution and management Memory and storage Thermal, aging, noise … Hierarchy Cache coherence Peripherals DRAM, SRAM, flash, STT- RAM … Network interface, user interface, management … Mesh MPEG RISC RISC CPU SRAM Core Core Core Core FPGA MPEG arbiter 1 ring processor bus CPU DSP DSP Core Core Core Core memory SRAM controller memory USB bridge controller bridge USB CPU Core Core Core Core arbiter 2 arbiter peripheral bus CPU GPU bus Core Core Core Core CPU LCD power GPIO LCD power GPIO controller manager controller manager 2016-06-09 Jiang Xu (HKUST) 4
Simulation-based Architecture Exploration Benchmark applications with sample Benchmark Applications input data sets Programs Sample inputs System software Cycle- accurate “full - system” architecture Compilation simulator software System Instructions Speed-up techniques Simplify interconnect, memory, processor, OS, etc. Operating system Sampling application executions Device drivers Sampling inputs Architecture simulator Break causality to better parallelize simulations Architecture under evaluation Hybrid the above techniques 2016-06-09 Jiang Xu (HKUST) 5
The Good, the Bad and the Ugly Good for detailed/late-stage design Benchmark Applications Tweaking, testing, debugging … Programs Sample inputs Bad for early design space exploration Compilation Too slow to provide essential system statistics such as average and worst-case performance, software System energy efficiency, cost … Instructions Ugly for heterogeneous systems Operating system Compilation for heterogeneous ISAs, hardware Device drivers accelerator, FPGA … Architecture simulator Architecture under evaluation OS support of new large-scale heterogeneous systems without drivers 2016-06-09 Jiang Xu (HKUST) 6
Joint Application/Architecture Design Exploration Application models for heterogeneous Applications Sample Algorithms Programs multiprocessor system explorations inputs COSMIC Algorithm Application analysis partition COSMIC Heterogeneous multiprocessor system Computation, communication, and memory analysis and profiling design and simulation platform Application TCG models Statistical application Recorded application JADE models models Mapping, routing, scheduling Mapping, routing, scheduling Traffic routing plan Memory space mapping algorithms JADE Task mapping & scheduling Architecture under evaluation 2016-06-09 Jiang Xu (HKUST) 7
JADE Heterogeneous Multiprocessor Simulation Environment JADE (Joint Application/Architecture Hardware Architecture Network Architecture Design Exploration) Processor Memory Coherence Architecture Hierarchy Protocol Optical Electrical Heterogeneous system designs Early design space exploration COSMIC Architecture Template and Energy Library Benchmark Optical and Electrical Memory and Cache Systematic system evaluation Processor Library Recorded Network Library Coherence Library Application Model Memory Highlights JADE Statistical Application Model Statistical, recorded and synthetic application Network Synthetic Application models Model Processor Peripherals Network-on-chip and off-chip networks Mapping, Routing, Scheduling MRS Task Mapping and Communication Memory Space Optical and electrical interconnects Algorithms Scheduling Traffic Routing Plan Mapping Memory subsystem Built-in power analysis Memory System Performance Energy Access Trace Behavior Analysis Analysis Output 2016-06-09 Jiang Xu (HKUST) 8
COSMIC Heterogeneous Multiprocessor Benchmark Applica cation tion Descript iption ion Machine Learning - FMP Financial market prediction using machine learning Machine Learning - ALIP Machine learning based image indexing Molecular Dynamics Simulating molecular dynamics when molecules hit surfaces of solid atoms Ray Tracing 3D scenes rendering Ultrasound Medical diagnostics using 2D/3D ultrasound imaging Fast Fourier Transform Fast Fourier Transform with complex number inputs LDPC Encoder Low-density parity-check code encoder TURBO Decoder Turbo code decoder Reed-Solomon Reed-Solomon code encoder and decoder Collaborating with application experts More applications are under development 2016-06-09 Jiang Xu (HKUST) 9
Exploration Cases I 2 CON inter/intra-chip optical network ONI ONI ONI ONI SUOR optical NoC ONI ONI ONI ONI Electrical mesh-based NoC Controller Memory controller Memory controller ONI ONI ONI ONI Memory hierarchy Private L1 caches ONI ONI ONI ONI Shared L2 cache – 16 banks Cluster of Electrical Optical Network Core Waveguide Cores 16 memory controllers Wire Interface (ONI) Memory controller Processor core ARM-v7a Memory controller 7nm, 1GHz, 0.6V Memory controller Memory controller 2016-06-09 Jiang Xu (HKUST) 10
Performance and Scalability 2016-06-09 Jiang Xu (HKUST) 11
Energy Efficiency and Scalability 2016-06-09 Jiang Xu (HKUST) 12
Recommend
More recommend