kpn on heterogeneous multi many core
play

KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, - PowerPoint PPT Presentation

Debugging of application software based on KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, Ph.D. President & CEO, Architect TOPS Systems Corp. Multicore / Manycore provider in Japan MAD 2013 Agenda Porting Application


  1. Debugging of application software based on KPN on heterogeneous multi / many-core processors Yukoh Matsumoto, Ph.D. President & CEO, Architect TOPS Systems Corp. Multicore / Manycore provider in Japan MAD 2013

  2. Agenda Porting Application onto Heterogeneous Manycore  Case Study : Real-Time Ray Tracing, 800TFLOPS on Desk Top Machine  Architecture & Algorithm Co-Design  Deep Performance Analysis  Software Partitioning into Kahn Process Network  System Performance Modeling  System Performance Simulation  Debugging Issues and Challenges  Working on Better Solutions  Conclusions  MAD 2013

  3. Parallel Processing Goal TOPSTREAM Cluster CPU Core GPU CPU GPU Single Dual Core CPU CPU Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael I. Gordon, William Thies, and Saman Amarasinghe, MIT Exploit more parallelism for higher performance MAD 2013

  4. Our Multi-/Many-core SW Development Flow Sequential to Distributed Processing Core Core Decomposition Core Core Fine Grain Sequential Computation Tasks Functional Verification Multiple Processes Core Core Mapping Core Core Distributed Processing Distributed Processing on Multicore (Kahn Process Network) Performance Verification Debugging of both Functionality and Performance MAD 2013

  5. Programming Model based on KPN Global Memory Process consist of Multiple tasks Single executable Task Basic Distributed Processing based on KPM Process works with local memory Point to Point unidirectional Communication Channel Finite depth of FIFO within a channel Allows read only access to Global Memory MAD 2013

  6. Experiences : Sequential to Distributed Processing  Computer Vision  SIFT ; 10 cores, 4 cores, 8 cores  Haar-Like ; 4 cores  SVM ; 4 cores  Computer Graphics  Ray Tracing ; 73 cores  Codec  H.264 Decoder ; 10 cores  JPEG Decoder ; 10 cores  Wireless Communication  802.11b MAC and Baseband ; 4 cores Distributed Processing on Heterogeneous Multicore / Manycore MAD 2013

  7. Case Study : TOPSTREAM™ RTRT ・ Intel CPU: 100k Chips ・ NVIDIA GPU: 20k ~ 30k Chips ・ TOPSTREAM: 9 Chips Memory Interface Ultra-Accurate Real-Time Ray Tracing ・・・ ・・・ I/O I/O I/O I$ 64-bit L2-I L2-D  Color Model with 35 bands Bus Memory Memory RISC Peripheral B us Master Node Ctrl (Code) (Data) 96kByte 784kByte Processor  Rendering on Free Surface (Bezier) D$ ・ Memory CPU Bus Bridge  HD (1920 x 1080 pixels) @ 30frame/s ・ HDD Distributed Arbitration On Chip Global bus (TOPSTREAM ™ bus) Performance Requirement C0 MMP C1 C2 C3 C4 C5 C6 Bus Bridge C7 Slave Node  ≒ 800 TFLOPS / system; 88TFLOPS / chip L1-I L1-D Inter Core Event Interface Memory Memory 64kByte 128kByte LSI Design (Estimated) Local bus (TOPSTREAM ™ bus)  Technology : TSMC 45nm  Clock Frequency : 750 MHz Core0 Core1 Core2 Core7 ・・ ・・ L0 L0 L0 L0  Chip Size : 17mm × 17mm ; TOPSTREAM™ RTRT Logic : 267.7MGate (73 Heterogeneous Manycore) Memory : 23Mbit ( Desk Top Machine ) ・ 60cm × 60cm × 20cm = 7,200cm 3 ・ Power Consumption : 1000 W (max) (A Cluster includes 9 Heterogeneous Core) (Image Generated by Visual Simulation ) ※ Joint R&D with TOYOTA Moror & NIHON UNISYS Heterogeneous Many Core : 0.88TFLOPS/W MAD 2013

  8. Heterogeneous Multi-Core drives Computer Graphics Paradigm Shift  Synthesis Animations, movies, video games  Algorithms : Polygon based Ray Tracing   Computer Performance Requirement : ~ 1TFLOPS  Reproduction Replace prototypes and samples  Industrial Design & Showrooms   Automotives, Buildings, Houses, etc. Algorithms : Natural Surface based Ray Tracing, Photon Mapping  Computer Performance Requirement : 100’s TFLOPS ~  MAD 2013

  9. Architecture-Algorithm Co-Design for Application Domain Specific Computing Requirements Performance vs. Power Optimization Performance vs. Power Optimization Application 88TFLOPS@100W SW Partitioning “ Ray Tracing ” Performance = f × IPC Performance = f × IPC 750MHz, “ Photon Mapping ” Power = ½ α C V 2 f Power = ½ α C V 2 f Legacy Software (Sequential ) TOPSTREAM™ Architecture Architecture-Algorithm Co-Design Splitting Analysis Fine Grain Tasks Architectural Algorithmic Optimization Optimization Grouping Process Reduces Walls Reduces Walls Performance & Power Simulation Elements  ILP Wall  ILP Wall  Memory Wall  Memory Wall Co-operating  Power Wall  Power Wall HW System SW Distributed TOPSTREAM™ Platform IP Spec Spec Spec Processing KPN model Mapping TS-ISIM Patents HW/SW Co-Design ( ISS ) Distributed Processing HW design SW design Multi-Core TOPS_Lib Base HW HW/SW Co-Verification ( RTL ) ( RTL ) MAD 2013

  10. Architecture & Algorithm Co-Design Optimizations go Bidirectional Partitioning based on analysis ・ Functional Equivalency Checking Distributed Processing model method SW-C1 SW-C 4 SW-C 6 SW-C 2 SW-C3 SW-C 5 SW-C 7 KPN model OS P-2 Optimization Communication ・ Network Topology Map to KPN ・ Functional Partitioning ・ Merging P-1 P- 3 P- 4 ・ FIFO Multi-core model Optimization CPU Core2 Bus / Network Mapping onto cores ・ Extended Instructions ・ Static ・ Memory Hierarchy Core1 Core3 Core4 ・ Dynamic ・ etc. Can expect more than 10 X of Performance Improvement MAD 2013

  11. System Level Architecture  Distributed Processing with KPN Local Memory Local Memory – Non-Shared Memory Processes Task- Task- FIFO FIFO FIFO (*) (*) – Zero-Overhead Message Passing Mechanism (*) B A ( ZOMP ) Kahn Process Network  Combination of Parallelisms – Distributed Parallel Processing ( Task 、 Pipeline ) Data Parallel Task-A Data Parallelism ( High-Level 、 Instruction Level ) – Data Parallel ( SIMD ) Task Parallel Task-B  Stream Processing (Core) Task-C Task-D – Kernel – Stream-In (Read Message) time Combination of Data & Task Parallel – Stream-Out (Write Message) Core can keep Processing of  Optimization of Core Kernel – Support Stream Processing : background Stream – Complex Inst : Reduction of Kernel cycle – FIFO support mechanism – Reduction of energy for instruction / data supply Combination of Parallelisms, Stream Processing, and ASIP MAD 2013 時間

  12. Basic concept of stream processing: “ Maximize processor efficiency” Careful Scheduling of Stream-In and Stream-Out MAD 2013

  13. Real-Time Ray Tracing Performance Analysis Result Examples  Performance Requirement Analysis  Performance / Frame  Performance / Area  Performance / Ray  Performance / Ray Type  Performance / Function  Computation / Function  Memmory Access / Function ・ Memory Allocation ・ Memory Hierarchy ・ Processing unit ・ Special Instruction ・ Floating point to Fix Point Big Challenge was Dynamic Huge Load Changes (max. 3751 x ) MAD 2013

  14. Partitioning and KPN model for Ray Tracing Rendering on Intersect with Color and Lighting Ray Generation  Partitioning of Ray Tracing Process  Based on processing flow : Functional partitioning Space Check Light BBox Check Create Node Surface Intersect Ray Trim Generation Voxel BBox Surf Rough Traverse Check Intersect Input Check Space Check Trim Depth Check Test Haikei Create Intersect Node Process (local memory) Channel (point-to-point, FIFO) Two Levels of Functional Verification Object 1 st Level : Each Process Lighting  Mapping of Processor ics Output 2 nd Level : Whole KPN Background Lighting Kahn Process Network (KPN) model for Ray Tracing Equivalency Test with a number of Input / Output Data Set MAD 2013

  15. 11 cores 11 cores Debugging of Multicore is crazy! Synchronization Point Synchronization Point Each core is executing its instruction stream Each core is executing its instruction stream MAD 2013

  16. Human’s nature is  For typical engineers,  can follow “ a single instruction stream ” for debugging  make mistakes with “ Two instruction stream ”  No way with “ Three instruction stream ” Key for Multicore Debugging Key for Multicore Debugging Extract “One Stream” of information, and concentrate on it. Extract “One Stream” of information, and concentrate on it. Provide tools to be able to concentrate on debugging Provide tools to be able to concentrate on debugging MAD 2013

  17. Debugging of application with KPN model QCP QCP QCP QCP 1 2 3 4 Something wrong on Filter Function Focus on FIFO MAD 2013

  18. MPArchitect provides several tools Instruction Profile Activities inside core On-Chip bus usage Activity monitor helps programming for Low Power MAD 2013

  19. Performance Considerations Real-Time Ray Tracing KPN based Distributed Processing Flow Setected Sub- Reflection Ray Virtual Area Process Shadow Ray Gen Primary Ray Gen 16 16 n 23 23 Ray Tree Gen ③ ① ② OL BL Priority Selection 16 16 Ray Gen Optimization for Lighting Lighting Lighting Load balancing Space Check 4 DEPTH 16 16 Voxel Traverse Reflec Brightness tion 4 Sub-Area BBoxCheck ( 35 band ) 32 Critical Path Surf Intersect 4 4 4 4 DepthTest/Haikei/CreateNode Critical Loop and Buffers for Load Ballancing MAD 2013

Recommend


More recommend