Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Víctor García-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Peña 4 , and Wen-mei Hwu 2 1 Universidad de Córdoba, 2 University of Illinois at Urbana-Champaign, 3 Universitat Politècnica de Catalunya, 4 Barcelona Supercomputing Center, 5 MulticoreWare, Inc.
Motivation • Heterogeneous systems are moving towards tighter integration • Shared virtual memory, coherence, system-wide atomics • OpenCL 2.0, CUDA 8.0 • Benchmark suite is needed • Analyzing collaborative workloads • Evaluating new architecture features
Application Structure fine-grain task fine-grain sub-tasks coarse-grain task coarse-grain sub-task
Data Partitioning A B Execution Flow A B A B
Data Partitioning: Bézier Surfaces • Output surface points are distributed across devices x y z ... ... 3D Surface point processed in CPU ... 3D Surface point processed in GPU ... Tile of surface points processed in CPU . . . ... ... ... ... ... ... Tile of surface points processed ... in GPU ...
Data Partitioning: Image Histogram Input pixels distributed across devices Output bins distributed across devices CPU GPU CPU GPU
Data Partitioning: Padding • Rows are distributed across devices • Challenge: in-place, required inter-worker synchronization CPU GPU
Data Partitioning: Stream Compaction • Rows are distributed across devices • Like padding, but irregular and involves predicate computations CPU GPU
Data Partitioning: Other Benchmarks • Canny Edge Detection • Different devices process different images • Random Sample Consensus • Workers on different devices process different models • In-place Transposition • Workers on different devices follow different cycles
Types of data partitioning • Partitioning strategy: • Static (fixed work for each device) • Dynamic (contend on shared worklist) • Flexible interface for defining partitioning schemes • Partitioned data: • Input (e.g., Image Histogram) • Output (e.g., Bézier Surfaces) • Both (e.g., Padding)
Fine-grain Task Partitioning Execution Flow A B A A B B
Fine-grain Task Partitioning: Random Sample Consensus Data partitioning: models distributed Task partitioning: model fitting on CPU across devices and evaluation on GPU Fitting Fitting Fitting Fitting Fitting Fitting Fitting Fitting Fitting Evaluation Evaluation Fitting Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation Evaluation CPU GPU CPU GPU
Fine-grain Task Partitioning: Task Queue System Synthetic Tasks Histogram Tasks enqueue read enqueue read enqueue read Short Long Long Short Short enqueue read enqueue read T short Histo. Histo. T long T short Histo. T long empty? empty? T short Histo. Histo. CPU GPU CPU GPU
Coarse-grain Task Partitioning Execution Flow A B A B
Coarse-grain Task Partitioning: Breadth First Search & Single Source Shortest Path small frontiers large frontiers processed on CPU processed on GPU CPU GPU SSSP performs more computations than BFS which hides communication/memory latency
Coarse-grain Task Partitioning: Canny Edge Detection Data partitioning: images distributed Task partitioning: stages distributed across devices across devices and pipelined Gaussian Filter Gaussian Filter Gaussian Filter Sobel Filter Sobel Filter Sobel Filter Gaussian Filter Non-max Suppression Non-max Suppression Non-max Suppression Sobel Filter Hysteresis Hysteresis Hysteresis Non-max Suppression Hysteresis CPU GPU CPU GPU
Benchmarks and Implementations Implementations: • OpenCL -U • OpenCL -D • CUDA -U • CUDA -D • CUDA -U -Sim • CUDA -D -Sim • C++AMP
Benchmark Diversity D ATA P ARTITIONING Benchmark Partitioning Granularity Partitioned Data System-wide Atomics Load Balance BS Fine Output None Yes CEDD Coarse Input, Output None Yes HSTI Fine Input Compute No HSTO Fine Output None No PAD Fine Input, Output Sync Yes RSCD Medium Output Compute Yes SC Fine Input, Output Sync No TRNS Medium Input, Output Sync No F INE - GRAIN T ASK P ARTITIONING C OARSE - GRAIN T ASK P ARTITIONING Benchmark System-wide Atomics Load Balance Benchmark System-wide Atomics Partitioning Concurrency RSCT Sync, Compute Yes BFS Sync, Compute Iterative No TQ Sync No CEDT Sync Non-iterative Yes TQH Sync No SSSP Sync, Compute Iterative No
Evaluation Platform • AMD Kaveri A10-7850K APU • 4 CPU cores • 8 GPU compute units • AMD APP SDK 3.0 • Profiling: • CodeXL • gem5-gpu
Benefits of Collaboration • Collaborative execution improves performance 4096 512 12x12 (300x300) best 8x8 (300x300) 256 1024 Execution Time ( ms ) Execution Time ( ms ) 1 4x4 (300x300) 128 0.5 256 best 0 64 64 32 16 16 4 8 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU 1CPU 2CPU 4CPU Bézier Surfaces Stream Compaction (up to 47% improvement over GPU only) (up to 82% improvement over GPU only)
Benefits of Collaboration • Optimal number of devices not always max and varies across datasets 4096 524288 NE 12000x11999 NY 6000x5999 1024 Execution Time ( ms ) 65536 Execution Time ( ms ) best UT 1000x999 best 256 8192 64 1024 16 128 4 1 16 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU GPU GPU + GPU + GPU + 1CPU 2CPU 4CPU 1CPU 2CPU 4CPU Padding Single Source Shortest Path (up to 16% improvement over GPU only) (up to 22% improvement over GPU only)
Benefits of Collaboration
Benefits of Unified Memory Kernel Comparable (same kernels, system-wide Unified kernels can Unified kernels avoid 1.6 Execution Time ( normalized ) atomics make Unified sometimes slower) exploit more parallelism kernel launch overhead 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning
Benefits of Unified Memory Kernel Copy Back & Merge Copy To Device 1.6 Unified versions avoid copy overhead Execution Time ( normalized ) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning
Benefits of Unified Memory Kernel Copy Back & Merge Copy To Device Allocation SVM allocation seems 1.6 Execution Time ( normalized ) to take longer 1.4 1.2 1 0.8 0.6 0.4 0.2 0 D U D U D U D U D U D U D U D U D U D U D U D U D U D U BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP Fine-grain Coarse-grain Data Partitioning Task Partitioning
C++ AMP Performance Results 4.37 11.93 8.08 2.5 Speedup (normalized to faster) 2 1.5 OpenCL-U 1 C++AMP 0.5 0
Occupancy L EGEND : 100% Benchmark Diversity 80% 60% 40% VALUBusy MemUnitBusy 20% 0% VALUUtilization CacheHit BS CEDD (gaussian) CPU GPU 49.5 64.8 14 CEDD (sobel) CEDD (non-max) CEDD (hysteresis) HSTI System-wide Atomics ( ops / thousand cycles ) 12 10 8 HSTO PAD RSCD SC 6 4 TRNS RSCT TQ TQH 2 0 BS CEDD HSTI HSTO PAD RSCD SC TRNS RSCT TQ TQH BFS CEDT SSSP BFS CEDT (gaussian) CEDT (sobel) SSSP Varying intensity in use of system-wide atomics Diverse execution profiles
Benefits of Collaboration on FPGA 1.2 Idle Similar improvement 1.0 Execution Time (s) from data and task Copy partitioning 0.8 Compute 0.6 Case Study: Canny Edge Detection 0.4 0.2 0.0 C F C F C F C F C F C F C F C F CPU FPGA Data Task CPU FPGA Data Task Single device Collaborative Single device Collaborative Stratix V Arria 10 Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track.
Benefits of Collaboration on FPGA 45 Data Partitioning (Stratix V) Task Partitioning (Stratix V) 40 Data Partitioning (Arria 10) 35 Task Partitioning (Arria 10) Execution Time (ms) Case Study: 30 Random Sample 25 Consensus 20 Task partitioning exploits 15 disparity in nature of tasks 10 5 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Source: Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track.
Released • Website: chai-benchmarks.github.io • Code: github.com/chai-benchmarks/chai • Online Forum: groups.google.com/d/forum/chai-dev • Papers: • Chai: Collaborative Heterogeneous Applications for Integrated-architectures. ISPASS’17 . • Collaborative Computing for Heterogeneous Integrated Systems. ICPE’17 Vision Track .
Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna 1 , Izzat El Hajj 2 , Li-Wen Chang 2 , Víctor García-Flores 3,4 , Simon Garcia de Gonzalo 2 , Thomas B. Jablin 2,5 , Antonio J. Peña 4 , and Wen-mei Hwu 2 1 Universidad de Córdoba, 2 University of Illinois at Urbana-Champaign, 3 Universitat Politècnica de Catalunya, 4 Barcelona Supercomputing Center, 5 MulticoreWare, Inc. URL: chai-benchmarks.github.io Thank You! J
Recommend
More recommend