Using Intra-Core Loop-Task Accelerators to Improve the Productivity - PowerPoint PPT Presentation

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs Ji Kim , Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten Computer Systems Laboratory Cornell University 50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017 Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 1/21 1/21 1

Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 2/21 2/21 2

Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 6/21 6/21 3

Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 7/21 7/21 3

Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Challenge #2: Inefficient Execution of Irregular Tasks Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 8/21 8/21 3

Native Performance Results Regular Irregular Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 9/21 9/21 4

Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 10/21 10/21 5

LTA SW: API and ISA Hint void app_kernel_lta(int N, float* src, float* dst) { LTA_PARALLEL_FOR(0, N, (dst, src), ({ if (src[i] > THRESHOLD) dst[i] = DoComputeLight(src[i]); else dst[i] = DoComputeHeavy(src[i]); })); } void loop_task_func(void* a, int start, int end, int step=1); Hint that hardware can potentially accelerate task execution Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 11/21 11/21 6

LTA SW: Task-Based Runtime Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 12/21 12/21 7

LTA HW: Fully-Coupled LTA Coupling better for regular workloads (amortize frontend/memory) Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 14/21 14/21 9

LTA HW: Fully Decoupled LTA Decoupling better for irregular workloads (hide latencies) Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 15/21 15/21 10

LTA HW: Task-Coupling Taxonomy + Higher Perf on Irregular - Higher Area/Energy Task Group (lock-step execution) More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 16/21 16/21 11

LTA HW: Task-Coupling Taxonomy Does it matter whether we decouple in space or in time? Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 17/21 17/21 11

Evaluation: Methodology • Ported 16 application kernels from PBBS and in-house benchmark suites with diverse loop-task parallelism • Scientific computing: N-body simulation, MRI-Q, SGEMM • Image processing : bilateral filter, RGB-to-CMYK, DCT • Graph algorithms : breadth-first search, maximal matching • Search/Sort algorithms : radix sort, substring matching • gem5 + PyMTL co-simulation for cycle-level performance • Component/event-based area/energy modeling • Uses area/energy dictionary backed by VLSI results and McPAT Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 19/21 19/21 13

Evaluation: Design Space Exploration g n i l p u o c e d resource constraints l a i t a p s g n l i p u o c e d a l r o p m e t Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 20/21 20/21 14

Evaluation: Design Space Exploration g n i l p u o c e d resource constraints l a i t a p s g n l i p u o c e d a l r o p m e t Prefer spatial decoupling over temporal decoupling Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 21/21 21/21 14

Evaluation: Design Space Exploration g n i l p u o r e s o c u r e c e d c o n s l t r a a i n i t s t a p s g n l i p u o c e d a l r o p m e t Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 22/21 22/21 14

Evaluation: Design Space Exploration g n i l p u o r e s o c u r e c e d c o n s l t r a a i n i t s t a p s g n l i p u o c e d a l r o p m e t Reduce spatial decoupling to improve energy efficiency Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 23/21 23/21 14

Evaluation: Multicore LTA Performance 10.7x 5.2x 2.9x 4.4x Regular Irregular Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 24/21 24/21 15

Evaluation: Area-Normalized Performance 1.8x 1.6x 1.2x Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 25/21 25/21 16

Related Work • Challenge #1: Intra-Core Parallel Abstraction Gap • Persistent threads for GPGPUs (S. Tzeng et al.) • OpenCL, OpenMP, C++ AMP • Cilk for vectorization (B. Ren et al.) • And more... • Challenge #2: Inefficient Execution of Irregular Tasks • Variable warp sizing (T. Rogers et al.) • Temporal SIMT (S. Keckler et al.) • Vector-lane threading (S. Rivoire et al.) • And more… • Please see paper for more detailed references! Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 27/21 27/21 18

Using Intra-Core Loop-Task Accelerators to Improve the Productivity - PowerPoint PPT Presentation

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs Ji Kim , Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Application Accelerators: Application Accelerators: Application Accelerators: Application

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

Repetition Types of Loops Counting loop Know how many times to loop

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

AODVoice Voice over AODV Durga Prasad Pandey 6.829 Course Project Fall 2006 Wireless

Extensibility, Safety and Performance in the SPIN Operating System Department of Computer

EpiChord: Parallelizing the Chord Lookup Algorithm with Reactive Routing State Management Ben

Reducing Latency in Linux Wireless Network Drivers Tim Shepard shep@alum.mit.edu netdev 1.1

Engineering at FERMILAB Scientific Computing Division Gustavo Cancelo Engineering retreat 20

Helping Moores Law: Architectural Techniques to Address Parameter Variation Radu Teodorescu

Scribe Graphs Stochastic Computation 22 : Heiko Zimmermann : Auto encoding Variational

What is Latent Tree Analysis (LTA)? Repeated event co-occurrences might Due to common

Using Intra-Core Loop-Task Accelerators to Improve the Productivity - PowerPoint PPT Presentation

Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs Ji Kim , Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten

Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing the Loop Closing

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Application Accelerators: Application Accelerators: Application Accelerators: Application

African Trade Champions African Trade Champions (INTRA-CHAMPS) (INTRA-CHAMPS) Statement by:

Repetition Types of Loops Counting loop Know how many times to loop

Trading Strategies Introduction Trading Loop Trading Loop Trading Loop Trading Loop Three

Coarse-Grained Parallelism Variable Privatization, Loop Alignment, Loop Fusion, Loop

Image and Video Coding: Intra Prediction &amp; Picture Partitioning Intra-Picture Prediction

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Loop Invariants: Part 2 7 January 2019 OSU CSE 1 Maintaining the Loop Invariant A claimed

Loop Optimizations Important because lots of execution Loop Optimizations Loop Optimizations

Upper and Lower Loop Bound Estimation by Symbolic Execution and Loop Acceleration Pavel Cadek

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

DETECTORS AND ACCELERATORS DETECTORS AND ACCELERATORS APPLIED TO MEDICINE Jos Bernabu Jos

Accelerators for Americas Future ACCELERATORS - MODERN SHIPS OF DISCOVERY October 26, 2009

AODVoice Voice over AODV Durga Prasad Pandey 6.829 Course Project Fall 2006 Wireless

Extensibility, Safety and Performance in the SPIN Operating System Department of Computer

EpiChord: Parallelizing the Chord Lookup Algorithm with Reactive Routing State Management Ben

Reducing Latency in Linux Wireless Network Drivers Tim Shepard shep@alum.mit.edu netdev 1.1

Engineering at FERMILAB Scientific Computing Division Gustavo Cancelo Engineering retreat 20

Helping Moores Law: Architectural Techniques to Address Parameter Variation Radu Teodorescu

Scribe Graphs Stochastic Computation 22 : Heiko Zimmermann : Auto encoding Variational

What is Latent Tree Analysis (LTA)? Repeated event co-occurrences might Due to common

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction