Using Intra-Core Loop-Task Accelerators to Improve the Productivity and Performance of Task-Based Parallel Programs Ji Kim , Shunning Jiang, Christopher Torng, Moyang Wang, Shreesha Srinath, Berkin Ilbeyi, Khalid Al-Hawaj, and Christopher Batten Computer Systems Laboratory Cornell University 50th ACM/IEEE Int’l Symp. on Microarchitecture, MICRO-2017 Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 1/21 1/21 1
Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 2/21 2/21 2
Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 3/21 3/21 2
Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 4/21 4/21 2
Inter-Core ● Task-Based Parallel Programming Frameworks ○ Intel TBB, Cilk Intra-Core ● Packed-SIMD Vectorization ○ Intel AVX, Arm NEON Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 5/21 5/21 2
Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 6/21 6/21 3
Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 7/21 7/21 3
Challenges of Combining Tasks and Vectors void app_kernel_tbb_avx(int N, float* src, float* dst) { // Pack data into padded aligned chunks // src -> src_chunks[NUM_CHUNKS * SIMD_WIDTH] // dst -> dst_chunks[NUM_CHUNKS * SIMD_WIDTH] ... // Use TBB across cores parallel_for (range(0, NUM_CHUNKS, TASK_SIZE), [&] (range r) { for (int i = r.begin(); i < r.end(); i++) { // Use packed-SIMD within a core #pragma simd vlen(SIMD_WIDTH) for (int j = 0; j < SIMD_WIDTH; j++) { if (src_chunks[i][j] > THRESHOLD) aligned_dst[i] = DoLightCompute(aligned_src[i]); else aligned_dst[i] = DoHeavyCompute(aligned_src[i]); ... ... Challenge #1: Intra-Core Parallel Abstraction Gap Challenge #2: Inefficient Execution of Irregular Tasks Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 8/21 8/21 3
Native Performance Results Regular Irregular Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 9/21 9/21 4
Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 10/21 10/21 5
LTA SW: API and ISA Hint void app_kernel_lta(int N, float* src, float* dst) { LTA_PARALLEL_FOR(0, N, (dst, src), ({ if (src[i] > THRESHOLD) dst[i] = DoComputeLight(src[i]); else dst[i] = DoComputeHeavy(src[i]); })); } void loop_task_func(void* a, int start, int end, int step=1); Hint that hardware can potentially accelerate task execution Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 11/21 11/21 6
LTA SW: Task-Based Runtime Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 12/21 12/21 7
Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 13/21 13/21 8
LTA HW: Fully-Coupled LTA Coupling better for regular workloads (amortize frontend/memory) Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 14/21 14/21 9
LTA HW: Fully Decoupled LTA Decoupling better for irregular workloads (hide latencies) Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 15/21 15/21 10
LTA HW: Task-Coupling Taxonomy + Higher Perf on Irregular - Higher Area/Energy Task Group (lock-step execution) More decoupling (more task groups) in either space or time improves performance on irregular workloads at the cost of area/energy Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 16/21 16/21 11
LTA HW: Task-Coupling Taxonomy Does it matter whether we decouple in space or in time? Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 17/21 17/21 11
Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 18/21 18/21 12
Evaluation: Methodology • Ported 16 application kernels from PBBS and in-house benchmark suites with diverse loop-task parallelism • Scientific computing: N-body simulation, MRI-Q, SGEMM • Image processing : bilateral filter, RGB-to-CMYK, DCT • Graph algorithms : breadth-first search, maximal matching • Search/Sort algorithms : radix sort, substring matching • gem5 + PyMTL co-simulation for cycle-level performance • Component/event-based area/energy modeling • Uses area/energy dictionary backed by VLSI results and McPAT Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 19/21 19/21 13
Evaluation: Design Space Exploration g n i l p u o c e d resource constraints l a i t a p s g n l i p u o c e d a l r o p m e t Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 20/21 20/21 14
Evaluation: Design Space Exploration g n i l p u o c e d resource constraints l a i t a p s g n l i p u o c e d a l r o p m e t Prefer spatial decoupling over temporal decoupling Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 21/21 21/21 14
Evaluation: Design Space Exploration g n i l p u o r e s o c u r e c e d c o n s l t r a a i n i t s t a p s g n l i p u o c e d a l r o p m e t Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 22/21 22/21 14
Evaluation: Design Space Exploration g n i l p u o r e s o c u r e c e d c o n s l t r a a i n i t s t a p s g n l i p u o c e d a l r o p m e t Reduce spatial decoupling to improve energy efficiency Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 23/21 23/21 14
Evaluation: Multicore LTA Performance 10.7x 5.2x 2.9x 4.4x Regular Irregular Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 24/21 24/21 15
Evaluation: Area-Normalized Performance 1.8x 1.6x 1.2x Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 25/21 25/21 16
Loop-Task Accelerator (LTA) Vision ● Motivation ● Challenge #1: LTA SW ● Challenge #2: LTA HW ● Evaluation ● Conclusion Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 26/21 26/21 17
Related Work • Challenge #1: Intra-Core Parallel Abstraction Gap • Persistent threads for GPGPUs (S. Tzeng et al.) • OpenCL, OpenMP, C++ AMP • Cilk for vectorization (B. Ren et al.) • And more... • Challenge #2: Inefficient Execution of Irregular Tasks • Variable warp sizing (T. Rogers et al.) • Temporal SIMT (S. Keckler et al.) • Vector-lane threading (S. Rivoire et al.) • And more… • Please see paper for more detailed references! Cornell University Cornell University Cornell University Ji Yun Kim Ji Kim Ji Kim 27/21 27/21 18
Recommend
More recommend