OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang - PowerPoint PPT Presentation

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North Carolina State University) Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc) Email: gchen11@ncsu.edu 1

Introduction • A key challege in storage system o Failure(disk sector, entile disk, storage site) • A Solution: o Erasure Coding • Intel’s intelligent storage acceleration library.(ISA-L) From google image 2

Motivation • Erasure Coding o Replication.(simple, high cost, low toleration) o Reed-Solomon coding.(less cost, high toleration, complex) o ...... • Motivation: o To explore using various heterogeneous architectures to accelerate Reed-Solomon coding. 3

Reed-Solomon Coding • Block- based Parity Encoding İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘ length ’ bytes. o Encode matrix: dests > srcs o Dest = V × Src 4

Reed-Solomon Coding • Block- based Parity Encoding İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘ length ’ bytes. o Encode matrix: dests > srcs o 𝑡𝑠𝑑𝑡−1 𝐸𝑓𝑡𝑢 𝑚 𝑗 = 𝑊 𝑚 𝑘 × 𝑇𝑠𝑑 𝑘 [𝑗] 𝑘=0 Dest = V × Src 5

Reed-Solomon Coding • Block- based Parity Encoding İnputs are partitioned into ‘srcs’ blocks, with a block size of ‘ length ’ bytes. o Encode matrix: dests > srcs o • sum: 8-bit XOR operation; mul: GF(2 8 ) multiplication 𝑡𝑠𝑑𝑡−1 𝐸𝑓𝑡𝑢 𝑚 𝑗 = 𝑊 𝑚 𝑘 × 𝑇𝑠𝑑 𝑘 [𝑗] 𝑘=0 Dest = V × Src 6

GF(2 8 ) multiplication • 3 Ways for Galois Field Multiplication: o Russian Peasant Algorithm: pure logic operations. o 2 small tables: 256 bytes per table, 3 table lookups, 3 logic operations. o 1 large table: 256*256 bytes, no logic operations, one lookup Refer to paper for details. 7

Reed-solomon Coding On CPUs • Intel ISA-L. o Single thread. o Baseline. • Adding Multithreading support. o Partition input matrix in a column-wise manner. 8

Reed-solomon Coding On GPUs • Computation for one element in output matrix is independent from others. • Fine-grain parallelization o Each workitem for one byte in output matrix.(Baseline) • Optimizations??? 9

Reed-solomon Coding On GPUs-Opt(A) • A. Optimize GPU Memory Bandwidth. o Memory coalescing(workitems in one group access data in the same row). o Vectorization.(reads uint4 one time) ==> higher bandwidth. • Each workitem for 16 bytes data. 10

Reed-solomon Coding On GPUs-Opt(B) • B. Overcoming Memory Bandwidth Limit Using Texture Caches, Tiling. o Workitems in the same row share same value in V . ==> Putting encode matrix and large look up table(64KB, for GF(2 8 ) Multiplication) in texture cache. Dest = V × Src 11

Reed-solomon Coding On GPUs-Opt(B) • B. Overcoming Memory Bandwidth Limit Using Texture Caches, Tiling. o Workitems in the same row share same value in V . ==> Putting encode matrix and large look up table(64KB, for GF(2 8 ) Multiplication) in texture cache. o Src in texture cache by using tiling(like MM). • Not helpful. Bottoleneck: computation bound Dest = V × Src 12

Reed-solomon Coding On GPUs-Opt(C) • C. Hiding Data Transmission Latency Over PCIe o Partition input into multiple groups. • One stream for one group o Hide data copy time with computation time. Stream 1 H2D Compute D2H H2D Compute D2H Stream 2 Stream 3 H2D Compute D2H ..... ...... ....... ...... Stream N H2D Compute D2H 13

Reed-solomon Coding On GPUs-Opt(D) • D. Shared virtual memory to eliminate memory copying o Shared virtual memory (SVM) is supported in OpenCL 2.0 • AMD APUs. • No need for data copy. 14

Reed-solomon Coding On FPGAs • FPGAs o Abundant on-chip logics for computation. o Pipelined parallelism instead of data parallelism on GPU. o Relatively low memory access bandwidth • Reed-solomon Coding o Computation bound o A good candidate for FPGAs o Same baseline code as used on GPUs. (1 workitem for 1 byte) 15

Reed-solomon Coding On FPGAs-Opt(A) • A. Vectorization to Optimize FPGA Memory Bandwidth o One workitem reads 64 bytes from input. 16

Reed-solomon Coding On FPGAs-Opt(B) • B. Overcoming memory bandwidth limit using tiling. o Load a tile from input matrix to local memory shared by workgroup. o A large tile size results in high data reuse and reduces off- chip memory bandwidth 17

Reed-solomon Coding On FPGAs-Opt(C) • C. Unroll loop and Kernel replication to fully utilize FPGA logic resources. o __attribute__(num_compute_units(n)): n pipelines. o Loop unroll: deeper pipleline. 18

Experiments • Input: 836.9MB file. • On CPU: Intel(R) Xeon(R) CPU E5-2697 v3 (28 cores) • On GPU: NVIDIA K40m, CUDA7.0; AMD Carrizo. • On FPGA: Altera Stratix V A7. 19

On CPU • srcs = 30, dests = 33 Encode Bandwidth 3 2.84 2.5 2 GB/s 1.5 1 0.5 56 0 0 20 40 60 80 100 120 number of threads 20

On NVIDIA K40m • One Stream: Best: large table (2.15GB/s) o • 8 Streams: == 3.9GB/s Encode Bandwidth 21

On AMD Carrizo SVM • Not as good as streaming. Texture cache doesn’t work well. o Overhead of blocking functions to map and unmap SVM buffers. o Encode Bandwidth 0.6 0.5 0.4 GB/s 0.3 0.2 0.1 0 char int int4 char int int4 SVM Streaming 22

On FPGA Encode Bandwidth 10 • DMA read/write about 3GB/s. 1 GB/s • Only focus on 0.1 kernel throughput. 0.01 • Assume DMA engine can be 0.001 easily increased. int16 int16 char int16+tiling+unroll int int16 + tiling char int16+tiling+unroll Large Table Small Table Russian Peasant 23

Overall • Considering the price, FPGA platform is most promising but needs to improve its current PCIe DMA interface. 8 GPU FPGA MC-CPU ST-CPU 7 6 dests = srcs + 3 GB/s 5 4 3 2 1 0 10 15 20 25 30 srcs 24

NEW-update: Kernel + Memory Copy between Host and Device Encode BW (GB/s) 7 6 5 4 3 2 1 0 file1 file2 file1 file2 file1 file2 file1 file2 BDW+SVM BDW Arria10 StratixV file 1 has a size of 29MB; file 2 has a size of 438MB BDW: Integrated FPGA (arria 10) on Xeon core. SVM (Shared Virtual Memory): the Map/unMap overhead is included Arria 10: discrete FPGA board through PCIe. 25 Stratix V: discrete FPGA board through PCIe.

Conclusions • Explore different computing devices for erasure codes. • Different optimizations for different devices. • FPGA is the most promising device for erasure codes. 26

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang - PowerPoint PPT Presentation

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North Carolina State University) Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc) Email: gchen11@ncsu.edu 1

Forward Error Correction using Erasure Codes using Erasure Codes Reference : L. Rizzo,

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Decoding F q -linear codes over erasure channels Sara D. Cardell Universidad de Alicante

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Pyrit: Polynomial Ring Transforms for Fast Erasure Coding Some parts of this work have been

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Linear-Time Erasure List-Decoding of Expander Codes Noga Ron-Zewi (University of Haifa) Mary

Erasure Codes. Erasure Code: Example. Example Make polynomial, P ( x ) = a 2 x 2 + a 1 x + a 0

Type Erasure 86 What is Type Erasure? The way for the Java

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Sign Languages Mary Edward 1 and Pamela Perniss 2 1 University of Brighton (UK) 2 University of

Encoding formats and consideration of requirements for terminology mapping Libo Si, Department

Unsupervised Le Learning of Video Representations using LS LSTMs Srivastava et al. University

t Ps s t

Causal Reasoning in SDNs (NetKAT) Georgiana Caltais, University of Konstanz Shonan Seminar

Port of a fixed point MPEG2-AAC encoder on a ARM platform Romain Pagniez University College

Biometrics & Security Seminar Fingerprint-based Fuzzy Vault: Implementation and Performance

ARCNET Tutorial What is ARCNET? Attached Resource Computer NETwork Token-Passing Local

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang - PowerPoint PPT Presentation

OpenCL-Based Erasure Coding on Heterogeneous Architectures Guoyang Chen, Huiyang Zhou, Xipeng Shen, (North Carolina State University) Josh Gahm, Narayan Venkat, Skip Booth, John Marshall (Cisco Systems, Inc) Email: gchen11@ncsu.edu 1

Forward Error Correction using Erasure Codes using Erasure Codes Reference : L. Rizzo,

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Decoding F q -linear codes over erasure channels Sara D. Cardell Universidad de Alicante

Formal Modeling in Cognitive Science 1 Coding Theorems Lecture 28: Kraft Inequality; Source Coding

Pyrit: Polynomial Ring Transforms for Fast Erasure Coding Some parts of this work have been

Risk-Based Coding and Reimbursement What is Risk-Based Coding? Risk-Based Coding Overview A

Linear-Time Erasure List-Decoding of Expander Codes Noga Ron-Zewi (University of Haifa) Mary

Erasure Codes. Erasure Code: Example. Example Make polynomial, P ( x ) = a 2 x 2 + a 1 x + a 0

Type Erasure 86 What is Type Erasure? The way for the Java

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

Image and Video Coding: Video Coding Extensions Screen Content Coding Screen Content Coding

ADVANCED MULTIMEDIA ADVANCED MULTIMEDIA CODING CODING Fernando Pereira Instituto Superior

Sign Languages Mary Edward 1 and Pamela Perniss 2 1 University of Brighton (UK) 2 University of

Encoding formats and consideration of requirements for terminology mapping Libo Si, Department

Unsupervised Le Learning of Video Representations using LS LSTMs Srivastava et al. University

t Ps s t

Causal Reasoning in SDNs (NetKAT) Georgiana Caltais, University of Konstanz Shonan Seminar

Port of a fixed point MPEG2-AAC encoder on a ARM platform Romain Pagniez University College

Biometrics &amp; Security Seminar Fingerprint-based Fuzzy Vault: Implementation and Performance

ARCNET Tutorial What is ARCNET? Attached Resource Computer NETwork Token-Passing Local

Biometrics & Security Seminar Fingerprint-based Fuzzy Vault: Implementation and Performance