GPU Parallel Implementation of The Approximate K-SVD Algorithm Using - PowerPoint PPT Presentation

Introduction OpenCL AK-SVD PAK-SVD Conclusions GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan Dumitrescu 2 1 University Politehnica of Bucharest 2 Tampere University of Technology paul@irofti.net bogdan.dumitrescu@tut.fi EUSIPCO’2014

Introduction OpenCL AK-SVD PAK-SVD Conclusions Outline Introduction 1 OpenCL 2 AK-SVD 3 PAK-SVD 4 Conclusions 5

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description The problem Given: initial dictionary D 0 set of training signals Y target sparsity s number of iterations K Output: trained dictionary D sparse representations X Such that Y ≈ DX .

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description Optimization Problem Solving the optimization problem of: � Y − DX � 2 minimize F D , X subject to � x i � 0 ≤ s , ∀ i

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description General Approach Most algorithm iterations involve two essential steps: sparse coding Y using dictionary D resulting X updating the dictionary using the current representations X Existing solutions: Sparse representations: SP MP OMP Dictionary update: MOD K-SVD AK-SVD

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description Current State Practical applications employing these methods show good results low representation errors slow running times top consumer: the sparse representation stage dictionary update performed one atom at a time each update step depends on the one before it Our approach: update more than one atoms at a time distributed sparse coding new parallel algorithm PAK-SVD

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description Platform OpenCL platform execute small functions (kernels) in parallel processing elements ⊂ compute units ⊂ OpenCL device work load topology defined as an n-dimensional space Notation: NDR : � x , y , z �

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description N-Dimensional Range – 2D Example

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description Memory Layout

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description Hardware ATI FirePro V8800 (FireGL V) specifications: 1600 streaming processors 2048MB global memory 32KB local memory 256 maximum work-group size 20 maximum compute units OpenCL v1.2 compliant 2640 single-precision GFLOPS 528 double-precision GFLOPS.

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description Time Counting Counting in CPU ticks bypassing: unsynchronized tick counts between different cores on a multiprocessor system lack of serialization with MSVC compilers on x64 systems EBX/RBX register spilling issues with GCC compilers when using position independent code On the machine we tested one tick represents roughly 0.3125ns.

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description AK-SVD Algorithm Data: given dictionary D and signal set Y compute sparse representations X and optimize dictionary D Iterations: sparse coding: for each signal y in Y use OMP( D , y ) for representing x of X dictionary update: for each atom d in D remove d from the dictionary find the singals using d in their representation optimize d keeping the representations and the dictionary fixed update the representations by using the new atom d update the dictionary by reintroducing the optimized atom d

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description Comments Observations: the dictionary is changed on each update step so are the sparse representations the current atom’s update depends on all of the atoms updated before it AK-SVD eliminates the need to explicitly compute the residual

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description PAK-SVD Sparse Coding Data: given dictionary D ∈ R p × n and signal set Y ∈ R p × m compute sparse representations X ∈ R n × m Sparse Coding with OMP: using an NDR( � m � , � any � ) splitting big memory foot-print O ( ns ), where s is the desired sparsity all the matrices are kept in global memory each PE computes OMP for a single data item from Y PE 1 PE 2 PE m X 1 =OMP( Y 1 ) X 2 =OMP( Y 2 ) X m =OMP( Y m ) . . .

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description PAK-SVD Dictionary Update Data: D ∈ R p × n , Y ∈ R p × m and X ∈ R n × m Dictionary update for batches of ˜ n atoms from D : calculate the full residual matrix E = Y − DX for each atom from the current batch do in parallel compensate the error matrix E as if the current atom was missing from the dictionary find the singals using d in their representation optimize d keeping the representations and the error matrix fixed update the representations by using the new atom d update the dictionary by reintroducing the optimized atom d

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description PAK-SVD Dictionary Update (2) We use an NDR( � ˜ n � , � any � ) splitting for updating ˜ n atoms at a time: PE 1 PE 2 PE ˜ n D 1 , X D 1 D 1 , X D 2 D ˜ n , X D ˜ . . . n Each PE is in charge of updating one atom. Memory layout: private: d , the atom being updated local or global: I , indices of signals using d global: E , X , D

Introduction OpenCL AK-SVD PAK-SVD Conclusions Description Matrix Multiplication OpenCL implementation: split the N-dimensional space as NDR( � n , m � , � 64 , 64 � ) block-based multiplication calculating a block is performed within a work-group Memory layout: global: input and output matrices local: copied input block sub-matrices private: vectorized types for dot operations

Introduction OpenCL AK-SVD PAK-SVD Conclusions Experimental Results Error 10 AK-SVD ˜ n = 64 n = 256 ˜ 0 n = 512 ˜ RMSE (dB) -10 -20 -30 -40 0 20 40 60 80 100 120 140 160 180 200 Iterations Error evolution for m = 16384, n = 512, s = 12.

Introduction OpenCL AK-SVD PAK-SVD Conclusions Experimental Results Performance (1) 4.2 4 3.8 3.6 log 10 ( time ( s )) 3.4 3.2 3 2.8 CPU n = 16 ˜ n = 1 ˜ n = 32 ˜ 2.6 n = 2 ˜ ˜ n = 64 2.4 n = 4 ˜ n = 128 ˜ n = 8 ˜ 2.2 128 256 512 Atoms Execution times for m = 16384, s = 10, K = 200. : *

Introduction OpenCL AK-SVD PAK-SVD Conclusions Experimental Results Performance (2) 4.4 4.2 4 3.8 log 10 ( time ( s )) 3.6 3.4 3.2 3 2.8 CPU n = 64 ˜ n = 1 ˜ n = 128 ˜ 2.6 n = 8 ˜ n = 256 ˜ ˜ n = 16 n = 512 ˜ 2.4 ˜ n = 32 2.2 8192 16384 32768 65536 Signals Execution times for n = 512, s = 8, K = 100. : *

Introduction OpenCL AK-SVD PAK-SVD Conclusions Experimental Results More Error Results Table: Final errors for AK-SVD and PAK-SVD with ˜ n = n . n 128 256 512 AK PAK AK PAK AK PAK 4 0.0425 0.0407 0.0385 0.0387 0.0376 0.0372 6 0.0374 0.0349 0.0334 0.0316 0.0311 0.0297 8 0.0345 0.0306 0.0294 0.0272 0.0259 0.0245 s 10 0.0322 0.0276 0.0276 0.0239 0.0233 0.0206 12 0.0319 0.0249 0.0254 0.0205 0.0221 0.0176

Introduction OpenCL AK-SVD PAK-SVD Conclusions Conclusions PAK-SVD improves AK-SVD: performs up to 12x faster parallel sparse coding stage parallel dictionary update smaller representation error

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using - PowerPoint PPT Presentation

Introduction OpenCL AK-SVD PAK-SVD Conclusions GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan Dumitrescu 2 1 University Politehnica of Bucharest 2 Tampere University of Technology

SVD Status H. Yin August 24, 2017 H. Yin SVD Status August 24, 2017 1 / 19 Overview SVD

Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate

A study for hit-time reconstruction of Belle II SVD Yuma Uematsu (UTokyo) on behalf of Belle II

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

SVD- -based Functional ANOVA For based Functional ANOVA For SVD Measurement Evaluation of

Cooling pipes, heat management, temperature, humidity control of VXD Overview SVD PXD 2

Partial Lanczos SVD methods for R Bryan Lewis 1 , adapted from the work of Jim Baglama 2 and Lothar

1 Low-rank approximations to a matrix using SVD First point: we can write the SVD as a sum of

The Great SVD Mystery James H. Steiger Department of Psychology and Human Development Vanderbilt

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

S9226 Fast singular value decomposition on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Samuel

Ra Randomized SV SVD, CU CUR De Decom ompos osition on, and and SPSD SPSD Ma Matri trix

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

Strategies for the incremental inference of majority-rule sorting models

Community Development Block Grant National Disaster Resilience (CDBG- NDR) Competition NOFA

Lecture Outline Regeltechniek Previous lecture: representation of dynamic models, transfer func-

Math 211 Math 211 Lecture #32 Harmonic Motion November 10, 2003 2 The Vibrating Spring The

Theater Pharmacy Overview MAJ (P) Steven Barr Pharm D BCPS Theater Enabling Medical Command

Towards the boundary of the character variety Painlev e conference Strasbourg, Thursday 7

CEE 697K ENVIRONMENTAL REACTION KINETICS Lecture #11 Kinetic Theory: Encounter Model,

Statistical Analysis in the You should use you preferred R -environment. Lexis Diagram:

GPU Parallel Implementation of The Approximate K-SVD Algorithm Using - PowerPoint PPT Presentation

Introduction OpenCL AK-SVD PAK-SVD Conclusions GPU Parallel Implementation of The Approximate K-SVD Algorithm Using OpenCL Paul Irofti 1 Bogdan Dumitrescu 2 1 University Politehnica of Bucharest 2 Tampere University of Technology

SVD Status H. Yin August 24, 2017 H. Yin SVD Status August 24, 2017 1 / 19 Overview SVD

Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate

A study for hit-time reconstruction of Belle II SVD Yuma Uematsu (UTokyo) on behalf of Belle II

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Approximate Graph Operations on Parallel Platforms Approximate Graph Operations on Parallel

SVD- -based Functional ANOVA For based Functional ANOVA For SVD Measurement Evaluation of

Cooling pipes, heat management, temperature, humidity control of VXD Overview SVD PXD 2

Partial Lanczos SVD methods for R Bryan Lewis 1 , adapted from the work of Jim Baglama 2 and Lothar

1 Low-rank approximations to a matrix using SVD First point: we can write the SVD as a sum of

The Great SVD Mystery James H. Steiger Department of Psychology and Human Development Vanderbilt

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

S9226 Fast singular value decomposition on GPU Lung-Sheng Chien, NVIDIA lchien@nvidia.com Samuel

Ra Randomized SV SVD, CU CUR De Decom ompos osition on, and and SPSD SPSD Ma Matri trix

2110412 Parallel Comp Arch CUDA: Parallel Programming on GPU Natawut Nupairoj, Ph.D. Department

Strategies for the incremental inference of majority-rule sorting models

Community Development Block Grant National Disaster Resilience (CDBG- NDR) Competition NOFA

Lecture Outline Regeltechniek Previous lecture: representation of dynamic models, transfer func-

Math 211 Math 211 Lecture #32 Harmonic Motion November 10, 2003 2 The Vibrating Spring The

Theater Pharmacy Overview MAJ (P) Steven Barr Pharm D BCPS Theater Enabling Medical Command

Towards the boundary of the character variety Painlev e conference Strasbourg, Thursday 7

CEE 697K ENVIRONMENTAL REACTION KINETICS Lecture #11 Kinetic Theory: Encounter Model,

Statistical Analysis in the You should use you preferred R -environment. Lexis Diagram:

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team