FPGA Acceleration for Computational Glass-Free Displays Zhuolun He and Guojie Luo Peking University FPGA, Feb. 2017
Motivation: hyperopia/myopia Issues 2
Background Technology: Glass-Free Display • Light-field display – [Huang and Wetzstein, SIGGRAPH 2014] • Correcting for visual aberrations – Display: predistorted content – Retina: desired image display retina Desired perception Target light field 3
Related Technologies: Light Field Camera 4
Related: Near-eye Light-field Display Source: NVIDIA, SIGGRAPH Asia 2013 5
Pinhole Array vs. Microlens One 75um pinhole in every 390um manufactured using lithography 6
In this Paper… • Analyze the computational kernels • Accelerate using FPGAs • Propose several optimizations 7
Computational Glass-Free Display display retina Desired perception Target light field T T T = * x P u 8
Casting as a Model Fitting Problem T T T = * x P u 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥ 2 minimize subeject to 0 ≤ 𝑦 ≤ 1 9
Background of the L-BFGS Algorithm • L-BFGS: a widely-used convex optimization algorithm Calculate gradient 𝛼𝑔(𝑦 𝑙 ) N Calculate direction 𝑞 𝑙 Y done Converged? Search for step length 𝛽 𝑙 Update 𝑦 𝑙+1 = 𝑦 𝑙 + 𝛽 𝑙 𝑞 𝑙 10
Background of the L-BFGS Algorithm • L-BFGS algorithm 𝑞 𝑙 = −𝛼𝑔(𝑦 𝑙 ) for 𝑗 = 𝑙 − 1 to 𝑙 − 𝑛 do – Input: (history size = m) 𝛽 𝑗 = (𝑡 𝑗 ∙ 𝑞 𝑙 ) (𝑡 𝑗 ∙ 𝑧 𝑗 ) # some work 𝑦 𝑙−𝑛+1 ⋯ 𝑦 𝑙 𝑞 𝑙 = 𝑞 𝑙 − 𝛽 𝑗 𝑧 𝑗 𝛼𝑔 𝑦 𝑙 𝛼𝑔 𝑦 𝑙−𝑛+1 end for 𝑡 𝑘 = 𝑦 𝑘+1 − 𝑦 𝑘 ▫ 𝑞 𝑙 = 𝑞 𝑙 ∙ (𝑡 𝑙−1 ∙ 𝑧 𝑙−1 ) (𝑧 𝑙−1 ∙ 𝑧 𝑙−1 ) ▫ 𝑧 𝑘 = 𝛼𝑔(𝑦 𝑘+1 ) − 𝛼𝑔(𝑦 𝑘 ) for 𝑗 = 𝑙 − 𝑛 to 𝑙 − 1 do – Output: direction 𝑞 𝑙 𝛾 𝑗 = (𝑧 𝑗 ∙ 𝑞 𝑙 ) (𝑡 𝑗 ∙ 𝑧 𝑗 ) # more work 𝑞 𝑙 = 𝑞 𝑙 +(𝛽 𝑗 − 𝛾 𝑗 )𝑡 𝑗 • Computational kernels end for – dot prod return direction 𝑞 𝑙 – vector updates 11
Vector-free L-BFGS Algorithm • Original idea 𝑞 𝑙 = −𝛼𝑔(𝑦 𝑙 ) for 𝑗 = 𝑙 − 1 to 𝑙 − 𝑛 do – [NIPS 2014] 𝛽 𝑗 = (𝑡 𝑗 ∙ 𝑞 𝑙 ) (𝑡 𝑗 ∙ 𝑧 𝑗 ) # some work • Observation 𝑞 𝑙 = 𝑞 𝑙 − 𝛽 𝑗 𝑧 𝑗 – 𝑞 𝑙 is a linear combination of some end for basis in {𝑡 𝑘 } and {𝑧 𝑘 } 𝑞 𝑙 = 𝑞 𝑙 ∙ (𝑡 𝑙−1 ∙ 𝑧 𝑙−1 ) (𝑧 𝑙−1 ∙ 𝑧 𝑙−1 ) • Techniques for 𝑗 = 𝑙 − 𝑛 to 𝑙 − 1 do 𝛾 𝑗 = (𝑧 𝑗 ∙ 𝑞 𝑙 ) (𝑡 𝑗 ∙ 𝑧 𝑗 ) – dot prod ⇒ lookup + scalar op. # more work 𝑞 𝑙 = 𝑞 𝑙 +(𝛽 𝑗 − 𝛾 𝑗 )𝑡 𝑗 – vector update ⇒ coeff. update end for return direction 𝑞 𝑙 12
𝛼𝑔(𝑦 𝑙 ) {coeff.} dotprod() s-v mult. scalar {coeff.} dotprod dotprod() table s-v mult. v-v add. scalar {coeff.} v−v add. {coeff.} dotprod() s-v mult. scalar 𝑞 𝑙 v−v add. 𝑞 𝑙 {coeff.} Original L-BFGS Vector-free L-BFGS 13
Updating the Dot Product Table Scenario Focus Distributed computing minimize [NIPS 2014] using MapReduce #syncs FPGA acceleration with minimize Ours small on-chip BRAM data transfers • Similar idea to reduce data transfers – dot prod ⇒ lookup + scalar op. – vector update => coeff. update 14
Distributed vs. FPGA-based Scenario Focus data transfer Distributed computing minimize [NIPS 2014] 8md using MapReduce #syncs FPGA acceleration with minimize Ours (4m+4)d small on-chip BRAM data transfers – m: history size (e.g., 10) – d: image size 15
Sparse Matrix-Vector Multiplication minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥ 2 • Size of matrix/vector – Sparse matrix 𝑄 : 16384*490000 – Variable 𝑦 : 490000 16
Sparse Matrix-Vector Multiplication minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥ 2 • Problem: storage of P Format Storage (MB) • Solution: flat 32112.64 COO 6.63 – Sparsity => compressed row storage (CRS) CRS 5.24 – Range of indices => bitwidth reduction CRS+LUT 2.90 – #unique values => look-up table (LUT) ▫ ~ 810K non-zero entries ▫ ~600 unique values 17
Sparse Matrix-Vector Multiplication minimize 𝑔 𝑦 = ∥ 𝑣 − 𝑄𝑦 ∥ 2 Min Max Factor 𝑶 Method Total cycle cycle/row cycle/row 980 cyclic 1 1 16384 • Problem: partitioning vector 𝑦 1225 cyclic 1 1 16384 • “Solution”: 1250 cyclic 1 2 19840 – Matrix 𝑄 is irregular but constant … … … … … 1400 block 4 18 188564 – => access pattern is non-affine but statistically analyzable 1250 block 5 18 193276 – => enumerate factors of |𝑦| as partitioning factors … … … … … 1 N/A 37 54 816272 18
Overall Design of the Accelerator • [Li et al, FPGA 15] • Maximize performance • Subject to resources 19
Experimental Evaluation 140 124.5 Baseline 120 100 SpMV optimization Time (s) 80 65.49 L-BFGS enhancement 60 47.47 parameter tuning in L-BFGS 40 25.26 9.74 20 Overall result after other fine tunings 0 Runtime Comparison +: 12.78X Speedup - : Peak memory bandwidth < 800MB/s 20
Conclusions • Summary – Bandwidth-friendly L-BFGS algorithm – Application-specific sparse matrix compression – Memory partitioning for non-affine access • Future work – Possibility of real-time processing – Construct transformation matrix by eye-ball tracking – A demonstrative system 21
Questions? 22
Runtime Profiling of a 2-min L-BFGS per procedure per operation 23
24
25
Recommend
More recommend