Black-box approximation of bivariate functions …for FPGAs David B. Thomas, Imperial College London
Chosen approximation problem • Input Ω xy = Ω x x Ω y ( 2 fx Z ) x ( 2 fy Z ) • Fixed-point domain: • Fixed-point range: Ω r = 2 fr Z • A function: f : Ω xy → Ω r • An error budget: ( ε >=2 fr ) ε Does not need to be differentiable, analytic, or smooth • Output • An approximation: a : Ω xy → Ω r ꓯ (x,y) Ω xy : | f(x,y) – a(x,y) | < ε • A certificate: • Synthesisable C
Classic example : atan2 • Input domain : Ω xy = [-1,1) x [-1,1) fx=-8, fy=-8 • Output domain : fr = -16
Classic example : atan2 • Output domain : fr = -1
1d function approximation • Function-specific range-reduction • Split the remaining problem up into easier parts • Come up with small primitives that locally fit • Can use Remez algorithm to find best polynomial x* i x r Range Domain Primitive Range r* x* Reduction Splitting Evaluation Reconst.
2d function approximation Polynomial Uniform grid Spline Symmetries Binary split Relative -> Absolute RBF Odd-even function Quadtree Chebfun2 Hash-table Kernels x* i x r Range Domain Primitive Range y* r* x* y Reduction Splitting Evaluation Reconst. y*
Why this is hard? • Remez approach does not work • Degree m polynomials get expensive in 2D • Memory: O(m) -> O(m 2 ) coefficients • Area : O(m) -> O(m 2 ) multipliers • Delay: O(m) -> O(2m) stages • Uniform domain splitting requires too much RAM • Non-uniform domain splitting get expensive • Large number of splitting coefficients • Potential for very deep trees
Why make it harder? • Why not require partial derivatives? • Doesn’t actually buy us much in practise • Makes it more difficult for end-users • Why not smooth/analytic/continuous functions? • Going to have to prove correctness empirically anyway • Why not exploit function-specific range reductions? • Many interesting 2D functions do not have any • Often we just get symmetry, so “only” a factor of 2 saving
Making it easier • Target domains of up to about 2 24 x 2 24 bits • 1 CPU : Verify by exhaustion up to ~ 2 20 x 2 20 in a day • 9 EC2 c4.large: verify up to 2 24 x 2 24 in a day and $108 • Though arguably a bit pointless • Target range of up to ~2 24 bits • e.g. [0,1) with lsb=-24, or [-4,+4) with lsb=-16 • Expect user to try to manage range a bit • Though the method will work with larger ranges • Treat double as “real” • Target function given as: double f(double x, double y); • Reasonable given the above assumptions • We are unlikely to have mpfr versions of functions
Restricting the solution space • Influenced by experience with FloatApprox • No function-specific range reduction • Only use binary trees to split the domain • Hash-tables are very promising: not covered here • Only use polynomial patches as primitives • Chebfun2-like patches quite promising Binary split Polynomial i x Domain Primitive r x* y Splitting Evaluation y*
High-level Algorithm int a (int x , int y ) { // Constant table of polynomials static const poly_t polynomials [...] = { ... }; // Select index of polynomial segment int i = select_segment ( x , y ); // Get all the coefficients poly_t poly = polynomials [ i ]; // Evaluate that patch at input co-ordinates return eval ( poly , x , y ); }
Segmentation by binary split struct node_t { bool dir ; int split ; int left , right ; }; int select_segment (int x , int y ) { // Constant table of splits static const node_t n0 [ 1 ] = { ... }; static const node_t n1 [ 2 ] = { ... }; static const node_t n2 [ 4 ] = { ... }; static const node_t n3 [ 7 ] = { ... }; // Select index of polynomial segment using pipeline int i = 0 ; i =( n0 [ i ]. dir ? x : y ) < n0 [ i ]. split ? n0 [ i ]. left : n0 [ i ]. right ; i =( n1 [ i ]. dir ? x : y ) < n1 [ i ]. split ? n1 [ i ]. left : n1 [ i ]. right ; i =( n2 [ i ]. dir ? x : y ) < n2 [ i ]. split ? n2 [ i ]. left : n2 [ i ]. right ; i =( n3 [ i ]. dir ? x : y ) < n3 [ i ]. split ? n3 [ i ]. left : n3 [ i ]. right ; return i ; }
Residual of atan2 Degree=2, fracBits=-10, maxError=0.001
Residual of atan2 degree 3, polys=154 degree 4, polys=95
Residual of atan2 maxError=0.001, polys=154, maxError=0.0001, polys=676
Poly count versus input precision Quadratic, maxError=0.001 65536 Number of polynomials 16384 4096 1024 256 64 10 12 14 16 18 Domain Fractional Bits func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t
Poly count versus input precision Cubic, maxError=0.001 65536 Number of polynomials 16384 4096 1024 256 64 10 12 14 16 18 Domain Fractional Bits func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t
Controlling the split • Which dimension to split: x or y? • Alternating dimensions: uniformly sub-divide to squares • Repeat one dimension: allows long thin patches • Where should we split along the dimension? • Split in the middle: narrow binary constants in tree • Split adaptatively: manage the depth of the tree
Curvature based splitting • We need to estimate curvature in some way • Currently use number of Chebyshev coefficients • Fit Chebyshev curves to f( x ,y mid ) and f(x mid , y ) • Use highest degree non-zero coeff as “curvature” • Split along the “hardest” direction • Also played around with other methods • E.g. varying split point to balance curvature • Can sometimes reduce tree depth
Naïve versus curvature split 10000 Number of polynomials 1000 100 10 Function curvature_split naive
Evaluating a 2D polynomial • Standard 1D polynomial • p(x) = a 0 + a 1 x + a 2 x 2 + a 3 x 3 … a m x m • Extending to a 2D polynomial + a 3,0 x 3 + … • p(x,y) = a 0,0 + a 1,0 x + a 2,0 x 2 + a 0,1 y + a 1,1 x y + a 2,1 x 2 y + a 3,1 x 3 y + … + a 0,2 y 2 + a 1,2 x y 2 + a 2,2 x 2 y 2 + a 3,2 x 3 y 2 + … + a 0,3 y 3 + a 1,3 x y 3 + a 2,3 x 2 y 3 + a 3,3 x 3 y 3 + … + • Need m 2 coefficients • Naïve form • Requires m 2 multipliers • Horrible numeric properties: requires many guard bits
2D Horner form c33 c23 c13 c03 x * * * * c32 + c22 + c12 + c02 + * * * * c31 + c21 + c11 + c01 + * * * * c30 + c20 + c10 + c00 + * + * + * + y
Advantages of Horner form • Only requires (m-1) 2 multipliers • Latency is 2(m+1) multipliers • Evaluation guard bits are roughly O(log m) bits • Requires -1 <= x <= +1 and -1 <= y <= +1 • Pre-scale all polynomials • Start with native polynomial range [xL,xH) x [yL,yH) • Use offsets xO = -(xL+xH)/2 and yL = -(yL+hH)/2 • Find integers xS and yS such that: -1 <= 2 xS (xL+xO) <0.5 and 0.5 < 2 xS (xH+xO) <= +1 • • Also drastically reduces range of intermediates • Means multiplier input widths are much smaller
Fitting polynomials • We don’t have the Remez property • Use least squares to fit the coefficients + a 3,0 x 3 + … + a 1,0 x + a 2,0 x 2 p(x,y) = a 0,0 + a 0,1 y + a 1,1 x y + a 2,1 x 2 y + a 3,1 x 3 y + … + a 0,2 y 2 + a 1,2 x y 2 + a 2,2 x 2 y 2 + a 3,2 x 3 y 2 + … + a 0,3 y 3 + a 1,3 x y 3 + a 2,3 x 2 y 3 + a 3,3 x 3 y 3 + … • Find a set of n points p spread over interval • Regress to find best coefficients in least squares sense
Optimising polynomials • Least squares fit is not a minimax approximation… • Attempt 1: successive over-constrained fit • Use residual in previous fit to nudge next fit • Attempt 2 : downhill simplex to optimise coefficients • Tiny improvements in maximum error • Attempt 3 : use Chebyshev nodes during fit • No noticeable improvement over uniform nodes • Attempt 4 : express the problem as discrete LP • Far too slow to be useful
Overall: total BRAMs, error=10 -3 30 Number of Block RAMs 25 20 15 10 5 0 10 12 14 16 18 Domain fractional bits func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t
Overall: total BRAMs, error=10 -5 500 Number of Block RAMs 400 300 200 100 0 10 12 14 16 18 Domain fractional bits func_atan2_whole func_chebfun2_ex_f1 func_gamma_p func_gamma_p_inv func_gamma_p_inva func_lbeta func_normpdf_offset func_owens_t
Conclusion • Black-box 2D function approximation is feasible • At least up to around 2 24 in each dimension • Major constraint is block RAM usage • Most functions only take up a fraction of RAM though • Route to hardware is via HLS • Generate C code describing tables • Ongoing work • Optimising the splitting process • Improving monomial basis selection • Lower-level hardware optimisations
Depopulated polynomials c03 x * c12 c02 + * * c21 c11 + c01 + * * * c30 c20 + c10 + c00 + * + * + * + y
Recommend
More recommend