on the balcony
play

On the Balcony T om Mi Mink nka Mi Microso soft t Res esea - PowerPoint PPT Presentation

From aut utomat omatic ic di differ erentia entiatio tion n to mes essa sage ge passi pa ssing ng On the Balcony T om Mi Mink nka Mi Microso soft t Res esea earch ch What I do Algorithms for probabilistic inference


  1. From aut utomat omatic ic di differ erentia entiatio tion n to mes essa sage ge passi pa ssing ng On the Balcony T om Mi Mink nka Mi Microso soft t Res esea earch ch

  2. What I do Algorithms for probabilistic inference Probabilistic • Expectation Propagation Programming • Non-conjugate variational message passing • A* sampling TrueSkill

  3. Machine Learning Language • A machine learning language should (among other things) simplify implementation of On the Balcony machine learning algorithms

  4. Machine Learning Language • A general-purpose machine learning language should (among other things) simplify On the Balcony implementation of all machine learning algorithms

  5. Roadmap 1. Automatic Differentiation 2. AutoDiff lacks approximation On the Balcony 3. Message passing generalizes AutoDiff 4. Compiling to message passing

  6. 1. Automatic / algorithmic differentiation

  7. Recommended reading “Evaluating derivatives” by Griewank • and Walther (2008)

  8. Programs are the new formulas Programs can specify mathematical • functions more compactly than formulas Program is not a black box: undergoes • analysis and transformation Numbers are assumed to have infinite • precision

  9. Multiply-all example As formulas: • • 𝑔 = ς 𝑗 𝑦 𝑗 • 𝑒𝑔 = σ 𝑗 𝑒𝑦 𝑗 ς 𝑘≠𝑗 𝑦 𝑘

  10. Multiply-all example Input program Derivative program c[1] = x[1] dc[1] = dx[1] for i = 2 to n for i = 2 to n c[i] = c[i-1]*x[i] dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] f = c[n] df = dc[n] 𝑔 = ෑ 𝑦 𝑗 𝑒𝑔 = ෍ 𝑒𝑦 𝑗 ෑ 𝑦 𝑘 𝑗 𝑗 𝑘≠𝑗

  11. Phases of AD Execution • • Replace every operation with a linear one Accumulation • • Collect linear coefficients

  12. Execution phase dx*y + x*dy + dy*z + y*dz x*y + y*z x y z dx dy dz y z y x + + Scale * * factors 1 1 + +

  13. Accumulation phase (Forward) dx*y + x*dy + dy*z + y*dz (Reverse) dx dz dy coefficient of dx = 1*y y z y x + + coefficient of dy = 1*x + 1*z 1 1 coefficient of dz = 1*y + Gradient vector = (1*y, 1*x + 1*z, 1*y)

  14. Linear composition e*(a*x + b*y) + f*(c*y + d*z) x z y e*b f*c (e*a)*x + d c a b e*a f*d + + (e*b + f*c)*y + e f e f (f*d)*z +

  15. Dynamic programming Reverse accumulation • is dynamic x y programming a b (e+f)*b (e+f)*a Backward message is • e f e f sum over paths to output

  16. Source-to-source translation Tracing approach builds a • graph during execution phase, then accumulates it Source-to-source produces a • gradient program matching structure of original

  17. Multiply-all example Input program Derivative program c[1] = x[1] dc[1] = dx[1] for i = 2 to n for i = 2 to n c[i] = c[i-1]*x[i] dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] return c[n] return dc[n] c[i-1] x[i] dc[i-1] dx[i] c[i-1] x[i] * + c[i] dc[i]

  18. Multiply-all example Derivative program Gradient program dc[1] = dx[1] dcB[n] = 1 for i = 2 to n for i = n downto 2 dc[i] = dc[i-1]*x[i] + c[i-1]*dx[i] dcB[i-1] = dcB[i]*x[i] return dc[n] dxB[i] = dcB[i]*c[i-1] dc[i-1] dx[i] dxB[1] = dcB[1] return dxB c[i-1] x[i] dcB[i-1] dxB[i] + dc[i] dcB[i]

  19. General case c = f(x,y) dx dy dc = df1(x,y) * dx + df2(x,y) * dy df2 df1 dxB dyB + dc dxB = dcB * df1(x,y) dyB = dcB * df2(x,y) dcB

  20. Fan-out If a variable is read multiple times, we • need to add its backward messages Non-incremental approach: • transform program so that each variable is defined and used at most once on every execution path

  21. Fan-out example Input program Edge program Gradient program a = x * y (y1,y2) = dup(y) aB = cB b = y * z a = x * y1 bB = cB c = a + b b = y2 * z y2B = bB * z c = a + b y1B = aB * x yB = y1B + y2B … y y dup y2 y1

  22. Summary of AutoDiff AD Message passing Programs not formulas Yes Yes Graph structure / sparsity Yes Yes Source-to-source Yes Yes Only one execution path Yes Not always Single forward-backward sweep Yes Not always Exact Yes Not always

  23. 2. AutoDiff lacks approximation

  24. Approximate gradients for big models Mini-batching • 𝑜 ∇ ෍ 𝑔 𝑗 𝜄 ≈ User changes input • 𝑗=1 program to be ∇ 𝑜 ෍ 𝑔 𝑡 (𝜄) = 𝑛 approximate, then 𝑡~(1:𝑜) 𝑜 computes exact ෍ ∇𝑔 𝑡 (𝜄) (AutoDiff) 𝑛 𝑡~(1:𝑜) gradient

  25. Black-box variational inference 1. Approximate the marginal log- ∫ 𝑞 𝑦, 𝐸 𝑒𝑦 likelihood with a lower bound ≥ −𝐿𝑀 𝑟 | 𝑞) 2. Approximate the lower bound by importance sampling 3. Compute exact gradient of approximation

  26. AutoDiff in Tractable Models AutoDiff can mechanically derive reverse summation • algorithms for tractable models 𝑇 → 𝐵𝐵 • Markov chains, Bayesian networks (Darwiche, 2003) 𝐵 → 𝐵𝐶 𝐵 • Generative grammars, Parse trees (Eisner, 2016) 𝐵 𝐶 Posterior expectations are derivatives of marginal • log-likelihood, which can be computed exactly • User must provide forward summation algorithm

  27. Approximation in Tractable Models Approximation is useful in tractable models • • Sparse forward-backward (Pal et al, 2006) • Beam parsing (Goodman, 1997) Cannot be obtained through AutoDiff of an • approximate model Neither can Viterbi •

  28. MLL should facilitate approximations Expectations • Fixed-point iteration • • Optimization • Root finding Should all be natively supported •

  29. 3. Message-passing generalizes autodiff

  30. Message-passing Approximate reasoning about exponential state • space of a program, along all execution paths Propagates state summaries in both directions • Forward can depend on backward and vice • versa Iterate to convergence •

  31. Interval constraint propagation What is largest and smallest value • each variable could have? Each operation in program is • interpreted as a constraint between inputs and output Propagates information forward and • backward until convergence

  32. Circle-parabola example Find (𝑦, 𝑧) that satisfies 𝑦 2 + 𝑧 2 = 1 and 𝑧 = 𝑦 2

  33. Circle-parabola program Input program x y = x^2 yy y yy = y^2 z = y + yy assert(z == 1) z

  34. Interval propagation program x Input program Edge program ^2 y = x^2 y = x^2 y (y1,y2) = dup(y) y1 dup ^2 yy = y^2 yy = y1^2 z = y + yy z = y2 + yy y2 yy assert(z == 1) assert(z == 1) + z

  35. Interval propagation program Edge program Message program x Until convergence: ^2 y = x^2 yF = xF^2 y1F y y1F = yF ∩ y2B (y1,y2) = dup(y) y1 y2F = yF ∩ y1B dup ^2 yy = y1^2 yyF = y1F^2 y1B y2 yy y1B = sqrt(y1F, yyB) z = y2 + yy y2B = zB – yyF + yyB = zB – y2F z assert(z == 1) zB = [1,1]

  36. Running ^2 backwards yy = y1^2 y1B = sqrt(y1F, yyB) = project[ y1F ∩ sqrt(yyB) ] y1F yyB = [1, 4] y1 ^2 sqrt(yyB) = [-2, -1] ∪ [1, 2] y1F = [0, 10] y1B yy y1F ∩ sqrt(yyB) = [] ∪ [1, 2] project[ y1F ∩ sqrt(yyB) ] = [1, 2] y1F ∩ project[ sqrt(yyB) ] = [0, 2]

  37. Results If all intervals start (−∞, ∞) • then 𝑦 → −1,1 (overestimate) Apply subdivision • Starting at 𝑦 = (0.1,1) gives • 𝑦 → (0.786, 0.786)

  38. Interval propagation program yF = xF^2 Until convergence: x zB = [1,1] yF = xF^2 ^2 xB = sqrt(xF, yB) Until convergence: yB = y1B ∩ y2B y y1 (perform updates) y1F = yF ∩ y2B dup ^2 y2F = yF ∩ y1B yB = y1B ∩ y2B y2 yy … xB = sqrt(xF, yB) zB = [1,1] + z

  39. Typical message-passing program 1. Pass messages into the loopy core 2. Iterate 3. Pass messages out of the loopy core Analogous to Stan’s “transformed data” and “generated quantities”

  40. Simplifications of message-passing Message dependencies dictate execution • If forward messages do not depend on • backward, becomes non-iterative If forward messages only include single • state, only one execution path is explored AutoDiff has both properties •

  41. Other message-passing algorithms

  42. Probabilistic Programming Probabilistic programs are the • new Bayesian networks Using a program to specify a • probabilistic model Program is not a black box: • undergoes analysis and transformation to help inference

  43. Loopy belief propagation Loopy belief propagation has same structure as • interval propagation, but using distributions • Gives forward and backward summations for tractable models Expectation propagation adds projection steps • • Approximate expectations for intractable models • Parameter estimation in non-conjugate models

  44. Gradient descent Parameters send current value • out, receive gradients in, take a step θ 𝜄 ∇𝑔(𝜄) • Gradients fall out of EP equations 𝑔(𝜄) Part of the same iteration loop •

  45. Gibbs sampling Variables send current value • out, receive conditional 𝑞(𝑦) x distributions in 𝑦 𝑢 𝑞(𝑧 = 𝑧 𝑢 |𝑦) Collapsed variables • send/receive distributions as in 𝑧 𝑢 𝑞(𝑧|𝑦 = 𝑦 𝑢 ) BP y • No need to collapse in the model

  46. Thanks! On the Balcony Model-based machine learning book: http://mbmlbook.com/ Infer.NET is open source: http://dotnet.github.io/infer

Recommend


More recommend