approximate inference
play

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, - PowerPoint PPT Presentation

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory for representing unknowns


  1. Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1

  2. Bayesian paradigm • Consistent use of probability theory for representing unknowns (parameters, latent variables, missing data) 2

  3. Bayesian paradigm • Bayesian posterior distribution summarizes what we’ve learned from training data and prior knowledge • Can use posterior to: • Can use posterior to: – Describe training data – Make predictions on test data – Incorporate new data (online learning) • Today’s question: How to efficiently represent and compute posteriors? 3

  4. Factor graphs • Shows how a function of several variables can be factored into a product of simpler functions • f(x,y,z) = (x+y)(y+z)(x+z) • f(x,y,z) = (x+y)(y+z)(x+z) • Very useful for representing posteriors 4

  5. Example factor graph = p ( x | m ) N ( x ; m , 1 ) i i 5

  6. Two tasks • Modeling – What graph should I use for this data? • Inference – Given the graph and data, what is the mean – Given the graph and data, what is the mean of x (for example)? – Algorithms: • Sampling • Variable elimination • Message-passing (Expectation Propagation, Variational Bayes, …) 6

  7. A (seemingly) intractable problem A (seemingly) intractable problem 7

  8. Clutter problem • Want to estimate x given multiple y’s 8

  9. Exact posterior exact ,D) p(x,D -1 0 1 2 3 4 x 9

  10. Representing posterior distributions Sampling Deterministic approximation Good for complex, Good for simple, multi-modal distributions smooth distributions Fast, but unpredictable accuracy 10 Slow, but predictable accuracy

  11. Deterministic approximation Laplace’s method • Bayesian curve fitting, neural networks (MacKay) • Bayesian PCA (Minka) Variational bounds • Bayesian mixture of experts (Waterhouse) • Mixtures of PCA (Tipping, Bishop) • Factorial/coupled Markov models (Ghahramani, Jordan, Williams) 11

  12. Moment matching Another way to perform deterministic approximation • Much higher accuracy on some problems Assumed-density filtering (1984) Loopy belief propagation (1997) Expectation Propagation (2001) 12

  13. Today • Moment matching (Expectation Propagation) Tomorrow • Variational bounds (Variational Message Passing) 13

  14. Best Gaussian by moment matching exact bestGaussian ) p(x,D) -1 0 1 2 3 4 x 14

  15. Strategy • Approximate each factor by a Gaussian in x 15

  16. Approximating a single factor 16

  17. (naïve) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 17

  18. (informed) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 18

  19. Single factor with Gaussian context 19

  20. Gaussian multiplication formula 20

  21. Approximation with narrow context 21

  22. Approximation with medium context 22

  23. Approximation with wide context 23

  24. Two factors x x Message passing 24

  25. Three factors x x Message passing 25

  26. Message Passing = Distributed Optimization • Messages represent a simpler distribution � � � � that approximates � � � � – A distributed representation • Message passing = optimizing � to fit � • Message passing = optimizing � to fit � – � stands in for � when answering queries • Choices: – What type of distribution to construct (approximating family) – What cost to minimize (divergence measure) 26

  27. Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation: 27

  28. Gaussian found by EP ep exact bestGaussian p(x,D) p( -1 0 1 2 3 4 x 28

  29. Other methods vb laplace exact p(x,D) p( -1 0 1 2 3 4 x 29

  30. Accuracy Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155 30

  31. Cost vs. accuracy 200 points 20 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not 31

  32. Censoring example • Want to estimate x given multiple y’s = = p ( x ) N ( x ; 0 , 100 ) p ( y | x ) N ( y ; x , 1 ) i − − ∞ ∞ ∫ t ∫ > = + p (| y | t | x ) N ( y ; x , 1 ) dy N ( y ; x , 1 ) dy i − ∞ t 32

  33. Time series problems Time series problems 33

  34. Example: Tracking Guess the position of an object given noisy measurements y 2 x 2 x x 1 3 y 4 y y 1 3 x 4 Object 34

  35. Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 = + ν x x e.g. (random walk) − 1 t t t = + y x noise t t want distribution of x’s given y’s 35

  36. Approximate factor graph x x x x 3 1 2 4 36

  37. Splitting a pairwise factor x x 1 2 x x 1 2 37

  38. Splitting in context x x 3 2 x x 3 2 38

  39. Sweeping through the graph x x x x 3 1 2 4 39

  40. Sweeping through the graph x x x x 3 1 2 4 40

  41. Sweeping through the graph x x x x 3 1 2 4 41

  42. Sweeping through the graph x x x x 3 1 2 4 42

  43. Example: Poisson tracking • y t is a Poisson-distributed integer with mean exp(x t ) 43

  44. Poisson tracking model p ( 1 x ) ~ N ( 0 , 100 ) p p ( ( x x | | x x ) ) ~ ~ N N ( ( x x , , 0 0 . . 01 01 ) ) − − t t 1 t 1 = − x p ( y | x ) exp( y x e ) / y ! t t t t t t 44

  45. Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 x x x x 3 1 2 4 45

  46. Approximating a measurement factor x 1 y 1 x 1 46

  47. 47

  48. Posterior for the last state 48

  49. 49

  50. 50

  51. EP for signal detection (Qi and Minka, 2003) • Wireless communication problem ω + φ a sin( t ) • Transmitted signal = φ • vary to encode each symbol ( a , ) φ φ i • In complex numbers: ae Im a φ Re 51

  52. Binary symbols, Gaussian noise 1 = = − 0 • Symbols are and s 1 s 1 (in complex plane) = ω t + φ + y t a sin( ) noise • Received signal • Optimal detection is easy in this case • Optimal detection is easy in this case y t 0 1 s s 52

  53. Fading channel • Channel systematically changes amplitude and phase: = + y x s noise t t t • • = transmitted symbol = transmitted symbol s s t x • = channel multiplier (complex number) t x • changes over time y 1 t x t s t 0 x t s 53

  54. Differential detection • Use last measurement to estimate state: ≈ x y 1 / s − − t t t 1 • State estimate is noisy – can we do better? better? y y t − t 1 − y − 54 t 1

  55. Factor graph s s s s 3 1 2 4 y y y y y y y y 1 2 4 3 x x x x 3 1 2 4 Symbols can also be correlated (e.g. error-correcting code) Channel dynamics are learned from training data (all 1’s) 55

  56. 56

  57. 57

  58. Splitting a transition factor 58

  59. Splitting a measurement factor 59

  60. On-line implementation • Iterate over the last δ measurements • Previous measurements act as prior • Results comparable to particle filtering, but much faster 60

  61. 61

  62. Classification problems Classification problems 62

  63. Spam filtering by linear separation Not spam Spam Choose a boundary that will generalize to new data 63

  64. Linear separation Minimum training error solution (Perceptron) Too arbitrary – won’t generalize well 64

  65. Linear separation Maximum-margin solution (SVM) 65 Ignores information in the vertical direction

  66. Linear separation Bayesian solution (via averaging) 66 Has a margin, and uses information in all dimensions

  67. Geometry of linear separation Separator is any vector w such that: > T x w 0 (class 1) i < T x w 0 (class 2) i = = (sphere) (sphere) w w 1 1 This set has an unusual shape SVM: Optimize over it Bayes: Average over it 67

  68. Performance on linear separation EP Gaussian approximation to posterior 68

  69. Factor graph = p ( w ) N ( w ; 0 , I ) 69

  70. Computing moments w = \ i \ i \ i q ( ) N ( w ; m , V ) 70

  71. Computing moments 71

  72. Time vs. accuracy A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther’s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem) 72

  73. Gaussian kernels • Map data into high-dimensional space so that 73

  74. Bayesian model comparison • Multiple models M i with prior probabilities p(M i ) • Posterior probabilities: • For equal priors, models are compared using model evidence: 74

  75. Highest-probability kernel 75

  76. Margin-maximizing kernel 76

Recommend


More recommend