expectation propagation
play

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK - PowerPoint PPT Presentation

Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK 2006 Advanced Tutorial Lecture Series, CUED 1 A typical machine learning problem A typical machine learning problem 2 Spam filtering by linear separation Not spam Spam


  1. Expectation Propagation Tom Minka Microsoft Research, Cambridge, UK 2006 Advanced Tutorial Lecture Series, CUED 1

  2. A typical machine learning problem A typical machine learning problem 2

  3. Spam filtering by linear separation Not spam Spam Choose a boundary that will generalize to new data 3

  4. Linear separation Minimum training error solution (Perceptron) Too close to data – won’t generalize well 4

  5. Linear separation Maximum-margin solution (SVM) 5 Ignores information in the vertical direction

  6. Linear separation Bayesian solution (via averaging) 6 Has a margin, and uses information in all dimensions

  7. Geometry of linear separation Separator is any vector w such that: > T x 0 (class 1) w i < T x 0 (class 2) w i = = (sphere) (sphere) 1 1 w w This set has an unusual shape SVM: Optimize over it Bayes: Average over it 7

  8. Performance on linear separation EP Gaussian approximation to posterior 8

  9. Bayesian paradigm • Consistent use of probability theory for representing unknowns (parameters, latent variables, missing data) 9

  10. Factor graphs • Shows how a function of several variables can be factored into a product of simpler functions • f(x,y,z) = (x+y)(y+z)(x+z) • f(x,y,z) = (x+y)(y+z)(x+z) • Very useful for representing posteriors 10

  11. Example factor graphs 11

  12. Two tasks • Modeling – What graph should I use for this data? • Inference – Given the graph and data, what is the mean – Given the graph and data, what is the mean of x (for example)? – Algorithms: • Sampling • Variable elimination • Message-passing (Expectation Propagation, Variational Bayes, …) 12

  13. Division of labor • Model construction – Domain specific (computer vision, biology, text) • Inference computation • Inference computation – Generic, mechanical – Further divided into: • Fitting an approximate posterior • Computing properties of the approx posterior 13

  14. Benefits of the division • Algorithmic knowledge is consolidated into general graph-based algorithms (like EP) • Applied research has more freedom in choosing models choosing models • Algorithm research has much wider impact 14

  15. Take-home message • Applied researcher: – express your model as factor graph – use graph-based inference algorithms • Algorithm researcher: • Algorithm researcher: – present your algorithm in terms of graphs 15

  16. A (seemingly) intractable problem A (seemingly) intractable problem 16

  17. Clutter problem • Want to estimate x given multiple y’s 17

  18. Exact posterior exact ,D) p(x,D -1 0 1 2 3 4 x 18

  19. Representing posterior distributions Sampling Deterministic approximation Good for complex, Good for simple, multi-modal distributions smooth distributions 19 Fast, but unpredictable accuracy Slow, but predictable accuracy

  20. Deterministic approximation Laplace’s method • Bayesian curve fitting, neural networks (MacKay) • Bayesian PCA (Minka) Variational bounds • Bayesian mixture of experts (Waterhouse) • Mixtures of PCA (Tipping, Bishop) • Factorial/coupled Markov models (Ghahramani, Jordan, Williams) 20

  21. Moment matching Another way to perform deterministic approximation • Much higher accuracy on some problems Assumed-density filtering (1984) Loopy belief propagation (1997) Expectation Propagation (2001) 21

  22. Best Gaussian by moment matching exact bestGaussian ) p(x,D) -1 0 1 2 3 4 x 22

  23. Strategy • Approximate each factor by a Gaussian in x 23

  24. Approximating a single factor 24

  25. (naïve) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 25

  26. (informed) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 26

  27. Single factor with Gaussian context 27

  28. Gaussian multiplication formula 28

  29. Approximation with narrow context 29

  30. Approximation with medium context 30

  31. Approximation with wide context 31

  32. Two factors x x Message passing 32

  33. Three factors x x Message passing 33

  34. Message Passing = Distributed Optimization • Messages represent a simpler distribution � � � � that approximates � � � � – A distributed representation • Message passing = optimizing � to fit � • Message passing = optimizing � to fit � – � stands in for � when answering queries • Choices: – What type of distribution to construct (approximating family) – What cost to minimize (divergence measure) 34

  35. Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation: 35

  36. Global divergence to local divergence • Global divergence: • Local divergence: 36

  37. Message passing • Messages are passed between factors • Messages are factor approximations: • Factor � receives – Minimize local divergence to get – Send to other factors – Repeat until convergence 37

  38. Gaussian found by EP ep exact bestGaussian p(x,D) p( -1 0 1 2 3 4 x 38

  39. Other methods vb laplace exact p(x,D) p( -1 0 1 2 3 4 x 39

  40. Accuracy Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155 40

  41. Cost vs. accuracy 200 points 20 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not 41

  42. Time series problems Time series problems 42

  43. Example: Tracking Guess the position of an object given noisy measurements y 2 x 2 x x 1 3 y 4 y y 1 3 x 4 Object 43

  44. Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 = + x x e.g. (random walk) − 1 t t t = + y x noise t t want distribution of x’s given y’s 44

  45. Approximate factor graph x x x x 3 1 2 4 45

  46. Splitting a pairwise factor x x 1 2 x x 1 2 46

  47. Splitting in context x x 3 2 x x 3 2 47

  48. Sweeping through the graph x x x x 3 1 2 4 48

  49. Sweeping through the graph x x x x 3 1 2 4 49

  50. Sweeping through the graph x x x x 3 1 2 4 50

  51. Sweeping through the graph x x x x 3 1 2 4 51

  52. Example: Poisson tracking • y t is a Poisson-distributed integer with mean exp(x t ) 52

  53. Poisson tracking model p ( 1 x ) ~ N ( 0 , 100 ) p p ( ( x x | | x x ) ) ~ ~ N N ( ( x x , , 0 0 . . 01 01 ) ) − − t t 1 t 1 = − x p ( y | x ) exp( y x e ) / y ! t t t t t t 53

  54. Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 x x x x 3 1 2 4 54

  55. Approximating a measurement factor x 1 y 1 x 1 55

  56. 56

  57. Posterior for the last state 57

  58. 58

  59. 59

  60. EP for signal detection • Wireless communication problem ω + φ a sin( t ) • Transmitted signal = φ • vary to encode each symbol ( , ) a φ φ i • In complex numbers: ae Im a φ Re 60

  61. Binary symbols, Gaussian noise • Symbols are 1 and –1 (in complex plane) ω t + φ + sin( ) noise a • Received signal = φ φ ˆ = + = • Recovered ˆ a e ae noise y t • Optimal detection is easy in this case y t 0 1 s s 61

  62. Fading channel • Channel systematically changes amplitude and phase: = + y x s noise t t • • x x changes over time changes over time t y 1 t x t s 0 x t s 62

  63. Differential detection • Use last measurement to estimate state • Binary symbols only • No smoothing of state = noisy y y t − t 1 − y − 63 t 1

  64. Factor graph s s s s 3 1 2 4 y y y y y y y y 1 2 4 3 x x x x 3 1 2 4 Symbols can also be correlated (e.g. error-correcting code) Dynamics are learned from training data (all 1’s) 64

  65. 65

  66. 66

  67. Splitting a transition factor 67

  68. Splitting a measurement factor 68

  69. On-line implementation • Iterate over the last δ measurements • Previous measurements act as prior • Results comparable to particle filtering, but much faster 69

  70. 70

  71. Linear separation revisited Linear separation revisited 71

  72. Geometry of linear separation Separator is any vector w such that: > T x 0 (class 1) w i < T x 0 (class 2) w i = = (sphere) (sphere) 1 1 w w This set has an unusual shape SVM: Optimize over it Bayes: Average over it 72

  73. Factor graph 73

  74. Performance on linear separation EP Gaussian approximation to posterior 74

  75. Time vs. accuracy A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther’s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem) 75

  76. Gaussian kernels • Map data into high-dimensional space so that 76

Recommend


More recommend