Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1
Bayesian paradigm • Consistent use of probability theory for representing unknowns (parameters, latent variables, missing data) 2
Bayesian paradigm • Bayesian posterior distribution summarizes what we’ve learned from training data and prior knowledge • Can use posterior to: • Can use posterior to: – Describe training data – Make predictions on test data – Incorporate new data (online learning) • Today’s question: How to efficiently represent and compute posteriors? 3
Factor graphs • Shows how a function of several variables can be factored into a product of simpler functions • f(x,y,z) = (x+y)(y+z)(x+z) • f(x,y,z) = (x+y)(y+z)(x+z) • Very useful for representing posteriors 4
Example factor graph = p ( x | m ) N ( x ; m , 1 ) i i 5
Two tasks • Modeling – What graph should I use for this data? • Inference – Given the graph and data, what is the mean – Given the graph and data, what is the mean of x (for example)? – Algorithms: • Sampling • Variable elimination • Message-passing (Expectation Propagation, Variational Bayes, …) 6
A (seemingly) intractable problem A (seemingly) intractable problem 7
Clutter problem • Want to estimate x given multiple y’s 8
Exact posterior exact ,D) p(x,D -1 0 1 2 3 4 x 9
Representing posterior distributions Sampling Deterministic approximation Good for complex, Good for simple, multi-modal distributions smooth distributions Fast, but unpredictable accuracy 10 Slow, but predictable accuracy
Deterministic approximation Laplace’s method • Bayesian curve fitting, neural networks (MacKay) • Bayesian PCA (Minka) Variational bounds • Bayesian mixture of experts (Waterhouse) • Mixtures of PCA (Tipping, Bishop) • Factorial/coupled Markov models (Ghahramani, Jordan, Williams) 11
Moment matching Another way to perform deterministic approximation • Much higher accuracy on some problems Assumed-density filtering (1984) Loopy belief propagation (1997) Expectation Propagation (2001) 12
Today • Moment matching (Expectation Propagation) Tomorrow • Variational bounds (Variational Message Passing) 13
Best Gaussian by moment matching exact bestGaussian ) p(x,D) -1 0 1 2 3 4 x 14
Strategy • Approximate each factor by a Gaussian in x 15
Approximating a single factor 16
(naïve) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 17
(informed) f i ( x ) × × \ x \ x q i q i ( ( ) ) = p ( x ) 18
Single factor with Gaussian context 19
Gaussian multiplication formula 20
Approximation with narrow context 21
Approximation with medium context 22
Approximation with wide context 23
Two factors x x Message passing 24
Three factors x x Message passing 25
Message Passing = Distributed Optimization • Messages represent a simpler distribution � � � � that approximates � � � � – A distributed representation • Message passing = optimizing � to fit � • Message passing = optimizing � to fit � – � stands in for � when answering queries • Choices: – What type of distribution to construct (approximating family) – What cost to minimize (divergence measure) 26
Distributed divergence minimization • Write p as product of factors: • Approximate factors one by one: • Multiply to get the approximation: 27
Gaussian found by EP ep exact bestGaussian p(x,D) p( -1 0 1 2 3 4 x 28
Other methods vb laplace exact p(x,D) p( -1 0 1 2 3 4 x 29
Accuracy Posterior mean: exact = 1.64864 ep = 1.64514 laplace = 1.61946 vb = 1.61834 vb = 1.61834 Posterior variance: exact = 0.359673 ep = 0.311474 laplace = 0.234616 vb = 0.171155 30
Cost vs. accuracy 200 points 20 points Deterministic methods improve with more data (posterior is more Gaussian) Sampling methods do not 31
Censoring example • Want to estimate x given multiple y’s = = p ( x ) N ( x ; 0 , 100 ) p ( y | x ) N ( y ; x , 1 ) i − − ∞ ∞ ∫ t ∫ > = + p (| y | t | x ) N ( y ; x , 1 ) dy N ( y ; x , 1 ) dy i − ∞ t 32
Time series problems Time series problems 33
Example: Tracking Guess the position of an object given noisy measurements y 2 x 2 x x 1 3 y 4 y y 1 3 x 4 Object 34
Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 = + ν x x e.g. (random walk) − 1 t t t = + y x noise t t want distribution of x’s given y’s 35
Approximate factor graph x x x x 3 1 2 4 36
Splitting a pairwise factor x x 1 2 x x 1 2 37
Splitting in context x x 3 2 x x 3 2 38
Sweeping through the graph x x x x 3 1 2 4 39
Sweeping through the graph x x x x 3 1 2 4 40
Sweeping through the graph x x x x 3 1 2 4 41
Sweeping through the graph x x x x 3 1 2 4 42
Example: Poisson tracking • y t is a Poisson-distributed integer with mean exp(x t ) 43
Poisson tracking model p ( 1 x ) ~ N ( 0 , 100 ) p p ( ( x x | | x x ) ) ~ ~ N N ( ( x x , , 0 0 . . 01 01 ) ) − − t t 1 t 1 = − x p ( y | x ) exp( y x e ) / y ! t t t t t t 44
Factor graph x x x x 3 1 2 4 y y y y 1 2 3 4 x x x x 3 1 2 4 45
Approximating a measurement factor x 1 y 1 x 1 46
47
Posterior for the last state 48
49
50
EP for signal detection (Qi and Minka, 2003) • Wireless communication problem ω + φ a sin( t ) • Transmitted signal = φ • vary to encode each symbol ( a , ) φ φ i • In complex numbers: ae Im a φ Re 51
Binary symbols, Gaussian noise 1 = = − 0 • Symbols are and s 1 s 1 (in complex plane) = ω t + φ + y t a sin( ) noise • Received signal • Optimal detection is easy in this case • Optimal detection is easy in this case y t 0 1 s s 52
Fading channel • Channel systematically changes amplitude and phase: = + y x s noise t t t • • = transmitted symbol = transmitted symbol s s t x • = channel multiplier (complex number) t x • changes over time y 1 t x t s t 0 x t s 53
Differential detection • Use last measurement to estimate state: ≈ x y 1 / s − − t t t 1 • State estimate is noisy – can we do better? better? y y t − t 1 − y − 54 t 1
Factor graph s s s s 3 1 2 4 y y y y y y y y 1 2 4 3 x x x x 3 1 2 4 Symbols can also be correlated (e.g. error-correcting code) Channel dynamics are learned from training data (all 1’s) 55
56
57
Splitting a transition factor 58
Splitting a measurement factor 59
On-line implementation • Iterate over the last δ measurements • Previous measurements act as prior • Results comparable to particle filtering, but much faster 60
61
Classification problems Classification problems 62
Spam filtering by linear separation Not spam Spam Choose a boundary that will generalize to new data 63
Linear separation Minimum training error solution (Perceptron) Too arbitrary – won’t generalize well 64
Linear separation Maximum-margin solution (SVM) 65 Ignores information in the vertical direction
Linear separation Bayesian solution (via averaging) 66 Has a margin, and uses information in all dimensions
Geometry of linear separation Separator is any vector w such that: > T x w 0 (class 1) i < T x w 0 (class 2) i = = (sphere) (sphere) w w 1 1 This set has an unusual shape SVM: Optimize over it Bayes: Average over it 67
Performance on linear separation EP Gaussian approximation to posterior 68
Factor graph = p ( w ) N ( w ; 0 , I ) 69
Computing moments w = \ i \ i \ i q ( ) N ( w ; m , V ) 70
Computing moments 71
Time vs. accuracy A typical run on the 3-point problem Error = distance to true mean of w Billiard = Monte Carlo sampling (Herbrich et al, 2001) Opper&Winther’s algorithms: MF = mean-field theory TAP = cavity method (equiv to Gaussian EP for this problem) 72
Gaussian kernels • Map data into high-dimensional space so that 73
Bayesian model comparison • Multiple models M i with prior probabilities p(M i ) • Posterior probabilities: • For equal priors, models are compared using model evidence: 74
Highest-probability kernel 75
Margin-maximizing kernel 76
Recommend
More recommend