Tractable Inference for Probabilistic Models Manfred Opper (Aston University, Birmingham, U.K.) collaboration with: Ole Winther (TU Denmark) D¨ orthe Malzahn (TU Denmark) Lehel Csat´ o (Aston U)
The general Structure D = Observed data S = Hidden variables (unknown causes, etc) Bayes Rule P ( S | D) = P (D | S ) × P ( S ) /P [D] � �� � � �� � � �� � Likelihood prior posterior distribution distribution
Overview • Inference with probabilistic models: Examples • A “canonical” model • Problems with inference and approximate solutions • Cavity/TAP approximation • Applications • Outlook
Example I: Modeling with Gaussian Processes • Observations: Data D = ( y 1 . . . , y N ) observed at points x i ∈ R D . BV set size: 10; Lik. par: 2.0594 7 6 5 4 3 2 1 0 −1 −2 −20 −15 −10 −5 0 5 10 15 20 • Model for observations y i = f ( x i ) + “noise” (Regression, eg. with positive noise ) y i = sign[ f ( x i ) + “noise”] (Classifikation) • A priori information about “latent variable” (function f ): Realization of Gaussian random process with covariance K ( x, x ′ ).
Modeling with Gaussian processes: Windfields Ambiguities in local observation model for measuring wind velocity fields from satellites. MDN network Solution: Model prior distribution of wind fields using a Gaussian process.
Example II: Code Division Multiple Access (CDMA) • K users in mobile communication try to transmit message bits S 1 , . . . S K with S i ∈ {− 1 , 1 } over single channel. • Modulation: Multiply message with spreading code x k ( n ) for n = 1 , . . . N c • Received signals K � y ( n ) = S k x k ( n ) + σε ( n ) k =1 • Inference: Estimate S k ’s from the y ( n )’s (= regression with binary variables). (introduced to machine learning community by Toshiyuki Tanaka)
A canonical Class of Distributions P ( S ) = 1 � � ρ i ( S i ) exp S i J ij S j Z i i<j ρ i models local observations (likelihood) / or local constraints. i J ij j Normalization Z usually coincides with probability P ( D ) of observed data.
Problems with Inference • Variables dependent → highdimensional integrals/sums. • Exact inference impossible if random variables continuous (and non Gaussian). • Laplace approximation for integrals impossible if integrand non differen- tiable. • “Learning” of coupling matrix J by EM-Algorithm (Maximum Like- lihood) requires correlations E [ S i S j ].
� �✁ ✁ � �✁ ✁ Non-variational Approximations • Bethe approximation/Belief Propagation (Yedidia, Freeman & Weiss): site i “treelike” graphs. • TAP - type of approximations: many neighbours, weak dependencies, Neighbourhood → Gaussian random influence. site i
Gibbs Free Energy • Gives moments and Z = P ( D ) simultaneously. • Applicability of optimization methods Φ( m ) . � � KL ( Q || P ) | E Q [ S i ] = m i , E Q [ S 2 = min Q i ] = M i , i = 1 , . . . , N − ln Z Φ (m) −lnP(D) E[S ] m i
TAP Approximation to Free Energy Introduce tunable interaction strength l P l ( S ) = 1 � � ρ i ( S i ) exp l S i J ij S j Z i i<j Exact result � 1 � 1 0 dl∂ Φ l = Φ l =0 − 1 m i J ij m j − 1 � Φ l =1 = Φ l =0 + 0 dl Tr( C l J ) . ∂l 2 2 i,j with covariance C l . • TAP (Thouless, Anderson & Palmer) : Expand Φ l to O ( l 2 ). • Adaptive TAP (Opper & Winther): Gaussian approximation for C l C g l = ( Λ l − l J ) − 1
Properties of TAP Free Energy • Free Energy has the form ΦTAP( m , M ) = Φ 0 ( m , M ) + Φ g ( m , M ) − Φ g 0 ( m , M ) The Φ’s are convex and correspond to Φ 0 ( m , M ): true likelihood, no interactions. Φ g ( m , M ): Gauss likelihood, full interactions. Φ g 0 ( m , M ): Gauss likelihood, no interactions. • Minimizing hyperparameters of ΦTAP equal fixedpoints of approximate EM algorithm.
Relation to Cavity Approach i ) + m T γ 0 + 1 � ln Z 0 i ( γ 0 i , λ 0 2 M T λ 0 Φ 0 = max − . λ 0 , γ 0 i with � � � i S + 1 Z i ( γ 0 i , λ 0 γ 0 2 λ 0 i S 2 i ) = dS ρ i ( S ) exp = � � �� � � S ( γ 0 λ 0 = dS ρ i ( S ) E z exp i + i z ) with z a standard normal Gaussian random variable.
� �✁ ✁ Algorithm: Expectation Propagation (T. Minka) Introduce effective Gaussian distribution having likelihood N N e − λ i S 2 � ρ g � i + γ i S i i ( S i ) = i =1 i =1 site i • → site i . Replace Gaussian likelihood by true Likelihood. New Marginal i ( S ) ρ i ( S ) P i ( S ) ∝ P g i ( S ) → ρ g Recompute E [ S i ] and E [ S 2 i ] • Recompute λ i and γ i → new site.
Exact Average case behaviour: Random J matrix ensembles, N → ∞ Assume Orthogonal random matrix ensemble for J N with asymptotic scaling of generating function � � 1 1 2 Tr( AJ N ) N ln e ≃ Tr G ( A /N ) J For N → ∞ : Average case properties (replica symmetry) of exact inference and ADATAP approximation agree (if single solution).
Application: Non-Gaussian Regression y = f ( x ) + ξ with positive noise p ( ξ ) = λe − λx I x> 0 : Estimate parameter λ with N = 1000. BV set size: 10; Lik. par: 2.0594 7 6 5 4 3 2 1 0 −1 −2 −20 −15 −10 −5 0 5 10 15 20
Example: Estimation of Wind Fields 10ms −1 20ms −1 10ms −1 20ms −1 Likelihood Monte Carlo prediction ADATAP prediction
CDMA Results I (Winther & Fabricius) 10 10 8 8 6 6 4 4 2 2 ylabelnaive ylabeltap 0 0 −2 −2 −4 −4 −6 −6 −8 −8 −10 −10 −10 −8 −6 −4 −2 0 2 4 6 8 10 −10 −8 −6 −4 −2 0 2 4 6 8 10 xlabelexact xlabelexact Results for Bayes optimal prediction h i = artanh( m i ): Exact/Mean Field and Exact/ADATAP. K = 8 users and N c = 16
CDMA Results II (Winther & Fabricius) 0 10 −1 10 BER −2 10 −3 10 Naive Adaptive−TAP Linear MMSE Hard Serial IC Matched Filter −4 10 16 18 20 22 24 26 28 30 K Biterror Rate as a function of the number of users. SNR = 10 dB and Spreading factor N c = 20
Approximate analytical Bootstrap Goal: Estimate average case properties (eg test errors, uncertainty) of sta- tistical predictor (eg SVM) without hold out test data. Bootstrap (Efron): Generate new pseudo training data by resampling old training data with replacement. Original training data: D 0 = ( z 1 , z 2 , z 3 ) Bootstrap samples: D 1 = ( z 1 , z 1 , z 2 ); D 2 = ( z 1 , z 2 , z 2 ); D 3 = ( z 3 , z 3 , z 3 ) , . . . Problem: Each sample requires time consuming retraining of predictor. Approximate analytical approach: Average over samples with help of “rep- lica trick”.
Supportvector Classifier (Vapnik) SVM predicts y = sign[ ˆ f D 0 ( x )] for x ∈ R d , f D 0 ( x ) = � N with ˆ j =1 y j α i K ( x, x j ) and K a positive definite kernel. Setting S i = � N j =1 y j α i K ( x i , x j ), the α ’s can be found from the convex optimization problem � � S T K − 1 S Minimize under the constraint S i y i ≥ 1 , i = 1 , . . . , N.
Probabilistic formulation of Supportvector Machines Define prior � � 1 − β 2 S T K − 1 S µ [ S ] = exp . � (2 π ) N β − N | K | and Pseudo-likelihood � � P ( y j | S ) = Θ( y j S j − 1) j j where Θ( u ) = 1 for u > 0 and 0 otherwise. For β → ∞ , measure P [ S | D ] ∝ µ [ S ] P ( D | S ) concentrates at vector ˆ S which solves SVM optimization problem.
Analytical Average using Replicas Let s j = # times data point y j appears in bootstrap sample D n � ( d S a µ [ S a ]) � � E D [ Z n ] = E D P s j ( y j | S a = j ) a =1 j,a n N n � S � ( d S a µ [ S a ]) � � P ( y j | S a exp j ) N a =1 j =1 a =1 New intractable statistical model with coupled replicas! Need approximate inference tools & limit n → 0.
Results: Classification & Regression Compare TAP approximation theory / bootstrap simulation (= Sampling + Retraining) Generalization error: 0.5 Average number of test points 0.14 341 230 155 104 70 Wisconsin, N=683 0.12 40 Bootstrapped classification error 0.10 Simulation 0.4 0.08 Theory (TAP) Bootstrapped square loss 0.06 30 Approx. theory (TAP) 0.04 0.3 Theory (Var. Gaussian) 0 200 400 600 Theory (Mean field) 20 Pima, N=532 0.2 Boston, N=506 10 Sonar, N=208 0.1 Crabs, N=200 0 0.0 0 200 400 800 1000 600 0 200 400 600 Bootstrap sample size S Size S of bootstrap sample
SVM results cont’d Uncertainty of SVM Prediction at test points 2.0 Simulation: p(-1|x) 0 0.2 0.4 0.6 0.8 1 1.0 0.8 Theory: p(-1|x) 1.5 0.6 0.4 Density 1.0 0.2 0.0 0.5 S: 0.376 T: 0.405 0.0 -2 -1 0 1 -1.5 -0.5 0.5 Bootstrapped local field at a test input x
Regression Distribution of predictor on training points 300 0.12 250 10 0.1 200 0.08 Abundance 5 Density 150 0.06 0 0.2 0.3 0.4 0.5 100 0.04 L1 50 0.02 0 0 -4 0 4 8 12 20 24 16 0 0.1 0.2 0.3 0.4 0.5 Bootstrapped prediction at input x 372 L1
Outlook • Systematic improvement • Tractable substructures • More complex dependencies (eg directed graphs) • Fast algorithms & sparsity • Combinatorial optimization problems, metastability • Performance bounds?
Recommend
More recommend