 
              Introduction Mixture Density Networks Bayesian Neural Networks Summary Neural Networks - II Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Neural Networks 1 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Outline Introduction 1 Mixture Density Networks 2 Bayesian Neural Networks 3 Summary 4 Henrik I Christensen (RIM@GT) Neural Networks 2 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Introduction Last lecture: Neural networks as a layered regression problem Feed-forward networks Linear model with activation functions Global Optimization Coverage of multi-modal networks Bayesian models for neural networks Henrik I Christensen (RIM@GT) Neural Networks 3 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Outline Introduction 1 Mixture Density Networks 2 Bayesian Neural Networks 3 Summary 4 Henrik I Christensen (RIM@GT) Neural Networks 4 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Motivation The models this far have assumed a Gaussian Distribution How about multi-modal distributions? How about inverse problems Mixture models is one possible solution Henrik I Christensen (RIM@GT) Neural Networks 5 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Motivation The models this far have assumed a Gaussian Distribution How about multi-modal distributions? How about inverse problems Mixture models is one possible solution Henrik I Christensen (RIM@GT) Neural Networks 5 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Motivation The models this far have assumed a Gaussian Distribution How about multi-modal distributions? How about inverse problems Mixture models is one possible solution Henrik I Christensen (RIM@GT) Neural Networks 5 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Simple Robot Example ( x 1 , x 2 ) ( x 1 , x 2 ) L 2 θ 2 elbow up elbow L 1 θ 1 down Henrik I Christensen (RIM@GT) Neural Networks 6 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Simple Functional Approximation Example 1 1 0 0 0 1 0 1 Henrik I Christensen (RIM@GT) Neural Networks 7 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Basic Formulation Objective - approximation of: p ( t | x ) A generic model K � π k ( x ) N ( t | µ k ( x ) , σ 2 p ( t | x ) = k ( x )) k =1 Here a Gaussian mixture is used but any distribution could be the basis Parameters to be estimated π k ( x ), µ k ( x ) and σ 2 k ( x ). Henrik I Christensen (RIM@GT) Neural Networks 8 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary The mixture density network p ( t | x ) x D θ M θ x 1 θ 1 t Henrik I Christensen (RIM@GT) Neural Networks 9 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary The Model Parameters Mixing coefficients K � π k ( x ) = 1 0 ≤ π k ( x ) ≤ 1 k =1 achieved using softmax e a π k π k ( x ) = � K l =1 e a π l The variance must be postive, so a good choice is σ k ( x ) = e a σ k The means can be represented by direct activations µ kj ( x ) = a µ kj Henrik I Christensen (RIM@GT) Neural Networks 10 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary The Energy Equation(s) The error function is then as seen before � K N � � � π k ( x n , w ) N ( t | µ k ( x n , w ) , σ 2 E ( w ) = − ln k ( x n , w )) n =1 k =1 Computing the derivatives we can minimize E ( w ) Lets use γ nk = γ n ( t n | x n ) = π k N nk / � π l N nl The derivatives are then ∂ E n = π k − γ nk ∂ a π k ∂ E n � µ kl − t nl � = γ nk ∂ a µ σ 2 kl k L − || t n − µ k || 2 ∂ E n � � = γ nk σ 2 ∂ a σ k k Henrik I Christensen (RIM@GT) Neural Networks 11 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary A Toy Example 1 1 0 0 0 1 0 1 (a) (b) 1 1 0 0 0 1 0 1 (c) (d) Henrik I Christensen (RIM@GT) Neural Networks 12 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Mixed density networks The net is optimizing a mixture of parameters Different parts corresponds to different components Each part has its own set “energy terms” and gradients Illustrates the flexibility but also complications Henrik I Christensen (RIM@GT) Neural Networks 13 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Outline Introduction 1 Mixture Density Networks 2 Bayesian Neural Networks 3 Summary 4 Henrik I Christensen (RIM@GT) Neural Networks 14 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Introductory Remarks What is the output was a probability distribution? Could we optimize over the posterior distribution? p ( t | x ) Assume it is Gaussian to enable processing p ( t | x , w , β ) = N ( t | y ( x , w ) , β − 1 ) Let’s consider how we can analyze the problem? Henrik I Christensen (RIM@GT) Neural Networks 15 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary The Laplace Approximation - I Sometimes the posterior is no longer Gaussian Challenges integration Closed form solutions might not be available How can we generate an approximation Obviously, using a Gaussian approximation would be helpful. Using a Laplace approximation Consider for now f ( z ) p ( z ) = � f ( a ) da the denominator is merely for normalization and considered unknown Assume the mode, z 0 has been determined, so that df ( z ) / dz = 0 Henrik I Christensen (RIM@GT) Neural Networks 16 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary The Laplace Approximation - II Taylor expansion of ln f is then ln f ( z ) ≈ ln f ( z 0 ) − 1 2 A ( z − z 0 ) 2 where A = − d 2 dz 2 ln f ( z ) | z = z 0 Taking the exponential f ( z ) ≈ f ( z 0 ) e { − A 2 ( z − z 0 ) 2 } which can be transformed to � A � 1 2 e { − A 2 ( z − z 0 ) 2 } q ( z ) = 2 π the extension to multi-variate distribution is straight forward (see book). Henrik I Christensen (RIM@GT) Neural Networks 17 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Posterior Parameter Distribution Back to the Bayesian networks For an IID dataset with target values t = { t 1 , . . . , t N } we have N � N ( t n | y ( x n , w ) , β − 1 ) p ( t | w , β ) = n =1 The posterior is then p ( w | t , α, β ) ∝ p ( w | α ) p ( t | w , β ) As usual we have N ln p ( w | t ) = − α 2 w T w − β { y ( x n , w ) − t n } 2 + const � 2 n =1 Henrik I Christensen (RIM@GT) Neural Networks 18 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Posterior Parameter Distribution - II We can use the Laplace approximation to estimate the distribution A = −∇ 2 ln p ( w | t , α, β ) = α I + β H The approximation would be q ( w | t ) = N ( w | w MAP , A − 1 ) In turn we have p ( t | x , t , α, β ) = N ( t | y ( x , w MAP ) , σ 2 ) where σ 2 = β − 1 + g T A − 1 g and g = ∇ w y ( x , w ) | w = w MAP Henrik I Christensen (RIM@GT) Neural Networks 19 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Optimization of Hyper-parameters How do we estimate α and β ? We can consider the problem � p ( t | α, β ) = p ( t | w , β ) p ( w | α ) dw From linear regression we have the composition β Hu i = λ i u i where H is the Hessian for the error, E with regression with have γ α = w T MAP w MAP where γ is the effective rank of the Hessian Similarly β can be derived to be N 1 1 � { y ( x n , w MAP ) − t n } 2 β = N − γ n =1 Henrik I Christensen (RIM@GT) Neural Networks 20 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Bayesian Neural Networks Modelling of system as a probabilistic generator Use standard techniques to generate w MAP We can in addition generate estimates for the precision/variance Henrik I Christensen (RIM@GT) Neural Networks 21 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Outline Introduction 1 Mixture Density Networks 2 Bayesian Neural Networks 3 Summary 4 Henrik I Christensen (RIM@GT) Neural Networks 22 / 23
Introduction Mixture Density Networks Bayesian Neural Networks Summary Summary With Neural Nets we have a general functional estimator Can be applied both for regression and discrmination The basis functions can be a broad set of functions NNs can also be used for estimation of mixture systems Estimation of probability distributions is also possible for Gaussians (approximation w. w MAP , β ) Neural nets is a rich area with a long history. Henrik I Christensen (RIM@GT) Neural Networks 23 / 23
Recommend
More recommend