  Introduction Last lecture: Neural networks as a layered regression problem Feed-forward networks Linear model with activation functions Global Optimization Coverage of multi-modal networks Bayesian models for neural networks

  Motivation The models this far have assumed a Gaussian Distribution How about multi-modal distributions? How about inverse problems Mixture models is one possible solution

  Simple Robot Example ( x 1 , x 2 ) ( x 1 , x 2 ) L 2 θ 2 elbow up elbow L 1 θ 1 down

  Simple Functional Approximation Example 1 1 0 0 0 1 0 1

  Basic Formulation Objective - approximation of: p ( t | x ) A generic model K � π k ( x ) N ( t | µ k ( x ) , σ 2 p ( t | x ) = k ( x )) k =1 Here a Gaussian mixture is used but any distribution could be the basis Parameters to be estimated π k ( x ), µ k ( x ) and σ 2 k ( x ).

  The mixture density network p ( t | x ) x D θ M θ x 1 θ 1 t

  The Model Parameters Mixing coefficients K � π k ( x ) = 1 0 ≤ π k ( x ) ≤ 1 k =1 achieved using softmax e a π k π k ( x ) = � K l =1 e a π l The variance must be postive, so a good choice is σ k ( x ) = e a σ k The means can be represented by direct activations µ kj ( x ) = a µ kj

  The Energy Equation(s) The error function is then as seen before � K N � � � π k ( x n , w ) N ( t | µ k ( x n , w ) , σ 2 E ( w ) = − ln k ( x n , w )) n =1 k =1 Computing the derivatives we can minimize E ( w ) Lets use γ nk = γ n ( t n | x n ) = π k N nk / � π l N nl The derivatives are then ∂ E n = π k − γ nk ∂ a π k ∂ E n � µ kl − t nl � = γ nk ∂ a µ σ 2 kl k L − || t n − µ k || 2 ∂ E n � � = γ nk σ 2 ∂ a σ k k

  A Toy Example 1 1 0 0 0 1 0 1 (a) (b) 1 1 0 0 0 1 0 1 (c) (d)

  Mixed density networks The net is optimizing a mixture of parameters Different parts corresponds to different components Each part has its own set "energy terms" and gradients Illustrates the flexibility but also complications

  Introductory Remarks What is the output was a probability distribution? Could we optimize over the posterior distribution? p ( t | x ) Assume it is Gaussian to enable processing p ( t | x , w , β ) = N ( t | y ( x , w ) , β − 1 ) Let's consider how we can analyze the problem?

  The Laplace Approximation - I Sometimes the posterior is no longer Gaussian Challenges integration Closed form solutions might not be available How can we generate an approximation Obviously, using a Gaussian approximation would be helpful. Using a Laplace approximation Consider for now f ( z ) p ( z ) = � f ( a ) da the denominator is merely for normalization and considered unknown Assume the mode, z 0 has been determined, so that df ( z ) / dz = 0

  The Laplace Approximation - II Taylor expansion of ln f is then ln f ( z ) ≈ ln f ( z 0 ) − 1 2 A ( z − z 0 ) 2 where A = − d 2 dz 2 ln f ( z ) | z = z 0 Taking the exponential f ( z ) ≈ f ( z 0 ) e { − A 2 ( z − z 0 ) 2 } which can be transformed to � A � 1 2 e { − A 2 ( z − z 0 ) 2 } q ( z ) = 2 π the extension to multi-variate distribution is straight forward (see book).

  Posterior Parameter Distribution Back to the Bayesian networks For an IID dataset with target values t = { t 1 , . . . , t N } we have N � N ( t n | y ( x n , w ) , β − 1 ) p ( t | w , β ) = n =1 The posterior is then p ( w | t , α, β ) ∝ p ( w | α ) p ( t | w , β ) As usual we have N ln p ( w | t ) = − α 2 w T w − β { y ( x n , w ) − t n } 2 + const � 2 n =1

  Posterior Parameter Distribution - II We can use the Laplace approximation to estimate the distribution A = −∇ 2 ln p ( w | t , α, β ) = α I + β H The approximation would be q ( w | t ) = N ( w | w MAP , A − 1 ) In turn we have p ( t | x , t , α, β ) = N ( t | y ( x , w MAP ) , σ 2 ) where σ 2 = β − 1 + g T A − 1 g and g = ∇ w y ( x , w ) | w = w MAP

  Optimization of Hyper-parameters How do we estimate α and β ? We can consider the problem � p ( t | α, β ) = p ( t | w , β ) p ( w | α ) dw From linear regression we have the composition β Hu i = λ i u i where H is the Hessian for the error, E with regression with have γ α = w T MAP w MAP where γ is the effective rank of the Hessian Similarly β can be derived to be N 1 1 � { y ( x n , w MAP ) − t n } 2 β = N − γ n =1

  Bayesian Neural Networks Modelling of system as a probabilistic generator Use standard techniques to generate w MAP We can in addition generate estimates for the precision/variance

  Summary With Neural Nets we have a general functional estimator Can be applied both for regression and discrmination The basis functions can be a broad set of functions NNs can also be used for estimation of mixture systems Estimation of probability distributions is also possible for Gaussians (approximation w. w MAP , β ) Neural nets is a rich area with a long history.


