T-61.182 Information Theory and Machine Learning 38. Introduction to Neural Networks 40. Capacity of a Single Neuron 41. Learning as Inference Presented by Yang, Zhi-rong on 22, April 2004 T-61.182 Information Theory and Machine Learning – p. 1/2
Contents Introduction to Neural Networks – Memories – Terminology Capacity of a Single Neuron – Neural network learning as communication – The capacity of a single neuron – Counting threshold functions Learning as Inference – Neural network learning as inference – Beyond optimization: making predictions – Implementation by Monte Carlo method – Implementation by Gaussian approximations T-61.182 Information Theory and Machine Learning – p. 2/2
Memories Address-based memory scheme – not associative – not robust or fault-tolerant – not distributed Biological memory systems – content addressable – error-tolerant and robust – parallel and distributed T-61.182 Information Theory and Machine Learning – p. 3/2
Terminology Architecture Activity rule Learning rule Supervised neural networks Unsupervised neural networks T-61.182 Information Theory and Machine Learning – p. 4/2
NN learning as communication 1. Obtain adapted weights { t n } N n =1 ↓ { x n } N − → Learning algorithm − → w n =1 2. Communication { x n } N { ˆ t n } N − → − → w n =1 n =1 T-61.182 Information Theory and Machine Learning – p. 5/2
The capacity of a single neuron General position Definition 1 A set of points { x n } in K-dimensional space are in general position if any subset of size ≤ K is linearly independent The linear threshold function � K � � y = f w k x k k =1 � 1 a > 0 f ( a ) = 0 a ≤ 0 T-61.182 Information Theory and Machine Learning – p. 6/2
Counting threshold functions Denote T ( N, K ) the number of distinct threshold functions on N points n general position in K dimensions. In this section, the author try to derive a fomula for T ( N, K ) . To start with, let us work out a few cases by hand. K = 1 , for any N T ( N, 1) = 2 N = 1 , for any K T (1 , K ) = 2 K = 2 T ( N, 2) = 2 N The points of XOR function are unrealizable. T-61.182 Information Theory and Machine Learning – p. 7/2
Counting threshold functions Final Result � 2 N K ≥ N T ( N, K ) = 2 � K − 1 � N − 1 � K < N k =0 k Vapnik-Chervonenkis dimension (VC dimension) 1.2 1 0.8 T(N,K)/2 N 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 N/K T-61.182 Information Theory and Machine Learning – p. 8/2
NN learning as inference Objective function to be minimized M ( w ) = G ( w ) + αE W ( w ) with error function � � � t ( n ) ln y ( x ( n ) ; w ) + (1 − t ( n ) )ln(1 − y ( x ( n ) ; w )) G ( w ) = − n and a regularizer E W ( w ) = 1 � w 2 i 2 i T-61.182 Information Theory and Machine Learning – p. 9/2
NN learning as inference Finally P ( D | w ) P ( w | α ) P ( w | D, α ) = (1) P ( D | α ) e G ( w ) e − αE W ( w ) /Z W ( α ) = (2) P ( D | α ) 1 = exp( − M ( w )) (3) Z M T-61.182 Information Theory and Machine Learning – p. 10/2
NN learning as inference Denote y ( w ; x ) ≡ P ( t = 1 | x , w ) Then P ( t | x , w ) = y t (1 − y ) 1 − t = exp[ t ln y + (1 − t )ln(1 − y )] The likelihood can be expressed in terms of the error function P ( D | w ) = exp[ − G ( w )] Similarly for the regularizer 1 P ( w | α ) = Z W ( α )exp( − αE W ) T-61.182 Information Theory and Machine Learning – p. 11/2
Making predictions Over-confident prediction (example) *� *� *� *� *� *� *� *� A� A� *� *� B� B� T-61.182 Information Theory and Machine Learning – p. 12/2
Bayesian prediction: marginalizing Take into account the whole posterior ensemble P ( t ( N +1) | x ( N +1) , D, α ) d K w P ( t ( N +1) | x ( N +1) , w , α ) P ( w | D, α ) � = Try to find a way of computing the integral P ( t ( N +1) = 1 | x ( N +1) , D, α ) d K w P ( t ( N +1) | x ( N +1) , w , α ) 1 � = Z M exp( − M ( w )) T-61.182 Information Theory and Machine Learning – p. 13/2
The Langevin Monte Carlo Method g = gradM(w); M = findM(w); for l=1:L p = randn(size(w)); H = p’*p/2+M; p = p-epsilon*g/2; wnew = w+epsilon*p; gnew = gradM(wnew); p = p-epsilon*gnew/2; Mnew = findM(wnew); Hnew = p’*p/2+Mnew; dH = Hnew-H; if (dH<0||rand()<exp(-dH)) g=gnew; w=wnew; M=Mnew; endfor T-61.182 Information Theory and Machine Learning – p. 14/2
The Langevin Monte Carlo Method ‘gradient descent with added noise’ ∆ w = − 1 2 ǫ 2 g + ǫ p speedup by Hamiltonian Monte Carlo wnew=w; gnew=g; for tau=1:Tau p = p-epsilon*gnew/2; wnew = wnew+epsilon*p; gnew = gradM(wnew); p = p-epsilon*gnew/2; endfor T-61.182 Information Theory and Machine Learning – p. 15/2
Gaussian approximations Taylor expand M ( w ) M ( w ) ≃ M ( w MP ) + 1 2( w − w MP ) T A ( w − w MP ) + · · · with Hessian matrix ∂ 2 � � A ij ≡ M ( w ) � ∂w i ∂w j � w = w MP The Gaussian approximation is defined as: Q ( w ; w MP , A ) 2 ( w − w MP ) T A ( w − w MP ) = [ det ( A / 2 π )] 1 / 2 exp − 1 � � T-61.182 Information Theory and Machine Learning – p. 16/2
Gaussian approximations the second derivative of M ( w ) with respect to w is given by N ∂ 2 f ′ ( a ( n ) ) x ( n ) x ( n ) � M ( w ) = + αδ ij i j ∂w i ∂w j n =1 where 1 f ( a ) ≡ 1 + e − a a ( n ) = w j x ( n ) � j j T-61.182 Information Theory and Machine Learning – p. 17/2
Gaussian approximations = Normal( a MP , s 2 ) P ( a | x , D, α ) � − ( a − a MP ) 2 � 1 = 2 πs 2 exp √ 2 s 2 where a MP = a ( x ; w MP ) and s 2 = x T A − 1 x T-61.182 Information Theory and Machine Learning – p. 18/2
Gaussian approximations Therefore the marginalized output is: � P ( t = 1 | x , D, α ) = ψ ( a MP , s 2 ) ≡ d af ( a )Normal( a MP , s 2 ) And further approximation can be applied: ψ ( a MP , s 2 ) ≃ φ ( a MP , s 2 ) ≡ f ( κ ( s ) a MP ) where � 1 + πs 2 / 8 κ ( s ) = 1 / T-61.182 Information Theory and Machine Learning – p. 19/2
Exercises Practice on counting threshold functions: Ex. 40.6 (page 490) Prove the approximation on Hessian matrix: Ex. 41.1 (page 501) T-61.182 Information Theory and Machine Learning – p. 20/2
Recommend
More recommend