Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c
Introduction ◮ Chapter 11. only focus on Feedforward NNs. Related to projection pursuit regression: f ( x ) = � M m =1 g m ( w ′ m x ), where each w m is a vector of weights and g m is a smooth nonparametric function; to be estimated. ◮ Two high waves in 1960s and late 1980s-90s. ◮ A biological neuron vs an artificial neuron (perceptron). Google: images biological neural network tutorial Minsky & Papert’s (1969) XOR problem: XOR ( X 1 , X 2 ) = 1 if X 1 � = X 2 ; = 0 o/w. X 1 , X 2 ∈ { 0 , 1 } . Percepton: f = I ( α 0 + α ′ X > 0).
◮ McCulloch & Pitts model (1943): n j ( t ) = I ( � i → j w ij n i ( t − 1) > θ j ). w ij can be > 0 (excitatory) or < 0 (inhibitory). ◮ Feldman’s (1985) “one hundred step program”: at most 100 steps within a human reaction time. because a human can recognize another person in 100 ms, while the processing time of a neuron is 1ms. = ⇒ human brain works in a massively parallel and distributed way. ◮ Cognitive science: human vision is performed in a series of layers in the brain. ◮ Human can learn. ◮ Hebb (1949) model: w ij ← w ij + η y i y j , reinforcing learning by simultaneous activations.
Feed-forward NNs ◮ Fig 11.2 ◮ Input: X ◮ A hidden layer (or layers): for m = 1 , ..., M , Z m = σ ( α 0 m + α ′ m X ), Z = ( Z 1 , ..., Z M ) ′ . e.g. σ ( v ) = 1 / (1 + exp( − v )), sigmoid (or logit − 1 ) function. ◮ Output: f 1 ( X ) , ..., f K ( X ). T k = β 0 k + β ′ k Z , T = ( T 1 , ..., T K ) ′ , f k ( X ) = g k ( T ). e.g. regression: g k ( T ) = T k ; classification: g k ( T ) = exp( T k ) / � K j =1 exp( T j ); softmax or multi-logit − 1 function.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c � � � � � � � � � � � � Y Y Y Y Y Y � � � � � � � � � � � � K K 1 1 2 2 Z 1 Z 1 Z 2 Z 2 Z Z Z Z 3 3 M m X X X 2 X X 3 X X X X p X p 1 1 2 3 p-1 p-1 FIGURE 11.2. Schematic of a single hidden layer, feed-forward neural network.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c 1 / (1 + e − v ) 1.0 0.5 0.0 -10 -5 0 5 10 v FIGURE 11.3. Plot of the sigmoid function σ ( v ) = 1 / (1 + exp( − v )) (red curve), commonly used in the hidden layer of a neural network. Included are σ ( sv ) for s = 1 2 (blue curve) and s = 10 (purple curve). The scale parameter s controls the activation rate, and we can see that large s amounts to a hard activation at v = 0 . Note that σ ( s ( v − v 0 )) shifts the activation threshold from 0 to v 0 .
◮ How to fit the model? ◮ Given training data: ( Y i , X i ), i = 1 , ..., n . ◮ For regression, minimize R ( θ ) = � K � n i =1 ( Y ik − f k ( X i )) 2 . k =1 ◮ For classification, minimize R ( θ ) = − � K � n i =1 Y ik log f k ( X i ). k =1 And G ( x ) = arg max f k ( x ). ◮ Can use other loss functions. ◮ How to minimize R ( θ )? Gradient descent, called back-propagation. § 11.4 Very popular and appealing! recall Hebb model ◮ Other algorithms: Newton’s, conjugate-gradient, ...
Back-propagation algorithm ◮ Given: training data ( Y i , X i ), i = 1 , ..., n . ◮ Goal: estimate α ’s and β ’s. k ( Y ik − f k ( X i )) 2 := � Consider R ( θ ) = � � i R i . i ◮ Denote Z mi = σ ( α 0 m + α ′ m X i ), Z i = ( Z 1 i , ..., Z Mi ) ′ , ∂ R i = − 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) Z mi := δ ki Z mi , ∂β km ∂ R i � 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) β km σ ′ ( α ′ = − m X i ) X il := s mi X il . ∂α ml k where δ ki , s mi are “errors” from the current model. ◮ Update at step r + 1: ∂ R i � ∂ R i � β ( r +1) = β ( r ) β ( r ) ,α ( r ) , α ( r +1) = α ( r ) � � � � km − γ r ml − γ r β ( r ) ,α ( r ) . � � km ml ∂β km ∂α ml � � i i γ r : learning rate; can be fixed or selected by a line search. ◮ training epoch: a cycle of updating ◮ +: simple and intuitive; -: slow
Some issues ◮ Starting values: Existence of many local solutions. Multiple tries; model averaging, ... ◮ Over-fitting? Old days: adding more and more units and hidden layers ... Early stopping! Regularization: add a penalty term , e.g. Ridge; use km β 2 ml α 2 R ( θ ) + λ J ( θ ) with J ( θ ) = � km + � ml ; called weight decay; Fig 11.4. ◮ Performance: Fig 11.6-8 ◮ Example code: ex7.1.r
Recommend
More recommend