Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c
Introduction ◮ Chapter 11. only focus on Feedforward NNs. Related to projection pursuit regression: f ( x ) = � M m =1 g m ( w ′ m x ), where each w m is a vector of weights and g m is a smooth nonparametric function; to be estimated. really? ◮ Here: + CNN; later recurrent NNs (for seq data). No autoencoders (unsupervised) ...? Goodfellow, Bengio, Courville (2016). Deep Learning. http://www.deeplearningbook.org/ ◮ Two high waves in 1960s and late 1980s-90s. ◮ McCulloch & Pitts model (1943): n j ( t ) = I ( � i → j w ij n i ( t − 1) > θ j ). w ij can be > 0 (excitatory) or < 0 (inhibitory).
◮ A biological neuron vs an artificial neuron (perceptron). Google: images biological neural network tutorial Minsky & Papert’s (1969) XOR problem: XOR ( X 1 , X 2 ) = 1 if X 1 � = X 2 ; = 0 o/w. X 1 , X 2 ∈ { 0 , 1 } . Perceptron: f = I ( α 0 + α ′ X > 0). ◮ Feldman’s (1985) “one hundred step program”: at most 100 steps within a human reaction time. because a human can recognize another person in 100 ms, while the processing time of a neuron is 1ms. = ⇒ human brain works in a massively parallel and distributed way. ◮ Cognitive science: human vision is performed in a series of layers in the brain. ◮ Human can learn. ◮ Hebb (1949) model: w ij ← w ij + η y i y j , reinforcing learning by simultaneous activations.
Feed-forward NNs ◮ Fig 11.2 ◮ Input: X ◮ A (hidden) layer: for m = 1 , ..., M , Z m = σ ( α 0 m + α ′ m X ), Z = ( Z 1 , ..., Z M ) ′ . activation function: σ ( v ) = 1 / (1 + exp( − v )), sigmoid (or logit − 1 ); Q: what is each Z m ? hyperbolic tangent: tanh ( v ) = 2 σ ( v ) − 1. ◮ ...(may have multiple (hidden) layers)... ◮ Output: f 1 ( X ) , ..., f K ( X ). T k = β 0 k + β ′ k Z , T = ( T 1 , ..., T K ) ′ , f k ( X ) = g k ( T ). regression: g k ( T ) = T k ; classification: g k ( T ) = exp( T k ) / � K j =1 exp( T j ); softmax or multi-logit − 1 function.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c � � � � � � � � � � � � Y Y Y Y Y Y � � � � � � � � � � � � K K 1 1 2 2 Z 1 Z 1 Z 2 Z 2 Z Z Z Z 3 3 M m X X X 2 X X 3 X X X X p X p 1 1 2 3 p-1 p-1 FIGURE 11.2. Schematic of a single hidden layer, feed-forward neural network.
Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c 1 / (1 + e − v ) 1.0 0.5 0.0 -10 -5 0 5 10 v FIGURE 11.3. Plot of the sigmoid function σ ( v ) = 1 / (1 + exp( − v )) (red curve), commonly used in the hidden layer of a neural network. Included are σ ( sv ) for s = 1 2 (blue curve) and s = 10 (purple curve). The scale parameter s controls the activation rate, and we can see that large s amounts to a hard activation at v = 0 . Note that σ ( s ( v − v 0 )) shifts the activation threshold from 0 to v 0 .
◮ How to fit the model? ◮ Given training data: ( Y i , X i ), i = 1 , ..., n . ◮ For regression, minimize R ( θ ) = � K � n i =1 ( Y ik − f k ( X i )) 2 . k =1 ◮ For classification, minimize R ( θ ) = − � K � n i =1 Y ik log f k ( X i ). k =1 And G ( x ) = arg max f k ( x ). ◮ Can use other loss functions. ◮ How to minimize R ( θ )? Gradient descent, called back-propagation. § 11.4 Very popular and appealing! recall Hebb model ◮ Other algorithms: Newton’s, conjugate-gradient, ...
Back-propagation algorithm ◮ Given: training data ( Y i , X i ), i = 1 , ..., n . ◮ Goal: estimate α ’s and β ’s. k ( Y ik − f k ( X i )) 2 := � i r 2 Consider R ( θ ) = � � i R i := � i . i ◮ NN: input X i , output ( f 1 ( X i ) , ..., f K ( X i )) ′ . Z mi = σ ( α 0 m + α ′ m X i ), Z i = ( Z 1 i , ..., Z Mi ) ′ , T ki = β 0 k + β ′ k Z i , T i = ( T 1 i , ..., T Ki ) ′ , f k ( X i ) = g k ( T i ) = T ki . ◮ Chain rule: ∂ R i = ∂ R i ∂ r i ∂ g k ∂ T i ∂β km ∂ r i ∂ g k ∂ T i ∂β km ∂ R i = − 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) Z mi := δ ki Z mi , ∂β km
Back-propagation algorithm (cont’ed) ◮ ∂ R i = ∂ R i ∂ r i ∂ g k ∂ T i ∂ Z i ∂α ml ∂ r i ∂ g k ∂ T i ∂ Z i ∂α ml ∂ R i � 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) β km σ ′ ( α ′ = − m X i ) X il := s mi X il . ∂α ml k where δ ki , s mi are “errors” from the current model. ◮ Update at step r + 1: ∂ R i � ∂ R i � β ( r +1) = β ( r ) β ( r ) ,α ( r ) , α ( r +1) = α ( r ) � � � � km − γ r ml − γ r β ( r ) ,α ( r ) . km � ml � ∂β km ∂α ml � � i i γ r : learning rate ; a tuning parameter; can be fixed or selected/decayed. too large/small then ... ◮ training epoch : a cycle of updating
Some issues ◮ Starting values: Existence of many local minima and saddle points. Multiple tries; model averaging, ... Data preprocessing: centering at 0 and scaling ◮ Stochastic gradient descent ( SGD ): use a minibatch (i.e. a random subset) of the training data for a few iterations; minimbatch size: 32 or 64 or 128 or ..., a tuning parameter. ◮ +: simple and intuitive; -: slow ◮ Modifications: SGD + Momentum SGD: x t +1 = x t − γ ∇ f ( x t ) . SGD+M: v t +1 = ρ v t + ∇ f ( x t ), x t +1 = x t − γ v t +1 . ... (AdaGrad, RMSProp) ... Adam , default (now!)
Some issues (cont’ed) ◮ Over-fitting? Universal Approx Thm If add more units or layers, then... 1) Early stopping! 2) Regularization: add a penalty term , e.g. Ridge; use km β 2 ml α 2 R ( θ ) + λ J ( θ ) with J ( θ ) = � km + � ml ; called weight decay ; Fig 11.4. Performance: Figs 11.6-8 3) Regularization: Dropout (randomly) a subset/proportion of nodes/units or connections during training; an ensemble; more robust. ◮ A main technical issue with a deep NN: gradients vanishing or exploding, why? use ReLU : f ( x ) = max(0 , x ); batch normalization; .... ◮ Transfer learning : reusing trained networks: why? http: //jmlr.org/proceedings/papers/v32/donahue14.pdf ◮ Example code: ex7.1.r
Recommend
More recommend