Chapter 7. Neural Networks Wei Pan Division of Biostatistics, - PowerPoint PPT Presentation

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 � Wei Pan c

Introduction ◮ Chapter 11. only focus on Feedforward NNs. Related to projection pursuit regression: f ( x ) = � M m =1 g m ( w ′ m x ), where each w m is a vector of weights and g m is a smooth nonparametric function; to be estimated. ◮ Two high waves in 1960s and late 1980s-90s. ◮ A biological neuron vs an artificial neuron (perceptron). Google: images biological neural network tutorial Minsky & Papert’s (1969) XOR problem: XOR ( X 1 , X 2 ) = 1 if X 1 � = X 2 ; = 0 o/w. X 1 , X 2 ∈ { 0 , 1 } . Percepton: f = I ( α 0 + α ′ X > 0).

◮ McCulloch & Pitts model (1943): n j ( t ) = I ( � i → j w ij n i ( t − 1) > θ j ). w ij can be > 0 (excitatory) or < 0 (inhibitory). ◮ Feldman’s (1985) “one hundred step program”: at most 100 steps within a human reaction time. because a human can recognize another person in 100 ms, while the processing time of a neuron is 1ms. = ⇒ human brain works in a massively parallel and distributed way. ◮ Cognitive science: human vision is performed in a series of layers in the brain. ◮ Human can learn. ◮ Hebb (1949) model: w ij ← w ij + η y i y j , reinforcing learning by simultaneous activations.

Feed-forward NNs ◮ Fig 11.2 ◮ Input: X ◮ A hidden layer (or layers): for m = 1 , ..., M , Z m = σ ( α 0 m + α ′ m X ), Z = ( Z 1 , ..., Z M ) ′ . e.g. σ ( v ) = 1 / (1 + exp( − v )), sigmoid (or logit − 1 ) function. ◮ Output: f 1 ( X ) , ..., f K ( X ). T k = β 0 k + β ′ k Z , T = ( T 1 , ..., T K ) ′ , f k ( X ) = g k ( T ). e.g. regression: g k ( T ) = T k ; classification: g k ( T ) = exp( T k ) / � K j =1 exp( T j ); softmax or multi-logit − 1 function.

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c � � � � � � � � � � � � Y Y Y Y Y Y � � � � � � � � � � � � K K 1 1 2 2 Z 1 Z 1 Z 2 Z 2 Z Z Z Z 3 3 M m X X X 2 X X 3 X X X X p X p 1 1 2 3 p-1 p-1 FIGURE 11.2. Schematic of a single hidden layer, feed-forward neural network.

Elements of Statistical Learning (2nd Ed.) � Hastie, Tibshirani & Friedman 2009 Chap 11 c 1 / (1 + e − v ) 1.0 0.5 0.0 -10 -5 0 5 10 v FIGURE 11.3. Plot of the sigmoid function σ ( v ) = 1 / (1 + exp( − v )) (red curve), commonly used in the hidden layer of a neural network. Included are σ ( sv ) for s = 1 2 (blue curve) and s = 10 (purple curve). The scale parameter s controls the activation rate, and we can see that large s amounts to a hard activation at v = 0 . Note that σ ( s ( v − v 0 )) shifts the activation threshold from 0 to v 0 .

◮ How to fit the model? ◮ Given training data: ( Y i , X i ), i = 1 , ..., n . ◮ For regression, minimize R ( θ ) = � K � n i =1 ( Y ik − f k ( X i )) 2 . k =1 ◮ For classification, minimize R ( θ ) = − � K � n i =1 Y ik log f k ( X i ). k =1 And G ( x ) = arg max f k ( x ). ◮ Can use other loss functions. ◮ How to minimize R ( θ )? Gradient descent, called back-propagation. § 11.4 Very popular and appealing! recall Hebb model ◮ Other algorithms: Newton’s, conjugate-gradient, ...

Back-propagation algorithm ◮ Given: training data ( Y i , X i ), i = 1 , ..., n . ◮ Goal: estimate α ’s and β ’s. k ( Y ik − f k ( X i )) 2 := � Consider R ( θ ) = � � i R i . i ◮ Denote Z mi = σ ( α 0 m + α ′ m X i ), Z i = ( Z 1 i , ..., Z Mi ) ′ , ∂ R i = − 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) Z mi := δ ki Z mi , ∂β km ∂ R i � 2( Y ik − f k ( X i )) g ′ k ( β ′ k Z i ) β km σ ′ ( α ′ = − m X i ) X il := s mi X il . ∂α ml k where δ ki , s mi are “errors” from the current model. ◮ Update at step r + 1: ∂ R i � ∂ R i � β ( r +1) = β ( r ) β ( r ) ,α ( r ) , α ( r +1) = α ( r ) � � � � km − γ r ml − γ r β ( r ) ,α ( r ) . � � km ml ∂β km ∂α ml � � i i γ r : learning rate; can be fixed or selected by a line search. ◮ training epoch: a cycle of updating ◮ +: simple and intuitive; -: slow

Some issues ◮ Starting values: Existence of many local solutions. Multiple tries; model averaging, ... ◮ Over-fitting? Old days: adding more and more units and hidden layers ... Early stopping! Regularization: add a penalty term , e.g. Ridge; use km β 2 ml α 2 R ( θ ) + λ J ( θ ) with J ( θ ) = � km + � ml ; called weight decay; Fig 11.4. ◮ Performance: Fig 11.6-8 ◮ Example code: ex7.1.r

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, - PowerPoint PPT Presentation

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 Wei Pan c Introduction Chapter 11. only focus on

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CHAPTER VI VI CHAPTER Learning in Feedforward Feedforward Learning in Neural Networks Neural

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Neural networks Chapter 20, Section 5 Chapter 20, Section 5 1 Outline Brains Neural

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

CHAPTER IV IV CHAPTER Combinatorial Optimization Combinatorial Optimization by Neural Networks

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Convolutional Neural Networks Convolutional neural networks One of the major kinds of ANNs in use

Neural Networks 0. Logistics Spring 2019 1 Neural Networks are taking over! Neural networks

Neural Networks and their Application to Go Neural Networks Learning Blackjack Theory Training

Neural Networks 1. Introduction Fall 2017 Neural Networks are taking over! Neural networks

Neural Networks Neural Net Basics Dan Klein, John DeNero UC Berkeley Slides adapted from Greg

Andrea Chiappo andrea.chiappo@fysik.su.se Co-authors: Jan Conrad, Nils Hkansson, Johann

Analysis strategies and treatment of systematic effects in the KATRIN experiment Martin Sle zk

CSE 527 Lectures 12-13 Markov Models and Hidden Markov Models DNA Methylation CH 3 CpG - 2

L I K E L I H O O D F R E E I N F E R E N C E http://arxiv.org/abs/1506.02169 Kyle Cranmer New

Chapter 7. Neural Networks Wei Pan Division of Biostatistics, School of Public Health, University

ss rst st

Revisiting Frank-Wolfe Projection-Free Sparse Convex Optimization Martin Jaggi Ecole

W HY S ELECTING V ARIABLES ? Nowadays many research areas produce data with tenth or hundred