(C) Reg (C) Regression ression, , layered layered ne neur ural netw tworks - Networks of conti tinuous units ts - Reg Regres ression ion problems - Gradient t descent, t, backpropagation of error - The role of the learning rate te - O Onlin line learn e learnin ing, , stochastic approximation
Of Neurons and Netw tworks biolog biological n ical neu euron rons (very brief) - single neurons - synapses and networks - synaptic plasticity and learning simplified descripti tion - inspirati tion for arti tificial neural netw tworks arti tificial neural netw tworks - - architectures and types of networks: recurrent attr ttracto tor neural netw tworks (associative memory) feed-forward neural netw tworks (classification/ regression) Neural Networks 2
Of Neurons and Netw tworks ne neur urons: ns: pre-synaptic post-synaptic highly specialized cells dendrites cell body so soma ma • incoming dendrite tes • branched axon axon • many ma ny ne neur urons ns ! ≳ 10 12 in human cortex axon soma highly connecte ted ! ≳ 1000 neighbors synaptic cleft axon acti tion pote tenti tials / spikes: branches ∙ travel along ∙ cells generate the axon electric pulses Neural Networks 3
Of Neurons and Netw tworks pre-synaptic synap sy napses: ses: ∙ pre-synaptic pulse arriving at vesicles excita tato tory /inhibito tory synapse transmitter triggers / hinders post-synaptic spike generation synaptic cleft receptors excitatory: increase ∙ incoming the postsynaptic pulse membrane potential inhibitory: decrease post-synaptic ∙ all or nothing response potential exceeds th threshold ⇨ postsynaptic neuron fires potential is sub-th threshold ⇨ posts tsynapti tic neuron rests Neural Networks 4
Of Neurons and Netw tworks simplified description of neural activity: firing rate tes single spikes time [ms] e.g. spikes / ms mean activity S(t) Neural Networks 5
Of Neurons and Netw tworks (mean) local pote tenti tial at neuron i (with activity S i ) X weighte ted sum of incoming activities w ij S j j j excita tato tory synapse > 0 synapti tic w ij = = 0 weights ts inhibito tory synapse < 0 i Neural Networks 6
Acti tivati tion Functi tion hP i non-l no n-line near resp sponse nse: S i = h j w ij S j ∙ minimal activity h(x → - ∞ ) ≡ 0 important class of fcts.: ∙ maximal activity h(x → + ∞ ) ≡ 1 sigmo sigmoid idal al acti tivati tion ∙ monotonic increase h’(x) > 0 h ( x i ) = 1 ⇣ ⌘ just one example: 1 + tanh [ γ ( x i − θ )] 2 1 gain parameter Υ Υ local threshold θ X x i = w ij S j 0 θ j Neural Networks 7
Acti tivati tion Functi tion hP i no non-l n-line near resp sponse nse: S i = g j w ij S j ∙ minimal activity g(x → - ∞ ) ≡ -1 sigmo sigmoid idal al acti tivati tion ∙ maximal activity g(x → + ∞ ) ≡ 1 ∙ monotonic increase g ’ (x) > 0 just one example: g ( x i ) = tanh [ γ ( x i − θ )] 1 gain parameter Υ Υ local threshold θ X x i = w ij S j -1 j θ Neural Networks 8
McCulloch Pitts tts Neurons an extreme case: infinite te gain γ → ∞ ⇢ +1 for x ≥ θ g ( x i ) = tanh [ γ ( x i − θ )] → sign [ x − θ ] = − 1 for x < θ McCulloch Pitts tts [1943]: model neuron is either quiescent or maximally active do not consider graded response local threshold θ 1 ( don’t confuse θ with the all-or-nothing X x i = w ij S j threshold in spiking -1 j θ neurons ) Neural Networks 9
Synapti tic Plasti ticity ty D. D. Heb Hebb [1949] [1949] Hypothesis: Heb Hebbian ian Learn Learnin ing A consider - presynaptic neuron A - postsynaptic neuron B - excitatory synapse w BA B If A and B (frequently) fire at the same time the excitatory synaptic strength increases w AB → memory-effect will favor joint activity in the future For symmetrized firing rates − 1 ≤ S A , S B ≤ +1 change of synaptic strength ∆ w BA ∝ S A S B pre-synaptic x post-synaptic activity Neural Networks 10
Arti tificial Neural Netw tworks in the following: - assembled from simple firing rate neurons - connected by weights, real valued synaptic strenghts - various architectures and types of networks e.g.: attr ttracto tor neural netw tworks, recurrent t netw tworks w ij S i ( t ) dynamical systems, e.g. Hopfield model: S j ( t ) network of McCulloch Pitts neurons, can operate as Associative Memory by learning of synaptic interactions here: N=5 neurons partial connectivity Neural Networks 11
feed-forward netw tworks layered archite tectu ture input t layer (external stimulus) (here: 6-3-4-1) directe ted connecti tions (here: only to next layer) hidden units ts (internal representation) w ij 0 1 @X S i = g w ij S j A j ↑ previous layer only outp tput t unit( t(s) (function of input vector) Neural Networks 12
the perceptr th tron revisite ted input t units ts ξ j ∈ I R R N weights ts w j ∈ I R, w ∈ I single outp tput t unit 0 1 N X S = sign w j ξ j − θ @ A j =1 output = “ linear separable functi tion ” of input variables parameterized by the weight vector and threshold θ w Neural Networks 13
convergent t tw two-layer archite tectu ture R N input t units ts ξ j ∈ I R, ξ ∈ I w ( k ) input t to to hidden weights ts j 0 1 hidden layer units ts w ( k ) @X S k = g ξ j A j j hidden to to outp tput t weights ts v k single outp tput t unit σ output = non-linear functi tion of input variables: 0 0 1 1 K ! w ( k ) X @X @X σ = g v k S k = g v k g ξ j A A j k =1 k j parameterized by set of all weights (and threshold) Neural Networks 14
netw tworks of conti tinuous nodes continuous activation functions, e.g. g ( x ) = tanh ( γ x ) for all nodes in the network given a network architecture, the weights (and thresholds) parameterize a function (input/output relation): R N → σ ( ξ ) ∈ I (here: single output unit) ξ ∈ I R Learning as reg regression ression problem problem µ , τ µ = { ξ µ , τ µ = τ ( ξ µ ) } P set of examples with real-valued labels µ =1 tr training: (approximately) implement σ ( ξ µ ) = τ ( ξ µ ) for all µ generalizati tion: σ ( ξ ) ≈ τ ( ξ ) application to novel data Neural Networks 15
error measure and tr training training strategy: employ an error m error measu easure re for comparison of student/teacher outputs just one very popular and plausible choice: e ( σ , τ ) = 1 2 ( σ − τ ) 2 quadrati tic deviati tion: P P E = 1 e µ = 1 1 ⌘ 2 ⇣ X X cost t functi tion: σ ( ξ µ ) − τ ( ξ µ ) 2 P P µ =1 µ =1 - defined for a given set of example data - guides the training process - is a differenti tiable functi tion of weights and thresholds - training by gradient t descent t minimization of E Neural Networks 16
a single unit t . . . . . . R N ξ j ∈ I R, ξ ∈ I R N w ∈ I 0 1 N X σ = g w j ξ j @ A j =1 P E ( w ) = 1 1 g ( w · ξ µ ) − τ µ ⌘ 2 ⇣ X 2 P µ =1 P ∂ E ( w ) = 1 ⇣ g ( w · ξ µ ) − τ µ ⌘ X g 0 ( w · ξ µ ) ξ µ k ∂ w k P µ =1 P r w E ( w ) = 1 ⇣ g ( w · ξ µ ) � τ µ ⌘ X g 0 ( w · ξ µ ) ξ µ P µ =1 Neural Networks 17
convenient calculation of the gradient in multilayer networks ( chain rule) Backpropagation of Error example: continuous two-layer network with hidden units convenient calculation of the gradient in multilayer networks ( ← chain rule) inputs example: continuous two-layer network with K hidden units weights inputs R N ξ ∈ I weights R N , k = 1 , 2 , . . . , K w k ∈ I hidden units convenient calculation of the gradient in multilayer networks ( chain rule) hidden units σ k ( ξ ) = g ( w k · ξ ) example: continuous two-layer network with hidden units output inputs ⇣P K ⌘ output σ ( ξ ) = h j =1 v j g ( w j · ξ ) derive and weights derive and the weigths w k and v k are used ... hidden units the weigths – downward for the calculation of hidden states and output and are used ... – upward for the calculation of the gradient ⇣P ⌘ output – for the calculation of hidden states and output 75 – for the calculation of the gradient derive and ∂ E Exercise: r w k E ∂ v k the weigths and are used ... 18 – for the calculation of hidden states and output – for the calculation of the gradient
backpropagati tion A.E. Bryson, Y.-C. Ho (1969) (1969) Applied optimal control: optimization, estimation and control. Blaisdell Publishing, p 481 P. Werbos (1974). (1974). Beyond regression: New Tools for Prediction and Analysis in Behavorial Sciences PhD thesis, Harvard University D.E. Rumelhart, G.E. Hinton, R.J. Williams (1986) (1986) Learning representations by backpropagating errors. Nature 323 (6088): 533-536 Neural Networks 19
backpropagati tion 1987 1995 Neural Networks 20
negative gradient gives the direction of steepest descent in E simple gradient based minimization of E : sequence w 0 → w 1 → . . . → w t → w t +1 → . . . with w t +1 = w t − η r E | w t approaches some minimum of (?) E learning rate rate η – controls the step size of the algorithm – has to be small enough to ensure convergence – should be as large as possible to facilitate fast learning 21
Recommend
More recommend