Learning and Meta-learning • computation – making predictions – choosing actions – acquiring episodes – statistics • algorithm – gradient ascent ( eg of the likelihood) – correlation – Kalman filtering • implementation – Hebbian synpatic plasticity – neuromodulation 1
Types of Learning supervised v | u inputs u and desired or target outputs v both provided, eg prediction → outcome reinforce max r | u input u and scalar evaluation r often with temporal credit assignment problem unsupervised or self-supervised u learn structure from statistics These are closely related: supervised learn P [ v | u ] unsupervised learn P [ v , u ] 2
Hebb Famously suggested: if cell A consistently contributes to the activity of cell B, then the synapse from A to B should be strengthened • strong element of causality • what about weakening (LTD)? • multiple timescales – STP to protein synthesis • multiple biochemical mechanisms • systems: – hippocampus – multiple sub-areas – neocortex – layer and area differences – cerebellum – LTD is the norm 3
Neural Rules field potential amplitude ( mV ) 0.4 LTP LTD 0.3 potentiated level depressed, partially depotentiated level 0.2 control level 0.1 1 s 10 min 100 2 Hz Hz 0 40 0 10 20 30 time (min) 4
Stability and Competition Hebbian learning involves positive feedback. Control by: LTD usually not enough – covariance versus correlation saturation prevent synaptic weights from getting too big (or too small) – triviality beckons competition spike-time dependent learning rules normalization over pre-synaptic or post-synaptic arbors • subtractive: decrease all synapses by the same amount whether large or small • multiplicative: decrease large synapses by more than small synapses 5
Preamble Linear firing rate model N u dv � dt = − v + w · u = − v + τ r w b u b b =1 assume that τ r is small compared with the rate of change of the weights, then v = w · u during plasticity Then have d w τ w dt = f ( v, u , w ) Supervised rules use targets to specify v – neural basis in ACh? 6
The Basic Hebb Rule d w τ w dt = u v averaged �� over input statistics gives d w dt = � u v � = � uu · w � = Q · w τ w where Q is the input correlation matrix. Positive feedback instability dt | w | 2 = 2 τ w w · d w d dt = 2 v 2 τ w Also have discretised version w → w + T Q · w . τ w integrating over time, presenting patterns for T seconds. 7
Covariance Rule Since LTD really exists, contra Hebb: d w τ w dt = ( v − θ v ) u or d w τ w dt = ( u − θ θ θ u ) v If θ v = � v � or θ θ θ u = � u � then d w dt = C · w τ w where C = � ( u − � u � )( u − � u � ) � is the input covariance matrix. Still unstable d dt | w | 2 = 2 v ( v − � v � ) τ w which averages to the (positive) covariance of v . 8
BCM Rule 0 Odd to have LTD with v = 0 or u = 0 0. Evidence for d w τ w dt = v u ( v − θ v ) . 1.5 weight change/u 1 0.5 0 −0.5 −1 0 0.5 1 1.5 v If θ v slides to match a high power of v dθ v dt = v 2 − θ v τ θ with a fast τ θ , then get competition between synapses – intrinsic stabilization. 9
Subtractive Normalisation Could normalise | w | 2 or � w b = n · w n = (1 , 1 . . . , 1) For subtractive normalisation of n · w : dt = v u − v ( n · u ) d w τ w n N u with dynamic subtraction, since d n · w 1 − n · n � � = v n · u τ w = 0 . dt N u as n · n = N u . Strongly competitive – typically all the weights bar one go to 0. Therefore use upper saturating limit. 10
The Oja Rule A multiplicative way to ensure | w | 2 is constant d w dt = v u − αv 2 w τ w gives d | w | 2 = 2 v 2 (1 − α | w | 2 ) . τ w dt so | w | 2 → 1 /α . Dynamic normalisation – could also enforce normalisation all the time. 11
Timing-Based Rules A B 140 90 epsp amplitude (% of control) 130 percent potentiation 60 (+10 ms ) 120 110 30 ( ± 100 ms ) 100 0 90 -30 80 (-10 ms ) 70 -60 0 50 25 -100 -50 0 50 100 time (min) t post - t pre (ms) slice cortical pyramidal cells; Xenopus retinotectal system • window of 50ms • gets Hebbian causality right • rate-description � ∞ d w dτ ( H ( τ ) v ( t ) u ( t − τ ) + H ( − τ ) v ( t − τ ) u ( t )) . τ w dt = 0 • spike-based description necessary if an input spike can have a measurable impact on an output spike. • critical factor is the overall integral – net LTD with ‘local’ LTP. • partially self-stabilizing 12
Timing-Based Rules Gutig et al; van Rossum et al: � − λf − ( w i ) K (∆ t ) if ∆ t ≤ 0 ∆ w i = λf + ( w i ) K (∆ t ) if ∆ t > 0 K (∆ t ) = e −| ∆ t | /τ f + ( w ) = (1 − w ) µ f − ( w ) = αw µ 13
FP Analysis How can we predict the weight distribution? 1 ∂P ( w, t ) = − p p P ( w, t ) − p d P ( w, t )+ ρ in ∂t p p P ( w − w p , t ) + p d P ( w + w d , t ) Taylor-expand about P ( w, t ) leads to a Fokker-Planck equation. Need to work out p d and p p ; assume steady firing Depression: p d = t window /t isi � t w Potentiation: I affects O: p p = 0 P ( δt ) dδt 14
Single Postsynaptic Neuron Basic Hebb rule: d w dt = Q · w τ w analyse using an eigendecomposition of Q : Q · e µ = λ µ e µ λ 1 ≥ λ 2 . . . Since Q is symmetric and positive (semi-)definite • complete set of real orthonormal evecs • with non-negative eigenvalues • whose growth is decoupled Write N u � w ( t ) = c µ ( t ) e µ µ =1 then t � � c µ ( t ) = c µ (0) exp λ µ τ w and w ( t ) → α ( t ) e 1 as t → ∞ 15
Constraints α ( t ) = exp( λ µ t/τ w ) → ∞ . • Oja makes w ( t ) → e 1 / √ α • saturation can disturb outcome A B 1 1 0.8 0.8 0.6 0.6 w 2 w 2 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 w 1 w 1 • subtractive constraint w = Q · w − ( w · Q · n ) n τ w ˙ . N u Sometimes e 1 ∝ n – so its growth is stunted; and e µ · n = 0 for µ � = 1 so w ( t ) = ( w (0) · e 1 ) e 1 + N u � λ µ t � � exp ( w (0) · e µ ) e µ τ w µ =2 16
Translation Invariance Particularly important case for development has Q bb ′ = Q ( b − b ′ ) � u b � = � u � Write n = (1 , . . . , 1) and J = nn T , then Q ′ = Q − N � u � 2 J 1. e µ · n = 0, AC modes are unaffected 2. e µ · n � = 0, DC modes are affected 3. Q has discrete sines and cosines as eigenvectors 4. fourier spectrum of Q are the eigenvalues 17
PCA What is the significance of e 1 ? A B C u 2, w 2 2 4 4 u 2, w 2 3 u 2, w 2 3 u 1, w 1 2 2 -2 2 1 1 -2 0 0 0 1 2 3 4 0 1 2 3 4 u 1, w 1 u 1, w 1 • optimal linear reconstruction: minimise � | u − g v | 2 � E ( w , g ) = • information maximisation: I [ v, u ] = H [ v ] − H [ v | x ] under a linear model • assume � u � = 0 0 0 or use C instead of Q . 18
Linear Reconstruction � | u − g v | 2 � E ( w , g ) = K − 2 w · Q · g + � g � 2 w · Q · w = quadratic in w with minimum at g w ∗ = � g � 2 making E ( w ∗ , g ) = K − g · Q · g . � g � 2 k ( e k · g ) e k and � g � 2 =1: look for soln with g = � N ( e k · g ) 2 λ k E ( w ∗ , g ) = K − � k =1 clearly has e 1 · g = 1 and e 2 · g = e 3 · g = . . . = 0 0 0 Therefore g and w both point along principal component 19
Infomax (Linsker) argmax w I [ v, u ] = H [ v ] − H [ v | u ] Very general unsupervised learning suggestion: • H [ v | u ] is not quite well defined unless v = w · u + η where η is arbitrarily deterministic 2 log 2 πeσ 2 for a Gaussian. • H [ v ] = 1 If P [ u ] ∼ N [0 0 0 , Q ] then v ∼ N [0 , w · Q · w + υ 2 ] maximise wQw T subject to � w � 2 = 1 Same problem as above: implies that w ∝ e 1 . note the normalisation If non-Gaussian, only maximising an upper bound on I [ v, u ]. 20
v ( a ) W ( a; b ) A ( a; b ) W ( a; b ) A ( a; b ) Ocular Dominance u ( b ) u ( b ) cortex competitive interaction L R L R left thalamus right • retina-thalamus-cortex • OD develops around eye-opening • interaction with refinement of topography � A W W • interaction with orientation a • interaction with ipsi/contra-innervation • effect of manipulations to input b b b b b ocularity L R L R L R 21
Start Simple Consider one input from each eye v = w R u R + w L u L . Then � � q S q D Q = � uu � = q D q S has √ e 1 = (1 , 1) / 2 λ 1 = q S + q D √ e 2 = (1 , − 1) / 2 λ 2 = q S − q D so if w + = w R + w L , w − = w R − w L then dw + dw − τ w = ( q S + q D) w + τ w = ( q S − q D) w − . dt dt Since q D ≥ 0, w + dominates – so use subtractive normalisation dw + dw − = 0 = ( q S − q D) w − . τ w τ w dt dt so w − → ± ω and one eye dominates. 22
Orientation Selectivity Model is exactly the same – input correlations come from ON/OFF cells: C) D) 1 1.5 0.5 1 Q (b) ~ − Q (b) ~ 0 0.5 −0.5 −1 0 −6 −4 −2 0 2 4 6 0 2 4 6 ~ b b Now dominant mode of Q − has spatial structure: centre-surround version also possible, but is usually dominated because of non-linear effects. 23
Recommend
More recommend