9.54 Review Levels +Biophysics+ Supervised learning Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert 9.54, fall semester 2014
Vision A Computational Investigation into the Human Representation and Processing of Visual Information David Marr Foreword by Shimon Ullman Afterword by Tomaso Poggio David Marr's posthumously published Vision (1982) influenced a generation of brain and cognitive scientists, inspiring many to enter the field. In Vision , Marr describes a general framework for understanding visual perception and touches on broader questions about how the brain and its functions can be studied and understood…… In Marr's framework, the process of vision constructs a set of representations …. A central theme, and one that has had far-reaching influence in both neuroscience and cognitive science, is the notion of different levels of analysis—in Marr's framework, the computational level, the algorithmic level, and the hardware implementation level. Now, thirty years later, the main problems that occupied Marr remain fundamental open problems in the study of perception…..
Levels of Understanding (1977) Computation Algorithms Wetware, hardware, circuits and components
Levels of Understanding A case study of levels of understanding: the fly’s visual system
Levels of Understanding (1977--2012) Evolution Learning/Development Computation Algorithms Wetware, hardware, circuits and components Poggio, T. The Levels of Understanding framework, revised, MIT-CSAIL-TR-2012-014, CBCL-308 ,
e3 ANDNOT (i1 OR i2 OR i3)] OR [e2 ANDNOT (i1 OR i2) OR (e1 ANDNT i1 ) (e1 ANDNOT i1) OR (e2 ANDNOT i2 ) OR {[(e3 ANDNOT i3) OR (e4 ANDNOT i4) OR (e6 AND-NOT i6) OR (e6 AND- NOT i6)] ANDNOT i7 }
Thus Y − MX = 0 min || Y − MX || 2 More in general look for M such that if M = w T r V ( w ) = 2( Y � w T X ) X T = 0 yields Y X T = w T XX T and w T = Y X T ( XX T ) − 1 min w || Y − w T X || 2 + λ || w || 2 Now look for w such that r V ( w ) = 2( Y � w T X ) X T + 2 λ w T = 0 yields Y X T = w T XX T + λ w T and w T = Y X T ( XX T + λ I ) − 1
Example: representer theorem in the linear case Math Some simple linear algebra shows that w T = Y X T ( XX T ) − 1 = Y ( X T X ) − 1 X T = CX T since X T ( XX T ) − 1 = ( X T X ) − 1 X T Then n f ( x ) = w T x = CX T x = X c i x T i x i We can compute C n or w n depending whether n ≤ p . The above result is the most basic form of the Representer Theorem.
Stability and (Tikhonov) Regularization Math Consider f ( x ) = w T x = P p j =1 w j x j , and R ( f ) = w T w , n 1 w T = Y X T ( XX T ) − 1 X ( y i − f ( x i )) 2 min n f ∈ H i =1 ( ) n 1 ( y i � f ( x i )) 2 + λ k f k 2 X w T = Y X T ( XX T + λ I ) − 1 min n f ∈ H i =1 λ || w || 2 in the case of linear functions
From Linear to Nonparametric Models Math We can now consider a truly non parametric model n X X w j Φ ( x ) j = f ( x ) = K ( x, x i ) } c i | {z j ≥ 1 i =1 X Φ ( x i ) j Φ ( x ) j We have j ≥ 1 C n = ( X n X T | {z } + λ nI ) − 1 Y n |{z} + λ nI ) − 1 Y n C n = ( K n n ( K n ) i,j = K ( x i , x j ) ( X n X T n ) i,j = x T i x j
N N D D c i < x, x i > = X X X X X c i K ( x i , x ) for K ( x, y ) = < x, y > gives x j x i f ( x ) = j = c i w j x j i i i j j N X c i x i thus w = Xc with w j = j i K = c T Kc = c T X T Xc = w T w For linear kernels || f || 2
Thus Y − MX = 0 More in general look for M such that min || Y − MX || 2 The solution is given by putting the gradient to zero The solution is given by putting the gradient to zero r V ( M ) = 2( Y � MX ) X T = 0 yielding Y X T = MXX T that is M = Y X T ( XX T ) − 1
How could minimization done in general, in practice, by the brain? Probably not by analytic solution…. The gradient offers a general way to compute a solution to a minimization problem dM dt = � γ r V ( M ) minV ( M ) finds the elements of M which correspond to As an example let us look again at min || Y − MX || 2 Using r V ( M ) = 2( Y � MX ) X T
Let us make the example more specific. Assume that are scalar y i Then M = w T and n min w ∈ R d 1 ( y i − w T x i ) 2 min m i,j || MX − Y || 2 X becomes n 1=1 yielding n i ) = 2 ( y i � w T x i ) x T X r V ( M ) = r V ( w T i n i =1 and thus n dw T X ( y i − w T t x i ) x T = − γ t i dt i =1
Discretizing time in n dw T X ( y i − w T t x i ) x T = − γ t i dt i =1 we obtain n X w T t +1 = w T ( y i − w T t x i ) x T t − γ t i i =1
Gradient descent has several nice properties but it is still not “biological”… n X w T t +1 = w T ( y i − w T t x i ) x T t − γ t i i =1 can be written as n dw T X = � γ t r V i ( w ) dt i =1 Stochastic gradient descent is… dw T = � γ t r V i ( w ) , i = 1 , · · · , n dt
Recommend
More recommend