MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical Learning Setting Lorenzo Rosasco
Learning from examples rather than being explicitly programmed. theory. L.Rosasco, 9.520/6.860 Fall 2018 ◮ Machine Learning deals with systems that are trained from data ◮ Here we describe the framework considered in statistical learning
All starts with DATA L.Rosasco, 9.520/6.860 Fall 2018 ◮ Supervised : { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . ◮ Unsupervised: { x 1 , . . . , x m } . ◮ Semi-supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ∪ { x 1 , . . . , x m } .
Supervised learning Problem: given S n fjnd f x new y new L.Rosasco, 9.520/6.860 Fall 2018 x 1 y 1 x n y n
The supervised learning problem Defjne expected risk : Problem: Solve given only i.e. n i.i.d. samples w.r.t. P fjxed, but unknown. L.Rosasco, 9.520/6.860 Fall 2018 ◮ X × R probability space, with measure P . ◮ ℓ : Y × Y → [ 0 , ∞ ) , measurable loss function . L ( f ) = L ( f ) = E ( x , y ) ∼ P [ ℓ ( y , f ( x ))] f : X → Y L ( f ) , min S n = ( x 1 , y 1 ) , . . . , ( x n , y n ) ∼ P n ,
Data space X input space Y output space L.Rosasco, 9.520/6.860 Fall 2018 ���� ����
Input space X input space: – vectors, – functions, – matrices/operators. – strings, – probability distributions, – graphs. L.Rosasco, 9.520/6.860 Fall 2018 ◮ Linear spaces, e. g. ◮ “Structured” spaces, e. g.
Output space Y output space: – Y Hilbert space, functional regression. – strings, – probability distributions, – graphs. L.Rosasco, 9.520/6.860 Fall 2018 ◮ linear spaces, e. g. – Y = R , regression, – Y = R T , multitask regression, ◮ “Structured” spaces, e. g. – Y = {− 1 , 1 } , classifjcation, – Y = { 1 , . . . , T } , multicategory classifjcation,
Probability distribution Refmects uncertainty and stochasticity of the learning problem, L.Rosasco, 9.520/6.860 Fall 2018 P ( x , y ) = P X ( x ) P ( y | x ) , ◮ P X marginal distribution on X , ◮ P ( y | x ) conditional distribution on Y given x ∈ X .
Conditional distribution and noise Regression L.Rosasco, 9.520/6.860 Fall 2018 f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) y i = f ∗ ( x i ) + ǫ i . ◮ Let f ∗ : X → Y , fjxed function, ◮ ǫ 1 , . . . , ǫ n zero mean random variables, ǫ i ∼ N ( 0 , σ ) , ◮ x 1 , . . . , x n random, P ( y | x ) = N ( f ∗ ( x ) , σ ) .
Conditional distribution and misclassifjcation Noise in classifjcation: overlap between the classes, Classifjcation L.Rosasco, 9.520/6.860 Fall 2018 P ( y | x ) = { P ( 1 | x ) , P ( − 1 | x ) } . 0.9 1 � � � � � � � P ( 1 | x ) − 1 / 2 � ≤ δ ∆ δ = x ∈ X � .
Marginal distribution and sampling L.Rosasco, 9.520/6.860 Fall 2018 P X takes into account uneven sampling of the input space.
Marginal distribution, densities and manifolds L.Rosasco, 9.520/6.860 Fall 2018 dx p ( x ) = dP X ( x ) ⇒ p ( x ) = dP X ( x ) d vol ( x ) 1.0 1.0 0.5 0.5 0.0 0.0 � 0.5 � 0.5 � 1.0 � 1.0 � 1.0 � 0.5 0.0 0.5 1.0 � 1.0 � 0.5 0.0 0.5 1.0
Loss functions Note: sometimes it is useful to consider loss of the form L.Rosasco, 9.520/6.860 Fall 2018 ℓ : Y × Y → [ 0 , ∞ ) ◮ Cost of predicting f ( x ) in place of y . ◮ Measures the pointwise error ℓ ( y , f ( x )) . � ◮ Part of the problem defjnition since L ( f ) = X × Y ℓ ( y , f ( x )) dP ( x , y ) . ℓ : Y × G → [ 0 , ∞ ) for some space G , e.g. G = R .
Loss for regression L.Rosasco, 9.520/6.860 Fall 2018 ℓℓ ( y , y ′ ) = V ( y − y ′ ) , V : R → [ 0 , ∞ ) . ◮ Square loss ℓ ( y , y ′ ) = ( y − y ′ ) 2 . ◮ Absolute loss ℓ ( y , y ′ ) = | y − y ′ | . ◮ ǫ -insensitive ℓ ( y , y ′ ) = max( | y − y ′ | − ǫ, 0 ) . 1.0 0.8 Square Loss 0.6 Absolute - insensitive 0.4 0.2 1.0 0.5 0.5 1.0
Loss for classifjcation L.Rosasco, 9.520/6.860 Fall 2018 ℓ ( y , y ′ ) = V ( − yy ′ ) , V : R → [ 0 , ∞ ) . ◮ 0-1 loss ℓ ( y , y ′ ) = Θ( − yy ′ ) , Θ( a ) = 1, if a ≥ 0 and 0 otherwise. ◮ Square loss ℓ ( y , y ′ ) = ( 1 − yy ′ ) 2 . ◮ Hinge-loss ℓ ( y , y ′ ) = max( 1 − yy ′ , 0 ) . ◮ Logistic loss ℓ ( y , y ′ ) = log( 1 + exp( − yy ′ )) . 2.0 1.5 0 1 loss square loss 1.0 Hinge loss Logistic loss 0.5 0.5 1 2
Loss function for structured prediction Loss specifjc for each learning task, e.g. L.Rosasco, 9.520/6.860 Fall 2018 ◮ Multiclass: square loss, weighted square loss, logistic loss, … ◮ Multitask: weighted square loss, absolute, … ◮ …
Expected risk with Example L.Rosasco, 9.520/6.860 Fall 2018 � L ( f ) = E ( x , y ) ∼ P [ ℓ ( y , f ( x ))] = ℓ ( y , f ( x )) dP ( x , y ) , X × Y f ∈ F , F = { f : X → Y | f measurable } . Y = {− 1 , + 1 } , ℓ ( y , f ( x )) = Θ( − yf ( x )) 1 L ( f ) = P ( { ( x , y ) ∈ X × Y | f ( x ) � = y } ) . 1 Θ( a ) = 1, if a ≥ 0 and 0 otherwise.
Target function can be derived for many loss functions. It is possible to show that: f P L.Rosasco, 9.520/6.860 Fall 2018 = arg min f ∈F L ( f ) , � � � L ( f ) = dP ( x , y ) ℓ ( y , f ( x )) = dPX ( x ) ℓ ( y , f ( x )) dP ( y | x ) , � �� � L x ( f ( x )) � ◮ inf f ∈F L ( f ) = dP X ( x ) inf a ∈ R L x ( a ) . ◮ Minimizers of L ( f ) can be derived “pointwise” from the inner risk L x ( f ( x )) . ◮ Measurability of this pointwise defjnition of f P can be ensured.
Target functions in regression Y y absolute loss L.Rosasco, 9.520/6.860 Fall 2018 square loss f P ( x ) = arg min L x ( a ) . a ∈ R � f P ( x ) = ydP ( y | x ) . f P ( x ) = median ( P ( y | x )) , � y � + ∞ median ( p ( · )) = y s . t . tdp ( t ) = tdp ( t ) . −∞
Target functions in classifjcation misclassifjcation loss square loss logistic loss hinge-loss L.Rosasco, 9.520/6.860 Fall 2018 f P ( x ) = sign ( P ( 1 | x ) − P ( − 1 | x )) . f P ( x ) = P ( 1 | x ) − P ( − 1 | x ) . P ( 1 | x ) f P ( x ) = log P ( − 1 | x ) . f P ( x ) = sign ( P ( 1 | x ) − P ( − 1 | x )) .
Difgerent loss, difgerent target Learning enters the picture when the latter is impossible or hard to compute (as in simulations). induced computations. L.Rosasco, 9.520/6.860 Fall 2018 ◮ Each loss functions defjnes a difgerent optimal target function. ◮ As we see in the following, loss functions also difger in terms of
Learning algorithms Solve given only Learning algorithm How to measure the error of an estimate? L.Rosasco, 9.520/6.860 Fall 2018 min f ∈F L ( f ) , S n = ( x 1 , y 1 ) , . . . , ( x n , y n ) ∼ P n . S n → � f n = � f S n . f n estimates f P given the observed examples S n .
Excess risk Excess risk: L.Rosasco, 9.520/6.860 Fall 2018 L ( � f ) − min f ∈F L ( f ) . Consistency: For any ǫ > 0 , � � L ( � lim f ) − min f ∈F L ( f ) > ǫ = 0 . n →∞ P
Other forms of consistency Note: difgerent notions of consistency correspond to difgerent notions of convergence for random variables: weak, in expectation and almost sure. L.Rosasco, 9.520/6.860 Fall 2018 Consistency in Expectation: For any ǫ > 0 , n →∞ E [ L ( � lim f ) − min f ∈F L ( f )] = 0 . Consistency almost surely: For any ǫ > 0 , � � n →∞ L ( � lim f ) − min f ∈F L ( f ) = 0 = 1 . P
Sample complexity, tail bounds and error bounds L.Rosasco, 9.520/6.860 Fall 2018 ◮ Sample complexity: For any ǫ > 0 , δ ∈ ( 0 , 1 ] , when n ≥ n P , F ( ǫ, δ ) , � � L ( � f ) − min f ∈F L ( f ) ≥ ǫ ≤ δ. P ◮ Tail bounds : For any ǫ > 0 , n ∈ N , � � L ( � P f ) − min f ∈F L ( f ) ≥ ǫ ≤ δ P , F ( n , ǫ ) . ◮ Error bounds : For any δ ∈ ( 0 , 1 ] , n ∈ N , � � L ( � f ) − min f ∈F L ( f ) ≤ ǫ P , F ( n , δ ) ≥ 1 − δ. P
No free-lunch theorem A good algorithm should have small sample complexity for many distributions P . No free-lunch Is it possible to have an algorithm with small (fjnite) sample complexity for all problems? The no free lunch theorem provides a negative answer. In other words given an algorithm there exists a problem for which the learning performance are arbitrarily bad. L.Rosasco, 9.520/6.860 Fall 2018
Algorithm design: complexity and regularization The design of most algorithms proceed as follows: L.Rosasco, 9.520/6.860 Fall 2018 ◮ Pick a (possibly large) class of function H , ideally min f ∈H L ( f ) = min f ∈F L ( f ) ◮ Defjne a procedure A γ ( S n ) = ˆ f γ ∈ H to explore the space H
Bias and variance Small Bias lead to good data fjt, high variance to possible instability. L.Rosasco, 9.520/6.860 Fall 2018 Key error decomposition Let f γ be the solution obtained with an infjnite number of examples. L (ˆ f ∈H L ( f ) = L (ˆ f γ ) − min f γ ) − L ( f γ ) + L ( f γ ) − min f ∈H L ( f ) � �� � � �� � Variance / Estimation Bias / Approximation
ERM and structural risk minimization A classical example. Example n n L.Rosasco, 9.520/6.860 Fall 2018 Then, let Consider ( H γ ) γ such that H 1 ⊂ H 2 , . . . H γ ⊂ . . . H � ˆ � � f γ = min L ( f ) , L ( f ) = 1 ℓ ( y i , f ( x i )) f ∈H γ i = 1 H γ are functions f ( x ) = w ⊤ x (or f ( x ) = w ⊤ Φ( x ) ), s.t. � w � ≤ γ
Recommend
More recommend