regml 2020 class 1 statistical learning theory
play

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with DATA Supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , Unsupervised: { x 1 , . . . , x m } , Semi-supervised: { ( x 1 , y 1 ) , . . .


  1. RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT

  2. All starts with DATA ◮ Supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , ◮ Unsupervised: { x 1 , . . . , x m } , ◮ Semi-supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ∪ { x 1 , . . . , x m } L.Rosasco, RegML 2020

  3. Learning from examples L.Rosasco, RegML 2020

  4. Setting for the supervised learning problem ◮ X × Y probability space, with measure ρ . ◮ S n = ( x 1 , y 1 ) , . . . , ( x n , y n ) ∼ ρ n , i.e. sampled i.i.d. ◮ L : Y × Y → [0 , ∞ ) , measurable loss function . ◮ Expected risk � E ( f ) = L ( y, f ( x )) dρ ( x, y ) . X × Y Problem: Solve f : X → Y E ( f ) , min given only S n ( ρ fixed, but unknown). L.Rosasco, RegML 2020

  5. Data space X Y ���� ���� output space input space L.Rosasco, RegML 2020

  6. Input space X input space: ◮ linear spaces, e. g. – vectors, – functions, – matrices/operators ◮ “structured” spaces, e. g. – strings, – probability distributions, – graphs L.Rosasco, RegML 2020

  7. Output space Y output space ◮ linear spaces, e. g. – Y = R , regression, – Y = R T , multi-task regression, – Y Hilbert space, functional regression, ◮ “structured” spaces – Y = { +1 , − 1 } , classification, – Y = { 1 , . . . , T } , multi-label classification, – strings, – probability distributions, – graphs L.Rosasco, RegML 2020

  8. Probability distribution Reflects uncertainty and stochasticity of the learning problem ρ ( x, y ) = ρ X ( x ) ρ ( y | x ) , ◮ ρ X marginal distribution on X , ◮ ρ ( y | x ) conditional distribution on Y given x ∈ X . L.Rosasco, RegML 2020

  9. Conditional distribution and noise f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Regression y i = f ∗ ( x i ) + ǫ i , ◮ Let f ∗ : X → Y , fixed function ◮ ǫ 1 , . . . , ǫ n zero mean random variables ◮ x 1 , . . . , x n random L.Rosasco, RegML 2020

  10. Conditional distribution and misclassification Classification ρ ( y | x ) = { ρ (1 | x ) , ρ ( − 1 | x ) } , 0.9 1 Noise in classification: overlap between the classes � � � � � � � ρ (1 | x ) − ρ ( − 1 | x ) � ≤ t ∆ t = x ∈ X � L.Rosasco, RegML 2020

  11. Marginal distribution and sampling ρ X takes into account uneven sampling of the input space L.Rosasco, RegML 2020

  12. Marginal distribution, densities and manifolds p ( x ) = dρ X ( x ) → p ( x ) = dρ X ( x ) d vol( x ) , dx 1.0 1.0 0.5 0.5 0.0 0.0 � 0.5 � 0.5 � 1.0 � 1.0 � 1.0 � 0.5 0.0 0.5 1.0 � 1.0 � 0.5 0.0 0.5 1.0 L.Rosasco, RegML 2020

  13. Loss functions L : Y × Y → [0 , ∞ ) , ◮ The cost of predicting f ( x ) in place of y . � ◮ Part of the problem definition E ( f ) = L ( y, f ( x )) dρ ( x, y ) ◮ Measures the pointwise error , L.Rosasco, RegML 2020

  14. Losses for regression L ( y, y ′ ) = L ( y − y ′ ) ◮ Square loss L ( y, y ′ ) = ( y − y ′ ) 2 , ◮ Absolute loss L ( y, y ′ ) = | y − y ′ | , ◮ ǫ -insensitive L ( y, y ′ ) = max( | y − y ′ | − ǫ, 0) , 1.0 0.8 Square Loss 0.6 Absolute - insensitive 0.4 0.2 1.0 0.5 0.5 1.0 L.Rosasco, RegML 2020

  15. Losses for classification L ( y, y ′ ) = L ( − yy ′ ) ◮ 0-1 loss L ( y, y ′ ) = 1 {− yy ′ > 0 } ◮ Square loss L ( y, y ′ ) = (1 − yy ′ ) 2 , ◮ Hinge-loss L ( y, y ′ ) = max(1 − yy ′ , 0) , ◮ logistic loss L ( y, y ′ ) = log(1 + exp( − yy ′ )) , 2.0 1.5 0 1 loss square loss 1.0 Hinge loss Logistic loss 0.5 0.5 1 2 L.Rosasco, RegML 2020

  16. Losses for structured prediction Loss specific for each learning task e. g. ◮ Multi-class: square loss, weighted square loss, logistic loss, . . . ◮ Multi-task: weighted square loss, absolute, . . . ◮ . . . L.Rosasco, RegML 2020

  17. Expected risk � E ( f ) = E L ( f ) = L ( y, f ( x )) dρ ( x, y ) X × Y note that f ∈ F where F = { f : X → Y | f measurable } . Example Y = {− 1 , +1 } , L ( y, f ( x )) = 1 {− yf ( x ) > 0 } E ( f ) = P ( { ( x, y ) ∈ X × Y | f ( x ) � = y } ) . L.Rosasco, RegML 2020

  18. Target function f ρ = arg min f ∈F E ( f ) , can be derived for many loss functions... L.Rosasco, RegML 2020

  19. Target functions in regression square loss , � f ρ ( x ) = ydρ ( y | x ) Y absolute loss , f ρ ( x ) = median ρ ( y | x ) , where � y � + ∞ median p ( · ) = y s . t . tdp ( t ) = tdp ( t ) . −∞ y L.Rosasco, RegML 2020

  20. Target functions in classification 0-1 loss , f ρ ( x ) = sign ( ρ (1 | x ) − ρ ( − 1 | x )) square loss , f ρ ( x ) = ρ (1 | x ) − ρ ( − 1 | x ) logistic loss , ρ (1 | x ) f ρ ( x ) = log ρ ( − 1 | x ) hinge-loss , f ρ ( x ) = sign ( ρ (1 | x ) − ρ ( − 1 | x )) L.Rosasco, RegML 2020

  21. Learning algorithms S n → � f n = � f S n f n estimates f ρ given the observed examples S n How to measure the error of an estimator? L.Rosasco, RegML 2020

  22. Excess risk Excess Risk: E ( � f ) − inf f ∈F E ( f ) , Consistency: For any ǫ > 0 � � E ( � n →∞ P lim f ) − inf f ∈F E ( f ) > ǫ = 0 , L.Rosasco, RegML 2020

  23. Tail bounds, sample complexity and error bound ◮ Tail bounds : For any ǫ > 0 , n ∈ N � � E ( � P f ) − inf f ∈F E ( f ) > ǫ ≤ δ ( n, F , ǫ ) ◮ Sample complexity: For any ǫ > 0 , δ ∈ (0 , 1] , when n ≥ n 0 ( ǫ, δ, F ) � � E ( � f ) − inf f ∈F E ( f ) > ǫ ≤ δ, P ◮ Error bounds : For any δ ∈ (0 , 1] , n ∈ N , with probability at least 1 − δ , E ( � f ) − inf f ∈F E ( f ) ≤ ǫ ( n, F , δ ) , L.Rosasco, RegML 2020

  24. Error bounds and no free-lunch theorem Theorem For any � f , there exists a problem for which E ( E ( � f ) − inf f ∈F E ( f )) > 0 L.Rosasco, RegML 2020

  25. No free-lunch theorem continued Theorem For any � f , there exists a ρ such that E ( E ( � f ) − inf f ∈F E ( f )) > 0 F → H Hypothesis space L.Rosasco, RegML 2020

  26. Hypothesis space H ⊂ F E.g. X = R d � d w j x j , | w ∈ R d , ∀ x ∈ X } H = { f ( x ) = � w, x � = j =1 then H ⋍ R d . L.Rosasco, RegML 2020

  27. Finite dictionaries D = { φ i : X → R | i = 1 , . . . , p } p � H = { f ( x ) = w j φ j ( x ) | w 1 , . . . , w p ∈ R , ∀ x ∈ X } j =1 f ( x ) = w ⊤ Φ( x ) , Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) L.Rosasco, RegML 2020

  28. This class Learning theory ingredients ◮ Data space/distribution ◮ Loss function, risks and target functions ◮ Learning algorithms and error estimates ◮ Hypothesis space L.Rosasco, RegML 2020

  29. Next class ◮ Regularized learning algorithm: penalization ◮ Statistics and computations ◮ Nonparametrics and kernels L.Rosasco, RegML 2020

Recommend


More recommend