data sciences centralesupelec advance machine learning
play

Data Sciences CentraleSupelec Advance Machine Learning Course II - - PowerPoint PPT Presentation

Data Sciences CentraleSupelec Advance Machine Learning Course II - Linear regression/Linear classification Emilie Chouzenoux Center for Visual Computing CentraleSupelec emilie.chouzenoux@centralesupelec.fr Linear Regression Linear


  1. Data Sciences – CentraleSupelec Advance Machine Learning Course II - Linear regression/Linear classification Emilie Chouzenoux Center for Visual Computing CentraleSupelec emilie.chouzenoux@centralesupelec.fr

  2. Linear Regression Linear classification Linear regression Motivations: ◮ Simple approach (essential to understand more sophisticated ones) ◮ Interpretable description of the relations inputs ↔ outputs ◮ Can outperform nonlinear models, in the case of few training data/high noise/sparse data ◮ Extended applicability when combined with basis-function methods (see Lab) :

  3. Linear Regression Linear classification Linear regression Motivations: ◮ Simple approach (essential to understand more sophisticated ones) ◮ Interpretable description of the relations inputs ↔ outputs ◮ Can outperform nonlinear models, in the case of few training data/high noise/sparse data ◮ Extended applicability when combined with basis-function methods (see Lab) Applications: Prediction of ◮ Sale of products in the future based on past buying behaviour. ◮ Economic growth of a country or state. ◮ How much houses it would sell in the coming months and at what price. ◮ Number of goals a player would score in coming matches based on previous performances. ◮ Hours of study a student puts in, with respect to the exam results. :

  4. Linear Regression Linear classification Linear model Training data: x i ∈ R d , y i ∈ R , i = 1 , . . . , n ( x i ) 1 ≤ i ≤ n are inputs / transformed version of inputs (eg, through log) / basis expansions. :

  5. Linear Regression Linear classification Linear model Training data: x i ∈ R d , y i ∈ R , i = 1 , . . . , n ( x i ) 1 ≤ i ≤ n are inputs / transformed version of inputs (eg, through log) / basis expansions. Fitting model: y i ≈ f ( x i ) ( ∀ i = 1 , . . . , n ) with, for every i ∈ { 1 , . . . , n } , f ( x i ) = β 0 1 + β 1 x i 1 + . . . + β d x id = x ′⊤ i β = [ X β ] i with X ∈ R n × d +1 whose i -th line is x ′ i = [1 , x i 1 , . . . , x id ]. :

  6. Linear Regression Linear classification Linear model Training data: x i ∈ R d , y i ∈ R , i = 1 , . . . , n ( x i ) 1 ≤ i ≤ n are inputs / transformed version of inputs (eg, through log) / basis expansions. Fitting model: y i ≈ f ( x i ) ( ∀ i = 1 , . . . , n ) with, for every i ∈ { 1 , . . . , n } , f ( x i ) = β 0 1 + β 1 x i 1 + . . . + β d x id = x ′⊤ i β = [ X β ] i with X ∈ R n × d +1 whose i -th line is x ′ i = [1 , x i 1 , . . . , x id ]. � [ β 1 , . . . , β d ] defines a hyperplan in R d , and β 0 can be viewed as a bias shifting function f perpendicularly to the hyperplan. :

  7. Linear Regression Linear classification Linear model Training data: x i ∈ R d , y i ∈ R , i = 1 , . . . , n ( x i ) 1 ≤ i ≤ n are inputs / transformed version of inputs (eg, through log) / basis expansions. Fitting model: y i ≈ f ( x i ) ( ∀ i = 1 , . . . , n ) with, for every i ∈ { 1 , . . . , n } , f ( x i ) = β 0 1 + β 1 x i 1 + . . . + β d x id = x ′⊤ i β = [ X β ] i with X ∈ R n × d +1 whose i -th line is x ′ i = [1 , x i 1 , . . . , x id ]. � [ β 1 , . . . , β d ] defines a hyperplan in R d , and β 0 can be viewed as a bias shifting function f perpendicularly to the hyperplan. Goal: Using the training set, learn the linear function f (parametrized by β ) that predict a real value y from an observation x . :

  8. Linear Regression Linear classification Least Squares Principle: Search for β that minimizes the sum of squares residuals n F ( β ) = 1 ( y i − f ( x i )) 2 = 1 2 � X β − y � 2 = 1 � 2 � e � 2 2 i =1 with e = X β − y the residual vector. :

  9. Linear Regression Linear classification Optimization (reminders?) We search for a solution to min β F ( β ) where F : R d +1 → R is convex. β is minimizer if and only if ∇ F (ˆ ˆ β ) = 0 where ∇ F is the gradient of F , such that [ ∇ F ( β )] j = ∂ F ( β ) ( ∀ j = 0 , . . . , d ) . ∂β j Note that F also reads: F ( β ) = 1 2 y ⊤ y − β ⊤ X ⊤ y + 1 2 β ⊤ X ⊤ X β Its gradient is ∇ F ( β ) = − X ⊤ y + X ⊤ X β . Assuming that X has full column rank, then X ⊤ X is positive definite, the solution is unique and reads: β = ( X ⊤ X ) − 1 X ⊤ y ˆ :

  10. Linear Regression Linear classification White board :

  11. Linear Regression Linear classification Interpretation The fitted values at the training inputs are y = X ˆ β = X ( X ⊤ X ) − 1 X ⊤ y = Hy ˆ where H is called the “hat matrix”. This matrix computes the orthogonal projection of y onto the vectorial subspace spanned by the columns of X . :

  12. Linear Regression Linear classification Statistical properties Variance: Var(ˆ β ) = ( X ⊤ X ) − 1 σ 2 for uncorrelated observations y i with variance σ 2 , and deterministic x i . Unbiased estimator: n 1 σ 2 = � y i ) 2 ˆ ( y i − ˆ n − ( d + 1) i =1 Inference properties: Assume that Y = β 0 + � d j =1 X j β j + ǫ with ǫ ∼ N (0 , σ 2 ). Then ˆ β and ˆ σ are independant and ◮ ˆ β ∼ N ( β , ( X ⊤ X ) − 1 σ 2 ) σ 2 ∼ σ 2 χ 2 ◮ ( n − ( d + 1))ˆ n − ( d +1) :

  13. Linear Regression Linear classification High dimensional linear regression Problems with least squares regression if d is large: ◮ Accuracy : The hyperplan fits the data well but predicts (generalizes) badly. (low bias / large variance) ◮ Interpretation : We want to identify a small subset of features important/relevant for predicting the data. :

  14. Linear Regression Linear classification High dimensional linear regression Problems with least squares regression if d is large: ◮ Accuracy : The hyperplan fits the data well but predicts (generalizes) badly. (low bias / large variance) ◮ Interpretation : We want to identify a small subset of features important/relevant for predicting the data. 2 � y − X β � 2 + λ R ( β ) Regularization: F ( β ) = 1 ◮ ridge regression : R ( β ) = 1 2 � β � 2 ◮ shrinkage : R ( β ) = � β � 1 ◮ subset selection : R ( β ) = � β � 0 ∗ Explicit solution in the case of ridge. Otherwise, optimization method is usually needed ! :

  15. Linear Regression Linear classification White board :

  16. Linear Regression Linear classification Penalty functions j | β j | q Contour plots for � When the columns of X are orthonormal, the estimators can be deduced from the LS estimator ˆ β according to: ◮ Ridge : ˆ β j / (1 + λ ) weight decay ◮ Lasso : sign(ˆ β j )( | ˆ β j | − λ ) + soft tresholding � � ◮ Best subset : ˆ ˆ β 2 β j · δ j ≥ 2 λ hard tresholding :

  17. Linear Regression Linear classification White board :

  18. Linear Regression Linear classification White board :

  19. Linear Regression Linear classification Robust regression Challenge: Estimation methods insensitive to outliers and possibly high leverage points. Approach: M-estimation n � ρ ( y i − x ′⊤ F ( β ) = i β ) i =1 with ρ a potential function satisfying: ◮ ρ ( e ) ≥ 0 and ρ (0) = 0 ◮ ρ ( e ) = ρ ( − e ) ◮ ρ ( e ) ≥ ρ ( e ′ ) for | e | ≥ | e ′ | :

  20. Linear Regression Linear classification Robust regression Challenge: Estimation methods insensitive to outliers and possibly high leverage points. Approach: M-estimation n � ρ ( y i − x ′⊤ F ( β ) = i β ) i =1 with ρ a potential function satisfying: ◮ ρ ( e ) ≥ 0 and ρ (0) = 0 ◮ ρ ( e ) = ρ ( − e ) ◮ ρ ( e ) ≥ ρ ( e ′ ) for | e | ≥ | e ′ | ∗ Minimizer satisfies: i ˆ ρ ( y i − x ′⊤ β ) x ′ ˙ i = 0 , i = 1 , . . . , n ⇒ IRLS algorithm . :

  21. Linear Regression Linear classification IRLS algorithm Core idea: Let f be defined as ( ∀ x ∈ R ) ρ ( x ) = φ ( | x | ) where (i) φ is differentiable on ]0 , + ∞ [, (ii) φ ( √· ) is concave on ]0 , + ∞ [, h(.,y) ˙ (iii) ( ∀ x ∈ [0 , + ∞ [) φ ( x ) ≥ 0, f � ˙ � φ ( x ) ∈ R . (iv) lim x → 0 ω ( x ) := x x > 0 y Then, for all y ∈ R , ρ ( y )( x − y ) + 1 2 ω ( | y | )( x − y ) 2 . ( ∀ x ∈ R ) ρ ( x ) ≤ ρ ( y ) + ˙ :

  22. Linear Regression Linear classification Examples of functions ρ ρ ( x ) ω ( x ) (exercise) | x | − δ log( | x | /δ + 1) � x 2 if | x | < δ Convex 2 δ | x | − δ 2 otherwise log(cosh( x )) (1 + x 2 /δ 2 ) κ/ 2 − 1 1 − exp( − x 2 / (2 δ 2 )) x 2 / (2 δ 2 + x 2 ) Nonconvex √ � 1 − (1 − x 2 / (6 δ 2 )) 3 if | x | ≤ 6 δ 1 otherwise tanh( x 2 / (2 δ 2 )) log(1 + x 2 /δ 2 ) ( λ, δ ) ∈ ]0 , + ∞ [ 2 , κ ∈ [1 , 2] :

  23. Linear Regression Linear classification White board :

  24. Linear Regression Linear classification IRLS algorithm: β k +1 = ( X ⊤ W k X ) − 1 X ⊤ W k y . ( ∀ k ∈ N ) with the IRLS weight matrix W k = Diag( ω ( y − X β k )). :

  25. Linear Regression Linear classification Linear classification Applications: ◮ Sentiment analysis from text features ◮ Handwritten digits recognition ◮ Gene expression data classification ◮ Object recognition in images :

  26. Linear Regression Linear classification Linear classification Applications: ◮ Sentiment analysis from text features ◮ Handwritten digits recognition ◮ Gene expression data classification ◮ Object recognition in images Goal: Learn linear functions f k ( · ) for dividing the input space into a collection of K regions. ◮ Map a linear function on Pr( G = k | X = x ) ∼ linear regression ◮ More generally, map a linear function to a transformation of Pr( G = k | X = x ) :

Recommend


More recommend