Loss Functions Reminder: ∆( y, ¯ y ) = cost of predicting ¯ y if y is correct. Optimal decision: choose g : X → Y to minimize the expected loss � � L ∆ ( y ; x ) = p (¯ y | x )∆(¯ y, y ) = p (¯ y | x )∆(¯ y, y ) ( ∆( y, y ) = 0 ) y � = y ¯ ¯ y ∈Y g ( x ) = argmin L ∆ ( y ; x ) pick label of smallest expected loss y ∈Y 21 / 90
Loss Functions Reminder: ∆( y, ¯ y ) = cost of predicting ¯ y if y is correct. Optimal decision: choose g : X → Y to minimize the expected loss � � L ∆ ( y ; x ) = p (¯ y | x )∆(¯ y, y ) = p (¯ y | x )∆(¯ y, y ) ( ∆( y, y ) = 0 ) y � = y ¯ ¯ y ∈Y g ( x ) = argmin L ∆ ( y ; x ) pick label of smallest expected loss y ∈Y 0 1 1 Special case: ∆( y, ¯ y ) = � y � = y � . E.g. 1 0 1 (for 3 labels) 1 1 0 � g ∆ ( x ) = argmin L ∆ ( y ) = argmin p ( y | x ) � y � = y � y ∈Y y ∈Y y � = y ¯ = argmax p ( y | x ) y ∈Y ( → Bayes classifier) 21 / 90
Learning Paradigms Given: training data { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ X × Y Approach 1) Generative Probabilistic Models 1) Use training data to obtain an estimate p ( x | y ) for any y ∈ Y 2) Compute p ( y | x ) ∝ p ( x | y ) p ( y ) � 3) Predict using g ( x ) = argmin y y p (¯ y | x )∆(¯ y, y ) . ¯ Approach 2) Discriminative Probabilistic Models 1) Use training data to estimate p ( y | x ) directly. � 2) Predict using g ( x ) = argmin y y p (¯ y | x )∆(¯ y, y ) . ¯ Approach 3) Loss-minimizing Parameter Estimation 1) Use training data to search for best g : X → Y directly. 22 / 90
0.5 1.0 p ( x | +1) p ( +1 | x ) 0.4 0.8 p ( x |− 1) p ( − 1 | x ) 0.3 0.6 0.4 0.2 0.2 0.1 0.0 0.0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 Generative Probabilistic Models This is what we did in the RoboCup example! ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , x n } . X × Y ⊂ X × Y ◮ For each y ∈ Y , build model for p ( x | y ) of X y := { x i ∈ X : y i = y } ◮ Histogram: if x can have only few discrete values. � ◮ Kernel Density Estimator: p ( x | y ) ∝ k ( x i , x ) x i ∈ X y ◮ Gaussian: p ( x | y ) = G ( x ; µ y , Σ y ) ∝ exp( − 1 2 ( x − µ y ) ⊤ Σ − 1 y ( x − µ y )) ◮ Mixture of Gaussians: p ( x | y ) = � K k =1 π k y G ( x ; µ k y , Σ k y ) 23 / 90
Generative Probabilistic Models This is what we did in the RoboCup example! ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , x n } . X × Y ⊂ X × Y ◮ For each y ∈ Y , build model for p ( x | y ) of X y := { x i ∈ X : y i = y } ◮ Histogram: if x can have only few discrete values. � ◮ Kernel Density Estimator: p ( x | y ) ∝ k ( x i , x ) x i ∈ X y ◮ Gaussian: p ( x | y ) = G ( x ; µ y , Σ y ) ∝ exp( − 1 2 ( x − µ y ) ⊤ Σ − 1 y ( x − µ y )) ◮ Mixture of Gaussians: p ( x | y ) = � K k =1 π k y G ( x ; µ k y , Σ k y ) 0.5 1.0 p ( x | +1) p ( +1 | x ) 0.4 0.8 p ( x |− 1) p ( − 1 | x ) 0.3 0.6 0.4 0.2 0.2 0.1 0.0 0.0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 class posteriors for p (+1) = p ( − 1) = 1 class conditional densities (Gaussian) 2 23 / 90
Generative Probabilistic Models This is what we did in the RoboCup example! ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , x n } . X × Y ⊂ X × Y ◮ For each y ∈ Y , build model for p ( x | y ) of X y := { x i ∈ X : y i = y } ◮ Histogram: if x can have only few discrete values. � ◮ Kernel Density Estimator: p ( x | y ) ∝ k ( x i , x ) x i ∈ X y ◮ Gaussian: p ( x | y ) = G ( x ; µ y , Σ y ) ∝ exp( − 1 2 ( x − µ y ) ⊤ Σ − 1 y ( x − µ y )) ◮ Mixture of Gaussians: p ( x | y ) = � K k =1 π k y G ( x ; µ k y , Σ k y ) Typically: Y small, i.e. few possible labels, X low-dimensional, e.g. RGB colors, X = R 3 But: large Y is possible with right tools → ”Intro to graphical models” 23 / 90
1.0 0.8 0.6 0.4 0.2 0.0 0.2 40 20 30 0 20 10 20 0 10 40 20 30 Discriminative Probabilistic Models Most popular: Logistic Regression ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . X × Y ⊂ X × Y ◮ To simplify notation: assume X = R d , Y = {± 1 } ◮ Parametric model: 1 with free parameter w ∈ R d p ( y | x ) = 1 + exp( − y w ⊤ x ) 24 / 90
Discriminative Probabilistic Models Most popular: Logistic Regression ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . X × Y ⊂ X × Y ◮ To simplify notation: assume X = R d , Y = {± 1 } ◮ Parametric model: 1 with free parameter w ∈ R d p ( y | x ) = 1 + exp( − y w ⊤ x ) 1.0 0.8 0.6 0.4 0.2 0.0 0.2 40 20 30 0 20 10 20 0 10 40 20 30 24 / 90
Discriminative Probabilistic Models Most popular: Logistic Regression ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . X × Y ⊂ X × Y ◮ To simplify notation: assume X = R d , Y = {± 1 } ◮ Parametric model: 1 with free parameter w ∈ R d p ( y | x ) = 1 + exp( − y w ⊤ x ) ◮ Find w by maximizing the conditional data likelihood n � w = argmax p ( y i | x i ) w ∈ R d i =1 n � � � 1 + exp( − y i w ⊤ x i ) = argmin log w ∈ R d i =1 Extensions to very large Y → ”Structured Outputs (Wednesday)” 24 / 90
Loss-minimizing Parameter Estimation ◮ Training data X = { x 1 , . . . , x n } , Y = { y 1 , . . . , x n } . X × Y ⊂ X × Y ◮ Simplify: X = R d , Y = {± 1 } , ∆( y, ¯ y ) = � y � = ¯ y � ◮ Choose hypothesis class: (which classifiers do we consider?) H = { g : X → Y} (e.g. all linear classifiers) ◮ Expected loss of a classifier h : X → Y on a sample x � L ( g, x ) = p ( y | x )∆( y, g ( x ) ) y ∈Y ◮ Expected overall loss of a classifier: � L ( g ) = p ( x ) L ( g, x ) x ∈X � � = p ( x, y )∆( y, g ( x ) ) = E x,y ∆( y, g ( x )) x ∈X y ∈Y ◮ Task: find ”best” g in H , i.e. g := argmin g ∈H L ( g ) Note: for simplicity, we always write � x . When X is infinite (i.e. almost always), read this as � X dx 25 / 90
Rest of this Lecture Part II: H = { linear classifiers } Part III: H = { nonlinear classifiers } Part IV (if there’s time): Multi-class Classification 26 / 90
Notation... ◮ data points X = { x 1 , . . . , x n } , x i ∈ R d , (think: feature vectors) ◮ class labels Y = { y 1 , . . . , y n } , y i ∈ { +1 , − 1 } , (think: cat or no cat ) ◮ goal: classification rule g : R d → {− 1 , +1 } . 27 / 90
Notation... ◮ data points X = { x 1 , . . . , x n } , x i ∈ R d , (think: feature vectors) ◮ class labels Y = { y 1 , . . . , y n } , y i ∈ { +1 , − 1 } , (think: cat or no cat ) ◮ goal: classification rule g : R d → {− 1 , +1 } . ◮ parameterize g ( x ) = sign f ( x ) with f : R d → R : f ( x ) = a 1 x 1 + a 2 x 2 + · · · + a n x n + a 0 w = ( a 0 , . . . , a n ) : simplify notation: ˆ x = (1 , x ) , ˆ (inner/scalar product in R d +1 ) f ( x ) = � ˆ w, ˆ x � w ⊤ ˆ (also: ˆ w · ˆ x or ˆ x ) ◮ out of lazyness, we just write f ( x ) = � w, x � with x, w ∈ R d . 27 / 90
Linear Classification – the classical view Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . 2 . 0 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 28 / 90
Linear Classification – the classical view Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . Any w partitions the data space into two half-spaces by means of f ( x ) = � w, x � . 2 . 0 f ( x ) > 0 f ( x ) < 0 1 . 0 w 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 28 / 90
Linear Classification – the classical view Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . Any w partitions the data space into two half-spaces by means of f ( x ) = � w, x � . 2 . 0 f ( x ) > 0 f ( x ) < 0 1 . 0 w 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 “What’s the best w ?” 28 / 90
Criteria for Linear Classification What properties should an optimal w have? Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . 2 . 0 2 . 0 1 . 0 1 . 0 0 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Are these the best? No, they misclassify many examples. Criterion 1: Enforce sign � w, x i � = y i for i = 1 , . . . , n . 29 / 90
Criteria for Linear Classification What properties should an optimal w have? Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . What’s the best w ? 2 . 0 2 . 0 1 . 0 1 . 0 0 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Are these the best? No, they would be “risky” for future samples. Criterion 2 : Ensure sign � w, x � = y for future ( x, y ) as well. 30 / 90
Criteria for Linear Classification Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . Assume that future samples are similar to current ones. What’s the best w ? 2 . 0 ρ ρ 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Maximize “robustness”: use w such that we can maximally perturb the input samples without introducing misclassifications. 31 / 90
Criteria for Linear Classification Given X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . Assume that future samples are similar to current ones. What’s the best w ? 2 . 0 2 . 0 margin region ρ ρ ρ 1 . 0 1 . 0 0 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Maximize “robustness”: use w such that we can maximally perturb the input samples without introducing misclassifications. Central quantity: margin ( x ) = distance of x to decision hyperplane = � w � w � , x � 31 / 90
Maximum Margin Classification Maximum-margin solution is determined by a maximization problem : w ∈ R d ,γ ∈ R + γ max subject to sign � w, x i � = y i for i = 1 , . . . n . � � � � w � � � w � , x i � � ≥ γ for i = 1 , . . . n . � � Classify new samples using f ( x ) = � w, x � . 32 / 90
Maximum Margin Classification Maximum-margin solution is determined by a maximization problem : max γ w ∈ R d , � w � =1 γ ∈ R subject to y i � w, x i � ≥ γ for i = 1 , . . . n . Classify new samples using f ( x ) = � w, x � . 33 / 90
Maximum Margin Classification We can rewrite this as a minimization problem : � w � 2 min w ∈ R d subject to y i � w, x i � ≥ 1 for i = 1 , . . . n . Classify new samples using f ( x ) = � w, x � . Maximum Margin Classifier (MMC) 34 / 90
Maximum Margin Classification From the view of optimization theory � w � 2 min w ∈ R d subject to y i � w, x i � ≥ 1 for i = 1 , . . . n is rather easy: ◮ The objective function is differentiable and convex . ◮ The constraints are all linear. We can find the globally optimal w in O ( d 3 ) (usually much faster). ◮ There are no local minima. ◮ We have a definite stopping criterion. 35 / 90
Linear Separability What is the best w for this dataset? 2 . 0 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 36 / 90
Linear Separability What is the best w for this dataset? 2 . 0 ρ margin ξ i margin violation 1 . 0 x i 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Possibly this one, even though one sample is misclassified. 37 / 90
Linear Separability What is the best w for this dataset? 2 . 0 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 38 / 90
Linear Separability What is the best w for this dataset? 2 . 0 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Maybe not this one, even though all points are classified correctly. 39 / 90
Linear Separability What is the best w for this dataset? 2 . 0 ρ margin ξ i 1 . 0 0 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 3 . 0 Trade-off: large margin vs. few mistakes on training set 40 / 90
Soft-Margin Classification Mathematically, we formulate the trade-off by slack -variables ξ i : n w ∈ R d ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, x i � ≥ 1 − ξ i for i = 1 , . . . n , ξ i ≥ 0 for i = 1 , . . . , n . Linear Support Vector Machine (linear SVM) 41 / 90
Soft-Margin Classification Mathematically, we formulate the trade-off by slack -variables ξ i : n w ∈ R d ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, x i � ≥ 1 − ξ i for i = 1 , . . . n , ξ i ≥ 0 for i = 1 , . . . , n . Linear Support Vector Machine (linear SVM) ◮ We can fulfill every constraint by choosing ξ i large enough. ◮ The larger ξ i , the larger the objective (that we try to minimize). ◮ C is a regularization /trade-off parameter: ◮ small C → constraints are easily ignored ◮ large C → constraints are hard to ignore ◮ C = ∞ → hard margin case → no errors on training set ◮ Note: The problem is still convex and efficiently solvable. 41 / 90
Solving for Soft-Margin Solution Reformulate: n w ∈ R d ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, x i � ≥ 1 − ξ i for i = 1 , . . . n , ξ i ≥ 0 for i = 1 , . . . , n . � We can read off the optimal values ξ i = max { 0 , 1 − y i � w, x i � . Equivalent optimization problem (with λ = 1 /C ): n w ∈ R d λ � w � 2 + 1 � � min max { 0 , 1 − y i � w, x i � n i =1 ◮ Now unconstrained optimization, but non-differentiable ◮ Solve efficiently, e.g., by subgradient method → ”Large-scale visual recognition” (Thursday) 42 / 90
Linear SVMs in Practice Efficient software packages: ◮ liblinear: http://www.csie.ntu.edu.tw/ ∼ cjlin/liblinear/ ◮ SVMperf: http://www.cs.cornell.edu/People/tj/svm light/svm perf.html ◮ see also: Pegasos: , http://www.cs.huji.ac.il/ ∼ shais/code/ ◮ see also: sgd: , http://leon.bottou.org/projects/sgd Training time: ◮ approximately linear in data dimensionality ◮ approximately linear in number of training examples, Evaluation time (per test example): ◮ linear in data dimensionality ◮ independent of number of training examples Linear SVMs are currently the most frequently used classifiers in Computer Vision. 43 / 90
Linear Classification – the modern view Geometric intuition is nice, but are there any guarantees ? ◮ SVM solution is g ( x ) = sign f ( x ) for f ( x ) = � w, x � with n λ � w � 2 + 1 � � w = argmin max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ What we really wanted to minimized is expected loss : g = argmin E x,y ∆( y, g ( x )) h ∈H with H = { g ( x ) = sign f ( x ) | f ( x ) = � w, x � for w ∈ R d } . What’s the relation? 44 / 90
Linear Classification – the modern view Geometric intuition is nice, but are there any guarantees ? ◮ SVM solution is g ( x ) = sign f ( x ) for f ( x ) = � w, x � with n λ � w � 2 + 1 � � w = argmin max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ What we really wanted to minimized is expected loss : g = argmin E x,y ∆( y, g ( x )) h ∈H with H = { g ( x ) = sign f ( x ) | f ( x ) = � w, x � for w ∈ R d } . What’s the relation? 44 / 90
Linear Classification – the modern view Geometric intuition is nice, but are there any guarantees ? ◮ SVM solution is g ( x ) = sign f ( x ) for f ( x ) = � w, x � with n λ � w � 2 + 1 � � w = argmin max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ What we really wanted to minimized is expected loss : g = argmin E x,y ∆( y, g ( x )) h ∈H with H = { g ( x ) = sign f ( x ) | f ( x ) = � w, x � for w ∈ R d } . What’s the relation? 44 / 90
Linear Classification – the modern view Geometric intuition is nice, but are there any guarantees ? ◮ SVM solution is g ( x ) = sign f ( x ) for f ( x ) = � w, x � with n λ � w � 2 + 1 � � w = argmin max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ What we really wanted to minimized is expected loss : g = argmin E x,y ∆( y, g ( x )) h ∈H with H = { g ( x ) = sign f ( x ) | f ( x ) = � w, x � for w ∈ R d } . What’s the relation? 44 / 90
Linear Classification – the modern view SVM training is an example of Regularized Risk Minimization . General form: n 1 � min Ω( f ) + ℓ ( y i , f ( x i )) n f ∈F ���� i =1 regularizer � �� � loss on training set: ’risk’ Support Vector Machine: n 1 � � λ � w � 2 min + max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 ◮ F = { f ( x ) = � w, x �| w ∈ R d } ◮ Ω( f ) = � w � 2 for any f ( x ) = � w, x � ◮ ℓ ( y, f ( x )) = max { 0 , 1 − yf ( x ) } (Hinge loss) 45 / 90
Linear Classification – the modern view: the loss term Observation 1: The empirical loss approximates the expected loss. For i.i.d. training examples ( x 1 , y 1 ) , . . . , ( x n , y n ) : n p ( x, y )∆( y, g ( x )) ≈ 1 � � � � � ∆( y, g ( x )) = ∆( y i , g ( x i ) ) E x,y n x ∈X y ∈Y i =1 46 / 90
Linear Classification – the modern view: the loss term Observation 1: The empirical loss approximates the expected loss. For i.i.d. training examples ( x 1 , y 1 ) , . . . , ( x n , y n ) : n p ( x, y )∆( y, g ( x )) ≈ 1 � � � � � ∆( y, g ( x )) = ∆( y i , g ( x i ) ) E x,y n x ∈X y ∈Y i =1 Observation 2: The Hinge loss upper bounds the 0/1-loss. For ∆( y, ¯ y ) = � y � = ¯ y � and g ( x ) = sign � w, x � one has ∆( y, g ( x ) ) = � y � w, x � < 0 � ≤ max { 0 , 1 − y � w, x �} 46 / 90
Linear Classification – the modern view: the loss term Observation 1: The empirical loss approximates the expected loss. For i.i.d. training examples ( x 1 , y 1 ) , . . . , ( x n , y n ) : n p ( x, y )∆( y, g ( x )) ≈ 1 � � � � � E x,y ∆( y, g ( x )) = ∆( y i , g ( x i ) ) n x ∈X y ∈Y i =1 Observation 2: The Hinge loss upper bounds the 0/1-loss. For ∆( y, ¯ y ) = � y � = ¯ y � and g ( x ) = sign � w, x � one has ∆( y, g ( x ) ) = � y � w, x � < 0 � ≤ max { 0 , 1 − y � w, x �} Combination: 1 � � � E x,y ∆( y, g ( x )) � max { 0 , 1 − y i � w, x i �} n i Intuition: small ”risk” term in SVM → few mistakes in the future 46 / 90
Linear Classification – the modern view: the regularizer Observation 3: Only minimizing the loss term can lead to overfitting. We want classifiers that have small loss, but are simple enough to generalize. 47 / 90
Linear Classification – the modern view: the regularizer Ad-hoc definition: a function f : R d → R is simple , if it not very sensitive to the exact input dy dx dy dx sensitivity is measured by slope: f ′ For linear f ( x ) = � w, x � , slope is �∇ x f � = � w � : Minimizing � w � 2 encourages ”simple” functions Formal results, including proper bounds on the generalization error: e.g. [Shawe-Taylor, Cristianini: ”Kernel Methods for Pattern Analysis”, Cambridge U Press, 2004] 48 / 90
Other classifiers based on Regularized Risk Minimization There are many other RRM-based classifiers, including variants of SVM: L1-regularized Linear SVM n 1 � � min λ � w � L 1 + max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 � w � L 1 = � d j =1 | w j | encourages sparsity ◮ learned weight vector w will have many zero entries ◮ acts as feature selector ◮ evaluation f ( x ) = � w, x � becomes more efficient Use if you have prior knowledge that optimal classifier should be sparse. 49 / 90
Other classifiers based on Regularized Risk Minimization SVM with squared slacks / squared Hinge loss n 1 � λ � w � 2 ξ 2 min + i n w ∈ R d i =1 subject to y i � w, x i � ≥ 1 − ξ i and ξ i ≥ 0 . Equivalently: n 1 � � ) 2 min λ � w � L 1 + (max { 0 , 1 − y i � w, x i � n w ∈ R d i =1 Also has a max-margin interpretation, but objective is once differentiable . 50 / 90
Other classifiers based on Regularized Risk Minimization Least-Squares SVM aka Ridge Regression n 1 � λ � w � 2 (1 − y i � w, x i � ) 2 min + n w ∈ R d i =1 Loss function: ℓ ( y, f ( x )) = ( y − f ( x )) 2 ”squared loss” ◮ Easier to optimize than regular SVM: closed-form solution for w w = y ⊤ ( λ Id + XX ⊤ ) − 1 X ⊤ ◮ But: loss does not really reflect classification : ℓ ( y, f ( x )) can be big, even if sign f ( x ) = y 51 / 90
Other classifiers based on Regularized Risk Minimization Regularized Logistic Regression n 1 � λ � w � 2 min + log(1 + exp( − y i � w, x i � )) n w ∈ R d i =1 Loss function: ℓ ( y, f ( x ) ) = log(1 + exp( − y i � w, x i � )) ”logistic loss” ◮ Smooth ( C ∞ -differentiable) objective ◮ Often similar results to SVM 52 / 90
3.0 0--1 loss 2.5 Hinge Loss Squared Hinge Loss 2.0 Squared Loss 1.5 Logistic Loss 1.0 0.5 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Summary – Linear Classifiers (Linear) Support Vector Machines ◮ geometric intuition: maximum margin classifier ◮ well understood theory: regularized risk minimization 53 / 90
Summary – Linear Classifiers (Linear) Support Vector Machines ◮ geometric intuition: maximum margin classifier ◮ well understood theory: regularized risk minimization Many variants of losses and regularizers 3.0 0--1 loss ◮ first: try Ω( · ) = � · � 2 2.5 Hinge Loss Squared Hinge Loss 2.0 ◮ encourage sparsity: Ω( · ) = � · � L 1 Squared Loss 1.5 Logistic Loss ◮ differentiable losses: 1.0 0.5 easier numeric optimization 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 53 / 90
Summary – Linear Classifiers (Linear) Support Vector Machines ◮ geometric intuition: maximum margin classifier ◮ well understood theory: regularized risk minimization Many variants of losses and regularizers 3.0 0--1 loss ◮ first: try Ω( · ) = � · � 2 2.5 Hinge Loss Squared Hinge Loss 2.0 ◮ encourage sparsity: Ω( · ) = � · � L 1 Squared Loss 1.5 Logistic Loss ◮ differentiable losses: 1.0 0.5 easier numeric optimization 0.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 Fun fact: different losses often have similar empirical performance ◮ don’t blindly believe claims ”My classifier is the best.” 53 / 90
Nonlinear Classification 54 / 90
Nonlinear Classification What is the best linear classifier for this dataset? 1 . 5 y 0 . 5 x − 0 . 5 − 1 . 5 − 2 . 0 − 1 . 0 0 . 0 1 . 0 2 . 0 None. We need something nonlinear! 55 / 90
Nonlinear Classification Idea 1) Combine multiple linear classifiers into nonlinear classifier σ (f 1 (x)) σ (f 2 (x)) σ (f 5 (x)) σ (f 3 (x)) σ (f 4 (x)) 56 / 90
Nonlinear Classification: Boosting Boosting Situation: ◮ we have many simple classifiers (typically linear), h 1 , . . . , h k : X → {± 1 } ◮ none of them is particularly good Method: ◮ construct stronger nonlinear classifier: � g ( x ) = sign j α j h j ( x ) with α j ∈ R ◮ typically: iterative construction for finding α 1 , α 2 , . . . Advantage: ◮ very easy to implement Disadvantage: ◮ computationally expensive to train ◮ finding base classifiers can be hard 57 / 90
Nonlinear Classification: Decision Tree Decision Trees x f 1 (x) <0 >0 f 2 (x) f 3 (x) <0 >0 <0 >0 y: 1 2 3 1 Advantage: ◮ easy to interpret ◮ handles multi-class situation Disadvantage: ◮ by themselves typically worse results than other modern methods [Breiman, Friedman, Olshen, Stone, ”Classification and regression trees”, 1984] 58 / 90
Nonlinear Classification: Random Forest Random Forest x x x x f 1 (x) f 1 (x) f 1 (x) f 1 (x) <0 >0 <0 >0 <0 >0 <0 >0 . . . f 2 (x) f 3 (x) f 2 (x) f 3 (x) f 2 (x) f 3 (x) f 2 (x) f 3 (x) <0 >0 <0 >0 <0 >0 <0 >0 <0 >0 <0 >0 <0 >0 <0 >0 y: 1 2 3 1 y: 1 2 3 1 y: 1 2 3 1 y: 1 2 3 1 Method: ◮ construct many decision trees randomly (under some constraints) ◮ classify using majority vote Advantage: ◮ conceptually easy ◮ works surprisingly well Disadvantage: ◮ computationally expensive to train ◮ expensive at test time if forest has many trees [Breiman, ”Random Forests”, 2001] 59 / 90
Nonlinear Classification: Neural Networks Artificial Neural Network / Multilayer Perceptron / Deep Learning f i (x)=<w i ,x> σ nonlinear Multi-layer architecture: σ (f 5 (x)) ◮ first layer: inputs x ◮ each layer k evaluates f k 1 , . . . , f k m feeds output to next layer σ (f 1 (x)) σ (f 2 (x)) σ (f 3 (x)) σ (f 4 (x)) ◮ last layer: output y Advantage: ◮ biologically inspired → easy to explain to non-experts ◮ efficient at evaluation time Disadvantage: ◮ non-convex optimization problem ◮ many design parameters, few theoretic results → ”Deep Learning” (Tuesday) [Rumelhart, Hinton, Williams, ”Learning Internal Representations by Error Propagation”, 1986] 60 / 90
Nonlinearity: Data Preprocessing Idea 2) Preprocess the data y This dataset is not x linearly separable : θ This one is separable: r But: both are the same dataset ! Top: Cartesian coordinates. Bottom: polar coordinates 61 / 90
Nonlinearity: Data Preprocessing Idea 2) Preprocess the data y Nonlinear separation : x θ Linear Separation r Linear classifier in polar space acts nonlinearly in Cartesian space. 62 / 90
Generalized Linear Classifier Given ◮ X = { x 1 , . . . , x n } , Y = { y 1 , . . . , y n } . ◮ Given any (nonlinear) feature map φ : R k → R m . Solve the minimization for φ ( x 1 ) , . . . , φ ( x n ) instead of x 1 , . . . , x n : n w ∈ R m ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n . ◮ The weight vector w now comes from the target space R m . ◮ Distances/angles are measure by the inner product � . , . � in R m . ◮ Classifier f ( x ) = � w, φ ( x ) � is linear in w , but nonlinear in x . 63 / 90
Example Feature Mappings ◮ Polar coordinates: �� � x � � x 2 + y 2 φ : �→ y ∠ ( x, y ) ◮ d -th degree polynomials: � � � � 1 , x 1 , . . . , x n , x 2 1 , . . . , x 2 n , . . . , x d 1 , . . . , x d φ : x 1 , . . . , x n �→ n ◮ Distance map: � � φ : � x �→ � � x − � p i � , . . . , � � x − � p N � for a set of N prototype vectors � p i , i = 1 , . . . , N . 64 / 90
Representer Theorem Solve the soft-margin minimization for φ ( x 1 ) , . . . , φ ( x n ) ∈ R m : n w ∈ R m ,ξ i ∈ R + � w � 2 + C � min ξ i (1) n i =1 subject to y i � w, φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n . For large m , won’t solving for w ∈ R m become impossible? 65 / 90
Representer Theorem Solve the soft-margin minimization for φ ( x 1 ) , . . . , φ ( x n ) ∈ R m : n w ∈ R m ,ξ i ∈ R + � w � 2 + C � min ξ i (1) n i =1 subject to y i � w, φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n . For large m , won’t solving for w ∈ R m become impossible? No! Theorem (Representer Theorem) The minimizing solution w to problem (1) can always be written as n � w = α j φ ( x j ) for coefficients α 1 , . . . , α n ∈ R . j =1 [Sch¨ olkopf, Smola, ”Learning with Kernels”, 2001] 65 / 90
Kernel Trick Rewrite the optimization using the representer theorem: ◮ insert w = � n j =1 α j φ ( x j ) everywhere, ◮ minimize over α i instead of w . n w ∈ R m ,ξ i ∈ R + � w � 2 + C � min ξ i n i =1 subject to y i � w, φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n . 66 / 90
Kernel Trick Rewrite the optimization using the representer theorem: ◮ insert w = � n j =1 α j φ ( x j ) everywhere, ◮ minimize over α i instead of w . n n α j φ ( x j ) � 2 + C � � α i ∈ R ,ξ i ∈ R + � min ξ i n j =1 i =1 subject to n � y i � α j φ ( x j ) , φ ( x i ) � ≥ 1 − ξ i for i = 1 , . . . n. j =1 The former m -dimensional optimization is now n -dimensional. 67 / 90
Recommend
More recommend