Statistical Machine Learning A Crash Course Part II: Classification & SVMs - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS
Bayesian Decision Theory ■ Decision rule: p ( C 1 | x ) > p ( C 2 | x ) • Decide if C 1 We do not need the normalization! • This is equivalent to p ( x | C 1 ) p ( C 1 ) > p ( x | C 2 ) p ( C 2 ) • Which is equivalent to p ( x | C 1 ) p ( x | C 2 ) > p ( C 2 ) p ( C 1 ) ■ Bayes optimal classifier: • A classifier obeying this rule is called a Bayes optimal classifier. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2
Bayesian Decision Theory p ( C k | x ) = p ( x | C k ) p ( C k ) ■ Bayesian decision theory: p ( x ) • Model and estimate class-conditional density as well as p ( x | C k ) class prior p ( C k ) • Compute posterior p ( C k | x ) • Minimize the error probability by maximizing p ( C k | x ) ■ New approach: • Directly encode the “decision boundary” • Without modeling the densities directly • Still minimize the error probability Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 3
Discriminant Functions ■ Formulate classification using comparisons: • Discriminant functions: y 1 ( x ) , . . . , y K ( x ) • Classify as class iff: C k x y k ( x ) > y j ( x ) , ⇥ j � = k • Example: Discriminant functions from Bayes classifier y k ( x ) = p ( C k | x ) y k ( x ) = p ( x | C k ) p ( C k ) y k ( x ) = log p ( x | C k ) + log p ( C k ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 4
Discriminant Functions ■ Special case: 2 classes y 1 ( x ) y 2 ( x ) > y 1 ( x ) − y 2 ( x ) 0 ⇔ > ⇔ : y ( x ) 0 > • Example: Bayes classifier y ( x ) = p ( C 1 | x ) − p ( C 2 | x ) log p ( x | C 1 ) p ( x | C 2 ) + log p ( C 1 ) y ( x ) = p ( C 2 ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 5
Example: Bayes classifier ■ 2 classes, Gaussian class-conditionals: decision boundary 5 5 5 5 5 0 0 0.01 4 4 4 4 4 − − − 1 2 − 0.01 − 0.01 3 0 1 0.01 0 . 0 0.02 1 3 3 3 3 3 − 0.01 − 0.01 0.02 3 0.02 0 . 0 0 0 0.01 0.01 0.01 0.01 0.01 2 2 2 2 2 0.02 − − 0.03 1 2 0 − 0.01 − 0.01 0.04 0 0.03 0.01 0.01 . 0 3 0 . 2 2 1 0 1 1 1 1 1 0.02 0 0 0.02 0.02 0.02 . . 0 0 1 4 2 2 0 0 0 0 . . 0 0 0 0 0 0 0 0 . 0 0.01 − 3 1 0.01 0.01 0 2 0.01 0 0 . 3 2 1 − 1 − 1 − 1 − 1 − 1 0.01 0.01 0.01 0 0 − 2 − 2 − 2 − 2 − 2 0 4 3 2 1 − 3 − 3 − 3 − 3 − 3 − 3 − 3 − 2 − 2 − 1 − 1 0 0 1 1 2 2 3 3 4 4 5 5 − 3 − 3 − 3 − 2 − 2 − 2 − 1 − 1 − 1 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 p ( x | C 1 ) p ( C 1 ) p ( x | C 2 ) p ( C 2 ) p ( x | C 1 ) p ( C 1 ) − p ( x | C 2 ) p ( C 2 ) log p ( x | C 1 ) p ( C 1 ) p ( x | C 2 ) p ( C 2 ) Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 6
Linear Discriminant Functions ■ 2-class problem: y ( x ) > 0 : decide class 1, otherwise class 2 ■ Simplest case: linear decision boundary • Linear discriminant function: y ( x ) = w T x + w 0 offset normal vector Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 7
Linear Discriminant Functions ■ Illustration for 2D case: y ( x ) = w T x + w 0 x 2 y > 0 y = 0 R 1 y < 0 R 2 x w signed distance to the y ( x ) k w k decision boundary x ? x 1 � w 0 k w k Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 8
Linear Discriminant Functions ■ 2 basic cases: linearly separable not linearly separable Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 9
Multi-Class Case ■ What if we constructed a multi-class classifier from several 2-class classifiers? C 3 C 1 ? R 1 R 1 R 3 R 2 C 1 ? C 1 C 3 R 3 R 2 C 2 C 2 not C 1 not C 2 C 2 one-versus-all one-versus-one (one-versus-the-rest) • If we base our decision rule on binary decisions, this may lead to ambiguities. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 10
Multi-Class Case ■ Better solution: (we have seen this already) • Use discriminant function to encode how strongly we believe in y 1 ( x ) , . . . , y K ( x ) each class: y k ( x ) > y j ( x ) , ⇥ j � = k • Decision rule: R j R i If the discriminant functions are linear, the decision regions are R k connected and convex x B x A ˆ x Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 11
Discriminant Functions ■ Why might we want to use discriminant functions? ■ Example: 2 classes • We could easily fit the class-conditionals using Gaussians and use a Bayes classifier. • How about now? • Do these points matter for making the decision between the two classes? Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 12
Distribution-free Classifiers ■ Main idea: • We do not necessarily need to model all details of the class- conditional distributions to come up with a good decision boundary. - The class-conditionals may have many intricacies that do not matter at the end of the day. • If we can learn where to place the decision boundary directly, we can avoid some of the complexity. ■ Nonetheless: • It would be unwise to believe that such classifiers are inherently superior to probabilistic ones. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 13
First Attempt: Least Squares ■ Try to achieve a certain value of the discriminative function: y ( x ) = +1 x ∈ C 1 ⇔ y ( x ) = − 1 x ∈ C 2 ⇔ X = { x 1 ∈ R d , . . . , x n } • Training data: • Labels: Y = { y 1 ∈ { − 1 , 1 } , . . . , y n } ■ Linear discriminant function: x T • Try to enforce ∀ i = 1 , . . . , n i w + w 0 = y i , • One linear equation for each training data point/label pair. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 14
First Attempt: Least Squares ■ Linear equation system: x T ∀ i = 1 , . . . , n i w + w 0 = y i , � w ⇥ � x i ⇥ • Define ˆ x i = w = ˆ 1 w 0 • Rewrite equation system: x T ˆ i ˆ ∀ i = 1 , . . . , n w = y i , • Or in matrix-vector notation: X T ˆ ˆ ˆ X = [ˆ x 1 , . . . , ˆ x n ] with w = y y = [ y 1 , . . . , y n ] T Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 15
First Attempt: Least Squares ■ Overdetermined system of equations: X T ˆ ˆ w = y • n equations, d+1 unknowns ■ Look for least squares solution: X T ˆ || ˆ w − y || 2 → min X T ˆ X T ˆ ( ˆ w − y ) T ( ˆ w − y ) → min w T ˆ X T ˆ w − 2 y T ˆ X T ˆ X ˆ w + y T y → min ˆ • Set the derivative to zero: X T ˆ 2 ˆ X ˆ w − 2 ˆ Xy := 0 X T ) − 1 ˆ w = ( ˆ X ˆ ˆ X ⌅ y ⇤ ⇥� Pseudo-inverse Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 16
First Attempt: Least Squares ■ Problem: Least-squares is very sensitive to outliers 4 4 2 2 0 0 − 2 − 2 − 4 − 4 − 6 − 6 − 8 − 8 − 4 − 2 0 2 4 6 8 − 4 − 2 0 2 4 6 8 Outliers present No outliers Least-squares discriminant breaks Least-squares discriminant works down Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 17
New Stategy ■ If our classes are linearly separable, we want to make sure that we find a separating (hyper)plane: ■ First such algorithm we will see: • The perceptron algorithm [Rosenblatt, 1962] • A true classic! Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 18
Perceptron Algorithm ■ Perceptron discriminant function: y ( x ) = sign( w T x + b ) ■ Algorithm: • Start with some “weight” vector and some bias b w y i ∈ { − 1 , 1 } • For all data points with class labels x i - If is correctly classified, i.e. , do nothing. y ( x i ) = y i x i - Otherwise, if update the parameters using: y i = 1 b ← b + 1 w ← w + x i - Otherwise, if update the parameters using: y i = − 1 b ← b − 1 w ← w − x i • Repeat until convergence. Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 19
Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 20
Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 21
Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 22
Perceptron Algorithm 1 ■ Intuition: 0.5 0 − 0.5 − 1 − 1 − 0.5 0 0.5 1 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 23
Recommend
More recommend