Classification and Pattern Recognition L´ eon Bottou NEC Labs America COS 424 – 2/23/2010
The machine learning mix and match Classification, clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 2/32 COS 424 – 2/23/2010
Topics for today’s lecture Classification , clustering, regression, other. Goals Parametric vs. kernels vs. nonparametric Probabilistic vs. nonprobabilistic Representation Linear vs. nonlinear Deep vs. shallow Explicit: architecture, feature selection Explicit: regularization, priors Capacity Control Implicit: approximate optimization Implicit: bayesian averaging, ensembles Loss functions Operational Budget constraints Considerations Online vs. offline Exact algorithms for small datasets. Computational Stochastic algorithms for big datasets. Considerations Parallel algorithms. L´ eon Bottou 3/32 COS 424 – 2/23/2010
Summary 1. Bayesian decision theory 2. Nearest neigbours 3. Parametric classifiers 4. Surrogate loss functions 5. ROC curve. 6. Multiclass and multilabel problems L´ eon Bottou 4/32 COS 424 – 2/23/2010
Classification a.k.a. Pattern recognition Association between patterns x ∈ X and classes y ∈ Y . • The pattern space X is unspecified. For instance, X = R d . • The class space Y is an unordered finite set. Examples: • Binary classification ( Y = {± 1 } ). Fraud detection, anomaly detection,. . . • Multiclass classification: ( Y = { C 1 , C 2 , . . . C M } ) Object recognition, speaker identification, face recognition,. . . • Multilabel classification: ( Y is a power set). Document topic recognition,. . . • Sequence recognition: ( Y contains sequences). Speech recognition, signal identification, . . . . L´ eon Bottou 5/32 COS 424 – 2/23/2010
Probabilistic model Patterns and classes are represented by random variables X and Y . �������� ������ � � ��������� ��������� � � ������� ������� ������ ������� ������� ����������������� � ����������������� ������� ����������������� ����������������� P ( X, Y ) = P ( X ) P ( Y | X ) = P ( Y ) P ( X | Y ) L´ eon Bottou 6/32 COS 424 – 2/23/2010
Bayes decision theory Consider a classifier x ∈ X �→ f ( x ) ∈ Y . Maximixe the probability of correct answer: � P { f ( X ) = Y } = 1 I( f ( x ) = y ) dP ( x, y ) � � I( f ( x ) = y ) P { Y = y | X = x } dP ( x ) = 1 y ∈Y � P { Y = f ( x ) | X = x } dP ( x ) = Bayes optimal decision rule: f ∗ ( x ) = arg max P { Y = y | X = x } y ∈Y � Bayes optimal error rate: B = 1 − max y ∈Y P { Y = y | X = x } dP ( x ) . L´ eon Bottou 7/32 COS 424 – 2/23/2010
Bayes optimal decision rule Comparing class densities p y ( x ) scaled by the class priors P y = P { Y = y } : � � �� � ��� � � �� � ��� � � �� � ��� Hatched area represents the Bayes optimal error rate. L´ eon Bottou 8/32 COS 424 – 2/23/2010
How to build a classifier from data Given a finite set of training examples { ( x 1 , y 1 ) , . . . , ( x n , y m ) } ? • Estimating probabilities : – Find a plausible probability distribution (next lecture). – Compute or approximate the optimal Bayes classifier. • Minimize empirical error : – Choose a parametrized family of classification functions a priori. – Pick one that minimize the observed error rate. • Nearest neighbours : – Determine class of x on the basis of the closest example(s). L´ eon Bottou 9/32 COS 424 – 2/23/2010
Nearest neighbours Let d ( x, x ′ ) be a distance on the patterns. Nearest neighbour rule (1NN) – Give x the class of the closest training example. – f nn ( x ) = y nn ( x ) with nn ( x ) = arg min i d ( x, x i ) . K -Nearest neighbours rule (kNN) – Give x the most frequent class among the K closest training examples. K -Nearest neighbours variants – Weighted votes (according the the distances) L´ eon Bottou 10/32 COS 424 – 2/23/2010
Voronoi tesselation Euclian distance in the plane Cosine distance on the sphere – 1NN: Piecewise constant classifier defined on the Voronoi cells. – kNN: Same, but with smaller cells and additional constraints. L´ eon Bottou 11/32 COS 424 – 2/23/2010
1NN and Optimal Bayes Error Theorem (Cover & Hart, 1967) : Assume η y ( x ) = P { Y = y | X = x } is continuous. When n → ∞ , B ≤ P { f nn ( X ) � = Y } ≤ 2 B . Easy proof when there are only two classes Let η ( x ) = P { Y = +1 | X = x } . � – B = min( η ( x ) , 1 − η ( x )) dP ( x ) – P { f nn ( X ) � = Y } η ( x )(1 − η ( x ∗ )) + (1 − η ( x )) η ( x ∗ ) dP ( x ) � = � → 2 η ( x )(1 − η ( x )) dP ( x ) ���� � � L´ eon Bottou 12/32 COS 424 – 2/23/2010
1NN versus kNN 1 Bayes Bayes*2 1−nn 0.8 3−nn Using more neighbours 5−nn 7−nn 51−nn 0.6 – Is to Bayes rule in the limit. – Needs more examples to approach the 0.4 condition η ( x k nn ( x ) ) ≈ η ( x ) 0.2 0 0 0.2 0.4 0.6 0.8 1 K is a capacity parameter – to be determined using a validation set. L´ eon Bottou 13/32 COS 424 – 2/23/2010
Computation Straightforward implementation – Computing f ( x ) requires n distance computations. – ( − ) Grows with the number of examples. – ( + ) Embarrassingly parallelizable. Data structures to speedup the search: K-D trees – ( + ) Very effective in low dimension – ( − ) Nearly useless in high dimension Shortcutting the computation of distances – Stop computing as soon as a distance gets non-competitive. Use the triangular inequality d ( x, x i ) ≥ | d ( x, x ′ ) − d ( x i , x ′ ) | – Pick r well spread patterns x (1) . . . x ( r ) . – Precompute d ( x i , x ( j ) ) for i = 1 . . . n and j = 1 . . . r . – Lower bound d ( x, x i ) ≥ max j =1 ...r | d ( x, x ( j ) ) − d ( x i , x ( j ) ) | . – Shortcut if lower bound is not competitive. L´ eon Bottou 14/32 COS 424 – 2/23/2010
Distances Nearest Neighbour performance is sensitive to distance. Euclidian distance: d ( x, x ′ ) = ( x − x ′ ) 2 – do not take the square root! Mahalanobis distance: d ( x, x ′ ) = ( x − x ′ ) ⊤ A ( x − x ′ ) – Mahalanobis distance: A = Σ − 1 – Safe variant: A = (Σ + ǫI ) − 1 Dimensionality reduction: – Diagonalize Σ = Q ⊤ Λ Q . – Drop the low eigenvalues and corresponding eigenvector. x = Λ − 1 / 2 Q x . Precompute all the ˜ – Define ˜ x i . x i ) 2 . – Compute d ( x, x i ) = (˜ x − ˜ L´ eon Bottou 15/32 COS 424 – 2/23/2010
Discriminant function Binary classification: y = ± 1 Discriminant function: f w ( x ) – Assigns class sign( f w ( x )) to pattern x . – Symbol x represents parameters to be learnt. Example: Linear discriminant function – f w ( x ) = w ⊤ Φ( x ) . L´ eon Bottou 16/32 COS 424 – 2/23/2010
Example: The Perceptron The perceptron is a linear discriminant function Retina Associative area Treshold element (w’ x) sign w’ x x L´ eon Bottou 17/32 COS 424 – 2/23/2010
The Perceptron Algorithm – Initialize w ← 0 . – Loop – Pick example x i , y i – If y i w ⊤ Φ( x i ) ≤ 0 then w ← w + y i Φ( x i ) – Until all examples are correctly classified Perceptron theorem Guaranteed to stop if the training data is linearly separable Perceptron via Stochastic Gradient Descent i max(0 , − y i w ⊤ Φ( x i )) gives: SGD for minimizing C ( w ) = � – If y i w ⊤ Φ( x i ) ≤ 0 then w ← w + γ y i Φ( x i ) L´ eon Bottou 18/32 COS 424 – 2/23/2010
The Perceptron Mark 1 (1957) The Perceptron is not an algorithm. The Perceptron is a machine! L´ eon Bottou 19/32 COS 424 – 2/23/2010
Minimize the empirical error rate Empirical error rate n 1 � min 1 I { y i f ( x i , w ) ≤ 0 } w n i =1 Misclassification loss function – Noncontinuous – Nondifferentiable – Nonconvex ^ y y(x) L´ eon Bottou 20/32 COS 424 – 2/23/2010
Surrogate loss function Minimize instead n 1 � min ℓ ( y i f ( x i , w )) w n i =1 Quadratic surrogate loss Quadratic: ℓ ( z ) = ( z − 1) 2 ^ y y(x) L´ eon Bottou 21/32 COS 424 – 2/23/2010
Surrogate loss functions Exp loss and Log loss Exp loss: ℓ ( z ) = exp( − z ) Log loss: ^ ℓ ( z ) = log(1 + exp ( − z )) y y(x) Hinges Perceptron loss: ℓ ( z ) = max(0 , − z ) Hinge loss: ^ ℓ ( z ) = max(0 , 1 − z ) y y(x) L´ eon Bottou 22/32 COS 424 – 2/23/2010
Surrogate loss function Quadratic + Sigmoid Let σ ( z ) = tanh( z ) . ℓ ( z ) = ( σ ( 3 2 z ) − 1) 2 ^ y y(x) Ramp Ramp loss: ℓ ( z ) = [1 − z ] + − [ s − z ] + ^ y y(x) L´ eon Bottou 23/32 COS 424 – 2/23/2010
Recommend
More recommend