II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Naive Bayes hypothesis Discretize range ( H ) in 10 segments. → P ( H, W, F | G ) is a 3-dimensional array. → 10 3 values to estimate → requires lots of data! Curse of dimensionality: #(data) scales exponentially with #(features). Reminder, conditional probabilities: P ( H, W, F | G ) = P ( H | G ) × P ( W | G, H ) × P ( F | G, H, W ) S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Naive Bayes hypothesis Discretize range ( H ) in 10 segments. → P ( H, W, F | G ) is a 3-dimensional array. → 10 3 values to estimate → requires lots of data! Curse of dimensionality: #(data) scales exponentially with #(features). Reminder, conditional probabilities: P ( H, W, F | G ) = P ( H | G ) × P ( W | G, H ) × P ( F | G, H, W ) � P ( W | G, H ) = P ( W | G ) Naive Bayes: “what if ?” P ( F | G, H, W ) = P ( F | G ) → Then P ( H, W, F | G ) = P ( H | G ) × P ( W | G ) × P ( F | G ) → only 3 × 10 values to estimate S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Naive Bayes hypothesis, cont’d P ( W | S, H ) = P ( W | S ) what does that mean? “Among male individuals, the weight is independent of the height” What do you think? S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Naive Bayes hypothesis, cont’d P ( W | S, H ) = P ( W | S ) what does that mean? “Among male individuals, the weight is independent of the height” What do you think? Despite that naive assumption, Naive Bayes classifiers perform very well! S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Naive Bayes hypothesis, cont’d P ( W | S, H ) = P ( W | S ) what does that mean? “Among male individuals, the weight is independent of the height” What do you think? Despite that naive assumption, Naive Bayes classifiers perform very well! Let’s formalize that a little more. S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 2 Naive Bayes classifiers in one slide! posterior = prior × likelihood evidence S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 2 Naive Bayes classifiers in one slide! posterior = prior × likelihood evidence P ( Y | X 1 , . . . , X n ) = P ( Y ) × P ( X 1 , . . . , X n | Y ) P ( X 1 , . . . , X n ) S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 2 Naive Bayes classifiers in one slide! posterior = prior × likelihood evidence P ( Y | X 1 , . . . , X n ) = P ( Y ) × P ( X 1 , . . . , X n | Y ) P ( X 1 , . . . , X n ) Naive conditional independence assump.: ∀ i � = j, P ( X i | Y, X j ) = P ( X i | Y ) S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 2 Naive Bayes classifiers in one slide! posterior = prior × likelihood evidence P ( Y | X 1 , . . . , X n ) = P ( Y ) × P ( X 1 , . . . , X n | Y ) P ( X 1 , . . . , X n ) Naive conditional independence assump.: ∀ i � = j, P ( X i | Y, X j ) = P ( X i | Y ) n ⇒ P ( Y | X 1 , . . . , X n ) = 1 � Z × P ( Y ) × P ( X i | Y ) i =1 S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 2 Naive Bayes classifiers in one slide! posterior = prior × likelihood evidence P ( Y | X 1 , . . . , X n ) = P ( Y ) × P ( X 1 , . . . , X n | Y ) P ( X 1 , . . . , X n ) Naive conditional independence assump.: ∀ i � = j, P ( X i | Y, X j ) = P ( X i | Y ) n ⇒ P ( Y | X 1 , . . . , X n ) = 1 � Z × P ( Y ) × P ( X i | Y ) i =1 � Y ∈ { 1 , . . . , k } If , the NBC has ( k − 1) + nqk parameters θ . P ( X i | Y ) ∼ q params Given { x i , y i } 0 ≤ i ≤ N , θ = ˆ θ MLE := argmax (log) L ( x 1 . . . x N ; θ ) θ ∈ Θ S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 2 Naive Bayes classifiers in one slide! posterior = prior × likelihood evidence P ( Y | X 1 , . . . , X n ) = P ( Y ) × P ( X 1 , . . . , X n | Y ) P ( X 1 , . . . , X n ) Naive conditional independence assump.: ∀ i � = j, P ( X i | Y, X j ) = P ( X i | Y ) n ⇒ P ( Y | X 1 , . . . , X n ) = 1 � Z × P ( Y ) × P ( X i | Y ) i =1 � Y ∈ { 1 , . . . , k } If , the NBC has ( k − 1) + nqk parameters θ . P ( X i | Y ) ∼ q params Given { x i , y i } 0 ≤ i ≤ N , θ = ˆ θ MLE := argmax (log) L ( x 1 . . . x N ; θ ) θ ∈ Θ θ ( Y = y ) × � n Prediction: NBC ( x ) := argmax θ ( X i = x i | Y = y ) P ˆ i =1 P ˆ y ∈ [1 ,k ] S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 3 Back to the example P ( G | H, W, F ) = 1 Z × P ( G ) × P ( H | G ) × P ( W | G ) × P ( F | G ) S(ex) H(eight) (m) W(eight) (kg) F(oot size) (cm) M 1 . 82 82 30 M 1 . 80 86 28 P ( S = M ) =? M 1 . 70 77 30 P ( H = 1 . 81 | S = M ) =? M 1 . 80 75 25 P ( W = 59 | S = M ) =? F 1 . 52 45 15 P ( F = 21 | S = M ) =? F 1 . 65 68 20 F 1 . 68 59 18 F 1 . 75 68 23 S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 3 Back to the example P ( G | H, W, F ) = 1 Z × P ( G ) × P ( H | G ) × P ( W | G ) × P ( F | G ) > gens <- read.table("sex classif.csv", sep=";", colnames) > library("MASS") > fitdistr(gens[1:4,2],"normal") ... > 0.5*dnorm(1.81,mean=1.78,sd=0.04690416) *dnorm(59,mean=80,sd=4.301163) *dnorm(21,mean=28.25,sd=2.0463382) > 0.5*dnorm(1.81,mean=1.65,sd=0.08336666) *dnorm(59,mean=60,sd=9.407444) *dnorm(21,mean=19,sd=2.915476) S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 3 Back to the example P ( G | H, W, F ) = 1 Z × P ( G ) × P ( H | G ) × P ( W | G ) × P ( F | G ) S is discrete, H , W and F are assumed Gaussian. S p S ˆ µ H | S ˆ ˆ σ H | S µ W | S ˆ ˆ σ W | S µ F | S ˆ ˆ σ F | S M 0 . 5 1 . 78 0 . 0469 80 4 . 3012 28 . 25 2 . 0463 F 0 . 5 1 . 65 0 . 0834 60 9 . 4074 19 2 . 9154 − (1 . 78 − 1 . 81)2 − (80 − 59)2 − (28 . 25 − 21)2 P ( M | 1 . 81 , 59 , 21) = 1 Z × 0 . 5 × e 2 · 0 . 04692 × e 2 π 4 . 3012 2 × e 2 · 4 . 30122 2 · 2 . 04632 √ √ √ 2 π 0 . 0469 2 2 π 2 . 0463 2 = 1 Z × 7 . 854 · 10 − 10 − (1 . 65 − 1 . 81)2 − (60 − 59)2 − (19 − 21)2 2 · 0 . 08342 2 · 9 . 40742 2 · 2 . 91542 P ( F | 1 . 81 , 59 , 21) = 1 Z × 0 . 5 × e × e 2 π 9 . 4074 2 × e √ √ √ 2 π 0 . 0834 2 2 π 2 . 9154 2 = 1 Z × 1 . 730 · 10 − 3 S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 3 Back to the example P ( G | H, W, F ) = 1 Z × P ( G ) × P ( H | G ) × P ( W | G ) × P ( F | G ) Conclusion: given the data, (1.81m, 59kg, 21cm) is more likely to be female. S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 4 General features Using the naive assumption, we have p P ( Y | X 1 , . . . , X n ) = 1 � Z × P ( Y ) × P ( X j | Y ) j =1 Continuous X i : use a Gaussian approximation Assume normal distribution X j | Y = y ∼ N ( µ jy , σ jy ) Discretize X j | Y = y via binning (often better if many data points) Binary X j : use a Bernoulli approximation Bernouilli distribution X i | Y = y ∼ B ( p jy ) S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Algorithm Train : For all possible values of Y and X j , compute ˆ P n ( Y = y ) and ˆ P n ( X j = x j | Y = y ) . Predict : Given ( x 1 , . . . , x p ) , return y that maximizes ˆ P ( Y = y ) � p j =1 ˆ P n ( X j = x j | Y = y ) . S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering When should you use NBC? Needs little data to estimate parameters. Can easily deal with large feature spaces. Requires little tuning (but a bit of feature engineering). Without good tuning, more complex approaches are often outperformed by NBC. . . . despite the naive independence assumption! If you want to understand why: The Optimality of Naive Bayes , H. Zhang, FLAIRS , 2004. S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering A little more Computational amendments: Never say never! p P ( Y = y | X j = x j , j ∈ [1 , p ]) = 1 ˆ ˆ � ˆ P ( Y = y ) × P ( X j = x j | Y = y ) Z j =1 But if ˆ P ( X j = x j | Y = y ) = 0 , then all other info from X j is lost! → never set a probability estimate below ǫ (sample correction) Additive model p Log-likelihood: log ˆ P ( Y | X ) = − log Z + log ˆ log ˆ � P ( Y ) + P ( X j | Y ) and: j =1 � ˆ ˆ � ˆ � p � P ( Y | X ) P ( Y ) P ( X j | Y ) � log = log + log ˆ P ( ¯ 1 − ˆ ˆ P ( X j | ¯ Y | X ) P ( Y ) Y ) j =1 n � = α + g ( X j ) j =1 S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering II - 5 A real-world example: spam filtering Build a NBC that classifies emails as spam/non-spam, using the occurence of words. Any ideas? S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering The data Data = a bunch of emails, labels as spam/non-spam. The Ling-spam dataset: http://csmining.org/index.php/ling-spam-datasets.html . Preprocessing Form each email, remove: stop-words lemmatization non-words S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering The data Before: Subject: Re: 5.1344 Native speaker intuitions The discussion on native speaker intuitions has been extremely interesting, but I worry that my brief intervention may have muddied the waters. I take it that there are a number of separable issues. The first is the extent to which a native speaker is likely to judge a lexical string as grammatical or ungrammatical per se. The second is concerned with the relationships between syntax and interpretation (although even here the distinction may not be entirely clear cut). After: re native speaker intuition discussion native speaker intuition extremely interest worry brief intervention muddy waters number separable issue first extent native speaker likely judge lexical string grammatical ungrammatical per se second concern relationship between syntax interpretation although even here distinction entirely clear cut S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering The data Keep a dictionnary V of the | V | most frequent words. Count the occurence of each dictionary word in each example email. m emails n i words in email i | V | words in dictionary What is Y ? What are the X i ? S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Text classification features Y = 1 if the email is a spam. X k = 1 if word i of dictionary appears in the email Estimator of P ( X k = 1 | Y = y ) : j is the j th word of email i , y i is the label of email i . x i ni m � � 1 { xi j = k and yi = y } + 1 i =1 j =1 φ ky = m � 1 { yi = y } n i + | V | i =1 S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Getting started in R > trainingSet <- read.table("emails-train-features.txt", sep=" ", col.names=c("document","word","count")) > labelSet <- read.table("emails-train-labels.txt", sep=" ", col.names=c("spam")) > num.features <- 2500 > doc.word.train <- spMatrix(max(trainingSet[,1]), num.features, as.vector(trainingSet[,1]), as.vector(trainingSet[,2]), as.vector(trainingSet[,3])) > doc.class.train <- labelSet[,1] > source("trainSpamClassifier") # your very own classifier! > params <- trainSpamClassifier(doc.word.train,doc.class.train) > testingSet <- read.table("emails-test-features.txt", sep=" ", col.names=c("document","word","count")) > doc.word.test <- spMatrix(max(testingSet[,1]), num.features, as.vector(testingSet[,1]), as.vector(testingSet[,2]), as.vector(testingSet[,3])) > source("testSpamClassifier.r") > prediction <- testSpamClassifier(params, doc.word.test) # does it work well? S. Gadat Big Data - Lecture 3
II - 1 Posterior distribution Introduction II - 2 Naive Bayes classifiers in one slide Naive Bayes classifier II - 3 Back to the example Nearest Neighbour rule II - 4 General features SVM II - 5 A real-world example: spam filtering Going further in text mining in R The “Text Mining” package: http://cran.r-project.org/web/packages/tm/ http://tm.r-forge.r-project.org/ Useful if you want to change the features on the previous dataset. S. Gadat Big Data - Lecture 3
III - I A very standard classification algorithm Introduction III - 2 Statistical framework Naive Bayes classifier III - 3 Margin assumption Nearest Neighbour rule III - 4 Classification abilities SVM III - 5 Short example Schedule 1 Introduction I - 1 Motivations I - 2 Binary supervised classification I - 3 Statistical model 2 Naive Bayes classifier II - 1 Posterior distribution II - 2 Naive Bayes classifiers in one slide II - 3 Back to the example II - 4 General features II - 5 A real-world example: spam filtering 3 Nearest Neighbour rule III - I A very standard classification algorithm III - 2 Statistical framework III - 3 Margin assumption III - 4 Classification abilities III - 5 Short example 4 Support Vector Machines Motivation S. Gadat Big Data - Lecture 3
III - I A very standard classification algorithm Introduction III - 2 Statistical framework Naive Bayes classifier III - 3 Margin assumption Nearest Neighbour rule III - 4 Classification abilities SVM III - 5 Short example I - 4 Un algorithme de classification classique Metric space ( K , � . � ), given x ∈ K , we rank the n observations according to the distances to x : � X (1) ( x ) − x � ≤ � X (2) ( x ) − x � ≤ . . . ≤ � X ( n ) ( x ) − x � . X ( m ) ( x ) is the m -th closest neighbour of x in D n and Y ( m ) ( x ) is the corresponding label. k 1 Y ( j ) ( x ) > 1 � 1 if 2 , Φ n,k ( x ) := k (1) j =1 0 otherwise. A simple picture. . . Fig. Left: decision with 3 -NN Fig. Right: classifier Φ Bayes S. Gadat Big Data - Lecture 3
III - I A very standard classification algorithm Introduction III - 2 Statistical framework Naive Bayes classifier III - 3 Margin assumption Nearest Neighbour rule III - 4 Classification abilities SVM III - 5 Short example I - 4 Un algorithme de classification classique Influence of k on the k-NN classifier? k ∈ { 1 , 3 , 20 , 200 } , k = 1 ֒ → overfitting (global variance), k = 200 ֒ → underfitting (global bias). S. Gadat Big Data - Lecture 3
III - I A very standard classification algorithm Introduction III - 2 Statistical framework Naive Bayes classifier III - 3 Margin assumption Nearest Neighbour rule III - 4 Classification abilities SVM III - 5 Short example III - 2 Statistical framework Assumption on the distribution of X compactly supported (on K ). The law of X P X has a density w.r.t. µ (Lebesgue measure on K ). Regular support : ∀ x ∈ K ∀ r ≤ r 0 λ ( K ∩ B ( x, r )) ≥ c 0 λB ( x, r ) . This assumption means that K does not possess a kind of fractal structure. We assume at last that η = g/ ( f + g ) is L -Lispchitz w.r.t. � . � : ∃ L > 0 ∀ x ∈ K ∀ h | η ( x + h ) − η ( x ) | ≤ L � h � . S. Gadat Big Data - Lecture 3
III - I A very standard classification algorithm Introduction III - 2 Statistical framework Naive Bayes classifier III - 3 Margin assumption Nearest Neighbour rule III - 4 Classification abilities SVM III - 5 Short example III - 3 Margin assumption Margin assumption H MA ( α ) introduced by Mammen & Tsybakov (’99): A real value α ≥ 0 exists such that �� � η ( X ) − 1 � � ≤ Cǫ α � � ∀ ǫ ≤ ǫ 0 � ≤ ǫ P X � � 2 Line: η = 1 / 2 , dashed line: η = 1 / 2 ± ǫ . Local property around the boundary η = 1 / 2 . α = + ∞ , η has a spacial discontinuity and jumps ”saute” at the level 1 / 2 . If η ”crosses” the boundary 1 / 2 , then α = 1 . 1 If η possesses r vanishing derivatives on the set η = 1 / 2 , then α = r +1 . S. Gadat Big Data - Lecture 3
III - I A very standard classification algorithm Introduction III - 2 Statistical framework Naive Bayes classifier III - 3 Margin assumption Nearest Neighbour rule III - 4 Classification abilities SVM III - 5 Short example III - 4 Classification abilities Up to the former simulations when k varies, a careful choice is needed! The following theorem holds: Th´ eor` eme (2007,2014) (a) For any classification algorithm Φ n , there exists a distribution such that the Margin assumption holds, as well as the assumptions on the density and R (Φ n ) − R (Φ Bayes ) ≥ Cn − (1+ α ) / (2+ d ) , (b) The lower bound is optimal and reached by the k n NN rule with k n = n 2 / (2+ d ) . − 2 2+ d where d is the dimension of the state Standard situation: α = 1 , excess risk: ∼ n space. In the 1 − D case, we reach the rate n − 2 / 3 . The effect of the dimension is dramatical! It is related to the curse of dimensionality. Important need: reduce the effective dimension of the data that still preserves the discrimination (compute a PCA and project on main directions, or use a preliminary feature selection algorithm). S. Gadat Big Data - Lecture 3
III - I A very standard classification algorithm Introduction III - 2 Statistical framework Naive Bayes classifier III - 3 Margin assumption Nearest Neighbour rule III - 4 Classification abilities SVM III - 5 Short example III - 5 Short example Dataset: 100 samples of ( x 1 , x 2 ) ∼ U [0 , 1]2 . Class label: If ( x 1 , x 2 ) is above the line 2 x 1 + x 2 > 1 . 5 choose Y ∼ B ( p ) with p < 0 . 5 If ( x 1 , x 2 ) is below the line 2 x 1 + x 2 < 1 . 5 choose Y ∼ B ( q ) with q > 0 . 5 Example of the 1 NN decision with: simulated data classification 1.2 true frontier k−nn zone 1 1.0 k−nn zone 0 0.8 test$x[, 2] 0.6 0.4 0.2 −0.2 −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 test$x[, 1] Bayes classification error: 0.1996. 1 NN classifier error: 0.35422 Works bad! S. Gadat Big Data - Lecture 3
III - I A very standard classification algorithm Introduction III - 2 Statistical framework Naive Bayes classifier III - 3 Margin assumption Nearest Neighbour rule III - 4 Classification abilities SVM III - 5 Short example III - 5 Short example Optimization with a Cross Validation criterion: Optimal choice of k leads to k = 12 . Works better! Theoretical recommandation: k n ∼ n 2 / (2+ d ) ≃ 10 here. S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Schedule 1 Introduction I - 1 Motivations I - 2 Binary supervised classification I - 3 Statistical model 2 Naive Bayes classifier II - 1 Posterior distribution II - 2 Naive Bayes classifiers in one slide II - 3 Back to the example II - 4 General features II - 5 A real-world example: spam filtering 3 Nearest Neighbour rule III - I A very standard classification algorithm III - 2 Statistical framework III - 3 Margin assumption III - 4 Classification abilities III - 5 Short example 4 Support Vector Machines Motivation S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Linearly separable data Intuition: How would you separate whites and blacks? S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane M + M - β Any separation hyperplane can be written ( β, β 0 ) such that: ∀ i = 1 ..N, β T x i + β 0 ≥ 0 if y i = +1 ∀ i = 1 ..N, β T x i + β 0 ≤ 0 if y i = − 1 This can be written: � β T x i + β 0 � ∀ i = 1 ..N, y i ≥ 0 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane M + M - β But. . . � β T x i + β 0 � y i is the signed distance between point i and the hyperplane ( β, β 0 ) � β T x i + β 0 � Margin of a separating hyperplane: min y i ? i S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane M + M - β Optimal separating hyperplane Maximize the margin between the hyperplane and the data. max M β,β 0 � β T x i + β 0 � such that ∀ i = 1 ..N, y i ≥ M and � β � = 1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane M + M - β Let’s get rid of � β � = 1 : 1 β T x i + β 0 � � ∀ i = 1 ..N, � β � y i ≥ M β T x i + β 0 � � ⇒ ∀ i = 1 ..N, y i ≥ M � β � S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane M + M - β β T x i + β 0 � � ∀ i = 1 ..N, y i ≥ M � β � If ( β, β 0 ) satisfies this constraint, then ∀ α > 0 , ( αβ, αβ 0 ) does too. � β T x i + β 0 � Let’s choose to have ∀ i = 1 ..N, y i ≥ 1 1 then we need to set � β � = M S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane M + M - β 1 Now M = � β � . Geometrical interpretation? So � β � 2 max M ⇔ min � β � ⇔ min β,β 0 β,β 0 β,β 0 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Separation hyperplane M + M - β Optimal separating hyperplane (continued) 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 1 Maximize the margin M = � β � between the hyperplane and the data. S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N L P ( β, β 0 , α ) = 1 2 � β � 2 − β T x i + β 0 � � � � � α i y i − 1 i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N L P ( β, β 0 , α ) = 1 2 � β � 2 − β T x i + β 0 � � � � � α i y i − 1 i =1 N ∂LP = 0 ⇒ β = � α i y i x i ∂β i =1 N ∂LP ∂β 0 = 0 ⇒ 0 = � α i y i KKT conditions i =1 � � β T x i + β 0 � � ∀ i = 1 ..N, α i − 1 y i = 0 ∀ i = 1 ..N, α i ≥ 0 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! β T x i + β 0 � � � � ∀ i = 1 ..N, α i y i − 1 = 0 Two possibilities: � β T x i + β 0 � α i > 0 , then y i = 1 : x i is on the margin’s boundary α i = 0 , then x i is anywhere on the boundary or further . . . but does not participate in β . N � β = α i y i x i i =1 The x i for which α i > 0 are called Support Vectors . S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! N N N α i − 1 � � � α i α j y i y j x T Dual problem: α ∈ R + N L D ( α ) = max i x j 2 i =1 i =1 j =1 N � such that α i y i = 0 i =1 Solving the dual problem is a maximization in R N , rather than a (constrained) minimization in R n . Usual algorithm: SMO=Sequential Minimal Optimization. S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane 1 2 � β � 2 min β,β 0 β T x i + β 0 � � such that ∀ i = 1 ..N, y i ≥ 1 It’s a QP problem! And β 0 ? � � β T x i + β 0 � � Solve α i y i − 1 = 0 for any i such that α i > 0 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane M + M - β Overall: N � β = α i y i x i i =1 With α i > 0 only for x i support vectors . � N � � β T x + β 0 � α i y i x T Prediction: f ( x ) = sign = sign � i x + β 0 i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Non-linearly separable data? S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Non-linearly separable data? S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Non-linearly separable data? S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Non-linearly separable data? Slack variables ξ = ( ξ 1 , . . . , ξ N ) y i ( β T x i + β 0 ) ≥ M − ξ i N � or and ξ i ≥ 0 and ξ i ≤ K y i ( β T x i + β 0 ) ≥ M (1 − ξ i ) i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Non-linearly separable data? y i ( β T x i + β 0 ) ≥ M (1 − ξ i ) ⇒ misclassification if ξ i ≥ 1 N � ξ i ≤ K ⇒ maximum K misclassifications i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Non-linearly separable data? Optimal separating hyperplane min � β � β,β 0 � β T x i + β 0 � y i ≥ 1 − ξ i , such that ∀ i = 1 ..N, N ξ i ≥ 0 , � ξ i ≤ K i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Non-linearly separable data? Optimal separating hyperplane N 1 2 � β � 2 + C � min ξ i β,β 0 i =1 � � β T x i + β 0 � y i ≥ 1 − ξ i , such that ∀ i = 1 ..N, ξ i ≥ 0 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane Again a QP problem. N N N L P = 1 2 � β � 2 + C � � β T x i + β 0 � � � � � ξ i − α i y i − (1 − ξ i ) − µ i ξ i i =1 i =1 i =1 N ∂LP = 0 ⇒ β = � α i y i x i ∂β i =1 N ∂LP ∂β 0 = 0 ⇒ 0 = � α i y i i =1 KKT conditions ∂LP = 0 ⇒ α i = C − µ i ∂ξ � � β T x i + β 0 � � ∀ i = 1 ..N, α i − (1 − ξ i ) y i = 0 ∀ i = 1 ..N, µ i ξ i = 0 ∀ i = 1 ..N, α i ≥ 0 , µ i ≥ 0 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane N N N α i − 1 � � � α i α j y i y j x T Dual problem: α ∈ R + N L D ( α ) = max i x j 2 i =1 i =1 j =1 N � such that α i y i = 0 i =1 and 0 ≤ α i ≤ C S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane N β T x i + β 0 � � � � � α i y i − (1 − ξ i ) = 0 and β = α i y i x i i =1 Again: � β T x i + β 0 � α i > 0 , then y i = 1 − ξ i : x i is a support vector . Among these: ξ i = 0 , then 0 ≤ α i ≤ C ξ i > 0 , then α i = C (because µ i = 0 , because µ i ξ i = 0 ) α i = 0 , then x i does not participate in β . S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Optimal separating hyperplane Overall: N � β = α i y i x i i =1 With α i > 0 only for x i support vectors . � N � � β T x + β 0 � α i y i x T Prediction: f ( x ) = sign = sign � i x + β 0 i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Non-linear SVMs? Key remark � X → H h : is a mapping to a p-dimensional Euclidean space. x �→ h ( x ) ( p ≫ n , possibly infinite) � N � SVM classifier in H : f ( x ′ ) = sign α i y i � x ′ i , x ′ � + β 0 � . i =1 Suppose K ( x, x ′ ) = � h ( x ) , h ( x ′ ) � , Then: � N � � f ( x ) = sign α i y i K ( x i , x ) + β 0 . i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. Example: x 2 1 √ X = R 2 , H = R 3 , h ( x ) = 2 x 1 x 2 x 2 2 K ( x, y ) = h ( x ) T h ( y ) S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. What if we knew that K ( · , · ) is a kernel, without explicitly building h ? The SVM would be a linear classifier in H but we would never have to compute h ( x ) for training or prediction! This is called the kernel trick . S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. Under what conditions is K ( · , · ) an acceptable kernel? Answer: if it is an inner product on a (separable) Hilbert space. In more general words, we are interested in positive, definite kernel on a Hilbert space: Positive Definite Kernels K ( · , · ) is a positive definite kernel on X if n ∀ n ∈ N , x ∈ X n and c ∈ R n , � c i c j K ( x i , x j ) ≥ 0 i,j =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. Mercer’s condition Given K ( x, y ) , if: � �� g ( x ) 2 dx < ∞ , ∀ g ( x ) / K ( x, y ) g ( x ) g ( y ) dxdy ≥ 0 Then, there exists a mapping h ( · ) such that: K ( x, y ) = � h ( x ) , h ( y ) � S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. Examples of kernels: polynomial K ( x, y ) = (1 + � x, y � ) d radial basis K ( x, y ) = e − γ � x − y � 2 (very often used in R n ) sigmoid K ( x, y ) = tanh ( κ 1 � x, y � + κ 2 ) S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Kernels Kernel K ( x, y ) = � h ( x ) , h ( y ) � is called a kernel function. What do you think: Is it good or bad to send all data points in a feature space with p ≫ n ? S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM SVM and kernels for classification N 1 2 � β � 2 + C � min ξ i β,β 0 i =1 � � β T h ( x i ) + β 0 � y i ≥ 1 − ξ i , such that ∀ i = 1 ..N, ξ i ≥ 0 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM SVM and kernels for classification N 1 2 � β � 2 + C � min ξ i β,β 0 i =1 � � β T h ( x i ) + β 0 � y i ≥ 1 − ξ i , such that ∀ i = 1 ..N, ξ i ≥ 0 N N N α i − 1 � � � Dual problem: α ∈ R + N L D ( α ) = max α i α j y i y j � h ( x i ) , h ( x j ) � 2 i =1 i =1 j =1 N � such that α i y i = 0 i =1 and 0 ≤ α i ≤ C S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM SVM and kernels for classification N 1 2 � β � 2 + C � min ξ i β,β 0 i =1 � � β T h ( x i ) + β 0 � y i ≥ 1 − ξ i , such that ∀ i = 1 ..N, ξ i ≥ 0 N N N α i − 1 � � � Dual problem: α ∈ R + N L D ( α ) = max α i α j y i y j K ( x i , x j ) 2 i =1 i =1 j =1 N � such that α i y i = 0 i =1 and 0 ≤ α i ≤ C S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM SVM and kernels for classification N � β = α i y i x i Overall: i =1 With α i > 0 only for x i support vectors . � N � � β T x + β 0 � � Prediction: f ( x ) = sign = sign α i y i K ( x i , x ) + β 0 i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Why whould you use SVM? With kernels, sends the data into higher (sometimes infinite) dimension feature space, where the data is separable / linearly interpolable. Produces a sparse predictor (many coefficients are zero). Automatically maximizes margin (thus generalization error?). Performs very well on complex, non-linearly separable / fittable data. S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM SVM for regression Now we don’t want to separate, but to fit. Contradictory goals? N Fit the data: minimize � V ( y i − f ( x i )) i =1 V is a loss function. Keep large margins: minimize � β � S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM SVM for regression Now we don’t want to separate, but to fit. Contradictory goals? N Fit the data: minimize � V ( y i − f ( x i )) i =1 V is a loss function. Keep large margins: minimize � β � Support Vector Regression N 1 2 � β � 2 + C V ( y i − β T x i + β 0 )) � min β,β 0 i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM Loss functions � 0 if | z | ≤ ǫ ǫ -insensitive V ( z ) = | z | − ǫ otherwise Laplacian V ( z ) = | z | V ( z ) = 1 2 z 2 Gaussian � 2 σ z 2 if | z | ≤ σ 1 Huber’s robust loss V ( z ) = | z | − σ 2 otherwise S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM ǫ -SVR N λ 2 � β � 2 + C � ξ i + ξ ∗ � � min i β,β 0 i =1 y i − � β, x i � − β 0 ≤ ǫ + ξ i ǫ + ξ ∗ � β, x i � + β 0 − y i ≤ subject to i ξ i , ξ ∗ ≥ 0 i S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM ǫ -SVR N λ 2 � β � 2 + C � ξ i + ξ ∗ � � min i β,β 0 i =1 y i − � β, x i � − β 0 ≤ ǫ + ξ i ǫ + ξ ∗ � β, x i � + β 0 − y i ≤ subject to i ξ i , ξ ∗ ≥ 0 i As previously, this is a QP problem. N N L P = λ 2 � β � 2 + C � ξ i + ξ ∗ � � � − α i ( ǫ + ξ i − y i + � β, x i � + β 0 ) i i =1 i =1 N N α ∗ ǫ + ξ ∗ η i ξ i + η ∗ i ξ ∗ � � − � i + y i − � β, x i � − β 0 � − � � i i i =1 i =1 S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM ǫ -SVR cont’d N N L D = − 1 � � � � � α i − α ∗ α j − α ∗ � � x i , x j � i j 2 i =1 j =1 N N � α i + α ∗ � α i − α ∗ � � � � − ǫ + y i i i i =1 i =1 Dual optimization problem: max L D α N ( α i − α ∗ � i ) = 0 subject to i =1 α i , α ∗ ∈ [0 , C ] i S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM ǫ -SVR, support vectors α i ( ǫ + ξ i − y i + � β, x i � + β 0 ) = 0 α ∗ i ( ǫ + ξ ∗ i − y i + � β, x i � + β 0 ) = 0 KKT conditions: ( C − α i ) ξ i = 0 ( C − α ∗ i ) ξ ∗ i = 0 if α ( ∗ ) = 0 , then ξ ( ∗ ) = 0 : points inside the ǫ -insensitivity “tube” don’t participate in β i i if α ( ∗ ) > 0 , then i if ξ ( ∗ ) = 0 , then x i is exactly on the border of the “tube”, i α ( ∗ ) ∈ [0 , C ] i if ξ ( ∗ ) > 0 , then α ( ∗ ) = C : outliers are support vectors. i i S. Gadat Big Data - Lecture 3
Introduction Naive Bayes classifier Motivation Nearest Neighbour rule SVM SVR prediction N � α i − α ∗ � � � x i , x � + β 0 f ( x ) = i i =1 S. Gadat Big Data - Lecture 3
Recommend
More recommend