plugin classifiers naive bayes lda plugin classifiers
play

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE - PowerPoint PPT Presentation

PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION LOGISTIC REGRESSION Matthieu R Bloch Tuesday, January 28, 2020 1 LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) -


  1. PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, PLUGIN CLASSIFIERS: NAIVE BAYES, LDA, LOGISTIC REGRESSION LOGISTIC REGRESSION Matthieu R Bloch Tuesday, January 28, 2020 1

  2. LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL C449 Cubicle D) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB 423) - 12:00:pm-1:15pm Thursday: Hossein (VL C449 Cubicle B): 10:45pm - 12:00pm Friday: Brighton (TSRB 523a) - 12pm-1:15pm Homework 1 posted on Canvas Due Wednesday January 29, 2020 (11:59PM EST) (Wednesday February 5, 2020 for DL) Lecture notes updated Versions 1.1 posted for lectures 1,3,4,5 (small typos) Logistics for homework submission Upload separate PDF file Put problems in order Show your work (“Similar to above, etc.” does not show work) Include listing of code (example on overleaf) 2

  3. RECAP: NAIVE BAYES RECAP: NAIVE BAYES Consider (random) feature vector and the label ⊺ R d x = [ x 1 , ⋯ , x d ] ∈ y Naive asssumption: Given , the features of are independent , i.e., ∏ d { x i } d y x P x | y = i =1 P x i | y i =1 Main benefit: only need univariate densities , combine discrete/continous features P x i | y Procedure Estimate a priori class probabilities: for π k 0 ≤ k ≤ K − 1 Esimate class conditional densities for and p ( x | k ) 1 ≤ i ≤ d 0 ≤ k ≤ K − 1 x i | y Lemma. The maximum likelihood estimate of is where N k π k π = N k ≜ |{ y i : y i = k }| ^ k N What about ? P x | y Continuous features: o�en Gaussian, use ML estimate Discrete (categorical) features: o�en Multinomial, use ML estimate 3

  4. NAIVE BAYES (CT’D) NAIVE BAYES (CT’D) Assume th feature takes distinct values j x j J {0, … , J − 1} Lemma. N ( j ) The maximum likelihood estimate of is where ℓ, k ˆ P (ℓ| k ) P (ℓ| k ) = x j | y x j | y N k N ( j ) ≜ |{ x : y = k and x j = ℓ}| ℓ, k The naive bayes estimator is ^ k ∏ d h NB ˆ ( x ) = argmax k π j =1 P ( x j | k ) x j | y Naive Bayes can be completely wrong! Example bivariate Gaussian case 4

  5. NAIVE BAYES AND BAG OF WORDS NAIVE BAYES AND BAG OF WORDS Classification of documents into categories (politics, sports, etc.) Document as vector with the number of occurences of word in document ⊺ x = [ x 1 , ⋯ , x d ] x j j Model documents of length and assume words are distributed among the words n d independently at random (multinonial distribution) Estimate parameters Compute the ML estimate of the document classes N k π = ^ k Compute the ML estimate of the probability that word occurs in class across all documents: N j k ∑ ℓ N ( j ) ℓ ℓ, k μ = ^ j , k j =1 ∑ ℓ N ( j ) ∑ d ℓ ℓ, k Run classifier: ^ NB x j ^ k ∏ d h = argmax k π j =1 ( μ ) ^ j , k Weakness of approach: some words may not show up at training but show up at testing Use Laplace smoothing ∑ ℓ N ( j ) 1 + ℓ ℓ, k μ ^ j , k = j =1 ∑ ℓ N ( j ) ∑ d d + ℓ ℓ, k

  6. LINEAR DISCRIMINANT ANALYSIS (LDA) LINEAR DISCRIMINANT ANALYSIS (LDA) Consider (random) feature vector and the label ⊺ R d x = [ x 1 , ⋯ , x d ] ∈ y Assumption: Given , the feature vector have a Gaussian distribution y P x | y ∼ N ( μ k , Σ) The mean is class dependent but the covariance matrix is not 1 1 ) ⊺ Σ −1 ϕ ( x ; μ , Σ) ≜ exp ( − ( x − μ ( x − μ ) ) 1 d 2 2 2 (2 π ) |Σ | Estimate parameters from data N k π = ^ k N 1 μ = N k ∑ i : x i ^ k y i = k (recall assumption about covariance matrix) N ∑ K −1 1 ^ k ) ⊺ ^ Σ = k =0 ∑ i : ( x i − μ )( x i − μ ^ k = k y i Lemma. The LDA classifier is 1 ^ −1 h LDA ^ k ) ⊺ Σ ( x ) = argmin ( ( x − μ ( x − μ ) − log π ) ^ k ^ k 2 k For , the LDA is a linear classifier K = 2

  7. 7

  8.    11

  9. LINEAR DISCRIMINANT ANALYSIS (CT’D) LINEAR DISCRIMINANT ANALYSIS (CT’D) Generative model rarely accurate Number of parameters to estimate: class priors, means, elements of covariance matrix 1 K − 1 Kd d ( d + 1) Works well if 2 N ≫ d Works poorly if without other tricks (dimensionality reduction, structured covariance) N ≪ d Biggest concern: “one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling p(xly)].” , Vapnik, 1998 Revisit binary classifier with LDA π 1 ϕ ( x ; μ 1 , Σ) 1 η 1 ( x ) = = w ⊺ π 1 ϕ ( x ; μ 1 , Σ) + π 0 ϕ ( x ; μ 0 , Σ) 1 + exp(−( x + b )) We no not need to estimate the full joint distribution! 8

  10. LOGISTIC REGRESSION LOGISTIC REGRESSION Assume that is of the form 1 η ( x ) w ⊺ 1+exp(−( x + b )) Estimate and from the data directly ^ w ^ b Plugin the result to obtain 1 η ^ ( x ) = ^ ⊺ ^ 1+exp(−( x + )) w b The function is called the logistic function 1 x ↦ 1+ e − x The binary logistic classifier is ( linear ) 1 ^ ⊺ h LC ^ ( x ) = 1 { ( x ) ≥ η } = 1 { w x + b ≥ 0} ^ 2 How do we estimate and ? ^ w b ^ From LDA analysis: , ^ −1 μ ^ −1 μ ^ −1 μ ^ ⊺ ^ ⊺ π ^ 1 1 1 w = Σ ( − μ ) b = 2 μ 0 Σ − 2 μ 1 Σ + log ^ ^ 1 ^ 0 ^ 0 ^ 1 π ^ 0 Direct estimation of from maximum likelihood ( w , b ) ^ 9

  11. MLE FOR LOGISTIC REGRESSION MLE FOR LOGISTIC REGRESSION We have a parametric density model for p θ ( y | x ) = η ^ ( x ) Standard trick: and x ⊺ ] ⊺ θ = [ b w ⊺ ] ⊺ ~ x = [1, This allows us to lump the offset and write 1 η ( x ) = θ ⊺ x ~ 1 + exp(− ) Given our dataset the likelihood is ∏ N ~ i y i } N ~ i {( x , ) L ( θ ) ≜ i =1 P θ y i x ( | ) i =1 For with we obtain K = 2 Y = {0, 1} N ~ i ) y i ~ i ) 1− y i L ( θ ) ≜ ∏ η ( x (1 − η ( x ) i =1 N ~ i ~ i ℓ( θ ) = ∑ ( y i log η ( x ) + (1 − y i ) log(1 − η ( x ))) i =1 N e θ ⊺ x ~ i y i θ ⊺ x ~ i ℓ( θ ) = ∑ ( − log(1 + ) ) i =1 10

Recommend


More recommend