mining
play

MINING Text Data: Nave Bayes Instructor: Yizhou Sun - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Text Data: Nave Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text


  1. CS145: INTRODUCTION TO DATA MINING Text Data: Naïve Bayes Instructor: Yizhou Sun yzsun@cs.ucla.edu December 7, 2017

  2. Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Naïve Bayes for Text Classification Decision Tree ; KNN; SVM ; NN Clustering K-means; hierarchical PLSA clustering; DBSCAN; Mixture Models Linear Regression Prediction GLM* Apriori; FP growth GSP; PrefixSpan Frequent Pattern Mining Similarity Search DTW 2

  3. Naïve Bayes for Text • Text Data • Revisit of Multinomial Distribution • Multinomial Naïve Bayes • Summary 3

  4. Text Data • Word/term • Document • A sequence of words • Corpus • A collection of documents 4

  5. Text Classification Applications • Spam detection From: airak@medicana.com.tr Subject: Loan Offer Do you need a personal or business loan urgent that can be process within 2 to 3 working days? Have you been frustrated so many times by your banks and other loan firm and you don't know what to do? Here comes the Good news Deutsche Bank Financial Business and Home Loan is here to offer you any kind of loan you need at an affordable interest rate of 3% If you are interested let us know. • Sentiment analysis 5

  6. Represent a Document • Most common way: Bag-of-Words • Ignore the order of words • keep the count c1 c2 c3 c4 c5 m1 m2 m3 m4 For document 𝑒, 𝒚 𝑒 = (𝑦 𝑒1 , 𝑦 𝑒2 , … , 𝑦 𝑒𝑂 ) , where 𝑦 𝑒𝑜 is the number of words for nth word in the vocabulary Vector space model 6

  7. More Details • Represent the doc as a vector where each entry corresponds to a different word and the number at that entry corresponds to how many times that word was present in the document (or some function of it) • Number of words is huge • Select and use a smaller set of words that are of interest • E.g. uninteresting words: ‘and’, ‘the’ ‘at’, ‘is’, etc. These are called stop- words • Stemming: remove endings. E.g. ‘learn’, ‘learning’, ‘learnable’, ‘learned’ could be substituted by the single stem ‘learn’ • Other simplifications can also be invented and used • The set of different remaining words is called dictionary or vocabulary. Fix an ordering of the terms in the dictionary so that you can operate them by their index. • Can be extended to bi-gram, tri-gram, or so 7

  8. Limitations of Vector Space Model • Dimensionality • High dimensionality • Sparseness • Most of the entries are zero • Shallow representation • The vector representation does not capture semantic relations between words 8

  9. Naïve Bayes for Text • Text Data • Revisit of Multinomial Distribution • Multinomial Naïve Bayes • Summary 9

  10. Bernoulli and Categorical Distribution • Bernoulli distribution • Discrete distribution that takes two values {0,1} • 𝑄 𝑌 = 1 = 𝑞 and 𝑄 𝑌 = 0 = 1 − 𝑞 • E.g., toss a coin with head and tail • Categorical distribution • Discrete distribution that takes more than two values, i.e., 𝑦 ∈ 1, … , 𝐿 • Also called generalized Bernoulli distribution, multinoulli distribution • 𝑄 𝑌 = 𝑙 = 𝑞 𝑙 𝑏𝑜𝑒 σ 𝑙 𝑞 𝑙 = 1 • E.g., get 1-6 from a dice with 1/6 10

  11. Binomial and Multinomial Distribution • Binomial distribution • Number of successes (i.e., total number of 1’s) by repeating n trials of independent Bernoulli distribution with probability 𝑞 • 𝑦: 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑡𝑣𝑑𝑑𝑓𝑡𝑡𝑓𝑡 • 𝑄 𝑌 = 𝑦 = 𝑜 𝑦 𝑞 𝑦 1 − 𝑞 𝑜−𝑦 • Multinomial distribution (multivariate random variable) • Repeat n trials of independent categorical distribution • Let 𝑦 𝑙 be the number of times value 𝑙 has been observed, note σ 𝑙 𝑦 𝑙 = 𝑜 𝑜! 𝑦 𝑙 • 𝑄 𝑌 1 = 𝑦 1 , 𝑌 2 = 𝑦 2 , … , 𝑌 𝐿 = 𝑦 𝐿 = 𝑦 1 !𝑦 2 !…𝑦 𝐿 ! ς 𝑙 𝑞 𝑙 11

  12. Naïve Bayes for Text • Text Data • Revisit of Multinomial Distribution • Multinomial Naïve Bayes • Summary 12

  13. Bayes’ Theorem: Basics ( | ) ( ) P X h P h • Bayes’ Theorem:  ( | ) P h X ( ) P X • Let X be a data sample (“ evidence ”) • Let h be a hypothesis that X belongs to class C • P(h) ( prior probability ): the probability of hypothesis h • E.g., the probability of “spam” class • P(X|h) ( likelihood ): the probability of observing the sample X, given that the hypothesis holds • E.g., the probability of an email given it’s a spam • P(X): marginal probability that sample data is observed • 𝑄 𝑌 = σ ℎ 𝑄 𝑌 ℎ 𝑄(ℎ) • P(h|X), (i.e., posterior probability): the probability that the hypothesis holds given the observed data sample X 13

  14. Classification: Choosing Hypotheses • Maximum Likelihood (maximize the likelihood):  arg max ( | ) X h P D h ML  h H • Maximum a posteriori (maximize the posterior): • Useful observation: it does not depend on the denominator P(X)   arg max ( | ) arg max ( | ) ( ) X X h P h D P D h P h MAP   h H h H 14

  15. Classification by Maximum A Posteriori • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an p-D attribute vector x = (x 1 , x 2 , …, x p ) • Suppose there are m classes y ∈ {1, 2, …, m } • Classification is to derive the maximum posteriori, i.e., the maximal P(y=j| x ) • This can be derived from Bayes’ theorem 𝑞 𝑧 = 𝑘 𝒚 = 𝑞 𝒚 𝑧 = 𝑘 𝑞(𝑧 = 𝑘) 𝑞(𝒚) • Since p( x ) is constant for all classes, only 𝑞 𝒚 𝑧 𝑞(𝑧 ) needs to be maximized 15

  16. Now Come to Text Setting • A document is represented as a bag of words • 𝒚 𝑒 = (𝑦 𝑒1 , 𝑦 𝑒2 , … , 𝑦 𝑒𝑂 ) , where 𝑦 𝑒𝑜 is the number of words for nth word in the vocabulary • Model 𝑞 𝒚 𝑒 𝑧 for class 𝑧 • Follow multinomial distribution with parameter vector 𝜸 𝑧 = (𝛾 𝑧1 , 𝛾 𝑧2 , … , 𝛾 𝑧𝑂 ) , i.e., (σ 𝑜 𝑦 𝑒𝑜 )! 𝑦 𝑒𝑜 • 𝑞 𝒚 𝑒 𝑧 = 𝑦 𝑒1 !𝑦 𝑒2 !…𝑦 𝑒𝑂 ! ς 𝑜 𝛾 𝑧𝑜 • Model 𝑞 𝑧 = 𝑘 • Follow categorical distribution with parameter vector 𝝆 = (𝜌 1 , 𝜌 2 , … , 𝜌 𝑛 ) , i.e., • 𝑞 𝑧 = 𝑘 = 𝜌 𝑘 16

  17. Classification Process Assuming Parameters are Given • Find 𝑧 that maximizes 𝑞 𝑧 𝒚 𝑒 , which is equivalently to maximize 𝑧 ∗ = 𝑏𝑠𝑕max 𝑞 𝒚 𝑒 , 𝑧 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑧 𝑞 𝒚 𝑒 𝑧 𝑞 𝑧 (σ 𝑜 𝑦 𝑒𝑜 ) ! 𝑦 𝑒𝑜 × 𝜌 𝑧 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑧 𝑦 𝑒1 ! 𝑦 𝑒2 ! … 𝑦 𝑒𝑂 ! ෑ 𝛾 𝑧𝑜 𝑜 Constant for every class, 𝑦 𝑒𝑜 × 𝜌 𝑧 denoted as 𝒅 𝒆 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑧 ෑ 𝛾 𝑧𝑜 𝑜 = 𝑏𝑠𝑕𝑛𝑏𝑦 𝑧 ෍ 𝑦 𝑒𝑜 𝑚𝑝𝑕𝛾 𝑧𝑜 + 𝑚𝑝𝑕𝜌 𝑧 𝑜 17

  18. Parameter Estimation via MLE • Given a corpus and labels for each document • 𝐸 = {(𝒚 𝑒 , 𝑧 𝑒 )} • Find the MLE estimators for Θ = (𝜸 1 , 𝜸 2 , … , 𝜸 𝑛 , 𝝆) • The log likelihood function for the training dataset 𝑚𝑝𝑕𝑀 = 𝑚𝑝𝑕 ෑ 𝑞(𝒚 𝑒 , 𝑧 𝑒 |Θ) = ෍ 𝑚𝑝𝑕 𝑞 𝒚 𝑒 , 𝑧 𝑒 Θ 𝑒 𝑒 = ෍ 𝑚𝑝𝑕 𝑞 𝒚 𝑒 𝑧 𝑒 𝑞 𝑧 𝑒 = ෍ 𝑦 𝑒𝑜 𝑚𝑝𝑕𝛾 𝑧𝑜 + 𝑚𝑝𝑕𝜌 𝑧 𝑒 + 𝑚𝑝𝑕𝑑 𝑒 𝑒 𝑒 • The optimization problem Does not involve max log 𝑀 parameters, can be Θ dropped for optimization 𝑡. 𝑢. purpose 𝜌 𝑘 ≥ 0 𝑏𝑜𝑒 ෍ 𝜌 𝑘 = 1 𝑘 𝛾 𝑘𝑜 ≥ 0 𝑏𝑜𝑒 ෍ 𝛾 𝑘𝑜 = 1 𝑔𝑝𝑠 𝑏𝑚𝑚 𝑘 18 𝑜

  19. Solve the Optimization Problem • Use the Lagrange multiplier method • Solution σ 𝑒:𝑧𝑒=𝑘 𝑦 𝑒𝑜 • መ 𝛾 𝑘𝑜 = σ 𝑒:𝑧𝑒=𝑘 σ 𝑜′ 𝑦 𝑒𝑜′ • σ 𝑒:𝑧 𝑒 =𝑘 𝑦 𝑒𝑜 : total count of word n in class j • σ 𝑒:𝑧 𝑒 =𝑘 σ 𝑜 ′ 𝑦 𝑒𝑜 ′ : total count of words in class j σ 𝑒 1(𝑧 𝑒 =𝑘) • ො 𝜌 𝑘 = |𝐸| • 1(𝑧 𝑒 = 𝑘) is the indicator function, which equals to 1 if 𝑧 𝑒 = 𝑘 holds • |D|: total number of documents 19

  20. Smoothing • What if some word n does not appear in some class j in training dataset? σ 𝑒:𝑧𝑒=𝑘 𝑦 𝑒𝑜 • መ 𝛾 𝑘𝑜 = σ 𝑒:𝑧𝑒=𝑘 σ 𝑜′ 𝑦 𝑒𝑜′ = 0 𝑦 𝑒𝑜 = 0 • ⇒ 𝑞 𝒚 𝑒 𝑧 = 𝑘 ∝ ς 𝑜 𝛾 𝑧𝑜 • But other words may have a strong indication the document belongs to class j • Solution: add-1 smoothing or Laplacian smoothing σ 𝑒:𝑧𝑒=𝑘 𝑦 𝑒𝑜 +1 • መ 𝛾 𝑘𝑜 = σ 𝑒:𝑧𝑒=𝑘 σ 𝑜′ 𝑦 𝑒𝑜′ +𝑂 • 𝑂: 𝑢𝑝𝑢𝑏𝑚 𝑜𝑣𝑛𝑐𝑓𝑠 𝑝𝑔 𝑥𝑝𝑠𝑒𝑡 𝑗𝑜 𝑢ℎ𝑓 𝑤𝑝𝑑𝑏𝑐𝑣𝑚𝑏𝑠𝑧 • Check: σ 𝑜 መ 𝛾 𝑘𝑜 = 1? 20

Recommend


More recommend