cs 4501 machine learning for nlp
play

CS 4501 Machine Learning for NLP Text Classification (I): Logistic - PowerPoint PPT Presentation

CS 4501 Machine Learning for NLP Text Classification (I): Logistic Regression Yangfeng Ji Department of Computer Science University of Virginia Overview 1. Problem Definition 2. Bag-of-Words Representation 3. Case Study: Sentiment Analysis


  1. CS 4501 Machine Learning for NLP Text Classification (I): Logistic Regression Yangfeng Ji Department of Computer Science University of Virginia

  2. Overview 1. Problem Definition 2. Bag-of-Words Representation 3. Case Study: Sentiment Analysis 4. Logistic Regression 5. 퐿 2 Regularization 6. Demo Code 1

  3. Problem Definition

  4. Case I: Sentiment Analysis [Pang et al., 2002] 3

  5. Case II: Topic Classification Example topics ◮ Business ◮ Arts ◮ Technology ◮ Sports ◮ · · · 4

  6. Classification ◮ Input: a text 풙 ◮ Example: a product review on Amazon ◮ Output: 푦 ∈ Y , where Y is the predefined category set (sample space) ◮ Example: Y = { Positive , Negative } 1 In this course, we use 풙 for both text and its representation with no distinction 5

  7. Classification ◮ Input: a text 풙 ◮ Example: a product review on Amazon ◮ Output: 푦 ∈ Y , where Y is the predefined category set (sample space) ◮ Example: Y = { Positive , Negative } The pipeline of text classification: 1 Text Numeric Vector 풙 Classifier Category 푦 1 In this course, we use 풙 for both text and its representation with no distinction 5

  8. Probabilistic Formulation With the conditional probability 푃 ( 푌 | 푿 ) , the prediction on 푌 for a given text 푿 = 풙 is 푦 = argmax ˆ 푃 ( 푌 = 푦 | 푿 = 풙 ) (1) 푦 ∈ Y 6

  9. Probabilistic Formulation With the conditional probability 푃 ( 푌 | 푿 ) , the prediction on 푌 for a given text 푿 = 풙 is 푦 = argmax ˆ 푃 ( 푌 = 푦 | 푿 = 풙 ) (1) 푦 ∈ Y Or, for simplicity 푦 = argmax ˆ 푃 ( 푦 | 풙 ) (2) 푦 ∈ Y 6

  10. Key Questions Recall ◮ The formulation defined in the previous slide 푦 = argmax ˆ 푃 ( 푌 = 푦 | 푿 = 풙 ) (3) 푦 ∈ Y ◮ The pipeline of text classification Text Numeric Vector 풙 Classifier Category 푦 7

  11. Key Questions Recall ◮ The formulation defined in the previous slide 푦 = argmax ˆ 푃 ( 푌 = 푦 | 푿 = 풙 ) (3) 푦 ∈ Y ◮ The pipeline of text classification Text Numeric Vector 풙 Classifier Category 푦 Building a text classifier is about answering the following two questions 1. How to represent a text as 풙 ? 2. How to estimate 푃 ( 푦 | 풙 ) ? 7

  12. Key Questions Recall ◮ The formulation defined in the previous slide 푦 = argmax ˆ 푃 ( 푌 = 푦 | 푿 = 풙 ) (3) 푦 ∈ Y ◮ The pipeline of text classification Text Numeric Vector 풙 Classifier Category 푦 Building a text classifier is about answering the following two questions 1. How to represent a text as 풙 ? ◮ Bag-of-words representation 2. How to estimate 푃 ( 푦 | 풙 ) ? 7

  13. Key Questions Recall ◮ The formulation defined in the previous slide 푦 = argmax ˆ 푃 ( 푌 = 푦 | 푿 = 풙 ) (3) 푦 ∈ Y ◮ The pipeline of text classification Text Numeric Vector 풙 Classifier Category 푦 Building a text classifier is about answering the following two questions 1. How to represent a text as 풙 ? ◮ Bag-of-words representation 2. How to estimate 푃 ( 푦 | 풙 ) ? ◮ Logistic regression models 7

  14. Key Questions Recall ◮ The formulation defined in the previous slide 푦 = argmax ˆ 푃 ( 푌 = 푦 | 푿 = 풙 ) (3) 푦 ∈ Y ◮ The pipeline of text classification Text Numeric Vector 풙 Classifier Category 푦 Building a text classifier is about answering the following two questions 1. How to represent a text as 풙 ? ◮ Bag-of-words representation 2. How to estimate 푃 ( 푦 | 풙 ) ? ◮ Logistic regression models ◮ Neural network classifiers 7

  15. Bag-of-Words Representation

  16. Bag-of-Words Representation Example Texts Text 1: I love coffee. Text 2: I don’t like tea. 9

  17. Bag-of-Words Representation Example Texts Text 1: I love coffee. Text 2: I don’t like tea. Step I : convert a text into a collection of tokens Tokenized Texts Tokenized text 1: I love coffee Tokenized text 2: I don t like tea 9

  18. Bag-of-Words Representation Example Texts Text 1: I love coffee. Text 2: I don’t like tea. Step I : convert a text into a collection of tokens Tokenized Texts Tokenized text 1: I love coffee Tokenized text 2: I don t like tea Step II : build a dictionary/vocabulary Vocabulary { I love coffee don t like tea } 9

  19. Bag-of-Words Representations Step III : based on the vocab, convert each text into a numeric representation as Bag-of-Words Representations I love coffee don t like tea 풙 ( 1 ) = 0] T [1 1 1 0 0 0 풙 ( 2 ) = 1] T [1 0 0 1 1 1 10

  20. Bag-of-Words Representations Step III : based on the vocab, convert each text into a numeric representation as Bag-of-Words Representations I love coffee don t like tea 풙 ( 1 ) = 0] T [1 1 1 0 0 0 풙 ( 2 ) = 1] T [1 0 0 1 1 1 The pipeline of text classification: Text Numeric Vector 풙 Classifier Category 푦 Bag-of-words Representation 10

  21. Preprocessing for Building Vocab 1. convert all characters to lowercase UVa , UVA → uva 11

  22. Preprocessing for Building Vocab 1. convert all characters to lowercase UVa , UVA → uva 2. map low frequency words to a special token unk Zipf’s law: 푓 ( 푤 푡 ) ∝ 1 / 푟 푡 11

  23. Information Embedded in BoW Representations It is critical to keep in mind about what information is preserved in bag-of-words representations: ◮ Keep: ◮ words in texts ◮ Lose: ◮ word order ◮ sentence boundary ◮ paragraph boundary ◮ · · · 12

  24. Case Study: Sentiment Analysis

  25. A Simple Predictor Consider the following toy example Tokenized Texts Tokenized text 1: I love coffee Tokenized text 2: I don t like tea 14

  26. A Simple Predictor Consider the following toy example Tokenized Texts Tokenized text 1: I love coffee Tokenized text 2: I don t like tea I love coffee don t like tea 풙 ( 1 ) 0 ] T [1 1 1 0 0 0 0 ] T [0 1 0 0 0 1 풘 Pos 0 ] T [0 0 0 1 0 0 풘 Neg 14

  27. A Simple Predictor Consider the following toy example Tokenized Texts Tokenized text 1: I love coffee Tokenized text 2: I don t like tea I love coffee don t like tea 풙 ( 1 ) 0 ] T [1 1 1 0 0 0 0 ] T [0 1 0 0 0 1 풘 Pos 0 ] T [0 0 0 1 0 0 풘 Neg The prediction of sentiment polarity can be formulated as the following 풘 T Pos 풙 = 1 > 풘 T Neg 풙 = 0 (4) 14

  28. A Simple Predictor Consider the following toy example Tokenized Texts Tokenized text 1: I love coffee Tokenized text 2: I don t like tea I love coffee don t like tea 풙 ( 1 ) 0 ] T [1 1 1 0 0 0 0 ] T [0 1 0 0 0 1 풘 Pos 0 ] T [0 0 0 1 0 0 풘 Neg The prediction of sentiment polarity can be formulated as the following 풘 T Pos 풙 = 1 > 풘 T Neg 풙 = 0 (4) Essentially, this way of prediction is counting the positive and 14 negative words in the text.

  29. Another Example The limitation of word counting I love coffee don t like tea 풙 ( 2 ) 1 ] T [1 0 0 1 1 1 0 ] T 풘 Pos [0 1 0 0 0 1 0 ] T 풘 Neg [0 0 0 1 0 0 15

  30. Another Example The limitation of word counting I love coffee don t like tea 풙 ( 2 ) 1 ] T [1 0 0 1 1 1 0 ] T 풘 Pos [0 1 0 0 0 1 0 ] T 풘 Neg [0 0 0 1 0 0 ◮ Different words should contribute differently. e.g., not vs. dislike 15

  31. Another Example The limitation of word counting I love coffee don t like tea 풙 ( 2 ) 1 ] T [1 0 0 1 1 1 0 ] T 풘 Pos [0 1 0 0 0 1 0 ] T 풘 Neg [0 0 0 1 0 0 ◮ Different words should contribute differently. e.g., not vs. dislike ◮ Sentiment word lists are not complete Example II: Positive Din Tai Fung, every time I go eat at anyone of the locations around the King County area, I keep being reminded on why I have to keep coming back to this restaurant. · · · 15

  32. Logistic Regression

  33. Log-linear Models Directly modeling a linear classifier as ℎ 푦 ( 풙 ) = 풘 T 푦 풙 + 푏 푦 (5) with ◮ 풙 ∈ ℕ 푉 : vector, bag-of-words representation ◮ 풘 푦 ∈ ℝ 푉 : vector, classification weights associated with label 푦 ◮ 푏 푦 ∈ ℝ : scalar, label bias in the training set 푦 17

  34. Log-linear Models Directly modeling a linear classifier as ℎ 푦 ( 풙 ) = 풘 T 푦 풙 + 푏 푦 (5) with ◮ 풙 ∈ ℕ 푉 : vector, bag-of-words representation ◮ 풘 푦 ∈ ℝ 푉 : vector, classification weights associated with label 푦 ◮ 푏 푦 ∈ ℝ : scalar, label bias in the training set 푦 About Label Bias Consider a case where we have 90 positive examples and 10 negative examples in the training set. With 푏 Pos > 푏 Neg , a classifier can get 90% predictions correct without even resorting the texts. 17

  35. Logistic Regression Rewrite the linear decision function in the log probabilitic form log 푃 ( 푦 | 풙 ) ∝ 풘 T 푦 풙 + 푏 푦 (6) � ����� �� ����� � ℎ 푦 ( 풙 ) 18

  36. Logistic Regression Rewrite the linear decision function in the log probabilitic form log 푃 ( 푦 | 풙 ) ∝ 풘 T 푦 풙 + 푏 푦 (6) � ����� �� ����� � ℎ 푦 ( 풙 ) or, the probabilistic form is 푃 ( 푦 | 풙 ) ∝ exp ( 풘 T 푦 풙 + 푏 푦 ) (7) 18

Recommend


More recommend