inf4080 2020 fall
play

INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 INF4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 (Mostly Text) Classification, Naive Bayes Lecture 3, 31 Aug Today - Classification 3 Motivation Classification Naive Bayes classification NB for text


  1. 1 INF4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning

  2. 2 (Mostly Text) Classification, Naive Bayes Lecture 3, 31 Aug

  3. Today - Classification 3  Motivation  Classification  Naive Bayes classification  NB for text classification  The multinomial model  The Bernoulli model  Experiments: training, test and cross-validation  Evaluation

  4. Motivation 4

  5. Did Mikhail Sholokov write And Quiet Flows the Don? 5  Sholokov, 1905-1984  And Quiet Flows the Don  published 1928-1940  Nobel prize, literature, 1965  Authorship contested  e.g. Aleksandr Solzhenitsyn, 1974  Geir Kjetsaa (UiO) et al, 1984 Kjetsaa according to Hjort  refuted the contestants In addition to various linguistic analyses and  Nils Lid Hjort, 2007, confirmed several doses of detective work, quantitative data Kjetsaa by using sentence length and were gathered and organised, for example, advanced statistics. relating to word lengths, frequencies of certain words and phrases, sentence lengths, grammatical  https://en.wikipedia.org/wiki/Mikhail_Sholokhov characteristics, etc.

  6. Pocitive or negative movie review? 6  unbelievably disappointing  Full of zany characters and richly applied satire, and come great plot twists  this is the greatest screwball comedy ever filmed  It was pathetic. The worst part about it was the boxing scenes. From Jurafsky & Martin

  7. What is the subject of this article? 7 MeSH Subject Category Hierarchy MEDLINE Article  Antagonists and Inhibitors  Blood Supply  Chemistry ?  Drug Therapy  Embryology  Epidemiology  … From Jurafsky & Martin

  8. Classification 8

  9. Classification 9  Can be rule-based, but mostly machine learned  Text classification is a sub-class  Text classification examples:  Other types of classification:  Spam detection  Word sense disambiguation  Genre classification  Sentence splitting  Language identification  Tagging  Sentiment analysis:  Named-entity recognition  Positive-negative

  10. Machine learning 10 Supervised  Supervised: 1. Classification (categorical)  Given classes 1. Regression (numerical)  Given examples of correct 2. classes Unsupervised 2.  Unsupervised: Semi-supervised 3.  Construct classes Reinforcement learning 4.

  11. Supervised classification 11

  12. Supervised classification 12 Task O C  Given Spam E-mails Spam,  a well-defined set of classification no-spam observations, O Language Pieces of Arabian,  a given set of classes, identification text Chinese, C={c 1 , c 2 , …, c k } English,  Goal: a classifier,  , a mapping Norwegian, from O to C … Word sense Occurrences Sense1, …,  For supervised training one disambi- of ”bass” sense8 needs a set of pairs from OxC guation

  13. Features 13  To represent the objects in O, O: email O: person extract a set of features Features: Features: • length • height  Be explicit: • sender • weight  Which features • contained words • hair color  For each feature • language • eye color • … • …  The type  Categorical  Numeric (Discrete/Continuous)  The value space Cf. First lecture Classes and features are both attributes of the observations

  14. Supervised classification  A given set of classes, C={c 1 , c 2 , …, c k }  A well defined class of observations, O  Some features f 1 , f 2 , …, f n  For each feature: a set of possible values V 1 , V 2 , …, V n  The set of feature vectors: V= V 1  V 2  …  V n  Each observation in O is represented by some member of V:  Written (f 1 =v 1 , f 2 =v 2 , …, f n =v n ), or  (v 1 , v 2 , …, v n ), if we have decided the order  A classifier,  , can be considered a mapping from V to C

  15. A variety of ML classifiers 15  k-Nearest Neighbors  Rocchio  Naive Bayes  Logistic regression (Maximum entropy)  Support Vector Machines  Decision Trees  Perceptron  Multi-layered neural nets ("Deep learning")

  16. Naïve Bayes 16

  17. Example: Jan. 2021 Professor, do I can give you a you think I will scientific answer enjoy using machine IN3050? learning.

  18. Baseline  Survey Yes,  Asked all the students of 2020 you will like it.  200 answered:  130 yes  70 no  Baseline classifier:  Choose the majority class  Accuracy 0.65=65%  (With two classes, always > 0.5)

  19. Example: one year from now, Jan. 2021 Professor, do To answer that, I you think I will have to ask you enjoy some questions. IN3050?

  20. The 2020 survey (imaginary) Ask each of the 200 students:  Did you enjoy the course?  Yes/no  Do you like mathematics?  Yes/no  Do you have programming experience?  None/some/good (= 3 or more courses)  Have you taken advanced machine learning courses?  Yes/no  And many more questions, but we have to simplify here

  21. Results of the 2020 survey: a data set Student no Enjoy maths Programming Adv. ML Enjoy 1 Y Good N Y 2 Y Some N Y 3 N Good Y N 4 N None N N 5 N Good N Y 6 N Good Y Y ….

  22. Summary of the 2020 survey

  23. Our new student  We ask our incoming new student the same three question  From the table we can see e.g. that if:  she has good programming  no AdvML-course  does not like maths 40  There is a 44 chance she will But what should we say to a student with enjoy the course some programming background, and adv. ML course who does not like maths.?

  24. A little more formal 24  What we do is that we consider  𝑄 𝑓𝑜𝑘𝑝𝑧 = 𝑧𝑓𝑡 𝑞𝑠𝑝𝑕 = 𝑕𝑝𝑝𝑒, 𝐵𝑒𝑤𝑁𝑀 = 𝑜𝑝, 𝑁𝑏𝑢ℎ𝑡 = 𝑜𝑝 and  𝑄 𝑓𝑜𝑘𝑝𝑧 = 𝑜𝑝𝑢 𝑞𝑠𝑝𝑕 = 𝑕𝑝𝑝𝑒, 𝐵𝑒𝑤𝑁𝑀 = 𝑜𝑝, 𝑁𝑏𝑢ℎ𝑡 = 𝑜𝑝  and decide on the class which has the largest probability, in symbols  𝑏𝑠𝑕𝑛𝑏𝑦 𝑧∈ 𝑧𝑓𝑡,𝑜𝑝 𝑄 𝑓𝑜𝑘𝑝𝑧 = 𝑧 𝑞𝑠𝑝𝑕 = 𝑕𝑝𝑝𝑒, 𝐵𝑒𝑤𝑁𝑀 = 𝑜𝑝, 𝑁𝑏𝑢ℎ𝑡 = 𝑜𝑝  But, there may be many more features.  An exponential growth in possible combinations  We might not have seen all combinations, or they may be rare  Therefore we apply Bayes theorem, and we make a simplifying assumption

  25. Naive Bayes: Decision 25  Given an observation    , ,..., f v f v f v  1 1 2 2 n n  Consider      for each class 𝑡 𝑛 | , ,..., P s f v f v f v  1 1 2 2 m n n  Choose the class with the largest value, in symbols      arg max | , ,..., P s f v f v f v 1 1 2 2 m n n  s S m  i.e. choose the class for which the observation is most likely

  26. Naive Bayes: Model 26  Bayes formula        , ,..., | ( ) P f v f v f v s P s     1 1 2 2 n n m m  | , ,...,   P s f v f v f v    1 1 2 2 m n n , ,..., P f v f v f v 1 1 2 2 n n  Sparse data, we may not even have seen     , ,..., f v f v f v 1 1 2 2 n n  We assume (wrongly) independence   n          , ,..., | | P f v f v f v s P f v s 1 1 2 2 n n m i i m  1 i  Putting together, choose   n          arg max | , ,..., arg max ( ) | P s f v f v f v P s P f v s 1 1 2 2 m n n m i i m    s S s S 1 i m m

  27. Naive Bayes, Training 1 27  Maximum Likelihood  ( , )   C s o ˆ m  m P s ( ) C o  where C(s m , o) are the number of occurrences of observations o in class s m  Observe what we are doing:  We are looking for the true probability 𝑄(𝑡 𝑛 )  ෠ 𝑄(𝑡 𝑛 ) is an approximation to this, our best guess from a set of observations  Maximum likelihood means that it is the model which makes the set of observations we have seen, most likely

  28. Naive Bayes: Training 2 28  Maximum Likelihood  ( , )   C f v s ˆ    i i m | P f v s i i m ( ) C s m  where C(f i =v i , s m ) is the number of observations o  where the observation o belongs to class s m  and the feature f i takes the value v i  C(s m ) is the number of observations belonging to class s m

  29. Back to example 29 • Collect the numbers • Estimate the probabilities

  30. Back to example 30 𝑜  𝑏𝑠𝑕𝑛𝑏𝑦 𝑑 𝑛 ∈𝐷 𝑄 𝑑 𝑛 ς 𝑗=1 𝑄 𝑔 𝑗 = 𝑤 𝑗 𝑑 𝑛 ) 𝑄 𝑧𝑓𝑡 × 𝑄 𝑕𝑝𝑝𝑒 𝑧𝑓𝑡 × 𝑄 𝐵: 𝑜𝑝 𝑧𝑓𝑡 × 𝑄 𝑁: 𝑜𝑝 𝑧𝑓𝑡 =  130 100 115 59 200 × 130 × 130 × 130 = 0.2 𝑄 𝑜𝑝 × 𝑄 𝑕𝑝𝑝𝑒 𝑜𝑝 × 𝑄 𝐵: 𝑜𝑝 𝑜𝑝 × 𝑄 𝑁: 𝑜𝑝 𝑜𝑝 =  70 22 53 39 200 × 70 × 70 × 70 = 0.046  So we predict that the student will most probably enjoy the class  Accuracy on training data: 75%  Compare to Baseline: 65%  Best classifier: 80%

Recommend


More recommend