lda 1 credits mike smith las vegas sun 2013
play

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: - PowerPoint PPT Presentation

LDA 1 [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2 [Credits: IITD Library] 4 5 6 In text, the hidden variables are the thematic structure. What are the topics that describe this collection? How does a new document fit into the topic


  1. LDA 1

  2. [Credits: Mike Smith, Las Vegas Sun 2013] LDA 2

  3. [Credits: IITD Library]

  4. 4

  5. 5

  6. 6

  7. In text, the hidden variables are the thematic structure. What are the topics that describe this collection? How does a new document fit into the topic structure? 7

  8. Credits: [David Blei, KDD12] 8

  9. Credits: [David Blei, KDD12] • • • 9

  10. P(topics, proportions, assignments | documents) 10

  11. θ β 𝑎 𝑒,𝑜 𝑋 𝛽 𝜃 𝑒,𝑜 11

  12. θ β 𝑎 𝑒,𝑜 𝑋 𝛽 𝜃 𝑒,𝑜 12

  13. 𝛽 θ β η 𝑎 𝑒,𝑜 𝑋 𝑒,𝑜 • • • θ • β • • 13

  14. θ α θ 14

  15. [Credits: Wikipedia] 15

  16. • • • • • 16

  17. 17

  18. 18

  19. θ β 𝑎 𝑒,𝑜 𝑋 𝛽 𝜃 𝑒,𝑜 19

  20. Topic 4: Deep Learning ( 𝜸 𝟓 ) Topic 3: AI ( 𝜸 𝟒 ) Topic 1: PGM ( 𝜸 𝟐 ) Topic 2: ML ( 𝜸 𝟑 ) Backpropagation: 0.15 Markov: 0.09 Bayesian: 0.1 Inference: 0.2 Convolution: 0.1 Reinforcement: Markov: 0.09 Posterior: 0.15 LSTM: 0.0.9 0.08 Network: 0.07 Regression: 0.1 Dropout: 0.07 Planning: 0.08 Inference: 0.07 Gradient: 0.09 … … … … 𝜾 𝒆 𝒂 𝒆,𝒐 𝑋 𝑒,𝑜 Topic 1: 0.7 Topic 1 Markov Topic 2: 0.1 Topic 3: 0.15 Topic 4: 0.05 20

  21. 21

  22. 𝛽 = 1 22

  23. 𝛽 = 10 23

  24. 𝛽 = 100 24

  25. 𝛽 = 1 25

  26. 𝛽 = 0.1 26

  27. 𝛽 = 0.01 27

  28. 𝑞 𝛾, 𝜄, 𝑨 𝑥) 𝑞(𝛾, 𝜄, 𝑨, 𝑥) 𝛾,𝜄 σ 𝑨 𝑞(𝛾, 𝜄, 𝑨, 𝑥) ׭ 28

  29. 𝑦 1:𝑂 𝑨 1:𝑁 29

  30. 𝜉 30

  31. 𝑟(𝛾, 𝑨) 31

  32. 32

  33. 33

  34. 𝑜(𝑨 1:𝑂 ) 34

  35. 𝜄 𝑜 𝑙 (𝑨 −𝑗 ) 𝑨 −𝑗 35

  36. LDA 36

  37. Steve Yelp iPad Jobs TYPE: TYPE: TYPE: Launch IPO Death DATE: DATE: DATE: Mar 7 March 2 Oct 6

  38. Claim: This is worth investigating

  39. • [Prachi] Events shown as url http://statuscalendar.com

  40. [Nupur] Model Architecture [Happy] Normalization? [Shantanu, Surag] Error Accumulation [Himanshu, Prachi] Reliance on POS tagger 40

  41. Since spread of printing press Timebank MUC & ACE competitions • Limited to narrow domains • Performance is still not great

  42. Short Easy to write (even on mobile devices) Instantly and widely disseminated Many irrelevant messages Many redundant messages

  43. `2m', `2ma', `2mar', `2mara', `2maro', `2marrow', `2mor', `2mora', `2moro', `2morow', `2morr', `2morro', `2morrow', `2moz', `2mr', `2mro', `2mrrw', `2mrw', `2mw', `tmmrw', `tmo', `tmoro', `tmorrow', `tmoz', `tmr', `tmro', `tmrow', `tmrrow', `tmrrw', `tmrw', `tmrww', `tmw', `tomaro', `tomarow', `tomarro', `tomarrow', `tomm', `tommarow', `tommarrow', `tommoro', `tommorow', `tommorrow', `tommorw', `tommrow', `tomo', `tomolo', `tomoro', `tomorow', `tomorro', `tomorrw', `tomoz', `tomrw', `tomz ‘ “The Hobbit has FINALLY started filming! I cannot wait!” “ watchng american dad.”

  44. • Annotated 2400 tweets (about 34K tokens) • Train on in-domain data

  45. 0.8 0.7 0.6 0.5 P 0.4 R 0.3 F 0.2 0.1 0 Stanford T-NER

  46. • – •

  47. Sports Politics Product releases … Allow more customized calendars Could be useful in upstream tasks

  48. Might start talking about different things Might want to focus on different groups of users

  49. Generative Probabilistic Models Discovers types which match the data No need to annotate individual events Don’t need to commit to a specific set of types Modular, can integrate into various applications

  50. Each Event Phrase is modeled as a mixture of types [Happy, Arindam, Akshay, Surag, Dinesh R] Liked [Akshay] New entities? [Anshul] Sensitive to parameters P( SPORTS | cheered )= 0.6 P( POLITICS | cheered )= 0.4 Each Event phrase Each Event Type is Associated is modeled as a with a Distribution over Entities and Dates mixture of types

  51. 1,000 iterations of burn in Parallelized sampling (approximation) using MPI [Newman et. al. 2009] [Happy, Nupur] Disliked manual annotation [Anshul] ‘Legal’, ‘Food’ not event categories

  52. Using types discovered by the topic model Supervised classification using 10-fold cross validation Treat event phrases like bag of words [Nupur] Multiple entity events? [Nupur, Anshul] Very simple baseline

  53. What they ate for lunch Entities such as McDonalds would be frequent on most days Only show if entities appear more than expected

  54. 𝐻 2 𝑃 𝑦,𝑧 × 𝑚𝑜 𝑃 𝑦,𝑧 𝐻 2 = ෍ 𝐹 𝑦,𝑧 𝑦∈ 𝑓,¬𝑓 ,𝑧∈{𝑒,¬𝑒} 𝑃 𝑓,𝑒 𝑃 𝑓,¬𝑒 𝐹 𝑓,𝑒 [Happy, Akshay, Shantanu, Nupur, Anshul, Rishab, Dinesh R] Liked [Barun, Shantanu] Same event on multiple days? [Rishab] Why not 𝜓 2 ? 62

  55. [Akshay, Barun] Liked

  56. End-to-end Evaluation No Named Entity Recognition Rely on significance test to rank ngrams A few extra heuristics (filter out temporal expressions etc…)

  57. 65

Recommend


More recommend