LDA 1
[Credits: Mike Smith, Las Vegas Sun 2013] LDA 2
[Credits: IITD Library]
4
5
6
In text, the hidden variables are the thematic structure. What are the topics that describe this collection? How does a new document fit into the topic structure? 7
Credits: [David Blei, KDD12] 8
Credits: [David Blei, KDD12] • • • 9
P(topics, proportions, assignments | documents) 10
θ β 𝑎 𝑒,𝑜 𝑋 𝛽 𝜃 𝑒,𝑜 11
θ β 𝑎 𝑒,𝑜 𝑋 𝛽 𝜃 𝑒,𝑜 12
𝛽 θ β η 𝑎 𝑒,𝑜 𝑋 𝑒,𝑜 • • • θ • β • • 13
θ α θ 14
[Credits: Wikipedia] 15
• • • • • 16
17
18
θ β 𝑎 𝑒,𝑜 𝑋 𝛽 𝜃 𝑒,𝑜 19
Topic 4: Deep Learning ( 𝜸 𝟓 ) Topic 3: AI ( 𝜸 𝟒 ) Topic 1: PGM ( 𝜸 𝟐 ) Topic 2: ML ( 𝜸 𝟑 ) Backpropagation: 0.15 Markov: 0.09 Bayesian: 0.1 Inference: 0.2 Convolution: 0.1 Reinforcement: Markov: 0.09 Posterior: 0.15 LSTM: 0.0.9 0.08 Network: 0.07 Regression: 0.1 Dropout: 0.07 Planning: 0.08 Inference: 0.07 Gradient: 0.09 … … … … 𝜾 𝒆 𝒂 𝒆,𝒐 𝑋 𝑒,𝑜 Topic 1: 0.7 Topic 1 Markov Topic 2: 0.1 Topic 3: 0.15 Topic 4: 0.05 20
21
𝛽 = 1 22
𝛽 = 10 23
𝛽 = 100 24
𝛽 = 1 25
𝛽 = 0.1 26
𝛽 = 0.01 27
𝑞 𝛾, 𝜄, 𝑨 𝑥) 𝑞(𝛾, 𝜄, 𝑨, 𝑥) 𝛾,𝜄 σ 𝑨 𝑞(𝛾, 𝜄, 𝑨, 𝑥) 28
𝑦 1:𝑂 𝑨 1:𝑁 29
𝜉 30
𝑟(𝛾, 𝑨) 31
32
33
𝑜(𝑨 1:𝑂 ) 34
𝜄 𝑜 𝑙 (𝑨 −𝑗 ) 𝑨 −𝑗 35
LDA 36
Steve Yelp iPad Jobs TYPE: TYPE: TYPE: Launch IPO Death DATE: DATE: DATE: Mar 7 March 2 Oct 6
Claim: This is worth investigating
• [Prachi] Events shown as url http://statuscalendar.com
[Nupur] Model Architecture [Happy] Normalization? [Shantanu, Surag] Error Accumulation [Himanshu, Prachi] Reliance on POS tagger 40
Since spread of printing press Timebank MUC & ACE competitions • Limited to narrow domains • Performance is still not great
Short Easy to write (even on mobile devices) Instantly and widely disseminated Many irrelevant messages Many redundant messages
`2m', `2ma', `2mar', `2mara', `2maro', `2marrow', `2mor', `2mora', `2moro', `2morow', `2morr', `2morro', `2morrow', `2moz', `2mr', `2mro', `2mrrw', `2mrw', `2mw', `tmmrw', `tmo', `tmoro', `tmorrow', `tmoz', `tmr', `tmro', `tmrow', `tmrrow', `tmrrw', `tmrw', `tmrww', `tmw', `tomaro', `tomarow', `tomarro', `tomarrow', `tomm', `tommarow', `tommarrow', `tommoro', `tommorow', `tommorrow', `tommorw', `tommrow', `tomo', `tomolo', `tomoro', `tomorow', `tomorro', `tomorrw', `tomoz', `tomrw', `tomz ‘ “The Hobbit has FINALLY started filming! I cannot wait!” “ watchng american dad.”
• Annotated 2400 tweets (about 34K tokens) • Train on in-domain data
0.8 0.7 0.6 0.5 P 0.4 R 0.3 F 0.2 0.1 0 Stanford T-NER
• – •
Sports Politics Product releases … Allow more customized calendars Could be useful in upstream tasks
Might start talking about different things Might want to focus on different groups of users
Generative Probabilistic Models Discovers types which match the data No need to annotate individual events Don’t need to commit to a specific set of types Modular, can integrate into various applications
Each Event Phrase is modeled as a mixture of types [Happy, Arindam, Akshay, Surag, Dinesh R] Liked [Akshay] New entities? [Anshul] Sensitive to parameters P( SPORTS | cheered )= 0.6 P( POLITICS | cheered )= 0.4 Each Event phrase Each Event Type is Associated is modeled as a with a Distribution over Entities and Dates mixture of types
1,000 iterations of burn in Parallelized sampling (approximation) using MPI [Newman et. al. 2009] [Happy, Nupur] Disliked manual annotation [Anshul] ‘Legal’, ‘Food’ not event categories
Using types discovered by the topic model Supervised classification using 10-fold cross validation Treat event phrases like bag of words [Nupur] Multiple entity events? [Nupur, Anshul] Very simple baseline
What they ate for lunch Entities such as McDonalds would be frequent on most days Only show if entities appear more than expected
𝐻 2 𝑃 𝑦,𝑧 × 𝑚𝑜 𝑃 𝑦,𝑧 𝐻 2 = 𝐹 𝑦,𝑧 𝑦∈ 𝑓,¬𝑓 ,𝑧∈{𝑒,¬𝑒} 𝑃 𝑓,𝑒 𝑃 𝑓,¬𝑒 𝐹 𝑓,𝑒 [Happy, Akshay, Shantanu, Nupur, Anshul, Rishab, Dinesh R] Liked [Barun, Shantanu] Same event on multiple days? [Rishab] Why not 𝜓 2 ? 62
[Akshay, Barun] Liked
End-to-end Evaluation No Named Entity Recognition Rely on significance test to rank ngrams A few extra heuristics (filter out temporal expressions etc…)
65
Recommend
More recommend