disambiguation on twitter
play

disambiguation on Twitter Damiano Spina, Enrique Amig and Julio - PowerPoint PPT Presentation

Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amig and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22,


  1. Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam

  2. Goal • Two signals coming from intuition: – Filter keywords – Majority Class • Do they help characterizing and solving the problem?

  3. WePS-3 Online Reputation Management Task

  4. WePS-3 Online Reputation Management Task

  5. WePS-3 Online Reputation Management Task

  6. Tweets for query «jaguar» • related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8

  7. Tweets for query «orange» • related tweets=0 • unrelated tweets=10 • Related ratio = 0

  8. Tweets for query «apple» • related tweets=5 • unrelated tweets=5 • Related ratio = 0.5

  9. Fingerprint representation

  10. Fingerprint representation

  11. Fingerprint representation

  12. Fingerprint representation

  13. WePS-3 Task 2 Systems

  14. WePS-3 Task 2 Systems

  15. Filter keywords

  16. Tweets for query «apple»

  17. Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related»

  18. Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related» • negative keyword: eating • 2 tweets annotated as «unrelated»

  19. Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related» • negative keyword: eating • 2 tweets annotated as «unrelated» • Accuracy= 1.0 • Recall=60%

  20. Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric

  21. Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

  22. Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

  23. Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola

  24. Upper bound of Filter Keywords Oracle keywords 20 oracle keywords ≈ 50% recall 5 oracle keywords ≈ 30% recall

  25. Upper bound of Filter Keywords Oracle keywords Manual keywords – ≈10 per company – 14.61 % recall (vs. 39.97% 10 oracle keyword) 20 oracle keywords ≈ 50% recall – 0.86 accuracy 5 oracle keywords ≈ 30% recall

  26. Upper bound of Filter Keywords Oracle keywords Manual keywords – ≈10 per company – 14.61 % recall (vs. 39.97% 10 oracle keyword) 20 oracle keywords ≈ 50% recall – 0.86 accuracy 5 oracle keywords ≈ 30% recall Twitter ≠ Web

  27. Majority Class

  28. Tweets for query «jaguar» • related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8 • Accuracy= 0.80 • Recall=100%

  29. Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related

  30. Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related • Optimal decision – 0.80 accuracy

  31. Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related • Optimal decision – 0.80 accuracy • ≈ best manual system (0.83) • > best automatic system (0.75)

  32. Filter keywords + majority class upperbound Filter keywords (oracle or manual) Majority Class? Tweets

  33. (1) winner-takes-all Filter keywords (oracle or manual) Majority Class Tweets

  34. (2) winner-takes-remainder Filter keywords (oracle or manual) Majority Class Tweets

  35. (3) bootstrapping Filter keywords (oracle or manual) training Machine learning Tweets

  36. (3) bootstrapping Filter keywords (oracle or manual) training Machine learning application Tweets

  37. Filter keywords + majority class

  38. Filter keywords + majority class ≈ ‘ all related ’ baseline

  39. Filter keywords + majority class baseline

  40. Filter keywords + majority class baseline • Automatic Discovery of Filter Keywords: Keyword Classification Terms Filter keywords (automatic)

  41. Filter keywords + majority class baseline • Automatic Discovery of Filter Keywords: Keyword Classification Terms Filter keywords (automatic) – 13 Term features: • 3 Collection-based features • 6 Web-based features • 4 Expanded by co-occurrence features – 3 classification methods • Machine learning (Neural net + all features) • Heuristic (2 features: col_c_specificity + cooc_om_assoc ) • Hybrid (Neural net + heuristic’s features)

  42. Automatic Tweets Classification 0,83 WePS-3 systems 0,75 0,73 0,63 (manual) accuracy 0,56 0,48 WePS-3 systems (automatic) Filter keywords + Majority Class baseline

  43. Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus

  44. Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus • Twitter ≠ Web – Oracle keywords ≠ Manual keywords

  45. Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus • Twitter ≠ Web – Oracle keywords ≠ Manual keywords • Filter keywords & majority class strategies – Useful signals to help solving the problem – Both signals alone already give competitive performance

  46. Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam

Recommend


More recommend