Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam
Goal • Two signals coming from intuition: – Filter keywords – Majority Class • Do they help characterizing and solving the problem?
WePS-3 Online Reputation Management Task
WePS-3 Online Reputation Management Task
WePS-3 Online Reputation Management Task
Tweets for query «jaguar» • related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8
Tweets for query «orange» • related tweets=0 • unrelated tweets=10 • Related ratio = 0
Tweets for query «apple» • related tweets=5 • unrelated tweets=5 • Related ratio = 0.5
Fingerprint representation
Fingerprint representation
Fingerprint representation
Fingerprint representation
WePS-3 Task 2 Systems
WePS-3 Task 2 Systems
Filter keywords
Tweets for query «apple»
Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related»
Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related» • negative keyword: eating • 2 tweets annotated as «unrelated»
Tweets for query «apple» • positive keyword: store • 4 tweets annotated as «related» • negative keyword: eating • 2 tweets annotated as «unrelated» • Accuracy= 1.0 • Recall=60%
Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric
Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Manual keywords (perfects for a Web user) Company name Positive Keywords Negative Keywords amazon electronics, books, apparel, river, rainforest, deforestation, computers, buy bolivian, brazilian fox tv, broadcast, shows, episodes, fringe, animal, terrier, hunting, bones volkswagen, racing ford motor, cars, hybrids, crossovers, tom, harrison, henry, glenn, gucci mondeo, focus, fiesta, prices, dealer, electric Oracle keywords (perfects on Twitter) Company name Positive Keywords Negative Keywords amazon sale, books, deal, deals, gift followdaibosyu, pest, plug, brothers, pirotta fox money, weather, leader, denouncing, megan, matthew, lazy, valley, viewers michael ford mustang, focus, hybrid, motor, truck tom, harrison, rob, bring, coppola
Upper bound of Filter Keywords Oracle keywords 20 oracle keywords ≈ 50% recall 5 oracle keywords ≈ 30% recall
Upper bound of Filter Keywords Oracle keywords Manual keywords – ≈10 per company – 14.61 % recall (vs. 39.97% 10 oracle keyword) 20 oracle keywords ≈ 50% recall – 0.86 accuracy 5 oracle keywords ≈ 30% recall
Upper bound of Filter Keywords Oracle keywords Manual keywords – ≈10 per company – 14.61 % recall (vs. 39.97% 10 oracle keyword) 20 oracle keywords ≈ 50% recall – 0.86 accuracy 5 oracle keywords ≈ 30% recall Twitter ≠ Web
Majority Class
Tweets for query «jaguar» • related tweets=8 • unrelated tweets=2 • Related ratio = 8/(8+2) = 0.8 • Accuracy= 0.80 • Recall=100%
Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related
Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related • Optimal decision – 0.80 accuracy
Upper bound of Majority Class winner-takes-all • For each test case /company name – all unrelated or all related • Optimal decision – 0.80 accuracy • ≈ best manual system (0.83) • > best automatic system (0.75)
Filter keywords + majority class upperbound Filter keywords (oracle or manual) Majority Class? Tweets
(1) winner-takes-all Filter keywords (oracle or manual) Majority Class Tweets
(2) winner-takes-remainder Filter keywords (oracle or manual) Majority Class Tweets
(3) bootstrapping Filter keywords (oracle or manual) training Machine learning Tweets
(3) bootstrapping Filter keywords (oracle or manual) training Machine learning application Tweets
Filter keywords + majority class
Filter keywords + majority class ≈ ‘ all related ’ baseline
Filter keywords + majority class baseline
Filter keywords + majority class baseline • Automatic Discovery of Filter Keywords: Keyword Classification Terms Filter keywords (automatic)
Filter keywords + majority class baseline • Automatic Discovery of Filter Keywords: Keyword Classification Terms Filter keywords (automatic) – 13 Term features: • 3 Collection-based features • 6 Web-based features • 4 Expanded by co-occurrence features – 3 classification methods • Machine learning (Neural net + all features) • Heuristic (2 features: col_c_specificity + cooc_om_assoc ) • Hybrid (Neural net + heuristic’s features)
Automatic Tweets Classification 0,83 WePS-3 systems 0,75 0,73 0,63 (manual) accuracy 0,56 0,48 WePS-3 systems (automatic) Filter keywords + Majority Class baseline
Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus
Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus • Twitter ≠ Web – Oracle keywords ≠ Manual keywords
Conclusions • Fingerprint representation – Behaviour of binary classification systems on skewed datasets – Baselines independent of corpus • Twitter ≠ Web – Oracle keywords ≠ Manual keywords • Filter keywords & majority class strategies – Useful signals to help solving the problem – Both signals alone already give competitive performance
Filter keywords and majority class strategies for company name disambiguation on Twitter Damiano Spina, Enrique Amigó and Julio Gonzalo {damiano,enrique,julio}@lsi.uned.es UNED NLP & IR Group CLEF 2011 Conference September 19-22, Amsterdam
Recommend
More recommend