Mining the Temporal Statistics of Query Terms for Searching Social - PowerPoint PPT Presentation

Mining the Temporal Statistics of Query Terms for Searching Social Media Posts Oct. 1 st 2017 ICTIR’17 Amsterdam Jinfeng Rao Ferhan Ture Xing Niu Jimmy Lin

Task: Ad-hoc Search on Social Media domain Stream of T weets A ranked list of tweets Interest Profiles (User’s queries) Interest Profile (~topic) … Example query MB001: BBC world service stuff cut …….. ……..

Background • Challenges for Social Media Search • Usually very short,140 characters for tweets. • Posts are written in a highly concise way, sometimes can be quite noisy. • Many abbreviations,misspellings, typos,emojis, hashtags,etc. • Time is an important relevance signal • Relevant posts are more likely to group together at the time shaking news happened. • Example query MB001 fromTREC 2011: BBC world service stuff cut • distribution of relevant docs ( ground truth ) in below. • x axis denotes the number of days prior to query time. • the height of a bar denotes the number of relevant docs during that time interval.

Combine Lexical and Temporal Evidence • Moving window , Dakka et al.TKDE’12 [2] (pseudo trend) 0.20 0.15 • Kernel density estimation , Efron et al. SIGIR’14 [3] n 0.10 ✓ x − x i ◆ f ω ( x ) = 1 ˆ X ω i K nh h 0.05 i =0 • Recurrent Neural Networks , Rao et al. NeuIR’17 [4] 0.00 • However, these work all require two-stage retrieval: • Initial retrieval: estimate the ground truth distribution ( pseudo trend ). • Second retrieval: rerank docs with the estimated pseudo trend.

Research Question • Research question:can we make use of the temporal statistics of query terms ( term trends ) to predict the ground truth ? • What is term trend?T erm frequencies in the collection for each 5 minutes. • An example of ground truth and term trends for query MB127 “ hagel nomination filibustered ” fromTREC 2013 topic set. ground truth Strong correlation! term trends

Approach: Temporal Modeling via Regression ground truth term trends Goal: Approximate the ground truth (Y) by taking a weighted sum of all term trends (ft).

Term Importance Modeling • Bursty terms can be more informative. • We adopt entropy definition to measure the importance of terms. • Given the counts of a particular term t (unigram/bigram) {c 1 , c 2 ,…, c n }, lower entropy = bursty term trend = more important

Approach: Temporal Modeling via Regression • T wo questions in this non-linear regression modeling: • Q1: How to model the weights of different query terms? • Q2: How to differentiate the contribution from unigrams with bigrams? • Q1 solution: exponential mapping from entropy to term weight • Q2 solution: assume unigram weight u i , then bigram weight (1- u i ) where R i is the difference between the maximum unigram entropy and maximum bigram entropy. Intuition :R i > 0 => max(unigram_entropy) > max(bigram_entropy) => u i > 0.5

Approach: Temporal Modeling via Regression • Problem reformulation: • Objective Loss: which can be solved with gradient descent algorithm (more details in paper).

Combine Term Trend with Pseudo Trend • T wo ways to estimate the ground truth distribution: • Document-level: pseudo trend through an initial retrieval • T erm-level: regression over term trends • Combine term trend and pseudo trend in a linear ranking model:

Experimental Setup • T opic set: TREC Microblog Track 2013 and 2014, total 115 topics. • Collection: T weets2013 (~243 million tweets) • Metrics: Mean Average Precision (AP) and Precision at 30 (P30) • Three data splits: • Odd-even: odd numbered topics (57 topics) for training, even (58 topics) for testing • Even-odd: switch train/test split • Cross: 4-fold cross validation

Baselines 1. QL 2. Recency Prior,Li et al. CIKM’03 [1] 3. MovingWindow, Dakka et al.TKDE’12 [2] 4. Kernel Density Estimation (KDE), Efron et al. SIGIR’14 [3] • Uniform-based weighting (IRDu) • Score-based weighting (IRDs) • Rank-based weighting (IRDr) • Oracle (upper bound)

Main Results • Conclusions: • KDE with rank-based weights (IRDr) is the strongest baseline. • Our approach (Reg-IRDr) significantly outperforms all baselines,and is even close to the upper bound in some splits.

Randomized Experiments Average improvement over QL baseline summarized over 30 random train/test splits.

Per-Topic Analysis Per-topic P30 improvement against the Query Likelihood (QL) and the best KDE baseline (IRDr).

Analysis of the Best-Performing Topic 144 How term trend signals help? • red color for ground truth distribution • green for pseudo trend estimated by • the best KDE method (IRDr) blue for term trends. • • Conclusion:A combination of pseudo trend (KDE) and term trend (Our approaches) provides a more accurate estimation to the ground truth distribution.

Conclusion • We are the first to study temporal statistics of query terms for social media search. • Our learning to rank and regression model show this new signal is effective. • For efficiency purpose, use our term trending modeling technique • For effectiveness purpose, use the combination of pseudo trend and term trend modeling

Thanks for listening! Any question?

Reference 1. Xiaoyan Li and W. Bruce Cro . 2003. Time-Based Language Models. In CIKM. 469–475. 2. Wisam Dakka, Luis Gravano, and Panagiotis G. Ipeirotis. 2012. Answering General Time- Sensitive eries. TKDE 3. Miles Efron, Jimmy Lin, Jiyin He, and Arjen de Vries. 2014. Temporal Feedback for Tweet Search with Non-Parametric Density Estimation. In SIGIR. 33–42. 4. JinfengRao, Hua He, Haotian Zhang, Ferhan Ture, Royal Sequiera, Salman Mohammed, and Jimmy Lin. 2017. Integrating Lexical and Temporal Signals in Neural Ranking Models for Social Media Search. In SIGIR Workshop on Neural Information Retrieval (Neu-IR)

Mining the Temporal Statistics of Query Terms for Searching Social - PowerPoint PPT Presentation

Mining the Temporal Statistics of Query Terms for Searching Social Media Posts Oct. 1 st 2017 ICTIR17 Amsterdam Jinfeng Rao Ferhan Ture Xing Niu Jimmy Lin Task: Ad-hoc Search on Social Media domain Stream of T weets A ranked list of

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Mathematical Logics 15. Model theory Luciano Serafini Fondazione Bruno Kessler, Trento, Italy

Propositional and Predicate Logic - X Petr Gregor KTIML MFF UK WS 2016/2017 Petr Gregor (KTIML

Termination of Rewrite Systems (I) 15ai (based on Dershowitz, JSC, 3, 87) Some basic properties

1. Problems of Big Software

The Uses of SAT Solvers in Vampire Giles Reger and Martin Suda School of Computer Science,

Computational Logic The (ISO-)Prolog Programming Language 1 (ISO-)Prolog A practical

45'6(71+8 3%),)/'()#% "#$%&'()#%*+#,+-#.)/+01#.1'22)%. 3%),)/'()#% ! 9$(:)%7

CSE410 aka CSE306 Software Quality in Practice Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis

Mining the Temporal Statistics of Query Terms for Searching Social - PowerPoint PPT Presentation

Mining the Temporal Statistics of Query Terms for Searching Social Media Posts Oct. 1 st 2017 ICTIR17 Amsterdam Jinfeng Rao Ferhan Ture Xing Niu Jimmy Lin Task: Ad-hoc Search on Social Media domain Stream of T weets A ranked list of

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Spatio-Temporal Statistics with R Chapter Two: Exploring Spatio-Temporal Data Spatio-Temporal

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Query Understanding: A Manifesto Daniel Tunkelang queryunderstanding.com Overview What is

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

Query Op)miza)on 1 Query op)miza)on Given an SQL query,

CS4224/CS5424 Lecture 9 Distributed Query Processing Query Processing Translates query into a

Temporal, Spatial, and Spatio-temporal Granularities Gabriele Pozzani Department of Computer

Temporal Code Temporal Code Temporal Code (Acoustic Front-end) Human Recognition Machine

Temporal Privacy in Wireless Sensor Networks Temporal Privacy in Wireless Sensor Networks

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Temporal Distortion Temporal Distortion Perspective) Perspective) t t Blue view Blue view y

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Mathematical Logics 15. Model theory Luciano Serafini Fondazione Bruno Kessler, Trento, Italy

Propositional and Predicate Logic - X Petr Gregor KTIML MFF UK WS 2016/2017 Petr Gregor (KTIML

Termination of Rewrite Systems (I) 15ai (based on Dershowitz, JSC, 3, 87) Some basic properties

1. Problems of Big Software

The Uses of SAT Solvers in Vampire Giles Reger and Martin Suda School of Computer Science,

Computational Logic The (ISO-)Prolog Programming Language 1 (ISO-)Prolog A practical

45'6(71+8 3%),)/'()#% &quot;#$%&amp;'()#%*+#,+-#.)/+01#.1'22)%. 3%),)/'()#% ! 9$(:)%7

CSE410 aka CSE306 Software Quality in Practice Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis

45'6(71+8 3%),)/'()#% "#$%&'()#%*+#,+-#.)/+01#.1'22)%. 3%),)/'()#% ! 9$(:)%7