time dependent similarity measure dependent similarity
play

Time- -dependent Similarity Measure dependent Similarity Measure - PowerPoint PPT Presentation

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity Measure of Queries Using Historical Click- - of Queries Using Historical Click- of Queries Using Historical Click through Data through Data


  1. Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity Measure of Queries Using Historical Click- - of Queries Using Historical Click- of Queries Using Historical Click through Data through Data through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan Liu * This work was done when Zhao and Hoi were interns at Microsoft Research Asia

  2. Outline Outline � Background � Observations and Motivation � Our approach � Empirical study � Future work

  3. � � Background Background • A dilemma for Web search engines Very short queries ~2.5 Inconsistency of term usages • The Web is not well-organized • Users express queries with their own vocabulary

  4. � � � Background (cont’d) Background (cont’d) • Solution: query expansion Document term based expansion (KDD00, SIGIR05) • a query can be expanded with top keywords in the top- k relevant documents Query term based expansion (WWW02, CIKM04) • a query can be expanded with similar queries (queries are similar if they lead to similar pages, pages are similar if they are visited by issuing similar queries) Click-though data were used for query expansion in many previous work.

  5. � Background (cont’d) Background (cont’d) • Click-through data Log data about the interactions between users and Web search engines • Typical Click-through data representation

  6. Observation 1 Observation 1 • Accuracy of query similarity Calculated from all the click- through data before that time point Calculated only from the click- through data in that time interval. (month)

  7. Observation 2 Observation 2 the keyword “firework” and related pages are becoming more popular one week before the event and reach the peak on July 4th • Event driven and dynamic character of query similarity “firework + market" and “firework + show" become popular and reach their peaks a few days before July 4th “firework + injuries" and “firework + “firework + injuries" and “firework + picture“ have a little delay in terms of the picture“ have a little delay in terms of the number of times being issued and visited. number of times being issued and visited.

  8. Motivations Motivations • Exploit the click-through data for semantic similarity of queries by incorporating temporal information • To combine explicit content similarity and implicit semantic similarity

  9. Our Approach Our Approach

  10. � � � Time-Dependent Concepts Time-Dependent Concepts • Calendar schema and pattern • Example Calendar schema <day, month, year> Calendar pattern <15, *,*> <15, 1, 2002> is contained in the pattern <15, *,*>

  11. � Time-Dependent Concepts Time-Dependent Concepts • Click-Through Subgroup • Example Based on the schema <day, week>, and the pattern <1,*>, <2,*>,…,<7,*>, we can partition the data into 7 groups, which correspond to Sun, Mon, Tue, …, Sat.

  12. � � Similarity Measure Similarity Measure • For efficiency and simplicity, we measure the query similarity in a certain time slot only based on the click-through data. Vector representation of queries with respect to clicked documents. w i is defined by Page Frequency (PF) and Inverted Query Frequency (IQF)

  13. � � Similarity Measure Similarity Measure • Query similarity measures Cosine function Marginalized kernel • By introducing query clusters, one can model the query similarity in a more semantic way. •

  14. Time-Dependent Similarity Measure Time-Dependent Similarity Measure

  15. � � Empirical Evaluation Empirical Evaluation • Dataset Click-through log of a commercial search engine: • June 16, 2005 to July 17,2005 • Total size of 22GB • Only queries from US Calendar schema and pattern • <hour, day, month>, <1, *, *>, <2, *, *>, … • Divide the data into 24 subgroups • Average subgroup size: 59,400,000 query-page pairs

  16. Empirical Examples Empirical Examples • Kids+toy, map+route ���������������������������� �������������������������������

  17. Empirical Examples Empirical Examples • weather + forecast, fox + news ���������������������������� �������������������������������

  18. � � � Quality Evaluation Quality Evaluation • Experimental Settings Partition 32-day dataset into two parts • First part for model construction • Second part for model evaluation Accuracy is defined as the percentage of difference between the actual similarity and the model-based prediction 1000 representative query pairs, similarity larger than 0.3 using the entire dataset • Half of them are top queries of the month • Half are selected manually related to real world events such as “hurricane”.

  19. Experimental Results Experimental Results

  20. Experimental Results Experimental Results ����� “ �������� ” �������������������������������������������������������������� ����������������������������������������� For example, when the distance is 1 and the training data size is 10 , we summarize all the accuracy values that use the I to 10+i days as training and use the 10+1+i as testing.

  21. Experimental Results Experimental Results

  22. Conclusion Conclusion � Presented a preliminary study of the dynamic nature of query similarity using click-through data � Observed and verified that query similarity are dynamic and event driven with real data � Proposed an time-dependent model � For our future work, we will investigate an adaptive way to determine the most suitable time granularity for two given queries.

  23. Thanks! Thanks! tyliu@microsoft.com http://research.microsoft.com/users/tyliu

Recommend


More recommend