top k processing for search and information discovery
play

Top- k Processing for Search and Information Discovery in Social - PowerPoint PPT Presentation

Top- k Processing for Search and Information Discovery in Social Applications Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011 Why top- k ? Practical many applications, with new ones emerging every


  1. Top- k Processing for Search and Information Discovery in Social Applications Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011

  2. Why top- k ? • Practical – many applications, with new ones emerging every day • Beautiful – basic algorithms (TA, NRA) are simple and intuitive – nice theoretical properties – instance optimality • Insightful – optimizes for I/O operations, a good run-time performance abstraction – trades run-time performance and space overhead – representative of many data management problems • Influential – Optimal Aggregation Algorithms for Middleware, PODS 2001 by Ronald Fagin, Amnon Lotem and Moni Naor – 733 citations (as of July 11, 2011), PODS 2011 test-of-time award Social top- k @ Joint RuSSIR/EDBT Summer School 2011 2

  3. Why Social? • Social science is an old discipline • Social Web: a new opportunity where actions of millions of users online can be gathered and processed (content sites like Delicious, streams like Twitter, and media like ALJ online, in the form of factual and opinion data ) • Computing : large-scale data processing • Social Computing: The science of gathering, storing and processing social breadcrumbs left my millions of users, in order to enhance their online experience => scaling-up social science Social top- k @ Joint RuSSIR/EDBT Summer School 2011 3

  4. About us • Sihem – PhD INRIA, France (1999) – AT&T Research (1999-2006), Yahoo! Research (2006-2011) – Now: Principal Research Scientist at Qatar Foundation – Research interests: social computing large-scale data processing • Julia – Data management at several New York start-ups (1998-2003) – PhD Columbia University, USA (2009) – Now: Postdoctoral Researcher at the University of Pennsylvania – Research interests: ranking and data exploration biological and social applications Social top- k @ Joint RuSSIR/EDBT Summer School 2011 4

  5. Course outline • Top- k and its applications – Fundamental top- k algorithms – Personalized search – Group recommendation • Social: user studies – Group recommendation (MovieLens) – Travel itinerary extraction (Flickr) • Open problems in top- k and social Feel free to interrupt and ask questions! Social top- k @ Joint RuSSIR/EDBT Summer School 2011 5

  6. A plug: Web Data Management A new book by Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart. Will be published in print in December 2011. Available for free download from http://webdam.inria.fr/Jorge/files/wdmd.pdf Social top- k @ Joint RuSSIR/EDBT Summer School 2011 6

  7. Top- k Processing for Search and Information Discovery in Social Applications Lecture 1: Fundamental Top-k Algorithms Sihem Amer-Yahia Julia Stoyanovich Social top- k @ Joint RuSSIR/EDBT Summer School 2011

  8. Quote of the day Life is the sum of all your choices. ~Albert Camus Social top- k @ Joint RuSSIR/EDBT Summer School 2011 8

  9. What is top -k processing? • Find k items that best answer a user’s query – As a set, as a sorted list, or as a sorted list with scores – Usually from among N items, where N >> k • Application domains – Search over structured datasets with user-defined preferences • Find large apartments in a good school district in Brooklyn • Find cheap hotels that are near a beach – Web search & other document retrieval / ranking tasks • Find documents about “Winter Olympics Sochi” – Search over multimedia repositories (with probability scores) • Find images that show a palm tree next to a house – Many others, including in the social domain! Social top- k @ Joint RuSSIR/EDBT Summer School 2011 9

  10. SQL example Social top- k @ Joint RuSSIR/EDBT Summer School 2011 10

  11. Top- k vs. SQL querying • Compared to standard SQL querying – Relevance to a degree , not Boolean – Return only the best items, not all items – Quality of an item is expressed by a score – A score is assigned to items by a ranking (scoring) function Social top- k @ Joint RuSSIR/EDBT Summer School 2011 11

  12. Outline  Intro • Semantics – Ranking functions – Dominance and skylines • Fundamental algorithms – Fagin algorithm (FA) – Threshold algorithm (TA) – No random access algorithm (NRA) • Extensions and alternatives – top-k with expensive predicates (MPro) – TAAT vs. DAAT vs. …. Social top- k @ Joint RuSSIR/EDBT Summer School 2011 12

  13. Ranking functions • Ranking (scoring) functions are used to compute the score of an item • Item r(x 1 , …, x m ) , where x i are the ranking attributes , e.g., size of a house in m 2 , number of times “Sochi” occurs in the text score(r) = g (f 1 (x 1 ), … , f m (x m )) – where f i are monotone functions , e.g. f(x) = 2 × x – g is a monotone aggregation function, e.g. sum, average, max, min – e.g., score (r) = 2 × size + 3 * quality of school district Definitions A function f is monotone if f(x) ≤ f(y) whenever x ≤ y An aggregation function g is monotone if, for all i, g(r) ≤ g(r’) whenever r.x i ≤ r’.x i Social top- k @ Joint RuSSIR/EDBT Summer School 2011 13

  14. Ranking functions and dominance • Monotone aggregation ranking functions express user preferences – r (price, distToBeach); score(r) = w 1 * price+ w 2 * distToBeach – score(r 1 ) < score(r 2 ) whenever r 1 .price < r 2 .price AND r 1 .distToBeach < r 2 .distToBeach – This holds for any positive weights w 1 and w 2 ! (What if one of the weights was negative? What if both were negative?) – In this example, we are interested in minimizing the score • Mapping a multi-dimensional (e.g., 2D) space into R – In top- k , we look for k tuples with the best score, which is weight- specific, but … • users may differ on the weights in their preferences • users may want to understand the structure of the 2D space – Skylines to the rescue! • make the structure of the space explicit • Are independent of the weights Social top- k @ Joint RuSSIR/EDBT Summer School 2011 14

  15. Skylines represent dominance • Item r 1 is better than r 2 , according to its score if – score(r 1 ) < score(r 2 ) , i.e., – r 1 .price < r 2 .price AND r 1 .distToBeach < r 2 .distToBeach • This is equivalent to a notion of dominance (Pareto-optimality) – Definition : r 1 dominates r 2 iff r 1 is as good or better than r 2 in all dimensions, and strictly better in at least one dimension. – Some items may be incomparable • r 1 : $100, 1.0 km • r 2 : $ 90, 1.2 km • r 3 : $110, 1.1 km • r 4 : $120, 1.1 km – Observe: dominance is transitive Social top- k @ Joint RuSSIR/EDBT Summer School 2011 15

  16. Skyline examples Social top- k @ Joint RuSSIR/EDBT Summer School 2011 16

  17. Skyline properties • For any monotone ranking function f – If point r is best according to f , then r is on the skyline • A top-1 hotel will always be on the skyline, irrespective of the weights! – If point r is on the skyline, then there exists an f for which r is best • Every hotel on the skyline is someone’s favorite Social top- k @ Joint RuSSIR/EDBT Summer School 2011 17

  18. Outline  Intro  Semantics  Ranking functions  Dominance and skylines • Fundamental algorithms – Fagin algorithm (FA) – Threshold algorithm (TA) – No random access algorithm (NRA) • Extensions and alternatives – top-k with expensive predicates (MPro) – TAAT vs. DAAT vs. …. Social top- k @ Joint RuSSIR/EDBT Summer School 2011 18

  19. Quantifying top- k algorithm performance • Execution time – Sequential access (SA) • accessing items in order, e.g., by reading from a cursor • similar concept to a sequential disk read, where seek time is amortized over multiple accesses – Random access (RA) • accessing items out of order, e.g., a primary key lookup • similar to a random disk read • typically more expensive than an SA (even orders of magnitude), sometimes impossible – Why not use wall clock time? • Buffer size – How much state do we have to keep during computation – Is the size bounded by some constant (e.g. k ), or is it linear in the size of the dataset ( N )? (recall that k << N ) Social top- k @ Joint RuSSIR/EDBT Summer School 2011 19

  20. Example 1 score id income (K$) netWorth (K$) = income + net worth r 1 150 350 500 r 2 150 425 575 r 3 125 450 575 r 4 100 450 550 r 5 100 200 300 r 6 80 100 180 r 7 75 500 575 r 8 75 50 125 r 9 50 300 350 r 10 50 50 100 Social top- k @ Joint RuSSIR/EDBT Summer School 2011 20

  21. Naïve computation of the top -k answers • Algorithm – Compute the score of each item – Sort items in decreasing order of score – Return k items with the highest score • Properties of naïve solution – Advantages? – Disadvantages? • Idea : throw space at the problem – pre-compute inverted lists for components of the score – aggregate partial scores at run-time • Main intuition: – top-k items will appear near the top of sufficiently many lists Social top- k @ Joint RuSSIR/EDBT Summer School 2011 21

  22. The basic indexing structure: inverted list id income id netWorth L 1 L 2 r 1 150 r 7 500 r 2 150 r 3 450 r 3 125 r 4 450 r 4 100 r 2 425 r 5 100 r 1 350 r 6 80 r 9 300 r 7 75 r 5 200 r 8 75 r 6 100 r 9 50 r 8 50 r 10 50 r 10 50 Social top- k @ Joint RuSSIR/EDBT Summer School 2011 22

Recommend


More recommend