[PPT] - Top- k Processing for Search and Information Discovery in Social PowerPoint Presentation

SLIDE 1

Social top-k @ Joint RuSSIR/EDBT Summer School 2011

Top-k Processing for Search and Information Discovery in Social Applications

Sihem Amer-Yahia Julia Stoyanovich

SLIDE 2

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 2

Why top-k?

Practical

– many applications, with new ones emerging every day

Beautiful

– basic algorithms (TA, NRA) are simple and intuitive – nice theoretical properties – instance optimality

Insightful

– optimizes for I/O operations, a good run-time performance abstraction – trades run-time performance and space overhead – representative of many data management problems

Influential

– Optimal Aggregation Algorithms for Middleware, PODS 2001 by Ronald Fagin, Amnon Lotem and Moni Naor – 733 citations (as of July 11, 2011), PODS 2011 test-of-time award

SLIDE 3

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 3

Why Social?

Social science is an old discipline
Social Web: a new opportunity where actions of millions of users online can

be gathered and processed (content sites like Delicious, streams like Twitter, and media like ALJ online, in the form of factual and opinion data)

Computing: large-scale data processing
Social Computing: The science of gathering, storing and processing social

breadcrumbs left my millions of users, in order to enhance their online experience => scaling-up social science

SLIDE 4

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 4

About us

Sihem

– PhD INRIA, France (1999) – AT&T Research (1999-2006), Yahoo! Research (2006-2011) – Now: Principal Research Scientist at Qatar Foundation – Research interests: social computing large-scale data processing

Julia

– Data management at several New York start-ups (1998-2003) – PhD Columbia University, USA (2009) – Now: Postdoctoral Researcher at the University of Pennsylvania – Research interests: ranking and data exploration biological and social applications

SLIDE 5

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 5

Course outline

Top-k and its applications

– Fundamental top-k algorithms – Personalized search – Group recommendation

Social: user studies

– Group recommendation (MovieLens) – Travel itinerary extraction (Flickr)

Open problems in top-k and social

Feel free to interrupt and ask questions!

SLIDE 6

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 6

A plug: Web Data Management

A new book by Serge Abiteboul, Ioana Manolescu, Philippe Rigaux, Marie-Christine Rousset, Pierre Senellart. Will be published in print in December 2011. Available for free download from http://webdam.inria.fr/Jorge/files/wdmd.pdf

SLIDE 7

Social top-k @ Joint RuSSIR/EDBT Summer School 2011

Top-k Processing for Search and Information Discovery in Social Applications

Lecture 1: Fundamental Top-k Algorithms

Sihem Amer-Yahia Julia Stoyanovich

SLIDE 8

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 8

Quote of the day

Life is the sum of all your choices. ~Albert Camus

SLIDE 9

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 9

What is top-k processing?

Find k items that best answer a user’s query

– As a set, as a sorted list, or as a sorted list with scores – Usually from among N items, where N >> k

Application domains

– Search over structured datasets with user-defined preferences

Find large apartments in a good school district in Brooklyn
Find cheap hotels that are near a beach

– Web search & other document retrieval / ranking tasks

Find documents about “Winter Olympics Sochi”

– Search over multimedia repositories (with probability scores)

Find images that show a palm tree next to a house

– Many others, including in the social domain!

SLIDE 10

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 10

SQL example

SLIDE 11

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 11

Top-k vs. SQL querying

Compared to standard SQL querying

– Relevance to a degree, not Boolean – Return only the best items, not all items – Quality of an item is expressed by a score – A score is assigned to items by a ranking (scoring) function

SLIDE 12

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 12

Outline

 Intro

Semantics

– Ranking functions – Dominance and skylines

Fundamental algorithms

– Fagin algorithm (FA) – Threshold algorithm (TA) – No random access algorithm (NRA)

Extensions and alternatives

– top-k with expensive predicates (MPro) – TAAT vs. DAAT vs. ….

SLIDE 13

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 13

Ranking functions

Ranking (scoring) functions are used to compute the score of an item
Item r(x1, …, xm), where xi are the ranking attributes,

e.g., size of a house in m2, number of times “Sochi” occurs in the text score(r) = g (f1 (x1), … , fm (xm))

– where fi are monotone functions, e.g. f(x) = 2 × x – g is a monotone aggregation function, e.g. sum, average, max, min – e.g., score (r) = 2 × size + 3 * quality of school district

Definitions

A function f is monotone if f(x) ≤ f(y) whenever x ≤ y An aggregation function g is monotone if, for all i, g(r) ≤ g(r’) whenever r.xi ≤ r’.xi

SLIDE 14

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 14

Ranking functions and dominance

Monotone aggregation ranking functions express user preferences

– r (price, distToBeach); score(r) = w1 * price+ w2 * distToBeach – score(r1) < score(r2) whenever r1.price < r2.price AND r1.distToBeach < r2.distToBeach – This holds for any positive weights w1 and w2! (What if one of the weights was negative? What if both were negative?) – In this example, we are interested in minimizing the score

Mapping a multi-dimensional (e.g., 2D) space into R

– In top-k, we look for k tuples with the best score, which is weight- specific, but …

users may differ on the weights in their preferences
users may want to understand the structure of the 2D space

– Skylines to the rescue!

make the structure of the space explicit
Are independent of the weights

SLIDE 15

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 15

Skylines represent dominance

Item r1 is better than r2, according to its score if

– score(r1) < score(r2) , i.e., – r1.price < r2.price AND r1.distToBeach < r2.distToBeach

This is equivalent to a notion of dominance (Pareto-optimality)

– Definition: r1 dominates r2 iff r1 is as good or better than r2 in all dimensions, and strictly better in at least one dimension. – Some items may be incomparable

r1: $100, 1.0 km
r2: $ 90, 1.2 km
r3: $110, 1.1 km
r4: $120, 1.1 km

– Observe: dominance is transitive

SLIDE 16

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 16

Skyline examples

SLIDE 17

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 17

Skyline properties

For any monotone ranking function f

– If point r is best according to f, then r is on the skyline

A top-1 hotel will always be on the skyline, irrespective of the weights!

– If point r is on the skyline, then there exists an f for which r is best

Every hotel on the skyline is someone’s favorite

SLIDE 18

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 18

Outline

 Intro  Semantics

 Ranking functions  Dominance and skylines

Fundamental algorithms

– Fagin algorithm (FA) – Threshold algorithm (TA) – No random access algorithm (NRA)

Extensions and alternatives

– top-k with expensive predicates (MPro) – TAAT vs. DAAT vs. ….

SLIDE 19

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 19

Quantifying top-k algorithm performance

Execution time

– Sequential access (SA)

accessing items in order, e.g., by reading from a cursor
similar concept to a sequential disk read,

where seek time is amortized over multiple accesses

– Random access (RA)

accessing items out of order, e.g., a primary key lookup
similar to a random disk read
typically more expensive than an SA (even orders of

magnitude), sometimes impossible

– Why not use wall clock time?

Buffer size

– How much state do we have to keep during computation – Is the size bounded by some constant (e.g. k), or is it linear in the size of the dataset (N)? (recall that k << N)

SLIDE 20

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 20

Example 1

id income (K$) netWorth (K$) score = income + net worth r1 150 350 500 r2 150 425 575 r3 125 450 575 r4 100 450 550 r5 100 200 300 r6 80 100 180 r7 75 500 575 r8 75 50 125 r9 50 300 350 r10 50 50 100

SLIDE 21

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 21

Naïve computation of the top-k answers

Algorithm

– Compute the score of each item – Sort items in decreasing order of score – Return k items with the highest score

Properties of naïve solution

– Advantages? – Disadvantages?

Idea : throw space at the problem

– pre-compute inverted lists for components of the score – aggregate partial scores at run-time

Main intuition:

– top-k items will appear near the top of sufficiently many lists

SLIDE 22

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 22

The basic indexing structure: inverted list

id income

r1 150 r2 150 r3 125 r4 100 r5 100 r6 80 r7 75 r8 75 r9 50 r10 50 L1 id netWorth r7 500 r3 450 r4 450 r2 425 r1 350 r9 300 r5 200 r6 100 r8 50 r10 50 L2

SLIDE 23

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 23

Fagin Algorithm (FA)

Algorithm

– Access all lists sequentially (SA), in parallel – STOP once k items have been seen sequentially in all lists – Compute scores of incomplete items by performing a random access (RA) – Sort on score, return the best k items

Is this algorithm correct?
Performance of FA for Example 1 with k = 3

– Number of SA – Number of RA – Max buffer size

SLIDE 24

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 24

Threshold Algorithm (TA)

Algorithm

– Access all lists sequentially (SA), in parallel – After each cursor move

Compute the score of the item r under the cursor with random accesses (RA)
Record r in the buffer if buffer size < k and

r’s score > kth score, remove kth item from buffer

Update the threshold θ = Σ current list scores
STOP when kth score > θ

– Return the k items currently in the buffer

Is this algorithm correct?
Performance of FA for Example 1 with k = 3

– Number of SA – Number of RA – Max buffer size

SLIDE 25

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 25

Comparison between FA and TA

Example 2 with k = 3
Theorem 1

– # SA in TA ≤ # SA in FA

Theorem 2

– TA requires bounded buffers, while buffers in FA are not bounded

SLIDE 26

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 26

Instance optimality of TA

Theorem 3

– TA is instance-optimal over the class of algorithms A that

1. correctly find the top-k answers and 2. do not make any random guesses

Does Theorem 3 hold over all algorithms that

correctly identify the top-k answers?

Example 3

SLIDE 27

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 27

What if random accesses were impossible?

When is this the case?
Common approach: change semantics of the result

– Sometimes it suffices to output the top-k as a set – Sometimes we can get away with outputting top-k in sorted order, but with no scores

Example 4

SLIDE 28

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 28

No Random Access algorithm (NRA)

Algorithm

– Access all lists sequentially (SA), in parallel – After each cursor move compute

Worst-case score W(r), best-case score B(r) for each seen r
Sort all seen items on W(r), breaking ties by B(r)
θ = Σ current list scores (this is the best-case score of any unseen object)
STOP when W(r) of kth object > θ

– If RA is possible, compute complete scores of the top-K items – Return the top-k items

NRA with Example 1 for k = 3
Performance

– # SA, #RA, buffer size – Optimal performance if no RAs are allowed! – In reality, computation may be slow, because of re-sorting large buffers at each step

SLIDE 29

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 29

Outline

 Intro  Semantics

 Ranking functions  Dominance and skylines

 Fundamental algorithms

 Fagin algorithm (FA)  Threshold algorithm (TA)  No random access algorithm (NRA)

Extensions and alternatives

– top-k with expensive predicates (MPro) – TAAT vs. DAAT

SLIDE 30

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 30

BOTH H Sor Sorte ted Acce ccess Onl Only y Random

m

Acce ccess Onl Only y FA, T TA, Q Quic ick- k-comb combin ine, M Multi- ti-Ste Step NRA RA, Str Stream- m-comb combin ine MPro, U Upper, P Pic ick k

A classification of top-k algorithms

SLIDE 31

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 31

Minimal Probing (MPro)

Recall TA

– sequentially accesses one list – probes all other lists for each encountered object

What if random accesses (probes) were very expensive?

– User-defined functions – Calls to external services, e.g., web-based

Idea: execute only the necessary probes
Assumptions

– One component of the score accessed sequentially (SA), other components are probed (RA) – Probing schedule is given, global (same for all items) or per-item

SLIDE 32

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 32

The MPro algorithm

For an item r

– Worst (r) – worst-case score seen so far – Best (r) - best-case score, unseen components have max score (1.0)

What is a necessary probe?

– The probe on item r is necessary if there do not exist k items r1, …, rk such that Best (r) < Best (ri) for each ri

The MPro Algorithm

– Access items sequentially – Maintain a best-case queue (a.k.a. the ceiling queue)

STOP when k items with complete scores are at the head of the queue
Probe the first incomplete item, re-sort the queue

SLIDE 33

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 33

An MPro example

SELECT id FROM House WHERE new (age) s,

- sequential access

cheap(price, size) pC -- random access, a.k.a. “probe” large(size) pL

- random access

ORDER BY MIN(s, rC, rL) STOP AFTER 2

id s

new (age)

pC

cheap (price, size)

pL

large (size)

score MIN(s, pC, pl)

a 0.90 0.85 0.75 0.75 b 0.80 0.78 0.90 0.78 c 0.70 0.75 0.20 0.20 d 0.60 0.90 0.90 0.60 e 0.50 0.70 0.80 0.50

SLIDE 34

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 34

The big picture: top-k is term-at-a-time

Here items are called “documents” (IR terminology)

Term-at-a-time (TAAT)

– Partial scores for items kept in accumulators – May (it better!) terminate before all items are considered, but each item may be considered more than once – Typically outperform DAAT when contributions of terms to the score are independent and dataset is reasonably small

Document-at-a-time (DAAT)

– Complete scores for items computed (i.e., typically all items are considered) – Smaller memory footprint than in TAAT – Easier to execute in parallel than TAAT

SLIDE 35

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 35

Summary and outlook

Semantics of top-k queries

– Items have score that are made up of components – Components are aggregated using monotone aggregation

Fundamental algorithms

– Use the inverted list indexing structure – Have an access strategy and a stopping condition – TA – instance-optimal over the class of reasonable algorithms – NRA – useful when random access is expensive or impossible

Generalizations and extensions
Next lecture

– using top-k for personalized search in social tagging sites

SLIDE 36

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 36

References and further reading

1. Optimal aggregation algorithms for middleware.

Ronald Fagin, Amnon Lotem and Moni Naor, PODS 2001.

2. Minimal probing: supporting expensive predicates for top-K queries.

Kevin Chen-Chuan Chang and Seung-won Hwang, SIGMOD 2002.

3. Evaluating Top-k Queries over Web-Accessible Databases.

Nicolas Bruno, Luis Gravano and Amélie Marian, ICDE 2002.

4. The Skyline operator.

Stephan Börzsönyi, Donald Kossmann, Konrad Stocker, ICDE 2001.

5. Query evaluation: strategies and optimizations.

Howard Turtle and James Flood, Info. Processing and Management 1995.

SLIDE 37

Social top-k @ Joint RuSSIR/EDBT Summer School 2011 37