☎ ✂ � Taher H. Haveliwala Similarity search Evaluating Strategies for ✁ Given a query Web page q , return Web Similarity Search on the Web pages that are “similar” to q Taher H. Haveliwala www.moneycentral.com Aristides Gionis Dan Klein www.pathfinder.com/money Piotr Indyk www.moneyworld.co.uk www.money.com {taherh,gionis,klein}@cs.stanford.edu www.etrade.com indyk@theory.lcs.mit.edu www.moneyclub.com Similarity search Related work ✁ Finding Related Pages in the WWW ✁ Two major issues: ✄ Choose the strategy that best captures the ✄ [Dean,Henzinger WWW8 ’99] ✁ Automatic Resource Compilation ... notion of Web-page “similarity” ✄ Scaling up the chosen strategy to repository ✄ [Chakrabarti et al WWW7 ’98] of millions of pages ✁ Commercial search engines 1
✄ � ✆ ☎ Taher H. Haveliwala Model for document similarity Model for document similarity ✁ Represent each Web page as bag of ✁ For pages a and b , with respective bags α and β , define terms α ∩ β ✂ content, anchor-text, links, ... ( , ) = sim a b ✁ Similarity of two pages is given by α ∪ β similarity their respective bags ✁ Strategy for (page → bag) is the crucial ✂ cosine step in quality of sim() ✂ Jaccard Similarity search system Similarity search system Query Processing Query Processing Sim Page Sim Page Web Web Index Representations Index Representations Query-time Query-time page page → Indexing → θ Indexing representation representation Using strategy θ Preprocessing Preprocessing 2
☎ ✁ � ✁ ✂ Taher H. Haveliwala Similarity search system Possible term choices http://www.foobar.com/ http://www.music.com/ Query Processing ...click here for a MusicWorld great music page... ...click here for great sports page... Enter our site Sim Page Web Index http://www.baz.com/ Representations Query-time ...what I had for lunch... page Indexing → ...this music is great... representation Preprocessing Content Links http://www.music.com/ http://www.music.com/ http://www.foobar.com/ http://www.foobar.com/ ...click here for a ...click here for a MusicWorld MusicWorld great music page... great music page... ...click here for great ...click here for great sports page... sports page... Welcome Enter our site http://www.baz.com/ http://www.baz.com/ ...what I had for ...what I had for bag: www.music.com bag: www.music.com lunch... lunch... music 1 www.foobar.com 1 ...this music is great... ...this music is great... world 1 www.baz.com 1 welcome 1 ✁✄✁ 3
✂ ✝ � � � ✆ ✁ � Taher H. Haveliwala Parameter space for bag Anchor windows generation http://www.foobar.com/ http://www.music.com/ ✄ Space of parameters considered: ...click here for a MusicWorld ☎ content vs. links vs. anchor windows great music page... ☎ anchor window length ...click here for great sports page... ☎ term weighting schemes Enter our site ✄ Choice of a particular assignment of http://www.baz.com/ parameters, θ , defines a similarity search ...what I had for bag: www.music.com lunch... strategy music 2 ...this music is great... great 2 click 1 ... Similarity search system (Strategy, query) → similarity ordering ✄ Inputs: Query Processing ☎ θ ∈ Θ : strategy (i.e., parameter setting) ☎ q ∈ Web: query page ✄ Outputs: Sim Page ☎ τ : list of web pages ordered by similarity to q Web Index Representations Query-time using strategy θ ✄ τ = Τ ( θ , q ) page → θ Indexing representation Using strategy θ Preprocessing 4
� ✁ ✆ � � ☎ Taher H. Haveliwala Evaluating strategies Web directories (Yahoo!, ODP) ✂ Hand-constructed hierarchical directories ✂ Goal: find “best” θ i ∈ Θ ✂ Develop system to measure quality of such as Yahoo! and the Open Directory Project (ODP) can be used as an external different parameter settings quality measure ✄ What do you choose as the ground truth for ✂ Do not directly provide ranked similarity Web-page similarity? listings ✄ How do you compare a particular strategy to ✂ Do contain many implicit similarity this ground truth? judgements Directory → Similarity judgements (Directory, query) → similarity ordering Open Directory Computers Hardware Software Unrelated www.hardware.com www.software.com Cousin Class Sibling Class Same Class www.programming.com www.machine.com ✝✟✞ Query 5
✄ ✞ ✄ ☎ ☎ ✄ ☎ � ✄ Taher H. Haveliwala Evaluating strategies (Directory, query) → similarity ordering Inputs: Restrict attention during evaluation 1. phase to pages in the directory D D : hierarchical directory q ∈ D : query page Compare similarity ordering induced by 2. Outputs: parameter setting θ i to the similarity τ : list of pages of D partially ordered by similarity to q, ordering induced by the directory, over using the ordering implicit in D test set of query pages τ = Τ ( D , q) Choose the θ i that agrees most closely The above is for evaluating similarity search, not 3. with the judgements in D performing it! �✂✁ �✆� Directory vs. Strategy Comparing two orderings Open Directory ✟ Based on Kruskal-Goodman Γ ✟ Inputs ✠ τ odp : strict weak ordering of pages (ODP) weak order ✠ τ i : total ordering of pages according to θ i ODP ✟ Output ✠ -1 ≤ Γ ≤ 1: measure of agreement Unrelated Cousin Class 2 × Pr[ τ odp and τ i agree on ordering of (u,v)] - 1 Sibling Class Same Class total order Query Strategy θ i �✆✝ 6
☎ Taher H. Haveliwala Directory vs. Strategy Directory vs. Strategy ODP ODP Agreement Strategy θ i Strategy θ i Disagreement! �✂✁ �✂✄ Example of two rankings with different Γ scores Evaluating strategies Query page: www.aabga.org For each θ i ∈ Θ 1. (American Association of Botanical Gardens and Arboreta) Γ θ i = Avg q ∈ D [ Γ ( Τ ( D , q), Τ ( θ i , q) ) ] Canadian Botanical Conservation Network The Huntington Library, Art Collections, and Botanical Gardens Select strategy θ * = argmax θ i [ Γ θ i ] http://www.rbg.ca/cbcn 2. www.huntington.org The Royal Horticultural Society The American Rhododendron Society www.rhs.org.uk www.rhododendron.org The American Rhododendron Society American Chiropractic Association http://www.rhododendron.org Only assumes that higher agreement, www.amerchiro.org Gardener’s Supply Company on average, with ODP is a good thing American Trakehner Association (horses) www.vg.com www.americantrakehner.com The New England Botanical Club American Subcontractors Association www.herbaria.harvard.edu/collections/neb c/nebc.html www.asaonline.com Γ =0.5312 Γ =0.3096 �✂✆ �✂✝ 7
☛ ✟ ✡ ✠ ✡ ✠ ✡ ✠ ✟ ✟ Taher H. Haveliwala Experimental results Directory vs. Strategy Open Directory ✄ 42 million page subset of the Web from the Stanford WebBase ✄ Following results restrict attention to two weak order colors: same class and sibling class ODP ✄ D: 300 pairs of sibling clusters from ODP Unrelated Cousin Class Sibling Class Same Class total order Query Strategy θ i �✂✁ ☎✂✆ Γ scores Feature space: term selection Content 0.45 Inlinks 0.40 Anchor-windows 0.35 0.30 Basic Sibling- Γ 0.25 window size W ∈ {0,4,8,16,32} 0.20 Syntactic 0.15 averaged 3 words in both directions 0.10 Topical 0.05 averaged 21 words in both directions 0.00 s s 0 4 8 6 2 c a l t k i n w w w 1 3 t c n w w c e a p i t l i n t o n t o y c s ☎✞✝ ☎✂� 8
✎ ✠ ☞ ✂ � ☎ ✟ ✏ ✍ Taher H. Haveliwala Directory → Similarity judgements Orthogonality Computers 1 Fraction of Pairs that are 0.9 0.8 0.7 Orthogonal 0.6 0.5 Hardware Software 0.4 0.3 0.2 0.1 www.hardware.com www.software.com 0 s 0 4 8 6 2 s c a l t k w w w 1 3 n n w w t i c c e l i p i t a n t o n t o y c s www.programming.com www.machine.com �✁� Composite schemes Feature space: term weighting ✢ Distance weighting for anchor-window 0.440 terms 0.438 0.436 Sibling- Γ 0.434 0.432 0.430 Left window Anchor text Right window 0.428 0.426 Anchor-Window-32 Anchor-Window-32, Content Anchor-Window-32, Content, Links ✆✞✝✁✟ ✠☛✡✌☞☛✍ ✑☛✒ ✠✔✓✕✠☛✍ ✖✁✗✁✟ ✑✁✖✘✓✙✏ ✎✔✠ ✑☛✏ ✚☛✛ �✁✄ �✁✜ 9
☞ ✟ Taher H. Haveliwala Weighting schemes Feature space: term weighting ☎ Frequency based weighting schemes 0.46 ✆ Inverse Document Frequency (IDF) ✝ attenuate weights for frequent terms 0.44 ✆ Nonmonotonic Document Frequency (NMDF) Sibling- 0.42 ✝ attenuate weights for frequent and infrequent terms 0.40 0.38 None Distance �✂✁ �✂✄ Term weighting (*DF) Comparison of best and worst 0.48 1 0.9 0.47 0.8 0.7 0.46 Sibling- Γ 0.6 Sibling- Γ 0.45 0.5 0.4 0.44 0.3 0.43 0.2 0.1 0.42 0 None log sqrt NMDF Worst setting Best setting �✂✞ ✠☛✡ 10
Recommend
More recommend