PP-Index: Using Permutation Prefixes for Efficient and Scalable - PowerPoint PPT Presentation

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli andrea.esuli@isti.cnr.it Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, 1 — 56124 Pisa, Italy ISTI:Science seminar, May 12, 2009 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 1 / 48

Outline Introduction 1 The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 2 / 48

Introduction Outline Introduction 1 The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 3 / 48

Introduction Similarity search Outline Introduction 1 Similarity search Permutation based methods Local similarity hashing methods The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 4 / 48

Introduction Similarity search Similarity search The similarity search model involves: A collection of objects D , belonging to a domain O ; a query object q ∈ O ; a distance function d : O × O → R + . The goal is to sort the objects in D by their distance with respect to q , returning the objects that are closer to q , which are considered to be the most similar. Typically only the k -top ranked objects are returned ( k -NN query), or those within a maximum distance value r (range query). The determination of a meaningful r value is often a non-easy task. k -NN queries are usually preferred, specially in end-user applications, also for the direct control on the result set size. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 5 / 48

Introduction Similarity search Similarity search Example ( R 2 , L 2 ): o o 2 2 o o o o 1 1 3 3 o o 10 10 o o o o o o 5 o o 0 0 5 4 4 q 9 9 q r o o 8 8 o o o o 11 11 7 7 o o 6 6 o o 12 12 Figure 1: Range query. Figure 2: k -NN query ( k = 5 ). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 6 / 48

Introduction Similarity search Approximate similarity search Exhaustive search: for all o i ∈ D compute the distance d ( q, o i ) , while keeping track of which objects satisfy the query. It does not scale to large collections. Exact methods: equivalent to exhaustive search, but using data structures that leverage on the properties of the observed similarity space (e.g., vectorial spaces, metric spaces) in order to reduce the number of objects of D to be compared with the query. Usually efficient but still not enough for huge collections. Approximate methods: accepting that the results could contain errors (e.g., d ( q, o 1 ) < d ( q, o 2 ) , o 2 is in the results and o 1 is not), gaining efficiency. Approximation is acceptable, e.g., when d is an approximation of a complex, human-perceived concept of similarity. It (obviously) scales! Typically derived from “relaxed” exact methods. Natively approximated proposals, e.g.: local similarity hashing (LSH) index and permutation-based index (the PP-Index takes inspiration from both). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 7 / 48

Introduction Similarity search Approximate similarity search Approximation quality: What have we missed? What have we included? How much have we saved? o 2 o o 1 3 o 10 o o o 5 o 0 4 9 q o 8 o o 11 7 o 6 o 12 Figure 3: Approximate result for a k -NN query ( k = 5 ). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 8 / 48

Introduction Permutation based methods Outline Introduction 1 Similarity search Permutation based methods Local similarity hashing methods The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 9 / 48

Introduction Permutation based methods Permutation based methods Independently proposed by Amato and Savino 1 and Chavez et al. 2 , using different data structures. The idea: an object is represented by its view of the surrounding world. Intuively, if two objects “see” the elements of a set of reference objects R in the same order of (increasing) distance, they are likely to be close one to the other. Example Where am I likely to live if I see the main European cities in the following order? Rome, Milan, Bern, Marseilles, Munich, Luxembourg, Bonn, Vienna, Belgrade, Brussels, Barcelona, Paris, Berlin, Amsterdam, London, Copenhagen, Madrid, Istanbul, Dublin, Athens, Oslo, Stockholm, Lisbon, Helsinki. 1 G. Amato and P. Savino, Approximate similarity search in metric spaces using inverted files , INFOSCALE 2008, pages 1-10. 2 E. Chavez, K. Figueroa, and G. Navarro, Effective proximity retrieval by ordering permutations , IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9):1647-1658, 2008. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 10 / 48

Introduction Permutation based methods Permutation based methods The method: A set of reference objects R = { r 0 , . . . , r | R |− 1 } ⊂ O is defined (e.g., by randomly selecting | R | objects from D ). Every object o i ∈ D is then represented by a permutation Π o i of � 0 , . . . , | R | − 1 � , i.e., the list of the identifiers of reference objects, so that the identifiers are sorted by the distance of their relative reference objects with respect to o i . The search process mainly consists in computing Π q and estimating the true distance d ( q, o i ) using a permutation-based distance d ′ (Π q , Π o i ) , e.g., the Spearman’s footrule distance . Amato and Savino have shown that using only the prefix Π l o i of the permutation Π o i (e.g., l = 100 when | R | = 500 ) improves both efficiency and effectiveness. The PP-Index adopts a permutation-based data representation model, using very short prefixes (e.g., l = 6 when | R | = 1000 ). Differently from previous approaches, the permutation prefixes are used just to quickly find a small set of candidate objects from D for inclusion into results, not to estimate their relative order. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 11 / 48

Introduction Permutation based methods Permutation based methods <0,1,3> r r <1,3,4> <1,3,0> 0 0 <1,4,3,2,0,5> <0,2,3,1,5,4> <1,3,2,0,4,5> <0,2,3> <1,4,3> <1,3,2> r r 1 1 <2,3,0> <3,2,1> <2,0,3> r r 3 3 <2,3,5> r <4,1,3> r <4,1,3,2,5,0> 2 2 r <2,5,3> r 4 4 <5,2,3,0,1,4> <3,2,5> <4,3,1> <5,2,3> <4,3,1,2,5,0> r r 5 5 <4,3,5> <5,2,3,4,1,0> Figure 4: Regions of the 2-dimensional space identified by 6 randomly selected reference points, using the Euclidean distance, and full-lenght permutations (left) or permutation prefixes of lenght 3. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 12 / 48

Introduction Local similarity hashing methods Outline Introduction 1 Similarity search Permutation based methods Local similarity hashing methods The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 13 / 48

Introduction Local similarity hashing methods Local similarity hashing methods A family H of hash functions f : O → U is called ( r, ǫ, p 1 , p 2 ) -sensitive, with r, ǫ > 0 , p 1 > p 2 > 0 , if for any p, q ∈ O : if d ( p, q ) ≤ r then P [ h ( p ) = h ( q )] ≥ p 1 if d ( p, q ) > r (1 + ǫ ) then P [ h ( p ) = h ( q )] ≤ p 2 for any function h randomly selected from H . Intuitively: two objects have a (high) probability x 1 ≥ p 1 to collide if they are closer than r , and a (low) probability x 2 ≤ p 2 if they are more distant than r (1 + ǫ ) . LSH-Index 3 : j randomly chosen functions h i ∈ H define a hash function g ( x ) = ( h 1 ( x ) h 2 ( x ) . . . h j ( x )) , i.e. bad collision probability is significantly lowered to p j 2 . t different hash tables are built, based on randomly generated g 1 . . . g t functions, in order to increase good collision probability. 3 P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality , STOC 1998, pages 604-613. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 14 / 48

Introduction Local similarity hashing methods Local similarity hashing methods It is hard to tune LSH-Index (length of hash keys) in order to obtain good efficacy, due to the dependence between data distribution and hash length. LSH-Forest 4 : Use of variable length hash keys . Long hash key are indexed in a prefix tree (LSH-Tree). At search time the key length is varied in order to retrieve a given number of candidate objects. Candidate objects are retrieved sequentially from a data storage on disk. Multiple LSH-Tree, i.e., a forest, are used to improve effectiveness. The PP-Index uses similar data structure. 4 M. Bawa, T. Condie, and P. Ganesan, LSH-Forest: self-tuning indexes for similarity search , WWW 2005, pages 651-660. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 15 / 48

The PP-Index Outline Introduction 1 The PP-Index 2 Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 16 / 48

The PP-Index Data structures Outline Introduction 1 The PP-Index 2 Data structures Algorithms Experiments 3 Demo 4 Conclusions 5 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 17 / 48

PP-Index: Using Permutation Prefixes for Efficient and Scalable - PowerPoint PPT Presentation

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli andrea.esuli@isti.cnr.it Istituto di Scienza e Tecnologie dellInformazione A. Faedo Consiglio Nazionale delle Ricerche Via

Spelling, Punctuation and Grammar Prefixes super- anti- auto- Year One SPaG | Prefixes super-

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

The diameter of permutation groups permutation groups H. A. Helfgott February 2017 The

Growth in permutation groups and linear New work on algebraic groups permutation groups H. A.

What APT does Assumption PI (or equivalent) prefixes of edge sites are not routed globally

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

The diameter of permutation groups Proof ideas H. A. Helfgott and . Seress July 2013 Cayley

Statistics on permutation tableaux Pawel Hitczenko Drexel University parts based on joint work

The diameter of permutation groups kos Seress May 2012 Cayley graphs The diameter of

Enumeration schemes for permutation patterns dashed permutation patterns Lara Pudwell Dashed

Algorithms for Permutation groups Alice Niemeyer UWA, RWTH Aachen Alice Niemeyer (UWA, RWTH

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index

Using Unsupervised Paradigm Acquisition for Prefixes Daniel Zeman FAL MFF, Univerzita Karlova,

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

PEGAS NONWOVENS 1Q 2014 Financial Results Analyst Conference Call 29 May 2014 Cautionary

Jefferies Energy Conference November 29, 2016 Strong. Innovative. Growing. 1 Investor Notice

CASCADES INC. CIBC Conference Montreal September 22, 2016 DISCLAIMER Certain statements in

Alpha Presentation Indexing System Mobile Dashboard The Capstone Experience Team Google Karthik

TM Indexes Work How TokuDB Fractal Tree Bradley C. Kuszmaul MySQL UC 2010How Fractal Trees

Information Retrieval in MongoDB Data storage, Indexing and Querying Kaustubh Dhokte (NB97699)

Fact-Based Indexing Lothar Flatz Senior Principal Consultant Diso AG The Swiss Data and

PP-Index: Using Permutation Prefixes for Efficient and Scalable - PowerPoint PPT Presentation

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli andrea.esuli@isti.cnr.it Istituto di Scienza e Tecnologie dellInformazione A. Faedo Consiglio Nazionale delle Ricerche Via

Spelling, Punctuation and Grammar Prefixes super- anti- auto- Year One SPaG | Prefixes super-

CS143: Index 1 Topics to Learn Important concepts Dense index vs. sparse index Primary

The diameter of permutation groups permutation groups H. A. Helfgott February 2017 The

Growth in permutation groups and linear New work on algebraic groups permutation groups H. A.

What APT does Assumption PI (or equivalent) prefixes of edge sites are not routed globally

Index Rules and Methodology Index Name Ticker S-Network US Equity 3000 Index SN3000 S-Network

The diameter of permutation groups Proof ideas H. A. Helfgott and . Seress July 2013 Cayley

Statistics on permutation tableaux Pawel Hitczenko Drexel University parts based on joint work

The diameter of permutation groups kos Seress May 2012 Cayley graphs The diameter of

Enumeration schemes for permutation patterns dashed permutation patterns Lara Pudwell Dashed

Algorithms for Permutation groups Alice Niemeyer UWA, RWTH Aachen Alice Niemeyer (UWA, RWTH

Fourier transform for nilpotent Lie groups Index sets and representations Granada Index sets

Index Rules and Methodology S-Network Europe Equity 500 Index (Ticker: SNE500) S-Network Europe

FAANG+ holdings in S&amp;P 500 &amp; MSCI EM Index S&amp;P 500 Index Weighting 20% MSCI EM Index

Using Unsupervised Paradigm Acquisition for Prefixes Daniel Zeman FAL MFF, Univerzita Karlova,

THE INDEX OF RETAIL PRICES REVISION OF THE INDEX OF RETAIL PRICES INDEX OF RETAIL PRICES The

Indexing Presentation - The Basics Attached is the slide deck for a short presentation on indexing

PEGAS NONWOVENS 1Q 2014 Financial Results Analyst Conference Call 29 May 2014 Cautionary

Jefferies Energy Conference November 29, 2016 Strong. Innovative. Growing. 1 Investor Notice

CASCADES INC. CIBC Conference Montreal September 22, 2016 DISCLAIMER Certain statements in

Alpha Presentation Indexing System Mobile Dashboard The Capstone Experience Team Google Karthik

TM Indexes Work How TokuDB Fractal Tree Bradley C. Kuszmaul MySQL UC 2010How Fractal Trees

Information Retrieval in MongoDB Data storage, Indexing and Querying Kaustubh Dhokte (NB97699)

Fact-Based Indexing Lothar Flatz Senior Principal Consultant Diso AG The Swiss Data and

FAANG+ holdings in S&P 500 & MSCI EM Index S&P 500 Index Weighting 20% MSCI EM Index