Peer-to-Peer Similarity Search in Metric Spaces Christos - PowerPoint PPT Presentation

Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business (AUEB) Athens, Greece Christos Doulkeridis, AUEB 1

Motivation • Similarity search in metric spaces • Objects are represented in a high dimensional feature space • Complex distance functions (e.g. text, multimedia) • Goal: share the computational load over a set of computers Peer-to-Peer • DBISP2P’07 session on P2P similarity search • Existing work – Centralized settings – Structured P2P systems (not preserving peer autonomy) Christos Doulkeridis, AUEB 2

Outline 1. Preliminaries a. Metric spaces b. iDistance 2. SIMPEER a. Construction b. Range query processing c. k -NN query processing 3. Experimental results 4. Conclusions & further work Christos Doulkeridis, AUEB 3

Metric Space • Metric space M=(D,d) – d(p,q) = d(q,p) (symmetry) – d(p,q) > 0, q ≠ p and d(p,p)=0 (non negativity) – d(p,q) ≤ d(p,o) + d(o,q) (triangle inequality) • Similarity queries – Range queries: R(q,r) = { u ∈ D | d(q,u) < r } – k -NN queries: NN k (q) Christos Doulkeridis, AUEB 4

iDistance – Indexing the Distance • Space partitioning into n clusters • Reference points K i • Each cluster mapped to an interval • Each object x mapped to 1-d Values indexed in a B + -Tree • • Query R(q,r) – If a query intersects with a cluster – Scan the interval Christos Doulkeridis, AUEB 5

SIMPEER 3-level Clustering Scheme 1. Each peer • clusters its own data • indexes local points using iDistance 2. Each super-peer • receives its peers’ cluster descriptions • computes the hyper-clusters using our extension of iDistance 3. Super-peers • exchange hyper-clusters • build a set of routing clusters Super-peer architecture Christos Doulkeridis, AUEB 6

iDistance Extension • Map clusters, not points • Index the furthest point of each cluster only! r’ i • Each cluster C j mapped to C 3 Values indexed in a B + -Tree R(q,r) • O i C 1 C 2 • Query R(q,r) – Search region [ d(O i ,q) - r, r’ I ] Hyper-cluster Christos Doulkeridis, AUEB 7

Peer Query Processing LC p = {C 1 (K 1 ,r 1 ), C 2 (K 2 ,r 2 ), C 3 (K 3 ,r 3 )} • Data organization – Clustering / Space partitioning K 1 ,r 1 K 3 ,r 3 – iDistance • LC p sent to super-peer R(q,r) K 2 ,r 2 – LC p = { C i : (K i ,r i ) } Peer data space • Range query processing – Scan intervals of B + -tree Leaf nodes of B + -tree Christos Doulkeridis, AUEB 8

Super-Peer Query Processing LHC sp = {HC 1 (O 1 ,r’ 1 ), HC 2 (O 2 ,r’ 2 ), HC 3 (O 3 ,r’ 3 )} • Super-peer – Creates hyper-clusters O 1 ,r’ 1 based on peer clusters – Indexes the furthest point of each peer’s cluster • Range query processing O 3 ,r’ 3 – Find peers to forward the R(q,r) query Super-peer O 2 ,r’ 2 – Peer selection mechanism data space Christos Doulkeridis, AUEB 9

Routing Indices • Super-peers broadcast hyper-clusters • Recipient super-peers – Treat hyper-clusters similarly to peer clusters • Build routing clusters RC i – Used to determine the neighbouring super-peer to forward the query – Super-peer selection mechanism Christos Doulkeridis, AUEB 10

k-NN Query Processing • Convert k-NN query to range query R(q,r) – Use estimated range r – Based on (peer) cluster information at a super-peer local estimation – Based on hyper-cluster information at a super-peer global estimation – No communication required for estimation! • Maximum 2 round-trips required! – If less than k objects retrieved, cannot avoid second round-trip – Super-peer computes an upper bound for r, based on its peers data • Goal: make a good estimation, such that – First round-trip is enough (overestimate r) – r is sufficient, but not too large (do not overestimate r too much) Christos Doulkeridis, AUEB 11

Histogram Construction Frequency of distances for: d ≤ 2r B • Distribution of distances – F(r) = Pr {d(q,p) ≤ r} Cluster�i • Expected number of F i (sr B ) retrieved objects by R(q,r) – #objs(R(q,r)) = n x F(r) • Assumption – “high” homogeneity of F i (2r B ) viewpoints inside a cluster F i (r B ) [Ciaccia, PODS’98] – Approximate F q with a sampled distance distribution F ... r B 2r B sr B Christos Doulkeridis, AUEB 12

Local Estimation (LE) r i r i r’ K i K i R(q,r) R(q,r) C i C i d(K i , q) + r ≤ r i Condition : d(K i , q) + r > r i Estimated n i x F i (r) n i x F i (r’) #objects : where r’=r i +r-d(K i ,q)/2 Binary search on [0,sr B ] to find the smallest r for which the estimated number of objects ≥ k Christos Doulkeridis, AUEB 13

Global Estimation (GE) • Hyper-clusters enhanced with 2 histograms: (hc i ) (hd i ) – Number of clusters intersecting the query (nc i ) • Distance distribution of clusters within a hyper-cluster – Number of data objects contained in the intersection (nd i ) • Superimpose cluster histograms, by keeping the minimum value of each bin • Also keep the minimum cardinality of all clusters Estimated nc i (r) x nd i (r) #objects : Christos Doulkeridis, AUEB 14

Experimental Setup • GT-ITM topology generator (4K-16K peers) • #Super-peers={200,400} • DEG sp =4-7 • DEG p =20-60 • k p =10 • Sunthetic {uniform,clustered} datasets – 8-32d, 3M-12M objects • Real datasets – VEC 1M 45-dim vectors of color image features – CovType 581K 54-dim instances of forest Covertype data Christos Doulkeridis, AUEB 15

Construction Cost • Mainly depends on super-peer topology • One-time cost! • Approx. 1.5MB per super-peer Total construction cost (MB) 700 600 500 400 Nsp=200 300 Nsp=400 200 100 0 4 5 6 7 DEGsp Christos Doulkeridis, AUEB 16

Range Queries – Response Time • (N sp =200, N p =2000, n=1M, d=16) • Increases only slightly with cardinality • Higher response time in clustered dataset • Most results come from the same network paths, causing delays Response Time (sec) 18 16 Network transfer 14 Uniform, k=120 12 rate Uniform, k=60 10 4KB/sec 8 Clustered, k=120 6 Clustered, k=60 4 2 0 3 6 9 12 Cardinality (x10^6) Christos Doulkeridis, AUEB 17

Range Queries – Success Ratio • Clustered dataset (N sp =200, N p =2000, n=1M) • Success ratio = how many of the contacted peers (super-peers) returned results Success Ratio 100 80 SP, d=8 60 SP, d=32 P, d=8 40 P, d=32 20 0 2 1.67 1.33 1 0.67 0.33 Query Selectivity (x10^-5) Christos Doulkeridis, AUEB 18

k-NN Queries – Overestimation(%) • VEC dataset (N sp =200, N p =2000) – LE better (initially) – GE becomes better with increasing k sp 12 10 Overestimation (%) 8 LE/RE 6 GE/RE 4 2 0 k=100 k=50 k=100 k=50 k=100 k=50 ksp=5 ksp=5 ksp=10 ksp=10 ksp=15 ksp=15 Christos Doulkeridis, AUEB 19

Conclusions & Further Work • SIMPEER – A metric-based framework for P2P similarity search – Utilizes a three-level clustering scheme • Support for range and k -NN query processing • Distributed statistics • Further work – Extension for non-vector-based data representations – Devise an approach that deals with uniform data distributions in a better way Christos Doulkeridis, AUEB 20

Thank you for your attention ! More info: http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Christos Doulkeridis, AUEB 21

Peer-to-Peer Similarity Search in Metric Spaces Christos - PowerPoint PPT Presentation

Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business

Welcome back... Metric spaces. Approximate metric using a tree. Tree metric: 16 16 A metric

Dynamical Systems Continuous maps of metric spaces We work with metric spaces, usually a

Metric Spaces Definition If d is a metric on X , then the metric topology on X induced by d is

CALCULUS ON METRIC SPACES: BEYOND THE POINCAR INEQUALITY New Examples of Differentiability

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

A Few Pearls in the Theory of Quasi-Metric Spaces Jean Goubault-Larrecq ANR Blanc CPP TACL

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

THE PEER-TO-PEER NETWORK JOHN NEWBERY @jfnewbery github.com/jnewbery THE PEER-TO-PEER NETWORK

Serverless networking (peer-to-peer computing) Peer-to-peer models Client-server computing

Peer-to-Peer Networks 09 Random Graphs for Peer-to-Peer-Networks Christian Ortolf Technical

Comparing Hybrid Peer-to-Peer Hybrid peer-to-peer systems Systems Beverly Yang and Hector

Lecture 11: Security January 25, 2020 Chris Stone Lab 3 (Bomb) Due 1:15pm Friday Lab 4 (Attack)

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou ourse-Aw Aware Sente ntence

Visiting The Catalog A Stroll Through The PostgreSQL Catalog Charles Clavadetscher Swiss

Probing the relative momentum of two-nucleon system in 6 He and 6 Li W. Horiuchi and Y. Suzuki

NH & RA Summer Institute Mixed-Income and Workforce Housing Case Studies Reclaiming

How Well are Minnesotans Housed? Housing Trends and Policy in Minnesota Sarah Berke, Director of

Lecture 15 with Shot Noise Chapter 10 Four- Dimensional Signal Con- stellations Dual-

P01 Overview of CMS HL-LHC Upgrades Anders Ryd, Deputy Project Manager September 17, 2015

Peer-to-Peer Similarity Search in Metric Spaces Christos - PowerPoint PPT Presentation

Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business

Welcome back... Metric spaces. Approximate metric using a tree. Tree metric: 16 16 A metric

Dynamical Systems Continuous maps of metric spaces We work with metric spaces, usually a

Metric Spaces Definition If d is a metric on X , then the metric topology on X induced by d is

CALCULUS ON METRIC SPACES: BEYOND THE POINCAR INEQUALITY New Examples of Differentiability

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal,

A Few Pearls in the Theory of Quasi-Metric Spaces Jean Goubault-Larrecq ANR Blanc CPP TACL

Tyrol Hill Park Phase 4 Elementary Campbell Elementary Campbell Park Spaces Open Park

THE PEER-TO-PEER NETWORK JOHN NEWBERY @jfnewbery github.com/jnewbery THE PEER-TO-PEER NETWORK

Serverless networking (peer-to-peer computing) Peer-to-peer models Client-server computing

Peer-to-Peer Networks 09 Random Graphs for Peer-to-Peer-Networks Christian Ortolf Technical

Comparing Hybrid Peer-to-Peer Hybrid peer-to-peer systems Systems Beverly Yang and Hector

Lecture 11: Security January 25, 2020 Chris Stone Lab 3 (Bomb) Due 1:15pm Friday Lab 4 (Attack)

Ev Evaluation Benchmarks and Learning Criteria fo for Di Discou ourse-Aw Aware Sente ntence

Visiting The Catalog A Stroll Through The PostgreSQL Catalog Charles Clavadetscher Swiss

Probing the relative momentum of two-nucleon system in 6 He and 6 Li W. Horiuchi and Y. Suzuki

NH &amp; RA Summer Institute Mixed-Income and Workforce Housing Case Studies Reclaiming

How Well are Minnesotans Housed? Housing Trends and Policy in Minnesota Sarah Berke, Director of

Lecture 15 with Shot Noise Chapter 10 Four- Dimensional Signal Con- stellations Dual-

P01 Overview of CMS HL-LHC Upgrades Anders Ryd, Deputy Project Manager September 17, 2015

NH & RA Summer Institute Mixed-Income and Workforce Housing Case Studies Reclaiming