Peer-to-Peer Similarity Search in Metric Spaces Christos Doulkeridis, Akrivi Vlachou, Yannis Kotidis, Michalis Vazirgiannis http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Department of Informatics Athens University of Economics and Business (AUEB) Athens, Greece Christos Doulkeridis, AUEB 1
Motivation • Similarity search in metric spaces • Objects are represented in a high dimensional feature space • Complex distance functions (e.g. text, multimedia) • Goal: share the computational load over a set of computers Peer-to-Peer • DBISP2P’07 session on P2P similarity search • Existing work – Centralized settings – Structured P2P systems (not preserving peer autonomy) Christos Doulkeridis, AUEB 2
Outline 1. Preliminaries a. Metric spaces b. iDistance 2. SIMPEER a. Construction b. Range query processing c. k -NN query processing 3. Experimental results 4. Conclusions & further work Christos Doulkeridis, AUEB 3
Metric Space • Metric space M=(D,d) – d(p,q) = d(q,p) (symmetry) – d(p,q) > 0, q ≠ p and d(p,p)=0 (non negativity) – d(p,q) ≤ d(p,o) + d(o,q) (triangle inequality) • Similarity queries – Range queries: R(q,r) = { u ∈ D | d(q,u) < r } – k -NN queries: NN k (q) Christos Doulkeridis, AUEB 4
iDistance – Indexing the Distance • Space partitioning into n clusters • Reference points K i • Each cluster mapped to an interval • Each object x mapped to 1-d Values indexed in a B + -Tree • • Query R(q,r) – If a query intersects with a cluster – Scan the interval Christos Doulkeridis, AUEB 5
SIMPEER 3-level Clustering Scheme 1. Each peer • clusters its own data • indexes local points using iDistance 2. Each super-peer • receives its peers’ cluster descriptions • computes the hyper-clusters using our extension of iDistance 3. Super-peers • exchange hyper-clusters • build a set of routing clusters Super-peer architecture Christos Doulkeridis, AUEB 6
iDistance Extension • Map clusters, not points • Index the furthest point of each cluster only! r’ i • Each cluster C j mapped to C 3 Values indexed in a B + -Tree R(q,r) • O i C 1 C 2 • Query R(q,r) – Search region [ d(O i ,q) - r, r’ I ] Hyper-cluster Christos Doulkeridis, AUEB 7
Peer Query Processing LC p = {C 1 (K 1 ,r 1 ), C 2 (K 2 ,r 2 ), C 3 (K 3 ,r 3 )} • Data organization – Clustering / Space partitioning K 1 ,r 1 K 3 ,r 3 – iDistance • LC p sent to super-peer R(q,r) K 2 ,r 2 – LC p = { C i : (K i ,r i ) } Peer data space • Range query processing – Scan intervals of B + -tree Leaf nodes of B + -tree Christos Doulkeridis, AUEB 8
Super-Peer Query Processing LHC sp = {HC 1 (O 1 ,r’ 1 ), HC 2 (O 2 ,r’ 2 ), HC 3 (O 3 ,r’ 3 )} • Super-peer – Creates hyper-clusters O 1 ,r’ 1 based on peer clusters – Indexes the furthest point of each peer’s cluster • Range query processing O 3 ,r’ 3 – Find peers to forward the R(q,r) query Super-peer O 2 ,r’ 2 – Peer selection mechanism data space Christos Doulkeridis, AUEB 9
Routing Indices • Super-peers broadcast hyper-clusters • Recipient super-peers – Treat hyper-clusters similarly to peer clusters • Build routing clusters RC i – Used to determine the neighbouring super-peer to forward the query – Super-peer selection mechanism Christos Doulkeridis, AUEB 10
k-NN Query Processing • Convert k-NN query to range query R(q,r) – Use estimated range r – Based on (peer) cluster information at a super-peer local estimation – Based on hyper-cluster information at a super-peer global estimation – No communication required for estimation! • Maximum 2 round-trips required! – If less than k objects retrieved, cannot avoid second round-trip – Super-peer computes an upper bound for r, based on its peers data • Goal: make a good estimation, such that – First round-trip is enough (overestimate r) – r is sufficient, but not too large (do not overestimate r too much) Christos Doulkeridis, AUEB 11
Histogram Construction Frequency of distances for: d ≤ 2r B • Distribution of distances – F(r) = Pr {d(q,p) ≤ r} Cluster�i • Expected number of F i (sr B ) retrieved objects by R(q,r) – #objs(R(q,r)) = n x F(r) • Assumption – “high” homogeneity of F i (2r B ) viewpoints inside a cluster F i (r B ) [Ciaccia, PODS’98] – Approximate F q with a sampled distance distribution F ... r B 2r B sr B Christos Doulkeridis, AUEB 12
Local Estimation (LE) r i r i r’ K i K i R(q,r) R(q,r) C i C i d(K i , q) + r ≤ r i Condition : d(K i , q) + r > r i Estimated n i x F i (r) n i x F i (r’) #objects : where r’=r i +r-d(K i ,q)/2 Binary search on [0,sr B ] to find the smallest r for which the estimated number of objects ≥ k Christos Doulkeridis, AUEB 13
Global Estimation (GE) • Hyper-clusters enhanced with 2 histograms: (hc i ) (hd i ) – Number of clusters intersecting the query (nc i ) • Distance distribution of clusters within a hyper-cluster – Number of data objects contained in the intersection (nd i ) • Superimpose cluster histograms, by keeping the minimum value of each bin • Also keep the minimum cardinality of all clusters Estimated nc i (r) x nd i (r) #objects : Christos Doulkeridis, AUEB 14
Experimental Setup • GT-ITM topology generator (4K-16K peers) • #Super-peers={200,400} • DEG sp =4-7 • DEG p =20-60 • k p =10 • Sunthetic {uniform,clustered} datasets – 8-32d, 3M-12M objects • Real datasets – VEC 1M 45-dim vectors of color image features – CovType 581K 54-dim instances of forest Covertype data Christos Doulkeridis, AUEB 15
Construction Cost • Mainly depends on super-peer topology • One-time cost! • Approx. 1.5MB per super-peer Total construction cost (MB) 700 600 500 400 Nsp=200 300 Nsp=400 200 100 0 4 5 6 7 DEGsp Christos Doulkeridis, AUEB 16
Range Queries – Response Time • (N sp =200, N p =2000, n=1M, d=16) • Increases only slightly with cardinality • Higher response time in clustered dataset • Most results come from the same network paths, causing delays Response Time (sec) 18 16 Network transfer 14 Uniform, k=120 12 rate Uniform, k=60 10 4KB/sec 8 Clustered, k=120 6 Clustered, k=60 4 2 0 3 6 9 12 Cardinality (x10^6) Christos Doulkeridis, AUEB 17
Range Queries – Success Ratio • Clustered dataset (N sp =200, N p =2000, n=1M) • Success ratio = how many of the contacted peers (super-peers) returned results Success Ratio 100 80 SP, d=8 60 SP, d=32 P, d=8 40 P, d=32 20 0 2 1.67 1.33 1 0.67 0.33 Query Selectivity (x10^-5) Christos Doulkeridis, AUEB 18
k-NN Queries – Overestimation(%) • VEC dataset (N sp =200, N p =2000) – LE better (initially) – GE becomes better with increasing k sp 12 10 Overestimation (%) 8 LE/RE 6 GE/RE 4 2 0 k=100 k=50 k=100 k=50 k=100 k=50 ksp=5 ksp=5 ksp=10 ksp=10 ksp=15 ksp=15 Christos Doulkeridis, AUEB 19
Conclusions & Further Work • SIMPEER – A metric-based framework for P2P similarity search – Utilizes a three-level clustering scheme • Support for range and k -NN query processing • Distributed statistics • Further work – Extension for non-vector-based data representations – Devise an approach that deals with uniform data distributions in a better way Christos Doulkeridis, AUEB 20
Thank you for your attention ! More info: http://www.db-net.aueb.gr/cdoulk/ cdoulk@aueb.gr Christos Doulkeridis, AUEB 21
Recommend
More recommend