keyword searching in hypercubic manifolds
play

Keyword Searching in Hypercubic Manifolds Yu- -En Lu En Lu , - PowerPoint PPT Presentation

Keyword Searching in Hypercubic Manifolds Yu- -En Lu En Lu , Steven Hand, Pietro Lio Yu University of Cambridge Computer Laboratory Motivation Unstructured P2P networks such as Guntella evaluates complex queries by flooding the network


  1. Keyword Searching in Hypercubic Manifolds Yu- -En Lu En Lu , Steven Hand, Pietro Lio Yu University of Cambridge Computer Laboratory

  2. Motivation � Unstructured P2P networks such as Guntella evaluates complex queries by flooding the network while nothing can be guaranteed � Distributed Hash Tables evaluates simple queries due to hashing whilst guarantee is provided (at least theoretically) � What if we may cluster similar objects in similar regions of network via only hashing? � No preprocessing is needed � No global knowledge required, only the hash � Plug & play on top of current DHT designs

  3. Types of Queries F Exact Query DHT Systems K 1 ∧ K 2 ∧ K 3 e.g. “Harry Potter V.mpg” F Range Query K 1 ∧ K 2 ∧ K 3 ∧ K 4 ≥ 128 PHT, Mercury, P-Grid etc. e.g. “Harry Potter V.mpg AND bit-rate > 128kbps” F Partial Match Query K 1 ∧ K 2 ∨ K 3 ∨ K 4 Qube “Harry Potter [III,IV].mpg’ F Flawed Query K i +1 ∧ K j − 20 Qube “Hary Porter.mpg”

  4. A Projection of Object Feature Space Movie French Cuisine Possible answers to Query the query Harry

  5. A Qube View of the Mappings: Features - Overlay – Nodes Bon jovi Harry potter Vertices/ Abstract Graph Nodes/ Network topology � Each object is represented as a bit-string where 1 denotes it contains a keyword and 0 means not � Each bit string is then hashed onto the P2P name space � The nodes in the network chooses positions in the P2P space randomly and links with each other in some overlay topology. In our case, a hypercube is used.

  6. Design Principles and Fundamental Trade-offs � Latency vs. Message Complexity � Low message complexity usually means low latency � Not entirely true for DHT systems � Fairness vs. Performance � Sending everything to a handful of ultra-peers is fast and simple � Having things spread across the network means fairer system (and perhaps better availability) � Storage vs. Synchronisation complexity � Most popular queries may be processed by querying one random node due to generous replication/caching � For some applications such as distributed inverted index, frequent synchronisation is costy

  7. Hashing and Network Topology Keywords: Harry Potter music movie 7 � Summary Hash h : { 0 , 1 } ∗ → { 0 , 1 } b � Non-expansive: 0 0 0 1 0 1 0 0 1 0 d(x,y) ≤ d ( h ( x ) , h ( y )) � Fair Partitioning: 0 1 h − 1 ( u ) = h − 1 ( v ) � Keyword edges linking Harry Potter Harry Potter nodes in one word music distance Harry Potter � Similar objects are located Harry Potter movie 7 in manifolds of the movie Hypercube

  8. Query Processing Harry Potter Music 0011 1011 0111 1111 0010 1010 1110 0110 Harry Potter Harry Potter VII 0001 Harry 0101 1001 1101 0000 0100 1000 1000 1100 0001 0011 0101 0111 1001 1011 1101 1111 0000 0010 0100 0110 1000 1010 1100 1110

  9. Experimental Setup � A Hypercube DHT is instantiated where end-to-end distance is drawn from King dataset* � King dataset contains latency measurements of a set of DNS servers � Surrogates to a logical ID is chosen based on Plaxton style post-fix matching scheme � Nodes choose DHT IDs randomly � No network proximity to expose worst case performance and tradeoff with dimensionality and caching � A sample of FreeDB**, a free online CD album containing 20 million songs, is used to reflect actual objects in real world � Gnutella query traces served as our query samples * http://pdos.csail.mit.edu/p2psim/kingdata/ ** http://www.freedb.org

  10. Retrieval Cost * Recall rate = percentage of relevant objects found ** Legend tuples denotes (b,n) where b is dimensionality and n is the size of the network

  11. Query Latency in Wide Area Selection of b controls the degree of clustering

  12. Network Performance * This result takes the query “bon jovi” for example where there are 3242 distinct, related songs in FreeDB

  13. Conclusion � Qube spreads objects across the network by their similarity � Better fairness and availability � Zero preprocessing � Little synchronisation need � By tuning parameter b, one may choose the degree of performance/fairness tradeoff � Further investigating lower latency schemes to trim probing cost and decouple query accuracy from network size

  14. Future Work � Large scale simulation (>100K nodes with realistic network latency generator) � Flash crowds query model and replication/caching � Distributed proximity-searches such as kNN under Euclidean metric

  15. Thank you! Yu-En.Lu@cl.cam.ac.uk

Recommend


More recommend