Project AutoMate Squid: Decentralized Discovery Service C. Schmidt, The AutoMate Group The Applied Software Systems Laboratory Rutgers, The State University of New Jersey http://automate.rutgers.edu CAIP Autonomic Computing Workshop June, 2003 Outline • Introduction • Related Work • Design • Evaluation • Ongoing work CAIP Autonomic Computing Workshop, June 2003 2 Motivation • The need for information discovery in large, decentralized, distributed resource sharing environments, in the absence of global knowledge of naming conventions • Examples: – P2P Document Sharing Systems – Grid Resource Discovery – Web Service Discovery – Collaboration CAIP Autonomic Computing Workshop, June 2003 3
Overview • Squid is a Peer-to-Peer (P2P) indexing and information discovery system • Supports decentralized information discovery in AutoMate • Supports complex queries containing partial keywords, wildcards and range queries • Guarantees that all existing data elements matching a query will be found with bounded cost in terms of number of messages and nodes involved CAIP Autonomic Computing Workshop, June 2003 4 Related Work Information Discovery P2P Systems • Unstructured (Gnutella-like) – Unstructured overlay network, use flooding • Hybrid (Napster) – Unstructured overlay network, use centralized directories for search • Data-lookup (CAN, Chord, Pastry, etc) – Structured overlay, Internet-scale DHT • Structured keyword search – Structured overlay, extend data-lookup protocols – Examples: • Distributed Inverted Indices • Space Filling Curve CAIP Autonomic Computing Workshop, June 2003 5 Outline • Introduction • Related Work • Design • Evaluation • Ongoing work CAIP Autonomic Computing Workshop, June 2003 6
Design - Overview Document (kw1, D dimensional kw2, …, kwD) keyword space SFC 1-dimensional Peers (P1, P2, index space …Pk, …) CAIP Autonomic Computing Workshop, June 2003 7 The keyword space • Documents have assigned keywords 3-dimensional keyword space for storing computational resources, Document using the attributes: storage space, Network base bandwidth and cost Base Bandwidth Computer t s o C 2-dimensional keyword space Computational resource for a P2P sharing system Storage space CAIP Autonomic Computing Workshop, June 2003 8 Hilbert Space-Filling Curve (SFC) • f: N d → N, recursive generation 11 0110 10001001 0101 01 10 1 0111 10 0100 1010 1011 1100 0011 0010 1101 01 0 00 11 1110 0000 0001 1111 00 0 1 00 01 10 11 • Properties: – Digital causality – Locality preserving – Clustering CAIP Autonomic Computing Workshop, June 2003 9
Using SFC to generate the index space • the d-dimensional keyword space is mapped to a 1- dimensional index space using SFC Document Network Computer CAIP Autonomic Computing Workshop, June 2003 10 The overlay network • Use Chord as overlay network 0 51 13 40 29 Overlay network with 5 nodes and an identifier space from 0 to 64 Cost to look-up data: O(log 2 N) Each node stores the keys that map to the segment of the curve between itself and the predecessor node. CAIP Autonomic Computing Workshop, June 2003 11 The Query Engine • Query: combination of keywords, partial keywords, wildcards, ranges • Example: – (computer, network) – (computer, net*) – (comp*, *) – (256-512MB, *, 10Mbps-*) (memory, cost, base bandwidth) CAIP Autonomic Computing Workshop, June 2003 12
Query Processing • Step1: Translate the query to relevant clusters on the SFC-based index space Query, e.g. (computer, *) • Step2: Query the appropriate nodes in the overlay 0 Query the nodes 13 and 29 51 13 40 29 CAIP Autonomic Computing Workshop, June 2003 13 Query optimization • Not all clusters that are generated for a query exist in the network => optimize! • SFC generation recursive => clusters generation is recursive => the process of cluster generation can be viewed as a tree • Optimization: embed the tree into the overlay, and prune nodes during the construction phase CAIP Autonomic Computing Workshop, June 2003 14 Query optimization – illustration Solve query: (011, *) 111 11 0101 0110 1001 1010 10 110 01 1 10 101 0100 1011 0111 1000 100 01 1101 011 0010 0011 1100 010 0 00 11 00 001 1110 1111 0000 0001 000 00 01 10 11 000 001 010 011 100 101 110 111 0 1 0 0 0, 0 1 00 01, 01 10, 00 10 01 11 0001 01, 0110 10, 0010 01, 0111 11 0001 10 0110 11, 0010 10 0111 00 CAIP Autonomic Computing Workshop, June 2003 15
Query optimization – illustration 0 0 0, 0 1 00 01, 01 10, 00 10 01 11 0001 01, 0010 01, 0111 11 0110 10, 0001 10 0010 10 0110 11, 0111 00 00 000000 0 111000 0001 01 000100 Embed the leftmost tree path (solid arrows) and the rightmost 011110 path (dashed arrows) onto the 001001 overlay network topology. 0110 001111 CAIP Autonomic Computing Workshop, June 2003 16 Load balancing • Load balancing at node join: – generate more than one ID for the new node, send join requests in the network and join with the ID that places the node in the most crowded part of the network • Load balancing at runtime: – run a local load balancing algorithm between neighbors (from time to time), and redistribute the load – use virtual nodes that can migrate to less loaded physical nodes CAIP Autonomic Computing Workshop, June 2003 17 Outline • Introduction • Related Work • Design • Evaluation • Ongoing work CAIP Autonomic Computing Workshop, June 2003 18
Experimental evaluation • 1000 to 5400 nodes • Up to 10 6 keys (unique keyword combinations) • Metrics: – Number of routing nodes – Number of processing nodes – Number of data nodes – Number of messages • Query types: – Q1: (computer, *), (comp*, *, *) – Q2: (comp*, net*), (computer, network, *) – Q3: range queries CAIP Autonomic Computing Workshop, June 2003 19 2D keyword space – Q1 and Q2 queries • System size increases from 1000 to 5400 nodes, keys from 2*10 5 to 10 6 CAIP Autonomic Computing Workshop, June 2003 20 3D keyword space – Q1 and Q2 queries CAIP Autonomic Computing Workshop, June 2003 21
3D keyword space – range queries CAIP Autonomic Computing Workshop, June 2003 22 Load balancing 2100 The distribution of the keys in the 1800 Number of keys 1500 index space. The index space 1200 was partitioned into 5000 900 intervals. The Y-axis represents 600 300 the number of keys per interval. 0 1 501 1001 1501 2001 2501 3001 3501 4001 4501 The index space (intervals) 2100 2100 1800 1800 Number of keys Number of keys 1500 1500 1200 1200 900 900 600 600 300 300 0 0 1 501 1001 1501 2001 2501 3001 3501 4001 4501 1 501 1001 1501 2001 2501 3001 3501 4001 4501 Nodes in the system Nodes in the system The distribution of the keys when using The distribution of the keys when using both only the load balancing at node join the load balancing at node join technique, technique. and the local load balancing. CAIP Autonomic Computing Workshop, June 2003 23 Outline • Introduction • Related Work • Design • Evaluation • Ongoing work CAIP Autonomic Computing Workshop, June 2003 24
Ongoing work • Tests with a 5-dimensional keyword space • Develop new methods to further prune the clusters that do not exist in the network • Implement the actual system, on top of Chord lookup system CAIP Autonomic Computing Workshop, June 2003 25 Future work • Ranking • New overlay topology • Replication and caching CAIP Autonomic Computing Workshop, June 2003 26 Reference • [1] T. Bially. A class of dimension changing mapping and its application to bandwidth compression. Ph. D Thesis, Polytechnic Institute of Brooklyn, June 1967. • [2] I. Stoica, R. Morris, D. Karger, F. Kaashoek and H. Balakrishnan. Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications. In Proceedings of ACM SIGCOMM, 2001. • [3] C. Schmidt and M. Parashar. Flexible Information Discovery in Decentralized Distributed Systems, In Proceedings of IEEE High Performance Distributed Computing, June 2003. • [4] M. Agarwal at all. AutoMate: Enabling Autonomic Applications on the Grid, in Proceedings of the Autonomic Computing Workshop, June 2003. CAIP Autonomic Computing Workshop, June 2003 27
Recommend
More recommend