A Peer-to-Peer Inverted Index Implementation for Word-based Content Search Nuno Lopes University of Minho October 2003
P2P System Characterization • Scalable up to millions of nodes • Highly dynamic node membership • Reduced node uptime: 1 hour on average • No centralized authority � 2003 Nuno Lopes c 1 SDDI 2003
1st Generation of P2P Systems File Sharing Oriented • Napster Centralized search with p2p file download ⇒ Single point-of-failure • Gnutella Broadcast based search ⇒ Network overloaded � 2003 Nuno Lopes c 2 SDDI 2003
Searching Model • Local model Individual peer search Examples: Gnutella, Pedone’02 • Global model Information is placed on a global (distributed) shared index � 2003 Nuno Lopes c 3 SDDI 2003
2nd Generation of P2P Systems Distributed Hash Table (DHT) Based • Examples: Chord, Pastry, others... • Simple hash table operations on ( key , value ) pairs • Efficient routing: O (log N ) hops for any peer • Scalable state information: O (log N ) routing entries per peer • But... incapable of searching � 2003 Nuno Lopes c 4 SDDI 2003
Inverted Index Description • Association word �→ { document location } SET • Document Location Set is highly dynamic • Follows Zipf distribution 700 600 500 # Documents 400 300 200 100 0 0 5000 10000 15000 20000 25000 30000 35000 Words � 2003 Nuno Lopes c 5 SDDI 2003
Inverted Index API • INSERT( word , reference ) • REMOVE( word , reference ) • HAS REF( word , reference ): bool • GET REF( word ): reference • NEXT REF( word , reference ): reference � 2003 Nuno Lopes c 6 SDDI 2003
Inverted Index Implementation Index is splited in constant size blocks, accessed through 2 layers: • DHT as base platform for block-oriented storage ⇒ Unsuitable as a stand-alone implementation • B+ tree for block management Responsible for the set implementation to each word � 2003 Nuno Lopes c 7 SDDI 2003
Current Simulation Settings • Only the B+ tree layer is simulated • Peers store a single block each • Messages have an atomic cost • Single client requests index operations on the system • Data consists on 1000 small documents with 36499 unique words � 2003 Nuno Lopes c 8 SDDI 2003
Initial Simulation Results • B+ trees make the storage load uniform across peers • However... root blocks for popular words have high network load 800 700 600 500 Access rate 400 300 200 100 0 0 10000 20000 30000 40000 50000 60000 Blocks � 2003 Nuno Lopes c 9 SDDI 2003
Caching Mechanism • Clients have high probability of requesting the same blocks for popular words • Caching of (non-leaf) blocks reduces the number of accesses • In order to avoid stale copies, leaf blocks are never cached • Higher level blocks are less probable to become modified and therefore stale � 2003 Nuno Lopes c 10 SDDI 2003
Simulation Results (Using Cache) • The use of a cache mechanism (LRU) distributes more evenly the network load on peers • Access rates were reduced by a factor of 10 60 50 40 Access rate 30 20 10 0 0 10000 20000 30000 40000 50000 60000 Blocks � 2003 Nuno Lopes c 11 SDDI 2003
Open Questions • Measurement of DHT as stand-alone implementation of inverted index • Analysis of the block caching mechanism to determine the best cache size for different numbers of peers on the system • Implementation of multiple blocks to peer association for studying effective peer load • AND and OR search operators implementation and load measurement � 2003 Nuno Lopes c 12 SDDI 2003
Recommend
More recommend