Arpeggio: Metadata Searching and Content Sharing with Chord Austin T. Clements, Dan R. K. Ports, David R. Karger { aclements, drkp, karger } @mit.edu MIT Computer Science and Artificial Intelligence Laboratory
Outline Overview Searching Index Gateways Content Distribution Conclusion Arpeggio – p.1
The Content-Sharing Problem Goals Find files matching a search query Identify sources for a file Want full decentralization Assumptions Only searching metadata Metadata is small (compared to actual data) Highly dynamic and unstable network topology, content, and sources Arpeggio – p.2
DHTs (Almost) to the Rescue Great for lookup-by-name Insufficient for efficient search-by-content Powerful underlying L OOKUP abstraction Arpeggio – p.3
Outline Overview Searching Index Gateways Content Distribution Conclusion Arpeggio – p.4
Index Entries (debian, disk1, iso) name Debian Disk1.iso file ID cdb79ca3db1f39b1940ed5... size 586MB type application/x-iso9660-image . . . Arpeggio – p.5
Distributed Inverted Indexing (debian, disk1, iso) name Debian Disk1.iso fi le ID cdb79ca3db1f3... ▽ Arpeggio – p.6
Distributed Inverted Indexing (debian, disk1, iso) debian name Debian Disk1.iso disk1 fi le ID cdb79ca3db1f3... iso ▽ Arpeggio – p.6
Distributed Inverted Indexing (debian, disk1, iso) debian name Debian Disk1.iso disk1 fi le ID cdb79ca3db1f3... iso (debian, disk2, iso) disk2 name Debian Disk2.iso fi le ID 5ccbf54d7e502... ▽ Arpeggio – p.6
Distributed Inverted Indexing (debian, disk1, iso) debian name Debian Disk1.iso disk1 fi le ID cdb79ca3db1f3... iso (debian, disk2, iso) disk2 name Debian Disk2.iso freebsd fi le ID 5ccbf54d7e502... (disk1, freebsd, iso) name Freebsd Disk1.iso fi le ID fbcbfdff31f27de... ▽ Arpeggio – p.6
Distributed Inverted Indexing (debian, disk1, iso) debian name Debian Disk1.iso disk1 fi le ID cdb79ca3db1f3... iso (debian, disk2, iso) disk2 name Debian Disk2.iso freebsd fi le ID 5ccbf54d7e502... (disk1, freebsd, iso) “debian disk1?” name Freebsd Disk1.iso fi le ID fbcbfdff31f27de... ▽ Arpeggio – p.6
Distributed Inverted Indexing (debian, disk1, iso) debian name Debian Disk1.iso disk1 fi le ID cdb79ca3db1f3... iso (debian, disk2, iso) disk2 name Debian Disk2.iso freebsd fi le ID 5ccbf54d7e502... (disk1, freebsd, iso) “debian disk1?” name Freebsd Disk1.iso fi le ID fbcbfdff31f27de... { , } ∩ { , } ▽ Arpeggio – p.6
Distributed Inverted Indexing (debian, disk1, iso) debian name Debian Disk1.iso disk1 fi le ID cdb79ca3db1f3... iso (debian, disk2, iso) disk2 name Debian Disk2.iso freebsd fi le ID 5ccbf54d7e502... (disk1, freebsd, iso) “debian disk1?” name Freebsd Disk1.iso fi le ID fbcbfdff31f27de... { , } ∩ { , } Problem: Network hosage Arpeggio – p.6
Index-Side Filtering Keywords are small, so store keywords in index Pick one index node Send full query Index node performs filtering and returns only relevant results Can also include other filterable metadata , e.g. file size, MP3 bitrate, etc. ▽ Arpeggio – p.7
Index-Side Filtering (debian, disk1, iso) debian name Debian Disk1.iso fi le ID cdb79ca3db1f3... disk1 iso (debian, disk2, iso) disk2 name Debian Disk2.iso freebsd fi le ID 5ccbf54d7e502... (disk1, freebsd, iso) name Freebsd Disk1.iso fi le ID fbcbfdff31f27de... ▽ Arpeggio – p.7
Index-Side Filtering (debian, disk1, iso) debian name Debian Disk1.iso fi le ID cdb79ca3db1f3... disk1 iso (debian, disk2, iso) disk2 name Debian Disk2.iso freebsd fi le ID 5ccbf54d7e502... (disk1, freebsd, iso) “debian disk1?” name Freebsd Disk1.iso fi le ID fbcbfdff31f27de... ▽ Arpeggio – p.7
Index-Side Filtering (debian, disk1, iso) debian name Debian Disk1.iso fi le ID cdb79ca3db1f3... disk1 iso (debian, disk2, iso) disk2 name Debian Disk2.iso freebsd fi le ID 5ccbf54d7e502... (disk1, freebsd, iso) “debian disk1?” name Freebsd Disk1.iso fi le ID fbcbfdff31f27de... { } ▽ Arpeggio – p.7
Index-Side Filtering (debian, disk1, iso) debian name Debian Disk1.iso fi le ID cdb79ca3db1f3... disk1 iso (debian, disk2, iso) disk2 name Debian Disk2.iso freebsd fi le ID 5ccbf54d7e502... (disk1, freebsd, iso) “debian disk1?” name Freebsd Disk1.iso fi le ID fbcbfdff31f27de... { } Problem: Poor query load-balancing Arpeggio – p.7
Keyword-Set Indexing Build index on keyword sets rather than keywords Store subsets of size ≤ K More keyword-set indexes, but each is shorter Single-keyword indexes are less important, so can be truncated < 29% of web searches have only 1 keyword. [Reynolds & Vahdat 2003] To search: send filtered query to any K -size subset index ▽ Arpeggio – p.8
Keyword-Set Indexing ( K = 2 ) debian (debian, disk1, iso) disk1 name Debian Disk1.iso iso fi le ID cdb79ca3db1f3... debian disk1 debian iso disk1 iso ▽ Arpeggio – p.8
Keyword-Set Indexing ( K = 2 ) debian (debian, disk1, iso) disk1 name Debian Disk1.iso iso fi le ID cdb79ca3db1f3... debian disk1 (debian, disk2, iso) debian iso name Debian Disk2.iso disk1 iso fi le ID 5ccbf54d7e502... disk2 debian disk2 disk2 iso ▽ Arpeggio – p.8
Keyword-Set Indexing ( K = 2 ) debian (debian, disk1, iso) disk1 name Debian Disk1.iso iso fi le ID cdb79ca3db1f3... debian disk1 (debian, disk2, iso) debian iso name Debian Disk2.iso disk1 iso fi le ID 5ccbf54d7e502... disk2 debian disk2 (disk1, freebsd, iso) name Freebsd Disk1.iso disk2 iso fi le ID fbcbfdff31f27de... . . . ▽ Arpeggio – p.8
Keyword-Set Indexing ( K = 2 ) debian (debian, disk1, iso) disk1 name Debian Disk1.iso iso fi le ID cdb79ca3db1f3... debian disk1 (debian, disk2, iso) debian iso name Debian Disk2.iso disk1 iso fi le ID 5ccbf54d7e502... disk2 debian disk2 (disk1, freebsd, iso) name Freebsd Disk1.iso disk2 iso fi le ID fbcbfdff31f27de... . . . “debian disk1?” ▽ Arpeggio – p.8
Keyword-Set Indexing ( K = 2 ) debian (debian, disk1, iso) disk1 name Debian Disk1.iso iso fi le ID cdb79ca3db1f3... debian disk1 (debian, disk2, iso) debian iso name Debian Disk2.iso disk1 iso fi le ID 5ccbf54d7e502... disk2 debian disk2 (disk1, freebsd, iso) name Freebsd Disk1.iso disk2 iso fi le ID fbcbfdff31f27de... . . . “debian disk1?” Arpeggio – p.8
Indexing Cost m = metadata keywords K = maximum subset size parameter I ( m ) = index entries K � 2 m − 1 � m � if m ≤ K � = = O ( m K ) i if m > K i =1 ▽ Arpeggio – p.9
Indexing Cost K � 2 m − 1 � m � if m ≤ K � I ( m ) = = O ( m K ) i if m > K i =1 512 K = ∞ 256 K = 4 128 K = 3 64 K = 2 I ( m ) 32 K = 1 16 8 4 2 1 1 2 3 4 5 6 7 8 9 m ▽ Arpeggio – p.9
Indexing Cost K � 2 m − 1 � m � if m ≤ K � I ( m ) = = O ( m K ) i if m > K i =1 For files with many metadata keywords, I ( m ) is polynomial in m . Arpeggio – p.9
Storage Costs (FreeDB) Number of songs 21,195,244 Total index entries ( K = 1 ) 134,403,379 Index entries per song ( K = 1 ) 6.274406 Total index entries ( K = 3 ) 1,494,688,373 Index entries per song ( K = 3 ) 66.078093 ⇒ Total storage cost only an order of magnitude more than required for K = 1 inverted index. Arpeggio – p.10
Choosing K Larger K improves query load distribution, increases indexing costs 100000 Relative index size 10000 1000 100 10 1 2 4 6 8 10 12 14 K For web searches: average query length 2.53 Arpeggio – p.11
Outline Overview Searching Index Gateways Content Distribution Conclusion Arpeggio – p.12
Index Gateways S 1 I 1 : a I 2 : b I 3 : a b M F M F M F Each file has one metadata block, stored in I ( m ) indexes. ▽ Arpeggio – p.13
Index Gateways S 1 S 2 I 1 : a I 2 : b I 3 : a b M F M F M F Each peer sharing the file will insert the same metadata block into each index. ▽ Arpeggio – p.13
Index Gateways S 1 S 3 S 2 I 1 : a I 2 : b I 3 : a b M F M F M F Total insertion cost for m metadata keywords and s source peers: sI ( m ) messages. Problem: expensive and redundant ▽ Arpeggio – p.13
Index Gateways Solution: aggregate updates at an index gateway Receives metadata blocks from sources and sends to indexes only when necessary S 1 G I 1 : a I 2 : b I 3 : a b M F M F M F Insertion cost is now s + I ( m ) (vs. sI ( m ) )! ▽ Arpeggio – p.13
Index Gateways Solution: aggregate updates at an index gateway Receives metadata blocks from sources and sends to indexes only when necessary S 1 S 2 G I 1 : a I 2 : b I 3 : a b M F M F M F Insertion cost is now s + I ( m ) (vs. sI ( m ) )! ▽ Arpeggio – p.13
Index Gateways Solution: aggregate updates at an index gateway Receives metadata blocks from sources and sends to indexes only when necessary S 1 S 3 S 2 G I 1 : a I 2 : b I 3 : a b M F M F M F Insertion cost is now s + I ( m ) (vs. sI ( m ) )! Arpeggio – p.13
Outline Overview Searching Index Gateways Content Distribution Conclusion Arpeggio – p.14
Direct Storage? Content is large Network has churn Kazaa median session length 2.4 minutes [Gummadi et al. 2003] Problem: DHT storage of content is impractical Arpeggio – p.15
Recommend
More recommend