Comparing Hybrid Peer-to-Peer Hybrid peer-to-peer systems Systems Beverly Yang and Hector Garcia-Molina Pure peer-to-peer systems are hard to scale Gnutella Look at hybrids between p2p and server-client Presented by Marco Barreno Servers will index files, clients download from each November 3, 2003 other directly CS 294-4: Peer-to-peer systems Searching can be done more efficiently on a server Napster (but Napster had its own problems...) Several other architectures Questions for hybrid systems Contributions of this paper Best way to organize servers? Presents several architectures for hybrid systems Index replication policy? Presents and evaluates a probabilistic model for queries What queries are submitted often? Compares architectures quantitatively, based on How do we deal with churn? their models and the music sharing domain How do query patterns affect performance? Compares strategies in non-music-sharing domains (a bit)
General concepts: basic actions Goal Login The goal of this study is to maximize UsersPerServer A client connects to a server and uploads metadata about the files it offers What do you think of this goal? It is a local user to that server, a remote user to others Query A list of words to search on Satisfied if preset maximum number of results found Download Contact peer directly after getting info from server Batch vs. incremental logins Architectures (1) Batch: on login/logout, user’ s entire metadata set Chained architecture is added/removed Servers are arranged in a linear chain (ring?) Allows index to remain small, but login/logout is Each server keeps metadata for local users expensive Unsatisfied queries sent along chain Incremental: metadata kept in index at all times, Logins and downloads scalable; queries potentially and only deltas are sent at login expensive Saves much effort on login/logout Queries become more expensive, as server must filter for online users
Architectures (2) Architectures (3) Full replication architecture Hash architecture Each server keeps metadata about all users Metadata words hashed so a particular server is responsible for a particular subset of them Logins expensive Queries sent to relevant servers Queries cheap On login, metadata sent to all relevant servers Limited number of servers need to see each query, but sending the lists may be expensive Architectures (4) Query model Unchained architecture Universe of queries: q 1 , q 2 , q 3 , ...; densities f, g Servers are independent and don’ t communicate g(i) is probability that a submitted query is query q i (query popularity) A user can only search files on the server he/she connects to f(i) is probability that any given file will match Napster query q i (selection power) Disadvantage: user’ s views are limited g tells us what queries users like to submit, while Advantage: scales very well (as servers, users f tells us which files users like to store increase together)
Expected results for chained Expected values for others ExServ = Expected number of servers needed to ExServ trivially 1 for full replication and obtain R results (MaxResults) unchained If P(s) is the probability that exactly s servers are ExServ is equivalent to balls-in-bins for hash needed to return R or more results, we have: ExLocalResults based on (UsersPerServer * FilesPerUser) files ExTotalResults based on (ExLocalResults * k) files Distributions for f() and g() Validation of query model M(n) = expected # results from n files Exponential distributions work well for music domain: Q(n) = probability we don’ t get R results These data gathered from OpenNap Monotonically decreasing Popularity and selection power are correlated Most popular has highest selection power, and so on
Performance model Evaluation CPU cycles Metric: max users per server (throughput, not Cost estimates based on examination and guesswork, plus latency) some experiments Matched OpenNap relatively well for batch logins Inter-server bandwidth Varies among architectures Server-client bandwidth Napster protocol: Login, AddFile, RemoveFile Take min over resources (iterative estimation) Memory requirements Beyond music f() and g() could be different May be no or negative correlation e.g. Adding “price > 0” to a query makes it less popular but doesn't change size of result set e.g. Archive system will return more results from farther in the past (queries presumably rarer) No or negative correlation can be modeled by adjusting the ratio of the parameters to f and g No: r = 1 Negative: r >> 1
CPU performance vs. r Conclusion Chained is the best architecture for music domain Full replication might be good with lots of cheap memory and stable network connections Incremental logins do best when there is negative correlation between f and g, and it performs best in short, bandwidth-limited sessions
Recommend
More recommend