On Random Sampling, its Applications, and Efficient Distributed - PowerPoint PPT Presentation

On Random Sampling, its Applications, and Efficient Distributed Algorithms Asad Awan, Ronaldo Ferreira, Ramanathan Muralikrishna, Suresh Jagannathan, and Ananth Grama. Parallel and Distributed Systems Lab Department of Computer Sciences Purdue University Acknowledgements: National Science Foundation

Overview of P2P Networks • Structured: – Napster: Star topology. – Chord, Pastry: O (log n ) -regular graphs. – CAN: Mesh structure. • Unstructured: – Gnutella, Freenet, Morpheus, ... • New Generation: – BitTorrent.

Research Problems in P2P Networks • File sharing is the predominant application of p2p networks. • Structured: – Guaranteed resource lookup using hashes, lacks keyword search. • Unstructured: – Search ∗ Flooding used by real-world software – high overhead. ∗ Recent techniques rely on replication [ Cohen et. al, SIGCOMM ’02 ], random walks [ Gkantisidis et. al, INFOCOM ’04 ]. ∗ Trend towards distributed randomized algorithms. – Distributed Randomized Algorithms: Applications ∗ (1) Search, (2) duplicate elimination & controlled replication, (3) leader election, (4) routing, (5) group communication. ∗ Underlying substrate: uniform sampling in Real-World distributed settings.

Duplicate Elimination • Issues: – Reliability - At least one copy of the data needs to be present. – Scalability to large unstructured networks. • Techniques: – Delta encoding, duplicate block suppression, Rabin fingerprints for chunking, REBL [USENIX ’04] for local sites. – Self Arranging Lossy Associative Database (SALAD) - [ICDCS ’02] , aggregate file content and local information. – Probabilistic approaches: ∗ Elimination by estimation. ∗ Elimination by election.

Elimination by Estimation • Two step approach: – Estimate the number of copies. – Based on the above estimate and the required replication factor of the system, each site deletes the copy probabilistically. • Estimation: – Abstracted using a balls-and-bins model. – Each peer with a duplicate ∗ selects γ √ n peers u.a.r and send an estimate message. ∗ selects another γ √ n peers u.a.r and request the number of unique estimate message received by each peer. 1 ∗ estimates the number of copies as γ 2 of the total received in the previous step. – If K is the number of copies, the standard deviation of √ K the above estimate is γ .

Elimination by Estimation - continued • Replication factor - ρ . • Current number of copies - K ( K >> ρ ). • Each site keeps the copy with probability ρ K . • Message complexity - O ( n √ n ) . • Issues: – the number of sites should be known. – methods for uniform sampling.

Elimination by Election • For each file, elect a leader. • Leader is responsible for the file. • Traditional Leader Election: – Unique leader is a requirement. – Identity of the leader is known by all the participants. – Message complexity - O ( n log n ) . • Randomized Leader Election: – Uniqueness is desired but absolutely required. – Participants need not know the identity of the elected leader. – Message complexity - O ( n ) .

Definitions and Features • Contender: Participating site in the protocol that holds a copy of the file. • Mediator: Site that receives a message from a contender and arbitrates whether the contender participates in subsequent steps of the protocol. • Round: Composed of communication between a contender and a set of mediators. • Features: 1 – A unique leader elected w.h.p ( 1 − n Ω(1) ) . – Message Complexity - O ( n ) . – Round Complexity - O (log n ) .

Randomized Leader Election • Performed in two phases. • First Phase: – Goal: Reduce the number of contenders to a desired level. √ 2 i ln 2 mediators – Each contender C sends a message to selected u.a.r in round i . – Each mediator M that received a message from C sends back a response indicating that C can proceed to round i + 1 iff it did not receive message from another contender. – C proceeds to the round i + 1 if it receives positive responses from all the mediators to which it sent a message. Otherwise C deletes the local copy of the file. – After each round, the expected number of mediators reduces by half.

An Example Contenders Mediators 1 3 1 1 1 2 1 6 2 2 2 2 4 6 3 3 3 3 5 5 4 4 4 4 4 6 6 5 5 4 5 5 5 2 6 5 Leader 6 6 6 6 5 5 2 7 7 7 5 7 8 6 7 8 8 8 8 6 6 5 Round 1: Each contender Round 2: Each contender Round 3 (log n): Each casts one ball. casts two balls contender casts 5 balls. First Phase Second Phase

Randomized Leader Election - continued • Second Phase: – Goal: Elect a unique leader. – Based on the Probabilistic Quorum Systems of Malkhi et. al (PODC ’97). – Each contender sends a message with a random √ number to n ln n mediators selected u.a.r. – A mediator responds positively to the contender with the largest random number. – A contender is the leader iff it receives positive responses from all its mediators.

Distributed Uniform Sampling in Real-World Networks • Random walk of a prescribed minimum length gives random sampling – independent of origin of walk. • Random walks are conveniently analyzed using Markov model: – Represent the network as a graph G ( V, E ) , and define transition probability matrix P , where p ij = 1 /d i , ( i, j ) ∈ E . ∗ Walks are memoryless: P r ( X t = j | X t − 1 = i, X t − 2 = i 1 , . . . ) = p ij . ∗ G is connected & aperiodic ⇒ MC is irreducible and aperiodic. – A long walk reaches stationary distribution π , which is irrespective of origin and path. → This gives random sampling with distribution π i = d i / 2 | E | . → Is this a uniformly random sample? Not if the network nodes have non-uniform degree distribution!

Random Walks in Real-World Networks • Real-World networks have non-uniform degree distribution. e.g., Power-law topology of Gnutella. • Stationary sampling distribution for random walks on a 50,000 node power-law random graph: 9e-05 Simple RW, Length=30logn 8e-05 7e-05 Probability of Selection 6e-05 5e-05 4e-05 3e-05 2e-05 1e-05 0 0 10000 20000 30000 40000 50000 Node (Sorted by degree) Node selection probability has almost an order of magnitude variation!

Uniform Sampling via Random Walks • To get uniform sampling we need to change P , over which the random walk is defined. What class of transition matrices will yield random walks with π i = 1 /n ? • Recall that stationary distribution satisfies π T = π T P . π uniform = (1 /n ) 1 T = (1 /n ) 1 T P . ⇒ The sum of entries in each column of P is 1. Thus, P should be doubly stochastic. • Observation: Symmetric transition probability matrix are doubly stochastic. Row-stochastic because sum of probabilities = 1, and column-stochastic by virtue of symmetry. • A key issue remains: What should be the length of walk to reach stationarity?

Length of Random Walks • Eigen-structure of P : 1 = λ 1 > | λ 2 | ≥ | λ 3 | ≥ . . . ≥ | λ n | • Convergence to stationarity: P T π = P π = π Thus, π is an eigenvector with eigenvalue 1 , i.e., largest eigenvalue. ∴ P ∞ = 1 π T (infinite step walk) • Perron-Frobenius theorem : P t = P ∞ + O ( | λ 2 | t t m 2 − 1 ) | λ 2 | < 1 ⇒ | λ 2 | t ≈ 0 , for a shorter length walk if | λ 2 | is small • Mixing time or length of walk = O (log n/ (1 − SLEM )) for expanders. (Lovasz ’96) . The required length of the walk to reach stationary distribution is small, if SLEM is small.

Enabling Efficient Uniform Sampling Aim: Design distributed algorithms to locally compute transitions between neighboring nodes; resulting in a global transition matrix with stationary distribution π uniform . • Known algorithms: – Maximum-Degree Algorithm: P md is symmetric ⇒ π uniform 8 1 /d max if i � = j and j ∈ Γ( i ) < p md = 1 − d i /d max if i = j ij 0 otherwise . : – Metropolis-Hastings Algorithm: P mh is symmetric ⇒ π uniform (Adapted for uniform stationary distribution.) 8 1 /max ( d i , d j ) if i � = j and j ∈ Γ( i ) > < p mh j ∈ Γ( i ) ( p mh 1 − P ij ) if i = j = ij 0 otherwise . > : • Issues: – MD requires knowledge of a global dynamic variable ( d max ) – Both algorithms have a high self-transition probability, which intuitively implies longer mixing times.

Enabling Efficient Uniform Sampling • Random Weight Distribution (RWD) – Initialization: Assign a small constant transition probability, 1 /ρ (system parameter ρ ≥ d max ) to each edge. This leaves a high self- transition probability (called weight). – Iteration: Each node i , randomly distributes its weight to neighbors, by incrementing transition probability with them symmetrically – using ACK s and NACK s. – Termination: For node i , either p ii = 0 or p jj = 0 , where j is a neighbor of i . – Each increment is done by a quantum value (system parameter). – ρ is a static system parameter – overestimate of d max . It is often easy to estimate, e.g., supernodes have a max connection limit in P2P networks. As we shall demonstrate, a high overestimate is adequate. – If quantum ( δ ) is small, number of increment messages is higher ( msgs < (1 − ( d i /ρ )) /δ + d i < 1 /δ + d i ). These msgs can be piggybacked on routine communication between neighbors (ping- pong, queries, etc.).

On Random Sampling, its Applications, and Efficient Distributed - PowerPoint PPT Presentation

On Random Sampling, its Applications, and Efficient Distributed Algorithms Asad Awan, Ronaldo Ferreira, Ramanathan Muralikrishna, Suresh Jagannathan, and Ananth Grama. Parallel and Distributed Systems Lab Department of Computer Sciences Purdue

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Outline Applications of Random Networks Random Networks Applications of Random Networks

CONTENTS OF DAY 2 I. Random Sampling 3 Connection with independent random variables 4

Global Illumination Multi-Sampling Path Tracing Simple Sampling Josef talked about all of

Approximate Counting By Sampling CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 3 :

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1 Agenda

CS 3700 Networks and Distributed Systems P2P and BitTorrent (Why is Lars Ulrich So Angry?)

The Road Ahead p2p file sharing techniques Downloading: Whole-file vs. chunks

Peer-to-Peer Networks 01: Organization and Introduction Christian Ortolf Technical Faculty

1Q18 Earnings Call Presentation April 25, 2018 Sands Macao Sands Bethlehem Four Seasons Macao The

Finally, a Use for Componentized Transport Protocols Tyson Condie, Joseph M. Hellerstein, Petros

Maius CIS (MCIS) Quintin Siebers TMRA 2007 -- open space sessions 11-10-2007 Morpheus Software

Targeted Proteomics Environment Status of the Skyline open-source software project five years

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at

On Random Sampling, its Applications, and Efficient Distributed - PowerPoint PPT Presentation

On Random Sampling, its Applications, and Efficient Distributed Algorithms Asad Awan, Ronaldo Ferreira, Ramanathan Muralikrishna, Suresh Jagannathan, and Ananth Grama. Parallel and Distributed Systems Lab Department of Computer Sciences Purdue

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Outline Applications of Random Networks Random Networks Applications of Random Networks

CONTENTS OF DAY 2 I. Random Sampling 3 Connection with independent random variables 4

Global Illumination Multi-Sampling Path Tracing Simple Sampling Josef talked about all of

Approximate Counting By Sampling CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 3 :

Consistent subset sampling Konstantin Kutzkov and Rasmus Pagh Work supported by: 1 Agenda

CS 3700 Networks and Distributed Systems P2P and BitTorrent (Why is Lars Ulrich So Angry?)

The Road Ahead p2p file sharing techniques Downloading: Whole-file vs. chunks

Peer-to-Peer Networks 01: Organization and Introduction Christian Ortolf Technical Faculty

1Q18 Earnings Call Presentation April 25, 2018 Sands Macao Sands Bethlehem Four Seasons Macao The

Finally, a Use for Componentized Transport Protocols Tyson Condie, Joseph M. Hellerstein, Petros

Maius CIS (MCIS) Quintin Siebers TMRA 2007 -- open space sessions 11-10-2007 Morpheus Software

Targeted Proteomics Environment Status of the Skyline open-source software project five years

CSE 291D/234 Data Systems for Machine Learning Arun Kumar Topic 1: Classical ML Training at

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling