why graphs
play

Why Graphs? Discussion is based on the book and slides by Let us - PDF document

Why Graphs? Discussion is based on the book and slides by Let us now look at implementing graph Jimmy Lin and Chris Dyer algorithms in MapReduce. Analyze hyperlink structure of the Web Social networks Facebook friendships, Twitter


  1. Why Graphs? • Discussion is based on the book and slides by Let us now look at implementing graph Jimmy Lin and Chris Dyer algorithms in MapReduce. • Analyze hyperlink structure of the Web • Social networks – Facebook friendships, Twitter followers, email flows, phone call patterns • Transportation networks – Roads, bus routes, flights • Interactions between genes, proteins, etc. 1 2 What is a Graph? Graph Problems • G = (V, E) • Graph search and path planning – V: set of vertices (nodes) – Find driving directions from A to B – E: set of edges (links), 𝐹 ⊆ 𝑊 × 𝑊 – Recommend possible friends in social network • Edges can be directed or undirected – How to route IP packets or delivery trucks • Graph might have cycles or not (acyclic graph) • Graph clustering • Nodes and edges can be annotated – Identify communities in social networks – E.g., social network: node has demographic – Partition large graph to parallelize graph processing information like age; edge has type of relationship • Minimum spanning trees like friend or family – Connected graph of minimum total edge weight 3 4 More Graph Problems Graph Representations • Bipartite graph matching • Usually one of these two: – Match nodes on “left” with nodes on “right” side – Adjacency matrix – E.g., match job seekers and employers, singles looking – Adjacency list for dates, papers with reviewers • Maximum flow – Maximum traffic between source and sink – E.g., optimize transportation networks • Finding “special” nodes – E.g., disease hubs, leader of a community, people with influence 5 6

  2. Adjacency Matrix Properties • Matrix M of size |N| by |N| • Advantages – Entry M(i,j) contains weight of edge from node i to – Easy to manipulate with linear algebra node j; 0 if no edge • M  M: entry (i,j) = number of two-step paths to go from node i to node j 2 1 2 3 4 – Operation on outlinks and inlinks corresponds to 1 0 1 0 1 1 iteration over rows and columns 3 2 1 0 1 1 • Disadvantage 3 1 0 0 0 – Huge space overhead for sparse matrix 4 1 0 1 0 4 – E.g., Facebook friendship graph Example source: Jimmy Lin 7 8 Adjacency List Properties • Compact row-wise representation of matrix • Advantages – More space-efficient – Still easy to compute over outlinks for each node • Disadvantage 1 2 3 4 1: 2, 4 – Difficult to compute over inlinks for each node 1 0 1 0 1 2: 1, 3, 4 2 1 0 1 1 3: 1 • Note: remember inverse Web graph 3 1 0 0 0 4: 1, 3 discussion 4 1 0 1 0 9 10 Parallel Breadth-First Search Dijkstra’s Algorithm Example • Case study: single-source shortest path problem 1   – Find the shortest path from a source node s to all other nodes in the graph 10 • For non-negative edge weights, Dijkstra’s algorithm is the classic sequential solution 9 0 2 3 4 6 – Initialize distance d[s]=0, all others to  – Maintain priority queue of nodes sorted by distance 7 5 – Remove first node u from queue and update d[v] for   each node v in adjacency list of u if (1) v is in queue 2 and (2) d[v] > d[u]+weight(u,v) Example from Jimmy Lin’s presentation 11 12 Example from CLR

  3. Dijkstra’s Algorithm Example Dijkstra’s Algorithm Example 1 1  10 8 14 10 10 9 9 0 2 3 4 6 0 2 3 4 6 7 7 5 5  5 5 7 2 2 13 14 Example from CLR Example from CLR Dijkstra’s Algorithm Example Dijkstra’s Algorithm Example 1 1 8 13 8 9 10 10 2 3 9 4 6 2 3 9 4 6 0 0 7 7 5 5 5 7 5 7 2 2 15 16 Example from CLR Example from CLR Dijkstra’s Algorithm Example Parallel Single-Source Shortest Path • Priority queue is core element of Dijkstra’s 1 8 9 algorithm – No global shared data structure in MapReduce 10 • Dijkstra’s algorithm proceeds sequentially, 9 0 2 3 4 6 node by node – Taking non-min node could affect correctness of 7 5 algorithm 5 7 • Solution: perform parallel breadth-first search 2 17 18 Example from CLR

  4. Parallel Breadth-First Search BFS Visualization n 7 • Start at source s n 0 n 1 • In first round, find all nodes reachable in one hop from s n 2 n 3 n 6 • In second round, find all nodes reachable in two hops from s, and so on n 5 • Keep track of min distance for each node n 4 n 8 – Also record corresponding path • Iterations stop when no shorter path possible Example from Jimmy Lin’s n 9 presentation 19 20 MapReduce Code: Single Iteration Overall Algorithm • Need driver program to control the iterations map(nid n, node N) // N stores node’s current min distance and adjacency list • Initialization: SourceNode.distance = 0, all others d = N.distance have distance=  emit(nid n, N) // Pass along graph structure for all nid m in N.adjacencyList do • When to stop iterating? emit(nid m, d + w(n,m)) // Emit distances to reachable nodes • If all edges have weight 1, can stop as soon as no node has  distance any more reduce(nid m, [d1,d2,…]) dMin =  ; M =  – Can detect this with Hadoop counter for all d in [d1,d2,…] do • Number of iterations depends on graph diameter if isNode(d) then M = d // Recover graph structure – In practice, many networks show the small-world else if d < dMin then // Look for min distance in list phenomenon, e.g., six degrees of separation dMin = d if dMin < M.distance // N eeded to avoid overwriting of source node’s distance M.distance = dMin // Update node’s shortest distance emit(nid m, node M) 21 22 Dealing With Diverse Edge Weights MapReduce Algorithm Analysis • “Detour” path can be shorter than “direct” connection, • Brute-force approach that performs many hence cannot stop as soon as all node distances are irrelevant computations finite • Stop when no node’s shortest distance changes any – Computes distances for nodes that still have more infinity distance – Can be detected with Hadoop counter – Repeats previous computations inside “search – Worst case: |N| iterations frontier” 1 1 1 n 6 n 7 • Dijkstra’s algorithm only explores the search n 8 10 n 9 1 frontier, but needs the priority queue n 5 n 1 1 Example from Jimmy Lin’s 1 presentation n 4 1 1 n 2 n 3 23 24

  5. Typical Graph Processing in PageRank Introduction MapReduce • Graph represented by adjacency list per node, • Popularized by Google for evaluating the quality plus extra node data of a Web page • Map works on a single node u • Based on random Web surfer model – Node u’s local state and links only – Web surfer can reach a page by jumping to it or by • Node v in u’s adjacency list is intermediate key following the link from another page pointing to it – Passes results of computation along outgoing edges • Reduce combines partial results for each – Modeled as random process destination node • Intuition: important pages are linked from many • Map also passes graph itself to reducers other (important) pages • Driver program controls execution of iterations – Goal: find pages with greatest probability of access 25 26 PageRank Definition Computing PageRank • PageRank of page n: • Similar to BFS for shortest path – 𝑄 𝑜 = 𝛽 1 𝑄(𝑛) • Computing P(n) only requires P(m) and C(m) |𝑊| + (1 − 𝛽) 𝑛∈𝑀(𝑜) 𝐷(𝑛) for all pages linking to n – |V| is number of pages (nodes) –  is probability of random jump – During iteration, distribute P(m) evenly over – L(n) is the set of pages linking to n outlinks – P(m) is m’s PageRank – Then add contributions over all of n’s inlinks – C(m) is m’s out -degree • Initialization: any probability distribution over • Definition is recursive the nodes – Compute by iterating until convergence (fixpoint) 27 28 PageRank Example PageRank Example Iteration 2 Iteration 1 n 2 (0.2) n 2 (0.166) n 2 (0.166) n 2 (0.133) 0.1 0.033 0.083 n 1 (0.2) 0.1 0.1 n 1 (0.066) n 1 (0.066) 0.083 n 1 (0.1) 0.033 0.1 0.066 0.1 0.066 0.1 0.066 0.1 n 5 (0.2) n 5 (0.3) n 5 (0.3) n 5 (0.383) n 3 (0.2) n 3 (0.166) n 3 (0.166) n 3 (0.183) 0.2 0.2 0.3 0.166 n 4 (0.2) n 4 (0.3) n 4 (0.3) n 4 (0.2) Source: Jimmy Lin’s presentation 29 30

Recommend


More recommend