Why Graphs? • Discussion is based on the book and slides by Let us now look at implementing graph Jimmy Lin and Chris Dyer algorithms in MapReduce. • Analyze hyperlink structure of the Web • Social networks – Facebook friendships, Twitter followers, email flows, phone call patterns • Transportation networks – Roads, bus routes, flights • Interactions between genes, proteins, etc. 1 2 What is a Graph? Graph Problems • G = (V, E) • Graph search and path planning – V: set of vertices (nodes) – Find driving directions from A to B – E: set of edges (links), 𝐹 ⊆ 𝑊 × 𝑊 – Recommend possible friends in social network • Edges can be directed or undirected – How to route IP packets or delivery trucks • Graph might have cycles or not (acyclic graph) • Graph clustering • Nodes and edges can be annotated – Identify communities in social networks – E.g., social network: node has demographic – Partition large graph to parallelize graph processing information like age; edge has type of relationship • Minimum spanning trees like friend or family – Connected graph of minimum total edge weight 3 4 More Graph Problems Graph Representations • Bipartite graph matching • Usually one of these two: – Match nodes on “left” with nodes on “right” side – Adjacency matrix – E.g., match job seekers and employers, singles looking – Adjacency list for dates, papers with reviewers • Maximum flow – Maximum traffic between source and sink – E.g., optimize transportation networks • Finding “special” nodes – E.g., disease hubs, leader of a community, people with influence 5 6
Adjacency Matrix Properties • Matrix M of size |N| by |N| • Advantages – Entry M(i,j) contains weight of edge from node i to – Easy to manipulate with linear algebra node j; 0 if no edge • M M: entry (i,j) = number of two-step paths to go from node i to node j 2 1 2 3 4 – Operation on outlinks and inlinks corresponds to 1 0 1 0 1 1 iteration over rows and columns 3 2 1 0 1 1 • Disadvantage 3 1 0 0 0 – Huge space overhead for sparse matrix 4 1 0 1 0 4 – E.g., Facebook friendship graph Example source: Jimmy Lin 7 8 Adjacency List Properties • Compact row-wise representation of matrix • Advantages – More space-efficient – Still easy to compute over outlinks for each node • Disadvantage 1 2 3 4 1: 2, 4 – Difficult to compute over inlinks for each node 1 0 1 0 1 2: 1, 3, 4 2 1 0 1 1 3: 1 • Note: remember inverse Web graph 3 1 0 0 0 4: 1, 3 discussion 4 1 0 1 0 9 10 Parallel Breadth-First Search Dijkstra’s Algorithm Example • Case study: single-source shortest path problem 1 – Find the shortest path from a source node s to all other nodes in the graph 10 • For non-negative edge weights, Dijkstra’s algorithm is the classic sequential solution 9 0 2 3 4 6 – Initialize distance d[s]=0, all others to – Maintain priority queue of nodes sorted by distance 7 5 – Remove first node u from queue and update d[v] for each node v in adjacency list of u if (1) v is in queue 2 and (2) d[v] > d[u]+weight(u,v) Example from Jimmy Lin’s presentation 11 12 Example from CLR
Dijkstra’s Algorithm Example Dijkstra’s Algorithm Example 1 1 10 8 14 10 10 9 9 0 2 3 4 6 0 2 3 4 6 7 7 5 5 5 5 7 2 2 13 14 Example from CLR Example from CLR Dijkstra’s Algorithm Example Dijkstra’s Algorithm Example 1 1 8 13 8 9 10 10 2 3 9 4 6 2 3 9 4 6 0 0 7 7 5 5 5 7 5 7 2 2 15 16 Example from CLR Example from CLR Dijkstra’s Algorithm Example Parallel Single-Source Shortest Path • Priority queue is core element of Dijkstra’s 1 8 9 algorithm – No global shared data structure in MapReduce 10 • Dijkstra’s algorithm proceeds sequentially, 9 0 2 3 4 6 node by node – Taking non-min node could affect correctness of 7 5 algorithm 5 7 • Solution: perform parallel breadth-first search 2 17 18 Example from CLR
Parallel Breadth-First Search BFS Visualization n 7 • Start at source s n 0 n 1 • In first round, find all nodes reachable in one hop from s n 2 n 3 n 6 • In second round, find all nodes reachable in two hops from s, and so on n 5 • Keep track of min distance for each node n 4 n 8 – Also record corresponding path • Iterations stop when no shorter path possible Example from Jimmy Lin’s n 9 presentation 19 20 MapReduce Code: Single Iteration Overall Algorithm • Need driver program to control the iterations map(nid n, node N) // N stores node’s current min distance and adjacency list • Initialization: SourceNode.distance = 0, all others d = N.distance have distance= emit(nid n, N) // Pass along graph structure for all nid m in N.adjacencyList do • When to stop iterating? emit(nid m, d + w(n,m)) // Emit distances to reachable nodes • If all edges have weight 1, can stop as soon as no node has distance any more reduce(nid m, [d1,d2,…]) dMin = ; M = – Can detect this with Hadoop counter for all d in [d1,d2,…] do • Number of iterations depends on graph diameter if isNode(d) then M = d // Recover graph structure – In practice, many networks show the small-world else if d < dMin then // Look for min distance in list phenomenon, e.g., six degrees of separation dMin = d if dMin < M.distance // N eeded to avoid overwriting of source node’s distance M.distance = dMin // Update node’s shortest distance emit(nid m, node M) 21 22 Dealing With Diverse Edge Weights MapReduce Algorithm Analysis • “Detour” path can be shorter than “direct” connection, • Brute-force approach that performs many hence cannot stop as soon as all node distances are irrelevant computations finite • Stop when no node’s shortest distance changes any – Computes distances for nodes that still have more infinity distance – Can be detected with Hadoop counter – Repeats previous computations inside “search – Worst case: |N| iterations frontier” 1 1 1 n 6 n 7 • Dijkstra’s algorithm only explores the search n 8 10 n 9 1 frontier, but needs the priority queue n 5 n 1 1 Example from Jimmy Lin’s 1 presentation n 4 1 1 n 2 n 3 23 24
Typical Graph Processing in PageRank Introduction MapReduce • Graph represented by adjacency list per node, • Popularized by Google for evaluating the quality plus extra node data of a Web page • Map works on a single node u • Based on random Web surfer model – Node u’s local state and links only – Web surfer can reach a page by jumping to it or by • Node v in u’s adjacency list is intermediate key following the link from another page pointing to it – Passes results of computation along outgoing edges • Reduce combines partial results for each – Modeled as random process destination node • Intuition: important pages are linked from many • Map also passes graph itself to reducers other (important) pages • Driver program controls execution of iterations – Goal: find pages with greatest probability of access 25 26 PageRank Definition Computing PageRank • PageRank of page n: • Similar to BFS for shortest path – 𝑄 𝑜 = 𝛽 1 𝑄(𝑛) • Computing P(n) only requires P(m) and C(m) |𝑊| + (1 − 𝛽) 𝑛∈𝑀(𝑜) 𝐷(𝑛) for all pages linking to n – |V| is number of pages (nodes) – is probability of random jump – During iteration, distribute P(m) evenly over – L(n) is the set of pages linking to n outlinks – P(m) is m’s PageRank – Then add contributions over all of n’s inlinks – C(m) is m’s out -degree • Initialization: any probability distribution over • Definition is recursive the nodes – Compute by iterating until convergence (fixpoint) 27 28 PageRank Example PageRank Example Iteration 2 Iteration 1 n 2 (0.2) n 2 (0.166) n 2 (0.166) n 2 (0.133) 0.1 0.033 0.083 n 1 (0.2) 0.1 0.1 n 1 (0.066) n 1 (0.066) 0.083 n 1 (0.1) 0.033 0.1 0.066 0.1 0.066 0.1 0.066 0.1 n 5 (0.2) n 5 (0.3) n 5 (0.3) n 5 (0.383) n 3 (0.2) n 3 (0.166) n 3 (0.166) n 3 (0.183) 0.2 0.2 0.3 0.166 n 4 (0.2) n 4 (0.3) n 4 (0.3) n 4 (0.2) Source: Jimmy Lin’s presentation 29 30
Recommend
More recommend