Search Engine Architecture 6. Link Analysis This work is licensed - PowerPoint PPT Presentation

Search Engine Architecture 6. Link Analysis This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Noted slides adapted from: Lin et al.’s Big Data Infrastructure, UMD Spring 2015 with cosmetic changes.

Today’s Agenda • Graph problems and representations • Parallel breadth- fi rst search • PageRank • Optimizing graph algorithms

What’s a graph? G = (V,E), where • V represents the set of vertices (nodes) • E represents the set of edges (links) • Both vertices and edges may contain additional information • Di ff erent types of graphs: • Directed vs. undirected edges • Presence or absence of cycles • Graphs are everywhere: • Hyperlink structure of the web • Physical structure of computers on the Internet • Interstate highway system • Social networks • Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Some Graph Problems Finding shortest paths • Routing Internet tra ffi c and UPS trucks • Finding minimum spanning trees • Telco laying down fi ber • Finding Max Flow • Airline scheduling • Identify “special” nodes and communities • Breaking up terrorist cells, spread of avian fl u • Bipartite matching • Monster.com, Match.com • And of course... PageRank • Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Graphs and MapReduce A large class of graph algorithms involve: • Performing computations at each node: based on node • features, edge features, and local link structure Propagating computations: “traversing” the graph • Key questions: • How do you represent graph data in MapReduce? • How do you traverse a graph in MapReduce? • Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Representing Graphs • G = (V, E) • Two common representations • Adjacency matrix • Adjacency list Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Adjacency Matrices Represent a graph as an n x n square matrix M • n = |V| • M ij = 1 means a link from node i to j 2 1 2 3 4 1 0 1 0 1 1 3 2 1 0 1 1 3 1 0 0 0 4 1 0 1 0 4 Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Adjacency Matrices: Critique • Advantages: • Amenable to mathematical manipulation • Iteration over rows and columns corresponds to computations on outlinks and inlinks • Disadvantages: • Lots of zeros for sparse matrices • Lots of wasted space Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Adjacency Lists Take adjacency matrices… and throw away all the zeros 1 2 3 4 1: 2, 4 1 0 1 0 1 2: 1, 3, 4 2 1 0 1 1 3: 1 3 1 0 0 0 4: 1, 3 4 1 0 1 0 Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Adjacency Lists: Critique • Advantages: • Much more compact representation • Easy to compute over outlinks • Disadvantages: • Much more di ffi cult to compute over inlinks Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Single-Source Shortest Path • Problem: fi nd shortest path from a source node to one or more target nodes • Shortest might also mean lowest weight or cost • First, a refresher: Dijkstra’s Algorithm Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Dijkstra’s Algorithm Example 1 ¡ ∞ ¡ ∞ ¡ 10 ¡ 9 ¡ 2 ¡ 3 ¡ 4 ¡ 6 ¡ 0 ¡ 7 ¡ 5 ¡ ∞ ¡ ∞ ¡ 2 ¡ Example ¡from ¡CLR ¡ Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Dijkstra’s Algorithm Example 1 ¡ 10 ¡ ∞ ¡ 10 ¡ 9 ¡ 2 ¡ 3 ¡ 4 ¡ 6 ¡ 0 ¡ 7 ¡ 5 ¡ 5 ¡ ∞ ¡ 2 ¡ Example ¡from ¡CLR ¡ Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Dijkstra’s Algorithm Example 1 ¡ 8 ¡ 14 ¡ 10 ¡ 9 ¡ 2 ¡ 3 ¡ 4 ¡ 6 ¡ 0 ¡ 7 ¡ 5 ¡ 5 ¡ 7 ¡ 2 ¡ Example ¡from ¡CLR ¡ Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Single-Source Shortest Path • Problem: fi nd shortest path from a source node to one or more target nodes • Shortest might also mean lowest weight or cost • Single processor machine: Dijkstra’s Algorithm • MapReduce: parallel breadth- fi rst search (BFS) Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Finding the Shortest Path Consider simple case of equal edge weights • Solution to the problem can be de fi ned inductively • Here’s the intuition: • De fi ne: b is reachable from a if b is on adjacency list of a • D ISTANCE T O ( s ) = 0 For all nodes p reachable from s , • D ISTANCE T O ( p ) = 1 For all nodes n reachable from some other set of nodes M , • D ISTANCE T O ( n ) = 1 + min(D ISTANCE T O ( m ), m ∈ M ) d 1 m 1 … d 2 s … n m 2 … d 3 m 3 Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Visualizing Parallel BFS n 7 ¡ n 0 ¡ n 1 ¡ n 2 ¡ n 3 ¡ n 6 ¡ n 5 ¡ n 4 ¡ n 8 ¡ n 9 ¡

Source: ¡Wikipedia ¡(Wave) ¡ Via ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

From Intuition to Algorithm Data representation: • Key: node n • Value: d (distance from start), adjacency list (nodes reachable from n ) • Initialization: for all nodes except for start node, d = ∞ • Mapper: • ∀ m ∈ adjacency list: emit ( m , d + 1) • Sort/Shu ffl e • Groups distances by reachable nodes • Reducer: • Selects minimum distance path for each reachable node • Additional bookkeeping needed to keep track of actual path • Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Multiple Iterations Needed Each MapReduce iteration advances the “frontier” by • one hop Subsequent iterations include more and more reachable • nodes as frontier expands Multiple iterations are needed to explore entire graph • Preserving graph structure: • Problem: Where did the adjacency list go? • Solution: mapper emits ( n , adjacency list) as well • Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

BFS Pseudo-Code Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Single Source: Weighted Edges • Now add positive weights to the edges • Why can’t edge weights be negative? • Simple change: add weight w for each edge in adjacency list • In mapper, emit ( m , d + w p ) instead of ( m , d + 1) for each node m • That’s it? Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Stopping Criterion How many iterations are needed in parallel BFS • (positive edge weight case)? Convince yourself: when a node is fi rst “discovered”, • we’ve found the shortest path Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Stopping Criterion How many iterations are needed in parallel BFS • (positive edge weight case)? Convince yourself: when a node is fi rst “discovered”, • we’ve found the shortest path Not true! Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Comparison to Dijkstra • Dijkstra’s algorithm is more e ffi cient • At each step, only pursues edges from minimum- cost path inside frontier • MapReduce explores all paths in parallel • Lots of “waste” • Useful work is only done at the “frontier” • Why can’t we do better using MapReduce? Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

Additional Complexities 1 search frontier 1 1 n 6 n 7 n 8 10 r n 9 1 n 5 n 1 1 1 s q p n 4 1 1 n 2 n 3 Source: ¡Lin ¡et ¡al. ¡Big ¡Data ¡Infrastructure, ¡UMD ¡Spring ¡2015. ¡

PageRank

Search Engine Architecture 6. Link Analysis This work is licensed - PowerPoint PPT Presentation

Search Engine Architecture 6. Link Analysis This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Noted slides adapted from:

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Information Retrieval CS6200 Search Engine Architecture Jesse Anderton College of Computer and

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

The search engine you can see Connects people to information and services The search engine you

Audient: Audient: An Acoustic Search Engine An Acoustic Search Engine By Ted Leath Supervisor:

Automatic Search Engine Evaluation Automatic Search Engine Evaluation with Click- -through Data

An Online Shopping Search Shopping Search An Online Engine User Study Engine User Study

Game Engine Architecture Game Engine Architecture Spring 2017 Spring 2017 03. Event systems

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

The Implementation of T elemaintenance A Study on Change Management with respect to the Naval

AKTINA* a new productive urban furniture a solar & sustainable mobility project by CITY

Strategic Management of Knowledge in Big Science Agust Canals KIMO Research Group Universitat

RE-IMAGINING TRANSPORT how design thinking creates new paths forward transport of tomorrow

Introduction External memory algorithms for well known problems A basic breadth first

1 Introduction 1.1 Problem Definition Let G = ( V, E ) be undirected graph with n vertices, and

Output Spaces Darryl Buller, Aaron Kaufer Information Assurance Directorate National Security

chameleon-db Presented by Alu Joint work with

Sambuz

Useful Links

Newsletter

Mail Us

Search Engine Architecture 6. Link Analysis This work is licensed - PowerPoint PPT Presentation

Search Engine Architecture 6. Link Analysis This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details Noted slides adapted from:

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

The Economics of Internet Search Hal R. Varian Sept 31, 2007 Search engine use Search

Information Retrieval CS6200 Search Engine Architecture Jesse Anderton College of Computer and

Elastic Search - Aditi Choksi (EW18455) Elastic Search Search engine Distributed

Technologies behind Internet Search Engine Ming-Jer Lee CTO VisionNEXT Inc. Type of Search

search engine optimization ABOUT ME HOLISTIC SEARCH 2.0 ECOSYSTEM eRetail Search Platform

EE 6882 Visual Search Engine Lec. 1: Introduction tinyeye, photo copy search Web image search

How to Rank Your Website on Page #1 of Google SEARCH ENGINE OPTIMISATION (SEO) Search Results

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

eyeShot Multimedia Search Engine Multimedia Search Engine eyeShot Extracting text patterns

The search engine you can see Connects people to information and services The search engine you

Audient: Audient: An Acoustic Search Engine An Acoustic Search Engine By Ted Leath Supervisor:

Automatic Search Engine Evaluation Automatic Search Engine Evaluation with Click- -through Data

An Online Shopping Search Shopping Search An Online Engine User Study Engine User Study

Game Engine Architecture Game Engine Architecture Spring 2017 Spring 2017 03. Event systems

Whats New in Engine Research Whats New in Engine Research Mark Musculus Engine Combustion

The Implementation of T elemaintenance A Study on Change Management with respect to the Naval

AKTINA* a new productive urban furniture a solar &amp; sustainable mobility project by CITY

Strategic Management of Knowledge in Big Science Agust Canals KIMO Research Group Universitat

RE-IMAGINING TRANSPORT how design thinking creates new paths forward transport of tomorrow

Introduction External memory algorithms for well known problems A basic breadth first

1 Introduction 1.1 Problem Definition Let G = ( V, E ) be undirected graph with n vertices, and

Output Spaces Darryl Buller, Aaron Kaufer Information Assurance Directorate National Security

chameleon-db Presented by Alu Joint work with

Sambuz

Useful Links

Newsletter

Mail Us

AKTINA* a new productive urban furniture a solar & sustainable mobility project by CITY