A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data - PowerPoint PPT Presentation

A foray into graph mining Neil Shah April 15 th , 2019

(Graph) data is prevalent • 2.5 exabytes of data produced every day • 90% generated in the last 2 years • Data is produced as the product of a highly interconnected world 244 million users 1.3 billion users 187 million daily actives 480 million products 1 billion daily mobile views 3.5 billion daily snaps

(Graph) data shapes perspectives e e n n o i i v g i o t a n M e d n h e c g m r n a m i e k o S n c a e r r m r o f t a l p t n c u l o a i d i t g c c o n o a r i P S r s e a t h n c i r u p

What’s in a graph? • Graphs consist of nodes, edges and attributes • ex: Facebook social network where • nodes = individuals • edges = friendship • attributes = gender (node), # of messages exchanged (edge) • Graphs can easily model relationships between entities • Who-follows-whom on a social network • Who-buys-what on an e-commerce platform • Who-calls-whom using a certain cellular provider

Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways

Graph preliminaries – directionality Users-by-users Users-by-users u 1 u 1 u 7 u 7 u 2 u 2 u 8 u 8 u 3 u 3 u 9 u 9 u 4 u 4 u 10 u 10 u 5 u 5 u 11 u 11 u 6 u 6

Graph preliminaries – degree • Degree: # of adjacent edges Users-by-users u 1 u 7 u 2 u 8 • Degree(u 7 ) = 2 u 3 u 9 u 4 u 10 u 5 u 11 u 6

Graph preliminaries – out- and in-degree • Degree: # of adjacent edges Users-by-users • Out-degree: # outgoing edges • In-degree: # incoming edges u 1 u 7 u 2 u 8 • Out-degree(u 4 ) = 1 u 3 u 9 • In-degree(u 6 ) = 2 u 4 u 10 u 5 u 11 u 6

Graph preliminaries – weighted degree • Weighted degree: total sum of adjacent edge weights Users-by-users • i.e. “how many times did two users 3 u 1 communicate” u 7 4 u 2 u 8 u 3 1 u 9 2 u 4 • Weighted-degree(u 6 ) = 7 u 10 u 5 9 u 11 6 u 6 1

Graph preliminaries – ego(net) • Ego : single, central node Users-by-users u 1 • Ego network (egonet): nodes and u 7 edges within one “hop” from ego u 2 u 8 u 3 u 9 • Egonet(u 7 ) = u 4 u 10 • Nodes {u 7 , u 3 , u 5 } u 5 • Edges {u 7 -u 3 , u 7 -u 5 } u 11 u 6

Graph preliminaries – connectivity • Two nodes are connected if there is a path between them. Users-by-users • A graph is fully connected if all node u 1 u 7 pairs are connected. u 2 u 8 u 3 u 9 • u 1 and u 8 are connected u 4 • u 3 and u 5 are connected u 10 u 5 • u 1 and u 9 are not connected u 11 u 6 • This graph is not fully connected

Graph preliminaries – node and edge types Users-by-products • A heterogeneous graph has multiple node and/or edge types. Users Products u 1 p 1 • Users and products u 2 p 2 • Who-buys-what and who-rates-what u 3 p 3 u 4 p 4 u 5 p 5 u 6

Graph preliminaries – matrix representation • Graph connectivity can be summarized in an adjacency matrix . • A i,j = # (or weight) of edges from node i to j • A usually very sparse (makes compact representations possible!) Users Users-by-users u 1 1 u 7 1 u 2 1 u 8 Users 1 u 3 1 u 9 1 u 4 1 u 10 1 u 5 u 11 1 u 6

Key question: What does a graph “look like”? • At first look… large, unwieldy and seemingly random. • Spoiler: In actuality, most real- world graphs are far from random. Trace-route paths Lyon ’03 on the internet

A quick detour: “Random” graphs • Erdos-Renyi random graph model: graph G(n,p) • n = number of nodes • p = probability of an edge between two nodes (independent edges) • Expected # of edges: • Degree distribution: (binom.) Babaoglu’ 18

What about real graphs? Faloutsos ‘99 Viswanath ‘09 Adamic ‘02 log(# visitors) vs. log(# sites) Log(# posts) vs. log(# users) log(# peers) vs. log(# routers) • X-axis: degree, Y-axis: frequency/probability • Degree distributions of real graphs are not “random” • What exactly are they, then?

The “scale-free” property • Real-world graphs are often scale-free, meaning that their degree distribution obeys a power-law: • Scaling the input by a multiple simply results in proportional scaling of the whole function • Power laws are linear in log-log scales • Typical 2 ≤ # ≤ 3 log(# visitors) vs. log(# sites)

Scale-freeness is evident in many domains Newman ‘05

Why are many real graphs scale-free? • Hypothesis: preferential attachment, or a “rich-get-richer” effect • Generative process to construct a network: • Start with ! " nodes, each with at least 1 edge • At each timestep, add a new node with ! edges connecting it to ! already existing nodes • Probability of new node to connect to node # depends on the degree $ % as • Many real-world variants of this effect: academic citations, recommendation, virality log(# visitors) vs. log(# sites)

Real graphs have “small-world” effects • How “far apart” are nodes in real graphs? • Interestingly, not very far! The typical number is 6 . You may have heard of the “six degrees of separation” • Milgram ‘69: avg. # of hops for a letter to travel from Nebraska to Boston was 6.2 (sample size 64) • Leskovec ‘08: avg. distance between node pairs on MSN messenger has mode 6 (sample size 180M nodes and 1.3B edges)

What causes the small-world effect? • Hypothesis: The abundance of hubs, or high-degree nodes • Even though most nodes aren’t connected to most other nodes, they are connected to hubs, which facilitate paths log(# visitors) vs. log(# sites)

How do real graphs “grow” over time? • Consider a time-evolving graph ! • If it has "($) nodes and &($) edges at time t… • Suppose that "($ + 1) = 2"($) • What is &($ + 1) ? • Not only is it > 2& $ ; the growth is actually superlinear and follows & $ ∝ " $ . (power law!) with 1 ≤ 0 ≤ 2 , generally

Real graphs exhibit densification Avg. out-degree increases over time Power-law in # edges vs. # nodes (over time)

Moreover, the graph diameter shrinks • Graph diameter = max(distance between node pairs) • Leskovec ‘05 shows that diameter actually shrinks over time, instead of growing. In other words, nodes tend to get closer • Hypothesis: Once again due to prevalence and growth of hubs

Much more work done on graph behaviors • Generative graph models (Leskovec ‘05) • Patterns in sizes of connected components (Kang ‘10) • Node in-degree (popularity) over time (McGlohon ‘07) • Duration of calls in phone-call networks (Vaz de Melo ‘10) • Temporal structure evolution (Shah ‘15) … the list goes on

Key question: how can we leverage graphs for recommendation/ranking tasks? • Measuring webpage importance • Link prediction and recommendation • Local methods • Global methods

PageRank for large-scale search engines • Key problem: how to prioritize/curate a large (ever-growing) hyperlinked body of pages by importance and relevance? • Key idea: leverage the hyperlink citation graph (page-links-page) to rank page importance according to connectivity patterns • 150 million web pages à 1.7 billion links Backlinks and Forward links: Ø A and B are C’s backlinks Ø C is A and B’s forward link Content adapted from Li ‘09

Simplified PageRank Idea: each page equally distributes its own PageRank to its forward-links recursively. “An important page has many important pages pointing to it” • ! : a web page • " ! : the set of u’s backlinks • # $ : the number of forward links of page v • % : the normalization factor to make & a probability distribution • Simplified PageRank is the stationary probability dist. of a random- walk on the graph; a surfer keeps clicking successive pages at random.

Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) PageRank Calculation: first iteration

Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) PageRank Calculation: second iteration

Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) Convergence after some iterations

Problem with Simplified PageRank A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!

A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data - PowerPoint PPT Presentation

A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data is prevalent 2.5 exabytes of data produced every day 90% generated in the last 2 years Data is produced as the product of a highly interconnected world 244 million

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

A mathematicians foray into signal processing Carlos Beltr an Universidad de Cantabria,

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

Chapter X: Graph Mining Information Retrieval & Data Mining Universitt des Saarlandes,

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Topic II: Graph Mining Discrete Topics in Data Mining Universitt des Saarlandes, Saarbrcken

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Foray into Computable Reports Brown EPC Duke EPC Minnesota EPC 1 Disclosures None 2 The

American Foray Into Soft OR zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA An

The Anglo Saxon Experience Peter Williams My presentation A brief foray into a complex

A Regions First Foray into Scenario Planning What is Scenario Planning? Scenario planning

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Neuro-Evolutive System for Ego-Motion Estimation with a 3D Camera Ivan Villaverde, Zelmar

Decision-making Complexity in created by Sonja Sinz, July 2020 Why are we How can decision -

The DiceKriging and DiceOptim packages: kriging-based metamodeling and optimization for computer

Behaviour Design The economics of nudging India HCI 2014 workshop . 7 th Dec 2014 Workshop

Easy as ABC : A Lightweight Centrality-Based Caching Strategy for Information-Centric IoT Jakob

Testament in the Gospel of John JESUS Which one is the Deans Assistant? Her offspring

ASTERICS WP4 DADI Data Access, Discovery and Interoperability Franoise Genova and the WP4

Ph.D. Tomas Plankis Vilnius University tomas.plankis@mif.vu.lt Faculty of Mathema/cs and

A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data - PowerPoint PPT Presentation

A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data is prevalent 2.5 exabytes of data produced every day 90% generated in the last 2 years Data is produced as the product of a highly interconnected world 244 million

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

A mathematicians foray into signal processing Carlos Beltr an Universidad de Cantabria,

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

Chapter X: Graph Mining Information Retrieval &amp; Data Mining Universitt des Saarlandes,

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Topic II: Graph Mining Discrete Topics in Data Mining Universitt des Saarlandes, Saarbrcken

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Foray into Computable Reports Brown EPC Duke EPC Minnesota EPC 1 Disclosures None 2 The

American Foray Into Soft OR zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA An

The Anglo Saxon Experience Peter Williams My presentation A brief foray into a complex

A Regions First Foray into Scenario Planning What is Scenario Planning? Scenario planning

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Neuro-Evolutive System for Ego-Motion Estimation with a 3D Camera Ivan Villaverde, Zelmar

Decision-making Complexity in created by Sonja Sinz, July 2020 Why are we How can decision -

The DiceKriging and DiceOptim packages: kriging-based metamodeling and optimization for computer

Behaviour Design The economics of nudging India HCI 2014 workshop . 7 th Dec 2014 Workshop

Easy as ABC : A Lightweight Centrality-Based Caching Strategy for Information-Centric IoT Jakob

Testament in the Gospel of John JESUS Which one is the Deans Assistant? Her offspring

ASTERICS WP4 DADI Data Access, Discovery and Interoperability Franoise Genova and the WP4

Ph.D. Tomas Plankis Vilnius University tomas.plankis@mif.vu.lt Faculty of Mathema/cs and

Chapter X: Graph Mining Information Retrieval & Data Mining Universitt des Saarlandes,