A foray into graph mining Neil Shah April 15 th , 2019
(Graph) data is prevalent • 2.5 exabytes of data produced every day • 90% generated in the last 2 years • Data is produced as the product of a highly interconnected world 244 million users 1.3 billion users 187 million daily actives 480 million products 1 billion daily mobile views 3.5 billion daily snaps
(Graph) data shapes perspectives e e n n o i i v g i o t a n M e d n h e c g m r n a m i e k o S n c a e r r m r o f t a l p t n c u l o a i d i t g c c o n o a r i P S r s e a t h n c i r u p
What’s in a graph? • Graphs consist of nodes, edges and attributes • ex: Facebook social network where • nodes = individuals • edges = friendship • attributes = gender (node), # of messages exchanged (edge) • Graphs can easily model relationships between entities • Who-follows-whom on a social network • Who-buys-what on an e-commerce platform • Who-calls-whom using a certain cellular provider
Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways
Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways
Graph preliminaries – directionality Users-by-users Users-by-users u 1 u 1 u 7 u 7 u 2 u 2 u 8 u 8 u 3 u 3 u 9 u 9 u 4 u 4 u 10 u 10 u 5 u 5 u 11 u 11 u 6 u 6
Graph preliminaries – degree • Degree: # of adjacent edges Users-by-users u 1 u 7 u 2 u 8 • Degree(u 7 ) = 2 u 3 u 9 u 4 u 10 u 5 u 11 u 6
Graph preliminaries – out- and in-degree • Degree: # of adjacent edges Users-by-users • Out-degree: # outgoing edges • In-degree: # incoming edges u 1 u 7 u 2 u 8 • Out-degree(u 4 ) = 1 u 3 u 9 • In-degree(u 6 ) = 2 u 4 u 10 u 5 u 11 u 6
Graph preliminaries – weighted degree • Weighted degree: total sum of adjacent edge weights Users-by-users • i.e. “how many times did two users 3 u 1 communicate” u 7 4 u 2 u 8 u 3 1 u 9 2 u 4 • Weighted-degree(u 6 ) = 7 u 10 u 5 9 u 11 6 u 6 1
Graph preliminaries – ego(net) • Ego : single, central node Users-by-users u 1 • Ego network (egonet): nodes and u 7 edges within one “hop” from ego u 2 u 8 u 3 u 9 • Egonet(u 7 ) = u 4 u 10 • Nodes {u 7 , u 3 , u 5 } u 5 • Edges {u 7 -u 3 , u 7 -u 5 } u 11 u 6
Graph preliminaries – connectivity • Two nodes are connected if there is a path between them. Users-by-users • A graph is fully connected if all node u 1 u 7 pairs are connected. u 2 u 8 u 3 u 9 • u 1 and u 8 are connected u 4 • u 3 and u 5 are connected u 10 u 5 • u 1 and u 9 are not connected u 11 u 6 • This graph is not fully connected
Graph preliminaries – node and edge types Users-by-products • A heterogeneous graph has multiple node and/or edge types. Users Products u 1 p 1 • Users and products u 2 p 2 • Who-buys-what and who-rates-what u 3 p 3 u 4 p 4 u 5 p 5 u 6
Graph preliminaries – matrix representation • Graph connectivity can be summarized in an adjacency matrix . • A i,j = # (or weight) of edges from node i to j • A usually very sparse (makes compact representations possible!) Users Users-by-users u 1 1 u 7 1 u 2 1 u 8 Users 1 u 3 1 u 9 1 u 4 1 u 10 1 u 5 u 11 1 u 6
Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways
Key question: What does a graph “look like”? • At first look… large, unwieldy and seemingly random. • Spoiler: In actuality, most real- world graphs are far from random. Trace-route paths Lyon ’03 on the internet
A quick detour: “Random” graphs • Erdos-Renyi random graph model: graph G(n,p) • n = number of nodes • p = probability of an edge between two nodes (independent edges) • Expected # of edges: • Degree distribution: (binom.) Babaoglu’ 18
What about real graphs? Faloutsos ‘99 Viswanath ‘09 Adamic ‘02 log(# visitors) vs. log(# sites) Log(# posts) vs. log(# users) log(# peers) vs. log(# routers) • X-axis: degree, Y-axis: frequency/probability • Degree distributions of real graphs are not “random” • What exactly are they, then?
The “scale-free” property • Real-world graphs are often scale-free, meaning that their degree distribution obeys a power-law: • Scaling the input by a multiple simply results in proportional scaling of the whole function • Power laws are linear in log-log scales • Typical 2 ≤ # ≤ 3 log(# visitors) vs. log(# sites)
Scale-freeness is evident in many domains Newman ‘05
Why are many real graphs scale-free? • Hypothesis: preferential attachment, or a “rich-get-richer” effect • Generative process to construct a network: • Start with ! " nodes, each with at least 1 edge • At each timestep, add a new node with ! edges connecting it to ! already existing nodes • Probability of new node to connect to node # depends on the degree $ % as • Many real-world variants of this effect: academic citations, recommendation, virality log(# visitors) vs. log(# sites)
Real graphs have “small-world” effects • How “far apart” are nodes in real graphs? • Interestingly, not very far! The typical number is 6 . You may have heard of the “six degrees of separation” • Milgram ‘69: avg. # of hops for a letter to travel from Nebraska to Boston was 6.2 (sample size 64) • Leskovec ‘08: avg. distance between node pairs on MSN messenger has mode 6 (sample size 180M nodes and 1.3B edges)
What causes the small-world effect? • Hypothesis: The abundance of hubs, or high-degree nodes • Even though most nodes aren’t connected to most other nodes, they are connected to hubs, which facilitate paths log(# visitors) vs. log(# sites)
How do real graphs “grow” over time? • Consider a time-evolving graph ! • If it has "($) nodes and &($) edges at time t… • Suppose that "($ + 1) = 2"($) • What is &($ + 1) ? • Not only is it > 2& $ ; the growth is actually superlinear and follows & $ ∝ " $ . (power law!) with 1 ≤ 0 ≤ 2 , generally
Real graphs exhibit densification Avg. out-degree increases over time Power-law in # edges vs. # nodes (over time)
Moreover, the graph diameter shrinks • Graph diameter = max(distance between node pairs) • Leskovec ‘05 shows that diameter actually shrinks over time, instead of growing. In other words, nodes tend to get closer • Hypothesis: Once again due to prevalence and growth of hubs
Much more work done on graph behaviors • Generative graph models (Leskovec ‘05) • Patterns in sizes of connected components (Kang ‘10) • Node in-degree (popularity) over time (McGlohon ‘07) • Duration of calls in phone-call networks (Vaz de Melo ‘10) • Temporal structure evolution (Shah ‘15) … the list goes on
Roadmap • Preliminaries • Notable graph properties • Cool applications • Recommendation and ranking • Clustering • Anomaly detection • Takeaways
Key question: how can we leverage graphs for recommendation/ranking tasks? • Measuring webpage importance • Link prediction and recommendation • Local methods • Global methods
PageRank for large-scale search engines • Key problem: how to prioritize/curate a large (ever-growing) hyperlinked body of pages by importance and relevance? • Key idea: leverage the hyperlink citation graph (page-links-page) to rank page importance according to connectivity patterns • 150 million web pages à 1.7 billion links Backlinks and Forward links: Ø A and B are C’s backlinks Ø C is A and B’s forward link Content adapted from Li ‘09
Simplified PageRank Idea: each page equally distributes its own PageRank to its forward-links recursively. “An important page has many important pages pointing to it” • ! : a web page • " ! : the set of u’s backlinks • # $ : the number of forward links of page v • % : the normalization factor to make & a probability distribution • Simplified PageRank is the stationary probability dist. of a random- walk on the graph; a surfer keeps clicking successive pages at random.
Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) PageRank Calculation: first iteration
Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) PageRank Calculation: second iteration
Simplified PageRank Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each” Yahoo Amzn MS Adjacency matrix transposed Initial PageRank scores and column-normalized (accounts for equal neighbor distribution) Convergence after some iterations
Problem with Simplified PageRank A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!
Recommend
More recommend