Course : Data mining Lecture : Computing basic graph statistics - PowerPoint PPT Presentation

Course : Data mining Lecture : Computing basic graph statistics Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016

algorithmic tools

efficiency considerations • data in the web and social-media are typically of extremely large scale (easily reach to billions) • how to compute simple graph statistics? • even quadratic algorithms are not feasible in practice Data mining — Computing basic graph statistics 3

hashing and sketching • probabilistic / approximate methods • sketching : create sketches that summarize the data and allow to estimate simple statistics with small space • hashing : hash objects in such a way that similar objects have larger probability of mapped to the same value than non-similar objects Data mining — Computing basic graph statistics 4

estimator theorem • consider a set of items U • a fraction ρ of them have a specific property • estimate ρ by sampling • how many samples N are needed? N ≥ 4 ǫ 2 ρ log 2 δ . for an ǫ -approximation with probability at least 1 − δ • notice: it does not depend on | U | (!) Data mining — Computing basic graph statistics 5

homework use the Chernoff bound to derive the estimator theorem Data mining — Computing basic graph statistics 6

applications of the algorithmic tools to real scenarios

clustering coefficient and triangles

clustering coefficient C = 3 × number of triangles in the network number of connected triples of vertices • how to compute it? • how to compute the number of triangles in a graph? • assume that the graph is very large, stored in disk [Buriol et al., 2006] • count triangles when graph is seen as a data stream • two models: – edges are stored in any order – edges in order : all edges incident to one vertex are – stored sequentially Data mining — Computing basic graph statistics 9

counting triangles • brute-force algorithm is checking every triple of vertices • obtain an approximation by sampling triples Data mining — Computing basic graph statistics 10

sampling algorithm for counting triangles • how many samples are required? • let T be the set of all triples and T i the set of triples that have i edges, i = 0 , 1 , 2 , 3 • by the estimator theorem, to get an ǫ -approximation, with probability 1 − δ , the number of samples should be N ≥ O ( | T | ǫ 2 log 1 1 δ ) | T 3 | • but | T | can be very large compared to | T 3 | Data mining — Computing basic graph statistics 11

counting triangles • incidence model : all edges incident to each vertex appear in order in the stream • sample connected triples Data mining — Computing basic graph statistics 12

sampling algorithm for counting triangles • incidence model • consider sample space S = { b - a - c | ( a , b ) , ( a , c ) ∈ E } • |S| = � i d i ( d i − 1 ) / 2 1: sample X ⊆ S (paths b - a - c ) 2: estimate fraction of X for which edge ( b , c ) is present 3: scale by |S| • gives ( ǫ, δ ) approximation Data mining — Computing basic graph statistics 13

counting triangles — incidence stream model S AMPLE T RIANGLE [Buriol et al., 2006] 1st pass count the number of paths of length 2 in the stream 2nd pass uniformly choose one path ( a , b , c ) 3rd pass if (( b , c ) ∈ E ) β = 1 else β = 0 return β Data mining — Computing basic graph statistics 14

counting triangles — incidence stream model S AMPLE T RIANGLE [Buriol et al., 2006] 1st pass count the number of paths of length 2 in the stream 2nd pass uniformly choose one path ( a , b , c ) 3rd pass if (( b , c ) ∈ E ) β = 1 else β = 0 return β 3 | T 3 | d u ( d u − 1 ) we have E [ β ] = | T 2 | + 3 | T 3 | , with | T 2 | + 3 | T 3 | = � , so u 2 d u ( d u − 1 ) � | T 3 | = E [ β ] 6 u and space needed is O (( 1 + | T 2 | | T 3 | ) 1 ǫ 2 log 1 δ ) Data mining — Computing basic graph statistics 14

properties of the sampling space it should be possible to • estimate the size of the sampling space • sample an element uniformly at random Data mining — Computing basic graph statistics 15

homework 1 compute triangles in 3 passes when edges appear in arbitrary order 2 compute triangles in 1 pass when edges appear in arbitrary order 3 compute triangles in 1 pass in the incidence model Data mining — Computing basic graph statistics 16

counting graph minors

counting other minors • count all minors in a very large graphs – connected subgraphs – size 3 and 4 – directed or undirected graphs • why? • modeling networks, “signature” structures e.g., copying model • anomaly detection, e.g., spam link farms [Alon, 2007, Bordino et al., 2008] Data mining — Computing basic graph statistics 18

counting minors in large graphs • characterize a graph by the distribution of its minors all undirected minors of size 4 all directed minors of size 3 Data mining — Computing basic graph statistics 19

sampling algorithm for counting triangles • incidence model • consider sample space S = { b - a - c | ( a , b ) , ( a , c ) ∈ E } • |S| = � i d i ( d i − 1 ) / 2 1: sample X ⊆ S (paths b - a - c ) 2: estimate fraction of X for which edge ( b , c ) is present 3: scale by |S| • gives ( ǫ, δ ) approximation Data mining — Computing basic graph statistics 20

adapting the algorithm sampling spaces: • 3-node directed • 4-node undirected are the sampling space properties satisfied? Data mining — Computing basic graph statistics 21

datasets graph class type # instances synthetic un/directed 39 wikipedia un/directed 7 webgraphs un/directed 5 cellular directed 43 citation directed 3 food webs directed 6 word adjacency directed 4 author collaboration undirected 5 autonomous systems undirected 12 protein interaction undirected 3 US road undirected 12 Data mining — Computing basic graph statistics 22

clustering of undirected graphs assigned to 0 1 2 3 4 5 6 AS graph 12 0 0 0 0 0 0 collaboration 0 0 3 2 0 0 0 protein 1 0 0 1 0 0 1 road-graph 0 12 0 0 0 0 0 wikipedia 0 0 0 0 2 5 0 synthetic 11 0 0 0 0 0 28 webgraph 2 0 0 1 0 0 0 Data mining — Computing basic graph statistics 23

clustering of directed graphs feature class accuracy compared to ground truth standard topological properties (81) 0.74% minors of size 3 0.78% minors of size 4 0.84% minors of size 3 and 4 0.91% Data mining — Computing basic graph statistics 24

graph distance distributions

small-world phenomena small worlds : graphs with short paths • Stanley Milgram (1933-1984) “The man who shocked the world” • obedience to authority (1963) • small-world experiment (1967) Data mining — Computing basic graph statistics 26

Milgram’s experiment • 300 people (starting population) are asked to dispatch a parcel to a single individual (target) • the target was a Boston stockbroker • the starting population is selected as follows: • 100 were random Boston inhabitants (group A) • 100 were random Nebraska strockbrokers (group B) • 100 were random Nebraska inhabitants (group C) Data mining — Computing basic graph statistics 27

Milgram’s experiment • rules of the game : • parcels could be directly sent only to someone the sender knows personally • 453 intermediaries happened to be involved in the experiments (besides the starting population and the target) Data mining — Computing basic graph statistics 28

Milgram’s experiment questions Milgram wanted to answer: 1. how many parcels will reach the target? . 2. what is the distribution of the number of hops required to reach the target? . 3. is this distribution different for the three starting subpopulations? . Data mining — Computing basic graph statistics 29

Milgram’s experiment answers to the questions 1. how many parcels will reach the target? 29% 2. what is the distribution of the number of hops required to reach the target? average was 5.2 3. is this distribution different for the three starting subpopulations? YES : average for groups A/B/C was 4.6/5.4/5.7 Data mining — Computing basic graph statistics 30

chain lengths Data mining — Computing basic graph statistics 31

measuring what? but what did Milgram’s experiment reveal, after all? 1. the the world is small 2. that people are able to exploit this smallness Data mining — Computing basic graph statistics 32

graph distance distribution • obtain information about a large graph, i.e., social network • macroscopic level • distance distribution • mean distance • median distance • diameter • effective diameter • ... Data mining — Computing basic graph statistics 33

graph distance distribution • given a graph, d ( x , y ) is the length of the shortest path from x to y , defined as ∞ if one cannot go from x to y • for undirected graphs, d ( x , y ) = d ( y , x ) • for every t , count the number of pairs ( x , y ) such that d ( x , y ) = t • the fraction of pairs at distance t is a distribution Data mining — Computing basic graph statistics 34

Course : Data mining Lecture : Computing basic graph statistics - PowerPoint PPT Presentation

Course : Data mining Lecture : Computing basic graph statistics Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 algorithmic tools efficiency considerations data in the

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

slides of Layered Adaptive Importance Sampling Presentation June 2016 CITATION READS 1 40 3

noise and number of sensors Giovanni Capellari Eleni Chatzi Stefano Mariani 3 rd International

Understanding MCMC Dynamics as Flows on the Wasserstein Space Chang Liu, Jingwei Zhuo, Jun Zhu 1

Improve your work fl ow for reproducible science Mine etinkaya-Rundel University of Edinburgh

Sequential Detection and Isolation of a Correlated Pair Anamitra Chaudhuri Department of

Choice with multiple alternatives 5.2 Specification of the deterministic part Michel

Online k -MLE for mixture modelling with exponential families Christophe Saint-Jean Frank

Some results on convolution idempotents May 28, 2020 1 IIT Hyderabad, India 2 Stanford University