A Random Walk Around The Block Johan Ugander Stanford University Joint work with: Isabel Kloumann (Facebook) & Jon Kleinberg (Cornell) Google Mountain View August 17, 2016
S e e d s e t e x p a n s i o n • Given a graph G=(V, E), goal is to accurately identify a target set T ⊂ V from a smaller seed set S ⊂ T . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● target set T ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
S e e d s e t e x p a n s i o n • Given a graph G=(V, E), goal is to accurately identify a target set T ⊂ V from a smaller seed set S ⊂ T . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● target set T ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● seed set S ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
S e e d s e t e x p a n s i o n • Given a graph G=(V, E), goal is to accurately identify a target set T ⊂ V from a smaller seed set S ⊂ T . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● seed set S ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
S e e d s e t e x p a n s i o n • Given a graph G=(V, E), goal is to accurately identify a target set T ⊂ V from a smaller seed set S ⊂ T . ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● seed set S ● ● ● ● ● ● ● ● ● ● ● ● Scored by ● ● ● Personalized PageRank ● ●
S e e d s e t e x p a n s i o n • Given a graph G=(V, E), goal is to accurately identify a target set T ⊂ V from a smaller seed set S ⊂ T . • Applications: Broadly: ranking on graphs, recommendation systems • Spam filtering (Wu & Chellapilla ’07) • Community detection (Weber et al. ’13) • Missing data inference (Mislove et al. ’14) • ● • Common methods: ● ● ● ● Semi-supervised learning (Zhu et al. ’03) • ● ● ● ● ● Diffusion-based classification • ● ● ● ● ● ● ● ● ● ● (Jeh & Widom ’03, Kloster & Gleich ’14) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Outwardness, modularity and more ● ● ● • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (Bagrow ’08, Kloumann & Kleinberg ’14) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
S e e d s e t e x p a n s i o n • Given a graph G=(V, E), goal is to accurately identify a target set T ⊂ V from a smaller seed set S ⊂ T . • Applications: Broadly: ranking on graphs, recommendation systems • Spam filtering (Wu & Chellapilla ’07) • Community detection (Weber et al. ’13) • Missing data inference (Mislove et al. ’14) • ● • Common methods: ● ● ● ● Semi-supervised learning (Zhu et al. ’03) • ● ● ● ● ● Diffusion-based classification • ● ● ● ● ● ● ● ● ● ● (Jeh & Widom ’03, Kloster & Gleich ’14) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Outwardness, modularity and more ● ● ● • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (Bagrow ’08, Kloumann & Kleinberg ’14 ) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
R e c a l l c u r v e s f o r s e e d s e t e x p a n s i o n Kloumann & Kleinberg ‘14 • Recall curve: true positive rate, as a function of the number of items returned based on small uniformly random seed set. • Kloumann & Kleinberg ’14 tested many different methods on data, broadly found Personalized PageRank to be best.
R e c a l l c u r v e s f o r s e e d s e t e x p a n s i o n Kloumann & Kleinberg ‘14 • Recall curve: true positive rate, as a function of the number of items returned based on small uniformly random seed set. • Kloumann & Kleinberg ’14 tested many different methods on data, broadly found Personalized PageRank to be best. • Truncated PPR (first K steps) comparable to PPR from K=4. • Heat Kernel later found comparable to PPR.
D i f f u s i o n - b a s e d n o d e c l a s s i fi c a t i o n • Classification based on random walk landing probabilities r v • , probability that a random walk starting in S is at v after k steps. k ( r v 1 , r v 2 , ..., r v • , truncated vector of landing probabilities. K ) • Personalized PageRank and Heat Kernel ranking: ∞ ∞ ✓ t k ◆ X X ( α k ) r v r v PPR( v ) ∝ HK( v ) ∝ k k k ! k =1 k =1 ● ● ● • General diffusion score function: ● ● ● ● ● ● ● ● ● ● ∞ ● ● ● ● ● ● ● X w k r v ● ● ● ● ● ● ● score( v ) = ● ● ● ● ● ● ● ● ● ● ● ● ● ● k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● k =1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
D i f f u s i o n - b a s e d n o d e c l a s s i fi c a t i o n K • Personalized PageRank and Heat Kernel X w k r v score( v ) = = two parametric families of linear weights k k =1 0 10 α =0.99 Weight w k = α k PPR − 5 10 w k = t k /k ! HK α =0.85 t=1 t=5 t=15 0 20 40 60 80 100 Length (Kloster & Gleich, ’14) • Question in this work: What weights are “optimal” for diffusion-based classification?
T h e s t o c h a s t i c b l o c k m o d e l • C blocks p in p out • Focus on C=2 blocks: 1=“Target”, 2=“Other” • n 1 , n 2 nodes in blocks p out p in • Independent edge probabilities: ● ● ● ● ● ● ● ● ● ● • Edge probability within a block = p in ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Edge probability across blocks = p out ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● • (Results for C>2 as well, see paper) ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Model with many names: • Stochastic Block Model (Holland et al. ’83) • Affiliation Model (Frank-Harary ’82) • Planted Partition Model (Dyer-Frieze ’89)
Recommend
More recommend