Recent advances in local graph clustering and the transition to - - PowerPoint PPT Presentation

recent advances in local graph clustering and the
SMART_READER_LITE
LIVE PREVIEW

Recent advances in local graph clustering and the transition to - - PowerPoint PPT Presentation

Recent advances in local graph clustering and the transition to global analysis Kimon Fountoulakis @CS UWaterloo 02/07/2020 Workshop: From Local to Global Information Motivation: detection of small clusters in large and noisy graphs - Real


slide-1
SLIDE 1

Recent advances in local graph clustering and the transition to global analysis

Kimon Fountoulakis @CS UWaterloo 02/07/2020 Workshop: From Local to Global Information

slide-2
SLIDE 2

Motivation: detection of small clusters in large and noisy graphs

  • Real large-scale graphs have rich local structure
  • We often have to detect small clusters in large and

noisy graphs: Rather than partitioning graphs with nice structure

protein-protein interaction graph, color denotes similar functionality US-Senate graph, nice bi-partition in year 1865 around the end of the American civil ward

slide-3
SLIDE 3

Our goals

  • new methods that are able probe graphs with billions of nodes and

edges,

  • the running time of the new methods should depend on the size of

the output instead of the size of the whole graph,

  • the new methods should be supported by worst- and average-case

theoretical guarantees. Large scale data with multiple noisy small-scale and meso-scale clusters determine the need for

slide-4
SLIDE 4

Existing and new local graph clustering methods

  • As a warm-up: non-linear PageRank.
  • Non-linear combinatorial diffusions.
  • Non-linear diffusions which balance between spectral and

combinatorial diffusions. The vast majority of methods perform some sort of linear diffusion, i.e., PageRank. We need models that are better than simply averaging of probabilities.

slide-5
SLIDE 5

Current local and global developments for local graph clustering methods

Local analysis Local to global

slide-6
SLIDE 6

About this talk

  • I will mostly discuss methods, I will demonstrate theoretical results

and I will present experiments that promote understanding of the methods within the available time.

  • For extensive experiments on real-data please check the cited
  • papers. We literally have performed hundreds of experiments for

measuring performance of local graph clustering methods.

slide-7
SLIDE 7

Local Graph Clustering

slide-8
SLIDE 8

The local graph clustering problem?

  • Definition: find set of nodes given a seed node in set
  • Set has good precision/recall w.r.t set
  • The running time depends on instead of the whole graph

A B A B A

slide-9
SLIDE 9

Data: Facebook Johns Hopkins, A. L. Traud, P. J. Mucha and M. A. Porter, Physica A, 391(16), 2012

Facebook Johns Hopkins social network: color denotes class year

Students of year 2009

slide-10
SLIDE 10

Local graph clustering: example

Data: Facebook Johns Hopkins, A. L. Traud, P. J. Mucha and M. A. Porter, Physica A, 391(16), 2012

slide-11
SLIDE 11

Data: The MIPS mammalian protein-protein interaction database. Bioinformatics, 21(6):832-834, 2005

Protein structure similarity: color denotes similar function

slide-12
SLIDE 12

Data: The MIPS mammalian protein-protein interaction database. Bioinformatics, 21(6):832-834, 2005

Local graph clustering finds 2% of the graph

slide-13
SLIDE 13

Data: The MIPS mammalian protein-protein interaction database. Bioinformatics, 21(6):832-834, 2005

Local graph clustering finds 1% of the graph

slide-14
SLIDE 14

Or we might want to detect galaxies

slide-15
SLIDE 15

Warm-up: non-linear PageRank

slide-16
SLIDE 16

Some definitions

  • n x n adjacency matrix: A
  • Graph:

, ,

G = ( V

⏟ nodes

, E

⏟ edges

) |V| = n |E| = m

  • An element of is equal to 1 if two nodes are connected

A

slide-17
SLIDE 17

Some definitions

  • Lazy random walk matrix: W = 1

2 (I + AD−1)

  • Random walk matrix: AD−1
  • Each element of

shows the number of neighbors of a node

D

  • Graph Laplacian: L = D − A
  • Degree matrix:

, is a vector of all ones.

D = diag(A1n) 1n

slide-18
SLIDE 18
  • Consider a diffusion process where we perform lazy random walk with

probability , and jump to a given seed node with probability :

1 − α α

  • where is an indicator vector of the seed node and alpha is the teleportation

parameter.

s

  • Simple idea: use a random walk from a seed node. The nodes with the

highest probability after steps consist a cluster.

k

  • Let

be the teleportation parameter

α ∈ (0,1)

αs1T

n + (1 − α)W

Linear diffusion: personalized PageRank

slide-19
SLIDE 19

Let’s get rid off the tail

  • For the stationary personalized PageRank vector most of the

probability mass is concentrated around the seed node.

  • This means that the ordered personalized PageRank vector has long

tail for nodes far away from the seed node.

  • We can efficiently cut the tale using l1-regularized PageRank without

even having to compute the long tail.

slide-20
SLIDE 20

Non-linear PageRank diffusion

  • Instead of using power method to compute the PageRank vector, we

can perform a non-linear power method where we do a random walk step first and then threshold small values to zero.

pk+1 = proxραd∥⋅∥1( (1 − α)Wpk + αs random walk step )

proxραd∥⋅∥1(x) = { x − ραd if x ≥ ραd

  • therwise
  • where
  • perator reduces components smaller than

to zero. prox

ραd

slide-21
SLIDE 21

Far stretched relation to graph neural networks

pk+1 = proxραd∥⋅∥1( (1 − α)Wpk + αs random walk step )

pk+1 = ReLU(Random Walk Matrix × Parameters × pk) Non-linear PageRank Graph Neural Network Layer

slide-22
SLIDE 22

L1-regularized PageRank

where Q = αD + 1 − α

2 L

<latexit sha1_base64="JAe47C3ZiYMhlyhW3tdBkc0CfY=">ACDHicbVDLSgMxFL1TX7W+qi7dBIsgiGWmCLoRirpw4aIF+4BOKZk04ZmMkOSEcowH+DGX3HjQhG3foA7/8a0nYW2HgczjmXm3u8iDOlbfvbyi0tr6yu5dcLG5tb2zvF3b2mCmNJaIOEPJRtDyvKmaANzTSn7UhSHictrzR9cRvPVCpWCju9Ti3QAPBPMZwdpIvWKpji6Ri3k0xOgGoRPk+hKTxDmdaWlSe9Myi7bU6BF4mSkBlqveKX2w9JHFChCcdKdRw70t0ES80Ip2nBjRWNMBnhAe0YKnBAVTeZHpOiI6P0kR9K84RGU/X3RIDpcaBZ5IB1kM1703E/7xOrP2LbsJEFGsqyGyRH3OkQzRpBvWZpETzsSGYSGb+isgQmza06a9gSnDmT14kzUrZsctO/axUvcrqyMBHMIxOHAOVbiFGjSAwCM8wyu8WU/Wi/VufcyiOSub2Yc/sD5/AFsWmU4=</latexit><latexit sha1_base64="JAe47C3ZiYMhlyhW3tdBkc0CfY=">ACDHicbVDLSgMxFL1TX7W+qi7dBIsgiGWmCLoRirpw4aIF+4BOKZk04ZmMkOSEcowH+DGX3HjQhG3foA7/8a0nYW2HgczjmXm3u8iDOlbfvbyi0tr6yu5dcLG5tb2zvF3b2mCmNJaIOEPJRtDyvKmaANzTSn7UhSHictrzR9cRvPVCpWCju9Ti3QAPBPMZwdpIvWKpji6Ri3k0xOgGoRPk+hKTxDmdaWlSe9Myi7bU6BF4mSkBlqveKX2w9JHFChCcdKdRw70t0ES80Ip2nBjRWNMBnhAe0YKnBAVTeZHpOiI6P0kR9K84RGU/X3RIDpcaBZ5IB1kM1703E/7xOrP2LbsJEFGsqyGyRH3OkQzRpBvWZpETzsSGYSGb+isgQmza06a9gSnDmT14kzUrZsctO/axUvcrqyMBHMIxOHAOVbiFGjSAwCM8wyu8WU/Wi/VufcyiOSub2Yc/sD5/AFsWmU4=</latexit><latexit sha1_base64="JAe47C3ZiYMhlyhW3tdBkc0CfY=">ACDHicbVDLSgMxFL1TX7W+qi7dBIsgiGWmCLoRirpw4aIF+4BOKZk04ZmMkOSEcowH+DGX3HjQhG3foA7/8a0nYW2HgczjmXm3u8iDOlbfvbyi0tr6yu5dcLG5tb2zvF3b2mCmNJaIOEPJRtDyvKmaANzTSn7UhSHictrzR9cRvPVCpWCju9Ti3QAPBPMZwdpIvWKpji6Ri3k0xOgGoRPk+hKTxDmdaWlSe9Myi7bU6BF4mSkBlqveKX2w9JHFChCcdKdRw70t0ES80Ip2nBjRWNMBnhAe0YKnBAVTeZHpOiI6P0kR9K84RGU/X3RIDpcaBZ5IB1kM1703E/7xOrP2LbsJEFGsqyGyRH3OkQzRpBvWZpETzsSGYSGb+isgQmza06a9gSnDmT14kzUrZsctO/axUvcrqyMBHMIxOHAOVbiFGjSAwCM8wyu8WU/Wi/VufcyiOSub2Yc/sD5/AFsWmU4=</latexit><latexit sha1_base64="JAe47C3ZiYMhlyhW3tdBkc0CfY=">ACDHicbVDLSgMxFL1TX7W+qi7dBIsgiGWmCLoRirpw4aIF+4BOKZk04ZmMkOSEcowH+DGX3HjQhG3foA7/8a0nYW2HgczjmXm3u8iDOlbfvbyi0tr6yu5dcLG5tb2zvF3b2mCmNJaIOEPJRtDyvKmaANzTSn7UhSHictrzR9cRvPVCpWCju9Ti3QAPBPMZwdpIvWKpji6Ri3k0xOgGoRPk+hKTxDmdaWlSe9Myi7bU6BF4mSkBlqveKX2w9JHFChCcdKdRw70t0ES80Ip2nBjRWNMBnhAe0YKnBAVTeZHpOiI6P0kR9K84RGU/X3RIDpcaBZ5IB1kM1703E/7xOrP2LbsJEFGsqyGyRH3OkQzRpBvWZpETzsSGYSGb+isgQmza06a9gSnDmT14kzUrZsctO/axUvcrqyMBHMIxOHAOVbiFGjSAwCM8wyu8WU/Wi/VufcyiOSub2Yc/sD5/AFsWmU4=</latexit>

minimize 1 2xT Qx αxT s | {z }

f(x)

+ ραkDxk1 | {z }

g(x)

<latexit sha1_base64="gNWgDYqXzUVlMJvmiht5BTN5Bn8=">ACWnicbVHNSxwxHM2M9WtdW29RJcCkpxmZFCe5TaQ48Krgqbdchkf7MbzMeYZMqscf7JXorgv1Iwu86hah+EPN7vPZK85KXg1iXJfRQvVleWV1b72y8fbe51d1+f251ZRgMmBbaXObUguAKBo47AZelASpzARf59fF8fvELjOVanblZCSNJ4oXnFEXpKx7Q2Suay+54pLfQoMJpUag8kNZeBJETafNv6wqa/O8Cmu8QEmVJRTioNgm8wXe/V+gz8/j5mpbl3k7kdN7rI0OCcLZ9btJf1kAfyapC3poRYnWfc3GWtWSVCOCWrtME1KN/LUOM4ENB1SWSgpu6YTGAaqAQ78otqGvwpKGNcaBOWcnih/pvwVFo7k3lwSuqm9uVsLv5vNqxc8W3kuSorB4o9HVRUAjuN5z3jMTfAnJgFQpnh4a6YTWlox4Xf6IQS0pdPfk3OD/tp0k9Pv/SOvrd1rKGPaBftoR9RUfoJzpBA8TQH/Q3WolWo4c4jtfjSdrHLWZD+gZ4p1Hvzi0xg=</latexit><latexit sha1_base64="gNWgDYqXzUVlMJvmiht5BTN5Bn8=">ACWnicbVHNSxwxHM2M9WtdW29RJcCkpxmZFCe5TaQ48Krgqbdchkf7MbzMeYZMqscf7JXorgv1Iwu86hah+EPN7vPZK85KXg1iXJfRQvVleWV1b72y8fbe51d1+f251ZRgMmBbaXObUguAKBo47AZelASpzARf59fF8fvELjOVanblZCSNJ4oXnFEXpKx7Q2Suay+54pLfQoMJpUag8kNZeBJETafNv6wqa/O8Cmu8QEmVJRTioNgm8wXe/V+gz8/j5mpbl3k7kdN7rI0OCcLZ9btJf1kAfyapC3poRYnWfc3GWtWSVCOCWrtME1KN/LUOM4ENB1SWSgpu6YTGAaqAQ78otqGvwpKGNcaBOWcnih/pvwVFo7k3lwSuqm9uVsLv5vNqxc8W3kuSorB4o9HVRUAjuN5z3jMTfAnJgFQpnh4a6YTWlox4Xf6IQS0pdPfk3OD/tp0k9Pv/SOvrd1rKGPaBftoR9RUfoJzpBA8TQH/Q3WolWo4c4jtfjSdrHLWZD+gZ4p1Hvzi0xg=</latexit><latexit sha1_base64="gNWgDYqXzUVlMJvmiht5BTN5Bn8=">ACWnicbVHNSxwxHM2M9WtdW29RJcCkpxmZFCe5TaQ48Krgqbdchkf7MbzMeYZMqscf7JXorgv1Iwu86hah+EPN7vPZK85KXg1iXJfRQvVleWV1b72y8fbe51d1+f251ZRgMmBbaXObUguAKBo47AZelASpzARf59fF8fvELjOVanblZCSNJ4oXnFEXpKx7Q2Suay+54pLfQoMJpUag8kNZeBJETafNv6wqa/O8Cmu8QEmVJRTioNgm8wXe/V+gz8/j5mpbl3k7kdN7rI0OCcLZ9btJf1kAfyapC3poRYnWfc3GWtWSVCOCWrtME1KN/LUOM4ENB1SWSgpu6YTGAaqAQ78otqGvwpKGNcaBOWcnih/pvwVFo7k3lwSuqm9uVsLv5vNqxc8W3kuSorB4o9HVRUAjuN5z3jMTfAnJgFQpnh4a6YTWlox4Xf6IQS0pdPfk3OD/tp0k9Pv/SOvrd1rKGPaBftoR9RUfoJzpBA8TQH/Q3WolWo4c4jtfjSdrHLWZD+gZ4p1Hvzi0xg=</latexit><latexit sha1_base64="gNWgDYqXzUVlMJvmiht5BTN5Bn8=">ACWnicbVHNSxwxHM2M9WtdW29RJcCkpxmZFCe5TaQ48Krgqbdchkf7MbzMeYZMqscf7JXorgv1Iwu86hah+EPN7vPZK85KXg1iXJfRQvVleWV1b72y8fbe51d1+f251ZRgMmBbaXObUguAKBo47AZelASpzARf59fF8fvELjOVanblZCSNJ4oXnFEXpKx7Q2Suay+54pLfQoMJpUag8kNZeBJETafNv6wqa/O8Cmu8QEmVJRTioNgm8wXe/V+gz8/j5mpbl3k7kdN7rI0OCcLZ9btJf1kAfyapC3poRYnWfc3GWtWSVCOCWrtME1KN/LUOM4ENB1SWSgpu6YTGAaqAQ78otqGvwpKGNcaBOWcnih/pvwVFo7k3lwSuqm9uVsLv5vNqxc8W3kuSorB4o9HVRUAjuN5z3jMTfAnJgFQpnh4a6YTWlox4Xf6IQS0pdPfk3OD/tp0k9Pv/SOvrd1rKGPaBftoR9RUfoJzpBA8TQH/Q3WolWo4c4jtfjSdrHLWZD+gZ4p1Hvzi0xg=</latexit>

Fountoulakis et al. Variational Perspective of Local Graph Clustering, Mathematical Programming, 2017

  • The stationary vector of the non-linear PageRank diffusion

corresponds to the optimal solution of the l1-regularized PageRank problem:

slide-23
SLIDE 23

Properties of the l1-regularized optimal solution

  • Theorem
  • If the graph is unweighted then the number of nonzero nodes in the
  • ptimal solution is bounded by

.

1/ρ

  • If the graph is weighted then the volume of nonzero nodes in the optimal

solution is bounded by .

1/ρ

Fountoulakis et al. Variational Perspective of Local Graph Clustering, Mathematical Programming, 2017

slide-24
SLIDE 24

The solution path is monotonic

  • Theorem
  • Let

be the solution of the l1-regularized problem as a function of .

̂ x(ρ) ρ

  • Then

is a component-wise monotone function

̂ x(ρ)

  • The inequality becomes strict when a component is positive.

̂ x(ρ0) ≤ ̂ x(ρ1) for ρ0 > ρ1

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-25
SLIDE 25

Stage-wise for recovering the whole path

  • Corollary
  • The stage-wise algorithm converges to the l1-regularized solution path if

we drag the step-size of the algorithm to zero.

2) Update [xk+1]i = [xk]i + η di

<latexit sha1_base64="Edbuj4S8YQqS2MrsrFatzvo1vPc=">ACI3icbZDLSsQwFIZT7463UZdugoOgCEM7CIogiG5cKjgXmJaSpqcaJr2QnIpD6bu48VXcuFDEjQvfxXSchY7+EPjyn3NIzh9kUmi07Q9ranpmdm5+YbG2tLyulZf3+joNFc2jyVqeoFTIMUCbRoIRepoDFgYRuMDiv6t07UFqkyTUOM/BidpOISHCGxvLrx24cpPdFa4+2s5Ah0JL27/1isO+Uni/oSXUbVLRP3UgxXriArCxCX5R+vWE37ZHoX3DG0CBjXfr1NzdMeR5DglwyrfuOnaFXMIWCSyhrbq4hY3zAbqBvMGExaK8Y7VjSHeOENEqVOQnSkftzomCx1sM4MJ0xw1s9WavM/2r9HKMjrxBJliMk/PuhKJcU1oFRkOhgKMcGmBcCfNXym+ZiQJNrDUTgjO58l/otJqO3XSuDhqnZ+M4FsgW2Sa7xCGH5JRckEvSJpw8kCfyQl6tR+vZerPev1unrPHMJvkl6/MLWs+jgQ=</latexit><latexit sha1_base64="Edbuj4S8YQqS2MrsrFatzvo1vPc=">ACI3icbZDLSsQwFIZT7463UZdugoOgCEM7CIogiG5cKjgXmJaSpqcaJr2QnIpD6bu48VXcuFDEjQvfxXSchY7+EPjyn3NIzh9kUmi07Q9ranpmdm5+YbG2tLyulZf3+joNFc2jyVqeoFTIMUCbRoIRepoDFgYRuMDiv6t07UFqkyTUOM/BidpOISHCGxvLrx24cpPdFa4+2s5Ah0JL27/1isO+Uni/oSXUbVLRP3UgxXriArCxCX5R+vWE37ZHoX3DG0CBjXfr1NzdMeR5DglwyrfuOnaFXMIWCSyhrbq4hY3zAbqBvMGExaK8Y7VjSHeOENEqVOQnSkftzomCx1sM4MJ0xw1s9WavM/2r9HKMjrxBJliMk/PuhKJcU1oFRkOhgKMcGmBcCfNXym+ZiQJNrDUTgjO58l/otJqO3XSuDhqnZ+M4FsgW2Sa7xCGH5JRckEvSJpw8kCfyQl6tR+vZerPev1unrPHMJvkl6/MLWs+jgQ=</latexit><latexit sha1_base64="Edbuj4S8YQqS2MrsrFatzvo1vPc=">ACI3icbZDLSsQwFIZT7463UZdugoOgCEM7CIogiG5cKjgXmJaSpqcaJr2QnIpD6bu48VXcuFDEjQvfxXSchY7+EPjyn3NIzh9kUmi07Q9ranpmdm5+YbG2tLyulZf3+joNFc2jyVqeoFTIMUCbRoIRepoDFgYRuMDiv6t07UFqkyTUOM/BidpOISHCGxvLrx24cpPdFa4+2s5Ah0JL27/1isO+Uni/oSXUbVLRP3UgxXriArCxCX5R+vWE37ZHoX3DG0CBjXfr1NzdMeR5DglwyrfuOnaFXMIWCSyhrbq4hY3zAbqBvMGExaK8Y7VjSHeOENEqVOQnSkftzomCx1sM4MJ0xw1s9WavM/2r9HKMjrxBJliMk/PuhKJcU1oFRkOhgKMcGmBcCfNXym+ZiQJNrDUTgjO58l/otJqO3XSuDhqnZ+M4FsgW2Sa7xCGH5JRckEvSJpw8kCfyQl6tR+vZerPev1unrPHMJvkl6/MLWs+jgQ=</latexit><latexit sha1_base64="Edbuj4S8YQqS2MrsrFatzvo1vPc=">ACI3icbZDLSsQwFIZT7463UZdugoOgCEM7CIogiG5cKjgXmJaSpqcaJr2QnIpD6bu48VXcuFDEjQvfxXSchY7+EPjyn3NIzh9kUmi07Q9ranpmdm5+YbG2tLyulZf3+joNFc2jyVqeoFTIMUCbRoIRepoDFgYRuMDiv6t07UFqkyTUOM/BidpOISHCGxvLrx24cpPdFa4+2s5Ah0JL27/1isO+Uni/oSXUbVLRP3UgxXriArCxCX5R+vWE37ZHoX3DG0CBjXfr1NzdMeR5DglwyrfuOnaFXMIWCSyhrbq4hY3zAbqBvMGExaK8Y7VjSHeOENEqVOQnSkftzomCx1sM4MJ0xw1s9WavM/2r9HKMjrxBJliMk/PuhKJcU1oFRkOhgKMcGmBcCfNXym+ZiQJNrDUTgjO58l/otJqO3XSuDhqnZ+M4FsgW2Sa7xCGH5JRckEvSJpw8kCfyQl6tR+vZerPev1unrPHMJvkl6/MLWs+jgQ=</latexit>

1) Choose i such that

  • d−1

i rif(xk)

  • is the largest among [n]
<latexit sha1_base64="HB1dEsK7n/2WdcYsdB8P6Ykh+0A=">ACVXicbVFNb9NAFybUkr4aIAjlxURUnsgsiskOFb0wrFIpK0UG+t582yvsh/W7jNq5OZP9oL4J1yQ2CQ+QMtIK41m3uzHbNkq6SlJfkbxg72H+48OHo+ePH32/HD84uWFt50TOBNWXdVgkclDc5IksKr1iHoUuFluTzb+Jf0XlpzVdatZhrqI2spAKUjFWmS7tdZ8e87PGWo98zSXfadx3ouHUAUxU1jRzaKQ3/p36TozUCoJK+OrovlceZk3dDNEJM+hJArcDV64qCtqcMOc5MX40kyTbg90k6kAkbcF6Mb7OFZ1GQ0KB9/M0aSnvwZEUCtejrPYglhCjfNADWj0eb9tZc3fBmXBK+vCMsS36t+JHrT3K12GSQ3U+LveRvyfN+o+pj30rQdoRG7g6pOcbJ8UzFfSIeC1CoQE6Gu3LRgANB4SNGoYT07pPvk4uTaZpM0y/vJ6efhjoO2Gv2h2xlH1gp+wzO2czJtgt+xVFURz9iH7He/H+bjSOhswr9g/iwz9ZQrMp</latexit><latexit sha1_base64="HB1dEsK7n/2WdcYsdB8P6Ykh+0A=">ACVXicbVFNb9NAFybUkr4aIAjlxURUnsgsiskOFb0wrFIpK0UG+t582yvsh/W7jNq5OZP9oL4J1yQ2CQ+QMtIK41m3uzHbNkq6SlJfkbxg72H+48OHo+ePH32/HD84uWFt50TOBNWXdVgkclDc5IksKr1iHoUuFluTzb+Jf0XlpzVdatZhrqI2spAKUjFWmS7tdZ8e87PGWo98zSXfadx3ouHUAUxU1jRzaKQ3/p36TozUCoJK+OrovlceZk3dDNEJM+hJArcDV64qCtqcMOc5MX40kyTbg90k6kAkbcF6Mb7OFZ1GQ0KB9/M0aSnvwZEUCtejrPYglhCjfNADWj0eb9tZc3fBmXBK+vCMsS36t+JHrT3K12GSQ3U+LveRvyfN+o+pj30rQdoRG7g6pOcbJ8UzFfSIeC1CoQE6Gu3LRgANB4SNGoYT07pPvk4uTaZpM0y/vJ6efhjoO2Gv2h2xlH1gp+wzO2czJtgt+xVFURz9iH7He/H+bjSOhswr9g/iwz9ZQrMp</latexit><latexit sha1_base64="HB1dEsK7n/2WdcYsdB8P6Ykh+0A=">ACVXicbVFNb9NAFybUkr4aIAjlxURUnsgsiskOFb0wrFIpK0UG+t582yvsh/W7jNq5OZP9oL4J1yQ2CQ+QMtIK41m3uzHbNkq6SlJfkbxg72H+48OHo+ePH32/HD84uWFt50TOBNWXdVgkclDc5IksKr1iHoUuFluTzb+Jf0XlpzVdatZhrqI2spAKUjFWmS7tdZ8e87PGWo98zSXfadx3ouHUAUxU1jRzaKQ3/p36TozUCoJK+OrovlceZk3dDNEJM+hJArcDV64qCtqcMOc5MX40kyTbg90k6kAkbcF6Mb7OFZ1GQ0KB9/M0aSnvwZEUCtejrPYglhCjfNADWj0eb9tZc3fBmXBK+vCMsS36t+JHrT3K12GSQ3U+LveRvyfN+o+pj30rQdoRG7g6pOcbJ8UzFfSIeC1CoQE6Gu3LRgANB4SNGoYT07pPvk4uTaZpM0y/vJ6efhjoO2Gv2h2xlH1gp+wzO2czJtgt+xVFURz9iH7He/H+bjSOhswr9g/iwz9ZQrMp</latexit><latexit sha1_base64="HB1dEsK7n/2WdcYsdB8P6Ykh+0A=">ACVXicbVFNb9NAFybUkr4aIAjlxURUnsgsiskOFb0wrFIpK0UG+t582yvsh/W7jNq5OZP9oL4J1yQ2CQ+QMtIK41m3uzHbNkq6SlJfkbxg72H+48OHo+ePH32/HD84uWFt50TOBNWXdVgkclDc5IksKr1iHoUuFluTzb+Jf0XlpzVdatZhrqI2spAKUjFWmS7tdZ8e87PGWo98zSXfadx3ouHUAUxU1jRzaKQ3/p36TozUCoJK+OrovlceZk3dDNEJM+hJArcDV64qCtqcMOc5MX40kyTbg90k6kAkbcF6Mb7OFZ1GQ0KB9/M0aSnvwZEUCtejrPYglhCjfNADWj0eb9tZc3fBmXBK+vCMsS36t+JHrT3K12GSQ3U+LveRvyfN+o+pj30rQdoRG7g6pOcbJ8UzFfSIeC1CoQE6Gu3LRgANB4SNGoYT07pPvk4uTaZpM0y/vJ6efhjoO2Gv2h2xlH1gp+wzO2czJtgt+xVFURz9iH7He/H+bjSOhswr9g/iwz9ZQrMp</latexit>

η

<latexit sha1_base64="TZ6dp5tyCBo6NdZnLwRhzdKcg4s=">AB63icbVA9SwNBEN2LXzF+RS1tFoNgFe5E0DJoYxnBxEByhL3NJFmyu3fszgnhyF+wsVDE1j9k579xL7lCEx8MPN6bYWZelEh0fe/vdLa+sbmVnm7srO7t39QPTxq2zg1HFo8lrHpRMyCFBpaKFBCJzHAVCThMZrc5v7jExgrYv2A0wRCxUZaDAVnmEs9QNav1vy6PwdJUFBaqRAs1/96g1inirQyCWzthv4CYZMyi4hFml1pIGJ+wEXQd1UyBDbP5rTN65pQBHcbGlUY6V39PZExZO1WR61QMx3bZy8X/vG6Kw+swEzpJETRfLBqmkmJM8fpQBjgKeOMG6Eu5XyMTOMo4un4kIl9eJe2LeuDXg/vLWuOmiKNMTsgpOScBuSINckeapEU4GZNn8krePOW9eO/ex6K15BUzx+QPvM8fCOKOA=</latexit><latexit sha1_base64="TZ6dp5tyCBo6NdZnLwRhzdKcg4s=">AB63icbVA9SwNBEN2LXzF+RS1tFoNgFe5E0DJoYxnBxEByhL3NJFmyu3fszgnhyF+wsVDE1j9k579xL7lCEx8MPN6bYWZelEh0fe/vdLa+sbmVnm7srO7t39QPTxq2zg1HFo8lrHpRMyCFBpaKFBCJzHAVCThMZrc5v7jExgrYv2A0wRCxUZaDAVnmEs9QNav1vy6PwdJUFBaqRAs1/96g1inirQyCWzthv4CYZMyi4hFml1pIGJ+wEXQd1UyBDbP5rTN65pQBHcbGlUY6V39PZExZO1WR61QMx3bZy8X/vG6Kw+swEzpJETRfLBqmkmJM8fpQBjgKeOMG6Eu5XyMTOMo4un4kIl9eJe2LeuDXg/vLWuOmiKNMTsgpOScBuSINckeapEU4GZNn8krePOW9eO/ex6K15BUzx+QPvM8fCOKOA=</latexit><latexit sha1_base64="TZ6dp5tyCBo6NdZnLwRhzdKcg4s=">AB63icbVA9SwNBEN2LXzF+RS1tFoNgFe5E0DJoYxnBxEByhL3NJFmyu3fszgnhyF+wsVDE1j9k579xL7lCEx8MPN6bYWZelEh0fe/vdLa+sbmVnm7srO7t39QPTxq2zg1HFo8lrHpRMyCFBpaKFBCJzHAVCThMZrc5v7jExgrYv2A0wRCxUZaDAVnmEs9QNav1vy6PwdJUFBaqRAs1/96g1inirQyCWzthv4CYZMyi4hFml1pIGJ+wEXQd1UyBDbP5rTN65pQBHcbGlUY6V39PZExZO1WR61QMx3bZy8X/vG6Kw+swEzpJETRfLBqmkmJM8fpQBjgKeOMG6Eu5XyMTOMo4un4kIl9eJe2LeuDXg/vLWuOmiKNMTsgpOScBuSINckeapEU4GZNn8krePOW9eO/ex6K15BUzx+QPvM8fCOKOA=</latexit><latexit sha1_base64="TZ6dp5tyCBo6NdZnLwRhzdKcg4s=">AB63icbVA9SwNBEN2LXzF+RS1tFoNgFe5E0DJoYxnBxEByhL3NJFmyu3fszgnhyF+wsVDE1j9k579xL7lCEx8MPN6bYWZelEh0fe/vdLa+sbmVnm7srO7t39QPTxq2zg1HFo8lrHpRMyCFBpaKFBCJzHAVCThMZrc5v7jExgrYv2A0wRCxUZaDAVnmEs9QNav1vy6PwdJUFBaqRAs1/96g1inirQyCWzthv4CYZMyi4hFml1pIGJ+wEXQd1UyBDbP5rTN65pQBHcbGlUY6V39PZExZO1WR61QMx3bZy8X/vG6Kw+swEzpJETRfLBqmkmJM8fpQBjgKeOMG6Eu5XyMTOMo4un4kIl9eJe2LeuDXg/vLWuOmiKNMTsgpOScBuSINckeapEU4GZNn8krePOW9eO/ex6K15BUzx+QPvM8fCOKOA=</latexit>
  • The running time of stage-wise depends on the nonzero nodes and its

neighbors and not on the size of the whole graph.

  • Stage-wise algorithm
  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-26
SLIDE 26

L1-reg. Path Stagewise path η = 10−4

Stage-wise for recovering the whole path - example

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-27
SLIDE 27

What if we do not want to recover the whole path?

Proximal gradient descent

xk+1 := argmin g(x) + f(xk) + hrf(xk), x xki | {z } first-order Taylor approximation + 1 2kx xkk2

2

| {z }

upper bound on the approximation error

Requires careful implementation to avoid excessive running time

  • Need to maintain a set of non-zero nodes
  • Update x and gradient only for non-zero nodes and their neighbors at

each iteration

minimize 1 2xT Qx αxT s | {z }

f(x)

+ ραkDxk1 | {z }

g(x)

<latexit sha1_base64="gNWgDYqXzUVlMJvmiht5BTN5Bn8=">ACWnicbVHNSxwxHM2M9WtdW29RJcCkpxmZFCe5TaQ48Krgqbdchkf7MbzMeYZMqscf7JXorgv1Iwu86hah+EPN7vPZK85KXg1iXJfRQvVleWV1b72y8fbe51d1+f251ZRgMmBbaXObUguAKBo47AZelASpzARf59fF8fvELjOVanblZCSNJ4oXnFEXpKx7Q2Suay+54pLfQoMJpUag8kNZeBJETafNv6wqa/O8Cmu8QEmVJRTioNgm8wXe/V+gz8/j5mpbl3k7kdN7rI0OCcLZ9btJf1kAfyapC3poRYnWfc3GWtWSVCOCWrtME1KN/LUOM4ENB1SWSgpu6YTGAaqAQ78otqGvwpKGNcaBOWcnih/pvwVFo7k3lwSuqm9uVsLv5vNqxc8W3kuSorB4o9HVRUAjuN5z3jMTfAnJgFQpnh4a6YTWlox4Xf6IQS0pdPfk3OD/tp0k9Pv/SOvrd1rKGPaBftoR9RUfoJzpBA8TQH/Q3WolWo4c4jtfjSdrHLWZD+gZ4p1Hvzi0xg=</latexit><latexit sha1_base64="gNWgDYqXzUVlMJvmiht5BTN5Bn8=">ACWnicbVHNSxwxHM2M9WtdW29RJcCkpxmZFCe5TaQ48Krgqbdchkf7MbzMeYZMqscf7JXorgv1Iwu86hah+EPN7vPZK85KXg1iXJfRQvVleWV1b72y8fbe51d1+f251ZRgMmBbaXObUguAKBo47AZelASpzARf59fF8fvELjOVanblZCSNJ4oXnFEXpKx7Q2Suay+54pLfQoMJpUag8kNZeBJETafNv6wqa/O8Cmu8QEmVJRTioNgm8wXe/V+gz8/j5mpbl3k7kdN7rI0OCcLZ9btJf1kAfyapC3poRYnWfc3GWtWSVCOCWrtME1KN/LUOM4ENB1SWSgpu6YTGAaqAQ78otqGvwpKGNcaBOWcnih/pvwVFo7k3lwSuqm9uVsLv5vNqxc8W3kuSorB4o9HVRUAjuN5z3jMTfAnJgFQpnh4a6YTWlox4Xf6IQS0pdPfk3OD/tp0k9Pv/SOvrd1rKGPaBftoR9RUfoJzpBA8TQH/Q3WolWo4c4jtfjSdrHLWZD+gZ4p1Hvzi0xg=</latexit><latexit sha1_base64="gNWgDYqXzUVlMJvmiht5BTN5Bn8=">ACWnicbVHNSxwxHM2M9WtdW29RJcCkpxmZFCe5TaQ48Krgqbdchkf7MbzMeYZMqscf7JXorgv1Iwu86hah+EPN7vPZK85KXg1iXJfRQvVleWV1b72y8fbe51d1+f251ZRgMmBbaXObUguAKBo47AZelASpzARf59fF8fvELjOVanblZCSNJ4oXnFEXpKx7Q2Suay+54pLfQoMJpUag8kNZeBJETafNv6wqa/O8Cmu8QEmVJRTioNgm8wXe/V+gz8/j5mpbl3k7kdN7rI0OCcLZ9btJf1kAfyapC3poRYnWfc3GWtWSVCOCWrtME1KN/LUOM4ENB1SWSgpu6YTGAaqAQ78otqGvwpKGNcaBOWcnih/pvwVFo7k3lwSuqm9uVsLv5vNqxc8W3kuSorB4o9HVRUAjuN5z3jMTfAnJgFQpnh4a6YTWlox4Xf6IQS0pdPfk3OD/tp0k9Pv/SOvrd1rKGPaBftoR9RUfoJzpBA8TQH/Q3WolWo4c4jtfjSdrHLWZD+gZ4p1Hvzi0xg=</latexit><latexit sha1_base64="gNWgDYqXzUVlMJvmiht5BTN5Bn8=">ACWnicbVHNSxwxHM2M9WtdW29RJcCkpxmZFCe5TaQ48Krgqbdchkf7MbzMeYZMqscf7JXorgv1Iwu86hah+EPN7vPZK85KXg1iXJfRQvVleWV1b72y8fbe51d1+f251ZRgMmBbaXObUguAKBo47AZelASpzARf59fF8fvELjOVanblZCSNJ4oXnFEXpKx7Q2Suay+54pLfQoMJpUag8kNZeBJETafNv6wqa/O8Cmu8QEmVJRTioNgm8wXe/V+gz8/j5mpbl3k7kdN7rI0OCcLZ9btJf1kAfyapC3poRYnWfc3GWtWSVCOCWrtME1KN/LUOM4ENB1SWSgpu6YTGAaqAQ78otqGvwpKGNcaBOWcnih/pvwVFo7k3lwSuqm9uVsLv5vNqxc8W3kuSorB4o9HVRUAjuN5z3jMTfAnJgFQpnh4a6YTWlox4Xf6IQS0pdPfk3OD/tp0k9Pv/SOvrd1rKGPaBftoR9RUfoJzpBA8TQH/Q3WolWo4c4jtfjSdrHLWZD+gZ4p1Hvzi0xg=</latexit>

Fountoulakis et al. Variational Perspective of Local Graph Clustering, Mathematical Programming, 2017

slide-28
SLIDE 28

Theorem: non-decreasing non-zero nodes

200 400 600 800 1000 1200 1400 1600 1800

Iterations

500 1000 1500

Number of nonzeros

Proximal Gradient Optimal number of non-zeros

Fountoulakis et al. Variational Perspective of Local Graph Clustering, Mathematical Programming, 2017

slide-29
SLIDE 29

Open problem: is accelerated prox. grad. a local algorithm?

Gradient descent running time

200 400 600 800 1000 1200 1400 1600 1800

Iterations

200 400 600 800 1000 1200 1400 1600

Number of nonzeros

Accelerated prox. grad. Proximal Gradient

  • Accel. gradient descent

˜ 𝒫 ( vol( ̂ S) μ ) ˜ 𝒫 ( vol(G) μ ) ˜ 𝒫 ( vol( ̂ S) μ )

  • : support of optimal solution, i.e., non-zero nodes.

̂ S

  • strong convexity parameter of the problem.

μ

slide-30
SLIDE 30

Two ways to measure performance of the l1-regularized PageRank model

  • Performance under stochastic block model - recover a cluster using the
  • utput of l1-regularized PageRank.
  • Use conductance to measure quality of the output. Show that the output

has conductance value similar to a target cluster around the seed node. Average-case Worst-case

Fountoulakis et al. Variational Perspective of Local Graph Clustering, Mathematical Programming, 2017

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020

Zhu et al. A local algorithm for finding well-connected clusters, ICML, 2013

slide-31
SLIDE 31

Average-case guarantees

slide-32
SLIDE 32

Average-case performance

Local random model

  • Given a graph

with nodes, let be a target cluster inside .

G n K G

  • Two nodes in

are connected with probability

K p

  • Nodes in

are connected with probability .

K Kc q

  • The rest of edges can be drawn using any other model.
slide-33
SLIDE 33

Expected l1-regularized PageRank

  • The optimal solution of the expected problem identifies the target cluster.
  • Theorem
  • Suppose that the seed node is selected from target cluster K. The optimal

solution of

  • satisfies
  • as long as ρ = 𝒫 (p/ ¯

d2)

  • where is the expected degree of nodes in the target cluster.

¯ d

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020

supp(x*) = K

x* := argmin1 2 xT𝔽[Q]x − αxTs + ρα∥𝔽[D]x∥1

slide-34
SLIDE 34

Results for l1-regularized PageRank for noisy data

  • In practice, we do not have access to the expected graph. We are given a

realization of the local random model that includes “noise”, i.e., edges from the target cluster to the rest of the graph.

  • We have two results for the noisy case.
  • First result.
  • Second result.
  • Zero false negatives.
  • Bounded false positives.
  • With additional assumptions on the seed nodes we can show exact

recovery.

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-35
SLIDE 35

Results for l1-regularized PageRank for noisy data

  • Theorem (bounded false positives)
  • Suppose

, where is the size of the target cluster.

p2k ≥ 𝒫(log k) k

  • and ρ = 𝒫 (γp/ ¯

d2)

  • where

, i.e., the probability of staying inside the target cluster in one step.

γ = pk/ ¯ d

  • Then with probability

the optimal solution of the realized problem has zero false negatives and the false positives are bounded

1 − 6exp(−𝒫(p2k))

vol(FP) ≤ vol(K)(𝒫 ( 1 γ2) − 1)

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-36
SLIDE 36

Results for l1-regularized PageRank for noisy data

where , i.e., the probability of staying inside the target cluster in one step.

γ = pk ¯ d

Assuming is the smaller part of the graph

B

Φ(B) := ( number of edges leaving B sum of edges of vertices in B)

Definitions

slide-37
SLIDE 37
  • Theorem (exact recovery)
  • Let q = 𝒫 (1/n)
  • Then with probability at least

there exists a good seed node such that if we use that seed node we get

1 − 𝒫(e−k)

  • As long as

supp( ̂ x) = K

dj ≥ 𝒫 ( 1 γp ) ∀j ∈ Kc

Results for l1-regularized PageRank for noisy data

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-38
SLIDE 38
  • The assumption that q = 𝒫 (1/n)
  • but it is not, because it also covers the case were the size of the target

cluster is k = 𝒫(1)

  • This is a realistic local graph clustering setting where we attempt to

recover a very small target cluster of constant size with constant number

  • f edges leaving the cluster.
  • implies that there are constant number of edges leaving the cluster,

which sounds artificial.

Results for l1-regularized PageRank for noisy data

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-39
SLIDE 39

Worst-case guarantees

slide-40
SLIDE 40

Some definitions

  • Internal connectivity of target cluster

B

  • Conductance of target cluster :

B

Assuming is the smaller part of the graph

B

Φ(B) := ( number of edges leaving B sum of edges of vertices in B) IC(B) := the minimum conductance of the subgraph induced by B

slide-41
SLIDE 41

Worst-case performance

  • Theorem (by Zhu et al.)

Zhu et al. A local algorithm for finding well-connected clusters, ICML, 2013

  • Assume that the internal connectivity of the target cluster K is larger

than its conductance

  • False positives are bounded by
  • True positives are bounded by

IC2(K) Φ(K)log vol(K) ≥ Ω(1) vol(FP) ≤ 𝒫 ( Φ(K) IC(K)) vol(K) vol(FN) ≤ 𝒫 ( Φ(K) IC(K)) vol(K)

slide-42
SLIDE 42

Compare average- and worst-case

False Positives False Negatives Average-case Worst-case zero

  • The average-case result on FP is stronger for large values of .

γ

  • Also for the average-case we can also prove exact recovery.

vol(FP) ≤ vol(K)(𝒫 ( 1 γ2) − 1) vol(FP) ≤ vol(K)𝒫((1 − γ)log k) vol(FN) ≤ vol(K)𝒫((1 − γ)log k)

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-43
SLIDE 43

Comparison to planted cluster model

  • Example,

and

  • Using semidefinite programming one can achieve exact recovery as long

as ,

  • while our results guarantee zero false negative and a constant proportion
  • f false positives.
  • However, our model is not allowed to touch the whole graph.

p = 1 q = 𝒫(log n/n) k ≥ 𝒫(log n)

  • W. Ha, K. Fountoulakis, M. Mahoney. Statistical Guarantees of Local Graph Clustering. AISTATS-2020
slide-44
SLIDE 44

Combinatorial Diffusion: Capacity Releasing Diffusion

slide-45
SLIDE 45

Problem: spectral diffusions might leak mass

  • regularized PageRank (best tuning)

Precision=0.73, Recall=0.91

ℓ1

Target cluster: Students of Year 2008

Red nodes: output of the algorithm Data: Facebook Colgate University, A. L. Traud, P. J. Mucha and M. A. Porter, Physica A, 391(16), 2012

slide-46
SLIDE 46

Solving the problem of spreading mass indiscriminately by gradual release of edge capacity

  • Even distribution of the residual probability mass to neighbors

Spectral diffusions

  • Controls the amount of mass to be send over an edge by using the height “h”
  • f a node

Capacity Releasing Diffusion

  • In theory this results in bounded mass leaked outside of the target cluster
  • In practice this results in much better precision and recall
  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017
slide-47
SLIDE 47

Capacity Releasing Diffusion algorithm

A B C D E F G H m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v Maintain mass “m” and height “h” for each node

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017
slide-48
SLIDE 48

Capacity Releasing Diffusion algorithm

A B C D E F G H m=0, h=0 m=4, h=1 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= 2deg(v) for all nodes v m(v) <= deg(v) for all nodes v

Algorithm

Overflow: m(v) = 2m(v) Push excess mass to unsaturated nodes with lower height

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017
slide-49
SLIDE 49

A B C D E F G H m=0, h=0 m=4, h=1 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A) Push excess mass to unsaturated nodes with lower height

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= deg(v) for all nodes v

Algorithm

Overflow: m(v) = 2m(v)

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-50
SLIDE 50

A B C D E F G H m=0, h=0 m=4, h=1 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= deg(v) for all nodes v

Algorithm

Overflow: m(v) = 2m(v) Pick node A (has excess mass) and a neighbor of A with lower height “h”

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-51
SLIDE 51

A B C D E F G H m=0, h=0 m=4, h=1 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate Pick node C m(v) <= deg(v) for all nodes v

Algorithm

Overflow: m(v) = 2m(v)

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-52
SLIDE 52

A B C D E F G H m=0, h=0 m=4, h=1 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate Pick node A (has excess mass) m(v) <= deg(v) for all nodes v

Algorithm

Push 1 unit

Overflow: m(v) = 2m(v)

Gradual release: do not push more than the height of A

Push at most "h" flow to a chosen neighbor

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-53
SLIDE 53

A B C D E F G H m=0, h=0 m=3, h=1 m=1, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= deg(v) for all nodes v

Algorithm

Push excess mass to unsaturated nodes with lower height Overflow: m(v) = 2m(v)

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-54
SLIDE 54

A B C D E F G H m=0, h=0 m=3, h=1 m=1, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= deg(v) for all nodes v

Algorithm

P u s h 1 u n i t

Overflow: m(v) = 2m(v) Pick node A (has excess mass) and a new edge of node A of residual flow less than “h”

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-55
SLIDE 55

A B C D E F G H m=1, h=0 m=2, h=1 m=1, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= 2deg(v) for all nodes v m(v) <= deg(v) for all nodes v

Algorithm

Push excess mass to unsaturated nodes with lower height Overflow: m(v) = 2m(v)

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-56
SLIDE 56

A B C D E F G H m=2, h=0 m=4, h=1 m=2, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= 2deg(v) for all nodes v m(v) <= deg(v) for all nodes v

Algorithm

Push excess mass to unsaturated nodes with lower height Overflow: m(v) = 2m(v)

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-57
SLIDE 57

A B C D E F G H m=2, h=0 m=4, h=1 m=2, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v Algorithm

Overflow the seed: m(A) = 2deg(A) Iterate m(v) <= deg(v) for all nodes v Overflow: m(v) = 2m(v) Pick node A (has excess mass) and a neighbor of A with lower height “h”

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-58
SLIDE 58

A B C D E F G H m=2, h=0 m=4, h=1 m=2, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate Pick node A (has excess mass) m(v) <= deg(v) for all nodes v

Algorithm

Push 1 unit

Overflow: m(v) = 2m(v)

Gradual release: do not push more than the height of A

Push at most "h" flow to a chosen neighbor

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-59
SLIDE 59

A B C D E F G H m=2, h=0 m=3, h=1 m=3, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate Note C has excess it has to be added to the candidate nodes m(v) <= deg(v) for all nodes v

Algorithm

Overflow: m(v) = 2m(v)

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-60
SLIDE 60

A B C D E F G H m=2, h=0 m=3, h=1 m=3, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= deg(v) for all nodes v

Algorithm

Overflow: m(v) = 2m(v) Pick node C (has excess mass) and a neighbor of C with lower height “h”

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-61
SLIDE 61

A B C D E F G H m=2, h=0 m=3, h=1 m=3, h=1 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= deg(v) for all nodes v

Algorithm

Overflow: m(v) = 2m(v) There is no neighbor of C with lower height so increase the height of C by 1

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-62
SLIDE 62

A B C D E F G H m=2, h=0 m=3, h=1 m=3, h=1 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0 m=0, h=0

Overflow the seed: m(A) = 2deg(A)

Saturated nodes: m(v) >= deg(v) Excess mass = max(m(v) - deg(v),0) degree(v): #edges of node v

Iterate m(v) <= deg(v) for all nodes v

Algorithm

Overflow: m(v) = 2m(v) Repeat until there is no node with excess mass

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017

Capacity Releasing Diffusion algorithm

slide-63
SLIDE 63

Theoretical comparison to spectral diffusions

  • Theoretical bound on FP/FN needs: “signal” polylog stronger than “noise”, as
  • pposed to: quadratically stronger for spectral methods
  • The running time is

times faster than spectral

1/IC(B)

Weaker assumptions Better running time Better worst-case guarantees

  • Internal connectivity (“signal”) of target

B

  • Conductance of target (“noise”):

B

  • Output A satisfies

, as opposed to

Φ(A) ≤ 𝒫(Φ(B)) Φ(A) ≤ 𝒫(Φ(B)/IC(B))

Assuming is the smaller part of the graph

B

Φ(B) := ( number of edges leaving B sum of edges of vertices in B )

IC(B) := the minimum conductance of the subgraph induced by B

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017
slide-64
SLIDE 64

Example on Facebook Colgate University social network

Year 2008

  • regularized PageRank(best tuning)

Precision=0.73, Recall=0.94

ℓ1

Capacity Releasing Diffusion Precision=0.93, Recall=0.94

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017
slide-65
SLIDE 65

Example on Facebook Johns Hopkins social network

Capacity Releasing Diffusion Precision=0.87, Recall=0.94

  • regularized PageRank (best tuning)

Precision=0.71, Recall=0.91

ℓ1

Same major

  • D. Wang, K. Fountoulakis, M. Mahoney, S. Rao. Capacity Releasing Diffusion for Speed and Locality. ICML 2017
slide-66
SLIDE 66
  • norm Flow Diffusions

p

slide-67
SLIDE 67

Spectrum of methods

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.

Spectral Diffusions Combinatorial Diffusions e.g., PageRank easy to understand fast in practice e.g., capacity releasing diffusion robust to noise

slide-68
SLIDE 68

Spectrum of methods

Spectral Diffusions Combinatorial Diffusions

  • norm flow diffusion

p

  • -norm flow diffusion is a family of convex optimization problems that

characterizes the trade-off between spectral and combinatorial diffusions.

  • This allows us to define methods that are the best of both worlds.

p

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-69
SLIDE 69

Some definitions - incidence matrix

Incidence matrix B

A B C D E F G H A-B

1

  • 1

A-C

1

  • 1

B-C

1

  • 1

C-D

1

  • 1

D-E

1

  • 1

D-F

1

  • 1

D-G

1

  • 1

F-H

1

  • 1

A B C D E F G H

  • Ordering of edges and direction is arbitrary
  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-70
SLIDE 70

Some definitions - flow variables

  • Let be a vector and each component of corresponds to an edge,

for example:

f f

A B C D E F G H

fAC fCD fAB fBC fDE fDG fDF fFH

  • The magnitude of is the amount of flow that passes through an edge
  • The sign of is the direction of flow

f f

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-71
SLIDE 71

Some definitions - net flow

  • Let

be a non-negative vector, each component of indicates the initial mass at a node.

  • is a vector that captures the net flow on a node.
  • indicates the net mass on every node.

Δ Δ BTf BTf + Δ

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-72
SLIDE 72

Node capacities

  • We will require that each node has capacity equal to its degree
  • We will say that the initial mass

has been diffused, when the net mass on each node is less than its capacity:

di Δ

BTf + Δ net mass per node ≤ d

capacity per node

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-73
SLIDE 73

Diffusion as an optimization formulation

minimize kfkp subject to: BT f + ∆  d

<latexit sha1_base64="sUIrAJhP7kZkLC3+vs7ahvgonLY=">ACNXicbVDLSgMxFM34tr6qLt0EiyIZUYExVWpLly4ULCt0NSe9obJIZkoxYx/6UG/DlS5cKOLWXzB9LHwdCBzOZebe8JEcGN9/9kbGR0bn5icms7NzM7NL+QXl6omTjWDCotFrM9CakBwBRXLrYCzRAOVoYBa2N7v+bVr0IbH6tR2EmhIeqF4xBm1Tmrmj4gM45tMcsUlv4UudljHBJO7iNw1E0Jyg4BJwytgFt4r9sPlM9PI7yJyQEISzERgFu4mS/4Rb8P/JcEQ1JAQxw384+kFbNUgrJMUGPqgZ/YRka15UxAN0dSAwlbXoBdUcVlWAaWf/qLl5zSgtHsXZPWdxXv09kVBrTkaFLSmovzW+vJ/7n1VMb7TYyrpLUgmKDRVEq3O24VyFuce2qEB1HKNPc/RWzS6ops67onCsh+H3yX1LdKgZ+MTjZLpTKwzqm0ApaRsoQDuohA7RMaoghu7RE3pFb96D9+K9ex+D6Ig3nFlGP+B9fgE2KkN</latexit><latexit sha1_base64="sUIrAJhP7kZkLC3+vs7ahvgonLY=">ACNXicbVDLSgMxFM34tr6qLt0EiyIZUYExVWpLly4ULCt0NSe9obJIZkoxYx/6UG/DlS5cKOLWXzB9LHwdCBzOZebe8JEcGN9/9kbGR0bn5icms7NzM7NL+QXl6omTjWDCotFrM9CakBwBRXLrYCzRAOVoYBa2N7v+bVr0IbH6tR2EmhIeqF4xBm1Tmrmj4gM45tMcsUlv4UudljHBJO7iNw1E0Jyg4BJwytgFt4r9sPlM9PI7yJyQEISzERgFu4mS/4Rb8P/JcEQ1JAQxw384+kFbNUgrJMUGPqgZ/YRka15UxAN0dSAwlbXoBdUcVlWAaWf/qLl5zSgtHsXZPWdxXv09kVBrTkaFLSmovzW+vJ/7n1VMb7TYyrpLUgmKDRVEq3O24VyFuce2qEB1HKNPc/RWzS6ops67onCsh+H3yX1LdKgZ+MTjZLpTKwzqm0ApaRsoQDuohA7RMaoghu7RE3pFb96D9+K9ex+D6Ig3nFlGP+B9fgE2KkN</latexit><latexit sha1_base64="sUIrAJhP7kZkLC3+vs7ahvgonLY=">ACNXicbVDLSgMxFM34tr6qLt0EiyIZUYExVWpLly4ULCt0NSe9obJIZkoxYx/6UG/DlS5cKOLWXzB9LHwdCBzOZebe8JEcGN9/9kbGR0bn5icms7NzM7NL+QXl6omTjWDCotFrM9CakBwBRXLrYCzRAOVoYBa2N7v+bVr0IbH6tR2EmhIeqF4xBm1Tmrmj4gM45tMcsUlv4UudljHBJO7iNw1E0Jyg4BJwytgFt4r9sPlM9PI7yJyQEISzERgFu4mS/4Rb8P/JcEQ1JAQxw384+kFbNUgrJMUGPqgZ/YRka15UxAN0dSAwlbXoBdUcVlWAaWf/qLl5zSgtHsXZPWdxXv09kVBrTkaFLSmovzW+vJ/7n1VMb7TYyrpLUgmKDRVEq3O24VyFuce2qEB1HKNPc/RWzS6ops67onCsh+H3yX1LdKgZ+MTjZLpTKwzqm0ApaRsoQDuohA7RMaoghu7RE3pFb96D9+K9ex+D6Ig3nFlGP+B9fgE2KkN</latexit><latexit sha1_base64="sUIrAJhP7kZkLC3+vs7ahvgonLY=">ACNXicbVDLSgMxFM34tr6qLt0EiyIZUYExVWpLly4ULCt0NSe9obJIZkoxYx/6UG/DlS5cKOLWXzB9LHwdCBzOZebe8JEcGN9/9kbGR0bn5icms7NzM7NL+QXl6omTjWDCotFrM9CakBwBRXLrYCzRAOVoYBa2N7v+bVr0IbH6tR2EmhIeqF4xBm1Tmrmj4gM45tMcsUlv4UudljHBJO7iNw1E0Jyg4BJwytgFt4r9sPlM9PI7yJyQEISzERgFu4mS/4Rb8P/JcEQ1JAQxw384+kFbNUgrJMUGPqgZ/YRka15UxAN0dSAwlbXoBdUcVlWAaWf/qLl5zSgtHsXZPWdxXv09kVBrTkaFLSmovzW+vJ/7n1VMb7TYyrpLUgmKDRVEq3O24VyFuce2qEB1HKNPc/RWzS6ops67onCsh+H3yX1LdKgZ+MTjZLpTKwzqm0ApaRsoQDuohA7RMaoghu7RE3pFb96D9+K9ex+D6Ig3nFlGP+B9fgE2KkN</latexit>
  • Out of all possible flows that satisfy the capacities we are interested in the
  • ne with minimum

norm, where .

f Lp p ∈ [2,∞)

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-74
SLIDE 74

Relation to other methods

  • For

the dual of the -norm flow diffusion problem is

p = 2 2

minimize 1 2 ∥Bx∥2

2 − xTΔ + ∥Dx∥1

  • which is a regularized spectral problem, very similar
  • regularized

PageRank.

ℓ1

  • For

the dual of the

  • norm flow diffusion problem is

p → ∞ ∞

minimize ∥Bx∥1 − xTΔ + ∥Dx∥1

  • which is a regularized min-cut problem, very similar to the so-called flow-

improve methods

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-75
SLIDE 75

Rounding

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
  • Sort the dual variables in descending order
  • Output the prefix set with smallest conductance.
  • In practice we solve the dual of the -norm flow problem

p

minimize −xTΔ + ∥Dx∥1 subject to: ∥Bx∥q ≤ 1 x ≥ 0

  • So we have direct access to the dual variables
slide-76
SLIDE 76
  • norm network flow diffusions - conductance guarantees

p

  • Theorem - Let

be the target cluster with conductance , if is initialized inside , and the input seed set sufficiently overlaps with , then the output satisfies

C Φ(C) Δ C C A

Φ(A) ≤ 𝒫 (Φ(B)1−1/p)

  • Cheeger-type result for

.

  • Constant factor approximation when

, similar to combinatorial diffusions.

  • Smooth transition for general values in between

p = 2 p → ∞ p

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-77
SLIDE 77
  • norm network flow diffusions - algorithm

p

  • Simple randomized coordinate descent
  • Running time

𝒫( |Δ| γ (

|Δ| ϵ ) 1−2/plog 1 ϵ )

  • represents the magnitude of the initial mass.
  • is the strong convexity parameter of the dual problem.
  • is the required accuracy

|Δ| γ ϵ

  • gives the usual running time for spectral methods
  • gives the usual running time for combinatorial methods

p = 2 ˜ 𝒫(|Δ|) p → ∞ ˜ 𝒫(|Δ|2/ϵ)

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-78
SLIDE 78
  • norm network flow diffusions - summary

p

  • There is a trade-off between quality of output and running time
  • The larger is the better the output with respect to conductance.
  • However, the larger is the more the running time for solving problem.
  • In practice, small values of

gives the best of both worlds.

p p p ∈ [2,8]

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
slide-79
SLIDE 79

Performance

  • S. Yang, D. Wang, K. Fountoulakis. p-Norm Flow Diffusion for Local Graph Clustering.
  • LFR synthetics model, basically a stochastic block model
  • is a parameter that controls noise, the higher the more noise.

μ

0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 0.5 0.6

Conductance

nonlinear power L1-reg. pr p = 8 p = 4 p = 2

0.1 0.2 0.3 0.4 0.4 0.6 0.8 1

F1 measure

p = 2 p = 4 p = 8 L1-reg. pr nonlinear power

slide-80
SLIDE 80

Local to Global Applications: Network Community Profiles, Node embeddings, Graph Visualization, Semi- Supervised Learning

(no theory ☹, preliminary work)

slide-81
SLIDE 81

Network Community Profiles

Clusters with smallest conductance correspond to galaxies

slide-82
SLIDE 82

Node embeddings

Types of node embeddings

  • Global embeddings
  • Local embeddings, i.e., spectral and combinatorial
  • Goal: Represent a node with a low dimensional vector.
  • We use node embeddings for graph visualization, semi-supervised learning

and graph partitioning.

slide-83
SLIDE 83

Global embeddings

  • Compute the Laplacian matrix
  • Compute non-trivial eigenvectors of
  • Stuck the eigenvectors as columns of a

matrix .

  • Each row of

is a vector representation (node embedding) of a node.

L = D − A k L n × k U U

slide-84
SLIDE 84

Local spectral embeddings

  • Choose randomly

seed sets

  • For each seed set run a local spectral algorithm.
  • Stuck eigenvectors as columns of a

matrix .

  • Compute principal left singular vectors of
  • Stuck the singular vectors as columns of a

matrix .

  • Each row of

is a vector representation (node embedding) of a node.

N n × N X k X n × k U U

slide-85
SLIDE 85

Local flow embeddings

  • Choose randomly

seed sets

  • For each seed set run a local flow algorithm
  • Stuck eigenvectors as columns of a

matrix .

  • Compute principal left singular vectors of
  • Stuck the singular vectors as columns of a

matrix .

  • Each row of

is a vector representation (node embedding) of a node.

N n × N X k X n × k U U

slide-86
SLIDE 86

Graph visualization - US highway network 


  • Edges represent naturally funded highways, and nodes represent

intersections.

  • Mostly toy-graph for demonstration purposes
slide-87
SLIDE 87

Graph visualization - global embeddings

  • Color shows true longitude
  • Global embeddings seem to correlate with longitude
  • But, compresses major regions of the northeastern US (Washington, New

York, Boston) as well as the Western US (Los Angeles, San Diego, Phoenix).

slide-88
SLIDE 88

map Local spectral embeddings Local flow embeddings

Graph visualization - local embeddings

  • With global embeddings Western US (Los Angeles, San Diego, Phoenix) was

quite compressed.

  • Local embeddings help in de-compressing the region.
  • Local spectral and flow embeddings seem to be qualitatively different.
slide-89
SLIDE 89

Main Galaxy Sample data

  • Each node is a galaxy
  • Edges represent distance among galaxies
  • The distance is determined by measuring the distance of the emission spectra
  • f two galaxies
  • There are 517182 galaxies (nodes) and each galaxy has 4 neighbor galaxies

(edges)

Mapping the similarities of spectra: global and local approaches to sdss galaxies. The Astrophysical Journal. 2016.

slide-90
SLIDE 90

Local spectral and flow embeddings - Main Galaxy Sample data

Global embedding Local spectral Local flow Zoom-in this dense region

slide-91
SLIDE 91

Local spectral and flow embeddings - Main Galaxy Sample data

  • Structural differences in visualization also translate to clusters with smaller

conductance. k = 50 k = 100

slide-92
SLIDE 92

Semi-supervised learning

  • Infer unknown labels for all nodes, when given a few nodes with known labels.
  • We assume that the graph edges represent a high likelihood of sharing a

label.

  • For each class, we randomly select a small subset of nodes, and we fix the

labels of these nodes as known.

  • We then run a spectral or a flow method where this set of nodes is the
  • reference. This gives one spectral or flow vector per class.
  • For each unlabelled node we look at the corresponding coordinate in the

vectors and we give it the label that corresponds to the class with the highest value. Problem Algorithm

slide-93
SLIDE 93

Semi-supervised learning

True labels included in seeds

  • PubMed is a citation network. 19717 scientific

publications about diabetes with 44338 citation links.

  • By construction of the graph, articles about one

type of diabetes cite others about the same type more often. Info about the data

slide-94
SLIDE 94

Software

LocalGraphClustering on

slide-95
SLIDE 95

Thank you!