The ground truth about metadata and community detection in 8 8 7 7 8 8 networks 5 5 0 0 . . 8 8 0 0 6 6 1 1 : : Leto Peel v v i i X X Université catholique de Louvain r r a a
Community detectjon: Split nodes into groups based 8 on their patuern of links 7 8 5 0 . 8 0 6 1 : v i X r a
Data generatjng process: 8 7 Generate nodes and assign to 8 communitjes 5 0 . 8 0 6 1 : v i X r a
Data generatjng process: 8 7 Generate nodes and assign to 8 communitjes, T 5 0 . 8 0 g( T ) 6 1 : v i Generate links in G dependent X r on community membership a
Community detectjon: 8 7 8 Infer T 5 0 . 8 0 f( G ) 6 1 : v i Observe G X r a Assess performance on how well we recover T
Ground truth in real networks? 8 7 8 5 0 . 8 ? 0 6 1 : v i X r a
Networks can have metadata that describe the nodes 8 7 8 5 social networks age, sex, ethnicity, race, etc. 0 . food webs feeding mode, species body mass, etc. 8 0 internet data capacity, physical locatjon, etc. 6 1 : protein interactjons molecular weight, associatjon with cancer, etc. v i X r a
Recovering metadata implies sensible methods 8 7 8 5 0 . 8 0 6 1 : v i X stochastjc block model stochastjc block model r a with degree correctjon Karrer, Newman. Stochastjc blockmodels and community structure in networks. Phys. Rev. E 83, 016107 (2011). Adamic, Glance. The politjcal blogosphere and the 2004 US electjon: divided they blog. 36–43 (2005).
Metadata ofuen treated as ground truth 8 7 8 5 0 . 8 0 6 1 : v i X r a Yang & Leskovec. Overlapping community detectjon at scale: a nonnegatjve matrix factorizatjon approach (2013).
Metadata ofuen treated as ground truth 8 7 8 5 0 . 8 0 6 1 : v i Do you think thats ground X r truth you're detectjng? a Yang & Leskovec. Overlapping community detectjon at scale: a nonnegatjve matrix factorizatjon approach (2013).
8 7 8 Ground truth, T Ground truth, T 5 0 . 8 0 6 1 : d d ( ( T T , , f f ( ( G G ) ) ) ) v i X r a Communities, C = f ( G )
Metadata, M d ( M, T ) 8 7 8 Ground truth, T Ground truth, T 5 0 d ( M, f ( G ) ) . 8 0 6 1 : d d ( ( T T , , f f ( ( G G ) ) ) ) v i X r a Communities, C Communities, C = f ( G ) = f ( G )
When communitjes ≠ metadata... 8 7 8 5 0 . 8 0 6 1 : v i X r a (i) the metadata do not relate to the network structure,
When communitjes ≠ metadata... 8 7 8 5 0 . 8 0 6 1 : v i X r a (ii) the detected communitjes and the metadata capture difgerent aspects of the network’s structure,
When communitjes ≠ metadata... 8 7 8 5 0 . 8 0 6 1 : v i X r a (iii) the network contains no structure (e.g., an E-R random graph)
When communitjes ≠ metadata... 8 7 8 5 0 . 8 0 6 1 : v i X r a (iv) the community detectjon algorithm does not perform well. Typically we assume this is the only possible cause
The Karate Club network Instructor President 8 7 8 5 0 . 8 0 6 1 : v i X r a Split into factjons
The Karate Club network Instructor President 8 7 8 5 0 . 8 0 6 1 : v i X r a Split into factjons
‘This can be explained by notjng that he was only three weeks away from a test for black belt (master status) when the split in the club 8 7 occurred. Had he joined the offjcers’[President's] 8 5 club he would have had to give up his rank and 0 . 8 begin again in a new style of karate with a white 0 6 (beginner’s) belt, since the offjcers had decided 1 : v i to change the style of karate practjced in their X r new club’ a - Zachary 1977
You only see what you look for... 8 7 8 5 0 . 8 0 6 1 : v i X r a US politjcs is more than two opposing views Adamic, Glance. The politjcal blogosphere and the 2004 US electjon: divided they blog. 36–43 (2005). Peixoto, T. P. Hierarchical Block Structures and High-Resolutjon Model Selectjon in Large Networks. Phys. Rev. X 4, 011047 (2014).
Difgerent generatjve processes = difgerent community structures 8 7 8 5 0 . 8 0 6 1 : v i X r a
Many good partjtjons... 8 7 8 5 0 . 8 0 6 1 : v i X r a Evans, T. S. Clique graphs and overlapping communitjes. J. Stat. Mech. 2010, P12037–22 (2010).
Metadata are not ground truth for community detectjon 8 7 8 5 0 . 8 0 6 1 : v i X r a
Metadata are not ground truth for community detectjon No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure 8 7 (iii) the network has no structure 8 (iv) the algorithm does not perform well 5 0 . 8 0 6 1 : v i X r a
Metadata are not ground truth for community detectjon No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure 8 7 (iii) the network has no structure 8 (iv) the algorithm does not perform well 5 0 . Multjple sets of metadata exist. 8 0 Which set is ground truth? 6 1 : v i X r a
Metadata are not ground truth for community detectjon No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure 8 7 (iii) the network has no structure 8 (iv) the algorithm does not perform well 5 0 . Multjple sets of metadata exist. 8 0 Which set is ground truth? 6 1 : We see what we look for. v i Confjrmatjon bias. Publicatjon bias. X r a
Metadata are not ground truth for community detectjon No interpretability of negatjve results. (i) M unrelated to network structure (ii) C and M capture difgerent aspects of network structure 8 7 (iii) the network has no structure 8 (iv) the algorithm does not perform well 5 0 . Multjple sets of metadata exist. 8 0 Which set is ground truth? 6 1 : We see what we look for. v i Confjrmatjon bias. Publicatjon bias. X r a “Community” is model dependent. Do we expect all networks across all domains to have the same relatjonship with communitjes?
Community detectjon is an inverse problem 8 7 Communitjes, T 8 5 0 . data community 8 g( T ) f( G ) generatjon detectjon 0 6 1 : v i Network, G X r a
However, in real networks both T and g are unknown 8 7 8 5 For any graph there exist a (Bell) number of possible “ground truth” partjtjons, 0 and an infjnite number of capable generatjve models. . 8 0 6 1 {generatjve models, g} x {partjtjons, T} {graph G} : v i many to one X r a f o o r p r o f e r e h e e s The community detectjon problem is ill-posed (no unique solutjon)
A No Free Lunch Theorem for community detectjon? NFL theorem (supervised learning) states that there cannot exist a classifjer that is a priori betuer than any other, averaged 8 over all possible problems. 7 8 5 0 . 8 0 6 1 : v i X r a Wolpert, D. H. The lack of a priori distjnctjons between learning algorithms. Neural Computatjon 8, 1341–1390 (1996).
A No Free Lunch Theorem for community detectjon NFL Theorem for communtjy detectjon 8 (paraphrased): 7 8 5 For the community detectjon problem, with accuracy 0 . measured by adjusted mutual informatjon, the uniform 8 average of the accuracy of any method f over all 0 6 possible community detectjon problems is a constant 1 : which is independent of f . v i X r f o a o r p r o f e r e h e e s On average, no community detectjon algorithm performs betuer than any other
a r X i v : 1 6 0 8 . 0 5 8 7 8
So, what about metadata? 8 7 8 Metadata = types of nodes 5 0 . Communitjes = how nodes interact 8 0 6 1 : Metadata + Communitjes = how difgerent types of nodes interact with each other v i X r a we require new methods to understand the relatjonship between metadata and structure
Are the metadata related to the network structure? Blockmodel Entropy Signifjcance Test 8 7 8 5 0 . 8 0 6 1 Do metadata and detected communitjes capture : v difgerent aspects network structure? i X r neoSBM a
Are the metadata related to the network structure? Blockmodel Entropy Signifjcance Test 8 7 8 5 (i) the metadata do not relate to the network structure, 0 . 8 0 6 1 Do metadata and detected communitjes capture : v difgerent aspects network structure? i X r neoSBM a (ii) communitjes and metadata capture difgerent aspects network structure,
The Stochastjc Blockmodel 8 Edges are conditjonally independent given community membership 7 p ij = p(e ij |z i ,z j ,ω) = ω zi,zj 8 5 0 . 8 inter-community 0 density 6 i n t 1 r a : - c o v m i m X u n r increasing i t y a d density e n s i inter-community t y density
Recommend
More recommend