Scalable Algorithm for Probabilistic Overlapping Community Detection - PowerPoint PPT Presentation

Scalable Algorithm for Probabilistic Overlapping Community Detection Kento Nozawa, Kei Wakabayashi University of Tsukuba WSDM 2017 workshop on SWM

Large Graph It’s hard to analyze a large graph. Examples: • Citation networks • Co-author relationships • Social networks • Hyperlinks on web pages Needs: Decomposition a large graph into some smaller subgraphs 2

Community Structures in Graph In the same community, • nodes are densely connected internally • nodes resemble the others – Same affiliation – Same interest – Related research area 3

Overlapping Community Each node belongs to multiple communities. Many graphs have overlapping communities – Ex. Related Research areas in co-author graph B x w E • Blue : Data mining • Red : Machine learning A y A has published in both areas C 4 D z

Bag-of-nodes Representation Bag-of-words for graph • A node corresponds to one document • The node and its adjacency list correspond to words in the document Node as doc Nodes as words A, B, C, D, A E, x, y, z w B B, A, E B x E C C, D, E, A D D, E, C, A A y E E, B, A, C, D x x, y, A, w C y y, A, x, z z z z, y, A D w w, x Graph 5 Bag-of-nodes

Latent Dirichlet Allocation (LDA) [Blei+, 2003] • Probabilistic generative model for bag-of-words • Find topics from words co-occurrence • Each topic defines a distribution over all words Coffee shop author 0.14 coffee 0.15 drink drink cite 0.12 drink 0.15 citation 0.11 coffee coffee beans 0.14 review 0.11 cafe 0.13 coffee coffee … … beans beans espresso cafe Topics (distribution over all words) Documents (bag-of-words) 6

LDA for Graph A topic represents an overlapping community. Each community is an affiliation probability distribution over nodes. Node as doc Nodes as words E , C , D , B and A belong to A, B, C, D, the community with high probability A E, x, y, z B B, A, E E 0.20 x 0.22 C C, D, E, A C 0.20 y 0.22 D 0.18 z 0.20 D D, E, C, A B 0.18 A 0.15 E E, B, A, C, D A 0.12 w 0.07 x x, y, A, w x 0.04 C 0.05 y y, A, x, z y 0.03 D 0.04 z 0.03 B 0.04 z z, y, A w 0.02 E 0.01 w w, x Graph as documents Communities (distributions over nodes) 7

Stochastic Variational Inference [Mimno+, 2012] Inference algorithms based on stochastic gradient descent – Update parameters based on sampling nodes in each iteration – mini-batch size : # sampling nodes as document Node as doc Nodes as words A, B, C, D, A Node Nodes E, x, y, z as doc as words B B, A, E B B, A, E C C, D, E, A D D, E, C, A C C, D, E, A E E, B, A, C, D z z, y, A x x, y, A, w sampling w w, x y y, A, x, z When mini-batch size is 4 z z, y, A w w, x 8 Graph as documents

Experiment Evaluation of scalability for the graph size • Runtime for overlapping community detection Quality metrics for overlapping communities • Triangle participation ratio (TPR) – Ratio of #nodes that belong to a triangle – Higher is better • Conductance – Ratio of #edges that link to an outer node – Lower is better 9

Experimental Datasets Name #nodes #edges DBLP 317,080 1,049,866 Orkut 3,072,441 117,185,083 Friendster 65,608,366 1,806,067,135 From SNAP Datasets Only Friendster: • Store into MySQL • Sample mini-batch size records from the table 10

Comparison of Runtime 2hours 7 min. #communities: 4,000 #iterations: 1,000 Mini-batch size: 2,000 11

The Metrics of DBLP Communities TPR: the median of SVBLDA is the third best Conductance: the median of SVBLDA is the third worst #communities: 4,000 #iterations: 1,000 12 Mini-batch size: 2,000

Parameter Sensitivity in DBLP • Varying mini-batch size or # iterations when fixing the other parameter • No significantly improvement of TPR/Conductance when mini-batch size > 3000 or # iterations > 2000 13 Mini-batch size: 2,000 #iterations: 1,000

Conclusion • Scalable community detection algorithm based on LDA for large graph • About 2 hours to detect communities from the large graph • It’s unnecessary to set large mini-batch size and #iteration for DBLP datasets 14

Scalable Algorithm for Probabilistic Overlapping Community Detection - PowerPoint PPT Presentation

Scalable Algorithm for Probabilistic Overlapping Community Detection Kento Nozawa, Kei Wakabayashi University of Tsukuba WSDM 2017 workshop on SWM Large Graph Its hard to analyze a large graph. Examples: Citation networks Co-author

http://cs224w.stanford.edu Non overlapping vs overlapping communities Non overlapping

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Variational methods for overlapping and non-overlapping stochastic block models Pierre Latouche

Ego-Splitting Framework: from Non-Overlapping to Overlapping Clusters. Alessandro Epasto

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

Probabilistic Databases Guy Van den Broeck Scalable Uncertainty Management (SUM) Sep 21, 2016

Scalable Uncertainty Management 04 Probabilistic Databases Rainer Gemulla Jun 1, 2012

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

Adiantum: length-preserving encryption for entry-level processors Paul Crowley and Eric Biggers

Disks Computer Center, CS, NCTU Outline Interfaces Geometry Add new disks

Operating Systems II Unit OS8: File System 8.3. NTFS Recovery Support Prof. Dr. Andreas Polze,

Operating Systems Secondary Storage Lecture 12 Michael OBoyle 1 Overview Disk trends

Basic Operations Algebra of Bags Mathematical system consisting of: Operands

Algebraic and Logical Query Languages Spring 2011 Instructor: Hassan Khosravi Relational

Convolutional Networks for Text Graham Neubig Site https://phontron.com/class/nn4nlp2017/

Fast Bag-Of-Words Candidate Selection in Content-Based Instance Retrieval Systems Micha