Community Recovery in Graphs with Locality Yuxin Chen , Govinda - PowerPoint PPT Presentation

Community Recovery in Graphs with Locality Yuxin Chen † , Govinda Kamath † , Changho Suh ∗ , David Tse † Stanford † KAIST ∗

Community recovery / graph clustering Community structures are common in many social networks Credit: The Future Buzz Credit: S. Papadopoulos

Community recovery / graph clustering Community structures are common in many social networks Credit: The Future Buzz Credit: S. Papadopoulos Community recovery: partition users into several clusters based on their friendships / similarities

Community recovery in computational biology A genome phasing problem

Community recovery in computational biology A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited

Community recovery in computational biology A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited linking reads: relative phase relation of 2 (or more) SNPs

Community recovery in computational biology A genome phasing problem phase info for each SNP: (1) maternally inherited (2) paternally inherited linking reads: relative phase relation of 2 (or more) SNPs Haplotype phasing : retrieve phase info of all SNPs from linking reads

Stochastic block model / censored block model Pairwise measurements for any pair ( i, j ) of nodes � P 0 , if i and j are from same community ind. y i,j ∼ else P 1 ,

Problem: nodes often have locality Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al...

Problem: nodes often have locality Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al... More realistically: samples come mainly (or exclusively) from nearby nodes

Problem: nodes often have locality Most prior work: (almost) equally likely to sample between any pair of nodes – Condon et al., Jalali et al., Chen et al., Abbe et al., Mossel et al., Hajek et al., Chin et al... More realistically: samples come mainly (or exclusively) from nearby nodes In new technologies like 10x-Genomics: (1) n ∼ 10 5 SNPs; (2) linking range ∼ 100 SNPs

This work: how to deal with measurement locality in community recovery?

A two-community model • n variables we seek: x 1 , · · · , x n ∈ { 0 , 1 } – encode community membership x i = 0 x i = 1

Measurement model: random sampling • Constraint graph G | {z r {z r }

Measurement model: random sampling • Constraint graph G | {z r {z r } • Random sampling: pick m randomly chosen edges of G

Measurement model: random sampling • Constraint graph G | {z r {z r } • Random sampling: pick m randomly chosen edges of G • Noise model: on each of these m edges ( i, j ) , take an independent sample  x i ⊕ x j , with prob. 1 − θ  ind. �� y i,j = meas. error rate  x i ⊕ x j ⊕ 1 , else

Modeling locality via constraint graph Global / long-range measurements constraint graph randomly picked edges

Modeling locality via constraint graph Global / long-range measurements constraint graph randomly picked edges Local measurements | {z r {z r } constraint graph randomly picked edges (e.g. r ∼ n 0 . 4 for 10x)

Information and computation limits 1. How many samples are needed to recover { x i } reliably (up to global offset)?

Information and computation limits 1. How many samples are needed to recover { x i } reliably (up to global offset)? 2. How to recover efficiently?

Information and computation limits 1. How many samples are needed to recover { x i } reliably (up to global offset)? 2. How to recover efficiently? Global samples Local samples prior works

Information and computation limits 1. How many samples are needed to recover { x i } reliably (up to global offset)? 2. How to recover efficiently? Global samples Local samples prior works Encouraging news: one can obtain efficient recovery within linear time

Proposed algorithm: a 3-stage linear-time paradigm

Spectral-Stitching: Stage 1 Start by running spectral method on core complete subgraphs = E [ L ] + L − E [ L ] L �� rank-1 • Compute rank-1 approximation of L ( sample matrix restricted to the subgraph )

Spectral-Stitching: Stage 1 Split all nodes into overlapping subsets and run spectral methods separately

Spectral-Stitching: Stage 1 Split all nodes into overlapping subsets and run spectral methods separately • Approximate solution within each subgraph – Key observation: approx. recovery needs only O (1) samples per node

Spectral-Stitching: Stage 1 Split all nodes into overlapping subsets and run spectral methods separately • Approximate solution within each subgraph – Key observation: approx. recovery needs only O (1) samples per node • Inconsistent global phases across subgraphs

Spectral-Stitching: Stage 2 Calibrate phases across subgraphs by checking their correlations

Spectral-Stitching: Stage 2 Calibrate phases across subgraphs by checking their correlations Purpose of Stages 1-2: obtain approximate solution of all nodes

Spectral-Stitching: Stage 3 Clean up all remaining errors by iterative refinement • local majority vote using all samples 31 / 45

Spectral-Stitching: Stage 3 Clean up all remaining errors by iterative refinement • local majority vote using all samples • Key observation: exact recovery needs at least Θ(log n ) samples per node 32 / 45

Main results: rings

Main results: rings 0 . 5 n log n Theorem: minimum sample complexity = 1 − exp {− KL (0 . 5 � θ } )

Main results: rings 0 . 5 n log n Theorem: minimum sample complexity = 1 − exp {− KL (0 . 5 � θ } ) Info and comput. limits meet!

An insensitivity phenomenon ring

An insensitivity phenomenon complete graph ring

An insensitivity phenomenon complete graph ring small-world

An insensitivity phenomenon complete graph ring small-world Info and comput. limits are identical for many spatially invariant graphs

Empirical success rate vs. sample size n = 100 , 000 , input error rate = 0 . 2 10 Monte Carlo runs to get each point Each run takes ∼ 6.4 sec on a Mac Pro

Extension: beyond spatially invariant graphs | {z r {z r } {z | {z } r r lines grids

Extension: beyond spatially invariant graphs sample complexity | {z r {z r grids } lines {z rings | {z } r r locality radius r lines grids n 0 . 25 n 0 . 5 n 0 . 75 n Info limit vs. r Infomation and comput. limits achievable by same algorithm

Extension: beyond pairwise measurements New technologies (e.g. 10x) provide multi-linked reads from same chromosome, not just two

Extension: beyond pairwise measurements New technologies (e.g. 10x) provide multi-linked reads from same chromosome, not just two Algorithm and theory can be easily extended to see performance gain 15 n log n total # SNPs touched paired reads 10 n log n triple-linked reads 5 n log n infinite-linked reads error rate per read 0 . 1 0 . 2

Initial results on real data (haplotype phasing) NA12878 dataset from 10x genomics # SNPs n : 34240 ∼ 191829 , sample size m : 102633 ∼ 574189

Concluding remarks • Studied community recovery when measurements are highly local – motivated by genome phasing and social networks • Information limits can be achieved in linear time for a broad family of models | {z r {z r } {z | {z } r r Full version of paper available at http://arxiv.org/abs/1602.03828

Community Recovery in Graphs with Locality Yuxin Chen , Govinda - PowerPoint PPT Presentation

Community Recovery in Graphs with Locality Yuxin Chen , Govinda Kamath , Changho Suh , David Tse Stanford KAIST Community recovery / graph clustering Community structures are common in many social networks Credit: The

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

COMMUNITY MANAGEMENT jono bacon COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Locality Planning in the South Eastern Area Cathy Polley Ards Community Network & Chair of

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services

Indirect Cost Recovery Using Federal Funds to Recover Indirect Costs Federal Funding

Can Applications Recover from fsync Failures? Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan,

Enterprise Storage Architecture Fall 2018 Data recovery and forensics Tyler Bletsch Duke

Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling Shanshan Wu 1 , Alex

9/24/2020 1 9/24/2020 2 9/24/2020 FEMA funded. Funding becomes available to the state when a

Prepared Food Rescue: Landscape Analysis Emmett McKinney, JoAnne Berkenkamp, & Linda Breggin

Long-Term JPEG Data Protection and Recovery for NAND Flash-Based Solid-State Storage Yu-Chun Kuo,

Community Recovery in Graphs with Locality Yuxin Chen , Govinda - PowerPoint PPT Presentation

Community Recovery in Graphs with Locality Yuxin Chen , Govinda Kamath , Changho Suh , David Tse Stanford KAIST Community recovery / graph clustering Community structures are common in many social networks Credit: The

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

COMMUNITY MANAGEMENT jono bacon COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY COMMUNITY

locality.org.uk Locality is the national network of ambitious and enterprising community-led

Graphs () Graphs () Graphs Graphs Graphs are collections of nodes

Weighted graphs Weighted graphs Weighted graphs Weighted graphs Graphs with numbers, called

Highway Locality Budget Scheme Steve Dibben Highway Locality Manager Mid Herts Group

Week 4 Kullmann Graphs and directed graphs Elementary Graph Algorithms Representing graphs

On some classes of Deza graphs Deza graphs without 3-cocliques Line graphs V.V. Kabanov 1 Deza

Graphs Graphs Examples Definitions Implementation/Representation of graphs Graphs

Locality Planning in the South Eastern Area Cathy Polley Ards Community Network &amp; Chair of

Strip Recovery: Strip Recovery: Strip Recovery: Strip Recovery: A 12 A 12- -Step

Graphs Graphs Simple graphs Algorithms Depth-first search Breadth-first search

Searching on Graphs November 16, 2016 CMPE 250 Graphs- Searching on Graphs November 16, 2016 1

CS200: Graphs Prichard Ch. 14 Rosen Ch. 10 CS200 - Graphs 1 Graphs A collection of What can

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Developing Checkpointing and Recovery Procedures with the Storage Services of Amazon Web Services

Indirect Cost Recovery Using Federal Funds to Recover Indirect Costs Federal Funding

Can Applications Recover from fsync Failures? Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan,

Enterprise Storage Architecture Fall 2018 Data recovery and forensics Tyler Bletsch Duke

Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling Shanshan Wu 1 , Alex

9/24/2020 1 9/24/2020 2 9/24/2020 FEMA funded. Funding becomes available to the state when a

Prepared Food Rescue: Landscape Analysis Emmett McKinney, JoAnne Berkenkamp, &amp; Linda Breggin

Long-Term JPEG Data Protection and Recovery for NAND Flash-Based Solid-State Storage Yu-Chun Kuo,

Locality Planning in the South Eastern Area Cathy Polley Ards Community Network & Chair of

Prepared Food Rescue: Landscape Analysis Emmett McKinney, JoAnne Berkenkamp, & Linda Breggin