Seriation & Ranking: Spectral Approach Fajwel Fogel , CNRS & - PowerPoint PPT Presentation

Seriation & Ranking: Spectral Approach Fajwel Fogel , CNRS & ENS, Paris. with Alexandre d’Aspremont, Francis Bach, Rodolphe Jenatton, & Milan Vojnovic CNRS, INRIA, ENS Paris & MSR Cambridge 1

The seriation problem ⌅ Pairwise similarity information S ij on n variables. ⌅ Suppose the data has a serial structure , i.e. there is an order π such that S π ( i ) π ( j ) decreases with | i � j | (R-matrix) Recover π ? 20 20 40 40 60 60 80 80 100 100 120 120 140 140 160 160 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 Similarity matrix Input Reconstructed 2

DNA de novo assembly Seriation has direct applications in DNA de novo assembly. ⌅ Genomes are cloned multiple times and randomly cut into shorter reads ( ⇠ 400bp), which are fully sequenced. ⌅ Reorder the reads to recover the genome. (from Wikipedia. . . ) 3

Seriation: a combinatorial problem ⌅ Combinatorial Solution [FJBA. 2013, Laurent et Seminaroti 2014] For R-matrices, 2-SUM ( ) seriation. ⌅ 2-SUM : assign similar items to nearby positions in reordering, i.e. find permutation π of items 1 to n that minimizes n X S i,j ( π ( i ) � π ( j )) 2 . (1) i,j =1 ⌅ The 2-SUM problem is NP-Complete for generic matrices S [George and Pothen 1997]. 4

A spectral solution Spectral Seriation. Define the Laplacian of S as L S = diag ( S 1 ) � S , the Fiedler vector of S is written x T L S x. f = argmin 1 T x =0 , k x k 2 =1 and is the second smallest eigenvector of the Laplacian. The Fiedler vector reorders a R-matrix in the noiseless case. Theorem [Atkins, Boman, Hendrickson, et al., 1998] Spectral seriation. Suppose S 2 S n is a pre-R matrix, with a simple Fiedler value whose Fiedler vector f has no repeated values. Suppose that Π 2 P is such that the permuted Fielder vector Π v is monotonic, then Π S Π T is an R-matrix. 5

Spectral solution: advantages ⌅ Exact for R-matrices. ⌅ Quite robust to noise. Arguments similar to perturbation results in spectral clustering. ⌅ Scales very well, especially when similarity matrix is sparse (as in DNA sequencing and ranking). 6

Ranking with pairwise comparisons 7

Ranking Goal: given pairwise comparisons between a set of items, find the most consistent global order of these items. Applications ⌅ sports competitions (e.g. chess, football. . . ) ⌅ crowdsourcing services (e.g. TopCoder. . . ) ⌅ online computer games. . . 8

Ranking Classical methods ⌅ ranking by score (e.g. #wins - #losses) [Huber, 1963; Wauthier et al., 2013] ⌅ ranking by “skills” under a probabilistic model [Bradley and Terry, 1952; Luce, 1959; Herbrich et al., 2006] ⌅ ranking according to principal eigenvector of a transition matrix [Page et al., 1998; Negahban et al., 2012] ⌅ . . . Two main issues ⌅ missing comparisons ⌅ non transitive comparisons (i.e. a < b and b < c but a > c ). 9

Ranking 10

Casting the ranking problem as a seriation problem ⌅ Input : a matrix of pairwise comparisons C where C i,j 2 [ � 1 , 1] e.g. for a tournament C i,j 2 { � 1 , 0 , 1 } (loss, tie, win) ⌅ Idea : count matching comparisons of i and j against other items k Example : in a tournament setting, if players i and j had the same outcomes against other opponents k , they should have a similar rank. 11

Casting the ranking problem as a seriation problem ⌅ Construct a similarity matrix S X S i,j = σ ( C i,k , C j,k ) , i,j compared with k where σ is a similarity measure. ⌅ Example: when σ ( a, b ) = 1 + ab , S = n 11 T + CC T . Comparison matrix Similarity matrix ⌅ Is it the right way to solve the ranking problem, in the presence of corrupted and missing comparisons? 12

SerialRank New ranking algorithm: SerialRank ⌅ A very simple procedure: � compute a similarity matrix from pairwise comparisons ( e.g. count matching comparisons) � solve the corresponding seriation problem ( e.g. use the spectral solution). ⌅ Might be improved by designing new similarities. 13

Choice of similarity ⌅ In applications, the design of the similarity can have a major impact . ⌅ For ranking, depending on the nature of your data (cardinal or ordinal data, ties etc.), you might adapt your similarity. ⌅ For DNA assembly, you would like to have a similarity robust to sequencing noise. ⌅ Ongoing work... 14

Performance guarantees for SerialRank ⌅ Robustness to missing/corrupted comparisons Similarity based ranking is more robust than typical score based rankings (i.e. #wins - #losses). ⌅ Exact recovery regime Exact recovery of underlying ranking with probability 1 � o (1) for o ( p n ) random missing/corrupted comparisons. ⌅ Approximate recovery regime Competitive to other approaches for partial observations and corrupted comparisons (cf. numerical experiments). 15

Performance guarantees for SerialRank 10 9 8 SerialRank 7 rank 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 item 10 9 8 7 rank Score 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 item Ranking Comparison matrix Similarity matrix All comparisons given, corrupted entries induce ties in score based ranking but not in similarity based ranking. 16

Perturbation analysis ⌅ Derive asymptotic analytical expression of Fiedler vector in noise free setting. ⌅ Use perturbation results (i.e. Davis-Kahan) in order to bound the perturbation of the Fiedler vector with missing/corrupted comparisons. ⌅ Get theoretical guarantees for SerialRank in settings with only few comparisons available. 17

Perturbation analysis Analytical expression of Fiedler vector ⌅ Use results on the convergence of Laplacian operators to provide a description of the spectrum of the unperturbed Laplacian. ⌅ Following the same analysis as in [Von Luxburg ’08] we can prove that asymptotically, once normalized by n 2 , apart from the first and second eigenvalue, the spectrum of the Laplacian matrix is contained in the interval [0 . 5 , 0 . 75] . ⌅ Moreover, we can characterize the eigenfunctions of the limit Laplacian operator ( i.e. lim L n n ) by a di ff erential equation, which gives an asymptotic analytical expression for the Fiedler vector. 18

Perturbation analysis Analytical expression of Fiedler vector ⌅ Taking the same notations as in [Von Luxburg ’08] we have here k ( x, y ) = 1 � | x � y | . The degree function is Z 1 Z 1 d ( x ) = k ( x, y ) dP ( y ) = k ( x, y ) d ( y ) 0 0 (samples are uniformly ranked). d ( x ) = � x 2 + x + 1 / 2 . ⌅ We deduce that the range of d is [0 . 5 , 0 . 75] . Interesting eigenvectors ( i.e. here the second eigenvector) are not in this range. 19

Perturbation analysis Analytical expression of Fiedler vector ⌅ We can also characterize eigenfunctions f by a di ff erential equation Uf ( x ) = λ f ( x ) 8 x 2 [0 , 1] ) f 00 ( x )(1 / 2 � λ + x � x 2 ) + 2 f 0 ( x )(1 � 2 x ) = 0 8 x 2 [0 , 1] . (2) ⌅ The asymptotic expression for the Fiedler vector is a solution to this di ff erential equation, with λ < 0 . 5 . ⌅ Very accurate numerically , even for small values of n . 20

Perturbation analysis Analytical expression of Fiedler vector Comparison between the asymptotic analytical expression of the Fiedler vector and the numeric values obtained from eigenvalue decomposition, for n = 10 ( left ) and n = 100 ( right ). 1 0.3 Fiedler vector Asymptotic Fiedler vector 0.2 0.5 0.1 0 0 − 0.1 − 0.2 − 0.5 − 0.3 − 1 − 0.4 2 4 6 8 10 20 40 60 80 100 21

Perturbation analysis Goal Get similar result as for point score method (cf [Wauthier et al., 2013]). Show that for any precision parameter µ , with a proportion of observations p & log n µn max | ˜ π � π | . µn whp . ... up to constants and log( n ) factors. 22

Perturbation analysis Classical perturbation results Davis-Kahan Theorem If | ˆ λ 3 � λ 2 | > | λ 3 � λ 2 | / 2 and | ˆ λ 1 � λ 2 | > | λ 1 � λ 2 | / 2 , then || ˆ p L � L || op || f � ˆ f || 2  2 min( λ 2 � λ 1 , λ 3 � λ 2 ) . Weyl’s Inequality Let L S and L ˜ S be n ⇥ n positive definite matrices and let L R = L ˜ S � L S . Let λ 1  . . . λ n and ˜ λ 1  . . . ˜ λ n be the eigenvalues of L S and L ˜ S respectively. Then, for all i , | ˜ λ i � λ i |  || L R || 2 . + concentration inequalities 23

Numerical results: ranking Synthetic datasets with random missing/corrupted comp. Evaluate Kendall rank correlation coe ffi cient τ between recovered ranking and “true” ranking ( τ 2 [ � 1 , 1] , τ = 1 means identical rankings). 1 1 1 SR PS 0.9 0.9 0.9 RC Kendall Kendal τ BTL 0.8 0.8 0.8 0.7 0.7 0.7 0.6 0.6 0.6 0 50 100 0 50 100 0 50 100 missing % Corrupted % Missing (with 20 % corr.) % Missing 100 items, SR: SerialRank, PS: point-score, RC: rank centrality, BTL: Bradley-Terry 24

Numerical results: ranking Real datasets TopCoder England Premier League 1 Official TopCoder 0.9 PS 0.45 PS RC 0.8 RC BTL BTL 0.4 0.7 SR SR Semi − sup. Dis Dis % upsets in top k 0.6 0.35 0.5 0.4 0.3 0.3 0.25 500 1000 1500 2000 2500 5 10 15 20 Top k Top k k k SR: SerialRank, PS: point-score, RC: rank centrality, BTL: Bradley-Terry 25

Conclusion Results ⌅ Ranking as a seriation problem, with perturbation results. ⌅ Good performance on some applications, without specific tuning. Open problems ⌅ Impact of similarity measures. ⌅ Predictive power of SerialRank. 26

Seriation & Ranking: Spectral Approach Fajwel Fogel , CNRS & - PowerPoint PPT Presentation

Seriation & Ranking: Spectral Approach Fajwel Fogel , CNRS & ENS, Paris. with Alexandre dAspremont, Francis Bach, Rodolphe Jenatton, & Milan Vojnovic CNRS, INRIA, ENS Paris & MSR Cambridge 1 The seriation problem Pairwise

Seriation and de novo genome assembly Antoine Recanati , CNRS & ENS with Alexandre

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS & ENS with

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de

Graphics: Effect Ordering Packages: seriation, gclus, corrgram Example: PCP Flea data S

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

Spectral Method and Regularized MLE Are Both Optimal for Top- K Ranking Yuxin Chen Electrical

Spectral Method and Regularized MLE Are Both Optimal for Top- K Ranking Cong Ma ORFE, Princeton

10Hz Spectral Lines Joschua Dilly 10Hz Spectral Lines 2 Introduction Ions 50cm Protons 30cm

AIRS In-flight Spectral Calibration Steve Gaiser 1 Steve Gaiser, AIRS in-orbit spectral

Lesson 9 Introduction Signal Spectral Analysis: Estimation of the power spectral density

Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce, MPI and Threading

Crash course on Computational Biology for Computer Scientists Bartek Wilczyski

Metagenomics an introduction Katie Lennard Metagenomics vs. amplicon sequencing (16S)

Genome Assembly Sample Prepara1on Fragments Sequencing Reads

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Sequencing of a genome Bioinformatics Algorithms From the DNA molecules (input of experiment) we

Warren Snelling, U.S. Meat Animal Research June 19, 2019 Center Genome sequencing cannot

NETTAB 2012 NETTAB 2012 Background high throughput next generation sequencing (NGS)

Seriation & Ranking: Spectral Approach Fajwel Fogel , CNRS & - PowerPoint PPT Presentation

Seriation & Ranking: Spectral Approach Fajwel Fogel , CNRS & ENS, Paris. with Alexandre dAspremont, Francis Bach, Rodolphe Jenatton, & Milan Vojnovic CNRS, INRIA, ENS Paris & MSR Cambridge 1 The seriation problem Pairwise

Seriation and de novo genome assembly Antoine Recanati , CNRS &amp; ENS with Alexandre

Spectral Clustering Spectral Clustering? Spectral methods Methods using eigenvectors of

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS &amp; ENS with

Easy and Hard Outline Constraint Ranking in OT The Constraint Ranking problem Making fast

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 An Introduction to Spectral

Relaxations of the Seriation Problem and Applications to de novo Genome Assembly Soutenance de

Graphics: Effect Ordering Packages: seriation, gclus, corrgram Example: PCP Flea data S

Ranking candidate genes from Ranking candidate genes from perturbation experiments Niko

Online Submodular Set Cover, Ranking, and Repeated Active Learning Online Ranking: At each round,

TVM for Ads Ranking @ Facebook Hao Lu, Ansha Yu, Yinghai Lu, Andrew Tulloch Ads Ranking at

Spectral Method and Regularized MLE Are Both Optimal for Top- K Ranking Yuxin Chen Electrical

Spectral Method and Regularized MLE Are Both Optimal for Top- K Ranking Cong Ma ORFE, Princeton

10Hz Spectral Lines Joschua Dilly 10Hz Spectral Lines 2 Introduction Ions 50cm Protons 30cm

AIRS In-flight Spectral Calibration Steve Gaiser 1 Steve Gaiser, AIRS in-orbit spectral

Lesson 9 Introduction Signal Spectral Analysis: Estimation of the power spectral density

Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce, MPI and Threading

Crash course on Computational Biology for Computer Scientists Bartek Wilczyski

Metagenomics an introduction Katie Lennard Metagenomics vs. amplicon sequencing (16S)

Genome Assembly Sample Prepara1on Fragments Sequencing Reads

Genome Reassembly From Fragments 7 January 2019 OSU CSE 1 Genome A genome is the encoding

Sequencing of a genome Bioinformatics Algorithms From the DNA molecules (input of experiment) we

Warren Snelling, U.S. Meat Animal Research June 19, 2019 Center Genome sequencing cannot

NETTAB 2012 NETTAB 2012 Background high throughput next generation sequencing (NGS)

Seriation and de novo genome assembly Antoine Recanati , CNRS & ENS with Alexandre

Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS & ENS with