Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS & ENS with Alexandre d’Aspremont, Thomas Kerdreux, Thomas Br¨ uls, CNRS - ENS Paris & Genoscope. A. Recanati Symbiose Seminar, Juin 2018, 1/29
Seriation The Seriation Problem. � Pairwise similarity information A ij on n variables. � Suppose the data has a serial structure , i.e. there is an order π such that A π ( i ) π ( j ) decreases with | i − j | (R-matrix) Recover π ? 20 20 40 40 60 60 80 80 100 100 120 120 140 140 160 160 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 Similarity matrix Input Reconstructed A. Recanati Symbiose Seminar, Juin 2018, 2/29
Genome Assembly Seriation has direct applications in (de novo) genome assembly. � Genomes are cloned multiple times and randomly cut into shorter reads ( ∼ 400bp to 100kbp), which are fully sequenced. � Reorder the reads to recover the genome. A. Recanati Symbiose Seminar, Juin 2018, 3/29
Genome Assembly Overlap Layout Consensus (OLC). Three stages. � Compute overlap between all read pairs. � Reorder overlap matrix to recover read order. � Average the read values to create a consensus sequence. The read reordering problem is a seriation problem. A. Recanati Symbiose Seminar, Juin 2018, 4/29
Genome Assembly in Practice Noise. In the noiseless case, the overlap matrix is a R-matrix . In practice. . . � There are base calling errors in the reads, typically 2% to 20% depending on the process. � Entire parts of the genome are repeated , which breaks the serial structure. Sequencing technologies � Next generation : short reads ( ∼ 400bp), few errors ( ∼ 2%). Repeats are challenging � Third generation : long reads ( ∼ 10kbp), more errors ( ∼ 15%). Can resolve some repeats, but not all of them + noise can be challenging A. Recanati Symbiose Seminar, Juin 2018, 5/29
Genome Assembly in Practice Current assemblers. � With short accurate reads , the reordering problem is solved by combinatorial methods using the topology of the assembly graph and additional pairing information. � With long noisy reads , reads are corrected before assembly (hybrid correction or self-mapping). � Layout and consensus not clearly separated, many heuristics . . . � minimap+miniasm : first long raw reads straight assembler (but consensus sequence is as noisy as raw reads). A. Recanati Symbiose Seminar, Juin 2018, 6/29
Outline � Introduction � Spectral relaxation of Seriation (Spectral Ordering) � Multi-dimensional Spectral Ordering � Results (Application to genome assembly) A. Recanati Symbiose Seminar, Juin 2018, 7/29
2-SUM and the Graph Laplacian The 2-SUM Combinatorial Problem. � The 2-SUM problem is written n � A π ( i ) π ( j ) ( i − j ) 2 min π ∈P i,j =1 or alternatively, n � A ij ( π ( i ) − π ( j )) 2 min π ∈P i,j =1 � optimal permutation π ∗ : high values of A ⇔ low | π ( i ) − π ( j ) | , i.e. , i and j lay close to each other. A. Recanati Symbiose Seminar, Juin 2018, 8/29
2-SUM and the Graph Laplacian Graph Laplacian � A : adjacency matrix of a undirected weighted graph ( A ij > 0 iff. there is an edge between nodes i and j ). � Node i has degree d i = � j A ij . Degree matrix D = diag ( A 1 ) = diag ( d ) . � Laplacian matrix L = D − A . � The Laplacian can be viewed as a quadratic form , n f T Lf = 1 � A ij ( f i − f j ) 2 2 i,j =1 A. Recanati Symbiose Seminar, Juin 2018, 9/29
2-SUM and the Graph Laplacian Mathematical reminder � For a vector f = ( f 1 , . . . , f n ) T ∈ R n and a matrix M ∈ R n × n , we have, f T Mf = � n i,j =1 M ij f i f j � ( λ ∈ R , u ∈ R n ) is a eigenvalue-eigenvector couple of L ∈ R n × n iff Lu = λu A. Recanati Symbiose Seminar, Juin 2018, 10/29
2-SUM and the Graph Laplacian The Laplacian can be viewed as a quadratic form , n f T Lf = 1 � A ij ( f i − f j ) 2 2 i,j =1 Indeed for any f ∈ R n , f T Lf = f T Df − f T Af � n i D ii − � n i =1 f 2 = i,j =1 A ij f i f j � n i ( � n j =1 A ij ) − � n i =1 f 2 = i,j =1 A ij f i f j � n i,j =1 A ij ( f 2 = i − f i f j ) � n 1 i,j =1 A ij ( f 2 j + f 2 = i − 2 f i f j ) 2 � n 1 i,j =1 A ij ( f i − f j ) 2 = 2 A. Recanati Symbiose Seminar, Juin 2018, 11/29
2-SUM and the Graph Laplacian The Laplacian can be viewed as a quadratic form , n f T Lf = 1 � A ij ( f i − f j ) 2 2 i,j =1 � L is symmetric and positive semi-definite. � L has n non-negative, real-valued eigenvalues, 0 = λ 1 ≤ λ 2 ≤ . . . ≤ λ n . � 1 = (1 , . . . , 1) T is eigenvector associated to eigenvalue 0 . � If A has K connected components, the eigenvalue 0 has multiplicity K + 1 , with eigenvectors being indicators of the connected components. � If f ∈ {− 1 , +1 } n , objective of min-cut (clustering). A. Recanati Symbiose Seminar, Juin 2018, 12/29
2-SUM and the Graph Laplacian � The 2-SUM problem is written � n i,j =1 A π ( i ) π ( j ) ( i − j ) 2 min π ∈P or alternatively, � n i,j =1 A ij ( π ( i ) − π ( j )) 2 min π ∈P i.e. , π T Lπ min π ∈P � For certain matrices A , 2-SUM ⇐ ⇒ seriation. ([Fogel et al., 2013]) � NP-Complete for generic matrices A . � Constraints π ∈ P ? A. Recanati Symbiose Seminar, Juin 2018, 13/29
Spectral relaxation π ∈P π T L A π min (2SUM) Set of permutation vectors : π ( i ) ∈ { 1 , ..., n } , ∀ 1 ≤ i ≤ n π T 1 = n ( n + 1) / 2 � π � 2 = n ( n + 1)(2 n + 1) / 6 2 � Since L 1 = 0 , (2SUM) is invariant by π ← π − ( n +1) 1 , so enforce π T 1 = 0 . 2 � Up to a dilatation, we can chose � π � 2 2 = 1 . � Relax the integer constraints and let π ( i ) ∈ R . A. Recanati Symbiose Seminar, Juin 2018, 14/29
Spectral relaxation Spectral Seriation. Define the Laplacian of A as L = diag ( A 1 ) − A . The Fiedler vector of A is written x T L A x. f = argmin 1 T x =0 , � x � 2 =1 and is the second smallest eigenvector of the Laplacian. The Fiedler vector reorders a R-matrix in the noiseless case. Theorem [Atkins, Boman, and Hendrickson, 1998] Spectral seriation. Suppose A ∈ S n is a pre-R matrix, with a simple Fiedler value whose Fiedler vector f has no repeated values. Suppose that Π ∈ P is such that the permuted Fielder vector Π v is monotonic, then Π A Π T is an R-matrix. A. Recanati Symbiose Seminar, Juin 2018, 15/29
Spectral Ordering Algorithm The Algorithm. Input: Connected similarity matrix A ∈ R n × n 1: Compute Laplacian L = diag ( A 1 ) − A 2: Compute second smallest eigenvector of L , x ∗ 3: Sort the values of x ∗ Output: Permutation π : x ∗ π (1) ≤ x ∗ π (2) ≤ ... ≤ x ∗ π ( n ) 0 20 40 60 80 0 0.15 20 0.10 0.05 40 0.00 60 0.05 0.10 80 0.15 0 20 40 60 80 100 Similarity matrix Fiedler vector A. Recanati Symbiose Seminar, Juin 2018, 16/29
Spectral Solution � Spectral solution easy to compute and scales well (polynomial time) � But sensitive and not flexible (hard to include additional structural constraints) � Other (convex) relaxations can handle structural constraints and solve more robust objectives than 2SUM Genome assembly pipeline � Overlap : computed from k-mers , yielding a similarity matrix A � Layout : A is thresholded to remove noise-induced overlaps, and reordered with spectral ordering algorithm . Layout fine-grained with overlap information. � Consensus : Genome sliced in windows A. Recanati Symbiose Seminar, Juin 2018, 17/29
Spectral Solution vs Noisy Synthetic data 0 100 200 300 400 500 0 0.075 0.050 100 0.025 200 0.000 0.025 300 0.050 400 0.075 0 100 200 300 400 500 500 Similarity matrix Fiedler vector � Gaussian noise over perfect R-matrix. A. Recanati Symbiose Seminar, Juin 2018, 18/29
Spectral Solution vs Real DNA data 0.005 0.000 0.005 0.010 0.015 0.020 0 2500 5000 7500 10000 12500 15000 17500 Similarity matrix Fiedler vector � Repeats are a more structured noise that makes the method fail. A. Recanati Symbiose Seminar, Juin 2018, 19/29
Outline � Introduction � Spectral relaxation of Seriation (Spectral Ordering) � Multi-dimensional Spectral Ordering � Results (Application to genome assembly) A. Recanati Symbiose Seminar, Juin 2018, 20/29
Multi-dimensional Spectral Embedding (Spoiler Alert!) There is information in the rest of the eigenvectors of L 3d scatter plot of the 3 first non-zero eigenvectors of L A. Recanati Symbiose Seminar, Juin 2018, 21/29
Multi-Dim 2-SUM and the Graph Laplacian Generalize the quadratic expression involving the Laplacian , n = 1 � � Φ T L A ˜ ˜ � A ij � y i − y j � 2 Φ Tr 2 2 i,j =1 � Let 0 = λ 0 < λ 1 ≤ . . . ≤ λ n − 1 , Λ � diag ( λ 0 , . . . , λ n − 1 ) , � � , be the eigendecomposition of L = ΦΛΦ T . Φ = 1 , f (1) , . . . , f ( n − 1) � For any K < n , Φ ( K ) � � � f (1) , . . . , f ( K ) defines a K -dimensional embedding � T ∈ R K , � y i = f (1) ( i ) , f (2) ( i ) , . . . , f ( K ) ( i ) for i = 1 , . . . , n. (K-LE) which solves the following embedding problem, � n i,j =1 A ij � y i − y j � 2 minimize 2 � T ∈ R n × K , ˜ Φ T ˜ ˜ Φ = I K , ˜ � y T 1 , . . . , y T Φ T 1 n = 0 K such that Φ = n (Lap-Emb) A. Recanati Symbiose Seminar, Juin 2018, 22/29
Recommend
More recommend