Hierarchical orga- nization of syn- tenic blocks in large genomic datasets Daniel Doerr Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University Workshop on Data Structures in Bioinformatics, February 4, 2020
Hierarchical organization of syntenic blocks in large genomic datasets 1 Introduction Synteny hi- erarchies for permutations Synteny hi- erarchies for sequences PSyCHO
Data structures for large-scale comparisons Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 2 Objective: multi-species whole-genome comparisons Solution: pan-genome data structures only suitable for very similar genomes
Data structures for large-scale comparisons Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 2 Objective: multi-species whole-genome comparisons Solution: pan-genome data structures ... only suitable for very similar genomes
Abstraction by decomposition Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 3 genomes decomposed into syntenic blocks essential for studying genome evolution between distant species current studies restricted to protein-coding genes omission of many other conserved genomic regions syntenic block CCTTGTGCGAGAATGCCCGCCAGTTCTCCCT GGAACACGCTCTTACGGGCGGTCAAGAGGGA
What is synteny? Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 4 A zoo of definitions: “ the same ribbon ” (Renwick, 1971) , set of markers co-located on same chromosome markers must be collinear local rearrangements allowed mostly tool-centric: FISH, GRIMM/DRIMM-Synteny, Cyntenator, i-ADHoRe, Sibelia, CoGe, Satsuma, etc. A G H B
What is synteny? Hierarchical organization of syntenic blocks in large genomic datasets: Introduction into syntenic blocks there is no one true decomposition of genomes dilemma : contiguous syntenic blocks syntenic block (SB) : single marker or set of Definition [Ghiurcuta and Moret, 2014] 4 (equivalence) relations homology assignment : set H of pairwise A G Given two genomes G , H and homology assignment H , two SBs H A ⊆ G and B ⊆ H are homologous if for each B ⇒ ( a , b ′ ) ∈ H , b ′ ∈ B a ∈ A : ∃ ( a , h ) ∈ H , h ∈ H = ⇒ ( a ′ , b ) ∈ H , a ′ ∈ A b ∈ B : ∃ ( b , g ) ∈ H , g ∈ G = A G H B
What is synteny? Hierarchical organization of syntenic blocks in large genomic datasets: Introduction into syntenic blocks there is no one true decomposition of genomes dilemma : contiguous syntenic blocks syntenic block (SB) : single marker or set of Definition [Ghiurcuta and Moret, 2014] 4 (equivalence) relations homology assignment : set H of pairwise A G Given two genomes G , H and homology assignment H , two SBs H A ⊆ G and B ⊆ H are homologous if for each B ⇒ ( a , b ′ ) ∈ H , b ′ ∈ B a ∈ A : ∃ ( a , h ) ∈ H , h ∈ H = ⇒ ( a ′ , b ) ∈ H , a ′ ∈ A b ∈ B : ∃ ( b , g ) ∈ H , g ∈ G = A G H B
Synteny hierarchy Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 5 What are the homologous SBs of G , H ? G H
Synteny hierarchy Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 5 G , H are covered by one homologous SB pair G H
Synteny hierarchy Hierarchical organization of syntenic blocks in large genomic datasets: Introduction 5 ... but contains several other homologous SB pairs G H
Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 6 Introduction Synteny hi- erarchies for permutations Synteny hi- erarchies for sequences PSyCHO
Common intervals in permutations Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 7 Definition A pair of intervals of two permu- tations is common if they share the same set of elements.
Synteny hierarchy 8 PQ-tree: [Booth and Lueker, 1976] “ Q ”-node: collinear, “ P ”-node: permute freely Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations P P P Q P Q G H Q P Q P P P
Booth and Lueker Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 9 PQ tree construction linear time w.r.t. input size, i.e., number of 1s of an number of markers: n nodes! n × m matrix number of common intervals: m ∈ O ( n 2 ) ... but cubic w.r.t. output size: the PQ tree has only O ( n )
Intervals of a PQ tree Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 10 Definition [Bergeron et al., 2008] The frontier of a node is the set of labels of the leaves of the subtree rooted at this node, or a singleton comprising a leaf label.
Sets of common intervals in permutations Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 11 Definition [Bergeron et al., 2008] i j k l A set of intervals I is closed if (1) , .., ( n ) ∈ I , (1 .. m ) ∈ I , and for each pair of intervals ( i .. k ) , ( j .. l ) ∈ I s.t. i < j ≤ k < l , also ( i .. j ) , ( j .. k ) , ( k .. l ) , ( i .. l ) ∈ I
Sets of common intervals in permutations Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 11 Definition [Bergeron et al., 2008] i j k l A set of intervals I is closed if (1) , .., ( n ) ∈ I , (1 .. m ) ∈ I , and for each pair of intervals ( i .. k ) , ( j .. l ) ∈ I s.t. i < j ≤ k < l , also ( i .. j ) , ( j .. k ) , ( k .. l ) , ( i .. l ) ∈ I
Commuting sets Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 12 Definition [Bergeron et al., 2008] intervals commute. Two intervals A , B commutes if A ⊆ B or B ⊆ A or A ∩ B = ∅ . ... and a set of intervals I is commuting if all pairs of
Strong intervals Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for permutations 13 Definition [Bergeron et al., 2008] Given a set of intervals I , an interval A is strong if it commutes with all intervals B ∈ I . The strong intervals of a closed set of intervals I are the frontier of the PQ tree of I .
Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for sequences 14 Introduction Synteny hi- erarchies for permutations Synteny hi- erarchies for sequences PSyCHO
SB hierarchy Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for sequences 15 Context-dependency two sets of common intervals intersect only if all their intervals intersect in the corresponding sequences G H I
Sets of common intervals in sequences Then there exists a unique PQ -tree with k j i . and I holds true that intervals I such that for the set of strong frontier be a near-closed set of intervals. Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for sequences Let Lemma Definition 16 l A set of intervals I is near-closed if (1) , .., ( n ) ∈ I , (1 .. m ) ∈ I , and for each pair of intervals ( i .. k ) , ( j .. l ) ∈ I s.t. i < j ≤ k < l , also ( i .. l ) ∈ I
Sets of common intervals in sequences Hierarchical organization of syntenic blocks in large genomic datasets: Synteny hierarchies for sequences k j i Then there exists a unique PQ -tree with Lemma 16 Definition l A set of intervals I is near-closed if (1) , .., ( n ) ∈ I , (1 .. m ) ∈ I , and for each pair of intervals ( i .. k ) , ( j .. l ) ∈ I s.t. i < j ≤ k < l , also ( i .. l ) ∈ I Let I be a near-closed set of intervals. frontier F such that for the set of strong intervals I ′ ⊆ I holds true that I ′ ⊆ F and |I| ≥ ⌈ 1/2 · |F|⌉ .
Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 17 Introduction Synteny hi- erarchies for permutations Synteny hi- erarchies for sequences PSyCHO
PSyCHO Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 18 PSyCHO Principled Synteny using Common Intervals and Hierarchical Organization http://github.com/danydoerr/PSyCHO
Construction of a synteny hierarchy Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 19 raw genomic sequences G H I 1 genome segmentation marker-order sequences marker similarity graph G G G H H H I I I 2 synteny hierarchy construction discovery of homologous SBs G G 3 H H I I
Similarity graph, syntenic contexts, homologous SBs Hierarchical organization of syntenic blocks in large genomic datasets: PSyCHO 20 1. reference-based reconstruction of syntenic contexts 2. handling of insertions/deletions (work in progress) 3. reference-based discovery of homologous syntenic blocks in each context computational problem: enumerating common intervals in k sequences Reference subject to indel handling G 2 G 3 computational problem: finding δ -teams in sequences
Recommend
More recommend