Overlap Graph and Clumps Mireille R´ egnier LIX and INRIA Mireille.Regnier@inria.fr web page : algo.inria.fr/regnier October, 9-th – 2008 AlBio08 An Optimized Counting Graph 1
Outline 1 Introduction and principles 2 Overlap Graph 3 Combinatorics of clumps 4 Open problems AlBio08 An Optimized Counting Graph 2
Cis-regulation AlBio08 An Optimized Counting Graph 3
Cis-regulation changes AlBio08 An Optimized Counting Graph 4
Example : the caudal motif in early developmental enhancers from Drosophila Papatsenko et al., 2002 GCTTTTTTATGGTCGGC TCGCTTTTATGGCCCAA CAGTTTTTATGTCTTTA CCGTTTTGATGGCGGTG AAATTTTTAGGGAACCA GCCCGTTTATGGTTCCC GACACTTTATGTGACAA TCGGATTTATGACACAA A| 2 3 2 2 1 0 0 0 9 0 0 2 1 3 3 4 7 ATGTCTTTATGATTATT C| 3 7 3 2 3 0 0 0 0 0 0 0 6 4 5 2 2 GCAACTTTTGGGCCATA G| 4 0 5 1 1 0 0 2 0 2 11 7 1 1 2 1 1 CCCTTTTGTTGGCCAAA T| 2 1 1 6 6 11 11 9 2 9 0 2 3 3 1 4 1 (a) Aligned Motifs (b) Countings AlBio08 An Optimized Counting Graph 5
Example : the caudal motif in early developmental enhancers from Drosophila Papatsenko et al., 2002 GCTTTTTTATGGTCGGC TCGCTTTTATGGCCCAA CAGTTTTTATGTCTTTA CCGTTTTGATGGCGGTG AAATTTTTAGGGAACCA GCCCGTTTATGGTTCCC GACACTTTATGTGACAA TCGGATTTATGACACAA A| 2 3 2 2 1 0 0 0 9 0 0 2 1 3 3 4 7 ATGTCTTTATGATTATT C| 3 7 3 2 3 0 0 0 0 0 0 0 6 4 5 2 2 GCAACTTTTGGGCCATA G| 4 0 5 1 1 0 0 2 0 2 11 7 1 1 2 1 1 CCCTTTTGTTGGCCAAA T| 2 1 1 6 6 11 11 9 2 9 0 2 3 3 1 4 1 (a) Aligned Motifs (b) Countings A| -0.22 0.06 -0.22 -0.22 -0.62 -1.32 -1.32 -1.32 0.98 -1.32 -1.32 -0.22 -0.62 0.06 0.06 0.28 0.75 C| 0.06 0.75 0.06 -0.22 0.06 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 -1.32 0.62 0.28 0.47 -0.22 -0 G| 0.28 -1.32 0.47 -0.62 -0.62 -1.32 -1.32 -0.22 -1.32 -0.22 1.16 0.75 -0.62 -0.62 -0.22 -0.62 -0 T| -0.22 -0.62 -0.62 0.62 0.62 1.16 1.16 0.98 -0.22 0.98 -1.32 -0.22 0.06 0.0 6 -0.62 0.28 -0 (c) Position Specific Scoring matrix AlBio08 An Optimized Counting Graph 6
Probability Weight Matrices Probability function ! Threshhold s : A word (site) is similar iff score ( w ) > s . ! Pvalue : Prob n ( ∃ H ; score ( H ) > s ) . AlBio08 An Optimized Counting Graph 7
Probability Weight Matrices Probability function ! Threshhold s : A word (site) is similar iff score ( w ) > s . ! Pvalue : Prob n ( ∃ H ; score ( H ) > s ) . Algorithms and data structures ! candidates-motifs extraction AlBio08 An Optimized Counting Graph 8
Probability Weight Matrices Probability function ! Threshhold s : A word (site) is similar iff score ( w ) > s . ! Pvalue : Prob n ( ∃ H ; score ( H ) > s ) . Algorithms and data structures ! candidates-motifs extraction Model accuracy ! Improve PWM with structural information AlBio08 An Optimized Counting Graph 9
Principles Biological function ! Overrepresented words ! underrepresented words Statistical softwares ! candidates-motifs extraction ! statistical significance AlBio08 An Optimized Counting Graph 10
Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. AlBio08 An Optimized Counting Graph 11
Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Exact (all n ) → AhoPro (NIIGenetika, Inria) ! O ( n × | Σ | ) ; n : text size ; Σ : data structure. AlBio08 An Optimized Counting Graph 12
Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Exact (all n ) → AhoPro (NIIGenetika, Inria) ! O ( n × | Σ | ) ; n : text size ; Σ : data structure. Drawback ! n dependency ; ! numerical precision ; AlBio08 An Optimized Counting Graph 13
Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. AlBio08 An Optimized Counting Graph 14
Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Approximation → RSA-tools, Spatt, AhoSoft (NIIGenetika, Inria) ! O (1 × | Σ | ) AlBio08 An Optimized Counting Graph 15
Probability Computation “Classic” methods vs Graphs ! induction ; [GuOd81] ! languages [ReSz98] ;automata [NiFlSa00]. Space/time complexity ! Approximation → RSA-tools, Spatt, AhoSoft (NIIGenetika, Inria) ! O (1 × | Σ | ) Drawback ! size of the data structure ; ! tightness ; AlBio08 An Optimized Counting Graph 16
AhoCorasick searching automaton t c a t t g a a c a t a a t c g t t c t c c a a a a a c a c t c c t t c a a a a a a c a a 1 2 3 4 5 6 7 8 AlBio08 An Optimized Counting Graph 17
AhoCorasick automaton : searching and computing ! n : w n = largest prefix found =ATA ; ! n + 1 : character x found : x = G , wx = ATAG ∈ Graph , w n +1 = ATAG x = A , C , T , wx �∈ Graph * x = C ; w = A · TA , w n +1 = TAC ∈ Graph * x = T ; w = AT · A , w n +1 = AT ∈ Graph * x = A ; AA , TAA �∈ G , w n +1 = root t c a t t g a a c a t a a t c g t t c t c c a a a a a c a c t c c t t c a a a a a a c a a 1 2 3 4 5 6 7 8 AlBio08 An Optimized Counting Graph 18
AhoPo :pobability computation Step n : ( p n ( w )) w ∈ Graph . p n ( w ) = Prob (largest prefix ending at n is w ). Induction p n +1 ( ATAG ) = p n ( ATA ) · p ( G ) p n +1 ( AT ) = p n ( ATA ) · p ( T ) + p n ( AGA ) · p ( T ) + p n ( CA ) · p ( T ) + p n ( TA ) · p ( T ) AlBio08 An Optimized Counting Graph 19
AhoCorasick automaton : searching and computing Left relation H 1 R L H 2 ⇔ Father LOG ( H 1 ) = Father LOG ( H 2 ) ˜ { ATACACA , ATAGATA } ATA ATA :Largest prefix of ATACACA that is a suffix in H AlBio08 An Optimized Counting Graph 20
AhoCorasick automaton : searching and computing Left relation H 1 R L H 2 ⇔ Father LOG ( H 1 ) = Father LOG ( H 2 ) ˜ { ATACACA , ATAGATA } ATA ATA :Largest prefix of ATACACA that is a suffix in H Right relation H 1 R R H 2 ⇔ Mother ROG ( H 1 ) = Mother ROG ( H 2 ) ¯ { ATACACA , ATACACA } ACA ∪{ AGACACA , } ACA :Largest suffix of ATACACA that is a prefix in H AlBio08 An Optimized Counting Graph 21
Computation on Graph :induction AlBio08 An Optimized Counting Graph 22
AhoCorasick automaton : searching and computing First occurrence at position n = 18 GGGGGGGG | ATACACA | no H ∈ H | · · · | n AlBio08 An Optimized Counting Graph 23
AhoCorasick automaton : searching and computing First occurrence at position n = 18 GGGGGGGG | ATACACA | no H ∈ H | · · · | n AND NOT GGGGCATT | ATACACA | GGGGACAT | ATACACA | GGACATAT | ATACACA | GGAGACAC | ATACACA | · · · All marked nodes in AhoGraph AlBio08 An Optimized Counting Graph 24
Ovelap graph :pobability computation Compute ( p n ( H )) H ∈H using LOG, ROG. LOG dependency to the past ROG information to transfer (memory) AlBio08 An Optimized Counting Graph 25
Ovelap graph :pobability computation Compute ( p n ( H )) H ∈H using LOG, ROG. LOG dependency to the past ROG information to transfer (memory) Graph traversals... AlBio08 An Optimized Counting Graph 26
Clump counts First occurrence : “small” n . k occurrences : large n . ⇒ approximation ⇒ generating functions ⇒ clumps AlBio08 An Optimized Counting Graph 27
Clump counts With H 1 = AACGGAA and H 2 = GAATCA , AACGGAACGGAACGGAATCACGGAA k -decomposition counted with coef. ( − 1) k [BoClReVa05]. AlBio08 An Optimized Counting Graph 28
Clump counts With H 1 = AACGGAA and H 2 = GAATCA , AACGGAACGGAACGGAATCACGGAA k -decomposition counted with coef. ( − 1) k [BoClReVa05]. Contribution ( − 1) 7 = − 1 With AACAACAACAA = AA ( CAA ) 3 AACAACAACAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · ACAACAACAA · No contribution : even = odd AACAACAACAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · CAA · ACAACAACAA · AlBio08 An Optimized Counting Graph 29
Open problems : Frameshift and riboswitches AlBio08 An Optimized Counting Graph 30
Open problems : Frameshift and riboswitches Boxes : ( w 1 , w 2 , ˜ w 1 , ˜ w 2 ) with : P. Nicodeme. AlBio08 An Optimized Counting Graph 31
Open problems : Frameshift and riboswitches AlBio08 An Optimized Counting Graph 32
Recommend
More recommend