determining coding cpg islands
play

Determining coding CpG islands as regions significant for Markov - PowerPoint PPT Presentation

Guideline Introduction Methods Results Outlook Determining coding CpG islands as regions significant for Markov chain based counting statistics Alexander Schnhuth Centrum Wiskunde & Informatica Amsterdam joint work with Meromit


  1. Guideline Introduction Methods Results Outlook Determining coding CpG islands as regions significant for Markov chain based counting statistics Alexander Schönhuth Centrum Wiskunde & Informatica Amsterdam joint work with Meromit Singer, Alexander Engström and Lior Pachter UC Berkeley Rutgers University, New Jersey October 12, 2011

  2. Guideline Introduction Methods Results Outlook Guideline Introduction Cytosine Deamination Problem Definition Methods The Null Model The Algorithm Results Epigenetic Association New Findings Outlook

  3. Guideline Introduction Methods Results Outlook Introduction Cytosine Deamination • Degradation of CpG dinucleotides more frequent than for other constellations CH3 • Methylated cytosines mutate to thymine ATATG TTGGA CG through deamination ( C → T ) • CpG islands: Deamination A AG C TTG CG CG CG CG substrings in the genome with unusually high ATATG TTGGA TG CpG content

  4. Guideline Introduction Methods Results Outlook Introduction Cytosine Deamination • Degradation of CpG dinucleotides more frequent than for other constellations CH3 • Methylated cytosines mutate to thymine ATATG TTGGA CG through deamination ( C → T ) • CpG islands: Deamination A AG C TTG CG CG CG CG substrings in the genome with unusually high ATATG TTGGA TG CpG content • CpG islands are not affected by neutral mutation rates due to epigenetic constraint ☞ computational inference possible • Still most popular: • G.-Garden / Frommer: length ≥ 200 bp , GC % ≥ 0 . 5, CpG Obs/Exp ≥ 0 . 6 • Takai / Jones: length ≥ 500 bp , GC % ≥ 0 . 55, CpG Obs/Exp ≥ 0 . 65

  5. Guideline Introduction Methods Results Outlook Generic Motivation Computation of CpG Islands Input : A genome G resp. a set of genomic sequences G i (exons in the following). Output: A set of non-overlapping substrings G 1 , ..., G L which are “most significant” in terms of their CpG content.

  6. Guideline Introduction Methods Results Outlook Generic Motivation Computation of CpG Islands Input : A genome G resp. a set of genomic sequences G i (exons in the following). Output: A set of non-overlapping substrings G 1 , ..., G L which are “most significant” in terms of their CpG content. • Thereby one would like to control the false discovery rate E ( V L ) where V = # False Positives that is the fraction of false positives to be expected.

  7. Guideline Introduction Methods Results Outlook Methods Definitions • Let Σ = { A , C , G , T } , � G ∈ Σ n an n -mer and | � #( � G | and G , CG ) the length and number of CG occurrences in � G . • For example, � G = CGACG : | � G | = 5 , #( � G , CG ) = 2 .

  8. Guideline Introduction Methods Results Outlook Methods Definitions • Let Σ = { A , C , G , T } , � G ∈ Σ n an n -mer and | � #( � G | and G , CG ) the length and number of CG occurrences in � G . • For example, � G = CGACG : | � G | = 5 , #( � G , CG ) = 2 . • Let Z n be a random variable defined by Z n : Σ n − → N � #( � G �→ G , CG )

  9. Guideline Introduction Methods Results Outlook Methods Definitions • Let Σ = { A , C , G , T } , � G ∈ Σ n an n -mer and | � #( � G | and G , CG ) the length and number of CG occurrences in � G . • For example, � G = CGACG : | � G | = 5 , #( � G , CG ) = 2 . • Let Z n be a random variable defined by Z n : Σ n − → N � #( � G �→ G , CG ) • Let � G be a genomic substring of length n , m := #( � G , CG ) . • Consider the tail probability G ) := p n , m := P ( { Z n ≥ m } ) . p ( � which reflects that a randomly drawn n -mer contains at least m CG s.

  10. Guideline Introduction Methods Results Outlook Methods Definitions • Let Σ = { A , C , G , T } , � G ∈ Σ n an n -mer and | � #( � G | and G , CG ) the length and number of CG occurrences in � G . • For example, � G = CGACG : | � G | = 5 , #( � G , CG ) = 2 . • Let Z n be a random variable defined by Z n : Σ n − → N � #( � G �→ G , CG ) • Let � G be a genomic substring of length n , m := #( � G , CG ) . • Consider the tail probability G ) := p n , m := P ( { Z n ≥ m } ) . p ( � which reflects that a randomly drawn n -mer contains at least m CG s. Wanted : Genomic substrings � G of significantly small p ( � G ) .

  11. Guideline Introduction Methods Results Outlook Methods Problem Specification Computation of CpG Islands Input : A genome G resp. a set of genomic sequences G i (exons in the following) and a user-specified threshold α ∈ [ 0 , 1 ] . Output: A set of non-overlapping substrings � G 1 , ..., � G L in G resp. the G i which minimize L L � p ( � � G l ) = p n l , m l l = 1 l = 1 where n l := | � G l | , m l := #( � G l , CG ) , such that E ( V L ) ≤ α.

  12. Guideline Introduction Methods Results Outlook Methods Problem Specification Computation of CpG Islands Input : A genome G resp. a set of genomic sequences G i (exons in the following) and a user-specified threshold α ∈ [ 0 , 1 ] . Output: A set of non-overlapping substrings � G 1 , ..., � G L in G resp. the G i which minimize L L � p ( � � G l ) = p n l , m l l = 1 l = 1 where n l := | � G l | , m l := #( � G l , CG ) , such that E ( V L ) ≤ α. • Some additional, biologically reasonable constraints will apply. • Still missing: Specification of P .

  13. Guideline Introduction Methods Results Outlook Null Model Markov Chains Standard hidden Markov model for CpG island detection Issue : Specification of an “island model” necessary.

  14. Guideline Introduction Methods Results Outlook Null Model Markov Chains Parameter estimation for only a null model straightforward : Collect dinucleotide frequencies into Markov transition probability matrix   p AA p AC p AG p AT p CA p CC p CG p CT   M =  .   p GA p GC p GG p GT  p TA p TC p TG p TT

  15. Guideline Introduction Methods Results Outlook Methods Computation of Probabilities • Consider the probability vectors π n , m = [ π n , m ( A ) , π n , m ( C ) , π n , m ( G ) , π n , m ( T )] ∈ [ 0 , 1 ] 4 where π n , m ( x ) is the probability that the Markov chain generates a sequence of length n which contains at least m CGs and which ends in the nucleotide x ∈ { A , C , G , T } .

  16. Guideline Introduction Methods Results Outlook Methods Computation of Probabilities • Consider the probability vectors π n , m = [ π n , m ( A ) , π n , m ( C ) , π n , m ( G ) , π n , m ( T )] ∈ [ 0 , 1 ] 4 where π n , m ( x ) is the probability that the Markov chain generates a sequence of length n which contains at least m CGs and which ends in the nucleotide x ∈ { A , C , G , T } . • For all n ∈ N initialize π n , 0 = π where π T M = π T is the stationary eigenvector associated with the Markov chain. • Recursively compute ( π n , m ) T =     p AA p AC p AG p AT 0 0 0 0 p CA p CC 0 p CT 0 0 p CG 0 ( π n − 1 , m ) T ·  + ( π n − 1 , m − 1 ) T ·         p GA p GC p GG p GT 0 0 0 0    p TA p TC p TG p TT 0 0 0 0

  17. Guideline Introduction Methods Results Outlook Bona Fide Islands Significance Vs. Epigenetic Score 1.0 1.0 A B 0.8 0.8 0.6 0.6 Hit Rate Hit Rate 0.4 0.4 0.2 0.2 episcore episcore p-value p-value obs/exp cg obs/exp cg 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Alarm Rate False Alarm Rate ROC Plots: p-values vs. epigenetic score vs. CpG Obs/Exp on bona fide islands for prediction of open chromatin and differential methylation

  18. Guideline Introduction Methods Results Outlook Exonic CpG Islands Coding Constraint vs. Epigenetic Constraint • In exons, preservation of CpGs due to both coding and epigenetic constraint. A AG C TTG CG CG CG CG Epigenetic Constraint Coding Constraint • Coding CpG island: exonic substring with significant CpG content due to epigenetic constraint The Genetic Code

  19. Guideline Introduction Methods Results Outlook Null Model 5-th order Markov chain  A C G T    P ( A 5 → A ) P ( A 5 → C ) P ( A 5 → G ) P ( A 5 → T )   AAAAA    P ( A 4 C → A ) P ( A 4 C → C ) P ( A 4 C → G ) P ( A 4 C → T )  AAAAC     P ( A 4 G → A ) P ( A 4 G → C ) P ( A 4 G → G ) P ( A 4 G → T ) AAAAG     P ( A 4 T → A ) P ( A 4 T → C ) P ( A 4 T → G ) P ( A 4 T → T ) AAAAT     . . . . .   . . . . . . . . . .     P ( T 4 A → A ) P ( T 4 A → C ) P ( T 4 A → G ) P ( T 4 A → T )   TTTTA    P ( T 4 C → A ) P ( T 4 C → C ) P ( T 4 C → G ) P ( T 4 C → T )  TTTTG    P ( T 4 G → A ) P ( T 4 G → C ) P ( T 4 G → G ) P ( T 4 G → T )  TTTTG   P ( T 5 → A ) P ( T 5 → C ) P ( T 5 → G ) P ( T 5 → T ) TTTTT • 2 6 = 64 parameters to be learned from data • Needed : Dinucleotide counting statistics on 5-th order Markov chains • Goal : Determine significance of exonic substrings

Recommend


More recommend