finding motifs using random projections
play

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A - PowerPoint PPT Presentation

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A Presentation by Gunola Drillon Anisah Ghoorah Lin Han Frank Dondelinger 1/48 Overview DNA motifs The problem Current approaches New algorithm - PROJECTION


  1. Finding Motifs Using Random Projections by J. Buhler and M. Tompa A Presentation by Guénola Drillon Anisah Ghoorah Lin Han Frank Dondelinger 1/48

  2. Overview  DNA motifs  The problem  Current approaches  New algorithm - PROJECTION  PROJECTION’s results  Conclusions 2/48

  3. DNA Motifs  DNA − 4 nucleotides: A, T, C and G  DNA Motifs − Short, recurring patterns in DNA − Strongly conserved − Have biological function . Gene regulation, gene interaction − Indicate sequence-specific binding sites for proteins D’haeselleer (2006) 3/48

  4. Motif Representations  Consensus pattern − Using IUPAC code  Frequency matrix  Logo D’haeselleer (2006) 4/48

  5. Motif Finding Problem Planted ( l , d )-Motif • − Planted (11,2)-Motif: CCGATTACCGA l -mers • All possible subsequences of length l in each sequence - 5/48

  6. Problem Definition  Given t sequences, each of length n , find a motif M of length l , where each planted instance differs from M in d positions − Planted ( l , d )-Motif  No prior knowledge of motif M Planted instance t sequences 6/48

  7. Why Motif Finding?  Comparative genomics − Study similar genes in different species using microarrays − Identification of transcription factor binding sites − Genetic regulatory network  Genomes are large and complex  Simple search won’t work!  Need more efficient search algorithm 7/48

  8. Current approaches (1) - Local Search  Gibbs Sampling - Lawrence et al (1993) − Obtain an initial motif model − Use an iterative approach based on probability to find correct motif  MEME - Bailey & Elkan (1995) − Obtain an initial motif − Use EM approach to find correct motif  CONSENSUS - Hertz & Stormo (1999) − Obtain an initial motif − Use an iterative approach to build up motifs by adding more and more pattern instances. 8/48

  9. Problem with Local Search  Depends on initial conditions  Local optima issues − Returns best solution in neighbourhood − Not necessarily the best planted motif 9/48

  10. Current approaches (2)  Enumeration − Exhaustive enumeration of all possible motifs M − Cover the entire search space − No risk of getting stuck in local optimum  Problem − Too rigid for most real-world binding sites − Run in time exponential to motif length 10/48

  11. Current approaches (3)  WINNOWER – Pevzner & Sze (2000) − Graph-theoretic approach which represents a motif as a large clique  SP-STAR – Pevzner & Sze (2000) − Heuristic local improvement technique using a scoring function  Both solve the planted (15,4)-motif problem  Problem − Fail to find the planted (14,4), (16,5), (18,6) motif problems 11/48

  12. New Approach PROJECTION Algorithm − Random Projections (global search) − Motif Refinement (local search) 12/48

  13. Random Projection Hash h(x)  Choose k of the l positions at random  Consider x as an l -mer, then h(x) is the k -mer resulting from selecting k residues of x  A projection from l -dimensional space onto a k - dimensional subspace  Example: l = 15 Projection k = 7 A T G GC A T TCA GAT TC TGCTGAT Projection = (2, 4, 5, 7, 11, 12, 13) 13/48

  14. Random Projection  4 k buckets in total  M: motif; h(M): the planted bucket  If k < l - d , a number of planted motifs in planted bucket  If k not too small, less than one l -mer in random bucket  Highly enriched l -mers in planted bucket enable recovering the motif. 14/48

  15. Random Projection  s is the threshold for potential planted bucket  Choose buckets that contain at least s l -mers 15/48

  16. Example of Hashing and Buckets l = 7, k = 4 with projection position (1,2,5,7) ...TAGACATCCGACTTGCCTTACTAC... Buckets ATCCGAC ATGC GCTC 16/48

  17. Example of Hashing and Buckets  s=3  Choose buckets which contain more than 3 l - mers ATGC GCTC CATC ATTC 17/48

  18. Three Important Parameters Projection size k  k < l - d and k not too small to keep planted bucket highly enriched  Larger k to ensure we have less than one l -mer in each bucket 18/48

  19. Three Important Parameters Bucket threshold s  Varies according to the data we use.  Case of (20, 2) and (16, 5)  Larger number of sequences 19/48

  20. Three Important Parameters The number m of independent trials to run: m = log  1 − Q  log  B   Q: probability that s or more motif instances in planted bucket in at least one of m trials  B: probability that fewer than s planted instances in planted bucket in a number of independent Bernoulli trials 20/48

  21. Motif Refinement We have our buckets: Now what? For each large enough bucket h:  Use h as a starting point W h  Apply EM to refine W h to W h *  Get consensus motif C using model W h * At the end, return best C found 21/48

  22. Starting Point W h W h is a model for the motif 4 x l matrix → W h (i, j) = probability of base i in position j Approximation that works in practice 22/48

  23. Starting Point W h : Example In bucket h: W h 1 2 3 Positions T  1 / 3  1 / 3 1 / 3 A 1 AGT 1 / 3 C 0 0 AAA 2 / 3 G 0 0 AGC 0 0 Bases To avoid too many zeroes, add background probability b i using Laplace smoothing 23/48

  24. Refinement: Finding W h * Use EM to refine initial model W h Let S be the dataset, P the background distribution Find W h * that (locally) maximises: * , P  Pr  S | W h Pr  S | P  Could take a long time! Better : Run only a few iterations of EM 24/48

  25. Refinement: Find the Motif From W h *, want to determine motif: For each input sequence: Determine likeliest l- mer w.r.t. W h * Likelihood of l -mer x determined by: *  Pr  x | W h Pr  x | P  Get set T of t most likely l -mers 25/48

  26. Refinement: Find the Motif Example: Most likely 2-mer in AGT (Assume P same for all bases in all positions) 1 2 T  0.49  A 0.88 0.20 C 0.01 0.30 W = G 0.10 0.01 0.01 Two 2-mers: AG and GT Likelihood of AG: 0.88x0.01 = 0.0088 Likelihood of GT: 0.10x0.49 = 0.0490 Add GT to set T 26/48

  27. Refinement: Find the Motif Once set T complete: Find consensus C h Then calculate s(T): number of l -mers in T that are further than d away from C h Return the consensus with the smallest value s(T) over all buckets and all runs Ideally, find C h such that s(T) = 0 27/48

  28. Refinement: Find the Motif Example : l = 3, d = 1, T = {AGT, AAA, AGC} Consensus: AG? -> Many schemes possible Let's say consensus AGT dist(AGT, AGT) = 0 dist(AAA, AGT) = 2 2 > d dist(AGC, AGT) = 1 s(T) = 1 28/48

  29. Refinement: A Heuristic For the simulated data, we can do better than minimising s( T ) over all buckets and all runs. sc(T) = number of l -mers in T that are at most d away from C h Let T' contain the l -mers that are closest to C h If sc(T') > sc(T), replace T with T' and repeat Usually converges quickly. If final score sc(T) = t , return the motif, otherwise maximise score over all buckets and all runs 29/48

  30. Refinement: A Heuristic Example : l = 3, d = 1, T = {AGT, AAA, AGC}, C h = AGT sc( T ) = 2 If: S = {AGTC, AAAT, AGCT} then: T' = {AGT, AAT , AGC} sc( T' ) = 3 → return consensus of T' (which happens to be C h ) 30/48

  31. PROJECTION Algorithm Recap PROJECTION algorithm:  Do random projections − Hash l -mers to buckets using k random positions − Use full buckets as starting points  Do motif refinement − Get model W h from bucket h − Refine to optimal model W h * (using e.g. EM) − Return best consensus motif 31/48

  32. Experimental Results • Experiments on Simulated Data • Limitations on Solvable (l,d)-Motif Problems • Transcription Factor Binding Sites • Ribosomes Binding Sites 32/48

  33. Experiments on Simulated Data Simulated Data: 1 – a motif M is chosen randomly 2 – t independent planted instances are produced by randomly selecting d positions in M 3 – their position in the input sequence is selected randomly 4 – n-l residues of each sequence are chosen randomly 33/48

  34. Experiments on Simulated Data Performance coefficient : GGACCTCAATGCAGGATACACCGATCGGTA GGAGTACGGCAAGTCCCCATGTGAGGACCT AGGCTGGACCAGGACCTGACTCTACACCTA TGGACCTGCAGGATACAGCGGGACCTATCG …… K = the t*l residue positions in the t planted motif instances P = corresponding set of residues in the instances predicted by the algorithm 34/48

  35. Experiments on Simulated Data Average PC on planted (l,d)-motif 1.2 1 P ROJE CTION 0.8 WINNOWE R 0.6 PC SP -STAR 0.4 Gibbs 0.2 0 10.2 11.2 12.3 13.3 14.4 15.4 16.5 17.5 18.6 19.6 (l,d) Results : - Average on 20 random instances - All runs used projection size k = 7 and bucket threshold s = 4 but a different number of iterations m - WINNOWER (k = 2)

  36. Limitations on Solvable (l,d)-Motif Problems Why is it difficult to find a (l,d)-motif ? What is the difference between : (9,2) (11,3) (13,4) (15,5) (17,6) and (10,2) (12,3) (14,4) (16,5) (18,6) ? between : (l,d) and (l+1,d) ? 36/48

  37. Limitations on Solvable (l,d)-Motif Problems i   4   4  • The probability that a i l − i d  3 1 l p d = ∑ random sequence will correspond to a motif M i = 0 with up to d substitutions  The probability to n − l  1  l  1 − 1 − p d  t E  l ,d = 4 find a random motif 37/48

  38. Limitations on Solvable (l,d)-Motif Problems Statistics of spurious (l,d)-motifs in simulated data : 38/48

Recommend


More recommend