Finding Motifs Using Random Projections by J. Buhler and M. Tompa A - PowerPoint PPT Presentation

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A Presentation by Guénola Drillon Anisah Ghoorah Lin Han Frank Dondelinger 1/48

Overview  DNA motifs  The problem  Current approaches  New algorithm - PROJECTION  PROJECTION’s results  Conclusions 2/48

DNA Motifs  DNA − 4 nucleotides: A, T, C and G  DNA Motifs − Short, recurring patterns in DNA − Strongly conserved − Have biological function . Gene regulation, gene interaction − Indicate sequence-specific binding sites for proteins D’haeselleer (2006) 3/48

Motif Representations  Consensus pattern − Using IUPAC code  Frequency matrix  Logo D’haeselleer (2006) 4/48

Motif Finding Problem Planted ( l , d )-Motif • − Planted (11,2)-Motif: CCGATTACCGA l -mers • All possible subsequences of length l in each sequence - 5/48

Problem Definition  Given t sequences, each of length n , find a motif M of length l , where each planted instance differs from M in d positions − Planted ( l , d )-Motif  No prior knowledge of motif M Planted instance t sequences 6/48

Why Motif Finding?  Comparative genomics − Study similar genes in different species using microarrays − Identification of transcription factor binding sites − Genetic regulatory network  Genomes are large and complex  Simple search won’t work!  Need more efficient search algorithm 7/48

Current approaches (1) - Local Search  Gibbs Sampling - Lawrence et al (1993) − Obtain an initial motif model − Use an iterative approach based on probability to find correct motif  MEME - Bailey & Elkan (1995) − Obtain an initial motif − Use EM approach to find correct motif  CONSENSUS - Hertz & Stormo (1999) − Obtain an initial motif − Use an iterative approach to build up motifs by adding more and more pattern instances. 8/48

Problem with Local Search  Depends on initial conditions  Local optima issues − Returns best solution in neighbourhood − Not necessarily the best planted motif 9/48

Current approaches (2)  Enumeration − Exhaustive enumeration of all possible motifs M − Cover the entire search space − No risk of getting stuck in local optimum  Problem − Too rigid for most real-world binding sites − Run in time exponential to motif length 10/48

Current approaches (3)  WINNOWER – Pevzner & Sze (2000) − Graph-theoretic approach which represents a motif as a large clique  SP-STAR – Pevzner & Sze (2000) − Heuristic local improvement technique using a scoring function  Both solve the planted (15,4)-motif problem  Problem − Fail to find the planted (14,4), (16,5), (18,6) motif problems 11/48

New Approach PROJECTION Algorithm − Random Projections (global search) − Motif Refinement (local search) 12/48

Random Projection Hash h(x)  Choose k of the l positions at random  Consider x as an l -mer, then h(x) is the k -mer resulting from selecting k residues of x  A projection from l -dimensional space onto a k - dimensional subspace  Example: l = 15 Projection k = 7 A T G GC A T TCA GAT TC TGCTGAT Projection = (2, 4, 5, 7, 11, 12, 13) 13/48

Random Projection  4 k buckets in total  M: motif; h(M): the planted bucket  If k < l - d , a number of planted motifs in planted bucket  If k not too small, less than one l -mer in random bucket  Highly enriched l -mers in planted bucket enable recovering the motif. 14/48

Random Projection  s is the threshold for potential planted bucket  Choose buckets that contain at least s l -mers 15/48

Example of Hashing and Buckets l = 7, k = 4 with projection position (1,2,5,7) ...TAGACATCCGACTTGCCTTACTAC... Buckets ATCCGAC ATGC GCTC 16/48

Example of Hashing and Buckets  s=3  Choose buckets which contain more than 3 l - mers ATGC GCTC CATC ATTC 17/48

Three Important Parameters Projection size k  k < l - d and k not too small to keep planted bucket highly enriched  Larger k to ensure we have less than one l -mer in each bucket 18/48

Three Important Parameters Bucket threshold s  Varies according to the data we use.  Case of (20, 2) and (16, 5)  Larger number of sequences 19/48

Three Important Parameters The number m of independent trials to run: m = log  1 − Q  log  B   Q: probability that s or more motif instances in planted bucket in at least one of m trials  B: probability that fewer than s planted instances in planted bucket in a number of independent Bernoulli trials 20/48

Motif Refinement We have our buckets: Now what? For each large enough bucket h:  Use h as a starting point W h  Apply EM to refine W h to W h *  Get consensus motif C using model W h * At the end, return best C found 21/48

Starting Point W h W h is a model for the motif 4 x l matrix → W h (i, j) = probability of base i in position j Approximation that works in practice 22/48

Starting Point W h : Example In bucket h: W h 1 2 3 Positions T  1 / 3  1 / 3 1 / 3 A 1 AGT 1 / 3 C 0 0 AAA 2 / 3 G 0 0 AGC 0 0 Bases To avoid too many zeroes, add background probability b i using Laplace smoothing 23/48

Refinement: Finding W h * Use EM to refine initial model W h Let S be the dataset, P the background distribution Find W h * that (locally) maximises: * , P  Pr  S | W h Pr  S | P  Could take a long time! Better : Run only a few iterations of EM 24/48

Refinement: Find the Motif From W h *, want to determine motif: For each input sequence: Determine likeliest l- mer w.r.t. W h * Likelihood of l -mer x determined by: *  Pr  x | W h Pr  x | P  Get set T of t most likely l -mers 25/48

Refinement: Find the Motif Example: Most likely 2-mer in AGT (Assume P same for all bases in all positions) 1 2 T  0.49  A 0.88 0.20 C 0.01 0.30 W = G 0.10 0.01 0.01 Two 2-mers: AG and GT Likelihood of AG: 0.88x0.01 = 0.0088 Likelihood of GT: 0.10x0.49 = 0.0490 Add GT to set T 26/48

Refinement: Find the Motif Once set T complete: Find consensus C h Then calculate s(T): number of l -mers in T that are further than d away from C h Return the consensus with the smallest value s(T) over all buckets and all runs Ideally, find C h such that s(T) = 0 27/48

Refinement: Find the Motif Example : l = 3, d = 1, T = {AGT, AAA, AGC} Consensus: AG? -> Many schemes possible Let's say consensus AGT dist(AGT, AGT) = 0 dist(AAA, AGT) = 2 2 > d dist(AGC, AGT) = 1 s(T) = 1 28/48

Refinement: A Heuristic For the simulated data, we can do better than minimising s( T ) over all buckets and all runs. sc(T) = number of l -mers in T that are at most d away from C h Let T' contain the l -mers that are closest to C h If sc(T') > sc(T), replace T with T' and repeat Usually converges quickly. If final score sc(T) = t , return the motif, otherwise maximise score over all buckets and all runs 29/48

Refinement: A Heuristic Example : l = 3, d = 1, T = {AGT, AAA, AGC}, C h = AGT sc( T ) = 2 If: S = {AGTC, AAAT, AGCT} then: T' = {AGT, AAT , AGC} sc( T' ) = 3 → return consensus of T' (which happens to be C h ) 30/48

PROJECTION Algorithm Recap PROJECTION algorithm:  Do random projections − Hash l -mers to buckets using k random positions − Use full buckets as starting points  Do motif refinement − Get model W h from bucket h − Refine to optimal model W h * (using e.g. EM) − Return best consensus motif 31/48

Experimental Results • Experiments on Simulated Data • Limitations on Solvable (l,d)-Motif Problems • Transcription Factor Binding Sites • Ribosomes Binding Sites 32/48

Experiments on Simulated Data Simulated Data: 1 – a motif M is chosen randomly 2 – t independent planted instances are produced by randomly selecting d positions in M 3 – their position in the input sequence is selected randomly 4 – n-l residues of each sequence are chosen randomly 33/48

Experiments on Simulated Data Performance coefficient : GGACCTCAATGCAGGATACACCGATCGGTA GGAGTACGGCAAGTCCCCATGTGAGGACCT AGGCTGGACCAGGACCTGACTCTACACCTA TGGACCTGCAGGATACAGCGGGACCTATCG …… K = the t*l residue positions in the t planted motif instances P = corresponding set of residues in the instances predicted by the algorithm 34/48

Experiments on Simulated Data Average PC on planted (l,d)-motif 1.2 1 P ROJE CTION 0.8 WINNOWE R 0.6 PC SP -STAR 0.4 Gibbs 0.2 0 10.2 11.2 12.3 13.3 14.4 15.4 16.5 17.5 18.6 19.6 (l,d) Results : - Average on 20 random instances - All runs used projection size k = 7 and bucket threshold s = 4 but a different number of iterations m - WINNOWER (k = 2)

Limitations on Solvable (l,d)-Motif Problems Why is it difficult to find a (l,d)-motif ? What is the difference between : (9,2) (11,3) (13,4) (15,5) (17,6) and (10,2) (12,3) (14,4) (16,5) (18,6) ? between : (l,d) and (l+1,d) ? 36/48

Limitations on Solvable (l,d)-Motif Problems i   4   4  • The probability that a i l − i d  3 1 l p d = ∑ random sequence will correspond to a motif M i = 0 with up to d substitutions  The probability to n − l  1  l  1 − 1 − p d  t E  l ,d = 4 find a random motif 37/48

Limitations on Solvable (l,d)-Motif Problems Statistics of spurious (l,d)-motifs in simulated data : 38/48

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A - PowerPoint PPT Presentation

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A Presentation by Gunola Drillon Anisah Ghoorah Lin Han Frank Dondelinger 1/48 Overview DNA motifs The problem Current approaches New algorithm - PROJECTION

A STUDY OF TORSION ANGLES OF RNA MOTIFS By Sai Teja Kshir Sagar Bioinformatics Independent

Network Motifs Bioinformatics: Sequence Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Towards Reliable Traffic Classification Using Visual Motifs Wilson Lian 1 John McHugh 1 , 2 Fabian

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

in the story? Does it resonate beyond those motifs ? By: Teja Smith, Keyonna Jackson, Lauryn

The Glass Menagerie Shannon ., Leyla C., Jade G. & Steven M. Choices of Author Motifs

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Bioinformatics: Network Analysis Network Motifs COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay

Detection of network motifs by local Local Statistics concentration A global statistic Motif

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Hierarchical Generation of Molecular Graphs using Structural Motifs Wengong Jin, Regina Barzilay,

Random Walks on Graphs Larry Fenn DATE Larry Fenn Random Walks on Graphs Introduction

Household Analysis Review Group 12 April 2011 Incorporating Survey Data in Household Projections

Theoretical investigation of possibility to suppress FSR in specific dark matter models explaining

T T R I TT TR RI I T T ra ining fo r T o wnship Re ne w a l Initia tive Understanding

Past Pheno Research @ UdeA Jos David Ruiz-lvarez Research topic My main research topic:

25 th April 2012 Financial Position Performance Customer Satisfaction The Future Local Issues

tra z er Social media campaign plan Introduction keme consulting K K t r a z e r agenda 1

The AtonR Fund A Journey Into The Future JUNE 2020 www.atonra.ch sales@atonra.ch

We all have different sized stress buckets We all live under leaky stress taps We all

Exploring Missouris Level of Care (LOC) Eligibility Process and Criteria May 5, 2018 AcEvity:

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A - PowerPoint PPT Presentation

Finding Motifs Using Random Projections by J. Buhler and M. Tompa A Presentation by Gunola Drillon Anisah Ghoorah Lin Han Frank Dondelinger 1/48 Overview DNA motifs The problem Current approaches New algorithm - PROJECTION

A STUDY OF TORSION ANGLES OF RNA MOTIFS By Sai Teja Kshir Sagar Bioinformatics Independent

Network Motifs Bioinformatics: Sequence Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Towards Reliable Traffic Classification Using Visual Motifs Wilson Lian 1 John McHugh 1 , 2 Fabian

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

in the story? Does it resonate beyond those motifs ? By: Teja Smith, Keyonna Jackson, Lauryn

The Glass Menagerie Shannon ., Leyla C., Jade G. &amp; Steven M. Choices of Author Motifs

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Bioinformatics: Network Analysis Network Motifs COMP 572 (BIOS 572 / BIOE 564) - Fall 2013 Luay

Detection of network motifs by local Local Statistics concentration A global statistic Motif

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Hierarchical Generation of Molecular Graphs using Structural Motifs Wengong Jin, Regina Barzilay,

Random Walks on Graphs Larry Fenn DATE Larry Fenn Random Walks on Graphs Introduction

Household Analysis Review Group 12 April 2011 Incorporating Survey Data in Household Projections

Theoretical investigation of possibility to suppress FSR in specific dark matter models explaining

T T R I TT TR RI I T T ra ining fo r T o wnship Re ne w a l Initia tive Understanding

Past Pheno Research @ UdeA Jos David Ruiz-lvarez Research topic My main research topic:

25 th April 2012 Financial Position Performance Customer Satisfaction The Future Local Issues

tra z er Social media campaign plan Introduction keme consulting K K t r a z e r agenda 1

The AtonR Fund A Journey Into The Future JUNE 2020 www.atonra.ch sales@atonra.ch

We all have different sized stress buckets We all live under leaky stress taps We all

Exploring Missouris Level of Care (LOC) Eligibility Process and Criteria May 5, 2018 AcEvity:

The Glass Menagerie Shannon ., Leyla C., Jade G. & Steven M. Choices of Author Motifs