Outline CSE 527 Previously: Learning from data MLE: Max Likelihood Estimators Autumn 2009 EM: Expectation Maximization (MLE w/hidden data) These Slides: 5 – Motifs: Representation & Discovery Bio: Expression & regulation Expression: creation of gene products Regulation: when/where/how much of each gene product; complex and critical Comp: using MLE/EM to find regulatory motifs in biological sequence data Gene Expression Gene Expression & Recall a gene is a DNA sequence for a protein Regulation To say a gene is expressed means that it is transcribed from DNA to RNA the mRNA is processed in various ways is exported from the nucleus (eukaryotes) is translated into protein A key point: not all genes are expressed all the time, in all cells, or at equal levels
RNA Regulation Transcription Some genes heavily transcribed In most cells, pro- or eukaryote, easily a 10,000-fold (many are not) difference between least- and most-highly expressed genes Regulation happens at all steps. E.g., some genes are highly transcribed, some are not transcribed at all, some transcripts can be sequestered then released, or rapidly degraded, some are weakly translated, some are very actively translated, ... Below, focus on 1st step only: transcriptional regulation Alberts, et al. E. coli growth on glucose + lactose http://en.wikipedia.org/wiki/Lac_operon
1965 Nobel Prize Sea Urchin - Endo16 Physiology or Medicine François Jacob, Jacques Monod, André Lwoff DNA Binding Proteins A variety of DNA binding proteins (so-called “transcription factors”; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding genes
In the The Double Helix groove Different patterns of potential H bonds at edges of different base pairs, accessible esp. in major groove Los Alamos Science Helix-Turn-Helix DNA Binding Motif H-T -H Dimers Bind 2 DNA patches, ~ 1 turn apart Increases both specificity and affinity
Zinc Finger Motif Leucine Zipper Motif Homo-/hetero-dimers and combinatorial control Alberts, et al. Some Protein/DNA MyoD interactions well-understood http://www.rcsb.org/pdb/explore/jmol.do?structureId=1MDY&bionumber=1
But the overall DNA binding Summary ! “code” still defies prediction Proteins can bind DNA to regulate gene expression (i.e., production of other proteins & themselves) ! This is widespread ! Complex combinatorial control is possible ! But it’s not the only way to do this... ! CAP 16 ! DNA binding site Sequence Motifs summary Motif : “a recurring salient thematic element” Complex “code” Last few slides described structural motifs in Short patches (4-8 bp) proteins Often near each other (1 turn = 10 bp) Equally interesting are the DNA sequence motifs to which these proteins bind - e.g. , Often reverse-complements one leucine zipper dimer might bind (with Not perfect matches varying affinities) to dozens or hundreds of similar sequences
E. coli Promoters E. coli Promoters “TATA Box” ~ 10bp upstream of “TATA Box” - consensus TATAAT transcription start ~10bp upstream of transcription start How to define it? Not exact: of 168 studied (mid 80’s) TACGAT Consensus is TATAAT – nearly all had 2/3 of TAxyzT TAAAAT TATACT – 80-90% had all 3 BUT all differ from it GATAAT – 50% agreed in each of x,y,z Allow k mismatches? TATGAT – no perfect match Equally weighted? TATGTT Other common features at -35, etc. Wildcards like R,Y? ({A,G}, {C,T}, resp.) TATA Scores TATA Box Frequencies A “Weight Matrix Model” or “WMM” pos pos 1 2 3 4 5 6 1 2 3 4 5 6 base base A 2 95 26 59 51 1 A -36 19 1 12 10 -46 C 9 2 14 13 20 3 C -15 -36 -8 -9 -3 -31 G 10 1 16 15 13 0 G -13 -46 -6 -7 -9 -46 (?) T 79 3 44 13 17 96 T 17 -31 8 -9 -6 19
Scanning for TATA Scanning for TATA A -36 19 1 12 10 -46 C -15 -36 -8 -9 -3 -31 = -90 100 85 G -13 -46 -6 -7 -9 -46 66 T 17 -31 8 -9 -6 19 50 50 A C T A T A A T C G 23 Score 0 A -36 19 1 12 10 -46 C -15 -36 -8 -9 -3 -31 = 85 -50 G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19 -100 -90 -91 A C T A T A A T C G -150 A -36 19 1 12 10 -46 A C T A T A A T C G A T C G A T G C T A G C A T G C G G A T A T G A T C -15 -36 -8 -9 -3 -31 = -91 G -13 -46 -6 -7 -9 -46 T 17 -31 8 -9 -6 19 A C T A T A A T C G Stormo, Ann. Rev. Biophys. Biophys Chem, 17, 1988, 241-263 Score Distribution TATA Scan at 2 genes (Simulated) LacI 50 3500 Score -50 3000 -150 2500 2000 1500 LacZ 1000 50 Score 500 -50 -150 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 -400 AUG +400
Weight Matrices: Neyman-Pearson Statistics Assume: Given a sample x 1 , x 2 , ..., x n , from a distribution f(...| ! ) with parameter ! , want to test f b,i ! = frequency of base b in position i in TATA hypothesis ! = " 1 vs ! = " 2 . f b ! = frequency of base b in all sequences Might as well look at likelihood ratio: Log likelihood ratio, given S = B 1 B 2 ...B 6 : f( x 1 , x 2 , ..., x n | " 1 ) > # f( x 1 , x 2 , ..., x n | " 2 ) ( P(S “promoter” | % ) 6 f ( " , i % B , i B f ( % P(S log ' & | “nonpromot er”) # $ = log & = = i i " f & ' 1 1 6 # # 1 B 6 i = $ i i ! = # # & i i & B $ log ' f (or log likelihood ratio ) Assumes independence Score Distribution What’s best WMM? (Simulated) Given, say, 168 sequences s 1 , s 2 , ..., s k of length 3500 6, assumed to be generated at random 3000 according to a WMM defined by 6 x (4-1) 2500 parameters " , what’s the best " ? 2000 E.g., what’s MLE for " given data s 1 , s 2 , ..., s k ? 1500 1000 Answer: like coin flips or dice rolls, count 500 frequencies per position (see HW). 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90
Weight Matrices: Another WMM example Chemistry 8 Sequences: Freq. Col 1 Col 2 Col 3 A 0.625 0 0 ATG C 0 0 0 ATG Experiments show ~80% correlation of log ATG G 0.250 0 1 likelihood weight matrix scores to measured ATG T 0.125 1 0 ATG binding energy of RNA polymerase to GTG LLR Col 1 Col 2 Col 3 variations on TATAAT consensus GTG A 1.32 - $ - $ TTG [Stormo & Fields] C - $ - $ - $ Log-Likelihood Ratio: G 0 - $ 2.00 T -1.00 2.00 - $ f x i ,i , f x i = 1 log 2 f x i 4 Non-uniform Background Relative Entropy • E. coli - DNA approximately 25% A, C, G, T AKA Kullback-Liebler Distance/Divergence, AKA Information Content • M. jannaschi - 68% A-T, 32% G-C LLR from previous Given distributions P , Q LLR Col 1 Col 2 Col 3 example, assuming A 0.74 - $ - $ P ( x ) log P ( x ) ≥ 0 C - $ - $ - $ � H ( P || Q ) = f A = f T = 3 / 8 Q ( x ) G 1.00 - $ 3.00 x ∈ Ω f C = f G = 1 / 8 T -1.58 1.42 - $ Notes: e.g., G in col 3 is 8 x more likely via WMM Let P ( x ) log P ( x ) Q ( x ) = 0 if P ( x ) = 0 [since lim y → 0 y log y = 0] than background, so (log 2 ) score = 3 (bits). Undefined if 0 = Q ( x ) < P ( x )
WMM Scores vs WMM: How “Informative”? Mean score of site vs bkg? Relative Entropy For any fixed length sequence x , let P(x) = Prob. of x according to WMM H(P||Q) = 5.0 3500 Q(x) = Prob. of x according to background 3000 Relative Entropy: -H(Q||P) = -6.8 2500 P ( x ) 2000 � H ( P || Q ) = P ( x ) log 2 Q ( x ) 1500 x ∈ Ω -H(Q||P) H(P||Q) H(P||Q) is expected log likelihood score of a 1000 sequence randomly chosen from WMM ; 500 -H(Q||P) is expected score of Background 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 Expected score difference: H(P||Q) + H(Q||P) On average, foreground model scores > background by 11.8 bits (score difference of 118 on 10x scale used in examples above). WMM Example, cont. For a WMM: Freq. Col 1 Col 2 Col 3 H ( P || Q ) = � i H ( P i || Q i ) A 0.625 0 0 C 0 0 0 where P i and Q i are the WMM/background G 0.250 0 1 T 0.125 1 0 distributions for column i. Uniform Non-uniform LLR Col 1 Col 2 Col 3 LLR Col 1 Col 2 Col 3 Proof: exercise A 1.32 - $ - $ A 0.74 - $ - $ C - $ - $ - $ C - $ - $ - $ Hint: Use the assumption of independence G 0 - $ 2.00 G 1.00 - $ 3.00 between WMM columns T -1.00 2.00 - $ T -1.58 1.42 - $ RelEnt 0.70 2.00 2.00 4.70 RelEnt 0.51 1.42 3.00 4.93
Recommend
More recommend