George Palade CSE P 590 A Nov. 19, 1912 -- Oct 8, 2008 Autumn 2008 Lecture 5 Motifs: Representation & Discovery 1966 Albert Lasker Award for Basic Medical Research 1974 Nobel Prize in Physiology or Medicine (with Albert Claude and Christian de Duve) Identified the function of mitochondria, ribosomes and cellular secretion Outline Gene Expression & Last week: Learning from data: Regulation - MLE: Max Likelihood Estimators - EM: Expectation Maximization (MLE w/hidden data) Expression & regulation - Expression: creation of gene products - Regulation: when/where/how much of each gene product; complex and critical Next: using MLE/EM to find regulatory motifs in biological sequence data
RNA Gene Expression Transcription Some genes heavily transcribed Recall a gene is a DNA sequence for a protein (many are not) To say a gene is expressed means that it is transcribed from DNA to RNA the mRNA is processed in various ways is exported from the nucleus (eukaryotes) is translated into protein A key point: not all genes are expressed all the time, in all cells, or at equal levels Alberts, et al. E. coli growth Regulation on glucose + lactose In most cells, pro- or eukaryote, easily a 10,000-fold difference between least- and most-highly expressed genes Regulation happens at all steps. E.g., some transcripts can be sequestered then released, or rapidly degraded, some are weakly translated, some are very actively translated, some are highly transcribed, some are not transcribed at all Below, focus on 1st step only: transcriptional regulation http://en.wikipedia.org/wiki/Lac_operon
1965 Nobel Prize François Jacob and Jacques Monod The Double Helix DNA Binding Proteins A variety of DNA binding proteins (“transcription factors”; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding genes Los Alamos Science
In the Helix-Turn-Helix DNA Binding Motif groove Different patterns of potential H bonds at edges of different base pairs, accessible esp. in major groove Zinc Finger Motif H-T -H Dimers Bind 2 DNA patches, ~ 1 turn apart Increases both specificity and affinity
Some Protein/DNA Leucine Zipper Motif interactions well-understood Homo-/hetero-dimers and combinatorial control Alberts, et al. But the overall DNA binding “code” still defies prediction CAP
Bacterial Met Repressor Summary � Negative feedback loop: high Met level ⇒ repress Met synthesis genes (a beta-sheet DNA binding domain) Proteins can bind DNA to regulate gene expression (i.e., production of other proteins & themselves) � This is widespread � Complex combinatorial control is possible � SAM (Met derivative) But it’s not the only way to do this... � 16 � DNA binding site Sequence Motifs summary Motif : “a recurring salient thematic element” Complex “code” Last few slides described structural motifs in Short patches (4-8 bp) proteins Often near each other (1 turn = 10 bp) Equally interesting are the DNA sequence motifs to which these proteins bind - e.g. , Often reverse-complements one leucine zipper dimer might bind (with Not perfect matches varying affinities) to dozens or hundreds of similar sequences
E. coli Promoters E. coli Promoters “TATA Box” ~ 10bp upstream of “TATA Box” - consensus TATAAT transcription start ~10bp upstream of transcription start How to define it? Not exact: of 168 studied (mid 80’s) TACGAT Consensus is TATAAT – nearly all had 2/3 of TAxyzT TAAAAT TATACT – 80-90% had all 3 BUT all differ from it GATAAT – 50% agreed in each of x,y,z Allow k mismatches? TATGAT – no perfect match Equally weighted? TATGTT Other common features at -35, etc. Wildcards like R,Y? ({A,G}, {C,T}, resp.) TATA Scores TATA Box Frequencies pos pos 1 2 3 4 5 6 1 2 3 4 5 6 base base A 2 95 26 59 51 1 A -36 19 1 12 10 -46 C 9 2 14 13 20 3 C -15 -36 -8 -9 -3 -31 G 10 1 16 15 13 0 G -13 -46 -6 -7 -9 -46 (?) T 79 3 44 13 17 96 T 17 -31 8 -9 -6 19
Scanning for TATA Scanning for TATA A C G 100 85 T 66 50 50 A C T A T A A T C G 23 Score 0 A C G -50 T -100 -93 -95 A C T A T A A T C G -150 A A C T A T A A T C G A T C G A T G C T A G C A T G C G G A T A T G A T C G T A C T A T A A T C G Stormo, Ann. Rev. Biophys. Biophys Chem, 17, 1988, 241-263 Weight Matrices: Score Distribution Statistics (Simulated) Assume: 3500 f b,i � = frequency of base b in position i in TATA 3000 2500 f b � = frequency of base b in all sequences 2000 Log likelihood ratio, given S = B 1 B 2 ...B 6 : 1500 1000 500 P(S log � | “promoter” ) � i 1 � � f � log = 6 i 6 , i � B � , log i B i f � � � P(S � “nonpromot | er”) � � = i f 6 1 � = � � i � B i � � = 1 = � i � B � � f � � 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 Assumes independence
Score Distribution Neyman-Pearson (Simulated) Given a sample x 1 , x 2 , ..., x n , from a distribution 3500 f(...| � ) with parameter � , want to test 3000 hypothesis � = � 1 vs � = � 2 . 2500 2000 Might as well look at likelihood ratio: 1500 1000 f( x 1 , x 2 , ..., x n | � 1 ) > � 500 f( x 1 , x 2 , ..., x n | � 2 ) 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 Weight Matrices: What’s best WMM? Chemistry Given, say, 168 sequences s 1 , s 2 , ..., s k of length 6, assumed to be generated at random Experiments show ~80% correlation of log according to a WMM defined by 6 x (4-1) likelihood weight matrix scores to measured parameters � , what’s the best � ? binding energy of RNA polymerase to variations on TATAAT consensus E.g., what’s MLE for � given data s 1 , s 2 , ..., s k ? [Stormo & Fields] Answer: like coin flips or dice rolls, count frequencies per position (see HW).
Another WMM example Non-uniform Background 8 Sequences: Freq. Col 1 Col 2 Col 3 • E. coli - DNA approximately 25% A, C, G, T A 0.625 0 0 ATG C 0 0 0 ATG • M. jannaschi - 68% A-T, 32% G-C ATG G 0.250 0 1 LLR from previous ATG T 0.125 1 0 LLR Col 1 Col 2 Col 3 ATG example, assuming A 0.74 - � - � GTG LLR Col 1 Col 2 Col 3 C - � - � - � GTG f A = f T = 3 / 8 A 1.32 - � - � G 1.00 - � 3.00 TTG f C = f G = 1 / 8 C - � - � - � T -1.58 1.42 - � Log-Likelihood Ratio: G 0 - � 2.00 e.g., G in col 3 is 8 x more likely via WMM T -1.00 2.00 - � f x i ,i , f x i = 1 than background, so (log 2 ) score = 3 (bits). log 2 f x i 4 WMM: How “Informative”? Relative Entropy Mean score of site vs bkg? For any fixed length sequence x , let AKA Kullback-Liebler Distance/Divergence, P(x) = Prob. of x according to WMM AKA Information Content Q(x) = Prob. of x according to background Given distributions P, Q Relative Entropy: P ( x ) log P ( x ) � 0 P ( x ) � H ( P || Q ) = � H ( P || Q ) = P ( x ) log 2 Q ( x ) Q ( x ) x ∈ Ω x ∈ Ω -H(Q||P) H(P||Q) Notes: H(P||Q) is expected log likelihood score of a Let P ( x ) log P ( x ) sequence randomly chosen from WMM ; Q ( x ) = 0 if P ( x ) = 0 [since lim y → 0 y log y = 0] -H(Q||P) is expected score of Background Undefined if 0 = Q ( x ) < P ( x )
WMM Scores vs Relative Entropy For WMM, you can show (based on the assumption of independence between H(P||Q) = 5.0 3500 columns), that : 3000 -H(Q||P) = -6.8 2500 H ( P || Q ) = � i H ( P i || Q i ) 2000 where P i and Q i are the WMM/background 1500 distributions for column i. 1000 500 0 -150 -130 -110 -90 -70 -50 -30 -10 10 30 50 70 90 WMM Example, cont. Pseudocounts Freq. Col 1 Col 2 Col 3 A 0.625 0 0 Are the - � ’s a problem? C 0 0 0 G 0.250 0 1 Certain that a given residue never occurs T 0.125 1 0 in a given position? Then - � just right Uniform Non-uniform Else, it may be a small-sample artifact LLR Col 1 Col 2 Col 3 LLR Col 1 Col 2 Col 3 Typical fix: add a pseudocount to each observed A 1.32 - � - � A 0.74 - � - � count—small constant (e.g., .5, 1) C - � - � - � C - � - � - � G 0 - � 2.00 G 1.00 - � 3.00 Sounds ad hoc ; there is a Bayesian justification T -1.00 2.00 - � T -1.58 1.42 - � RelEnt 0.70 2.00 2.00 4.70 RelEnt 0.51 1.42 3.00 4.93
Recommend
More recommend