statistical analysis applied to genome and proteome
play

Statistical analysis applied to genome and proteome analyses - PowerPoint PPT Presentation

Statistical analysis applied to genome and proteome analyses (lecture 5) Warning: These notes serve only as a scaffold and should be completed with notes taken during the class Motif identification with position dependent weight matrices


  1. Statistical analysis applied to genome and proteome analyses (lecture 5) Warning: These notes serve only as a scaffold and should be completed with notes taken during the class Motif identification with position dependent weight matrices Estimating background models and score distributions felix.naef@epfl.ch

  2. Stripe 2 enhancer

  3. � ��������������������� Transcription Factors involved from Segal, nature 2008 2 5 10 0.2 25 1 0 1 Conc. 20 2 0 8 8 0.008 Hunchback 15 1 5 0.1 6 6 Expr. 10 1 0 4 4 5 -0.05 0.0 0 5 2 2 Coop. -11 -7 -3 1 5 9 13 19 0 0 0 0 6.22 0-50 400-450 800-850 100 75 50 25 0 3 5 3 1 0.2 1.0 Conc. 4 0.01 2 2 3 0.1 0.5 Expr. Giant 2 -1.98 1 1 0.0 0.0 1 Coop. -11 -7 -3 1 5 9 13 17 0 0 0 0 6.68 0-50 400-450 800-850 100 75 50 25 0 20 5 20 1 0.3 3.5 Conc. 3.0 4 0.004 15 15 0.2 2.5 Kruppel 3 Expr. 2.0 10 10 0.1 2 -0.22 1.5 5 5 0.0 1.0 1 Coop. -11 -7 -3 1 5 9 13 17 0 0 0 0 4.45 100 75 50 25 0 0-50 400-450 800-850 10 10 1 0.4 1.5 1 5 Conc. 8 8 0.3 1 2 0.09 1.0 0.2 9 6 6 Knirps Expr. 0.5 0.1 6 4 4 -1.15 0.0 0.0 3 2 2 Coop. -11 -7 -3 1 5 9 13 17 0 0 0 0 1.03 100 75 50 25 0 0-50 400-450 800-850

  4. Goals of this lecture • Get a flavor about modeling cis-regulatory modules from the point of view of protein-DNA interactions 1. bioinformatics method for modeling enhancers 2. statistical assessment of likelihood scores for single TFs 3. discuss some recent examples

  5. Main questions 1. Given the consensus for the TF involved, what are the most favorable configurations for a given set of concentrations 2. How is the above configuration translated into gene expression, in other words how does the transcription apparatus read the configuration. 3. Ultimate aim: can one predict expression from sequence, once the TF activities are given (measured)

  6. Graphically cell type 1 cell type 2

  7. Things that matter because of mass action 1. Affinity (binding energy, how many H-bonds are formed) 2. concentrations (local) of the TFs 3. competition (steric hindrance) 4. cooperativity • all these can be taken into accounts in HMM using models (cf. the latest Segal paper)

  8. Outline of the class 1. Physical and statistical models for Protein DNA- interactions and cis-regulatory modules • one and multi- component systems for one site • HMMs 2. Case studies: • Circadian enhancers • “Predicting expression patterns from regulatory sequence in Drosophila segmentation”, Eran Segal et al., Nature 2008 3. Exercises

  9. Statistical and physical models for Protein DNA- interactions 1. Weight matrices and likelihoods - W is distribution over the space of sequences of length L L - � P ( S | W ) = w j ( S j ) - j =1 simplest model: independent positions - interpretation: P(S|W) is the likelihood that the sequence S was drawn from the distribution W - c P(S|W) is proportional to the weight of the state where the site S is occupied 2. Binding energy (e.g. measured in kcal/mol) - � E ( S ) = � j ( S j ) j - simplest model: additive energies (~independence) - is proportional to the weight of the state where e − ( E ( S ) − µ ) /kT = e µ/kT e − E ( S ) /kT the site S is occupied

  10. Getting a sense of the numbers • 1 good H-bond contributes 1 kcal/mol • RT=0.6 kcal/mol • exp(1/0.6)~5.3 • thus one H-bond contributes a factor ~5 in affinity (K d )

  11. Site occupancies and the law of mass action • Consider a single TF - 2 state model • chemical kinetics F + D ∅ ↔ D F F K d = k − 1 D F = F + K d k 1 • statistical mechanics 2 state 0,1 c 1 c e − β ( E 1 − E 0 ) c 1 ˆ occupancy of state 1: � n 1 � = c e − β ( E 1 − E 0 ) = 1 + c 1 K d + c 1 ˆ 1 1 c P ( S | W ) c 1 • probabilities ˆ � n 1 � = P ( S | B ) + c 1 c P ( S | W ) ˆ

  12. Site occupancy and the law of mass action • N state model: N TFs compete for the same genomic site - generalize the above forms, the factor that wins is the one with the largest product cK, if some of these products are similar the site will be shared among different factors K i c i 1 � n i � = K = 1 + � N K d i =1 K i c i - one H-bond less can be compensated by a factor 5 in concentration

  13. Long sequences with multiple sites • so far only room for one site • now: tile the sequences and compute observable by weighted averaging over each possible state ( � p ) n, � • n i is the color and p i the position of the site (sites cannot overlap!)

  14. Hidden Markov Models (HMMs) • Efficient methodology for computing the sums over all possible configurations (cf. book by R. Durbin et al.) − β � Z e − β � K 1 q ( E 0 ( S q )) , i =1 ( E nk ( S pk ) − µ nk,pk ) e P ( S, � n ) = − β �� K q E 0 ( S q ) � k =1 E nk ( S pk ) − � e β � 1 k µ nk,pk = Z e , � �� � � �� � P ( S | � n ) P ( � n ) � Z ( S ) = P ( S, � n ) � � n - Is the partition function, its logarithm is the free energy or the Forward score in HMM parlance - usually one compute the Forward score for a set of matrices minus the Forward score for pure background

  15. Background models for likelihood scores (binding energies) 1. Empirical models (brute force) 2. Gaussian approximation - is most accurate for long weight and weakly polarized weight matrices - uses the Central Limit Theorem (CLT) 3. Saddle-point approximation - is a size correction to the former, can thus handle shorter matrices - uses inverse Laplace transforms

  16. Goal (taken from Djorjevic, Gen Res 2003) • • Occupancy vs. likelihood distribution of -log-likelihood

  17. 2 Background models The purpose of this section is to introduce an analytical method for computing the expected distribution of likelihood scores from a PSSM model. This follows the desription in Djorjevic et al . To evaluate the statistical significance of a candidate binding site in a genomic sequence, i.e. a position where the log-likelihood E ( S ) = � � α ǫ j,S j is high, we would like to compute the expected distribution j of scores for random sequences described by a background model with properties that resemble the real sequence. For the sake of simplicity, we will consider a zero-th order background model with frequencies b α for letter α and � α b α = 1. 2.1 Gaussian model The log-likelihood E is a sum of independent random variables ǫ j . In the large L limit, i.e. when the model has many sites, we can apply the central limit theorem (CLT). Namely we know in that limit that the distribution of E will be a Gaussian with mean ¯ E = � j ¯ ǫ j , where ¯ ǫ j = � α b α ǫ j, α , and with variance χ 2 = � ǫ j ) 2 b α . j χ 2 j = � � α ( ǫ j, α − ¯ j

  18. � � � 2.2 Saddle-point approximation One can do better by invoking the following trick. The background distribution ρ ( E ) can be thought of as a density of states in the space of possible sequences and thus be written as: � i ∞ + γ � i ∞ + γ 1 1 � � e β ( E − E ( s )) d β = e β E Z ( β ) d β , ρ ( E ) = P ( S ) δ ( E − E ( S )) = P ( S ) (10) 2 π i 2 π i − i ∞ + γ − i ∞ + γ S S where the first equation use a representation of the δ − function as an inverse Laplace transform, and the S P ( S ) e − β E ( S ) . The last integral can be approximated by second introduces the partition function Z ( β ) = � identifying the saddle point in the integrand: β ∗ = min β real e β E +ln Z ( β ) , and taking the complex integration through the saddle ( γ = β ∗ ). At this point ζ ( β ∗ + ib ) = ζ ( β ∗ ) + 1 d β 2 ζ ( β ∗ ) b 2 + O ( b 3 ), where ζ = ln Z and d 2 2 therefore we find after that 2 ln(2 π d 2 ln ρ ( E ) ≈ ln Z ( β ∗ ) + β ∗ E − 1 d β 2 ln Z ( β ∗ )) . (11) The extremum condition leads to an implicit equation for β ∗ : E = − d d β Z ( β ∗ ) . (12) In an order zero model the partition function factorizes and so it is straight forward to find β ∗ numerically, and subsequently the second derivative. Concrete example will be shown during the exercises

  19. Case studies 1. Circadian Ebox signals (Paquet et al., 2008) 2. cis-regulatory code in the segment polarity network in Drosophila (Segal et al., 2008)

  20. Case study I Modeling an evolutionary conserved circadian cis-element Eric Paquet, Guillaume Rey and Felix Naef (PLoS CB, 2008) E1 converged ( p 1 =2 -11 , p 2 =2 -4 ) E2 converged

  21. Circadian rhythms and oscillators – period ~24 hrs (+/- 1 hrs) – found all over the tree of life – central and peripheral clocks – limit cycle oscillators Circadian locomotor assay from Konopka and Benzer 1971

  22. Circadian running-wheel assay for mouse Environment → ??? → SCN pacemaker → ??? → behavior → ??? → → ??? → phase locked free running

  23. Molecular rhythms Circadian profiles in fly heads (Claridge-Chang et al. 2001, Wijnen 2006) 172 Drosophila circadian genes φ = 0 0 24 24 phase [hr] no enrichment for canonical E-boxes among cyclers

  24. Recurrent themes in oscillator designs – inter-locked (delayed) feedback loops – post-translational modifications and protein-proteins interactions – coupling with metabolism Understand the green arrows Which genes are direct CLK targets ? from Schibler and Naef, ‘05

Recommend


More recommend