characterization of transcription factor binding sites by
play

Characterization of transcription factor binding sites by - PDF document

Characterization of transcription factor binding sites by high-throughput SELEX Overview of the HTPSELEX Database Transcription Factor Binding Sites: Features and Facts Degenerate sequence motifs Typical length: 6-20 bp Low information


  1. Characterization of transcription factor binding sites by high-throughput SELEX Overview of the HTPSELEX Database Transcription Factor Binding Sites: Features and Facts Degenerate sequence motifs Typical length: 6-20 bp Low information content: 8-12 bits (1 site per 250-4000 bp) Quantitative recognition mechanism: measurable affinity of different sites may vary over three orders of magnitude Regulatory function often depends on cooperative interactions with neighboring sites

  2. Representation of the binding specificity by a scoring matrix (also referred to as weight matrix) 1 2 3 4 5 6 7 8 9 A -10 -10 -14 -12 -10 5 -2 -10 -6 C 5 -10 -13 -13 -7 -15 -13 3 -4 G -3 -14 -13 -11 5 -12 -13 2 -7 T -5 5 5 5 -10 -9 5 -11 5 Strong C T T T G A T C T Binding site 5 + 5 + 5 + 5 + 5 + 5 + 5 + 3 + 5 = 43 Random A C G T A C G T A Sequence -10 -10 -13 + 5 -10 -15 -13 -11 - 6 = -83 Title

  3. Physical interpretation of an weight matrix Weight matrix elements represent relative binding energies between DNA base-pairs and protein surface areas (base-pair acceptor sites). A weight matrix column describes the base preferences of a base-pair acceptor site. Berg-von Hippel model of protein-DNA interactions − ∆ = + The weight matrix score expresses the G ( x ) S ( ) const. x N binding free energy of protein-DNA complex ∑ = S ( x ) w i x ( ) i in arbitrary units: = i 1 N ∑ = ε E ( ) i x ( ) It is convenient to express the binding free x i ε = − ( b ) w ( b ) RT = i 1 energy in dimension-free − RT units: i i On a relative scale, the binding constant for = E ( ) K ( ) e x sequence x is given by: x rel For sequences longer than the weight matrix: = ∑ 1 1 = K ( ) or K ( ) x x rel − rel − E ( x ... x ) E ( x ... x ) e max e i i + N − 1 i i + N − 1 i i (index i runs over all subsequence starting positions on both strands)

  4. Berg-von Hippel Theory – Information Content The energy terms of a weight matrix can be computed from the base frequencies p i (b) 1 p ( b ) found in in vitro or in vivo selected binding ε = − ( b ) ln i i λ q ( b ) sites: q(b) is the background frequency of base b. λ is an unknown parameters related to the stringency of the binding conditions. The information content of a binding site has been defined as the conditional entropy of the base frequency matrix relative to back-ground base frequencies. N T p ( b ) ∑∑ = IC p ( b ) log i i 2 q ( b ) = = i 1 b A Paradox: λ depends on selection conditions ( e.g. the protein concentration) - therefore the base frequencies observed in selected binding sites do not reflect a protein-intrinsic property. Weight matrices/profiles from a biochemical and viewpoint A weight matrix expresses the sequence specificity of a DNA binding proteins. A column describes the base preferences of a surface area of the DNA- binding protein. Weights of a weight matrix can be interpreted as additive binding energy contributions. No interactions between binding site positions ! According to the Berg-von Hippel theory negated binding energies are proportional to the logarithms of the base frequencies observed in an in vivo or in vitro selected set of binding sites. Weight matrices can thus be used to compute relative binding energies or dissociation constants for oligonucleotides of any sequence, which in turn can be experimentally determined by gel shift experiments. An accurate weight matrix for the binding specificity of a transcription factor is one that accurately predicts binding constants.

  5. Experimental techniques for estimating the parameters of a TF specificity matrix Competitive bandshifts (EMSA) → rel. binding constants of oligonucletides Alignment of in vivo sites → base frequency matrix (from 10-100 sequences) in vitro selection (SELEX) → base frequency matrix (up to 200 sequences) SAGE/SELEX → base frequency matrix (up to 10’000 binding sequences) Exhaustive mutagenesis + K rel assay → intrinsic specificity matrix Protein binding arrays + magic algorithm → intrinsic specificity matrix Some problems and limitations: – A base probability matrix is generate by an alignment or probabilistic modeling algorithm → no direct observation – K rel usually not very precise (within factor of 2) – Point mutations may create binding site in other frame Modeling of a Transcription Factor Binding Site from High Throughput SELEX Data Using a Hidden Markov Modeling Approach Emmanuelle Roulet, Nicolas Mermod (Center for biotechnology UNIL- EPFL, Lausanne, Switzerland) Anamaria A Camargo, Andrew JG Simpson (Ludwig Institute of Cancer Research, Sao Paulo, Brazil) Philipp Bucher (Swiss Institute for Experimental Cancer Research and Swiss Institute of Bioinformatics, Epalinges s/Lausanne, Switzerland) Nat. Biotechnol. 20, 31-835 (2002)

  6. Motivation and Goals of the Project Motivation: Accurate and reliable computational tools to predict transcription factor binding sites are still not available. Potential reasons: 1. Lack of adequate experimental data 2. Lack of adequate computational models 3. Lack of an adequate method to estimate the parameters of a computational model from the experimental data Goal: To develop a combined computational-experimental protocol to derive an accurate predictive model of the sequence specificity of a DNA-binding protein Potential benefits: 1. Being able to predict transcription factor binding in genome sequences. 2. Insights into molecular mechanisms of sequence-specific protein-DNA interactions 3. Ability to rationally design gene control regions of desired properties for biotechnological applications

  7. Our Approach to the Problem of Characterizing the Sequence-Specificity of a DNA Binding Transcription Factor 1. Choice of a quantitative predictive model for representing the binding specificity. Our choice: a profile-HMM 2. Choice of an experimental method to generate data for estimating the model parameters. Our choice: a SELEX experiment 3. Choice of a machine learning algorithm to estimate the model parameters from the data. Our choice: the Baum-Welch HMM training algorithm 4. Validation of the approach and optimization of the experimental parameters by a computer simulation of step 2 and 3 5. Adjustment of experimental protocol to produce the necessary data as suggested by the computer simulation 6. Generation of the experimental data 7. Building a binding site model from the data 8. A posteriori validation of the model by cross-validation and comparison with independent experimental results Study Object: Transcription Factor CTF/NFI Dimeric DNA-binding protein recognizing a palindromic sequence motif with consensus sequence TTGGC(N5)GCCAA First isolated as a replication factor of Adenovirus type 2 Later independently isolated as a CCAAT-box binding transcription factor Can activate transcription of a reporter gene in transfected cells Recently shown to be implicated in regulatory pathways related to tumor progression and immune response Biochemical mechanism of gene regulation still elusive

  8. Old CTF/NFI Binding Site Profile Example: TGGGCATATAGCCAC Score: 10-1+10+10+10 +0 +10+10+10+10+9 = 88

  9. Random sequence library 5’ –TCCATCTCTTCTGTATGTCG AGATCTA.N(25).TAGATCT CCTAACCGACTCCGTTAATT-3’ Second strand synthesis by pcr Bgl II Bgl II Primer 1 5’–TCCATCTCTTCTGTATGTCG AGATCTA.N(25).TAGATCT CCTAACCGACTCCGTTAATT-3’ 3’–AGGTAGAGAAGACATACAGA TCTAGAT.N(25).ATCTAG AGGATTGGCTGAGGCAATTAA-5’ Primer 2 Selection of binding sequences (gel shift) Selection cycles Amplification Digestion Bgl II 5’ – GATCTA..N(25)..TA AT ..N(25).. TA CTAG -3’ Concatemerization and cloning 5’-GATCTA…N(25)…TAGATCTA…N(25)…TAGATCTA…N(25)…TA AT…N(25)…ATCTAGAT…N(25)…ATCTAGAT…N(25)…ATCTAG-3’ site 1 site 2 site 3 HTS sequencing Principle of the Baum-Welch hidden Markov model training algorithm Initial model: Training sequences: AACAGCGTGCCAACTAGTGATCACA CCACAACFFACGCCCAAATAACCAA GTTAGTGGACCGCTTCCAGCAATCT ATCACGGCACCCCATTTTTCTGTCT TGGTAAATTAATAATAAAACAGTGG GCGCGTGATTTGGCATCGTCCCATA AAGTTGGCTTTTCACCAATAGCGAG ... How does it work ? Trained model: 1. The initial model serves as current model. 2. Training sequences are aligned to the current model. 3. New base and transition frequencies are estimated from the multiple alignment generated by step 2. The new model becomes the current model. 4. Step 2 and 3 are repeated until convergence is reached.

  10. Doing the Experiment

  11. Results – CTF/NF1 Clone statistics Cycle Seq.reads Clones Colonies Clones with detectable inserts 0 468 425 427 295 1 623 364 553 111 2 545 392 447 208 3 2234 1445 1619 1187 4 378 215 318 102 Site Statistics Cycle Sites Different Diff. sites sites err < 0.01/bp err <0.001/bp 0 2262 2262 1482 825 1 1678 1678 1227 954 2 1572 1572 731 203 3 8813 8813 7385 5585 4 1156 1156 552 309 SUM 15481 15481 11377 7876 New CTF/NFI model Hidden Markov Model (frequencies given in %): Scoring profile (relative energy units):

Recommend


More recommend