Stochastic processes and Hidden Markov Models Dr Mauro Delorenzi and Dr Frédéric Schütz Swiss Institute of Bioinformatics EMBnet course – Basel 23.3.2006 Introduction � A mainstream topic in bioinformatics is the problem of sequence annotation : given a sequence of DNA/RNA or protein, we want to identify “interesting” elements � Examples: – DNA/RNA: genes, promoters, splicing signals, segmentation of heterogeneous DNA, binding sites, etc – Proteins: coiled-coil domains, transmembrane domains, signal peptides, phosphorylation sites, etc – Generally: homologs, etc. � “ The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster ” – http://www.fruitfly.org/GASP1/tutorial/presentation/ EMBNET course Basel 23.3.2006
Sequence annotation � The sequence of many of these interesting elements can be characterized statistically, so we are interested in modeling them. � By modeling , we mean find statistical models than can: – Accurately describe the observed elements of provided sequences; – Accurately predict the presence of particular elements in new, unannotated, sequences; – If possible, be readily interpretable and provide some insight into the actual biological process involved (i.e. not a black box ). EMBNET course Basel 23.3.2006 Example: heterogeneity of DNA sequences � The nucleotide composition of segments of genomic DNA changes between different regions in a single organism – Example: coding regions in the human genome tend to be GC-rich. � Modeling the differences between different homogeneous regions is interesting because – These differences often have a biological meaning – Many bioinformatics tools depend on the “background distribution” of nucleotides, often assumed to be constant. EMBNET course Basel 23.3.2006
Modeling tools (quick review) � Among the different tools used for modeling sequences, we have (sorted by increasing complexity): – Consensus sequences – Regular expressions – Position Specific Scoring Matrices (PSSM), or Weight Matrices – Markov Models, Hidden Markov Models and other stochastic processes � These tools (in particular the stochastic processes) are also used for bioinformatics problems other than pure sequence analysis. EMBNET course Basel 23.3.2006 Consensus sequence � Exact sequence that correspond to a certain region � Example: Transcription initiation in E. coli – Transcription initiated at the promoter; the sequence of the promoter is recognised by the sigma factor ot RNA polymerase – For the sigma factor σ 70 , the consensus sequence of the promoter is given by -35 -10 TTGACA … TATAAT � Very rigid, and do not allow for any variation � This works also well for enzyme restriction sites , or, in general, for sites for which strict conservation is important (in the case of restriction sites: cutting of the DNA at a certain site is a question of “life and death” for the DNA) EMBNET course Basel 23.3.2006
Example: binding site for TF p53 � The Transcription Factor Binding Site (TFBS) for p53 as been described as having the consensus sequence GGA CATG CCC * GGG CATG TCT where * represents a spacer of various length. � In this case, the sequence is not entirely conserved; this is believed to allow the cell some flexibility in the level of response for different signals (which was not possible or desirable for restriction sites). EMBNET course Basel 23.3.2006 Example: binding site for TF p53 � This flexibility translates into the need for more complicated models to describe the site. � Since the binding site is not entirely conserved, the consensus sequence represents only the nucleotides most frequently observed . � The protein could potentially bind to many other similar, but different, sites along the genome. � In theory, if the sites are not independent, the protein may not even bind to the actual consensus sequence ! EMBNET course Basel 23.3.2006
Patterns/Regular Expression � Patterns attempts to explain observed motifs by trying to identify the most important combinations of positions and nucleotide s /residue s of a given site (to be compared with the consensus sequence, where the most important nucleotide/residue at each position was identified) � They are often described using the Regular Expression syntax . � Prosite database (developed at the SIB): http://www.expasy.org/prosite/ EMBNET course Basel 23.3.2006 Example: Cys-Cys-His-His zinc finger DNA binding domain � Its characteristic motif has regular expression C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H � Where ‘x’ means any amino acid, ‘(2,4)’ means between 2 and 4 occurences, and ‘[…]’ indicate a list of possible amino acids. � Example: 1ZNF: XYKCGLCERSFVEKSALSRHQRVHKNX EMBNET course Basel 23.3.2006
Example: TFBS for p53 � The TFBS has been described with the pattern →← … →← where →← is the palindromic sequence (5’) Pu-Pu-Pu-C-[AT][TA]-G-Py-Py-Py (3’) and “…” is a spacer with 0 to 14 nucleotides � Note that this pattern (with the palindromic condition) can not be expressed using a regular expression (at least not in a simple or general way). J. Hoh et al. “The p53MH algorithm and its application in detecting EMBNET course p53-responsive genes”. PNAS, 99(13), June 2002, 8467-8472. Basel 23.3.2006 Example: TFBS for p53 � The pattern approaches clearly allows more flexibility than the consensus sequence; however it is still too rigid , especially for sites that are not well conserved. � When applying the pattern, each possible amino/acid or nucleotide at a given position has the same weight , i.e. is it not possible to specifiy if one is more likely to appear than another. EMBNET course Basel 23.3.2006
Position-Specific Scoring Matrices � “Stochastic consensus sequence” � Indicates the relative importance of a given nucleotide or amino acid at a certain position. � Usually built from an alignment of sequences corresponding to the domain we are interested in, and either a collection of sequences known not to contain the domain, or (most often), background probabilities for the different nucleotides or amino acids. EMBNET course Basel 23.3.2006 Building a PSSM Pos. 1 2 3 4 5 6 A 9 214 63 142 118 8 C 22 7 26 31 52 13 Counts from 242 known sites G 18 2 29 38 29 5 T 193 19 124 31 43 216 A 0.04 0.88 0.26 0.59 0.49 0.03 C 0.09 0.03 0.11 0.13 0.22 0.05 Relative frequencies: f bl G 0.07 0.01 0.12 0.16 0.12 0.02 T 0.80 0.08 0.51 0.13 0.18 0.89 PSSM: A -2.76 1.82 0.06 1.23 0.96 -2.92 log f bl /p b C -1.46 -3.11 -1.22 -1.00 -0.22 -2.21 (p b =background G -1.76 -5.00 -1.06 -0.67 -1.06 -3.58 probabilities) EMBNET course T 1.67 -1.66 1.04 -1.00 -0.49 1.84 Basel 23.3.2006
Scoring a sequence using a PSSM C T A T A A T C sum A - 38 19 1 12 10 -48 Move the matrix C - 15 -38 -8 -10 -3 -32 along the sequence -93 G - 13 -48 -6 - 7 -10 - 48 and score each “window” T 17 -32 8 - 9 - 6 19 A Peaks should occur at - 38 19 1 12 10 -48 the “true” sites C - 15 -38 -8 -10 -3 -32 +85 G - 13 -48 -6 - 7 -10 - 48 Of course in general any threshold will T 17 -32 8 - 9 - 6 19 have some false A positive and false - 38 19 1 12 10 -48 negative rate C - 15 -38 -8 -10 -3 -32 -95 G - 13 -48 -6 - 7 -10 - 48 EMBNET course T 17 -32 8 - 9 - 6 19 Basel 23.3.2006 Sequence logo: graphical representation Cys-Cys-His-His zinc finger DNA binding domain The total height of each stack represent the degree of conservation of each position; the heights of the letters on a stack are proportional to their frequencies. EMBNET course Basel 23.3.2006
PSSM for p53 binding site Counts from 37 known sites Pos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A 14 11 26 0 28 2.5 0 0.5 0 3 6 2 11.5 0 27 4 0 0.5 1 2 C 3 1 1 36 1 0.5 0 24.5 33 23 2 0 0.5 36 2 0 0 9.5 24 15 G 16 24 10 0 0 0 37 0 0 0 23.5 34 25 0 2 1 37 0 0 3 T 4 1 0 1 7 34 0 12 4 10 5.5 1 0 1 5 32 0 27 12 16 J. Hoh et al., “The p53HM algorithm and its application in EMBNET course detecting p53-responsive genes”, 2002. Basel 23.3.2006 What is missing ? � PSSM help to deal with the stochastic distributions of symbols at a given position. � However, they lack the ability to deal with the length distribution of the motif they describe. � Stochastic processes provide a more general framework to deal with these questions. EMBNET course Basel 23.3.2006
Recommend
More recommend