Five hierarchical levels of sequence-structure correlations in proteins Chris Bystroff Rensselaer Polytechnic Institute Troy, New York, USA
What does structure prediction tell us about the physics of folding? Check one: A. If we can predict protein structures, then we know how proteins fold. B. If we know how proteins fold, then we can predict protein structures.
Two ways to predict protein structure... query best Database search sequence alignment (statistics) lowest query Folding Simulation energy sequence (physics)
...two very different Underlying principles Darwin: query best Proteins with a common sequence alignment ancestor have the same fold. millions of years Boltzmann: lowest query Proteins adopt a minimum energy sequence the free energy conformation. microseconds to seconds
Darwin versus Boltzmann. Do hybrid models make sense? BLAST threading Rosetta AMBER physics Global structure similarity Sequence similarity Knowledge-based physics
We know proteins fold via pathways. local structure first, eliminating alternate pathways, then global
Proteins can fold because they don't have to search all of conformational space.
We know that proteins have a heirarchy of structural similarity... conserves... Class 2° content Architecture packing of 2° Topology* chain connectivity *Fold recognition algorithms work at this level Image borrowed from CATH database
Can we use the database to make models for folding pathways? Steps along the Steps in early folding pathway: data mining: (1) Initiation local motifs (2) propagation extended local motifs (3) condensation pairs of motifs (4) molten globule multiple motifs late (5) native state aligned multiple motifs
Heirarchical level 1: Folding initiation site motifs recurrent Non-homologous sequences sequence HDFPIEGGDS P M Q T IF FW S N ANAKLSHGY CPYDNIW M Q T IFFN QSAAVYSVLHLIFLT IDMNPQGSIE M QTIFF GYAESA ELSPVVNFLE EM QTIF F ISGFTQTANSD I N W G S M Q T IFFEE W Q LM NV M DKIPS I FNESKKKGIA M QTIFF ILSGR PPPM QTI FFVIVNYN ESKHALWCSVD PW M W NLM Q TIFF ISQ QVIEIPS MQT IFF VFSHDEQ MKLKGLKGA Is it a recurrent structure?
Sampling bias creates problems for motif mining First we must "factor out" inheritance
Removing database redundancy (1): Cluster sequences into phylogenetic trees. (3): Convert each One family, one count. position to a probability distribution. (2): apply a tree weight to each sequence. ( ) w k δ s kj = aa i ∑ k = seqs ij = P ∑ w k k = seqs w w w w w ww w "sequence profile"
Clustering sequence profiles to find recurrent patterns 26 27 28 29 30 31 32 G G P P D D E E K K R R H H S S T T N N AA Q Q A A M M Y Y W W V V I I L L F F C C 26 27 28 29 30 31 32 position Each dot represents a short profile similarity metric (product of log-likelihood ratios) ( ) LLR q ij ( ) ∑ ∑ ∑ ∑ D ( p , q ) = ijl − P LLR p ij | P ikl | i = 1,20 l = 1, L positions amîno j acids i
The I-sites Library Backbone angles: Type-I ψ =green, hairpin φ =red diverging type-2 turn Amino Serine Frayed acids hairpin helix arranged from non- polar to polar alpha-alpha corner glycine helix N-cap Proline helix C-cap
Are I-sites really folding initiation sites? Prediction experiments (Bystroff & Baker, Proteins, 1997) NMR data on peptides (Yi et al, J.Mol.Biol., 1998) Molecular dynamics simulations (Bystroff & Garde, Proteins, 2002)
Level 2. Motif grammar Arrangement of I-sites motifs in proteins is highly non-random helix beta beta helix cap strand turn Adjacencies can be modeled as a Markov chain
Aligned motifs become a Markov chain Type-1 G α C-cap φ ψ aligned α helix aligned profiles structures Type-2 G α C-cap Type-1 state G α C-cap topology: α helix Type-2 G α C-cap
A Markov state from HMMSTR next state a ij amino acid b i = {ACDEF...} symbols r i = {HGEBdblLex} previous a hi state structure d i = {HST} symbols c i = {mnhd...} a ik next state
Discretized structure states: backbone angle regions ( r i )
How an HMM works We have S (the sequence). We want Q (the state sequence), P(Q|S) is the probability of Q given S ∏ ( ) = π q 1 ( s 1 ) P Q | S a q t − 1 q i B q t ( s t ) t = 2, N starting states ⎛ ( ) ⎞ arrows d i D t ⎜ ⎟ ( ) = ( ) ⎜ ⎟ B i s t r i R t b q i ( O t ) ⎜ ⎟ amino acid profiles ⎜ ⎟ ( ) c i C t ⎝ ⎠
HMMSTR Hidden Markov Model for local protein STRucture 282 nodes 317 transitions Unified model for 31 distinct sequence- structure motifs (Bystroff & Baker, J. Mol. Biol., 2000)
Level 1: I-sites Level 2: HMMSTR propagation initiation
Level 3: Pairwise Motif-Motif Contact Potentials • G (p, q, s) represents the free energy of a motif-motif contact. ∑ ∑ ( ) Γ i + s , q ( ) Γ i , p PDBselect i ∋ D i , i + s < 8 Å G ( p , q , s ) = − log ∑ ∑ ( ) Γ i + s , q ( ) Γ i , p PDBselect i
if d ( i , j ) ≤ D if d ( i , j ) > D What is a contact map? 1 0 ⎧ ⎨ ⎩ S ( I , J ) = Definition:
Both axes: sequence Red: favorable contact Blue: unfavorable E(i,j)
Features in a contact map can be interpreted as a TOPS diagram helices strands
Features in a contact map can be interpreted as a TOPS diagram helices strands Which one is right?
A rule-based simulation procedure. amphipathicnon-polar T0130 X True contact map True Contact Map T0130 CASP5 Contact energies ab initio Prediction
Level 4: Multibody arrangements of local motifs It is difficult to see similarities between these two proteins, but...
Different folds can have the same arrangement of secondary structure elements. 1alk 3 2 7 2 6 1 5 3 4 4 1 1vpt 4 3 7 6 5 4 1 2 3 1 2
SCALI : Structural Core ALIgnment
How SCALI works (1) Gapless alignment of HMMSTR states (2) Initialize tree search w/ one gapless fragment. (3) Add a new fragment iff it is compatible and has a high score . (4) Tree leaves when no fragments can be added. Score of leaves = aligned contacts + permutation penalty.
HMMs may be built based on non- sequential alignments Markov states represent amino acid sequences and positions in space. Connections between them represent loops.
Hidden Markov models for α/β/α proteins
Non-sequential clusters may be a useful for classifying proteins Core packing classes Multiple non-sequential alignments are more specific than “architecture” but not as specific as “topology”.`
Level 1: I-sites Level 2: HMMSTR Level 4: SCALI Level 3: HMMSTR-CM molten propagation condensation globule initiation
Level 5: Global topology Separation of the SCOP 1.53 database into training and test sequences, shown for the G proteins test family
Support Vector Machine 4052 proteins --> x2 54-dimensional Support vector. Each Vectors dimension is the Optimal hyperplane order of appearance HMMSTR states for one family. X1 Support Vectors
HMMSTR as the basis for a Support Vector Machine SCOP benchmark of 54 sequence families 4052 proteins, represented as 282-dimensional vector = Prob of each HMMSTR state. (Hou,Y et al , Bioinformatics, 2003; Proteins, 2004)
No sparse data problem as we mine longer and longer patterns! Why? Steps along the early folding pathway: Model Complexity (1) Initiation I-sites ~40 motifs (2) propagation HMMSTR 1.1 transitions/node (3) condensation HMMSTR-CM ~1% of pairs occur (4) molten globule SCALI only self-avoiding paths late (5) native state SVM-HMMSTR ~1000-2000 folds
Are there any conclusions? We assumed that proteins fold in a certain, heirarchical manner, mined the data accordingly and found recurrence at every level, from short motifs to global structure.
Funding from: HMMSTR : NSF-CISE Chris Bystroff Vesteinn Thorsson David Baker SVM-HMMSTR Yaoming Huang Bystroff Lab ( Nat.Univ.Singapore ) Yu Shao Donna Crone Yuna Hou Xin Yuan Rachel van Duyne Mong-Li Lee Kwang Kim Ben Cole Wynne Hsu www.bioinfo.rpi.edu/~bystrc/ HMMSTR says: Think Globally, Act Locally.
Are I-sites folding initiation sites? Patterns of conservation suggest energetic motive 2. sidechain 1. backbone contacts angle constraints 3. negative design
NMR structures confirm independent folding (a) (c) 1 2 3 4 5 6 7 26 27 2829 30 3132 G G G G P P P P D D D D color E E E E scale K K K K �1. R R R R 0.8 H H H H 0.6 S S S S 0.4 T T 0.2 T T 0.0 N N N N AA AA AA AA -.2 Q Q Q Q -.4 A A A A -.6 M M M M -.8 Y Y Y Y Š-1 W W W W V V V V I I I I L L L L F F F F C C C C 1 2 3 4 5 6 7 26 27 2829 30 3132 (b) position (d) position diverging turn motif NMR structure of a 7-residue I-sites motif in isolation (Yi et al , J. Mol. Biol, 1998)
Recommend
More recommend