Outline Overview of Protein Sequences and Structures Structural Alignment Using Dynamic Programming The Kpax Algorithm Explained Kpax – Protein Structure Alignment Demo: Using Kpax on Linux Dave Ritchie Practical: Homology Modeling Using Kpax + Modeler Team Orpailleur Inria Nancy – Grand Est 2 / 33 Protein Sequences and Structures Comparing Two Strings Q. Suppose we have two strings, e.g. EXPONENTIAL and POLYNOMIAL . How do we measure their similarity? A1. In information theory, the edit distance measures the cost of transforming one string into another using one-character edits POLYNOMIAL A2. Match 3 letters and then give a score for each pair... ||| EXPONENTIAL Q. Suppose gaps are allowed. What is the best possible alignment? --POLYNOM-IAL --POLYNOMIAL A. How about or ? || | ||| || | ||| EXPO--NENTIAL EXPONEN-TIAL Q. Which is better ? A1. The second one? (6 matches + 3 gaps v’s 6 matches + 5 gaps) Source: ”The Gam protein of bacteriophage Mu is an orthologue of eukaryotic Ku”, A2. ... It depends on the score for each pair and the penalty for a gap F.A. di Fagagna et al. , EMBO Reports (2003), 4, 47–52 3 / 33 4 / 33
Dynamic Programming Back-Tracking Through The DP Scoring Table Dynamic programming (DP) is a method of dividing a problem into smaller P O L Y N O M I A L sub-problems. It was first described by Richard Bellman in the 1940s. But p p p p p p p p p p p p 0 instead of using recursion, it uses a table (“memoisation” in 1940s language). p E 1 p X 2 Goal: find similarity E ( n , m ) between two strings: x [ 1: n ] and y [ 1: m ] p P 3 p O 4 p 5 N Sub-goal: find E ( i , j ) between two prefixes: x [ 1: i ] and y [ 1: j ] p E 6 p N 7 p T 8 x [ i ] x [ i ] Observation: the best alignment must end on y [ j ] or or − p I 9 y [ j ] − p A 10 p L 11 Method: build similarity table with scores S ( i , j ) and penalties P ( i ) : p 12 0 1 2 3 4 5 6 7 8 9 1011 E ( i − 1 , j − 1 ) + S ( i , j ) --POLYNOMIAL E ( i , j ) = max E ( i , j − 1 ) − P ( i ) This gives the desired optimal alignment || | ||| E ( i − 1 , j ) − P ( j ) EXPONEN-TIAL Then, “trace back” from E ( n , m ) to E ( 1 , 1 ) to extract the alignment 5 / 33 6 / 33 3D Least-Squares Fitting So, What’s The Problem? Least-squares fitting finds the 3D rotation/translation matrix M that DP is “perfect” for 1D string matching minimises the sum of squared distances: Least-squares fitting is “perfect” for 3D superposition N � BUT ( x A i − M . x B i ) 2 F = Proteins are not made of 1D symbols or 3D points. They are made i = 1 For proteins, the x i are normally C α atom coordinates of complex 3D chemical components (amino acid residues). It is The translational part is easy – shift centres of mass to the origin difficult to write a good scoring function to compare residues... The rotation can be found using eigenvector or quaternion methods Similar 1D protein sub-sequences can have different 3D shapes ( α -helices, β -strands), i.e. global environment can affect local shape. The residual error (RMSD) is then given by We don’t know a priori the right 1D pairings for 3D fitting... � N Proteins are globally flexible. Even if many local 1D regions “match”, � � 1 � � ( x A i − M . x B i ) 2 RMSD = not all of them might simultaneously superpose well in 3D space... N i = 1 ADDITIONALLY! So, given list of aligned C α ’s, we can fit optimally to some RMSD Proteins can contain multiple repeats and/or transpositions... 7 / 33 8 / 33
Over 100 Structure Alignment Algorithms in 25 Years Quick List of Structural Alignment Approaches http://en.wikipedia.org/wiki/Structural alignment software “elastic” Gaussian scoring “double dynamic programming” on C α distance matrices triples or higher fragments (8-tuples) of C α atoms backbone C α vectors backbone torsion angles secondary structure elements geometric hashing Voronoi tessellations structural alphabets Lagrangian contact map optimisation eigenvector analysis of distance matrices Fourier correlations 90 more... Gaussian fragments ... 9 / 33 10 / 33 Introducing Kpax Defining Local Coordinate Frames All C α atoms have highly conserved tetrahedral geometry Exploit this to define a “canonical” C α –C–N orientation e.g. put C α at origin; C on -ve z axis; N in +ve xz plane http://kpax.loria.fr/ Dynamic programming with Gaussian scores Uses NO sequence similarity OR secondary structure information Very fast database search (CATH, SCOP, Pfam, ..., user-defined) Rigid and flexible structural alignments Now, ALL α -helices and β -strands look the same at the origin Multiple flexible alignments coming soon... 11 / 33 12 / 33
Comparing Structural Fragments Representing Local Geometry as a Product of Gaussians In the canonical frame, similar structures have similar distances Calculate Gaussian distribution of all C α atoms in CATH between their up-stream and down-stream C α atoms: . .. . . . .. . .. . .. . .. . .. . . . . . . .. . . .. . . . . .. . . . . . . . . . . .. . . . . . . . . . . . .. . . .. . . . . .. . . .. . . . . .. . . .. . .. . . .. .. . . .. .. . . . .. . .. . .. . . .. .. . . . . . .. . . . . . . . . . . . . . . .. . .. .. . . . . ... . . . . . . .. . . .. . . . . . . . . .. . . . . . . . . .. . . . . . . -3 . .. . . .. . . . .. . .. . . .. . . . .. . . . .. . . .. . . .. . . . . . . . . . .. . . z . . . . . . . -2 . . .. . . . . . . . . .. . .. . . ... . . . . . . . . .. . . .. . . . . . . .. . .. . . .. . . . . . . . . . . .. . . . . . . ... . . . .. . . . . . . . . . .. . .. . . . . . . . . . . . . . . . . . . . . . -1 y .. . . . .. . . . . . . . . . . . . . .. . . . . .. . ... . . . . . . . . .. . . . . . . . . . . .. . . . . . . . .. . .. . . . .. . . . . . .. . . . . . . . . . ... . . . . . . . . . . . .. . . . . . . . . . . . . . . x . . . .. . . .. . . . . ... . +1 . . . . . .. . . . . . . .. .. . . . . . . . .. . . . . . . . .. .. . . . . . . .. . . .. . . . . . . .. . . . .. . . . .. . . . . . . .. .. . . . . .. . . . . . .. . . .. . . . . . . CATH +2 . . .. . . .. . . .. . . .. . .. . .. . . .. . .. . .. . . . . .. . . . . . . . . . . . .. . . . . . .. . . . .. . . . .. . . . .. . . . . . .. . . .. . . . . .. . . . .. .. .. +3 .. .. . . .. . . . . .. . . . .. .. . . . .. . . . Gives Gaussian width σ k for each up-stream and down-stream C α Then, represent residue i as a product of Gaussians: ψ i = φ − 1 ( x i − 1 ) φ + 1 ( x i − n ) φ + n ( x i + 1 ) ... φ − n ( x i + n ) i i i i each individual Gaussian function has the form: But how to combine all the distances into a single score? i ( x i + k ) = N k e − β k r 2 k / 2 σ 2 φ k k 13 / 33 14 / 33 Calculating a Per-Residue Local Similarity Score Detecting Secondary Structure Elements By sliding a model α -helix and β -strand along a structure, Kpax detects its secondary structure elements (SSEs) automatically (it does not distinguish π or 3 10 helices or detect β -turns). Here are some examples: Calculate the local-frame similarity, K local , as an overlap integral ij � K local = ψ i ψ j d x − n ... + n . ij With products of Gaussians, this reduces to a simple sum = e − � n k = − n β k R 2 i + k , j + k / 4 σ 2 K local k , ij In identical α -helices, β -strands, and even loops, K local = 1. ij Nice, but how to match correctly a short α -helix with a longer one? 15 / 33 16 / 33
Recommend
More recommend