knowledge discovery in large knowledge discovery in large
play

Knowledge discovery in large Knowledge discovery in large - PowerPoint PPT Presentation

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid biological data sets using hybrid classifier/evolutionary classifier/evolutionary algorithms algorithms Dr. Michael L. Raymer Dr. Michael L. Raymer


  1. Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid biological data sets using hybrid classifier/evolutionary classifier/evolutionary algorithms algorithms Dr. Michael L. Raymer Dr. Michael L. Raymer Department of Computer Science Department of Computer Science and Engineering / and Engineering / Biomedical Sciences Program Biomedical Sciences Program

  2. EC Approaches EC Approaches • Knowledge Discovery • Knowledge Discovery � Solvation Prediction � Solvation Prediction • Protein Structure Modeling/Prediction • Protein Structure Modeling/Prediction � Combinatorial comparative modeling � Combinatorial comparative modeling M. Raymer, Interface 2004 2

  3. Ligand Screening & Docking Ligand Screening & Docking ? ? • Complementarity • Complementarity � Shape � Shape � Chemical � Chemical � Electrostatic � Electrostatic M. Raymer, Interface 2004 3

  4. Solvation complication Solvation complication • The protein surface is • The protein surface is highly solvated highly solvated � Protein crystals are � Protein crystals are 27–77% water 27–77% water M. Raymer, Interface 2004 4

  5. Solvation conservation Solvation conservation • Question 1: • Question 1: Given a Given a Protein surface solvated crystal solvated crystal structure, find structure, find those water those water molecules that Water molecules that molecule are likely to be are likely to be conserved upon conserved upon Ligand protein-ligand protein-ligand binding. binding. M. Raymer, Interface 2004 5

  6. Water Binding Site Prediction Water Binding Site Prediction Unsolvated and solvated Aspartic Protease (3APR) with peptidyl inhibitor. Question 2: Given a structural model or unsolvated Question 2: Given a structural model or unsolvated structure, identify likely solvent binding positions. structure, identify likely solvent binding positions. M. Raymer, Interface 2004 6

  7. Pattern Recognition Approach Pattern Recognition Approach f1 Cube N f2 C N f3 C f4 N C f5 Labeled training data f1 ? f2 f3 f4 Classification/ Classifier f5 prediction M. Raymer, Interface 2004 7

  8. Crystallographic Waters Crystallographic Waters • False Positives • False Positives � Crystallographic � Crystallographic interfacial waters interfacial waters � Reduction of R-free by � Reduction of R-free by including water molecules including water molecules • False negatives • False negatives � Poor resolution � Poor resolution � Smeared density and � Smeared density and computational refinement computational refinement M. Raymer, Interface 2004 8

  9. Data Set Generation Data Set Generation • 30 Pairs of proteins: ligand-bound and unbound • 30 Pairs of proteins: ligand-bound and unbound � Minimal conformational change upon binding � Minimal conformational change upon binding (backbone RMSD < 0.5) (backbone RMSD < 0.5) � 2.0 Å or better resolution � 2.0 Å or better resolution � Low residual error (R < 0.22) � Low residual error (R < 0.22) • ~3000 Water molecules in the first hydration • ~3000 Water molecules in the first hydration shell shell M. Raymer, Interface 2004 9

  10. Conserved and Displaced Conserved and Displaced Rigid body superimposition Rigid body superimposition of aspartic protease, unbound of aspartic protease, unbound structure (2APR, red) along structure (2APR, red) along with peptidyl ligand-bound with peptidyl ligand-bound structure (3APR, cyan). Only structure (3APR, cyan). Only active-site waters of bound active-site waters of bound structure shown. structure shown. M. Raymer, Interface 2004 10

  11. Probe Site Generation Probe Site Generation Aspartic protease (2apr) with crystallographically Aspartic protease (2apr) with crystallographically observed and computer-generated water molecules. observed and computer-generated water molecules. M. Raymer, Interface 2004 11

  12. Feature Generation Feature Generation • Computable from crystal coordinates, or (less • Computable from crystal coordinates, or (less desirable) structure factors desirable) structure factors � Empirical � Empirical • Likely to be associated with water binding • Likely to be associated with water binding M. Raymer, Interface 2004 12

  13. Atomic Density (ADN) Atomic Density (ADN) A water A water molecule in the molecule in the ligand-free ligand-free structure of structure of dihydrofolate dihydrofolate reductase reductase (1DR2). (1DR2). The atomic The atomic density of this density of this water molecule water molecule is 5. is 5. M. Raymer, Interface 2004 13

  14. Prediction of water molecules Prediction of water molecules DHFR complex DHFR complex with biopterin, with biopterin, colored according colored according to AHP. to AHP. (1DR2/1DR3) (1DR2/1DR3) Displaced water Displaced water molecules from molecules from the free structure the free structure are shown as are shown as wireframe wireframe spheres. spheres. M. Raymer, Interface 2004 14

  15. Temperature Factor (B-Value) Temperature Factor (B-Value) The backbone of The backbone of dihydrofolate dihydrofolate reductase (1DR2) is reductase (1DR2) is shown as ribbons shown as ribbons colored according to colored according to crystallographic crystallographic temperature factor temperature factor (B-value). (B-value). M. Raymer, Interface 2004 15

  16. Features Measured Features Measured • Temperature factor (BVAL) • Temperature factor (BVAL) • Atomic Density (ADN) • Atomic Density (ADN) • Atomic Hydrophilicity (AHP) • Atomic Hydrophilicity (AHP) • Hydrogen bonds to protein (HBDP) • Hydrogen bonds to protein (HBDP) • Hydrogen bonds to water (HBDW) • Hydrogen bonds to water (HBDW) • Mobility (MOB) • Mobility (MOB) B B w w • ABVAL • ABVAL B B = MOB = MOB avg avg Occ Occ • NBVAL • NBVAL w w Occ Occ avg avg M. Raymer, Interface 2004 16

  17. Highly overlapping distributions Highly overlapping distributions B-value, H-Bonds, AHP, B-value, H-Bonds, AHP, rotated to show rotated to show distribution. distribution. PCA shows similar overlap PCA shows similar overlap among 1 st two components. among 1 st two components. LDA obtains nearly LDA obtains nearly random (55%) two-class random (55%) two-class accuracy. accuracy. M. Raymer, Interface 2004 17

  18. Knowledge Discovery Knowledge Discovery Classifier The black box classifier does The black box classifier does not help elucidate why the not help elucidate why the water molecules bind where water molecules bind where they do. they do. Unsolvated and solvated Aspartic Protease (3APR) with peptidyl inhibitor. M. Raymer, Interface 2004 18

  19. EC: Feature Extraction EC: Feature Extraction Feature Space Classifier Projection (KNN) (EA) Large n , moderate d database M. Raymer, Interface 2004 19

  20. Feature Weighted knn Feature Weighted knn Class 1 a. Class 2 Feature 1 Unknown Feature 2 b. Feature 1 Feature 2 Scale Extended M. Raymer, Interface 2004 20

  21. GA & knn Interaction GA & knn Interaction Genetic Algorithm Masked Weight Vector & k Masked Weight Vector & k W 1 W 2 W 3 W 4 W 5 KNN Classifier W 1 W 2 W 3 W 4 W 5 W 1 W 2 W 3 W 4 W 5 W 1 W 2 W 3 W 4 W 5 W 2 ... ... W 1 Fitness — How is it Fitness — How is it calculated? calculated? M. Raymer, Interface 2004 21

  22. Weighting and Masking Weighting and Masking • How do we sample feature subsets? • How do we sample feature subsets? � Weight below a threshold value: slow sampling � Weight below a threshold value: slow sampling � Masking: � Masking: W 1 W 2 W 3 W 4 W 5 M 1 M 2 M 3 M 4 M 5 k 73.2 0 • Classifier parameters ( k ) on the chromosome • Classifier parameters ( k ) on the chromosome M. Raymer, Interface 2004 22

  23. The Cost Function The Cost Function • We can direct the search toward any objective. • We can direct the search toward any objective. � Classification accuracy � Classification accuracy � Class balance � Class balance � Feature subset parsimony (reduce d ) � Feature subset parsimony (reduce d ) • The GA minimizes the cost function: • The GA minimizes the cost function: v v v v v v = × + × = × + × cost( , ) ( , ) ( ) cost( , ) ( , ) ( ) w k C err w k C nonzero w w k C err w k C nonzero w acc pars acc pars v v v v + × + × + × + × _ ( , ) ( , ) _ ( , ) ( , ) C incorrect votes w k C bal w k C incorrect votes w k C bal w k vote bal vote bal M. Raymer, Interface 2004 23

  24. Data Partitioning Data Partitioning Validation Validation Classifier Classifier Training Training Tuning/Fitness Calculation Tuning/Fitness Calculation M. Raymer, Interface 2004 24

  25. Cross Validation Results Cross Validation Results Classifier Accuracy (%) Balance Total non-site site Logistic 69.331 65.496 73.164 7.668 NeuralNetwork 69.293 66.003 72.582 6.579 VotedPerceptron 69.246 66.754 71.737 4.983 SMO 69.068 57.759 76.470 18.711 Solvation site prediction Classifier Accuracy (%) Balance Total disp cons NeuralNetwork 66.618 44.174 80.705 36.531 j48 66.023 37.061 84.200 47.138 ADTree 65.969 44.268 79.589 35.321 VotedPerceptron 65.742 36.453 84.141 47.688 Ligand-binding conservation prediction M. Raymer, Interface 2004 25

Recommend


More recommend