Data Mining of Chemical Compounds Using Functional Groups Ali Rathore Chabot College Electrical Engineering & Computer Science Mentor: Sayan Ranu Advisor: Dr. Ambuj Singh Department: Computer Science Database & Bioinformatics Lab (DBL) Funding: National Science Foundation Division of Information & Intelligent Systems
of Chemical Compounds Data Mining
Research Goals Database Significant Molecule Characterization Substructures Datamine Parameters Pattern Set Alkenyl Ethylene Hydroxyl Methanol Aniline Exam ple Benzene Phenol Neighborhood of Each Atom Functional Groups
Research Method Database Original Method New Method Load Data into Computer Preprocess Data Mine the Data Get Results Significant Substructures GraphSig Pattern Set
GraphSig Results Time vs. Database Size Comparison of “Accuracy” (Score out of 100) Runtim e Com parison of OA, LEAP and GraphSig Database GraphSig Optimal Assignment Kernel Scalable Leap Search ~ 55 mins MCF-7 68 76 77 MOLT-4 65 72 74 ~ 42 mins NCI-H23 79 79 80 OVCAR-8 67 78 79 P388 79 84 84 PC-3 66 76 76 SF-295 75 77 80 SN12C 75 80 80 ~ 9 mins 77 SW-620 70 76 UACC-257 65 75 81 Yeast 64 71 73 Average 70.2 76.7 78.2 S. Ranu, A. Singh. “GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases”
GraphSig Results AIDS Database GraphSig Parameters 3-azido-thymidine (AZT) Most used medicine for controlling HIV virus. • Only difference is presence of Leukemia Database Antimony (Sb) and Bismuth (Bi) GraphSig • May lead chemists to try other Parameters metals from same group • Sb & Bi cannot be mined using other techniques.
Preprocessing Method Original Molecules Find Functional Groups Replace with “Atoms” Significant New Molecules Substructures “Better” Pattern Set GraphSig
Data Mining Of Chemical Compounds � Automated extraction of implicit information. � Discovery of previously unknown patterns. � Analysis of databases of chemical compounds. Summary � Allows chemists to: � Predict behavior of new compounds. � Identify compounds with wanted properties. � Allows pharmacists to: � Create drugs using significant substructures. � Classify compounds as active or inactive.
Acknowledgements Liu-Yen Kram er , CNSI Education Programs Development Analyst Dr. Evelyn Hu, CNSI Scientific Director Jens-Uw e Kuhn, INSET Program Coordinator Dr. Nick Arnold , INSET Faculty Coordinator Sayan Ranu , Graduate Student Mentor Dr. Am buj Singh , Computer Science Faculty Advisor Everyone at Database & Bioinformatics Lab
Questions? Thank You
Research Method Database Original Method New Method Load Data into Computer Preprocess Data Mine the Data Get Results Significant Substructures GraphSig Pattern Set
Preprocessing Results Preprocessing Timings Average Molecule Size Replace Functional Groups Find FGs Total 30 122.67 140 Number of Atoms 25 90.757 120 25.6 100 20 Seconds 58.97 80 15 60 31 40 10 12.76 20 0 5 10 20 30 40 0 Number of Molecules (Thousand) Original Molecules New "Molecules"
GraphSig Results Comparison of “Accuracy” Runtim e Com parison of OA, LEAP and GraphSig Time vs. Database Size Database GraphSig Optimal Assignment Kernel Scalable Leap Search MCF-7 68 76 77 MOLT-4 65 72 74 NCI-H23 79 79 80 OVCAR-8 67 78 79 P388 79 84 84 PC-3 66 76 76 SF-295 75 77 80 SN12C 75 80 80 77 SW-620 70 76 UACC-257 65 75 81 Yeast 64 71 73 Average 70.2 76.7 78.2 S. Ranu, A. Singh. “GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Graph Databases”
Recommend
More recommend