Chemical Insights from a Random Forest Prediction of Molecular Quantum Properties Beomchang Kang Seoul National University 2019.11.8, 1st XAIENCE Conference
Fluorescent molecule • Bio-imaging • Specification • Cell organelles • Proteins • Observation • Structure • Dynamics
Good fluorescent molecule? • High quantum yield in visible area • Distinctive color • Low toxicity • High synthetic ability
Towards discovery of novel and effective fluorescent molecules • Prediction of quantum properties for a given molecule • High quantum yield • Distinctive color • Searching the chemical space for molecules of desired properties
Today, I focus on… • Prediction of • Oscillator strength to get high quantum yield • Excitation energy • Gaining chemical insight from Random Forest results
Excitation Energy • Energy difference between 2 state • Electronic transition • Determines color
Oscillator strength (OS) • Dimensionless quantity • Probability of electromagnetic radiation • Absorption or emission • Transitions between energy levels • To have high OS (Oscillator Strength) • Orbital shapes of the two states must be different
Methods
Prediction of molecular properties Molecule Predictor Property
PubChemQC Database • Molecular quantum calculation • DFT • TD-DFT • From PubChem • Really synthesized • Molecular orbitals • Quantum properties • Classical properties
Data set for RF • From PubchemQC • Only H, B, C, N, O, F, P, S, Cl • Only neutral molecules • Randomly selected 0.5 M compounds • Training:Test = 9:1
RandomForest • Advantage • Simple • White-box • Feature importance • From feature importance • Chemical Insight • To be compared with deep learning methods
Extended Circular FingerPrint [ECFP] • 2D Molecule -> Identifiers • Parameter - Radius • Bit vector of ECFP • Hashing • One-hot encoding (binary) • Parameter - # of bits
Results & Discussion
RF result - Excitation Energy • RMSE 0.4500(eV) • PearsonR 0.8689
RF result -Oscillator strength • RMSE 0.066 • PearsonR 0.7300
0.5 M set Mean Median std 0.042 0.009 0.096
Feature importance to Fragments 1 … 6128 6129 6130 … 16384 0.xxx 0.xxx 0.022 0.xxx 0.xxx Many Fragments…
RandomForest - Feature importance • Oscillator strength Bit number 6129 • ECFP6 Cc1=cc=c(o1)c=C Oscillator strength 0.4690 • n_bit = 16384 • Feature Importance > 0.02 Feature # of Bit Number Importance Fragments 9352 0.0330 115 8017 0.0251 107 6192 0.0218 129
Important Fragments • # of molecules which have tag fragment > 3 • Feature importance > 0.02 Fragment radius Mean OS # of molecules • ECFP6, 16384 vector 1 0.175 10590 • Average of OS > 0.1 3 0.175 4 2 0.342 9 3 0.211 11 1 0.207 6263 3 0.101 4
Fragment of high OS • C(=C)c(c)o • Radius = 2 • 9 molecules • Mean of OS = 0.342
Ethyl 5-ethenylfuran-2-carboxylate OS = 0.5230
5-ethenyl-3H-1,3-oxazole-2-thione OS = 0.4790
ethyl 2-(5-ethenylfuran-2-yl)propanoate OS = 0.4730
Thank You!
Recommend
More recommend