Exploring the Use of TensorFlow to Predict Connection Table Information within Chemical Structures Brodie Schroeder
Machine Learning Basics Gives "computers the ability to learn without being explicitly programmed." ● - Arthur Samuel ● Goal is to solve problems with “generalized” algorithms that apply to many different problems Unsupervised and supervised learning ● ● Artificial Neural Networks and Deep Learning
= ?
Basic Artificial Neural Network
y = <activation func>(Wx + b) Softmax, Sigmoid, ReLU...
= ?
y = [0,0,1,0,0,0,0,0,0,0] The image appears to be a ‘2’
Goal for this Project Given the XYZ coordinates of atoms and their bonding information within a chemical structure, predict a bonding table for all atoms.
Example Dataset benzene ACD/Labs0812062058 6 6 0 0 0 0 0 0 0 0 1 V2000 1.9050 -0.7932 0.0000 C XYZ coordinates and the atom type. 1.9050 -2.1232 0.0000 C This will be the ‘x’ input in our model. 0.7531 -0.1282 0.0000 C 0.7531 -2.7882 0.0000 C -0.3987 -0.7932 0.0000 C -0.3987 -2.1232 0.0000 C 2 1 1 0 0 0 0 3 1 2 0 0 0 0 Connection information for 4 2 2 0 0 0 0 bonding between atoms. We will 5 3 1 0 0 0 0 use this to train our model. 6 4 1 0 0 0 0 6 5 2 0 0 0 0 M END $$$$
Parsing SDF Files Euclidean Distance Between Atoms 6.0, 0.0, 4.312, 6.223, 7.321, 3.221, 9.023, 2.345, 1.652, 4.791, C Atomic Number of Atom 6.0, 1.542, 0.0, 4.222, 8.231, 6.321, 1.999, 4.562, 8.345, 2.221, C 6.0, 2.221, 5.012, 0.0, 4.223, 6.723, 7.232, 9.821, 3.323, 4.124, C 8.0, 7.010, 3.011, 7.221, 0.0, 5.434, 7.777, 8.421, 5.341, 9.981, O 6.0, 4.312, 3.221, 3.563, 7.212, 0.0, 6.521, 7.623, 3.253, 7.456, C 8.0, 2.333, 5.321, 6.872, 6.454, 8.991, 0.0, 4.221, 6.213, 4.343, O
Parsing SDF Files Boolean Array of Connections 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 2, 1, 1, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 3, 1, 2, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 4, 2, 2, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 5, 3, 1, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 6, 4, 1, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 6, 5, 2,
Input and Training Data 0.0, 1.0, 1.0, 0.0, 0.0, 0.0, 6.0, 0.0, 4.312, 6.223, 7.321, 3.221, 9.023, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 6.0, 1.542, 0.0, 4.222, 8.231, 6.321, 1.999, x = y_ = 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 6.0, 2.221, 5.012, 0.0, 4.223, 6.723, 7.232, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 8.0, 7.010, 3.011, 7.221, 0.0, 5.434, 7.777, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 6.0, 4.312, 3.221, 3.563, 7.212, 0.0, 6.521, 0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 8.0, 2.333, 5.321, 6.872, 6.454, 8.991, 0.0, 6 x 6 = 36 6 x 7 = 42
Input and Training Data Build two Python lists ● ● List ‘a’ is a list of flattened 2d NumPy matrices containing euclidean distances and atom type List ‘b’ is a list of flattened 2d NumPy matrices containing bonding ● information for all atoms ● Matrix size is capped at 28 x 29 and 28 x 28 respectively (only molecules with less than or equal to 28 atoms are included) If smaller than 28 atoms the matrix is padded with zeros ● a[n] corresponds to b[n] ●
What is TensorFlow ? “Open source software library for numerical computation using data flow ● graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.” Provides an API that makes it easy to setup, design and train deep learning ● models.
Building the Model x = tf.placeholder(tf.float32, [None, 812]) layer1 = tf.add(tf.matmul(x, W1), b1) layer1 = tf.nn.relu(layer1) W1 = tf.Variable(tf.truncated_normal([812, 784], stddev=0.1)) b1 = tf.Variable(tf.truncated_normal([784], stddev=0.1)) layer2 = tf.add(tf.matmul(layer1, W2), b2) layer2 = tf.nn.relu(layer2) W2 = tf.Variable(tf.truncated_normal([784, 784], stddev=0.1)) b2 = tf.Variable(tf.truncated_normal([784], stddev=0.1)) layer3 = tf.add(tf.matmul(layer2, W3), b3) layer3 = tf.nn.relu(layer3) W3 = tf.Variable(tf.truncated_normal([784, 784], stddev=0.1)) b3 = tf.Variable(tf.truncated_normal([784], stddev=0.1)) y = tf.add(tf.matmul(layer3, W), b) y = tf.nn.sigmoid(y) W = tf.Variable(tf.truncated_normal([784, 784], stddev=0.1)) b = tf.Variable(tf.truncated_normal([784], stddev=0.1)) y_ = tf.placeholder(tf.float32, [None, 784])
Building the Model cross_entropy = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y)) train_step = tf.train.AdamOptimizer(0.001).minimize(cross_entropy) sess = tf.InteractiveSession() tf.global_variables_initializer().run() a, b = get_batch() train_len = len(a) correct_prediction = tf.equal(y_, y) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) # Training for i in range(train_len): batch_xs = a[i] batch_ys = b[i] _, loss, acc = sess.run([train_step, cross_entropy, accuracy], feed_dict={x: batch_xs, y_: batch_ys}) print("Loss= " + "{:.6f}".format(loss) + " Accuracy= " + "{:.5f}".format(acc))
Building the Model # Test trained model cumulative_accuracy = 0.0 for i in range(train_len): acc_batch_xs = a[i] acc_batch_ys = b[i] cumulative_accuracy += accuracy.eval(feed_dict={x: acc_batch_xs, y_: acc_batch_ys}) print("Test Accuracy= {}".format(cumulative_accuracy / train_len))
Results thus far...
Test Accuracy = 0.865 Apprx. 10,000 training sets
Future Improvements and Optimization Cache results of parsing SDF file ● ● Improve code for calculating distances ● Improve initial values of weights Overtraining or undertraining? ● TensorBoard visualization ●
Questions? View the code: https://github.com/Allvitende/chemical-modeling/
Recommend
More recommend