RDKit (cheminformatics) Neo4j Integration Mentors: Christian Pilger (BASF) Presenter - Evgeny Sorokin Greg Landrum (RDKit) Stefan Armbruster (Neo4j)
Motivation Neo4j = useful tool to map knowledge ● Chemical/pharmaceutical R&D: ● Required: mapping data of completely different nature (recipe, process, application ○ test, chemical structures) Knowledge graphs are frequently a good choice here over other data models ○ Problem: Neo4j does not support chemical structures ○ RDKit ● is a widely used Open Source tool to deal with chemical structures ○ has proven its value in conjunction with Postgres ○ Idea: enrich Neo4j's capabilities by combining it with RDKit => GSoC project ●
Chemical structure representation Not intended : “dissolve” atoms and bond as nodes and relations into the graph! ● Intended : use available structure representation as node properties ● SMILES format: c1ccccc1 (single line ASCII representation --> exact search via string matching) ○ MOL format: (3D coordinates: richer format, more details --> advantages in sub-structure searches) ○ name: benzene formula: C 6 H 6 SMILES: c1ccccc1
Chemical structures example name: benzene formula: C 6 H 6 SMILES: c1ccccc1
Requirements Basic Functionality : ● Exact chemical search (“find the molecule benzene") ○ Chemical substructure search (“find all molecules that contain a benzene moiety") ○ Typical application scenarios in Graph context ● Find entry points into the graph ○ Filter paths during graph traversal with chemical structure conditions ○
How was it implemented - storage in a graph A new node with labels :Chemical:Structure is processed by RDKit event handler ● From either smiles or mdlmol property a list of 7-8 properties is created per node ● A full text index is created for fingerprint property ● canonical_smiles ● inchi ● formula ● molecular_weight ● fp - bit-vector fingerprint ● fp_ones - count of positive bits ● mdlmol ● smiles [optional] ●
How was it implemented - exact search Simple case: compare two canonical smiles with each other, find a match. SMILES O=S(=O)(Cc1ccccc1)CS(=O)(=O)Cc1ccccc1 Canonical SMILES O=S(=O)(CC1=CC=CC=C1)CS(=O)(=O)CC1=CC=CC=C1
How was it implemented - SSS Chemical fingerprint is a unique pattern for the presence of a particular molecule. Bitvector and count-based fingerprints
How was it implemented - SSS 1. Each of the structures is encoded as bitvector fingerprint 2. Bitvectors are transformed into a string of positive indexes 3. Fulltext index is applied to transformed bitvectors (numbers -> words) 4. Search is done with constraints regarding specific properties.
How was it implemented - SSS
Chemical reactions’ relationships
What are possible applications 2.) expand path apoc.path.expand
Resources ● https://github.com/rdkit/neo4j-rdkit ● https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf ● https://www.rdkit.org/ ● https://neo4j.com/docs/cypher-manual/current/schema/index/ ● http://tiny.cc/mol_block_definition ● @evgerher via telegram
Hunger games Q&A 1) Hard : what format can resolve a situation when chemical structure has chirality property (ex.: Lactic acid )? a) SMILES b) MOL block c) All of above 2) Medium : what is the difference between bitvector and count-based fingerprints? a) Harder to store b) Does not support similarity search c) Does not keep track of occurence amount 3) Easy : transformation of bitvector [1 0 1 0 1 1 0 0 1] is: a) “1 3 5 6 9” b) “2 4 6 7 9” c) “1 3 5 6 9”
Recommend
More recommend