Logic Programming for Big Data in Computational Biology Nicos Angelopoulos Wellcome Sanger Institute Hinxton, Cambridge nicos.angelopoulos@sanger.ac.uk 18.9.18
overview ◮ knowledge for Bayesian machine learning over model structure ◮ applied knowledge representation for biological data analytics
Bayesian inference of model structure (Bims) A Bayesian machine learning system that can model prior knowledge by means of a probabilistic logic programming. Nonmeclature ◮ DLPs = Distributional logic programs ◮ Bims = Bayesian inference of model structure Timeline ◮ Theory (York, 2000-5) ◮ Applications (Edinburgh, 2006-8, IAH 2009, NKI 2013) ◮ Bims library and theory paper 2015-2017
Bims Overview ◮ syntax of DLPs ◮ a succinct classification tree prior program ◮ Bayesian learning of model structure ◮ learning classification and regression trees ◮ Bayesian learning of Bayesian networks ◮ the bims library
DLPs- description We extend LP’s clausal syntax with probabilistic guards that associate a resolution step using a particular clause with a probability whose value is computed on-the-fly. The intuition is that this value can be used as the probability with which the clause is selected for resolution. Thus in addition to the logical relation, a clause defines over the objects that appear as arguments in its head, it also defines a probability distribution over aspects of this relation.
member ( H , [ H | T ]) . DLPs example member ( El , [ H | T ]) : − ( C 1 ) member ( El , T ) . L :: length ( List , L ) ∼ El :: umember ( El , List ) ( G 1 ) 1 L :: L :: umember ( El , [ El | Tail ]) . ( C 2 ) 1 − 1 L :: L :: umember ( El , [ H | Tail ]) : − ( C 3 ) umember ( El , Tail ) .
DLPs probabilistic goals 1 L :: L :: umember ( El , [ El | Tail ]) . ( C 4 ) 1 − 1 L :: L :: umember ( El , [ H | Tail ]) : − ( C 5 ) K is L − 1 , K :: umember ( El , Tail ) .
DLPs query ? − umember ( X , [ a , b , c ]) . X = a (1 / 3 of the times = 1 / 3); X = b (1 / 3 of the times = 2 / 3 ∗ 1 / 2); X = c (1 / 3 of the times = 2 / 3 ∗ 1 / 2 ∗ 1) .
?- cart( ζ , ξ , A, M ). simple tree prior M=nd(x2,1,nd(x1,0,lf,lf),lf) ( C 0 ) cart ( ζ, ξ, M , Cart ) : − ψ 0 is ζ , ψ 0 : split (0 , ζ, ξ, M , Cart ). ( C 1 ) ψ D : split ( D , ζ, ξ, M B , nd ( F , Val , L , R )) : − ψ D +1 is ζ ∗ (1 + D ) − ξ , D 1 is D + 1, r select ( F , Val , M B , L B , R B ), ψ D +1 : split ( D 1 , ζ, ξ, L B , L ), ψ D +1 : split ( D 1 , ζ, ξ, R B , R ). ( C 2 ) 1 − ψ D : split ( D , ζ, ξ, M B , lf ) .
Bims theory Bayes’ Theorem p ( D | M ) p ( M ) p ( M | D ) = � M p ( D | M ) p ( M ) Metropolis-Hastings � q ( M ∗ , M i ) P ( D | M ∗ ) P ( M ∗ ) � α ( M i , M ∗ ) = min q ( M i , M ∗ ) P ( D | M i ) P ( M i ) , 1
DLP defined model space From M i identify G i then sample forward to M ⋆ . q ( M i , M ⋆ ) is the probability of proposing M ⋆ when M i is the current model.
Pyruvate kinase interactors objective improve chances of discovering binding molecules based on examples from screened chemical libraries. pyruvate kinase affinity data 582 Active and 582 Inactive. Dragon software produces 1500 property descriptors for each molecule, about 1100 were used. ten-fold cross-validation Compared to Feed Forward Neural Networks and Support Vector Machines by splitting the data into ten train/test segments.
best likelihood model
ten-fold validation T + T − Sensitivity = Specificity = T + + F − T − + F +
molecules of Eduliss according to BCarts
Bims: Bayesian inference of model structure Released in 2016 as an easily installable SWI-Prolog library Includes (IJAR paper in 2017) ◮ priors and likelihoods for: CARTs and Bayesian networks ◮ hooks for user defined models Probabilistic logic programming ◮ thesis: probabilistic finite domains ◮ PLP workshop and IJAR associated issues (5th edition)
knowledge-based computation biology ◮ graphical models (focal adhesion dynamics, NKI, 2011-3) ◮ proteomics functional analysis (TKSilac,KSR1,ATG9A, Imperial, 2014-5) ◮ mutational profiling (14MG, Sanger, 2016-8)
Graphical models of FAD Graphical models (aka Bayesian networks) can provide a network view of dependencies among variables, capturing much richer information than pairwise correlations. In this project, microscopy based variables characterising focal adhesion in time are connected for a number of conditions in the HGF pathway.
tkSilac: tyrosine kinase screen ◮ MCF7 cell line ◮ 33 SILAC runs ◮ 65/66 expressed tyrosine kinases ◮ 4739 quantified in some experiment ◮ 1000 quantified in 60 or more TK KO
Figure 2 Color Key 0.5 2 Value PTK2B NTRK2 RET PTK7 TEC SYK TNK1 ROR1 JAK2 KDR YES1 TYK2 LCK JAK1 PTK2 TNK2 STYK1 LYN FES EPHB6 FYN EPHA3 FLT3 DDR1 CSK EPHA7 ERBB3 EPHB2 FGFR1 BTK EPHA6 ABL2 CSF1R ERBB2 EPHB3 EPHA4 AXL ABL1 EPHA2 RYK MST1R LMTK2 NTRK1 NTRK3 EGFR EPHA1 MET HCK SRC PDGFRB PTK6 MERTK IGF1R INSR EPHB1 EPHB4 FRK FGFR2 FLT1 TYRO3 ROR2 ZAP70 MATK ERBB4 LMTK3 Fig. 2. Heatmap of quantified proteins after TK silencing. The overall pattern of regulation is shown in the heat- map of quantified values. After normalized to siControl, values of fold changes are all above 0, with value 1 show- ing that the expression levels of the specific protein are not altered after silencing TKs. For each knockdown (rows)
Figure 4 R MERTK A E C F HCK 6 EGFR P M N SRC G IGF1R HEATR1 K H NTRK1 T XPOT E D T R EPHB1 1 R A LDHB LMTK2 T P P S K 1 4 MUC5B N B MST1R ASNS 3 I H NUP188 0 P K ASS1 E FGFR2 RYK R PRKDC EPHA2 F MROH1 GYG1 FLT1 -1 ASMTL A HUWE1 B TYRO3 L TUBA1C 1 VDAC1 AXL -2 ROR2 RNF213 EPHA4 LTN1 ZAP70 MON2 USP32 E P H TXNRD1 B 3 MATK PHGDH ERBB2 GEMIN4 B B 4 GALNT7 E R MC1R CSF1R KRT18 LMTK3 CAD ABL2 HEATR6 P T K 2 MYBBP1A B EPHA6 SLC7A1 NTRK2 ME1 K T PREX1 B RET STAT1 FGFR1 PYGB PTK7 BCAS1 EPHB2 LASP1 TEC LCP1 3 B FTH1 SYK B AGR2 R EPHA7 E TNK1 MVP CSK SLC12A2 R BASP1 DDR1 O J AKR1C2 R A FLT3 KDR CDH1 1 K EPHA3 BLOC1S5 Y 2 N ADAM10 TYK2 E PODXL EPHB6 LCK Y S S J PBXIP1 N STYK1 P F E TNK2 A 1 ERMP1 Y T F K FLNB L K SLC38A1 1 2 MUC1 CA2 GGH B D NCAM2 GSTM3 CYB5R1 FBP1 Clusters No. TKs 100 GREB1 Up FARP1 Down ABL1, AXL, EPHA2, EPHA4, LMTK2, GUSB 1 8 MST1R, NTRK1, RYK GLA GFRA1 ABL2, BTK, CSF1R, CSK, DDR1, EPHA3, 75 MYOF 2 15 EPHA6, EPHA7, EPHB2, EPHB6, ERBB3, LXN FES, FGFR1, FLT3, FYN CELSR2 Counts SDC1 3 3 EGFR, EPHA1, NTRK3 50 FREM2 4 6 EPHB1, EPHB4, FGFR2, FLT1, FRK, INSR NRCAM 5 2 EPHB3, ERBB2 PGR CA12 6 6 ERBB4, LMTK3, MATK, ROR2, TYRO3, ZAP70 EPB41L1 25 7 7 HCK, IGF1R, MERTK, MET, PDGFRB, PTK6, SRC TMEM164 CD44 8 3 JAK1, LCK, PTK2 JAK2, KDR, NTRK2, PTK2B, PTK7, RET, 9 12 ABL1 AXL EPHA2 EPHA4 LMTK2 MST1R NTRK1 RYK 0 ROR1, SYK, TEC, TNK1, TYK2, YES1 1 2 3 4 5 6 7 8 9 10 10 3 LYN, STYK1, TNK2 Clusters
Figure 5 A B 10 10 Cell communication Transporter activity 9 9 8 8 7 7 6 6 Cell cycle 5 Structural molecule activity 5 4 4 3 3 2 2 1 Reproduction 1 Receptor activity Metabolic process Protein binding TF activity Transport Molecular transducer activity Immune system Enzyme regulator activity Growth Chemoattractant activity Development Catalytic activity Apoptosis Binding Cell adhesion Antioxidant activity 0 10 20 30 40 50 0 20 40 60 80 % % Fig. 5. Characterization of a functional portrait for each cluster. A, A functional profile of top GO biologic processes that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO biologic process term. The color coding and the number for each cluster are indi- cated as above. B, A functional profile of top GO molecular functions that the up- and downregulated proteins belong to is presented. x-axis shows the percentage of hits in each cluster that belong to a GO molecular function term.
Recommend
More recommend