Linking biological data using data science and cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow de-RSE conference, Potsdam 05/06/2019
Breaking down scientific mono- cultures by cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow de-RSE conference, Potsdam 05/06/2019
talk by Alys Brett Florian Huber | @me_datapoint | de-RSE 2019
Florian Huber | @me_datapoint | de-RSE 2019
We signal challenges and opportunities at the intersection of software and academic research Photography: Elodie Burrillon
Our technological expertise areas Big data Efficient Optimized data analytics computing handling Scientific visualization Low power computing Databases Machine learning Accelerated computing Linked data Information retrieval Orchestrated computing Handling sensor data Computer vision High performance computing Information integration Information visualization Distributed computing Data assimilation T ext mining
What do we do? Research software Link between researchers and IT infrastructure Data stewards/data scientists Cross-disciplinary transfer
Example project: Integrated ‘omics’ analysis NL eScience Center Medema lab - Wageningen UR, NL UCSD: Glasgow University: Madeleine Ernst Simon Rogers, Pieter Dorrestein Andrew Ramsay, Grimur Hjorleifsson Eldjar
secondary metabolites Florian Huber | @me_datapoint | de-RSE 2019
DNA mass spectra HMM (Hidden Markov Model) + manually written rules
DNA mass spectra HMM (Hidden Markov Model) + manually written rules
DNA mass spectra HMM (Hidden Markov Model) + manually written rules
Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + + + + Mass Detection
Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + MS2 + + + Mass Detection m/z
Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + MS2 + + + Mass Detection m/z Fragments to puzzle the metabolite structure
Bacteria, fungi, and plants produce a large & diverse arsenal of high-value molecules: rapamycin vancomycin ( immunosuppressant ) ( antibiotic ) doxorubicin ( chemotherapeutic agent ) spinosad ( insecticide ) Pneumocandin ( antifungal ) lovastatin ( cholesterol lowering agent ) Mass spectrometry fragmentation spectrum The challenge…. ….is large-scale coupling of spectral data to molecular structures of known & especially novel natural products molecules.
But…. How similar are they? Spectral similarity Florian Huber | @me_datapoint | de-RSE 2019
How similar are they? Spectral similarity Florian Huber | @me_datapoint | de-RSE 2019
How similar are they? Spectral similarity Florian Huber | @me_datapoint | de-RSE 2019
How similar are they? What does similar mean? number of words? …likes cake with a cappuccino. number of characters? grammatical structure? …loves to have a cookie and a coffee. topic? meaning? style? phonetic structure? Florian Huber | @me_datapoint | de-RSE 2019
‘word’ …likes cake with a cappuccino. …loves to have a cookie and a coffee. ‘sentence’ (or ‘document’) Florian Huber | @me_datapoint | de-RSE 2019
Count how often ‘words’ co -occur (find word ‘context’) Words … … … cookie … sweet … cake all words in corpus… 0 0 9 … monster 0 … 0 0 24 cake … 9 0 17 cookie … 24 17 sweet … … N x N matrix N: number of words in dictionary Florian Huber | @me_datapoint | de-RSE 2019
‘Word2Vec’ → lower dimensional context vector Words … … cake … cookie … sweet … factorization 0 0 9 … monster 0 … 0 0 24 cake … x 9 0 17 cookie … 24 17 sweet … … Florian Huber | @me_datapoint | de-RSE 2019
‘Word2Vec’ → lower dimensional context vector Words … … cake … cookie … sweet … 0 0 9 … monster 0 … V cookie V cookie 0 0 24 cake … V cake 9 0 17 V cake cookie … 24 17 sweet … … Florian Huber | @me_datapoint | de-RSE 2019
NLP → metabolomics: use peaks as words peak positions … … m(Aa) … m(A) … m(A’’) … =‘words’ 0 0 9 … … 0 … V A V A 0 0 24 m(Aa) V Aa … 9 0 17 V Aa m(A) … 24 17 m(A’’) … … Florian Huber | @me_datapoint | de-RSE 2019
Spectral similarity measures. NLP/word2vec based method ‘document’ vector ‘word’ vector V spectrum1 = V A + V B + V A’ + V B’ + … Florian Huber | @me_datapoint | de-RSE 2019
Spectral similarity measures. NLP/word2vec based method ‘document’ vector ‘word’ vector V spectrum1 = V A + V B + V A’ + V B’ + … V spectrum2 = V Aa + V Bb + V A’ + V Bb ’ + … 𝑊 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1 ∙ 𝑊 Similarity = cos( ) = 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2 𝑊 𝑊 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2 Florian Huber | @me_datapoint | de-RSE 2019
Spectral similarity measures: evaluation. Dataset: 11.000 spectra with known molecular structures 0.23 0.13 0.85 Florian Huber | @me_datapoint | de-RSE 2019 (fake spectra)
Molecular similarity scores: Histogram of reference scores for 10.000 best scoring pairs (classical score) igh molecular similarity 10.000 highest ‘classical’ scores* 16% low molecular similarity Spectra (ID) 0 1 2 3 4 5 6 … 0 1 2 3 hig 4 lo 5 6 … Molecular similarity scores (circular fingerprint: Morgan3 / ECFP6) * = scores > 0.998 Florian Huber | @me_datapoint | de-RSE 2019
Molecular similarity scores: Histogram of reference scores for 10.000 best scoring pairs (NLP-based score) low molecular similarity igh molecular similarity 10.000 highest NLP-based scores* 73% Spectra (ID) 0 1 2 3 4 5 6 … 0 1 2 3 hig lo 4 5 6 … Molecular similarity scores (circular fingerprint: Morgan3 / ECFP6) * = scores > 0.84 Florian Huber | @me_datapoint | de-RSE 2019
bad bad Spectral similarity measures: examples. query molecule 9 closest candidates (according to molecular networking similarity) 1 2 3 4 spectrum ID: 3351 9 5 6 7 8 Florian Huber | @me_datapoint | de-RSE 2019
Spectral similarity measures: examples. query molecule 9 closest candidates (according to Word2vec-based spectral similarity) 1 2 3 4 spectrum ID: 3351 9 5 6 7 8 Florian Huber | @me_datapoint | de-RSE 2019
RSE’s creating unique links • RSE ’s – working in teams with broad range of expertise and backgrounds. • RSE ’s – working on projects of different scientific domains. → Creating opportunities unlike anywhere else in the academic setting! - Transfer methods/techniques between domains. - Spot potential synergies between (sub-)fields. Florian Huber | @me_datapoint | de-RSE 2019
Interested in Research Software ? The Netherlands eScience Center is the Dutch national center of excellence for the development and application of research software to advance academic research. Join the team ! n.renaud@esciencecenter .nl Florian Huber @me _ datapoint +31 (0)20 460 4770 www.esciencecenter .nl Carlos Martinez-Ortiz blog.esciencecenter .nl @neocarlitos
Recommend
More recommend