science and cross-disciplinary software development Florian Huber 1 - PowerPoint PPT Presentation

Linking biological data using data science and cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow de-RSE conference, Potsdam 05/06/2019

Breaking down scientific mono- cultures by cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow de-RSE conference, Potsdam 05/06/2019

talk by Alys Brett Florian Huber | @me_datapoint | de-RSE 2019

Florian Huber | @me_datapoint | de-RSE 2019

We signal challenges and opportunities at the intersection of software and academic research Photography: Elodie Burrillon

Our technological expertise areas Big data Efficient Optimized data analytics computing handling Scientific visualization Low power computing Databases Machine learning Accelerated computing Linked data Information retrieval Orchestrated computing Handling sensor data Computer vision High performance computing Information integration Information visualization Distributed computing Data assimilation T ext mining

What do we do? Research software Link between researchers and IT infrastructure Data stewards/data scientists Cross-disciplinary transfer

Example project: Integrated ‘omics’ analysis NL eScience Center Medema lab - Wageningen UR, NL UCSD: Glasgow University: Madeleine Ernst Simon Rogers, Pieter Dorrestein Andrew Ramsay, Grimur Hjorleifsson Eldjar

secondary metabolites Florian Huber | @me_datapoint | de-RSE 2019

DNA mass spectra HMM (Hidden Markov Model) + manually written rules

Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + + + + Mass Detection

Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + MS2 + + + Mass Detection m/z

Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + MS2 + + + Mass Detection m/z Fragments to puzzle the metabolite structure

Bacteria, fungi, and plants produce a large & diverse arsenal of high-value molecules: rapamycin vancomycin ( immunosuppressant ) ( antibiotic ) doxorubicin ( chemotherapeutic agent ) spinosad ( insecticide ) Pneumocandin ( antifungal ) lovastatin ( cholesterol lowering agent ) Mass spectrometry fragmentation spectrum The challenge…. ….is large-scale coupling of spectral data to molecular structures of known & especially novel natural products molecules.

But…. How similar are they? Spectral similarity Florian Huber | @me_datapoint | de-RSE 2019

How similar are they? Spectral similarity Florian Huber | @me_datapoint | de-RSE 2019

How similar are they? What does similar mean? number of words? …likes cake with a cappuccino. number of characters? grammatical structure? …loves to have a cookie and a coffee. topic? meaning? style? phonetic structure? Florian Huber | @me_datapoint | de-RSE 2019

‘word’ …likes cake with a cappuccino. …loves to have a cookie and a coffee. ‘sentence’ (or ‘document’) Florian Huber | @me_datapoint | de-RSE 2019

Count how often ‘words’ co -occur (find word ‘context’) Words … … … cookie … sweet … cake all words in corpus… 0 0 9 … monster 0 … 0 0 24 cake … 9 0 17 cookie … 24 17 sweet … … N x N matrix N: number of words in dictionary Florian Huber | @me_datapoint | de-RSE 2019

‘Word2Vec’ → lower dimensional context vector Words … … cake … cookie … sweet … factorization 0 0 9 … monster 0 … 0 0 24 cake …  x 9 0 17 cookie … 24 17 sweet … … Florian Huber | @me_datapoint | de-RSE 2019

‘Word2Vec’ → lower dimensional context vector Words … … cake … cookie … sweet … 0 0 9 … monster 0 … V cookie V cookie 0 0 24 cake … V cake 9 0 17 V cake cookie … 24 17 sweet … … Florian Huber | @me_datapoint | de-RSE 2019

NLP → metabolomics: use peaks as words peak positions … … m(Aa) … m(A) … m(A’’) … =‘words’ 0 0 9 … … 0 … V A V A 0 0 24 m(Aa) V Aa … 9 0 17 V Aa m(A) … 24 17 m(A’’) … … Florian Huber | @me_datapoint | de-RSE 2019

Spectral similarity measures. NLP/word2vec based method ‘document’ vector ‘word’ vector V spectrum1 = V A + V B + V A’ + V B’ + … Florian Huber | @me_datapoint | de-RSE 2019

Spectral similarity measures. NLP/word2vec based method ‘document’ vector ‘word’ vector V spectrum1 = V A + V B + V A’ + V B’ + … V spectrum2 = V Aa + V Bb + V A’ + V Bb ’ + …  𝑊 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1 ∙ 𝑊 Similarity = cos(  ) = 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2 𝑊 𝑊 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2 Florian Huber | @me_datapoint | de-RSE 2019

Spectral similarity measures: evaluation. Dataset: 11.000 spectra with known molecular structures 0.23 0.13 0.85 Florian Huber | @me_datapoint | de-RSE 2019 (fake spectra)

Molecular similarity scores: Histogram of reference scores for 10.000 best scoring pairs (classical score) igh molecular similarity 10.000 highest ‘classical’ scores* 16% low molecular similarity Spectra (ID) 0 1 2 3 4 5 6 … 0 1 2 3 hig 4 lo 5 6 … Molecular similarity scores (circular fingerprint: Morgan3 / ECFP6) * = scores > 0.998 Florian Huber | @me_datapoint | de-RSE 2019

Molecular similarity scores: Histogram of reference scores for 10.000 best scoring pairs (NLP-based score) low molecular similarity igh molecular similarity 10.000 highest NLP-based scores* 73% Spectra (ID) 0 1 2 3 4 5 6 … 0 1 2 3 hig lo 4 5 6 … Molecular similarity scores (circular fingerprint: Morgan3 / ECFP6) * = scores > 0.84 Florian Huber | @me_datapoint | de-RSE 2019

bad bad Spectral similarity measures: examples. query molecule 9 closest candidates (according to molecular networking similarity) 1 2 3 4 spectrum ID: 3351 9 5 6 7 8 Florian Huber | @me_datapoint | de-RSE 2019

Spectral similarity measures: examples. query molecule 9 closest candidates (according to Word2vec-based spectral similarity) 1 2 3 4 spectrum ID: 3351 9 5 6 7 8 Florian Huber | @me_datapoint | de-RSE 2019

RSE’s creating unique links • RSE ’s – working in teams with broad range of expertise and backgrounds. • RSE ’s – working on projects of different scientific domains. → Creating opportunities unlike anywhere else in the academic setting! - Transfer methods/techniques between domains. - Spot potential synergies between (sub-)fields. Florian Huber | @me_datapoint | de-RSE 2019

Interested in Research Software ? The Netherlands eScience Center is the Dutch national center of excellence for the development and application of research software to advance academic research. Join the team ! n.renaud@esciencecenter .nl Florian Huber @me _ datapoint +31 (0)20 460 4770 www.esciencecenter .nl Carlos Martinez-Ortiz blog.esciencecenter .nl @neocarlitos

science and cross-disciplinary software development Florian Huber 1 - PowerPoint PPT Presentation

Linking biological data using data science and cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen

Intercalating art and science in a cross-disciplinary landscape. Peter G Knight

The Integrated Tribological Surface Cross- Disciplinary Research Challenges US-South America

Certificate in Digital Information Management: A Cross-Disciplinary Functional Approach Bruce

Extending OSDC toolset for cross- disciplinary discoveries (Michael

Therapeutic and Management Strategies for AMR: A UCLA Cross-Disciplinary Workshop to improve

Critical Strategies for Improving the Code Quality and Cross-Disciplinary Impact of the

Agend nda L Lunc nch S h Sli lides Preliminary action steps for furthering cross-disciplinary

Developing Cross-Disciplinary enhance the professional Leadership Capacity for preparation of

Integrating Faculty Development and Research through a Cross-Disciplinary Faculty Learning

mobile solutions for research Dan Burger Nov 2018 Cross-disciplinary web and mobile development

Solar activity (space weather) data: Facilitating cross-disciplinary studies Maria T. Patterson

Disciplinary Maps of Sustainability Science Dr. Katy Brner Questions about Sustainability

Millimeter-Wave Wireless: A Cross-Disciplinary View of Research and Technology Development mmNets

Ian Cross Centre for Music & Science Faculty of Music, Cambridge www.mus.cam.ac.uk/~cross

Holding our disciplinary ground: Disciplinary writing in the age of audit Sharon McCulloch

IMP MPLEMENTATI TION of NE NEW Re Regu gula latio ion K5 " 1 Multi Disciplinary

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

Disciplinary variation and beyond Dr Paul Thompson University of Birmingham, UK Overview of

GLOBAL PARTNERSHIPS Caro McCaw with Philippa Keaney & Ron Bull Disciplinary partners

Rotorcraft Noise Prediction with Multi-disciplinary Coupling Methods Yi Liu NIA CFD Seminar,

Disciplinary Literacy: Close Reading of Complex Texts Skills & Strategies to Address

Red Cross Clubs Red Cross Clubs Why Red Cross Clubs should be started at your school What We

Cross Ram Support Set Ram accessories 1 Cross Ram Support Set Set composition The Cross

USC ADDRESS TO THE BOARD OF TRUSTEES JANUARY 24, 2013 Joyce Tolliver, UI-Urbana On behalf of the

science and cross-disciplinary software development Florian Huber 1 - PowerPoint PPT Presentation

Linking biological data using data science and cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen

Intercalating art and science in a cross-disciplinary landscape. Peter G Knight

The Integrated Tribological Surface Cross- Disciplinary Research Challenges US-South America

Certificate in Digital Information Management: A Cross-Disciplinary Functional Approach Bruce

Extending OSDC toolset for cross- disciplinary discoveries (Michael

Therapeutic and Management Strategies for AMR: A UCLA Cross-Disciplinary Workshop to improve

Critical Strategies for Improving the Code Quality and Cross-Disciplinary Impact of the

Agend nda L Lunc nch S h Sli lides Preliminary action steps for furthering cross-disciplinary

Developing Cross-Disciplinary enhance the professional Leadership Capacity for preparation of

Integrating Faculty Development and Research through a Cross-Disciplinary Faculty Learning

mobile solutions for research Dan Burger Nov 2018 Cross-disciplinary web and mobile development

Solar activity (space weather) data: Facilitating cross-disciplinary studies Maria T. Patterson

Disciplinary Maps of Sustainability Science Dr. Katy Brner Questions about Sustainability

Millimeter-Wave Wireless: A Cross-Disciplinary View of Research and Technology Development mmNets

Ian Cross Centre for Music &amp; Science Faculty of Music, Cambridge www.mus.cam.ac.uk/~cross

Holding our disciplinary ground: Disciplinary writing in the age of audit Sharon McCulloch

IMP MPLEMENTATI TION of NE NEW Re Regu gula latio ion K5 &quot; 1 Multi Disciplinary

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, and regularization J.

Disciplinary variation and beyond Dr Paul Thompson University of Birmingham, UK Overview of

GLOBAL PARTNERSHIPS Caro McCaw with Philippa Keaney &amp; Ron Bull Disciplinary partners

Rotorcraft Noise Prediction with Multi-disciplinary Coupling Methods Yi Liu NIA CFD Seminar,

Disciplinary Literacy: Close Reading of Complex Texts Skills &amp; Strategies to Address

Red Cross Clubs Red Cross Clubs Why Red Cross Clubs should be started at your school What We

Cross Ram Support Set Ram accessories 1 Cross Ram Support Set Set composition The Cross

USC ADDRESS TO THE BOARD OF TRUSTEES JANUARY 24, 2013 Joyce Tolliver, UI-Urbana On behalf of the

Ian Cross Centre for Music & Science Faculty of Music, Cambridge www.mus.cam.ac.uk/~cross

IMP MPLEMENTATI TION of NE NEW Re Regu gula latio ion K5 " 1 Multi Disciplinary

GLOBAL PARTNERSHIPS Caro McCaw with Philippa Keaney & Ron Bull Disciplinary partners

Disciplinary Literacy: Close Reading of Complex Texts Skills & Strategies to Address