science and cross disciplinary
play

science and cross-disciplinary software development Florian Huber 1 - PowerPoint PPT Presentation

Linking biological data using data science and cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen


  1. Linking biological data using data science and cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow de-RSE conference, Potsdam 05/06/2019

  2. Breaking down scientific mono- cultures by cross-disciplinary software development Florian Huber 1 , Justin van der Hooft 2 , Simon Rogers 3 , Marnix Medema 2 , Lars Ridder 1 1 Netherlands eScience Center 2 Bioinformatics Group, Wageningen University 3 School of Computing Science, University of Glasgow de-RSE conference, Potsdam 05/06/2019

  3. talk by Alys Brett Florian Huber | @me_datapoint | de-RSE 2019

  4. Florian Huber | @me_datapoint | de-RSE 2019

  5. We signal challenges and opportunities at the intersection of software and academic research Photography: Elodie Burrillon

  6. Our technological expertise areas Big data Efficient Optimized data analytics computing handling Scientific visualization Low power computing Databases Machine learning Accelerated computing Linked data Information retrieval Orchestrated computing Handling sensor data Computer vision High performance computing Information integration Information visualization Distributed computing Data assimilation T ext mining

  7. What do we do? Research software Link between researchers and IT infrastructure Data stewards/data scientists Cross-disciplinary transfer

  8. Example project: Integrated ‘omics’ analysis NL eScience Center Medema lab - Wageningen UR, NL UCSD: Glasgow University: Madeleine Ernst Simon Rogers, Pieter Dorrestein Andrew Ramsay, Grimur Hjorleifsson Eldjar

  9. secondary metabolites Florian Huber | @me_datapoint | de-RSE 2019

  10. DNA mass spectra HMM (Hidden Markov Model) + manually written rules

  11. DNA mass spectra HMM (Hidden Markov Model) + manually written rules

  12. DNA mass spectra HMM (Hidden Markov Model) + manually written rules

  13. Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + + + + Mass Detection

  14. Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + MS2 + + + Mass Detection m/z

  15. Mass spectrometry and fragmentation Ionization MS1 Mass Separation + + + m/z + + + Mass Trapping + + MS2 + + + Mass Detection m/z Fragments to puzzle the metabolite structure

  16. Bacteria, fungi, and plants produce a large & diverse arsenal of high-value molecules: rapamycin vancomycin ( immunosuppressant ) ( antibiotic ) doxorubicin ( chemotherapeutic agent ) spinosad ( insecticide ) Pneumocandin ( antifungal ) lovastatin ( cholesterol lowering agent ) Mass spectrometry fragmentation spectrum The challenge…. ….is large-scale coupling of spectral data to molecular structures of known & especially novel natural products molecules.

  17. But…. How similar are they? Spectral similarity Florian Huber | @me_datapoint | de-RSE 2019

  18. How similar are they? Spectral similarity Florian Huber | @me_datapoint | de-RSE 2019

  19. How similar are they? Spectral similarity Florian Huber | @me_datapoint | de-RSE 2019

  20. How similar are they? What does similar mean? number of words? …likes cake with a cappuccino. number of characters? grammatical structure? …loves to have a cookie and a coffee. topic? meaning? style? phonetic structure? Florian Huber | @me_datapoint | de-RSE 2019

  21. ‘word’ …likes cake with a cappuccino. …loves to have a cookie and a coffee. ‘sentence’ (or ‘document’) Florian Huber | @me_datapoint | de-RSE 2019

  22. Count how often ‘words’ co -occur (find word ‘context’) Words … … … cookie … sweet … cake all words in corpus… 0 0 9 … monster 0 … 0 0 24 cake … 9 0 17 cookie … 24 17 sweet … … N x N matrix N: number of words in dictionary Florian Huber | @me_datapoint | de-RSE 2019

  23. ‘Word2Vec’ → lower dimensional context vector Words … … cake … cookie … sweet … factorization 0 0 9 … monster 0 … 0 0 24 cake …  x 9 0 17 cookie … 24 17 sweet … … Florian Huber | @me_datapoint | de-RSE 2019

  24. ‘Word2Vec’ → lower dimensional context vector Words … … cake … cookie … sweet … 0 0 9 … monster 0 … V cookie V cookie 0 0 24 cake … V cake 9 0 17 V cake cookie … 24 17 sweet … … Florian Huber | @me_datapoint | de-RSE 2019

  25. NLP → metabolomics: use peaks as words peak positions … … m(Aa) … m(A) … m(A’’) … =‘words’ 0 0 9 … … 0 … V A V A 0 0 24 m(Aa) V Aa … 9 0 17 V Aa m(A) … 24 17 m(A’’) … … Florian Huber | @me_datapoint | de-RSE 2019

  26. Spectral similarity measures. NLP/word2vec based method ‘document’ vector ‘word’ vector V spectrum1 = V A + V B + V A’ + V B’ + … Florian Huber | @me_datapoint | de-RSE 2019

  27. Spectral similarity measures. NLP/word2vec based method ‘document’ vector ‘word’ vector V spectrum1 = V A + V B + V A’ + V B’ + … V spectrum2 = V Aa + V Bb + V A’ + V Bb ’ + …  𝑊 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1 ∙ 𝑊 Similarity = cos(  ) = 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2 𝑊 𝑊 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛1 𝑡𝑞𝑓𝑑𝑢𝑠𝑣𝑛2 Florian Huber | @me_datapoint | de-RSE 2019

  28. Spectral similarity measures: evaluation. Dataset: 11.000 spectra with known molecular structures 0.23 0.13 0.85 Florian Huber | @me_datapoint | de-RSE 2019 (fake spectra)

  29. Molecular similarity scores: Histogram of reference scores for 10.000 best scoring pairs (classical score) igh molecular similarity 10.000 highest ‘classical’ scores* 16% low molecular similarity Spectra (ID) 0 1 2 3 4 5 6 … 0 1 2 3 hig 4 lo 5 6 … Molecular similarity scores (circular fingerprint: Morgan3 / ECFP6) * = scores > 0.998 Florian Huber | @me_datapoint | de-RSE 2019

  30. Molecular similarity scores: Histogram of reference scores for 10.000 best scoring pairs (NLP-based score) low molecular similarity igh molecular similarity 10.000 highest NLP-based scores* 73% Spectra (ID) 0 1 2 3 4 5 6 … 0 1 2 3 hig lo 4 5 6 … Molecular similarity scores (circular fingerprint: Morgan3 / ECFP6) * = scores > 0.84 Florian Huber | @me_datapoint | de-RSE 2019

  31. bad bad Spectral similarity measures: examples. query molecule 9 closest candidates (according to molecular networking similarity) 1 2 3 4 spectrum ID: 3351 9 5 6 7 8 Florian Huber | @me_datapoint | de-RSE 2019

  32. Spectral similarity measures: examples. query molecule 9 closest candidates (according to Word2vec-based spectral similarity) 1 2 3 4 spectrum ID: 3351 9 5 6 7 8 Florian Huber | @me_datapoint | de-RSE 2019

  33. RSE’s creating unique links • RSE ’s – working in teams with broad range of expertise and backgrounds. • RSE ’s – working on projects of different scientific domains. → Creating opportunities unlike anywhere else in the academic setting! - Transfer methods/techniques between domains. - Spot potential synergies between (sub-)fields. Florian Huber | @me_datapoint | de-RSE 2019

  34. Interested in Research Software ? The Netherlands eScience Center is the Dutch national center of excellence for the development and application of research software to advance academic research. Join the team ! n.renaud@esciencecenter .nl Florian Huber @me _ datapoint +31 (0)20 460 4770 www.esciencecenter .nl Carlos Martinez-Ortiz blog.esciencecenter .nl @neocarlitos

Recommend


More recommend