non targeted analysis supported by data and
play

Non-targeted analysis supported by data and cheminformatics - PowerPoint PPT Presentation

http://www. orcid.org/0000-0002-2668-4821 Non-targeted analysis supported by data and cheminformatics delivered via the US EPA CompTox Chemicals Dashboard Antony Williams , Alex Chao, Tom Transue, Tommy Cathey, Elin Ulrich and Jon Sobus 1)


  1. http://www. orcid.org/0000-0002-2668-4821 Non-targeted analysis supported by data and cheminformatics delivered via the US EPA CompTox Chemicals Dashboard Antony Williams , Alex Chao, Tom Transue, Tommy Cathey, Elin Ulrich and Jon Sobus 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC 2) Oak Ridge Institute of Science and Education (ORISE) Research Participant, RTP, NC 3) GDIT, Research Triangle Park, North Carolina, United State 4) National Exposure Research Laboratory, U.S. Environmental Protection Agency, RTP, NC The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA August 2019 ACS Fall Meeting, San Diego

  2. An intro to the Dashboard • Freely available web-based database from the National Center for Computational Toxicology • Providing data for 875,000 substances including – Experimental and predicted physicochemical properties – In vivo toxicity data harvested from dozens of public resources – In vitro bioactivity data for thousands of chemicals and assays – Exposure data including chemicals in consumer products – Real time predictions for >20 physchem and toxicological endpoints • Dashboard is used by mass spectrometrists for chemical identification • A quick view of general capabilities… 1

  3. CompTox Chemicals Dashboard https://comptox.epa.gov/dashboard 875k Chemical Substances 2

  4. Detailed Chemical Pages 3

  5. Access to Chemical Hazard Data 4

  6. Sources of Exposure to Chemicals 5

  7. Link Access Links based on chemical identifiers to dozens of online resources – including analytical data 6

  8. MassBank of North America https://mona.fiehnlab.ucdavis.edu 7

  9. “MS-ready” structures 8

  10. Overview of MS-Ready Structures • All structure-based chemical substances are algorithmically processed to – Split multicomponent chemicals into individual structures – Desalt and neutralize individual structures – Remove stereochemical bonds from all chemicals • MS-Ready structures are then mapped to original substances to provide a path between chemicals detected by mass spectrometry to original substances 9

  11. 10

  12. MS-Ready Mappings from Details Page 11

  13. Two MS-Ready Mappings Set 12

  14. MS-Ready Mappings Set All substances containing component 13

  15. Mass/Formula Searching and Metadata Ranking 14

  16. Advanced Searches Mass Search 15

  17. Advanced Searches Mass Search 16

  18. MS-Ready Structures for Formula Search 17

  19. MS-Ready Mappings • EXACT Formula : C10H16N2O8: 3 Hits 18

  20. MS-Ready Mappings • Same Input Formula: C10H16N2O8 • MS Ready Formula Search: 125 Chemicals 19

  21. MS-Ready Mappings • Exact Formula – 3 hits • MS-Ready Formula – 125 hits!! – ONLY 8 of the 125 are single component chemicals – 3 are neutral compounds and 2 are charged • How can we rank the candidates list? 20

  22. Candidate ranking using metadata 21

  23. Data Source Ranking of “ known unknowns ” C14H22N2O3 • A mass and/or formula search is 266.16304 for an unknown chemical but it is a known chemical contained within a reference database Chemical Reference Database • Most likely candidate chemicals have the most associated data sources, most associated Sorted candidate literature articles or both structures 22

  24. The original ChemSpider work 23

  25. Is a bigger database better? • ChemSpider was 26 million chemicals for the original work • Much BIGGER today • Is bigger better?? • Are there other metadata to use for ranking? 24

  26. Using Metadata for Ranking • Chosen dashboard metadata to rank candidates – Associated data sources • Lists in the underlying database (more about lists later) • Associated data sources in PubChem • Specific source types (e.g. water, surfactants, pesticides) – Number of associated literature articles (Pubmed) – Chemicals in the environment – the number of products/categories containing the chemical is an important source of data (from CPDat database) 25

  27. Identification ranks for 1783 chemicals using multiple data streams Data Sources alone rank ~75% of the chemicals as Top Hit DS: Data Sources PC: PubChem PM: PubMed STOFF: DB KEMI: DB 26

  28. Comparing Search Performance • When dashboard contained 720k chemicals • Only 3% of ChemSpider size • What was the comparison in performance? 27

  29. SAME dataset for comparison 28

  30. How did performance compare? For the same 162 chemicals, Dashboard outperforms ChemSpider for both Mass and Formula Ranking 29

  31. How did performance compare? 30

  32. Data Quality is important • Data quality in free web-based databases! 31

  33. Public Databases require curation • There is significant bloating in the public databases because of lack of curation • The number of hits retrieved based on mass or formula searching can explode based on poorly represented chemicals – especially stereochemistry issues • MS-Ready structures will map back to multiple versions of “the same chemical”. 32

  34. Will the correct Microcystin LR Stand Up? ChemSpider Skeleton Search 33

  35. Comparing ChemSpider Structures 34

  36. Comparing ChemSpider Structures 35

  37. Other Searches 36

  38. Batch Searching mass and formula 37

  39. Batch Searching • Singleton searches are useful but we work with thousands of masses and formulae! • Typical questions – What is the list of chemicals for the formula C x H y O z – What is the list of chemicals for a mass +/- error – Can I get chemical lists in Excel files? In SDF files? – Can I include properties in the download file? 38

  40. Batch Searching Formula/Mass 39

  41. Searching batches using MS-Ready Formula (or mass) searching 40

  42. Mass Spectrometry Related Searches 41

  43. Find me “related structures” Formula-Based Search 42

  44. Select Chemicals of Interest 43

  45. Find me “related structures” Based on Structure Similarity 44

  46. Find me “related structures” Based on Structure Similarity 45

  47. Find me “related structures” Structure Similarity – sort on mass 46

  48. Chemical Lists 47

  49. Chemical Lists 48

  50. EPAHFR: Hydraulic Fracturing 49

  51. PFAS lists of Chemicals 50

  52. Research in Progress 51

  53. Predicted Mass Spectra http://cfmid.wishartlab.com/ • MS/MS spectra prediction for ESI+, ESI-, and EI • Predictions generated and stored for >800,000 structures, to be accessible via Dashboard 52

  54. Search Expt. vs. Predicted Spectra

  55. Search Expt. vs. Predicted Spectra

  56. Spectral Viewer Comparison 55

  57. Prototype Development 56

  58. Prototype Development 57

  59. API services and Open Data • Present API and web services available at https://actorws.epa.gov/actorws/ but major redevelopment is underway • Downloadable data available via the downloads page 58

  60. Web Services https://actorws.epa.gov/actorws/ • Data in UI, JSON and XML format 59

  61. InChIKey to DTXCIDs https://actorws.epa.gov/actorws/dsstox/v02/msready?identifier =UVOFGKIRTCCNKG-UHFFFAOYSA-N 60

  62. Data and Services used by the Community 61

  63. NORMAN Suspect List Exchange https://www.norman-network.com/?q=node/236 62

  64. Integration to MetFrag in place https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0299-2 63

  65. MassBank mapping to Dashboard Based on Web Service lookup 64

  66. Conclusion • Dashboard access to data for ~875,000 chemicals • MS-Ready data facilitates structure identification • Related metadata facilitates candidate ranking • Relationship mappings and chemical lists of great utility • Dashboard and contents are one part of the solution • New developments in progress, especially API development, will be very enabling… 65

  67. Acknowledgements • IT Development team – especially Jeff Edwards and Jeremy Dunne • Chris Grulke for the ChemReg system • NERL colleagues – Jon Sobus, Elin Ulrich, Mark Strynar, Seth Newton, Alex Chao • Emma Schymanski, LCSB, Luxembourg • NORMAN Network and all contributors 66

  68. Contact Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology EMAIL: Williams.Antony@epa.gov ORCID : https://orcid.org/0000-0002-2668-4821 67

Recommend


More recommend