http://www. orcid.org/0000-0002-2668-4821 Non-targeted analysis supported by data and cheminformatics delivered via the US EPA CompTox Chemicals Dashboard Antony Williams , Alex Chao, Tom Transue, Tommy Cathey, Elin Ulrich and Jon Sobus 1) National Center for Computational Toxicology, U.S. Environmental Protection Agency, RTP, NC 2) Oak Ridge Institute of Science and Education (ORISE) Research Participant, RTP, NC 3) GDIT, Research Triangle Park, North Carolina, United State 4) National Exposure Research Laboratory, U.S. Environmental Protection Agency, RTP, NC The views expressed in this presentation are those of the author and do not necessarily reflect the views or policies of the U.S. EPA August 2019 ACS Fall Meeting, San Diego
An intro to the Dashboard • Freely available web-based database from the National Center for Computational Toxicology • Providing data for 875,000 substances including – Experimental and predicted physicochemical properties – In vivo toxicity data harvested from dozens of public resources – In vitro bioactivity data for thousands of chemicals and assays – Exposure data including chemicals in consumer products – Real time predictions for >20 physchem and toxicological endpoints • Dashboard is used by mass spectrometrists for chemical identification • A quick view of general capabilities… 1
CompTox Chemicals Dashboard https://comptox.epa.gov/dashboard 875k Chemical Substances 2
Detailed Chemical Pages 3
Access to Chemical Hazard Data 4
Sources of Exposure to Chemicals 5
Link Access Links based on chemical identifiers to dozens of online resources – including analytical data 6
MassBank of North America https://mona.fiehnlab.ucdavis.edu 7
“MS-ready” structures 8
Overview of MS-Ready Structures • All structure-based chemical substances are algorithmically processed to – Split multicomponent chemicals into individual structures – Desalt and neutralize individual structures – Remove stereochemical bonds from all chemicals • MS-Ready structures are then mapped to original substances to provide a path between chemicals detected by mass spectrometry to original substances 9
10
MS-Ready Mappings from Details Page 11
Two MS-Ready Mappings Set 12
MS-Ready Mappings Set All substances containing component 13
Mass/Formula Searching and Metadata Ranking 14
Advanced Searches Mass Search 15
Advanced Searches Mass Search 16
MS-Ready Structures for Formula Search 17
MS-Ready Mappings • EXACT Formula : C10H16N2O8: 3 Hits 18
MS-Ready Mappings • Same Input Formula: C10H16N2O8 • MS Ready Formula Search: 125 Chemicals 19
MS-Ready Mappings • Exact Formula – 3 hits • MS-Ready Formula – 125 hits!! – ONLY 8 of the 125 are single component chemicals – 3 are neutral compounds and 2 are charged • How can we rank the candidates list? 20
Candidate ranking using metadata 21
Data Source Ranking of “ known unknowns ” C14H22N2O3 • A mass and/or formula search is 266.16304 for an unknown chemical but it is a known chemical contained within a reference database Chemical Reference Database • Most likely candidate chemicals have the most associated data sources, most associated Sorted candidate literature articles or both structures 22
The original ChemSpider work 23
Is a bigger database better? • ChemSpider was 26 million chemicals for the original work • Much BIGGER today • Is bigger better?? • Are there other metadata to use for ranking? 24
Using Metadata for Ranking • Chosen dashboard metadata to rank candidates – Associated data sources • Lists in the underlying database (more about lists later) • Associated data sources in PubChem • Specific source types (e.g. water, surfactants, pesticides) – Number of associated literature articles (Pubmed) – Chemicals in the environment – the number of products/categories containing the chemical is an important source of data (from CPDat database) 25
Identification ranks for 1783 chemicals using multiple data streams Data Sources alone rank ~75% of the chemicals as Top Hit DS: Data Sources PC: PubChem PM: PubMed STOFF: DB KEMI: DB 26
Comparing Search Performance • When dashboard contained 720k chemicals • Only 3% of ChemSpider size • What was the comparison in performance? 27
SAME dataset for comparison 28
How did performance compare? For the same 162 chemicals, Dashboard outperforms ChemSpider for both Mass and Formula Ranking 29
How did performance compare? 30
Data Quality is important • Data quality in free web-based databases! 31
Public Databases require curation • There is significant bloating in the public databases because of lack of curation • The number of hits retrieved based on mass or formula searching can explode based on poorly represented chemicals – especially stereochemistry issues • MS-Ready structures will map back to multiple versions of “the same chemical”. 32
Will the correct Microcystin LR Stand Up? ChemSpider Skeleton Search 33
Comparing ChemSpider Structures 34
Comparing ChemSpider Structures 35
Other Searches 36
Batch Searching mass and formula 37
Batch Searching • Singleton searches are useful but we work with thousands of masses and formulae! • Typical questions – What is the list of chemicals for the formula C x H y O z – What is the list of chemicals for a mass +/- error – Can I get chemical lists in Excel files? In SDF files? – Can I include properties in the download file? 38
Batch Searching Formula/Mass 39
Searching batches using MS-Ready Formula (or mass) searching 40
Mass Spectrometry Related Searches 41
Find me “related structures” Formula-Based Search 42
Select Chemicals of Interest 43
Find me “related structures” Based on Structure Similarity 44
Find me “related structures” Based on Structure Similarity 45
Find me “related structures” Structure Similarity – sort on mass 46
Chemical Lists 47
Chemical Lists 48
EPAHFR: Hydraulic Fracturing 49
PFAS lists of Chemicals 50
Research in Progress 51
Predicted Mass Spectra http://cfmid.wishartlab.com/ • MS/MS spectra prediction for ESI+, ESI-, and EI • Predictions generated and stored for >800,000 structures, to be accessible via Dashboard 52
Search Expt. vs. Predicted Spectra
Search Expt. vs. Predicted Spectra
Spectral Viewer Comparison 55
Prototype Development 56
Prototype Development 57
API services and Open Data • Present API and web services available at https://actorws.epa.gov/actorws/ but major redevelopment is underway • Downloadable data available via the downloads page 58
Web Services https://actorws.epa.gov/actorws/ • Data in UI, JSON and XML format 59
InChIKey to DTXCIDs https://actorws.epa.gov/actorws/dsstox/v02/msready?identifier =UVOFGKIRTCCNKG-UHFFFAOYSA-N 60
Data and Services used by the Community 61
NORMAN Suspect List Exchange https://www.norman-network.com/?q=node/236 62
Integration to MetFrag in place https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0299-2 63
MassBank mapping to Dashboard Based on Web Service lookup 64
Conclusion • Dashboard access to data for ~875,000 chemicals • MS-Ready data facilitates structure identification • Related metadata facilitates candidate ranking • Relationship mappings and chemical lists of great utility • Dashboard and contents are one part of the solution • New developments in progress, especially API development, will be very enabling… 65
Acknowledgements • IT Development team – especially Jeff Edwards and Jeremy Dunne • Chris Grulke for the ChemReg system • NERL colleagues – Jon Sobus, Elin Ulrich, Mark Strynar, Seth Newton, Alex Chao • Emma Schymanski, LCSB, Luxembourg • NORMAN Network and all contributors 66
Contact Antony Williams US EPA Office of Research and Development National Center for Computational Toxicology EMAIL: Williams.Antony@epa.gov ORCID : https://orcid.org/0000-0002-2668-4821 67
Recommend
More recommend