How We Handle Mass Spectra NIST Mass Spectrometry Data Center
NIST/EPA/NIH Mass Spectral Library Numbers of Spectra 200,000 180,000 160,000 140,000 120,000 Replicates 100,000 Compounds 80,000 60,000 40,000 20,000 0 '78 '80 '83 '86 '88 '90 '93 '98 '02 Red Books EPA NIST
Libraries Distributed/Year 4500 4000 3500 3000 2500 2000 1500 1000 500 0 '88 '89 '90 '91 '92 '93 '94 '95 '96 '97 '98 '99 '00 '01 '02 '03 '04
The Data H Cl 12 13 16 17 H 1 10 14 2 11 9 NH H 20 3 5 8 15 4 H 6 7 H 19 O N NH NH N NH 2
Connection Table 1 2 3 4 4 D S Cl 1 3 D S 2 S S S 3 S 4 1 2
From Structure to Spectrum: A Mass “Fragmentogram” mass = 140 u H 3 C H 3 C + O O e- + 2e- + H C O P CH 3 H C O P CH 3 F F H 3 C H 3 C + H 3 C O CH 2 OH + + CH 3 CH CH O P CH 3 H O P CH 3 + H 2 C F F mass = 125 u mass = 99 u
Molecular Fingerprints VX HD GB
I will discuss • Library Searching – Full and Partial Spectra • Spectrum Purification • Chemical Structure Representation • Peptide Spectra Libraries
Instrument ‘Noise Signature’ 250 Hexachlorobenzene Spectra same instrument, calibration mix 1000 800 600 400 200 Bars show quartiles 0 0 50 100 150 200 250 300
Instrument Effects
Library Search unknown MF=93 sarin MF=68
Spectral Similarity � MR � � M R M = f (Abundance) Peak in Measured Spectrum • R = f (Abundance) Peak in Reference Spectrum • • Sum over all peaks f (Abundance) • – Abundance – Abundance * m/z – Certainty
Algorithm Performance 12,592 Replicate Spectra against NIST Library Percent Correct Model Top Hit Top 2 Hits Top 3 Hits Correlation – Weighted 74.9 86.9 91.7 Correlation 72.9 85.9 90.8 Euclidean Distance 71.9 83.9 88.9 Absolute Distance 67.9 80.3 85.5 PBM - Published 64.7 78.4 84.8 Hites/Hertz/Biemann 64.4 77.2 83.2
FP/FP Above Given Match Factor for NIST Library Spectra 1.0 0.8 False Negatives (21,000 replicate spectra) 0.6 Fraction Recovered 0.4 False Positives (108,000 compounds) 0.2 0.0 0 20 40 60 80 100 Match Factor Threshold
FP/FN Expanded View m/z weighting 0.8 0.6 Fraction Recovered 0.4 no weighting FN 0.2 FP x 10,000 0.0 80 85 90 95 100 Match Factor
FP Depends on Spectrum Uniqueness decalin decane TMB HCB 200 DMPB 150 FP 100 sarin malathion 50 0 0 20 40 60 80 100 HCB = hexachlorobenzene Match Factor DMPB = dimethylpenobarbital TMB = 1,2,3-trimethylbenzene
Multiple Ion Monitoring • What is is? – Use 2-5 Major Peaks in Spectrum of Target • 10 – 100 more sensitive • What’s the problem? – Can match major Target peaks with Minor Sample Peaks • What we have done: – Examine risk using library as source of potential false positive IDs
False Positive Risk vs Number of Peaks Used Figure 1. Median FPP vs. NP Figure 1. Median FPP vs. NP 1 1 BMA BMA 0.1 0.1 1/128 1/128 1/128 1/64 1/64 1/64 FP/ 0.01 0.01 1/32 1/32 1/32 spectrum 1/16 1/16 1/16 1/8 1/8 1/8 0.001 0.001 1/4 1/4 1/4 (median) 1/2 1/2 1/2 0.0001 0.0001 1 1 0.00001 0.00001 1 1 2 2 3 3 4 4 5 5 Number of Peaks Number of Peaks Number of Peaks Abundance Ratio: Biggest Search Peak/ Matching Peak in FP
Mass Spectral Peak Occurrences are Correlated Small Peaks 100 100 90 90 80 80 Relative Probabilities Relative Probabilities 70 70 Joint 60 60 Occurrence 50 50 s s Prob . 40 40 30 30 20 20 10 10 0 0 0 0 10 10 20 20 30 30 40 40 50 50 60 60 70 70 80 80 90 90 100 100 m/z Difference m/z Difference Difference in Peak Position Medium Big (m/z) Peaks Peaks
FP Observed and Computed (from individual peak probabilities) 10000 10000 1000 1000 Actual 100 100 FP 10 10 1 1 0.1 0.1 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 Observed FPP Percentile Observed FPP Percentile No Peak Correlation FP Percentile/10
Search Results Depend on Search Spectrum Quality AMDIS: http://chemdata.nist.gov
Real Data Total ion chromatogram A mass spectrum (scan)
Chromatogram with single ion
AMDIS Analysis of Data O AMDIS Match = 81 O P F
Order of Analysis • Noise Analysis – find ‘Noise Factor’ • Find and quantify maximizing ions • Combine to create ‘Model Peak’ • Use Model Peak shape (intensity vs time) to purify spectra • Find best matching library spectrum
Derive Noise Factor Noise Intensity = Noise K Intensity noise
Finding Possible Peaks for Each m/z Maximum rate Scan number n
Find Possible Compounds: Do Ions Maximize at Same Time? 36 10 .0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 2
Separate the Components 508 751 103 4 264 42 41 15 111 2 6 18 14 16 16 7 13 22 8 7 85 37 96 82 75 36 10 11 147 82 81 23 57 14 305 19 1 2 .0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .3 .3 yes yes .2 .2 NO .6
A ‘Model Peak’ Provides Shape 508 751 103 4 264 42 41 15 111 2 6 18 14 16 16 7 13 22 8 7 85 37 96 82 75 36 10 11 147 82 81 23 57 14 305 19 1 2 .0 .1 .2 .3 .4 .5 .6 .7 .8 .9 The model shape is defined as the sum of all of the ion .3 .3 chromatograms that maximize within the range and yes have a sharpness value within 75% of the maximum. yes .2 .2 NO .6
AMDIS Testing – Closely Eluting Components
Representing Chemical Identity • Visual: 2D Structure • Text: IUPAC Name • Digital: No Accepted, Open Method • Solution: The IUPAC/NIST Chemical Identifier
Connection Table 1 2 3 4 4 D S Cl 1 3 D S 2 S S S 3 S 4 1 2
Chemical Identity Problems H 3 C CH 3 H 3 C CH 3 Registry Number possible for each exact form, mixture, unknown, unspecified Experts required Expensive, ambiguous and error prone
Requirements • Different compounds have different identifiers – Keep all distinguishing structural information = = IChI - 2 IChI - 1
Requirements • One compound has only one identifier – Omit unnecessary information O O O O O + O O O N N N N = = = Same INChI
3 Steps to INChI • Chemistry – ‘Normalize’ Input Structure • Implement chemical rules • Math – ‘Canonicalize’ (label the atoms) • Equivalent atoms get the same label • Format – ‘Serialize’ Labeled Structure • Output as character string (‘name’)
“ Layers ” Chemical Substances formula connectivity stereo isotope
9 8 O O 7 + N Nitrobenzene 6 C 4 5 CH CH 2 3 CH CH CH 1 Canonical numbering Description Layers formula C6H5NO2 connectivity 8-7(9)6-4-2-1-3-5-6 H-atoms 1-5H charges
8 9 O O + 1 Na 1 4 5 CH 2 C C 2 7 O CH 2 CH H O 10 3 MSG NH 2 6 Canonical numbering Description Layers formula C5H8NO4.Na connectivity 6-3(5(9)10)1-2-4(7)8; H-atoms 1-2H2,3H,6H2(H-,7,8,9,10); stereo sp 3 3-; charges -1;+1 C5H9NO4.Na/c6-3(5(9)10)1-2-4(7)8;/h1- 2H2,3H,6H2,(H,7,8)(H,9,10);/q;+1/p-1/t3-;/m1./s1
Input/ Result Mobile H On/Off Include Org- Metal Bonds INChI Test Version
Peptide Mass Spectra: Libraries for Organisms • Proteins are linear sequences of amino acids – characteristic of Genome (organism) • Peptides are ‘digested’ fragments of proteins • MS ‘sequences’ peptides to reveal source Protein • Peptides fragmentation spectra are not quite predictable • Peptide fragmentation spectra for a ‘genome’ can be contained in one Library.
Spectrum Prediction Programs
Peptide Spectra Reference Library (multiple measurements each of 10,000 peptides) HLQLAIR/2+
MS Mapped to the Genome From Eric Deutsch, ISB, 6/2004
Recommend
More recommend