statistical modeling in molecular medicine proteomics
play

Statistical modeling in molecular medicine: proteomics Anna Gambin - PowerPoint PPT Presentation

Statistical modeling in molecular medicine: proteomics Anna Gambin Institute of Informatics, University of Warsaw outline masSpec basics modeling isotopic distribution modeling exopeptidase activity incorporating MEROPS data


  1. Statistical modeling in molecular medicine: proteomics Anna Gambin Institute of Informatics, University of Warsaw

  2. outline • masSpec basics •modeling isotopic distribution •modeling exopeptidase activity •incorporating MEROPS data • peptidase activity in time •modeling electron transfer dissociation •deconvolution of spectra • modeling fragmentation

  3. ���������������� Mass Spectrometry data source: Center For Proteomics, Proteins Antwerp, belgium

  4. Identifying proteins is complicated there are plenty of proteins in a sample proteins are frequently fragmented even a single protein has a complicated signal

  5. � �������������������� ����������� ���������������� ������������� �������������� ��������������� ������������������ ������������������ � ������������ � � � ����������� � ������������� ������������������� ��������������� ����������� Chemical compounds are made of different isotopes isotopic envelope

  6. ������������������ ����������������� ���������������� ��� ����������� ��� � ���������������������� ��� � ������������������������� huge number of isotopologues C c H h N n O o S s n e i e

  7. important observation some isotopic variants are more probable than others P( ) =

  8. ������������� � ����������� Assume 1) variants of isotopes of atoms are independent 2) elements vary in abundances of isotopes P( ) =

  9. ������������������������� ���������������� �������������� ������������� �������������� o 0 + o 1 + o 2 = 200

  10. ������� � � ������� ���� ���� ����� �� �������� � ��������� How much we gain by considering the smallest set with a fixed probability ? π k/ 2 ie − 1 ⇣ ⌘ Y p q k ≈ C lattice n det ∆ e 2 Γ ( k/ 2 + 1) ∝ e χ 2 ( k ) Elements ie − 1 Y Y n i e − 1 2 n e e Elements Elements

  11. To get the smallest set with probability P: Find the most probable variant while Total Probability < P : Get layer so that p> P(v)>=qp where p = P(v min previous layer ) Trim the least probable variants from the last layer so that Total Probability >= P

  12. � Monotonic Expansion Property: For each v set {W: P(W)>=P(v) } is adjacent to v multinomial Smallest set with current Total Probability distribution

  13. �� ��� � � �� � � �� ����� � � �� � � �� ��� � ��� �� � � �� ��� � � �� ��� �

  14. ���� ���� our OPTIMAL implementation uses complexity queue for storing subsequent layers a version of quick select for trimming ���� other tricks O(n) in the total number of configurations

  15. We provide theoretical background and get better run times

  16. proteolytic fragmentation LC-MS/MS • data for colorectal cancer patients and healthy donors • ca 1000 peptides • preprocessing : spectra interpretation and retention time aligning

  17. Exopeptidase activity • motivation : differential exoprotease activities contribute to cancer type–specific serum peptidome degradation • our goal: first formal model estimated from LC-MS/MS data Villanueva, J., Nazarian, A., Lawlor, K., et al. 2008. A sequence-specific exopeptidase activity test (sseat) for “functional” biomarker discovery. Mol. Cell. Proteomics 7, 509–518.

  18. ⌥ Cleavage graph ⋆ FTSSTS if x ⇥ i = x i + 1, x ⇥  � i = x � i for some i , a ⇥ i   if x ⇥ j = x j + 1, x ⇥  i = x i � 1 , a r ( i,j ) x i  Q ( x, x ⇥ ) = FTSST TSSTS SSTSY and x ⇥ � i � j = x � i � j for some i ⇧ j ,    if x ⇥ i = x i � 1, x ⇥ � i = x � i for some i . a i † x i  transition intensities for Markov process FTSS TSST SSTS STSY describing the flow of particles through the graph i.e. the process of peptidome degradation FTS TSS SST STS TSY  create  a � i    FT SS TS ST SY    move   a r ( i,j ) x i   Q ( x, x � ) = †         annihilate/degrade  a i † x i  

  19. in equilibrium Proposition 1 (Equilibrium distribution). The process .X.t// has the equilibrium (stationary) dis- tribution � given by: e � i � x i Y i � .x/ D x i Š ; i 2 V where the configuration of intensities . � i / i 2 V is the unique solution to the following system of “balance” equations: 0 1 X @X � k a r.k;i/ C a ?i D � i a r.i;j/ C a i � for every i 2 V : A i ! j k ! i old as the hills, but…

  20. hierarchical Bayesian model ( B r ) r ∈ R ( B ? i ) i ∈ V in S shape , S rate ( b r ) r ∈ R ( b ? i ) i ∈ V in s ∼ Gamma( S shape , S rate ) ∼ ∼ Dir(( B r ) r ∈ R ) Dir(( B ? i ) i ∈ V in ) missing readings q � i = � i ( s, b ? , b ) for i ∈ V ( ✏ i ) i ∈ V errors x i ∼ Poiss( � i ) for i : ✏ i = 1 � i ∼ Bern( q ) for i : ✏ i = 1 ⌧ Metropolis-Hastings to sample from posterior: y i ∼ LogNormal( x i , ⌧ ) for i : � i = 1 y i ∼ Background for i : � i = 0

  21. NON TRIVIAL TASK: filling the cleavage graph with real data • from aa sequence: • 1000 peptides: mass, calculate mass charge, retention time • consider all charges • 243 precursor peptides • predict retention • ca. 40 000 subsequences time (random forests) FTSS quite often: missing reads and errors !

  22. Cleavage graph for real proteolytic events u MSFT † LTN † K • 20 colorectal cancer ⇥ peps ⇥ ther xy vw patients and 20 thermolysin pepsin healthy donors, y x v w • ca 1000 peptides, MSFT † L † TN MSFT LTNK K • preprocessing phase ⇥ ther ⇥ chem vz st thermolysin chemotrypsin MUCH SMALLER cleavage graphs ! z s t LTN MSFTL TN

  23. identified enzymes make sense ! Color Key and Histogram 100 Count 60 20 10 30 Value plasmin neprilysin calpain.2 matrix.metallopeptidase.3 kallikrein.related.peptidase.3 aminopeptidase.PILS legumain cathepsin.K membrane.type.matrix.metallopeptidase.4 cathepsin.H ADAM10.peptidase ADAM17.peptidase caspase.1 ADAMTS4.peptidase pepsin.A chymotrypsin.C ADAMTS5.peptidase membrane.type.matrix.metallopeptidase.6 calpain.1 cathepsin.L cathepsin.G myeloblastin chymase...Homo.sapiens..type. tryptase.alpha matrix.metallopeptidase.20 tripeptidyl.peptidase.I elastase.1 granzyme.B...Homo.sapiens..type. cathepsin.S trypsin.1 membrane.type.matrix.metallopeptidase.3 cathepsin.B eupitrilysin 25 38 16 14 7 19 3 13 1 9 15 37 28 26 34 33 39 31 22 35 17 6 12 29 8 10 2 27 23 32 11 20 18 24 21 4 5 30 36 data set no.

  24. ⇤ A. Gambin, B. Kluge / Modeling Proteolysis from MS data u stochastic dynamics in time MSFT † LTN † K ⇥ peps ⇥ ther xy vw from MEROPS: thermolysin pepsin by ρ vw the vector of all peptidase affinity coefficients for the cleavage v † w (for � � y x v w if x � = x � � u + � v + � w and u = v † w , � c T ⇥ vw x u MSFT † L † TN MSFT LTNK K Q xx � = 0 otherwise . ⇥ ther ⇥ chem vz st thermolysin chemotrypsin to be estimated: estimate peptidase cutting intensities vector to perform the cleavage is proportional z s t LTN MSFTL TN calculated from P ( x, t ) = P ( X ( t ) = x ) . CME ⌥ ⌅ ⌥ tP ( x, t ) = ( Q yx P ( y, t ) − Q xy P ( x, t )) no more monomolecular system - y ⇥ = x ⌅ c T ⇤ vw [( x u + 1) P ( x + � u − � v − � w , t ) − x u P ( x, t )] we have reactions: = A -> B and A-> B+C (endopeptidases) u = v † w ⌅ c T ⇤ vw [ x � u P ( x � , t ) − x u P ( x, t )] , = u = v † w

  25. interesting moments... � u − v − w by E q ( t ) the expected number of instances of peptide q at time t . equation above: 150 ⌅ E q ( t ) = x q P ( x, t ) , 100 x 20 50 � ⇥ ⌅ ∂ ⌦ ∂ t E q ( t ) = λ uq E u ( t ) + λ qq E q ( t ) . Row 0 40 u → q q ∈ V − 50 60 − 100 E ( t ) = E (0) T exp( Λ t ) , − 150 20 40 60 the matrix Λ = ( λ vw ) v,w ∈ V for peptide VAHRFKDLGEEN.

  26. ETD fragmentation more fragments more insight into structure more confidence in correct identification

  27. some bonds get easily broken ETD .. others not

  28. the goal of masstodon understand fragmentation inside the instrument under different experimental conditions use purified chemical samples study fragmentation pathways solution: locate fragments in data 1. deconvolute signals and 2. infer fragmentation reaction constants

Recommend


More recommend