nonparametric combinatorial sequence models
play

Nonparametric combinatorial sequence models Fabian L. Wauthier, UC - PowerPoint PPT Presentation

Nonparametric combinatorial sequence models Fabian L. Wauthier, UC Berkeley with Nebojsa Jojic (MSR) and Michael I. Jordan (UCB) 30 th March, 2011 Fabian L. Wauthier: Nonparametric combinatorial sequence models, 1 Biological motivation:


  1. Nonparametric combinatorial sequence models Fabian L. Wauthier, UC Berkeley with Nebojsa Jojic (MSR) and Michael I. Jordan (UCB) 30 th March, 2011 Fabian L. Wauthier: Nonparametric combinatorial sequence models, 1

  2. Biological motivation: Sequence variability Y N Q S E A G S H I I Q R M Y G C D Y N Q S E A G S H T L Q R M Y G C D Y N Q S E D G S H T I Q I M Y G C D ◮ Suppose we are given aligned sequences. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2

  3. Biological motivation: Sequence variability Y N Q S E A G S H I I Q R M Y G C D Y N Q S E A G S H T L Q R M Y G C D Y N Q S E D G S H T I Q I M Y G C D ◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability: • Functional properties, domains, ancestral inference, etc. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2

  4. Biological motivation: Sequence variability Y N Q S E A G S H I I Q R M Y G C D Y N Q S E A G S H T L Q R M Y G C D Y N Q S E D G S H T I Q I M Y G C D ◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability: • Functional properties, domains, ancestral inference, etc. ◮ Many simplifying assumptions in previous work: • Site independence: Kingman coalescents, phylogenetic trees. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2

  5. Biological motivation: Sequence variability Y N Q S E A G S H I I Q R M Y G C D Y N Q S E A G S H T L Q R M Y G C D Y N Q S E D G S H T I Q I M Y G C D ◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability: • Functional properties, domains, ancestral inference, etc. ◮ Many simplifying assumptions in previous work: • Site independence: Kingman coalescents, phylogenetic trees. • Full site dependence: Mixture models Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2

  6. Biological motivation: Sequence variability Y N Q S E A G S H I I Q R M Y G C D Y N Q S E A G S H T L Q R M Y G C D Y N Q S E D G S H T I Q I M Y G C D ◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability: • Functional properties, domains, ancestral inference, etc. ◮ Many simplifying assumptions in previous work: • Site independence: Kingman coalescents, phylogenetic trees. • Full site dependence: Mixture models • Sequential stochastic process: HMMs, changepoint models. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2

  7. Biological motivation: Sequence variability Y N Q S E A G S H I I Q R M Y G C D Y N Q S E A G S H T L Q R M Y G C D Y N Q S E D G S H T I Q I M Y G C D ◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability: • Functional properties, domains, ancestral inference, etc. ◮ Many simplifying assumptions in previous work: • Site independence: Kingman coalescents, phylogenetic trees. • Full site dependence: Mixture models • Sequential stochastic process: HMMs, changepoint models. Our interest: sequences where these assumptions do not hold Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2

  8. Biological motivation: Sequence variability Y N Q S E A G S H I I Q R M Y G C D Y N Q S E A G S H T L Q R M Y G C D Y N Q S E D G S H T I Q I M Y G C D ◮ Suppose we are given aligned sequences. ◮ Interest in understanding sequence variability: • Functional properties, domains, ancestral inference, etc. ◮ Many simplifying assumptions in previous work: • Site independence: Kingman coalescents, phylogenetic trees. • Full site dependence: Mixture models • Sequential stochastic process: HMMs, changepoint models. Our interest: sequences where these assumptions do not hold ◮ Partial, long-range site dependencies Fabian L. Wauthier: Nonparametric combinatorial sequence models, 2

  9. Example: MHC I proteins Freeman and Company, 2007 ◮ MHC I proteins present peptide chains to T-cell receptors. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3

  10. Example: MHC I proteins Freeman and Company, 2007 ◮ MHC I proteins present peptide chains to T-cell receptors. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3

  11. Example: MHC I proteins Freeman and Company, 2007 ◮ MHC I proteins present peptide chains to T-cell receptors. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3

  12. Example: MHC I proteins Freeman and Company, 2007 ◮ MHC I proteins present peptide chains to T-cell receptors. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3

  13. Example: MHC I proteins Freeman and Company, 2007 ◮ MHC I proteins present peptide chains to T-cell receptors. ◮ Peptides originating from virus protein ⇒ destruction of cell. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3

  14. Example: MHC I proteins Freeman and Company, 2007 ◮ MHC I proteins present peptide chains to T-cell receptors. ◮ Peptides originating from virus protein ⇒ destruction of cell. ◮ Variability: duplication + mutation + fitness pressure. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3

  15. Example: MHC I proteins Freeman and Company, 2007 ◮ MHC I proteins present peptide chains to T-cell receptors. ◮ Peptides originating from virus protein ⇒ destruction of cell. ◮ Variability: duplication + mutation + fitness pressure. Our Interest: model sequence variability , not its origins . Fabian L. Wauthier: Nonparametric combinatorial sequence models, 3

  16. Example: MHC I proteins Freeman and Company, 2007 Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4

  17. Example: MHC I proteins Freeman and Company, 2007 ◮ Binding site decomposes into pockets (Sidney et al., 2008) Expect partial site linkage. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4

  18. Example: MHC I proteins Freeman and Company, 2007 ◮ Binding site decomposes into pockets (Sidney et al., 2008) Expect partial site linkage. ⇒ Full site (in)dependence inappropriate Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4

  19. Example: MHC I proteins Freeman and Company, 2007 ◮ Binding site decomposes into pockets (Sidney et al., 2008) Expect partial site linkage. ⇒ Full site (in)dependence inappropriate ◮ Variability due to evolutionary pressure on 3D binding site. Variable sites are discontiguous ⇒ long-range dependencies. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4

  20. Example: MHC I proteins Freeman and Company, 2007 ◮ Binding site decomposes into pockets (Sidney et al., 2008) Expect partial site linkage. ⇒ Full site (in)dependence inappropriate ◮ Variability due to evolutionary pressure on 3D binding site. Variable sites are discontiguous ⇒ long-range dependencies. ⇒ Markovian analysis inappropriate Fabian L. Wauthier: Nonparametric combinatorial sequence models, 4

  21. Our model: high level Main idea: Each sequence is composed of smaller components. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5

  22. Our model: high level Main idea: Each sequence is composed of smaller components. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5

  23. Our model: high level Main idea: Each sequence is composed of smaller components. 1. Sites grouped into discontiguous, aligned components (gray). Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5

  24. Our model: high level Main idea: Each sequence is composed of smaller components. 1. Sites grouped into discontiguous, aligned components (gray). 2. Components of a sequence assigned a PSSM (colors). Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5

  25. Our model: high level Main idea: Each sequence is composed of smaller components. 1. Sites grouped into discontiguous, aligned components (gray). 2. Components of a sequence assigned a PSSM (colors). 3. Symbols sampled from assigned PSSMs. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5

  26. Our model: high level Main idea: Each sequence is composed of smaller components. 1. Sites grouped into discontiguous, aligned components (gray). 2. Components of a sequence assigned a PSSM (colors). 3. Symbols sampled from assigned PSSMs. C.f. Probabilistic index map (Jojic and Caspi, CVPR 2004; Jojic et al., UAI 2004) Fabian L. Wauthier: Nonparametric combinatorial sequence models, 5

  27. Missing information Do not know how many site groups/PSSMs there are! Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6

  28. Missing information Do not know how many site groups/PSSMs there are! ◮ Our approach : put a prior distribution on these unknowns Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6

  29. Missing information Do not know how many site groups/PSSMs there are! ◮ Our approach : put a prior distribution on these unknowns ◮ Our model : A Chinese Restaurant Franchise (CRF) conditioned on a Chinese Restaurant Process (CRP) Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6

  30. Missing information Do not know how many site groups/PSSMs there are! ◮ Our approach : put a prior distribution on these unknowns ◮ Our model : A Chinese Restaurant Franchise (CRF) conditioned on a Chinese Restaurant Process (CRP) 1. CRP: induces prior on number of site groups. Fabian L. Wauthier: Nonparametric combinatorial sequence models, 6

Recommend


More recommend