clustering
play

Clustering Eric Xing 2 1 Object Recognition and Tracking (1.9, - PDF document

School of Computer Science Infinite Mixture and Dirichlet Process Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 20, Nov 28, 2007 Receptor A Receptor A X 1 X 1 X 1 Receptor B Receptor B X 2 X 2


  1. School of Computer Science Infinite Mixture and Dirichlet Process Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 20, Nov 28, 2007 Receptor A Receptor A X 1 X 1 X 1 Receptor B Receptor B X 2 X 2 X 2 Eric Xing Eric Xing Kinase C Kinase C X 3 X 3 X 3 Kinase D Kinase D X 4 X 4 X 4 Kinase E Kinase E X 5 X 5 X 5 TF F TF F X 6 X 6 X 6 Reading: Gene G Gene G X 7 X 7 X 7 X 8 X 8 X 8 Gene H Gene H 1 Clustering Eric Xing 2 1

  2. Object Recognition and Tracking (1.9, 9.0, 2.1) (1.8, 7.4, 2.3) (1.9, 6.1, 2.2) (0.7, 5.1, 3.2) (0.6, 5.9, 3.2) (0.9, 5.8, 3.1) t=1 t=2 t=3 Eric Xing 3 Modeling The Mind … Latent Latent brain processes: brain processes: View picture View picture Read sentence Read sentence Decide whether consistent Decide whether consistent fMRI scan: scan: fMRI ∑ ∑ … … … t=1 t=T Eric Xing 4 2

  3. The Evolution of Science Research Research circles circles Phy Phy Bio Research Research topics topics CS PNAS papers papers PNAS 2000 ? 1900 Eric Xing 5 Partially Observed, Open and Evolving Possible Worlds Unbounded # of objects/trajectories � Changing attributes � Birth/death, merge/split � Relational ambiguity � The parametric paradigm: � ( ) ( { } Event model Event model ) ( { } ) motion model motion model { } { } Finite � 0 1 1 p φ φ φ + φ : T t t p p or k k k k Entity space Entity space Structurally � unambiguous Ξ Ξ * * + 1 + | 1 t | t t t observation space observation space Sensor model Sensor model ( { } ) φ p | x k How to open it up? How to open it up? Eric Xing 6 3

  4. A Classical Approach � Clustering as Mixture Modeling � Then "model selection" Eric Xing 7 Model Selection vs. Posterior Inference � Model selection "intelligent" guess: ??? � cross validation: data-hungry � � information theoretic: � ( ) AIC � f g ⋅ ⋅ θ ˆ arg min KL ( ) | ( | , K ) ML TIC � Parsimony, Ockam's Ockam's Razor Razor MDL : Parsimony, � Bayes factor: need to compute data likelihood � � Posterior inference: we want to handle uncertainty of model complexity explicitly p M D p D M p M ∝ ( | ) ( | ) ( ) { } M ≡ θ , K we favor a distribution that does not constrain M in a "closed" space! � Eric Xing 8 4

  5. Two "Recent" Developments � First order probabilistic languages (FOPLs) Examples: PRM, BLOG … � Lift graphical models to "open" world (#rv, relation, index, lifespan …) � Focus on complete, consistent, and operating rules to instantiate possible worlds, � and formal language of expressing such rules Operational way of defining distributions over possible worlds, via sampling � methods � Bayesian Nonparametrics Examples: Dirichlet processes, stick-breaking processes … � � From finite, to infinite mixture, to more complex constructions (hierarchies, spatial/temporal sequences, …) Focus on the laws and behaviors of both the generative formalisms and resulting � distributions Often offer explicit expression of distributions, and expose the structure of the � distributions --- motivate various approximate schemes Eric Xing 9 Clustering How to label them ? � How many clusters ??? � Eric Xing 10 5

  6. Genetic Demography Are there genetic prototypes among them ? � What are they ? � How many ? (how many ancestors do we have ?) � Eric Xing 11 Genetic Polymorphisms Eric Xing 12 6

  7. Biological Terms � Genetic polymorphism: a difference in DNA sequence among individuals, groups, or populations � Single Nucleotide Polymorphism (SNP): DNA sequence variation occurring when a single nucleotide - A, T, C, or G - differs between members of the species – Each variant is called an “allele” – Almost always bi-allelic – Account for most of the genetic diversity among different (normal) individuals, e.g. drug response, disease susceptibility Eric Xing 13 From SNPs to Haplotypes � Alleles of adjacent SNPs on a chromosome form haplotypes Powerful in the study of disease association or genetic evolution � Eric Xing 14 7

  8. Haplotype and Genotype � A collection of alleles derived from the same chromosome Genotypes Haplotypes 2 13 13 2 1 6 1 6 9 15 9 15 4 17 17 4 1 9 1 9 Haplotype 2 6 6 2 9 17 9 17 Re-construction 2 12 12 2 12 7 7 12 6 14 14 6 1 7 7 1 18 18 18 18 1 4 1 4 10 10 10 10 Chromosome phase is unknown Chromosome phase is known Eric Xing 15 Ancestral Inference ? θ k A k H n 1 H n 2 G n N N Essentially a clustering problem, but … … Essentially a clustering problem, but Better recovery of the ancestors leads to better haplotyping results � (because of more accurate grouping of common haplotypes) � True haplotypes are obtainable with high cost, but they can validate model more subjectively (as opposed to examining saliency of clustering) � Many other biological/scientific utilities Eric Xing 16 8

  9. A Finite (Mixture of ) Allele Model � The probability of a genotype g : H n 1 H n 2 ∑ = p ( g ) p ( h , h ) p ( g | h , h ) 1 2 1 2 G n ∈ h , h H 1 2 Population haplotype Genotyping Haplotype pool model model � Standard settings: H | = K << 2 J fixed-sized population haplotype pool � p ( h 1 ,h 2 ) = p ( h 1 ) p ( h 2 ) =f 1 f 2 Hardy-Weinberg equilibrium � H ? � Problem: K ? Eric Xing 17 A Infinite (Mixture of ) Allele Model ∞ θ k A k H n 1 H n 2 G n N N � How? Via a nonparametric hierarchical Bayesian formalism ! � Eric Xing 18 9

  10. Stick-breaking Process 0 0.4 0.4 ∞ G ∑ = π δ θ ( ) k k k 1 = 0.6 0.5 0.3 G θ ~ k 0 ∞ Location ∑ 0.3 0.8 0.24 1 π = k k = 1 k 1 - ∏ 1 π = β β ( - ) k k k j 1 = Mass G 0 1 β α ~ Beta( , ) k Eric Xing 19 Graphical Model ∞ θ k A k H n 1 H n 2 G n N N Eric Xing 20 10

  11. Chinese Restaurant Process θ θ 1 2 1 0 0 P ( c = k | ) = c i - i α 1 0 α α 1 + 1 + α 1 1 α α α 2 + 2 + 2 + α 1 2 α α α 3 + 3 + 3 + α m m .... 1 2 α α + α i + - 1 i + - 1 i - 1 CRP defines an exchangeable distribution on partitions over an (infinite) sequence of samples, such a distribution is formally known as the Dirichlet Process (DP) Eric Xing 21 The DP Mixture of Ancestral Haplotypes � The customers around a table form a cluster associate a mixture component ( i.e ., a population haplotype) with a table � sample { a, θ } at each table from a base measure G 0 to obtain the � population haplotype and nucleotide substitution frequency for that component 1 8 9 4 2 { A , θ } { A , θ } { A , θ } { A , θ } { A , θ } { A , θ } … … 7 6 3 5 With p ( h| { Α, θ }) and p ( g|h 1 ,h 2 ), the CRP yields a posterior distribution on � the number of population haplotypes (and on the haplotype configurations and the nucleotide substitution frequencies) Eric Xing 22 11

  12. DP-haplotyper α G 0 DP G K infinite mixture components θ (for population haplotypes) A H n 1 H n 2 Likelihood model (for individual G n haplotypes and genotypes) N � Inference: Markov Chain Monte Carlo (MCMC) Gibbs sampling � Metropolis Hasting � Eric Xing 23 Model components � Choice of base measure: ∏ G a ⋅ θ ~ Unif( ) Beta( ) 0 j j � Nucleotide-substitution model: = ∏ p h a p h a θ θ ( | { , } ) ( | , ) i k i j k j k j , , , j h a ⎧ θ = if k j i j k j p h a θ = , , , ⎨ where ( | , ) i j k j k j 1 − θ h = a , , , if ⎩ k j i j k j , , , � Noisy genotyping model: = ∏ p g h h p g h h ( | , ) ( | , ) i i i i j i j i j , , , 1 2 1 2 j h h g ⎧ γ ⊕ = if ⎪ i j i j i j 1 , 2 , , p g h h = 1 ⎨ − γ where ( | , ) i j i j i j h h g , , , ⊕ ≠ 1 2 if ⎪ i j i j i j ⎩ , , , 1 2 2 Eric Xing 24 12

  13. Gibbs sampling Starting from some initial haplotype reconstruction H (0) , pick a first table (0) , and form initial population-hap pool A (0) = { a 1 (0) }: with an arbitrary a 1 i) Choose an individual i and one of his/her two haplytopes t , uniformly and at random, from all ambiguous individuals; ( + + ( + t 1 ) ( t 1 ) ( t ) ( t ) ( t ) t 1 ) c p ( c | c , H , ) c A ii) Sample from , update ; − i t i i t t ( + = ( + + ∀ + = t 1 ) t 1 ) ( t 1 ) ( t ) ( t 1 ) a k c p ( a | h s.t. c k ) iii) Sample , where , from ; − k i t k i ' i ' t ' t ' update A ( t+ 1) ; ( + + + + t 1 ) ( t 1 ) ( t 1 ) ( t ) ( t 1 ) p ( h | c , H , ) h A , update H ( t+ 1) . iii) Sample from − i i i i t t t t Eric Xing 25 Convergence of Ancestral Inference Eric Xing 26 13

  14. Haplotyping Error The Gabriel data Eric Xing 27 14

Recommend


More recommend