clustering multivariate binary outcomes with restricted
play

Clustering Multivariate Binary Outcomes with Restricted Latent Class - PowerPoint PPT Presentation

Clustering Multivariate Binary Outcomes with Restricted Latent Class Models: A Bayesian Approach Zhenke Wu Assistant Professor of Biostatistics Schools of Public Health, University of Michigan, Ann Arbor Joint Statistical Meetings 2018


  1. Clustering Multivariate Binary Outcomes with Restricted Latent Class Models: A Bayesian Approach Zhenke Wu Assistant Professor of Biostatistics Schools of Public Health, University of Michigan, Ann Arbor Joint Statistical Meetings 2018 Vancouver August 2, 2018 (zhenkewu@umich.edu) zhenkewu.com R Package: rewind https://github.com/zhenkewu/rewind

  2. Motivating Example (Y il ): data (Y il ): design Hierarchical Clustering (cut with true # clusters) 10 10 10 Subject (1:N) Subject (1:N) Subject (1:N) 20 20 20 30 30 30 40 40 40 50 50 50 20 40 60 80 100 20 40 60 80 100 10 20 30 40 50 Dimension (1:L) Dimension (1:L) Subject (1:N) Standard LCA (true # clusters) Subset clustering (Hoff, 2005) Proposed 10 10 10 Subject (1:N) Subject (1:N) Subject (1:N) 20 20 20 30 30 30 40 40 40 50 50 50 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 JSM2018 Aug 2, 2018 Subject (1:N) Subject (1:N) Subject (1:N) zhenkewu@umich.edu

  3. Take-away Accurate clustering of multivariate binary data that 1) automatically selects feature subsets and 2) works well for unbalanced cluster sizes We achieve this goal via boolean matrix decomposition, or more generally, restricted latent class models JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  4. Boolean Matrix Decomposition (noise-free version) (a special case of restricted latent class models) Broad Applications: - Medicine: clustering based on autoantibodies in autoimmune diseases - Disease epidemiology: childhood pneumonia etiology estimation - Purchasing behavior: grocery shopping - Computer Science: text mining - Educational assessment: cognitive classifications - Mobile health: latent constructs, e.g., engagement with JSM2018 Aug 2, 2018 interventions, vulnerability and receptivity zhenkewu@umich.edu

  5. Statistical Formulation JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  6. Model Setup: Quick Overview "( ) ∈ {0,1} ( , / = 1, … , 0 • Data : ! " = $ "% , … , $ • Latent state vector : 1 " ∈ 2 ⊂ {0,1} 4 • Latent dimension: M • Latent class: K distinct patterns of 1 " • The number of clusters, K, unknown (no greater than 2 M ) • Q-matrix (M by L; binary): Q JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  7. Model Setup: Quick Overview 1) Given a latent state dimension M, specify likelihood [" # ∣ % # , '] via restricted latent class models (RLCM) ; with conditional independence For example, for dimension l: -Needs just one required state in ({ m : Q ml = 1 }) for a positive ideal response Γ il = 1. - referred to as partially latent class model in epidemiology (Wu et al ., 2016); Deterministic In and Noise Or gate (DINO) in psychology (Junker and Sijtsma, 2001); non-negative matrix factorization if rows of Q are orthogonal (Lee and Seung, 1999) JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  8. Model Setup: Quick Overview In two steps, 1) Given a latent state dimension M, first specify the likelihood [" # ∣ % # , '] via restricted latent class models (RLCM) ; with conditional independence 2) A prior distribution [% # , ) = 1, … , -] obtained from a clustering mechanism with unknown # of clusters K (represented by cluster assignment indicators {/ # , ) = 1, … , -} ); We use mixture of finite mixtures (Miller and Harrison, 2017 JASA) JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  9. Challenges: Boolean Matrix Decomposition (an example of restricted latent class models) C1. High-dimensional discrete space • Sparse priors that encourage: 1. small # of latent state dimensions 2. small # of distinct latent state patterns C2. Unknown number of latent state dimensions • Infinite dimension model (based on semi-ordered formulation of Indian Buffet Process); Identifiability issue C3. Unknown number of clusters (i.e., # latent classes) • Mixture of finite mixture model T1: Identifiability of model parameters based on likelihood • only Open and frontier problem; exciting progress at Michigan C: computational JSM2018 T: theoretical Aug 2, 2018 zhenkewu@umich.edu

  10. Comparison of variants of latent class analysis of multivariate binary data JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  11. Data JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  12. - 76 autoantibody patterns from patients with rheumatic disease & cancer - all were negative for autoantibodies against prominent defined specificities Can an algorithm be developed to identify common autoantibody signatures? And estimate clusters among patients? JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  13. Intensity Aug 2, 2018 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 Lane 16 Lane 11 0.4 0.4 0.4 Lane 6 0.4 Lane 1 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 (20 lanes on a single gel) Raw Intensity Scan Data 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 Lane 17 Lane 12 Lane 7 Lane 2 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 zhenkewu@umich.edu 0.0 0.0 0.0 0.0 Location on Gel (t j JSM2018 0.2 0.2 0.2 0.2 Lane 18 Lane 13 0.4 0.4 0.4 Lane 8 0.4 Lane 3 0.6 0.6 0.6 0.6 ( g ) ) 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 Lane 19 Lane 14 Lane 9 Lane 4 0.4 0.4 0.4 0.4 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.2 0.2 0.2 0.2 Lane 20 Lane 15 Lane 10 0.4 0.4 0.4 0.4 Lane 5 0.6 0.6 0.6 0.6 0.8 0.8 0.8 0.8 1.0 1.0 1.0 1.0

  14. Scientific Questions • How many clusters? What are the clusters? [the clustering problem] • How many machines are there and what are the component auto-antigens? [estimation of latent state dimensions] • What makes the clusters different in terms of presence or absence of machines? [interpretability of the clusters] JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  15. Preprocessing Step I-a: Automated Peak Detection Example: Gel Set 1 20 * * * * * * 19 * * * * * * * * 18 * * * * * * * * * * 17 * * * * 16 * * * * * * * * 15 * * * * 14 * * * 13 * * * * * * * * * 12 * * * * * * * * * 11 Lane * * * * * * 10 * * * * 9 * * * * * * * * * * 8 * * * * * 7 * * * * * 6 * * 5 * * 4 * * * * * * 3 * * * * * * * * * * 2 * * * * * 1 * * * * * * * 0.0 0.2 0.4 0.6 0.8 ( g ) ) Location on Gel (t j JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  16. Align the peaks (Wu et al., 2017) zhenkewu@umich.edu R package: spotgear JSM2018 Aug 2, 2018

  17. Posterior Computation JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  18. Posterior Computation Designed and implemented MCMC algorithms that deal with • • a) unknown number of clusters (mixture of finite mixture models; split-merge), and • b) unknown number of machines (slice sampler for infinite Indian Buffet Process). Also works for pre-specified number of machines. zhenkewu@umich.edu R package: rewind JSM2018 Aug 2, 2018

  19. Simulation JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  20. Simulation Setup ( G il ): Design Matrix ( h im ): Latent State (Q ml ): True Q (ordered) Latent State (1:M) 1 1 3 2 2 3 3 4 4 2 5 5 6 6 1 7 7 8 8 9 9 20 40 60 80 100 10 10 11 11 12 12 Dimension (1:L) 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 Subject (1:N) Subject (1:N) 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 1 34 34 35 35 36 36 37 37 0 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 20 40 60 80 100 1 2 3 Dimension (1:L) Latent State (1:M) JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  21. Recovery of the matrix Q (low noise) JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  22. Recovery of the matrix Q (intermediate noise) JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  23. Recovery of the matrix Q (high noise) JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  24. Preliminary clustering results based on machine models Data: CTP negative sera Method: Bayesian machine- based restricted latent class analysis Figure: Three estimated clusters (top three panels) with distinct enrichment of three distinct estimated machines (bottom panel) Colored labels: red, blue, green - for clusters obtained by standard method; this algorithm is agnostic to them. JSM2018 Aug 2, 2018 zhenkewu@umich.edu

  25. Main Points Once Again Goal: Based on multivariate binary data, find scientifically structured, • interpretable clusters Proposed a framework for clustering using restricted latent class models • • Designed and implemented MCMC algorithms that deal with unknown number of clusters and machines; Bayesian binary factorization algorithm • Superior clustering performance compared to standard analyses; Improved estimations under unbalanced cluster sizes. JSM2018 Aug 2, 2018 SRF zhenkewu@umich.edu

Recommend


More recommend